Bioinformatics and the future role of computing in biology, A Sabesan

Tags: information systems, computer science, data integration, information system, data management, biological research, biological data, information resources, application server, biological data sets, virtual environment, deductive approaches, virtual environments, References Anahory, change components, biological systems, systems integration, exploratory nature, data sets, organisms, Life Sciences Task Force, reference data sets, computer systems, Object Management Group, information technologists, controlled vocabularies, information resource, CORBA interfaces, distributed object computing, electronic information, Pedro Mendes, technical standards, data warehousing, decision support systems, intelligence, public information system, biological organisms, real-time data management, information systems research, distributed research, data acquisition, model organisms, information integration, biological researchers, DNA sequences, complex biological systems, biological information systems, scientific knowledge, research specialists, biological data management, biological reality, biology, biological databases, scientific research, Santa Fe Institute, technological development, Genome Resources Santa Fe, NM, Katie Hafner, Public Health and Biotechnology, information management, data generation, Walter Truett Anderson, information technology, Bioinformatics, Hazel Henderson
Content: Bioinformatics and the Future Role of Computing in Biology (Also published on-line in AgBiotechNet at URL http://agbio.cabweb.org) BRUNO W. S. SOBRAL National Center for Genome Resources Santa Fe, NM, U.S.A. "The appearance of high performance computing environments has, to a great extent, removed the problem of increasing the biological reality of the mathematical models. For the first time in the history of the field, practical biological reality is finally within the grasp of biological modelers." (Preliminary Announcement, First World Congress on Computational Medicine, Public Health and Biotechnology, April 1994.) "These models are never perfect as describers or predictors, but they can be continually revised." (Walter Truett Anderson, 1996, Evolution Isn't What It Used to Be.) "The ARPANET, and later the Internet, grew as much from the free availability of software and documentation as from anything else." (Katie Hafner and Matthew Iyon, 1998, Where Wizards Stay Up Late.) "Educators and professionals must deal with whether to specialize further or to expand their horizons into interdisciplinary studies, even at the expense of rigor as academically defined and rewarded. At the personal level, we all must decide whether to trade expanded consciousness for greater secular power and money." (Hazel Henderson, Building A Win-Win World: life beyond global economic warfare.) Introduction There is an increasing technological convergence occurring between biology and Computer Science. We are constantly hearing about genetic algorithms, DNA chips, neural networks, etc. These words are jargon from separate disciplines merged into hybrid, compound terms. Much like with businesses, the biological research world is becoming truly inseparable from the information systems needed to support scientific research and technological development. While these developments are exciting and promising, they also provide the scientific community with serious challenges. The intersection between computer science and biology has recently been described loosely as bioinformatics (Baxevanis and Oullette, 1998). It is this bioinformatics that must deal with the challenges and deliver the paths that will allow the new century of biology to bear fruit. A database is a structured collection of data brought together and made persistent for the sake of querying (Ryan, 1996). Until now, biological databases have mostly served a "memory" function for the biological research community. However, simply storing data in a database does not provide biological researchers with the needed context for those data to be truly useful in the discovery of new biological knowledge, rules, or principles (Sobral et al., 1999). Over the past twenty years or so, engineers have worked with molecular biologists to develop very highthroughput data generation factories. Increasing miniaturization and novel laboratory technologies, such as the polymerase chain reaction, have allowed huge leaps in the amount of data that can be gathered - 115 -
from biological organisms in a short time period. As a result of these developments, biology has gone from data poor to "data poisoned" very quickly. Consequentially, the bottleneck has shifted from data generation to data management. Because of such challenges and bottlenecks, information technology itself has become a significant and integral part of the research process, integrating each time more into scientific thought and work. Despite the multidisciplinary nature of the challenge of biological data management, we have a wealth of specialists in biology, computer science, mathematics and engineering to draw on for creative solutions to such challenges. However, competent specialists are necessary but not sufficient for meeting the challenge effectively. It is also necessary to include in research teams individuals who can put the pieces together meaningfully. Such individuals are rare. To further catalyze the co-development of biology with other fields also requires institutional architectures that have embedded frameworks that reward teamwork and multidisciplinary. An inspiring example of institutional barrier breaking is provided by the Santa Fe Institute (http://www.santafe.edu/). The convergence of electronics, mathematics and biology suggests a future in which the state of technology in each field will become initially dependent upon the state of technology in the other. Further in the future, it is possible that the convergence between these disciplines results in much deeper impacts on future evolution, and thus life itself. This short article was requested with a vision into the future and "not just five years, but twenty-five years from now". I have tried to provide some brief glimpse of what may be happening a couple of decades from now at the end of this work. To provide some framework for this speculation, I will focus on the biology of crop plant species and agriculture as a technological application area. However, I hope some of the generalities are understood to apply across biology, even if the fields of application are distinct. I will briefly present the current situation, extending over the next five or so years, and finally provide some thoughts about possible developments that may be decades out. I recognize that, as with all predictions, these are likely to become outdated and obviously incorrect the moment the paper leaves my desk. The Present Breeders sell modified organisms that they tailor to meet specific requirements set forth by farmers and consumers. Breeding is thus a real-time information management project, in which variant organisms are produced, evaluated, and their fate decided with respect to the next generation. As breeders' tools for evaluation of their material have become more precise (from raw phenotypes to gene sequences, for example), the need for integrated data management has become more and more obvious. Information systems for public breeders have been developed as independent efforts across organisms and types of information relevant to the breeders' needs (for a summary, see Sobral et al., 1999). Even in the private sector, many of the information system resources were developed over long periods of time using various proprietary technologies. This development process has resulted in unnecessary constraints to utilization of the information and its transformation into knowledge and products. The problem of using information housed in different databases providing different types of access is a computer science research and application area that is being tackled by businesses and other organizations daily. It is known as the heterogeneous database problem. Various technologies and approaches can be leveraged from experiences of these organizations in non-biological domains. However, in biological information management, we suffer from two levels of heterogeneity: one is across different systems housing the same types of information (for example, genetic maps in RiceGenes and MaizeDB), and a second is across different types of data that need to be related and available for analysis through a single - 116 -
interface (for example, genetic maps are different from DNA sequences, which are different from temporal profiles of gene expression, but they all are relevant to the inquisitive breeder engineering a new variety). Private and federal funding for plant biology has increased significantly over the last few years, led by changes in biological research brought about through genomics. Genomics embodies the technology of high-throughput engineered laboratories acquiring data on populations of cellular components, in this case, DNA sequences. Extending the genomics paradigm to other subcellular components, proteomics, metabolomics, and other such terms have come into existence. Whatever the new words we create for our new approaches, two interesting points emerge from being able to develop technologies to provide increasingly detailed snapshots of complex biological systems, as they respond to environmental or structural changes. One is that biological data management becomes increasingly needed as an integral part of the research effort at the outset. Integral, in the sense that it is necessary to model the system users, understand relationships among data, and generate and evolve data models to make biological information systems useful today and tomorrow, in an ever-changing research environment. Thus, there are social adjustments to be made in the way biological research projects are developed, implemented, and funded. The other point is that from the capacity to analyze parts with increasing detail, the necessity arises to integrate data to provide a framework for its comprehension and consumption by varied specialists, including breeders. And this data integration can only be done inside the computer (i.e., in silico). More and more we should expect that advances in scientific knowledge, as well as downstream products of breeding, will depend on geographically distributed groups of research specialists that pool their efforts and knowledge in analyzing biological information. Thus, there is a need to provide real-time data management to geographically distributed research collaborators. Fortunately, there is the Internet, developed in the ARPA days precisely to support collaborative scientific research (Hafner and Iyon, 1998). The Next Five Years On the laboratory technology front, the challenge will continue to be to devise more efficient and costeffective technologies for identifying and scoring all types of genetic variants (at the structural level) in a given genome, with the human genome taking the lead (Chakravarti, 1999). Outside the domain of structural data (such as DNA and protein sequences), continued development of new ways to assay molecular responses of living organisms to environmental changes will occur. Of special interest, the development of high-throughput methods to monitor and analyze responses at the level of regulatory and biochemical networks, will allow enhanced understanding of genetic control. The shift in emphasis from data accumulation to data interpretation has already begun and will continue to expand. Integration of data types, provision of unified interfaces to complex biological data sets and provision of distributed data acquisition, storage and analysis is a current focus of many public and private efforts in the broadly defined field of bioinformatics (Sobral et al., 1999). However, there will be continued chaos in terms of appropriate balancing of raw data production, and its interpretation and transformation into generally useful public reference sets for use by different types of science and technology consumers. Further refinement of strategy and goals in public-private research and technology is needed, to reduce the redundancy of data production and increase the collaborative efforts to generate and sustain funding for public reference sets of high quality. - 117 -
Meeting the needs of information integration, analysis and management also must address the models for data acquisition, ongoing curation, and distribution. Clearly, those having biological knowledge of relevance to the data at hand are the most appropriate people to integrate and annotate those data. The importance of data quality cannot be overstated. Sources of error in databases are varied and must be controlled when trying to implement new methods of prediction and classification. However, we still lack consistent and scalable models for integration of large-scale biological data sets with existing world literature and specific knowledge of biological researchers. Experiments will continue with international collaborations, where data curation is provided in countries that are knowledge-rich but where funding for science is suffering. Testing, deployment, and evolution of architectures to support data integration will require a certain data density across various types of data, in order to determine what types of complex queries will be usefully handled by the system. Thus, various early opportunities will likely be explored in model organisms. Model organism databases are reaching a framework of genomic sequences that are complete or under completion, large-scale expression profiling under varying environmental conditions or known mutations, and so on, such that data density across various data types can be achieved soon. Long established model organisms also typically bring with them a rich research literature, though biased in its own specific ways based on the reasoning for establishment of the model. Scalable methods for tying "genomic" and other large-scale data sets into the world's scientific literature need to be explored, as is required for automated "framework" annotation of data coming from large-scale data factories (for a possible prototype, see Bailey et al., 1998). Model organisms tend to help organize research communities around various aspects of the basic biology of the organism in question. Although this creates a research bias, it also strengthens the will to contribute as a community to the increase of knowledge. The role of a strong community in advancing information systems research can be observed in the building of the Internet (Hafner and Iyon, 1998). Additionally, certain breeding-specific applications will need to be modeled directly through an understanding of the workflow of the breeders, which typically do not work with model organisms. An interplay will need to occur between the evolution of the information system and evolution of breeding and evaluation methods. As organismal information resources over the Internet evolve, they will become more like updateable encyclopedias of knowledge. Model organism information systems will also affect the future of scientific publishing. One example of an electronic journal that is providing a forum to experiment with new paradigms of serving information to the scientific community is the Journal of Agricultural Genomics (http://www.ncgr.org/ag/jag/index.html), previously known as the Journal or Quantitative Trait Loci. Electronic information resources become ever more important with the development of multiple, organismally-focused genome projects, given that most of the data generated by such projects will never appear in printed journals and will only be available to the scientific community via electronic resources. The development and evolution of interface tools that communicate with evolving underlying information that comes from public model organism genome projects, is an example of evolving scientific publication. Such electronic interfaces also provide a means for integrating and viewing large amounts of biological information for interaction with various types of specialists that are capable of discovering and testing new relationships in the biological data objects. Additional levels of investigation will involve not only multidimensional, large data sets, but also bring in time as a variable. One important method through which organismal information resources can help investigators find new relationships, is the provision of a common biological vocabulary. Such vocabulary can be instrumental in developing comparative approaches to using organismal data sets. Fortunately, some key organismal information resources are already communicating with each other on standard controlled vocabulary for - 118 -
function, process, and subcellular localization of gene products. Extension of collaboration on such vocabularies will increasingly provide the power of comparative biology to these information resources. In addition to an agreement on controlled vocabularies from biologists, information technologists working in bioinformatics will need to begin agreeing more on common protocols, much as occurred with the development of the Internet. The purpose of agreeing on common protocols for sharing information is to bring everyone in, rather than to exclude. In the computer networking community of young, the Requests for Comments (RFCs) have been and continue to be the principle means of expression and the accepted manner of recommending, reviewing and adopting new technical standards (Hafner and Iyon, 1998). Whatever the means, the adoption of standards is required to lay a foundation for a growing community of diverse specialists needing to work with information systems. To engineer the software to support the needs of diverse specialists to access integrated data sets, will require the sharing of approaches of publicly funded efforts to define and adhere to standards for communication, such as those provided by the Life Sciences Task Force of the Object Management Group (OMG) (see http://www.omg.org/homepages/lsr). OMG is the standard-bearer for the Common Request Broker Architecture (CORBA). CORBA is a standard for platform- and language-independent distributed object computing. Standards help achieve implementation of distributed object technology and the Internet to be harnessed for biological research. A possible structure for a generic organismal information resource is shown in Figure 1. Note that the central "brain" of this system is the application server. The goal of the application server is to accept and process requests for data from diverse clients. As the application server is accessible to applications, community-developed evolutions can be easily integrated. In this structure, client-side communication is handled using CORBA interfaces that are expressed in terms of biologically based objects, such as physical maps, DNA sequences, and phenotypes. Finally, the application server also allows control of data input. The speed of change of information systems over the Internet will continue at a sizzling pace. Most biologists probably know that public DNA sequence information doubles approximately every 12 to 14 months. This is also about the same as the doubling time of electronic information as a whole, estimated at 18 months (Dhar and Stein, 1997). Fortunately, our computing capacity seems to be on a similar exponential growth pattern. The growth in accessibility of the information to users is what really causes the feeling of data poisoning. Thus, in the short term, biological information systems need to be able to find, summarize and interpret large amounts of multidimensional data. It is from these requirements that the future of organismal databases need to extract their vision. Import of external data into organismal databases will be required to enable complex queries, isolate from dependence on remote resources, increase performance, and provide security in a rapidly evolving environment. Data warehousing is one possible approach to integrate information from distributed information resources (Anahory and Murray, 1997). A prototype system was developed and tested using a virtual database for the prototype phase (Ritter, 1994). Once information from different sources are integrated into a "warehouse", it is possible to use On-Line Analytical Processing (OLAP) systems as tools to explore and package the information further so it can be delivered in the form(s) needed (Dhar and Stein, 1997; Thomsen, 1997). The goal of data warehousing is to provide a decision support infrastructure for OLAP servers. The goal of OLAP servers is to provide views of the data that allow varying perspectives along many dimensions. Entry points can be provided based on the characteristics of the data themselves (Dhar and Stein, 1997). - 119 -
Client Tier
B rowser (H TML )
Browser (applet)
Orga ni smDB A ppli ca tion
3rd party Applicat ion
HTTP Web Server
HTT P CORBA
CORBA
CORBA
C ORBA
Application Server Data Object Laye r J DBC
Middle Tier D atabase d ependent p ro tocol
Database Tier
Orga nism DB
Other databases
FIGURE 1: A three-tiered architecture introduces a "middle tier", usually in the form of an application server, between client-side software and database servers. This design approach helps to reduce the complexity of client applications, and to insulate them from changes to databases (DB). An application server also can be useful for dealing with issues relating to concurrent access by multiple users (e.g., locking to prevent "dirty writes"), and can provide a single, unified point of access to multiple databases. The Data Object Layer (DOL) is a library of software designed to serve as a convenient means for coupling objectoriented applications with a variety of database management systems. Using object-oriented techniques, it encapsulates database-related tasks inside a single class hierarchy, which easily can be extended (using inheritance) to support new and different database management systems. In theory, these data base management systems (DBMS) may be relational, object-oriented, AceDBs (A C. elegans databases), flatfilebased, or of any other design. Applications that use the DOL are insulated from the particulars of DBMS, which are considered abstractly as "persistence mechanisms". The DOL allows access to multiple heterogeneous persistence mechanisms, and would enable us to "swap out" one such mechanism for another, should we eventually desire to do so. The importance of deployment of public information system resources over the Internet should not be forgotten. Not only would this fulfill one of the original primary goals of ARPANET, which was to tie together research laboratories through the real time sharing of information and computational resources, it would also fulfill one of the original secondary goals of ARPANET, which was to stem the flow of the best minds from universities into the private sector because of the disparate salary ranges (Hafner and Iyon, 1998). The concept was that the university scientists would stay in the public research sector because they would have available, through distributed computing, similar resources as private sector scientists. This would not solve the salary differential but it would support continuation of cutting-edge research in universities. From the perspective of bioinformatics, we continue to deal with large amounts of very noisy data and an absence of general theories. Thus, many approaches will be of the "learn the theory from the data" type (a good resource is Baldi and Brunak, 1998). If closely coupled with experimentation, such machine learning approaches may become quite powerful. From these endeavors, a theoretical framework for discovering new relationships among biological data may emerge and evolve quickly.
- 120 -
Continuing a trend of increasing wealth of data with a (still) vanishingly small knowledge of biology, means that short-term solutions to using integrated biological data sets will be mostly of an exploratory nature. Large amounts of uncertainty in the data (missing and erroneous facts) makes for difficult application of many deductive approaches. However, an increase in the application of various types of modeling techniques is expected to occur, even in the deductive arena. Primarily, these will be in the areas of fuzzy logic (Zadeh, 1965; De Caluwe 1997), neural networks (see the interesting book by Kosko, 1993), and genetic algorithms (Koza, 1992). The Next Decades Data integration and development of high-quality reference will likely provide some calming of the chaos that will still exist into the near future with respect to public reference data sets. Perhaps many of those reference sets will be organismal in nature. Whatever their nature, a trend toward systems integration will likely lead towards the opportunity to actually model virtual organisms extensively, using the same data sets and highly evolved (and evolving) ideas about how to put the "parts" together into a whole. Interfaces for biological data will evolve from the text or even screen graphics phase and virtual environments will exist for collaborating on real or simulated data sets, bringing together specialists from different domains into collaborative virtual research. This will push computer systems to evolve in the direction of supporting multiple users in real time, working together in the same virtual environment. Outside the video game industry, our PC-centric world was developed with individual desktops in mind, not collaborative use of computational resources. While concurrent users are supported by current systems, they typically are not working on the same data at the same time. Thus, one could imagine two scientists collaborating in real time by using a virtual reality helmet, in which they are collaboratively working through a data set that has just been acquired from an automated data factory running overnight somewhere else. The capacity to enter virtual reconstructions of living cells or organisms and essentially grab and change components, surely will thrill molecular-minded researchers in the next decades! If truly integrated, highly polished (from a data quality standpoint) data sets are created through new models for collaboration between software developers and biologists, then it follows that there will be an ongoing increase in the "intelligence density" of the information system. Dhar and Stein (1997) defined increasing intelligence density of data as "a heuristic measure of `army type' intelligence". Ultimately, intelligence density is meant to be an information age equivalent attempt to measure what was thought of in the industrial age as "productivity". Thus, it refers to the fact that ways are needed to allow people to see the "important" features of the information quickly, however importance may be defined by the person in question. High-density materials, thus, allow users of the system to spend less time on the lowlevel details and more time on the higher value-added aspects. Such systems clearly will provide an advantage to their users when compared to lower density systems. Increasing intelligence density of information systems will be one of the enabling technologies on which our capacity to develop and test hypotheses on system-level phenomena in biological organisms will depend. The enabling feature is the capacity to use large multidimensional data sets that are coherently presented to the specialist trying to think at that level. Increasing laboratory miniaturization and non-destructive technologies to acquire data from living systems in real-time, may eventually lead to intensive exploration of a comprehensive theory of life at a molecular level. For example, one promising technology could be based on the use of very small robots, such as nanobots, to enter living organisms. It may someday be possible to have specifically targeted small robots inside experimental organisms, acquiring and sending real time information about specific subcellular components to an information system. Perturbations of those experimental organisms will - 121 -
then allow acquisition of data concerning cellular and subcellular responses to the perturbation. Integration of those data in time will allow a much richer understanding of how biological systems work. The capacity to essentially acquire data from living organisms in real time, suggests that management of organisms bred for consumption will likely change radically. It would certainly be feasible to essentially "wire" some number of "reporter" organisms in the field or lot and then use real-time analysis of uploaded data and highly automated management environments to control administration of chemical or increasingly bio-chemical resources to increase productivity. Such systems are the beginning of a much more explicitly cybernetic future, where humans and their information systems as a whole become ever more integrated, dependent and, why not, co-evolving. Whatever the future holds for the promised "Century of Biology", it is very likely that only through the convergence of disciplines, will the opportunity to understand the inherent complexity of biological systems occur. It certainly is an exciting time to be a part of this convergence. Acknowledgments I take full responsibility for this work. However, it would not have been possible without hours of excellent interactions with various scientists and engineers at NCGR. I especially thank Adam Siepel, Bill Beavis, Allan Dickerman, Rob Pecherer, David Stamper, Andrew Farmer, Mark Waugh and Pedro Mendes. Intriguing and invigorating discussions with RW Doerge (Purdue University) and Rob Farber (Los Alamos National Laboratory) were also instrumental in helping to shape thoughts. References Anahory, S., and D. Murray, 1997. Data Warehousing in The Real World: a practical guide for building decision support systems. Addison-Welsey, London. Anderson, W. T., 1996. Evolution Isn't What It Used to Be: the augmented animal and the wired world. W. H. Freeman & Co., New York, NY. Baldi, P., and S. Brunak, 1998. Bioinformatics: the machine learning approach. MIT Press, Cambridge MA. Bailey Jr., L. C., S. Fischer, J. Schug, J. Crabtree, M. Gibson and G. C. Overton, 1998 GAIA: framework annotation of genomic sequence. Genome Res. 8:234­250. Baxevanis, A. D., and B. F. F. Ouellette, 1998. Bioinformatics: a practical guide to the analysis of genes and proteins. John Wiley & Sons, New York, NY. Chakravarti, A., 1999. Population genetics--making sense out of sequence. Nature Genet. 21(suppl. 1):56­60. DeCaluwe, R. (ed.), 1997. Fuzzy and Uncertain Object-Oriented Databases: concepts and models. Advances in Fuzzy Systems--Applications and Theory, vol. 13. World Scientific, Singapore. Dhar, V., and R. Stein, 1997. Seven Methods for Transforming Corporate Data Into Business Intelligence. Prentice Hall, Upper Saddle River, NJ. Hafner, K., and M. Iyon, 1998. Where the Wizards Stay Up Late: the origins of the internet. Touchstone, New York, NY. Henderson, H., 1996. Building a Win-Win World: Life Beyond Economic Warfare. Berrett-Koehler Publishers, San Francisco, CA. Kosko, B., 1993. Fuzzy Thinking: the new science of fuzzy logic. Hyperion, New York, NY. Koza, J. R., 1992. Genetic Programming: on the programming of computers by means of natural selection. MIT Press, Cambridge, MA. Ritter, O., 1994. The integrated genome database (IGD). In: S. Sushai (Ed.) Computational Methods in Genome Research. Plenum Press, NY. Ryan, T. W., 1996. Distributed Object Technology: concepts and applications. Prentice-Hall, Upper Saddle Creek, NJ. - 122 -
Sobral, B. W. S., M. Waugh, and B. Beavis, 1999. Information systems approaches to support discovery in agricultural genomics. In: Advances in cellular and molecular biology of plants, Volume 1: DNA-Based Markers in Plants (2nd ed.) R.L. Phillips and I.K. Vasil (eds.). (In press). Thomsen, E., 1997. OLAP Solutions: building multidimensional information systems. Wiley Computer Publishing, New York, NY. Zadeh, L., 1965. Fuzzy Sets. Information and Control 8:338­353. - 123 -

A Sabesan

File: bioinformatics-and-the-future-role-of-computing-in-biology.pdf
Title: Microsoft Word - 010_Sobral.doc
Author: A Sabesan
Published: Tue Jun 1 11:27:05 1999
Pages: 9
File size: 0.07 Mb


Health fair planning guide, 60 pages, 0.43 Mb
Copyright © 2018 doc.uments.com