It is a science of storing, retrieving and analyzing biological information with computer as a tool in the research of life science. It is one of the important frontier fields of life science and natural science today, and it will also be one of the core fields of natural science in 2 1 century. Its research focuses mainly on genomics and protein omics, specifically, it is to analyze the biological information of structure and function expressed in the sequence from nucleic acid and protein sequence.
Bioinformatics is a subject that uses computer technology to study the laws of biological systems.
At present, bioinformatics is basically a combination of molecular biology and information technology (especially Internet technology). The research materials and achievements of bioinformatics are all kinds of biological data. The research tool is computer, and the research methods include biological data search (collection and screening), processing (editing, sorting, management and display) and utilization (calculation and simulation).
Since1990s, with the development of various genome sequencing plans, the breakthrough of molecular structure determination technology and the popularity of the Internet, hundreds of biological databases have mushroomed. This poses a severe challenge to bioinformatics workers: what information is contained in hundreds of millions of ACGT sequences? How does this information in the genome control the development of organisms? How did the genome itself evolve?
Another challenge of bioinformatics is to predict the structure of protein from the amino acid sequence of protein. This question has puzzled theoretical biologists for more than half a century, and now it is becoming more and more urgent to find the answer to it. The Nobel Prize winner W Gilbert once pointed out in 199 1: "The way of traditional biology to solve problems is experimental. Now, based on the fact that all genes will be known electronically and reside in the database, the starting point of the new biological research model should be theoretical. A scientist will start from theoretical speculation and then go back to the experiment to track or verify these theoretical assumptions. "
Main research directions of bioinformatics: genomics-protein omics-systems biology-comparative genomics.
Let's not quote the lengthy definition of bioinformatics, but explain its core application in common language: with the landmark progress of the biological genome sequencing project, including the human genome project, the resulting biological data, including the birth, illness, death and illness of organisms, are increasing at an unprecedented speed, and have now reached the speed of doubling every 14 months. At the same time, with the popularity of the Internet, hundreds of biological databases have mushroomed. But these are only the acquisition of original biological information, which is the primary stage of the development of biological information industry. At this stage, most bioinformatics enterprises make a living by selling biological databases. Celera, which is famous for sequencing the human genome, is a successful representative of this stage.
After the original biological information resources have been excavated, life scientists are faced with severe challenges: what information is contained in hundreds of millions of ACGT sequences? How does this information in the genome control the development of organisms? How did the genome itself evolve? The advanced stage of bioinformatics industry is reflected here, and mankind has since entered the post-genome era centered on bioinformatics. The new drug innovation project combined with bioinformatics is a typical application at this stage.
[Edit this paragraph] Introduction to development
Bioinformatics is based on molecular biology. Therefore, to understand bioinformatics, we must first have a simple understanding of the development of molecular biology. The research on the structure and function of biological macromolecules in biological cells has been started for a long time. 1866, Mendel put forward the hypothesis that genes exist as biological components from experiments. 187 1, Mischel isolated deoxyribonucleic acid (DNA) from the nucleus of dead white blood cells. Before Avery and Macatee proved that DNA was the genetic material of living organs in 1944, people thought that chromosome protein carried genes. DNA is a secondary role. 1944, Chargaff discovered the famous Chargaff law, that is, the amount of guanine in DNA is always equal to the amount of cytidine, and the amounts of adenine and thymine are also equal. At the same time, Wilkins and Franklin used X-ray diffraction technology to determine the structure of DNA fiber. 1953, James Watson and FrancisCrick speculated on the three-dimensional structure of DNA (double helix) in Nature magazine. DNA and phosphate sugar chains form a double helix. According to Chargaff's law, the base on deoxyribose forms a base pair between two phosphate sugar chains. This model shows that DNA has a self-complementary structure, and according to the principle of base pair, the genetic information stored in DNA can be accurately copied. Their theory laid the foundation of molecular biology. The DNA double helix model predicts the law of DNA replication. Kornberg isolated I(DNA polymerase I comes from E.coli 1956, and four dNTP can be linked into DNA. The replication of DNA needs DNA. DNA as a template. Meselson and Stahl( 1958) proved that DNA replication is a semi-conservative replication. Crick put forward the law of genetic information transmission in 1954. DNA is the template for synthesizing RNA, and RNA is the template for synthesizing protein. It is called the central dogma and plays an extremely important guiding role in the future development of molecular biology and bioinformatics. Through the efforts of nirenberg and Maathai (1963), the gene clone encoding 20 amino acids has been deciphered. The discovery of restriction endonuclease and the cloning of recombinant DNA laid the technical foundation of genetic engineering. It is precisely because the research of molecular biology has greatly promoted the development of life science that the emergence of bioinformatics has become a necessity. In February, 200 1 year, the sequencing of human genome project was completed. Bioinformatics has reached a climax. Due to the rapid development of DNA automatic sequencing technology, the amount of nucleic acid sequence data in DNA database is increasing at the rate of 106bp per day, and biological information is rapidly expanding into an ocean of data. Undoubtedly, we are changing from an era of data accumulation to an era of data interpretation, and the huge accumulation of data often contains the possibility of potential breakthrough discovery. Bioinformatics is an interdisciplinary subject based on this premise. Roughly speaking, the core content of this field is to study how to understand DNA sequence, structure, evolution and its relationship with biological functions more deeply through statistical calculation and analysis of DNA sequence. Its research topics involve many fields such as molecular biology, molecular evolution and structural biology, statistics and computer science. Bioinformatics is a subject with rich connotations. Its core is genomic informatics, including the acquisition, processing, storage, distribution and interpretation of genomic information. The key of genomic informatics is to "read" the nucleotide sequence of the genome, that is, the exact location of all genes on the chromosome and the function of each DNA fragment. At the same time, after discovering new genetic information, we can simulate and predict the spatial structure of protein, and then design drugs according to the function of specific protein. Understanding the regulation mechanism of gene expression is also an important content of bioinformatics. According to the role of biomolecules in gene regulation, the inherent laws of human disease diagnosis and treatment are described. Its research goal is to reveal "the complexity of genome information structure and the fundamental law of genetic language" and explain the genetic language of life. Bioinformatics has become an important part of the whole life science development and the frontier of life science research.
[Edit this paragraph] Main research direction
Bioinformatics has formed many research directions in just over ten years. The following briefly introduces some major research hotspots.
1, sequence alignment.
The basic problem of sequence alignment is to compare the similarity or dissimilarity of two or more symbol sequences. From the biological point of view, this problem contains the following meanings: (1) reconstructing the complete sequence of DNA from overlapping sequence fragments; Determine the physical and genetic map storage from the probe data under various experimental conditions, traverse and compare the DNA sequences in the database, compare the similarities of two or more sequences, search related sequences and subsequences in the database, find out the continuous generation mode of nucleotides, find out the information components in protein and DNA sequences, and compare the biological characteristics of DNA sequences, such as local insertion, deletion (the former two are referred to as indel for short) and replacement. The objective function of sequences obtains the minimum distance weighted sum or maximum similarity sum of variation sets between sequences. The methods of alignment include global alignment, local alignment and generation gap punishment. Dynamic programming algorithm is often used to compare two sequences, which is suitable for short sequence length, but not for massive gene sequences (such as human DNA sequence as high as 109bp), and even the algorithm complexity is linear. Therefore, the heuristic method is difficult to work.
2. Comparison and prediction of protein structure.
The basic problem is to compare the similarities or differences of the spatial structures of two or more protein molecules. The structure and function of protein are closely related. It is generally believed that protein with similar functions is generally similar in structure. Protein is a long chain composed of amino acids, with the length ranging from 50 to1000 to 3000 aa. Protein has many functions, such as storage and transportation of enzymes and substances, and signal transmission. Antibodies, etc. The sequence of amino acids inherently determines the three-dimensional structure of protein. It is generally believed that protein has four different structures. The reason for studying the structure and prediction of protein is to understand the function of organisms in medicine, to find the target of docking drugs, and to obtain better crop genetic engineering in agriculture. Enzymatic synthesis is used in industry. The reason for directly comparing protein structure is that the three-dimensional structure of protein is more stable than the first-order structure in evolution and contains more information than AA sequence. The premise of protein's three-dimensional structure research is that the internal amino acid sequence corresponds to the three-dimensional structure one by one (not necessarily true). Physics can be explained by minimum energy. The structure of unknown protein is predicted by observing and summarizing the protein structure law of known structures. Homologous modeling and threading both fall into this category. Homology modeling is used to find protein structures with high similarity (more than 30% amino acids are the same), and the latter is used to compare different protein structures in evolutionary families. However, the research status of structural prediction in protein is far from meeting the actual needs.
3. Gene identification and non-coding region analysis.
The basic problem of gene recognition is to correctly identify the range and exact position of genes in a given genome sequence. Non-coding regions are composed of introns, which are usually discarded after protein formation. However, from the experiment, if the non-coding regions are removed, gene replication cannot be completed. Obviously, DNA sequence, as a genetic language, is not only contained in the coding region, but also implied in the non-coding sequence. At present, there is no general guiding method for analyzing DNA sequences in non-coding regions. In the human genome, not all sequences are encoded, that is, some kind of protein template, and the encoded part only accounts for 3-5% of the total sequence of human genes. Obviously, it is inconceivable to search such a large gene sequence manually. The method of detecting the password region includes measuring the frequency of codons in the password region. First-order and second-order Markov chains, ORF (open reading frame), promoter recognition, HMM (hidden Markov model) and GENSCAN, splicing alignment and so on.
4. Molecular Evolution and Comparative Genomics
Molecular evolution is to use the similarities and differences of the same gene sequence in different species to study the evolution of organisms and build an evolutionary tree. We can not only use DNA sequences, but also use the amino acid sequences encoded by them, even through the structural comparison of related protein, on the premise that similar races are genetically similar. By comparison, we can find out which races are the same. What is the difference? Early research methods usually use external factors, such as size, skin color and number of limbs, as the basis of evolution. In recent years, with the completion of many model organism genome sequencing tasks, people can study molecular evolution from the perspective of the whole genome. When matching genes of different races, there are generally three situations to be dealt with: orthodoxy: genes of different races with the same function; Collateral homology: Homologous genes with different functions; Heterologous gene: a gene that spreads between organisms by other means, such as a virus injection gene. The common method in this field is to construct a phylogenetic tree, which is realized by methods based on features (that is, the specific positions of amino acid bases in DNA sequences or protein) and distances (alignment scores) and some traditional clustering methods (such as UPGMA).
5, sequence overlapping group assembly
According to the current sequencing technology, only 500 or more base pairs can be detected in each reaction. For example, short shot method is used to measure human genes, which requires a large number of short sequences to form overlapping groups. The process of splicing them gradually to form a longer contig until a complete sequence is obtained is called contig assembly. From the perspective of algorithm, the overlapping group of sequences is a NP-complete problem.
6, the origin of genetic code
Generally speaking, the study of genetic code thinks that the relationship between codons and amino acids is caused by an accidental event in the history of biological evolution and has been fixed on the same ancestor of modern organisms until now. Different from this "freezing" theory, some people put forward three theories to explain the genetic code, namely, selection optimization, chemistry and history. With the completion of various biological genome sequencing tasks, it provides new materials for studying the origin of genetic code and testing the authenticity of the above theory.
7. Structure-based drug design
One of the purposes of human genetic engineering is to understand the structure, function and interaction of about 654.38+ million kinds of protein in human body and their relationship with various human diseases, and to seek various treatment and prevention methods including drug therapy. Drug design based on biomacromolecules and micromolecules is an extremely important research field in bioinformatics. In order to inhibit the activity of some enzymes or protein, based on the known tertiary structure of proteins, inhibitor molecules can be designed as candidate drugs on the computer by using molecular permutation algorithm. The purpose of this field is to find new gene drugs, which have great economic benefits.
8. Modeling and simulation of biological system
With the development of large-scale experimental technology and data accumulation, it has become another research hotspot in the post-genome era-system biology to study and analyze biological systems from the global and systematic levels and reveal their development laws. At present, its research contents include simulation of biological system (Curr Opin Rheumatol, 2007, 463-70), system stability analysis (nonlinear dynamic psychological life Sci, 2007, 4 13-33) and system robustness analysis (Ernst Schering Res Found Workshop, 2007, 69-83). The modeling language represented by SBML (Bioinformatics, 2007, 1297-8) has developed rapidly. Boolean networks (PLoS Comput Biol, 2007, e 163), differential equations (Mol Biol Cell, 2004, 3841-. In 2007, 3262-92) and discrete dynamic event system (Bioinformatics, 2007, 336-43), many models have been established with reference to the modeling methods of physical systems such as circuits, and many studies have tried to solve the complexity of the system from the macroscopic analysis ideas such as information flow, entropy and energy flow (Anal Quant Cytol Histol, 2007, 296-308). Of course, it will take a long time to establish the theoretical model of biological system. Although the experimental observation data are increasing greatly, the data needed for biological system model identification far exceeds the output capacity of current data. For example, for the chip data of time series, the number of sampling points is not enough to use the traditional time series modeling method, and the huge experimental cost is the main difficulty of system modeling at present. System description and modeling methods also need pioneering development.
9. Research on Bioinformatics Technology and Methods
Bioinformatics is not only a simple arrangement of biological knowledge and a simple application of knowledge in mathematics, physics, information science and other disciplines. Massive data and complex background lead to the rapid development of machine learning, unified data analysis and system description under the background of bioinformatics. Huge amount of calculation, complex noise patterns and massive time-varying data bring great difficulties to traditional statistical analysis, which requires more flexible data analysis techniques, such as nonparametric statistics (BMC Bioinformatics, 2007,339) and cluster analysis (Qual Life Res, 2007, 1655-63). The analysis of high-dimensional data requires the compression technology of feature space such as partial least squares (PLS). In the development of computer algorithm, it is necessary to fully consider the time and space complexity of the algorithm, and use parallel computing, grid computing and other technologies to expand the realizability of the algorithm.
10, biological image
Why do people who are not related by blood look so alike?
Appearance is made up of points. The more points overlap, the more they look alike. Why do these two unrelated points overlap?
What is the biological basis? Are the genes similar? I don't know, I hope experts can answer.
1 1, others
Such as gene expression profile analysis and metabolic network analysis; Gene chip design and protein omics data analysis have gradually become new important research fields in bioinformatics. In terms of disciplines, the disciplines derived from bioinformatics include structural genomics, functional genomics, comparative genomics, research in protein, pharmacogenomics, traditional Chinese medicine genomics, oncology, molecular epidemiology and environmental genomics. It is not difficult to see from the current development that genetic engineering has entered the post-genome era. We also have a clear understanding of machine learning, mathematics and other possible misleading closely related to bioinformatics.
[Edit this paragraph] Bioinformatics and machine learning
Large-scale biological information brings new problems and challenges to data mining, which requires new ideas to join. Traditional computer algorithms can still be applied to biological data analysis, but they are increasingly unsuitable for sequence analysis. The reason is that the biological system is inherently complex and lacks a complete life organization theory established at the molecular level. Simon once defined learning as the change of the system, which can make the system more effective when doing the same work. The purpose of machine learning is to automatically acquire corresponding theories from data. By using methods such as reasoning, model fitting and sample learning, it is especially suitable for the lack of general theory, "noise" mode and large-scale data sets. Therefore, machine learning has formed a feasible method complementary to conventional methods. Machine learning makes it possible to extract useful knowledge and discover knowledge from massive biological information by computer. Multi-vector data analysis plays an increasingly important role, but at present, a large number of gene database processing needs computer automatic identification and labeling to avoid time-consuming and labor-intensive manual processing methods. Early scientific methods-observation and hypothesis-can no longer rely solely on human perception to deal with the requirements of high data volume, fast data acquisition rate and objective analysis. Therefore, the combination of bioinformatics and machine learning is inevitable. The most basic theoretical framework in machine learning is based on probability. In a sense, it is the continuation of statistical model fitting and its purpose is to extract useful information. Machine learning is closely related to pattern recognition and statistical reasoning. The learning method includes data clustering. Neural network classifier and nonlinear regression. Hidden Markov model is also widely used to predict the genetic structure of DNA. Current research focuses include: 1) observing and exploring interesting phenomena. At present, the focus of ML research is how to visualize and mine high-dimensional vector data. The general method is to reduce it to low-dimensional space, such as conventional principal component analysis (PCA) and kernel principal component analysis (KPCA). Independent component analysis, local linear embedding. 2) generate hypotheses and formal models to explain the phenomenon [6]. Most clustering methods can be regarded as a mixture of fitting vector data to some simple distribution. Clustering method has been used in microarray data analysis in bioinformatics. In the direction of cancer type classification, machine learning is also used to obtain the corresponding phenomenon explanation from the gene database. Machine learning accelerates the progress of bioinformatics, but also brings corresponding problems. Most machine learning methods assume that data conform to a relatively fixed model, while the general data structure is usually variable, especially in bioinformatics. Therefore, it is necessary to establish a set of general methods to find the internal structure of data sets without relying on the assumed data structure. Secondly, machine learning methods often use "black box" operations, such as neural network and hidden Markov model, and the internal mechanism of obtaining specific solutions is still unclear.
[Edit this paragraph] Mathematical problems in bioinformatics
Mathematics occupies a large proportion in bioinformatics. Statistics, including multivariate statistics, is one of the mathematical foundations of bioinformatics. Probability theory and stochastic process theory, such as Hidden Markov Chain Model (HMM), have important applications in bioinformatics. Others, such as operational research of sequence alignment; Application of optimization theory in protein spatial structure prediction and molecular docking research: study the topological structure of DNA supercoiled structure; Study the group theory of genetic code and the symmetry of DNA sequence. In a word, various mathematical theories have played a corresponding role in biological research to a certain extent, but not all mathematical methods can be universally established when introducing bioinformatics. The following examples are statistics and metric spaces.
1, the paradox of statistics
The development of mathematics is accompanied by paradox. The most obvious paradox in the study of evolutionary tree and clustering is the average value, which shows that the conventional average value method can not separate the two categories, and it also shows that the average value can not bring more geometric properties of data. Then, if the data presents similar unique distribution, the commonly used evolutionary tree algorithm and clustering algorithm (such as K- means) will often draw wrong conclusions. Statistical traps are usually caused by the following reasons.
Because of the lack of general understanding of data structures.
2. Assumption of metric space
In bioinformatics, it is necessary to introduce the concept of measure into the establishment of evolutionary tree and gene clustering. For example, the genes with close distance have the same function, and the genes with the smallest score in the phylogenetic tree have the same paternal line. The premise of this metric space is that metrics are established in a global sense. Then, whether this premise assumption is universal or not, we might as well give a general description: assuming that the two vectors are A and B, then, under the assumption that the dimensions are linearly independent, the metrics of the two vectors can be defined as: (1) Euclidean metric space satisfying orthogonal invariant motion groups can be obtained according to the above formula, which is also a general description often used in most bioinformatics, that is, assuming that the variables are linearly independent. However, this assumption can't describe the nature of measurement correctly, especially in high-dimensional data sets. It is obviously problematic that the nonlinear correlation between data variables is not considered. Therefore, we can think that a correct measurement formula can be given by the following formula: (2) Einstein's summation formula convention is adopted in the above formula to describe the measurement relationship between variables. The latter is equivalent to (1) when (3) is satisfied, so it is a more general description. But the problem is how to accurately describe the nonlinear correlation between variables, which is what we are studying.
[Edit this paragraph] Difficulties in the application of statistical learning theory in bioinformatics
The amount of data and database in bioinformatics is very large, but it is generally difficult to give a clear definition of relative objective function. This difficulty in bioinformatics can be described as the contradiction between the huge scale of the problem and the pathological definition of the problem. Generally speaking, it is inevitable to introduce a regularization term to improve performance [7]. The following briefly introduces the statistical learning theory based on this idea, Kolmogorov complexity [98] and BIC (Bayesian information criterion] [109] and their existing problems. Support vector machine (SVM) is a popular method recently, and its research background is Vapnik's statistical learning theory, which realizes classification by maximizing the maximum interval between two data sets. For nonlinear problems, kernel function is used to map data sets to high-dimensional space, and it is not necessary to explicitly describe the properties of data sets in high-dimensional space. Compared with neural method, this method has the advantage of simplifying the selection of hidden layer parameters of neural network to the selection of kernel function, so it has attracted wide attention. It has also been paid attention to in bioinformatics. However, the choice of kernel function itself is a very difficult problem. From this perspective, the choice of the optimal kernel function may be just an ideal, and SVM may be just another big bubble in the process of machine learning research like neural network. Kolmogorov's complexity thought and statistical learning theory thought describe the essence of learning from different angles, the former is from the perspective of coding, and the latter is based on finite samples to achieve uniform convergence. The complexity of Kolmogorov is incalculable. Then the MDL principle (minimum description length) was derived, which was originally only applicable to discrete data, and recently extended to continuous data sets, trying to obtain the minimum description of model parameters from the perspective of coding. Its drawback lies in the high complexity of modeling, which makes it difficult to apply BIC criterion to large data sets. BIC criterion imposes a large penalty on the model with high complexity, and a small penalty on the contrary. Occam's razor principle is embodied implicitly and has been widely used in bioinformatics in recent years. The main limitation of BIC criterion is that it is sensitive to parameter model hypothesis and prior selection, and the processing speed is slow when the data is large. So there is still a lot of room for exploration in this field.