Current location - Loan Platform Complete Network - Big data management - Introduction to bioinformatics
Introduction to bioinformatics
Table of Contents 1 Pinyin 2 English Reference 3 Current Main Research Contents of Bioinformatics 3.1 Obtaining complete genomes of human beings and various organisms 3.2 Discovering new genes and new single nucleotide polymorphisms 3.3 Non-coding protein in genomes 3.4 Studying biological evolution at genome level 3.5 Comparative study of complete genomes. 3.6 From Functional Genome to Systems Biology 3.7 protein Structural Simulation and Drug Design 3.8 Research on the Application and Development of Bioinformatics 1 Pinyin

sh ē ng w ù x? x? x? xué 2 English Reference

Bioinformatics

Bioinformatics is a new interdisciplinary subject. Many people will think that bioinformatics involves both biology and physics, and it must be a very extensive subject field. In fact, its connotation is very specific and its scope is very clear. Bioinformatics came into being with genome research, so its research content developed closely with genome research.

broadly speaking, bioinformatics is engaged in the acquisition, processing, storage, distribution, analysis and interpretation of biological information related to genome research. This definition includes two meanings: one is the collection, collation and service of massive data, that is, the management of these data; The other is to discover new rules from it, that is, to make good use of these data.

Specifically, bioinformatics takes the analysis of genomic D NA sequence information as the source to find the coding regions representing protein and R NA genes in the genome sequence; At the same time, clarify the information essence of a large number of non-coding regions in the genome, and decipher the genetic language rules hidden in the D NA sequence; On this basis, the data of transcription spectrum and protein spectrum related to the release and regulation of genomic genetic information are summarized and sorted out, so as to understand the laws of metabolism, development, differentiation and evolution.

Bioinformatics also uses the information of coding region in genome to simulate the spatial structure of protein and predict the function of protein, and combines this information with the physiological and biochemical information of organisms and life processes to clarify its molecular mechanism, and finally carries out molecular design of protein, nucleic acid, drug design and individualized medical care design.

Genome informatics, protein's structural calculation and simulation, and drug design are closely related to the central principle of genetic information transmission, so they must be organically connected.

why does genome research need to rely on bioinformatics? First of all, with the genome research, the related information has exploded, and it is urgent to deal with massive biological information. Since scientists deciphered the genome of Haemophilus influenzae with a total length of 1.8 million nucleotides in 1995, the complete genomes of about 6 microorganisms and several eukaryotes, such as yeast, nematodes, fruit flies and Arabidopsis thaliana, have been sequenced. By the spring of 21, scientists had published most of the sequences of the human genome, namely, the working sketch of the human genome. These achievements mean that genome research will enter a new stage of information extraction and data analysis. According to the statistics of the international database, the number of DNA bases was 3 billion in December 1999, 6 billion in April 2, and now it has reached 14 billion, doubling about every 14 months. At the same time, the growth of digital processing ability of electronic computer chips is equivalent to doubling every 18 months. Therefore, computers can effectively manage and run massive data.

however, the more essential reason is the complexity of genome data. The so-called genome of an organism refers to the sum of all genetic materials of the organism. Biological genetic material is a kind of biological macromolecules called deoxyribonucleic acid (DNA), which is composed of four nucleotides connected in series, usually represented by characters A, T, G and C.. Generally speaking, the biological genetic code is a linear long chain connected by these four characters. This chain is often very long. For example, a person's genetic code contains 3.2 billion characters. When they are piled up, a "heavenly book" with more than 1 million pages and 3, characters per page is formed. This "heavenly book" contains a lot of information about the structure and function of the human body and the process of life activities, but it is only composed of four characters, with no morphology, syntax and punctuation. It seems that every page is similar. How to read it is a great problem. Genome research is ultimately to transform biological problems into the processing of digital symbols. To solve this problem, we must develop new analytical theories, methods, techniques and tools, and we must rely on computer information processing.

engaged in bioinformatics research should have a variety of scientific basis. First of all, it needs certain computing power, including corresponding software and hardware equipment. There should be various databases or effective communication with international and domestic database systems. There must be a developed and stable internet system; At the same time, bioinformatics needs powerful innovative algorithms and software. Without algorithm innovation, bioinformatics cannot achieve sustainable development. Finally, it should establish extensive and close contact with experimental science, especially with automatic large-scale Qualcomm biological research methods and platform technology. These technologies are not only the main methods to generate bioinformatics data, but also the key means to verify the research results of bioinformatics. Therefore, people engaged in bioinformatics research must also have interdisciplinary knowledge.

China's research and application of bioinformatics have a certain foundation, so it is expected to achieve breakthrough results, which is very important for strengthening China's strength in the field of basic research and occupying the international leading position in some aspects. The application of bioinformatics results will also produce great social and economic benefits. 3 main research contents of bioinformatics at present 3.1 obtaining the complete genome of human beings and various organisms

The primary goal of genome research is to obtain the complete genetic code of human beings. The human genetic code has 3.2 billion bases, but the current D-NA sequencer can only read hundreds to thousands of bases per reaction. That is to say, to get all the genetic codes of human beings, we must first break the human genome, and then reassemble them after measuring the short sequences.

However, it is easy to imagine that if a book is torn into pieces of the same size, it will never be able to put them back together correctly, because the context of the book is lost at the same time. How should this be done? We can take two identical books and break them separately according to different tearing methods. By cross-referencing different fragments and finding the same words, the context of the book can be partially restored. The more books are torn, the more contextual connections are restored. Therefore, in order to obtain a complete set of human genetic code, it is not possible to measure the 3.2 billion bases of human beings only once, but often many times. For example, the draft of human genome published in Nature and Science at the beginning of this year reported that it contains about 2.9 billion bases, with 96% physical map coverage and 94% sequence coverage. More than 9% of the continuous sequence groups are more than 1 thousand bases; About 25% of the continuous sequence groups are equal to or greater than 1 million bases. 3,-4, genes encoding protein were found in these sequences. Getting such a map is equivalent to measuring the human genome for about five times. To do this, tens of millions of small fragments need to be connected by comparison, which is often called the splicing and assembly of genome sequence data.

Every step of large-scale genome sequencing is closely related to information analysis. From optical density sampling and analysis of sequencer, base reading, vector identification and removal, splicing and filling sequence gaps, to repeated sequence identification, frame prediction and gene labeling, every step is closely dependent on bioinformatics software and database. Among them, sequence splicing and filling sequence gaps are the most critical and primary problems. Its difficulty not only comes from its huge mass data, but also lies in its highly repetitive sequence. Therefore, it is particularly necessary to link experimental design with information analysis in this process. On the other hand, we must develop appropriate algorithms and corresponding software according to the requirements of different steps to deal with various complex problems. Many famous genome research centers in the world have their own splicing and assembly strategies, and such work is done on supercomputers.

with a complete genome, human beings will know themselves more carefully and accurately. For example, only 1.1% of our genome actually encodes protein (called exon); The region between exons (called introns) accounts for 24%; However, the interval sequence between genes accounts for 75%, that is to say, the regions that do not encode protein account for the vast majority in the human genome. It is found that human genes encoding proteins are more complex than those of other organisms, and there are more abundant splicing methods. It is found that fragment duplication in genome is very common, which reflects the complex evolutionary history of human beings. It is found that human chromosome 13 is relatively stable, while male chromosome 12 and female chromosome 16 are variable, and so on. 3.2 discovery of new genes and new single nucleotide polymorphisms

discovery of new genes is a hot topic in international genome research, and the use of bioinformatics is an important means to discover new genes. For example, the complete genome of Saccharomyces cerevisiae contains about 6 genes, about 6% of which are obtained through information analysis.

(1) Computer cloning of genes

Using E ST database to discover new genes is also called computer cloning of genes. E ST sequences are short c DNA sequences of gene expression, which carry the information of some fragments of complete genes. By October 21, there were more than 3.8 million human E ST sequences in the EST database of GenBank, which covered more than 9% of human genes.

As early as 1996, China began to search for new genes by computer cloning. Its principle is very simple, that is, find all E ST fragments belonging to the same gene and then connect them. Because E ST sequences are randomly generated in many laboratories all over the world, there must be a large number of repeated small fragments among many E ST sequences belonging to the same gene. Using these small fragments as markers, different ESTs can be connected until their full length is found, so we can say that a gene has been found through computer cloning. If this gene has not been found before, then we have found a new gene. However, it is complicated to design the computer cloning program, and the amount of calculation is huge.

(2) Predicting new genes from genomic D NA sequences

Predicting new genes from genomic sequences is essentially to distinguish the region encoding protein from the region not encoding protein. For the theoretical method, it is to find out which mathematical and physical characteristics are different between the coding area and the non-coding area. By comparing these sequences with the database of known genes, new genes can be found.

The discovery of new genes will deepen our understanding of life activities. According to the journal Nature on December 2, 1999, 679 genes have been identified from the data of human chromosome 22, 55% of which are unknown. There are 35 diseases associated with chromosomal mutation, such as immune system diseases, congenital heart diseases and schizophrenia. However, it is still a very important and arduous task to integrate all human genes, their corresponding protein and their related functions into an index completely and correctly. The International Human Genome Collaboration Group is working to establish a complete "integrated gene index" and related "integrated protein index".

(3) Single nucleotide polymorphism (SNP) was found

Some people smoke and drink but live longer, while others have been suffering from illness since childhood; The same drug for treating tumors is very effective for some people, but completely ineffective for others. Why is this? The answer is the differences in their genomes. Many of these differences are manifested in single base variation, that is, single nucleotide polymorphism (S NP).

It is generally believed that the study of S NP is an important step towards the application of the human genome project. This is mainly because S NP will provide a powerful tool for the discovery of high-risk groups, the identification of disease-related genes, the design and testing of drugs and the basic research of biology. S NP is widely distributed in the genome, and recent research shows that it appears every 3 base pairs in the human genome. The existence of a large number of S NP loci gives people the opportunity to find genomic mutations related to various diseases, including tumors; From the experimental operation, it is easier to find disease-related gene mutations through S NP than through families. Some S NP does not directly lead to the expression of disease genes, but because it is adjacent to some disease genes, it becomes an important marker. S NP has also played a great role in basic research. In recent years, the analysis of Y chromosome S NP has made a series of important achievements in the fields of human evolution, human population evolution and migration. 3.3 Study on the Structure and Function of the Non-coding protein

Region in the Genome

Recent studies have shown that in microorganisms like bacteria, the non-coding protein region only accounts for 1% to 2% of the whole genome sequence. With the evolution of organisms, there are more and more non-coding regions, and non-coding sequences have accounted for the vast majority of genome sequences in higher organisms and human genomes. This shows that these non-coding sequences must have important biological functions. It is generally recognized that they are related to the regulation of gene expression.

For the human genome, so far, only the region (gene) encoding protein on D NA has been truly mastered, and the latest data show that this part of the sequence only accounts for 1.1% of the genome. The research on coding regions, which account for only 1.1% of the human genome, has created dozens of Nobel Prize winners, and the number of achievements contained in 98% non-coding regions will be considerable. Therefore, finding the coding characteristics, information regulation and expression rules of these regions will be a hot topic for a long time to come and a source of important achievements. 3.4 studying biological evolution at the genome level

in recent years, with the massive increase of genome sequence data, the debate on the relationship between sequence differences and evolution has become increasingly fierce. Firstly, it is found that the phylogenetic trees reconstructed by the same population based on different molecular sequences may be different. At the same time, the discussion on the relationship between "vertical evolution" and "horizontal evolution" is gradually attracting people's attention. That is, the "lateral transfer phenomenon" of genes was discovered in recent years. That is, genes can migrate between coexisting populations, and the result may lead to sequence differences, but this difference has nothing to do with evolution. Even, the analysis of the human genome found that dozens of people's genes are only related to bacterial genes.