- Shopping Bag ( 0 items )
"There is a genuine need for this concise, informative, clearly written book. In systems biology, engineers, mathematicians, and computer scientists are collaborating increasingly with biologists and researchers in medicine. This book goes a long way toward narrowing the gap on this front, and it lays a rigorous foundation for a new discipline."—Olli Yli-Harja, Tampere University of Technology
"Overall, this book should be useful for individuals with a background in computer science and machine learning who wish to see the applications of mathematics to genomics."—Leon Glass, SIAM Review
No single agreed-upon definition seems to exist for the term bioinformatics, which has been used to mean a variety of things ranging in scope and focus. To cite but a few examples from textbooks, Lodish et al. (2000) state that "bioinformatics is the rapidly developing area of computer science devoted to collecting, organizing, and analyzing DNA and protein sequences." A more general and encompassing definition, given by Brown (2002), is that bioinformatics is "the use of computer methods in studies of genomes." More general still: "bioinformatics is the science of refining biological information into biological knowledge using computers" (Draghici, 2003). Kohane et al. (2003) observe that the "breadth of this commonly used definition of bioinformatics risks relegating it to the dustbin of labels too general to be useful" and advocate being more specific about the particular bioinformatics techniques employed.
While it is true that the field of bioinformatics has traditionally dealt primarily with biological data encoded in digital symbol sequences, such as nucleotide and amino acid sequences, in this book we will be mainly concerned with extracting information from gene expression measurements and genomic signals. By the latter we mean any measurable events, principally theproduction of messenger ribonucleic acid (RNA) and protein, that are carried out by the genome. The analysis, processing, and use of genomic signals for gaining biological knowledge and translating this knowedge into systems-based applications is called genomic signal processing.
In this chapter, our aim is to place this material into a proper biological context by providing the necessary background for some of the key concepts that we shall use. We cannot hope to comprehensively cover the topics of modern genetics, genomics, cell biology, and others, so we will confine ourselves to brief overviews of some of these topics. We particularly recommend the book by Alberts et al. (2002) for a more comprehensive coverage of these topics.
Broadly speaking, genetics is the study of genes. The latter can be studied from different perspectives and on a molecular, cellular, population, or evolutionary level. A gene is composed of deoxyribonucleic acid (DNA), which is a double helix consisting of two intertwined and complementary nucleotide chains. The entire set of DNA is the genome of the organism. The DNA molecules in the genome are assembled into chromosomes, and genes are the functional regions of DNA.
Each gene encodes information about the structure and functionality of some protein produced in the cell. Proteins in turn are the machinery of the cell and the major determinants of its properties. Proteins can carry out a number of tasks, such as catalyzing reactions, transporting oxygen, regulating the production of other proteins, and many others. The way proteins are encoded by genes involves two major steps: transcription and translation. Transcription refers to the process of copying the information encoded in the DNA into a molecule called messenger RNA (mRNA). Many copies of the same RNA can be produced from only a single copy of DNA, which ultimately allows the cell to make large amounts of proteins. This occurs by means of the process referred to as translation, which converts mRNA into chains of linked amino acids called polypeptides. Polypeptides can combine with other polypeptides or act on their own to form the actual proteins. The flow of information from DNA to RNA to protein is known as the central dogma of molecular biology. Although it is mostly correct, there are a number of modifications that need to be made. These include the processes of reverse transcription, RNA editing, and RNA replication.
Briefly, reverse transcription refers to the conversion of a single-stranded RNA molecule to a double-stranded DNA molecule with the help of an enzyme aptly called reverse transcriptase. For example, HIV virus consists of an RNA genome that is converted to DNA and inserted into the genome of the host. RNA editing refers to the alteration of RNA after it has been transcribed from DNA. Therefore, the ultimate protein product that results from the edited RNA molecule does not correspond to what was originally encoded in the DNA. Finally, RNA replication is a process whereby RNA can be copied into RNA without the use of DNA. Several viruses, such as hepatitis C virus, employ this mechanism. We will now discuss some preliminary concepts in more detail.
1.1.1 Nucleic Acid Structure
Almost every cell in an organism contains the same DNA content. Every time a cell divides, this material is faithfully replicated. The information stored in the DNA is used to code for the expressed proteins by means of transcription and translation. The DNA molecule is a polymer that is strung together from monomers called deoxyribonucleotides, or simply nucleotides, each of which consists of three chemical components: a sugar (deoxyribose), a phosphate group, and a nitrogenous base. There are four possible bases: adenine, guanine, cytosine, and thymine, often abbreviated as A, G, C, and T, respectively. Adenine and guanine are purines and have bicyclic structures (two fused rings), whereas cytosine and thymine are pyrimidines, and have monocyclic structures. The sugar has five carbon atoms that are typically numbered from 1' to 5'. The phosphate group is attached to the 5'-carbon atom, whereas the base is attached to the 1' carbon. The 3' carbon also has a hydroxyl group (OH) attached to it.
Figure 1.1 illustrates the structure of a nucleotide with a thymine base. Although this figure shows one phosphate group, up to three phosphates can be attached. For example, adenosine 5'-triphosphate (ATP), which has three phosphates, is the molecule responsible for supplying energy for many biochemical cellular processes.
Ribonucleic acid is a polymer that is quite close in structure to DNA. One of the differences is that in RNA the sugar is ribose rather than deoxyribose. While the latter has a hydrogen at the 2' position (figure 1.1), ribose has a hydroxyl group at this position. Another difference is that the thymine base is replaced by the structurally similar uracil (U) base in a ribonucleotide.
The deoxyribonucleotides in DNA and the ribonucleotides in RNA are joined by the covalent linkage of a phosphate group where one bond is between the phosphate and the 5' carbon of deoxyribose and the other bond is between the phosphate and the 3' carbon of deoxyribose. This type of linkage is called a phosphodiester bond. The arrangement just described gives the molecule a 5'->3' polarity or directionality. Because of this, it is a convention to write the sequences of nucleotides starting with the 5' end at the left, for example, 5'-ATCGGCTC-3'. Figure 1.2 is a simplified diagram of the phosphodiester bonds and the covalent structure of a DNA strand.
DNA commonly occurs in nature as two strands of nucleotides twisted around in a double helix, with the repeating phosphate-deoxyribose sugar polymer serving as the backbone. This backbone is on the outside of the helix, and the bases are located in the center. The opposite strands are joined by hydrogen bonding between the bases, forming base pairs. The two backbones are in opposite or antiparallel orientations. Thus, one strand is oriented 5'->3' and the other is 3'->5'. Each base can interact with only one other type of base. Specifically, A always pairs with T (an A · T base pair), and G always pairs with C (a G · C base pair). The bases in the base pairs are said to be complementary. The A · T base pair has two hydrogen bonds, whereas the G · C base pair has three hydrogen bonds. These bonds are responsible for holding the two opposite strands together. Thus, if a DNA molecule contains many G · C base pairs, it is more stable than one containing many A · T base pairs. This also implies that DNA that is high in G · C content requires a higher temperature to separate, or denature, the two strands. Although the individual hydrogen bonds are rather weak, because the overall number of these bonds is quite high, the two strands are held together quite well.
Although in this book we will focus on gene expression, which involves transcription, it is important to say a few words about how the DNA molecule duplicates. Because the two strands in the DNA double helix are complementary, they carry the same information. During replication, the strands separate and each one acts as a template for directing the synthesis of a new complementary strand. The two new double-stranded molecules are passed on to daughter cells during cell division. The DNA replication phase in the cell cycle is called the S (synthesis) phase. After the strands separate, the single bases (on each side) become exposed. Thus, they are free to form base pairs with other free (complementary) nucleotides. The enzyme that is responsible for building new strands is called DNA polymerase.
Genes represent the functional regions of DNA in that they can be transcribed to produce RNA. A gene contains a regulatory region at its upstream (5') end to which various proteins can bind and cause the gene to initiate transcription in the adjacent RNA-encoding region. This essentially allows the gene to receive and then respond to other signals from within or outside the genome. At the other (3') end of the gene, there is another region that signals termination of transcription.
In eukaryotes (cells that have a nucleus), many genes contain introns, which are segments of DNA that have no information for, or do not code for, any gene products (proteins). Introns are transcribed along with the coding regions, which are called exons, but are then cut out from the transcript. The exons are then spliced together to form the functional messenger RNA that leaves the nucleus to direct protein synthesis in the cytoplasm. Prokaryotes (cells without a nucleus) do not have an exon/intron structure, and their coding region is contiguous. These concepts are illustrated in figure 1.3.
The parts of DNA that do not correspond to genes are of mostly unknown function. The amount of this type of intergenic DNA present depends on the organism. For example, mammals can contain enormous regions of intergenic DNA.
Before we go on to discuss the process of transcription, which is the synthesis of RNA, let us say a few words about RNA and its roles in the cell. As discussed above, most RNAs are used as an intermediary in producing proteins via the process of translation. However, some RNAs are also useful in their own right in that they can carry out a number of functions. As mentioned earlier, the RNA that is used to make proteins is called messenger RNA. The other RNAs that perform various functions are never translated; however, these RNAs are still encoded by some genes.
One such type of RNA is transfer RNA (tRNA), which transports amino acids to mRNA during translation. tRNAs are quite general in that they can transport amino acids to the mRNA corresponding to any gene. Another type of RNA is ribosomal RNA (rRNA), which along with different proteins comprises ribosomes. Ribosomes coordinate assembly of the amino acid chain in a protein. rRNAs are also general-purpose molecules and can be used to translate the mRNA of any gene. There are also a number of other types of RNA involved in splicing (snRNAs), protein trafficking (scRNAs), and other functions. We now turn to the topic of transcription.
Transcription, which is the synthesis of RNA on a DNA template, is the first step in the process of gene expression, leading to the synthesis of a protein. Similarly to DNA replication, transcription relies on complementary base pairing. Transcription is catalyzed by an RNA polymerase and RNA synthesis always occurs from the 5' to the 3' end of an RNA molecule. First, the two DNA strands separate, with one of the strands acting as a template for synthesizing RNA. Which of the two strands is used as a template depends on the gene. After separation of the DNA strands, available ribonucleotides are attached to their complementary bases on the DNA template. Recall that in RNA uracil is used in place of thymine in complementary base pairing. The RNA strand is thus a direct copy (with U instead of T) of one of the DNA strands and is referred to as the sense strand. The other DNA strand, the one that is used as a template, is called the antisense strand. This is illustrated in figure 1.4.
Transcription is initiated when RNA polymerase binds to the double-stranded DNA. The actual site at which RNA polymerase binds is called a promoter, which is a sequence of DNA at the start of a gene. Since RNA is synthesized in the 5'->3' direction (figure 1.5), genes are also viewed in the same orientation by convention. Therefore, the promoter is always located upstream (5' side) of the coding region. Certain sequence elements of promoters are conserved between genes. RNA polymerase binds to these common parts of the sequences in order to initiate transcription of the gene. In most prokaryotes, the same RNA polymerase is able to transcribe all types of RNAs, whereas in eukaryotes, several different types of RNA polymerases are used depending on what kind of RNA is produced (mRNA, rRNA, tRNA).
In order to allow the DNA antisense strand to be used for base pairing, the DNA helix must first be locally unwound. This unwinding process starts at the promoter site to which RNA polymerase binds. The location at which the RNA strand begins to be synthesized is called the initiation site and is defined as position +1 in the gene. As shown in figure 1.5, RNA polymerase adds ribonucleotides to the 3' end of the RNA. After this, the helix is re-formed once again.
Since the transcript must eventually terminate, how does the RNA polymerase know when to stop synthesizing RNA? This is accomplished by the recognition of certain specific DNA sequences, called terminators, that signal the termination of transcription, causing the RNA polymerase to be released from the template and ending the RNA synthesis. Although there are several mechanisms for termination, a common direct mechanism in prokaryotes is a terminator sequence arranged in such a way that it contains self-complementary regions that can form stem-loop or hairpin structures in the RNA product. Such a structure, shown in figure 1.6, can cause the polymerase to pause, thereby terminating transcription. It is interesting to note that the hairpin structure is often GC-rich, making the self-complementary base pairing stronger because of the higher stability of G · C base pairs relative to A · U base pairs. Moreover, there are usually several U bases at the end of the hairpin structure, which, because of the relatively weaker A · U base pairs, facilitates dissociation of the RNA.
In eukaryotes, the transcriptional machinery is somewhat more complicated than in prokaryotes since the primary RNA transcript (pre-mRNA) must first be processed before being transported out of the nucleus (recall that prokaryotes have no nucleus). This processing first involves capping-the addition of a 7-methylguanosine molecule to the 5' end of the transcript, linked by a triphosphate bond. This typically occurs before the RNA chain is 30 nucleotides long. This cap structure serves to stabilize the transcript but is also important for splicing and translation. At the 3' end, a specific sequence (5-AAUAAA-3') is recognized by an enzyme which then cuts off the RNA at approximately 20 bases further down, and a poly (A) tail is added at the 3' end. The poly (A) tail consists of a run of up to 250 adenine nucleotides and is believed to help in the translation of mRNA in the cytoplasm. This process is called 3' cleavage and polyadenylation.
The final step in converting pre-mRNA into mature mRNA involves splicing, or removal of the introns and joining of the exons. In order for splicing to occur, certain nucleotide sequences must also be present. The 5 end of an intronal most always contains a 5'-GU-3' sequence, and the 3' end contains a 5'-AG-3' sequence. The AG sequence is preceded by a polypyrimidine tract-a pyrimidine-rich sequence. Further upstream there is a sequence called a branchpoint sequence, which is 5'-CU(A/G)A(C/U)-3' in vertebrates. The splice sites as well as the conserved sequences related to intron splicing are shown in figure 1.7.
Excerpted from Genomic Signal Processing by Ilya Shmulevich Edward R. Dougherty
Copyright © 2007 by Princeton University Press. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Chapter 1: Biological Foundations
1.1 Genetics 1
1.1.1 Nucleic Acid Structure 2
1.1.2 Genes 5
1.1.3 RNA 6
1.1.4 Transcription 6
1.1.5 Proteins 9
1.1.6 Translation 10
1.1.7 Transcriptional Regulation 12
1.2 Genomics 16
1.2.1 Microarray Technology 17
1.3 Proteomics 20
Chapter 2: Deterministic Models of Gene Networks
2.1 Graph Models 23
2.2 Boolean Networks 30
2.2.1 Cell Differentiation and Cellular Functional States 33
2.2.2 Network Properties and Dynamics 35
2.2.3 Network Inference 49
2.3 Generalizations of Boolean Networks 53
2.3.1 Asynchrony 53
2.3.2 Multivalued Networks 56
2.4 Differential Equation Models 59
2.4.1 A Differential Equation Model Incorporating Transcription and Translation 62
2.4.2 Discretization of the Continuous Differential Equation Model 65
Chapter 3: Stochastic Models of Gene Networks
3.1 Bayesian Networks 77
3.2 Probabilistic Boolean Networks 83
3.2.1 Definitions 86
3.2.2 Inference 97
3.2.3 Dynamics of PBNs 99
3.2.4 Steady-State Analysis of Instantaneously Random PBNs 113
3.2.5 Relationships of PBNs to Bayesian Networks 119
3.2.6 Growing Subnetworks from Seed Genes 125
3.3 Intervention 129
3.3.1 Gene Intervention 130
3.3.2 Structural Intervention 140
3.3.3 External Control 145
Chapter 4: Classification
4.1 Bayes Classifier 160
4.2 Classification Rules 162
4.2.1 Consistent Classifier Design 162
4.2.2 Examples of Classification Rules 166
4.3 Constrained Classifiers 168
4.3.1 Shatter Coefficient 171
4.3.2 VC Dimension 173
4.4 Linear Classification 176
4.4.1 Rosenblatt Perceptron 177
4.4.2 Linear and Quadratic Discriminant Analysis 178
4.4.3 Linear Discriminants Based on Least-Squares Error 180
4.4.4 Support Vector Machines 183
4.4.5 Representation of Design Error for Linear Discriminant Analysis 186
4.4.6 Distribution of the QDA Sample-Based Discriminant 187
4.5 Neural Networks Classifiers 189
4.6 Classification Trees 192
4.6.1 Classification and Regression Trees 193
4.6.2 Strongly Consistent Rules for Data-Dependent Partitioning 194
4.7 Error Estimation 196
4.7.1 Resubstitution 196
4.7.2 Cross-validation 198
4.7.3 Bootstrap 199
4.7.4 Bolstering 201
4.7.5 Error Estimator Performance 204
4.7.6 Feature Set Ranking 207
4.8 Error Correction 209
4.9 Robust Classifiers 213
4.9.1 Optimal Robust Classifiers 214
4.9.2 Performance Comparison for Robust Classifiers 216
Chapter 5: Regularization
5.1 Data Regularization 225
5.1.1 Regularized Discriminant Analysis 225
5.1.2 Noise Injection 228
5.2 Complexity Regularization 231
5.2.1 Regularization of the Error 231
5.2.2 Structural Risk Minimization 233
5.2.3 Empirical Complexity 236
5.3 Feature Selection 237
5.3.1 Peaking Phenomenon 237
5.3.2 Feature Selection Algorithms 243
5.3.3 Impact of Error Estimation on Feature Selection 244
5.3.4 Redundancy 245
5.3.5 Parallel Incremental Feature Selection 249
5.3.6 Bayesian Variable Selection 251
5.4 Feature Extraction 254
Chapter 6: Clustering
6.1 Examples of Clustering Algorithms 263
6.1.1 Euclidean Distance Clustering 264
6.1.2 Self-Organizing Maps 265
6.1.3 Hierarchical Clustering 266
6.1.4 Model-Based Cluster Operators 268
6.2 Cluster Operators 269
6.2.1 Algorithm Structure 269
6.2.2 Label Operators 271
6.2.3 Bayes Clusterer 273
6.2.4 Distributional Testing of Cluster Operators 274
6.3 Cluster Validation 276
6.3.1 External Validation 276
6.3.2 Internal Validation 277
6.3.3 Instability Index 278
6.3.4 Bayes Factor 280
6.4 Learning Cluster Operators 281
6.4.1 Empirical-Error Cluster Operator 281
6.4.2 Nearest-Neighbor Clustering Rule 283