Genomic signal processing (GSP) can be defined as the analysis, processing, and use of genomic signals to gain biological knowledge, and the translation of that knowledge into systems-based applications that can be used to diagnose and treat genetic diseases. Situated at the crossroads of engineering, biology, mathematics, statistics, and computer science, GSP requires the development of both nonlinear dynamical models that adequately represent genomic regulation, and diagnostic and therapeutic tools based on these models. This book facilitates these developments by providing rigorous mathematical definitions and propositions for the main elements of GSP and by paying attention to the validity of models relative to the data. Ilya Shmulevich and Edward Dougherty cover real-world situations and explain their mathematical modeling in relation to systems biology and systems medicine.
Genomic Signal Processing makes a major contribution to computational biology, systems biology, and translational genomics by providing a self-contained explanation of the fundamental mathematical issues facing researchers in four areas: classification, clustering, network modeling, and network intervention.
Genomic signal processing (GSP) can be defined as the analysis, processing, and use of genomic signals to gain biological knowledge, and the translation of that knowledge into systems-based applications that can be used to diagnose and treat genetic diseases. Situated at the crossroads of engineering, biology, mathematics, statistics, and computer science, GSP requires the development of both nonlinear dynamical models that adequately represent genomic regulation, and diagnostic and therapeutic tools based on these models. This book facilitates these developments by providing rigorous mathematical definitions and propositions for the main elements of GSP and by paying attention to the validity of models relative to the data. Ilya Shmulevich and Edward Dougherty cover real-world situations and explain their mathematical modeling in relation to systems biology and systems medicine.
Genomic Signal Processing makes a major contribution to computational biology, systems biology, and translational genomics by providing a self-contained explanation of the fundamental mathematical issues facing researchers in four areas: classification, clustering, network modeling, and network intervention.


eBook
Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
Related collections and offers
Overview
Genomic signal processing (GSP) can be defined as the analysis, processing, and use of genomic signals to gain biological knowledge, and the translation of that knowledge into systems-based applications that can be used to diagnose and treat genetic diseases. Situated at the crossroads of engineering, biology, mathematics, statistics, and computer science, GSP requires the development of both nonlinear dynamical models that adequately represent genomic regulation, and diagnostic and therapeutic tools based on these models. This book facilitates these developments by providing rigorous mathematical definitions and propositions for the main elements of GSP and by paying attention to the validity of models relative to the data. Ilya Shmulevich and Edward Dougherty cover real-world situations and explain their mathematical modeling in relation to systems biology and systems medicine.
Genomic Signal Processing makes a major contribution to computational biology, systems biology, and translational genomics by providing a self-contained explanation of the fundamental mathematical issues facing researchers in four areas: classification, clustering, network modeling, and network intervention.
Product Details
ISBN-13: | 9781400865260 |
---|---|
Publisher: | Princeton University Press |
Publication date: | 09/08/2014 |
Series: | Princeton Series in Applied Mathematics , #18 |
Sold by: | Barnes & Noble |
Format: | eBook |
Pages: | 288 |
File size: | 4 MB |
About the Author
Read an Excerpt
Genomic Signal Processing
By Ilya Shmulevich Edward R. Dougherty Princeton University Press
Copyright © 2007 Princeton University Press
All right reserved.
ISBN: 978-0-691-11762-1
Chapter One Biological Foundations
No single agreed-upon definition seems to exist for the term bioinformatics, which has been used to mean a variety of things ranging in scope and focus. To cite but a few examples from textbooks, Lodish et al. (2000) state that "bioinformatics is the rapidly developing area of computer science devoted to collecting, organizing, and analyzing DNA and protein sequences." A more general and encompassing definition, given by Brown (2002), is that bioinformatics is "the use of computer methods in studies of genomes." More general still: "bioinformatics is the science of refining biological information into biological knowledge using computers" (Draghici, 2003). Kohane et al. (2003) observe that the "breadth of this commonly used definition of bioinformatics risks relegating it to the dustbin of labels too general to be useful" and advocate being more specific about the particular bioinformatics techniques employed.
While it is true that the field of bioinformatics has traditionally dealt primarily with biological data encoded in digital symbol sequences, such as nucleotide and amino acid sequences, in this book we will be mainly concerned with extracting information from gene expression measurements and genomic signals. By the latter we mean any measurable events, principally theproduction of messenger ribonucleic acid (RNA) and protein, that are carried out by the genome. The analysis, processing, and use of genomic signals for gaining biological knowledge and translating this knowedge into systems-based applications is called genomic signal processing.
In this chapter, our aim is to place this material into a proper biological context by providing the necessary background for some of the key concepts that we shall use. We cannot hope to comprehensively cover the topics of modern genetics, genomics, cell biology, and others, so we will confine ourselves to brief overviews of some of these topics. We particularly recommend the book by Alberts et al. (2002) for a more comprehensive coverage of these topics.
1.1 GENETICS
Broadly speaking, genetics is the study of genes. The latter can be studied from different perspectives and on a molecular, cellular, population, or evolutionary level. A gene is composed of deoxyribonucleic acid (DNA), which is a double helix consisting of two intertwined and complementary nucleotide chains. The entire set of DNA is the genome of the organism. The DNA molecules in the genome are assembled into chromosomes, and genes are the functional regions of DNA.
Each gene encodes information about the structure and functionality of some protein produced in the cell. Proteins in turn are the machinery of the cell and the major determinants of its properties. Proteins can carry out a number of tasks, such as catalyzing reactions, transporting oxygen, regulating the production of other proteins, and many others. The way proteins are encoded by genes involves two major steps: transcription and translation. Transcription refers to the process of copying the information encoded in the DNA into a molecule called messenger RNA (mRNA). Many copies of the same RNA can be produced from only a single copy of DNA, which ultimately allows the cell to make large amounts of proteins. This occurs by means of the process referred to as translation, which converts mRNA into chains of linked amino acids called polypeptides. Polypeptides can combine with other polypeptides or act on their own to form the actual proteins. The flow of information from DNA to RNA to protein is known as the central dogma of molecular biology. Although it is mostly correct, there are a number of modifications that need to be made. These include the processes of reverse transcription, RNA editing, and RNA replication.
Briefly, reverse transcription refers to the conversion of a single-stranded RNA molecule to a double-stranded DNA molecule with the help of an enzyme aptly called reverse transcriptase. For example, HIV virus consists of an RNA genome that is converted to DNA and inserted into the genome of the host. RNA editing refers to the alteration of RNA after it has been transcribed from DNA. Therefore, the ultimate protein product that results from the edited RNA molecule does not correspond to what was originally encoded in the DNA. Finally, RNA replication is a process whereby RNA can be copied into RNA without the use of DNA. Several viruses, such as hepatitis C virus, employ this mechanism. We will now discuss some preliminary concepts in more detail.
1.1.1 Nucleic Acid Structure
Almost every cell in an organism contains the same DNA content. Every time a cell divides, this material is faithfully replicated. The information stored in the DNA is used to code for the expressed proteins by means of transcription and translation. The DNA molecule is a polymer that is strung together from monomers called deoxyribonucleotides, or simply nucleotides, each of which consists of three chemical components: a sugar (deoxyribose), a phosphate group, and a nitrogenous base. There are four possible bases: adenine, guanine, cytosine, and thymine, often abbreviated as A, G, C, and T, respectively. Adenine and guanine are purines and have bicyclic structures (two fused rings), whereas cytosine and thymine are pyrimidines, and have monocyclic structures. The sugar has five carbon atoms that are typically numbered from 1' to 5'. The phosphate group is attached to the 5'-carbon atom, whereas the base is attached to the 1' carbon. The 3' carbon also has a hydroxyl group (OH) attached to it.
Figure 1.1 illustrates the structure of a nucleotide with a thymine base. Although this figure shows one phosphate group, up to three phosphates can be attached. For example, adenosine 5'-triphosphate (ATP), which has three phosphates, is the molecule responsible for supplying energy for many biochemical cellular processes.
Ribonucleic acid is a polymer that is quite close in structure to DNA. One of the differences is that in RNA the sugar is ribose rather than deoxyribose. While the latter has a hydrogen at the 2' position (figure 1.1), ribose has a hydroxyl group at this position. Another difference is that the thymine base is replaced by the structurally similar uracil (U) base in a ribonucleotide.
The deoxyribonucleotides in DNA and the ribonucleotides in RNA are joined by the covalent linkage of a phosphate group where one bond is between the phosphate and the 5' carbon of deoxyribose and the other bond is between the phosphate and the 3' carbon of deoxyribose. This type of linkage is called a phosphodiester bond. The arrangement just described gives the molecule a 5'->3' polarity or directionality. Because of this, it is a convention to write the sequences of nucleotides starting with the 5' end at the left, for example, 5'-ATCGGCTC-3'. Figure 1.2 is a simplified diagram of the phosphodiester bonds and the covalent structure of a DNA strand.
DNA commonly occurs in nature as two strands of nucleotides twisted around in a double helix, with the repeating phosphate-deoxyribose sugar polymer serving as the backbone. This backbone is on the outside of the helix, and the bases are located in the center. The opposite strands are joined by hydrogen bonding between the bases, forming base pairs. The two backbones are in opposite or antiparallel orientations. Thus, one strand is oriented 5'->3' and the other is 3'->5'. Each base can interact with only one other type of base. Specifically, A always pairs with T (an A · T base pair), and G always pairs with C (a G · C base pair). The bases in the base pairs are said to be complementary. The A · T base pair has two hydrogen bonds, whereas the G · C base pair has three hydrogen bonds. These bonds are responsible for holding the two opposite strands together. Thus, if a DNA molecule contains many G · C base pairs, it is more stable than one containing many A · T base pairs. This also implies that DNA that is high in G · C content requires a higher temperature to separate, or denature, the two strands. Although the individual hydrogen bonds are rather weak, because the overall number of these bonds is quite high, the two strands are held together quite well.
Although in this book we will focus on gene expression, which involves transcription, it is important to say a few words about how the DNA molecule duplicates. Because the two strands in the DNA double helix are complementary, they carry the same information. During replication, the strands separate and each one acts as a template for directing the synthesis of a new complementary strand. The two new double-stranded molecules are passed on to daughter cells during cell division. The DNA replication phase in the cell cycle is called the S (synthesis) phase. After the strands separate, the single bases (on each side) become exposed. Thus, they are free to form base pairs with other free (complementary) nucleotides. The enzyme that is responsible for building new strands is called DNA polymerase.
1.1.2 Genes
Genes represent the functional regions of DNA in that they can be transcribed to produce RNA. A gene contains a regulatory region at its upstream (5') end to which various proteins can bind and cause the gene to initiate transcription in the adjacent RNA-encoding region. This essentially allows the gene to receive and then respond to other signals from within or outside the genome. At the other (3') end of the gene, there is another region that signals termination of transcription.
In eukaryotes (cells that have a nucleus), many genes contain introns, which are segments of DNA that have no information for, or do not code for, any gene products (proteins). Introns are transcribed along with the coding regions, which are called exons, but are then cut out from the transcript. The exons are then spliced together to form the functional messenger RNA that leaves the nucleus to direct protein synthesis in the cytoplasm. Prokaryotes (cells without a nucleus) do not have an exon/intron structure, and their coding region is contiguous. These concepts are illustrated in figure 1.3.
The parts of DNA that do not correspond to genes are of mostly unknown function. The amount of this type of intergenic DNA present depends on the organism. For example, mammals can contain enormous regions of intergenic DNA.
1.1.3 RNA
Before we go on to discuss the process of transcription, which is the synthesis of RNA, let us say a few words about RNA and its roles in the cell. As discussed above, most RNAs are used as an intermediary in producing proteins via the process of translation. However, some RNAs are also useful in their own right in that they can carry out a number of functions. As mentioned earlier, the RNA that is used to make proteins is called messenger RNA. The other RNAs that perform various functions are never translated; however, these RNAs are still encoded by some genes.
One such type of RNA is transfer RNA (tRNA), which transports amino acids to mRNA during translation. tRNAs are quite general in that they can transport amino acids to the mRNA corresponding to any gene. Another type of RNA is ribosomal RNA (rRNA), which along with different proteins comprises ribosomes. Ribosomes coordinate assembly of the amino acid chain in a protein. rRNAs are also general-purpose molecules and can be used to translate the mRNA of any gene. There are also a number of other types of RNA involved in splicing (snRNAs), protein trafficking (scRNAs), and other functions. We now turn to the topic of transcription.
1.1.4 Transcription
Transcription, which is the synthesis of RNA on a DNA template, is the first step in the process of gene expression, leading to the synthesis of a protein. Similarly to DNA replication, transcription relies on complementary base pairing. Transcription is catalyzed by an RNA polymerase and RNA synthesis always occurs from the 5' to the 3' end of an RNA molecule. First, the two DNA strands separate, with one of the strands acting as a template for synthesizing RNA. Which of the two strands is used as a template depends on the gene. After separation of the DNA strands, available ribonucleotides are attached to their complementary bases on the DNA template. Recall that in RNA uracil is used in place of thymine in complementary base pairing. The RNA strand is thus a direct copy (with U instead of T) of one of the DNA strands and is referred to as the sense strand. The other DNA strand, the one that is used as a template, is called the antisense strand. This is illustrated in figure 1.4.
Transcription is initiated when RNA polymerase binds to the double-stranded DNA. The actual site at which RNA polymerase binds is called a promoter, which is a sequence of DNA at the start of a gene. Since RNA is synthesized in the 5'->3' direction (figure 1.5), genes are also viewed in the same orientation by convention. Therefore, the promoter is always located upstream (5' side) of the coding region. Certain sequence elements of promoters are conserved between genes. RNA polymerase binds to these common parts of the sequences in order to initiate transcription of the gene. In most prokaryotes, the same RNA polymerase is able to transcribe all types of RNAs, whereas in eukaryotes, several different types of RNA polymerases are used depending on what kind of RNA is produced (mRNA, rRNA, tRNA).
In order to allow the DNA antisense strand to be used for base pairing, the DNA helix must first be locally unwound. This unwinding process starts at the promoter site to which RNA polymerase binds. The location at which the RNA strand begins to be synthesized is called the initiation site and is defined as position +1 in the gene. As shown in figure 1.5, RNA polymerase adds ribonucleotides to the 3' end of the RNA. After this, the helix is re-formed once again.
Since the transcript must eventually terminate, how does the RNA polymerase know when to stop synthesizing RNA? This is accomplished by the recognition of certain specific DNA sequences, called terminators, that signal the termination of transcription, causing the RNA polymerase to be released from the template and ending the RNA synthesis. Although there are several mechanisms for termination, a common direct mechanism in prokaryotes is a terminator sequence arranged in such a way that it contains self-complementary regions that can form stem-loop or hairpin structures in the RNA product. Such a structure, shown in figure 1.6, can cause the polymerase to pause, thereby terminating transcription. It is interesting to note that the hairpin structure is often GC-rich, making the self-complementary base pairing stronger because of the higher stability of G · C base pairs relative to A · U base pairs. Moreover, there are usually several U bases at the end of the hairpin structure, which, because of the relatively weaker A · U base pairs, facilitates dissociation of the RNA.
In eukaryotes, the transcriptional machinery is somewhat more complicated than in prokaryotes since the primary RNA transcript (pre-mRNA) must first be processed before being transported out of the nucleus (recall that prokaryotes have no nucleus). This processing first involves capping-the addition of a 7-methylguanosine molecule to the 5' end of the transcript, linked by a triphosphate bond. This typically occurs before the RNA chain is 30 nucleotides long. This cap structure serves to stabilize the transcript but is also important for splicing and translation. At the 3' end, a specific sequence (5-AAUAAA-3') is recognized by an enzyme which then cuts off the RNA at approximately 20 bases further down, and a poly (A) tail is added at the 3' end. The poly (A) tail consists of a run of up to 250 adenine nucleotides and is believed to help in the translation of mRNA in the cytoplasm. This process is called 3' cleavage and polyadenylation.
The final step in converting pre-mRNA into mature mRNA involves splicing, or removal of the introns and joining of the exons. In order for splicing to occur, certain nucleotide sequences must also be present. The 5 end of an intronal most always contains a 5'-GU-3' sequence, and the 3' end contains a 5'-AG-3' sequence. The AG sequence is preceded by a polypyrimidine tract-a pyrimidine-rich sequence. Further upstream there is a sequence called a branchpoint sequence, which is 5'-CU(A/G)A(C/U)-3' in vertebrates. The splice sites as well as the conserved sequences related to intron splicing are shown in figure 1.7.
(Continues...)
Excerpted from Genomic Signal Processing by Ilya Shmulevich Edward R. Dougherty
Copyright © 2007 by Princeton University Press. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
Preface ix
Biological Foundations
Genetics 1
Nucleic Acid Structure 2
Genes 5
RNA 6
Transcription 6
Proteins 9
Translation 10
Transcriptional Regulation 12
Genomics 16
Microarray Technology 17
Proteomics 20
Bibliography 22
Deterministic Models of Gene Networks
Graph Models 23
Boolean Networks 30
Cell Differentiation and Cellular Functional States 33
Network Properties and Dynamics 35
Network Inference 49
Generalizations of Boolean Networks 53
Asynchrony 53
Multivalued Networks 56
Differential Equation Models 59
A Differential Equation Model Incorporating Transcription and Translation 62
Discretization of the Continuous Differential Equation Model 65
Bibliography 70
Stochastic Models of Gene Networks
Bayesian Networks 77
Probabilistic Boolean Networks 83
Definitions 86
Inference 97
Dynamics of PBNs 99
Steady-State Analysis of Instantaneously Random PBNs 113
Relationships of PBNs to Bayesian Networks 119
Growing Subnetworks from Seed Genes 125
Intervention 129
Gene Intervention 130
Structural Intervention 140
External Control 145
Bibliography 151
Classification
Bayes Classifier 160
Classification Rules 162
Consistent Classifier Design 162
Examples of Classification Rules 166
Constrained Classifiers 168
Shatter Coefficient 171
VC Dimension 173
Linear Classification 176
Rosenblatt Perceptron 177
Linear and Quadratic Discriminant Analysis 178
Linear Discriminants Based on Least-Squares Error 180
Support Vector Machines 183
Representation of Design Error for Linear Discriminant Analysis 186
Distribution of the QDA Sample-Based Discriminant 187
Neural Networks Classifiers 189
Classification Trees 192
Classification and Regression Trees 193
Strongly Consistent Rules for Data-Dependent Partitioning 194
Error Estimation 196
Resubstitution 196
Cross-validation 198
Bootstrap 199
Bolstering 201
Error Estimator Performance 204
Feature Set Ranking 207
Error Correction 209
Robust Classifiers 213
Optimal Robust Classifiers 214
Performance Comparison for Robust Classifiers 216
Bibliography 221
Regularization
Data Regularization 225
Regularized Discriminant Analysis 225
Noise Injection 228
Complexity Regularization 231
Regularization of the Error 231
Structural Risk Minimization 233
Empirical Complexity 236
Feature Selection 237
Peaking Phenomenon 237
Feature Selection Algorithms 243
Impact of Error Estimation on Feature Selection 244
Redundancy 245
Parallel Incremental Feature Selection 249
Bayesian Variable Selection 251
Feature Extraction 254
Bibliography 259
Clustering
Examples of Clustering Algorithms 263
Euclidean Distance Clustering 264
Self-Organizing Maps 265
Hierarchical Clustering 266
Model-Based Cluster Operators 268
Cluster Operators 269
Algorithm Structure 269
Label Operators 271
Bayes Clusterer 273
Distributional Testing of Cluster Operators 274
Cluster Validation 276
External Validation 276
Internal Validation 277
Instability Index 278
Bayes Factor 280
Learning Cluster Operators 281
Empirical-Error Cluster Operator 281
Nearest-Neighbor Clustering Rule 283
Bibliography 292
Index 295
What People are Saying About This
There is a genuine need for this concise, informative, clearly written book. In systems biology, engineers, mathematicians, and computer scientists are collaborating increasingly with biologists and researchers in medicine. This book goes a long way toward narrowing the gap on this front, and it lays a rigorous foundation for a new discipline.
Olli Yli-Harja, Tampere University of Technology
"There is a genuine need for this concise, informative, clearly written book. In systems biology, engineers, mathematicians, and computer scientists are collaborating increasingly with biologists and researchers in medicine. This book goes a long way toward narrowing the gap on this front, and it lays a rigorous foundation for a new discipline."—Olli Yli-Harja, Tampere University of Technology