Genomic Signal Processing

Genomic signal processing (GSP) can be defined as the analysis, processing, and use of genomic signals to gain biological knowledge, and the translation of that knowledge into systems-based applications that can be used to diagnose and treat genetic diseases. Situated at the crossroads of engineering, biology, mathematics, statistics, and computer science, GSP requires the development of both nonlinear dynamical models that adequately represent genomic regulation, and diagnostic and therapeutic tools based on these models. This book facilitates these developments by providing rigorous mathematical definitions and propositions for the main elements of GSP and by paying attention to the validity of models relative to the data. Ilya Shmulevich and Edward Dougherty cover real-world situations and explain their mathematical modeling in relation to systems biology and systems medicine.

Genomic Signal Processing makes a major contribution to computational biology, systems biology, and translational genomics by providing a self-contained explanation of the fundamental mathematical issues facing researchers in four areas: classification, clustering, network modeling, and network intervention.

Genomic Signal Processing

120.0 In Stock

Genomic Signal Processing

Add to Wishlist

Genomic Signal Processing

eBook

$120.00

View All Available Formats & Editions

eBook
$120.00

View All Available Formats & Editions

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9781400865260
Publisher:	Princeton University Press
Publication date:	09/08/2014
Series:	Princeton Series in Applied Mathematics , #18
Sold by:	Barnes & Noble
Format:	eBook
Pages:	288
File size:	4 MB

About the Author

Ilya Shmulevich, an associate professor at the Institute for Systems Biology, is the coauthor of Microarray Quality Control and the coeditor of Computational and Statistical Approaches to Genomics. Edward R. Dougherty is professor of electrical and computer engineering and director of the Genomic Signal Processing Laboratory at Texas A&M University, and director of the Computational Biology Division at the Translational Genomics Research Institute. His thirteen previous books include Random Processes for Image and Signal Processing.

Read an Excerpt

Genomic Signal Processing

By Ilya Shmulevich Edward R. Dougherty Princeton University Press
Copyright © 2007 Princeton University Press
All right reserved.
ISBN: 978-0-691-11762-1

Chapter One Biological Foundations

No single agreed-upon definition seems to exist for the term bioinformatics, which has been used to mean a variety of things ranging in scope and focus. To cite but a few examples from textbooks, Lodish et al. (2000) state that "bioinformatics is the rapidly developing area of computer science devoted to collecting, organizing, and analyzing DNA and protein sequences." A more general and encompassing definition, given by Brown (2002), is that bioinformatics is "the use of computer methods in studies of genomes." More general still: "bioinformatics is the science of refining biological information into biological knowledge using computers" (Draghici, 2003). Kohane et al. (2003) observe that the "breadth of this commonly used definition of bioinformatics risks relegating it to the dustbin of labels too general to be useful" and advocate being more specific about the particular bioinformatics techniques employed.

While it is true that the field of bioinformatics has traditionally dealt primarily with biological data encoded in digital symbol sequences, such as nucleotide and amino acid sequences, in this book we will be mainly concerned with extracting information from gene expression measurements and genomic signals. By the latter we mean any measurable events, principally theproduction of messenger ribonucleic acid (RNA) and protein, that are carried out by the genome. The analysis, processing, and use of genomic signals for gaining biological knowledge and translating this knowedge into systems-based applications is called genomic signal processing.

In this chapter, our aim is to place this material into a proper biological context by providing the necessary background for some of the key concepts that we shall use. We cannot hope to comprehensively cover the topics of modern genetics, genomics, cell biology, and others, so we will confine ourselves to brief overviews of some of these topics. We particularly recommend the book by Alberts et al. (2002) for a more comprehensive coverage of these topics.

1.1 GENETICS

Broadly speaking, genetics is the study of genes. The latter can be studied from different perspectives and on a molecular, cellular, population, or evolutionary level. A gene is composed of deoxyribonucleic acid (DNA), which is a double helix consisting of two intertwined and complementary nucleotide chains. The entire set of DNA is the genome of the organism. The DNA molecules in the genome are assembled into chromosomes, and genes are the functional regions of DNA.

Each gene encodes information about the structure and functionality of some protein produced in the cell. Proteins in turn are the machinery of the cell and the major determinants of its properties. Proteins can carry out a number of tasks, such as catalyzing reactions, transporting oxygen, regulating the production of other proteins, and many others. The way proteins are encoded by genes involves two major steps: transcription and translation. Transcription refers to the process of copying the information encoded in the DNA into a molecule called messenger RNA (mRNA). Many copies of the same RNA can be produced from only a single copy of DNA, which ultimately allows the cell to make large amounts of proteins. This occurs by means of the process referred to as translation, which converts mRNA into chains of linked amino acids called polypeptides. Polypeptides can combine with other polypeptides or act on their own to form the actual proteins. The flow of information from DNA to RNA to protein is known as the central dogma of molecular biology. Although it is mostly correct, there are a number of modifications that need to be made. These include the processes of reverse transcription, RNA editing, and RNA replication.

Briefly, reverse transcription refers to the conversion of a single-stranded RNA molecule to a double-stranded DNA molecule with the help of an enzyme aptly called reverse transcriptase. For example, HIV virus consists of an RNA genome that is converted to DNA and inserted into the genome of the host. RNA editing refers to the alteration of RNA after it has been transcribed from DNA. Therefore, the ultimate protein product that results from the edited RNA molecule does not correspond to what was originally encoded in the DNA. Finally, RNA replication is a process whereby RNA can be copied into RNA without the use of DNA. Several viruses, such as hepatitis C virus, employ this mechanism. We will now discuss some preliminary concepts in more detail.

1.1.1 Nucleic Acid Structure

Almost every cell in an organism contains the same DNA content. Every time a cell divides, this material is faithfully replicated. The information stored in the DNA is used to code for the expressed proteins by means of transcription and translation. The DNA molecule is a polymer that is strung together from monomers called deoxyribonucleotides, or simply nucleotides, each of which consists of three chemical components: a sugar (deoxyribose), a phosphate group, and a nitrogenous base. There are four possible bases: adenine, guanine, cytosine, and thymine, often abbreviated as A, G, C, and T, respectively. Adenine and guanine are purines and have bicyclic structures (two fused rings), whereas cytosine and thymine are pyrimidines, and have monocyclic structures. The sugar has five carbon atoms that are typically numbered from 1' to 5'. The phosphate group is attached to the 5'-carbon atom, whereas the base is attached to the 1' carbon. The 3' carbon also has a hydroxyl group (OH) attached to it.

Figure 1.1 illustrates the structure of a nucleotide with a thymine base. Although this figure shows one phosphate group, up to three phosphates can be attached. For example, adenosine 5'-triphosphate (ATP), which has three phosphates, is the molecule responsible for supplying energy for many biochemical cellular processes.

Ribonucleic acid is a polymer that is quite close in structure to DNA. One of the differences is that in RNA the sugar is ribose rather than deoxyribose. While the latter has a hydrogen at the 2' position (figure 1.1), ribose has a hydroxyl group at this position. Another difference is that the thymine base is replaced by the structurally similar uracil (U) base in a ribonucleotide.

The deoxyribonucleotides in DNA and the ribonucleotides in RNA are joined by the covalent linkage of a phosphate group where one bond is between the phosphate and the 5' carbon of deoxyribose and the other bond is between the phosphate and the 3' carbon of deoxyribose. This type of linkage is called a phosphodiester bond. The arrangement just described gives the molecule a 5'->3' polarity or directionality. Because of this, it is a convention to write the sequences of nucleotides starting with the 5' end at the left, for example, 5'-ATCGGCTC-3'. Figure 1.2 is a simplified diagram of the phosphodiester bonds and the covalent structure of a DNA strand.

DNA commonly occurs in nature as two strands of nucleotides twisted around in a double helix, with the repeating phosphate-deoxyribose sugar polymer serving as the backbone. This backbone is on the outside of the helix, and the bases are located in the center. The opposite strands are joined by hydrogen bonding between the bases, forming base pairs. The two backbones are in opposite or antiparallel orientations. Thus, one strand is oriented 5'->3' and the other is 3'->5'. Each base can interact with only one other type of base. Specifically, A always pairs with T (an A · T base pair), and G always pairs with C (a G · C base pair). The bases in the base pairs are said to be complementary. The A · T base pair has two hydrogen bonds, whereas the G · C base pair has three hydrogen bonds. These bonds are responsible for holding the two opposite strands together. Thus, if a DNA molecule contains many G · C base pairs, it is more stable than one containing many A · T base pairs. This also implies that DNA that is high in G · C content requires a higher temperature to separate, or denature, the two strands. Although the individual hydrogen bonds are rather weak, because the overall number of these bonds is quite high, the two strands are held together quite well.

Although in this book we will focus on gene expression, which involves transcription, it is important to say a few words about how the DNA molecule duplicates. Because the two strands in the DNA double helix are complementary, they carry the same information. During replication, the strands separate and each one acts as a template for directing the synthesis of a new complementary strand. The two new double-stranded molecules are passed on to daughter cells during cell division. The DNA replication phase in the cell cycle is called the S (synthesis) phase. After the strands separate, the single bases (on each side) become exposed. Thus, they are free to form base pairs with other free (complementary) nucleotides. The enzyme that is responsible for building new strands is called DNA polymerase.

1.1.2 Genes

Genes represent the functional regions of DNA in that they can be transcribed to produce RNA. A gene contains a regulatory region at its upstream (5') end to which various proteins can bind and cause the gene to initiate transcription in the adjacent RNA-encoding region. This essentially allows the gene to receive and then respond to other signals from within or outside the genome. At the other (3') end of the gene, there is another region that signals termination of transcription.

In eukaryotes (cells that have a nucleus), many genes contain introns, which are segments of DNA that have no information for, or do not code for, any gene products (proteins). Introns are transcribed along with the coding regions, which are called exons, but are then cut out from the transcript. The exons are then spliced together to form the functional messenger RNA that leaves the nucleus to direct protein synthesis in the cytoplasm. Prokaryotes (cells without a nucleus) do not have an exon/intron structure, and their coding region is contiguous. These concepts are illustrated in figure 1.3.

The parts of DNA that do not correspond to genes are of mostly unknown function. The amount of this type of intergenic DNA present depends on the organism. For example, mammals can contain enormous regions of intergenic DNA.

1.1.3 RNA

Before we go on to discuss the process of transcription, which is the synthesis of RNA, let us say a few words about RNA and its roles in the cell. As discussed above, most RNAs are used as an intermediary in producing proteins via the process of translation. However, some RNAs are also useful in their own right in that they can carry out a number of functions. As mentioned earlier, the RNA that is used to make proteins is called messenger RNA. The other RNAs that perform various functions are never translated; however, these RNAs are still encoded by some genes.

One such type of RNA is transfer RNA (tRNA), which transports amino acids to mRNA during translation. tRNAs are quite general in that they can transport amino acids to the mRNA corresponding to any gene. Another type of RNA is ribosomal RNA (rRNA), which along with different proteins comprises ribosomes. Ribosomes coordinate assembly of the amino acid chain in a protein. rRNAs are also general-purpose molecules and can be used to translate the mRNA of any gene. There are also a number of other types of RNA involved in splicing (snRNAs), protein trafficking (scRNAs), and other functions. We now turn to the topic of transcription.

1.1.4 Transcription

Transcription, which is the synthesis of RNA on a DNA template, is the first step in the process of gene expression, leading to the synthesis of a protein. Similarly to DNA replication, transcription relies on complementary base pairing. Transcription is catalyzed by an RNA polymerase and RNA synthesis always occurs from the 5' to the 3' end of an RNA molecule. First, the two DNA strands separate, with one of the strands acting as a template for synthesizing RNA. Which of the two strands is used as a template depends on the gene. After separation of the DNA strands, available ribonucleotides are attached to their complementary bases on the DNA template. Recall that in RNA uracil is used in place of thymine in complementary base pairing. The RNA strand is thus a direct copy (with U instead of T) of one of the DNA strands and is referred to as the sense strand. The other DNA strand, the one that is used as a template, is called the antisense strand. This is illustrated in figure 1.4.

Transcription is initiated when RNA polymerase binds to the double-stranded DNA. The actual site at which RNA polymerase binds is called a promoter, which is a sequence of DNA at the start of a gene. Since RNA is synthesized in the 5'->3' direction (figure 1.5), genes are also viewed in the same orientation by convention. Therefore, the promoter is always located upstream (5' side) of the coding region. Certain sequence elements of promoters are conserved between genes. RNA polymerase binds to these common parts of the sequences in order to initiate transcription of the gene. In most prokaryotes, the same RNA polymerase is able to transcribe all types of RNAs, whereas in eukaryotes, several different types of RNA polymerases are used depending on what kind of RNA is produced (mRNA, rRNA, tRNA).

In order to allow the DNA antisense strand to be used for base pairing, the DNA helix must first be locally unwound. This unwinding process starts at the promoter site to which RNA polymerase binds. The location at which the RNA strand begins to be synthesized is called the initiation site and is defined as position +1 in the gene. As shown in figure 1.5, RNA polymerase adds ribonucleotides to the 3' end of the RNA. After this, the helix is re-formed once again.

Since the transcript must eventually terminate, how does the RNA polymerase know when to stop synthesizing RNA? This is accomplished by the recognition of certain specific DNA sequences, called terminators, that signal the termination of transcription, causing the RNA polymerase to be released from the template and ending the RNA synthesis. Although there are several mechanisms for termination, a common direct mechanism in prokaryotes is a terminator sequence arranged in such a way that it contains self-complementary regions that can form stem-loop or hairpin structures in the RNA product. Such a structure, shown in figure 1.6, can cause the polymerase to pause, thereby terminating transcription. It is interesting to note that the hairpin structure is often GC-rich, making the self-complementary base pairing stronger because of the higher stability of G · C base pairs relative to A · U base pairs. Moreover, there are usually several U bases at the end of the hairpin structure, which, because of the relatively weaker A · U base pairs, facilitates dissociation of the RNA.

In eukaryotes, the transcriptional machinery is somewhat more complicated than in prokaryotes since the primary RNA transcript (pre-mRNA) must first be processed before being transported out of the nucleus (recall that prokaryotes have no nucleus). This processing first involves capping-the addition of a 7-methylguanosine molecule to the 5' end of the transcript, linked by a triphosphate bond. This typically occurs before the RNA chain is 30 nucleotides long. This cap structure serves to stabilize the transcript but is also important for splicing and translation. At the 3' end, a specific sequence (5-AAUAAA-3') is recognized by an enzyme which then cuts off the RNA at approximately 20 bases further down, and a poly (A) tail is added at the 3' end. The poly (A) tail consists of a run of up to 250 adenine nucleotides and is believed to help in the translation of mRNA in the cytoplasm. This process is called 3' cleavage and polyadenylation.

The final step in converting pre-mRNA into mature mRNA involves splicing, or removal of the introns and joining of the exons. In order for splicing to occur, certain nucleotide sequences must also be present. The 5 end of an intronal most always contains a 5'-GU-3' sequence, and the 3' end contains a 5'-AG-3' sequence. The AG sequence is preceded by a polypyrimidine tract-a pyrimidine-rich sequence. Further upstream there is a sequence called a branchpoint sequence, which is 5'-CU(A/G)A(C/U)-3' in vertebrates. The splice sites as well as the conserved sequences related to intron splicing are shown in figure 1.7.

(Continues...)

Excerpted from Genomic Signal Processing by Ilya Shmulevich Edward R. Dougherty
Copyright © 2007 by Princeton University Press. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Preface     ix
Biological Foundations
Genetics     1
Nucleic Acid Structure     2
Genes     5
RNA     6
Transcription     6
Proteins     9
Translation     10
Transcriptional Regulation     12
Genomics     16
Microarray Technology     17
Proteomics     20
Bibliography     22
Deterministic Models of Gene Networks
Graph Models     23
Boolean Networks     30
Cell Differentiation and Cellular Functional States     33
Network Properties and Dynamics     35
Network Inference     49
Generalizations of Boolean Networks     53
Asynchrony     53
Multivalued Networks     56
Differential Equation Models     59
A Differential Equation Model Incorporating Transcription and Translation     62
Discretization of the Continuous Differential Equation Model     65
Bibliography     70
Stochastic Models of Gene Networks
Bayesian Networks     77
Probabilistic Boolean Networks     83
Definitions     86
Inference     97
Dynamics of PBNs     99
Steady-State Analysis of Instantaneously Random PBNs     113
Relationships of PBNs to Bayesian Networks     119
Growing Subnetworks from Seed Genes     125
Intervention     129
Gene Intervention     130
Structural Intervention     140
External Control     145
Bibliography     151
Classification
Bayes Classifier     160
Classification Rules     162
Consistent Classifier Design     162
Examples of Classification Rules     166
Constrained Classifiers     168
Shatter Coefficient     171
VC Dimension     173
Linear Classification     176
Rosenblatt Perceptron     177
Linear and Quadratic Discriminant Analysis     178
Linear Discriminants Based on Least-Squares Error     180
Support Vector Machines     183
Representation of Design Error for Linear Discriminant Analysis     186
Distribution of the QDA Sample-Based Discriminant     187
Neural Networks Classifiers     189
Classification Trees     192
Classification and Regression Trees     193
Strongly Consistent Rules for Data-Dependent Partitioning     194
Error Estimation     196
Resubstitution     196
Cross-validation     198
Bootstrap     199
Bolstering     201
Error Estimator Performance     204
Feature Set Ranking     207
Error Correction     209
Robust Classifiers     213
Optimal Robust Classifiers     214
Performance Comparison for Robust Classifiers     216
Bibliography     221
Regularization
Data Regularization     225
Regularized Discriminant Analysis     225
Noise Injection     228
Complexity Regularization     231
Regularization of the Error     231
Structural Risk Minimization     233
Empirical Complexity     236
Feature Selection     237
Peaking Phenomenon     237
Feature Selection Algorithms     243
Impact of Error Estimation on Feature Selection     244
Redundancy     245
Parallel Incremental Feature Selection     249
Bayesian Variable Selection     251
Feature Extraction     254
Bibliography      259
Clustering
Examples of Clustering Algorithms     263
Euclidean Distance Clustering     264
Self-Organizing Maps     265
Hierarchical Clustering     266
Model-Based Cluster Operators     268
Cluster Operators     269
Algorithm Structure     269
Label Operators     271
Bayes Clusterer     273
Distributional Testing of Cluster Operators     274
Cluster Validation     276
External Validation     276
Internal Validation     277
Instability Index     278
Bayes Factor     280
Learning Cluster Operators     281
Empirical-Error Cluster Operator     281
Nearest-Neighbor Clustering Rule     283
Bibliography     292
Index     295

What People are Saying About This

Olli Yli-Harja

There is a genuine need for this concise, informative, clearly written book. In systems biology, engineers, mathematicians, and computer scientists are collaborating increasingly with biologists and researchers in medicine. This book goes a long way toward narrowing the gap on this front, and it lays a rigorous foundation for a new discipline.
— Olli Yli-Harja, Tampere University of Technology

From the Publisher

"There is a genuine need for this concise, informative, clearly written book. In systems biology, engineers, mathematicians, and computer scientists are collaborating increasingly with biologists and researchers in medicine. This book goes a long way toward narrowing the gap on this front, and it lays a rigorous foundation for a new discipline."—Olli Yli-Harja, Tampere University of Technology

From the B&N Reads Blog

Page 1 of

Editorial Reviews

"Genomic Signal Processing makes a major contribution to computational biology, systems biology, and translational genomics by providing a self-contained explanation of the fundamental mathematical issues facing researchers in four areas: classification, clustering, network modeling, and network intervention. . . . The authors' substantial accomplishments in this area will inspire researchers and students alike. The book provides a much-needed stepping stone so that researchers can cross the gap on this front in their efforts. Also assuredly, it will be a delight to read for anyone who is encountering the topic for the first time and is wishing to exploit the current findings and interpretations in systems biology."

Current Engineering Practice

"Overall, this book should be useful for individuals with a background in computer science and machine learning who wish to see the applications of mathematics to genomics."—-Leon Glass, SIAM Review

From the Publisher

Overall, this book should be useful for individuals with a background in computer science and machine learning who wish to see the applications of mathematics to genomics.

SIAM Review - Leon Glass

Overall, this book should be useful for individuals with a background in computer science and machine learning who wish to see the applications of mathematics to genomics.
— Leon Glass

SIAM Review

Genomic Signal Processing

Genomic Signal Processing

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews