The bestselling introduction to bioinformatics and genomics – now in its third editionWidely received in its previous editions, Bioinformatics and Functional Genomics offers the most broad-based introduction to this explosive new discipline. Now in a thoroughly updated and expanded third edition, it continues to be the go-to source for students and professionals involved in biomedical research.This book provides up-to-the-minute coverage of the fields of bioinformatics and genomics. Features new to this edition include:
- Extensive revisions and a slight reorder of chapters for a more effective organization
- A brand new chapter on next-generation sequencing
- An expanded companion website, also updated as and when new information becomes available
- Greater emphasis on a computational approach, with clear guidance of how software tools work and introductions to the use of command-line tools such as software for next-generation sequence analysis, the R programming language, and NCBI search utilities
The book is complemented by lavish illustrations and more than 500 figures and tables - many newly-created for the third edition to enhance clarity and understanding. Each chapter includes learning objectives, a problem set, pitfalls section, boxes explaining key techniques and mathematics/statistics principles, a summary, recommended reading, and a list of freely available software. Readers may visit a related Web page for supplemental information such as PowerPoints and audiovisual files of lectures, and videocasts of how to perform many basic operations: www.wiley.com/go/pevsnerbioinformatics.Bioinformatics and Functional Genomics, Third Edition serves as an excellent single-source textbook for advanced undergraduate and beginning graduate-level courses in the biological sciences and computer sciences. It is also an indispensable resource for biologists in a broad variety of disciplines who use the tools of bioinformatics and genomics to study particular research problems; bioinformaticists and computer scientists who develop computer algorithms and databases; and medical researchers and clinicians who want to understand the genomic basis of viral, bacterial, parasitic, or other diseases.
|Edition description:||New Edition|
|Product dimensions:||8.90(w) x 11.00(h) x 1.90(d)|
About the Author
Jonathan Pevsner, PhD, is a Professor in the Department of Neurology at Kennedy Krieger Institute, an internationally recognized institution dedicated to improving the lives of children with neurodevelopmental disorders. He holds a primary faculty appointment as Professor in the Department of Psychiatry and Behavioral Sciences (Johns Hopkins University School of Medicine). He holds joint or secondary appointments in the Department of Neuroscience, the Institute of Genetic Medicine, and the Division of Health Sciences Informatics (Johns Hopkins School of Medicine), and the Department of Molecular Microbiology and Immunology (Johns Hopkins Bloomberg School of Public Health). He has taught bioinformatics courses since 2000 at the Johns Hopkins School of Medicine, and was awarded Teacher of the Year honors by the Graduate Student Association in both 2001 and 2006, the Professors’ Award for Excellence in Teaching awarded by the medical faculty (2003), Teacher of the Year (Advanced Academic Programs, 2009), and Teaching Excellence Award in the Johns Hopkins Bloomberg School of Public Health (2011). In 2013 his lab used whole genome sequencing and reported a mutation that causes a rare disease, Sturge-Weber syndrome, as well as a commonly occurring port-wine stain birthmark.
Read an Excerpt
Bioinformatics and Functional Genomics
By Jonathan Pevsner
John Wiley & SonsCopyright © 2003 Wiley-Liss
All right reserved.
Bioinformatics represents a new field at the interface of the twentieth-century revolutions in molecular biology and computers. A focus of this new discipline is the use of computer databases and computer algorithms to analyze proteins, genes, and the complete collections of deoxyribonucleic acid (DNA) that comprises an organism (the genome). A major challenge in biology is to make sense of the enormous quantities of sequence data and structural data that are generated by genome-sequencing projects, proteomics, and other large-scale molecular biology efforts. The tools of bioinformatics include computer programs that help to reveal fundamental mechanisms underlying biological problems related to the structure and function of macromolecules, biochemical pathways, disease processes, and evolution.
According to a National Institutes of Health (NIH) definition, bioinformatics is "research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, analyze, or visualize such data." The related discipline of computational biology is "the development and application of data-analytical and theoretical methods, mathematicalmodeling and computational simulation techniques to the study of biological, behavioral, and social systems."
While the discipline of bioinformatics focuses on the analysis of molecular sequences, genomics and functional genomics are two closely related disciplines. The goal of genomics is to determine and analyze the complete DNA sequence of an organism, that is, its genome. The DNA encodes genes, which can be expressed as ribonucleic acid (RNA) transcripts and then translated into protein. Functional genomics describes the use of genomewide assays to the study of gene and protein function.
The aim of this book is to explain both the theory and practice of bioinformatics. The book is especially designed to help the biology student use computer programs and databases to solve biological problems related to proteins, genes, and genomes. Bioinformatics is an integrative discipline, and our focus on individual proteins and genes is part of a larger effort to understand broad issues in biology such as the relationship of structure to function, development, and disease.
Organization of The Book
There are three main sections of the book. The first part explains how to access biological sequence data, particularly DNA and protein sequences (Chapter 2). Once sequences are obtained, we show how to compare two sequences (pairwise alignment; Chapter 3) and how to compare multiple sequences [primarily by the Basic Local Alignment Search Tool (BLAST); Chapters 4 and 5].
The second part of the book describes functional genomics approaches to RNA and protein. The central dogma of biology states that DNA is transcribed into RNA then translated into protein. We will examine gene expression, including a description of the emerging technology of DNA microarrays (Chapters 6 and 7). We then consider proteins from the perspective of protein families, the analysis of individual proteins, protein structure, and multiple sequence alignment (Chapters 8-10). The relationships of protein and DNA sequences that are multiply aligned can be visualized in phylogenetic trees (Chapter 11). Chapter 11 thus introduces the subject of molecular evolution.
Since 1995, the genomes have been sequenced for several hundred bacteria and archaea as well as fungi, animals, and plants. The third section of the book covers genome analysis. Chapter 12 provides an overview of the study of completed genomes and then descriptions of how the tools of bioinformatics can elucidate the tree of life. We describe bioinformatics resources for the study of viruses (Chapter 13) and bacteria and archaea (Chapter 14; these are two of the three main branches of life). Next we examine a variety of eukaryotes (from fungi to primates; Chapters 15 and 16) and then the human genome (Chapter 17). Finally, we explore bioinformatic approaches to human disease (Chapter 18).
Bioinformatics: The Big Picture
We can summarize the entire field of bioinformatics with three perspectives. The first perspective on bioinformatics is the cell (Fig. 1.1). The central dogma of molecular biology is that DNA is transcribed into RNA and translated into protein. The focus of molecular biology has been on individual genes, messenger RNA (mRNA) transcripts, and proteins. A focus of the field of bioinformatics is the complete collection of DNA (the genome), RNA (the transcriptome), and protein sequences (the proteome) that have been amassed (Henikoff, 2002). These millions of molecular sequences present both great opportunities and great challenges. A bioinformatics approach to molecular sequence data involves the application of computer algorithms and computer databases to molecular and cellular biology. Such an approach is sometimes referred to as functional genomics. This typifies the essential nature of bioinformatics: biological questions can be approached from levels ranging from single genes and proteins to cellular pathways and networks or even whole genomic responses (Ideker et al., 2001). Our goals are to understand how to study both individual genes and proteins and collections of thousands of genes/proteins.
From the cell we can focus on individual organisms, which represents the second perspective of the field of bioinformatics (Fig. 1.2). Each organism changes across different stages of development and (formulticellular organisms) across different regions of the body. For example, while we may sometimes think of genes as static entities that specify features such as eye color or height, they are in fact dynamically regulated across time and region and in response to physiological state. Gene expression varies in disease states or in response to a variety of signals, both intrinsic and environmental. Many bioinformatics tools are available to study the broad biological questions relevant to the individual: There are many databases of expressed genes and proteins derived from different tissues and conditions. One of the most powerful applications of functional genomics is the use of DNA microarrays to measure the expression of thousands of genes in biological samples.
At the largest scale is the tree of life (Fig. 1.3) (Chapter 12). There are many millions of species alive today, and they can be grouped into the three major branches of bacteria, archaea (single-celled microbes that tend to live in extreme environments), and eukaryotes. Molecular sequence databases currently hold DNA sequence from over 100,000 different organisms. The complete genome sequences of several hundred organisms will soon become available. One of the main lessons we are learning is the fundamental unity of life at the molecular level. We are also coming to appreciate the power of comparative genomics, in which genomes are compared.
Figure 1.4 on the following page presents the contents of this book in the context of the three perspectives of bioinformatics.
A Consistent Example: Retinol-Binding Protein
Throughout this book we will focus on the example of a gene and its corresponding protein product: retinol-binding protein (RBP4), a small, abundant secreted protein that binds retinol (vitamin A) in blood (Newcomer and Ong, 2000). Retinol, obtained from carrots in the form of vitamin A, is very hydrophobic. RBP4 helps transport this ligand to the eye where it is used for vision. We will study RBP4 in detail because it has a number of interesting features:
There are many proteins that are homologous to RBP4 in a variety of species, including human, mouse, and fish ("orthologs"). We will use these as examples of how to align proteins, perform database searches, and study phylogeny (Chapters 2-11). There are other human proteins that are closely related to RBP4 ("paralogs"). Altogether the family that includes RBP4 is called the lipocalins, a diverse group of small ligand-binding proteins that tend to be secreted into extracellular spaces (Akerstrom et al., 2000; Flower et al., 2000). Other lipocalins have fascinating functions such as apoliprotein D (which binds cholesterol), a pregnancy-associated lipocalin, aphrodisin (an "aphrodisiac" in hamsters), and an odorant-binding protein in mucus. There are even bacterial lipocalins, which could have a role in antibiotic resistance (Bishop, 2000). We will explore how bacterial lipocalins could be ancient genes that entered eukaryotic genomes by a process called lateral gene transfer. The gene expression levels of some lipocalins are dramatically regulated (Chapters 6 and 7). Because the lipocalins are small, abundant, and soluble proteins, their biochemical properties have been characterized in detail. The three-dimensional protein structure has been solved for several of them by X-ray crystallography (Chapter 9). Some lipocalins have been implicated in human disease (Chapter 18).
Another molecule we will introduce is the pol (polymerase) gene of human immunodeficiency virus 1 (HIV-1). HIV presents one of the greatest public health challenges in the world today. Over 42 million people are infected as of the end of the year 2002 and over 16 million people have died. The HIV-1 genome encodes just nine proteins, including pol (Frankel and Young, 1998). We will examine pol throughout the book because the properties of this gene, its protein products, and the HIV-1 genome are distinct from the lipocalins.
The pol gene is a multidomain protein: it is a single polypeptide with several structurally and functionally distinct domains. The pol gene encodes a protein of 1003 amino acids with reverse transcriptase activity (that is, an RNA-dependent DNA polymerase). It is also an aspartyl protease, and it has integrase activity. These multiple activities are typical of multidomain proteins. The modular nature of the pol protein affects our ability to perform database searches (Chapters 4 and 5) and multiple sequence alignments (Chapters 8 and 10). The pol gene incorporates substitutions extremely rapidly. A typical individual infected by HIV may have over a million variants of pol. The study of the evolution of pol complements our study of the lipocalins (Chapter 11). As a viral protein, our study of pol gives us the opportunity to learn how to access bioinformatics resources relevant to studying viruses (Chapter 13). Database searches with pol will help emphasize how to restrict searches to particular domains of the tree of life.
Organization of The Chapters
The chapters of this book are intended to provide both the theory of bioinformatics subjects as well as a practical guide to using computer databases and algorithms. Web resources are provided throughout each chapter. Chapters end with brief sections called Perspective and Pitfalls. The perspective feature describes the rate of growth of the subject matter in each chapter. For example, a perspective on Chapter 2 (access to sequence information) is that the amount of DNA sequence data deposited in GenBank is undergoing an explosive rate of growth. In contrast, an area such as pairwise sequence alignment, which is fundamental to the entire field of bioinformatics (Chapter 3), was firmly established in the 1970s and 1980s.
The pitfalls section of each chapter describes some common difficulties encountered by biologists using bioinformatics tools. Some errors might seem trivial, such as searching a DNA database with a protein sequence. Other pitfalls are more subtle, such as artifacts caused by multiple sequence alignment programs depending upon the type of algorithm that is selected. Indeed, while the field of bioinformatics depends substantially on analyzing sequence data, it is important to recognize that there are many categories of errors associated with data generation, collection, storage, and analysis.
Each chapter offers multiple-choice quizzes, which test your understanding of the chapter materials. There are also problems that require you to apply the concepts presented in each chapter. These problems may form the basis of a computer laboratory for a bioinformatics course.
The references at the end of each chapter are accompanied by an annotated list of recommended articles. This suggested reading section includes classic papers that show how the principles described in each chapter were discovered. Particularly helpful review articles and research papers are highlighted.
Suggestions For Students and Teachers: Web Exercises And Find-a-Gene
Often, students of bioinformatics have a particular research area of interest such as a gene, a physiological process, a disease, or a genome. It is hoped that by studying RBP4 and other specific proteins and genes throughout this book, students can simultaneously apply the principles of bioinformatics to their own research questions.
In teaching a course on bioinformatics at Johns Hopkins, it has been helpful to complement lectures with computer labs. All the websites described in this book are freely available on the World WideWeb, and many of the software packages are free for academic use.
Another feature of the Johns Hopkins course is that each student is required to discover a novel gene by the last day of the course. The student must begin with any protein sequence of interest and perform database searches to identify genomic DNA that encodes a protein no one has described before. This problem is described in Chapter 5 (see Fig. 5.17). The student thus chooses the name of the gene and its corresponding protein and describes information about the organism and evidence that the gene has not been described before. Then, the student creates a multiple sequence alignment of the new protein (or gene) and creates a phylogenetic tree showing its relation to other known sequences.
Each year, some beginning students are slightly apprehensive about accomplishing this exercise, but in the end all of them succeed. A benefit of this exercise is that it requires a student to actively use the principles of bioinformatics. Most students choose a gene (or protein) relevant to their own research area, while others find new lipocalins.
Teaching bioinformatics is notable for the diversity of students learning this new discipline. Each chapter provides background on the subject matter. For more advanced students, several key research papers are cited at the end of each chapter. These papers are technical, and reading them along with the chapters will provide a deeper understanding of the material. The suggested reading section also includes review articles.
Excerpted from Bioinformatics and Functional Genomics by Jonathan Pevsner Copyright © 2003 by Wiley-Liss. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.
Table of Contents
Part I Analyzing DNA RNA and Protein Sequences
1 Introduction 3
2 Access to Sequence Data and Related Information 19
3 Pairwise Sequence Alignment 69
4 Basic Local Alignment Search Tool (BLAST) 121
5 Advanced Database Searching 167
6 Multiple Sequence Alignment 205
7 Molecular Phylogeny and Evolution 245
Part II Genomewide Analysis of DNA RNA and Protein
8 DNA: The Eukaryotic Chromosome 307
9 Analysis of Next-Generation Sequence Data 377
10 Bioinformatic Approaches to Ribonucleic Acid (RNA) 433
11 Gene Expression: Microarray and RNA-seq Data Analysis 479
12 Protein Analysis and Proteomics 539
13 Protein Structure 589
14 Functional Genomics 635
Part III Genome Analysis
15 Genomes Across the Tree of Life 699
16 Completed Genomes: Viruses 755
17 Completed Genomes: Bacteria and Archaea 797
18 Eukaryotic Genomes: Fungi 847
19 Eukaryotic Genomes: From Parasites to Primates 887
20 Human Genome 957
21 Human Disease 1011
Self-Test Quiz: Solutions 1103
Author Index 1105
Subject Index 1109
What People are Saying About This
"I was particularly impressed by the comprehensible and comprehensive treatment of BLAST - the best that I have seen. One is guided from choosing the appropriate type of BLAST program, database and search parameters through to refining and analysing the significance of the search resultsall illustrated with clear examples."
David P. Leader, University of Glasgow
"I would not hesitate for a moment to propose Jonathan Pevsner’s new book as a standard course for biologists who need a serious, practical knowledge of modern bioinformatics. Dr. Pevsner does a masterful job at presenting virtually every major topic in bioinformatics and computational genomics, from the basics of sequence analysis, to microarray data classification, accurately and at a considerable level of detail but without any complex mathematics. In addition to being an extremely useful textbook, Pevsner’s book is a very nice read, due in large part, to carefully constructed questions and suggestions for discussion, and wonderful historical vignettes. In short, a great bioinformatics book for biologists!"
Eugene V. Koonin, Ph.D., National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland