The Barnes & Noble Review
It's hard to believe nowadays, but it wasn't long ago that biology and computing were about as far apart as two sciences could be. That was before the Human Genome Project, before the Protein Data Bank, before the explosion in biological data that has taken place in the past few years -- and the explosion in computational analysis tools for making sense of it all. Even more strikingly, biological experimentation is increasingly taking place "in silico" -- in simulations running in a computer, not in a test tube.
Put simply, programming is becoming a critical skill for more and more biologists. James Tisdall's timely Beginning Perl for Bioinformatics will help them gain the specific programming skills they'll need in their day-to-day work.
This book's examples and exercises -- and there are many -- almost all focus on real biological problems. Many of them use biological data sources working biologists will recognize. And the choice of Perl as a language for bioinformatics is apt: It's got a shallow learning curve, it's portable, fast, and if written properly, requires relatively little maintenance.
Tisdall starts at the highest level. You've been handed a problem -- a simple one, to help you get started. You need to count the regulatory elements in DNA. Where would you start? He walks you through identifying the inputs you'll need, establishing your overall program design, planning for output, refining your design using pseudocode -- an informal program that doesn't bother with correct syntax -- and, finally, writing a real, runnable program.
By Chapter 4, you're writing programs that represent DNA and protein sequence data, transcribe DNA to RNA, concatenate sequences, make the reverse complement of sequences, and read sequence data from files. (These are not examples you'd find in the "Camel" book, O'Reilly's classic introduction to Perl -- or, for that matter, in any other Perl book we've seen!) Of course, as you're writing these programs, you're also learning how to work with scalar and array variables, handling string operations, reading from files -- techniques you'll use constantly.
Perl is just super-duper at finding patterns, and if you're a biologist working with DNA or proteins, it won't take you long to find good applications for it. Chapter 5 teaches you how to search for motifs -- for example, regulatory elements of DNA or short stretches of protein that exist in multiple species -- and examine sequence data in detail. Along the way, you're learning how to use conditional tests, regular expressions, and string operations -- more meat-and-potatoes Perl stuff.
Mutation is a random process, and Tisdall spends a full chapter on randomization: modeling mutations with random numbers, using random numbers to generate DNA sequence data sets, repeatedly mutating DNA to understand how mutations accumulate, and more. Using hash datatypes, you'll learn how to write Perl programs that simulate how the genetic code translates DNA into proteins.
Next, Tisdall focuses on computing restriction maps, which help biologists determine where best to cut a DNA molecule in order to insert a new gene; and on restriction digests, one of the first methods for "fingerprinting" DNA. In so doing, he helps you deepen your skills with regular expressions, and offers practical advice on representing Restriction Enzyme Database data with them.
If you're working with the Genetic Sequence Data Bank (GenBank), Tisdall shows you how to extract information from it, search for patterns, parse its flat-file format to extract what you need, and create a Perl DBM database for rapid lookups on the data you work with most. There are chapters on working with the increasingly-important Protein Data Bank, which stores knowledge about the 3D structure of a growing collection of proteins; and finally, a brief introduction to the open source Bioperl modules, which can streamline sequence manipulation, access to biology databases, and other common bioinformatics tasks.
If you're a working biologist, or working on becoming one, Beginning Perl for Bioinformatics will be an invaluable resource -- and we've seen nothing like it.
Bill Camarda is a consultant, writer, and web/multimedia content developer with nearly 20 years' experience in helping technology companies deploy and market advanced software, computing, and networking products and services. He served for nearly ten years as vice president of a New Jerseybased marketing company, where he supervised a wide range of graphics and web design projects. His 15 books include Special Edition Using Word 2000 and Upgrading & Fixing Networks For Dummies®, Second Edition.
Read an Excerpt
Chapter 10: GenBank
GenBank (Genetic Sequence Data Bank) is a rapidly growing international repository of known genetic sequences from a variety of organisms. Its use is central to modern biology and to bioinformatics.
This chapter shows you how to write Perl programs to extract information from GenBank files and libraries. Exercises include looking for patterns; creating special libraries; and parsing the flat-file format to extract the DNA, annotation, and features. You will learn how to make a DBM database to create your own rapid-access lookups on selected data in a GenBank library.
Perl is a great tool for dealing with GenBank files. It enables you to extract and use any of the detailed data in the sequence and in the annotation, such as in the FEATURES table and elsewhere. When I first started using Perl, I wrote a program that searched GenBank for all sequence records annotated as being located on human chromosome 22. I found many genes where that information was so deeply buried within the annotation, that the major gene mapping database, Genome Database (GDB), hadn't included them in their chromosome map. I think you'll discover the same feeling of power over the information when you start applying Perl to GenBank files.
Most biologists are familiar with GenBank. Researchers can perform a search, e.g., a BLAST search on some query sequence, and collect a set of GenBank files of related sequences as a result. Because the GenBank records are maintained by the individual scientists who discovered the sequences, if you find some new sequence of interest, you can publish it in GenBank.
GenBank files have a great deal of information in them in addition to sequence data, including identifiers such as accession numbers and gene names, phylogenetic classification, and references to published literature. A GenBank file may also include a detailed FEATURES table that summarizes facts about the sequence, such as the location of the regulatory regions, the protein translation, and exons and introns.
GenBank is sometimes referred to as a databank or data store, which is different from a database. Databases typically have a relational structure imposed upon the data, including associated indices and links and a query language. GenBank in comparison is a flat file, that is, an ASCII text file that is easily readable by humans.1
From its humble beginnings GenBank has rapidly grown, and the flat-file format has seen signs of strain during the growth. With a quickly advancing body of knowledge, especially one that's growing as quickly as genetic data, it's difficult for the design of a databank to keep up. Several reworkings of GenBank have been done, but the flat-file format--in all its frustrating glory--still remains.
Due to a certain flexibility in the content of some sections of a GenBank record, extracting the information you're looking for can be tricky. This flexibility is good, in that it allows you to put what you think is most important into the data's annotation. It's bad, because that same flexibility makes it harder to write programs that to find and extract the desired annotations. As a result, the trend has been towards more structure in the annotations.
Since Perl's data structures and its use of regular expressions make it a good tool for manipulating flat files, Perl is especially well-suited to deal with GenBank data. Using these features in Perl and building on the skills you've developed from previous chapters, you can write programs to access the accumulated genetic knowledge of the scientific community in GenBank.
Since this is a beginning book that requires no programming experience, you should not expect to find the most finished, multipurpose software here. Instead you'll find a solid introduction to parsing and building fast lookup tables for GenBank files. If you've never done so, I strongly recommend you explore the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) (http://www.ncbi.nlm.nih.gov). While you're at it, stop by the European Bioinformatics Institute (EBI) at http://www.ebi.ac.uk and the bioinformatics arm of the European Molecular Biology Laboratory (EMBL) at http://www.embl-heidelberg.de/. These are large, heavily funded governmental bioinformatics powerhouses, and they have (and distribute) a great deal of state-of-the-art bioinformatics software.
The primary repositories for genetic information are the NCBI GenBank, EMBL in Europe, and the DNA Data Bank of Japan (DDBJ). All have almost identical information due to international cooperative agreements. Each entry or record in GenBank or its mirror sites may contain identifying, descriptive, and genetic information in ASCII-format files. Each record is written in a specific standard format, organized so that both humans and computer programs can extract the desired information with reasonable ease.
Let's look at a relatively short GenBank record and at how the fields are defined, before writing any code. I'll save this information in a file called record.gb, for use in later programs.
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000
DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
VERSION AB031069.1 GI:8100074
SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (sites)
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
is regulated by proteolysis
JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
REFERENCE 2 (bases 1 to 2487)
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
TITLE Direct Submission
JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
Tadahiro Fujino, Keio University School of Medicine, Department of
Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
/note="a nuclear protein carrying a PHD finger and a CXXC
/product="protein containing CXXC domain 1"
BASE COUNT 564 a 715 c 768 g 440 t
1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca
601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg
661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg
721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt
781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg
841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca
901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag
961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca
1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta
1081 cctctggatc ctgacctgta tcaggacttc tgtgcagggg cctttgatga ccatggcctg
1141 ccctggatga gcgacacaga agagtcccca ttcctggacc ccgcgctgcg gaagagggca
1201 gtgaaagtga agcatgtgaa gcgtcgggag aagaagtctg agaagaagaa ggaggagcga
1261 tacaagcggc atcggcagaa gcagaagcac aaggataaat ggaaacaccc agagagggct
1321 gatgccaagg accctgcgtc actgccccag tgcctggggc ccggctgtgt gcgccccgcc
1381 cagcccagct ccaagtattg ctcagatgac tgtggcatga agctggcagc caaccgcatc
1441 tacgagatcc tcccccagcg catccagcag tggcagcaga gcccttgcat tgctgaagag
1501 cacggcaaga agctgctcga acgcattcgc cgagagcagc agagtgcccg cactcgcctt
1561 caggaaatgg aacgccgatt ccatgagctt gaggccatca ttctacgtgc caagcagcag
1621 gctgtgcgcg aggatgagga gagcaacgag ggtgacagtg atgacacaga cctgcagatc
1681 ttctgtgttt cctgtgggca ccccatcaac ccacgtgttg ccttgcgcca catggagcgc
1741 tgctacgcca agtatgagag ccagacgtcc tttgggtcca tgtaccccac acgcattgaa
1801 ggggccacac gactcttctg tgatgtgtat aatcctcaga gcaaaacata ctgtaagcgg
1861 ctccaggtgc tgtgccccga gcactcacgg gaccccaaag tgccagctga cgaggtatgc
1921 gggtgccccc ttgtacgtga tgtctttgag ctcacgggtg acttctgccg cctgcccaag
1981 cgccagtgca atcgccatta ctgctgggag aagctgcggc gtgcggaagt ggacttggag
2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc aggagcgcaa tgtgcgcaca
2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt
2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat
2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag
2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa
Even if you're used to seeing GenBank files, it's worth taking the time to look one over, while considering how you would write a program to extract various parts of the data. For instance, how would you extract the sequence data? What's the format of the FEATURES table and its various subfields?
There's a lot of information packed into a typical GenBank entry, and it's important to be able to separate the different parts. For instance, if you can extract the sequence, you can search for motifs, calculate statistics on the sequence, look for similarity with other sequences, and so forth. Similarly, you'll want to separate out--or parse--the various parts of the data annotation. In GenBank, this includes ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation can include specific information about the DNA, such as the locations of exons, regulatory regions, important mutations, and so on.
The format specification of GenBank files and a great deal of other information about GenBank can be found in the GenBank release notes, gbrel.txt, on the GenBank web site at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.
gbrel.txt gives complete detail about the structure of GenBank files to help programmers, so you may want to refer to it as your searches become more complex. As a Perl programmer, you won't need all of the detail because you can parse data using regular expressions or the split function. You need to get the data out and make it available to your programs. The code that accomplishes this task can be fairly simple, as you will see in this chapter.
GenBank is distributed as a set of libraries--flat files containing many records in succession.2 As of GenBank release 125.0, August 2001, there are 243 files, most of which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually distributed compressed, which means you can download somewhat smaller files, but you need to uncompress them after you received them. Uncompressed, this amounts to about 50 GB of data. Since 1982, the number of sequences in GenBank has doubled about every 14 months.
GenBank libraries are further organized into divisions by the classification of the sequences they contain, either phylogenetically or by sequencing technology. Here are the divisions:
- PRI: primate sequences
- ROD: rodent sequences
- MAM: other mammalian sequences
- VRT: other vertebrate sequences
- INV: invertebrate sequences
- PLN: plant, fungal, and algal sequences
- BCT: bacterial sequences
- VRL: viral sequences
- PHG: bacteriophage sequences
- SYN: synthetic and chimeric sequences
- UNA: unannotated sequences
- EST: EST sequences (expressed sequence tags)
- PAT: patent sequences
- STS: STS sequences (sequence tagged sites)
- GSS: GSS sequences (genome survey sequences)
- HTG: HTGS sequences (high throughput genomic sequencing data)
- HTC: HTC sequences (high throughput cDNA sequencing data)
Some divisions are very large: the largest, the EST, or expressed sequence tag division, is comprised of 123 library files! A portion of human DNA is stored in the PRI division, which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of sequence.
The public database servers such as Entrez or BLAST at http://www.ncbi.nlm.nih.gov/ give you access to well-maintained and updated sequence data and programs, but many researchers find that they need to write their own programs to manipulate and analyze the data. The problem is, there's so much data. For many purposes, you can download a selected set of records from NCBI or other locations, but sometimes you need the whole dataset.
It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data onto your hard drive, however, is more difficult. A Perl program called mirror.pl helps to address this need. Downloading it, even with a university-standard, high-speed Internet connection can be time-consuming; downloading an entire dataset with a modem can be an exercise in frustration. The best solution is to download only the files you need, in compressed form. The EST data, for example, is about half the entire database; don't download it unless you really need to. If you need to download GenBank, I recommend contacting the help desk at NCBI. They'll help you get the most up-to-date information.
Since you're learning to program, it makes more sense to practice on a tiny, five-record library file, but the programs you'll write will work just fine on the real files....