Read an Excerpt
Chapter 10: GenBank
GenBank (Genetic Sequence Data Bank) is a rapidly growing international repository of known genetic sequences from a variety of organisms. Its use is central to modern biology and to bioinformatics.
This chapter shows you how to write Perl programs to extract information from GenBank files and libraries. Exercises include looking for patterns; creating special libraries; and parsing the flat-file format to extract the DNA, annotation, and features. You will learn how to make a DBM database to create your own rapid-access lookups on selected data in a GenBank library.
Perl is a great tool for dealing with GenBank files. It enables you to extract and use any of the detailed data in the sequence and in the annotation, such as in the FEATURES table and elsewhere. When I first started using Perl, I wrote a program that searched GenBank for all sequence records annotated as being located on human chromosome 22. I found many genes where that information was so deeply buried within the annotation, that the major gene mapping database, Genome Database (GDB), hadn't included them in their chromosome map. I think you'll discover the same feeling of power over the information when you start applying Perl to GenBank files.
Most biologists are familiar with GenBank. Researchers can perform a search, e.g., a BLAST search on some query sequence, and collect a set of GenBank files of related sequences as a result. Because the GenBank records are maintained by the individual scientists who discovered the sequences, if you find some new sequence of interest, you can publish it in GenBank.
GenBank files have a great deal of information in them in addition to sequence data, including identifiers such as accession numbers and gene names, phylogenetic classification, and references to published literature. A GenBank file may also include a detailed FEATURES table that summarizes facts about the sequence, such as the location of the regulatory regions, the protein translation, and exons and introns.
GenBank is sometimes referred to as a databank or data store, which is different from a database. Databases typically have a relational structure imposed upon the data, including associated indices and links and a query language. GenBank in comparison is a flat file, that is, an ASCII text file that is easily readable by humans.1
From its humble beginnings GenBank has rapidly grown, and the flat-file format has seen signs of strain during the growth. With a quickly advancing body of knowledge, especially one that's growing as quickly as genetic data, it's difficult for the design of a databank to keep up. Several reworkings of GenBank have been done, but the flat-file format--in all its frustrating glory--still remains.
Due to a certain flexibility in the content of some sections of a GenBank record, extracting the information you're looking for can be tricky. This flexibility is good, in that it allows you to put what you think is most important into the data's annotation. It's bad, because that same flexibility makes it harder to write programs that to find and extract the desired annotations. As a result, the trend has been towards more structure in the annotations.
Since Perl's data structures and its use of regular expressions make it a good tool for manipulating flat files, Perl is especially well-suited to deal with GenBank data. Using these features in Perl and building on the skills you've developed from previous chapters, you can write programs to access the accumulated genetic knowledge of the scientific community in GenBank.
Since this is a beginning book that requires no programming experience, you should not expect to find the most finished, multipurpose software here. Instead you'll find a solid introduction to parsing and building fast lookup tables for GenBank files. If you've never done so, I strongly recommend you explore the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) (http://www.ncbi.nlm.nih.gov). While you're at it, stop by the European Bioinformatics Institute (EBI) at http://www.ebi.ac.uk and the bioinformatics arm of the European Molecular Biology Laboratory (EMBL) at http://www.embl-heidelberg.de/. These are large, heavily funded governmental bioinformatics powerhouses, and they have (and distribute) a great deal of state-of-the-art bioinformatics software.
GenBank Files
The primary repositories for genetic information are the NCBI GenBank, EMBL in Europe, and the DNA Data Bank of Japan (DDBJ). All have almost identical information due to international cooperative agreements. Each entry or record in GenBank or its mirror sites may contain identifying, descriptive, and genetic information in ASCII-format files. Each record is written in a specific standard format, organized so that both humans and computer programs can extract the desired information with reasonable ease.
Let's look at a relatively short GenBank record and at how the fields are defined, before writing any code. I'll save this information in a file called record.gb, for use in later programs.
LOCUS AB031069 2487 bp mRNA PRI 27-MAY-2000
DEFINITION Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
complete cds.
ACCESSION AB031069
VERSION AB031069.1 GI:8100074
KEYWORDS .
SOURCE Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
mRNA.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (sites)
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
Takano,T.
TITLE PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
is regulated by proteolysis
JOURNAL Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
MEDLINE 20261256
REFERENCE 2 (bases 1 to 2487)
AUTHORS Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
Takano,T.
TITLE Direct Submission
JOURNAL Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
Tadahiro Fujino, Keio University School of Medicine, Department of
Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
(E-mail:fujino@microb.med.keio.ac.jp,
Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES Location/Qualifiers
source 1..2487
/organism="Homo sapiens"
/db_xref="taxon:9606"
/sex="male"
/cell_line="HuS-L12"
/cell_type="lung fibroblast"
/dev_stage="embryo"
gene 229..2199
/gene="PCCX1"
CDS 229..2199
/gene="PCCX1"
/note="a nuclear protein carrying a PHD finger and a CXXC
domain"
/codon_start=1
/product="protein containing CXXC domain 1"
/protein_id="BAA96307.1"
/db_xref="GI:8100075"
/translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BASE COUNT 564 a 715 c 768 g 440 t
ORIGIN
1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca
601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg
661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg
721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt
781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg
841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca
901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag
961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca
1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta
1081 cctctggatc ctgacctgta tcaggacttc tgtgcagggg cctttgatga ccatggcctg
1141 ccctggatga gcgacacaga agagtcccca ttcctggacc ccgcgctgcg gaagagggca
1201 gtgaaagtga agcatgtgaa gcgtcgggag aagaagtctg agaagaagaa ggaggagcga
1261 tacaagcggc atcggcagaa gcagaagcac aaggataaat ggaaacaccc agagagggct
1321 gatgccaagg accctgcgtc actgccccag tgcctggggc ccggctgtgt gcgccccgcc
1381 cagcccagct ccaagtattg ctcagatgac tgtggcatga agctggcagc caaccgcatc
1441 tacgagatcc tcccccagcg catccagcag tggcagcaga gcccttgcat tgctgaagag
1501 cacggcaaga agctgctcga acgcattcgc cgagagcagc agagtgcccg cactcgcctt
1561 caggaaatgg aacgccgatt ccatgagctt gaggccatca ttctacgtgc caagcagcag
1621 gctgtgcgcg aggatgagga gagcaacgag ggtgacagtg atgacacaga cctgcagatc
1681 ttctgtgttt cctgtgggca ccccatcaac ccacgtgttg ccttgcgcca catggagcgc
1741 tgctacgcca agtatgagag ccagacgtcc tttgggtcca tgtaccccac acgcattgaa
1801 ggggccacac gactcttctg tgatgtgtat aatcctcaga gcaaaacata ctgtaagcgg
1861 ctccaggtgc tgtgccccga gcactcacgg gaccccaaag tgccagctga cgaggtatgc
1921 gggtgccccc ttgtacgtga tgtctttgag ctcacgggtg acttctgccg cctgcccaag
1981 cgccagtgca atcgccatta ctgctgggag aagctgcggc gtgcggaagt ggacttggag
2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc aggagcgcaa tgtgcgcaca
2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt
2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat
2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag
2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//
Even if you're used to seeing GenBank files, it's worth taking the time to look one over, while considering how you would write a program to extract various parts of the data. For instance, how would you extract the sequence data? What's the format of the FEATURES table and its various subfields?
There's a lot of information packed into a typical GenBank entry, and it's important to be able to separate the different parts. For instance, if you can extract the sequence, you can search for motifs, calculate statistics on the sequence, look for similarity with other sequences, and so forth. Similarly, you'll want to separate out--or parse--the various parts of the data annotation. In GenBank, this includes ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation can include specific information about the DNA, such as the locations of exons, regulatory regions, important mutations, and so on.
The format specification of GenBank files and a great deal of other information about GenBank can be found in the GenBank release notes, gbrel.txt, on the GenBank web site at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.
gbrel.txt gives complete detail about the structure of GenBank files to help programmers, so you may want to refer to it as your searches become more complex. As a Perl programmer, you won't need all of the detail because you can parse data using regular expressions or the split function. You need to get the data out and make it available to your programs. The code that accomplishes this task can be fairly simple, as you will see in this chapter.
GenBank Libraries
GenBank is distributed as a set of libraries--flat files containing many records in succession.2 As of GenBank release 125.0, August 2001, there are 243 files, most of which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually distributed compressed, which means you can download somewhat smaller files, but you need to uncompress them after you received them. Uncompressed, this amounts to about 50 GB of data. Since 1982, the number of sequences in GenBank has doubled about every 14 months.
GenBank libraries are further organized into divisions by the classification of the sequences they contain, either phylogenetically or by sequencing technology. Here are the divisions:
- PRI: primate sequences
- ROD: rodent sequences
- MAM: other mammalian sequences
- VRT: other vertebrate sequences
- INV: invertebrate sequences
- PLN: plant, fungal, and algal sequences
- BCT: bacterial sequences
- VRL: viral sequences
- PHG: bacteriophage sequences
- SYN: synthetic and chimeric sequences
- UNA: unannotated sequences
- EST: EST sequences (expressed sequence tags)
- PAT: patent sequences
- STS: STS sequences (sequence tagged sites)
- GSS: GSS sequences (genome survey sequences)
- HTG: HTGS sequences (high throughput genomic sequencing data)
- HTC: HTC sequences (high throughput cDNA sequencing data)
Some divisions are very large: the largest, the EST, or expressed sequence tag division, is comprised of 123 library files! A portion of human DNA is stored in the PRI division, which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of sequence.
The public database servers such as Entrez or BLAST at http://www.ncbi.nlm.nih.gov/ give you access to well-maintained and updated sequence data and programs, but many researchers find that they need to write their own programs to manipulate and analyze the data. The problem is, there's so much data. For many purposes, you can download a selected set of records from NCBI or other locations, but sometimes you need the whole dataset.
It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data onto your hard drive, however, is more difficult. A Perl program called mirror.pl helps to address this need. Downloading it, even with a university-standard, high-speed Internet connection can be time-consuming; downloading an entire dataset with a modem can be an exercise in frustration. The best solution is to download only the files you need, in compressed form. The EST data, for example, is about half the entire database; don't download it unless you really need to. If you need to download GenBank, I recommend contacting the help desk at NCBI. They'll help you get the most up-to-date information.
Since you're learning to program, it makes more sense to practice on a tiny, five-record library file, but the programs you'll write will work just fine on the real files....