Beginning Perl for Bioinformatics

Beginning Perl for Bioinformatics

by James Tisdall


View All Available Formats & Editions
Choose Expedited Shipping at checkout for guaranteed delivery by Friday, April 26

Product Details

ISBN-13: 9780596000806
Publisher: O'Reilly Media, Incorporated
Publication date: 11/28/2001
Edition description: 1ST
Pages: 386
Product dimensions: 7.10(w) x 9.08(h) x 0.86(d)

About the Author

James Tisdall has worked as a musician, a programmer at Bell Labs (where he programmed for speech research and discovered a formal language for musical rhythm), and as a bioinformaticist at Mercator Genetics in Menlo Park, California, and at Fox Chase Cancer Center in Philadelphia. He has a B.A. in mathematics from the City College of New York and an M.S. in computer science from Columbia University; he is working towards a Ph.D. in computer science at the University of Pennsylvania. In his spare time, Jim teaches computer music at the Settlement Music School in Philadelphia. He is also the author of O'Reilly's Beginning Perl for Bioinformatics.

Read an Excerpt

Chapter 10: GenBank

GenBank (Genetic Sequence Data Bank) is a rapidly growing international repository of known genetic sequences from a variety of organisms. Its use is central to modern biology and to bioinformatics.

This chapter shows you how to write Perl programs to extract information from GenBank files and libraries. Exercises include looking for patterns; creating special libraries; and parsing the flat-file format to extract the DNA, annotation, and features. You will learn how to make a DBM database to create your own rapid-access lookups on selected data in a GenBank library.

Perl is a great tool for dealing with GenBank files. It enables you to extract and use any of the detailed data in the sequence and in the annotation, such as in the FEATURES table and elsewhere. When I first started using Perl, I wrote a program that searched GenBank for all sequence records annotated as being located on human chromosome 22. I found many genes where that information was so deeply buried within the annotation, that the major gene mapping database, Genome Database (GDB), hadn't included them in their chromosome map. I think you'll discover the same feeling of power over the information when you start applying Perl to GenBank files.

Most biologists are familiar with GenBank. Researchers can perform a search, e.g., a BLAST search on some query sequence, and collect a set of GenBank files of related sequences as a result. Because the GenBank records are maintained by the individual scientists who discovered the sequences, if you find some new sequence of interest, you can publish it in GenBank.

GenBank files have a great deal of information in them in addition to sequence data, including identifiers such as accession numbers and gene names, phylogenetic classification, and references to published literature. A GenBank file may also include a detailed FEATURES table that summarizes facts about the sequence, such as the location of the regulatory regions, the protein translation, and exons and introns.

GenBank is sometimes referred to as a databank or data store, which is different from a database. Databases typically have a relational structure imposed upon the data, including associated indices and links and a query language. GenBank in comparison is a flat file, that is, an ASCII text file that is easily readable by humans.1

From its humble beginnings GenBank has rapidly grown, and the flat-file format has seen signs of strain during the growth. With a quickly advancing body of knowledge, especially one that's growing as quickly as genetic data, it's difficult for the design of a databank to keep up. Several reworkings of GenBank have been done, but the flat-file format--in all its frustrating glory--still remains.

Due to a certain flexibility in the content of some sections of a GenBank record, extracting the information you're looking for can be tricky. This flexibility is good, in that it allows you to put what you think is most important into the data's annotation. It's bad, because that same flexibility makes it harder to write programs that to find and extract the desired annotations. As a result, the trend has been towards more structure in the annotations.

Since Perl's data structures and its use of regular expressions make it a good tool for manipulating flat files, Perl is especially well-suited to deal with GenBank data. Using these features in Perl and building on the skills you've developed from previous chapters, you can write programs to access the accumulated genetic knowledge of the scientific community in GenBank.

Since this is a beginning book that requires no programming experience, you should not expect to find the most finished, multipurpose software here. Instead you'll find a solid introduction to parsing and building fast lookup tables for GenBank files. If you've never done so, I strongly recommend you explore the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) ( While you're at it, stop by the European Bioinformatics Institute (EBI) at and the bioinformatics arm of the European Molecular Biology Laboratory (EMBL) at These are large, heavily funded governmental bioinformatics powerhouses, and they have (and distribute) a great deal of state-of-the-art bioinformatics software.

GenBank Files

The primary repositories for genetic information are the NCBI GenBank, EMBL in Europe, and the DNA Data Bank of Japan (DDBJ). All have almost identical information due to international cooperative agreements. Each entry or record in GenBank or its mirror sites may contain identifying, descriptive, and genetic information in ASCII-format files. Each record is written in a specific standard format, organized so that both humans and computer programs can extract the desired information with reasonable ease.

Let's look at a relatively short GenBank record and at how the fields are defined, before writing any code. I'll save this information in a file called, for use in later programs.

LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
DEFINITION  Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
            complete cds.
VERSION     AB031069.1  GI:8100074
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (sites)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
  TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
            is regulated by proteolysis
  JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
  MEDLINE   20261256
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /cell_type="lung fibroblast"
     gene            229..2199
     CDS             229..2199
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     /product="protein containing CXXC domain 1"
BASE COUNT      564 a    715 c    768 g    440 t
        1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
       61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
      121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
      181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
      241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
      301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
      361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
      421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
      481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
      541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca
      601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg
      661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg
      721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt
      781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg
      841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca
      901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag
      961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca
     1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta
     1081 cctctggatc ctgacctgta tcaggacttc tgtgcagggg cctttgatga ccatggcctg
     1141 ccctggatga gcgacacaga agagtcccca ttcctggacc ccgcgctgcg gaagagggca
     1201 gtgaaagtga agcatgtgaa gcgtcgggag aagaagtctg agaagaagaa ggaggagcga
     1261 tacaagcggc atcggcagaa gcagaagcac aaggataaat ggaaacaccc agagagggct
     1321 gatgccaagg accctgcgtc actgccccag tgcctggggc ccggctgtgt gcgccccgcc
     1381 cagcccagct ccaagtattg ctcagatgac tgtggcatga agctggcagc caaccgcatc
     1441 tacgagatcc tcccccagcg catccagcag tggcagcaga gcccttgcat tgctgaagag
     1501 cacggcaaga agctgctcga acgcattcgc cgagagcagc agagtgcccg cactcgcctt
     1561 caggaaatgg aacgccgatt ccatgagctt gaggccatca ttctacgtgc caagcagcag
     1621 gctgtgcgcg aggatgagga gagcaacgag ggtgacagtg atgacacaga cctgcagatc
     1681 ttctgtgttt cctgtgggca ccccatcaac ccacgtgttg ccttgcgcca catggagcgc
     1741 tgctacgcca agtatgagag ccagacgtcc tttgggtcca tgtaccccac acgcattgaa
     1801 ggggccacac gactcttctg tgatgtgtat aatcctcaga gcaaaacata ctgtaagcgg
     1861 ctccaggtgc tgtgccccga gcactcacgg gaccccaaag tgccagctga cgaggtatgc
     1921 gggtgccccc ttgtacgtga tgtctttgag ctcacgggtg acttctgccg cctgcccaag
     1981 cgccagtgca atcgccatta ctgctgggag aagctgcggc gtgcggaagt ggacttggag
     2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc aggagcgcaa tgtgcgcaca
     2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
     2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt
     2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat
     2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag
     2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
     2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
     2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa

Even if you're used to seeing GenBank files, it's worth taking the time to look one over, while considering how you would write a program to extract various parts of the data. For instance, how would you extract the sequence data? What's the format of the FEATURES table and its various subfields?

There's a lot of information packed into a typical GenBank entry, and it's important to be able to separate the different parts. For instance, if you can extract the sequence, you can search for motifs, calculate statistics on the sequence, look for similarity with other sequences, and so forth. Similarly, you'll want to separate out--or parse--the various parts of the data annotation. In GenBank, this includes ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation can include specific information about the DNA, such as the locations of exons, regulatory regions, important mutations, and so on.

The format specification of GenBank files and a great deal of other information about GenBank can be found in the GenBank release notes, gbrel.txt, on the GenBank web site at

gbrel.txt gives complete detail about the structure of GenBank files to help programmers, so you may want to refer to it as your searches become more complex. As a Perl programmer, you won't need all of the detail because you can parse data using regular expressions or the split function. You need to get the data out and make it available to your programs. The code that accomplishes this task can be fairly simple, as you will see in this chapter.

GenBank Libraries

GenBank is distributed as a set of libraries--flat files containing many records in succession.2 As of GenBank release 125.0, August 2001, there are 243 files, most of which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually distributed compressed, which means you can download somewhat smaller files, but you need to uncompress them after you received them. Uncompressed, this amounts to about 50 GB of data. Since 1982, the number of sequences in GenBank has doubled about every 14 months.

GenBank libraries are further organized into divisions by the classification of the sequences they contain, either phylogenetically or by sequencing technology. Here are the divisions:

  • PRI: primate sequences
  • ROD: rodent sequences
  • MAM: other mammalian sequences
  • VRT: other vertebrate sequences
  • INV: invertebrate sequences
  • PLN: plant, fungal, and algal sequences
  • BCT: bacterial sequences
  • VRL: viral sequences
  • PHG: bacteriophage sequences
  • SYN: synthetic and chimeric sequences
  • UNA: unannotated sequences
  • EST: EST sequences (expressed sequence tags)
  • PAT: patent sequences
  • STS: STS sequences (sequence tagged sites)
  • GSS: GSS sequences (genome survey sequences)
  • HTG: HTGS sequences (high throughput genomic sequencing data)
  • HTC: HTC sequences (high throughput cDNA sequencing data)

Some divisions are very large: the largest, the EST, or expressed sequence tag division, is comprised of 123 library files! A portion of human DNA is stored in the PRI division, which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of sequence.

The public database servers such as Entrez or BLAST at give you access to well-maintained and updated sequence data and programs, but many researchers find that they need to write their own programs to manipulate and analyze the data. The problem is, there's so much data. For many purposes, you can download a selected set of records from NCBI or other locations, but sometimes you need the whole dataset.

It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data onto your hard drive, however, is more difficult. A Perl program called helps to address this need. Downloading it, even with a university-standard, high-speed Internet connection can be time-consuming; downloading an entire dataset with a modem can be an exercise in frustration. The best solution is to download only the files you need, in compressed form. The EST data, for example, is about half the entire database; don't download it unless you really need to. If you need to download GenBank, I recommend contacting the help desk at NCBI. They'll help you get the most up-to-date information.

Since you're learning to program, it makes more sense to practice on a tiny, five-record library file, but the programs you'll write will work just fine on the real files....

Table of Contents

What Is Bioinformatics?;
About This Book;
Who This Book Is For;
Why Should I Learn to Program?;
Structure of This Book;
Conventions Used in This Book;
Comments and Questions;
Chapter 1: Biology and Computer Science;
1.1 The Organization of DNA;
1.2 The Organization of Proteins;
1.3 In Silico;
1.4 Limits to Computation;
Chapter 2: Getting Started with Perl;
2.1 A Low and Long Learning Curve;
2.2 Perl's Benefits;
2.3 Installing Perl on Your Computer;
2.4 How to Run Perl Programs;
2.5 Text Editors;
2.6 Finding Help;
Chapter 3: The Art of Programming;
3.1 Individual Approaches to Programming;
3.2 Edit—Run—Revise (and Save);
3.3 An Environment of Programs;
3.4 Programming Strategies;
3.5 The Programming Process;
Chapter 4: Sequences and Strings;
4.1 Representing Sequence Data;
4.2 A Program to Store a DNA Sequence;
4.3 Concatenating DNA Fragments;
4.4 Transcription: DNA to RNA;
4.5 Using the Perl Documentation;
4.6 Calculating the Reverse Complement in Perl;
4.7 Proteins, Files, and Arrays;
4.8 Reading Proteins in Files;
4.9 Arrays;
4.10 Scalar and List Context;
4.11 Exercises;
Chapter 5: Motifs and Loops;
5.1 Flow Control;
5.2 Code Layout;
5.3 Finding Motifs;
5.4 Counting Nucleotides;
5.5 Exploding Strings into Arrays;
5.6 Operating on Strings;
5.7 Writing to Files;
5.8 Exercises;
Chapter 6: Subroutines and Bugs;
6.1 Subroutines;
6.2 Scoping and Subroutines;
6.3 Command-Line Arguments and Arrays;
6.4 Passing Data to Subroutines;
6.5 Modules and Libraries of Subroutines;
6.6 Fixing Bugs in Your Code;
6.7 Exercises;
Chapter 7: Mutations and Randomization;
7.1 Random Number Generators;
7.2 A Program Using Randomization;
7.3 A Program to Simulate DNA Mutation;
7.4 Generating Random DNA;
7.5 Analyzing DNA;
7.6 Exercises;
Chapter 8: The Genetic Code;
8.1 Hashes;
8.2 Data Structures and Algorithms for Biology;
8.3 The Genetic Code;
8.4 Translating DNA into Proteins;
8.5 Reading DNA from Files in FASTA Format;
8.6 Reading Frames;
8.7 Exercises;
Chapter 9: Restriction Maps and Regular Expressions;
9.1 Regular Expressions;
9.2 Restriction Maps and Restriction Enzymes;
9.3 Perl Operations;
9.4 Exercises;
Chapter 10: GenBank;
10.1 GenBank Files;
10.2 GenBank Libraries;
10.3 Separating Sequence and Annotation;
10.4 Parsing Annotations;
10.5 Indexing GenBank with DBM;
10.6 Exercises;
Chapter 11: Protein Data Bank;
11.1 Overview of PDB;
11.2 Files and Folders;
11.3 PDB Files;
11.4 Parsing PDB Files;
11.5 Controlling Other Programs;
11.6 Exercises;
Chapter 12: BLAST;
12.1 Obtaining BLAST;
12.2 String Matching and Homology;
12.3 BLAST Output Files;
12.4 Parsing BLAST Output;
12.5 Presenting Data;
12.6 Bioperl;
12.7 Exercises;
Chapter 13: Further Topics;
13.1 The Art of Program Design;
13.2 Web Programming;
13.3 Algorithms and Sequence Alignment;
13.4 Object-Oriented Programming;
13.5 Perl Modules;
13.6 Complex Data Structures;
13.7 Relational Databases;
13.8 Microarrays and XML;
13.9 Graphics Programming;
13.10 Modeling Networks;
13.11 DNA Computers;
Appendix A: Resources;
A.1 Perl;
A.2 Computer Science;
A.3 Linux;
A.4 Bioinformatics;
A.5 Molecular Biology;
Appendix B: Perl Summary;
B.1 Command Interpretation;
B.2 Comments;
B.3 Scalar Values and Scalar Variables;
B.4 Assignment;
B.5 Statements and Blocks;
B.6 Arrays;
B.7 Hashes;
B.8 Operators;
B.9 Operator Precedence;
B.10 Basic Operators;
B.11 Conditionals and Logical Operators;
B.12 Binding Operators;
B.13 Loops;
B.14 Input/Output;
B.15 Regular Expressions;
B.16 Scalar and List Context;
B.17 Subroutines and Modules;
B.18 Built-in Functions;

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews

Beginning Perl for Bioinformatics 4 out of 5 based on 0 ratings. 2 reviews.
Anonymous More than 1 year ago
Anonymous More than 1 year ago