Beginning Perl for Bioinformatics: An Introduction to Perl for Biologists

With its highly developed capacity to detect patterns in data, Perl has become one of the most popular languages for biological data analysis. But if you're a biologist with little or no programming experience, starting out in Perl can be a challenge. Many biologists have a difficult time learning how to apply the language to bioinformatics. The most popular Perl programming books are often too theoretical and too focused on computer science for a non-programming biologist who needs to solve very specific problems.Beginning Perl for Bioinformatics is designed to get you quickly over the Perl language barrier by approaching programming as an important new laboratory skill, revealing Perl programs and techniques that are immediately useful in the lab. Each chapter focuses on solving a particular bioinformatics problem or class of problems, starting with the simplest and increasing in complexity as the book progresses. Each chapter includes programming exercises and teaches bioinformatics by showing and modifying programs that deal with various kinds of practical biological problems. By the end of the book you'll have a solid understanding of Perl basics, a collection of programs for such tasks as parsing BLAST and GenBank, and the skills to take on more advanced bioinformatics programming. Some of the later chapters focus in greater detail on specific bioinformatics topics. This book is suitable for use as a classroom textbook, for self-study, and as a reference.The book covers:

  • Programming basics and working with DNA sequences and strings
  • Debugging your code
  • Simulating gene mutations using random number generators
  • Regular expressions and finding motifs in data
  • Arrays, hashes, and relational databases
  • Regular expressions and restriction maps
  • Using Perl to parse PDB records, annotations in GenBank, and BLAST output
1140203004
Beginning Perl for Bioinformatics: An Introduction to Perl for Biologists

With its highly developed capacity to detect patterns in data, Perl has become one of the most popular languages for biological data analysis. But if you're a biologist with little or no programming experience, starting out in Perl can be a challenge. Many biologists have a difficult time learning how to apply the language to bioinformatics. The most popular Perl programming books are often too theoretical and too focused on computer science for a non-programming biologist who needs to solve very specific problems.Beginning Perl for Bioinformatics is designed to get you quickly over the Perl language barrier by approaching programming as an important new laboratory skill, revealing Perl programs and techniques that are immediately useful in the lab. Each chapter focuses on solving a particular bioinformatics problem or class of problems, starting with the simplest and increasing in complexity as the book progresses. Each chapter includes programming exercises and teaches bioinformatics by showing and modifying programs that deal with various kinds of practical biological problems. By the end of the book you'll have a solid understanding of Perl basics, a collection of programs for such tasks as parsing BLAST and GenBank, and the skills to take on more advanced bioinformatics programming. Some of the later chapters focus in greater detail on specific bioinformatics topics. This book is suitable for use as a classroom textbook, for self-study, and as a reference.The book covers:

  • Programming basics and working with DNA sequences and strings
  • Debugging your code
  • Simulating gene mutations using random number generators
  • Regular expressions and finding motifs in data
  • Arrays, hashes, and relational databases
  • Regular expressions and restriction maps
  • Using Perl to parse PDB records, annotations in GenBank, and BLAST output
31.99 In Stock
Beginning Perl for Bioinformatics: An Introduction to Perl for Biologists

Beginning Perl for Bioinformatics: An Introduction to Perl for Biologists

by James Tisdall
Beginning Perl for Bioinformatics: An Introduction to Perl for Biologists

Beginning Perl for Bioinformatics: An Introduction to Perl for Biologists

by James Tisdall

eBook

$31.99 

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

With its highly developed capacity to detect patterns in data, Perl has become one of the most popular languages for biological data analysis. But if you're a biologist with little or no programming experience, starting out in Perl can be a challenge. Many biologists have a difficult time learning how to apply the language to bioinformatics. The most popular Perl programming books are often too theoretical and too focused on computer science for a non-programming biologist who needs to solve very specific problems.Beginning Perl for Bioinformatics is designed to get you quickly over the Perl language barrier by approaching programming as an important new laboratory skill, revealing Perl programs and techniques that are immediately useful in the lab. Each chapter focuses on solving a particular bioinformatics problem or class of problems, starting with the simplest and increasing in complexity as the book progresses. Each chapter includes programming exercises and teaches bioinformatics by showing and modifying programs that deal with various kinds of practical biological problems. By the end of the book you'll have a solid understanding of Perl basics, a collection of programs for such tasks as parsing BLAST and GenBank, and the skills to take on more advanced bioinformatics programming. Some of the later chapters focus in greater detail on specific bioinformatics topics. This book is suitable for use as a classroom textbook, for self-study, and as a reference.The book covers:

  • Programming basics and working with DNA sequences and strings
  • Debugging your code
  • Simulating gene mutations using random number generators
  • Regular expressions and finding motifs in data
  • Arrays, hashes, and relational databases
  • Regular expressions and restriction maps
  • Using Perl to parse PDB records, annotations in GenBank, and BLAST output

Product Details

ISBN-13: 9780596550479
Publisher: O'Reilly Media, Incorporated
Publication date: 10/22/2001
Sold by: Barnes & Noble
Format: eBook
Pages: 386
File size: 3 MB

About the Author

James Tisdall has worked as a musician, a programmer at Bell Labs (where he programmed for speech research and discovered a formal language for musical rhythm), and as a bioinformaticist at Mercator Genetics in Menlo Park, California, and at Fox Chase Cancer Center in Philadelphia. He has a B.A. in mathematics from the City College of New York and an M.S. in computer science from Columbia University; he is working towards a Ph.D. in computer science at the University of Pennsylvania. In his spare time Jim teaches computer music at the Settlement Music School in Philadelphia.

Read an Excerpt

Chapter 10: GenBank

GenBank (Genetic Sequence Data Bank) is a rapidly growing international repository of known genetic sequences from a variety of organisms. Its use is central to modern biology and to bioinformatics.

This chapter shows you how to write Perl programs to extract information from GenBank files and libraries. Exercises include looking for patterns; creating special libraries; and parsing the flat-file format to extract the DNA, annotation, and features. You will learn how to make a DBM database to create your own rapid-access lookups on selected data in a GenBank library.

Perl is a great tool for dealing with GenBank files. It enables you to extract and use any of the detailed data in the sequence and in the annotation, such as in the FEATURES table and elsewhere. When I first started using Perl, I wrote a program that searched GenBank for all sequence records annotated as being located on human chromosome 22. I found many genes where that information was so deeply buried within the annotation, that the major gene mapping database, Genome Database (GDB), hadn't included them in their chromosome map. I think you'll discover the same feeling of power over the information when you start applying Perl to GenBank files.

Most biologists are familiar with GenBank. Researchers can perform a search, e.g., a BLAST search on some query sequence, and collect a set of GenBank files of related sequences as a result. Because the GenBank records are maintained by the individual scientists who discovered the sequences, if you find some new sequence of interest, you can publish it in GenBank.

GenBank files have a great deal of information in them in addition to sequence data, including identifiers such as accession numbers and gene names, phylogenetic classification, and references to published literature. A GenBank file may also include a detailed FEATURES table that summarizes facts about the sequence, such as the location of the regulatory regions, the protein translation, and exons and introns.

GenBank is sometimes referred to as a databank or data store, which is different from a database. Databases typically have a relational structure imposed upon the data, including associated indices and links and a query language. GenBank in comparison is a flat file, that is, an ASCII text file that is easily readable by humans.1

From its humble beginnings GenBank has rapidly grown, and the flat-file format has seen signs of strain during the growth. With a quickly advancing body of knowledge, especially one that's growing as quickly as genetic data, it's difficult for the design of a databank to keep up. Several reworkings of GenBank have been done, but the flat-file format--in all its frustrating glory--still remains.

Due to a certain flexibility in the content of some sections of a GenBank record, extracting the information you're looking for can be tricky. This flexibility is good, in that it allows you to put what you think is most important into the data's annotation. It's bad, because that same flexibility makes it harder to write programs that to find and extract the desired annotations. As a result, the trend has been towards more structure in the annotations.

Since Perl's data structures and its use of regular expressions make it a good tool for manipulating flat files, Perl is especially well-suited to deal with GenBank data. Using these features in Perl and building on the skills you've developed from previous chapters, you can write programs to access the accumulated genetic knowledge of the scientific community in GenBank.

Since this is a beginning book that requires no programming experience, you should not expect to find the most finished, multipurpose software here. Instead you'll find a solid introduction to parsing and building fast lookup tables for GenBank files. If you've never done so, I strongly recommend you explore the National Center for Biotechnology Information (NCBI) at the National Institutes of Health (NIH) (http://www.ncbi.nlm.nih.gov). While you're at it, stop by the European Bioinformatics Institute (EBI) at http://www.ebi.ac.uk and the bioinformatics arm of the European Molecular Biology Laboratory (EMBL) at http://www.embl-heidelberg.de/. These are large, heavily funded governmental bioinformatics powerhouses, and they have (and distribute) a great deal of state-of-the-art bioinformatics software.

GenBank Files

The primary repositories for genetic information are the NCBI GenBank, EMBL in Europe, and the DNA Data Bank of Japan (DDBJ). All have almost identical information due to international cooperative agreements. Each entry or record in GenBank or its mirror sites may contain identifying, descriptive, and genetic information in ASCII-format files. Each record is written in a specific standard format, organized so that both humans and computer programs can extract the desired information with reasonable ease.

Let's look at a relatively short GenBank record and at how the fields are defined, before writing any code. I'll save this information in a file called record.gb, for use in later programs.

LOCUS       AB031069     2487 bp    mRNA            PRI       27-MAY-2000
DEFINITION  Homo sapiens PCCX1 mRNA for protein containing CXXC domain 1,
            complete cds.
ACCESSION   AB031069
VERSION     AB031069.1  GI:8100074
KEYWORDS    .
SOURCE      Homo sapiens embryo male lung fibroblast cell_line:HuS-L12 cDNA to
            mRNA.
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
REFERENCE   1  (sites)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,Si. and
            Takano,T.
  TITLE     PCCX1, a novel DNA-binding protein with PHD finger and CXXC domain,
            is regulated by proteolysis
  JOURNAL   Biochem. Biophys. Res. Commun. 271 (2), 305-310 (2000)
  MEDLINE   20261256
REFERENCE   2  (bases 1 to 2487)
  AUTHORS   Fujino,T., Hasegawa,M., Shibata,S., Kishimoto,T., Imai,S. and
            Takano,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (15-AUG-1999) to the DDBJ/EMBL/GenBank databases.
            Tadahiro Fujino, Keio University School of Medicine, Department of
            Microbiology; Shinanomachi 35, Shinjuku-ku, Tokyo 160-8582, Japan
            (E-mail:fujino@microb.med.keio.ac.jp,
            Tel:+81-3-3353-1211(ex.62692), Fax:+81-3-5360-1508)
FEATURES             Location/Qualifiers
     source          1..2487
                     /organism="Homo sapiens"
                     /db_xref="taxon:9606"
                     /sex="male"
                     /cell_line="HuS-L12"
                     /cell_type="lung fibroblast"
                     /dev_stage="embryo"
     gene            229..2199
                     /gene="PCCX1"
     CDS             229..2199
                     /gene="PCCX1"
                     /note="a nuclear protein carrying a PHD finger and a CXXC
                     domain"
                     /codon_start=1
                     /product="protein containing CXXC domain 1"
                     /protein_id="BAA96307.1"
                     /db_xref="GI:8100075"
                     /translation="MEGDGSDPEPPDAGEDSKSENGENAPIYCICRKPDINCFMIGCD
                     NCNEWFHGDCIRITEKMAKAIREWYCRECREKDPKLEIRYRHKKSRERDGNERDSSEP
                     RDEGGGRKRPVPDPDLQRRAGSGTGVGAMLARGSASPHKSSPQPLVATPSQHHQQQQQ
                     QIKRSARMCGECEACRRTEDCGHCDFCRDMKKFGGPNKIRQKCRLRQCQLRARESYKY
                     FPSSLSPVTPSESLPRPRRPLPTQQQPQPSQKLGRIREDEGAVASSTVKEPPEATATP
                     EPLSDEDLPLDPDLYQDFCAGAFDDHGLPWMSDTEESPFLDPALRKRAVKVKHVKRRE
                     KKSEKKKEERYKRHRQKQKHKDKWKHPERADAKDPASLPQCLGPGCVRPAQPSSKYCS
                     DDCGMKLAANRIYEILPQRIQQWQQSPCIAEEHGKKLLERIRREQQSARTRLQEMERR
                     FHELEAIILRAKQQAVREDEESNEGDSDDTDLQIFCVSCGHPINPRVALRHMERCYAK
                     YESQTSFGSMYPTRIEGATRLFCDVYNPQSKTYCKRLQVLCPEHSRDPKVPADEVCGC
                     PLVRDVFELTGDFCRLPKRQCNRHYCWEKLRRAEVDLERVRVWYKLDELFEQERNVRT
                     AMTNRAGLLALMLHQTIQHDPLTTDLRSSADR"
BASE COUNT      564 a    715 c    768 g    440 t
ORIGIN      
        1 agatggcggc gctgaggggt cttgggggct ctaggccggc cacctactgg tttgcagcgg
       61 agacgacgca tggggcctgc gcaataggag tacgctgcct gggaggcgtg actagaagcg
      121 gaagtagttg tgggcgcctt tgcaaccgcc tgggacgccg ccgagtggtc tgtgcaggtt
      181 cgcgggtcgc tggcgggggt cgtgagggag tgcgccggga gcggagatat ggagggagat
      241 ggttcagacc cagagcctcc agatgccggg gaggacagca agtccgagaa tggggagaat
      301 gcgcccatct actgcatctg ccgcaaaccg gacatcaact gcttcatgat cgggtgtgac
      361 aactgcaatg agtggttcca tggggactgc atccggatca ctgagaagat ggccaaggcc
      421 atccgggagt ggtactgtcg ggagtgcaga gagaaagacc ccaagctaga gattcgctat
      481 cggcacaaga agtcacggga gcgggatggc aatgagcggg acagcagtga gccccgggat
      541 gagggtggag ggcgcaagag gcctgtccct gatccagacc tgcagcgccg ggcagggtca
      601 gggacagggg ttggggccat gcttgctcgg ggctctgctt cgccccacaa atcctctccg
      661 cagcccttgg tggccacacc cagccagcat caccagcagc agcagcagca gatcaaacgg
      721 tcagcccgca tgtgtggtga gtgtgaggca tgtcggcgca ctgaggactg tggtcactgt
      781 gatttctgtc gggacatgaa gaagttcggg ggccccaaca agatccggca gaagtgccgg
      841 ctgcgccagt gccagctgcg ggcccgggaa tcgtacaagt acttcccttc ctcgctctca
      901 ccagtgacgc cctcagagtc cctgccaagg ccccgccggc cactgcccac ccaacagcag
      961 ccacagccat cacagaagtt agggcgcatc cgtgaagatg agggggcagt ggcgtcatca
     1021 acagtcaagg agcctcctga ggctacagcc acacctgagc cactctcaga tgaggaccta
     1081 cctctggatc ctgacctgta tcaggacttc tgtgcagggg cctttgatga ccatggcctg
     1141 ccctggatga gcgacacaga agagtcccca ttcctggacc ccgcgctgcg gaagagggca
     1201 gtgaaagtga agcatgtgaa gcgtcgggag aagaagtctg agaagaagaa ggaggagcga
     1261 tacaagcggc atcggcagaa gcagaagcac aaggataaat ggaaacaccc agagagggct
     1321 gatgccaagg accctgcgtc actgccccag tgcctggggc ccggctgtgt gcgccccgcc
     1381 cagcccagct ccaagtattg ctcagatgac tgtggcatga agctggcagc caaccgcatc
     1441 tacgagatcc tcccccagcg catccagcag tggcagcaga gcccttgcat tgctgaagag
     1501 cacggcaaga agctgctcga acgcattcgc cgagagcagc agagtgcccg cactcgcctt
     1561 caggaaatgg aacgccgatt ccatgagctt gaggccatca ttctacgtgc caagcagcag
     1621 gctgtgcgcg aggatgagga gagcaacgag ggtgacagtg atgacacaga cctgcagatc
     1681 ttctgtgttt cctgtgggca ccccatcaac ccacgtgttg ccttgcgcca catggagcgc
     1741 tgctacgcca agtatgagag ccagacgtcc tttgggtcca tgtaccccac acgcattgaa
     1801 ggggccacac gactcttctg tgatgtgtat aatcctcaga gcaaaacata ctgtaagcgg
     1861 ctccaggtgc tgtgccccga gcactcacgg gaccccaaag tgccagctga cgaggtatgc
     1921 gggtgccccc ttgtacgtga tgtctttgag ctcacgggtg acttctgccg cctgcccaag
     1981 cgccagtgca atcgccatta ctgctgggag aagctgcggc gtgcggaagt ggacttggag
     2041 cgcgtgcgtg tgtggtacaa gctggacgag ctgtttgagc aggagcgcaa tgtgcgcaca
     2101 gccatgacaa accgcgcggg attgctggcc ctgatgctgc accagacgat ccagcacgat
     2161 cccctcacta ccgacctgcg ctccagtgcc gaccgctgag cctcctggcc cggacccctt
     2221 acaccctgca ttccagatgg gggagccgcc cggtgcccgt gtgtccgttc ctccactcat
     2281 ctgtttctcc ggttctccct gtgcccatcc accggttgac cgcccatctg cctttatcag
     2341 agggactgtc cccgtcgaca tgttcagtgc ctggtggggc tgcggagtcc actcatcctt
     2401 gcctcctctc cctgggtttt gttaataaaa ttttgaagaa accaaaaaaa aaaaaaaaaa
     2461 aaaaaaaaaa aaaaaaaaaa aaaaaaa
//

Even if you're used to seeing GenBank files, it's worth taking the time to look one over, while considering how you would write a program to extract various parts of the data. For instance, how would you extract the sequence data? What's the format of the FEATURES table and its various subfields?

There's a lot of information packed into a typical GenBank entry, and it's important to be able to separate the different parts. For instance, if you can extract the sequence, you can search for motifs, calculate statistics on the sequence, look for similarity with other sequences, and so forth. Similarly, you'll want to separate out--or parse--the various parts of the data annotation. In GenBank, this includes ID numbers, gene names, genus and species, publications, etc. The FEATURES table part of the annotation can include specific information about the DNA, such as the locations of exons, regulatory regions, important mutations, and so on.

The format specification of GenBank files and a great deal of other information about GenBank can be found in the GenBank release notes, gbrel.txt, on the GenBank web site at ftp://ncbi.nlm.nih.gov/genbank/gbrel.txt.

gbrel.txt gives complete detail about the structure of GenBank files to help programmers, so you may want to refer to it as your searches become more complex. As a Perl programmer, you won't need all of the detail because you can parse data using regular expressions or the split function. You need to get the data out and make it available to your programs. The code that accomplishes this task can be fairly simple, as you will see in this chapter.

GenBank Libraries

GenBank is distributed as a set of libraries--flat files containing many records in succession.2 As of GenBank release 125.0, August 2001, there are 243 files, most of which are over 200 MB in size. Altogether, GenBank contains 12,813516 loci and 13,543,364,296 bases from 12,813,516 reported sequences. The libraries are usually distributed compressed, which means you can download somewhat smaller files, but you need to uncompress them after you received them. Uncompressed, this amounts to about 50 GB of data. Since 1982, the number of sequences in GenBank has doubled about every 14 months.

GenBank libraries are further organized into divisions by the classification of the sequences they contain, either phylogenetically or by sequencing technology. Here are the divisions:

  • PRI: primate sequences
  • ROD: rodent sequences
  • MAM: other mammalian sequences
  • VRT: other vertebrate sequences
  • INV: invertebrate sequences
  • PLN: plant, fungal, and algal sequences
  • BCT: bacterial sequences
  • VRL: viral sequences
  • PHG: bacteriophage sequences
  • SYN: synthetic and chimeric sequences
  • UNA: unannotated sequences
  • EST: EST sequences (expressed sequence tags)
  • PAT: patent sequences
  • STS: STS sequences (sequence tagged sites)
  • GSS: GSS sequences (genome survey sequences)
  • HTG: HTGS sequences (high throughput genomic sequencing data)
  • HTC: HTC sequences (high throughput cDNA sequencing data)

Some divisions are very large: the largest, the EST, or expressed sequence tag division, is comprised of 123 library files! A portion of human DNA is stored in the PRI division, which contains (as of this writing) 13 library files, for a total of almost 3.5 GB of data. Human data is also stored in the STS, GSS, HTGS, and HTC divisions. Human data alone in GenBank makes up almost 5 million record entries with over 8 trillion bases of sequence.

The public database servers such as Entrez or BLAST at http://www.ncbi.nlm.nih.gov/ give you access to well-maintained and updated sequence data and programs, but many researchers find that they need to write their own programs to manipulate and analyze the data. The problem is, there's so much data. For many purposes, you can download a selected set of records from NCBI or other locations, but sometimes you need the whole dataset.

It's possible to set up a desktop workstation (Windows, Mac, Unix, or Linux) that contains all of GenBank; just be sure to buy a very large hard disk! Getting all that data onto your hard drive, however, is more difficult. A Perl program called mirror.pl helps to address this need. Downloading it, even with a university-standard, high-speed Internet connection can be time-consuming; downloading an entire dataset with a modem can be an exercise in frustration. The best solution is to download only the files you need, in compressed form. The EST data, for example, is about half the entire database; don't download it unless you really need to. If you need to download GenBank, I recommend contacting the help desk at NCBI. They'll help you get the most up-to-date information.

Since you're learning to program, it makes more sense to practice on a tiny, five-record library file, but the programs you'll write will work just fine on the real files....

Table of Contents

Prefacevii
1.Biology and Computer Science1
The Organization of DNA2
The Organization of Proteins3
In Silico4
Limits to Computation5
2.Getting Started with Perl6
A Low and Long Learning Curve6
Perl's Benefits8
Installing Perl on Your Computer10
How to Run Perl Programs13
Text Editors15
Finding Help16
3.The Art of Programming18
Individual Approaches to Programming18
Edit--Run--Revise (and Save)19
An Environment of Programs21
Programming Strategies22
The Programming Process23
4.Sequences and Strings29
Representing Sequence Data29
A Program to Store a DNA Sequence32
Concatenating DNA Fragments36
Transcription: DNA to RNA40
Using the Perl Documentation42
Calculating the Reverse Complement in Perl43
Proteins, Files, and Arrays46
Reading Proteins in Files47
Arrays50
Scalar and List Context54
Exercises55
5.Motifs and Loops56
Flow Control56
Code Layout62
Finding Motifs63
Counting Nucleotides70
Exploding Strings into Arrays71
Operating on Strings77
Writing to Files81
Exercises85
6.Subroutines and Bugs87
Subroutines87
Scoping and Subroutines90
Command-Line Arguments and Arrays96
Passing Data to Subroutines98
Modules and Libraries of Subroutines102
Fixing Bugs in Your Code104
Exercises116
7.Mutations and Randomization118
Random Number Generators119
A Program Using Randomization120
A Program to Simulate DNA Mutation126
Generating Random DNA136
Analyzing DNA141
Exercises147
8.The Genetic Code149
Hashes149
Data Structures and Algorithms for Biology150
The Genetic Code155
Translating DNA into Proteins163
Reading DNA from Files in FASTA Format166
Reading Frames175
Exercises180
9.Restriction Maps and Regular Expressions182
Regular Expressions182
Restriction Maps and Restriction Enzymes184
Perl Operations197
Exercises198
10.GenBank199
GenBank Files200
GenBank Libraries203
Separating Sequence and Annotation205
Parsing Annotations212
Indexing GenBank with DBM232
Exercises236
11.Protein Data Bank238
Files and Folders240
PDB Files248
Parsing PDB Files257
Controlling Other Programs267
Exercises272
12.BLAST274
Obtaining BLAST275
String Matching and Homology276
BLAST Output Files277
Parsing BLAST Output280
Presenting Data290
Bioperl294
Exercises301
13.Further Topics302
The Art of Program Design302
Web Programming302
Algorithms and Sequence Alignment303
Object-Oriented Programming303
Perl Modules303
Complex Data Structures304
Relational Databases304
Microarrays and XML305
Graphics Programming305
Modeling Networks305
DNA Computers306
A.Resources307
B.Perl Summary315
Index347
From the B&N Reads Blog

Customer Reviews