Read an Excerpt
Chapter 3: The Meaning of the Life Script
Though the two sides' claim to have sequenced the human genome sparked headlines around the world, they were celebrating victory before the battle. Each team now had to interpret its genome sequence by describing its major features and locating the genes, and from the quality of the respective reports a clear winner could emerge in the eyes of their fellow scientists.
This first analysis of the human genome was just as much a landmark in scientific history as the years spent decoding the sequence. But it was compressed into a few months' whirlwind of activity as the two sides set about interpreting the strange and enigmatic script they had wrested from the human cell. Talk of a joint annotation conference at which the two sides would compare notes in interpreting the human genome soon evaporated. Celera and the consortium worked with different groups of experts and published their reports in rival scientific journals, Celera choosing Science in Washington, D.C., and the consortium Nature in London. The sharp elbowing continued to the bitter end, with academic biologists including Eric Lander, chief author of the consortium's genome report, lobbying hard to persuade the editor of Science not to accept their rival's paper except under terms unacceptable to Celera.
The academics demanded that Celera make its genome sequence freely available through GenBank, even though GenBank could not meet Celera's condition that its commercial rivals be prevented from downloading Celera's data and reselling it. The editor of Science allowed Celera to be the custodian of its own data on condition that any part of it be made freely available to scientists for checking, a decision to which the academic biologists objected. Venter felt their real purpose was to deny him the prestige of being published in a leading academic journal.
Well before the White House announcement, both sides had started preparing to analyze their respective versions of the genome. Venter had more or less founded the art of genome interpretation when he published the first genome of a bacterium in 1995. He had learned then an important lesson: the best way of interpreting a genome is to compare it with the genome of a similar organism. He had long since decided that the mouse's genome would be a critical tool for interpreting the human genome, because comparison of the two long ago mammalian cousins would reveal by their regions of DNA sequence similarity all the features that nature had found it necessary to conserve. He had risked switching his sequencing machines from human to mouse DNA at the earliest possible moment so as to have both genomes in hand for the task of locating the human genes.
Another advantage for Celera was that its version of the human genome was much less bitty than the consortium's. Celera's vast computer, the largest in civilian use, had assembled 27 million of the 500-base pieces analyzed by the sequencing machines into long, mostly continuous scaffolds that straddled the genome. The consortium's genome was divided into thousands of the small sub-jigsaws known as BACs, chunks of DNA about 150,000 bases in length. The BACs had been completed for the two shortest human chromosomes, numbers 21 and 22, but over most of the rest of the genome were still in small pieces, many 10,000 bases or so in length. It was possible to hunt through these fragments for genes, but not at all easy. The consortium had not tried to assemble them by computer because it did not see the need to do so. Robert Waterston, director of the sequencing center at Washington University in Saint Louis, had prepared a BAC map that showed how one BAC overlapped another across the genome in a complete tiling path. With this BAC map in hand, the same method by which Waterston and John Sulston at the Sanger Centre had sequenced the roundworm's genome, there seemed no need to invest in the complex computing and assembly programs that were a necessary part of Celera's strategy.
Though Sulston and Waterston had laid the scientific groundwork for the consortium's sequencing effort, it was Eric Lander, director of the Whitehead Institute's sequencing center and a mathematician by training, who took the lead in analyzing the genome. In December 1999 he started to invite computational biologists a new discipline devoted to computer analysis of genomes to join in a genome analysis group. The group had no government funding, according to Lander, and did its work mostly by phone and e-mail.
Meanwhile Venter, who had convened an "annotation jamboree" of outside experts to help find the genes in the fruit fly genome, decided there was now enough expertise within Celera to undertake the first analysis of the human genome in-house, with the help of a few consul-tants.
The consortium might have been hopelessly outgunned in the interpretation phase of the genome race had it not been for a chance encounter, although one made possible by the consortium's open nature. One of the computational biologists approached by Lander in December 1999 was David Haussler of the University of California, Santa Cruz, whom Lander invited to help locate the genes. Haussler decided that before looking for genes, it would be best to put some order into the jumble of fragments within each BAC. He believed there was enough information, some created inadvertently by the consortium and some from other sources, for an assembly program to order and orient the intra-BAC fragments, and he at once started writing such a program.
To create the computing facility to run the program, he persuaded his university chancellor to advance him the money for a network of one hundred Pentium III computers. But the programming went slowly. In May, when a graduate student of his e-mailed to ask how the genome assembly program was going, Haussler replied that things were looking grim.
The student, James Kent, then offered to write an assembly program himself, using a simpler strategy. Haussler replied, "Godspeed." Four weeks later, Kent had completed an assembly program that in his supervisor's opinion might have taken a team of five or ten programmers six months or a year. "He had to ice his wrists at night because of the fury with which he created this extraordinarily complex piece of code," Haussler said of his student. Kent, who had previously run a computer animation company before returning to school to study computational biology, first used the program to assemble and order all the pieces in the consortium's genome on June 22, 2000. In doing so he gained a three-day lead on Celera, whose assembly program had encountered unexpected problems. Venter completed his first assembly of the human genome on June 25, just the night before the White House press conference.
When Celera and the consortium published their analyses of the genome in February 2001, it was clear that the consortium's rested heavily on Kent's improvised assembly program and the computer network put together by his supervisor. Venter was astonished that his competitors had at the last minute managed to extract so much sense from a genome sequence that in his view had been so hopelessly chaotic. "They used every piece of information available," he said. "It was really quite clever, given the quality of their data. So honestly, we are impressed. We were truly amazed, because we predicted, based on their raw data, that it would be nonassemblable. So what Haussler did was, he came in and saved them. Haussler put it all together."
In their first glimpse of the human genome, the two teams came to similar conclusions, of which the most surprising was the far smaller than expected number of human genes. Both found about 30,000 protein-coding human genes, far fewer than the 100,000 human genes that textbooks had estimated for many years. The 100,000 number had seemed especially credible after the C. elegans roundworm was found to have 19,098 genes and the fruit fly 13,601.
Celera said its new gene-finding program, named Otto, had predicted 26,588 human genes for sure, with another 12,731 possible genes. The consortium estimated the human instruction set at 30,000 to 40,000 genes. Both sides favor a number at the lower end of their respective estimates because gene-finding programs tend to overpredict, and 30,000 genes seems for the moment the preferred figure.
These first readings of the human life script, however awesome for biologists, were a little baffling for the lay audience. All this effort, just to find that people have only 50 percent more genes than a worm? Yet the human instruction manual was never likely to yield up all its secrets at first glance. It was hardly surprising that the first scan of its pages should produce more perplexity than enlightenment.
The human genome is written in an ancient and vastly alien language. It is designed for the cell to use, not for human eyes to make sense of. Its four letter alphabet, represented as A, T, C, and G for the four different bases of DNA, is so hard to parse that there would be no point in printing out the whole genome sequence. If anyone were to undertake so futile an effort, the result would occupy three hundred volumes the size of those in Encyclopaedia Britannica, each page of which would carry an almost identical looking block of letters, unbroken by spaces, punctuation, or headings. Nature's only subdivision of the genomic script is to package it in twenty-three chapters, the chromosomes, each of which is an enormous DNA molecule festooned with the special proteins that control it. These range in size from chromosome 1, a blockbuster 282 million bases in length, to chromosome 21 at a mere 45 million bases.
The challenge of the task that faced Celera and the consortium was to decipher the 3 billion letters in the script with machines that could read only five hundred letters at a time, losing the position of the fragment as they did so. The order of the five-hundred-letter pieces had to be reconstructed largely by creating so many that they overlapped, and so that through the overlaps the original chromosomal sequence could be inferred. Moreover both teams held themselves to an eventual accuracy standard of less than one error per 10,000 bases.
The interpretation of the genomic script is likely to prove as hard as the sequencing. So far just the major features have been recognized; doubtless many details remain invisible because of biologists' still substantial ignorance as to how the genome works.
One major feature of the genome's geography, which has prevented both Celera and the consortium from estimating its exact size, is the centromere. This is a stretch of DNA, more or less in the center of each chromosome, which is recognized at the time of cell division by the machinery that pulls duplicated DNA strands apart so that each daughter cell can receive a full set. The centromere consists of the same sequence of bases repeated many times over. Along with certain other problematic regions of the chromosomes, known collectively as heterochromatin, the centromeres cannot be sequenced by present day methods. Since they contain few or no genes, the anonymity of their DNA probably does not much matter.
Including its heterochromatic regions, the human genome now seems to be 3.1 billion bases in length. The rest of the DNA, known as euchromatin, is 2.91 billion bases in length, according to Celera.
Of the 2.91 billion bases, one quarter contains genes but the other three quarters of the vast terrain, as far as can be seen at present, is a graveyard of fossilized DNA, evolutionary experiments that didn't work, and dead genes on the road to extinction. The original name for this non-gene DNA junk DNA has been regarded as presumptuous because it assumed, without proof, that the so-called junk was useless. But on first glance, much of the non-gene regions of the genome are indeed full of junk.
The principal occupants of these regions are rogue pieces of DNA that have been able to copy themselves and insert the copy elsewhere in the genome. Called mobile DNA or transposons, these parasitic elements seem in some cases to have been derived from working genes and taken on a life of their own. The copying seems to serve no purpose other than cluttering up the genome. The consortium reported finding more than 850,000 LINEs and 1,500,000 SINEs, as the two largest families of rogue DNA are called. (The acronyms stand respectively for long and short interspersed nuclear elements.) A LINE is about 7,000 bases in length and a SINE about 200 bases. These and the two other families occupy a total of 1.229 billion bases of DNA, or more than one third of the genome.
The good news about the transposons is that most are dead, in the sense that they ceased to copy themselves thousands of years ago. They clock up mutations, just like any other segment of DNA, and the older they are the more mutations they have. By use of this mutational clock, the consortium determined that only one family of LINEs is still active in the genome, together with a SINE family that uses the LINEs' copying mechanism. Transposons are a genomic hazard, though not a large one, because the copies they make of themselves are inserted back into the genome at random sites. If the insertion occurs in the middle of a gene, the gene is likely to be disrupted. The active LINE element was first noticed because it had disrupted the gene for the blood-clotting Factor VIII in a hemophilia patient. Humans, with most of their transposons long ago placed on fossil status, are for unknown reasons much better off in this respect than the mouse, in whose genome transposons are still vigorously copying themselves and cluttering up the mouse's genetic patrimony.
The Celera team calculates that the region of the genome devoted to human genes occupies just a quarter of the euchromatic DNA, with the average gene sprawling over 27,000 bases of DNA. Genes consist of alternating stretches of DNA known as introns and exons, an arrangement that now seems particularly important in explaining how human complexity is generated with so few genes. The intron-exon system is the basis of a baroque system known as alternative splicing, which works as follows. A gene is activated by a set of special proteins that assemble on its control region, just upstream of the exon-intron mosaic. This transcription complex, as it is called, moves down the double strand of DNA, pushing one strand out of the way and making a copy of the other.
The material of the copy is not DNA but its close chemical cousin, RNA or ribonucleic acid, so called because its nucleotides are composed of the chemical unit known as ribose in place of DNA's deoxyribose. These RNA transcripts are then edited by a dexterous piece of cellular machinery known as a spliceosome, which snips out the introns and splices the exons together in a much shorter transcript. The string of united exons is then exported from the nucleus of the cell to the ribosomes, the protein-making machines in the cell's periphery or cytoplasm.
As the edited RNA transcript ratchets through the ribosome, the order of its bases dictates the composition of the protein chain that is assembled in lockstep with its passage. With each three bases of the RNA transcript, one amino acid the chemical units of which proteins are made is added to the growing protein chain. Each triplet of bases codes for one of the twenty types of amino acid although, since there are sixty-four possible combinations, some triplets code for the same amino acid, and three triplets also serve as a stop sign. This all-important relationship between the DNA/RNA world and proteins is known as the genetic code and must have been established at the dawn of life on Earth some 3.8 billion years ago.
Though proteins are made as a linear sequence, the chain folds up into a very specific three-dimensional structure dictated by the order of its amino acids. The part of each amino acid that forms the protein's backbone is standard, but each has a chemically different side group that juts out from the backbone, and it is the combination of these different groups that gives protein molecules their remarkable versatility. Some proteins serve as structural materials, like the stretchy collagen fibers of the skin, some as enzymes that run the cell's metabolic reactions, and some as complex machine tools, like the amazing topoisomerases that specialize in unknotting DNA when it gets in a tangle.
A neat feature in nature's design of proteins is that they have a modular design. A single protein can contain several modules or domains, each of which performs a different function. Some of the central roles in human cells are played by single proteins with a large number of domains, each of which regulates a complex circuitry of lower level proteins.
This is where the intron-exon structure of genes comes into play. When the spliceosome processes the RNA transcript of a gene, it can produce alternative editions, often by skipping exons, occasionally by including introns. The result is that a single gene can produce a family of different proteins, each with a different set of domains and different overall properties.
Biologists have no clear idea yet as to how cells control alternative splicing, as the process is called. In some cases different control regions in front of the gene may be selected by the transcription complex. Or the introns themselves may control which exons get edited out. Sometimes different splice forms are produced in different types of cells. The dystrophin gene, for example, one of the largest in the genome, occupies over 2.4 million bases of DNA that contain seventy-nine exons. It takes the cell sixteen hours just to make a transcript. The spliceosomes discard 99 percent of the transcript, but the edited version generates a nonetheless vast protein of 3,685 amino acids. Mutations that occur at various sites in the gene can impair the protein and produce the spectrum of diseases known as muscular dystrophy, which is how the protein was discovered and named. Though the manufacture of dystrophin seems an inefficient process, the body makes double use of the gene. The dystrophin gene is also activated in brain cells, but there it is alternatively spliced. Brain cells use only a subset of the seventy-nine exons and produce a much smaller protein.
Alternative splicing, though its mysteries are only just beginning to be probed, may well hold one of the sources of mammalian complexity. Up to 60 percent of human genes may have alternative splice forms. This may enable human cells to generate perhaps five times as many proteins as the worm or fly, despite having only twice the number of genes as the fly, the consortium said in its analysis of the human genome.
Another source of mammals' complexity seems to lie in the more sophisticated architecture of their proteins, which tend to have more domains than those of the fly and worm. Most protein domains are very ancient. Only 7 percent of the domains in the human proteome (that's the genomic age word for all the proteins made by a genome) are not found in lower animals, showing that invention of new domains was not so important in the design of mammals. The difference is that human proteins tend to have more domains, which makes possible not just more complex proteins but a much richer combination of interaction among proteins.
"The main invention seems to have been cobbling things together to make a multi-tasked protein," said Collins, the consortium's leader. "Maybe evolution designed most of the basic folds that proteins could use a long time ago, and the major advances in the last 400 million years have been to figure out how to shuffle those in interesting ways. That gives another reason not to panic," he said, referring to the unexpectedly small number of human genes his team had found.
There's another possible explanation for the small number of human genes found by Celera and the consortium, which is that both teams seriously undercounted. This is the belief of Venter's former colleague William Haseltine. Haseltine, chief executive of Human Genome Sciences, had long pegged the number of human genes at 120,000 or more, as had Randal Scott, chairman of Incyte. In fact Scott's prediction of 142,634 genes, made in September 1999, was one of the highest on record. In the wake of the new analysis, Scott said he accepted the new lower tally. But Haseltine stood firm with his high estimate, even though he was now in a minority of one against almost all the world's leading genome analysts.
Haseltine's rationale was hard to assess, since he had not published any of his findings, but also hard to dismiss. He believed that Venter and the gene finders had gotten it all wrong. To sequence the genome, to identify its genes, to discover in which cells the genes were turned on was all, in his view, a prodigious waste of effort when there existed a far simpler and surer method of identifying the human gene repertoire. The shortcut, pursued of course by Human Genome Sciences, is to let the cell read out the baffling information in the genome and then capture the cell's RNA transcripts. Human Genome Sciences had invested enormous effort in capturing the genes made by different types of human cells, including those of fetal tissues in various stages of development.
The method of capturing RNA transcripts had been exploited first by Venter and was the basis of his original partnership with Haseltine. The transcripts, which are usually sequenced only in part, are called expressed sequence tags, or ESTs, and are extremely useful for locating the genes from which they are copied. But EST collections are also known to overpredict the number of genes from which they are derived, in part because of alternative splicing and other vagaries of the spliceosome's operation. It was in large measure through recognizing that ESTs pointed to too high a number of genes that the consortium and Celera had arrived at the low number of 30,000.
But Haseltine said he was confident that he had removed the alternative splice forms and other known sources of confusion from his EST collection. With the resources of Human Genome Sciences, he had been able to determine the full length sequence of 90,000 ESTs, which he believed represented 90,000 different genes.
When consortium biologists published the sequence of human chromosome 22 in December 1999 they reported that they had identified at least 545 genes in it. But Haseltine figured he could see twice that number of his genes located to chromosome 22; in other words, the best conventional gene finding methods had picked up only one gene in two.
"No new discoveries were made, no new genes were found, and the authors go to great length to tell us that chromosome sequence cannot be used to find genes. I call that the biggest untold secret of the Human Genome Project," he said.
Warming to the same theme a few months later, Haseltine observed that "People who sequence DNA are the least likely to know how many genes there are." Later, when almost everyone who did sequence DNA had concluded that there were only 30,000 human genes, Haseltine argued that they had erred en masse because both teams had used the same methods of gene finding. The methods are admittedly imperfect and can overlook genes. One kind of evidence the present gene-finding programs rely on is a homology search, meaning that the programs search the DNA databases for any genes from other organisms that have a DNA sequence similar to any in the human genome. Any human DNA sequence that is homologous with, or similar to, a known gene in another genome is likely to be a gene itself. This is a powerful search method but will miss any gene that is so far unique to humans and has not yet been catalogued in the databases.
Unfazed by being in a minority of one, Haseltine predicted that the number of known human genes would steadily rise to meet his projections of 100,000 to 120,000. Venter, though he disagreed with his former colleague's number, wrote in his analysis of the human genome that ultimately the only way to determine the number of genes would be to capture the genes made in different types of cells. This is indeed the approach Haseltine has taken.
The first analyses of the human genome have brought home the long familiar fact that all organisms are intimately related to one another through being twigs on the same tree of life. But even evolutionary biologists may have been surprised by the overwhelming degree of similarity of people to other forms of life at the DNA level. About the only thing people have in common with the mouse is that we are fellow mammals, although separated by 100 million years from our last common ancestor. Yet Venter, having assembled the mouse genome, said that of the 26,000 confirmed human genes he could find only 300 that had no counterpart in the mouse. On this basis he expected the chimpanzee, our closest living relative, to have essentially the same set of genes, with the difference between the two species being caused by variant versions of the same genes.
The consortium, for its part, asserted that at least 100 human genes seem to have been borrowed from bacteria, presumably via some ancient infection. But this conclusion was quickly shot down. Its basis was heavily criticized by scientists at the Institute for Genomic Research and elsewhere, and Lander, the lead author of the consortium's paper, did little to defend it.
The similarity between the human and mouse genomes shows how far biologists are from being able to make the link between a genome and the organism that is based on it. Evidently small and subtle changes at the genomic level produce enormous differences at the level of the whole organism. A vast amount of research lies ahead before the human mechanism is fully understood in terms of its genetic instruction manual.
One necessary step is to continue the work that has begun on compiling a full catalog of human genes. Present gene-finding programs are powerful but not very accurate. Essentially, they look for "open reading frames," stretches of DNA that start with ATG, the triplet of bases used to initiate a protein chain, and continue for a plausible length without any of the stop-sign triplets. This method works well in bacteria, which have very compact genomes, but gene-finding programs get confused by the intron-exon structure of animal and plant genes. The junction between introns and exons is not well defined, and some of the cues the spliceosome uses for stripping out the introns are not yet understood. Nor are the rules that govern alternative splicing. Since the theoretical basis for detecting gene sequences in the genome is still incomplete, biologists supplement their prediction programs with empirical data. These include the DNA sequences inferred from known human proteins, which must for sure be reflected somewhere in the human genome; EST sequences the snippets of RNA transcripts captured from living human cells; homology searches; and a direct comparison of the human with the mouse genome. The combination of all these data helps predict the exons in the human genome that may be parts of human genes.
The human genome is more than a mere list of protein parts; it also embodies the program for the operation of the human cell. The program lies in control sequences of DNA that are placed upstream of the genes and are recognized by the gene transcription complexes. Molecular biologists have identified many of these control sequences, but there are doubtless many more to be discovered. The mouse genome is again of help. When the mouse genome, suitably reorganized so as to correspond to the somewhat different arrangement of human chromosomes, is laid alongside the human genome, the regions of conserved DNA sequence that do not correspond to exons may be control sequences.
Besides exons and control sequences, there is a third category of similarities between the mouse and human genome, Venter has said. Its nature remains unknown, though any DNA sequence that nature has found worth conserving for 100 million years must be important. One possibility is that there are many more RNA-making genes to be discovered. Most genes make proteins, but many of the cell's most vital pieces of machinery are made largely of RNA molecules. These include the spliceosome that edits RNA transcripts and the ribosome that translates the transcripts into proteins. RNA molecules are known to perform certain other roles, such as silencing one of the two X chromosomes in a woman's cells, so that the dosage of genes from the X will be the same as in a man's cells. RNA molecules can twist up into elaborate 3-D shapes, as do proteins, and can also catalyze chemical reactions. Given nature's propensity for using whatever is at hand, the full extent of RNA genes may not yet be known.
While computational biologists continue to refine their identification of human genes, a new branch of biology called proteomics has emerged. In one sense proteomics is just the study of proteins on a genome-wide scale, giving plain protein chemists the chance to style themselves proteomicists. But ingenious new methods are enabling biologists to study all or many of the proteins from a cell en masse. One technique, involving advanced mass spectrometry, allows all the proteins to be identified. Another, known as the yeast two hybrid system, shows which proteins interact with one another in the cell, an important first step to deciding what an unknown protein does.
Study of the genome and its proteins lays the groundwork for understanding the living human cell, which in turn is the basis for understanding all human disease. Although enormous progress has been made since the beginning of molecular biology in 1953, many essential features of the mechanism are only dimly understood. The human body is thought to contain around 100 trillion cells. All are the progeny of a single fertilized egg but each, by some still mysterious alchemy, has morphed into its specialist adult role.
There are about 260 different known types of human cell, with doubtless more to be discovered. All these cells share the same genome but must make use of it in different ways. Most of the genes must be permanently switched off, or the cell would be in chaos. Presumably each type of cell uses a common set of house-keeping genes and a suite of genes reserved for its use alone. But no one yet knows how to analyze the genome in terms of which sections are designated for use by a kidney cell and which by cells of skin or lung or liver. Nor is it yet clear how in the process of development each type of cell is assigned its own character and pattern of gene expression.
Most of the body's cells are probably in a state of intense molecular activity. The proteins of a cell are constantly interacting with one another, in effect performing complex calculations, the outcome of which is a decision sent to the nucleus to turn a gene on or off. Messages from both neighboring and faraway cells continually arrive at the cell's surface, bearing instructions that the receptor proteins in the cell's outer membrane convey to proteins in the interior and thence to the nucleus. Inside the nucleus, as the result of all the internal calculations and external messages, an array of transcription factors is copying genes, maybe hundreds or thousands every second.
This elaborate activity all takes place within a minute space. Imagine the smallest speck of dust you can see. This speck is the size of five average-sized human cells. These minuscule corpuscles are the protean clay of which the body is sculpted.
Within the cell are various compartments, of which the most prominent is the nucleus, which occupies a mere 10 percent of the cell's volume. The nucleus is the protected residence of the genome. In the rest of the cell, a fluid-filled compartment known as the cytoplasm, are other specialized structures such as the ribosomes, which manufacture the cell's proteins, and the mitochondria, the cell's energy production units. The mitochondria were once free-living bacteria that billions of years ago were captured and enslaved by animal cells; the mitochondria have their own, much degraded genome, a little circle of DNA containing a mere 16,569 nucleotide pairs.
Inside the nucleus the genome is packaged in the form of twenty-three pairs of chromosomes, each consisting of a single giant DNA molecule wrapped in the special proteins that protect and manage it. Despite the minuscule volume of the cell nucleus, the chromosomes are sizable objects. If fully stretched out, chromosome 1, the longest, would measure about 8.5 centimeters (3.35 inches). The forty-six chromosomes in the nucleus would stretch for just over 7 feet if laid end to end. It is an extraordinary feat of engineering for nature to have packed a 7-foot tape into so tiny a volume, yet still allow the cell unfettered access to all the parts it needs. "DNA stores 1011 gigabytes per cubic centimeter it's almost the greatest molecular packing density one could expect to get at the molecular level," says Randal W. Scott, chairman of Incyte Genomics.
Though it will be the work of decades to understand this miniature miracle of biological computing and construction, the process of translating knowledge about the genome into medical advances need not wait and has indeed already begun.
So who was the winner of the great race to sequence the human genome? Venter had accomplished much of what he set out to do, although, as discussed below, there was a serious inadequacy in the Science article describing his interpretation of the genome. He produced a very useful, though not complete, version of the human genome by February 2001, the date when most scientists could get access to at least parts of it. Without Venter's fierce competition, the consortium might well have continued on its original trajectory of providing a complete genome (complete, that is, apart from the heterochromatin DNA, which cannot at present be sequenced) by 2005, Watson's original target date.
Venter had not only brought forward the availability of the human genome by four years, he had also sequenced and assembled the mouse genome, an invaluable aid to interpreting the human genome. All in all, it was a spectacular achievement that validated the extraordinary risks that he and Michael Hunkapiller had taken in designing the project and that Tony White, chief executive officer of Applera, had taken in backing it.
But the consortium had also succeeded. Its draft version of the genome, published at the same time as Celera's, was surprisingly comparable even though Venter had been able to use both his own data and the consortium's whereas the consortium had only had its own. The consortium could thus claim a share of Venter's credit, although it had borrowed some of Venter's methods, such as the use of paired-end reads (having the sequencing machines read both ends of DNA fragments of known length).
The consortium bore the organizational burden of being spread among centers in six countries, although the brunt of its effort was born by John Sulston at the Sanger Centre near Cambridge in England and Robert Waterston's center at Washington University, Saint Louis, later joined by Eric Lander at the Whitehead Institute in Cambridge, Massachusetts. It was also dependent on, and at times held back by, the uneven flow of government funding. Despite all these impediments, the consortium was an outstanding technical success that remained on schedule and within budget, not so common a record for a government project.
The high noon duel with Venter was in any case subsidiary to the consortium's final goal, that of producing the entire genome sequence by the revised target date of 2003. Both sides' versions of the genome contained numerous gaps, although these were mostly in the regions of repetitive DNA and probably contained few genes. The consortium had committed itself to close every gap, except those in the heterochromatin, by 2003. Although Venter too planned to improve his sequence, he had not made the same commitment to completeness.
Initial reactions from some researchers who had examined both genome versions suggested that Celera's was "more accurate, easier to read and more complete than the rival version." With twice as much sequence data to draw on as the consortium, that was not surprising. And Celera desperately needed a sequence that was in some way better than its rival's, which was available for free. Venter's secret weapon was his mouse genome sequence, available only to his subscribers, but the consortium was working busily to prepare its own mouse sequence, and its members determined that not only the genome but every important interpretive tool should be freely available for all. The consortium was also busily working to improve its own human genome sequence. With two moving targets, judgments as to a winner were difficult and likely to prove temporary.
The inadequacy in Celera's Science article, which consortium scientists were quick to seize on, lay in Venter's strange neglect of his own shotgun method. He had his computer assemble the human genome by two different methods. One was the whole shotgun assembly and the other was a hybrid method that relied not only on the consortium's data but also on knowledge of the BACs' positions on their chromosomes, a borrowing that took direct advantage of the consortium's method. With two versions of the genome in hand, Venter then ignored the shotgun genome and based all his further gene identification and genome analysis on the hybrid version genome.
The consortium's scientists, once they had digested Celera's paper, were furiously contemptuous of it. Venter had claimed all along, with much swagger and braggadocio, that he would produce the better version of the human genome, yet the version he preferred in his Science paper leaned heavily not only on the consortium's genome data but also on its method. In the consortium's view, that was tantamount to someone downloading the consortium's work, adding a little DNA data of his own, and claiming the result as his own superior product.
Even the genome version Celera prepared by its whole genome shotgun method was not wholly independent of the consortium. Because of Venter's decision to switch his sequencing machines over to the mouse genome at the earliest possible moment, he had far less human data than originally planned and needed to borrow data from the consortium. But the data the consortium made publicly available in GenBank was partially assembled. Gene Myers, Celera's software designer, had previously asked the consortium for its raw reads of the C. elegans genome but had been rebuffed. Figuring the consortium centers were unlikely to give him their human reads either, Myers decided to download the assembled human data and artificially shred it in his computer into 500-base-length fragments.
The shredding would remove the many misassemblies he suspected existed in the consortium's genome. But he needed the artificial 500-base reads to assemble themselves across any gap in his own shotgun assembly. So he shredded the data twice, with the second set of shreds overlapping the first by exactly half a length.
The consortium biologists now had a powerful stick with which to beat Celera. Led by Lander, they charged that their shredded data would have reassembled itself in Celera's computer, injecting the positional information of the BAC-by-BAC method into Celera's whole genome shotgun, and that the shotgun would have failed without it.
"The WGS was a flop. No ifs ands or buts," Lander wrote in an e-mail referring to the whole genome shotgun. "Celera did not independently produce a sequence of the genome at all. It rode piggyback on the HGP. I have no objection to their using the data for their commercial databases, but it does seem odd to publish a paper reporting a sequence of the genome based largely (>60%) on other people's data."
Phil Green, a computational biologist at the University of Washington, Seattle, and author of Phred and Phrap, two widely used genome programs, had arrived independently at the same conclusion. "I think basically they could not have done the human genome using a whole genome shotgun and I think they realized that at some point, which is why they depended so heavily on the public data," he said.
Olson, who had forecast catastrophic problems for the whole genome shotgun in June 1998, maintained he had been right all along. He had predicted 100,000 serious gaps, and indeed there were 120,000 gaps in the shotgun version of Celera's genome. "Celera wanted to have it both ways," he said. "There is no question you get more data quicker with a shotgun, but you do it at the price of ending up with a mess on your hands that is unmanageable. Venter in June 1998 claimed that this was not a quick and dirty approach, that it would produce a sequence that met or exceeded best current standards. That claim was absurd at the time and remains absurd."
The refrain was taken up by Collins, the consortium's leader. "Careful analysis leads most observers I have consulted to conclude that a pure whole genome shotgun approach is unworkable for a genome of the size and repeat density (50%) of the human," he wrote.
Venter was outraged at his critics' attacks which, despite all previous experience to the contrary, he and his colleagues seem not to have expected. To have assembled two versions of the genome, both dependent to different degrees on the consortium's data, and then to have chosen the more dependent version, was not the most brilliantly conceived strategy for persuading his rivals to cry uncle. Indeed it positively invited them to cry foul.
His choice had been shaped by the competing needs of his commercial and academic goals. To sign up clients for Celera's database, he needed the best possible genome sequence. He explained that since the hybrid version of the genome contained about 2 percent more of the DNA sequence, he decided to analyze that instead of the shotgun version.
In that case, a totally independent shotgun version of the genome would have seemed essential for an article in the scientific literature written in explicit competition with the consortium's. "We didn't anticipate that anyone would say it didn't work," Myers said when asked why he didn't prepare a shotgun version of the genome using only Celera's data. "It didn't cross my mind. All we would have gotten would be to prove a point. It seemed silly to waste a lot of money on a point of pride, that's all there is to it."
Myers says he did not use any positional information from the shredded data, which reassembled itself only across gaps determined by the scaffolds in the shotgun. If so, there is probably little merit to Lander's charge that the whole genome shotgun would have failed without the public data; it would just have been a little less complete.
Myers and Venter also note that they have assembled the mouse genome, which is very similar in size and structure to the human genome, by the shotgun approach and using only Celera's data. The results with the mouse genome are very similar to the shotgun version of the human genome further proof, they say, that the shotgun worked. To make certain, though, they started a 20,000-hour computer run to assemble the human genome using only Celera's data just to prove the point they would have been wise to prove the first time around.
The shotgun method was a clear success with the fruit fly genome; but whether it worked better than the BAC-by-BAC method for the larger and more complex human genome may not be clear for a long time, if ever. Much depends on the yardstick applied. For completing the last possible base in the human genome, the consortium's method will probably be better: Celera is not even trying to settle the position of every base. But Celera's combined mouse and human database may prove a winner.
Should the Nobel Foundation's medical committee decide to award its prize for the human genome, it will have a hard choice. The Swedish jurors usually shy away from messy situations where credit is disputed. But there would be a neat way for them to sidestep the issue. Their prize, which can be awarded to a maximum of three people, could be split between Venter, Sulston and Waterston on the grounds that Venter sequenced the Haemophilus genome, a feat that revolutionized microbiology, while Sulston and Waterston laid the basis for sequencing animal and human genomes by completing the genome of C. elegans. And those three probably contributed most to sequencing the human genome, although the drama included several other important players, starting with Fred Sanger.
Prizes and wrangling aside, both sides, it could be said, had accomplished extraordinary feats. Each set a high goal and achieved it, with their competition serving the public interest. In such a contest, there could be no losers.
Copyright © 2001 by Nicholas Wade