Bookmark and Share

Genomic Puzzles Old and New

T. Ryan Gregory


Among the most startling discoveries in the history of genomics were these findings:

  • Fifty years ago, we learned that the total amount of DNA contained within a eukaryotic genome is independent of the complexity of the organism in which it is found. (Eukaryotes are all organisms whose cells contain nuclei and organelles.)
  • Five years ago, we learned that gene number and organismal complexity are likewise largely disconnected.
  • Both of the above findings are intriguing puzzles, not paradoxes, which when resolved will lead to improved understanding of the form, function, and evolution of genomes.

August 2006

Are common genomic characteristics universal?
Genomes are highly diverse.

“Most problems have either many answers or no answer. Only a few problems have a single answer.” Edmund C. Berkeley


Section of DNA. The bases lie horizontally between the two spiraling strands. Source: Wikimedia Commons.

The search for genetic differences among people represents one of the most active areas of research made possible by the completion of the human genome sequence. Yet the notion that there is such a thing as “the human genome” carries with it the implication that there are fundamental genomic characteristics that are universal among all members of a species. The most obvious of these relate to the quantity and arrangement of genetic material. We now know that:

  • One copy of a human’s genome contains about 3.5 picograms (pg, or 10-12 grams) of DNA packaged into 23 chromosomes.
  • Chimpanzees, the closest living relatives of Homo sapiens, carry around a slightly heavier genome (3.75 pg) apportioned into 24 chromosomes.
  • An aardvark genome, by contrast, is contained within only 10 chromosomes but weighs in at 5.8 pg.1
DNA content is constant within a species.

The basic idea that the amount of DNA per chromosome set might be consistent across cells within bodies and among individuals within species was hinted at as early as 1885. An explicit “DNA constancy hypothesis,” however, was not developed until the mid-20th century,2 stemming from a 1948 report of “a remarkable constancy in the nuclear DNA content of all the cells in all the individuals within a given animal species,”3 which was interpreted as evidence in favour of DNA, rather than proteins, as the molecule responsible for inheritance.

The DNA constancy hypothesis

In the simplest terms, the DNA constancy hypothesis that emerged in the late 1940s and early 1950s consisted of two central ideas:

This hypothesis emerged in the mid 1900s.
  • The amount of DNA per chromosome set within an individual organism is constant.
  • The DNA content of a single set of chromosomes is largely invariant among members of the same species.
It remains an important idea in modern genomics.

The underlying notion of DNA constancy persists more than a half-century later, even though there are interesting exceptions to both of these postulates (which are beyond the scope of this article). In fact, DNA constancy is an important assumption in modern genome size research, because the two dominant methods of DNA quantification both rely on the use of standards of “known” DNA content for certain conversions.4-6

The C-value paradox

It is due to its constancy that the amount of DNA contained within a haploid chromosome set is commonly referred to as the “C value,” a term coined by Hewson Swift in 1950.7 One year later, scientists provided the first taxonomically broad survey of C values and noted that:

More complex species don’t necessarily have more DNA.

Comparing the largest and one of the smallest examples among vertebrates, one finds that a cell of Amphiuma, a urodele, contains 70 times as much DNA as is found in a cell of the domestic fowl, a far more highly developed animal. It seems most unlikely that Amphiuma contains 70 times as many different genes as does the fowl or that a gene of Amphiuma contains 70 times as much DNA as does one in the fowl. To make a somewhat different comparison: a cell of Amphiuma contains 170 times as much DNA as does a cell of a relatively closely related animal, the trigger fish, whereas a cell of the latter contains only nine times as much DNA as does a cell of a sponge, which is far removed phylogenetically from any vertebrate.8

This finding was deemed a confusing paradox.

It is not difficult to understand why observations such as these engendered considerable confusion for the next two decades. As C. A. Thomas put it in 1971, “It was argued that mammals display a greater developmental complexity than primitive fish, therefore, they must have more genes, yet why should the lower forms have more DNA, if DNA is the chemical basis of the gene?”9 To early researchers this seemed downright paradoxical—and indeed, Thomas dubbed the disconnect between genome size and organismal complexity the “C-value paradox.”

Even species with similar complexity exhibit the so-called “paradox”.

The C-value paradox has traditionally been described in three different ways:

  • More complex organisms do not always have larger genomes than simpler ones. “The quantity of DNA does not seem to be related to the number of genes, for the amount of DNA does not increase unequivocally with the complexity and number of hereditary characters.”10

  • Any given genome seems to contain more DNA than would be needed for the predicted gene number. “One of the problems of eukaryotic genetics is that higher organisms possess much more DNA in their genome than they are likely to need as genetic information.”11

  • Some closely related species exhibit divergent DNA contents. “The paradox is the fact that organisms at the same general level of morphological complexity, which presumably have the same genetic requirements, nevertheless often have genomes whose DNA contents differ by orders of magnitude.”12

Consider, for example, the reported genome sizes versus semi-subjective notions of complexity for some well-known organisms:

The human genome is not the largest genome.
  • Nematode worm (Caenorhabditis elegans): 0.1 pg
  • Thale cress (Arabidopsis thaliana): 0.16 pg
  • Fruit fly (Drosophila melanogaster): 0.18 pg
  • Pufferfish (Takifugu rubripes): 0.4 pg
  • Rice (Oryza sativa): 0.5 pg
  • Human (Homo sapiens): 3.5 pg
  • Leopard frog (Rana pipiens): 6.7 pg
  • Onion (Allium cepa): 16.75 pg
  • Mountain grasshopper (Podisma pedestris): 16.9 pg
  • Tiger salamander (Ambystoma tigrinum): 32 pg
  • Easter lily (Lilium longiflorum): 35.2 pg
  • Marbled lungfish (Protopterus aethiopicus): 132 pg
Humans have less DNA than a tiny salamander.

The human genome, it turns out, is thoroughly average in size for a mammal and significantly smaller than that of various plants, amphibians, insects, and even some single-celled protozoa. Some authors apparently found this revelation bruising to the human ego, as reflected in this complaint:

Being a little chauvinistic toward our own species, we like to think that man is surely one of the most complicated species on earth and thus needs just about the maximum number of genes. However, the lowly liverwort has 18 times as much DNA as we, and the slimy, dull salamander known as Amphiuma has 26 times our complement of DNA. To further add to the insult, the unicellular Euglena has almost as much DNA as man.13

Noncoding DNA and the end of the paradox

Today, C-value differences are no longer paradoxical.

In spite of its label, the “paradox” was not so much the lack of a correlation with complexity, per se, but rather the inability of early researchers to reconcile the constancy of DNA content within species (which occurs because it is the stuff of genes) with the variation in quantity of DNA among species (which does not relate to the number of genes). Today, the solution to the paradox is widely recognized: Most eukaryotic DNA does not code for proteins, so there is no reason to expect a complex organism to have a large genome or a simple organism to have a small one.

To put it succinctly, the C-value paradox vanished the moment geneticists abandoned the concept of the genome consisting of the genes, all the genes, and nothing but the genes.

The real puzzle lies in “excess” or noncoding DNA.

Stanley K. Sessions may have said it best 20 years ago when, in a review of the influential volume The Evolution of Genome Size,14 he pointed out that:

The C-value paradox is the observation that genome size does not correspond to the amount of DNA needed for protein-coding functions. This observation is a paradox only under the expectation that genome size should be equal or proportional to gene number and should therefore increase with “organismal complexity.” This paradox has literally disappeared with the discovery that genomes contain “excess” (largely repetitive) DNA that is not transcribed into functional products. Thus it is no longer mysterious that salamanders (for example) have larger genomes than humans. The origin and precise function of the “excess” DNA (which may constitute more than 99% of the genomic DNA) remains an unsolved problem, but it is not a paradox.15

Comparatively modest in size though it is, the human genome provides an excellent illustration of the overwhelming abundance of noncoding DNA and thus the solution to the old “C-value paradox.” In 2001, the International Human Genome Sequencing Consortium revealed that each copy of the human genome consists of the following:

The human genome has been sequenced and analyzed.
  • 1.5% protein-coding genes
  • 25.9% introns (noncoding regions within gene sequences)
  • 20.4% long interspersed nuclear elements (LINEs), including 516,000 copies of the transposable element known asLINE-1
  • 13.1% short interspersed nuclear elements (SINEs), including 1,090,000 copies of the Alu element
  • 2.9% DNA transposons (mobile DNA elements)
  • 8.3% long terminal repeat (LTR) retrotransposons (transposons copied from RNA and flanked by repeated sequences)
  • 5% segmental duplications
  • 3% simple sequence repeats
  • 11.6% miscellaneous unique sequences
  • 8% miscellaneous compacted DNA, or heterochromatin

The C-value enigma

The C-value “enigma” is more apt and indicates a complex puzzle.

As Wendell L. Wilkie once quipped, “a good catchword can obscure analysis for 50 years.” Despite its obvious obsolescence, and in a clear case of linguistic inertia taking precedence over scientific precision, the term “C-value paradox” continues to enjoy widespread use—often with confusion and miscommunication as the outcome. Variation in genome size is not the least bit paradoxical, but as Sessions and many others have noted, it remains a long-standing puzzle in need of resolution. As an alternative to the outdated term “C-value paradox,” which tends to inspire one-dimensional attempts at explanation, the new term “C-value enigma” has been offered in its place.17-19

As an enigma—a complex puzzle—the issue of genome size variation can be explicitly divided into several component questions, each of which must be answered if a complete understanding is to be achieved:

More intruiguing questions about DNA remain.
  • What are the sources of all this noncoding DNA?
  • In what proportions are different types of noncoding DNA represented in the genomes of different species?
  • By what mechanisms is noncoding DNA gained and lost over evolutionary time?
  • What are the phenotypic implications, or in some cases perhaps even functions, of noncoding DNA?
  • Why are the genomes of some species, such as nematodes or rice, streamlined while others, such as those of lungfishes or lilies, are positively enormous?

Unraveling the enigma

While a great deal of work remains to be conducted in terms of each of the component questions of the C-value enigma, research spanning the past 50 years—from the origin of the DNA constancy hypothesis to the modern era of complete genome sequencing—has revealed many important insights regarding the nature and impacts of noncoding DNA. Among the most notable are these findings:

We now have new insights into noncoding DNA.
  • A very large fraction of many eukaryotic genomes is composed of “genomic parasites” in the form of transposable elements; in humans, nearly half of the genome consists of such “selfish DNA.” Moreover, large genomes contain a larger proportion of transposable elements and a lower proportion of protein-coding genes than smaller genomes.

  • The abundances and/or lengths of several types of both single-copy and repetitive noncoding DNA appear to increase along with genome size, including all types of transposable elements, introns, microsatellites (repetitive short nucleotide sequences), and ribosomal RNA genes. The amplification and loss of these sequence types varies, suggesting that there may be a general mechanism for DNA content modulation that applies across the genome.

Mechanisms exist that are capable of increasing or decreasing genome size over both short and long evolutionary timescales. For example, duplicative transposition of transposable elements and small- and large-scale duplications (from single genes to entire genomes) can add DNA to genomes, sometimes in large amounts and often very rapidly in evolutionary terms. Other processes can either add or remove DNA at a range of scales, such as the insertion or deletion of one or a few nucleotides during DNA replication, recombination events leading to the addition or loss of chromosome segments, and gains or losses of entire chromosomes.

Genome size may increase or decrease through evolution.
  • Genome size correlates positively with nucleus and cell size, and negatively with cell division rate, in a wide range of cell types and organisms. The preponderance of the evidence indicates that genome size exerts a causative influence on these cellular parameters.

  • Depending on the biology of the group in question, the cell-level effects of genome size variation may result in correlations between DNA content and body size, metabolic rate, developmental rate, organ complexity, geographical distribution, and ecological niche.

A new paradox?

The human genome contains a mere 20,000 to 25,000 genes.

Most of the early discussion surrounding the C-value paradox was predicated on the assumption that gene number and organismal complexity would be closely linked. In light of the extraordinary complexity of its bearer, the human genome in particular was expected to contain an exceptionally high number of protein-coding genes. Prior to the completion of the draft genome sequence, 100,000 genes was a common estimate; as it turns out, the human genome contains a mere 20,000 to 25,000 genes.20 Comparing this with the more than 3,000,000 copies of transposable elements present in each human genome, including more than one million copies of the SINEAlu, it is no wonder that W. Ford Doolittle once suggested, only partly facetiously, that our genomes “might be ironically viewed as vehicles for the replication ofAlusequences.”21

An examination of the genomes of other species shows that, like genome size, gene number is a poor predictor of organismal complexity:

Rice has twice as many genes as humans.
  • Fruit fly (Drosophila melanogaster): 13,500 genes
  • Nematode worm (Caenorhabditis elegans): 20,000 genes
  • Human (Homo sapiens): 20,000 to 25,000 genes
  • Pufferfish (Takifugu rubripes): 21,000 genes
  • Thale cress (Arabidopsis thaliana): 25,500 genes
  • Rice (Oryza sativa): 40,000 to 50,000 genes
Such findings led to the G-value “paradox”.

As with C-values, this observation has been the source of significant surprise among genome researchers. “How can our own supremely sophisticated species be governed by just 50% to 100% more genes than the nematode worm?” some wondered.22 Following the same formula as with genome size (simplistic expectation + contradictory data = “paradox”), this disparity between gene number and complexity has been labeled as the “G-value paradox” or “N-value paradox.”23-25

The G-value enigma

How an organism is constructed is puzzling.

Perhaps it should go without saying that the G-value “paradox,” like its C-value predecessor, is not paradoxical at all. What the data currently emerging from comparative genomics indicate is that the mechanisms by which the genome specifies the construction of an organism is complex and, for the time being, puzzling: a “G-value enigma.” And, like the C-value enigma, this new puzzle is most likely to be solved when the pieces are clearly delineated. In this case, some of the pertinent questions include these:

  • By what mechanisms are genes regulated, and how does this contribute to the high diversity of tissues constructed from a low number of genes? The recent suggestion of a second, nongenic “code” in DNA based on the positions of packaging structures called nucleosomes provides an exciting example of the sorts of discoveries that will be forthcoming in this area.26
What roles do noncoding DNA play?
  • What roles, if any, does noncoding DNA play in the link between genome and phenotype? Insights from the study of genome size in general, such as those described above, are directly relevant to this issue, as are other influences such as the position and configuration of DNA, the level of DNA compaction, and other such non-genic factors.

  • In what ways do interactions among genes account for the emergence of complex wholes from a relatively limited number of parts?

  • How many different protein products can a single gene region encode through such processes as alternative splicing, and to what extent could this explain the diverse protein products that can result from even a relatively simple protein-encoding genome?

Solving the enigma is a step toward understanding genomic form, function, and evolution.

Future perspectives

Although they may not yet be recognized explicitly as parts of a larger puzzle, each of the component questions in the G-value enigma is the subject of an increasing amount of study. To the extent that co-opted transposable elements play a role in gene regulation, that other noncoding DNA influences gene expression, that introns are involved in alternative splicing, and that bulk DNA content exerts an impact on cellular and organismal phenotypes, it is clear that the C-value and G-value enigmas are themselves part of an overarching quest to understand the form, function, and evolution of genomes. To advance this cause, a few key steps might be taken by the scientific community:

There are ways to improve the study of genomes.
  • Consider findings that contradict simplistic assumptions about genomes—most notably that one or a few linear genomic parameters should determine the complexity of organisms—as exciting challenges, rather than framing them as “paradoxical.”

  • Think of genomes as complex biological entities with their own inherent properties and evolutionary histories.

  • Characterize both the coding and noncoding components of genomes and their relative proportions in complete sequencing projects.

Problems with many answers are more stimulating than paradoxes.
  • Create greater linkages between researchers who study genome size (the C-value enigma) and those dealing with the sequences and functions of genes (the G-value enigma), and make a stronger effort to combine insights derived from the study of each of the major groups of living things and to move well beyond the current cast of model organisms.

The lesson from the past 50 years, and the most productive guiding principle for the next phase of genomic science, is that genomes are complex and strongly resistant to one-dimensional explanations. Put more simply, those wishing to shed light on the causes and consequences of genomic variation at any level should bear the following in mind: Paradoxes are frustrating, but clearly defined puzzles are stimulating.

T. Ryan Gregory, Ph.D., is assistant professor in the Department of Integrative Biology at the University of Guelph, Ontario, Canada. Gregory has been the recipient of several prestigious scholarships, fellowships, and awards, including the 2003 Howard Alper Postdoctoral Prize, the 2005 McMaster Alumni Association Arch Award, and the 2006 ‘American Society of Naturalists’ Young Investigators’ Prize. Gregory’s primary research interests include large-scale genome evolution, biologic and genomic diversity, and macroevolution. He created the Animal Genome Size Database in 2001 and published The Evolution of the Genome (Elsevier) in 2005.

Genomic Puzzles Old and New

New DNA code discovered

In his New York Times article, “Scientists Say They’ve Found a Code Beyond Genetics in DNA,” Nicholas Wade describes a newly discovered second code in DNA in addition to the genetic code.

Genome size

Genome size refers to the total amount of DNA contained in one copy of a genome.


C-value is the amount of DNA in a single set of chromosomes (i.e., the haploid DNA content) of an organism.


The G-value is the total number of genes in a genome.

Genome Databases

Genome size data are currently available for more than 10,000 species of animals, plants, and fungi and are freely available through the following databases:

Genome sequences

Updated lists of archaeal, bacterial, and eukaryotic genome sequencing projects are available online from the following:

Read a book online

Beginner’s Guide to Molecular Biology is a good introduction to the basics of cells, DNA, and molecular engineering for the lay public, students, and teachers.

Genome size discussion forum

Discuss genome size and other genomic topics.


Teaching Resources from the Northwest Association for Biomedical Research (NWABR)

The Northwest Association for Biomedical Research (NWABR) strengthens public trust in research through education and dialogue. Its diverse membership spans academic, industry, non-profit research institutes, health care, and voluntary health organizations. Through membership and extensive education programs, it fosters a shared commitment to the ethical conduct of research and ensures the vitality of the life sciences community.

Advanced Bioinformatics: Genetic Research
This curriculum unit explores how bioinformatics is used to perform genetic research. Students examine DNA sequences from different animal species, investigate the relationship between protein structure and function, and explore evolutionary relationships among eukaryotic organisms. Throughout the unit, students are presented with a number of career options in which the tools of bioinformatics are developed or used. original lesson

This lesson has been written by a science educator to specifically accompany the above article. It includes article content and extension questions, as well as activity handouts for different grade levels.

Lesson Title: Bioinformatics
Levels: high school (honors/AP) - undergraduate (year 1)
Summary: Students use inquiry skills to make and test predictions about genes and their corresponding proteins, understand the use of bioinformatics programs, and pursue their own studies of genes and proteins of interest to them.

Download/view lesson.
(To open the lesson’s PDF file, you need Adobe Acrobat Reader free software.)

Useful links for educators

  • » Genomics Analogy Model for Educators (GAME)
    The GAME website is a tool for high school science teachers and higher education instructors who teach genomics but who do not have a molecular biology background. Useful analogies and resources are available for teachers to use in their classroom.
  • » BioInteractive
    Biointeractive is a website and a collection of biology-focused teaching materials created by the Howard Hughes Medical Institute. Many materials are available to educators for free and can be ordered from the catalog. The site’s information and links are also useful to high school seniors and college-level students.

Useful links for student research

Refer also to the “learn more” and “get involved” links above.

  1. Gregory, T. R. 2006. Animal Genome Size Database. (accessed Aug. 22, 2006)
  2. Swift, H. 1950a. The desoxyribose nucleic acid content of animal nuclei. Physiological Zoology 23: 169-198.
  3. Vendrely, R., and C. Vendrely. 1948. La teneur du noyau cellulaire en acide désoxyribonucléique à travers les organes, les individus et les espéces animales: Techniques et premiers résultats. Experientia 4: 434-436.
  4. Price, H. J., and J. S. Johnston. 1996. Analysis of plant DNA content by Feulgen micro-spectrophotometry and flow cytometry. In P. Jauhur (ed). Methods of Genome Analysis in Plants, pp. 115-132. Boca Raton, FL: CRC Press.
  5. Hardie, D. C., T. R. Gregory, and P. D. N. Hebert. 2002. From pixels to picograms: A beginners’ guide to genome quantification by Feulgen image analysis densitometry. Journal of Histochemistry and Cytochemistry 50: 735-749.
  6. DeSalle, R., T. R. Gregory, and J. S. Johnston. 2005. Preparation of samples for comparative studies of arthropod chromosomes: Visualization, in situ hybridization, and genome size estimation. Methods in Enzymology 395: 460-488.
  7. Swift, H. 1950b. The constancy of desoxyribose nucleic acid in plant nuclei. Proceedings of the National Academy of Sciences of the USA 36: 643-654.
  8. Mirsky, A. E., and H. Ris. 1951. The desoxyribonucleic acid content of animal cells and its evolutionary significance. Journal of General Physiology 34: 451-462.
  9. Thomas, C. A. 1971. The genetic organization of chromosomes. Annual Review of Genetics 5: 237-256.
  10. Vendrely, R. 1955. The deoxyribonucleic acid content of the nucleus. In E. Chagraff and J.N. Davidson (eds). The Nucleic Acids , volume 2, pp. 155-180. New York: Academic Press.
  11. MacLean, N. 1973. Suggested mechanism for increase in size of the genome. Nature, New Biology 246: 205-206.
  12. Gall, J. G. 1981. Chromosome structure and the C-value paradox. Journal of Cell Biology 91: 3s-14s.
  13. Comings, D. E. 1972. The structure and function of chromatin. Advances in Human Genetics 3: 237-431.
  14. Cavalier-Smith, T (ed). 1985. The Evolution of Genome Size. Chichester, UK: John Wiley & Sons.
  15. Sessions, S. K. 1986. Thoughts on genome size: The controversy continues. Cell 45: 473-474.
  16. International Human Genome Sequencing Consortium. 2001. Initial sequencing and analysis of the human genome. Nature 409: 860-921.
  17. Gregory, T. R. 2001. Coincidence, coevolution, or causation? DNA content, cell size, and the C-value enigma. Biological Reviews 76: 65-101.
  18. Gregory, T. R. 2004. Macroevolution, hierarchy theory, and the C-value enigma. Paleobiology 30: 179-202.
  19. Gregory, T. R. 2005. Genome size evolution in animals. In T. R. Gregory (ed). The Evolution of the Genome, pp. 3-87. San Diego, CA: Elsevier.
  20. International Human Genome Sequencing Consortium. 2004. Finishing the euchromatic sequence of the human genome. Nature 431: 931-945.
  21. Doolittle, W. F. 1997. Why we still need basic research. Annals of the Royal College of Physicians and Surgeons of Canada 30: 76-80.
  22. Harrison, P. M., A. Kumar, N. Lang, M. Snyder, and M. Gerstein. 2002. A question of size: The eukaryotic proteome and the problems in defining it. Nucleic Acids Research 30: 1083-1090.
  23. Claverie, J.-M. 2001. What if there are only 30,000 human genes? Science 291: 1255-1257.
  24. Betrán, E., and M. Long. 2002. Expansion of genome coding regions by acquisition of new genes. Genetica 115: 65-80.
  25. Hahn, M. W., and G. A. Wray. 2002. The g-value paradox. Evolution & Development 4: 73-75.
  26. Segal, E., Y. Fondufe-Mittendorf, L. Chen, A. Thastrom, Y. Field, I. K. Moore, J.-P. Z. Wang, and J. Widom. 2006. A genomic code for nucleosome positioning. Nature advance online publication (July 19, 2006). DOI: 10.1038/nature04979.


Understanding Science