Rocky Mountain College * Department of Computer Science * 406 208 3193 * turn on javascript to see my email

Deconvolving Sequence Variation in Mixed DNA Populations

by A. Wildenberg, S. Skiena and P. Sumazin, 6th Annual International Conference on Research in Computational Molecular Biology (RECOMB02), Washington DC, April 2002.



The need for DNA sequencing did not end with the successful public and private projects to sequence the human genome. Indeed, attention is shifting from de novo sequencing of new organisms to analyzing sequence variation for research and diagnostic purposes.

Contemporary electrophorisis-based sequencing machines produce curves registering the amount of each of the four nucleotide bases as a function of sequence position. For homogeneous DNA samples, the largest peaks at each position define the underlying sequence. However, more careful analysis of sequence trace data holds promise for determining the presence and frequency of mutations in inhomogeneous samples.

In this paper, we look at the problem of using sequence trace data to identify sequence variants in mixed DNA populations. Our work is motivated by a new line of capillary electrophorisis sequencing machines being developed by BioPhotonics Corporation. By using advanced single-photon detectors and other technologies, BioPhotonics has the capability to not only detect but accurately determine the relative frequency of each base at each position to within 10%, and expects to reduce this error rate to 1% in the near future.

This motivates a variety of questions concerning how accurately we can sequence mixed populations from a single sample using relative frequency information. Possible applications of this technology include:

  • Frequency of Acquired Mutations --

    Perhaps the greatest promise of modern genomics is that of individualized medicine, where an individual's genetic composition is determined and analyzed to determine the best course of treatment. New technologies such as microarrays offer promise for obtaining sequence and expression data on an individual scale. Microarray studies of leukemia and breast cancer tissues have demonstrated that cancer subtypes can be accurately diagnosed on the basis of genomic data, and with them the prognosis for survival under various treatments.

    Such microarray studies will continue to help develop our understanding of gene expression and disease. However, the technologies used for widespread diagnostic tests may well be different, to minimize costs and increase robustness. Indeed, a major goal of BioPhotonics efforts is developing smaller, cheaper DNA sequencing machines with the vision of placing them in doctor's offices for diagnostic applications.

    Particularly important for many medical applications is the need to analyze sequence from heterogeneous genomic samples. Such mixed populations naturally arise from acquired mutations, say, in cancer, where various mutations to oncogenes such as p53 can lead to dramatically different disease courses. Extensive databases of p53 mutations are being constructed.

    In this paper, we provide simulation results demonstrating our ability to identify p53 mutations as a function of mutation frequency and sequencing accuracy.

  • SNP Generation and Analysis --

    Single-nucleotide polymorphisms (SNPs) represent an important part of sequence variation in humans. Cataloging SNPs is an important problem in contemporary sequence analysis. Here we propose a potentially high-throughput technology to catalog SNPs. A pool of DNA from m distinct individuals is assembled, with a region of interest amplified using PCR. Sequencing the resulting product and deconvolving the results will be significantly more efficient than individual sequencing runs for large m, provided they can be accurately analyzed for large m.

    In this paper, we study the potential of this approach both theoretically and through simulation. We demonstrate that, under reasonable assumptions of polymorphism rates and error probabilities, pool sizes of over 100 people can be analyzed on a single sequencing run.

  • Viral Population Analysis --

    Viruses such as HIV evolve rapidly, and each infected patient soon comes to host a variety of different strains. Our techniques make it possible to determine the various mutations present in a sample, as well as their relative frequencies. Obtaining accurate viral population frequency data will be important to determine a patient's response to a given course of treatment, and determine which strains react best to a given therapy.

    In this paper, we demonstrate that accurate determination of the relative frequencies of four distinct strains can be made even in the face of base-frequency error rates up to 25%.

You know we're constantly taking. We don't make most of the food we eat, we don't grow it, anyway. We wear clothes other people make, we speak a language other people developed, we use a mathematics other people evolved and spent their lives building. I mean we're constantly taking things. It's a wonderful ecstatic feeling to create something and put it into the pool of human experience and knowledge. -- Steve Jobs, Rolling Stone, November 1983.