DNA Sequencing
DNA encodes information as a string of nucleotides. The nucleotides can be one of four types: A (Adenine), T (Thymine), C (Cytosine), or G (Guanine). The ability to read DNA is critical in the field of biology. One of the most important developments in biology in recent years has been the development of inexpensive and accurate DNA sequencing. Dozens of techniques and methods have been developed to read DNA sequences, however none of them perfectly read the DNA. There is always some amount of error. Knowing what the error rate for a given sequencing technology allows us to intelligently apply the results with high confidence.
Part 1: Small Sequence
Let's assume we have a sequencing technology with the following conditional readout table which lists the probability of the nucleotide type based on a given nucleotide reading. For example if we read a single nucleotide and the sequencing machine prints out a "C", there is a 0.88 probability that the nucleotide is actually a C and a 0.04 probability that the nucleotide is a G.
P(A|X) | P(T|X) | P(C|X) | P(G|X) | |
X="A" | 0.94 | 0.02 | 0.02 | 0.02 |
X="T" | 0.02 | 0.94 | 0.02 | 0.02 |
X="C" | 0.04 | 0.04 | 0.88 | 0.04 |
X="G" | 0.03 | 0.03 | 0.03 | 0.91 |
When sequencing a DNA molecule, we will assume the identity of each nucleotide is an independent event (for example, a "C" at position n has no influence on the nucleotide at position n+1..)
You run a sample of DNA through your sequencing machine and it reads back "ATGCAG". Answer the following questions to three decimal places. Try not to round too much during your derivations.
- What is the probability that the DNA sequence is actually ATGCAG?:
- What is the probability that the DNA sequence is not ATGCAG?:
- What is the probability that the sequence is ATGTAG?:
Part 2: Larger Reads
Modern day sequencing techniques can have average raw accuracies (for all nucleotides) at or above 99.0%.
- If the single-nucleotide reading for a sequencing technique is on average 99.0% accurate what is the probability that the output sequence generated in reading a 100 nucleotide molecule is correct?:
- If the single-nucleotide reading for a sequencing technique is on average 99.9% accurate what is the probability that the output sequence generated in reading a 100 nucleotide molecule is correct?:
- If the single-nucleotide reading for a sequencing technique is on average 99.99% accurate what is the probability that the output sequence generated in reading a 100 nucleotide molecule is correct?:
Under the right conditions, modern day Sanger Sequencing can have raw accuracies of 99.999%, which means it can read large sequences of DNA with high confidence.