How I Learned to Stop Worrying and Love My Exome

In the spring of 2012, 23andMe offered customers the opportunity to participate in a pilot program to have their exome sequenced. The exome is the minor fraction – around 2% – of the human genome that encodes proteins. Exome sequencing is a transitional technology that has an excellent chance of finding variants responsible for inherited conditions at a significantly lower cost than whole-genome sequencing. As the cost of whole-genome sequencing falls, exome sequencing will probably disappear, but in the spring of 2012, it looked like an interesting offer.

In exome sequencing, the protein-coding portions of the human genome are selected through hybridization to an array of synthetic oligonucleotides representing all known exons, then sequenced using next-gen sequencing. 23andMe offered customers raw data without any interpretation. We were cautioned not to expect any user-friendly guides like the ones provided for SNP surveys. The price for this limited-time offer was $1,000.

My prior experience with having my genome analyzed had been outstanding. I found out about a medically actionable condition (predisposition to hemochromatosis), learned about my Neandertal ancestry, and gained a wealth of raw data to check as I read the biomedical literature. My experience helped me to add personal interest to my teaching and public outreach activity. I wasn’t sure that I had the skills to analyze the raw data from my exome sequence, but I thought that once I had my exome sequence, I would be highly motivated to learn how to analyze it. I had invested far more than $1,000 in other aspects of my education, and those investments had always paid off. I signed up.

The sample kit arrived quickly, but it seemed like forever before the results were ready (it was actually only four months). I followed the instructions for the elaborate download process designed to protect my genomic privacy. Despite having been promised no analysis whatsoever, 23andMe provided a limited analysis of the results in an accompanying PDF file. You can download my exome analysis here.

The first figure from the results is shown below.

reads_to_variants

Part A of the figure shows how many bases were called as a result of the exome sequencing. Some of the sequence data fails a quality filter, is duplicate data, or is off target, but after all that, there were nearly 3 billion bases of on-target exome sequence. This gives some idea of the extent to which coverage of my exome is overlapping, because there are 3 billion base pairs in the (haploid) human genome. Taking into account that I carry two genomes, one from each parent, there is about 25x coverage of the 2% of my genome that encodes proteins.

Part B of the figure shows that the vast majority of my exome matches the reference sequence. Almost all of the over 120 million base calls are the same as the reference human genome assembly. The tiny sliver of red on top of the yellow bar represents my variants.

Part C of the figure breaks down my variants into two classes, SNPs and indels. The Single Nucleotide Polymorphisms (SNPs) are like those from my original analysis from 23andMe, but includes all variants discovered in my exome, including any variants not previously described in the analysis of human genomes. The standard analysis of 600,000 SNPs using 23andMe’s SNP chip only detects variants built into the chip; these were all described prior to the design of the chip. My exome sequence includes all variants, including previously unknown or even “private” variants confined to me alone. Indels (insertions or deletions) are sites of variation where parts of my genome differ from the reference assembly by the insertion or deletion of one or more bases.

There are many variants in my exome. Leaving out the small fraction that doesn’t pass quality checks, I have about 100,000 SNPs that differ from the reference sequence and almost 10,000 indels. The 100,000 SNPs seen in my exome are only a small fraction of the 6,000,000 SNPs that differ between any two people, because there are more SNPs per unit of DNA in the noncoding parts of the genome than in the exome. Variation in coding sequences can potentially result in changes to protein sequence, so much of the variation that arises by mutation in coding sequences is removed over time from the population by selection. Most variation in noncoding sequences is neutral (neither selected for nor selected against).

The exome analysis provided by 23andMe characterizes the variants in my exome by their impact on gene function, as shown in the graph below.

characterized_variants

In this graph, high impact variants are frameshift mutations, splice site variants, loss or gain of stop codons, and loss of start codons. In a frameshift mutation, there is an insertion or deletion of a number of bases that is not a multiple of three. Because bases are read three at a time during translation, insertion or deletion of one or two bases will cause translation downstream of the variant to take place in a different reading frame, resulting in a radically altered protein sequence that is likely to have a premature stop codon. Splice site variants may interfere with the correct splicing of RNA (removal of introns), which could cause catastrophic changes to the sequence of the encoded protein. Stop codon gain (a nonsense mutation) will cause a truncated protein product missing all amino acids downstream of the position of the new stop codon, while stop codon loss will cause a protein to have an extension of new amino acids following what is normally the end of the protein. Loss of the start codon might eliminate the protein entirely, or result in a protein with a gain or loss of amino acids at the beginning of the protein, resulting from using another start codon.

Variants of moderate impact include nonsynonymous substitutions and codon insertions and deletions. A nonsynonymous substitution will change a single amino acid in a protein sequence to a different amino acid. Codon insertion or deletion will add or remove one or more amino acids without altering the sequence of the protein downstream of the variant.

Variants of low impact include synonymous substitutions, synonymous stops, and start gains. The genetic code is degenerate, which means that there are multiple codons that encode the same amino acid. Most amino acids are encoded by multiple synonymous codons. Substitution of a synonymous codon would not be expected to have any impact on gene or protein function. Similarly, there are three stop codons. Substitution of one stop codon for another would not be expected to make any difference. In a start gain, there is a new start codon upstream of the original one, but the original start codon can still be used to encode a normal protein.

Variants of unknown impact are those outside of the coding region. Some noncoding sequence (introns, 5′ UTR, 3′ UTR, and adjacent sequence that is not transcribed) is sequenced in exome sequencing. It is difficult to predict whether any of these variants would affect gene function.

Most of my variants (80%) have unknown impact and are likely neutral. Variants of low impact are not expected to influence gene function and can be ignored. Only variants of moderate or high impact are likely to have an effect on gene function, so we should examine these more closely.

Another way of analyzing variation is to determine how rare the variants are among the many human genomes that have been sequenced. The 1000 Genomes Project is an NIH-sponsored program that has cataloged variants in several thousand human genomes, allowing us to distinguish between relatively common (presumably benign) variants and rare variants that have either arisen recently or been subject to negative selection because they are deleterious.

The graph below shows the frequency of my variants as seen among the people sampled for the 1000 Genomes Project.

rare_variants

Most of my variants (95,660) are common, with allele frequencies of 5% or higher. About 3% of my variants have allele frequencies from 1% – 5%, while 1.4% of my variants have allele frequencies below 1%. About 8.5% of my variants are unknown, meaning that while they are already present in public databases, their allele frequency has not been calculated. About 4% of my variants are novel, meaning that they are not found in public databases.

Evaluating both the likely impact of each variant and the allele frequency of each variant allows the analysts to filter all of my exome variation to find variants of high or moderate impact that are novel, unknown, or rare, as shown in the decision tree below.

variant_filter

Of my variants predicted to have high or moderate impact, 249 are rare (allele frequency less than 1%), 1107 have unknown allele frequencies, and 669 are novel. That is a lot of variation to consider, so the last filter is to check the variants against a list of 592 gene associated with inherited disorders. The last check against the gene list produces 15 uncommon variants of high or moderate impact in the gene list associated with inherited disorders. There are 1761 uncommon variants of high or moderate impact not associated with genes on the list of inherited disorders.

My exome report summarizes the 15 variants in a series of tables. I summarize the 15 variants in the table below.

All 15 variants in the table below are nonsynonymous coding variants, which means that they change the amino acid sequence of the protein encoded by the gene.

As you would expect from the rarity of the variants, I am not homozygous for any of them; in each case, I am heterozygous for the variant in question. For two of the genes (NEB and TTN), I am heterozygous for two variants, but we can’t tell the phase from the sequencing technique used. This means that I might have one copy of NEB that lacks both rare variants with a second copy that carries both, or alternatively, each of my two copies of NEB could carry a different rare variant. The same is true for TTN.

Not all amino acid substitutions are equivalent. Some amino acid substitutions are conservative, meaning that they replace an amino acid with a chemically similar amino acid that is unlikely to alter the function of the protein. Other amino acid substitutions are nonconservative, substituting an amino acid with one that is dissimilar. For the technically minded reader, I should say that I used the BLOSUM62 substitution matrix (1) to determine whether a change is conservative or nonconservative. Not all conservative substitutions are harmless, and not all nonconservative substitutions affect protein function. Each amino acid position in a particular protein has its own properties.

Finally, it is possible to assess the likely outcome if each variant resulted in a complete loss of protein function, because we have the information from the OMIM entry. This will tell us whether heterozygotes (carriers) of a defective allele have an altered phenotype. This is perhaps the most important assessment here. I carry variants of moderate impact in three genes that have a semidominant impact on disease (MLH1, MSH2, and PTCH1). It turns out that there is no direct evidence that any of the particular variants that I carry are pathogenic, as I detail below.

Uncommon variants of high or moderate impact in disease genes for
Paul Szauter (23andMe Exome Pilot)
Symbol Name OMIM Link1 dbSNP ID2 AA change
(conservative?)3
1K Genomes
Frequency4
Effect
on phenotype5
BCKDHA branched chain keto acid dehydrogenase E1,
alpha polypeptide
608348 rs34442879 T122M
(nonconservative)
0.00560 recessive
CHH23 cadherin-related 23 605516 rs41281338 E2588Q
(conservative)
0.00670 recessive
EVC Ellis van Creveld syndrome 604831 rs41269549 D184N
(conservative)
9e-04 recessive
GLE1 GLE1 RNA export mediator homolog (yeast) 603371 rs138310419 E334K
(conservative)
0.00460 recessive
HSPG2 heparan sulfate proteoglycan 2 142461 rs114851469 R2977W
(nonconservative)
0.00340 recessive
ITGB4 integrin, beta 4 147557 rs145976111 R977C
(nonconservative)
0.00140 recessive
MLH1 mutL homolog 1, colon cancer,
nonpolyposis type 2 (E. coli)
120436 rs35831931 V134M
(conservative)
4e-04 semidominant
MSH2 mutS homolog 2, colon cancer,
nonpolyposis type 1 (E. coli)
609309 rs4987188 G108D
(nonconservative)
0.00910 semidominant
NEB nebulin 161650 rs149881695 V181I
(conservative)
0.00100 recessive
NEB nebulin 161650 N/A I287V
(conservative)
0.00480 recessive
PLG plasminogen 173350 rs4252129 R523W
(nonconservative)
0.00320 recessive
PTCH1 patched 1 601309 rs113663584 G1012S
(nonconservative)
9e-04 semidominant
SP110 SP110 nuclear body protein 604457 rs149485401 G185R
(nonconservative)
0.00690 recessive
TTN titin 188840 rs33917087 V2777F
(nonconservative)
0.00830 recessive
TTN titin 188840 rs55980498 P13977S
(nonconservative)
0.00240 recessive
1Link to the gene page in OMIM. Links at the top of that page direct you to entries on inherited diseases.

2Link to the SNP entry at dbSNP. Note that there is no entry for NEB-I287V, a previously unknown variant.

3Amino acid substitutions are classified as conservative or nonconservative using the BLOSUM62 substitution matrix (1).

4Frequency of the variant from the 1000 Genomes Project, cited by 23andMe as of 08/26/2011.

5From the OMIM gene and disease entries. This is the mode of inheritance of known pathogenic alleles, and does not imply that any of the variants shown are pathogenic.

Before going through the variants, given that I am an optimist, I have to make the glass-half-full argument first. My results showed that I did not carry variants of high or moderate impact in 579 of the 592 genes, so I got a 97.8% on my exome exam as graded by 23andMe. That’s good news. I would like to be able to get the gene list from 23andMe eventually.

The phenotypes produced by pathogenic variants in the 13 genes for which I carry variants of moderate impact are terrible, which is why they are on the list of genes responsible for inherited disease. Please bear in mind as you read these grim summaries that my health is excellent, and I am not affected by any of these conditions.

BCKDHA: This gene encodes an enzyme necessary for amino acid metabolism, specifically, the degradation of products of isoleucine, leucine, and valine catabolism. Individuals homozygous for defects in this gene have maple syrup urine disease (MSUD), which despite its funny name is a devastating childhood illness that results in physical and mental retardation if untreated. Because the disease has a devastating impact and can be treated through dietary modification, it is on the list of diseases for which newborns are screened through the analysis of blood chemistry. The disease phenotype is completely recessive, so even if this allele results in a complete loss of enzyme function, it should have no impact on the health of carriers like me.

CHH23: Mutations in this gene are associated with autosomal recessive deafness, specifically with Usher syndrome. This inherited condition is completely recessive. My variant allele is a conservative amino acid substitution not likely to lead to a loss of gene function.

EVC: Mutations in this gene are associated with recessive skeletal dysplasia with short limbs, polydactyly, and dental abnormalities. Mutations are completely recessive. My variant allele is a conservative amino acid substitution not likely to lead to a loss of gene function.

GLE1: Mutations in this gene are associated with lethal congenital contracture syndrome, with most cases associated with death around the time of birth. Mutations are completely recessive. My variant allele is a conservative amino acid substitution not likely to lead to a loss of gene function.

HSPG2: Mutations in this gene are associated with dyssegmental dysplasia, a recessive lethal form of neonatal dwarfism. The three specific variants known to cause the disease are an 89 bp duplication that causes a frameshift and two different stop codon gain mutations. Other alleles are associated with Schwartz-Jampel syndrome, a milder dwarfism syndrome that is completely recessive. One of the variants associated with Schwartz-Jampel syndrome is an amino acid substitution, C1532Y. It is not clear whether individuals homozygous for the allele that I carry, R2977W, would be affected.

ITGB4: Mutations in this gene are associated with epidermolysis bullosa, a recessive skin blistering disorder. One individual homozygous for G931D had a history of blistering and hair loss beginning in childhood when examined at age 68. One patient with a lethal form of the disease was homozygous for C61Y. Nonlethal cases include homozygotes or compound heterozygotes for R252C, C562R, and R1281W. No cases involving the allele that I carry, R977C, have been observed, so it is not clear whether individuals homozygous for this allele would be affected.

MLH1: This mutation was an immediate source of alarm to me when I obtained my results. The MLH1 gene encodes a DNA repair enzyme. Mutations in this gene are associated with dominant hereditary predisposition to colon cancer. Heterozygotes (carriers) of some alleles are at increased risk of colon cancer, while rare homozygotes or compound heterozygotes for two loss of function alleles develop colon cancer or other tumors early in life. There is no direct evidence that the V134M allele that I carry (a conservative substitution) is associated with cancer predisposition, although apart from the amino acid substitution that it causes, it is classified as a mutation in an exonic splicing enhancer that may cause the mutant exon to be skipped (2).

MSH2: Like MLH1, this mutation was an immediate source of concern, because it encodes a DNA repair enzyme. Mutations in MSH2 are associated with dominant hereditary predisposition to colon cancer. Heterozygotes (carriers) of some alleles are at increased risk of colon cancer, while rare homozygotes or compound heterozygotes for two loss of function alleles develop colon cancer or other tumors early in life. There is no direct evidence that the G108D allele that I carry (a nonconservative substitution) is associated with cancer predisposition, although apart from the amino acid substitution that it causes, it is classified as a mutation in an exonic splicing enhancer that may cause the mutant exon to be skipped (2). Taking the results for MLH1 and MSH2 together, I might be more concerned had I obtained these results in my 20s. I am 58. I had a colonoscopy at 55 as part of routine medical care, and learned that everything is fine. In this case, my medical history suggests that the variant MLH1 and MSH2 alleles that I carry are not harmful.

NEB: Mutations in NEB are associated with nemaline myopathy, a recessive disorder characterized by hypotonia (low muscle tone), generally evident at birth. My two rare alleles (V181I and I287V) are both conservative substitutions, while alleles associated with the disease are typically frameshifts, nonsense mutations, or the loss of splice sites. My normal phenotype with respect to muscle tone shows that I am not, nor will I be, affected by these variants.

PLG: Mutations that inactivate plasminogen are associated with ligneous (“wood-like”) conjunctivitis of the eye, and similar lesions of other mucous membranes. The nonconservative substitution that I carry has not been associated with recessive plasminogen deficiency. In any case, carriers of known pathogenic alleles are not affected.

PTCH1: Mutations in PTCH1 are associated with holoprosencephaly, a recessive developmental abnormality of the forebrain that causes mental retardation and craniofacial abnormalities. Mutations in PTCH1 are also associated with susceptibility to cutaneous basal cell carcinoma (skin cancer caused by sun exposure). The G1012S variant (a nonconservative substitution) that I carry is not known to be pathogenic.

SP110: Mutations in SP110 are associated with recessive hepatic venoocclusive disease with immunodeficiency and recessive susceptibility to tuberculosis. The G185R allele that I carry is not known to be pathogenic.

TTN: The TTN gene encodes titin, a giant muscle protein that is an essential component of striated muscle. Defects in TTN are associated with recessive cardiomyopathy, limb-girdle muscular dystrophy, and other muscle defects. The two nonconservative substitutions that I carry (V2777F and P13977S) are not known to be pathogenic. Because of the size of this protein and because it is a structural protein rather than an enzyme, a huge number of variants are known, most not associated with any phenotype.

Getting my exome sequenced was an excellent learning experience for me. First, I got a fairly clean bill of genomic health (I have learned to stop worrying and love my variants that are not known to be pathogenic). Second, I now have a more personal sense of how vast the genome is, how little we really know, and how many genetic variants are still novel in the average genome. Finally, I donated my exome results to the Personal Genome Project, which will be the subject of my next entry in the not-too-distant future.

I might have wished for a bit more information from 23andMe, but they actually delivered more than they promised. I’d like to have their list of 592 disease genes, and I’d like to have the list of genes that correspond to my 1761 rare variants of high or moderate impact that don’t match their 592 disease genes. Finally, I look forward to the day when 23andMe opens up exome sequencing to its customers again, and refines its analysis pipeline based on their experience with the pilot.

References

1. Henikoff, S, and JG Henikoff (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919.

2. Doss, CGP and R Sethumadhavan (2009) Investigation on the role of nsSNPs in HNPCC genes – a bioinformatics approach. J Biomed Sci. 16: 42.

Leave a Reply

Your email address will not be published. Required fields are marked *