How I Learned to Stop Worrying and Love My Exome

In the spring of 2012, 23andMe offered customers the opportunity to participate in a pilot program to have their exome sequenced. The exome is the minor fraction – around 2% – of the human genome that encodes proteins. Exome sequencing is a transitional technology that has an excellent chance of finding variants responsible for inherited conditions at a significantly lower cost than whole-genome sequencing. As the cost of whole-genome sequencing falls, exome sequencing will probably disappear, but in the spring of 2012, it looked like an interesting offer.

In exome sequencing, the protein-coding portions of the human genome are selected through hybridization to an array of synthetic oligonucleotides representing all known exons, then sequenced using next-gen sequencing. 23andMe offered customers raw data without any interpretation. We were cautioned not to expect any user-friendly guides like the ones provided for SNP surveys. The price for this limited-time offer was $1,000.

My prior experience with having my genome analyzed had been outstanding. I found out about a medically actionable condition (predisposition to hemochromatosis), learned about my Neandertal ancestry, and gained a wealth of raw data to check as I read the biomedical literature. My experience helped me to add personal interest to my teaching and public outreach activity. I wasn’t sure that I had the skills to analyze the raw data from my exome sequence, but I thought that once I had my exome sequence, I would be highly motivated to learn how to analyze it. I had invested far more than $1,000 in other aspects of my education, and those investments had always paid off. I signed up.

The sample kit arrived quickly, but it seemed like forever before the results were ready (it was actually only four months). I followed the instructions for the elaborate download process designed to protect my genomic privacy. Despite having been promised no analysis whatsoever, 23andMe provided a limited analysis of the results in an accompanying PDF file. You can download my exome analysis here.

The first figure from the results is shown below.


Part A of the figure shows how many bases were called as a result of the exome sequencing. Some of the sequence data fails a quality filter, is duplicate data, or is off target, but after all that, there were nearly 3 billion bases of on-target exome sequence. This gives some idea of the extent to which coverage of my exome is overlapping, because there are 3 billion base pairs in the (haploid) human genome. Taking into account that I carry two genomes, one from each parent, there is about 25x coverage of the 2% of my genome that encodes proteins.

Part B of the figure shows that the vast majority of my exome matches the reference sequence. Almost all of the over 120 million base calls are the same as the reference human genome assembly. The tiny sliver of red on top of the yellow bar represents my variants.

Part C of the figure breaks down my variants into two classes, SNPs and indels. The Single Nucleotide Polymorphisms (SNPs) are like those from my original analysis from 23andMe, but includes all variants discovered in my exome, including any variants not previously described in the analysis of human genomes. The standard analysis of 600,000 SNPs using 23andMe’s SNP chip only detects variants built into the chip; these were all described prior to the design of the chip. My exome sequence includes all variants, including previously unknown or even “private” variants confined to me alone. Indels (insertions or deletions) are sites of variation where parts of my genome differ from the reference assembly by the insertion or deletion of one or more bases.

There are many variants in my exome. Leaving out the small fraction that doesn’t pass quality checks, I have about 100,000 SNPs that differ from the reference sequence and almost 10,000 indels. The 100,000 SNPs seen in my exome are only a small fraction of the 6,000,000 SNPs that differ between any two people, because there are more SNPs per unit of DNA in the noncoding parts of the genome than in the exome. Variation in coding sequences can potentially result in changes to protein sequence, so much of the variation that arises by mutation in coding sequences is removed over time from the population by selection. Most variation in noncoding sequences is neutral (neither selected for nor selected against).

The exome analysis provided by 23andMe characterizes the variants in my exome by their impact on gene function, as shown in the graph below.


In this graph, high impact variants are frameshift mutations, splice site variants, loss or gain of stop codons, and loss of start codons. In a frameshift mutation, there is an insertion or deletion of a number of bases that is not a multiple of three. Because bases are read three at a time during translation, insertion or deletion of one or two bases will cause translation downstream of the variant to take place in a different reading frame, resulting in a radically altered protein sequence that is likely to have a premature stop codon. Splice site variants may interfere with the correct splicing of RNA (removal of introns), which could cause catastrophic changes to the sequence of the encoded protein. Stop codon gain (a nonsense mutation) will cause a truncated protein product missing all amino acids downstream of the position of the new stop codon, while stop codon loss will cause a protein to have an extension of new amino acids following what is normally the end of the protein. Loss of the start codon might eliminate the protein entirely, or result in a protein with a gain or loss of amino acids at the beginning of the protein, resulting from using another start codon.

Variants of moderate impact include nonsynonymous substitutions and codon insertions and deletions. A nonsynonymous substitution will change a single amino acid in a protein sequence to a different amino acid. Codon insertion or deletion will add or remove one or more amino acids without altering the sequence of the protein downstream of the variant.

Variants of low impact include synonymous substitutions, synonymous stops, and start gains. The genetic code is degenerate, which means that there are multiple codons that encode the same amino acid. Most amino acids are encoded by multiple synonymous codons. Substitution of a synonymous codon would not be expected to have any impact on gene or protein function. Similarly, there are three stop codons. Substitution of one stop codon for another would not be expected to make any difference. In a start gain, there is a new start codon upstream of the original one, but the original start codon can still be used to encode a normal protein.

Variants of unknown impact are those outside of the coding region. Some noncoding sequence (introns, 5′ UTR, 3′ UTR, and adjacent sequence that is not transcribed) is sequenced in exome sequencing. It is difficult to predict whether any of these variants would affect gene function.

Most of my variants (80%) have unknown impact and are likely neutral. Variants of low impact are not expected to influence gene function and can be ignored. Only variants of moderate or high impact are likely to have an effect on gene function, so we should examine these more closely.

Another way of analyzing variation is to determine how rare the variants are among the many human genomes that have been sequenced. The 1000 Genomes Project is an NIH-sponsored program that has cataloged variants in several thousand human genomes, allowing us to distinguish between relatively common (presumably benign) variants and rare variants that have either arisen recently or been subject to negative selection because they are deleterious.

The graph below shows the frequency of my variants as seen among the people sampled for the 1000 Genomes Project.


Most of my variants (95,660) are common, with allele frequencies of 5% or higher. About 3% of my variants have allele frequencies from 1% – 5%, while 1.4% of my variants have allele frequencies below 1%. About 8.5% of my variants are unknown, meaning that while they are already present in public databases, their allele frequency has not been calculated. About 4% of my variants are novel, meaning that they are not found in public databases.

Evaluating both the likely impact of each variant and the allele frequency of each variant allows the analysts to filter all of my exome variation to find variants of high or moderate impact that are novel, unknown, or rare, as shown in the decision tree below.


Of my variants predicted to have high or moderate impact, 249 are rare (allele frequency less than 1%), 1107 have unknown allele frequencies, and 669 are novel. That is a lot of variation to consider, so the last filter is to check the variants against a list of 592 gene associated with inherited disorders. The last check against the gene list produces 15 uncommon variants of high or moderate impact in the gene list associated with inherited disorders. There are 1761 uncommon variants of high or moderate impact not associated with genes on the list of inherited disorders.

My exome report summarizes the 15 variants in a series of tables. I summarize the 15 variants in the table below.

All 15 variants in the table below are nonsynonymous coding variants, which means that they change the amino acid sequence of the protein encoded by the gene.

As you would expect from the rarity of the variants, I am not homozygous for any of them; in each case, I am heterozygous for the variant in question. For two of the genes (NEB and TTN), I am heterozygous for two variants, but we can’t tell the phase from the sequencing technique used. This means that I might have one copy of NEB that lacks both rare variants with a second copy that carries both, or alternatively, each of my two copies of NEB could carry a different rare variant. The same is true for TTN.

Not all amino acid substitutions are equivalent. Some amino acid substitutions are conservative, meaning that they replace an amino acid with a chemically similar amino acid that is unlikely to alter the function of the protein. Other amino acid substitutions are nonconservative, substituting an amino acid with one that is dissimilar. For the technically minded reader, I should say that I used the BLOSUM62 substitution matrix (1) to determine whether a change is conservative or nonconservative. Not all conservative substitutions are harmless, and not all nonconservative substitutions affect protein function. Each amino acid position in a particular protein has its own properties.

Finally, it is possible to assess the likely outcome if each variant resulted in a complete loss of protein function, because we have the information from the OMIM entry. This will tell us whether heterozygotes (carriers) of a defective allele have an altered phenotype. This is perhaps the most important assessment here. I carry variants of moderate impact in three genes that have a semidominant impact on disease (MLH1, MSH2, and PTCH1). It turns out that there is no direct evidence that any of the particular variants that I carry are pathogenic, as I detail below.

Uncommon variants of high or moderate impact in disease genes for
Paul Szauter (23andMe Exome Pilot)
Symbol Name OMIM Link1 dbSNP ID2 AA change
1K Genomes
on phenotype5
BCKDHA branched chain keto acid dehydrogenase E1,
alpha polypeptide
608348 rs34442879 T122M
0.00560 recessive
CHH23 cadherin-related 23 605516 rs41281338 E2588Q
0.00670 recessive
EVC Ellis van Creveld syndrome 604831 rs41269549 D184N
9e-04 recessive
GLE1 GLE1 RNA export mediator homolog (yeast) 603371 rs138310419 E334K
0.00460 recessive
HSPG2 heparan sulfate proteoglycan 2 142461 rs114851469 R2977W
0.00340 recessive
ITGB4 integrin, beta 4 147557 rs145976111 R977C
0.00140 recessive
MLH1 mutL homolog 1, colon cancer,
nonpolyposis type 2 (E. coli)
120436 rs35831931 V134M
4e-04 semidominant
MSH2 mutS homolog 2, colon cancer,
nonpolyposis type 1 (E. coli)
609309 rs4987188 G108D
0.00910 semidominant
NEB nebulin 161650 rs149881695 V181I
0.00100 recessive
NEB nebulin 161650 N/A I287V
0.00480 recessive
PLG plasminogen 173350 rs4252129 R523W
0.00320 recessive
PTCH1 patched 1 601309 rs113663584 G1012S
9e-04 semidominant
SP110 SP110 nuclear body protein 604457 rs149485401 G185R
0.00690 recessive
TTN titin 188840 rs33917087 V2777F
0.00830 recessive
TTN titin 188840 rs55980498 P13977S
0.00240 recessive
1Link to the gene page in OMIM. Links at the top of that page direct you to entries on inherited diseases.

2Link to the SNP entry at dbSNP. Note that there is no entry for NEB-I287V, a previously unknown variant.

3Amino acid substitutions are classified as conservative or nonconservative using the BLOSUM62 substitution matrix (1).

4Frequency of the variant from the 1000 Genomes Project, cited by 23andMe as of 08/26/2011.

5From the OMIM gene and disease entries. This is the mode of inheritance of known pathogenic alleles, and does not imply that any of the variants shown are pathogenic.

Before going through the variants, given that I am an optimist, I have to make the glass-half-full argument first. My results showed that I did not carry variants of high or moderate impact in 579 of the 592 genes, so I got a 97.8% on my exome exam as graded by 23andMe. That’s good news. I would like to be able to get the gene list from 23andMe eventually.

The phenotypes produced by pathogenic variants in the 13 genes for which I carry variants of moderate impact are terrible, which is why they are on the list of genes responsible for inherited disease. Please bear in mind as you read these grim summaries that my health is excellent, and I am not affected by any of these conditions.

BCKDHA: This gene encodes an enzyme necessary for amino acid metabolism, specifically, the degradation of products of isoleucine, leucine, and valine catabolism. Individuals homozygous for defects in this gene have maple syrup urine disease (MSUD), which despite its funny name is a devastating childhood illness that results in physical and mental retardation if untreated. Because the disease has a devastating impact and can be treated through dietary modification, it is on the list of diseases for which newborns are screened through the analysis of blood chemistry. The disease phenotype is completely recessive, so even if this allele results in a complete loss of enzyme function, it should have no impact on the health of carriers like me.

CHH23: Mutations in this gene are associated with autosomal recessive deafness, specifically with Usher syndrome. This inherited condition is completely recessive. My variant allele is a conservative amino acid substitution not likely to lead to a loss of gene function.

EVC: Mutations in this gene are associated with recessive skeletal dysplasia with short limbs, polydactyly, and dental abnormalities. Mutations are completely recessive. My variant allele is a conservative amino acid substitution not likely to lead to a loss of gene function.

GLE1: Mutations in this gene are associated with lethal congenital contracture syndrome, with most cases associated with death around the time of birth. Mutations are completely recessive. My variant allele is a conservative amino acid substitution not likely to lead to a loss of gene function.

HSPG2: Mutations in this gene are associated with dyssegmental dysplasia, a recessive lethal form of neonatal dwarfism. The three specific variants known to cause the disease are an 89 bp duplication that causes a frameshift and two different stop codon gain mutations. Other alleles are associated with Schwartz-Jampel syndrome, a milder dwarfism syndrome that is completely recessive. One of the variants associated with Schwartz-Jampel syndrome is an amino acid substitution, C1532Y. It is not clear whether individuals homozygous for the allele that I carry, R2977W, would be affected.

ITGB4: Mutations in this gene are associated with epidermolysis bullosa, a recessive skin blistering disorder. One individual homozygous for G931D had a history of blistering and hair loss beginning in childhood when examined at age 68. One patient with a lethal form of the disease was homozygous for C61Y. Nonlethal cases include homozygotes or compound heterozygotes for R252C, C562R, and R1281W. No cases involving the allele that I carry, R977C, have been observed, so it is not clear whether individuals homozygous for this allele would be affected.

MLH1: This mutation was an immediate source of alarm to me when I obtained my results. The MLH1 gene encodes a DNA repair enzyme. Mutations in this gene are associated with dominant hereditary predisposition to colon cancer. Heterozygotes (carriers) of some alleles are at increased risk of colon cancer, while rare homozygotes or compound heterozygotes for two loss of function alleles develop colon cancer or other tumors early in life. There is no direct evidence that the V134M allele that I carry (a conservative substitution) is associated with cancer predisposition, although apart from the amino acid substitution that it causes, it is classified as a mutation in an exonic splicing enhancer that may cause the mutant exon to be skipped (2).

MSH2: Like MLH1, this mutation was an immediate source of concern, because it encodes a DNA repair enzyme. Mutations in MSH2 are associated with dominant hereditary predisposition to colon cancer. Heterozygotes (carriers) of some alleles are at increased risk of colon cancer, while rare homozygotes or compound heterozygotes for two loss of function alleles develop colon cancer or other tumors early in life. There is no direct evidence that the G108D allele that I carry (a nonconservative substitution) is associated with cancer predisposition, although apart from the amino acid substitution that it causes, it is classified as a mutation in an exonic splicing enhancer that may cause the mutant exon to be skipped (2). Taking the results for MLH1 and MSH2 together, I might be more concerned had I obtained these results in my 20s. I am 58. I had a colonoscopy at 55 as part of routine medical care, and learned that everything is fine. In this case, my medical history suggests that the variant MLH1 and MSH2 alleles that I carry are not harmful.

NEB: Mutations in NEB are associated with nemaline myopathy, a recessive disorder characterized by hypotonia (low muscle tone), generally evident at birth. My two rare alleles (V181I and I287V) are both conservative substitutions, while alleles associated with the disease are typically frameshifts, nonsense mutations, or the loss of splice sites. My normal phenotype with respect to muscle tone shows that I am not, nor will I be, affected by these variants.

PLG: Mutations that inactivate plasminogen are associated with ligneous (“wood-like”) conjunctivitis of the eye, and similar lesions of other mucous membranes. The nonconservative substitution that I carry has not been associated with recessive plasminogen deficiency. In any case, carriers of known pathogenic alleles are not affected.

PTCH1: Mutations in PTCH1 are associated with holoprosencephaly, a recessive developmental abnormality of the forebrain that causes mental retardation and craniofacial abnormalities. Mutations in PTCH1 are also associated with susceptibility to cutaneous basal cell carcinoma (skin cancer caused by sun exposure). The G1012S variant (a nonconservative substitution) that I carry is not known to be pathogenic.

SP110: Mutations in SP110 are associated with recessive hepatic venoocclusive disease with immunodeficiency and recessive susceptibility to tuberculosis. The G185R allele that I carry is not known to be pathogenic.

TTN: The TTN gene encodes titin, a giant muscle protein that is an essential component of striated muscle. Defects in TTN are associated with recessive cardiomyopathy, limb-girdle muscular dystrophy, and other muscle defects. The two nonconservative substitutions that I carry (V2777F and P13977S) are not known to be pathogenic. Because of the size of this protein and because it is a structural protein rather than an enzyme, a huge number of variants are known, most not associated with any phenotype.

Getting my exome sequenced was an excellent learning experience for me. First, I got a fairly clean bill of genomic health (I have learned to stop worrying and love my variants that are not known to be pathogenic). Second, I now have a more personal sense of how vast the genome is, how little we really know, and how many genetic variants are still novel in the average genome. Finally, I donated my exome results to the Personal Genome Project, which will be the subject of my next entry in the not-too-distant future.

I might have wished for a bit more information from 23andMe, but they actually delivered more than they promised. I’d like to have their list of 592 disease genes, and I’d like to have the list of genes that correspond to my 1761 rare variants of high or moderate impact that don’t match their 592 disease genes. Finally, I look forward to the day when 23andMe opens up exome sequencing to its customers again, and refines its analysis pipeline based on their experience with the pilot.


1. Henikoff, S, and JG Henikoff (1992) Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA 89: 10915-10919.

2. Doss, CGP and R Sethumadhavan (2009) Investigation on the role of nsSNPs in HNPCC genes – a bioinformatics approach. J Biomed Sci. 16: 42.

My Deep Ancestry

I can only recite a small bit of my ancestry. My father was born in 1909 in Hungary. His father died in 1914, forty years before I was born. His mother, who remarried, had additional children, his half-sibs. One of these, my uncle, was killed in World War II. I met my father’s mother a few times. She only spoke Hungarian and I only spoke English, so our contact was just some smiles, hugs, and laughs, no family stories.

My mother is Dutch, born in Holland. Her mother also came to the United States and lived in the city where I grew up, Youngstown, Ohio. My mother still lives in the house where I grew up, and at 89 is still going strong. She was an only child, but her mother had a brother, my great uncle, who I met on a visit to Holland in 1975. My great aunt and uncle were wonderful hosts, but we never got much into family history. As a child in a small nuclear family (father, mother, sister, brother, and grandmother), I never had much personal experience with extended families. I can’t rattle off my extended family tree, and the language of kinship – second cousins once removed, and so on – eludes me like the rules of some complex organized sport that I don’t follow.

I knew that my results from 23andMe were going to provide me with some insight into my deep ancestry. I knew that I was going to get two pieces of ancestry information right away: my mitochondrial haplotype and my Y chromosome haplotype. Each was going to give me a look at part of my ancestry going back thousands of years.

Mitochondria are subcellular organelles, found in the cells of all eukaryotic organisms (organisms with a nucleus). Plants, animals, protozoans, and fungi all have mitochondria. Mitochondria are the powerhouses of the cell, carrying out biochemical reactions that generate ATP, a chemical that provides energy for thousands of other reactions. Mitochondria have their own DNA, which encodes some of the proteins found in mitochondria. Mitochondria resemble bacteria in some ways, and are thought to be derived from endosymbionts – bacteria that a proto-eukaryote invited into its cells for mutual benefit billions of years ago.

Mitochondria and their DNA have the interesting property of being inherited exclusively from our mothers. Egg cells are much bigger than sperm cells and are packed with mitochondria. Sperm cells drop off a set of chromosomes during fertilization, but no mitochondria.

This means that my mitochondrial genome came from my mother, who got it from her mother, and so on back to my maternal great-great-great-grandmother and beyond. I have two parents, four grandparents, eight great-grandparents, and so on back through the generations to a very large number of ancestors. My nuclear genome has bits and pieces of my recent ancestors, but as you go back many generations, some of my ancestors are no longer directly represented there. Not so my mitochondrial DNA, which is an unbroken maternal line back through thousands of generations.

This would not be very informative about my ancestry, except that occasionally, mutations occur in mitochondrial DNA. Mutations in the coding sequences used to make mitochondrial proteins are usually very bad news and are eliminated through selection. There are some noncoding regions in the mitochondrial genome; mutations in these regions have no effect and are selectively neutral. Every time such a mutation occurs, it marks a new maternal lineage, branching off from the old lineage and continuing until another branch arises by mutation.

The rate at which these mutations occur is known, so we can say approximately when each new mitochondrial lineage arose. My results from 23andMe show my mitochondrial genome to be a type called H1. They show a map showing the frequency of mitochondrial lineage H1 in various populations around the world, shown below.

Clicking the history tab on this page at 23andMe tells me that haplogroup H1 originated about 13,000 years ago, not long after the end of the Ice Age. The people of Europe had been driven by ice sheets into southern France, Italy, and the Iberian Peninsula. The H1 haplotype likely arose in a woman living on the Iberian Peninsula. As the Ice Age ended, some of the descendants of this woman journeyed north all the way to Scandinavia, while others crossed into northern Africa. The blue on the map shows that H1 reaches a frequency of around 40% in Norway, far from its origins in Iberia, probably due to a founder effect. If a relatively small number of people founded the population of distant Norway, by chance the H1 haplotype is overrepresented there compared to Spain. My H1 haplotype is not a big surprise given my Dutch maternal ancestry, as H1 is common in Holland.

Men get extra information about their ancestry from 23andMe because of their Y chromosomes. The Y chromosome is one of the two sex chromosomes. Men are XY, women are XX. Normal human eggs have a single X chromosome as they await fertilization, while normal human sperm will either have an X chromosome or a Y chromosome. If the sperm fertilizing the egg carries an X chromosome, the zygote is XX and will be a girl, while if the sperm fertilizing the egg carries a Y chromosome, the zygote is XY and will be a boy.

This means that only men transmit the Y chromosome, and only to their sons. My Y chromosome traces back through my father, my father’s father, and so on back thousands of generations. Just like mitochondrial DNA, the DNA of the Y chromosome is subject to variation. Each time a new variant arises, it marks the beginning of a new paternal lineage. The rate at which these variants occur is known, so we can trace the origin of my Y chromosome haplotype to a specific time and place, just like for mitochondrial DNA. The 23andMe display for my Y chromosome is shown below.

My Y chromosome haplotype is E1b1b1a2*, a subgroup of E1b1b1a2 (also called E-V13) that arose in a population that moved from eastern Africa into northeastern Africa about 14,000 years ago, during the final days of the Ice Age. 23andMe reports that it is common among men in southern Europe, especially Greeks, Bulgarians, and Albanians. About 10% of Hungarian men carry this Y chromosome haplotype.

The Hungarian language is an odd one, related linguistically to Finnish, which reflects the migration of the Finno-Ugric people from the area near the Finnish-Russian border to what is currently Hungary in the 9th and 10th centuries. A search of the web for origins of the Hungarian people reveals a colorful history of repeated clashes with the neighboring kingdoms, especially during the 10th century. The reports of the geographic distribution of the E1b1b1a2 Y chromosome suggest that it did not come from the Finno-Ugric people. My Y chromosome likely came into Hungary from the outside.

In my search for the origins of the E1b1b1a2 Y chromosome, I found an authoritative account on Dienekes’ Anthropology Blog. Dienekes Pontikos presents the following conclusion:

The age and distribution of E-V13 chromosomes suggest that expansions of the Greek world in the Bronze and later ages were the major causes of its diffusion. Who was the E-V13 patriarch in Greece? He was perhaps one of the legendary figures of Greek mythology some of whom are said to have come from abroad. For whatever reason, his progeny grew, and were around to participate in the expansion of the Mycenaean world and the subsequent Greek colonization.

I have heard from a number of people who were forced to reevaluate their identities after getting results from 23andMe. My Y chromosome haplotype was the first result that made me question my view of my own identity. It is no longer as simple as being of “European” descent; now my paternal lineage traces from a movement of people in African to heroes of the Bronze Age in Greece, described much later in Homer’s epic poem The Iliad.

My mitochondrial and Y chromosome results pin down exactly two of my thousands of ancestors. What about all of the others? For this, we turn to my autosomal DNA, also analyzed by 23andMe. This is most of the 3 billion base pairs that make up my genome, and there is much to learn. Half is from my mother, and half from my father. Going back to my grandparents, about one quarter of my genome should be from each of them, on average, but here is where it gets messy.

The chromosome sets that end up in sperm or egg cells are the product of an elaborate cell division process called meiosis. During meiosis, homologous chromosomes replicate, pair, and then segregate from each other. The segregation process doesn’t work correctly unless the chromosomes are held together prior to segregation. Part of what holds chromosome pairs together is the process of meiotic recombination, in which chromosomes that are partly of maternal origin and partly of paternal origin are created. Because recombination takes place after chromosomes have replicated, chromosomes segregating into the sperm or egg might be entirely maternal, entirely paternal, or composite chromosomes made up of both maternal and paternal segments.

Each chromosome pair also segregates independently of all of the other pairs. This means that I only carry an average of 25% of each of my grandparent’s genomes. Tracing back through the generations, each of my great-grandparents is represented by an average of 12.5% of my genome, my great-great-grandparents by an average of 6.25% of my genome, and so on. The casino-like mechanism of sexual reproduction means that segments from some of my more remote ancestors are entirely absent from my genome, with the exception of my mitochondrial genome and my Y chromosome.

Can I learn anything about my ancestry by looking at my autosomal genome? It is best to start by asking what we might learn by surveying autosomal genetic variation across a large sample of people over the entire geographic range of our species. This has been done in increasing detail in recent years, and reveals a clear story of human history that supports independent evidence from anthropology and archaeology.

Imagine a population of individuals living in a particular area for many generations. There will be a certain level of genetic variation among these people. If a small group of people leaves this population to start a new population somewhere else, they will by chance leave some of their genetic variation behind. If a small group from the new population moves on, they will by chance leave some of their newly reduced genetic variation behind, further reducing their population’s genetic variation.

If the spread of people to new areas is rapid relative to the rate at which new genetic variation arises, which it is, we should be able to trace humanity to its original geographic location. This is easily done, and it is clear that human beings originated in Africa. Around 100,00 years ago, a small group of humans moved out of Africa to the Middle East, and spread from there throughout Europe, Asia, and eventually North and South America.

When humans first arrived in the Middle East and Europe, these adventurous people encountered an existing population of Neandertals. Neandertals are a distinct species that diverged from the human lineage about 600,000 years ago. Homo neandertalis is well known from the fossil record. In contrast to some of the stereotypes about “caveman,” we know that Neandertals used tools and weapons, cared for members of their population who were injured or disabled (often for decades), and buried their dead ceremonially with flowers and other objects.

It has long been of interest what happened when humans first encountered Neandertals. There are two broad hypotheses: displacement and admixture. Under the displacement hypothesis, humans outcompeted, drove off, or killed the Neandertals, driving them to extinction about 30,000 years ago. Under the admixture hypothesis, humans took a liking to their neighbors and interbred with them, preserving a bit of the Neandertal lineage among humans after Neandertals disappeared.

While it is easy to analyze fresh DNA that has been collected properly from people, analyzing DNA from fossils is not an easy task. While DNA is fairly stable, over thousands of years it breaks down into small fragments, and some of the bases undergo chemical changes. Nevertheless, some determined researchers have pushed this technique to the very limits. In 1997, Svante Pääbo and colleagues at the Max Planck Institute produced the first DNA sequences of Neandertal mitochondrial DNA (1). Mitochondrial DNA is easier than nuclear DNA because there are many copies of the mitochondrial genome per cell.

The answer was clear: there is no trace of the mitochondrial DNA of Neandertals among modern humans. It appeared that our species entirely displaced the Neandertals.

Techniques for sequencing DNA, including ancient DNA, have advanced rapidly. In 2010, Svante Pääbo and colleagues announced the results of sequencing genomic DNA from Neandertals (2). DNA recovered from the bones of three individuals was sequenced, producing data that reveals the sequence of most of the Neandertal genome. The sequence is, of course, very similar to the sequence of human DNA.

At each position of known SNP variation in humans, Svante Pääbo and colleagues asked whether the Neandertal sequence more closely resembles the sequence of human populations in Africa (where Neandertals never lived), Europe, or Asia. Upon making close to 100,000 such comparisons, Pääbo and colleagues made a stunning finding: the genome of Neandertals was more closely related to Europeans and Asians than it was to Africans. The average non-African appears to contain genomic sequences from Neandertals making up about 2.5% of their genome. Some people have more, some people have less. In contrast to the results of the work on Neandertal mitochondria, these results support the admixture hypothesis. The first people out of Africa encountered Neandertals in the Middle East and mated with them. Some segments of the Neandertal genome were advantageous, and have been maintained by positive selection 30,000 years after Neandertals became extinct.

I remember when these results first hit the science news. I talked to everyone that I could about it, including people not trained in science. It is a beautiful piece of work. For many scientists, there is nothing quite as much fun as finding out that something that is widely known by everyone is just plain, flat out wrong. Imagine that, people walking around today with caveman DNA. I took delight in it in an abstract kind of way.

After I got the email that my 23andMe results were ready, I moved to the ancestry portion of the site. Would I like to find out if I carried any Neandertal DNA? Sure, I thought, and clicked the link, revealing the display shown below.

I am 2.9% Neandertal, in the 92nd percentile among 23andMe users. When I first saw this, I stared at the screen for a while. I was a little bit shocked. I looked up a comparison of humans and Neandertals, showing two complete skeletons side by side. The Neandertal is described as “robust.” They were shorter, stockier, and barrel-chested. I started to identify with the wrong skeleton. I imagined how I look in a crowd of people. Shorter, stockier. Big shoulders. I ran my finger over my eyebrows and forehead to reassure myself. No brow ridges, high forehead. Human.

This was the first result from 23andMe that I discussed with other people. I found out that I’m mixed race, I’d say. They would usually pause, aware that for some this is a delicate subject. I’m 97.1% human race and 2.9% Neandertal, I’d say. Sometimes we would move on to the jokes to break the tension. I went for a long walk yesterday, I’d say. Boy, are my knuckles sore!

Some of my friends on Facebook consoled me. Your Neandertal ancestors were big-brained, gentle people, they reminded me. We know from the genomic sequence of Neandertals that they had a working copy of FOXP2, a gene required for language that is nonfunctional in our closest living relative, the chimpanzee. I began to look at many of the depictions of Neandertals in popular culture as insensitive. My Neandertal ancestors were not brutish, stupid ape-men, I thought. After a couple of weeks, I began to embrace my ancestry, boasting to others that I was probably more of a Neandertal than they were. One of my female colleagues who had heard of my ancestry told me that she tested as 3% Neandertal, and I felt disappointed, ordinary.

My colleague, Dr. Maggie Werner-Washburne, a consummate scientist, reacted well to the news.

“2.9%?” she said. She held up a hand to count on her fingers. “Let’s see: 50%, 25%, 12.5%, 6.25%,” then, touching her pinky, “3%. It hasn’t been that long for you, has it?”

The fast estimate that one of my great-great-great grandparents was a Neandertal would be a reasonable guess except that we know that Neandertals died out 30,000 years ago. This means that the parts of the Neandertal genome must have been under positive selection. Neandertals had been living in Europe for a long time when modern humans arrived, and were well adapted to the conditions there. Some of the Neandertal genome retained by the descendants of hybrids like myself are associated with the immune system, and were better for conditions in Europe than the African alleles my early human ancestors carried.

I recently registered for the Personal Genome Project, for which I volunteered to have my entire genome sequenced and made public. I was invited to the Genomes, Environments and Traits conference (GET2012) as part of the educational aspect of the project. I looked up the conference schedule and saw the keynote speaker: Svante Pääbo. I was hooked.

The GET 2012 conference was held on April 25. There were far too many wonderful things to write about in this post, so I will stick to Svante Pääbo’s talk. He held us spellbound with his unique combination of a low-key manner, wonderful data, and a few gentle jokes. He recounted the work on mitochondria, and how he made a public statement that we would never have the nuclear genome of Neandertals. Never make statements like that, he advised us.

He presented the arguments about displacement vs. admixture, showing that while the results from mitochondrial DNA favored total displacement, the results from the analysis of nuclear DNA clearly supported limited admixture. He presented the results from the analysis of a single bone fragment from a cave in Siberia that revealed another type of archaic human, now called a Denisovan, that is distinct from Neandertals and humans (3). Denisovan DNA makes up as much as 6% of the genome of some Melanesians. There is emerging genomic evidence that they may have been admixture in Africa with another type of archaic human for which there is no fossil record. The story is changing rapidly.

Pääbo closes with experiments with laboratory mice that have been genetically altered to make their FOXP2 gene match that of humans. It is a change of only three amino acids out of 714. The mice are run through a battery of tests to see if there is anything different about them. Amazingly, their vocalizations are altered. The audience is stunned. It is not exactly that they talk, but “medium spiny neurons have increased dendrite lengths and increased synaptic plasticity.”

Then, it’s time for questions. There are some good scientific questions until finally, someone asks the question that provides the perfect closer. The questioner points out that Neandertals were in western Europe for 100,000 years, but didn’t spread much. Humans moved out of Africa and relatively quickly spread everywhere: Europe, including the British Isles, Asia, and Australia. Why didn’t Neandertals spread?

Svante Pääbo reflected for a moment, then pointed out that most of the places where Neandertals never lived required them to do something that they didn’t like to do. They didn’t like to cross water if they couldn’t see land on the other side. He pointed out that many humans in boats must have died on the open ocean before the first humans reached Australia. So one of the differences between humans and Neandertals is that humans are crazy. They set out on ocean voyages without a clear idea of where they will end up.

I looked around at the auditorium, filled with participants in the Personal Genome Project who have made their genomes public, not knowing exactly what will happen. The rest of the crowd is a collection of forward-looking scientists, engineers, and venture capitalists. It occurs to me that I am looking at a group of people who all exhibit the most unique human characteristic: the willingness to set out on a voyage whose final destination cannot be clearly seen. Crazy. Human.

Hard Times for My Ancestors Have Marked My Genome

In this post, I present one small part of the tale of how the tribulations of my ancestors have left their mark upon my genome. I’m not sure who is reading this blog. I don’t know the average level of understanding of genetics among my readers. I have recently participated in the online discussions at 23andMe, where it is clear that many subscribers are struggling with the basics. This post will therefore contain a bit more background information than the average genomics blog. Some of the more technical information and citations are included in the footnotes.

Inherited genetic variation makes us different from each other. We all have the same genes, but every gene comes in different “flavors,” called alleles. Different alleles of a gene differ in their DNA sequence. Under some circumstances, alleles that eliminate the function of a gene can confer a selective advantage, meaning that people carrying that allele are likely to have more offspring. Changes in the frequency of a particular allele in a population (the “allele frequency”) over time can alert researchers to interesting problems in biology and medicine.

Most of the single nucleotide polymorphisms (SNPs) that are typed by 23andMe are “neutral.” They are sequence variants in parts of the genome that do not encode proteins. It doesn’t matter which base is present at that position, so natural selection does not change the frequency of a particular variant of this type directly. A small fraction of the SNPs typed by 23andMe are diagnostic for a variant allele of a gene that changes the sequence of the protein encoded by that gene. These variants have been discovered through research on people affected by an inherited disorder. A probe specific for the disease allele has been incorporated into the tests done by 23andMe. In my last post, I discussed three such variant alleles that I carry.

The variant allele that I carry for the PEX1 gene, PEX1-G843D, has an allele frequency of 0.001-0.002 (1). This means that if we look at all of the PEX1 alleles in a population, 0.1-0.2% of the alleles are the PEX1-G843D variant. The PEX1-G843D allele is a bad thing. While carriers like myself are unaffected, homozygotes (PEX1-G843D/PEX1-G843D) usually die before they reach one year of age. Why doesn’t natural selection “get rid of” this nasty allele?

The frequency of PEX1-G843D is at most 0.002. Assume for simplicity that there are no other variant alleles of PEX1, so the frequency of the wild-type (normal) allele is 0.998. If people mate at random without regard to their PEX1 genotype, we can calculate the frequency of the three possible genotypes as PEX1/PEX1 = 99.6%, PEX1/PEX1-G843D = 0.4%, and PEX1-G843D/PEX1-G843D = 0.0004% or 4/1,000,000 (2). PEX1-G843D is subject to negative selection, but this no longer changes the allele frequency of PEX1-G843D very much. There are about 1000 times as many heterozygotes (PEX1/PEX1-G843D) as there are homozygotes (PEX1-G843D/PEX1-G843D), so the allele frequency can’t be driven much lower by selection alone.

What about hemochromatosis? I am a “compound heterozygote” for two different variant alleles of HFE: HFE-H63D and HFE-C282Y. HFE-H63D has an allele frequency of 0.108 (10.8%) in a large diverse population sample from the Exome Sequencing Project and an allele frequency of 0.179 (17.9%) in a sample confined to Europeans (3). It is hardly surprising that I carry one HFE-H63D allele given my European ancestry. HFE-C282Y has an allele frequency of 0.047 in the sample from the Exome Sequencing Project and an allele frequency of 0.042 in a sample confined to Europeans (4). The protein encoded by the HFE-H63D allele has greatly reduced function, while the protein encoded by the HFE-C282Y allele is almost completely nonfunctional.

We know from studies of patients with hemochromatosis that the vast majority are HFE-C282Y/HFE-C282Y. A small fraction of hemochromatosis patients are HFE-C282Y/HFE-H63D like me. The rest have other variant alleles of HFE, or variant alleles of one of four other genes that predispose to hemochromatosis (5). The high frequency of variant HFE alelles raises an interesting question. Why might being a carrier for an inherited disorder be a good thing?

There are plenty of examples of disease genes that confer an advantage on carriers. Among people of European descent, the second most common inherited disorder (after hemochromatosis) is cystic fibrosis, resulting from defects in the CFTR gene. It is the most frequent inherited disorder leading to childhood deaths. The cumulative allele frequency for all variant alleles (there are many) is around 0.03 – 0.05 in people of European descent (6). Taking the low number (0.03) gives us the following genotype frequencies: CFTR/CFTR = 94%, CFTR/CFTR-variant = 5.8%, and CFTR-variant/CFTR-variant = 0.09% or 9/10,000 (7).

Almost 1/1000 children born to parents of European descent are afflicted with cystic fibrosis, in contrast to 4/1,000,000 for PEX1 variants (Zellweger Syndrome). About one person out of twenty of European descent is a carrier of a variant allele of CFTR, while only one person in 250 is a carrier of PEX1-G843D. People who are heterozygous for a variant allele of CFTR are healthy, but have additional genetic advantages: they are resistant to cholera and typhoid fever. Although these diseases are present outside of Europe, carriers of cystic fibrosis have salty sweat, and the loss of salt in hot climates may outweigh the advantages of disease resistance.

There are other examples of disease resistance conferred to carriers of genetic disorders. Three well-known inherited diseases confer resistance to malaria: sickle-cell anemia (HBB), thallasemia (HBB), and Favism or G6PD deficiency (G6PD). In a sample of 114 chromosomes from sub-Saharan Africa, the frequency of the sickle-cell allele of HBB was 11.4%, but it is not found in European populations. In some African populations, the allele frequency of a common variant of G6PD conferring malaria resistance is 20%. Other loss-of-function alleles of G6PD are common in Mediterranean or South Asian populations (8, 9).

Why might loss of function of the HFE gene, which leads to hemochromatosis, have a selective advantage? There are two interesting ideas about this. The first idea is that reduced function of HFE was a useful adaptation to the neolithic diet (10). When our ancestors switched from being hunter-gatherers to the practice of agriculture, the amount of red meat (a great source of iron) in people’s diets fell. The switch to a grain-based diet meant that careful biological regulation of the amount of iron taken in from the diet was no longer optimal. People with a defect in the signaling mechanism controlling iron uptake (of which the HFE gene product is a part) would experience iron overload, but might have an advantage during times of iron starvation, because it takes longer to deplete their body’s supply of iron.

While this allowed the allele frequency for variant alleles of HFE to rise, it might not fully account for the high frequency of HFE-H63D among people of European descent. There is a really great story here, first proposed by Sharon Moalem in a scientific paper (11) and popularized in his book Survival of the Sickest (12). Dr. Moalem has proposed that variant alleles of HFE rose to their current high frequencies because they confer resistance to bubonic plague, also known as the Black Death.

The Black Death is a bacterial infection caused by Yersinia pestis. People are infected with the bacterium when they are bitten by infected fleas, which are carried throughout the population by rats. When the plague bacterium enters the bloodstream, it is attacked by white blood cells called macrophages, which travel to the lymph nodes to carry out the destruction of the invaders. For most bacterial infections, this is usually a good strategy. However, the plague bacterium is often able to survive as a passenger in the macrophage, permitting the bacterium to attack the lymph nodes, causing one of the ghastly symptoms of bubonic plague: lymph nodes that swell to the size of an egg, sometimes bursting through the skin.

The plague bacterium needs nutrients to thrive, and one nutrient in particular is usually in short supply: iron. While people with reduced HFE function generally experience iron overload, this does not affect all cells in the body equally. Macrophages from people carrying variant alleles of HFE are deficient in iron. The iron-poor environment in the macrophage among HFE-deficient people allows the macrophage to gain the upper hand against the bacterium. The Black Death of the 14th century killed 30-50% of the population in a number of European countries. If variant HFE alleles conferred an advantage against this devastation, the allele frequency could have risen sharply among the survivors. While the Black Death is no longer part of the landscape in Europe, there is not much selection against variant HFE alleles. People with hemochromatosis do not usually display symptoms until they are past the age where most people have already had children. Because hemochromatosis does not interfere with reproduction, there is no selection against it.

In my case, it is possible that there was selection for incresed body stores of iron (and hence reduced HFE function) in my immediate ancestry. Both of my parents survived starvation during World War II. My mother, who is Dutch, lived in Holland through the entire war, and endured the Hongerwinter (“hunger winter”) of 1944. My father, who was Hungarian, spent the last 18 months of the war in a Soviet POW camp under harsh conditions. Both of my parents survived conditions in which there was widespread death from malnutrition, or from causes in which malnutrition was a factor.

I don’t wish to make light of my parent’s experience, or of the experience of my more remote ancestors who lived through the Black Death. Yet we have the genetic hand that we were dealt, and it is up to each of us to play it well. I intend to be evaluated for iron overload by a physician, and have already changed my diet, sharply reducing my intake of red meat. Humor is also an important aspect of health. In that spirit, I made this slide for one of my recent talks.

In my next post, I will go more deeply into my ancestry, as revealed by genetic testing.


1. The allele frequency of the PEX1-G843D allele is taken from the OMIM entry for PEX1

2. The frequency of the three PEX1 genotypes is calculated given an allele frequency of 0.002 for PEX1-G843D and 0.998 for the normal PEX1 allele as follows:
PEX1/PEX1 = 0.998 * 0.998 = 0.996 or 99.6%
PEX1/PEX1-G843D = 2 * 0.998 * 0.002 = 0.003992 = 0.4%
PEX1-G843D/PEX1-G843D = 0.002 * 0.002 = 0.000004 or 4/1,000,000

3. The allele frequency for HFE-H63D is taken from the dbSNP entry for rs1799945.

4. The allele frequency for HFE-C282Y is taken from the dbSNP entry for rs1800562.

5. Please see the OMIM entry for HFE and hemochromatosis  and for four other genes causing hemochromatosis (HJV, HAMP, TFR2, and SLC40A1).

6. Please see the OMIM entry for CFTR.

7. The frequency of the three CFTR genotypes is calculated given an allele frequency of 0.03 for all CFTR-variant alleles and 0.97 for the normal CFTR allele as follows:
CFTR/CFTR = 0.97 * 0.97 = 0.9409
CFTR/CFTR-variant = 2 * 0.97 * 0.03 = 0.0582
CFTR-variant/CFTR-variant = 0.03 * 0.03 = 0.0009 or 9/10,000

8. Please see the OMIM entry for HBB for more information about sickle-cell anemia and thalassemia. 

9. Please see the OMIM entry for G6PD for more information about favism.

10. Christopher Naugler (2008) Hemochromatosis: A Neolithic adaptation to cereal grain diets. Medical Hypotheses 70: 691-692.

11. S. Moalem, M.E. Percy, T.P.A. Kruck, and R.R. Gelbart (2002) Epidemic pathogenic selection: an explanation for hereditary hemochromatosis? Medical Hypotheses 59: 325-329.

12. Sharon Moalem with Jonathan Price (2008) Survival of the Sickest, Harper Perennial.

Getting My Genome Done

I am a geneticist with a varied career that has included research and teaching at a variety of academic institutions. In 2000, I shut down my research lab and took a job in bioinformatics, just as the human and mouse genome sequences were being completed. In November of 2011, I moved to a new position at the University of New Mexico, funded in part by the National Human Genome Research Institute. My new position involves teaching and public outreach. Recent progress in human genomics has been spectacular, and it is a great story to tell. I thought that knowledge about my own genome would motivate my learning about human genetics and would also personalize my presentations, so I decided to “get my genome done.”

There are several ways of getting a look at your own genome. Of the direct-to-consumer companies, I liked 23andMe. At the 2011 SACNAS National Conference in October, I heard a talk by Dr. Joanna Mountain, Senior Director of Research at 23andMe. Dr. Mountain talked us through 23andMe’s website as seen by a user. The 23andMe website offers information on inherited health conditions and ancestry based on a person’s genome. I liked their user interface. They present results in language accessible to people without an extensive background in science. Users are only a few clicks away from full technical data, including complete raw data that can be uploaded to third-party sites for further analysis.

I signed up for 23andMe using their website, and soon received a kit in the mail for sample collection. They recover DNA from saliva using a very clever method. Following the illustrated instructions, I spit into a plastic tube equipped with a funnel. When my saliva reached the fill line, I flipped a cap into place that dumped a solution into the saliva sample. I capped the tube and inverted it a few times. As I did this, I saw the familiar sight of DNA coming out of solution in an ethanol precipitation. I have isolated plenty of DNA in my days as a researcher, but this was the first time that it was my own. I packed the tube in the postpaid return mailer, dropped it off at the Post Office, and waited.

After a few weeks, I got an email from 23andMe that my results were ready. I had purchased their only offering at the time, a survey of my genotype using the Illumina OmniExpress Plus Genotyping BeadChip. This technology allows genotyping of a human DNA sample at about one million genomic sites. The sites that are genotyped are Single Nucleotide Polymorphisms (SNPs) that have been identified as sites of variation in survey sequencing of human populations. The 23andMe chip includes some SNPs that are the sites of mutation in well-studied genetic disorders. For example, the chip tests for 31 different sequence variants of the CFTR gene associated with Cystic Fibrosis.

Although I am healthy and free from any known genetic disease, I looked at my Carrier Status. The screenshot below shows part of the 23andMe report.

There are 44 genetic disorders listed on this page. The disorders are listed in alphabetical order. If you have a variant allele for any of them, it sorts to the top of the page. I had two: Zellweger Syndrome Spectrum, and variants associated with hemochromatosis, a disorder in which excess iron is taken in from the diet.

The Zellweger Syndrome Spectrum gene tested for is PEX1, a gene required for the normal formation of peroxisomes. Peroxisomes are membrane-bound vesicles inside of cells that are required for the catabolism of fatty acids and other compounds. Fortunately for me, I am a carrier, which means that I am heterozygous. I have one working copy of PEX1 and one bad copy. There are no health consequences for carriers. People homozygous for the allele of PEX1 that I carry generally die before they are one year old. This is why this gene is listed on the Carrier Status page; no one homozygous for the mutant PEX1 allele G843D has a computer, a credit card, and a 23andMe account.

My Hemochromatosis report is more complex. I have two different variant alleles of the HFE gene, one of which slightly predisposes to hemochromatosis, while the other causes a considerable increase in risk of the disease. Here is a screenshot of the page that appears when you click on the Hemochromatosis link.

There is a link to a technical report. The technical report is very detailed. Part of it is shown below.

The good news is that my risk for developing hemochromatosis is quite low. Nevertheless, I decided to modify my diet and to ask for some specialized tests the next time I visit a doctor. I will discuss this in greater detail in another post.

There are also discussion forums at 23andMe. I participated in these for a couple of weeks before launching this blog. Not everyone has training in genetics, even among 23andMe subscribers, so I will take this opportunity to explain the language used in the technical report.

Most genes encode proteins. A protein (polypeptide) is a chain of amino acids; there are twenty primary amino acids that make up the set that can be encoded by the 64 three-base codons of the genetic code. There are single letter codes for each of the twenty primary amino acids. The HFE gene encodes a protein 348 amino acids long. The H63D allele changes the 63rd amino acid from histidine (H) to aspartic acid (D). The C282Y allele changes the 282nd amino acid from cysteine (C) to tyrosine (Y).

The C282Y allele results in a significant loss of function of the HFE protein. The cysteine residue at that position is highly conserved, meaning that when you look at the HFE gene in other organisms, there is usually a cysteine at that position. This is a highly significant risk allele. From the OMIM entry:

“In patients with hemochromatosis, Feder et al. (1996) identified an 845G-A transition in the HFE gene (which they referred to as HLA-H or ‘cDNA 24’), resulting in a cys282-to-tyr (C282Y) substitution. This missense mutation occurs in a highly conserved residue involved in the intramolecular disulfide bridging of MHC class I proteins, and could therefore disrupt the structure and function of this protein. Using an allele-specific oligonucleotide-ligation assay on their group of 178 patients, they detected the C282Y mutation in 85% of all HFE chromosomes. In contrast, only 10 of the 310 control chromosomes (3.2%) carried the mutation, a carrier frequency of 10/155 = 6.4%. One hundred forty-eight of 178 HH patients were homozygous for this mutation, 9 were heterozygous, and 21 carried only the normal allele. These numbers were extremely discrepant from Hardy-Weinberg equilibrium. The findings corroborated heterogeneity among the hemochromatosis patients, with 83% of cases related to C282Y homozygosity.”

In other words, looking at this from the perspective of a physician, most people who receive a clinical diagnosis of hemochromatosis are homozygous for the C282Y allele of HFE.

In contrast, also from the OMIM entry, the H63D allele of HFE confers a minor risk of hemochromatosis. Here is one part of the OMIM entry:

“Jouanolle et al. (1996) commented on the significance of the C282Y mutation on the basis of a group of 65 unrelated affected individuals who had been under study in France for more than 10 years and identified by stringent criteria. Homozygosity for the C282Y mutation was found in 59 of 65 patients (90.8%); 3 of the patients were compound heterozygotes for the C282Y mutation and the H63D mutation (613609.0002); 1 was homozygous for the H63D mutation; and 2 were heterozygous for H63D. These results corresponded to an allelic frequency of 93.1% for the C282Y and 5.4% for the H63D mutations, respectively. Of note, the C282Y mutation was never observed in the family-based controls, whereas it was present in 5.8% of the general Breton population. This corresponds to a theoretical frequency of about 1 per 1,000 for the disease, which is slightly lower than generally estimated. In contrast, the H63D allelic frequency was nearly the same in both control groups (15% and 16.5% in the family-based and general population controls, respectively). While the experience of Jouanolle et al. (1996) appeared to indicate a close relationship of C282Y to hemochromatosis, the implication of the H63D variant was not clear.”

So, while the H63D allele of HFE appears to alter the function of HFE, it is almost as frequent among patients lacking a diagnosis of hemochromatosis as among those who are diagnosed with hemochromatosis. People have two alleles, so “having” the H63D allele in this case usually means also having a normal allele. I should also point out that there are other genes, different from HFE, that predispose to hemochromoatosis.

The Zellweger Syndrome Spectrum (PEX1) allele that I carry occurs at a frequency of around 0.2%. For hemochromatosis, among the 4,552 chromosomes sampled from the publicly-funded Exome Sequencing Project, the HFE-H63D allele occurs at a frequency of about 10.8%, while the HFE-C282Y allele occurs at a frequency of about 0.2%. Why is there such a wide range in the frequency of disease-causing alleles? I will cover that in my next post.