If you see the identifier “exon mismatch” in some alleles in our HLA reports, there may be a good explanation for it. We’ll explain briefly here.
The IMGT HLA database is a standard reference library for determining HLA genotypes. It is the largest repository of human major histocompatibility complex (MHC) sequence data available today and is growing at an exponential rate. It includes HLA sequences from highly diverse geographic and ethnic populations. And it includes a comprehensive set of Class I and Class II HLA alleles, following the official WHO HLA Nomenclature.
HLA Typing Algorithm
We use the GenDx NGSengine application to perform HLA typing on sequenced datasets. NGSengine uses the IMGT HLA database for its sequence reference library, aligns sample read sequences against exon and intron sequences in the IMGT database, correctly phases reads and performs HLA allele assignments. NGSengine uses statistical models to predict matching genotypes between DNA samples and the IMGT database. Due to the statistical nature of the matches, there may or may not be identical sequence matches, which could be reported as exon mismatches.
NGSengine will prominently display and highlight exon mismatches in our HLA typing reports.
For example, Table 1 shows exon mismatches in DRB4 for allele 1 (3 mismatches) and allele 2 (7 mismatches). There are no exon mismatches in the remaining genes. Typically, we’ll see one or just a few mismatches in a particular gene. It’s quite rare to see a large number of mismatches, which could indicate some other issue with DNA quality, library prep or large structural rearrangements in the gene itself.
Exon mismatches may be indicative of a novel allele that is not present in the IMGT database or a new allele created through genetic recombination.
The IMGT database is a standard reference database that contains most of the known human genome sequences for the MHC/HLA (major histocompatibility complex/human leukocyte antigen) chromosomal region. The current version of the database contains about 32,330 HLA alleles. Next Generation Sequencing HLA typing algorithms use this database to make typing calls for all sequences found in a sample. If HLA alleles already exist in the database that match sequences found in a sample, then the typing algorithms can make confident type calls based on those known allele sequences. However, if the typing algorithms encounter new sequences in a sample that are not represented in the database, then they either try to impute the statistically most likely matching allele sequence from the database, or declare that the sequence is a novel allele, i.e. one that does not exist in the database and has never been identified before. Most alleles listed in our HLA reports are found in the database. Occasionally (rarely) there will be novel alleles in a sample. One of our blog posts describes how we occasionally discover novel alleles. Exon mismatches may occur when a novel allele is found in a sample that is not present in the IMGT database.
A significant number of exon mismatches may indicate large structural genomic rearrangements such as genetic recombination.
For example, we have a documented case of a recombination event for HLA-DPA1. As shown in Table 2, DPA1 initially displayed 7 exon mismatches for allele 2 in this particular sample. Based on a deeper analysis of this allele with NGSengine (beyond the scope of this note), we discovered that DPA1*01:05 was in fact a recombination of DPA1*01:03 and DPA1*02:01 and the exon mismatches were derived from nucleotide sequences in exon 2 of DPA1. Interestingly, this putative new allele was also seen or discovered in several other American and European labs around the same time frame.
Fig. 1 shows schematically an example of the evolution of paralogous genes. An ancestral human Histone H1 gene undergoes a gene duplication event, yielding two identical genes: Histone H1.1 and Histone H1.2. (In this example a speciation event also occurs, although this is not strictly required for the existence of paralogous genes). Over time each gene copy diverges in sequence from its paralog by accumulated mutations. The mutation rate and selection pressure will determine how fast and by how much the two copies diverge in sequence. Initially the paralogous genes display high sequence homology and in fact may only differ by a single SNP, MNP or small indel. The paralogous genes may retain high homology over time if there is strong selection pressure against sequence divergence, or they may diverge significantly if selection pressure is weak. In either case we have two paralogous genes with highly similar gene sequences.
In HLA Typing, the allele pair “DRB1*04 and DRB4*01” is a well-known example of paralogous genes. Doxiadis et al. posit a single common ancestral Hominidae and Cercopithecoidea (Old World monkey) DRB gene. A gene duplication event occurred before speciation to create paralogous DRB genes.
Following speciation a complex set of evolutionary events yielded the related human HLA alleles DRB1*04 and DRB4*01 (Fig. 2). Apparently the evolution of the DRB gene family included the rearrangement of transposable elements, long interspersed nuclear elements (LINE‘s), short interspersed nuclear elements (SINE‘s), retrotransposons and other genomic modifications. However, DRB1*04 and DRB4*01 retain high sequence homology due to their common ancestry. NGSengine could display slight differences in paralogous sequences as exon mismatches.