H3M2: a novel algorithm for the detection of Runs of Homozygosity

H3M2: a novel algorithm for the detection of Runs of Homozygosity from second generation
sequencing data
Magi A(1), Tattini L(1), Benelli M(2), Palombo F(3), Pippucci T(3)
(1) Department of Clinical and Experimental Medicine, University of Florence, Florence, Italy
(2) Diagnostic Genetic Unit, Careggi Hospital, Florence, Italy
(3) Dipartimento di Scienze Mediche e Chirurgiche, Università di Bologna, Bologna, Italy
Contact: [email protected]
Motivation
Runs of homozygosity (ROH) are regions of the genome where the copies inherited from our parents
are identical. The two DNA copies are identical because our parents have inherited them from a
common ancestor in the past.
This genetic transmission process generates genomic ROHs that may range from tens of thousands to
millions of bases in length.
Population history (historical bottleneck or geographic isolation) and cultural factors (consanguineous
marriage or endogamy) can affect length of ROH in individual genomes. The distribution of ROH
records information about genetic transmission history and may also be important medically.
The medical relevance of ROH is due to the fact that the most common mode of inheritance for a
genetic disorder is autosomal recessive, which implies that two healthy heterozygous parents transmit
a single copy of a mutation to their affected child, who therefore has the mutated gene in double copy.
If individuals inherit the same ancestral mutation identically by descent, then they probably also share
adjacent DNA segments on which the mutation first arose. For a recessive phenotype, in affected
inbred individuals, the homozygous risk locus therefore probably resides in an unusually long
homozygous region. Deleterious recessive variants can thus be identified through detecting ROH.
To date, ROH discovery has been performed by using microarray-based technologies. Currently
available SNP-array platforms contains more than one milion of SNPs markers derived from the
HapMap project and are able to detect ROH as long as 100 Kb in size.
Over recent years, the advent of new high-throughput sequencing (HTS) platforms have opened many
opportunities for the study and the understanding of homozygous genomic regions.
Methods
Here we introduce a novel computational approach, H3M2, for the identification of ROH by using data
produced by HTS platforms.
To develop our method we used whole-exome sequencing (WES) data produced by the 1000 genomes
project (1000 GP) consortium that have been previously genotyped with SNP-array technologies by
the HapMap consortium.
As a measure of homozygosity/heterozygosity for each polymorphic position i, we used the B-allele
frequency (BAF) that is defined as the ratio between B-allele counts (NB the number of reads that
match with the allele with minor frequency at position i) and the total number of reads mapped to that
position (N, the depth of coverage).
To understand the capability of such a measure to predict the homozygous/heterozygous state of each
polymorphic position, we compared BAF with the HapMap and 1000GP genotyping results.
As a further step, we studied the properties of the BAF distribution in different classes of genomic
regions. By using the results of these analyses, we developed a novel algorithm, based on a
heterogeneous hidden markov models, that is able to recognise ROH from BAF values taking into
account the genomic distance between consecutive polymorphic position.
Results
To test the ability of our algorithm to identify ROHs of different size and made of different number of
SNPs, we performed an intensive simulation based on synthetic data.
Synthetic chromosomes were generated from the BAF data of six samples sequenced by the 1000
genome project consortium (1000 GP).
The results of these analyses clearly show that H3M2 is able to identify ROH with an unprecedented
resolution, outperforming state of the art methods based on microarray-based technologies.
As a further test, we applied our method on the analysis of 100 individuals sequenced by the 1000GP
consortium representing five different worldwide populations (20 Tuscans, 20 Caucasian, 20
Yorubam, 20 Japanese and 20 Chinese).
The ROH detected by our method were classified by length into three classes (short, intermediate, and
long) with a model-based clustering algorithm.
In accordance with previous results, for each class, the number and total length of ROHs per individual
show considerable variation across individuals and populations.
The total lengths of short, intermediate and long ROHs per individual increases with the distance of a
population from East Africa, in agreement with similar patterns previously observed for locus-wise
homozygosity and linkage disequilibrium.