Genetic variation in South Asia: assessing the

B RIEFINGS IN FUNC TIONAL GENOMICS AND P ROTEOMICS . VOL 8. NO 5. 395^ 404
doi:10.1093/bfgp/elp015
Genetic variation in South Asia:
assessing the influences of geography,
language and ethnicity for
understanding history and disease risk
Qasim Ayub and Chris Tyler-Smith
Advance Access publication date 17 June 2009
Abstract
South Asia is home to more than 1.5 billion humans representing many diverse ethnicities, linguistic and religious
groups and representing almost one-quarter of humanity. Modern humans arrived here soon after their departure
from Africa 50 000 ^70 000 years before present (YBP) and several subsequent human migrations and invasions,
as well as the unique social structure of the region, have helped shape the pattern of genetic diversity currently
observed in these populations. Over the last few decades population geneticists and molecular anthropologists
have analyzed DNA variation in indigenous populations from this region in order to catalog their genetic relationships and histories. The emphasis is gradually shifting from the study of population origins to high resolution surveys
of DNA variation to address issues of population stratification and genetic susceptibility or resistance to diseases
in genome-wide association surveys. We present a historical overview of the genetic studies carried out on populations from this region in order to understand the influence of geographic, linguistic and religious factors on population diversity in this region, and discuss future prospects in light of developments in high throughput genotyping
and next generation sequencing technologies.
Keywords: South Asia; Pakistan; India; population genetics; genetic variation; HGDP-CEPH
INTRODUCTION
South Asia encompasses the Indo–Pak sub-continent
and includes India, Pakistan, Bangladesh, the
Kingdoms of Nepal and Bhutan and the islands of
Sri Lanka, the Maldives and the Chagos Archipelago.
The land mass is bordered by Iran and Afghanistan
on the west, China in the north, and Myanmar on
its eastern fringes. The Indian Ocean straddles its
entire southern coast line. Its population of around
1.5 billion individuals constitutes approximately onefourth of humanity. Over the centuries this area has
witnessed many invasions and migrations mainly
from the West. Genetically, it contains the site of
what was once called ‘the grandest genetic experiment ever performed on Man’ and, in several surveys
of worldwide diversity, one of the most outstanding
populations making up a sixth grouping of humanity
comparable in these analyses to major continental
groupings [1, 2]. In this review, we will consider
how our understanding of the historical, cultural
and environmental factors that have shaped its inhabitants has developed, something of what we have
learned, and prospects for future developments.
This now-densely occupied land was encountered
by the first populations of modern humans that ventured out of Africa more than 50 000 years ago. It is
suggested that they arrived in this part of Asia via a
southern coastal route and continued to Southeast
Asia. The fossil and archaeological evidence for
human settlements in this part of the world prior
Corresponding author. Qasim Ayub, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA,
UK. Tel: þ44-1223-834244, Fax: þ44-1223-494919, E-mail: [email protected]
Qasim Ayub is the Human Evolution Team’s Staff Scientist at the Wellcome Trust Sanger Institute, Hinxton (http://www.sanger.
ac.uk/Teams/Team19/).
Chris Tyler-Smith heads the Human Evolution Team at the Wellcome Trust Sanger Institute, Hinxton (http://www.sanger.ac.uk/
Teams/Team19/). Their group investigates DNA variation in populations in order to understand recent human evolution.
ß The Author 2009. Published by Oxford University Press. For permissions, please email: [email protected]
396
Ayub and Tyler-Smith
to 9000 YBP is, unfortunately, sparse although settlements dating much further back are now beginning to emerge, including from Patne in western
India and Batadomba-lena in Sri Lanka, dated to
between 30 000 and 34 000 YBP [3–5]. There is evidence for indigenous animal and plant domestication
in many places in the region, the earliest being found
at a Neolithic site in Mehrgarh, in Southwest
Pakistan [6]. Subsequent epochal events included
the development and decline of the Indus Valley
civilization’s Harappan culture, the arrival of IndoEuropean speakers from central or west Asia, linked
to the introduction of the caste system in India and
the possible displacement of Dravidian speakers to
their current location in south central India. More
recent historical events have included Alexander’s
invasion in 327 BC, the Arab invasion and subsequent Muslim conquest and rule of India, followed
by the British Raj that lasted until the partition in
1947 [7].
More than 4000 well-defined population groups
including 500 tribal and approximately 30 huntergatherers reside in this region [8]. The overwhelming
majority are endogamous. The endogamous Hindu
caste system, and the presence of large consanguineous families in the region, especially among the
Muslim populations, provides unique resources for
unraveling the genetic basis of disease. Like elsewhere, demographic events such as genetic bottlenecks, population expansions and admixture have
shaped the genetic diversity that is observed in this
region today.
Linguistically the region comprises major groups
of Indo-European, Dravidian, Tibeto-Burman and
Austro-Asiatic speakers, minor groups like the
Andamanese, and even language-isolate groups like
the Hunza Burusho in northern Pakistan and Nihali
in Madhya Pradesh, India (Figure 1). More than 70%
of the population speak Indo-European languages.
Tibeto-Burman speakers are present in the north
and northeast and form a majority in Bhutan.
Dravidian languages are prevalent in south India,
Sri Lanka and, intriguingly, in Balochistan province
of Pakistan. Austro-Asiatic languages are spoken by
many Indians in the south and close to the border
with Myanmar [9]. The region is home to followers
of many religions, the major among them being
Islam, Hinduism, Buddhism and Sikhism. The population also includes sizeable Christian, Jewish and
Zoroastrian minorities. All have contributed to the
genetic and cultural diversity found here.
During the past few decades molecular geneticists
and anthropologists have analyzed DNA variation
among human populations in order to catalog their
genetic relationships and to glean information
about recent human evolution. Several such studies
have been carried out in populations from South
Asia, especially Pakistan, which is well-represented
in the Human Genome Diversity Project (HGDP),
and more recently India, in order to study their
origins and their susceptibility, or resistance, to disease. We first present an overview of the genetic
studies carried out on populations from this region
in order to understand the influence of geographic,
linguistic and religious patterns on population
diversity in this region and go on to discuss future
prospects for analyses of genetic variation in this
region.
PAST
Throughout the 1970s and 1980s classical serological
markers such as human blood groups, leukocyte
antigens (HLA), glucose-6-phosphate dehydrogenase
and other isozymes were used to study a limited
number of populations from South Asia [10].
The advent of polymorphic DNA markers such as
restriction fragment length polymorphisms (RFLPs),
variable number of tandem repeats (VNTRs), short
tandem repeats (STRs) or microsatellites, and singlenucleotide polymorphisms (SNPs) and the technological breakthroughs that made it relatively easy to
genotype large number of markers in thousands of
samples permitted the extensive analyses of genetic
variation in humans. Initially the laborious techniques of RFLP and hybridization were applied to
analyze a few hundred samples but the development
of polymerase chain reaction (PCR) amplification
made possible the extensive analyses of DNA variation in thousands of samples.
Several initial studies analyzed mitochondrial
DNA and the Y chromosome in selected populations from Pakistan and India in order to obtain
glimpses into their female and male ancestry, respectively. Sequence differences among individual mitochondrial DNAs separate them into haplotypes and
haplogroups that provide a snapshot of population
origins from the female perspective [11]. Similarly,
the male-specific part of the Y chromosome is passed
down from father to son without change, except
for the gradual accumulation of mutations which
appear as DNA polymorphisms and provide a male
Genetic variation in South Asia
397
Figure 1: Linguistic boundaries in South Asia.
perspective to human evolution. Human Y chromosomes are delineated into distinct haplogroups
and haplotypes, defined by a combination of
unique event or bi-allelic polymorphisms alone or
their combination with Y-STRs, respectively [12].
A number of other studies analyzed autosomal STRs
and Alu insertion polymorphisms to gauge an overall
view of autosomal genetic diversity in the region and
address the issue of population origins in light of
historical context [13–16].
398
Ayub and Tyler-Smith
Migrants
Populations from Pakistan and north India share a
sizable proportion of variation with European populations and Central and West Asians have been major
contributors to the gene pool of this region, consistent with arrival of migrants from Northwest Asia
[13, 14, 17]. In Pakistan the Karakoram Mountain
ranges that are part of the Himalayan Mountains in
the north have been a major barrier to gene flow
from China, although this does not appear to be
the case in India where Sino-Tibetan populations
entered through the north-east corridor and
Nepal [18].
Most Y-chromosomal lineages found in both
tribal and non-tribal populations from South Asia
date back to the Paleolithic period. Common haplogroups (markers) include C5 (M356), F (M89),
H1 (M52), H2 (Apt), L1 (M27; M76), R1a1
(M17) and R2 (M124) [19, 20]. All except for
R1a1, which constitutes 55% of Y chromosomes in
Pakistan, have long coalescent times and appear to be
indigenous. Y haplogroup J2 (M172) which is prevalent in Pakistan and north and west India represents
a West Asian contribution to the genetic diversity of
the sub-continent (although the migrants would of
course have carried other lineages as well) [21]. O2
(P31), O3 (M122) and Q (M242) lineages in India
and Pakistan are probably representative of migrants
from East and Southeast Asia [21, 22]. Generally
haplotype variation between populations is higher,
especially between tribes [23, 24].
More than 60% of mitochondrial lineages in this
region are represented by haplogroup M and mitochondrial variation, especially in haplogroup M, is
largely local in origin with estimated coalescence
times in the Paleolithic [18]. Approximately 30% of
Indian mtDNA haplotypes belong to West Eurasian
haplogroups. Haplogroups R7 and U2 lineages are
specific to India and the variation in its distribution
also suggests an ancient origin [25, 26]. It has been
argued that the presence of Eurasian and Australasian
mitochondrial haplogroups M, N and R and Y
haplogroups C (M130), D (M174) and F in
South Asia supports the migration of humans out
of East Africa via a southern coastal route through
the sub-continent [27]. The Andaman and Nicobar
Islanders that comprise six tribal populations and lie
on the southern coastal route to Oceania show a
high level of population differentiation with unique
mitochondrial sequences in the Andamanese probably arising by genetic drift in an isolated population.
However, the Nicobarese share genetic relatedness
with Southeast Asians and are probably more recent
migrants in comparison with their neighbors [28].
The presence of haplogroups B5a and F1 near the
border between India and Nepal suggests movement
across the Himalaya Mountains and this possibility
is also supported by the detection of Indian haplogroups R6 in the Nepalese [18].
The Indian caste system represents a unique
socio-economic hierarchy associated with the
Hindu religion, that broadly distinguishes upper
(Brahmins/priests), middle (Kshatriyas/warriors;
Vaishyas/farmers and traders; Sudras/laborers) and
lower (Panchama/untouchables) or scheduled
castes. It is associated with the spread of IndoEuropean languages after 3500 YBP, but the
extent of gene flow from outside remains a matter
of debate. Early studies were interpreted to suggest
extensive gene flow from West and Central Asia but
recent data, obtained using a larger number of markers, have been used to argue that most DNA variation was indigenous in origin [20, 29]. Male
substructure has been observed in the higher (priest
and warrior) classes from Jaunpur District in central
India and male gene flow between the castes was low
(<1% per generation) there, as expected from the
social rules [24]. Some studies show a genetic affinity
between the lower castes and tribal populations and
the frequency of these haplotypes is proportional to
caste rank, the highest frequency of West Eurasian
haplotypes being found in the upper castes [13, 30].
Upper caste Indians share 10–20% of variation with
European populations [13, 14].
Several populations stood out as being genetically
distinct in the initial studies based upon Y chromosomal and autosomal STRs. In the Kalash population
from Pakistan, referred to at the beginning of this
review as a sixth grouping of humanity, this was
most likely due to drift in a geographically and religiously isolated group that has undergone a population bottleneck during their recent migration to their
present day settlements in the Hindu Kush Mountain
valleys in northern Pakistan from where the individuals were sampled. A larger survey that includes
populations from their ancestral homeland in Nuristan, Afghanistan, would provide more insights about
their unique genetic structure. They have a major
West Eurasian mitochondrial component along
with certain population-specific Y lineages (L3a;
PK3) with little Y-STR variation and no genetic
affiliation with East Asian populations [21, 26, 31].
Genetic variation in South Asia
The origins of the Parsi are well-documented
and, although only a few thousand now live in
Pakistan, there are many in India. Their migration
to Gujarat in India after the collapse of the Sassanian
empire is well documented and their name relates to
their geographic origin—meaning ‘from Iran’. Their
Y chromosomes are closer to populations from
present-day Iran but 60% of their maternal gene
pool belongs to South Asian haplogroups not
found in Iran, highlighting their affinities with local
Gujarati women [17, 26].
A number of East African slaves were brought to
this region as involuntary migrants and among such
groups are the Makranis who have physical features
typical of African populations. The contribution of
sub-Saharan African Y chromosomes to this population was estimated to be 12% and combined with
the presence of sub-Saharan mitochondrial haplogroups represent the genetic legacy of the East
African slave trade that existed in this region [17, 26].
Preliminary data on the Nepalese and Bhutanese
populations using autosomal and Y-STRs show significant differences in comparison with their geographic neighbors consistent with genetic drift in
small geographically isolated populations, the differentiation being more striking in the Bhutan [32–34].
A more detailed genetic analyses of the samples
collected under the ‘Languages and Genes in the
Greater Himalaya Region’ Project should provide a
clearer picture.
Invaders
The power of a genealogical approach based on the
hierarchical use of Y chromosomal markers in ethnic
groups from this region has provided some interesting insights into historical events. In particular, analyses of Y lineages in the Hazara population from
Pakistan established the presence of a ‘star haplotype’
that could be directly linked to Genghis Khan or his
male ancestors and that spread by his sons throughout Eurasia [35]. An examination of their mitochondrial DNA suggests that they were accompanied by
women of East Asian ancestry and their autosomal
DNA also shows genetic relatedness with East
Asians [2, 26].
Although several populations from northern
Pakistan claimed that they were the descendents of
Greek soldiers, left behind in this region by
Alexander the Great, this was not generally borne
out by genetic analyses. Only a small proportion of
haplogroup E1b1b1a (M78) Y chromosomes in the
399
Pathan population of Pakistan provided strong evidence of a small Greek contribution [31].
Similarly, the Muslim invasions do not appear
to have left a readily detectable genetic imprint in
India. Islam seems to have spread by conversions and
cultural diffusion: overall the Muslims in India are
genetically closer to their non-Muslim geographical
neighbors than to other Muslim populations and
no correlation exists between genetic variation and
religious beliefs [36]. This is also true historically.
Muslims constitute more than 400 million of the
population of this region and since their advent in
the seventh century they have included many
ethnicities from Arabia, Turkey, Persia and Central
Asia. Some studies [37] have attempted to analyze
these Muslim populations based along the sectarian
divide between the two major Muslim sects (Shias
and Sunnis) but genetic analyses are not appropriate
for analyses of recent political and religious events
and any differences that are observed are likely to
relate to the ethnic or geographic origins of the
source populations.
Linguistics
Indo-European, Dravidian, Tibeto-Burman and
Austro-Asiatic languages are spoken in this region
and overall genetic relationships, as ascertained by
STRs, SNPs and other genetic markers, are dictated
primarily by geographic proximity rather than linguistic origin [9, 38].
Indo-European languages form the predominant
language group and the genetic relatedness between
the European and South Asian populations indicate
that they may well have shared a common language
superfamily, Dene-Caucasian. The Indo-European
family may have spread to South Asia 6000–10 000
YBP replacing the languages spoken earlier almost
everywhere [39, 40]. The Y haplogroup R1 (M173)
is often referred to as an Indo–European marker and
its associated haplogroup R1a1 is present at high frequency in many regions where Indo–European
speakers live. The worldwide distribution of this
haplogroup indicates frequency peaks in Eastern
Europe and West and South Asia, which fits in
with historical records of nomadic settlements in
Europe and India. However, its presence in 15% of
Dravidian speakers in India argues against a simple
correlation. Although recently several new SNPs
have been identified that refine this branch of the
tree they are restricted to a few individuals in the
same population or show genetic exchange across
400
Ayub and Tyler-Smith
the Gulf of Oman [21, Underhill et al., unpublished
data].
An elite dominance model of the Indo-European
speakers partly explains the genetic similarities
observed between the Dravidian and IndoEuropean groups and the seclusion of Dravidians in
southern India and parts of Sri Lanka but it does not
explain the enigma of the Brahui. This Dravidianspeaking population resides in the Balochistan province in south western Pakistan and is surrounded
on all sides by Indo Europeans. Quintana-Murci
and co-workers argued that they arrived in South
Asia from southwestern Iran with the expansion
of Dravidian-speaking farmers although this was
contested by later studies using extensive Y markers
[20, 29].
The Austro-Asiatic languages may be the most
ancient in the region and preliminary analyses
of mitochondrial DNA, a limited number of
Y-chromosomal and autosomal markers had suggested that the Austro-Asiatic tribes appear to be
descendants of remnant earlier settlers and share
genetic affinity with Tibeto-Burman populations
[14]. Subsequent analyses using a higher resolution
of Y haplotypes in a larger sample of the three major
Austro-Asiatic groups of India (Mundari, KhasiKhmuic and Mon-Khmer) demonstrated a strong
paternal genetic link amongst these populations and
those from Southeast Asia. The authors estimated
that the protohaplogroup O2a originated in the
Indian Austro-Asiatic populations 65 000 years
ago and entered Southeast Asia via the Northeast
Indian corridor. Mitochondrial DNA varied
between Austro-Asiatic groups from Southeast Asia
and India reflecting a difference in the history of the
sexes [25]. Since some estimates of the MRCA of
all extant Y chromosomes are as recent as 59 000
YBP, such time estimates should be treated with
caution [41].
Several Tibeto-Burman groups reside in north
and north east India, and in Pakistan the Balti
population from the Karakoram Mountains also
speaks a Tibeto-Burman language. Y-chromosomal
analyses of only a limited number of these Balti
speakers has been carried out and they are not
noticeably different then their neighbors in northern
Pakistan [17].
The Hunza Burusho were of particular genetic
interest because their language, Burushaski, is one
of the few remaining language isolates in the world
[9]. However, they are genetically close to their
geographic neighbors in Pakistan [38]. Their isolation in the Karakoram Mountains habitat may have
preserved their language but any differences between
them and their neighbors, who have acquired new
languages, have been greatly diluted by genetic
exchange. Analyses of the language isolate Nihali
speakers in India will show whether they exhibit
similar genetic affinity with their geographic
neighbors.
Disease Genetics
Families from South Asia have been invaluable in
understanding genetic basis of several Mendelian
disorders. The high rate of consanguineous marriages, large family sizes and contemporary inbreeding among tribes, clans and ethnicities make them
suitable for linkage analyses [42, 43]. The genetic
basis of several single gene disorders leading to syndromic and non-syndromic blindness, deafness,
thalassemias, skeletal, hair, skin and nail disorders
has been unraveled in families and populations
from this region [44–49]. Several of these mutations
are family- or population-specific [44, 50]. The
immediate benefits of these analyses include genetic
testing to exclude such disease in the unborn child.
However, the real challenge is to translate these findings into public health education programs that will
benefit these families and communities.
Determination of HLA frequencies at higher allelic resolution than achieved by serological methods
through DNA based genotyping in ethnic groups
from this region and their association with diseases
like malaria, tuberculosis, leprosy and rheumatic
heart fever have identified several high risk alleles
[51–53]. Genetic association studies using single,
or a few genetic variants, have also been carried
out but their associations have not been replicated
across populations possibly due to ascertainment bias,
choice of markers, insufficient statistical power, population stratification, or differences in linkagedisequilibrium patterns in patients and controls.
PRESENT
As a consequence of advances in genotyping technologies, the focus of current studies is shifting from
investigating population origins using small numbers
of loci to analyzing large number of SNPs, structural
and copy number variants (CNV) to address issues of
population sub-structure and group membership
that will have practical applications in the design of
Genetic variation in South Asia
disease association studies, in rationalizing use of
medicines tailored to an individual’s genetic make
up and DNA-based forensic analyses. The discovery
of a single high risk variant in the cardiac myosin
binding protein C (MYBPC3) that is restricted to
4% of South Asians and predisposes strongly to
heart failure in later life is testament to both
the promise of such studies and the distinct genetic
features of the region [50].
The wide availability of DNA samples of ethnic populations from Pakistan through the Foundation Jean Dausset’s HGDP-Centre d’Etude du
Polymorphisme Humain (CEPH) Human Genome
Diversity Cell Line Panel has permitted extensive
analyses of STRs, SNPs and CNVs in these populations [2, 54–57]. Although some similar work has
subsequently been carried out on tribal and caste
populations from India and Indian expatriates settled
in the USA the lack of readily-available cell lines or
DNA from indigenous Indian populations has greatly
hindered large-scale studies of other parts of South
Asia [58, 59].
Using samples in the HGDP panel, Conrad et al.
demonstrated that HapMap populations capture
common haplotypes well in non-HapMap South
Asian populations and that HapMap data could be
used for imputing missing genotypes in these populations [60]. South Asians were expected to have
intermediate levels of linkage disequilibrium (LD)
between Europeans and East Asians, two populations
that were part of the first phase of the HapMap
Project. Overall, Indian and Pakistani populations
are tagged most effectively by European populations
but optimal HapMap mixtures increased ‘the fraction
of polymorphic non-tag SNPs in a target population
that are in LD with at least one tag SNP above
a specified cut off point’ [61].
Analyses of populations from Pakistan in the
HGDP Panel using the Illumina HumanHap 550K
and 650K bead chips and expatriate populations from
India, Pakistan and Sri Lanka using Affymetrix
GeneChip Mapping Array 500K gave broadly similar
results to an earlier study that had employed STR
variation in this panel [56, 57, 62]. The South Asians
were separated as a distinct cluster when regional
identity was inferred for six groups using 650 000
SNPs, a better resolution than was obtained through
analyses of autosomal STRs or the 550K chip in
these populations, which could not clearly distinguish between populations from South Asia,
Europe and West Asia. As expected, the Hazara
401
shared ancestry with East Asians and there was a
small East Asian contribution to the gene pool of
Burusho, Pathan and Sindhi populations that was
not apparent with use of STR datasets [2, 57].
Although the Kalash individuals reportedly harbored
more than the average CNVs this was not replicated
in the most recent survey of CNV in human populations using the Illumina 650Y arrays [56, 63].
Examination of variability in Indian expatriates
in the USA using 471 insertion/deletion polymorphisms and autosomal STRs revealed low levels of
genetic divergence but these samples, like the Indian
Gujarati population from Texas that is included in
the HapMap 3 sample collection, are hardly representative of the diversity of extant Indian populations
[59]. This problem is being addressed by the Indian
Genome Variation Consortium which recently analyzed more than 400 SNPs in an indigenous sample
of 1871 samples representing numerous geographical, linguistic and religious groups from India [58].
They observed relatively low genetic differentiation
overall, with mean Fst value of 0.03, although this
value is still higher than in Europe (0.01), but these
results could also reflect marker or sampling bias.
The maximum genetic differences were observed
among tribal populations speaking Indo-European
languages, and certain populations and isolated
ethic groups clustered on basis of ethnicity or language, suggesting that care should be taken in the
selection of cases and controls while designing
Genome Wide Association Studies (GWAS) in
these populations.
FUTURE
Following the comprehensive documentation of
genetic stratification within south Asia, a next logical
step would be association mapping to identify
genetic variants associated with complex multigenic diseases and microbial resistance to pathogens
such as malarial parasite, viral hepatitis, HIV, or
Mycobacterium tuberculosis and Mycobacterium leprae that
afflict a large section of South Asians. The increase in
non-infectious diseases such as cardiovascular disease,
type 2 diabetes mellitus and metabolic disorder in
this region also places an enormous burden on
these developing economies and their already
stretched health care systems.
The development of dense SNP chips such as
Illumina’s Human1M-Duo BeadChip and Affymetrix v6.0 will enable an even denser and more
402
Ayub and Tyler-Smith
uniform genomic coverage of both SNPs and CNVs
to be obtained. GWAS will benefit from the identification of haplotypes that better describe association
than single SNPs. The recent success of the Wellcome Trust Case Control Consortium (WTCCC)
genetic association studies in identifying SNPs associated with cardiovascular disease, diabetes and other
disorders that are being replicated by the National
Human Genome Research Institute (NHGRI) and
other research centres provide an ideal model for
conducting collaborative large scale disease association studies in South Asian populations with a
common set of controls [64, 65]. Besides raising
ethical concerns and needing strict quality control,
these studies are expensive, and require very large
sample sizes. The presence of large South Asian
Diasporas in Europe and USA should also facilitate
these collaborative studies. The Pakistan Risk of
Myocardial Infarction Study (PROMIS) being
carried out as part of an international collaboration
plans to analyze 20 000 myocardial infarction
patients and appropriately matched controls and is
one of the largest such studies on populations from
South Asia, providing one model for future studies
[66]. Like many studies from Pakistan, it includes
Mohajirs, or Urdu ethnicities, which encompass
diverse ethnic groups from India that migrated to
Pakistan after independence and whose only commonality is the language (Urdu) that they speak.
In such studies careful consideration must be made
with regards to ethnicity and geographic origins to
ensure that unsuspected population structure does
not confound genetic analysis.
An increasing number of studies are also focusing
on the role that selection may have played in recent
human evolution and studying the constraints
imposed by local environment of pathogens, nutrients, toxins and climate on genetic variation.
A majority of genes that are under selection due to
pathogens are involved in host invasion (e.g. FY),
innate (CASP12, CD40, TLR,) or adaptive (HLA,
interleukins) immunity [67, 68]. Others include
those that aid adaptation to climate, exposure to
toxins and dietary nutrients [68]. The genetic variant
responsible for lactose tolerance (LCT –13910 T
allele), which has been under positive selection in
response to dietary milk in European populations,
appears to have a recent origin in South Asia,
being frequent in pastoral groups and present at
low frequency in non-pastorals in Pakistan [69].
Analyses of variation in drug metabolizing
enzymes such as alcohol dehydrogenase and cytochrome p450 enzymes are of immediate clinical
significance.
The technological advances in DNA sequencing
will soon make it economically feasible to resequence entire genomes of chosen individuals [70].
The current 1000 Genomes Project lacks South
Asian samples, but the HapMap3 Gujarati in
Houston sample meets the necessary international
criteria for consent and availability and, in the
absence of more suitable indigenous samples, may
provide the first insights into South Asian diversity
from whole-genome resequencing. We hope that
indigenous South Asian samples will soon be available for resequencing as well.
The complexity of human genetic variation in
South Asians and its role in gene regulation, expression and influencing disease and non-disease phenotypes in diverse populations from this region has yet
to be fully unraveled. The challenge is to obtain
clinically significant results that can translate into
benefits for the general population leading eventually
to early diagnosis, prevention and therapeutic intervention. Only then can this quarter of humanity
truly benefit from the promise of the ‘genetic
revolution’.
WEB RESOURCES
(i)
1000 Genomes Project: http://www.1000
genomes.org/page.php
(ii) Ethnologue: Languages of the World: http://
www.ethnologue.com/web.asp
(iii) HapMap: http://www.hapmap.org/
(iv) Human Genome Diversity Project: http://
www.stanford.edu/group/morrinst/hgdp
(v) Human Genome Diversity Cell Line Panel:
(vi) http://www.cephb.fr/en/hgdp/diversity.php/
(vii) Languages and Genes of the Greater Himalayan
Region:
http://www.sanger.ac.uk/Teams/
Team19/himalayas.shtml
http://www.le.ac.
uk/genetics/maj4/
Himalayan_OMLLreport.pdf
(viii) Pakistan Risk of Myocardial Infarction Study
(PROMIS):
http://www.phpc.cam.ac.uk/
MEU/PROMIS/
(ix) The Indian Genome Variation Consortium:
http://www.igvdb.res.in/
(x) The Wellcome Trust Case-Control Consortium: http://www.wtccc.org.uk/
Genetic variation in South Asia
Key Points
South Asia is home to a quarter of humanity and harbors many
diverse ethnicities, linguistic and religious groups.
We examine the influence of geographic, linguistic and religious
factors on genetic variation in this region and discuss the importance of population stratification and its implication for
genome-wide association surveys of South Asian populations.
Future prospects are discussed in light of developments in high
throughput genotyping and next generation sequencing
technologies.
Acknowledgements
The authors would like to thank Daniel MacArthur and the
reviewers for their helpful comments and Ambareen for the
illustration.
FUNDING
The Wellcome Trust.
References
1.
Dobzhansky T. Genetic Diversity and Human Equality.
New York: Basic Books, 1973.
2. Rosenberg NA, Pritchard JK, Weber JL, etal. Genetic structure of human populations. Science 2002;298:2381–5.
3. Petraglia MD, Allchin B. Human evolution and culture
change in the Indian subcontinent. In: MD Petraglia,
B Allchin (eds). The Evolution and History of Human
Populations in South Asia: Inter-disciplinary Studies in
Archaeology, Biological Anthropology, Linguistics and Genetics.
Dordrecht: Springer, 2007:1–22.
4. Mellars PA. Going East: new genetic and archaeological
perspectives on the modern human colonization of
Eurasia. Science 2006;313:796–800.
5. Deraniyagala SU. The Prehistory of Sri Lanka: An Ecological
Perspective. Colombo: Department of Archaeological
Survey, Government of Sri Lanka, 1992.
6. Jarrige JF. Mehrgarh: its place in the development of ancient
cultures in Pakistan. In: M Jansen, M Mulloy, G Urban
(eds). Forgotten Cities on the Indus. Early Civilization in Pakistan
From the 8th to the 2nd Millennia BC. Mainz: Verlag Philipp
von Zabern, 1991.
7. Wolpert S. A New History of India. Oxford: Oxford
University Press, 1997.
8. Majumder PP. People of India: biological diversity and
affinities. Evol Anthropol 1998;6:100–10.
9. Gordon RG (ed). Ethnologue: Languages of the World,
15th edn. Dallas: Summer Institute of Linguistics
International, 2005.
10. Cavalli-Sforza LL, Menozzi P, Piazza A. The History and
Geography of Human Genes. Princeton: Princeton University
Press, 1994.
11. Underhill PA, Kivisild T. Use of Y chromosome and mitochondrial DNA:population structure in tracing human
migrations. Annu Rev Genet 2007;41:539–642.
403
12. Jobling MA, Tyler-Smith C. The human Y chromosome:
an evolutionary marker comes of age. Nat Rev Genet 2003;4:
598–612.
13. Bamshad M, Kivisild T, Watkins WS, et al. Genetic evidence on the origins of Indian caste populations. Genome
Res 2001;11:994–1004.
14. Basu A, Mukherjee N, Roy S, et al. Ethnic India: a genomic
view, with special reference to peopling and structure.
Genome Res 2003;13:2277–90.
15. Mansoor A, Mazhar K, Khaliq S, et al. Investigation of the
Greek ancestry of populations from northern Pakistan. Hum
Genet 2004;114:484–90.
16. Watkins WS, Thara R, Mowry BJ, et al. Genetic variation
in South Indian castes: evidence from Y-chromosome,
mitochondrial, and autosomal polymorphisms. BMC Genet
2008;9:86.
17. Qamar R, Ayub Q, Mohyuddin A, et al. Y chromosomal
DNA variation in Pakistan. Am J Hum Genet 2002;70:
1007–24.
18. Thangaraj K, Chaubey G, Kivisild T, et al. Maternal footprints of southeast Asians in North India. Hum Hered 2008;
66:1–9.
19. Karafet TM, Mendez FL, Meilerman MB, et al. New binary
polymorphisms reshape and increase resolution of the
human Y chromosomal haplogroup tree. Genome Res
2008;18:830–8.
20. Sengupta S, Zhivotovsky LA, King R, et al. Polarity and
temporality of high resolution Y-chromosome distributions
in India identify both indigenous and exogenous expansions
and reveal minor genetic influence of Central Asian pastoralists. AmJ Hum Genet 2006;78:202–20.
21. Mohyuddin A, Ayub Q, Underhill PA, et al. Detection of
novel Y SNPs provides further insights into Y chromosomal
variation in Pakistan. J Hum Genet 2006;51:375–8.
22. Carvalho-Silva D, Zerjal T, Tyler-Smith C. Ancient Indian
roots? J Biosci 2006;31:1–2.
23. Carvalho-Silva D, Tyler Smith C. The grandest genetic
experiment ever performed on man? A Y-chromosomal
perspective on genetic variation in India. Int J Hum Genet
2008;8:21–9.
24. Zerjal T, Pandya A, Thangaraj K, et al. Y-chromosomal
insights into the genetic impact of the caste system in
India. Hum Genet 2007;121:137–44.
25. Chaubey G, Karmin M, Metspalu E, et al. Phylogeography
of mtDNA haplogroup R7 in the Indian peninsula. BMC
Evol Biol 2008;8:227.
26. Quintana-Murci L, Chaix R, Wells SR, et al. Where West
meets East: the complex mtDNA landscape of the Southwest
and Central Asian corridor. AmJHumGenet 2004;74:827–45.
27. Kivisild T, Rootsi S, Metspalu M, et al. The genetic heritage
of the earliest settlers persists both in Indian tribal and caste
populations. AmJ Hum Genet 2003;72:313–332.
28. Thangaraj K, Chaubey G, Kivisild T, et al. Reconstructing
the origin of Andaman Islanders. Science 2005;13:996.
29. Quintana-Murci L, Krausz C, Zerjal T, etal. Y-chromosome
lineages trace diffusion of people and languages in southwestern Asia. AmJ Hum Genet 2001;68:537–42.
30. Thanseem I, Thangaraj K, Chaubey G, et al. Genetic
affinities among the lower castes and tribal groups of
India: inference from Y chromosome and mitochondrial
DNA. BMC Genetics 2006;7:42.
404
Ayub and Tyler-Smith
31. Firasat S, Khaliq S, Mohyuddin A, et al. Y-chromosomal
evidence for a limited Greek contribution to the Pathan
population of Pakistan. EurJ Hum Genet 2007;15:121–6.
32. Kraaijenbrink T, van Driem GL, Opgenort JR, et al. Allele
frequency distribution for 21 autosomal STR loci in Nepal.
Forensic Sci Int 2007;168:227–31.
33. Kraaijenbrink T, van Driem GL, Tshering of Gaselô K,
et al. Allele frequency distribution for 21 autosomal STR
loci in Bhutan. Forensic Sci Int 2007;170:68–72.
34. Parkin EJ, Kraayenbrink T, van Driem GL, et al. 26-Locus
Y-STR typing in a Bhutanese population sample. Forensic
Sci Int 2006;161:1–7.
35. Zerjal T, Xue Y, Bertorelle G, et al. The genetic legacy of
the Mongols. AmJ Hum Genet 2003;72:717–21.
36. Gutala R, Carvalho-Silva DR, Jin L, et al. A shared
Y-chromosomal heritage between Muslims and Hindus in
India. Hum Genet 2006;120:543–51.
37. Terroros MC, Rowold D, Luis JR, et al. North Indian
Muslims: enclaves of foreign DNA or Hindu converts?
AmJ Physical Anthropol 2007;133:1004–12.
38. Ayub Q, Mansoor A, Ismail M, et al. Reconstruction of
human evolutionary tree using polymorphic autosomal
microsatellites. AmJ Physical Anthropol 2003;122:259–68.
39. Balter M. Search for the Indo-Europeans. Science 2004;303:
1323–6.
40. Diamond J, Bellwood P. Farmers and their languages: the
first expansions. Science 2003;300:597–603.
41. Thomson R, Pritchard JK, Shen P, et al. Recent common
ancestry of human Y chromosomes: evidence from DNA
sequence data. Proc Natl Acad Sci USA 2000;97:7360–5.
42. Hagen H-E, Carlstedt-Duke J. Building global networks for
human diseases: genes and populations. Nat Med 2004;10:
665–7.
43. Bittles AH. Endogamy, consanguinity and community
genetics. J Genet 2002;81:91–8.
44. Khaliq S. Consanguineous Marriages and the Genetics of
Eye Disorders in Pakistani Families. In: OR Ioseliani (ed).
New Developments in Eye Research. New York: Nova Science
Publishers Inc., 2006:241–67.
45. Ahmad W, Faiyaz ul Haque M, Brancolini V, et al. Alopecia
universalis associated with a mutation in the human hairless
gene. Science 1998;279:720–4.
46. Ahmed ZM, Riazuddin S, Riazuddin S, etal. The molecular
genetics of Usher syndrome. Clin Genet 2003;63:431–44.
47. Kumaramanickavel G, Joseph B, Vidhya A, et al.
Consanguinity and ocular genetic diseases in South India:
analysis of a five-year study. Community Genet 2002;5:182–5.
48. Dhavendra K. (ed). Genetic Disorders of the Indian Subcontinent.
Dordrecht: Kluwer Academic Publishers, 2004.
49. Abid A, Ismail M, Mehdi SQ, et al. Identification of novel
mutations in SEMA4A gene associated with retinal degenerative diseases. J Med Genet 2006;43:378–81.
50. Dhandapany PS, Sadayappan S, Xue Y, et al. A common
MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia. Nat Genet
2009;41:187–91.
51. Mohyuddin A, Ayub Q, Khaliq S, etal. HLA polymorphism
in six ethnic groups from Pakistan. Tissue Antigens 2002;59:
492–501.
52. Raja A. Immunology of tuberculosis. IndianJ Med Res 2004;
120:213–32.
53. Rehman S, Akhtar N, Ahmad W, et al. Human leukocyte
antigen (HLA) class II association with rheumatic heart
disease in Pakistan. J HeartValve Disease 2007;16:300–4.
54. Cann HM, de Toma C, Cazes L, et al. A human genome
diversity cell line panel. Science 2002;296:261–2.
55. Rosenberg NA. Standardized subsets of the HGDP-CEPH
Human Genome Diversity Cell Line Panel, accounting for
atypical and duplicated samples and pairs of close relatives.
Ann Hum Genet 2006;70:841–7.
56. Jakobsson M, Scholz SW, Scheet P, et al. Genotype, haplotype and copy-number variation in worldwide human
populations. Nature 2008;451:998–1003.
57. Li JZ, Absher DM, Tang H, et al. Worldwide human relationships inferred from genome-wide patterns of variation.
Science 2008;319:1100–4.
58. Indian Genome Variation Consortium. Genetic landscape
of the people of India: a canvas for disease gene exploration.
J Genet 2008;87:3–20.
59. Rosenberg NA, Mahajan S, Gonzalez-Quevedo C, et al.
Low levels of genetic divergence across geographically and
linguistically diverse populations from India. PLoS Genet
2006;2:e215.
60. Conrad DF, Jakobsson M, Coop G, et al. A worldwide
survey of haplotype variation and linkage disequilibrium
in the human genome. Nat Genet 2006;38:1251–60.
61. Pemberton TJ, Jakobsson M, Conrad DF, et al. Using population mixtures to optimize the utility of genomic databases:
linkage disequilibrium and association study design in India.
Ann Hum Genet 2008;72:535–46.
62. Auton A, Bryc K, Boyko A, et al. Global distribution
of genomic diversity underscores rich complex history
of continental human populations. Genome Res 2009;19:
795–803.
63. Itsara A, Cooper GM, Baker C, et al. Population analysis of
large copy number variants and hotspots of human genetic
disease. AmJ Hum Genet 2009;84:148–61.
64. The Wellcome Trust Case Control Consortium. Genomewide association study of 14,000 cases of seven common
diseases and 3,000 shared controls. Nature 2007;447:661–78.
65. Manolio TA. Collaborative genome-wide association studies of diverse diseases:programs of the NHGRI’s office of
population genomics. Pharmacogenomics 2009;10:235–41.
66. Saleheen D, Zaidi M, Rasheed A, et al. The Pakistan
risk of myocardial infarction study: a resource for the
study of genetic, lifestyle and other determinants of myocardial infarction in South Asia. Eur J Epidemiol 2009;24:
329–338.
67. Cooke GS, Hill AV. Genetics of susceptibility to human
infectious disease. Nat Rev Genet 2001;2:967–77.
68. Hancock AM, Di Rienzo A. Detecting the genetic signature of natural selection in human populations: models,
methods, and data. Annu RevAnthropol 2008;37:197–217.
69. Enattah NS, Jensen TG, Nielsen M, et al. Independent
introduction of two lactase-persistence alleles into human
populations reflects different history of adaptation to milk
culture. AmJ Hum Genet 2008;82:57–72.
70. Mardis ER. Next-generation DNA sequencing methods.
Annu Rev Genomics Hum Genet 2008;9:387–402.