B RIEFINGS IN FUNC TIONAL GENOMICS AND P ROTEOMICS . VOL 8. NO 5. 395^ 404 doi:10.1093/bfgp/elp015 Genetic variation in South Asia: assessing the influences of geography, language and ethnicity for understanding history and disease risk Qasim Ayub and Chris Tyler-Smith Advance Access publication date 17 June 2009 Abstract South Asia is home to more than 1.5 billion humans representing many diverse ethnicities, linguistic and religious groups and representing almost one-quarter of humanity. Modern humans arrived here soon after their departure from Africa 50 000 ^70 000 years before present (YBP) and several subsequent human migrations and invasions, as well as the unique social structure of the region, have helped shape the pattern of genetic diversity currently observed in these populations. Over the last few decades population geneticists and molecular anthropologists have analyzed DNA variation in indigenous populations from this region in order to catalog their genetic relationships and histories. The emphasis is gradually shifting from the study of population origins to high resolution surveys of DNA variation to address issues of population stratification and genetic susceptibility or resistance to diseases in genome-wide association surveys. We present a historical overview of the genetic studies carried out on populations from this region in order to understand the influence of geographic, linguistic and religious factors on population diversity in this region, and discuss future prospects in light of developments in high throughput genotyping and next generation sequencing technologies. Keywords: South Asia; Pakistan; India; population genetics; genetic variation; HGDP-CEPH INTRODUCTION South Asia encompasses the Indo–Pak sub-continent and includes India, Pakistan, Bangladesh, the Kingdoms of Nepal and Bhutan and the islands of Sri Lanka, the Maldives and the Chagos Archipelago. The land mass is bordered by Iran and Afghanistan on the west, China in the north, and Myanmar on its eastern fringes. The Indian Ocean straddles its entire southern coast line. Its population of around 1.5 billion individuals constitutes approximately onefourth of humanity. Over the centuries this area has witnessed many invasions and migrations mainly from the West. Genetically, it contains the site of what was once called ‘the grandest genetic experiment ever performed on Man’ and, in several surveys of worldwide diversity, one of the most outstanding populations making up a sixth grouping of humanity comparable in these analyses to major continental groupings [1, 2]. In this review, we will consider how our understanding of the historical, cultural and environmental factors that have shaped its inhabitants has developed, something of what we have learned, and prospects for future developments. This now-densely occupied land was encountered by the first populations of modern humans that ventured out of Africa more than 50 000 years ago. It is suggested that they arrived in this part of Asia via a southern coastal route and continued to Southeast Asia. The fossil and archaeological evidence for human settlements in this part of the world prior Corresponding author. Qasim Ayub, The Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK. Tel: þ44-1223-834244, Fax: þ44-1223-494919, E-mail: [email protected] Qasim Ayub is the Human Evolution Team’s Staff Scientist at the Wellcome Trust Sanger Institute, Hinxton (http://www.sanger. ac.uk/Teams/Team19/). Chris Tyler-Smith heads the Human Evolution Team at the Wellcome Trust Sanger Institute, Hinxton (http://www.sanger.ac.uk/ Teams/Team19/). Their group investigates DNA variation in populations in order to understand recent human evolution. ß The Author 2009. Published by Oxford University Press. For permissions, please email: [email protected] 396 Ayub and Tyler-Smith to 9000 YBP is, unfortunately, sparse although settlements dating much further back are now beginning to emerge, including from Patne in western India and Batadomba-lena in Sri Lanka, dated to between 30 000 and 34 000 YBP [3–5]. There is evidence for indigenous animal and plant domestication in many places in the region, the earliest being found at a Neolithic site in Mehrgarh, in Southwest Pakistan [6]. Subsequent epochal events included the development and decline of the Indus Valley civilization’s Harappan culture, the arrival of IndoEuropean speakers from central or west Asia, linked to the introduction of the caste system in India and the possible displacement of Dravidian speakers to their current location in south central India. More recent historical events have included Alexander’s invasion in 327 BC, the Arab invasion and subsequent Muslim conquest and rule of India, followed by the British Raj that lasted until the partition in 1947 [7]. More than 4000 well-defined population groups including 500 tribal and approximately 30 huntergatherers reside in this region [8]. The overwhelming majority are endogamous. The endogamous Hindu caste system, and the presence of large consanguineous families in the region, especially among the Muslim populations, provides unique resources for unraveling the genetic basis of disease. Like elsewhere, demographic events such as genetic bottlenecks, population expansions and admixture have shaped the genetic diversity that is observed in this region today. Linguistically the region comprises major groups of Indo-European, Dravidian, Tibeto-Burman and Austro-Asiatic speakers, minor groups like the Andamanese, and even language-isolate groups like the Hunza Burusho in northern Pakistan and Nihali in Madhya Pradesh, India (Figure 1). More than 70% of the population speak Indo-European languages. Tibeto-Burman speakers are present in the north and northeast and form a majority in Bhutan. Dravidian languages are prevalent in south India, Sri Lanka and, intriguingly, in Balochistan province of Pakistan. Austro-Asiatic languages are spoken by many Indians in the south and close to the border with Myanmar [9]. The region is home to followers of many religions, the major among them being Islam, Hinduism, Buddhism and Sikhism. The population also includes sizeable Christian, Jewish and Zoroastrian minorities. All have contributed to the genetic and cultural diversity found here. During the past few decades molecular geneticists and anthropologists have analyzed DNA variation among human populations in order to catalog their genetic relationships and to glean information about recent human evolution. Several such studies have been carried out in populations from South Asia, especially Pakistan, which is well-represented in the Human Genome Diversity Project (HGDP), and more recently India, in order to study their origins and their susceptibility, or resistance, to disease. We first present an overview of the genetic studies carried out on populations from this region in order to understand the influence of geographic, linguistic and religious patterns on population diversity in this region and go on to discuss future prospects for analyses of genetic variation in this region. PAST Throughout the 1970s and 1980s classical serological markers such as human blood groups, leukocyte antigens (HLA), glucose-6-phosphate dehydrogenase and other isozymes were used to study a limited number of populations from South Asia [10]. The advent of polymorphic DNA markers such as restriction fragment length polymorphisms (RFLPs), variable number of tandem repeats (VNTRs), short tandem repeats (STRs) or microsatellites, and singlenucleotide polymorphisms (SNPs) and the technological breakthroughs that made it relatively easy to genotype large number of markers in thousands of samples permitted the extensive analyses of genetic variation in humans. Initially the laborious techniques of RFLP and hybridization were applied to analyze a few hundred samples but the development of polymerase chain reaction (PCR) amplification made possible the extensive analyses of DNA variation in thousands of samples. Several initial studies analyzed mitochondrial DNA and the Y chromosome in selected populations from Pakistan and India in order to obtain glimpses into their female and male ancestry, respectively. Sequence differences among individual mitochondrial DNAs separate them into haplotypes and haplogroups that provide a snapshot of population origins from the female perspective [11]. Similarly, the male-specific part of the Y chromosome is passed down from father to son without change, except for the gradual accumulation of mutations which appear as DNA polymorphisms and provide a male Genetic variation in South Asia 397 Figure 1: Linguistic boundaries in South Asia. perspective to human evolution. Human Y chromosomes are delineated into distinct haplogroups and haplotypes, defined by a combination of unique event or bi-allelic polymorphisms alone or their combination with Y-STRs, respectively [12]. A number of other studies analyzed autosomal STRs and Alu insertion polymorphisms to gauge an overall view of autosomal genetic diversity in the region and address the issue of population origins in light of historical context [13–16]. 398 Ayub and Tyler-Smith Migrants Populations from Pakistan and north India share a sizable proportion of variation with European populations and Central and West Asians have been major contributors to the gene pool of this region, consistent with arrival of migrants from Northwest Asia [13, 14, 17]. In Pakistan the Karakoram Mountain ranges that are part of the Himalayan Mountains in the north have been a major barrier to gene flow from China, although this does not appear to be the case in India where Sino-Tibetan populations entered through the north-east corridor and Nepal [18]. Most Y-chromosomal lineages found in both tribal and non-tribal populations from South Asia date back to the Paleolithic period. Common haplogroups (markers) include C5 (M356), F (M89), H1 (M52), H2 (Apt), L1 (M27; M76), R1a1 (M17) and R2 (M124) [19, 20]. All except for R1a1, which constitutes 55% of Y chromosomes in Pakistan, have long coalescent times and appear to be indigenous. Y haplogroup J2 (M172) which is prevalent in Pakistan and north and west India represents a West Asian contribution to the genetic diversity of the sub-continent (although the migrants would of course have carried other lineages as well) [21]. O2 (P31), O3 (M122) and Q (M242) lineages in India and Pakistan are probably representative of migrants from East and Southeast Asia [21, 22]. Generally haplotype variation between populations is higher, especially between tribes [23, 24]. More than 60% of mitochondrial lineages in this region are represented by haplogroup M and mitochondrial variation, especially in haplogroup M, is largely local in origin with estimated coalescence times in the Paleolithic [18]. Approximately 30% of Indian mtDNA haplotypes belong to West Eurasian haplogroups. Haplogroups R7 and U2 lineages are specific to India and the variation in its distribution also suggests an ancient origin [25, 26]. It has been argued that the presence of Eurasian and Australasian mitochondrial haplogroups M, N and R and Y haplogroups C (M130), D (M174) and F in South Asia supports the migration of humans out of East Africa via a southern coastal route through the sub-continent [27]. The Andaman and Nicobar Islanders that comprise six tribal populations and lie on the southern coastal route to Oceania show a high level of population differentiation with unique mitochondrial sequences in the Andamanese probably arising by genetic drift in an isolated population. However, the Nicobarese share genetic relatedness with Southeast Asians and are probably more recent migrants in comparison with their neighbors [28]. The presence of haplogroups B5a and F1 near the border between India and Nepal suggests movement across the Himalaya Mountains and this possibility is also supported by the detection of Indian haplogroups R6 in the Nepalese [18]. The Indian caste system represents a unique socio-economic hierarchy associated with the Hindu religion, that broadly distinguishes upper (Brahmins/priests), middle (Kshatriyas/warriors; Vaishyas/farmers and traders; Sudras/laborers) and lower (Panchama/untouchables) or scheduled castes. It is associated with the spread of IndoEuropean languages after 3500 YBP, but the extent of gene flow from outside remains a matter of debate. Early studies were interpreted to suggest extensive gene flow from West and Central Asia but recent data, obtained using a larger number of markers, have been used to argue that most DNA variation was indigenous in origin [20, 29]. Male substructure has been observed in the higher (priest and warrior) classes from Jaunpur District in central India and male gene flow between the castes was low (<1% per generation) there, as expected from the social rules [24]. Some studies show a genetic affinity between the lower castes and tribal populations and the frequency of these haplotypes is proportional to caste rank, the highest frequency of West Eurasian haplotypes being found in the upper castes [13, 30]. Upper caste Indians share 10–20% of variation with European populations [13, 14]. Several populations stood out as being genetically distinct in the initial studies based upon Y chromosomal and autosomal STRs. In the Kalash population from Pakistan, referred to at the beginning of this review as a sixth grouping of humanity, this was most likely due to drift in a geographically and religiously isolated group that has undergone a population bottleneck during their recent migration to their present day settlements in the Hindu Kush Mountain valleys in northern Pakistan from where the individuals were sampled. A larger survey that includes populations from their ancestral homeland in Nuristan, Afghanistan, would provide more insights about their unique genetic structure. They have a major West Eurasian mitochondrial component along with certain population-specific Y lineages (L3a; PK3) with little Y-STR variation and no genetic affiliation with East Asian populations [21, 26, 31]. Genetic variation in South Asia The origins of the Parsi are well-documented and, although only a few thousand now live in Pakistan, there are many in India. Their migration to Gujarat in India after the collapse of the Sassanian empire is well documented and their name relates to their geographic origin—meaning ‘from Iran’. Their Y chromosomes are closer to populations from present-day Iran but 60% of their maternal gene pool belongs to South Asian haplogroups not found in Iran, highlighting their affinities with local Gujarati women [17, 26]. A number of East African slaves were brought to this region as involuntary migrants and among such groups are the Makranis who have physical features typical of African populations. The contribution of sub-Saharan African Y chromosomes to this population was estimated to be 12% and combined with the presence of sub-Saharan mitochondrial haplogroups represent the genetic legacy of the East African slave trade that existed in this region [17, 26]. Preliminary data on the Nepalese and Bhutanese populations using autosomal and Y-STRs show significant differences in comparison with their geographic neighbors consistent with genetic drift in small geographically isolated populations, the differentiation being more striking in the Bhutan [32–34]. A more detailed genetic analyses of the samples collected under the ‘Languages and Genes in the Greater Himalaya Region’ Project should provide a clearer picture. Invaders The power of a genealogical approach based on the hierarchical use of Y chromosomal markers in ethnic groups from this region has provided some interesting insights into historical events. In particular, analyses of Y lineages in the Hazara population from Pakistan established the presence of a ‘star haplotype’ that could be directly linked to Genghis Khan or his male ancestors and that spread by his sons throughout Eurasia [35]. An examination of their mitochondrial DNA suggests that they were accompanied by women of East Asian ancestry and their autosomal DNA also shows genetic relatedness with East Asians [2, 26]. Although several populations from northern Pakistan claimed that they were the descendents of Greek soldiers, left behind in this region by Alexander the Great, this was not generally borne out by genetic analyses. Only a small proportion of haplogroup E1b1b1a (M78) Y chromosomes in the 399 Pathan population of Pakistan provided strong evidence of a small Greek contribution [31]. Similarly, the Muslim invasions do not appear to have left a readily detectable genetic imprint in India. Islam seems to have spread by conversions and cultural diffusion: overall the Muslims in India are genetically closer to their non-Muslim geographical neighbors than to other Muslim populations and no correlation exists between genetic variation and religious beliefs [36]. This is also true historically. Muslims constitute more than 400 million of the population of this region and since their advent in the seventh century they have included many ethnicities from Arabia, Turkey, Persia and Central Asia. Some studies [37] have attempted to analyze these Muslim populations based along the sectarian divide between the two major Muslim sects (Shias and Sunnis) but genetic analyses are not appropriate for analyses of recent political and religious events and any differences that are observed are likely to relate to the ethnic or geographic origins of the source populations. Linguistics Indo-European, Dravidian, Tibeto-Burman and Austro-Asiatic languages are spoken in this region and overall genetic relationships, as ascertained by STRs, SNPs and other genetic markers, are dictated primarily by geographic proximity rather than linguistic origin [9, 38]. Indo-European languages form the predominant language group and the genetic relatedness between the European and South Asian populations indicate that they may well have shared a common language superfamily, Dene-Caucasian. The Indo-European family may have spread to South Asia 6000–10 000 YBP replacing the languages spoken earlier almost everywhere [39, 40]. The Y haplogroup R1 (M173) is often referred to as an Indo–European marker and its associated haplogroup R1a1 is present at high frequency in many regions where Indo–European speakers live. The worldwide distribution of this haplogroup indicates frequency peaks in Eastern Europe and West and South Asia, which fits in with historical records of nomadic settlements in Europe and India. However, its presence in 15% of Dravidian speakers in India argues against a simple correlation. Although recently several new SNPs have been identified that refine this branch of the tree they are restricted to a few individuals in the same population or show genetic exchange across 400 Ayub and Tyler-Smith the Gulf of Oman [21, Underhill et al., unpublished data]. An elite dominance model of the Indo-European speakers partly explains the genetic similarities observed between the Dravidian and IndoEuropean groups and the seclusion of Dravidians in southern India and parts of Sri Lanka but it does not explain the enigma of the Brahui. This Dravidianspeaking population resides in the Balochistan province in south western Pakistan and is surrounded on all sides by Indo Europeans. Quintana-Murci and co-workers argued that they arrived in South Asia from southwestern Iran with the expansion of Dravidian-speaking farmers although this was contested by later studies using extensive Y markers [20, 29]. The Austro-Asiatic languages may be the most ancient in the region and preliminary analyses of mitochondrial DNA, a limited number of Y-chromosomal and autosomal markers had suggested that the Austro-Asiatic tribes appear to be descendants of remnant earlier settlers and share genetic affinity with Tibeto-Burman populations [14]. Subsequent analyses using a higher resolution of Y haplotypes in a larger sample of the three major Austro-Asiatic groups of India (Mundari, KhasiKhmuic and Mon-Khmer) demonstrated a strong paternal genetic link amongst these populations and those from Southeast Asia. The authors estimated that the protohaplogroup O2a originated in the Indian Austro-Asiatic populations 65 000 years ago and entered Southeast Asia via the Northeast Indian corridor. Mitochondrial DNA varied between Austro-Asiatic groups from Southeast Asia and India reflecting a difference in the history of the sexes [25]. Since some estimates of the MRCA of all extant Y chromosomes are as recent as 59 000 YBP, such time estimates should be treated with caution [41]. Several Tibeto-Burman groups reside in north and north east India, and in Pakistan the Balti population from the Karakoram Mountains also speaks a Tibeto-Burman language. Y-chromosomal analyses of only a limited number of these Balti speakers has been carried out and they are not noticeably different then their neighbors in northern Pakistan [17]. The Hunza Burusho were of particular genetic interest because their language, Burushaski, is one of the few remaining language isolates in the world [9]. However, they are genetically close to their geographic neighbors in Pakistan [38]. Their isolation in the Karakoram Mountains habitat may have preserved their language but any differences between them and their neighbors, who have acquired new languages, have been greatly diluted by genetic exchange. Analyses of the language isolate Nihali speakers in India will show whether they exhibit similar genetic affinity with their geographic neighbors. Disease Genetics Families from South Asia have been invaluable in understanding genetic basis of several Mendelian disorders. The high rate of consanguineous marriages, large family sizes and contemporary inbreeding among tribes, clans and ethnicities make them suitable for linkage analyses [42, 43]. The genetic basis of several single gene disorders leading to syndromic and non-syndromic blindness, deafness, thalassemias, skeletal, hair, skin and nail disorders has been unraveled in families and populations from this region [44–49]. Several of these mutations are family- or population-specific [44, 50]. The immediate benefits of these analyses include genetic testing to exclude such disease in the unborn child. However, the real challenge is to translate these findings into public health education programs that will benefit these families and communities. Determination of HLA frequencies at higher allelic resolution than achieved by serological methods through DNA based genotyping in ethnic groups from this region and their association with diseases like malaria, tuberculosis, leprosy and rheumatic heart fever have identified several high risk alleles [51–53]. Genetic association studies using single, or a few genetic variants, have also been carried out but their associations have not been replicated across populations possibly due to ascertainment bias, choice of markers, insufficient statistical power, population stratification, or differences in linkagedisequilibrium patterns in patients and controls. PRESENT As a consequence of advances in genotyping technologies, the focus of current studies is shifting from investigating population origins using small numbers of loci to analyzing large number of SNPs, structural and copy number variants (CNV) to address issues of population sub-structure and group membership that will have practical applications in the design of Genetic variation in South Asia disease association studies, in rationalizing use of medicines tailored to an individual’s genetic make up and DNA-based forensic analyses. The discovery of a single high risk variant in the cardiac myosin binding protein C (MYBPC3) that is restricted to 4% of South Asians and predisposes strongly to heart failure in later life is testament to both the promise of such studies and the distinct genetic features of the region [50]. The wide availability of DNA samples of ethnic populations from Pakistan through the Foundation Jean Dausset’s HGDP-Centre d’Etude du Polymorphisme Humain (CEPH) Human Genome Diversity Cell Line Panel has permitted extensive analyses of STRs, SNPs and CNVs in these populations [2, 54–57]. Although some similar work has subsequently been carried out on tribal and caste populations from India and Indian expatriates settled in the USA the lack of readily-available cell lines or DNA from indigenous Indian populations has greatly hindered large-scale studies of other parts of South Asia [58, 59]. Using samples in the HGDP panel, Conrad et al. demonstrated that HapMap populations capture common haplotypes well in non-HapMap South Asian populations and that HapMap data could be used for imputing missing genotypes in these populations [60]. South Asians were expected to have intermediate levels of linkage disequilibrium (LD) between Europeans and East Asians, two populations that were part of the first phase of the HapMap Project. Overall, Indian and Pakistani populations are tagged most effectively by European populations but optimal HapMap mixtures increased ‘the fraction of polymorphic non-tag SNPs in a target population that are in LD with at least one tag SNP above a specified cut off point’ [61]. Analyses of populations from Pakistan in the HGDP Panel using the Illumina HumanHap 550K and 650K bead chips and expatriate populations from India, Pakistan and Sri Lanka using Affymetrix GeneChip Mapping Array 500K gave broadly similar results to an earlier study that had employed STR variation in this panel [56, 57, 62]. The South Asians were separated as a distinct cluster when regional identity was inferred for six groups using 650 000 SNPs, a better resolution than was obtained through analyses of autosomal STRs or the 550K chip in these populations, which could not clearly distinguish between populations from South Asia, Europe and West Asia. As expected, the Hazara 401 shared ancestry with East Asians and there was a small East Asian contribution to the gene pool of Burusho, Pathan and Sindhi populations that was not apparent with use of STR datasets [2, 57]. Although the Kalash individuals reportedly harbored more than the average CNVs this was not replicated in the most recent survey of CNV in human populations using the Illumina 650Y arrays [56, 63]. Examination of variability in Indian expatriates in the USA using 471 insertion/deletion polymorphisms and autosomal STRs revealed low levels of genetic divergence but these samples, like the Indian Gujarati population from Texas that is included in the HapMap 3 sample collection, are hardly representative of the diversity of extant Indian populations [59]. This problem is being addressed by the Indian Genome Variation Consortium which recently analyzed more than 400 SNPs in an indigenous sample of 1871 samples representing numerous geographical, linguistic and religious groups from India [58]. They observed relatively low genetic differentiation overall, with mean Fst value of 0.03, although this value is still higher than in Europe (0.01), but these results could also reflect marker or sampling bias. The maximum genetic differences were observed among tribal populations speaking Indo-European languages, and certain populations and isolated ethic groups clustered on basis of ethnicity or language, suggesting that care should be taken in the selection of cases and controls while designing Genome Wide Association Studies (GWAS) in these populations. FUTURE Following the comprehensive documentation of genetic stratification within south Asia, a next logical step would be association mapping to identify genetic variants associated with complex multigenic diseases and microbial resistance to pathogens such as malarial parasite, viral hepatitis, HIV, or Mycobacterium tuberculosis and Mycobacterium leprae that afflict a large section of South Asians. The increase in non-infectious diseases such as cardiovascular disease, type 2 diabetes mellitus and metabolic disorder in this region also places an enormous burden on these developing economies and their already stretched health care systems. The development of dense SNP chips such as Illumina’s Human1M-Duo BeadChip and Affymetrix v6.0 will enable an even denser and more 402 Ayub and Tyler-Smith uniform genomic coverage of both SNPs and CNVs to be obtained. GWAS will benefit from the identification of haplotypes that better describe association than single SNPs. The recent success of the Wellcome Trust Case Control Consortium (WTCCC) genetic association studies in identifying SNPs associated with cardiovascular disease, diabetes and other disorders that are being replicated by the National Human Genome Research Institute (NHGRI) and other research centres provide an ideal model for conducting collaborative large scale disease association studies in South Asian populations with a common set of controls [64, 65]. Besides raising ethical concerns and needing strict quality control, these studies are expensive, and require very large sample sizes. The presence of large South Asian Diasporas in Europe and USA should also facilitate these collaborative studies. The Pakistan Risk of Myocardial Infarction Study (PROMIS) being carried out as part of an international collaboration plans to analyze 20 000 myocardial infarction patients and appropriately matched controls and is one of the largest such studies on populations from South Asia, providing one model for future studies [66]. Like many studies from Pakistan, it includes Mohajirs, or Urdu ethnicities, which encompass diverse ethnic groups from India that migrated to Pakistan after independence and whose only commonality is the language (Urdu) that they speak. In such studies careful consideration must be made with regards to ethnicity and geographic origins to ensure that unsuspected population structure does not confound genetic analysis. An increasing number of studies are also focusing on the role that selection may have played in recent human evolution and studying the constraints imposed by local environment of pathogens, nutrients, toxins and climate on genetic variation. A majority of genes that are under selection due to pathogens are involved in host invasion (e.g. FY), innate (CASP12, CD40, TLR,) or adaptive (HLA, interleukins) immunity [67, 68]. Others include those that aid adaptation to climate, exposure to toxins and dietary nutrients [68]. The genetic variant responsible for lactose tolerance (LCT –13910 T allele), which has been under positive selection in response to dietary milk in European populations, appears to have a recent origin in South Asia, being frequent in pastoral groups and present at low frequency in non-pastorals in Pakistan [69]. Analyses of variation in drug metabolizing enzymes such as alcohol dehydrogenase and cytochrome p450 enzymes are of immediate clinical significance. The technological advances in DNA sequencing will soon make it economically feasible to resequence entire genomes of chosen individuals [70]. The current 1000 Genomes Project lacks South Asian samples, but the HapMap3 Gujarati in Houston sample meets the necessary international criteria for consent and availability and, in the absence of more suitable indigenous samples, may provide the first insights into South Asian diversity from whole-genome resequencing. We hope that indigenous South Asian samples will soon be available for resequencing as well. The complexity of human genetic variation in South Asians and its role in gene regulation, expression and influencing disease and non-disease phenotypes in diverse populations from this region has yet to be fully unraveled. The challenge is to obtain clinically significant results that can translate into benefits for the general population leading eventually to early diagnosis, prevention and therapeutic intervention. Only then can this quarter of humanity truly benefit from the promise of the ‘genetic revolution’. WEB RESOURCES (i) 1000 Genomes Project: http://www.1000 genomes.org/page.php (ii) Ethnologue: Languages of the World: http:// www.ethnologue.com/web.asp (iii) HapMap: http://www.hapmap.org/ (iv) Human Genome Diversity Project: http:// www.stanford.edu/group/morrinst/hgdp (v) Human Genome Diversity Cell Line Panel: (vi) http://www.cephb.fr/en/hgdp/diversity.php/ (vii) Languages and Genes of the Greater Himalayan Region: http://www.sanger.ac.uk/Teams/ Team19/himalayas.shtml http://www.le.ac. uk/genetics/maj4/ Himalayan_OMLLreport.pdf (viii) Pakistan Risk of Myocardial Infarction Study (PROMIS): http://www.phpc.cam.ac.uk/ MEU/PROMIS/ (ix) The Indian Genome Variation Consortium: http://www.igvdb.res.in/ (x) The Wellcome Trust Case-Control Consortium: http://www.wtccc.org.uk/ Genetic variation in South Asia Key Points South Asia is home to a quarter of humanity and harbors many diverse ethnicities, linguistic and religious groups. We examine the influence of geographic, linguistic and religious factors on genetic variation in this region and discuss the importance of population stratification and its implication for genome-wide association surveys of South Asian populations. Future prospects are discussed in light of developments in high throughput genotyping and next generation sequencing technologies. Acknowledgements The authors would like to thank Daniel MacArthur and the reviewers for their helpful comments and Ambareen for the illustration. FUNDING The Wellcome Trust. References 1. Dobzhansky T. Genetic Diversity and Human Equality. New York: Basic Books, 1973. 2. Rosenberg NA, Pritchard JK, Weber JL, etal. Genetic structure of human populations. Science 2002;298:2381–5. 3. Petraglia MD, Allchin B. Human evolution and culture change in the Indian subcontinent. In: MD Petraglia, B Allchin (eds). The Evolution and History of Human Populations in South Asia: Inter-disciplinary Studies in Archaeology, Biological Anthropology, Linguistics and Genetics. Dordrecht: Springer, 2007:1–22. 4. Mellars PA. Going East: new genetic and archaeological perspectives on the modern human colonization of Eurasia. Science 2006;313:796–800. 5. Deraniyagala SU. The Prehistory of Sri Lanka: An Ecological Perspective. Colombo: Department of Archaeological Survey, Government of Sri Lanka, 1992. 6. Jarrige JF. Mehrgarh: its place in the development of ancient cultures in Pakistan. In: M Jansen, M Mulloy, G Urban (eds). Forgotten Cities on the Indus. Early Civilization in Pakistan From the 8th to the 2nd Millennia BC. Mainz: Verlag Philipp von Zabern, 1991. 7. Wolpert S. A New History of India. Oxford: Oxford University Press, 1997. 8. Majumder PP. People of India: biological diversity and affinities. Evol Anthropol 1998;6:100–10. 9. Gordon RG (ed). Ethnologue: Languages of the World, 15th edn. Dallas: Summer Institute of Linguistics International, 2005. 10. Cavalli-Sforza LL, Menozzi P, Piazza A. The History and Geography of Human Genes. Princeton: Princeton University Press, 1994. 11. Underhill PA, Kivisild T. Use of Y chromosome and mitochondrial DNA:population structure in tracing human migrations. Annu Rev Genet 2007;41:539–642. 403 12. Jobling MA, Tyler-Smith C. The human Y chromosome: an evolutionary marker comes of age. Nat Rev Genet 2003;4: 598–612. 13. Bamshad M, Kivisild T, Watkins WS, et al. Genetic evidence on the origins of Indian caste populations. Genome Res 2001;11:994–1004. 14. Basu A, Mukherjee N, Roy S, et al. Ethnic India: a genomic view, with special reference to peopling and structure. Genome Res 2003;13:2277–90. 15. Mansoor A, Mazhar K, Khaliq S, et al. Investigation of the Greek ancestry of populations from northern Pakistan. Hum Genet 2004;114:484–90. 16. Watkins WS, Thara R, Mowry BJ, et al. Genetic variation in South Indian castes: evidence from Y-chromosome, mitochondrial, and autosomal polymorphisms. BMC Genet 2008;9:86. 17. Qamar R, Ayub Q, Mohyuddin A, et al. Y chromosomal DNA variation in Pakistan. Am J Hum Genet 2002;70: 1007–24. 18. Thangaraj K, Chaubey G, Kivisild T, et al. Maternal footprints of southeast Asians in North India. Hum Hered 2008; 66:1–9. 19. Karafet TM, Mendez FL, Meilerman MB, et al. New binary polymorphisms reshape and increase resolution of the human Y chromosomal haplogroup tree. Genome Res 2008;18:830–8. 20. Sengupta S, Zhivotovsky LA, King R, et al. Polarity and temporality of high resolution Y-chromosome distributions in India identify both indigenous and exogenous expansions and reveal minor genetic influence of Central Asian pastoralists. AmJ Hum Genet 2006;78:202–20. 21. Mohyuddin A, Ayub Q, Underhill PA, et al. Detection of novel Y SNPs provides further insights into Y chromosomal variation in Pakistan. J Hum Genet 2006;51:375–8. 22. Carvalho-Silva D, Zerjal T, Tyler-Smith C. Ancient Indian roots? J Biosci 2006;31:1–2. 23. Carvalho-Silva D, Tyler Smith C. The grandest genetic experiment ever performed on man? A Y-chromosomal perspective on genetic variation in India. Int J Hum Genet 2008;8:21–9. 24. Zerjal T, Pandya A, Thangaraj K, et al. Y-chromosomal insights into the genetic impact of the caste system in India. Hum Genet 2007;121:137–44. 25. Chaubey G, Karmin M, Metspalu E, et al. Phylogeography of mtDNA haplogroup R7 in the Indian peninsula. BMC Evol Biol 2008;8:227. 26. Quintana-Murci L, Chaix R, Wells SR, et al. Where West meets East: the complex mtDNA landscape of the Southwest and Central Asian corridor. AmJHumGenet 2004;74:827–45. 27. Kivisild T, Rootsi S, Metspalu M, et al. The genetic heritage of the earliest settlers persists both in Indian tribal and caste populations. AmJ Hum Genet 2003;72:313–332. 28. Thangaraj K, Chaubey G, Kivisild T, et al. Reconstructing the origin of Andaman Islanders. Science 2005;13:996. 29. Quintana-Murci L, Krausz C, Zerjal T, etal. Y-chromosome lineages trace diffusion of people and languages in southwestern Asia. AmJ Hum Genet 2001;68:537–42. 30. Thanseem I, Thangaraj K, Chaubey G, et al. Genetic affinities among the lower castes and tribal groups of India: inference from Y chromosome and mitochondrial DNA. BMC Genetics 2006;7:42. 404 Ayub and Tyler-Smith 31. Firasat S, Khaliq S, Mohyuddin A, et al. Y-chromosomal evidence for a limited Greek contribution to the Pathan population of Pakistan. EurJ Hum Genet 2007;15:121–6. 32. Kraaijenbrink T, van Driem GL, Opgenort JR, et al. Allele frequency distribution for 21 autosomal STR loci in Nepal. Forensic Sci Int 2007;168:227–31. 33. Kraaijenbrink T, van Driem GL, Tshering of Gaselô K, et al. Allele frequency distribution for 21 autosomal STR loci in Bhutan. Forensic Sci Int 2007;170:68–72. 34. Parkin EJ, Kraayenbrink T, van Driem GL, et al. 26-Locus Y-STR typing in a Bhutanese population sample. Forensic Sci Int 2006;161:1–7. 35. Zerjal T, Xue Y, Bertorelle G, et al. The genetic legacy of the Mongols. AmJ Hum Genet 2003;72:717–21. 36. Gutala R, Carvalho-Silva DR, Jin L, et al. A shared Y-chromosomal heritage between Muslims and Hindus in India. Hum Genet 2006;120:543–51. 37. Terroros MC, Rowold D, Luis JR, et al. North Indian Muslims: enclaves of foreign DNA or Hindu converts? AmJ Physical Anthropol 2007;133:1004–12. 38. Ayub Q, Mansoor A, Ismail M, et al. Reconstruction of human evolutionary tree using polymorphic autosomal microsatellites. AmJ Physical Anthropol 2003;122:259–68. 39. Balter M. Search for the Indo-Europeans. Science 2004;303: 1323–6. 40. Diamond J, Bellwood P. Farmers and their languages: the first expansions. Science 2003;300:597–603. 41. Thomson R, Pritchard JK, Shen P, et al. Recent common ancestry of human Y chromosomes: evidence from DNA sequence data. Proc Natl Acad Sci USA 2000;97:7360–5. 42. Hagen H-E, Carlstedt-Duke J. Building global networks for human diseases: genes and populations. Nat Med 2004;10: 665–7. 43. Bittles AH. Endogamy, consanguinity and community genetics. J Genet 2002;81:91–8. 44. Khaliq S. Consanguineous Marriages and the Genetics of Eye Disorders in Pakistani Families. In: OR Ioseliani (ed). New Developments in Eye Research. New York: Nova Science Publishers Inc., 2006:241–67. 45. Ahmad W, Faiyaz ul Haque M, Brancolini V, et al. Alopecia universalis associated with a mutation in the human hairless gene. Science 1998;279:720–4. 46. Ahmed ZM, Riazuddin S, Riazuddin S, etal. The molecular genetics of Usher syndrome. Clin Genet 2003;63:431–44. 47. Kumaramanickavel G, Joseph B, Vidhya A, et al. Consanguinity and ocular genetic diseases in South India: analysis of a five-year study. Community Genet 2002;5:182–5. 48. Dhavendra K. (ed). Genetic Disorders of the Indian Subcontinent. Dordrecht: Kluwer Academic Publishers, 2004. 49. Abid A, Ismail M, Mehdi SQ, et al. Identification of novel mutations in SEMA4A gene associated with retinal degenerative diseases. J Med Genet 2006;43:378–81. 50. Dhandapany PS, Sadayappan S, Xue Y, et al. A common MYBPC3 (cardiac myosin binding protein C) variant associated with cardiomyopathies in South Asia. Nat Genet 2009;41:187–91. 51. Mohyuddin A, Ayub Q, Khaliq S, etal. HLA polymorphism in six ethnic groups from Pakistan. Tissue Antigens 2002;59: 492–501. 52. Raja A. Immunology of tuberculosis. IndianJ Med Res 2004; 120:213–32. 53. Rehman S, Akhtar N, Ahmad W, et al. Human leukocyte antigen (HLA) class II association with rheumatic heart disease in Pakistan. J HeartValve Disease 2007;16:300–4. 54. Cann HM, de Toma C, Cazes L, et al. A human genome diversity cell line panel. Science 2002;296:261–2. 55. Rosenberg NA. Standardized subsets of the HGDP-CEPH Human Genome Diversity Cell Line Panel, accounting for atypical and duplicated samples and pairs of close relatives. Ann Hum Genet 2006;70:841–7. 56. Jakobsson M, Scholz SW, Scheet P, et al. Genotype, haplotype and copy-number variation in worldwide human populations. Nature 2008;451:998–1003. 57. Li JZ, Absher DM, Tang H, et al. Worldwide human relationships inferred from genome-wide patterns of variation. Science 2008;319:1100–4. 58. Indian Genome Variation Consortium. Genetic landscape of the people of India: a canvas for disease gene exploration. J Genet 2008;87:3–20. 59. Rosenberg NA, Mahajan S, Gonzalez-Quevedo C, et al. Low levels of genetic divergence across geographically and linguistically diverse populations from India. PLoS Genet 2006;2:e215. 60. Conrad DF, Jakobsson M, Coop G, et al. A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet 2006;38:1251–60. 61. Pemberton TJ, Jakobsson M, Conrad DF, et al. Using population mixtures to optimize the utility of genomic databases: linkage disequilibrium and association study design in India. Ann Hum Genet 2008;72:535–46. 62. Auton A, Bryc K, Boyko A, et al. Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Res 2009;19: 795–803. 63. Itsara A, Cooper GM, Baker C, et al. Population analysis of large copy number variants and hotspots of human genetic disease. AmJ Hum Genet 2009;84:148–61. 64. The Wellcome Trust Case Control Consortium. Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007;447:661–78. 65. Manolio TA. Collaborative genome-wide association studies of diverse diseases:programs of the NHGRI’s office of population genomics. Pharmacogenomics 2009;10:235–41. 66. Saleheen D, Zaidi M, Rasheed A, et al. The Pakistan risk of myocardial infarction study: a resource for the study of genetic, lifestyle and other determinants of myocardial infarction in South Asia. Eur J Epidemiol 2009;24: 329–338. 67. Cooke GS, Hill AV. Genetics of susceptibility to human infectious disease. Nat Rev Genet 2001;2:967–77. 68. Hancock AM, Di Rienzo A. Detecting the genetic signature of natural selection in human populations: models, methods, and data. Annu RevAnthropol 2008;37:197–217. 69. Enattah NS, Jensen TG, Nielsen M, et al. Independent introduction of two lactase-persistence alleles into human populations reflects different history of adaptation to milk culture. AmJ Hum Genet 2008;82:57–72. 70. Mardis ER. Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 2008;9:387–402.
© Copyright 2026 Paperzz