!1 October 17 2015 Dr. S.O’Brien edition Genome Russia Project Research Proposal ! Table of Content I VISION AND EXECUTIVE SUMMARY 2 II BACKGROUND AND RATIONALE 4 1 Specific AIMS 4 2 Purpose and Goals 5 3 Scientific Background 6 III STUDY DESIGN AND METHODOLOGY 11 Populations and ethnic groups in Russia to target (Peoples of Russia –Atlas, 2010). 11 Patient recruitment and blood collection 16 IRB review and approvals 16 Patient recruitment study design- requirements for the selection of study participants 17 Patient Recruitment Expeditions 18 DNA extraction and processing methodology. 20 DNA quality control 21 Laboratory quality management 22 4.) Lymphoblastoid (LCL) cell line establishment and the cell line biobank development 23 Protocol for isolation and transformation of PBMCs (See Appendix 7) 26 Abbreviations LCL Transformation 28 5.) Whole Genome Sequence Assessment of Study Participants 28 6. Computational resources, requirements, memory, capacity and security 32 !2 October 17 2015 Dr. S.O’Brien edition b.) Genome Russia Server cluster: 32 c.) Storage cluster: 33 d.) Data access: 35 e.) List of tools for bioinformatics analysis: 35 f) Genome Russia Website 36 The Genome Russia Database 38 7) Population whole genome sequence data analysis 40 a) Read quality control and filtration. 43 b) Alignment and filtration 44 c) Variant calling and genotype calling i. Strategy: iii. Functional SNV (fSNV) detection iv. Validation of rare allele and disease gene mutations 44 44 46 46 d) Genome Mining for described human disease gene variant alleles 47 e.) Copy Number Variation (CNV) assessment for Genome Russia Project.. 48 8. Haplotype Map constructions for Ethnic Russian population. 51 9. Interpreting Russian History population exchanges using phylogeography 54 IV PERSONNEL AND COLLABORATORS OF GENOME RUSSIA. V WORKFLOW AND TIME TABLE OF GENOME RUSSIA. 57 58 VI FORESEEABLE BENEFITS OF THE GENOME RUSSIA PROJECT TO SPSU, TO RUSSIA, AND THE WORLD . 58 VII REFERENCES ! I VISION AND EXECUTIVE SUMMARY ! 60 !3 October 17 2015 Dr. S.O’Brien edition The genomics era has revolutionized the delivery of medical diagnoses across the world as individual DNA assessment is today a critical component in personalized medicine. The completion of the first human genomes’ sequence in 2003 and the subsequent rush of whole genome sequencing (estimated as close to 200,000 people by the end of 2015) offers the promise of genome empowered medical diagnoses and treatments in the very near future. Derivative International projects such as the 1000 Genome Project, the Human HapMap Project and many others have begun the process of cataloguing and characterizing human gene diversity. The promise is to identify the determinants for hereditary diseases as well as for complex chronic and infectious diseases disease with a genetic underpinning (including cancers, neurological disorders, autoimmune disease, HIV, AIDS Ebola and many others). In the past few years major population genome sequencing projects are underway in several nations: UK, USA, Japan, Iceland, South Korea, Canada, Australia, Thailand, Kuwait, Qatar Israel, Belgium, Luxembourg, and Estonia. In spite of occupying over 8% of the world’s landmass, and having the ninth largest world population (estimated as 145,000,000 people), the Russian Federation has lagged behind in contributing to worldwide genomic database projects. Genome Russia would remedy this situation by a producing a genome sequence based formal detailing of genome variation across diverse population groups and ethnic minorities within the Russian Federation. As presented in detail by this report, Genome Russia will initially develop whole genome sequence from 2500 Russian volunteers and annotate common and rare DNA variants. We will offer these data to join Russia with the 1000 Genome Consortium, international HapMap and contributing partners. The public release of these data shall become a national and world resource for genomic enquiry and varied cross-disciplinary applications in medical genomic research. The principal goals of Genome Russia include: 1) Documenting the sum and breadth of natural DNA variation that characterizes the population and influences any and all heritable genetic traits; 2) Assessing the function-altering genetic determinates that are already linked to hereditary pathologies as well and new damaging variants that are yet described; 3) Construct a Haplotype map of the Ethnic Russian for use in disease gene discovery and also closing in on operative gene variation associated with complex heritable diseases and traits in populations; 4) Inspect with proven genetic methods the relationship and natural history of indigenous Russian ethnicities. All the data will be posted publically on an open access web-site for interested scientists to inspect and utilize in context of world-wide human populations analyses. Initially, we will not collect clinical information, as has been proposed in several national studies, however with !4 October 17 2015 Dr. S.O’Brien edition growing support for the concept and fulfillment of Genome Russia, we will recommend such an expansion rather soon. Genome Russia will embrace the medical, human genetics and anthropology science communities across the Russian Federation. The project is timely, important and feasible with available technology and computational power. There are tangible benefits to completing this project and we detail these in the last Section V of this document. The Genome Russia Project can and should become a example of international collaboration on the common ground and with the common goal of improving human health and betterment II BACKGROUND AND RATIONALE 1 Specific AIMS Genes are the basic “instruction book” for the cells that make up our bodies, and are made out of DNA. The DNA of a person is more than 99% the same as the DNA of any other unrelated person. But no two people have exactly the same DNA except identical twins. Differences in DNA are called genetic variations. They explain some of the physical differences among people, and partly explain why some people get diseases like cancer, diabetes, asthma, and depression, while others do not. Such diseases may also be affected by factors like diet, exercise, smoking, and pollution in the environment, which makes it hard to determine which genes affect the diseases. The objectives of the project “Genome Russia” are to develop an open access web-based database containing anonymous information on the whole-genome sequences of at least 2,000 men and women originating from the different regions of Russia, whose ancestors are indigenous to the region for several generations, as well as the description the genome variations in these groups, the detection of the features that affect the spread of diseases and the creation of a database of medically-relevant genomic variants characteristic to the Russian population, which would be the basis for developing the principles of the future personalized medicine. The data will be used for many purposes. However four immediate uses we anticipate are: • Discover and catalogue new gene variants that are specific for Russian ethnic groups • Identification of genetic variants that may affect the frequency of known diseases across the Russian peoples. !5 October 17 2015 Dr. S.O’Brien edition • Develop a Russian population based Haplotype map (HapMap), required to identify disease gene markers specific for high incidence Russian diseases. • Interpret the patterns in the variability of human DNA to decipher historical migratory routes and settlings of the man in Russia, Europe and Asia. The research database developed within the framework of the project will not include any personal information. 2 Purpose and Goals The initial task is to gather blood samples of some 2500 Russian people, including several hundred family trios (DNA samples of a child and both parents). The project will create a national collection of genetic data will engage researchers from other educational institutions and research organizations. Genome Russia will reach across Russian Biomedical Centers and join with an international “1000 genomes project” created to uncover rare gene variants in different human populations. DNA from the Russian volunteers will be subject to whole genome sequence assessment suitable for mining their genomes for secrets of their past and their future. The objectives of the project are the description of variations in the human genome in different groups of the population of the Russian Federation, identification of the features that affect the spread of diseases, as well as the creation of an information base of medically significant genomic variants specific to the population of Russia, which will be the basis for developing the principles of medicine of the future. In other words, knowledge of the genetic diversity of different groups of the population of the Russian Federation, which will be accumulated in the implementation of this project, will allow to identify and determine the frequency of genetic determinates previously associated with complex diseases among population of the Russian Federation. These estimates can contribute to tracking historical migrations and human settlement on the territory of the Russian Federation. Similar projects have previously been conducted in other countries and now we want to conduct a similar study in Russia. More explicitly the specific goals we propose to deliver with fulfillment of Genome Russia include: • A blood DNA biobank of Human biospecimens from the major population ethnic and regional groups that live in Russia today. • Whole Genome Sequence from ~2500 people of divergent ethnic background and genotype 160 trios (two parents and one offspring) to catalogue DNA single nucleotide, indel, and copy number variation within ethnic Russian populations. !6 October 17 2015 Dr. S.O’Brien edition • Resolution of thousands of Russian specific DNA variants not seen in other world populations. These will be deposited with the international 1000 Genomes project as the first and only contribution from the Russian peoples. • Explicit documentation of the pattern and distribution of common hereditary disease gene variants across the Russian peoples plus a catalogue of function altering genetic variants across 22,000 human genes in each individual • A Russian population-specific HapMap useful for disease gene association discoveries among Russian disease cohorts to be built in the future. • Mapping of the footprints of ancient geographic movements of the ancestors of modern Russian peoples. This natural history analyses will relate modern Russian ethnic groups to each other to their ancestors and to deep seated archival Russian populations of early mankind including Denisovan and Neanderthal culture that inhabited Russian lands. • Innovative new bioinformatics analytical algorithms applicable to disease gene discoveries including an open-access public database releasing sequences, genotypes and analyses discoveries 3 Scientific Background Mapping the unabridged pattern of human genetic variation across the world represents one of the greatest exploration projects undertaken since the genomics era began in 2001 with a published draft of the human genome. Driven by availability of samples and technological advancement of next generation sequencing techniques, in the last decade the whole genome sequencing scaled up from personal individual projects to the global surveys of genomic diversity best represented today by the 1,000Genome project (McVean et al., 2012; Auton et al., 2015). The latest release of this project would become the major global reference resource for human genetic variation, but it is not a complete genome map of the humankind (Auton et al., 2015). The principal goal of this grand exploration was to uncover rare and local DNA variation in modern ethnic populations in order to avoid the discovery bias in the studies of human disease based on geographically limited datasets originating in the developed countries of North America and Europe (Figure 1, McVean et al., 2012). The common variants, made available by efforts of the dbSNP (http://www.ncbi.nlm.nih.gov/SNP) have been used to develop genotyping arrays for important applications such as the International HapMap, GWAS studies, and the Human Genome Diversity Project – HGDP (Auton et al, 2009). !7 October 17 2015 Dr. S.O’Brien edition In the three years since the first 1,000 Genomes consortium paper on human diversity was published (McVean et al., 2012;), attention slowly shifted to the national population genome projects, notably Iceland and British populations (Gudbjartsson et al., 2015; Leslie et al., 2015) to help uncover the intricate natural history of these nations population. National genome projects are underway with a 100,000 UK Genome Project, UK (Akst, 2012; Marx, 2015), an Asian Genome Project (Anderson, 2014) , a Chinese Million Genomes endeavor (Heger, 2015), an African Genome Sequence Variation project (Gurdasani et al., 2015), along with whole-genome sequence population studies in the Netherlands, Turkey, and Japan, (Francioli et al., 2014; Alkan et al 2014 ; Nagasaki et al 2015), all meant to inform medical and natural history questions. ! ! ! ! Figure 1 Distribution of publicly available genome sequences. Worldwide locations of population samples with the whole genome data from the 1000 Genome Project. Each circle represents the number of genome sequences publicly available at www.1000genomes.org. ASIA: BEB Bengali in Bangladesh; CDX Chinese Dai in Xishuangbanna, China; CHB Han Chinese in Bejing, China; CHS Southern Han Chinese, China; GIH Gujarati Indian in Houston,TX; ITU Indian Telugu in the UK; JPT Japanese in Tokyo, Japan; KHV Kinh in Ho Chi Minh City, Vietnam; PJL Punjabi in Lahore, Pakistan; STU Sri Lankan Tamil in the UK. AFRICA: ACB African Caribbean in Barbados; ASW African Ancestry in Southwest USA; ESN Esan in Nigeria; GWD Western Division, The Gambia; LWK Luhya in Webuye, Kenya; MSL Mende in Sierra Leone; YRI Yoruba in Ibadan, Nigeria; EUROPE: CEU Utah residents with Northern and Western European ancestry, USA; FIN Finnish in Finland; GBR British in England and Scotland; IBS Iberian in !8 October 17 2015 Dr. S.O’Brien edition Spain; TSI Toscani in Italiy; THE AMERICAS: CLM Colombian in Medelin, Colombia; MXL Mexican Ancestry in Los Angeles, USA; PEL Peruvian in Lima, Peru; PUR Puerto Rican in Puerto Rico. Each circle represents the number of sequences in the final release. The dotted circles indicate populations that were collected in diaspora. ! Using the new population genomic data in targeted countries, medical research has been given a new roadmap and power in the disease variant discoveries. In addition, disease genome collections like the International Cancer Genome Consortium (ICGC) are developing impressive global networks bolstering collaboration and invigorating research progress in complex disease therapy. Looking at the world map with these dynamic developments in genome sequencing of global populations, one cannot help but notice a great “wide gap” in Eurasia (Figure 1). From Baltic Sea to the Beringia Straits, Russia remains the largest vast swath of land (~10% of the earth’s land mass) for which the human genome landscape remains relatively unexplored. Notably, even the larger population SNP array genotyping projects such as HGDP (~ 52 populations sampled worldwide) and the HapMap have scarce representation of ethnic groups in Russia (Figure 2, Auton et al., 2009, and 2015). Further, the European and East Asian population groups in the 1,000 Genome Project, do not capture the rich background of genomic diversity in this part of the world, partially because of the difference in ancestry, partially because of the history of admixture (Figure 1). Recent population genetic studies of Russian indigenous populations have employed mtDNA, STR, Y-chromosome haplogroups and genome SNP variants in certain regional ethnic populations, but little have been achieved with more comprehensive whole genome sequence of Russian people to date (Yunusbaev et al., 2011; Salmela et al. 2011; Khusnutdinova et al. 2012; Khrunin et al., 2013; Kharkov et al. 2013; Har’kov et al. 2014; Kushniarevich et al., 2015; Trofimova et al. 2015; Yunusbaev et al., 2015 Balanovskaya et al. 2001a; b see Appendix 13). !9 October 17 2015 Dr. S.O’Brien edition Figure 2 Eastern Hemisphere locations of population samples in surveys of worldwide genetic variation (HapMap, 1000 Genomes Project, Phase 1, and HGDP). The historic milestones that founded modern Russian populations include the northward and westward expansion of the Indo-Europeans and the Uralic people, the westward expansion of the Turkic people, and centuries of admixture between them (Figure 3). The routes for peopling Northern and Central Europe inevitably led through this territory, then the waves of great human migrations of the recorded history were pushed this way for centuries, followed by the great exchange of knowledge and technology (and likely the genes) along the Silk Road . The migrations of the last millennia have created a complex patchwork of human diversity that is today’s Russia and somewhere hidden in Siberia reside the ancestors for modern Native Americans. ! !10 October 17 2015 Dr. S.O’Brien edition ! ! Figure 3 Major human migration routes (adapted from Stewart and Chinnery, 2015) and locations of other hominid remains out of Africa. The approximate locations of major Neanderthal and Denisovan finds are indicated by glowing circles. ! In the more distant past, there surely occurred gene exchange between modern Homo sapiens and the prehistoric Neanderthal and Denisovan populations they encountered. The Neanderthal and Denisovan genetic contribution is not well studied beyond Western Europe for the former or South East Asia for the latter, despite their physical remains being unearthed in Siberia (Prufer et al., 2014; Reich et al., 2010; Fu et al., 2014). Do Russian populations contain ancestry components that undetected in populations represented in the 1,000 Genomes or even in the comprehensive HGDP database? Perhaps. Hence, Russia needs a national genome project on its own. Genome Russia is a first step towards accomplishing these goals. ! !11 October 17 2015 Dr. S.O’Brien edition III Study Design and Methodology Populations and ethnic groups in Russia to target (Peoples of Russia – Atlas, 2010). Ethnic and racial diversity of the Russian population is an indisputable fact. According to the 2010 census in 195 ethnic groups were recorded, for most of whom Russia is the territory of the residence. The diversity and ethnic admixing evolved over thousands of years as a consequence of multiple migrations, mixing and ethnic separation. The sources of these migrations were the Baltic region, the Balkans, Central Asia and the Far East. Many ethnic groups have been formed on the territory of modern Russia (Figure 3). These processes were influenced by the diversity of landscape forms in the vast territory that included the tundra, taiga, deciduous forests, forest-steppe and also mountain and seaside interzonal landscapes ((Trubachev, 2003; Yakupov, 2010; Zavyalov et al, 2012; Makarov, 2015; Simchenko and Tishkov, 1999, Allentoft et al, 2015; Kushniarevich et al., 2015; Molodin et al. 2014, Peoples of Russia –Atlas, 2010). In genetic terms, the carriers of mutations adaptive to the stable or changing environment would have spread as a consequence of migration and selection. Most ethnic groups in Russia are not homogeneous panmictic populations. Ethnographic and historical reconstructions have suggested mixed original founder groups that are in need of confirmation. For example, people who contributed to the ethnogenesis of the Bashkirs were Ugric people (today it is Khanty and Mansi who are the closest to the ancient Ugric ethnicity), various Turkic groups, and Mongols. The ethnogenesis of the North Caucasian was formed by indigenous population along with Turkic and Iranian steppe nomads, as well as migrants from the territory of the Greater Caucasus. Perhaps challenging is resolving the ethnography of modern Russian people: the Slavs - people from the Balkans and contemporary Polish Pomerania, Lettow-Lithuanian and Finno-Ugric peoples of the East European Plain, Turkic steppe and possibly the Mongols. During the period of Russian colonization of the Volga, Urals, Siberia and the Far East the contacts with the local population result in origin of the metis groups or the aborigines admixed with the settlers. The founder groups we mention were not homogeneous themselves. Further historical data cannot resolve from which ethnic groups they derived, only that they were heterogamous as indicated in both archaeological and anthropological records. !12 October 17 2015 Dr. S.O’Brien edition The complex processes of forming modern population groups did not occur contemporaneously. New waves of crossbreeding could take place quite far in time from the other, but sometimes these waves overlapped. Migrations alternated with periods of isolation; the processes of mutation in the population alternated with the process of stabilization of the gene pool. Such historic perturbations imply that there are actually no the so-called pure ethnic groups in genetic terms and there cannot be. This statement raises a certain research problem – in the first stage we should identify the most common alleles and thus reconstruct the overall scale of variation of the gene pool of all the peoples in Russia. In the second stage, we will identify the common and local rare variants private to separate ethnicities in Russia, . The solution to these two tasks must be performed in parallel each new investigated group should enhance our understanding of the total variability of the genomes in Russia and characterize each particular group. Research objectives of the ethno-genetics study would address the following important questions: 1. What is driving the close genetic relationship of modern populations- the territorial proximity or relation to one ethnic group or other factors? 2. Does sympatric co-occurrence on the same locale affect the convergence of genomic variants, regardless of ethnicity or not? 3. Which ethnic groups are closely related to each other and which are very different. 4. How does the population genetic structure of the Russian people relate to other nations of Europe, Asia and America (both indigenous Indian, Inuit and European newcomers)? 5. Are the gene pools of individual nations isolated or do they show evidence of admixture with neighboring populations? The answer to these questions will present a picture of the modern gene composition of Russia, and will also shed light on the ancestral origins of contemporary ethnic groups . Strategy of population sampling for Genome Russia: The most numerous and widely populated Russian ethnicities should be evaluated by multiple populations. For “ethnic Russians”, we propose to collect 12-15 sampling sites. For the other large ethnicities (more than 1,000,000 people) 3-5 sampling sites would suffice. For ethnic populations of less than 1,000,000 people, (e.g. representative of - Ural, the Northern EUROPEAN part, North-West and North-Eastern Caucasus, Altai, Siberia taiga – we will !13 October 17 2015 Dr. S.O’Brien edition sample 2 locales From the smaller ethnicities we suggest to select only those that are the least assimilated to date, i.e., the peoples living in the taiga and tundra of Siberia and the Caucasian highlands. (Appendices 1, 2a, 2b) In the Stage 1 we will collect samples to get a general impression about the genome of the Russian population as a whole. It is proposed to select the largest ethnic arrays and conduct research on their perimeter. Thus, for Ethnic Russian it would be useful to collect samples from the four principal regions of Russia (Appendix 1, Figure 4) . Northern Russians from Arkhangelsk , Western Russians from Pskov, Novgorod, and a group formed in the first half of the 2nd millennium AD and recently consolidated by many souces stettling in the St. Petersburg metropolitan area. Southern Russians of Rostov, Voronezh, Krasnodar and Oryol regions formed intialy in the 16th -18th centuries. These are quite resistant groups. Russians of the Central Russia regions are of less interest, since are close to southern Russians plus the stability of populations is very low as it is an industrial zone. In Siberia and the Far East, namely, from Omsk, Krasnoyarsk regions, Primorsky and Khabarovsky Krais were founded in the t 18th-20th centuries with 20th century introgression from European Russian groups such as Stolypin settlers who established their separate villages maintaining internal marital ties. At this stage, to a greater extent than for subsequent stages, relatively closed group must be chosen, remote from industrial centers and highways, as the latter are points of mechanical population growth. For Bashkirs we will sample three sites in Bashkortostan on the perimeter of the region, mainly where Bashkirs live in solid groups, in the northeast, northwest and south of the country. In addition we will sample a few groups located outside the main range of settlement to detect local allele derivation. From these Stage 1 samples of Russians and Bashkir will we hope to develop a preliminary outline the population distinctions among the peoples in Russia. In addition, it may be possible to narrow the geographical boundaries of the Russian gene pool from its most western point - Pskov region to one of the very eastern - Khabarovsk Krai. It should also be possible to identify latitudinal differences - from high latitudes in the Arkhangelsk region to relatively low in the Don region. In Stage 2 a detailed more focused analysis of Russians minority ethnicities of Russia will be conducted (Appendix 2a, 2b). Initially, to develop a detailed picture of !14 October 17 2015 Dr. S.O’Brien edition European Russia we shall sample populations in the Central Non-Black Earth Zone as well as from Vladimir, Yaroslavl and Nizhny Novgorod regions. Further, ethnic populations from several regions will be sampled: the southern Russians from Krasnodar, Rostov and Voronezh regions, Russians in Kama region and the Urals, and Russians from Udmurtia and Sverdlovsk regions. We shall also gather samples from indigenous populations in Eastern Siberia, especially in its southern taiga, steppe tundra, and northern areas and in Krasnoyarsk region (Figure 4). ! ! Figure 4 Some locales of Russian ethnic groups selected for sample development in genome Russia Two interesting ethnic groups who live throughout the country are Ukrainians and Tatars, The population structures and the ethnogenesis of these two groups are fundamentally different. Ukrainians are recent immigrants to the territory of Russia (18th-20th century.); their ethnogenesis is inextricably linked to the ethnogenesis of Russians. Although Russian and Ukrainian had more close contact than Russian and Poles, Lithuanians, and steppe Tartars, but they had less close contact than Russian and Finno-Ugric peoples, while sharing a main East Slavic component. We propose to sample three areas of concentrated residence of Ukrainians: 1. Belgorod region (the group was formed in 17th-28th century); 2. Krasnodar region (18th-19th century), !15 October 17 2015 Dr. S.O’Brien edition 3. From the Asian part in the Omsk region, and 4. Primorsky Krai (19th-20th century). The Tatars result from contact between settled Turkic (Bulgars), nomadic Turks (Tatars actually) and indigenous Finno-Ugric peoples. On a genetic level Tatars were in contact with Russian colonists of Volga-Kama region. We shall sample three groups along the "perimeter" of Tatarstan and one group in the Diaspora in the Novosibirsk region. Additional ethnic groups we shall sample are: the Karelians (West), Khanty, the Mordovians, Chuvash (Central), Mansi, and Komi (North), Udmurt (Volga-Kame area), Osteons, Kabardians and Adygeis (Northwest Caucasus), Naganasans, Chechens (Northeast Caucasus), Khakases (Altai), Yakut (Siberia taiga), Ulchi, Udygs, Nanaian (Far East). Komi is made up of a mixed Russian population that moved the north-eastward from Moscow. Comparison of Komi gene pool and its comparison with the gene pool of the Russian population in the Arkhangelsk region will determine the degree of admixture of the Slavs and the Finno-Ugric peoples as a phenomenon. We shall sample two groups, the taiga group in Ust-Kulom district and the forest-tundra group in Izhemsky area. Udmurts, had been isolated longer than any other Finno-Ugric peoples of Volga and Urals. Their gene structure may reflect an isolated “island population” level of differentiation. We shall investigate samples from the northern and southern Udmurts. The population of the northwestern Caucasus traditionally had contacts with the steppe peoples and the population of northeast Caucasus with the peoples of Transcaucasia and the Greater Caucasus, up to Asia Minor and the Iranian plateau. It is make sense to check the version of the divergence of gene pools in these groups. On the other hand, one should not exaggerate the degree of highlanders’ contacts with the outside world. From these people we shall collect samples from Adygeis, Chechens and Kabarda (two sampling sites per ethnicity). Turkic peoples of Siberia have been quite isolated from each other because of the enormous distances. We propose to explore a mountain Turkic group - Khakases (whose ethnic contacts for a long time were connected to the south of the Mongols and to other nomadic peoples of the Altai) and one taiga people of Yakutia (their interethnic connections had been limited to Tungus and Paleo-Asiatic peoples). The last group of people to sample represents small isolated ethnic populations. Their isolation is determined by both geography and adverse environmental conditions of !16 October 17 2015 Dr. S.O’Brien edition their habitat. Selected populations include: Nenets of the Yamal tundra, Nganasans of the Taimyr tundra, Khanty of Ob and Tsakhurs of southern mountainous Dagestan, and Chukuchi inside the Arctic Circle. Phase 3 will fill the gaps on the map of sampled locations from the Russian North and Siberians to Pomerania and Indigirka peoples. Such an approach would offer wide expansive coverage of modern populations across the nine time zones of today’s Russia. Patient recruitment and blood collection ! IRB review and approvals ! The Ethics Committee of SPSU has approved the informed consent document developed by the scientists of the Dobzhansky Center ( Appendix 3). In the informed consent document (Appendix 4) we invite potential participants to be part of the Genome Russia Project, explaining that it will develop a research resource that researchers around the Russian Federation and the world will use. In the informed consent document it states: “The overall objective of this project is to describe variations in the human genome in the genetic heterogeneity of the Russian population, to determine characteristics that influence spread of diseases, as well as to create a database of medically significant genomic variants specific to the population of Russia, which will provide a basis for developing medically significant genome variants in future. The specific objectives of the project are: a. Collection of blood samples and DNA extracted from the blood samples, which will be kept in a repository and distributed to researchers for use in future projects b. Data from the study of the samples, which will be kept on scientific databases available over the Internet. The resource will be used in many future studies related to health and disease.” Researchers in several Russian institutions including (St. Petersburg State University; Institute of Molecular Genetics, Center of Neurology Russian Academy of Sciences, Moscow, plus researchers several countries (Appendix 5, 6) are working together to develop this resource. Saint Petersburg State University is the principal sponsor of the project, and The Theodosius Dobzhansky Center for Genome Bioinformatics of St Petersburg State University is the coordinator of the research !17 October 17 2015 Dr. S.O’Brien edition consortium established to carry out this project (Appendix 5). Several scientific and research centers are members of the consortium. Also in the Informed Consent we mention that this project will include obtaining DNA from at least 2000 men and women from different parts of Russia, whose ancestors were indigenous to the region for several generations. In order to take part, the study participant must: • be at least 18 years of age; • be willing to give a blood sample ~ 7 ml so that researchers can read out all of the donor’s genetic information from it (a process called “sequencing”) or decoding the complete genome sequence); • be willing to have all of donor’s genetic information (without your name or other traditional identifying information, such as address, birth date, passport identification or Social Security number) put in scientific databases available on the Internet for scientific research; • be willing to have many researchers around Russian Federation and across the world study the genetic material and data from the sample for a long time, and to have the information they learn put in scientific databases on the Internet. Theodosius Dobzhansky Center for Genome Bioinformatics of St Petersburg State University will not collect donors’ names or any medical information. Researchers who study the material and data from the samples will be told only the sex of each donor and which ethnic or geographic group the donor came from. The object of the blood samples collection is the family trio, i.e. two biological parents and their full aged child. Ideally, we are aiming to collect blood samples from about 20 trios (60 individuals) from each geographic location under study. Before taking the blood samples the researcher should be convinced that the donor is at least 18 years old, explain to him/her the essence of the project, and answer all his/ her questions. The participant must sign the Informed Consent form and answer the questionnaire where we ask about his/her origin and ethnicity. Ideally all four grandparents should be from the same district and has the same locality and ethnicity. ! Patient recruitment study design- requirements for the selection of study participants All members of the trio should be: !18 October 17 2015 Dr. S.O’Brien edition • Age not younger than 18 years. • Healthy: at the time of sampling should not have serious chronic illnesses (according to the participants, the diagnosis is not required). • Ethnically homogeneous, originate from a particular region, including grandparents on both sides. • In case of doubt or lack of information about at least one of the ancestors of the family cannot be included in the study • The members selected for the study of family trio should be biological relatives o both parents in the trio must be the biological parents of the child • Selected to the study families should not have a family relationship between trios o the member of the family under study must not have parents, children, sisters, brothers, grandparents, cousins, aunts and uncles in other selected for sample collection trios. • ! All participants must provide the information necessary to complete the questionnaire and sign the free-will informed consent. Patient Recruitment Expeditions ! Before each expedition we organize a letter of support for our project from the SPSU rector to the governor of the region where we go. On the first day of arrival to our destination we come to the city administration and introduced our-self explaining the importance of our project and meaning of the expedition. We ask them for help in the local regional hospital and with local authorities. We ask the Chief medical officer of the hospital to nominate a medical nurse, who travels with us. With the help of the city’s authorities we ask to have access to the city archives that contain registration information about the local citizens. During the first day, we work in the city register, collecting information about local citizens. Our strategy is then as follows. First, we split into several groups in one car each. In the morning, we define several villages for the each group and travel to them separately. When we arrive, we first try to find an old-timer or long-term resident in each village and ask them who is an indigenous resident of the village, whose grandparents on both sides were born in the district. We give a one page flyer describing the project in !19 October 17 2015 Dr. S.O’Brien edition simple words, then if the resident shows interest in the project, we ask if s/he would be willing to donate 7 ml of his/her blood for the project. We also ask if s/he could recommend any other indigenous resident(s) in the local area. We fill out the questionnaire, explain and sign the informed consent document. The questionnaire, informed consent and the tube for the blood sample are labeled with the same bar-code sticker in order to log it into the computer, where personal data are not recorded (only location, ethnicity, age and gender are recorded). We accumulate several suitable volunteers in the morning with signed consent and then in the afternoon we pick up the medical nurse from the hospital and then travel again to the villages where the volunteers who have filled out forms are waiting for us. In some cases we drive volunteers to the hospital to obtain the blood sample. The nurse takes blood sample from each volunteer and we transport it back to the hospital where one of our employees mix the samples with pre-prepared TES buffer and then disposes the used vacutainers in biohazard bags. Therefore, our logistics has been well developed and carefully thought out. As a sign of our appreciation we gave each participant a box of chocolate, for which they are very appreciative. We strive to collect around 60 individuals from each district. We have learned many lessons from our fist expedition and understood how it will work in remote sites. We learned that: • To identify people with a pure ethnic background (i.e., not with admixed genotype from distant districts or nationalities) it is important to talk to residents and find an indigenous resident who knows and remembers their background and can indicate how to find the right people to include in our study. To convince people to participate in the project, a 15-20 min conversation is very important! People like to talk about themselves, so researchers should not immediately talk about the project, especially about blood sample collection, but listen to the putative participant, about his/her life, family stories. After a sense of trust has been established, we softly switch the conversation to the importance of the project, emphasizing the special value of the genomes from the district under investigation so that the volunteer can realize his/her importance to the project. We then explain the project in more detail, assure that the volunteer is appropriate to the project (having them fill out a form on their family genealogy), and finally explain the meaning of the informed consent form and have them sign it. • Several groups of researchers are more efficient in facilitating sample collections. • In rural regions rumors spread quickly, so researchers should be careful as to what they say to volunteers. ! !20 October 17 2015 Dr. S.O’Brien edition A draft timetable for sample collection expeditions is presented in Appendix 13 ! DNA extraction and processing methodology. Specimen collection, transport, storage and DNA extraction methods can contribute significantly to accurate whole genome sequencing results (Vaught, 2006; Troyer, 2008; Shabihkhani et al. 2014). Our laboratory has developed standard protocols to minimize the undesirable effects of pre-analytical variables on each of these steps. The protocols are described below. Sample collection and transportation. Blood is collected from participants into 10 ml vacutainer tubes with EDTA according to the protocol. All vacutainers are stored at 4°C until blood is transferred into 15 ml tubes with TES (Tris-EDTA-SDS) buffer in a 1:1 ratio for transportation to the laboratory. Sample coding and database. Each vacutainer and transport tube has the participant's unique identifier number and barcode according database records. All information is anonymous - no personal data is present on the tube labels or in the database. Sample aliquoting When the vacutainers arrive at the laboratory, they are processed according standard protocols. The first step is aliquoting. Aliquoting is necessary to preserve multiple samples to avoid freeze/thaw cycles. Each blood sample produces 14 aliquots of 1 ml volume. Aliquots coding and database Aliquots are frozen in tubes with an alphanumeric identifier and a barcode, which allows them to be identified in the database. The database contains the following information about each aliquot: 1) participant ID, 2) tube ID, 3) tube storage location, 4) type of biomaterial, 5) volume. Sample storage conditions Blood with TES buffer is stored at 4°C for short-term storage. For long-term storage, blood samples are preserved by deep-freezing at -80°C. This method prevents DNA degradation in the sample. Furthermore, a portion of each sample that can be delivered to the laboratory in 24 hours will be cryopreserved in liquid nitrogen in order to maintain viable cells in the presence of a cryoprotectant (DMSO). Subsequently, cryopreserved samples may be used to derive lymphoblastoid cell lines. Extraction of high molecular weight DNA !21 October 17 2015 Dr. S.O’Brien edition The objective of this stage of the project is to extract at least 10 µg of high molecular weight DNA from each blood sample for further whole-genome sequencing and storage. Human genomic DNA for whole-genome sequencing will be isolated from collected blood samples using the following cost-effective methods: 1) A magnetic bead-based method using automated nucleic acid extractor MagCore HF16 2) A silica-membrane-based method using QIAGEN QIAamp DNA Blood Mini/Midi Kits These techniques allow the collection of consistently high-quality DNA preparations, they are safe, and allow us to extract DNA without using phenol and chloroform (Riemann et al 2007). The automated nucleic acid extractor MagCore HF16 is a robotic desktop system for nucleic acid extraction from different materials: blood, cell cultures, body fluids, tissues, plants etc. Nucleic acid extraction is based on magnetic separation of MagCore particles (beads) covered with cellulose. Automation of work with blood will significantly increase safety, reduces the probability of random errors, and increases the accuracy of DNA extraction. Human genomic DNA will be extracted from 1200 µl of blood mixed with TES (Tris-EDTA-SDS) buffer in a 1:1 ratio. The MagCore HF16 system allows us to extract up to 31 µg of genomic DNA of approximately 20-30 kb in length suitable for whole-genome sequencing. QIAGEN QIAamp DNA Blood kits are well known and allow extracting highquality DNA without organic extraction or alcohol precipitation. The QIAamp DNA Blood Mini Kit provides silica-membrane-based DNA purification. The QIAamp DNA Blood Mini Kit is designed for processing up to 200 µl of fresh or frozen human whole blood, while the QIAamp DNA Blood Midi Kit allows for the processing of up to 2 ml fresh or frozen human whole blood. The QIAamp DNA Blood Kits yield DNA sizes from 200 bp up to 50 kb, depending on the age and storage of samples. The typical yield from 200 µl healthy whole blood is 4–12 µg, and from 1 ml the yield is 20-60 µg of DNA. ! DNA quality control Extracted DNA will be processed using quality control procedures: !22 October 17 2015 Dr. S.O’Brien edition 1) DNA quantification using spectrophotometric analysis with the Nanodrop system for RNA and protein impurities estimation. 260/280 nm ratio in range 1.8-2.0 is required. 2) DNA quantification using the fluorospectrometer Qubit for accurate DNA quantification. Minimum 10 µg DNA is required. 3) Gel-electrophoresis for DNA size distribution analysis will be performed in 0.8% agarose gels. DNA fragments should be of approximately 20-40 kb length without smear. DNA samples that pass these three QC steps will be used for whole-genome sequencing and storage. DNA samples storage conditions. For long-term storage, DNA samples are preserved by deepfreezing at -80°C. This method prevents DNA degradation over a long time. DNA samples coding in database DNA samples are stored in tubes with an alphanumeric identifier and a barcode, which allows them to be identified in the database. The database contains the following information about each sample: 1) participant ID, 2) tube ID, 3) tube storage location, 4) volume, 5) concentration, 6) A260/280, 7) integrity. No personal data are placed on the tube label or in the database. Laboratory quality management To ensure the quality of the blood samples and extracted genomic DNA, the following practices will be implemented in the laboratory: • standard operational procedures are written for each step of sample processing and DNA extraction • automatization of sample aliquoting and DNA extraction can reduce risk of pipetting errors • the system of tube labeling with barcodes reduces the risk of labeling and processing errors • the computer database allows us to manage all data related to the samples at each step and reduces human-writing errors. ! !23 October 17 2015 Dr. S.O’Brien edition 4.) Lymphoblastoid (LCL) cell line establishment and the cell line biobank development To yield large amounts of DNA for many genotype analyses and to provide a renewable source of DNA, it is necessary to harvest DNA and peripheral blood mononuclear cells (PBMCs) from individuals and their family members in several regions of Russia, to develop LCL cell lines from each individual, and to establish the biobank of these cell lines for their storage, maintenance and usage. a.) Why do we need to obtain the lymphoblastoid cell lines? Obtaining lymphoblastoid cell lines for each blood sample, which is used for DNA sequencing, is an integral part of the Genome Russia Project and is absolutely essential for several reasons: 1. To obtain the repeated blood sampling of individuals is costly and not always possible. Generating lymphoblastoid lines is the best and only way to preserve, store and replenish the genetic material of a particular person. 2. Cell lines are an inexhaustible resource of genetic material that allows the study of the human genetics and genomics for scientific and medical purposes. The cell lines will allow researches to follow up with more detailed studies: to study cellular phenotypes such as gene expression, epigenetic patterns, and drug response. The extensive genotype data will be available on these samples, and the trio samples from the Russian population will allow researchers to map regions of the genome computationally that affect the cellular phenotypes, and to study the heritability of these phenotypes. See http://ccr.coriell.org/sections/Collections/NHGRI/ hapmap.aspx?PgId=266&coll=GM 3. Preparation and storage of human genetic material in the form of a biobank of lymphoblastoid cell lines is a requirement for participation in the international Human Genome Project. They are necessary for verification and validation of the results and for communication and exchange between the research groups. b.) Methods to develop the lymphoblastoid cell line To develop the cell lines, we need to receive the blood transfusion sample from the donor and to allocate and isolate the blood fraction of B-lymphocytes. Then using viruses, particularly the Epstein-Barr virus (EBV), these B-lymphocytes should be transformed into a cell line. There are many protocols developed for these procedures; we will use the protocol outlined in Appendix 7, developed at the US National Cancer Institute Laboratory of Genomic Diversity and used for transforming over 5000 patients in HIV/ AIDS, nasopharyngeal carcinoma, HBV, HCV and other complex disease gene cohort studies (O’Brien ad Hendrickson 2013; Svitin et al 2014). !24 October 17 2015 Dr. S.O’Brien edition Human blood contains many types of cells that perform different functions - from the transport of oxygen to the production of antibodies. Blood cells are divided into red and white cell types - erythrocytes and leukocytes. Erythrocytes carry oxygen and carbon dioxide associated with hemoglobin. Leukocytes fight infection (immunity) and digest remnants of broken cells, etc. In addition, the blood contains a large number of platelets, which are involved in blood clotting. White blood cells are divided into three main groups: granulocytes, monocytes and lymphocytes. The lymphocytes are involved in the immune response and are represented by two main classes: 1) B-lymphocytes produce antibodies, 2) T-lymphocytes kill virus-infected cells, and regulate the activity of other leukocytes. Some of these cells operate solely within the circulatory system, while others are used only for transport, with functions performed in other tissues. However, the life cycle of all blood cells is similar to some extent in that 1) their life cycle is limited; 2) in the body they are continuously formed. Unfortunately, blood cells do not grow outside the human body. To impart the ability to reproduce them it is necessary to transform B-cells with the viruses. One of the most commonly used viruses for this purpose is the Epstein-Barr virus, or EBV. The Epstein-Barr virus (EBV) causes the transformation of B=lymphocytes of human, transforming them into stable cell lines. The method described generates LCLs from donor peripheral blood with rapid immortalization and cryopreservation times. Through the use of FK506, a T-cell immunosuppressant, and high titers of infectious virus, we are able to promote proliferation of EBV-infected B-cells from peripheral blood mononuclear cells. These interventions make the described method more efficient, resulting in the rapid expansion of cells for subsequent experiments. The transforming activity of certain strains of virus is extremely high, reaching 90-100%. Lymphoblastoid cells are characterized by certain features: they are easy to handle, maintain a diploid karyotype and multiply at a high rate, including in large-scale cultures. Until recently, almost all human lymphoblastoid cell lines were characterized as B-type cells and the genetic information contained Epstein-Barr virus, even if it did not produce the antigens. Subsequently, lymphoblastoid cells were prepared as the T-type. ! !25 October 17 2015 Dr. S.O’Brien edition ! ! Figure 5. Workflow for generation and cryopreservation of lymphoblastoid cell lines. Peripheral blood is centrifuged through a Ficoll gradient. PBMC present in the buffy coat of an established gradient followed by addition of EBV. EBV-exposed cells are grown at 37°C in the presence of 5% CO2 to establish and subsequently expand LCL for cryopreservation. ! Since the first reports of obtaining a stable line of lymphoblastoid cell lines, their numbers increased rapidly and are now extensively used in medicine, cell biotechnology and genomics. Laboratory employees are experienced with the methods of blood fractionation and have all the necessary protocols to isolate and store the fraction of Blymphocytes. The lymphoblastoid cell cultures are a specific kind of stable cell lines. They tend to be a suspension. The cells have a rounded shape, proliferate without being !26 October 17 2015 Dr. S.O’Brien edition attached to the walls of the culture vessel, and in a stationary culture they do not form aggregates. EBV virus can be obtained from infected mammalian cells. The most common source is from a marmoset cell line B-95-8, which can be purchased from Sigma, USA. This cell line can grow and to produce the large amount of viruses inside the cell. It is then necessary to isolate and to purify the EBV viruses and to infect them with the isolated B-cell fraction of the patient's blood cells. The protocols for this procedure are available in the laboratory and our laboratory staff have been extensively trained abroad to succeed in these procedures. Moreover, we will acquire the EBV-containing B95-8 cells from the biobank of The Gamalei Institute of Virology in Moscow, which have already been expanded, freezed and ready to be shipped .However, for this project will need to purchase an additional laboratory incubator, water bath for heating the medium, a centrifuge, and an inverted microscope to examine the cells. Moreover, storage of cells requires an additional biobank cryokonservation system and the liquid nitrogen storage unit for the cells. ! Protocol for isolation and transformation of PBMCs (See Appendix 7) PBMCs (Peripheral blood mononuclear cells) are isolated from whole blood by standard ficoll–hypaque density gradient centrifugation. Briefly, approximately 10 mL of heparinized, plasma-reduced blood is diluted with Hank’s buffered salt solution (HBSS; 1:2 dilution). Then, 15 mL of ficoll is covered with a layer of diluted blood (30 mL). After 30 min of centrifugation (2000 rpm, room temperature (RT)), the PBMCs are collected. After two washing steps and cell counting, the PBMCs are prepared for transformation with Epstein-Barr virus (EBV) added directly after isolation of the PBMCs. Alternatively; the isolated PBMCs are frozen and stored in a liquid nitrogen freezer for future batch transformation. PBMCs are frozen in FBS containing 10% dimethylsulfoxide (DMSO). The protocols for transformation of cells are similar whether the PBMCs are transformed after isolation or after storage in liquid nitrogen (details are provided below). For transformation of previously frozen PBMCs, cells are thawed and washed in 10mL of pre-warmed HBSS to remove all traces of the cryo-protectant in the freezing medium. Following centrifugation at 300g for 5 min, the supernatant is discarded; the pellet is then re-suspended in 1mL of complete medium (RPMI 1640, 10–20% heat inactivated FBS, 1% penicillin–streptomycin, and 0.5% normocin or 0.1% gentamicin), 2 and transferred to a 25cm flask containing 1.0– 2.0 mL of EBV supernatant and 1.0 mg !27 October 17 2015 Dr. S.O’Brien edition 6 of cyclosporine (CSA) per mL. Approximately 6–7.10 cells are used for the transformation of both thawed and freshly isolated PBMCs. Freshly isolated PBMCs are suspended in 14 mL of complete medium (RPMI 1640 with Glutamax, 10% heat inactivated FBS, 1% penicillin–streptomycin, and 0.5% normocin) in a 15 mL of Falcon tube, centrifuged at 350 g for 10 min, and the supernatant is discarded. The cells are then re- suspended in 2.5mL of EBV supernatant and 2.5 mL of o complete medium, mixed carefully, incubated for at least 3h (37 C; 6% CO2), and 2 transferred to a 25cm tissue culture flask. CSA (at a final concentration of 1 mg/ mL) is used to suppress growth of T-lymphocytes. The empty 15 mL Falcon tube is rinsed with 5 mL of CSA containing medium before transferring the CSA medium to the cells in the flask; and then the 10mL flask is placed in an incubator. Alternatively, cells are resuspended in 4.0mL of complete medium supplemented with 5mg/mL of 2 phytohemagglutinin-M (PHA-M) instead of CSA and then transferred to a 25 cm tissue culture flask. EBV supernatant (1 mL) is added to the flask, mixed carefully, and then the o flask is placed in a humidified incubator (37 C; 5% CO2). o The flasks are kept in a humidified incubator at 37 C and 5–6% CO2 throughout the culture period. They may be left undisturbed for the first 21 days, or may be subjected to additional procedures and/ or observations during this time. In the latter case, on day 5, 0.3 mL of PHA solution (100 mg/mL) may be added to the flask to augment the suppression of T-lymphocytes. If the cultures were periodically examined during the first 3 weeks of incubation, they are first checked at day 5–7 by inverted phase microscopy for bright refractile clumps of cells (post-setup check). If there are a significant number of clumps present, 1–3mL of complete medium (including 5 mL of CSA per mL) is added to the flask, depending on the number of clumps, and the flask is returned to the incubator. If very few clumps of cells are visible, no medium is added, and the flask is returned to the incubator to allow further growth before repeating the post-setup check. After 28–35 days of incubation, cell cultures are checked for sufficient cell numbers and split into two portions, one for freezing (one or more stock aliquots) and one for DNA extraction. The remaining cells (1–20 mL, depending on final culture volume) 2 are returned to a 25- or 75-cm flask for expansion to produce a sufficient number of cells for DNA extraction or freezing. Traditionally, growth transformation has been monitored !28 October 17 2015 Dr. S.O’Brien edition by visualization of clusters of cells by light microscopy about a week after exposure to 6, 15 EBV . However, clustering of cells is not a specific indicator of EBV-mediated growth transformation. We have previously demonstrated consistent identification of the proliferating cell population via flow cytometry, providing an accurate and specific method to determine successful outcome as early as three days after exposure of B-cells to EBV. Abbreviations LCL Transformation 1. CSA -cyclosporine 2. DMSO -dimethylsulfoxide 3. dsDNA -double stranded DNA EBV Epstein-Barr virus 4. EDTA -ethylenediaminetetraacetic acid FBS fetal bovine serum 5. HBSS -Hank’s buffered salt solution HLA human leukocyte antigen 6. LCL- lymphoblastoid cell line MHC major histocompatibility complex 7. OPAs -oligonucleotide pool assays PBMC peripheral blood mononuclear cells 8. PBS -phosphate-buffered saline PCR polymerase chain reaction 9. PHA -phytohemagglutinin RBC red blood cell 10. RT room temperature SDS sodium dodecyl sulfate 11. TE -Tris EDTA buffer 5.) Whole Genome Sequence Assessment of Study Participants a.) Sequencing platform details Currently there are two main high-throughput technologies which are suitable for large-scale whole–genome sequencing: Table1. Specifications of Illumina high-throughput platforms indicating main parameters and advantages/ disadvantages of each platform (http://www.illumina.com/systems/ sequencing.html). !29 October 17 2015 Dr. S.O’Brien edition ! ! HiSeq 2000 ! ! ! HiSeq 3000 HiSeq 4000 HiSeq X Five HiSeq X Ten N/A N/A N/A N/A Rapid High- Run Output 1 or 2 1 or 2 1 1 or 2 1 or 2 1 or 2 10-300 50-1000 125-750 Gb 125-1500 Gb 900-1800 Gb 900-1800 Gb Gb Gb 7-60 <1-6 <1-3 5 days <1-3.5 days <3 days <3 days hours days 300 2 billion 2.5 billion 2.5 billion 3 billion 3 billion 2x250 2x125 2x150 bp 2x150 bp 2x150 bp 2x150 bp bp bp Maximum throughput and lowest cost for production-scale genomics. Maximum throughput and lowest cost for production-scale genomics. Maximum throughput for production-scale human wholegenome sequencing. Maximum throughput and lowest cost population-scale human wholegenome sequencing million Power and efficiency for large-scale genomics. ! ! • The Illumina platform of sequencing instruments, from the HiSeq2000 model to the X Ten machines (Bentley et al., 2008). These next-generation sequencing machines were presented several years ago and helped generate the genomic revolution around the world. Numerous studies, including such large-scale projects as the 1000 Genomes project, the Genome 10K project, and many other de novo whole genome assemblies and population genomics studies were performed using these platforms. In Table 1, the main features of the sequencing platforms are presented. Illumina platforms make it possible to sequence thousands of individuals per year with 30-50x coverage. The high accuracy of these instruments along with a low cost of genome sequencing and scalability allows their use for population-scale sequencing of individuals and cancer samples. Several companies around the world allow access to commercial sequencers !30 October 17 2015 Dr. S.O’Brien edition and some of them have Illumina X Ten platforms, which allow the sequencing of thousands of human genomes (http://www.macrogen.com/eng/business/ xgenome.html). Furthermore, Illumina routinely upgrades their reagents to extend read length and improve the accuracy of the sequencing reads. • Complete Genomics (Drmanac et al., 2010) is one of the leaders in human whole genome sequencing and is based in Mountain View, California. Using its proprietary sequencing instruments, chemistry, and software, the company has sequenced more than 20,000 whole human genomes. The company’s mission is to improve human health by providing researchers and clinicians with the core technology and commercial systems to understand, prevent, diagnose, and treat diseases and conditions. Over the past three years, Complete Genomics has initiated a large number of clinical utility studies designed to demonstrate that patients, payers, and physicians may be better off with a whole genome sequence as compared to standard care. Complete Genomics is now previewing its first commercial product, the Revolocity™ system. Unlike other vendors who focus on providing only sequencing equipment, Complete Genomics has designed the Revolocity system to be a total end-to-end genomics solution for large-scale, high-quality genomes (http:// www.completegenomics.com/). The recently announced Revolocity system allows the sequencing of more around 10,000 thousands genomes per year with 50 X coverage. Unlike Illumina, which has thousands of citations, there arecurrently few reports in the scientific literature with using Complete Genomics technology (examples: Lee et al., 2015; Gilisen et al., 2014, Molenaar et al., 2014). ! ! Table 2. Comparison of two major deliverers of whole population-scale whole genome sequencing Feature Macrogen/X10/Illumina Complete Genomics Read length 2 X 150 bp 2 X 28 bp* Coverage 30X 50X Cost of 1 genome 1320 $ (with discount, including data transfer by Fedex on 2TB disk) 1600 $ (without discount and data transfer) !31 October 17 2015 Dr. S.O’Brien edition Estimated cost of data storage facility 500 000 $ 1 000 000 $ ! We compared major technical properties and cost of sequencing of whole genomes from two major genome sequencing providers (Macrogen, which has several Illumina X Ten machines and Complete Genomics) and present the results in the Table 2. According to the primary measures, Macrogen/Illumina outperforms Complete Genomics. The two main problems with Complete Genomics are: 1. Short read length (28 bp) does not allow the proper mapping of reads to a whole genome. It may result in many missed SNPs in low-complexity regions. The higher coverage of Complete Genomics will not help to resolve this problem. In contrast, Macrogen proposes 150 bp paired-end reads. 2. Software and SNP-calling methods of Complete Genomics are not comparable with modern Illumina software. It will be difficult to compare the quality of results and the power of SNP-calling with the already published datasets from the 1000 Genomes project. b) Study design and “Bake off” strategy to evaluate sequencing platforms We aim to determine the best strategy of whole genome sequencing for the Russian population using 10 individuals from three family trios (mother, father, children), each of which will be sequenced on the two platforms mentioned above (Illumina X Ten – Macrogen, Revolocity – Complete Genomics)> We will also sequence these using the Illumina HiSeq2000 platform at the Saint-Petersburg State University Biobank. Each provider of whole genome sequencing technology represents their technology as the most accurate and cheap in terms of trade-off between quality and price, but for the Genome Russia project our primary criteria will be the quality and reliability of the sequencing platforms. Sequencing of the same individuals from family trios will allow us to identify and quantify sequencing errors during SNPs and Indel calling, genotyping errors and their ability to call complex genomic variants such as long heterozygous deletions and tandem duplications. The main indicators of genome sequencing quality will be: percent of highquality bases and reads, mapability of short reads with Phred-scale quality above 40, and uniformity of coverage, among others. Finally all genotypes will be tested for Mendelian transmission errors in the trio design in order to asses the incidence of genotyping errors with the different sequencing platforms and SNP calling software. !32 October 17 2015 Dr. S.O’Brien edition The overall frequency of sequencing and genotyping errors for each technology and the results of the bake-off comparison will be used to decide which sequencing technology will be used for the Genome Russia project. These results will be published as a separate methodological paper. ! 6. Computational resources, requirements, memory, capacity and security ! a.) Big Data handling, storage and accessibility Genome Russia is expected to generate whole genome sequence files and analyses for > 2500 Russian citizens in the coming 2-3 years. 60X coverage sequencing and alignment requires approximately 500Gb data files for each person. This includes raw reads .fastq files and .bam files with mapped reads. We will store, save, analyze, and provide open web-access to the consented sequence data in the future. For this purpose, we will need to build a high-speed storage unit with at least 1 petabyte (Pb) capacity, a powerful server cluster, and a high-speed network to connect servers, storage, and provide access to data. Sequence data will be delivered to us on external hard drives, which serve as an additional backup system for raw reads files. ! • ! Peterhof data center. Our server and storage system will be installed at the Peterhof SPSU Datacenter. The Peterhof SPSU Datacenter will provide several free server racks for our equipment, a cooling system, a local gas extinguishing system, electrical power protected with a powerful generator and an uninterruptible power supply system. The Peterhof SPSU Datacenter will also provide 1 gigabyte (Gb) Internet access and a 10Gb channel to connect to the Dobzhansky Center datacenter on 41 Sredniy Prospekt. For our equipment, we will need 2 racks of 48 units. Server racks will be locked to prevent unauthorized access to Genome Russia servers. b.) Genome Russia Server cluster: ! • Hardware For bioinformatics analysis, we will use six Supermicro 4U form factor servers. Each server will include 4 CPU Intel E7-8890v3 with 72 cores, 3Tb of !33 October 17 2015 Dr. S.O’Brien edition memory DDR4 and 5Tb fast SSD hard drives to organize disk cache so as to reduce the load to the network and storage system. ! • Software For Server Cluster setup, we will use Sun Grid Engine (SGE) software. It will be responsible for accepting, scheduling, dispatching, and managing the remote and distributed execution of large numbers of standalone, parallel or interactive user jobs (Samuel at al). It also manages and schedules the allocation of distributed resources such as processors, memory, disk space, and software licenses. A typical Grid Engine cluster consists of a master host and one or more execution hosts. Multiple shadow masters can also be configured as hot spares, which take over the role of the master when the original master host crashes. • Organizations using SGE: o Sun Grid o TSUBAME supercomputer at the Tokyo Institute of Technology, o Ranger at the Texas Advanced Computing Center (TACC). Ranger has 62,976 processor cores in 3,936 nodes and a peak performance of 504TFlops. o San Diego Supercomputer Center (SDSC) o Geophysical Fluid Dynamics Laboratory (NOAA GFDL) ! c.) Storage cluster: ! • Hardware The storage cluster will be based on Supermicro servers nodes 1U form factor. Each node contains 12 large form factor 8Tb hard drives, so total raw space per node reaches 96Tb. We will need approximately 2Pb of raw space or 22 nodes to build 1Pb high-speed main storage system and 1Pb backup system. A backup storage system will be organized in the server room of the Dobzhansky Center on 41 Sredniy Prospekt. The backup system will consist from 2 Supermicro Storage servers with 72 8Tb hard drives for each one. Raidix software and RAID6 technology will be used to organize access and maintain security of the data. In order to achieve the speed of more than 10 Gigabit bundling of several physical ports, it is possible to form a single logical channel, which is defined by the Link !34 October 17 2015 Dr. S.O’Brien edition Aggregation Control Protocol (LACP) IEEE standart. This is also called interface bonding or teaming. Link aggregation is possible either using a single switch or stackable switches. Stackable switches allow the link aggregation across the stack, which enables improved redundancy and resiliency (Guijarro at al. 2007). ! • ! Software. Storage nodes will be merged into a high-speed system using the Gluster file system. Gluster is a software-only platform that provides scale-out NAS for physical and virtual environments. With Gluster, organizations can turn commodity computing and storage resources (either on-premise or in the public cloud) into a scale-on-demand, virtualized, commoditized, and centrally managed storage pool. The global namespace capability aggregates disk, CPU, I/O and memory into a single pool of resources with flexible back-end disk options, supporting direct attached, JBOD, or SAN storage. Storage server nodes can be added or removed without disruption to service - enabling storage to grow or shrink quickly in the most dynamic environments (Heath et al, 2014). Gluster is designed to distribute the workload across a large number of inexpensive servers and disks. This reduces the impact of poor performance of any single component, and dramatically reduces the impact of factors that have traditionally limited disk performance, such as spin time. In a typical Gluster deployment, relatively inexpensive disks can be combined to deliver performance that is equivalent to far more expensive, proprietary and monolithic systems at a fraction of the total cost. To scale capacity, organizations need simply add additional, inexpensive drives and will see linear gains in capacity without sacrificing performance. To scale out performance, organizations need simply add additional storage server nodes, and will generally see linear performance improvements. To increase availability, files can be replicated n-way across multiple storage nodes. While Gluster’s default configuration can handle most workloads, Gluster’s modular design allows it to be customized for particular and specialized workloads and easily adjust configurations to achieve the optimal balance between performance, cost, manageability, and availability for their particular needs. Gluster can be used for a wide variety of storage needs and performs well across a variety of workloads, including: • Both large numbers of large files and huge numbers of small files • Both read intensive and write intensive operations !35 October 17 2015 Dr. S.O’Brien edition • Both sequential and random access patterns • Large numbers of clients simultaneously accessing files ! d.) Data access: ! Access to data will be organized through a web interface with an access control system, so users can download it on request. Also, we will organize an Aspera server (http:// asperasoft.com/). The Aspera protocol allows the movement of large data sets over the WAN with unrivaled speed (100X faster than FTP or HTTP). This protocol is based on FASP technology. FASP enables large data set transfers over any network at maximum speed, regardless of network conditions or distance. For example, over gigabit WANs with 1 second RTT and 5% packet loss, FASP achieves 700-800 Mbps file transfers on high-end PCs with RAID-0 and 400-500 Mbps transfers on commodity PCs. Large data sets of small files are transferred with the same efficiency as large single files. The implementation is very lightweight, and thus does not require specialized or powerful hardware in order to maintain high speeds or high concurrency. In addition to significant transfer rate gains, FASP is able to fully utilize the available bandwidth, maximize use of the existing infrastructure and eliminate costly upgrades that may not even benefit TCP-based protocols. While FASP can fill any available bandwidth, it also includes an intelligent adaptive transmission rate control mechanism that throttles down for precision fairness to standard TCP traffic, and automatically ramps back up to fully utilize the unused bandwidth. This ensures that business-critical TCP traffic such as email, web, and scientific applications can function normally while allowing FASP to utilize unused bandwidth. ! e.) List of tools for bioinformatics analysis: ! • GATK – tool to analyze sequencing data • Bowtie – the read-mapping tool • Picard – utilities that manipulate SAM and BAM files !36 October 17 2015 Dr. S.O’Brien edition ! • Samtools – a software suite to process SAM/BAM files • Bcftools – tools to process genome variation data and handle files in the VCF format • Vcftools – tools for processing genome variation data • SnpEFF – Genetic variant annotation and effect prediction toolbox • PLINK – whole-genome association and variation analysis toolset • Bioconductor – bioinformatics-related packages for R f) Genome Russia Website ! It is very important to inform people, who can be potential study participants, about the Genome Russia Project to make our goals clear, to gain trust, and to encourage wide participation. Details of the goals, status and background of the Genome Russia Project is posted today with regular updates on an open website: http://genomerussia.bio.spbu.ru/ ! ! ! Figure 6 Genome Russia logo. ! ! ! ! ! ! ! ! !37 October 17 2015 Dr. S.O’Brien edition The Genome Russia Project has a logo, located in the upper left corner near the coat of arms of Saint-Petersburg State University. It will be displayed on advertising flyers, Tshirts and others promotional items. A major purpose of the website is to announce our goals and progress of the project and to release all information about project promptly. This assures that every user can read a detailed description of the project, volunteer for the project and address any questions to the scientists. To provide a better understanding of how blood sample collection is performed, we include a file with protocols of blood samples collection, DNA extraction and DNA quality control. Moreover, all volunteers receive a read a flyer with a brief description of the project, which describes the trios project, the informed consent form and the fate of their genome sequence data (Appendix 4). On the “News” tab we will publish periodic updates about events related to the Genome Russia Project. On the website is an interactive map of the Russian Federation with marked regions indicating where volunteers have been recruited for the project. ! ! ! Figure 7 Genome Russia Interactive map from web site ! !38 October 17 2015 Dr. S.O’Brien edition The Genome Russia Database ! The Genome Russia Database is a web-based database containing anonymous information on the whole-genome sequences of at least 2,000 men and women originating from the different regions of Russia. The information includes: gender, place of residence, place of birth, age, ethnicity, marital status, date of blood collection, in addition to the database of genealogical information. This database does not contain any confidential personal identifier information. ! Access to the Genome Russia Database is distributed through a web application with a Python backend that is connected to our MySQL database. There are different types of users and types of rights of access: Administrators that are responsible for managing user’s rights and supporting databases, moderators who can upload and update data, and guests who are registered to observe data. Every change to the database is retained. Moreover, the system periodically makes backups of the database on a server, so that user activity is under control and it is possible to recover lost data at any time. At the present, the Genome Russia Database stores the questionnaires of 132 individuals from 5 regions: ! Location Quantity Komi 9 Arkhangelsk region 33 Nizhny Novgorod Region 9 Pskov region 60 Tver region 21 Table 3 Inventory of study participants in genome Russia database ! !39 October 17 2015 Dr. S.O’Brien edition h.) The Genome Russia Database will store all the information about samples from individuals: where are they stored (room, freezer, box, tube), information about DNA extraction and results of QC tests and assessment. Each entity in the database has a unique bar code, which provides a key to the security and data structuring of the database (Wendi et al 2007). For each sample collection trip, a user can generate a set of barcodes for questionnaires, informed consent forms, vacutainers and various types of tubes on the appropriate page of the web interface. After receiving samples and uploading data into the database, the system will generate a barcode, so an employee of the laboratory can find all stored data by scanning a barcode with a barcode scanner. Files and tubes related to a common sample are interconnected with their ID’s; this simplifies searching through the database. ! ! ! ! ! !40 October 17 2015 Dr. S.O’Brien edition ! ! Figure 8 An example of the web-interface questionnaire page. ! One can track the progress of the Genome Russia Project or monitor what progress or phase of a particular sample, the tubes, files, and data associated with it, and where they are stored. It is possible to generate reports on the data that the database stores and download an Excel-file with all samples and their associated characteristics. Discovered and validated genomic variants, along with related annotated genome features such as function altering variants, pathological imputations, and haplotype maps, will be posted openly at the Dobzhansky Center GARField website mirror of the UCSC Genome Browser (Kent et al., 2002) Updated data, results and interpretations will be available as a track hub (Raney et al, 2014) for applications with third-party genome feature annotations. ! 7) Population whole genome sequence data analysis ! Analysis of next generation sequencing (NGS) data requires complex application of many various bioinformatic tools combined in chains named pipelines. In case of variant calling pipeline it can be divided into several stages as shown in Figure 9: 1. quality control and filtration of input data; 2. alignment of filtered reads to reference genome; !41 October 17 2015 Dr. S.O’Brien edition 3. filtration of alignment; 4. variant calling; 5. filtration of variants based on set of indicators; ! ! ! 6. annotation of variants; !42 October 17 2015 Dr. S.O’Brien edition ! ! Figure 9 Workflow of SNV calling ! These stages are described below. Full list of bioinformatics tools that will be used can be found in section 6.f. !43 October 17 2015 Dr. S.O’Brien edition ! a) Read quality control and filtration. ! Whole-genome sequencing (WGS) reads will be processed with the FastQC tool (Andrews, 2010) to assess their quality. This tool calculates several indicators and distributions of parameters useful for initial assessment of data quality, for example: 1. per base of read quality distribution (see Fig.N+1 below) , 2. per read quality scores, 3. per base of read GC content, 4. per base of read N content ! 5. sequence length distributions, and others. ! ! Figure 10 Read Filtration QC ! Reads will be filtered for adapters (technical sequences used in preparation of libraries), contamination and low-quality reads. Such filtration greatly improve following alignment. ! !44 October 17 2015 Dr. S.O’Brien edition b) Alignment and filtration Filtered reads will be aligned to the human reference genome sequence (the GRCh37 assembly) using the Bowtie2 read aligner (Langmead and Salzberg, 2012) in very sensitive mode to obtain raw alignments stored in BAM files(Li et al, 2009). BAM format was specially designed for storage of alignments in compressed form and stores full information about alignment. Then raw alignments will be filtered for PCR duplicates and ambiguously or not aligned reads using Samtools (Li et al, 2009). Filtration is required to reduce errors and biases in following SNV calling. Also it reduces volume required for storage of alignments as unnecessary information(for example, unaligned reads) is removed. c) Variant calling and genotype calling i. Strategy: BAM files of filtered read alignments will be subjected to GATK (https:// www.broadinstitute.org/gatk/; McKenna et al 2010; DePristo et al 2011; Van der Auwera et al., 2013) analysis to call genomic variants and determine their genotypes. Following GATK Best Practice guidelines, we will employ joint calling and individual genotyping procedure which provides high accuracy of results . Newly obtained samples will be added to the dynamic dataset without high computational penalty. Joint calling of variants precludes ambiguity of the absent records in VCF files that can arise when individual variation calling is perform hodology in more detail as well as the individual component variant annotated: this absence can be interpreted both as the absence of variation in the respective site (i.e. nucleotide is identical to the reference) as well as the no data on this site (lack/absence of coverage). Both original raw data (FastQ files) and analysis results (BAM and VCF files) will be stored in replicas in the separate locations for future reference and for data stability. ! ii. Quality control: Further analysis of variants analysis (for example, identification of variants unique to Russian population, comparison of allele frequencies between different populations etc.) requires high reliable set of variants. Because of this reason we plan do subject WGS data to the following quality control (QC): • entries in VCF file will be filtered according to the sequencing depth parameter (DP entry in the column of VCF file) with the lower boundary cutoff of mean(DP)/2 (upper boundary is not enforced since all repeats are removed at a later step) !45 October 17 2015 Dr. S.O’Brien edition • entries in VCF file will be filtered by the GATK according to various "best practice" and in some cases more strict thresholds: o MQ < 40 o QD < 2 ( This is variant confidence divided by the unfiltered depth of non-reference samples) o FS> 20(This is phred-scaled p-value using Fisher’s Exact Test to detect strand bias (the variation being seen on only the forward or only the reverse strand) in the reads.) o HaplotypeScore > 13 (Consistency of the site with strictly two segregating haplotypes) o MQRankSum < -12.5 (This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for mapping qualities (reads with ref bases vs. those with the alternate allele)) o ReadPosRankSum < -8 (This is the u-based z-approximation from the Mann-Whitney Rank Sum Test for the distance from the end of the read for reads with the alternate allele. If the alternate allele is only seen near the ends of reads, this is indicative of error) • variants that fall into repeats known to the RepBase Update database (Jurka et al., 2005) will be removed • individual entries will be checked for the correct gender assignment (phenotypic gender record should correspond to the genetic data on gender) and discrepant samples will be removed • all variants with occurrence<2, or minor allele frequency (MAF) <3% and quality parameter (QUAL column of VCF file) <150 will be filtered out • all variants on autosomes with be filtered by Hardy-Weineberg equilibrium (HWE) test with cutoff 0.0001 • all variants with genotyping rate <90% will be removed • all samples with the genotyping rate <90% will be removed • after linkage disequilibrium (LD) pruning variants will be filtered by IBD (identity-by-descent) with the cutoff of 0.2 !46 October 17 2015 Dr. S.O’Brien edition ! • filtering by quality/occurrence/MAF, HWE, variant and individual genotyping rate will be repeated again to account for the dataset changes introduced by IBD filtering • PCA analysis will be performed to check for the population structure iii. Functional SNV (fSNV) detection ! The following types of genomic variant will be identified and included into further analysis: • SNPs/SNVs, including nonsense variants, missense variants • insertions, both in-frame and frameshifts • deletions, both in-frame and frameshifts • Stop codons • Altered Start Sites • Splice junction alterations fSNV variants will be further annotated according to their effects on the resulting gene products (transcripts and/or proteins) by their predicted influence (see prediction software below). ! iv. Validation of rare allele and disease gene mutations Whole genome sequencing with Next Generation sequence are powerful technologies that have made Genome Russia possible methodology (see Section III-5). However, there are significant drawbacks associated with these high throughput methods, which affect quality of research results and their reliability. For example, amplification bias during PCR of heterogeneous mixtures can result in skewed populations [Kanagawa, 2003]. Additionally, polymerase mistakes, such as base mis-incorporations and rearrangements due to template switching, can result in incorrect variant calls. Furthermore, errors arise during cluster amplification, sequencing cycles, and image analysis result in approximately 0.1–1% of bases being called incorrectly [Fox et al 2014]. Traditional proven methods, namely Sanger sequencing and PCR are widely employed to verify novel results acquired by theses approaches. In Genome Russia we !47 October 17 2015 Dr. S.O’Brien edition will verify all rare and function altering by real-time PCR. and Sanger sequencing. Depending on the detection method that is used in the PCR (HRM, Sybr-green, hydrolysis probes or hybridization probes), primer design program should be chosen. PCR primers are designed according to the assembled transcript sequences. We will confirm that the primers and probes are really specific for the potential SNP and do not detect some known SNPs or pseudogenes. It is desirable to use a second method to verify the results of the established assay (for example one Taqman probe based assay and one HRM-based assay). All data coming from NGS should be verified on a heterozygous DNA sample by Sanger sequencing. All results should be congruent: the database search as well as all experimental data. ! d) Genome Mining for described human disease gene variant alleles ! i. Human Gene Mutation Database. Gene variants of known Mendelian diseases for the selected set of 173 genes with documented clinical/phenotypic effect will be retrieved from the Human Gene Mutation Database (HGMD; http://www.hgmd.cf.ac.uk/ac/ index.php; Stenson et al 2012). Only variants with severe and significant effect will be chosen. The list of chosen variants from HGMD will be intersected with the merged VCF files from the whole-genome sequencing (WGS) data and the resulting set of variants will be subjected to further analysis. ii. 1000 Genomes project. SNV population allele frequency data from 1000 Genomes project (McVean et al., 2012; Auton et al., 2015) will be included into the analysis alongside the WGS data obtained from Russian population. We will also identify variants that are unique for Russian population, as well as variants and respective genes that are present as alternative (non-reference) homozygotes or compounds. Identified variants of interest will be subjected to further analysis, e.g. comparing effect of respective genes to their prevalence of respective/related diseases in Russian population. The final validation of the assay should be performed with inter- and intra-assay specific validation. iii. Online Mendelian Inheritance in Man (OMIM) database. OMIM database (http://www.ncbi.nlm.nih.gov/omim, http://www.omim.org, Hamosh et al 2005) is a comprehensive, authoritative compendium of human genes and genetic phenotypes that is updated daily. The full-text, referenced overviews in OMIM contain information on all known mendelian disorders and over 15,000 genes. OMIM is focused on the relationship between phenotype and genotype. !48 October 17 2015 Dr. S.O’Brien edition Intersection of variants from OMIM database with variants obtained in Russian Genome Project will allow us estimate presence and frequencies of variants, connected with heritable diseases, in Russian population. iv. De novo variant annotation. For de novo variant effect prediction, the snpEff tool (Cingolani et al., 2012) will be applied to the obtained SNVs. Based on the predicted effects, putative loss-of-function SNVs (McArthur et al., 2012) will be identified. Also for each missense variant its effect on a corresponding protein will be assessed using Sorting Intolerant From Tolerant (SIFT; http://sift.jcvi.org/; Ng and Henikoff, 2003) and Polymorphism Phenotyping (PolyPhen; http://genetics.bwh.harvard.edu/pph2/; Ramensky et al 2002) software. For trio sample, analysis of inter-variant compounds (i.e. compounds between different variants located within the same gene) will be also carried out using De novo variant annotationtheir phased genotypes. Results obtained for the trios can be extended to other individuals in the study using read-alignment phasing implemented in samtools (Li et al., 2009). ! e.) Copy Number Variation (CNV) assessment for Genome Russia Project.. ! For the copy number annotation analysis, we will employ an approach used in various robust genome annotations studies, including the analysis of human and animal genomes [(Xue, et al. 2015; Prado-Martinez et al 2013; Tamazian et al 2014; MarquesBonet and Eichler, 2009; Alkan et al 2009). The method utilizes read depth calculation and comprises various steps, including masking-out the repeats in the reference genome, mapping datasets of multiple individuals to the reference genome, and screening the datasets of each individual for segmental duplications. Step 1: Repeat masking employs tools like Repeat Masker, Tandem Repeat Finder, and Dust Masker that allow identification of various categories of genomic repeats. Nonetheless, some repeats remain unmasked due to various reasons, such as their absence in the Repeat Masker database. Hidden repeats are identified using a k-mer approach by dividing the contigs and scaffolds into k-mers with a fixed k (e.g., k=36) and then mapping these repeats onto assembly using mrsFast software [4] to account for multi-mappings. Also, for each masked segment, one flanking k-mer (k=36) from the 5’ and 3’ end, are masked out. This is done for the purpose of preventing the coverage to drop-off in regions flanking the masked regions. !49 October 17 2015 Dr. S.O’Brien edition The later downstream analysis is based on mrFast and mrsFast [4] software that perform mapping of short reads or k-mers to the reference genome and designed with some optimizations for Illumina short-read datasets. Both mappers exploit properties of Illumina reads such as their short length (shorter than 454 or Pacbio), low error rate (2-3 bp along the read), and uniform read length within a single machine run. Basically, they implement a collision-free hash table to create indices of the reference genome that can efficiently utilize the system memory. In comparison to mrFast, the mrsFast software finds only mismatches (not indels) and thus allows for an increase of the mapping speed. In this analysis we will use the length of mapped sequences of 36bp and optimize the parameters of the software in order to find all possible map locations. Step 2 consists of mapping the data from multiple individuals to the reference genome in order to identify the copy numbers of genomic regions. For this purpose, reads of 100bp are split into two consecutive k-mers of 36 bps from positions 10-46 and positions 46-81. In case of 150bp reads, we will split reads into the consecutive regions with the coordinates: 5-41, 41-76, 76-112, 112-148. This allows for trimming potentially lower quality reads on the outermost regions. Step 3: CNVs are detected using mrCaNaVar [Alkan, et al 2009] for screening the read depth in non-overlapping windows of 1Kb of unmasked sequence. Also, the genomewide read depth is calculated by iteratively excluding windows with the most extreme read depth. The remaining regions are kept as control regions. Step 4: Next, we will identify segmental duplications that are defined as at least 5 consecutive windows of non-overlapping, non-masked sequence with the copy number value larger than the mean copy number value in control regions with a correction to standard deviation. These regions must also span at least 10Kb in genomic coordinates by definition. The approach described briefly above was optimized by the research group at the comparative genomics laboratory under the direction of Tomas Marques-Bonet, who pioneered these methods (Xue, et al. 2015 ; Prado-Martinez et al 2013; Tamazian et al 2014: Marques-Bonet an Eichler. 2009; Alkan, et al 2009) and is a collaborator of the Dobzhansky Center. To illustrate this method, we show recent results for CNV annotation from the genomes of five individuals of a novel antelope species (Figure 6; Table 7). The distribution of copy numbers in control regions is close to normal with expected values close to 2, which is expected with a diploid genome (Fig. 6). We found 459 regions of fixed duplications with of total length 7,713,143 bps. The fixed duplications comprise from 67.7 to 76% in each individual. We identified 265 common !50 October 17 2015 Dr. S.O’Brien edition autosomal genomic duplications with the number of copies of more than 10 belonging to about 222 annotated genes (Table 7). The methods for CNV analyses that we describe have been validated in several published studies by our group (Tamazian et al 2014; Dobrynin et al 2015; Cho et al 2014), thereby demonstrating that it is robust, reliable and appropriate for genome-wide annotation studies, in particular for the Genome Russia project. ! ! ! Figure 11 Distribution of copy number values in control regions for five species of sable antelope. ! ! Table 4. Statistics on segmental duplications ! ! !51 October 17 2015 Dr. S.O’Brien edition Sample Number of SD Length of SD Sample coverage SB1954 588 10,494,879 7.65 SB134 573 10,257,894 8.02 SB2027 631 11,392,378 7.56 SB2130 574 10,116,594 7.54 SB1954 588 10,494,879 7.65 ! 8. Haplotype Map constructions for Ethnic Russian population. ! A haplotype map is a collection of common genetic variants that represent genome structure of a population. Associations between alleles in different loci along the chromosome arise as a result of selection and/or demographic processes unique to the population (Reich 2002). The Russian genome structure must have been shaped by a unique interplay between adaptation, geographic isolation, demographic events, migration, and admixture. The proposed map of haplotypes in Russia will show what these variants are, where they are located along each of the chromosomes, how tightly associated (linked) they are with each other, and how they are distributed among people within and among populations in different parts of the Russian Federation. The resulting map will provide information for the many studies to follow that will connect genomic variants to the risk for human diseases, and help developing methods of diagnostics, treatment and prevention. Russian haplotype map will describe the common patterns of variation, including associations between genetic markers, and will include tags that can substitute sequencing in the in the future genome wide association studies or GWAS. A single nucleotide polymorphism, or SNP. is a site (locus) where homologous chromosomes differ in the nucleotide bases. A sequence of alleles in SNPs observed along the chromosome is referred to as a haplotype. Haplotypes differ between populations because of different history of mutation, selection, drift and recombination shaping the sequence of corresponding segments of DNA. The tendency of SNPs in haplotypes to be inherited together is referred to as linkage disequilibrium or LD (Reich 2001). The LD has a practical application: genotyping a subset of marker SNPs in the region provides enough information to predict the remainder of the common SNPs in that region of the chromosome, so that a limited number of 'tag' SNPs can identify each of the !52 October 17 2015 Dr. S.O’Brien edition common haplotypes in a region. Haplotypes commonly occur in a block pattern: the chromosome region of each block has several common haplotypes, separated by a recombination hotspot with the following block region, while sometimes the longerdistance haplotypes could be adjoining the shorter haplotypes in the two blocks. A haplotype map shows the coordinates of haplotype blocks, lists and labels the tag-SNPs that define them in each population. These tag-SNPs can be genotyped as the entire haplotype sequence in each block can be recovered from only a few informative loci. Many studies show that the chromosomal distances that SNP associations in haplotype blocks are generally shorter for the African populations (average ~11kb), and intermediate for European and Asian populations (average ~ 22kb), and relatively long for the Native American populations that experienced the most recent founder event (Gabriel et al., 2002). Tag-SNP transferability between the populations is affected primarily by the level of LD in the study population, with genetic similarity of the reference and study populations still important (Conrad, et al., 2006). For example, while employing the International HapMap reference panels for imputation, genotypes from European HGDP samples were imputed with the highest accuracy, followed by samples from East Asia, Central and South Asia, the Americas, Oceania, the Middle East, and Africa (Huang et al., 2009). Therefore, the choice of preferred reference panels for imputation in worldwide populations therefore should follow geographic groupings, and mixtures of reference panels that maximized imputation accuracy (Huang et al., 2009). Russian populations will therefore suffer in the imputation accuracy, as most of the reference populations in the 1000 Genomes and the panels designed using International HapMap are geographically distant (Figure 1). Russian population, in addition to the unique SNPs, is generally expected to show admixture between the European and the Asian genomes, and to have the haplotype length on the same order of magnitude. Therefore, once a haplotype map is constructed, Russian researchers will immediately benefit, as most of the haplotypes for a disease association study can be using no more than 300,000–1,000,000 tag SNPs (Gabriel et al., 2002). For example, the custom European array from Affymetrix contains a total of 674,518 SNPs (Hoffman et al., 2011a), and the arrays optimized for East Asians Africans, and Latinos include 712,950 (EAS), 893,631 (AFR) and 817,810 (LAT) SNPs respectively (Table 1). ! Table 5. The number of SNPs that are common to each of the arrays (modified from Hoffman et al., 2011). In bold are the total number of SNPs in the array. Not highlighted are ne numbers shared pairwise between arrays. !53 October 17 2015 Dr. S.O’Brien edition Array EUR EAS AFR LAT European (EUR) 674,518 386,841 384,966 434,028 712,950 303,850 314,794 893,631 574,940 East Asian (EAS) African (AFR) Latino (LAT) 817,810 ! Once genome data is available, the haplotype blocs can be constructed and the tag SNPs provided for the Russian analog of the Array in Table 1. Using trio (child and partents) design for sequences will give an additional advantage to phasing haplotypes in the sequenced individuals. There are number of standard software that identifies computes haplotype blocks and identifies tag SNPs relevant to the population. To name a few, HaploView (Barett et al., 2005) designed as a common interface to compute and visualize several tasks, including LD & haplotype block analysis haplotype population frequency estimation, and programs like Tagger (de Bakker et al. 2005) can produce a list of tag SNPs and corresponding statistical tests to capture all variants of interest, and a summary coverage report of the selected tag SNPs. From the perspective of GWAS analysis, once haplotype map is available, and a custom array is used to scale up genome association studies. Given the information from this development, a large number of SNPs without any genotype data can be imputed from the tagSNPs by substituting information from locally relevant haplotype blocks specific for the Russian population (Marchini and Howie, 2010). This information will help scale up association studies looking for Russian - specific determinants of the human disease (Johnson, et al., 2001). Using our genome data we will be able to identify local haplotype structure and ethnic admixture, and compared to the existing databases such as the International HapMap and 1,000 Genomes. Using trios will identify common patterns of recombination in the local population and describe major haplotype blocks that are relevant to the Russian Federation. The aim of Genome Russia haplotype map is to find and characterize local patterns chromosomal variation specific to the Russian Federation, by identifying common endemic sequence variants, their frequencies and correlations between them along the chromosomes in Russian populations. As a result, the project will provide genome tools that could be used to bring specific benefits of having a relevant haplotype map with !54 October 17 2015 Dr. S.O’Brien edition markers that will allow to scale up local association studies that evaluate risk or protection polymorphisms in functional, medically relevant genes within candidate regions suggested by either the classical family-based linkage analysis, or whole genome scans. In other words, the major practical need of a haplotype map is that it eliminates the need for base-by base sequencing of each human genome by providing informative variants that can be genotyped to predict entire fragments of chromosomes. ! 9. Interpreting Russian History population exchanges using phylogeography ! 1) Population genetics and phylogeography. To decipher historical migratory routes and settlings of Humankind in Russia, Europe and Asia we aim to use reliable population genetic and bioinformatic techniques which allow to estimate ancestry of defined populations, infer routes of migrations and date them using coalescent modeling and allele frequency spectrums of populations. The analysis will be performed using whole genome SNPs datasets and sets of SNPs from Y-chromosome to infer differences in the history of population divergence and migrations. Unlike autosomal sequences, mitochondrial DNA is inherited exclusive through females and never through males. By contrast Y chromosome haplotypes are inherited exclusive through males and never to daughters. Reconstructed sequences of full mitochondrial genomes will be used for precise coalescent modeling of the genetic history of this maternally transmitted organelle and comparison with Y-chromosome history. The main methods, we plan to use for the analysis are: a) Ancestry inference based on clustering algorithms. Population clustering based on their genomic ancestry will be performed using ADMIXTURE (Alexander et al., 2009) and fast STRUCTURE (Raj et al., 2014) software tools. These methods allow to identify ancestry of studied populations on different hierarchical levels and estimate admixture between the populations. Both methods have been extensively used and improved recently b). Principal component analysis : Principal component analysis (PCA) provides an opportunity to infer population structure without assuming any demographic models. The first time the method was applied to human populations is the study of human gene frequencies in Europeans (Mennozi et al., 1978). Two software implementations of PCA have been widely used for population genomic data analysis: EIGENSOFT/ EIGENSTRAT (Price et al., 2006; Patterson et al., 2006), that is a stand-alone software !55 October 17 2015 Dr. S.O’Brien edition package, and SNPrelate (Zheng et al., 2012), that is the part of Bioconductor framework of R bioinformatics packages. Principle component analysis of single nucleotide variants (SNVs) identified in Russian populations will be used to infer their structure of Russian populations and investigate Russian genomic diversity compared to other populations. PCA is a computationally effective procedure and can be parallelized for optimal processing of thousands of genomes. ! c). SNP –phylogenetic analyses (ME, MP, ML and Bayesian) To precisely infer the history of population divergence classical Neighbor-Joining trees (Saitou and Nei, 1987) with extensive bootstrap testing will be produced based on matrix of genetic distances (Nei Da (Nei et al. 1983), Dps (Bowcock et al., 1994)) both for individuals and populations. d). Demographic inference from genetic data, based on diffusion approximations to the allele frequency spectrum. ∂a∂I software (Gutenkust et al., 2009) implements methods for demographic history and selection inference from genetic data, based on diffusion approximations to the allele frequency spectrum. One of ∂a∂i's main benefits is speed: fitting a two-population model typically takes around 10 minutes, and run time is independent of the number of SNPs in your data set. ∂a∂i is also flexible, handling up to three simultaneous populations, with arbitrary time courses for population size and migration, plus the possibility of admixture and population-specific selection. The method has been applied (Zhao et al., 2013; Lam et al., 2010; Gravel et al., 2011) to human and animal populations as it allows fast and accurate inference, however, number of studied populations is limited by three. e) Detailed demographic and migration events inference using Multiple Sequential Markovian Coalescent (MSMC). Recently the first method for inference of demographic history of a population from one whole genome sequence, Pairwise Sequential Markovian Coalescent PSMC was introduced (Li and Durbin, 2011), but the method could not reliably infer the parameters of recent population history (before 10-20 kyr ago). The recent extension of PSMC method, which called MSMC (Schiffels and Durbin, 2014) resolves the problem of low number of recent coalescent events using information from multiple genomes. The method also allows to accurately infer divergence time of populations taking into account both migrations and recombination events in wholegenome coalescent framework. f) Estimation of population tree with admixture. TreeMix software is a novel method that uses large numbers of SNPs to estimate the historical relationships among !56 October 17 2015 Dr. S.O’Brien edition populations, using a graph representation that allows both population splits and migration events (Pickrell and Pritchard, 2012). ! 2) Signatures of positive selection in human genome during recent evolution. During the history of settlement, human became adapted to local climate, diet, attitude, pathogens and other factors (Tishkoff, 2015; Fumagalli et al., 2011; Kamberov et al., 2013; Engelken et al., 2014). These adaptations play important role in health of locally adapted populations and can be used for developing of new drugs and personal medicine (Fumagalli et al., 2015). Exploring of recently selected mutations is an important part of modern human genomics (Nielsen et al., 2007; Coop et al 2009). Many Russian ethnic minorities live in specific environment of cold Siberian climate, high attitude Caucasian mountains, have uniq diet habitats which can be associated with specific genomic variants. We aim to exploit the full spectrum of modern genomic techniques for inferring loci which were affected by positive selection in deferent Russian populations, including: ! a) Changes in the shape of the frequency distribution (spectrum) of genetic variation. Once a selective sweep reduces variability around a selected site, new mutations will gradually appear. New mutations would initially occur at low frequencies because their chances of increasing in a population under neutral drift are very low, and it takes some time after the sweep to restore a more typical distribution of mutation frequencies in a region (a frequency spectrum) that is consistent with the action of neutral forces. This shift to a low-frequency spectrum of polymorphism constitutes a signature of positive selection (Tajima 1989). Alternatively, balancing selection maintains a high proportion of the high-frequency polymorphisms, thereby shifting the spectrum to the intermediate frequencies (Oleksyk et al., 2010). b) Differentiating between populations (Fst). Variation of local conditions imposes differential selection pressures shaping variable adaptive landscapes (Wright 1951). Recent adaptations in populations often reflect the peculiarities of local environments. Local conditions are different from one locality to another and differ considerably between ecosystems. In some instances, given enough geographical isolation restricting gene flow, selection signatures could differ considerably between populations. Consequently, regions experiencing selective sweeps, in addition to the decreased variation within the population, should also display increased levels of population differentiation, a measure commonly denoted as Fst (Wright 1951). Tests that look for population differentiation are based on the premise that natural selection !57 October 17 2015 Dr. S.O’Brien edition can change the amount of differentiation between different populations of a species. Unless a selective sweep has already spread to all populations, the amount of genetic differentiation within the region that includes selected locus will increase. Therefore, if genetic differentiation in the genomic region is greater than the level expected under neutrality, this differentiation may be a consequence of natural selection (Oleksyk et al., 2010). ! IV c) Extended linkage disequilibrium segments. Historic selective sweeps in population data are apparent because of a hitchhiking effect described by MaynardSmith & Haigh (1974). As selection acts not on genotypes but on individuals carrying adaptive phenotypes that gain reproductive advantage, beneficial mutations, along with the entire genomes, are selected. However, independent assortment and recombination reshuffle chromosomes and regions distal to a selected beneficial variant. A selective sweep region would contain many neutral variants tightly linked to the beneficial mutation on haplotypes limited in length by a combination of selection strength and recombination rate. The extent of this association depends on the recombination distance, so persistence of a frequent, unusually long haplotype indicates strong, recent or ongoing selection, especially if that haplotype has risen to high frequency. Over many generations, haplotype size becomes smaller owing to recombination with other haplotypes (Oleksyk et al., 2010). Personnel and Collaborators of Genome Russia. Genome Russia is a dynamic moving consortium that is expanding on a regular basis to acquire and leverage the best possible expertise. Active collaborators and researchers are listed in : 1. Appendix 5 Dobzhansky Center 2. Appendix 6 Russian Collaborators ! ! 3. Appendix 15 International Collaborators !58 October 17 2015 Dr. S.O’Brien edition V Workflow and Time Table of Genome Russia. ! Flow diagrams of projected work flow and time table are presented in Appendices 11 and 12 ! VI Foreseeable benefits of the Genome Russia Project to SPSU, to Russia, and the world . ! Due to the moderate costs and recent advances in next-generation sequencing technologies, a project like the Genome Russia can be much less expensive than previous big-scale projects (e.g., 1000 Genomes project), and can bring numerous benefits to our understanding of population origins and disease on local, global and evolutionary scales. Here is how: First, low frequency and local variants discovered in the population genome projects can be used to screen individuals with genetic disorders in genome-wide association studies (GWAS), in clinical trials, and in genome assessment of proliferating cancer cells (McVean, 2012). Russian biomedical researchers will receive an immediate benefit of an information resource. Thereby building a baseline for future studies, including advances in precision/personalized medicine. ! Second, the history of population admixture of the Russian people can bring forth many interesting insights. The modern Russian population comprises a melting pot with genetic contributions from three main ancestral ethnicities: European (Slavic, Baltic and Germanic), Uralic (Finno-Hungarian), and Altaian (Turkic), with a possible addition of traces from peoples that occupied the Eurasian Arctic and Siberia in the past (Figure 1c). As yet, this history of genome admixture has not been well documented, and presents a new and unique opportunity to study population history in the wake of the great human migrations, the Black Death, the Great Silk Road diaspora, or recent demographic perturbations such as the siege of Leningrad (Smirnova, 2015). ! !59 October 17 2015 Dr. S.O’Brien edition Third, An admixture history combined with the diverse environments faced by the local populations in Russia create a unique opportunity for disease gene discoveries using mapping of admixture disequilibrium or admixture mapping (Smith and OBrien, 2005). This approach is known to be more powerful than a GWAS in homogeneous panmictic populations and has been used to discover a number of health-related mutations, mostly involving patients with Western European/West African or Western European/East Asian admixture components (Deo et al 2007; Cheng et al 2009). Given the difference in historic selection pressures, genome admixtures specific to Russia will contribute a wealth of new information, bringing forth different risk and/or protective alleles that do not exist nor associate with disease, elsewhere in the world. ! Fourth, the studies of population ancestry and admixture in Russia would not be limited to modern humans. Recent reports have uncovered the exact details about when Neanderthals and modern humans interbred and have even suggested important diseasefighting genes derivative of those pre-historic encounters (Green et al., 2010; Abi-Rached, et al., 2011; Sankararaman, et al., 2014; Slatkin et al 2014). Much of the Neanderthal heritage may still be unaccounted for, as recent reports keep discovering new genes originating from these ancient admixture events, and the spread of the Neanderthal is now documented as far as the Altai Mountains in Siberia (Prufer et al., 2014). The geographic source of Denisovan DNA is also Russian in origin (Callaway, 2011), while its contribution is mainly found in Melanesia. Given that most of the genetic landscape of Russia has been little explored thus far, can we state with any certainty that another great discovery is not hidden behind that great “wide gap” on the global genetic diversity map (Figure 1 a, b)? ! Fifth, engaging Russian scientists and communities in an international project like this would help integrate its scientists into the world genomics community. The scientific output and training in Russia has diminished since the fall of the USSR in 1991, but the sustained enormous intellectual potential has since become one of the world’s best secrets. Once Genome Russia formally contributes to the International 1,000 Genome project, the strict and widely agreed-upon ethical guidelines will illustrate the highest standards for compliance. We suggest that it is imperative that genomic research in Russia adheres to the prescient ethical standards recently developed across the international medical genomics community. Genome Russia would also be built upon the open access philosophy, a trend that is gaining momentum, but has been seen by many with suspicion, !60 October 17 2015 Dr. S.O’Brien edition as trust between Russia and Western governments has become challenged by recent political exchanges. ! The justifications for collecting, sequencing and analyzing populations from Russia in the immediate rather than distant future, all recognize the enormous significance of these populations in the history of humankind and its value as a reservoir of knowledge about human health. Without filling the great “wide gap” on the genetic map of the world, we will be greatly handicapped in our further genomic endeavors and understanding. The beginnings of such a Genome Russia Project are happening with a national enthusiasm endorsed by the Russian Academy of Sciences, the Russian Ministry of Education and Science, and the Central Russian government in a concerted effort to make it happen (http://genomerussia.bio.spbu.ru/?lang=en). While political diplomacies continue, the Genome Russia Project can and should become an example of international collaboration on the common ground and with the common goal of improving human health and betterment. ! ! ! VII References ! 1. Abi-Rached, L., Jobin, M. J., Kulkarni, S., McWhinnie, A., Dalva, K., Gragert, L., … Parham, P. (2011). The Shaping of Modern Human Immune Systems by Multiregional Admixture with Archaic Humans. Science , 334 (6052 ), 89-94. 2. Akst, J. 100,000 British Genomes: A new initiative lead by the UK's National Health Service aims to sequence the genomes of as many as 100,000 patients, a project that will cost £100 million. The Scientist. December 10, 2012 (http:// www.the-scientist.com/?articles.view/articleNo/33622/title/100-000-BritishGenomes/) . 3. Allentoft M, Sikora M, Sjögren K, Rasmussen S, Rasmussen M, Stenderup J, Damgaard P, Schroeder H, Ahlström T, Vinner L et al. (2015) Population genomics of Bronze Age Eurasia //Nature. Vol. 522. № 7555. P. 167-172. !61 October 17 2015 Dr. S.O’Brien edition 4. Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009). 5. Alkan C, Kavak P, Somel M, Gokcumen O, Ugurlu S, Saygi C, Dal E, Bugra K, Güngör T, Sahinalp SC, Özören N, Bekpen C. (2014) Whole genome sequencing of Turkish genomes reveals functional private alleles and impact of genetic interactions with Europe, Asia and Africa. BMC Genomics. 15:963. 6. Alkan, C., Jeffrey M. Kidd, Tomas Marques-Bonet, Gözde Aksay, Fereydoun Hormozdiari, Francesca Antonacci, Carl Baker, Onur Mutlu, S. Cenk Sahinalp, Richard A. Gibbs, Evan E. Eichler.”Personalized Copy-Number and Segmental Duplication Maps using Next-Gen Sequencing Technology.” Nature Genet. 2009 Oct;41(10):1061-7 7. Anderson, A Macrogen, Seoul National University Team Spells Out Upcoming Stages of Asian Genome Project . Genome Web Oct 07, 2014 https:// www.genomeweb.com/sequencing/macrogen-seoul-national-university-teamspells-out-upcoming-stages-asian-genome . 8. Andrews, S. "FastQC: A quality control tool for high throughput sequence data." (2010). http://www.bioinformatics.babraham.ac.uk/projects/fastqc/. 9. Angiuoli, Samuel V Malcolm Matalka, Aaron Gussman, Kevin Galens, Mahesh Vangala, David R Riley, Cesar Arze, James R White, Owen White, W Florian Fricke. CloVR: A virtual machine for automated and portable sequence analysis from the desktop using cloud computing. BMC Bioinformatics. 2011; 12: 356. Published online 2011 August 30. doi: 10.1186/1471-2105-12-356 10. Auton, A., Bryc, K., Boyko, A. R., Lohmueller, K. E., Novembre, J., Reynolds, A., … Bustamante, C. D. (2009). Global distribution of genomic diversity underscores rich complex history of continental human populations. Genome Research, 19(5), 795-803. 11. Auton A, Abecasis GR, The 1000 Genomes Consortium (2015) Global reference for human geneti variation. Nature. 2015 (in press) 12. Balanovskaia EV, Balanovskiĭ OP, Spitsyn VA, Bychkovskaia LS, Makarov SV, Paĭ GV, Rusakov AE, Subbota DS. (2001a) The Russian gene pool. Genogeography of serum genetic markers (HP, GC, PI, TF). Genetika. 37(8): 1125-37. Russian. 13. Balanovskaia EV, Balanovskiĭ OP, Spitsyn VA, Bychkovskaia LS, Makarov SV, Paĭ GV, Subbota DS. (2001b) The Russian gene pool. Genogeography of !62 October 17 2015 Dr. S.O’Brien edition erythrocyte genetic markers (ACP1, PGM1, ESD, GLO1, 6-PGD). Genetika. 37(8):1138-51. Russian. 14. Barrett JC, Fry B, Maller J, Daly MJ. (2005) Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005 21(2):263-5. 15. Belyaeva O, Bermisheva M, Khrunin A, Slominsky P, Bebyakova N, Khusnutdinova E, Mikulich A, Limborska S. Mitochondrial DNA variations in Russian and Belorussian populations. Hum Biol. 2003 Oct;75(5):647-60 16. Bentley D. R. et al. Accurate whole human genome sequencing using reversible terminator chemistry //Nature. – 2008. – Т. 456. – №. 7218. – С. 53-59. 17. Bowcock, a M. et al. High resolution of human evolutionary trees with polymorphic microsatellites. Nature 368, 455–457 (1994). 18. Callaway, E. (2011) Ancient DNA reveals secrets of human history: Modern humans may have picked up key genes from extinct relatives. Nature 476, 136-137. 19. Cheng CY, Kao WH, Patterson N, Tandon A, Haiman CA, Harris TB, Xing C, John EM, Ambrosone CB, Brancati FL, Coresh J, Press MF, Parekh RS, Klag MJ, Meoni LA, Hsueh WC, Fejerman L, Pawlikowska L, Freedman ML, Jandorf LH, Bandera EV, Ciupak GL, Nalls MA, Akylbekova EL, Orwoll ES, Leak TS, Miljkovic I, Li R, Ursin G, Bernstein L, Ardlie K, Taylor HA, Boerwinckle E, Zmuda JM, Henderson BE, Wilson JG, Reich D. Admixture mapping of 15,280 African Americans identifies obesity susceptibility loci on chromosomes 5 and X. PLoS Genet. 2009 May;5(5):e1000490. doi: 10.1371. 20. Cingolani, Pablo, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan Wang, Susan J. Land, Xiangyi Lu, and Douglas M. Ruden. "A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3." Fly 6, no. 2 (2012): 80-92. 21. Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK. (2006). A worldwide survey of haplotype variation and linkage disequilibrium in the human genome. Nat Genet 38:1251–1260 22. Coop, G. et al. The Role of Geography in Human Adaptation. PLoS Genet. 5, e1000500 (2009). 23. Cho, Yun Sung, Li Hu,2* Haolong Hou,2* Hang Lee,3* Jiaohui Xu,2* Soowhan Kwon,4 Sukhun Oh,4 Hak-Min Kim,1 Sungwoong Jho,1 Sangsoo Kim,5 Tae !63 October 17 2015 Dr. S.O’Brien edition Hyung Kim,6 Shu-Jin Luo,7 Warren Johnson,8 Sunghoon Lee,1,6 Young-Ah Shin, 1 Qian Zhou,2 Byung Chul Kim,1,6 Hyunmin Kim,6 Chang-uk Kim,1 Hyun-Ju Jung,6 Xiao Xu,7 Pryivrat Gadhvi,1 Pengwei Xu,2 Yingqi Xiong,2 Yadan Luo,2 Shengkai Pan,2 Caiyun Gou,2 Xiuhui Chu,2 Jilin Zhang,2 Sanyang Liu,2 Jing He, 2 Ying Chen,2 Linfeng Yang,2 Yulan Yang,2 Jiaju He,2 Sha Liu,2 Junyi Wang,2 Chul Hong Kim6, Jong-Soo Kim1, Seungwoo Hwang,9 Junsu Ko6, Chang-Bae Kim,10 Sangtae Kim,11 Damdin Bayarlkhagva,12 Woon Kee Paek,13 Seong-Jin Kim,6,14 Stephen J. O’Brien,15†, Jun Wang,2,16†, and Jong Bhak,1,6†. (2013.). The tiger genome and comparative analysis with other feline genomes. NATURE COMMUNICATIONS 4: 2433. 24. de Bakker, P.I.W., R. Yelensky, I. Pe'er, S.B. Gabriel, M.J. Daly, D. Altshuler (2005) Efficiency and power in genetic association studies. Nature Genetics. 37: 1217-1223. 25. Deo RC, Patterson N, Tandon A, McDonald GJ, Haiman CA, Ardlie K, Henderson BE, Henderson SO, Reich D. (2007) A high-density admixture scan in 1,670 African Americans with hypertension. PLoS Genet. 3(11): e196. 26. DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10. PubMed PMID: 21478889; PubMed Central PMCID: PMC3083463. 27. Dobrynin, Pavel , Shiping Liu2, Aleksey Komissarov1, Gaik Tamazian1, Alexey Makunin1, Ksenia Krasheninnikova1, Andrey Yurchenko1, Sergey Kliver1 , Vladimir Brukhin1 Klaus-Peter Koepfli1,3, Warren Johnson4 , Lukas Kuderna4, Raquel García-Pérez4, Marc deManuel4 Ricardo Godinez5, Weilin Qiu2, Long Zhou2, Fang Li2, Jian Yi2,Carlos Driscoll6 , Agostinho Antunes7, Taras K. Oleksyk8, Eduardo Eizirik9 , Polina Perelman10 . David Wildt3, Mark Diekhans11, Tomas Marques-Bonet3,12Anne Schmidt-Kuntzel13 , Laurie Marker14 , Jong Bhak15, Wang Jun2,16-18, Zijun Xiong 2, Guojie Zhang & Stephen J O’Brien. Genomic Legacy of the African Cheetah, Acinonyx jubatus Genome Biology In Press 28. Drmanac R. et al. Human genome sequencing using unchained base reads on selfassembling DNA nanoarrays //Science. – 2010. – Т. 327. – №. 5961. – С. 78-81. !64 October 17 2015 Dr. S.O’Brien edition 29. Engelken, J. et al. Extreme Population Differences in the Human Zinc Transporter ZIP4 (SLC39A4) Are Explained by Positive Selection in Sub-Saharan Africa. PLoS Genet. 10, e1004128 (2014). 30. Filippova IN, Khrunin AV, Limborska SA. Analysis of DNA variations in GSTA and GSTM gene clusters based on the results of genome-wide data from three Russian populations taken as an example. BMC Genet. 2012;13:89. 31. Flegontova OV, Khrunin AV, Lylova OI, Tarskaia LA, Spitsyn VA, Mikulich AI, Limborska SA. Haplotype frequencies at the DRD2 locus in populations of the East European Plain. BMC Genet. 2009;10:62. 32. Fox EJ, Reid-Bayliss KS, Emond MJ, Loeb LA. Accuracy of Next Generation Sequencing Platforms. Next Generat Sequenc & Applic. 2014; 1(1):106. 33. Francioli et al. (2014) Genome of the Netherlands Consortium. Whole-genome sequence variation, population structure and demographic history of the Dutch population. Nat Genet.; 46(8):818-25. 34. Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PL, Aximu-Petri A, Prüfer K, de Filippo C, Meyer M, Zwyns N, Salazar-García DC, Kuzmin YV, Keates SG, Kosintsev PA, Razhev DI, Richards MP, Peristov NV, Lachmann M, Douka K, Higham TF, Fumagalli M. et al. Signatures of environmental genetic adaptation pinpoint pathogens as the main selective pressure through human evolution //PLoS Genet. – 2011. – Т. 7. – №. 11. – С. e1002355. 35. Fumagalli, M. et al. Signatures of Environmental Genetic Adaptation Pinpoint Pathogens as the Main Selective Pressure through Human Evolution. PLoS Genet. 7, e1002355 (2011). 36. Fumagalli M, Moltke I, Grarup N, Racimo F, Bjerregaard P, Jørgensen ME, Korneliussen TS, Gerbault P, Skotte L, Linneberg A, Christensen C, Brandslund I, Jørgensen T, Huerta-Sánchez E, Schmidt EB, Pedersen O, Hansen T, Albrechtsen A, Nielsen R. Greenlandic Inuit show genetic signatures of diet and climate adaptation. (2015) Science. 349:1343-7. 37. Gabriel, S.B. et al. (2002) The structure of haplotype blocks in the human genome. Science 296, 2225–2229 38. Genovese G, Handsaker RE, Li H, Altemose N, Lindgren AM, Chambert K, Pasaniuc B, Price AL, Reich D, Morton CC, Pollak MR, Wilson JG, McCarroll !65 October 17 2015 Dr. S.O’Brien edition SA. (2013) Using population admixture to help complete maps of the human genome. Nat Genet. Apr;45(4):406-14, 414e1-2. 39. Gilissen, C., Hehir-Kwa, J. Y., Thung, D. T., van de Vorst, M., van Bon, B. W., Willemsen, M. H., ... & Veltman, J. A. (2014). Genome sequencing identifies major causes of severe intellectual disability. Nature. 511, 344–347. 40. Gravel, S. et al. Demographic history and rare allele sharing among human populations. Proc. Natl. Acad. Sci. 108, 11983–11988 (2011). 41. Green, R. E., Krause, J., Briggs, A. W., Maricic, T., Stenzel, U., Kircher, M., … Pääbo, S. (2010). A Draft Sequence of the Neandertal Genome. Science , 328 (5979 ), 710-722. 42. Gudbjartsson, D. F., Helgason, H., Gudjonsson, S. A., Zink, F., Oddson, A., Gylfason, A., … Stefansson, K. (2015). Large-scale whole-genome sequencing of the Icelandic population. Nat Genet, advance online publication. 43. Guijarro, Manuel; Ruben Gaspar et al. (2008). "Experience and Lessons learnt from running High Availability Databases on Network Attached Storage" (PDF). Journal of Physics: Conference Series. Conference Series (IOP Publishing) 119 (4): 042015. doi:10.1088/1742-6596/119/4/042015 44. Gurdasani D, Carstensen T, Tekola-Ayele F, Pagani L, Tachmazidou I, Hatzikotoulas K, Karthikeyan S, Iles L, Pollard MO, Choudhury A, Ritchie GR, Xue Y, Asimit J, Nsubuga RN, Young EH, Pomilla C, Kivinen K, Rockett K, Kamali A, Doumatey AP, Asiki G, Seeley J, Sisay-Joof F, Jallow M, Tollman S, Mekonnen E, Ekong R, Oljira T, Bradman N, Bojang K, Ramsay M, Adeyemo A, Bekele E, Motala A, Norris SA, Pirie F, Kaleebu P, Kwiatkowski D, Tyler-Smith C, Rotimi C, Zeggini E, Sandhu MS. (2015) The African Genome Variation Project shapes medical genetics in Africa. Nature, 517(7534):327-32. 45. Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D. Inferring the Joint Demographic History of Multiple Populations from Multidimensional SNP Frequency Data. PLoS Genet. 5, e1000695 (2009). 46. Hamosh A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders //Nucleic acids research. – 2005. – Т. 33. – №. suppl 1. – С. D514-D517. 47. Heger, M. BGI Plans to Launch Two NGS Systems This Year Based on Complete Genomics Technology. Genome Web. January 14, 2015 (https:// !66 October 17 2015 Dr. S.O’Brien edition www.genomeweb.com/business-news/bgi-plans-launch-two-ngs-systems-yearbased-complete-genomics-technology) 48. Har'kov VN, Hamina KV, Medvedeva OF, Simonova KV, Eremina ER, Stepanov VA. (2014) Gene pool of Buryats: clinal variability and territorial subdivision based on data of Y-chromosome markers. Genetika; 50(2): 203-13. 49. Heath AP, Greenway M, Powell R, Spring J, Suarez R, Hanley D, Bandlamudi C, McNerney ME, White KP, Grossman RL. Bionimbus: a cloud for managing, analyzing and sharing large genomics datasets. J Am Med Inform Assoc. 2014 Nov-Dec;21(6):969-75. doi: 10.1136/amiajnl-2013-002155. Epub 2014 Jan 24. 50. Hoffmann, TJ., Kvale MN., Hesselson, SE. et al. (2011a) Next generation genome-wide association tool: design and coverage of a high throughput European-optimized SNP array. Genomics. 98:79-89 51. Hoffmann, TJ., Zhan Y., Kvale MN et al. (2011b) Design and coverage of high throughput genotyping arrays optimized for individuals of East Asian, African American, and Latino race/ethnicity using imputation and a novel hybrid SNP selection algorithm. Genomics. 98:422-430. 52. Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, Rosenberg NA, Scheet P. (2009) Genotype-imputation accuracy across worldwide human populations. Am J Hum Genet. 2009 Feb;84(2):235-50. 53. Hui-Yuen, J., McAllister, S., Koganti, S., Hill, E., Bhaduri-McIntosh, S. Establishment of Epstein-Barr Virus Growth-transformed Lymphoblastoid Cell Lines. J. Vis. Exp. (57), e3321, DOI : 10.3791/3321 (2011). 54. Johnson, G.C. et al. (2001) Haplotype tagging for the identification of common disease genes. Nat. Genet. 29, 233–237. 55. Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz, J. (2005) Repbase Update, a database of eukaryotic repetitive elements. Cytogentic and Genome Research 110:462-467. http://www.girinst.org/repbase/ 56. Kanagawa T. Bias and artifacts in multitemplate polymerase chain reactions (PCR). J Biosci Bioeng. 2003; 96: 317–323. 57. Kamberov, Y. G. et al. Modeling Recent Human Evolution in Mice by Expression of a Selected EDAR Variant. Cell 152, 691–702 (2013). !67 October 17 2015 Dr. S.O’Brien edition 58. Kent, W. James, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom H. Pringle, Alan M. Zahler, and David Haussler. "The human genome browser at UCSC." Genome research 12, no. 6 (2002): 996-1006. 59. Kharkov VN, Khamina KV, Medvedeva OF, Simonova KV, Khitrinskaya IY, Stepanov VA. (2013) Gene pool structure of Tuvinians inferred from Ychromosome marker data. Genetika; 49(12): 1416-25. 60. Khusnutdinova EK, Litvinov SS, Kutuev IA, Iunusbaev BB, Khusainova RI, Akhmetova VL, Ahatova FS, Metspalu E, Rootsi S, Villems R. (2012) Gene pool of ethnic groups of the caucasus: results of integrated study of the Y chromosome and mitochondrial DNA and genome-wide data. Genetika. 48(6):750-61. Russian. 61. Khrunin AV, Tarskaia LA, Spitsyn VA, Lylova OI, Bebyakova NA, Mikulich AI, Limborska SA. p53 polymorphisms in Russia and Belarus: correlation of the 2-1-1 haplotype frequency with longitude. Mol Genet Genomics. 2005; 272(6): 666-72. 62. Khrunin AV, Bebiakova NA, Ivanov VP, Solodilova MA, Limborskaia SA. Polymorphism of Y-chromosomal microsatellites in Russian populations from the northern and southern Russia as exemplified by the populations of Kursk and Arkhangel'sk Oblast. Genetika. 2005; 41(8):1125-31. 63. Khrunin AV, Khokhrin DV, Limborskaia SA. Glutathione-S-transferase gene polymorphism in Russian populations of European Russia. Genetika. 2008;44(10): 1429-34 64. Khrunin A, Verbenko D, Nikitina K, Limborska S. Regional differences in the genetic variability of Finno-Ugric speaking Komi populations. Am J Hum Biol. 2007;19(6):741-50. 65. Khrunin A, Mihailov E, Nikopensius T, Krjutskov K, Limborska S, Metspalu A. Analysis of allele and haplotype diversity across 25 genomic regions in three Eastern European populations. ! 66. Khrunin AV, Firsov SIu, Limborskaia SA. Polymorphisms of DNA repair genes ERCC2 and XRCC1 in populations of Russia. Genetika. 2011;47(11):1565-8. 67. Khrunin AV, Khokhrin DV, Filippova IN, Esko T, Nelis M, Bebyakova NA, Bolotova NL, Klovins J, Nikitina-Zake L, Rehnström K, Ripatti S, Schreiber S, Franke A, Macek M, Krulišová V, Lubinski J, Metspalu A, Limborska SA. (2013) A genome-wide analysis of populations from European Russia reveals a new pole of genetic diversity in northern Europe. PLoS One 8 (3) : e58552 !68 October 17 2015 Dr. S.O’Brien edition 68. Krause, Johannes; Fu, Qiaomei; Good, Jeffrey M.; Viola, Bence; Shunkov, Michael V.; Derevianko, Anatoli P. & Pääbo, Svante (2010), "The complete mitochondrial DNA genome of an unknown hominin from southern Siberia", Nature 464 (7290): 894–897 69. Kushniarevich A, Utevska O, Chuhryaeva M, Agdzhoyan A, Dibirova K, Uktveryte I, Möls M, Mulahasanovic L, Pshenichnov A, Frolova S, Shanko A, Metspalu E, Reidla M, Tambets K, Tamm E, Koshel S, Zaporozhchenko V, Atramentova L, Kučinskas V, Davydenko O, Goncharova O, Evseeva I, Churnosov M, Pocheshchova E, Yunusbayev B, Khusnutdinova E, Marjanović D, Rudan P, Rootsi S, Yankovsky N, Endicott P, Kassian A, Dybo A; Genographic Consortium, Tyler-Smith C, Balanovska E, Metspalu M, Kivisild T, Villems R, Balanovsky O. (2015) Genetic Heritage of the Balto-Slavic Speaking Populations: A Synthesis of Autosomal, Mitochondrial and Y-Chromosomal Data. PLoS One 10 (9). 70. Lam, H.-M. et al. Resequencing of 31 wild and cultivated soybean genomes identifies patterns of genetic diversity and selection. Nat. Genet. 42, 1053–1059 (2010). ! 71. Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9:357-359. 72. Lee, D., Hormozdiari, F., Xin, H., Hach, F., Mutlu, O., & Alkan, C. (2015). Fast and accurate mapping of Complete Genomics reads. Methods, 79, 3-10. 73. Leslie, S., Winney, B., Hellenthal, G., Davison, D., Boumertit, A., Day, T., … Bodmer, W. (2015). The fine-scale genetic structure of the British population. Nature, 519(7543), 309-314. 74. Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer, Gabor Marth, Goncalo Abecasis, and Richard Durbin. "The sequence alignment/ map format and SAMtools." Bioinformatics 25, no. 16 (2009): 2078-2079. 75. Li, H. & Durbin, R. Inference of human population history from individual wholegenome sequences. Nature 475, 493–496 (2011). 76. Limborska SA, Khrunin AV, Flegontova OV, Tasitz VA, Verbenko DA.Specificity of genetic diversity in D1S80 revealed by SNP-VNTR haplotyping. Ann Hum Biol. 2011;38(5):564-9. !69 October 17 2015 Dr. S.O’Brien edition 77. MacArthur, Daniel G., Suganthi Balasubramanian, Adam Frankish, Ni Huang, James Morris, Klaudia Walter, Luke Jostins et al. "A systematic survey of loss-offunction variants in human protein-coding genes." Science 335, no. 6070 (2012): 823-828. 78. Makarov, NA, LA Belyaev, AV Engovatova (2015) Archaeology in Contemporary Russia: Prospects and Challenges // Russian archaeologist. 2: 5-15 (Russian) 79. Marques-Bonet, T. and E. Eichler. “The Evolution of Human Segmental Duplications and the Core Duplicon Hypothesis.” Cold Spring Harb Symp Quant Biol. 2009;74:355-62. Epub 2009 Aug 28. 80. Marx, V (2015) The DNA of a Nation. Nature 524:503. 81. McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr. 107524.110. Epub 2010 Jul 19. PubMed PMID: 20644199; PubMed Central PMCID: PMC2928508. 82. McLean, C. Y., Reno, P. L., Pollen, A. A., Bassan, A. I., Capellini, T. D., Guenther, C., … Kingsley, D. M. (2011). Human-specific loss of regulatory DNA and the evolution of human-specific traits. Nature, 471(7337), 216-219. 83. McVean GA., The 1000 Genomes Consortium (2012) An integrated map of genetic variation from 1,092 human genomes. Nature. 2012; 491(7422):56-65. 84. Menozzi, P., Piazza, a & Cavalli-Sforza, L. Synthetic maps of human gene frequencies in Europeans. Science (80-. ). 201, 786–792 (1978). 85. Mirabal S, Regueiro M, Cadenas AM, Cavalli-Sforza LL, Underhill PA, Verbenko DA, Limborska SA, Herrera RJ. Y-chromosome distribution within the geolinguistic landscape of northwestern Russia. Eur J Hum Genet. 2009 Oct;17(10): 1260-73. 86. Molenaar, J. J., Koster, J., Zwijnenburg, D. A., van Sluis, P., Valentijn, L. J., van der Ploeg, I., ... & Versteeg, R. (2012). Sequencing of neuroblastoma identifies chromothripsis and defects in neuritogenesis genes. Nature, 483: 589-593. 87. Molodin VI (2014) Ethnocultural mosaic in western Baraba (Late Bronze Age - a time of transition from the Bronze Age to the Iron Age. XIV-VIII century BC) // Archaeology, Ethnology and Anthropology of Eurasia. 4 (60) pp 54 - 63. (Russian) !70 October 17 2015 Dr. S.O’Brien edition 88. Movsesyan AA (2012) Paleophenetic analysis of modern and ancient population of Chukotka // Archaeology, Ethnology and Anthropology of Eurasia. 3 (51) pp 130 - 137. (Russian) 89. Nagasaki M, Yasuda J, Katsuoka F, Nariai N, Kojima K, Kawai Y, YamaguchiKabata Y, Yokozawa J, Danjoh I, Saito S, Sato Y, Mimori T, Tsuda K, Saito R, Pan X, Nishikawa S, Ito S, Kuroki Y, Tanabe O, Fuse N, Kuriyama S, Kiyomoto H, Hozawa A, Minegishi N, Douglas Engel J, Kinoshita K, Kure S, Yaegashi N; ToMMo Japanese Reference Panel Project, Yamamoto M. (2015) Rare variant discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat Commun. 2015 Aug 21;6:8018. doi: 10.1038/ncomms9018. 90. Nei, M., Tajima, F. & Tateno, Y. Accuracy of estimated phylogenetic trees from molecular data. J Mol Evol 19, 153–170 (1983). 91. Nelis M, Esko T, Mägi R, Zimprich F, Zimprich A, Toncheva D, Karachanak S, Piskácková T, Balascák I, Peltonen L, Jakkula E, Rehnström K, Lathrop M, Heath S, Galan P, Schreiber S, Meitinger T, Pfeufer A, Wichmann HE, Melegh B, Polgár N, Toniolo D, Gasparini P, D'Adamo P, Klovins J, Nikitina-Zake L, Kucinskas V, Kasnauskiene J, Lubinski J, Debniak T, Limborska S, Khrunin A, Estivill X, Rabionet R, Marsal S, Julià A, Antonarakis SE, Deutsch S, Borel C, Attar H, Gagnebin M, Macek M, Krawczak M, Remm M, Metspalu A. Genetic structure of Europeans: a view from the North-East. PLoS ONE. 2009;4(5):e5472. 92. Novogilov AG (2009) The population of the Pskov-Pechora region as an ethnolocal group // Vestnik St. Petersburg University. 2. History Series. SPb., 3: 94-110. (Russian) 93. Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein function. Nucleic Acids Res. 2003 Jul 1;31(13):3812-4. PubMed PMID: 12824425; PubMed Central PMCID: PMC168916. 94. O’Brien SJ and Hendrickson, S. Host Genomic Influences on HIV/AIDS. Genome Biology Genome Biology, 14:201-214. 2013. 95. Patterson, N., Price, A. L. & Reich, D. Population Structure and Eigenanalysis. PLoS Genet. 2, e190 (2006). 96. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–909 (2006). 97. Pesik VY, Fedunin AA, Agdzhoyan AT, Utevska OM, Chukhraeva MI, Evseeva IV, Churnosov MI, Lependina IN, Bogunov YV, Bogunova AA, Ignashkin !71 October 17 2015 Dr. S.O’Brien edition MA,Yankovsky NK, Balanovska EV, Orekhov VA, Balanovsky OP. (2014) Analysis of genetic diversity of Russian regional populations based on common STR markers used in DNA identification. Genetika. 50(6): 715-23. Russian. 98. Marchini J, and B. Howie (2010) Genotype imputation for genome-wide association studies. Nature Reviews Genetics 11:499-511 99. Marques-Bonet, T. and E. Eichler. “The Evolution of Human Segmental Duplications and the Core Duplicon Hypothesis.” Cold Spring Harb Symp Quant Biol. 2009;74:355-62. Epub 2009 Aug 28. 100. Nielsen, R., Hellmann, I., Hubisz, M., Bustamante, C. & Clark, A. G. Recent and ongoing selection in the human genome. Nat Rev Genet 8, 857–868 (2007). 101. Oleksyk, T. K., Smith, M. W. & O’Brien, S. J. Genome-wide scans for footprints of natural selection. Philos. Trans. R. Soc. B Biol. Sci. 365, 185–205 (2009). 102. Pickrell, J. K. & Pritchard, J. K. Inference of Population Splits and Mixtures from Genome-Wide Allele Frequency Data. PLoS Genet. 8, e1002967 (2012). 103. Popova SN, Slominsky PA, Pocheshnova EA, Balanovskaya EV, Tarskaya LA, Bebyakova NA, Bets LV, Ivanov VP, Livshits LA, Khusnutdinova EK, Spitcyn VA, Limborska SA. Polymorphism of trinucleotide repeats in loci DM, DRPLA and SCA1 in East European populations. Eur J Hum Genet. 2001;9(11):829-35. 104. Prado-Martinez, Javier, Peter H. Sudmant, Jeffrey M. Kidd, Heng Li, Joanna L. Kelley, Belen Lorente-Galdos, Krishna R. Veeramah et al. "Great ape genetic diversity and population history." Nature 499, no. 7459 (2013): 471-475. 105. Prufer, K., Racimo, F., Patterson, N., Jay, F., Sankararaman, S., Sawyer, S., … Paabo, S. (2014). The complete genome sequence of a Neanderthal from the Altai Mountains. Nature, 505(7481), 43-49. 106. Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: Variational Inference of Population Structure in Large SNP Data Sets. Genetics 197, 573–589 (2014). 107. Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and survey. Nucleic Acids Res. 2002 Sep 1;30(17):3894-900. PubMed PMID: 12202775; PubMed Central PMCID: PMC137415. 108. Raney, Brian J., Timothy R. Dreszer, Galt P. Barber, Hiram Clawson, Pauline A. Fujita, Ting Wang, Ngan Nguyen et al. "Track data hubs enable visualization of user-defined genome-wide annotations on the UCSC Genome Browser." Bioinformatics 30, no. 7 (2014): 1003-1005. !72 October 17 2015 Dr. S.O’Brien edition 109. Reich, D.E. et al. (2001) Linkage disequilibrium in the human genome. Nature 411, 199–204 110. Reich, D.E. et al. (2002) Human genome sequence variation and the influence of gene history, mutation and recombination. Nat. Genet. 32, 135–142 111. Reich, D., Green, R. E., Kircher, M., Krause, J., Patterson, N., Durand, E. Y., … Paabo, S. (2010). Genetic history of an archaic hominin group from Denisova Cave in Siberia. Nature, 468(7327), 1053-1060. 112. ! 113. Riemann K, Adamzik M, Frauenrath S, Egensperger R, Schmid KW, Brockmeyer NH, Siffert W. Comparison of manual and automated nucleic acid extraction from whole-blood samples. J Clin Lab Anal. 2007;21(4):244-8. 114. Rosinger S, Nutland S, Mickelson E, Varney M D, Bernard O Boehm, Gary J Olsem, John A Hansen, Ian Nicholson, Joan E Hilner, Letitia H Perdue, June J Pierce, Beena Akolkar, Concepcion Nierras, Michael W Steffes and the T1DGC. Collection and processing of whole blood for transformation of peripheral blood mononuclear cells and extraction of DNA: the Type 1 Diabetes Genetics Consortium. Clinical Trials 2010; 7: S65–S74. 115. Russian / Institute of Ethnology and Anthropology. N. Maclay RAS / Series "Peoples and Cultures", t. I. / Editors Series: Doctor. hist. YB Simchenko Sciences, Doctor. hist. Sciences VA Tishkov. - M .: Nauka, 1999. - 828 p.: ill. (Russian). 116. Russian peoples of Russia. Atlas of cultures and religions (2010) - M .: Design. Information. Cartography. – 320 p. (Russian). 117. Sankararaman, S., Mallick, S., Dannemann, M., Prufer, K., Kelso, J., Paabo, S., … Reich, D. (2014). The genomic landscape of Neanderthal ancestry in present-day humans. Nature, 507(7492), 354-357. 118. Salmela E, Lappalainen T, Liu J, Sistonen P, Andersen PM, Schreiber S, Savontaus ML, Czene K, Lahermo P, Hall P, Kere J. (2011) Swedish population substructure revealed by genome-wide single nucleotide polymorphism data. PLoS One. 6(2): e16747. 119. Schiffels, S. & Durbin, R. Inferring human population size and separation history from multiple genome sequences. Nat Genet 46, 919–925 (2014). !73 October 17 2015 Dr. S.O’Brien edition 120. Semino O, Passarino G, Oefner PJ, Lin AA, Arbuzova S, Beckman LE, De Benedictis G, Francalacci P, Kouvatsi A, Limborska S, Marcikiae M, Mika A, Mika B, Primorac D, Santachiara-Benerecetti AS, Cavalli-Sforza LL, Underhill PA. The genetic legacy of Paleolithic Homo sapiens sapiens in extant Europeans: a Y chromosome perspective. Science. 2000;290(5494):1155-9. 121. Shabihkhani M, Lucey GM, Wei B, Mareninov S, Lou JJ, Vinters HV, Singer EJ, Cloughesy TF, Yong WH. The procurement, storage, and quality assurance of frozen blood and tissue biospecimens in pathology, biorepository, and biobank settings. Clin Biochem. 2014 Mar;47(4-5):258-66. 122. Shabrova EV, Khusnutdinova EK, Tarskaia LA, Mikulich AI, Abolmasov NN, Limborska SA. DNA diversity of human populations from Eastern Europe and Siberia studied by multilocus DNA fingerprinting. Mol Genet Genomics. 2004;271(3):291-7. 123. Slatkin M, Hublin JJ, Reich D, Kelso J, Viola TB, Pääbo S. (2014) Genome sequence of a 45,000-year-old modern human from western Siberia. Nature. 514:445-9. 124. Sminova, Y. (2015) Did good genes help people outlast the brutal Leningrad siege. Science 348:1068 125. Smith, M.W., and O’Brien, S.J.: (2005). Mapping by Admixture Disequilibrium: Advances, Limits and Guidelines. Nature Genetics Reviews. 6: 623-632,. 126. Smith, J. M. & Haigh, J. The hitch-hiking effect of a favourable gene. Genet. Res. 23, 23 (1974). 127. Spitsyn VA1, Khorte MV, Pogoda TV, Slominsky PA, Nurbaev SD, Agapova RK, Limborska SA.Apolipoprotein B 3'-VNTR polymorphism in the Udmurt population. Hum Hered. 2000;50(4):224-6. 128. Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, Cooper DN. The Human Gene Mutation Database (HGMD) and its exploitation in the fields of personalized genomics and molecular evolution. Curr Protoc Bioinformatics. 2012 Sep;Chapter 1:Unit1.13. doi: 10.1002/0471250953.bi0113s39. PubMed PMID: 22948725. 129. Stewart JB, Chinnery PF. Nat Rev Genet. (2015 ) . The dynamics of mitochondrial DNA heteroplasmy: implications for human health and disease. 16(9):530-42. 130. Sudmant, Peter H. Jeffrey M. Kidd, Heng Li, Joanna L. Kelley, Belen LorenteGaldos, Krishna R. Veeramah, August E. Woerner, Timothy D. O’Connor, Gabriel !74 October 17 2015 Dr. S.O’Brien edition Santpere, Alexander Cagan, Christoph Theunert, Ferran Casals, Hafid Laayouni, Kasper Munch, Asger Hobolth, Anders E. Halager, Maika Malig, Jessica Hernandez-Rodriguez, Irene Hernando-Herraez, Kay Prüfer, Marc Pybus, Laurel Johnstone, Michael Lachmann, Can Alkan et al. Great ape genetic diversity and population history (2013)Nature 499, 471–475. 131. Svitin Anton, Malov Sergey, Cherkasov Nikolay, Geerts Paul, Rotkevich Mikhail, Dobrynin Pavel, Shevchenko Andrey, Guan Li, Troyer Jennifer, Hendrickson Sher, Hutcheson Dilks Holli, Oleksyk K. Taras, Donfield Sharyne, Gomperts Edward, Jabs A. Douglas, Sezgin Efe, Van Natta Mark, Harrigan P. Richard, Brumme L. Zabrina, O'Brien J. Stephen GWATCH: a web platform for automated gene association discovery analysis. (2014 ) GigaScience 3:18. 132. Tajima, F. "Statistical method for testing the neutral mutation hypothesis by DNA polymorphism." Genetics 123.3 (1989): 585-595. 133. Tamazian, T; Simonov, S; Dobrynin, P; Makunin, A; Logachev, A; Komissarov, A; Shevchenko, A; Brukhin, V; Cherkasov, N; Svitin, A; Koepfli, KP; Pontius, J; Driscoll, CA; Blackistone, K; Barr, C; Goldman, D; Antunes, A; Quilez, J; Lorente-Galdos, B; Alkan, C; Marques-Bonet, T; Menotti-Raymond, M; David, V; Narfstrom, K; O'Brien, SJ (2014): Annotated Features of the Domestic Cat (Felis catus) Genome. GigaScience 3:13 . doi:10.1186/2047-217X-3-13 134. http://www.gigasciencejournal.com/content/3/1/13 2014. 135. Tishkoff, Sarah. "Strength in small numbers." Science 349.6254 (2015): 1282-1283. 136. Trofimova NV, Litvinov SS, Khusainova RI, Penkin LN, Akhmetova VL, Akhatova FS, Khusnutdinova ÉK. (2015) Genetic characterization of populations of the Volga-Ural region according to the variability of the Y-chromosome. Genetika. 2015; 51(1): 120-7. 137. Troyer D. Biorepository standards and protocols for collecting, processing, and storing human tissues. Methods Mol Biol. 2008;441:193-220. 138. Trubachov ON (2003) Ethnogenesis and culture of ancient Slavs: linguistic studies. - M .: Nauka, 489 pp. (Russian) 139. Ulyanov MV, Lavryashina MB , Nikolaev VV, Octyabrskaya IV, Druzhinin VG (2014) The indigenous population of the northern regions of the Altai: the reflection of the demographic processes of XIX - beginning of the XXI century in !75 October 17 2015 Dr. S.O’Brien edition the dynamics of the family structure // Archaeology, Ethnology and Anthropology of Eurasia 3 (59): 128 - 140. (Russian). 140. Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, LevyMoonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K, Altshuler D, Gabriel S, DePristo M, 2013 From FastQ Data to High-Confidence Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline CURRENT PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33 141. Vaught JB. Blood collection, shipment, processing, and storage. Cancer Epidemiol Biomarkers Prev. 2006 Sep;15(9):1582-4. 142. Verbenko DA, Pogoda TV, Spitsyn VA, Mikulich AI, Bets LV, Bebyakova NA, Ivanov VP, Abolmasov NN, Pocheshkhova EA, Balanovskaya EV, Tarskaya LA, Sorensen MV, Limborska SA.. Apolipoprotein B 3'-VNTR polymorphism in Eastern European populations. Eur J Hum Genet. 2003;11(6):444-451. 143. Verbenko DA, Knjazev AN, Mikulich AI, Khusnutdinova EK, Bebyakova NA, Limborska SA. Variability of the 3'APOB minisatellite locus in Eastern Slavonic populations. Hum Hered. 2005;60(1):10-8. 144. Verbenko DA, Slominsky PA, Spitsyn VA, Bebyakova NA, Khusnutdinova EK, Mikulich AI, Tarskaia LA, Sorensen MV, Ivanov VP, Bets LV, Limborska SA. Polymorphisms at locus D1S80 and other hypervariable regions in the analysis of Eastern European ethnic group relationships. Ann Hum Biol. 2006; 33(5-6): 570-84. 145. Xue Y, Prado-Martinez J, Sudmant PH, Narasimhan V4, Ayub Q1, Szpak M1, Frandsen P5, Chen Y1, Yngvadottir B1, Cooper DN6, de Manuel M2, HernandezRodriguez J2, Lobon I2, Siegismund HR5, Pagani L7, Quail MA1, Hvilsom C8, Mudakikwa A9, Eichler EE10, Cranfield MR11, Marques-Bonet T12, Tyler-Smith C13, Scally A14. Mountain gorilla genomes reveal the impact of long-term population decline and inbreeding. Science. 2015;348: 242-5. 146. Wendl MC, Smith S, Pohl CS, et al. Design and implementation of a generalized laboratory data model. BMC Bioinformatics. 2007;8:362. doi: 10.1186/1471-2105-8-362. 147. Wright. "Genetical structure of populations." Nature 166 (1950): 247-49. 148. Yakupov RI (2010) Trying to understand the ethnic history of Eurasia (RG Kuzeev memory) // Ethnographic Review. 2: 110-119. (Russian). !76 October 17 2015 Dr. S.O’Brien edition 149. Zavyalov VI, LS Rozanov, NN Terekhova (2012) Ethno-cultural interaction in the era of migration of peoples: arheometallographic data (based on the sites of the Volga-Kama and Poochya). // Russian archeology. 1. (Russian) 150. Zhang, W. & Sun, Z. Random local neighbor joining: A new method for reconstructing phylogenetic trees. Mol. Phylogenet. Evol. 47, 117–128 (2008). 151. Zhao, S. et al. Whole-genome sequencing of giant pandas provides insights into demographic history and local adaptation. Nat. Genet. 45, 67–71 (2012). 152. Zheng, X. et al. A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012). ! ! ! ! ! ! ! !
© Copyright 2026 Paperzz