Genome Russia Project

!1
October 17 2015
Dr. S.O’Brien edition
Genome Russia Project
Research Proposal
!
Table of Content
I VISION AND EXECUTIVE SUMMARY
2
II BACKGROUND AND RATIONALE
4
1 Specific AIMS
4
2 Purpose and Goals
5
3 Scientific Background
6
III STUDY DESIGN AND METHODOLOGY
11
Populations and ethnic groups in Russia to target (Peoples of Russia –Atlas, 2010).
11
Patient recruitment and blood collection
16
IRB review and approvals
16
Patient recruitment study design- requirements for the selection of study participants
17
Patient Recruitment Expeditions
18
DNA extraction and processing methodology.
20
DNA quality control
21
Laboratory quality management
22
4.) Lymphoblastoid (LCL) cell line establishment and the cell line biobank development
23
Protocol for isolation and transformation of PBMCs (See Appendix 7)
26
Abbreviations LCL Transformation
28
5.) Whole Genome Sequence Assessment of Study Participants
28
6. Computational resources, requirements, memory, capacity and security
32
!2
October 17 2015
Dr. S.O’Brien edition
b.) Genome Russia Server cluster:
32
c.) Storage cluster:
33
d.) Data access:
35
e.) List of tools for bioinformatics analysis:
35
f) Genome Russia Website
36
The Genome Russia Database
38
7) Population whole genome sequence data analysis
40
a) Read quality control and filtration.
43
b) Alignment and filtration
44
c) Variant calling and genotype calling
i. Strategy:
iii. Functional SNV (fSNV) detection
iv. Validation of rare allele and disease gene mutations
44
44
46
46
d) Genome Mining for described human disease gene variant alleles
47
e.) Copy Number Variation (CNV) assessment for Genome Russia Project..
48
8. Haplotype Map constructions for Ethnic Russian population.
51
9. Interpreting Russian History population exchanges using phylogeography
54
IV
PERSONNEL AND COLLABORATORS OF GENOME RUSSIA.
V WORKFLOW AND TIME TABLE OF GENOME RUSSIA.
57
58
VI FORESEEABLE BENEFITS OF THE GENOME RUSSIA PROJECT TO SPSU, TO
RUSSIA, AND THE WORLD .
58
VII REFERENCES
!
I VISION AND EXECUTIVE SUMMARY
!
60
!3
October 17 2015
Dr. S.O’Brien edition
The genomics era has revolutionized the delivery of medical diagnoses across the
world as individual DNA assessment is today a critical component in personalized
medicine. The completion of the first human genomes’ sequence in 2003 and the
subsequent rush of whole genome sequencing (estimated as close to 200,000 people by
the end of 2015) offers the promise of genome empowered medical diagnoses and
treatments in the very near future. Derivative International projects such as the 1000
Genome Project, the Human HapMap Project and many others have begun the process of
cataloguing and characterizing human gene diversity. The promise is to identify the
determinants for hereditary diseases as well as for complex chronic and infectious
diseases disease with a genetic underpinning (including cancers, neurological disorders,
autoimmune disease, HIV, AIDS Ebola and many others).
In the past few years major population genome sequencing projects are underway
in several nations: UK, USA, Japan, Iceland, South Korea, Canada, Australia, Thailand,
Kuwait, Qatar Israel, Belgium, Luxembourg, and Estonia. In spite of occupying over 8%
of the world’s landmass, and having the ninth largest world population (estimated as
145,000,000 people), the Russian Federation has lagged behind in contributing to
worldwide genomic database projects. Genome Russia would remedy this situation by a
producing a genome sequence based formal detailing of genome variation across diverse
population groups and ethnic minorities within the Russian Federation. As presented in
detail by this report, Genome Russia will initially develop whole genome sequence from
2500 Russian volunteers and annotate common and rare DNA variants. We will offer
these data to join Russia with the 1000 Genome Consortium, international HapMap and
contributing partners. The public release of these data shall become a national and world
resource for genomic enquiry and varied cross-disciplinary applications in medical
genomic research.
The principal goals of Genome Russia include: 1) Documenting the sum and
breadth of natural DNA variation that characterizes the population and influences any and
all heritable genetic traits; 2) Assessing the function-altering genetic determinates that are
already linked to hereditary pathologies as well and new damaging variants that are yet
described; 3) Construct a Haplotype map of the Ethnic Russian for use in disease gene
discovery and also closing in on operative gene variation associated with complex
heritable diseases and traits in populations; 4) Inspect with proven genetic methods the
relationship and natural history of indigenous Russian ethnicities. All the data will be
posted publically on an open access web-site for interested scientists to inspect and utilize
in context of world-wide human populations analyses. Initially, we will not collect
clinical information, as has been proposed in several national studies, however with
!4
October 17 2015
Dr. S.O’Brien edition
growing support for the concept and fulfillment of Genome Russia, we will recommend
such an expansion rather soon.
Genome Russia will embrace the medical, human genetics and anthropology
science communities across the Russian Federation. The project is timely, important and
feasible with available technology and computational power. There are tangible benefits
to completing this project and we detail these in the last Section V of this document. The
Genome Russia Project can and should become a example of international collaboration
on the common ground and with the common goal of improving human health and
betterment
II BACKGROUND AND RATIONALE
1 Specific AIMS
Genes are the basic “instruction book” for the cells that make up our bodies, and
are made out of DNA. The DNA of a person is more than 99% the same as the DNA of
any other unrelated person. But no two people have exactly the same DNA except
identical twins. Differences in DNA are called genetic variations. They explain some of
the physical differences among people, and partly explain why some people get diseases
like cancer, diabetes, asthma, and depression, while others do not. Such diseases may also
be affected by factors like diet, exercise, smoking, and pollution in the environment,
which makes it hard to determine which genes affect the diseases.
The objectives of the project “Genome Russia” are to develop an open access
web-based database containing anonymous information on the whole-genome sequences
of at least 2,000 men and women originating from the different regions of Russia, whose
ancestors are indigenous to the region for several generations, as well as the description
the genome variations in these groups, the detection of the features that affect the spread
of diseases and the creation of a database of medically-relevant genomic variants
characteristic to the Russian population, which would be the basis for developing the
principles of the future personalized medicine.
The data will be used for many purposes. However four immediate uses we
anticipate are:
•
Discover and catalogue new gene variants that are specific for Russian ethnic groups
•
Identification of genetic variants that may affect the frequency of known diseases
across the Russian peoples.
!5
October 17 2015
Dr. S.O’Brien edition
•
Develop a Russian population based Haplotype map (HapMap), required to identify
disease gene markers specific for high incidence Russian diseases.
•
Interpret the patterns in the variability of human DNA to decipher historical
migratory routes and settlings of the man in Russia, Europe and Asia.
The research database developed within the framework of the project will not include any
personal information.
2 Purpose and Goals
The initial task is to gather blood samples of some 2500 Russian people, including
several hundred family trios (DNA samples of a child and both parents). The project will
create a national collection of genetic data will engage researchers from other educational
institutions and research organizations. Genome Russia will reach across Russian
Biomedical Centers and join with an international “1000 genomes project” created to
uncover rare gene variants in different human populations. DNA from the Russian
volunteers will be subject to whole genome sequence assessment suitable for mining their
genomes for secrets of their past and their future.
The objectives of the project are the description of variations in the human
genome in different groups of the population of the Russian Federation, identification of
the features that affect the spread of diseases, as well as the creation of an information
base of medically significant genomic variants specific to the population of Russia, which
will be the basis for developing the principles of medicine of the future. In other words,
knowledge of the genetic diversity of different groups of the population of the Russian
Federation, which will be accumulated in the implementation of this project, will allow to
identify and determine the frequency of genetic determinates previously associated with
complex diseases among population of the Russian Federation. These estimates can
contribute to tracking historical migrations and human settlement on the territory of the
Russian Federation. Similar projects have previously been conducted in other countries
and now we want to conduct a similar study in Russia.
More explicitly the specific goals we propose to deliver with fulfillment of
Genome Russia include:
•
A blood DNA biobank of Human biospecimens from the major population ethnic
and regional groups that live in Russia today.
•
Whole Genome Sequence from ~2500 people of divergent ethnic background and
genotype 160 trios (two parents and one offspring) to catalogue DNA single
nucleotide, indel, and copy number variation within ethnic Russian populations.
!6
October 17 2015
Dr. S.O’Brien edition
•
Resolution of thousands of Russian specific DNA variants not seen in other world
populations. These will be deposited with the international 1000 Genomes project as
the first and only contribution from the Russian peoples.
•
Explicit documentation of the pattern and distribution of common hereditary disease
gene variants across the Russian peoples plus a catalogue of function altering
genetic variants across 22,000 human genes in each individual
•
A Russian population-specific HapMap useful for disease gene association
discoveries among Russian disease cohorts to be built in the future.
•
Mapping of the footprints of ancient geographic movements of the ancestors of
modern Russian peoples. This natural history analyses will relate modern Russian
ethnic groups to each other to their ancestors and to deep seated archival Russian
populations of early mankind including Denisovan and Neanderthal culture that
inhabited Russian lands.
•
Innovative new bioinformatics analytical algorithms applicable to disease gene
discoveries including an open-access public database releasing sequences,
genotypes and analyses discoveries
3 Scientific Background
Mapping the unabridged pattern of human genetic variation across the world
represents one of the greatest exploration projects undertaken since the genomics era
began in 2001 with a published draft of the human genome. Driven by availability of
samples and technological advancement of next generation sequencing techniques, in the
last decade the whole genome sequencing scaled up from personal individual projects to
the global surveys of genomic diversity best represented today by the 1,000Genome
project (McVean et al., 2012; Auton et al., 2015). The latest release of this project would
become the major global reference resource for human genetic variation, but it is not a
complete genome map of the humankind (Auton et al., 2015).
The principal goal of this grand exploration was to uncover rare and local DNA
variation in modern ethnic populations in order to avoid the discovery bias in the studies
of human disease based on geographically limited datasets originating in the developed
countries of North America and Europe (Figure 1, McVean et al., 2012). The common
variants, made available by efforts of the dbSNP (http://www.ncbi.nlm.nih.gov/SNP) have
been used to develop genotyping arrays for important applications such as the
International HapMap, GWAS studies, and the Human Genome Diversity Project – HGDP
(Auton et al, 2009).
!7
October 17 2015
Dr. S.O’Brien edition
In the three years since the first 1,000 Genomes consortium paper on human
diversity was published (McVean et al., 2012;), attention slowly shifted to the national
population genome projects, notably Iceland and British populations (Gudbjartsson et al.,
2015; Leslie et al., 2015) to help uncover the intricate natural history of these nations
population. National genome projects are underway with a 100,000 UK Genome Project,
UK (Akst, 2012; Marx, 2015), an Asian Genome Project (Anderson, 2014) , a Chinese
Million Genomes endeavor (Heger, 2015), an African Genome Sequence Variation project
(Gurdasani et al., 2015), along with whole-genome sequence population studies in the
Netherlands, Turkey, and Japan, (Francioli et al., 2014; Alkan et al 2014 ; Nagasaki et al 2015),
all meant to inform medical and natural history questions.
!
!
!
!
Figure 1 Distribution of publicly available genome sequences. Worldwide locations of population samples with
the whole genome data from the 1000 Genome Project. Each circle represents the number of genome sequences
publicly available at www.1000genomes.org. ASIA: BEB Bengali in Bangladesh; CDX Chinese Dai in Xishuangbanna,
China; CHB Han Chinese in Bejing, China; CHS Southern Han Chinese, China; GIH Gujarati Indian in Houston,TX;
ITU Indian Telugu in the UK; JPT Japanese in Tokyo, Japan; KHV Kinh in Ho Chi Minh City, Vietnam; PJL Punjabi
in Lahore, Pakistan; STU Sri Lankan Tamil in the UK. AFRICA: ACB African Caribbean in Barbados; ASW African
Ancestry in Southwest USA; ESN Esan in Nigeria; GWD Western Division, The Gambia; LWK Luhya in Webuye,
Kenya; MSL Mende in Sierra Leone; YRI Yoruba in Ibadan, Nigeria; EUROPE: CEU Utah residents with Northern
and Western European ancestry, USA; FIN Finnish in Finland; GBR British in England and Scotland; IBS Iberian in
!8
October 17 2015
Dr. S.O’Brien edition
Spain; TSI Toscani in Italiy; THE AMERICAS: CLM Colombian in Medelin, Colombia; MXL Mexican Ancestry in
Los Angeles, USA; PEL Peruvian in Lima, Peru; PUR Puerto Rican in Puerto Rico. Each circle represents the number
of sequences in the final release. The dotted circles indicate populations that were collected in diaspora.
!
Using the new population genomic data in targeted countries, medical research has
been given a new roadmap and power in the disease variant discoveries. In addition,
disease genome collections like the International Cancer Genome Consortium (ICGC) are
developing impressive global networks bolstering collaboration and invigorating research
progress in complex disease therapy.
Looking at the world map with these dynamic developments in genome
sequencing of global populations, one cannot help but notice a great “wide gap” in
Eurasia (Figure 1). From Baltic Sea to the Beringia Straits, Russia remains the largest
vast swath of land (~10% of the earth’s land mass) for which the human genome
landscape remains relatively unexplored. Notably, even the larger population SNP array
genotyping projects such as HGDP (~ 52 populations sampled worldwide) and the
HapMap have scarce representation of ethnic groups in Russia (Figure 2, Auton et al.,
2009, and 2015). Further, the European and East Asian population groups in the 1,000
Genome Project, do not capture the rich background of genomic diversity in this part of
the world, partially because of the difference in ancestry, partially because of the history
of admixture (Figure 1). Recent population genetic studies of Russian indigenous
populations have employed mtDNA, STR, Y-chromosome haplogroups and genome SNP
variants in certain regional ethnic populations, but little have been achieved with more
comprehensive whole genome sequence of Russian people to date (Yunusbaev et al.,
2011; Salmela et al. 2011; Khusnutdinova et al. 2012; Khrunin et al., 2013; Kharkov et al.
2013; Har’kov et al. 2014; Kushniarevich et al., 2015; Trofimova et al. 2015; Yunusbaev
et al., 2015 Balanovskaya et al. 2001a; b see Appendix 13).
!9
October 17 2015
Dr. S.O’Brien edition
Figure 2 Eastern Hemisphere locations of population samples in surveys of worldwide genetic variation
(HapMap, 1000 Genomes Project, Phase 1, and HGDP).
The historic milestones that founded modern Russian populations include the
northward and westward expansion of the Indo-Europeans and the Uralic people, the
westward expansion of the Turkic people, and centuries of admixture between them
(Figure 3). The routes for peopling Northern and Central Europe inevitably led through
this territory, then the waves of great human migrations of the recorded history were
pushed this way for centuries, followed by the great exchange of knowledge and
technology (and likely the genes) along the Silk Road . The migrations of the last
millennia have created a complex patchwork of human diversity that is today’s Russia and
somewhere hidden in Siberia reside the ancestors for modern Native Americans.
!
!10
October 17 2015
Dr. S.O’Brien edition
!
!
Figure 3 Major human migration routes (adapted from Stewart and Chinnery, 2015) and locations of other
hominid remains out of Africa. The approximate locations of major Neanderthal and Denisovan finds are
indicated by glowing circles.
!
In the more distant past, there surely occurred gene exchange between modern
Homo sapiens and the prehistoric Neanderthal and Denisovan populations they
encountered. The Neanderthal and Denisovan genetic contribution is not well studied
beyond Western Europe for the former or South East Asia for the latter, despite their
physical remains being unearthed in Siberia (Prufer et al., 2014; Reich et al., 2010; Fu et
al., 2014). Do Russian populations contain ancestry components that undetected in
populations represented in the 1,000 Genomes or even in the comprehensive HGDP
database? Perhaps. Hence, Russia needs a national genome project on its own. Genome
Russia is a first step towards accomplishing these goals.
!
!11
October 17 2015
Dr. S.O’Brien edition
III Study Design and Methodology
Populations and ethnic groups in Russia to target (Peoples of Russia –
Atlas, 2010).
Ethnic and racial diversity of the Russian population is an indisputable fact.
According to the 2010 census in 195 ethnic groups were recorded, for most of whom
Russia is the territory of the residence.
The diversity and ethnic admixing evolved over thousands of years as a
consequence of multiple migrations, mixing and ethnic separation. The sources of these
migrations were the Baltic region, the Balkans, Central Asia and the Far East. Many
ethnic groups have been formed on the territory of modern Russia (Figure 3). These
processes were influenced by the diversity of landscape forms in the vast territory that
included the tundra, taiga, deciduous forests, forest-steppe and also mountain and seaside
interzonal landscapes ((Trubachev, 2003; Yakupov, 2010; Zavyalov et al, 2012; Makarov,
2015; Simchenko and Tishkov, 1999, Allentoft et al, 2015; Kushniarevich et al., 2015;
Molodin et al. 2014, Peoples of Russia –Atlas, 2010).
In genetic terms, the carriers of mutations adaptive to the stable or changing
environment would have spread as a consequence of migration and selection.
Most ethnic groups in Russia are not homogeneous panmictic populations.
Ethnographic and historical reconstructions have suggested mixed original founder groups
that are in need of confirmation. For example, people who contributed to the ethnogenesis
of the Bashkirs were Ugric people (today it is Khanty and Mansi who are the closest to
the ancient Ugric ethnicity), various Turkic groups, and Mongols. The ethnogenesis of the
North Caucasian was formed by indigenous population along with Turkic and Iranian
steppe nomads, as well as migrants from the territory of the Greater Caucasus.
Perhaps challenging is resolving the ethnography of modern Russian people: the
Slavs - people from the Balkans and contemporary Polish Pomerania, Lettow-Lithuanian
and Finno-Ugric peoples of the East European Plain, Turkic steppe and possibly the
Mongols. During the period of Russian colonization of the Volga, Urals, Siberia and the
Far East the contacts with the local population result in origin of the metis groups or the
aborigines admixed with the settlers.
The founder groups we mention were not homogeneous themselves. Further
historical data cannot resolve from which ethnic groups they derived, only that they were
heterogamous as indicated in both archaeological and anthropological records.
!12
October 17 2015
Dr. S.O’Brien edition
The complex processes of forming modern population groups did not occur
contemporaneously. New waves of crossbreeding could take place quite far in time from
the other, but sometimes these waves overlapped. Migrations alternated with periods of
isolation; the processes of mutation in the population alternated with the process of
stabilization of the gene pool.
Such historic perturbations imply that there are actually no the so-called pure
ethnic groups in genetic terms and there cannot be. This statement raises a certain
research problem – in the first stage we should identify the most common alleles and thus
reconstruct the overall scale of variation of the gene pool of all the peoples in Russia. In
the second stage, we will identify the common and local rare variants private to separate
ethnicities in Russia, . The solution to these two tasks must be performed in parallel each new investigated group should enhance our understanding of the total variability of
the genomes in Russia and characterize each particular group.
Research objectives of the ethno-genetics study would address the following
important questions:
1. What is driving the close genetic relationship of modern populations- the territorial
proximity or relation to one ethnic group or other factors?
2. Does sympatric co-occurrence on the same locale affect the convergence of genomic
variants, regardless of ethnicity or not?
3. Which ethnic groups are closely related to each other and which are very different.
4. How does the population genetic structure of the Russian people relate to other
nations of Europe, Asia and America (both indigenous Indian, Inuit and European
newcomers)?
5. Are the gene pools of individual nations isolated or do they show evidence of
admixture with neighboring populations?
The answer to these questions will present a picture of the modern gene
composition of Russia, and will also shed light on the ancestral origins of contemporary
ethnic groups .
Strategy of population sampling for Genome Russia: The most numerous and
widely populated Russian ethnicities should be evaluated by multiple populations. For
“ethnic Russians”, we propose to collect 12-15 sampling sites. For the other large
ethnicities (more than 1,000,000 people) 3-5 sampling sites would suffice. For ethnic
populations of less than 1,000,000 people, (e.g. representative of - Ural, the Northern
EUROPEAN part, North-West and North-Eastern Caucasus, Altai, Siberia taiga – we will
!13
October 17 2015
Dr. S.O’Brien edition
sample 2 locales From the smaller ethnicities we suggest to select only those that are the
least assimilated to date, i.e., the peoples living in the taiga and tundra of Siberia and the
Caucasian highlands. (Appendices 1, 2a, 2b)
In the Stage 1 we will collect samples to get a general impression about the
genome of the Russian population as a whole. It is proposed to select the largest ethnic
arrays and conduct research on their perimeter. Thus, for Ethnic Russian it would be
useful to collect samples from the four principal regions of Russia (Appendix 1, Figure 4)
.
Northern Russians from Arkhangelsk ,
Western Russians from Pskov, Novgorod, and a group formed in the first half of the 2nd
millennium AD and recently consolidated by many souces stettling in the St. Petersburg
metropolitan area.
Southern Russians of Rostov, Voronezh, Krasnodar and Oryol regions formed intialy in
the 16th -18th centuries. These are quite resistant groups. Russians of the Central Russia
regions are of less interest, since are close to southern Russians plus the stability of
populations is very low as it is an industrial zone.
In Siberia and the Far East, namely, from Omsk, Krasnoyarsk regions, Primorsky and
Khabarovsky Krais were founded in the t 18th-20th centuries with 20th century
introgression from European Russian groups such as Stolypin settlers who established
their separate villages maintaining internal marital ties. At this stage, to a greater extent
than for subsequent stages, relatively closed group must be chosen, remote from industrial
centers and highways, as the latter are points of mechanical population growth.
For Bashkirs we will sample three sites in Bashkortostan on the perimeter of the region,
mainly where Bashkirs live in solid groups, in the northeast, northwest and south of the
country. In addition we will sample a few groups located outside the main range of
settlement to detect local allele derivation.
From these Stage 1 samples of Russians and Bashkir will we hope to develop a
preliminary outline the population distinctions among the peoples in Russia. In addition, it
may be possible to narrow the geographical boundaries of the Russian gene pool from its
most western point - Pskov region to one of the very eastern - Khabarovsk Krai. It should
also be possible to identify latitudinal differences - from high latitudes in the Arkhangelsk
region to relatively low in the Don region.
In Stage 2 a detailed more focused analysis of Russians minority ethnicities of
Russia will be conducted (Appendix 2a, 2b). Initially, to develop a detailed picture of
!14
October 17 2015
Dr. S.O’Brien edition
European Russia we shall sample populations in the Central Non-Black Earth Zone as
well as from Vladimir, Yaroslavl and Nizhny Novgorod regions. Further, ethnic
populations from several regions will be sampled: the southern Russians from Krasnodar,
Rostov and Voronezh regions, Russians in Kama region and the Urals, and Russians from
Udmurtia and Sverdlovsk regions. We shall also gather samples from indigenous
populations in Eastern Siberia, especially in its southern taiga, steppe tundra, and northern
areas and in Krasnoyarsk region (Figure 4).
!
!
Figure 4 Some locales of Russian ethnic groups selected for sample development in genome Russia
Two interesting ethnic groups who live throughout the country are Ukrainians and
Tatars, The population structures and the ethnogenesis of these two groups are
fundamentally different. Ukrainians are recent immigrants to the territory of Russia
(18th-20th century.); their ethnogenesis is inextricably linked to the ethnogenesis of
Russians. Although Russian and Ukrainian had more close contact than Russian and
Poles, Lithuanians, and steppe Tartars, but they had less close contact than Russian and
Finno-Ugric peoples, while sharing a main East Slavic component. We propose to sample
three areas of concentrated residence of Ukrainians:
1.
Belgorod region (the group was formed in 17th-28th century);
2.
Krasnodar region (18th-19th century),
!15
October 17 2015
Dr. S.O’Brien edition
3.
From the Asian part in the Omsk region, and
4.
Primorsky Krai (19th-20th century).
The Tatars result from contact between settled Turkic (Bulgars), nomadic Turks
(Tatars actually) and indigenous Finno-Ugric peoples. On a genetic level Tatars were in
contact with Russian colonists of Volga-Kama region. We shall sample three groups
along the "perimeter" of Tatarstan and one group in the Diaspora in the Novosibirsk
region.
Additional ethnic groups we shall sample are: the Karelians (West), Khanty, the
Mordovians, Chuvash (Central), Mansi, and Komi (North), Udmurt (Volga-Kame area),
Osteons, Kabardians and Adygeis (Northwest Caucasus), Naganasans, Chechens
(Northeast Caucasus), Khakases (Altai), Yakut (Siberia taiga), Ulchi, Udygs, Nanaian (Far
East).
Komi is made up of a mixed Russian population that moved the north-eastward
from Moscow. Comparison of Komi gene pool and its comparison with the gene pool of
the Russian population in the Arkhangelsk region will determine the degree of admixture
of the Slavs and the Finno-Ugric peoples as a phenomenon. We shall sample two groups,
the taiga group in Ust-Kulom district and the forest-tundra group in Izhemsky area.
Udmurts, had been isolated longer than any other Finno-Ugric peoples of Volga
and Urals. Their gene structure may reflect an isolated “island population” level of
differentiation. We shall investigate samples from the northern and southern Udmurts.
The population of the northwestern Caucasus traditionally had contacts with the
steppe peoples and the population of northeast Caucasus with the peoples of
Transcaucasia and the Greater Caucasus, up to Asia Minor and the Iranian plateau. It is
make sense to check the version of the divergence of gene pools in these groups. On the
other hand, one should not exaggerate the degree of highlanders’ contacts with the outside
world. From these people we shall collect samples from Adygeis, Chechens and Kabarda
(two sampling sites per ethnicity).
Turkic peoples of Siberia have been quite isolated from each other because of the
enormous distances. We propose to explore a mountain Turkic group - Khakases (whose
ethnic contacts for a long time were connected to the south of the Mongols and to other
nomadic peoples of the Altai) and one taiga people of Yakutia (their interethnic
connections had been limited to Tungus and Paleo-Asiatic peoples).
The last group of people to sample represents small isolated ethnic populations.
Their isolation is determined by both geography and adverse environmental conditions of
!16
October 17 2015
Dr. S.O’Brien edition
their habitat. Selected populations include: Nenets of the Yamal tundra, Nganasans of the
Taimyr tundra, Khanty of Ob and Tsakhurs of southern mountainous Dagestan, and
Chukuchi inside the Arctic Circle.
Phase 3 will fill the gaps on the map of sampled locations from the Russian North
and Siberians to Pomerania and Indigirka peoples. Such an approach would offer wide
expansive coverage of modern populations across the nine time zones of today’s Russia.
Patient recruitment and blood collection
!
IRB review and approvals
!
The Ethics Committee of SPSU has approved the informed consent document
developed by the scientists of the Dobzhansky Center ( Appendix 3). In the informed
consent document (Appendix 4) we invite potential participants to be part of the Genome
Russia Project, explaining that it will develop a research resource that researchers around
the Russian Federation and the world will use. In the informed consent document it states:
“The overall objective of this project is to describe variations in the human
genome in the genetic heterogeneity of the Russian population, to determine
characteristics that influence spread of diseases, as well as to create a database of
medically significant genomic variants specific to the population of Russia, which
will provide a basis for developing medically significant genome variants in
future. The specific objectives of the project are:
a. Collection of blood samples and DNA extracted from the blood samples,
which will be kept in a repository and distributed to researchers for use in
future projects
b. Data from the study of the samples, which will be kept on scientific
databases available over the Internet. The resource will be used in many
future studies related to health and disease.”
Researchers in several Russian institutions including (St. Petersburg State
University; Institute of Molecular Genetics, Center of Neurology Russian Academy of
Sciences, Moscow, plus researchers several countries (Appendix 5, 6) are working
together to develop this resource. Saint Petersburg State University is the principal
sponsor of the project, and The Theodosius Dobzhansky Center for Genome
Bioinformatics of St Petersburg State University is the coordinator of the research
!17
October 17 2015
Dr. S.O’Brien edition
consortium established to carry out this project (Appendix 5). Several scientific and
research centers are members of the consortium.
Also in the Informed Consent we mention that this project will include obtaining
DNA from at least 2000 men and women from different parts of Russia, whose ancestors
were indigenous to the region for several generations. In order to take part, the study
participant must:
•
be at least 18 years of age;
•
be willing to give a blood sample ~ 7 ml so that researchers can read out all of the
donor’s genetic information from it (a process called “sequencing”) or decoding
the complete genome sequence);
•
be willing to have all of donor’s genetic information (without your name or other
traditional identifying information, such as address, birth date, passport
identification or Social Security number) put in scientific databases available on
the Internet for scientific research;
•
be willing to have many researchers around Russian Federation and across the
world study the genetic material and data from the sample for a long time, and to
have the information they learn put in scientific databases on the Internet.
Theodosius Dobzhansky Center for Genome Bioinformatics of St Petersburg
State University will not collect donors’ names or any medical information. Researchers
who study the material and data from the samples will be told only the sex of each donor
and which ethnic or geographic group the donor came from.
The object of the blood samples collection is the family trio, i.e. two biological
parents and their full aged child. Ideally, we are aiming to collect blood samples from
about 20 trios (60 individuals) from each geographic location under study.
Before taking the blood samples the researcher should be convinced that the donor
is at least 18 years old, explain to him/her the essence of the project, and answer all his/
her questions. The participant must sign the Informed Consent form and answer the
questionnaire where we ask about his/her origin and ethnicity. Ideally all four
grandparents should be from the same district and has the same locality and ethnicity.
!
Patient recruitment study design- requirements for the selection of
study participants
All members of the trio should be:
!18
October 17 2015
Dr. S.O’Brien edition
•
Age not younger than 18 years.
•
Healthy: at the time of sampling should not have serious chronic illnesses
(according to the participants, the diagnosis is not required).
•
Ethnically homogeneous, originate from a particular region, including
grandparents on both sides.
•
In case of doubt or lack of information about at least one of the ancestors of the
family cannot be included in the study
•
The members selected for the study of family trio should be biological relatives
o both parents in the trio must be the biological parents of the child
•
Selected to the study families should not have a family relationship between trios
o the member of the family under study must not have parents, children,
sisters, brothers, grandparents, cousins, aunts and uncles in other selected
for sample collection trios.
•
!
All participants must provide the information necessary to complete the
questionnaire and sign the free-will informed consent.
Patient Recruitment Expeditions
!
Before each expedition we organize a letter of support for our project from the
SPSU rector to the governor of the region where we go. On the first day of arrival to our
destination we come to the city administration and introduced our-self explaining the
importance of our project and meaning of the expedition. We ask them for help in the
local regional hospital and with local authorities. We ask the Chief medical officer of the
hospital to nominate a medical nurse, who travels with us.
With the help of the city’s authorities we ask to have access to the city archives
that contain registration information about the local citizens. During the first day, we work
in the city register, collecting information about local citizens.
Our strategy is then as follows. First, we split into several groups in one car each.
In the morning, we define several villages for the each group and travel to them
separately. When we arrive, we first try to find an old-timer or long-term resident in each
village and ask them who is an indigenous resident of the village, whose grandparents on
both sides were born in the district. We give a one page flyer describing the project in
!19
October 17 2015
Dr. S.O’Brien edition
simple words, then if the resident shows interest in the project, we ask if s/he would be
willing to donate 7 ml of his/her blood for the project. We also ask if s/he could
recommend any other indigenous resident(s) in the local area. We fill out the
questionnaire, explain and sign the informed consent document. The questionnaire,
informed consent and the tube for the blood sample are labeled with the same bar-code
sticker in order to log it into the computer, where personal data are not recorded (only
location, ethnicity, age and gender are recorded). We accumulate several suitable
volunteers in the morning with signed consent and then in the afternoon we pick up the
medical nurse from the hospital and then travel again to the villages where the volunteers
who have filled out forms are waiting for us. In some cases we drive volunteers to the
hospital to obtain the blood sample. The nurse takes blood sample from each volunteer
and we transport it back to the hospital where one of our employees mix the samples with
pre-prepared TES buffer and then disposes the used vacutainers in biohazard bags.
Therefore, our logistics has been well developed and carefully thought out. As a sign of
our appreciation we gave each participant a box of chocolate, for which they are very
appreciative. We strive to collect around 60 individuals from each district.
We have learned many lessons from our fist expedition and understood how it will
work in remote sites. We learned that:
•
To identify people with a pure ethnic background (i.e., not with admixed genotype
from distant districts or nationalities) it is important to talk to residents and find an
indigenous resident who knows and remembers their background and can indicate
how to find the right people to include in our study. To convince people to participate
in the project, a 15-20 min conversation is very important! People like to talk about
themselves, so researchers should not immediately talk about the project, especially
about blood sample collection, but listen to the putative participant, about his/her life,
family stories. After a sense of trust has been established, we softly switch the
conversation to the importance of the project, emphasizing the special value of the
genomes from the district under investigation so that the volunteer can realize his/her
importance to the project. We then explain the project in more detail, assure that the
volunteer is appropriate to the project (having them fill out a form on their family
genealogy), and finally explain the meaning of the informed consent form and have
them sign it.
•
Several groups of researchers are more efficient in facilitating sample collections.
•
In rural regions rumors spread quickly, so researchers should be careful as to what
they say to volunteers.
!
!20
October 17 2015
Dr. S.O’Brien edition
A draft timetable for sample collection expeditions is presented in Appendix 13
!
DNA extraction and processing methodology.
Specimen collection, transport, storage and DNA extraction methods can
contribute significantly to accurate whole genome sequencing results (Vaught, 2006;
Troyer, 2008; Shabihkhani et al. 2014). Our laboratory has developed standard protocols
to minimize the undesirable effects of pre-analytical variables on each of these steps. The
protocols are described below.
Sample collection and transportation. Blood is collected from participants into
10 ml vacutainer tubes with EDTA according to the protocol. All vacutainers are stored at
4°C until blood is transferred into 15 ml tubes with TES (Tris-EDTA-SDS) buffer in a 1:1
ratio for transportation to the laboratory.
Sample coding and database. Each vacutainer and transport tube has the
participant's unique identifier number and barcode according database records. All
information is anonymous - no personal data is present on the tube labels or in the
database.
Sample aliquoting When the vacutainers arrive at the laboratory, they are
processed according standard protocols. The first step is aliquoting. Aliquoting is
necessary to preserve multiple samples to avoid freeze/thaw cycles. Each blood sample
produces 14 aliquots of 1 ml volume.
Aliquots coding and database Aliquots are frozen in tubes with an alphanumeric
identifier and a barcode, which allows them to be identified in the database. The database
contains the following information about each aliquot: 1) participant ID, 2) tube ID, 3)
tube storage location, 4) type of biomaterial, 5) volume.
Sample storage conditions Blood with TES buffer is stored at 4°C for short-term
storage. For long-term storage, blood samples are preserved by deep-freezing at -80°C.
This method prevents DNA degradation in the sample. Furthermore, a portion of each
sample that can be delivered to the laboratory in 24 hours will be cryopreserved in liquid
nitrogen in order to maintain viable cells in the presence of a cryoprotectant (DMSO).
Subsequently, cryopreserved samples may be used to derive lymphoblastoid cell lines.
Extraction of high molecular weight DNA
!21
October 17 2015
Dr. S.O’Brien edition
The objective of this stage of the project is to extract at least 10 µg of high
molecular weight DNA from each blood sample for further whole-genome sequencing
and storage.
Human genomic DNA for whole-genome sequencing will be isolated from collected
blood samples using the following cost-effective methods:
1) A magnetic bead-based method using automated nucleic acid extractor MagCore
HF16
2) A silica-membrane-based method using QIAGEN QIAamp DNA Blood Mini/Midi
Kits
These techniques allow the collection of consistently high-quality DNA
preparations, they are safe, and allow us to extract DNA without using phenol and
chloroform (Riemann et al 2007).
The automated nucleic acid extractor MagCore HF16 is a robotic desktop system
for nucleic acid extraction from different materials: blood, cell cultures, body fluids,
tissues, plants etc. Nucleic acid extraction is based on magnetic separation of MagCore
particles (beads) covered with cellulose. Automation of work with blood will significantly
increase safety, reduces the probability of random errors, and increases the accuracy of
DNA extraction. Human genomic DNA will be extracted from 1200 µl of blood mixed
with TES (Tris-EDTA-SDS) buffer in a 1:1 ratio. The MagCore HF16 system allows us to
extract up to 31 µg of genomic DNA of approximately 20-30 kb in length suitable for
whole-genome sequencing.
QIAGEN QIAamp DNA Blood kits are well known and allow extracting highquality DNA without organic extraction or alcohol precipitation. The QIAamp DNA
Blood Mini Kit provides silica-membrane-based DNA purification. The QIAamp DNA
Blood Mini Kit is designed for processing up to 200 µl of fresh or frozen human whole
blood, while the QIAamp DNA Blood Midi Kit allows for the processing of up to 2 ml
fresh or frozen human whole blood. The QIAamp DNA Blood Kits yield DNA sizes from
200 bp up to 50 kb, depending on the age and storage of samples. The typical yield from
200 µl healthy whole blood is 4–12 µg, and from 1 ml the yield is 20-60 µg of DNA.
!
DNA quality control
Extracted DNA will be processed using quality control procedures:
!22
October 17 2015
Dr. S.O’Brien edition
1) DNA quantification using spectrophotometric analysis with the Nanodrop system for
RNA and protein impurities estimation. 260/280 nm ratio in range 1.8-2.0 is required.
2) DNA quantification using the fluorospectrometer Qubit for accurate DNA
quantification. Minimum 10 µg DNA is required.
3) Gel-electrophoresis for DNA size distribution analysis will be performed in 0.8%
agarose gels. DNA fragments should be of approximately 20-40 kb length without
smear.
DNA samples that pass these three QC steps will be used for whole-genome sequencing
and storage.
DNA samples storage conditions. For long-term storage, DNA samples are
preserved by deepfreezing at -80°C. This method prevents DNA degradation over a long
time.
DNA samples coding in database DNA samples are stored in tubes with an
alphanumeric identifier and a barcode, which allows them to be identified in the database.
The database contains the following information about each sample: 1) participant ID, 2)
tube ID, 3) tube storage location, 4) volume, 5) concentration, 6) A260/280, 7) integrity.
No personal data are placed on the tube label or in the database.
Laboratory quality management
To ensure the quality of the blood samples and extracted genomic DNA, the following
practices will be implemented in the laboratory:
•
standard operational procedures are written for each step of sample processing and
DNA extraction
•
automatization of sample aliquoting and DNA extraction can reduce risk of pipetting
errors
•
the system of tube labeling with barcodes reduces the risk of labeling and processing
errors
•
the computer database allows us to manage all data related to the samples at each step
and reduces human-writing errors.
!
!23
October 17 2015
Dr. S.O’Brien edition
4.) Lymphoblastoid (LCL) cell line establishment and the cell line
biobank development
To yield large amounts of DNA for many genotype analyses and to provide a
renewable source of DNA, it is necessary to harvest DNA and peripheral blood
mononuclear cells (PBMCs) from individuals and their family members in several regions
of Russia, to develop LCL cell lines from each individual, and to establish the biobank of
these cell lines for their storage, maintenance and usage.
a.) Why do we need to obtain the lymphoblastoid cell lines? Obtaining lymphoblastoid
cell lines for each blood sample, which is used for DNA sequencing, is an integral part of
the Genome Russia Project and is absolutely essential for several reasons:
1. To obtain the repeated blood sampling of individuals is costly and not always possible.
Generating lymphoblastoid lines is the best and only way to preserve, store and
replenish the genetic material of a particular person.
2. Cell lines are an inexhaustible resource of genetic material that allows the study of the
human genetics and genomics for scientific and medical purposes. The cell lines will
allow researches to follow up with more detailed studies: to study cellular phenotypes
such as gene expression, epigenetic patterns, and drug response. The extensive
genotype data will be available on these samples, and the trio samples from the
Russian population will allow researchers to map regions of the genome
computationally that affect the cellular phenotypes, and to study the heritability of
these phenotypes. See http://ccr.coriell.org/sections/Collections/NHGRI/
hapmap.aspx?PgId=266&coll=GM
3. Preparation and storage of human genetic material in the form of a biobank of
lymphoblastoid cell lines is a requirement for participation in the international Human
Genome Project. They are necessary for verification and validation of the results and
for communication and exchange between the research groups.
b.) Methods to develop the lymphoblastoid cell line
To develop the cell lines, we need to receive the blood transfusion sample from the
donor and to allocate and isolate the blood fraction of B-lymphocytes. Then using viruses,
particularly the Epstein-Barr virus (EBV), these B-lymphocytes should be transformed
into a cell line. There are many protocols developed for these procedures; we will use the
protocol outlined in Appendix 7, developed at the US National Cancer Institute
Laboratory of Genomic Diversity and used for transforming over 5000 patients in HIV/
AIDS, nasopharyngeal carcinoma, HBV, HCV and other complex disease gene cohort
studies (O’Brien ad Hendrickson 2013; Svitin et al 2014).
!24
October 17 2015
Dr. S.O’Brien edition
Human blood contains many types of cells that perform different functions - from
the transport of oxygen to the production of antibodies. Blood cells are divided into red
and white cell types - erythrocytes and leukocytes. Erythrocytes carry oxygen and carbon
dioxide associated with hemoglobin. Leukocytes fight infection (immunity) and digest
remnants of broken cells, etc. In addition, the blood contains a large number of platelets,
which are involved in blood clotting.
White blood cells are divided into three main groups: granulocytes, monocytes
and lymphocytes. The lymphocytes are involved in the immune response and are
represented by two main classes: 1) B-lymphocytes produce antibodies, 2) T-lymphocytes
kill virus-infected cells, and regulate the activity of other leukocytes. Some of these cells
operate solely within the circulatory system, while others are used only for transport, with
functions performed in other tissues. However, the life cycle of all blood cells is similar to
some extent in that 1) their life cycle is limited; 2) in the body they are continuously
formed.
Unfortunately, blood cells do not grow outside the human body. To impart the
ability to reproduce them it is necessary to transform B-cells with the viruses. One of the
most commonly used viruses for this purpose is the Epstein-Barr virus, or EBV. The
Epstein-Barr virus (EBV) causes the transformation of B=lymphocytes of human,
transforming them into stable cell lines.
The method described generates LCLs from donor peripheral blood with rapid
immortalization and cryopreservation times. Through the use of FK506, a T-cell
immunosuppressant, and high titers of infectious virus, we are able to promote
proliferation of EBV-infected B-cells from peripheral blood mononuclear cells. These
interventions make the described method more efficient, resulting in the rapid expansion
of cells for subsequent experiments.
The transforming activity of certain strains of virus is extremely high, reaching
90-100%. Lymphoblastoid cells are characterized by certain features: they are easy to
handle, maintain a diploid karyotype and multiply at a high rate, including in large-scale
cultures. Until recently, almost all human lymphoblastoid cell lines were characterized as
B-type cells and the genetic information contained Epstein-Barr virus, even if it did not
produce the antigens. Subsequently, lymphoblastoid cells were prepared as the T-type.
!
!25
October 17 2015
Dr. S.O’Brien edition
!
!
Figure 5. Workflow for generation and cryopreservation of lymphoblastoid cell lines. Peripheral blood is
centrifuged through a Ficoll gradient. PBMC present in the buffy coat of an established gradient followed by
addition of EBV. EBV-exposed cells are grown at 37°C in the presence of 5% CO2 to establish and subsequently
expand LCL for cryopreservation.
!
Since the first reports of obtaining a stable line of lymphoblastoid cell lines, their
numbers increased rapidly and are now extensively used in medicine, cell biotechnology
and genomics. Laboratory employees are experienced with the methods of blood
fractionation and have all the necessary protocols to isolate and store the fraction of Blymphocytes. The lymphoblastoid cell cultures are a specific kind of stable cell lines.
They tend to be a suspension. The cells have a rounded shape, proliferate without being
!26
October 17 2015
Dr. S.O’Brien edition
attached to the walls of the culture vessel, and in a stationary culture they do not form
aggregates.
EBV virus can be obtained from infected mammalian cells. The most common
source is from a marmoset cell line B-95-8, which can be purchased from Sigma,
USA. This cell line can grow and to produce the large amount of viruses inside the cell. It
is then necessary to isolate and to purify the EBV viruses and to infect them with the
isolated B-cell fraction of the patient's blood cells. The protocols for this procedure are
available in the laboratory and our laboratory staff have been extensively trained abroad
to succeed in these procedures.
Moreover, we will acquire the EBV-containing B95-8 cells from the biobank of
The Gamalei Institute of Virology in Moscow, which have already been expanded, freezed
and ready to be shipped .However, for this project will need to purchase an additional
laboratory incubator, water bath for heating the medium, a centrifuge, and an inverted
microscope to examine the cells. Moreover, storage of cells requires an additional
biobank cryokonservation system and the liquid nitrogen storage unit for the cells.
!
Protocol for isolation and transformation of PBMCs (See Appendix 7)
PBMCs (Peripheral blood mononuclear cells) are isolated from whole blood by
standard ficoll–hypaque density gradient centrifugation. Briefly, approximately 10 mL of
heparinized, plasma-reduced blood is diluted with Hank’s buffered salt solution (HBSS;
1:2 dilution). Then, 15 mL of ficoll is covered with a layer of diluted blood (30 mL). After
30 min of centrifugation (2000 rpm, room temperature (RT)), the PBMCs are collected.
After two washing steps and cell counting, the PBMCs are prepared for transformation
with Epstein-Barr virus (EBV) added directly after isolation of the PBMCs. Alternatively;
the isolated PBMCs are frozen and stored in a liquid nitrogen freezer for future batch
transformation. PBMCs are frozen in FBS containing 10% dimethylsulfoxide (DMSO).
The protocols for transformation of cells are similar whether the PBMCs are transformed
after isolation or after storage in liquid nitrogen (details are provided below).
For transformation of previously frozen PBMCs, cells are thawed and washed in
10mL of pre-warmed HBSS to remove all traces of the cryo-protectant in the freezing
medium. Following centrifugation at 300g for 5 min, the supernatant is discarded; the
pellet is then re-suspended in 1mL of complete medium (RPMI 1640, 10–20% heat
inactivated FBS, 1% penicillin–streptomycin, and 0.5% normocin or 0.1% gentamicin),
2
and transferred to a 25cm flask containing 1.0– 2.0 mL of EBV supernatant and 1.0 mg
!27
October 17 2015
Dr. S.O’Brien edition
6
of cyclosporine (CSA) per mL. Approximately 6–7.10 cells are used for the
transformation of both thawed and freshly isolated PBMCs.
Freshly isolated PBMCs are suspended in 14 mL of complete medium (RPMI
1640 with Glutamax, 10% heat inactivated FBS, 1% penicillin–streptomycin, and 0.5%
normocin) in a 15 mL of Falcon tube, centrifuged at 350 g for 10 min, and the supernatant
is discarded. The cells are then re- suspended in 2.5mL of EBV supernatant and 2.5 mL of
o
complete medium, mixed carefully, incubated for at least 3h (37 C; 6% CO2), and
2
transferred to a 25cm tissue culture flask. CSA (at a final concentration of 1 mg/ mL) is
used to suppress growth of T-lymphocytes. The empty 15 mL Falcon tube is rinsed with 5
mL of CSA containing medium before transferring the CSA medium to the cells in the
flask; and then the 10mL flask is placed in an incubator. Alternatively, cells are
resuspended in 4.0mL of complete medium supplemented with 5mg/mL of
2
phytohemagglutinin-M (PHA-M) instead of CSA and then transferred to a 25 cm tissue
culture flask. EBV supernatant (1 mL) is added to the flask, mixed carefully, and then the
o
flask is placed in a humidified incubator (37 C; 5% CO2).
o
The flasks are kept in a humidified incubator at 37 C and 5–6% CO2 throughout
the culture period. They may be left undisturbed for the first 21 days, or may be subjected
to additional procedures and/ or observations during this time. In the latter case, on day 5,
0.3 mL of PHA solution (100 mg/mL) may be added to the flask to augment the
suppression of T-lymphocytes. If the cultures were periodically examined during the first
3 weeks of incubation, they are first checked at day 5–7 by inverted phase microscopy for
bright refractile clumps of cells (post-setup check). If there are a significant number of
clumps present, 1–3mL of complete medium (including 5 mL of CSA per mL) is added to
the flask, depending on the number of clumps, and the flask is returned to the incubator. If
very few clumps of cells are visible, no medium is added, and the flask is returned to the
incubator to allow further growth before repeating the post-setup check.
After 28–35 days of incubation, cell cultures are checked for sufficient cell
numbers and split into two portions, one for freezing (one or more stock aliquots) and one
for DNA extraction. The remaining cells (1–20 mL, depending on final culture volume)
2
are returned to a 25- or 75-cm flask for expansion to produce a sufficient number of cells
for DNA extraction or freezing. Traditionally, growth transformation has been monitored
!28
October 17 2015
Dr. S.O’Brien edition
by visualization of clusters of cells by light microscopy about a week after exposure to
6, 15
EBV
. However, clustering of cells is not a specific indicator of EBV-mediated
growth transformation. We have previously demonstrated consistent identification of the
proliferating cell population via flow cytometry, providing an accurate and specific
method to determine successful outcome as early as three days after exposure of B-cells
to EBV.
Abbreviations LCL Transformation
1. CSA -cyclosporine
2. DMSO -dimethylsulfoxide
3. dsDNA -double stranded DNA EBV Epstein-Barr virus
4. EDTA -ethylenediaminetetraacetic acid FBS fetal bovine serum
5. HBSS -Hank’s buffered salt solution HLA human leukocyte antigen
6. LCL- lymphoblastoid cell line
MHC major histocompatibility complex
7. OPAs -oligonucleotide pool assays
PBMC peripheral blood mononuclear cells
8. PBS -phosphate-buffered saline PCR polymerase chain reaction
9. PHA -phytohemagglutinin RBC red blood cell
10. RT room temperature SDS sodium dodecyl sulfate
11. TE -Tris EDTA buffer
5.) Whole Genome Sequence Assessment of Study Participants
a.) Sequencing platform details Currently there are two main high-throughput
technologies which are suitable for large-scale whole–genome sequencing:
Table1. Specifications of Illumina high-throughput platforms indicating main parameters
and advantages/ disadvantages of each platform (http://www.illumina.com/systems/
sequencing.html).
!29
October 17 2015
Dr. S.O’Brien edition
!
!
HiSeq 2000
!
!
!
HiSeq 3000
HiSeq 4000
HiSeq X Five
HiSeq X Ten
N/A
N/A
N/A
N/A
Rapid
High-
Run
Output
1 or 2
1 or 2
1
1 or 2
1 or 2
1 or 2
10-300
50-1000
125-750 Gb
125-1500 Gb
900-1800 Gb
900-1800 Gb
Gb
Gb
7-60
<1-6
<1-3 5 days
<1-3.5 days
<3 days
<3 days
hours
days
300
2 billion
2.5 billion
2.5 billion
3 billion
3 billion
2x250
2x125
2x150 bp
2x150 bp
2x150 bp
2x150 bp
bp
bp
Maximum
throughput and
lowest cost for
production-scale
genomics.
Maximum
throughput and
lowest cost for
production-scale
genomics.
Maximum
throughput for
production-scale
human wholegenome
sequencing.
Maximum
throughput and
lowest cost
population-scale
human wholegenome
sequencing
million
Power and efficiency for
large-scale genomics.
!
!
•
The Illumina platform of sequencing instruments, from the HiSeq2000 model to the X
Ten machines (Bentley et al., 2008). These next-generation sequencing machines were
presented several years ago and helped generate the genomic revolution around the
world. Numerous studies, including such large-scale projects as the 1000 Genomes
project, the Genome 10K project, and many other de novo whole genome assemblies
and population genomics studies were performed using these platforms. In Table 1,
the main features of the sequencing platforms are presented. Illumina platforms make
it possible to sequence thousands of individuals per year with 30-50x coverage. The
high accuracy of these instruments along with a low cost of genome sequencing and
scalability allows their use for population-scale sequencing of individuals and cancer
samples. Several companies around the world allow access to commercial sequencers
!30
October 17 2015
Dr. S.O’Brien edition
and some of them have Illumina X Ten platforms, which allow the sequencing of
thousands of human genomes (http://www.macrogen.com/eng/business/
xgenome.html). Furthermore, Illumina routinely upgrades their reagents to extend
read length and improve the accuracy of the sequencing reads.
•
Complete Genomics (Drmanac et al., 2010) is one of the leaders in human whole
genome sequencing and is based in Mountain View, California. Using its proprietary
sequencing instruments, chemistry, and software, the company has sequenced more
than 20,000 whole human genomes. The company’s mission is to improve human
health by providing researchers and clinicians with the core technology and
commercial systems to understand, prevent, diagnose, and treat diseases and
conditions.
Over the past three years, Complete Genomics has initiated a large number of clinical
utility studies designed to demonstrate that patients, payers, and physicians may be better
off with a whole genome sequence as compared to standard care.
Complete Genomics is now previewing its first commercial product, the Revolocity™
system. Unlike other vendors who focus on providing only sequencing equipment,
Complete Genomics has designed the Revolocity system to be a total end-to-end
genomics solution for large-scale, high-quality genomes (http://
www.completegenomics.com/). The recently announced Revolocity system allows the
sequencing of more around 10,000 thousands genomes per year with 50 X coverage.
Unlike Illumina, which has thousands of citations, there arecurrently few reports in the
scientific literature with using Complete Genomics technology (examples: Lee et al.,
2015; Gilisen et al., 2014, Molenaar et al., 2014).
!
!
Table 2. Comparison of two major deliverers of whole population-scale whole genome
sequencing
Feature
Macrogen/X10/Illumina
Complete Genomics
Read length
2 X 150 bp
2 X 28 bp*
Coverage
30X
50X
Cost of 1 genome
1320 $ (with discount, including data
transfer by Fedex on 2TB disk)
1600 $ (without
discount and data
transfer)
!31
October 17 2015
Dr. S.O’Brien edition
Estimated cost of
data storage facility
500 000 $
1 000 000 $
!
We compared major technical properties and cost of sequencing of whole genomes
from two major genome sequencing providers (Macrogen, which has several Illumina X
Ten machines and Complete Genomics) and present the results in the Table 2.
According to the primary measures, Macrogen/Illumina outperforms Complete
Genomics. The two main problems with Complete Genomics are:
1. Short read length (28 bp) does not allow the proper mapping of reads to a whole
genome. It may result in many missed SNPs in low-complexity regions. The
higher coverage of Complete Genomics will not help to resolve this problem. In
contrast, Macrogen proposes 150 bp paired-end reads.
2. Software and SNP-calling methods of Complete Genomics are not comparable
with modern Illumina software. It will be difficult to compare the quality of results
and the power of SNP-calling with the already published datasets from the 1000
Genomes project.
b) Study design and “Bake off” strategy to evaluate sequencing platforms
We aim to determine the best strategy of whole genome sequencing for the
Russian population using 10 individuals from three family trios (mother, father, children),
each of which will be sequenced on the two platforms mentioned above (Illumina X Ten –
Macrogen, Revolocity – Complete Genomics)> We will also sequence these using the
Illumina HiSeq2000 platform at the Saint-Petersburg State University Biobank. Each
provider of whole genome sequencing technology represents their technology as the most
accurate and cheap in terms of trade-off between quality and price, but for the Genome
Russia project our primary criteria will be the quality and reliability of the sequencing
platforms. Sequencing of the same individuals from family trios will allow us to identify
and quantify sequencing errors during SNPs and Indel calling, genotyping errors and their
ability to call complex genomic variants such as long heterozygous deletions and tandem
duplications. The main indicators of genome sequencing quality will be: percent of highquality bases and reads, mapability of short reads with Phred-scale quality above 40, and
uniformity of coverage, among others. Finally all genotypes will be tested for Mendelian
transmission errors in the trio design in order to asses the incidence of genotyping errors
with the different sequencing platforms and SNP calling software.
!32
October 17 2015
Dr. S.O’Brien edition
The overall frequency of sequencing and genotyping errors for each technology
and the results of the bake-off comparison will be used to decide which sequencing
technology will be used for the Genome Russia project. These results will be published as
a separate methodological paper.
!
6. Computational resources, requirements, memory, capacity and
security
!
a.) Big Data handling, storage and accessibility Genome Russia is expected to generate
whole genome sequence files and analyses for > 2500 Russian citizens in the coming 2-3
years. 60X coverage sequencing and alignment requires approximately 500Gb data files
for each person. This includes raw reads .fastq files and .bam files with mapped reads.
We will store, save, analyze, and provide open web-access to the consented sequence data
in the future. For this purpose, we will need to build a high-speed storage unit with at least
1 petabyte (Pb) capacity, a powerful server cluster, and a high-speed network to connect
servers, storage, and provide access to data. Sequence data will be delivered to us on
external hard drives, which serve as an additional backup system for raw reads files.
!
•
!
Peterhof data center. Our server and storage system will be installed at the
Peterhof SPSU Datacenter. The Peterhof SPSU Datacenter will provide several
free server racks for our equipment, a cooling system, a local gas extinguishing
system, electrical power protected with a powerful generator and an
uninterruptible power supply system. The Peterhof SPSU Datacenter will also
provide 1 gigabyte (Gb) Internet access and a 10Gb channel to connect to the
Dobzhansky Center datacenter on 41 Sredniy Prospekt. For our equipment, we
will need 2 racks of 48 units. Server racks will be locked to prevent unauthorized
access to Genome Russia servers.
b.) Genome Russia Server cluster:
!
•
Hardware For bioinformatics analysis, we will use six Supermicro 4U form factor
servers. Each server will include 4 CPU Intel E7-8890v3 with 72 cores, 3Tb of
!33
October 17 2015
Dr. S.O’Brien edition
memory DDR4 and 5Tb fast SSD hard drives to organize disk cache so as to reduce
the load to the network and storage system.
!
•
Software For Server Cluster setup, we will use Sun Grid Engine (SGE) software. It
will be responsible for accepting, scheduling, dispatching, and managing the remote
and distributed execution of large numbers of standalone, parallel or interactive user
jobs (Samuel at al). It also manages and schedules the allocation of distributed
resources such as processors, memory, disk space, and software licenses. A typical
Grid Engine cluster consists of a master host and one or more execution hosts.
Multiple shadow masters can also be configured as hot spares, which take over the
role of the master when the original master host crashes.
•
Organizations using SGE:
o Sun Grid
o TSUBAME supercomputer at the Tokyo Institute of Technology,
o Ranger at the Texas Advanced Computing Center (TACC). Ranger has
62,976 processor cores in 3,936 nodes and a peak performance of
504TFlops.
o San Diego Supercomputer Center (SDSC)
o Geophysical Fluid Dynamics Laboratory (NOAA GFDL)
!
c.) Storage cluster:
!
•
Hardware The storage cluster will be based on Supermicro servers nodes 1U
form factor. Each node contains 12 large form factor 8Tb hard drives, so total raw
space per node reaches 96Tb. We will need approximately 2Pb of raw space or 22
nodes to build 1Pb high-speed main storage system and 1Pb backup system. A
backup storage system will be organized in the server room of the Dobzhansky
Center on 41 Sredniy Prospekt. The backup system will consist from 2 Supermicro
Storage servers with 72 8Tb hard drives for each one. Raidix software and RAID6
technology will be used to organize access and maintain security of the data.
In order to achieve the speed of more than 10 Gigabit bundling of several physical
ports, it is possible to form a single logical channel, which is defined by the Link
!34
October 17 2015
Dr. S.O’Brien edition
Aggregation Control Protocol (LACP) IEEE standart. This is also called interface bonding
or teaming. Link aggregation is possible either using a single switch or stackable
switches. Stackable switches allow the link aggregation across the stack, which enables
improved redundancy and resiliency (Guijarro at al. 2007).
!
•
!
Software. Storage nodes will be merged into a high-speed system using the
Gluster file system. Gluster is a software-only platform that provides scale-out
NAS for physical and virtual environments. With Gluster, organizations can turn
commodity computing and storage resources (either on-premise or in the public
cloud) into a scale-on-demand, virtualized, commoditized, and centrally managed
storage pool. The global namespace capability aggregates disk, CPU, I/O and
memory into a single pool of resources with flexible back-end disk options,
supporting direct attached, JBOD, or SAN storage. Storage server nodes can be
added or removed without disruption to service - enabling storage to grow or
shrink quickly in the most dynamic environments (Heath et al, 2014).
Gluster is designed to distribute the workload across a large number of inexpensive
servers and disks. This reduces the impact of poor performance of any single component,
and dramatically reduces the impact of factors that have traditionally limited disk
performance, such as spin time. In a typical Gluster deployment, relatively inexpensive
disks can be combined to deliver performance that is equivalent to far more expensive,
proprietary and monolithic systems at a fraction of the total cost. To scale capacity,
organizations need simply add additional, inexpensive drives and will see linear gains in
capacity without sacrificing performance. To scale out performance, organizations need
simply add additional storage server nodes, and will generally see linear performance
improvements. To increase availability, files can be replicated n-way across multiple
storage nodes.
While Gluster’s default configuration can handle most workloads, Gluster’s modular
design allows it to be customized for particular and specialized workloads and easily
adjust configurations to achieve the optimal balance between performance, cost,
manageability, and availability for their particular needs. Gluster can be used for a wide
variety of storage needs and performs well across a variety of workloads, including:
•
Both large numbers of large files and huge numbers of small files
•
Both read intensive and write intensive operations
!35
October 17 2015
Dr. S.O’Brien edition
•
Both sequential and random access patterns
•
Large numbers of clients simultaneously accessing files
!
d.) Data access:
!
Access to data will be organized through a web interface with an access control system, so
users can download it on request. Also, we will organize an Aspera server (http://
asperasoft.com/). The Aspera protocol allows the movement of large data sets over the
WAN with unrivaled speed (100X faster than FTP or HTTP). This protocol is based on
FASP technology. FASP enables large data set transfers over any network at maximum
speed, regardless of network conditions or distance. For example, over gigabit WANs
with 1 second RTT and 5% packet loss, FASP achieves 700-800 Mbps file transfers on
high-end PCs with RAID-0 and 400-500 Mbps transfers on commodity PCs. Large data
sets of small files are transferred with the same efficiency as large single files. The
implementation is very lightweight, and thus does not require specialized or powerful
hardware in order to maintain high speeds or high concurrency.
In addition to significant transfer rate gains, FASP is able to fully utilize the
available bandwidth, maximize use of the existing infrastructure and eliminate costly
upgrades that may not even benefit TCP-based protocols.
While FASP can fill any available bandwidth, it also includes an intelligent
adaptive transmission rate control mechanism that throttles down for precision fairness to
standard TCP traffic, and automatically ramps back up to fully utilize the unused
bandwidth. This ensures that business-critical TCP traffic such as email, web, and
scientific applications can function normally while allowing FASP to utilize unused
bandwidth.
!
e.) List of tools for bioinformatics analysis:
!
•
GATK – tool to analyze sequencing data
•
Bowtie – the read-mapping tool
•
Picard – utilities that manipulate SAM and BAM files
!36
October 17 2015
Dr. S.O’Brien edition
!
•
Samtools – a software suite to process SAM/BAM files
•
Bcftools – tools to process genome variation data and handle files in the VCF
format
•
Vcftools – tools for processing genome variation data
•
SnpEFF – Genetic variant annotation and effect prediction toolbox
•
PLINK – whole-genome association and variation analysis toolset
•
Bioconductor – bioinformatics-related packages for R
f) Genome Russia Website
!
It is very important to inform people, who can be potential study participants, about the
Genome Russia Project to make our goals clear, to gain trust, and to encourage wide
participation. Details of the goals, status and background of the Genome Russia Project is
posted today with regular updates on an open website: http://genomerussia.bio.spbu.ru/
!
!
!
Figure 6 Genome Russia logo.
!
!
!
!
!
!
!
!
!37
October 17 2015
Dr. S.O’Brien edition
The Genome Russia Project has a logo, located in the upper left corner near the coat of
arms of Saint-Petersburg State University. It will be displayed on advertising flyers, Tshirts and others promotional items.
A major purpose of the website is to announce our goals and progress of the
project and to release all information about project promptly. This assures that every user
can read a detailed description of the project, volunteer for the project and address any
questions to the scientists. To provide a better understanding of how blood sample
collection is performed, we include a file with protocols of blood samples collection,
DNA extraction and DNA quality control. Moreover, all volunteers receive a read a flyer
with a brief description of the project, which describes the trios project, the informed
consent form and the fate of their genome sequence data (Appendix 4).
On the “News” tab we will publish periodic updates about events related to the
Genome Russia Project. On the website is an interactive map of the Russian Federation
with marked regions indicating where volunteers have been recruited for the project.
!
!
!
Figure 7 Genome Russia Interactive map from web site
!
!38
October 17 2015
Dr. S.O’Brien edition
The Genome Russia Database
!
The Genome Russia Database is a web-based database containing anonymous information
on the whole-genome sequences of at least 2,000 men and women originating from the
different regions of Russia. The information includes: gender, place of residence, place of
birth, age, ethnicity, marital status, date of blood collection, in addition to the database of
genealogical information. This database does not contain any confidential personal
identifier information.
!
Access to the Genome Russia Database is distributed through a web application
with a Python backend that is connected to our MySQL database. There are different
types of users and types of rights of access: Administrators that are responsible for
managing user’s rights and supporting databases, moderators who can upload and update
data, and guests who are registered to observe data.
Every change to the database is retained. Moreover, the system periodically makes
backups of the database on a server, so that user activity is under control and it is possible
to recover lost data at any time.
At the present, the Genome Russia Database stores the questionnaires of 132
individuals from 5 regions:
!
Location
Quantity
Komi
9
Arkhangelsk region
33
Nizhny Novgorod Region
9
Pskov region
60
Tver region
21
Table 3 Inventory of study participants in genome Russia database
!
!39
October 17 2015
Dr. S.O’Brien edition
h.) The Genome Russia Database will store all the information about samples from
individuals: where are they stored (room, freezer, box, tube), information about DNA
extraction and results of QC tests and assessment.
Each entity in the database has a unique bar code, which provides a key to the
security and data structuring of the database (Wendi et al 2007). For each sample
collection trip, a user can generate a set of barcodes for questionnaires, informed consent
forms, vacutainers and various types of tubes on the appropriate page of the web
interface. After receiving samples and uploading data into the database, the system will
generate a barcode, so an employee of the laboratory can find all stored data by scanning
a barcode with a barcode scanner. Files and tubes related to a common sample are
interconnected with their ID’s; this simplifies searching through the database.
!
!
!
!
!
!40
October 17 2015
Dr. S.O’Brien edition
!
!
Figure 8 An example of the web-interface questionnaire page.
!
One can track the progress of the Genome Russia Project or monitor what progress
or phase of a particular sample, the tubes, files, and data associated with it, and where
they are stored.
It is possible to generate reports on the data that the database stores and download
an Excel-file with all samples and their associated characteristics.
Discovered and validated genomic variants, along with related annotated genome
features such as function altering variants, pathological imputations, and haplotype maps,
will be posted openly at the Dobzhansky Center GARField website mirror of the UCSC
Genome Browser (Kent et al., 2002) Updated data, results and interpretations will be
available as a track hub (Raney et al, 2014) for applications with third-party genome
feature annotations.
!
7) Population whole genome sequence data analysis
!
Analysis of next generation sequencing (NGS) data requires complex application of many
various bioinformatic tools combined in chains named pipelines. In case of variant calling
pipeline it can be divided into several stages as shown in Figure 9:
1. quality control and filtration of input data;
2. alignment of filtered reads to reference genome;
!41
October 17 2015
Dr. S.O’Brien edition
3. filtration of alignment;
4. variant calling;
5. filtration of variants based on set of indicators;
!
!
!
6. annotation of variants;
!42
October 17 2015
Dr. S.O’Brien edition
!
!
Figure 9 Workflow of SNV calling
!
These stages are described below. Full list of bioinformatics tools that will be used
can be found in section 6.f.
!43
October 17 2015
Dr. S.O’Brien edition
!
a) Read quality control and filtration.
!
Whole-genome sequencing (WGS) reads will be processed with the FastQC tool
(Andrews, 2010) to assess their quality. This tool calculates several indicators and
distributions of parameters useful for initial assessment of data quality, for example:
1. per base of read quality distribution (see Fig.N+1 below) ,
2. per read quality scores,
3. per base of read GC content,
4. per base of read N content
!
5. sequence length distributions, and others.
!
!
Figure 10
Read Filtration QC
!
Reads will be filtered for adapters (technical sequences used in preparation of
libraries), contamination and low-quality reads. Such filtration greatly improve following
alignment.
!
!44
October 17 2015
Dr. S.O’Brien edition
b) Alignment and filtration
Filtered reads will be aligned to the human reference genome sequence (the
GRCh37 assembly) using the Bowtie2 read aligner (Langmead and Salzberg, 2012) in
very sensitive mode to obtain raw alignments stored in BAM files(Li et al, 2009). BAM
format was specially designed for storage of alignments in compressed form and stores
full information about alignment. Then raw alignments will be filtered for PCR duplicates
and ambiguously or not aligned reads using Samtools (Li et al, 2009). Filtration is
required to reduce errors and biases in following SNV calling. Also it reduces volume
required for storage of alignments as unnecessary information(for example, unaligned
reads) is removed.
c) Variant calling and genotype calling
i. Strategy:
BAM files of filtered read alignments will be subjected to GATK (https://
www.broadinstitute.org/gatk/; McKenna et al 2010; DePristo et al 2011; Van der Auwera
et al., 2013) analysis to call genomic variants and determine their genotypes. Following
GATK Best Practice guidelines, we will employ joint calling and individual genotyping
procedure which provides high accuracy of results . Newly obtained samples will be
added to the dynamic dataset without high computational penalty. Joint calling of variants
precludes ambiguity of the absent records in VCF files that can arise when individual
variation calling is perform hodology in more detail as well as the individual component
variant annotated: this absence can be interpreted both as the absence of variation in the
respective site (i.e. nucleotide is identical to the reference) as well as the no data on this
site (lack/absence of coverage). Both original raw data (FastQ files) and analysis results
(BAM and VCF files) will be stored in replicas in the separate locations for future
reference and for data stability.
!
ii. Quality control: Further analysis of variants analysis (for example, identification of
variants unique to Russian population, comparison of allele frequencies between different
populations etc.) requires high reliable set of variants. Because of this reason we plan do
subject WGS data to the following quality control (QC):
•
entries in VCF file will be filtered according to the sequencing depth
parameter (DP entry in the column of VCF file) with the lower boundary
cutoff of mean(DP)/2 (upper boundary is not enforced since all repeats are
removed at a later step)
!45
October 17 2015
Dr. S.O’Brien edition
•
entries in VCF file will be filtered by the GATK according to various "best
practice" and in some cases more strict thresholds:
o MQ < 40
o QD < 2 ( This is variant confidence divided by the unfiltered depth of
non-reference samples)
o FS> 20(This is phred-scaled p-value using Fisher’s Exact Test to detect
strand bias (the variation being seen on only the forward or only the
reverse strand) in the reads.)
o HaplotypeScore > 13 (Consistency of the site with strictly two
segregating haplotypes)
o MQRankSum < -12.5 (This is the u-based z-approximation from the
Mann-Whitney Rank Sum Test for mapping qualities (reads with ref
bases vs. those with the alternate allele))
o ReadPosRankSum < -8 (This is the u-based z-approximation from the
Mann-Whitney Rank Sum Test for the distance from the end of the
read for reads with the alternate allele. If the alternate allele is only
seen near the ends of reads, this is indicative of error)
•
variants that fall into repeats known to the RepBase Update database (Jurka et
al., 2005) will be removed
•
individual entries will be checked for the correct gender assignment
(phenotypic gender record should correspond to the genetic data on gender)
and discrepant samples will be removed
•
all variants with occurrence<2, or minor allele frequency (MAF) <3% and
quality parameter (QUAL column of VCF file) <150 will be filtered out
•
all variants on autosomes with be filtered by Hardy-Weineberg equilibrium
(HWE) test with cutoff 0.0001
•
all variants with genotyping rate <90% will be removed
•
all samples with the genotyping rate <90% will be removed
•
after linkage disequilibrium (LD) pruning variants will be filtered by IBD
(identity-by-descent) with the cutoff of 0.2
!46
October 17 2015
Dr. S.O’Brien edition
!
•
filtering by quality/occurrence/MAF, HWE, variant and individual genotyping
rate will be repeated again to account for the dataset changes introduced by
IBD filtering
•
PCA analysis will be performed to check for the population structure
iii. Functional SNV (fSNV) detection
!
The following types of genomic variant will be identified and included into further
analysis:
•
SNPs/SNVs, including nonsense variants, missense variants
•
insertions, both in-frame and frameshifts
•
deletions, both in-frame and frameshifts
•
Stop codons
•
Altered Start Sites
•
Splice junction alterations
fSNV variants will be further annotated according to their effects on the resulting gene
products (transcripts and/or proteins) by their predicted influence (see prediction software
below).
!
iv. Validation of rare allele and disease gene mutations
Whole genome sequencing with Next Generation sequence are powerful
technologies that have made Genome Russia possible methodology (see Section III-5).
However, there are significant drawbacks associated with these high throughput methods,
which affect quality of research results and their reliability. For example, amplification
bias during PCR of heterogeneous mixtures can result in skewed populations [Kanagawa,
2003]. Additionally, polymerase mistakes, such as base mis-incorporations and
rearrangements due to template switching, can result in incorrect variant calls.
Furthermore, errors arise during cluster amplification, sequencing cycles, and image
analysis result in approximately 0.1–1% of bases being called incorrectly [Fox et al 2014].
Traditional proven methods, namely Sanger sequencing and PCR are widely
employed to verify novel results acquired by theses approaches. In Genome Russia we
!47
October 17 2015
Dr. S.O’Brien edition
will verify all rare and function altering by real-time PCR. and Sanger sequencing.
Depending on the detection method that is used in the PCR (HRM, Sybr-green, hydrolysis
probes or hybridization probes), primer design program should be chosen. PCR primers
are designed according to the assembled transcript sequences.
We will confirm that the primers and probes are really specific for the potential
SNP and do not detect some known SNPs or pseudogenes. It is desirable to use a second
method to verify the results of the established assay (for example one Taqman probe
based assay and one HRM-based assay). All data coming from NGS should be verified on
a heterozygous DNA sample by Sanger sequencing. All results should be congruent: the
database search as well as all experimental data.
!
d) Genome Mining for described human disease gene variant alleles
!
i. Human Gene Mutation Database. Gene variants of known Mendelian diseases for
the selected set of 173 genes with documented clinical/phenotypic effect will be retrieved
from the Human Gene Mutation Database (HGMD; http://www.hgmd.cf.ac.uk/ac/
index.php; Stenson et al 2012). Only variants with severe and significant effect will be
chosen. The list of chosen variants from HGMD will be intersected with the merged VCF
files from the whole-genome sequencing (WGS) data and the resulting set of variants will
be subjected to further analysis.
ii. 1000 Genomes project. SNV population allele frequency data from 1000 Genomes
project (McVean et al., 2012; Auton et al., 2015) will be included into the analysis
alongside the WGS data obtained from Russian population. We will also identify variants
that are unique for Russian population, as well as variants and respective genes that are
present as alternative (non-reference) homozygotes or compounds. Identified variants of
interest will be subjected to further analysis, e.g. comparing effect of respective genes to
their prevalence of respective/related diseases in Russian population. The final validation
of the assay should be performed with inter- and intra-assay specific validation.
iii. Online Mendelian Inheritance in Man (OMIM) database. OMIM database
(http://www.ncbi.nlm.nih.gov/omim, http://www.omim.org, Hamosh et al 2005) is a
comprehensive, authoritative compendium of human genes and genetic phenotypes that is
updated daily. The full-text, referenced overviews in OMIM contain information on all
known mendelian disorders and over 15,000 genes. OMIM is focused on the relationship
between phenotype and genotype.
!48
October 17 2015
Dr. S.O’Brien edition
Intersection of variants from OMIM database with variants obtained in Russian Genome
Project will allow us estimate presence and frequencies of variants, connected with
heritable diseases, in Russian population.
iv. De novo variant annotation. For de novo variant effect prediction, the snpEff tool
(Cingolani et al., 2012) will be applied to the obtained SNVs. Based on the predicted
effects, putative loss-of-function SNVs (McArthur et al., 2012) will be identified. Also for
each missense variant its effect on a corresponding protein will be assessed using Sorting
Intolerant From Tolerant (SIFT; http://sift.jcvi.org/; Ng and Henikoff, 2003) and
Polymorphism Phenotyping (PolyPhen; http://genetics.bwh.harvard.edu/pph2/; Ramensky
et al 2002) software. For trio sample, analysis of inter-variant compounds (i.e. compounds
between different variants located within the same gene) will be also carried out using De
novo variant annotationtheir phased genotypes. Results obtained for the trios can be
extended to other individuals in the study using read-alignment phasing implemented in
samtools (Li et al., 2009).
!
e.) Copy Number Variation (CNV) assessment for Genome Russia
Project..
!
For the copy number annotation analysis, we will employ an approach used in
various robust genome annotations studies, including the analysis of human and animal
genomes [(Xue, et al. 2015; Prado-Martinez et al 2013; Tamazian et al 2014; MarquesBonet and Eichler, 2009; Alkan et al 2009). The method utilizes read depth calculation
and comprises various steps, including masking-out the repeats in the reference genome,
mapping datasets of multiple individuals to the reference genome, and screening the
datasets of each individual for segmental duplications.
Step 1: Repeat masking employs tools like Repeat Masker, Tandem Repeat
Finder, and Dust Masker that allow identification of various categories of genomic
repeats. Nonetheless, some repeats remain unmasked due to various reasons, such as their
absence in the Repeat Masker database. Hidden repeats are identified using a k-mer
approach by dividing the contigs and scaffolds into k-mers with a fixed k (e.g., k=36) and
then mapping these repeats onto assembly using mrsFast software [4] to account for
multi-mappings. Also, for each masked segment, one flanking k-mer (k=36) from the 5’
and 3’ end, are masked out. This is done for the purpose of preventing the coverage to
drop-off in regions flanking the masked regions.
!49
October 17 2015
Dr. S.O’Brien edition
The later downstream analysis is based on mrFast and mrsFast [4] software that
perform mapping of short reads or k-mers to the reference genome and designed with
some optimizations for Illumina short-read datasets. Both mappers exploit properties of
Illumina reads such as their short length (shorter than 454 or Pacbio), low error rate (2-3
bp along the read), and uniform read length within a single machine run. Basically, they
implement a collision-free hash table to create indices of the reference genome that can
efficiently utilize the system memory. In comparison to mrFast, the mrsFast software
finds only mismatches (not indels) and thus allows for an increase of the mapping speed.
In this analysis we will use the length of mapped sequences of 36bp and optimize the
parameters of the software in order to find all possible map locations.
Step 2 consists of mapping the data from multiple individuals to the reference
genome in order to identify the copy numbers of genomic regions. For this purpose, reads
of 100bp are split into two consecutive k-mers of 36 bps from positions 10-46 and
positions 46-81. In case of 150bp reads, we will split reads into the consecutive regions
with the coordinates: 5-41, 41-76, 76-112, 112-148. This allows for trimming potentially
lower quality reads on the outermost regions.
Step 3: CNVs are detected using mrCaNaVar [Alkan, et al 2009] for screening the
read depth in non-overlapping windows of 1Kb of unmasked sequence. Also, the genomewide read depth is calculated by iteratively excluding windows with the most extreme
read depth. The remaining regions are kept as control regions.
Step 4: Next, we will identify segmental duplications that are defined as at least 5
consecutive windows of non-overlapping, non-masked sequence with the copy number
value larger than the mean copy number value in control regions with a correction to
standard deviation. These regions must also span at least 10Kb in genomic coordinates by
definition.
The approach described briefly above was optimized by the research group at the
comparative genomics laboratory under the direction of Tomas Marques-Bonet, who
pioneered these methods (Xue, et al. 2015 ; Prado-Martinez et al 2013; Tamazian et al
2014: Marques-Bonet an Eichler. 2009; Alkan, et al 2009) and is a collaborator of the
Dobzhansky Center. To illustrate this method, we show recent results for CNV annotation
from the genomes of five individuals of a novel antelope species (Figure 6; Table 7).
The distribution of copy numbers in control regions is close to normal with
expected values close to 2, which is expected with a diploid genome (Fig. 6). We found
459 regions of fixed duplications with of total length 7,713,143 bps. The fixed
duplications comprise from 67.7 to 76% in each individual. We identified 265 common
!50
October 17 2015
Dr. S.O’Brien edition
autosomal genomic duplications with the number of copies of more than 10 belonging to
about 222 annotated genes (Table 7). The methods for CNV analyses that we describe
have been validated in several published studies by our group (Tamazian et al 2014;
Dobrynin et al 2015; Cho et al 2014), thereby demonstrating that it is robust, reliable and
appropriate for genome-wide annotation studies, in particular for the Genome Russia
project.
!
!
!
Figure 11 Distribution of copy number values in control regions for five species of sable antelope.
!
!
Table 4. Statistics on segmental duplications
!
!
!51
October 17 2015
Dr. S.O’Brien edition
Sample
Number of SD
Length of SD
Sample coverage
SB1954
588
10,494,879
7.65
SB134
573
10,257,894
8.02
SB2027
631
11,392,378
7.56
SB2130
574
10,116,594
7.54
SB1954
588
10,494,879
7.65
!
8. Haplotype Map constructions for Ethnic Russian population.
!
A haplotype map is a collection of common genetic variants that represent genome
structure of a population. Associations between alleles in different loci along the
chromosome arise as a result of selection and/or demographic processes unique to the
population (Reich 2002). The Russian genome structure must have been shaped by a
unique interplay between adaptation, geographic isolation, demographic events,
migration, and admixture. The proposed map of haplotypes in Russia will show what
these variants are, where they are located along each of the chromosomes, how tightly
associated (linked) they are with each other, and how they are distributed among people
within and among populations in different parts of the Russian Federation. The resulting
map will provide information for the many studies to follow that will connect genomic
variants to the risk for human diseases, and help developing methods of diagnostics,
treatment and prevention. Russian haplotype map will describe the common patterns of
variation, including associations between genetic markers, and will include tags that can
substitute sequencing in the in the future genome wide association studies or GWAS.
A single nucleotide polymorphism, or SNP. is a site (locus) where homologous
chromosomes differ in the nucleotide bases. A sequence of alleles in SNPs observed
along the chromosome is referred to as a haplotype. Haplotypes differ between
populations because of different history of mutation, selection, drift and recombination
shaping the sequence of corresponding segments of DNA. The tendency of SNPs in
haplotypes to be inherited together is referred to as linkage disequilibrium or LD (Reich
2001). The LD has a practical application: genotyping a subset of marker SNPs in the
region provides enough information to predict the remainder of the common SNPs in that
region of the chromosome, so that a limited number of 'tag' SNPs can identify each of the
!52
October 17 2015
Dr. S.O’Brien edition
common haplotypes in a region. Haplotypes commonly occur in a block pattern: the
chromosome region of each block has several common haplotypes, separated by a
recombination hotspot with the following block region, while sometimes the longerdistance haplotypes could be adjoining the shorter haplotypes in the two blocks. A
haplotype map shows the coordinates of haplotype blocks, lists and labels the tag-SNPs
that define them in each population. These tag-SNPs can be genotyped as the entire
haplotype sequence in each block can be recovered from only a few informative loci.
Many studies show that the chromosomal distances that SNP associations in
haplotype blocks are generally shorter for the African populations (average ~11kb), and
intermediate for European and Asian populations (average ~ 22kb), and relatively long for
the Native American populations that experienced the most recent founder event (Gabriel
et al., 2002). Tag-SNP transferability between the populations is affected primarily by the
level of LD in the study population, with genetic similarity of the reference and study
populations still important (Conrad, et al., 2006). For example, while employing the
International HapMap reference panels for imputation, genotypes from European HGDP
samples were imputed with the highest accuracy, followed by samples from East Asia,
Central and South Asia, the Americas, Oceania, the Middle East, and Africa (Huang et al.,
2009). Therefore, the choice of preferred reference panels for imputation in worldwide
populations therefore should follow geographic groupings, and mixtures of reference
panels that maximized imputation accuracy (Huang et al., 2009). Russian populations will
therefore suffer in the imputation accuracy, as most of the reference populations in the
1000 Genomes and the panels designed using International HapMap are geographically
distant (Figure 1).
Russian population, in addition to the unique SNPs, is generally expected to show
admixture between the European and the Asian genomes, and to have the haplotype
length on the same order of magnitude. Therefore, once a haplotype map is constructed,
Russian researchers will immediately benefit, as most of the haplotypes for a disease
association study can be using no more than 300,000–1,000,000 tag SNPs (Gabriel et al.,
2002). For example, the custom European array from Affymetrix contains a total of
674,518 SNPs (Hoffman et al., 2011a), and the arrays optimized for East Asians Africans,
and Latinos include 712,950 (EAS), 893,631 (AFR) and 817,810 (LAT) SNPs
respectively (Table 1).
!
Table 5. The number of SNPs that are common to each of the arrays (modified from
Hoffman et al., 2011). In bold are the total number of SNPs in the array. Not highlighted
are ne numbers shared pairwise between arrays.
!53
October 17 2015
Dr. S.O’Brien edition
Array
EUR
EAS
AFR
LAT
European (EUR)
674,518
386,841
384,966
434,028
712,950
303,850
314,794
893,631
574,940
East Asian (EAS)
African
(AFR)
Latino
(LAT)
817,810
!
Once genome data is available, the haplotype blocs can be constructed and the tag
SNPs provided for the Russian analog of the Array in Table 1. Using trio (child and
partents) design for sequences will give an additional advantage to phasing haplotypes in
the sequenced individuals. There are number of standard software that identifies
computes haplotype blocks and identifies tag SNPs relevant to the population. To name a
few, HaploView (Barett et al., 2005) designed as a common interface to compute and
visualize several tasks, including LD & haplotype block analysis haplotype population
frequency estimation, and programs like Tagger (de Bakker et al. 2005) can produce a list
of tag SNPs and corresponding statistical tests to capture all variants of interest, and a
summary coverage report of the selected tag SNPs.
From the perspective of GWAS analysis, once haplotype map is available, and a
custom array is used to scale up genome association studies. Given the information from
this development, a large number of SNPs without any genotype data can be imputed
from the tagSNPs by substituting information from locally relevant haplotype blocks
specific for the Russian population (Marchini and Howie, 2010). This information will
help scale up association studies looking for Russian - specific determinants of the human
disease (Johnson, et al., 2001).
Using our genome data we will be able to identify local haplotype structure and
ethnic admixture, and compared to the existing databases such as the International
HapMap and 1,000 Genomes. Using trios will identify common patterns of
recombination in the local population and describe major haplotype blocks that are
relevant to the Russian Federation.
The aim of Genome Russia haplotype map is to find and characterize local patterns
chromosomal variation specific to the Russian Federation, by identifying common
endemic sequence variants, their frequencies and correlations between them along the
chromosomes in Russian populations. As a result, the project will provide genome tools
that could be used to bring specific benefits of having a relevant haplotype map with
!54
October 17 2015
Dr. S.O’Brien edition
markers that will allow to scale up local association studies that evaluate risk or protection
polymorphisms in functional, medically relevant genes within candidate regions
suggested by either the classical family-based linkage analysis, or whole genome scans.
In other words, the major practical need of a haplotype map is that it eliminates the need
for base-by base sequencing of each human genome by providing informative variants
that can be genotyped to predict entire fragments of chromosomes.
!
9. Interpreting Russian History population exchanges using
phylogeography
!
1) Population genetics and phylogeography. To decipher historical migratory routes
and settlings of Humankind in Russia, Europe and Asia we aim to use reliable
population genetic and bioinformatic techniques which allow to estimate ancestry of
defined populations, infer routes of migrations and date them using coalescent
modeling and allele frequency spectrums of populations. The analysis will be
performed using whole genome SNPs datasets and sets of SNPs from Y-chromosome
to infer differences in the history of population divergence and migrations. Unlike
autosomal sequences, mitochondrial DNA is inherited exclusive through females and
never through males. By contrast Y chromosome haplotypes are inherited exclusive
through males and never to daughters. Reconstructed sequences of full mitochondrial
genomes will be used for precise coalescent modeling of the genetic history of this
maternally transmitted organelle and comparison with Y-chromosome history. The
main methods, we plan to use for the analysis are:
a) Ancestry inference based on clustering algorithms. Population clustering based
on their genomic ancestry will be performed using ADMIXTURE (Alexander et al., 2009)
and fast STRUCTURE (Raj et al., 2014) software tools. These methods allow to identify
ancestry of studied populations on different hierarchical levels and estimate admixture
between the populations. Both methods have been extensively used and improved
recently
b). Principal component analysis : Principal component analysis (PCA) provides
an opportunity to infer population structure without assuming any demographic models.
The first time the method was applied to human populations is the study of human gene
frequencies in Europeans (Mennozi et al., 1978). Two software implementations of PCA
have been widely used for population genomic data analysis: EIGENSOFT/
EIGENSTRAT (Price et al., 2006; Patterson et al., 2006), that is a stand-alone software
!55
October 17 2015
Dr. S.O’Brien edition
package, and SNPrelate (Zheng et al., 2012), that is the part of Bioconductor framework
of R bioinformatics packages. Principle component analysis of single nucleotide variants
(SNVs) identified in Russian populations will be used to infer their structure of Russian
populations and investigate Russian genomic diversity compared to other populations.
PCA is a computationally effective procedure and can be parallelized for optimal
processing of thousands of genomes.
!
c). SNP –phylogenetic analyses (ME, MP, ML and Bayesian) To precisely infer the
history of population divergence classical Neighbor-Joining trees (Saitou and Nei, 1987)
with extensive bootstrap testing will be produced based on matrix of genetic distances
(Nei Da (Nei et al. 1983), Dps (Bowcock et al., 1994)) both for individuals and
populations.
d). Demographic inference from genetic data, based on diffusion approximations to
the allele frequency spectrum. ∂a∂I software (Gutenkust et al., 2009) implements
methods for demographic history and selection inference from genetic data, based on
diffusion approximations to the allele frequency spectrum. One of ∂a∂i's main benefits is
speed: fitting a two-population model typically takes around 10 minutes, and run time is
independent of the number of SNPs in your data set. ∂a∂i is also flexible, handling up to
three simultaneous populations, with arbitrary time courses for population size and
migration, plus the possibility of admixture and population-specific selection. The method
has been applied (Zhao et al., 2013; Lam et al., 2010; Gravel et al., 2011) to human and
animal populations as it allows fast and accurate inference, however, number of studied
populations is limited by three.
e) Detailed demographic and migration events inference using Multiple Sequential
Markovian Coalescent (MSMC). Recently the first method for inference of demographic
history of a population from one whole genome sequence, Pairwise Sequential Markovian
Coalescent PSMC was introduced (Li and Durbin, 2011), but the method could not
reliably infer the parameters of recent population history (before 10-20 kyr ago). The
recent extension of PSMC method, which called MSMC (Schiffels and Durbin, 2014)
resolves the problem of low number of recent coalescent events using information from
multiple genomes. The method also allows to accurately infer divergence time of
populations taking into account both migrations and recombination events in wholegenome coalescent framework.
f) Estimation of population tree with admixture. TreeMix software is a novel
method that uses large numbers of SNPs to estimate the historical relationships among
!56
October 17 2015
Dr. S.O’Brien edition
populations, using a graph representation that allows both population splits and migration
events (Pickrell and Pritchard, 2012).
!
2) Signatures of positive selection in human genome during recent evolution.
During the history of settlement, human became adapted to local climate, diet, attitude,
pathogens and other factors (Tishkoff, 2015; Fumagalli et al., 2011; Kamberov et al.,
2013; Engelken et al., 2014). These adaptations play important role in health of locally
adapted populations and can be used for developing of new drugs and personal medicine
(Fumagalli et al., 2015). Exploring of recently selected mutations is an important part of
modern human genomics (Nielsen et al., 2007; Coop et al 2009). Many Russian ethnic
minorities live in specific environment of cold Siberian climate, high attitude Caucasian
mountains, have uniq diet habitats which can be associated with specific genomic
variants. We aim to exploit the full spectrum of modern genomic techniques for inferring
loci which were affected by positive selection in deferent Russian populations, including:
!
a) Changes in the shape of the frequency distribution (spectrum) of genetic
variation. Once a selective sweep reduces variability around a selected site, new
mutations will gradually appear. New mutations would initially occur at low
frequencies because their chances of increasing in a population under neutral drift are
very low, and it takes some time after the sweep to restore a more typical distribution
of mutation frequencies in a region (a frequency spectrum) that is consistent with the
action of neutral forces. This shift to a low-frequency spectrum of polymorphism
constitutes a signature of positive selection (Tajima 1989). Alternatively, balancing
selection maintains a high proportion of the high-frequency polymorphisms, thereby
shifting the spectrum to the intermediate frequencies (Oleksyk et al., 2010).
b) Differentiating between populations (Fst). Variation of local conditions imposes
differential selection pressures shaping variable adaptive landscapes (Wright 1951).
Recent adaptations in populations often reflect the peculiarities of local environments.
Local conditions are different from one locality to another and differ considerably
between ecosystems. In some instances, given enough geographical isolation
restricting gene flow, selection signatures could differ considerably between
populations. Consequently, regions experiencing selective sweeps, in addition to the
decreased variation within the population, should also display increased levels of
population differentiation, a measure commonly denoted as Fst (Wright 1951). Tests
that look for population differentiation are based on the premise that natural selection
!57
October 17 2015
Dr. S.O’Brien edition
can change the amount of differentiation between different populations of a species.
Unless a selective sweep has already spread to all populations, the amount of genetic
differentiation within the region that includes selected locus will increase. Therefore,
if genetic differentiation in the genomic region is greater than the level expected under
neutrality, this differentiation may be a consequence of natural selection (Oleksyk et
al., 2010).
!
IV
c) Extended linkage disequilibrium segments. Historic selective sweeps in
population data are apparent because of a hitchhiking effect described by MaynardSmith & Haigh (1974). As selection acts not on genotypes but on individuals carrying
adaptive phenotypes that gain reproductive advantage, beneficial mutations, along
with the entire genomes, are selected. However, independent assortment and
recombination reshuffle chromosomes and regions distal to a selected beneficial
variant. A selective sweep region would contain many neutral variants tightly linked to
the beneficial mutation on haplotypes limited in length by a combination of selection
strength and recombination rate. The extent of this association depends on the
recombination distance, so persistence of a frequent, unusually long haplotype
indicates strong, recent or ongoing selection, especially if that haplotype has risen to
high frequency. Over many generations, haplotype size becomes smaller owing to
recombination with other haplotypes (Oleksyk et al., 2010).
Personnel and Collaborators of Genome Russia.
Genome Russia is a dynamic moving consortium that is expanding on a regular basis to
acquire and leverage the best possible expertise. Active collaborators and researchers are
listed in :
1. Appendix 5 Dobzhansky Center
2. Appendix 6 Russian Collaborators
!
!
3. Appendix 15 International Collaborators
!58
October 17 2015
Dr. S.O’Brien edition
V Workflow and Time Table of Genome Russia.
!
Flow diagrams of projected work flow and time table are presented in Appendices 11 and
12
!
VI Foreseeable benefits of the Genome Russia Project to
SPSU, to Russia, and the world .
!
Due to the moderate costs and recent advances in next-generation sequencing
technologies, a project like the Genome Russia can be much less expensive than previous
big-scale projects (e.g., 1000 Genomes project), and can bring numerous benefits to our
understanding of population origins and disease on local, global and evolutionary scales.
Here is how:
First, low frequency and local variants discovered in the population genome projects can
be used to screen individuals with genetic disorders in genome-wide association studies
(GWAS), in clinical trials, and in genome assessment of proliferating cancer cells
(McVean, 2012). Russian biomedical researchers will receive an immediate benefit of an
information resource. Thereby building a baseline for future studies, including advances
in precision/personalized medicine.
!
Second, the history of population admixture of the Russian people can bring forth many
interesting insights. The modern Russian population comprises a melting pot with genetic
contributions from three main ancestral ethnicities: European (Slavic, Baltic and
Germanic), Uralic (Finno-Hungarian), and Altaian (Turkic), with a possible addition of
traces from peoples that occupied the Eurasian Arctic and Siberia in the past (Figure 1c).
As yet, this history of genome admixture has not been well documented, and presents a
new and unique opportunity to study population history in the wake of the great human
migrations, the Black Death, the Great Silk Road diaspora, or recent demographic
perturbations such as the siege of Leningrad (Smirnova, 2015).
!
!59
October 17 2015
Dr. S.O’Brien edition
Third, An admixture history combined with the diverse environments faced by the local
populations in Russia create a unique opportunity for disease gene discoveries using
mapping of admixture disequilibrium or admixture mapping (Smith and OBrien, 2005).
This approach is known to be more powerful than a GWAS in homogeneous panmictic
populations and has been used to discover a number of health-related mutations, mostly
involving patients with Western European/West African or Western European/East Asian
admixture components (Deo et al 2007; Cheng et al 2009). Given the difference in
historic selection pressures, genome admixtures specific to Russia will contribute a wealth
of new information, bringing forth different risk and/or protective alleles that do not exist
nor associate with disease, elsewhere in the world.
!
Fourth, the studies of population ancestry and admixture in Russia would not be limited
to modern humans. Recent reports have uncovered the exact details about when
Neanderthals and modern humans interbred and have even suggested important diseasefighting genes derivative of those pre-historic encounters (Green et al., 2010; Abi-Rached,
et al., 2011; Sankararaman, et al., 2014; Slatkin et al 2014). Much of the Neanderthal
heritage may still be unaccounted for, as recent reports keep discovering new genes
originating from these ancient admixture events, and the spread of the Neanderthal is now
documented as far as the Altai Mountains in Siberia (Prufer et al., 2014). The geographic
source of Denisovan DNA is also Russian in origin (Callaway, 2011), while its
contribution is mainly found in Melanesia. Given that most of the genetic landscape of
Russia has been little explored thus far, can we state with any certainty that another great
discovery is not hidden behind that great “wide gap” on the global genetic diversity map
(Figure 1 a, b)?
!
Fifth, engaging Russian scientists and communities in an international project like this
would help integrate its scientists into the world genomics community. The scientific
output and training in Russia has diminished since the fall of the USSR in 1991, but the
sustained enormous intellectual potential has since become one of the world’s best secrets.
Once Genome Russia formally contributes to the International 1,000 Genome project, the
strict and widely agreed-upon ethical guidelines will illustrate the highest standards for
compliance. We suggest that it is imperative that genomic research in Russia adheres to
the prescient ethical standards recently developed across the international medical
genomics community. Genome Russia would also be built upon the open access
philosophy, a trend that is gaining momentum, but has been seen by many with suspicion,
!60
October 17 2015
Dr. S.O’Brien edition
as trust between Russia and Western governments has become challenged by recent
political exchanges.
!
The justifications for collecting, sequencing and analyzing populations from Russia in the
immediate rather than distant future, all recognize the enormous significance of these
populations in the history of humankind and its value as a reservoir of knowledge about
human health. Without filling the great “wide gap” on the genetic map of the world, we
will be greatly handicapped in our further genomic endeavors and understanding. The
beginnings of such a Genome Russia Project are happening with a national enthusiasm
endorsed by the Russian Academy of Sciences, the Russian Ministry of Education and
Science, and the Central Russian government in a concerted effort to make it happen
(http://genomerussia.bio.spbu.ru/?lang=en). While political diplomacies continue, the
Genome Russia Project can and should become an example of international collaboration
on the common ground and with the common goal of improving human health and
betterment.
!
!
!
VII References
!
1.
Abi-Rached, L., Jobin, M. J., Kulkarni, S., McWhinnie, A., Dalva, K., Gragert, L.,
… Parham, P. (2011). The Shaping of Modern Human Immune Systems by
Multiregional Admixture with Archaic Humans. Science , 334 (6052 ), 89-94.
2.
Akst, J. 100,000 British Genomes: A new initiative lead by the UK's National
Health Service aims to sequence the genomes of as many as 100,000 patients, a
project that will cost £100 million. The Scientist. December 10, 2012 (http://
www.the-scientist.com/?articles.view/articleNo/33622/title/100-000-BritishGenomes/) .
3.
Allentoft M, Sikora M, Sjögren K, Rasmussen S, Rasmussen M, Stenderup J,
Damgaard P, Schroeder H, Ahlström T, Vinner L et al. (2015) Population
genomics of Bronze Age Eurasia //Nature. Vol. 522. № 7555. P. 167-172.
!61
October 17 2015
Dr. S.O’Brien edition
4.
Alexander, D. H., Novembre, J. & Lange, K. Fast model-based estimation of
ancestry in unrelated individuals. Genome Res. 19, 1655–1664 (2009).
5.
Alkan C, Kavak P, Somel M, Gokcumen O, Ugurlu S, Saygi C, Dal E, Bugra K,
Güngör T, Sahinalp SC, Özören N, Bekpen C. (2014) Whole genome sequencing
of Turkish genomes reveals functional private alleles and impact of genetic
interactions with Europe, Asia and Africa. BMC Genomics. 15:963.
6.
Alkan, C., Jeffrey M. Kidd, Tomas Marques-Bonet, Gözde Aksay, Fereydoun
Hormozdiari, Francesca Antonacci, Carl Baker, Onur Mutlu, S. Cenk Sahinalp,
Richard A. Gibbs, Evan E. Eichler.”Personalized Copy-Number and Segmental
Duplication Maps using Next-Gen Sequencing Technology.” Nature Genet.
2009 Oct;41(10):1061-7
7.
Anderson, A Macrogen, Seoul National University Team Spells Out Upcoming
Stages of Asian Genome Project . Genome Web Oct 07, 2014 https://
www.genomeweb.com/sequencing/macrogen-seoul-national-university-teamspells-out-upcoming-stages-asian-genome .
8.
Andrews, S. "FastQC: A quality control tool for high throughput sequence
data." (2010). http://www.bioinformatics.babraham.ac.uk/projects/fastqc/.
9.
Angiuoli, Samuel V Malcolm Matalka, Aaron Gussman, Kevin Galens, Mahesh
Vangala, David R Riley, Cesar Arze, James R White, Owen White, W Florian
Fricke. CloVR: A virtual machine for automated and portable sequence analysis
from the desktop using cloud computing. BMC Bioinformatics. 2011; 12: 356.
Published online 2011 August 30. doi: 10.1186/1471-2105-12-356
10.
Auton, A., Bryc, K., Boyko, A. R., Lohmueller, K. E., Novembre, J., Reynolds, A.,
… Bustamante, C. D. (2009). Global distribution of genomic diversity underscores
rich complex history of continental human populations. Genome Research, 19(5),
795-803.
11.
Auton A, Abecasis GR, The 1000 Genomes Consortium (2015) Global reference
for human geneti variation. Nature. 2015 (in press)
12.
Balanovskaia EV, Balanovskiĭ OP, Spitsyn VA, Bychkovskaia LS, Makarov SV,
Paĭ GV, Rusakov AE, Subbota DS. (2001a) The Russian gene pool.
Genogeography of serum genetic markers (HP, GC, PI, TF). Genetika. 37(8):
1125-37. Russian.
13.
Balanovskaia EV, Balanovskiĭ OP, Spitsyn VA, Bychkovskaia LS, Makarov SV,
Paĭ GV, Subbota DS. (2001b) The Russian gene pool. Genogeography of
!62
October 17 2015
Dr. S.O’Brien edition
erythrocyte genetic markers (ACP1, PGM1, ESD, GLO1, 6-PGD). Genetika.
37(8):1138-51. Russian.
14.
Barrett JC, Fry B, Maller J, Daly MJ. (2005) Haploview: analysis and
visualization of LD and haplotype maps. Bioinformatics. 2005 21(2):263-5.
15.
Belyaeva O, Bermisheva M, Khrunin A, Slominsky P, Bebyakova N,
Khusnutdinova E, Mikulich A, Limborska S. Mitochondrial DNA variations in
Russian and Belorussian populations. Hum Biol. 2003 Oct;75(5):647-60
16.
Bentley D. R. et al. Accurate whole human genome sequencing using reversible
terminator chemistry //Nature. – 2008. – Т. 456. – №. 7218. – С. 53-59.
17.
Bowcock, a M. et al. High resolution of human evolutionary trees with
polymorphic microsatellites. Nature 368, 455–457 (1994).
18.
Callaway, E. (2011) Ancient DNA reveals secrets of human history: Modern
humans may have picked up key genes from extinct relatives. Nature 476,
136-137.
19.
Cheng CY, Kao WH, Patterson N, Tandon A, Haiman CA, Harris TB, Xing C,
John EM, Ambrosone CB, Brancati FL, Coresh J, Press MF, Parekh RS, Klag MJ,
Meoni LA, Hsueh WC, Fejerman L, Pawlikowska L, Freedman ML, Jandorf LH,
Bandera EV, Ciupak GL, Nalls MA, Akylbekova EL, Orwoll ES, Leak TS,
Miljkovic I, Li R, Ursin G, Bernstein L, Ardlie K, Taylor HA, Boerwinckle E,
Zmuda JM, Henderson BE, Wilson JG, Reich D. Admixture mapping of 15,280
African Americans identifies obesity susceptibility loci on chromosomes 5 and X.
PLoS Genet. 2009 May;5(5):e1000490. doi: 10.1371.
20.
Cingolani, Pablo, Adrian Platts, Le Lily Wang, Melissa Coon, Tung Nguyen, Luan
Wang, Susan J. Land, Xiangyi Lu, and Douglas M. Ruden. "A program for
annotating and predicting the effects of single nucleotide polymorphisms, SnpEff:
SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3." Fly 6,
no. 2 (2012): 80-92.
21.
Conrad DF, Jakobsson M, Coop G, Wen X, Wall JD, Rosenberg NA, Pritchard JK.
(2006). A worldwide survey of haplotype variation and linkage disequilibrium in
the human genome. Nat Genet 38:1251–1260
22.
Coop, G. et al. The Role of Geography in Human Adaptation. PLoS Genet. 5,
e1000500 (2009).
23.
Cho, Yun Sung, Li Hu,2* Haolong Hou,2* Hang Lee,3* Jiaohui Xu,2* Soowhan
Kwon,4 Sukhun Oh,4 Hak-Min Kim,1 Sungwoong Jho,1 Sangsoo Kim,5 Tae
!63
October 17 2015
Dr. S.O’Brien edition
Hyung Kim,6 Shu-Jin Luo,7 Warren Johnson,8 Sunghoon Lee,1,6 Young-Ah Shin,
1 Qian Zhou,2 Byung Chul Kim,1,6 Hyunmin Kim,6 Chang-uk Kim,1 Hyun-Ju
Jung,6 Xiao Xu,7 Pryivrat Gadhvi,1 Pengwei Xu,2 Yingqi Xiong,2 Yadan Luo,2
Shengkai Pan,2 Caiyun Gou,2 Xiuhui Chu,2 Jilin Zhang,2 Sanyang Liu,2 Jing He,
2 Ying Chen,2 Linfeng Yang,2 Yulan Yang,2 Jiaju He,2 Sha Liu,2 Junyi Wang,2
Chul Hong Kim6, Jong-Soo Kim1, Seungwoo Hwang,9 Junsu Ko6, Chang-Bae
Kim,10 Sangtae Kim,11 Damdin Bayarlkhagva,12 Woon Kee Paek,13 Seong-Jin
Kim,6,14 Stephen J. O’Brien,15†, Jun Wang,2,16†, and Jong Bhak,1,6†. (2013.).
The tiger genome and comparative analysis with other feline genomes. NATURE
COMMUNICATIONS 4: 2433.
24.
de Bakker, P.I.W., R. Yelensky, I. Pe'er, S.B. Gabriel, M.J. Daly, D. Altshuler
(2005) Efficiency and power in genetic association studies. Nature Genetics. 37:
1217-1223.
25.
Deo RC, Patterson N, Tandon A, McDonald GJ, Haiman CA, Ardlie K, Henderson
BE, Henderson SO, Reich D. (2007) A high-density admixture scan in 1,670
African Americans with hypertension. PLoS Genet. 3(11): e196.
26.
DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis
AA, del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM,
Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for
variation discovery and genotyping using next-generation DNA sequencing data.
Nat Genet. 2011 May;43(5):491-8. doi: 10.1038/ng.806. Epub 2011 Apr 10.
PubMed PMID: 21478889; PubMed Central PMCID: PMC3083463.
27.
Dobrynin, Pavel , Shiping Liu2, Aleksey Komissarov1, Gaik Tamazian1, Alexey
Makunin1, Ksenia Krasheninnikova1, Andrey Yurchenko1, Sergey Kliver1 ,
Vladimir Brukhin1 Klaus-Peter Koepfli1,3, Warren Johnson4 , Lukas Kuderna4,
Raquel García-Pérez4, Marc deManuel4 Ricardo Godinez5, Weilin Qiu2, Long
Zhou2, Fang Li2, Jian Yi2,Carlos Driscoll6 , Agostinho Antunes7, Taras K.
Oleksyk8, Eduardo Eizirik9 , Polina Perelman10 . David Wildt3, Mark
Diekhans11, Tomas Marques-Bonet3,12Anne Schmidt-Kuntzel13 , Laurie
Marker14 , Jong Bhak15, Wang Jun2,16-18, Zijun Xiong 2, Guojie Zhang &
Stephen J O’Brien. Genomic Legacy of the African Cheetah, Acinonyx jubatus
Genome Biology In Press
28.
Drmanac R. et al. Human genome sequencing using unchained base reads on selfassembling DNA nanoarrays //Science. – 2010. – Т. 327. – №. 5961. – С. 78-81.
!64
October 17 2015
Dr. S.O’Brien edition
29.
Engelken, J. et al. Extreme Population Differences in the Human Zinc Transporter
ZIP4 (SLC39A4) Are Explained by Positive Selection in Sub-Saharan Africa.
PLoS Genet. 10, e1004128 (2014).
30.
Filippova IN, Khrunin AV, Limborska SA. Analysis of DNA variations in GSTA
and GSTM gene clusters based on the results of genome-wide data from three
Russian populations taken as an example. BMC Genet. 2012;13:89.
31.
Flegontova OV, Khrunin AV, Lylova OI, Tarskaia LA, Spitsyn VA, Mikulich AI,
Limborska SA. Haplotype frequencies at the DRD2 locus in populations of the
East European Plain. BMC Genet. 2009;10:62.
32.
Fox EJ, Reid-Bayliss KS, Emond MJ, Loeb LA. Accuracy of Next Generation
Sequencing Platforms. Next Generat Sequenc & Applic. 2014; 1(1):106.
33.
Francioli et al. (2014) Genome of the Netherlands Consortium. Whole-genome
sequence variation, population structure and demographic history of the Dutch
population. Nat Genet.; 46(8):818-25.
34.
Fu Q, Li H, Moorjani P, Jay F, Slepchenko SM, Bondarev AA, Johnson PL,
Aximu-Petri A, Prüfer K, de Filippo C, Meyer M, Zwyns N, Salazar-García DC,
Kuzmin YV, Keates SG, Kosintsev PA, Razhev DI, Richards MP, Peristov NV,
Lachmann M, Douka K, Higham TF, Fumagalli M. et al. Signatures of
environmental genetic adaptation pinpoint pathogens as the main selective
pressure through human evolution //PLoS Genet. – 2011. – Т. 7. – №. 11. – С.
e1002355.
35.
Fumagalli, M. et al. Signatures of Environmental Genetic Adaptation Pinpoint
Pathogens as the Main Selective Pressure through Human Evolution. PLoS Genet.
7, e1002355 (2011).
36.
Fumagalli M, Moltke I, Grarup N, Racimo F, Bjerregaard P, Jørgensen ME,
Korneliussen TS, Gerbault P, Skotte L, Linneberg A, Christensen C, Brandslund I,
Jørgensen T, Huerta-Sánchez E, Schmidt EB, Pedersen O, Hansen T, Albrechtsen
A, Nielsen R. Greenlandic Inuit show genetic signatures of diet and climate
adaptation. (2015) Science. 349:1343-7.
37.
Gabriel, S.B. et al. (2002) The structure of haplotype blocks in the human genome.
Science 296, 2225–2229
38.
Genovese G, Handsaker RE, Li H, Altemose N, Lindgren AM, Chambert K,
Pasaniuc B, Price AL, Reich D, Morton CC, Pollak MR, Wilson JG, McCarroll
!65
October 17 2015
Dr. S.O’Brien edition
SA. (2013) Using population admixture to help complete maps of the human
genome. Nat Genet. Apr;45(4):406-14, 414e1-2.
39.
Gilissen, C., Hehir-Kwa, J. Y., Thung, D. T., van de Vorst, M., van Bon, B. W.,
Willemsen, M. H., ... & Veltman, J. A. (2014). Genome sequencing identifies
major causes of severe intellectual disability. Nature. 511, 344–347.
40.
Gravel, S. et al. Demographic history and rare allele sharing among human
populations. Proc. Natl. Acad. Sci. 108, 11983–11988 (2011).
41.
Green, R. E., Krause, J., Briggs, A. W., Maricic, T., Stenzel, U., Kircher, M., …
Pääbo, S. (2010). A Draft Sequence of the Neandertal Genome. Science , 328
(5979 ), 710-722.
42.
Gudbjartsson, D. F., Helgason, H., Gudjonsson, S. A., Zink, F., Oddson, A.,
Gylfason, A., … Stefansson, K. (2015). Large-scale whole-genome sequencing of
the Icelandic population. Nat Genet, advance online publication.
43.
Guijarro, Manuel; Ruben Gaspar et al. (2008). "Experience and Lessons learnt
from running High Availability Databases on Network Attached Storage" (PDF).
Journal of Physics: Conference Series. Conference Series (IOP Publishing) 119
(4): 042015. doi:10.1088/1742-6596/119/4/042015
44.
Gurdasani D, Carstensen T, Tekola-Ayele F, Pagani L, Tachmazidou I,
Hatzikotoulas K, Karthikeyan S, Iles L, Pollard MO, Choudhury A, Ritchie GR,
Xue Y, Asimit J, Nsubuga RN, Young EH, Pomilla C, Kivinen K, Rockett K,
Kamali A, Doumatey AP, Asiki G, Seeley J, Sisay-Joof F, Jallow M, Tollman S,
Mekonnen E, Ekong R, Oljira T, Bradman N, Bojang K, Ramsay M, Adeyemo A,
Bekele E, Motala A, Norris SA, Pirie F, Kaleebu P, Kwiatkowski D, Tyler-Smith
C, Rotimi C, Zeggini E, Sandhu MS. (2015)
The African Genome Variation Project shapes medical genetics in Africa. Nature,
517(7534):327-32.
45.
Gutenkunst, R. N., Hernandez, R. D., Williamson, S. H. & Bustamante, C. D.
Inferring the Joint Demographic History of Multiple Populations from
Multidimensional SNP Frequency Data. PLoS Genet. 5, e1000695 (2009).
46.
Hamosh A. et al. Online Mendelian Inheritance in Man (OMIM), a knowledgebase
of human genes and genetic disorders //Nucleic acids research. – 2005. – Т. 33. –
№. suppl 1. – С. D514-D517.
47.
Heger, M. BGI Plans to Launch Two NGS Systems This Year Based on Complete
Genomics Technology. Genome Web. January 14, 2015 (https://
!66
October 17 2015
Dr. S.O’Brien edition
www.genomeweb.com/business-news/bgi-plans-launch-two-ngs-systems-yearbased-complete-genomics-technology)
48.
Har'kov VN, Hamina KV, Medvedeva OF, Simonova KV, Eremina ER, Stepanov
VA. (2014)
Gene pool of Buryats: clinal variability and territorial subdivision based on data of
Y-chromosome markers. Genetika; 50(2): 203-13.
49.
Heath AP, Greenway M, Powell R, Spring J, Suarez R, Hanley D, Bandlamudi C,
McNerney ME, White KP, Grossman RL. Bionimbus: a cloud for managing,
analyzing and sharing large genomics datasets. J Am Med Inform Assoc. 2014
Nov-Dec;21(6):969-75. doi: 10.1136/amiajnl-2013-002155. Epub 2014 Jan 24.
50.
Hoffmann, TJ., Kvale MN., Hesselson, SE. et al. (2011a) Next generation
genome-wide association tool: design and coverage of a high throughput
European-optimized SNP array. Genomics. 98:79-89
51.
Hoffmann, TJ., Zhan Y., Kvale MN et al. (2011b) Design and coverage of high
throughput genotyping arrays optimized for individuals of East Asian, African
American, and Latino race/ethnicity using imputation and a novel hybrid SNP
selection algorithm. Genomics. 98:422-430.
52.
Huang L, Li Y, Singleton AB, Hardy JA, Abecasis G, Rosenberg NA, Scheet P.
(2009) Genotype-imputation accuracy across worldwide human populations. Am J
Hum Genet. 2009 Feb;84(2):235-50.
53.
Hui-Yuen, J., McAllister, S., Koganti, S., Hill, E., Bhaduri-McIntosh, S.
Establishment of Epstein-Barr Virus Growth-transformed Lymphoblastoid Cell
Lines. J. Vis. Exp. (57), e3321, DOI : 10.3791/3321 (2011).
54.
Johnson, G.C. et al. (2001) Haplotype tagging for the identification of common
disease genes. Nat. Genet. 29, 233–237.
55.
Jurka, J., Kapitonov, V.V., Pavlicek, A., Klonowski, P., Kohany, O., Walichiewicz,
J. (2005) Repbase Update, a database of eukaryotic repetitive elements.
Cytogentic and Genome Research 110:462-467. http://www.girinst.org/repbase/
56.
Kanagawa T. Bias and artifacts in multitemplate polymerase chain reactions
(PCR). J Biosci Bioeng. 2003; 96: 317–323.
57.
Kamberov, Y. G. et al. Modeling Recent Human Evolution in Mice by Expression
of a Selected EDAR Variant. Cell 152, 691–702 (2013).
!67
October 17 2015
Dr. S.O’Brien edition
58.
Kent, W. James, Charles W. Sugnet, Terrence S. Furey, Krishna M. Roskin, Tom
H. Pringle, Alan M. Zahler, and David Haussler. "The human genome browser at
UCSC." Genome research 12, no. 6 (2002): 996-1006.
59.
Kharkov VN, Khamina KV, Medvedeva OF, Simonova KV, Khitrinskaya IY,
Stepanov VA. (2013) Gene pool structure of Tuvinians inferred from Ychromosome marker data. Genetika; 49(12): 1416-25.
60.
Khusnutdinova EK, Litvinov SS, Kutuev IA, Iunusbaev BB, Khusainova RI,
Akhmetova VL, Ahatova FS, Metspalu E, Rootsi S, Villems R. (2012) Gene pool
of ethnic groups of the caucasus: results of integrated study of the Y chromosome
and mitochondrial DNA and genome-wide data. Genetika. 48(6):750-61. Russian.
61.
Khrunin AV, Tarskaia LA, Spitsyn VA, Lylova OI, Bebyakova NA, Mikulich AI,
Limborska SA. p53 polymorphisms in Russia and Belarus: correlation of the 2-1-1
haplotype frequency with longitude. Mol Genet Genomics. 2005; 272(6): 666-72.
62.
Khrunin AV, Bebiakova NA, Ivanov VP, Solodilova MA, Limborskaia SA.
Polymorphism of Y-chromosomal microsatellites in Russian populations from the
northern and southern Russia as exemplified by the populations of Kursk and
Arkhangel'sk Oblast. Genetika. 2005; 41(8):1125-31.
63.
Khrunin AV, Khokhrin DV, Limborskaia SA. Glutathione-S-transferase gene
polymorphism in Russian populations of European Russia. Genetika. 2008;44(10):
1429-34
64.
Khrunin A, Verbenko D, Nikitina K, Limborska S. Regional differences in the
genetic variability of Finno-Ugric speaking Komi populations. Am J Hum Biol.
2007;19(6):741-50.
65.
Khrunin A, Mihailov E, Nikopensius T, Krjutskov K, Limborska S, Metspalu A.
Analysis of allele and haplotype diversity across 25 genomic regions in three
Eastern European populations.
!
66.
Khrunin AV, Firsov SIu, Limborskaia SA. Polymorphisms of DNA repair genes
ERCC2 and XRCC1 in populations of Russia. Genetika. 2011;47(11):1565-8.
67.
Khrunin AV, Khokhrin DV, Filippova IN, Esko T, Nelis M, Bebyakova NA,
Bolotova NL, Klovins J, Nikitina-Zake L, Rehnström K, Ripatti S, Schreiber S,
Franke A, Macek M, Krulišová V, Lubinski J, Metspalu A, Limborska SA. (2013)
A genome-wide analysis of populations from European Russia reveals a new pole
of genetic diversity in northern Europe. PLoS One 8 (3) : e58552
!68
October 17 2015
Dr. S.O’Brien edition
68.
Krause, Johannes; Fu, Qiaomei; Good, Jeffrey M.; Viola, Bence; Shunkov,
Michael V.; Derevianko, Anatoli P. & Pääbo, Svante (2010), "The complete
mitochondrial DNA genome of an unknown hominin from southern
Siberia", Nature 464 (7290): 894–897
69.
Kushniarevich A, Utevska O, Chuhryaeva M, Agdzhoyan A, Dibirova K,
Uktveryte I, Möls M, Mulahasanovic L, Pshenichnov A, Frolova S, Shanko A,
Metspalu E, Reidla M, Tambets K, Tamm E, Koshel S, Zaporozhchenko V,
Atramentova L, Kučinskas V, Davydenko O, Goncharova O, Evseeva I,
Churnosov M, Pocheshchova E, Yunusbayev B, Khusnutdinova E, Marjanović D,
Rudan P, Rootsi S, Yankovsky N, Endicott P, Kassian A, Dybo A; Genographic
Consortium, Tyler-Smith C, Balanovska E, Metspalu M, Kivisild T, Villems R,
Balanovsky O. (2015) Genetic Heritage of the Balto-Slavic Speaking Populations:
A Synthesis of Autosomal, Mitochondrial and Y-Chromosomal Data. PLoS One 10
(9).
70.
Lam, H.-M. et al. Resequencing of 31 wild and cultivated soybean genomes
identifies patterns of genetic diversity and selection. Nat. Genet. 42, 1053–1059
(2010).
!
71.
Langmead B, Salzberg S. Fast gapped-read alignment with Bowtie 2. Nature
Methods. 2012, 9:357-359.
72.
Lee, D., Hormozdiari, F., Xin, H., Hach, F., Mutlu, O., & Alkan, C. (2015). Fast
and accurate mapping of Complete Genomics reads. Methods, 79, 3-10.
73.
Leslie, S., Winney, B., Hellenthal, G., Davison, D., Boumertit, A., Day, T., …
Bodmer, W. (2015). The fine-scale genetic structure of the British population.
Nature, 519(7543), 309-314.
74.
Li, Heng, Bob Handsaker, Alec Wysoker, Tim Fennell, Jue Ruan, Nils Homer,
Gabor Marth, Goncalo Abecasis, and Richard Durbin. "The sequence alignment/
map format and SAMtools." Bioinformatics 25, no. 16 (2009): 2078-2079.
75.
Li, H. & Durbin, R. Inference of human population history from individual wholegenome sequences. Nature 475, 493–496 (2011).
76.
Limborska SA, Khrunin AV, Flegontova OV, Tasitz VA, Verbenko DA.Specificity
of genetic diversity in D1S80 revealed by SNP-VNTR haplotyping. Ann Hum
Biol. 2011;38(5):564-9.
!69
October 17 2015
Dr. S.O’Brien edition
77.
MacArthur, Daniel G., Suganthi Balasubramanian, Adam Frankish, Ni Huang,
James Morris, Klaudia Walter, Luke Jostins et al. "A systematic survey of loss-offunction variants in human protein-coding genes." Science 335, no. 6070 (2012):
823-828.
78.
Makarov, NA, LA Belyaev, AV Engovatova (2015) Archaeology in Contemporary
Russia: Prospects and Challenges // Russian archaeologist. 2: 5-15 (Russian)
79.
Marques-Bonet, T. and E. Eichler. “The Evolution of Human Segmental
Duplications and the Core Duplicon Hypothesis.” Cold Spring Harb Symp Quant
Biol. 2009;74:355-62. Epub 2009 Aug 28.
80.
Marx, V (2015) The DNA of a Nation. Nature 524:503.
81.
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A,
Garimella K, Altshuler D, Gabriel S, Daly M, DePristo MA. The Genome
Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA
sequencing data. Genome Res. 2010 Sep;20(9):1297-303. doi: 10.1101/gr.
107524.110. Epub 2010 Jul 19. PubMed PMID: 20644199; PubMed Central
PMCID: PMC2928508.
82.
McLean, C. Y., Reno, P. L., Pollen, A. A., Bassan, A. I., Capellini, T. D., Guenther,
C., … Kingsley, D. M. (2011). Human-specific loss of regulatory DNA and the
evolution of human-specific traits. Nature, 471(7337), 216-219.
83.
McVean GA., The 1000 Genomes Consortium (2012) An integrated map of
genetic variation from 1,092 human genomes. Nature. 2012; 491(7422):56-65.
84.
Menozzi, P., Piazza, a & Cavalli-Sforza, L. Synthetic maps of human gene
frequencies in Europeans. Science (80-. ). 201, 786–792 (1978).
85.
Mirabal S, Regueiro M, Cadenas AM, Cavalli-Sforza LL, Underhill PA, Verbenko
DA, Limborska SA, Herrera RJ. Y-chromosome distribution within the geolinguistic landscape of northwestern Russia. Eur J Hum Genet. 2009 Oct;17(10):
1260-73.
86.
Molenaar, J. J., Koster, J., Zwijnenburg, D. A., van Sluis, P., Valentijn, L. J., van
der Ploeg, I., ... & Versteeg, R. (2012). Sequencing of neuroblastoma identifies
chromothripsis and defects in neuritogenesis genes. Nature, 483: 589-593.
87.
Molodin VI (2014) Ethnocultural mosaic in western Baraba (Late Bronze Age - a
time of transition from the Bronze Age to the Iron Age. XIV-VIII century BC) //
Archaeology, Ethnology and Anthropology of Eurasia. 4 (60) pp 54 - 63. (Russian)
!70
October 17 2015
Dr. S.O’Brien edition
88.
Movsesyan AA (2012) Paleophenetic analysis of modern and ancient population
of Chukotka // Archaeology, Ethnology and Anthropology of Eurasia. 3 (51) pp
130 - 137. (Russian)
89.
Nagasaki M, Yasuda J, Katsuoka F, Nariai N, Kojima K, Kawai Y, YamaguchiKabata Y, Yokozawa J, Danjoh I, Saito S, Sato Y, Mimori T, Tsuda K, Saito R, Pan
X, Nishikawa S, Ito S, Kuroki Y, Tanabe O, Fuse N, Kuriyama S, Kiyomoto H,
Hozawa A, Minegishi N, Douglas Engel J, Kinoshita K, Kure S, Yaegashi N;
ToMMo Japanese Reference Panel Project, Yamamoto M. (2015) Rare variant
discovery by deep whole-genome sequencing of 1,070 Japanese individuals. Nat
Commun. 2015 Aug 21;6:8018. doi: 10.1038/ncomms9018.
90.
Nei, M., Tajima, F. & Tateno, Y. Accuracy of estimated phylogenetic trees from
molecular data. J Mol Evol 19, 153–170 (1983).
91.
Nelis M, Esko T, Mägi R, Zimprich F, Zimprich A, Toncheva D, Karachanak S,
Piskácková T, Balascák I, Peltonen L, Jakkula E, Rehnström K, Lathrop M, Heath
S, Galan P, Schreiber S, Meitinger T, Pfeufer A, Wichmann HE, Melegh B, Polgár
N, Toniolo D, Gasparini P, D'Adamo P, Klovins J, Nikitina-Zake L, Kucinskas V,
Kasnauskiene J, Lubinski J, Debniak T, Limborska S, Khrunin A, Estivill X,
Rabionet R, Marsal S, Julià A, Antonarakis SE, Deutsch S, Borel C, Attar H,
Gagnebin M, Macek M, Krawczak M, Remm M, Metspalu A. Genetic structure of
Europeans: a view from the North-East. PLoS ONE. 2009;4(5):e5472.
92.
Novogilov AG (2009) The population of the Pskov-Pechora region as an
ethnolocal group // Vestnik St. Petersburg University. 2. History Series. SPb., 3:
94-110. (Russian)
93.
Ng PC, Henikoff S. SIFT: Predicting amino acid changes that affect protein
function. Nucleic Acids Res. 2003 Jul 1;31(13):3812-4. PubMed PMID:
12824425; PubMed Central PMCID: PMC168916.
94.
O’Brien SJ and Hendrickson, S. Host Genomic Influences on HIV/AIDS.
Genome Biology Genome Biology, 14:201-214. 2013.
95.
Patterson, N., Price, A. L. & Reich, D. Population Structure and Eigenanalysis.
PLoS Genet. 2, e190 (2006).
96.
Price, A. L. et al. Principal components analysis corrects for stratification in
genome-wide association studies. Nat. Genet. 38, 904–909 (2006).
97.
Pesik VY, Fedunin AA, Agdzhoyan AT, Utevska OM, Chukhraeva MI, Evseeva
IV, Churnosov MI, Lependina IN, Bogunov YV, Bogunova AA, Ignashkin
!71
October 17 2015
Dr. S.O’Brien edition
MA,Yankovsky NK, Balanovska EV, Orekhov VA, Balanovsky OP. (2014)
Analysis of genetic diversity of Russian regional populations based on common
STR markers used in DNA identification. Genetika. 50(6): 715-23. Russian.
98.
Marchini J, and B. Howie (2010) Genotype imputation for genome-wide
association studies. Nature Reviews Genetics 11:499-511
99.
Marques-Bonet, T. and E. Eichler. “The Evolution of Human Segmental
Duplications and the Core Duplicon Hypothesis.” Cold Spring Harb Symp Quant
Biol. 2009;74:355-62. Epub 2009 Aug 28.
100. Nielsen, R., Hellmann, I., Hubisz, M., Bustamante, C. & Clark, A. G. Recent and
ongoing selection in the human genome. Nat Rev Genet 8, 857–868 (2007).
101. Oleksyk, T. K., Smith, M. W. & O’Brien, S. J. Genome-wide scans for footprints
of natural selection. Philos. Trans. R. Soc. B Biol. Sci. 365, 185–205 (2009).
102. Pickrell, J. K. & Pritchard, J. K. Inference of Population Splits and Mixtures from
Genome-Wide Allele Frequency Data. PLoS Genet. 8, e1002967 (2012).
103. Popova SN, Slominsky PA, Pocheshnova EA, Balanovskaya EV, Tarskaya LA,
Bebyakova NA, Bets LV, Ivanov VP, Livshits LA, Khusnutdinova EK, Spitcyn
VA, Limborska SA. Polymorphism of trinucleotide repeats in loci DM, DRPLA
and SCA1 in East European populations. Eur J Hum Genet. 2001;9(11):829-35.
104. Prado-Martinez, Javier, Peter H. Sudmant, Jeffrey M. Kidd, Heng Li, Joanna L.
Kelley, Belen Lorente-Galdos, Krishna R. Veeramah et al. "Great ape genetic
diversity and population history." Nature 499, no. 7459 (2013): 471-475.
105. Prufer, K., Racimo, F., Patterson, N., Jay, F., Sankararaman, S., Sawyer, S., …
Paabo, S. (2014). The complete genome sequence of a Neanderthal from the Altai
Mountains. Nature, 505(7481), 43-49.
106. Raj, A., Stephens, M. & Pritchard, J. K. fastSTRUCTURE: Variational Inference
of Population Structure in Large SNP Data Sets. Genetics 197, 573–589 (2014).
107. Ramensky V, Bork P, Sunyaev S. Human non-synonymous SNPs: server and
survey. Nucleic Acids Res. 2002 Sep 1;30(17):3894-900. PubMed PMID:
12202775; PubMed Central PMCID: PMC137415.
108. Raney, Brian J., Timothy R. Dreszer, Galt P. Barber, Hiram Clawson, Pauline A.
Fujita, Ting Wang, Ngan Nguyen et al. "Track data hubs enable visualization of
user-defined genome-wide annotations on the UCSC Genome Browser."
Bioinformatics 30, no. 7 (2014): 1003-1005.
!72
October 17 2015
Dr. S.O’Brien edition
109. Reich, D.E. et al. (2001) Linkage disequilibrium in the human genome. Nature
411, 199–204
110. Reich, D.E. et al. (2002) Human genome sequence variation and the influence of
gene history, mutation and recombination. Nat. Genet. 32, 135–142
111. Reich, D., Green, R. E., Kircher, M., Krause, J., Patterson, N., Durand, E. Y., …
Paabo, S. (2010). Genetic history of an archaic hominin group from Denisova
Cave in Siberia. Nature, 468(7327), 1053-1060.
112.
!
113. Riemann K, Adamzik M, Frauenrath S, Egensperger R, Schmid KW, Brockmeyer
NH, Siffert W. Comparison of manual and automated nucleic acid extraction from
whole-blood samples. J Clin Lab Anal. 2007;21(4):244-8.
114. Rosinger S, Nutland S, Mickelson E, Varney M D, Bernard O Boehm, Gary J
Olsem, John A Hansen, Ian Nicholson, Joan E Hilner, Letitia H Perdue,
June J Pierce, Beena Akolkar, Concepcion Nierras, Michael W Steffes and the
T1DGC. Collection and processing of whole blood for transformation of
peripheral blood mononuclear cells and extraction of DNA: the Type 1 Diabetes
Genetics Consortium. Clinical Trials 2010; 7: S65–S74.
115. Russian / Institute of Ethnology and Anthropology. N. Maclay RAS / Series
"Peoples and Cultures", t. I. / Editors Series: Doctor. hist. YB Simchenko
Sciences, Doctor. hist. Sciences VA Tishkov. - M .: Nauka, 1999. - 828 p.: ill.
(Russian).
116. Russian peoples of Russia. Atlas of cultures and religions (2010) - M .: Design.
Information. Cartography. – 320 p. (Russian).
117. Sankararaman, S., Mallick, S., Dannemann, M., Prufer, K., Kelso, J., Paabo, S., …
Reich, D. (2014). The genomic landscape of Neanderthal ancestry in present-day
humans. Nature, 507(7492), 354-357.
118. Salmela E, Lappalainen T, Liu J, Sistonen P, Andersen PM, Schreiber S, Savontaus
ML, Czene K, Lahermo P, Hall P, Kere J. (2011) Swedish population substructure
revealed by genome-wide single nucleotide polymorphism data. PLoS One. 6(2):
e16747.
119. Schiffels, S. & Durbin, R. Inferring human population size and separation history
from multiple genome sequences. Nat Genet 46, 919–925 (2014).
!73
October 17 2015
Dr. S.O’Brien edition
120. Semino O, Passarino G, Oefner PJ, Lin AA, Arbuzova S, Beckman LE, De
Benedictis G, Francalacci P, Kouvatsi A, Limborska S, Marcikiae M, Mika A,
Mika B, Primorac D, Santachiara-Benerecetti AS, Cavalli-Sforza LL, Underhill
PA. The genetic legacy of Paleolithic Homo sapiens sapiens in extant Europeans: a
Y chromosome perspective. Science. 2000;290(5494):1155-9.
121. Shabihkhani M, Lucey GM, Wei B, Mareninov S, Lou JJ, Vinters HV, Singer EJ,
Cloughesy TF, Yong WH. The procurement, storage, and quality assurance of
frozen blood and tissue biospecimens in pathology, biorepository, and biobank
settings. Clin Biochem. 2014 Mar;47(4-5):258-66.
122. Shabrova EV, Khusnutdinova EK, Tarskaia LA, Mikulich AI, Abolmasov NN,
Limborska SA. DNA diversity of human populations from Eastern Europe and
Siberia studied by multilocus DNA fingerprinting. Mol Genet Genomics.
2004;271(3):291-7.
123. Slatkin M, Hublin JJ, Reich D, Kelso J, Viola TB, Pääbo S. (2014) Genome
sequence of a 45,000-year-old modern human from western Siberia. Nature.
514:445-9.
124. Sminova, Y. (2015) Did good genes help people outlast the brutal Leningrad
siege. Science 348:1068
125. Smith, M.W., and O’Brien, S.J.: (2005). Mapping by Admixture Disequilibrium:
Advances, Limits and Guidelines. Nature Genetics Reviews. 6: 623-632,.
126. Smith, J. M. & Haigh, J. The hitch-hiking effect of a favourable gene. Genet. Res.
23, 23 (1974).
127. Spitsyn VA1, Khorte MV, Pogoda TV, Slominsky PA, Nurbaev SD, Agapova RK,
Limborska SA.Apolipoprotein B 3'-VNTR polymorphism in the Udmurt
population. Hum Hered. 2000;50(4):224-6.
128. Stenson PD, Ball EV, Mort M, Phillips AD, Shaw K, Cooper DN. The Human
Gene Mutation Database (HGMD) and its exploitation in the fields of personalized
genomics and molecular evolution. Curr Protoc Bioinformatics. 2012 Sep;Chapter
1:Unit1.13. doi: 10.1002/0471250953.bi0113s39. PubMed PMID: 22948725.
129. Stewart JB, Chinnery PF. Nat Rev Genet. (2015 ) . The dynamics of
mitochondrial DNA heteroplasmy: implications for human health and disease.
16(9):530-42.
130. Sudmant, Peter H. Jeffrey M. Kidd, Heng Li, Joanna L. Kelley, Belen LorenteGaldos, Krishna R. Veeramah, August E. Woerner, Timothy D. O’Connor, Gabriel
!74
October 17 2015
Dr. S.O’Brien edition
Santpere, Alexander Cagan, Christoph Theunert, Ferran Casals, Hafid Laayouni,
Kasper Munch, Asger Hobolth, Anders E. Halager, Maika Malig, Jessica
Hernandez-Rodriguez, Irene Hernando-Herraez, Kay Prüfer, Marc Pybus, Laurel
Johnstone, Michael Lachmann, Can Alkan et al. Great ape genetic diversity and
population history (2013)Nature 499, 471–475.
131. Svitin Anton, Malov Sergey, Cherkasov Nikolay, Geerts Paul, Rotkevich
Mikhail, Dobrynin Pavel, Shevchenko Andrey, Guan Li, Troyer Jennifer,
Hendrickson Sher, Hutcheson Dilks Holli, Oleksyk K. Taras, Donfield Sharyne,
Gomperts Edward, Jabs A. Douglas, Sezgin Efe, Van Natta Mark, Harrigan P.
Richard, Brumme L. Zabrina, O'Brien J. Stephen GWATCH: a web platform for
automated gene association discovery analysis. (2014 ) GigaScience 3:18.
132. Tajima, F. "Statistical method for testing the neutral mutation hypothesis by DNA
polymorphism." Genetics 123.3 (1989): 585-595.
133. Tamazian, T; Simonov, S; Dobrynin, P; Makunin, A; Logachev, A; Komissarov, A;
Shevchenko, A; Brukhin, V; Cherkasov, N; Svitin, A; Koepfli, KP; Pontius, J;
Driscoll, CA; Blackistone, K; Barr, C; Goldman, D; Antunes, A; Quilez, J;
Lorente-Galdos, B; Alkan, C; Marques-Bonet, T; Menotti-Raymond, M; David, V;
Narfstrom, K; O'Brien, SJ (2014): Annotated Features of the Domestic Cat (Felis
catus) Genome. GigaScience 3:13 . doi:10.1186/2047-217X-3-13
134. http://www.gigasciencejournal.com/content/3/1/13 2014.
135. Tishkoff, Sarah. "Strength in small numbers." Science 349.6254 (2015):
1282-1283.
136. Trofimova NV, Litvinov SS, Khusainova RI, Penkin LN, Akhmetova
VL, Akhatova FS, Khusnutdinova ÉK. (2015)
Genetic characterization of populations of the Volga-Ural region according to
the variability of the Y-chromosome. Genetika. 2015; 51(1): 120-7.
137. Troyer D. Biorepository standards and protocols for collecting, processing, and
storing human tissues. Methods Mol Biol. 2008;441:193-220.
138. Trubachov ON (2003) Ethnogenesis and culture of ancient Slavs: linguistic
studies. - M .: Nauka, 489 pp. (Russian)
139. Ulyanov MV, Lavryashina MB , Nikolaev VV, Octyabrskaya IV, Druzhinin VG
(2014) The indigenous population of the northern regions of the Altai: the
reflection of the demographic processes of XIX - beginning of the XXI century in
!75
October 17 2015
Dr. S.O’Brien edition
the dynamics of the family structure // Archaeology, Ethnology and Anthropology
of Eurasia 3 (59): 128 - 140. (Russian).
140. Van der Auwera GA, Carneiro M, Hartl C, Poplin R, del Angel G, LevyMoonshine A, Jordan T, Shakir K, Roazen D, Thibault J, Banks E, Garimella K,
Altshuler D, Gabriel S, DePristo M, 2013 From FastQ Data to High-Confidence
Variant Calls: The Genome Analysis Toolkit Best Practices Pipeline CURRENT
PROTOCOLS IN BIOINFORMATICS 43:11.10.1-11.10.33
141. Vaught JB. Blood collection, shipment, processing, and storage. Cancer Epidemiol
Biomarkers Prev. 2006 Sep;15(9):1582-4.
142. Verbenko DA, Pogoda TV, Spitsyn VA, Mikulich AI, Bets LV, Bebyakova NA,
Ivanov VP, Abolmasov NN, Pocheshkhova EA, Balanovskaya EV, Tarskaya LA,
Sorensen MV, Limborska SA.. Apolipoprotein B 3'-VNTR polymorphism in
Eastern European populations. Eur J Hum Genet. 2003;11(6):444-451.
143. Verbenko DA, Knjazev AN, Mikulich AI, Khusnutdinova EK, Bebyakova NA,
Limborska SA. Variability of the 3'APOB minisatellite locus in Eastern Slavonic
populations. Hum Hered. 2005;60(1):10-8.
144. Verbenko DA, Slominsky PA, Spitsyn VA, Bebyakova NA, Khusnutdinova EK,
Mikulich AI, Tarskaia LA, Sorensen MV, Ivanov VP, Bets LV, Limborska SA.
Polymorphisms at locus D1S80 and other hypervariable regions in the analysis of
Eastern European ethnic group relationships. Ann Hum Biol. 2006; 33(5-6):
570-84.
145. Xue Y, Prado-Martinez J, Sudmant PH, Narasimhan V4, Ayub Q1, Szpak M1,
Frandsen P5, Chen Y1, Yngvadottir B1, Cooper DN6, de Manuel M2, HernandezRodriguez J2, Lobon I2, Siegismund HR5, Pagani L7, Quail MA1, Hvilsom C8,
Mudakikwa A9, Eichler EE10, Cranfield MR11, Marques-Bonet T12, Tyler-Smith
C13, Scally A14. Mountain gorilla genomes reveal the impact of long-term
population decline and inbreeding. Science. 2015;348: 242-5.
146. Wendl MC, Smith S, Pohl CS, et al. Design and implementation of a generalized
laboratory data model. BMC Bioinformatics. 2007;8:362. doi:
10.1186/1471-2105-8-362.
147. Wright. "Genetical structure of populations." Nature 166 (1950): 247-49.
148. Yakupov RI (2010) Trying to understand the ethnic history of Eurasia (RG Kuzeev
memory) // Ethnographic Review. 2: 110-119. (Russian).
!76
October 17 2015
Dr. S.O’Brien edition
149. Zavyalov VI, LS Rozanov, NN Terekhova (2012) Ethno-cultural interaction in the
era of migration of peoples: arheometallographic data (based on the sites of the
Volga-Kama and Poochya). // Russian archeology. 1. (Russian)
150. Zhang, W. & Sun, Z. Random local neighbor joining: A new method for
reconstructing phylogenetic trees. Mol. Phylogenet. Evol. 47, 117–128 (2008).
151. Zhao, S. et al. Whole-genome sequencing of giant pandas provides insights into
demographic history and local adaptation. Nat. Genet. 45, 67–71 (2012).
152. Zheng, X. et al. A high-performance computing toolset for relatedness and
principal component analysis of SNP data. Bioinformatics 28, 3326–3328 (2012).
!
!
!
!
!
!
!
!