Analysis of Recombination in Campylobacter

Analysis of Recombination in Campylobacter
jejuni from MLST Population Data
Paul Fearnhead1,5 , Nick Smith1 , Mishele Barrigas2 , Andrew Fox3 and Nigel
French4
1. Department of Mathematics and Statistics, Lancaster University, Lancaster, LA1 4YF, UK
2. University of Staffordshire, Stafford, UK
3. Health Protection Agency North West, Manchester Medical Microbiology
Partnership, Manchester Royal Infirmary, Manchester, UK
4. Institute of Veterinary, Animal and Biomedical Sciences, Massey University, New Zealand
5. Correspondence should be addressed to Paul Fearnhead.
(e-mail: [email protected]).
Summary: We analyse recombination in C. jejuni using MLST data from isolates taken from wild birds, cattle, wild rabbits and water in a 100km2 study
region in Cheshire, UK. We use a recent approximate likelihood method for inference, based on combining likelihood information from all pairs of segregating
(polymorphic) sites in the data. We find substantial evidence for recombination,
but only for recombination with short tract lengths, of around 225bps–750bps.
We estimate that the rate of recombination is of a similar magnitude to the rate
of mutation.
Keywords: Bacteria, Campylobacter, Helicobacter, MLST, Recombination Rate,
Recombination Tract Length
1
1
Introduction
Campylobacter jejuni is one of the most common agents of bacterial gastroenteritis. It is carried asymptomatically by many domestic and wild animals, and
is also present in environmental locations such as soil and water. We present an
analysis of multilocus sequence typing (MLST) data from C. jejuni isolates taken
from a range of animal and environmetal sources (French et al., 2004), with the
aim of learning about the recombination process of the bacteria.
Recombination is a major source of genetic diversity in pathogens, and perhaps the major mechanism responsible for genetic diversity. Learning about
the recombination process in pathogens is important for understanding the epidemiology of disease and therefore their ultimately control and pevention. For
example, recombination has been responsible for capsular polysaccharide switching in Neisseria meningitidis and Streptococcus pneumoniae which may result
in vaccine failures (Claus et al., 2002). Within C. jejuni, recombination has
been responsible substantial increased genetic diversity at the unc and pgm loci
(French et al., 2004).
There have been previous analyses of MLST data from C. jejuni (Suerbaum
et al., 2001), but these have primarily focussed on evidence for recombination,
and qualitative measures of the amount of recombination such as Homoplasy
ratio.
We present a likelihood-based analysis of the data, based on using the likelihood information from all pairs of segregating sites. Likelihoods are calculated
under a coalescent model. We present formal tests for the presence of recombination, and estimates of both the relative amount of recombination as compared
to mutation, and the mean recombination tract length. We find substantial evidence of recombination, with mean tract lengths in the region of 225bps–750bps;
and with recombination occuring at a similar rate to mutation.
2
2
Material and Methods
Campylobacter Data.
The data is taken from French et al. (2004), and consists of Multi-Locus sequence types (MLSTs) from 172 isolates of Campylobacter jejuni. The isolates
were taken from a variety of farm and wildlife sources over a 100km2 area of
predominantly dairy farmland.
For each isolate the data consist of the DNA sequence for approximately 500bp
fragments of 7 house-keeping genes which are randomly located along the 1.6Mb
genome. A small number of isolates had an allele at one of the genes which is
substantially divergent from the majority of the alleles at that gene, and may
have originated in a different Campylobacter species. (For example Colles et al.,
2003, suggest that allele 17 at gene unc is from C. coli). These were alleles 113
at gene pgm (found in 1 rabbit and 1 water isolate), and alleles 17, 38 and 42 at
gene unc (found in 18 cattle isolates). A summary of the data is given in Table
1.
Helicobacter Data.
To validate our methods, we also analysed MLST data from Helicobacter pylori, for which there is independent inference about the recombination process
(Falush et al., 2001). We analysed data at 5 loci (atpA, efp, ppa, trpC and
ureI) for 18 isolates collected from New Zealand Maoris, and 12 isolates collected from Korea. The data was obtained from the pubMLST isolate database
(http://pubmlst.org/helicobacter/).
Approximate Likelihood Methods.
We used an extension of the pairwise likelihood method of Hudson (2001) to
include a finite-sites mutation models (McVean et al., 2002). This is an approximate likelihood method which (i) calculates the likelihood for each pair of
polymorphic sites; and (ii) multiplies together the likelihoods for all such pairs
of sites. The approximation in this method is that (ii) would only be valid is the
3
Genetic Diversity of C. jejuni Isolates.
Gene
Fragment
Polymorphic
Distinct
Tajima’s
length (bp)
sites
Haplotypes
D
asp
477
19
14
0.57
gln
477
34
25
-0.69
glt
402
23(1)
23
0.31
gly
507
42(1)
24
1.88†
pgm
498
94(10)
29
-0.38
47(2)
28
1.77
pgm∗
†
tkt
459
43(2)
29
0.03
unc
489
81(5)
20
0.43
27(1)
17
-0.62
336(19)
65
0.22
233(7)
58
0.60
unc∗
Total
Total∗
3,309
Table 1: Summary of genetic diversity of the C. jejuni isolates. The numbers
of the polymorphic sites with three or four alleles for each gene are given in
brackets. ∗ Excluding isolates with inter-strain alleles. For pgm 2 isolates with
allele pgm113 were removed; for unc, 18 isolates with alleles unc17, unc38 or
unc42 were removed; while for Total, all these 20 isolates were removed.
† Significantly different from 0 at the 5% level.
4
data from each pair of sites were independent; which is clearly not the case.
The likelihoods in (i) are calculated using computational statistical techniques
(Fearnhead and Donnelly, 2001), assuming a neutral, constant-sized, panmictic
population coalescent model. Our mutation model is based on a 4-allele model
at each site, with mutations between all pairs of alleles being equally likely, and a
constant mutation rate across all sites. While this model is simplistic, simulation
studies show that inferences are robust to deviations from the model assumptions
(Smith and Fearnhead, 2004; McVean et al., 2004).
While theory is limited for the pairwise likelihood approach (see Fearnhead, 2003,
for some results), it has been widely used for analysing recombination processes
from population data (see Stumpf and McVean, 2003; Awadalla, 2003), due to its
speed of computation and the flexibility it allows for the underlying recombination process (e.g. variable recombination rates, and distributional assumptions
for tract lengths).
Testing for Recombination
Our data contains information about the recombination rates over both short
(within genes - up to 500bp) and long (between genes, of the order of 40-800kb)
distances. Using permutation tests we tested the hypotheses of (i) presence of
recombination in C. jejuni and (ii) the presence of recombination acting over
large distances.
To test (i) we fitted a model that recombination rate scales linearly with physical
distance. The pairwise likelihood for all pairs of polymorphic sites within each
gene was then maximised under this model. To calculate the significance of this
value, we then analysed 1,000 permuted data sets, where we shuffled the positions
of the polymorphic sites.
To test (ii) we fitted a model for the recombination rate between sites in different
genes, i and j, being ρij = a + bdij (G − dij ) where dij is the distance between
genes i and j, and G is the size of the genome (1.6Mb). The constant term in this
expression reflects the effect of recombination with short tract-lengths, assuming
5
that the tract lengths are smaller than the distances between the genes. The
effect of such recombination will be the same between any pair of genes. The
bdij (G−dij ) term is consistent with recombination where the two breakpoints are
uniformly distributed about the genome. We maximised the pairwise likelihood
for all pairs of polymorphic sites in different genes under this model. To assess
the significance of this value we then analysed permuted data sets obtained by
shuffling the positions of the 7 genes.
The power of both these tests is unknown, but should be comparable. If anything
the power for testing (ii) may be larger - as there are more pairs of sites in different
genes than within the same gene. While for each test we assume a specific model
for the relation of recombination to physical distance, it is hoped that these
models will give us power to detect a range of models which are consistent with
the presence of (i) recombination or (ii) recombination acting over long ranges.
Recombination Model
Our model for recombination, assuming that recombination tract lengths are
small compared to the size of the genome, is as follows. Let X be the (random
variable of the) tract length of a recombination event, and ρ be the rate at which
recombination events occur (per bp). Then the recombination rate for two sites
at a distance y (where y is much smaller than the size of the genome) apart is
2ρ
y
X
Pr(X ≥ x).
x=1
The factor of 2 appears due to our definition of ρ, as for each recombination event
there will be two recombination breaks. The sum in this expression appears due
to the need for precisely one of the recombination break points to lie between the
two sites. Using evidence from Drosophila (Hilliker et al., 1994), it is common to
assume X has an exponential distribution (e.g. Wiuf and Hein, 2000; Falush et al.,
2001), and in this case our expression simplifies to previously derived expressions
(see for example Frisse et al., 2001). However in order for our inferences to be
robust to deviations from this model, we assume that X has a discretised gamma
6
distribution; which is a generalisation of the exponential distribution.
Let µ be the mean tract length and α be the shape parameter of the gamma
distribution. We calculate the pairwise log-likelihood l(ρ, µ, α) over a grid of ρ,
µ and α values, but use the profile log-likelihood pl(ρ, µ) = maxα l(ρ, µ, α) to
perform inference. This approach accounts for uncertainty in the distribution of
the tract length when making inference for ρ and µ.
Confidence Intervals
To obtain an approximate confidence interval for µ (and similarly ρ) we used the
following scaled Likelihood Ratio statistic
LR(µ) = 2(pl(µ̂) − pl(µ))/S,
where pl(µ) = maxρ pl(ρ, µ) is the profile log-likelihood for µ, S is the number
of segregating sites. The scaling is to account for the fact that the number of
pairs of sites increases quadratically with S, whereas the amount of information
should increase at best linearly with S. We use simulation to obtain the empirical
distribution of the statistic.
Data was simulated (see below) under a fitted model, where the parameters of
the recombination process were fixed at their mles, and the mutation process
was chosen to produce (on average) the same number of segregating sites in the
data. For each simulated data set we obtained a value of the scaled Likelihood
Ratio statistic evaluated at the true parameter value. Repeatedly simulating
data and calculating the scaled Likelihood Ratio statistic enables us to build up
an empirical distribution for the statistic. We approximated the 95th percentile
of the true distribution by the 95th percentile of the empirical distribution; and
included in a 95% confidence interval all values of the parameter for which the
scaled Likelihood ratio statistics (for the real data) was less than this percentile.
Simulation.
Sequence data was simulated for a linear sequence of 7 loci of length 500 bp
separated by 10 kb gaps for Campylobacter and 5 loci of length 500 bp separated
7
by 10 kb gaps for Helicobacter. Details of the sequence simulations (e.g. numbers
of samples and segregating sites) were chosen to correspond roughly with the
values for the data set of interest, except for the gap length which does not affect
patterns of variation when gap length is greatly in excess of mean tract length.
First, the ms program of Hudson (2002) was used to construct a treefile (consisting
of a set of genealogies for different portions of the sequence) under the standard
neutral model for assuming gene conversion with an exponentially distributed
tract length. DNA sequence data was generated using the treefile with the seqgen program of Rambaut and Grassly (1997) under the Jukes-Cantor model of
DNA substitution with rate variation between sites corresponding to gamma
distribution with shape parameter 0.5.
3
Results
Interspecies Haplotypes.
A qualitative picture of the recombination process in C. jejuni can be obtained
by examining the haplotypes of isolates with alleles at either pgm or unc which
appear to have come from other strains of Campylobacter (see Table 2 for the
allelic profiles of these isolates). We shall call these the “interspecies haplotypes”,
and the respective alleles at pgm and unc “interspecies alleles”. (This general
approach to learning about recombination is similar in principle to that of Feil
et al., 2000).
For the isolates with haplotypes A1–A5 (see Table 2), there appears to be a
simple picture. Alleles unc38 and unc42 each differ from unc17 at a single site,
and thus haplotypes A3 and A4 each appear to be derived from haplotype A1 by
a single mutation. Haplotypes A2 and A5 differ from haplotype A1 at a single
gene, with A2 being produced by a recombination event in gene gly with a tract
length of at least 217bps (allele gly2 has frequency 22 in the sample); while A5
has been produced by a recombination event in gene unc of at least 484bps (the
8
Allelic Profile of Interspecies Haplotypes.
ID
asp
gln
glt
gly
pgm
tkt
unc
Number Source
A1
1
4
2
4
6
3
17
14
Cattle
A2
1
4
2
2
6
3
17
1
Cattle
A3
1
4
2
4
6
3
38
1
Cattle
A4
1
4
2
4
6
3
42
1
Cattle
A5
2
1
1
3
2
1
17
1
Cattle
B1
18
85
22
104
113
105
6
1
Water
B2
18
100
22
104
113
105
6
1
Rabbit
Table 2: The allelic profile, frequencies and sources of the interspecies haplotypes.
The interspecies alleles are 17, 38 and 42 at unc (top of table, haplotypes A1–A5)
and 113 at pgm (bottom of table, haplotypes B1 and B2).
allelic profile of A5 at the remaining 6 genes is at frequency 24 in the sample).
Inferring the history of haplotypes B1 and B2 (see Table 2) is more difficult.
These haplotypes differ solely at gln, allele gln85 only appears on haplotype B1,
while gln100 appears in one further isolate. There are five mutational differences
between the two gln alleles; including two mutations that only appear in gln100
and one mutation that only appears in gln85. There are no possible recombinations between either of these two interspecies gln alleles and one of the gln alleles
in the sample that would produce the other interspecies allele. Perhaps most
likely is that haplotype B2 has evolved from B1 via a recombination in gene gln
(of at least 142 bps). It is impossible to tell whether the mutation only found in
gln85 occurred before or after the recombination event with pgm113.
Detection of Recombination.
A qualitative picture of the recombination process in C. jejuni can be obtained
from a plot of pairwise Linkage Disequilibrium for all segregating sites whose
minor allele frequency is greater than 10% (see Figure 1). There is evidence for
9
greater LD within than between genes (the higher proportion of red for comparisons of pairs of sites within genes), but not for greater LD for genes which are
closer together on the genome (the approximate exchangeability of the patterns
for sites in different genes). This suggests a recombination process with short
(compared to the inter-gene distances) tract lengths.
To formally test for the presence of recombination acting over different distances we used an exact permutation test (see MATERIALS and METHODS).
We found significant evidence for recombination acting within genes (p-value
< 0.001, based on permutation of polymorphic sites), but no evidence for recombination acting over large distances (of the order of 100kb; p-value 0.64, based
on permutation of genes).
Inference of Recombination Process: Validation of Method.
To test our method for estimating the rate of recombination and the mean tract
length we analysed both (i) simulated data sets and (ii) data from H. pylori
for which there is independent evidence of the recombination tract length (see
MATERIALS and METHODS).
For (i) our data was simulated assuming the recombination tract length had an
exponential distribution, and under a finites sites mutation model that had a
large degree of rate variation; data was simulated under a model consistent with
the date from the cattle isolates. In analysing the data we assumed a gamma
distribution for the tract length, with a shape parameter ranging between 1/2
and 2. (The exponential distribution corresponds to the gamma distribution
with a shape parameter of 1; for a fixed mean, the variance of the tract lengths
doubles as compared to the exponential distribution when a shape parameter
of 1/2 is used, and halves with a shape parameter of 2.) For our analysis we
assumed a constant mutation rate for each sites.
The presence of mutation-rate variation in the simulated data biased our inference method towards smaller tract lengths and larger recombination rates. This
biasing appeared to be primarily caused from the contribution to our pairwise
10
likelihood of nearby pairs. To make inference more robust to this mutation-rate
variation we excluded from the pairwise likelihood all pairs of sites which were
less that 50bps apart. Histograms of the mles of the tract length and recombination rate from 1,000 simulated data sets are given in Figure 2. The average
values of these estimates were 470bps and 6.3 per kb (compared to true values
of 500bps and 5 per kb) respectively.
We then tested our approach on MLST data from 5 gene fragments in H. pylori.
Falush et al. (2001) have obtained estimates of the mean recombination tract
length in H. pylori of 417bps (95% credible interval 259-732bps). These estimates
are based on data from serial isolates and are independent from the MLST data
described in MATERIALS and METHODS. We repeated the approach above
to estimate the tract length from the Maori isolates, the Korean Isolates and a
sample consisting of both Maori and Korean Isolates. Results are given in Table
3, with confidence intervals as described in MATERIALS and METHODS.
For the individual Maori and Korean populations, the estimates of tract length
are consistent with those of Falush et al. (2001); but the analysis of the combined
data set appears to be under-estimating the mean tract length. One explanation for this is that the combined data set strongly violates the random-mating
assumption of the the model (For H. pylori population structure appears to correspond strongly to geographical locations; see Falush et al., 2003). The presence of
population structure will lead to LD decaying more slowly with genetic distance;
and thus to relative underestimates of the amount of recombination between as
compared to within genes (Smith and Fearnhead, 2004). As the recombination
rate between genes is proportional to the product of the actual rate of recombination events and the mean tract length; the net effect will be an underestimate
of the mean tract length. For H. pylori the true recombination rate between
genes is large (of the order of 25–50 for the combined data set) and so the effect
of structure will be particularly pronounced (Smith and Fearnhead, 2004).
Inference about Recombination Process in C. jejuni.
11
Estimates of Tract length in Helicobacter.
Source
µ̂ (bps)
Maori
500
(225–1400)
Korea
250
(50–1200)
All
175
(100–325)
Table 3: Estimates and, in brackets, approximate 95% confidence intervals of the
mean tract length (µ) for H. pylori isolates.
When analysing the C. jejuni isolates we removed all isolates which contained
any of the alleles at pgm or unc which are potential recombinants with a different
strain of Campylobacter. Our inference method is based on a model of randommating, namely that C. jejuni recombines with other members of C. jejuni at
equal rates but never with other species, and the presence of alleles which are
descended from a different Campylobacter strain would substantially violate such
an assumption. So as to both better fit a model of random-mating (there is
evidence that C. jejuni is exchanged between sources of the same type at a
faster rate than between sources of different types French et al., 2004), and to
potentially detect any differences between them, we analysed the isolates from
each of the four main sources (birds, cattle, rabbits and water) separately.
Table 4 shows the estimated parameters of the recombination process for isolates from each of the four main sources. There is substantial variation in both
the estimates of the recombination rates and the mean tract lengths across the
different sources, but approximate 95% confidence intervals overlap for both parameters across the four sources except for the recombination rates in cattle and
water. The variation in recombination rate may be caused by differences in the
effective population size of isolates in different sources; however the estimated
mutation rate based on pairwise differences (and also based on the number of
segregating sites; data not shown) is similar across the four sources. Estimates
12
Recombination Parameter Estimates.
Source
θ̂ per kb
ρ̂ per kb
µ̂ (bps)
Birds
10.3
12
(6.7–22)
450
(250–800)
Cattle
12.1
3.7
(1.7–6.7)
750
(300–1500)
Rabbits
13.7
6.7
(2.7–11.8)
300
(140–1500)
Water
14.4
15
(6.7–23)
225
(80–500)
Table 4: Estimates and, in brackets, approximate 95% confidence intervals of
the recombination rate (ρ) and mean tract length (µ) for C. jejuni isolates from
4 different sources. Isolates with alleles at pgm or unc which are potentially
descended from a different strain of Campylobacter were omitted from the analysis. For comparison, an estimate of the mutation rate, based on mean pairwise
differences, is given for each source.
of the recombination rate between genes also shows variation across the four
sources (mles of 11, 6, 4 and 7 for birds, cattle, rabbits and water respectively).
In general the recombination rates are smaller than the mutation rates, and if
there is considerable mutation rate variation then the estimates of the mutation
rates are likely to be under-estimates of the true mutation rate. However, under
our definition of recombination rate, each recombination event produces two
recombination break points - and thus the effective rate of recombination breaks
is twice that given in 4.
4
Discussion
We have used population data to make inferences about the recombination process in C. jejuni. There is very strong evidence both for recombination, and
recombination tract lengths that are small compared to the distances between
genes. Our estimates suggest mean tract lengths that are in the region of 225bp750bp; and a recombination rate that is similar in magnitude to the mutation
13
rate. This results are consistent with the qualitative patterns we observed in the
intra-strain haplotypes.
Our estimate of tract length is substantially smaller then the 3.3kb estimate of
Schouls et al. (2003), using the method of Feil et al. (2000). This method is
based on finding closely related isolates within populations (for example isolates
with identical alleles at 6 of the 7 genes), and analysing the genetic differences
of such closely related strains. Specific genetic differences are classified as due
to either mutation or recombination, with any single base change being classified as a mutation. If recombination tract lengths roughly have an exponential
distribution, then many recombination events will change small genomic regions,
and thus may alter the DNA only at a single base. As a result, classifying all
single base changes as mutations will lead to an overestimate of the recombination tract length, and may explain the larger estimate of Schouls et al. (2003).
Furthermore, the estimate of Schouls et al. (2003) is based on a small amount of
data and they state that “its validity is somewhat questionable”.
Our results suggest that tract lengths in C. jejuni are similar to those in H.
pylori, and much shorter than other bacteria (where estimates range from 2kb
to 14kb Falush et al., 2001). The similarity with H. pylori is not suprising given
the biological similarities of the two pathogens, including the main mechanism
of recombination being transformation (Suerbaum et al., 2001).
Laboratory studies (De Boer et al., 2002) have reported examples of complete
genes being moved via recombination. Such events require recombination tract
lengths of several kilobases. Our results suggest that while such large recombination events can occur, they are likely to be rare (for example if mean tract
length is 500bps, the probability under an exponential model for a tract length
in excess of 3kb is around 0.25%), and that the vast majority of recombination
events affect much smaller regions of the genome.
Our recombination model allowed for uncertainty in the distribution of the tractlength size; whereas a more common approach is to assume an exponential distri14
bution. In practice this made little difference to the estimates of the mean tract
length (the estimates are around 20% smaller than if the data were analysed under an exponential model); though it produces wider confidence intervals which
allow for this extra uncertainty. The data contained little information about the
shape parameter.
The information in the data about the recombination rate and tract length is
obtained by having Linkage Disequilibrium (LD) information on two different
scales: between and within genes. The LD between genes is informative about
the product of the recombination rate and tract length; whereas the LD within
genes is governed primarily by just the recombination rate (because the gene
fragment sizes are similar to the tract length). As a result we have much greater
power at estimating both the recombination rate and tract length than from a
contiguous region of DNA of the same size (Wall, 2004). Note, it should be
possible to improve the accuracy of studies such as the one presented here by
using the three-site likelihoods presented in Wall (2004).
The two main assumptions on our inference method are that (i) mutation rates
are constant across sites; and (ii) the sample of isolates is taken from a randomly mating population. The problem with (i) is that repeat mutation can
lead to over-estimates of recombination rates, particularly between nearby sites.
However, we have demonstrated our robustness to (i) through simulation, where
we are able to accurately estimate tract lengths (and recombination rates) in
the presence of considerable rate variations; and through obtaining reasonable
estimates of tract lengths for H. pylori.
The problem with (ii) is that population structure affects the decay of LD with there being excess LD over long distances. This could cause an excess of
LD between genes relative to within genes, and hence an under-estimate of the
mean tract length. Some evidence of this effect was noted in the analysis of
the H. pylori data: with smaller estimates of tract length obtained for a mixed
population of Korean and Maori isolates. To minimise this problem we analysed
15
the isolates from each of the four sources separately, and removed the intra-strain
haplotypes.
We further tested the robustness of our approach to population structure via a
simulation study, with parameters fixed to those estimated for the water isolates.
We assumed a two-island demographic model (Donnelly and Tavaré, 1995), with
three different migration rates, 20, 5 and 1. These produce FST values ranging
from approximately 0.01 to 0.2 (Hudson et al., 1992) - the larger values of F ST
correspond to stronger population structure. The mean of the estimates of the
mean tract length across 100 simulated data sets for each scenario varied from
245 to 230 (compared to the truth of 225; the over-estimation is due to the skewed
distribution of the mles in each case). These results suggest that the method is
robust to the degree of population structure present in these simulations. As a
rough comparison, the FST values for C. jejuni isolates presented in Colles et al.
(2003) vary from 0.005 to 0.094 for a comparison of human and animal isolates.
Acknowledgements This work was supported by EPSRC grant GR/S18786/01,
and by the Environment Institute and the Department for Environment, Food
and Rural Affairs (DEFRA). We thank Daniel Falush for helpful comments.
16
References
Awadalla, P. (2003). The evolutionary genomics of pathogen recombination. Nature Review Genetics 4, 50–60.
Claus, H., Maiden, M. C. J., Maag, R., Frosch, M. and Vogel, U. (2002). Many
carried meningococci lack the genes required for capsule synthesis and transport. Microbiology-SGM 148, 1813–1819.
Colles, F. M., Jones, K., Harding, R. M. and Maiden, M. C. J. (2003). Genetic
diversity of campylobacter jejuni isolates from farm animals and the farm environment. Applied and Environmental Microbiology 69, 7409–7413.
De Boer, P., Wagenaar, J. A., Achterberg, R. P., van Putten, J. P. M., Schouls,
L. M. and Duim, B. (2002). Generation of campylobacter jejuni genetic diversity in vivo. Molecular Microbiology 44, 351–359.
Donnelly, P. and Tavaré, S. (1995). Coalescents and genealogical structure under
neutrality. Annual Review of Genetics 29, 401–421.
Falush, D., Kraft, C., Taylor, N. S., Correa, P., Fox, J. G., Achtman, M. and
Suerbaum, S. (2001). Recombination and mutation during long-term gastric
colonizaion by helicobacter pylori:Estimates of clock rates, recombination size,
and minimal age. PNAS 98, 15056–15061.
Falush, D., Wirth, T., Linz, B., Pritchard, J. K., Stephens, M., Kidd, M., Blaser,
M. J., Graham, D. Y., Vacher, S., Perez-Perez, G. I., Yamaoka, Y., Megraud,
F., Otto, K., Reichard, U., Katzowitsch, E., Wang, X. Y., Achtman, M. and
Suerbaum, S. (2003). Traces of human migrations in Helicobacter pylori populations. Science 299, 1582–1585.
Fearnhead, P. (2003). Consistency of estimators of the population-scaled recombination rate. Theoretical Population Biology 64, 67–79.
17
Fearnhead, P. and Donnelly, P. (2001). Estimating recombination rates from
population genetic data. Genetics 159, 1299–1318.
Feil, E. J., Smith, J. M., Enright, M. C. and Spratt, B. G. (2000). Estimating recombinational parameters in Streptococcus pneumoniae from multilocus
sequence typing data. Genetics 154, 1439–1450.
French, N. P., Barrigas, M., Brown, P., Ribiero, P., Williams, N. J., Leatherbarrow, H., Birtles, R., Bolton, E., Fearnhead, P. and Fox, A. (2004). Spatial epidemiology and natural population structure of campylobacter jejuni colonising
a farmland ecosystem. submitted to Environmental Microbiology .
Frisse, L., Hudson, R. R., Bartoszewicz, A., Wall, J. D., Donfack, J. and Di
Rienzo, A. (2001). Gene conversion and different population histories may
explain the contrast between polymorphism and linkage disequilibrium levels.
American Journal of Human Genetics 69, 831–843.
Hilliker, A. J., Harauz, G., Reaume, A. G., Gray, M., Clark, S. H. and Chovnick,
A. (1994). Meiotic gene conversion tract length distribution within the rosy
locus of Drosophila Melanogaster. Genetics 137, 1019–1026.
Hudson, R. R. (2001). Two-locus sampling distributions and their application.
Genetics 159, 1805–1817.
Hudson, R. R. (2002). Generating samples under a Wright-Fisher neutral model
of genetic variation. Bioinformatics 18, 337–338.
Hudson, R. R., Slatkin, M. and Maddison, W. P. (1992). Estimation of levels of
gene flow from DNA-sequence data. Genetics 132, 583–589.
McVean, G. A. T., Awadalla, P. and Fearnhead, P. (2002). A coalescent method
for detecting recombination from gene sequences. Genetics 160, 1231–1241.
18
McVean, G. A. T., Myers, S. R., Hunt, S., Deloukas, P., Bentley, D. R. and
Donnelly, P. (2004). The fine-scale structure of recombination rate variation
in the human genome. Science 304, 581–584.
Rambaut, A. and Grassly, N. C. (1997). Seq-Gen: an application for the Monte
Carlo simulation of DNA sequence evolution along phylogenetic trees. Comput
Appl Biosci 13, 235–238.
Schouls, L. M., Reulen, S., Duim, B., Wagenaar, J. A., Willems, R. J. L., Dingle,
K. E., Colles, F. M. and Van Embden, J. D. A. (2003). Comparative genotyping
of campylobacter jejuni by amplified fragment length polymorphism, multilocus sequence typing, and short repeat sequencing: Strain diversity, host range,
and recombination. Journal of Clinical Microbiology 41, 15–26.
Smith, N. G. C. and Fearnhead, P. (2004). Comparative performance and robustness of three estimators of the recombination rate. In preparation .
Stumpf, M. P. H. and McVean, G. A. T. (2003). Estimating recombination rates
from population-genetic data. Nature Review Genetics 4, 959–968.
Suerbaum, S., Lohrengel, M., Sonnevend, A., Ruberg, F. and Kist, M. (2001).
Allelic diversity and recombination in Campylobacter jejuni. Journal of Bacteriology 183, 2553–2559.
Wall, J. D. (2004). Estimating recombination rates using three site likelihoods.
Genetics 167, 1461–1473.
Wiuf, C. and Hein, J. (2000). The coalescent with gene conversion. Genetics 155,
451–462.
19
3000
2500
boundary
2000
1500
1000
500
500
1000
1500
2000
2500
3000
boundary
Figure 1: Pairwise plot of Linkage Disequilibrium (LD; as measure by D 0 below
the diagonal and the Likelihood Ratio statistic for linkage equilibrium above the
diagonal). The MLST data was concatanated (in the order the genes appear on
the genome: asp, gln, glt, pgm, unc, gly and tkt), and the boundary of each gene
is marked by a black line. Each box represents a pair of segregating sites, the
boundary of the boxes are equidistant between neighbouring segregating sites;
and the colour of the box shows the amount of LD (ranging from red - high LD to white - low LD) between the two sites. Only sites with minor allele frequency
greater the 10% are included in the plot.
Recombination Rate per kb
20
15
10
5
0
50
100
150
Tract Length
500
1000
1500
2000
Frequency
0
50
100
150
200
250
Frequency
Figure 2: Histogram of mles of the tract length (left) and the recombination rate
(right) from 1,000 simulated sample. The true values were 500bp and 5 per kb
respectively.