Multiple Transcription Factor Binding Sites
Predict AID Targeting in Non-Ig Genes
This information is current as
of June 16, 2017.
Jamie L. Duke, Man Liu, Gur Yaari, Ashraf M. Khalil, Mary
M. Tomayko, Mark J. Shlomchik, David G. Schatz and
Steven H. Kleinstein
J Immunol 2013; 190:3878-3888; Prepublished online 20
March 2013;
doi: 10.4049/jimmunol.1202547
http://www.jimmunol.org/content/190/8/3878
References
Subscription
Permissions
Email Alerts
http://www.jimmunol.org/content/suppl/2013/03/20/jimmunol.120254
7.DC1
This article cites 48 articles, 25 of which you can access for free at:
http://www.jimmunol.org/content/190/8/3878.full#ref-list-1
Information about subscribing to The Journal of Immunology is online at:
http://jimmunol.org/subscription
Submit copyright permission requests at:
http://www.aai.org/About/Publications/JI/copyright.html
Receive free email-alerts when new articles cite this article. Sign up at:
http://jimmunol.org/alerts
The Journal of Immunology is published twice each month by
The American Association of Immunologists, Inc.,
1451 Rockville Pike, Suite 650, Rockville, MD 20852
Copyright © 2013 by The American Association of
Immunologists, Inc. All rights reserved.
Print ISSN: 0022-1767 Online ISSN: 1550-6606.
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
Supplementary
Material
The Journal of Immunology
Multiple Transcription Factor Binding Sites Predict AID
Targeting in Non-Ig Genes
Jamie L. Duke,* Man Liu,†,1 Gur Yaari,‡ Ashraf M. Khalil,x Mary M. Tomayko,{
Mark J. Shlomchik,†,x David G. Schatz,†,‖ and Steven H. Kleinstein*,‡
S
omatic hypermutation (SHM) occurs in germinal center
(GC) B cells, resulting in the introduction of point mutations into Ig genes. Although SHM provides an important
source of genetic diversity, capable of producing specific Abs for
quickly evolving pathogens, the process also poses a severe threat
to genomic stability. Activation-induced cytidine deaminase (AID),
*Interdepartmental Program in Computational Biology and Bioinformatics, Yale
University, New Haven, CT 06511; †Department of Immunobiology, Yale University
School of Medicine, New Haven, CT 06510; ‡Department of Pathology, Yale University School of Medicine, New Haven, CT 06510; xDepartment of Laboratory
Medicine, Yale University School of Medicine, New Haven, CT 06510; {Department
of Dermatology, Yale University School of Medicine, New Haven, CT 06510; and
‖
Howard Hughes Medical Institute, New Haven, CT 06510
1
Current address: Drinker Biddle & Reath LLP, Washington, D.C.
Received for publication September 26, 2012. Accepted for publication February 15,
2013.
J.L.D. was supported in part by the Pharmaceutical Research and Manufacturers of
America Foundation and National Institutes of Health Grant T15 LM07056 from the
National Library of Medicine. D.G.S. is an investigator of the Howard Hughes
Medical Institute. Computational resources were provided by the Yale University
Biomedical High Performance Computing Center (National Institutes of Health
Grant RR19895).
J.L.D. and S.H.K. designed the analyses; M.L. and D.G.S. designed the RNA polymerase II ChIP-Seq experiment; M.L. performed the ChIP portion of the experiment;
M.M.T. and M.J.S. provided the microarray expression data; J.L.D. and G.Y. wrote
software for the analyses; J.L.D. performed the analyses; and J.L.D., D.G.S., and S.H.K.
wrote the manuscript. All authors commented on the manuscript.
The microarray data presented in this article have been submitted to the Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44260)
under accession number GSE44260.
Address correspondence and reprint requests to Dr. Steven H. Kleinstein, Yale University School of Medicine, 300 George Street, Suite 505, New Haven, CT 06511.
E-mail address: [email protected]
The online version of this article contains supplemental material.
Abbreviations used in this article: AID, activation-induced cytidine deaminase; ChIP,
chromatin immunoprecipitation; ChIP-Seq, chromatin immunoprecipitation followed
by massively parallel sequencing; CSR, class switch recombination; dKO, double
knockout; FDR, false-discovery rate; GC, germinal center; GSEA, gene set enrichment analysis; KO, knockout; KW, Kruskal–Wallis test; MWU, Mann–Whitney U
test; NB, negative binomial; RefSeq, National Center for Biotechnology Information
Reference Sequence; SHM, somatic hypermutation; TC-Seq, translocation-capture
sequencing; TSS, transcription start site; ZI-NB, zero-inflated negative binomial.
Copyright Ó 2013 by The American Association of Immunologists, Inc. 0022-1767/13/$16.00
www.jimmunol.org/cgi/doi/10.4049/jimmunol.1202547
the enzyme that deaminates cytosines to initiate SHM, can act
outside of the Ig locus. In a previous sequencing study, we showed
that .45% of expressed genes in GC B cells are targeted by AID in
Ung2/2 Msh22/2 double-knockout (dKO) mice, where the absence
of DNA repair reveals the “footprint” of AID. Even among genes
that were targeted by AID, this study revealed a wide range of
mutation frequencies observed across 83 genes (1). In this study,
we seek to address two basic questions that are raised by the former
study: Why are some genes targeted by AID, whereas others are
not? and How do the genes targeted by AID accumulate different
levels of mutation? The main hypothesis we pursue is that sequence
features of each gene are responsible for this differential targeting.
The current model of SHM proposes two phases (2). In the first
phase, AID converts a cytosine (C) residue to a uracil (U) in ssDNA
created during the process of transcription, which, if left unrepaired,
leads to a C to T (thymine) transition mutation when the DNA is
replicated for cell division (3). The second phase of SHM begins
when DNA repair mechanisms attempt to remove the uracil lesion
from the DNA. The repair of the uracil happens via two pathways:
base excision repair with UNG and mismatch repair facilitated by
the MSH2/MSH6 complex, both of which are capable of working
in an error-prone fashion and contributing to the observed mutation frequency (4). In the dKO setting, the second phase of SHM
is unavailable, thus revealing the underlying “footprint” of AID,
where the expectation is primarily C → T transition mutations. We
previously sequenced 83 non-Ig genes from dKO mice on average
70 times per gene over a 1-kb region downstream of the transcription start site (TSS) (1). Mutation frequencies varied widely,
ranging from ,1 3 1025 to 116.1 3 1025 mutations/bp, but they
were highly predictable for the same gene across samples from
multiple mice. In the same system, sequencing of an IgH positive
control, specifically the VhJ558-Jh4 intron 39 flanking region
(hereafter referred to as the Jh4 intron), found a mutation frequency
of 9.96 3 1023 mutations/bp. Each gene represents a unique genomic context in which to explore the various properties associated
with AID targeting.
Differential AID activity in non-Ig genes may be influenced by
multiple underlying mechanisms. A higher transcription rate may
be associated with an increased mutation frequency. Genes with
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
Aberrant targeting of the enzyme activation-induced cytidine deaminase (AID) results in the accumulation of somatic mutations
in ∼25% of expressed genes in germinal center B cells. Observations in Ung2/2 Msh22/2 mice suggest that many other genes
efficiently repair AID-induced lesions, so that up to 45% of genes may actually be targeted by AID. It is important to understand
the mechanisms that recruit AID to certain genes, because this mistargeting represents an important risk for genome instability.
We hypothesize that several mechanisms combine to target AID to each locus. To resolve which mechanisms affect AID targeting,
we analyzed 7.3 Mb of sequence data, along with the regulatory context, from 83 genes in Ung2/2 Msh22/2 mice to identify
common properties of AID targets. This analysis identifies three transcription factor binding sites (E-box motifs, along with YY1
and C/EBP-b binding sites) that may work together to recruit AID. Based on previous knowledge and these newly discovered
features, a classification tree model was built to predict genome-wide AID targeting. Using this predictive model, we were able to
identify a set of 101 high-interest genes that are likely targets of AID. The Journal of Immunology, 2013, 190: 3878–3888.
The Journal of Immunology
a higher mutation frequency may contain a large number of AID
hotspots, such as WRC (W = A/T; R = A/G), and/or few AID
coldspots, such as SYC (S = C/G; Y = C/T), where the C is the
mutated position (5, 6). Clonal recruitment of AID to certain
genes may lead to an increased mutation frequency (7). Finally,
the genes for which high mutation frequencies are observed may
share functional elements, like transcription factor binding sites,
which recruit AID to the locus for mutation. In this study, we first
examine each of the possible mechanisms independently and then
develop an integrated model to predict targeting of AID in the
non-Ig genes.
Materials and Methods
Stratification of dKO genes
Gene-expression analysis
BALB/c background mice carrying IgH-transgenic alleles and targeted
deletions of the endogenous JH locus (VH186.2-Tg JH KO and V23-Tg JH
KO) were immunized i.p. with 50 mg NP25-chicken g globulin precipitated
in alum, as described (9). Fourteen days later, splenocytes were isolated
and stained with fluorescently labeled peanut agglutinin (Vector Laboratories), anti-l (goat polyclonal; SouthernBiotech), and anti-CD45R/B220
(RA3-6B2) to identify l+ PNA+ B220+ GC B cells. Cells were purified on
a FACSAria cell sorter (BD), as previously described (14). Live/dead
discrimination was accomplished based on forward/side scatter and propidium iodide exclusion. Cells were kept at 4˚C in buffers containing
0.05% sodium azide to minimize alterations in gene expression. All animal
immunizations and experiments were approved by the Yale Institutional
Animal Care and Use Committee.
The mRNA expression levels among sorted cells were determined using
Affymetrix Mouse 430 2.0 microarray and previously described methods
(14, 15). Annotation files were obtained directly from Affymetrix (v. 31)
and were used to associate the gene symbols and National Center for
Biotechnology Information Reference Sequence (RefSeq) (16) identifiers
with each probe set. The samples were normalized using GC Robust Multiarray Average algorithm. The average expression value reported for each
gene was determined by taking the log2 of the average expression values
across all probes associated with each RefSeq identifier for all samples.
These microarray data are available through the Gene Expression Omnibus: http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44260.
Polymerase II chromatin immunoprecipitation followed by
massively parallel sequencing
Mice carrying an mVH186.2-Tg IgH Tg allele and targeted deletions of
the JH locus (17) were immunized with NP25-chicken g globulin, and
spleens were removed 12 d postimmunization. B cells were enriched by an
EasySep customized B cell negative selection kit (without anti-CD43
Ab; Stem Cell Technologies). Cells were stained with anti-mouse B220–
Cy-Chrome, anti-mouse CD19-PE, and anti-mouse GL7-FITC (all from
Becton Dickinson Biosciences) Abs and sorted on a DakoCytomation
MoFlo cell sorter at the Yale Immunobiology Cell Sorting Facility.
Dynabeads Protein A (Invitrogen) were incubated with RNA polymerase
II Ab N20 (Santa Cruz Biotechnologies) or normal rabbit serum. Excess Ab
was washed away. Then Ab-bound beads were incubated with chromatin
from 20 million sorted spleen GC B cells (previously cross-linked with 1%
HCHO and then sonicated to shear the DNA fragments to 100–300 bp) at 4˚C
overnight. Beads were washed, chromatin was eluted, and the cross-linking
was reversed. DNA was purified, precipitated, and redissolved in TE buffer.
Precipitated DNA was quantified using a PicoGreen dsDNA quantification
kit (Molecular Probe). A total of 200 ng chromatin immunoprecipitation
(ChIP) DNA (from 40 million cells) ends was repaired using polynucleotide
kinase and Klenow enzyme, followed by treatment with Taq polymerase to
generate a protruding 39 A base used for adaptor ligation. Following ligation
of a pair of Solexa adaptors to the repaired ends, the ChIP DNA was amplified using the adaptor primers for 17 cycles, and the fragments around
220 bp (mononucleosome + adaptors) were isolated from agarose gel. The
purified DNA was used directly for cluster generation and sequencing analysis using the Solexa 1G Genome Analyzer, following the manufacturer’s
protocols.
The resulting 25-bp reads were aligned against the mouse genome (mm8)
using Efficient Local Alignment of Nucleotide Data (Illumina), allowing up
to two mismatches against the reference. The reads kept for further analysis
had to map uniquely to the genome, and a maximum of three copies of the
same read was kept to reduce PCR artifact. The reads were then converted
to browser extensible data format. The TSSs for all genes were identified
using the RefSeq identifier (16) and were obtained through the UCSC
Genome Table Browser (18) for the mouse mm8 genome build. For each
group, genes were aligned by the TSS, and the maximum peak of overlapping reads was determined for each gene in the region 100 bases around
the TSS. Differences among the three groups were determined using the
Kruskal–Wallis test (KW), and the Mann–Whitney U test (MWU) was
used for determining differences among pairs of groups.
Mutability
Mutations and all positions sequenced that occurred at C on either the top or
bottom strands were considered for analysis if the residue and the two
residues directly upstream passed the neighborhood quality standard criteria
of Liu et al. (1). Both mutations and positions that did not meet this criterion were excluded from the analysis, resulting in an observed mutation
frequency for each gene that is slightly different from the previously
published figures. Mutability was calculated in a manner similar to that of
Shapiro et al. (19), restricted to the case when the mutation is a C residue
in the third position (or G in the first position for the reverse complement).
To do so, the frequency of mutations in each of the 16 possible dinucleotides upstream of the mutated C residue was tabulated for each sequence and normalized by the total number of mutations in that sequence.
The same was done for all sequenced C residues to determine the background. The mutability for each sequence was calculated by dividing by
the normalized frequency of the dinucleotide motifs for the mutated residues by the background of the sequenced region. The overall mutability
for a set of sequences was calculated as the mean mutability in individual
sequences weighted by the total number of mutations in each sequence.
Error bars were calculated by bootstrapping the original set of sequences
10,000 times. The p values were calculated for each motif by comparing
the bootstrapped mutability values and the observed mutability values for
each sample. An aggregated p value for all motifs was computed using the
Fisher method for combining p values of individual tests.
Hot/cold-spot analysis
For each gene, the region sequenced was analyzed for occurrences of the
AID hotspot WRC and the AID coldspot SYC on both the top and bottom
strands. The total number of hotspots found in the sequenced region was
normalized by the number of C and G residues in the sequence to account
for the unique composition of each gene.
Negative binomial analysis
The number of mutations from each sequence was determined for each of
the 15 genes in group A and for the Jh4 intron. The negative binomial (NB)
distribution was fit to each gene using the fitdistr method available through
the MASS package for R (20), and a x2 goodness-of-fit test was applied. In
the cases in which the NB did not fit the data (p , 0.05), a zero-inflated
NB (ZI-NB) distribution was fit as defined through the emdbook package
for R (21) and tested using a x2 goodness-of-fit test.
Determining transcription factor binding sites
The region 2 kb around the TSS for each sequenced gene was calculated
based on RefSeq gene definitions (accessed July 2009). The corresponding
aligned sequences for both mouse (mm9) and human (hg18) were obtained
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
Genes were selected for sequencing in our previous study based on multiple
criteria (1): expression of the gene determined through microarray studies
in both mouse and human B cells (8–11), because it is expected that
expressed genes are undergoing transcription, a requirement for AID targeting (3, 12); a well-defined TSS and chromosomal location; and a high
level of homology between the mouse and human genes for the first exon.
Genes known to be involved in the immune response, tumorigenesis, cell
proliferation, or apoptosis or known to undergo chromosomal translocations or deletions, especially in B cell tumors, were given preference.
Finally, the set of genes chosen was also selected to provide good coverage
across the set of mouse chromosomes. A total of 83 non-Ig genes was
sequenced from Peyer’s patches of Ung2/2 Msh22/2 dKO mice, in addition to the positive control of the Ig Jh4 intron. Each gene was sequenced, as previously described, in a 1-kb region directly downstream of
the TSS (1). Mutations were determined through use of the neighborhood
quality standard algorithm (13), using previously defined criteria (1). The
resulting mutation frequency of each gene in the dKO setting was compared with the background mutation rate determined by sequencing 31
genes from Aicda2/2 knockout (KO) mice using the Fisher exact test.
False-discovery rate (FDR) q-values were determined for each gene (l = 0)
and used as the basis for ranking the genes and determining group
assignments: group A = q-value , 1026; group B = 1026 # q-value , 1022;
and group C = q-value $ 1022.
3879
3880
AID TARGETING IN NON-Ig GENES
from the multiple sequence alignment, multiz30way (22), for the mouse
genome available from the UCSC Genome Bioinformatics Site (23). Repetitive sequences were masked from both the mouse and human sequences using RepeatMasker (24). With regard to the identification of Ebox core sequences, two consensus sequences defined the site: CASSTG
(where S = C or G) and CANNTG (where n = A, C, G or T). In addition,
the MATCH (25) program was used to scan the sequences for hits of
TRANSFAC-defined binding sites (26) (including all high-quality vertebrate binding sites used for v. 2010.3) using the cutoff values defining a hit
to minimize the sum of false positives and false negatives. Only hits that
occurred in the same position in the mouse and human aligned sequences
were retained for analysis. Note that this step of evolutionary conservation
does not impose upon the model an assumption that mutation spectrum
from AID targeting in the species used for the conservation will correlate
perfectly, or at all, with mouse.
Gene set enrichment analysis
Location analysis
Binding site locations were compared between sets of genes using a Bonferroni-corrected one-tailed MWU. For each gene, the binding site closest to
the TSS (in the region 6 7.5 kb) was selected for analysis. Two sets of
control genes were included. The first set of control genes was defined as
the top 10% of genes with the highest average expression (across all probe
sets and samples) in the GC B cell microarray data previously described.
The second set of control genes included the set of 17 nonexpressed genes
that was sequenced in the wild-type setting (1).
Colocation and trilocation analysis
For the pair-wise analysis of transcription factor binding site locations
performed, a score was computed for each possible pair of binding sites
between two transcription factors (Eq. 1),
Colocation score5dðSÞ 1 dðTÞ;
ð1Þ
where d(S) is the distance between the binding sites for each factor, and
d(T) is the distance of the binding site furthest from the TSS. Genes
without a predicted binding site in the analyzed region defined were
assigned a distance equal to the maximal distance from the TSS. The genes
were again split into groups A, B, and C and compared against the two sets
of control genes from above. In computing the trilocation score of three
factors, d(S) was defined as the maximal distance between all three sites. A
Bonferroni-corrected one-tailed MWU was used to determine the differences between the groups of genes and the control sets.
Estimation of variable importance
To determine the significance of the variables that were used in this analysis,
the measurements were obtained for all group A, B, and C genes for all
independent attributes measured across all genes. In addition, the total
number of binding sites for each significant factor within the region 2 kb
around the TSS was included for analysis. The cforest algorithm (28)
(available through the party package for R) was used to generate a random
forest of classification trees to determine the variable importance using
an unbiased approach. A forest of 1000 trees was built with the following
parameters specified: minimum samples in a node for splitting = 15,
minimum number of samples in a leaf = 5, and number of variables randomly selected at each node = 5. Variable importance was calculated from
the random forest, defined by the mean decrease in prediction accuracy
when values are permuted for each variable. Important variables were defined as having a variable importance score greater than the absolute value
of the least important variable (29).
Classification tree model to predict genome-wide SHM
The classification tree model was generated using the rpart package for R
(30). The model was constructed using the default parameters, with the
where the columns correspond to the true group A, B, or C and the rows
correspond to predicting group A, B, or C. Thus, a gene from group B
predicted to be in group A results in a loss (or penalty) for misclassification of 10. Ten-fold cross-validation was performed during the
construction of the model and used to prune the full tree by choosing the
tree that minimized the cross-validated error to create the final model
(31).
Results
Mutation frequency is correlated weakly with transcription
Transcription is required for SHM at the Ig locus, and we previously
observed that nontranscribed genes did not accumulate mutations
in the genome-wide setting (1). We asked whether, for expressed
genes, there is a correlation with mutation frequency. We previously determined the mutation frequency for 83 non-Ig genes
isolated from Ung2/2 Msh22/2 dKO Peyer’s patch B cells and
classified these into three groups: A, B, and C (1) (see Materials
and Methods). In this setting, group A genes had the highest
mutation frequency, percentage of C → T transition mutations,
and hotspot focusing, followed by genes in groups B and C, with
decreases in each subsequent group, respectively. The relative
mRNA expression of these genes was compared in GC B cells; we
found a minimal correlation between a gene’s log2 mRNA expression level, as measured by Affymetrix microarray, and its
mutation frequency (r = +0.079) (Supplemental Fig. 1A). When
genes were grouped based on their p value for AID targeting
(group A and B genes accumulate mutations that are significantly
different from the background and show strong signs of AID targeting, whereas genes in group C are minimally mutated compared
with the background), we observed a positive trend between average log2 expression and mutation frequency (Fig. 1A). However,
there is not a significant difference between these groups (p =
0.064, KW). Thus, although AID-targeted genes seem to have
higher average log2 mRNA expression at the group level, this was
not statistically significant, and the correlation between mutation
and steady-state transcript levels is very weak.
Because steady-state mRNA levels are not a direct measure of
the rate of transcription and performing a nuclear run-on assay
would be difficult with GC B cells (32), RNA polymerase II ChIP
followed by massively parallel sequencing (ChIP-Seq) was performed on GC B cells to better address the relationship between
transcription rate and mutation frequency. For each gene, transcription was quantified by the maximum peak of overlapping
polymerase II tags in the region 100 bp around the TSS. A statistically significant difference was found among the genes in groups
A, B, and C (p = 0.04, KW), indicating an unequal distribution of
polymerase II (Fig. 1B). A further examination of the polymerase II
levels among the groups revealed that group A had a significantly
higher amount of polymerase II compared with both group B
(p = 0.0045, MWU) and group C (p = 4.6 3 1024). As a control,
group A, B, and C genes all had significantly higher polymerase II
levels compared with the set of 17 nonexpressed genes that were
also analyzed in the previous study (1) (p values: group A = 8.2 3
1027; group B = 9.0 3 1025; group C = 5.1 3 1024, MWU). Thus,
AID targeting is correlated positively with higher levels of polymerase II near the TSS of a gene. However, as with mRNA expression, polymerase II levels are only weakly predictive of mutation
frequency at the individual gene level (Fig. 1C).
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
Gene set enrichment analysis (GSEA) was performed using the GSEA
Preranked tool with classic weighting within the GSEA tool (27). Genes
from the dKO setting were ranked according to the FDR q-value, comparing
the dKO mutation frequency with the background mutation frequency from
Liu et al. (1). A total of 10,000 permutations was used to determine the null
distribution of enrichment scores for each gene set and to calculate the
p value. The FDR q-value corrects for multiple hypothesis testing and is
influenced by the total number of gene sets tested. In the case of E-boxes and
YY1 sites, the total number of gene sets tested was 6; in the general test
of all other transcription factor binding sites (including C/EBP-b), the total
number of gene sets tested was 705.
exception of the following parameters: the minimum number of samples
in each node to allow a split was set to 9, the splitting index used was
“information,” and the loss matrix used was:
3
2
0 3 15
4 10 0 10 5;
20 1 0
The Journal of Immunology
3881
Increased AID hotspot density does not lead to higher mutation
frequency
AID preferentially targets WRC/GYW motifs, so it is possible that
genes with more of these motifs may accumulate a greater number
of mutations independent of AID recruitment to the locus. To test
this hypothesis, we first confirmed that the hot/cold-spot targeting
of AID, which was previously defined at the Ig locus, was conserved in the genome-wide setting. It was shown previously that
the observed mutations, especially in the group A genes, were
biased toward the SHM hotspot WRCY/RGYW motif (1). To
extend this analysis to cover the full spectrum of DNA motifs,
focusing on the AID hotspot, and to increase coverage across
all motifs given the relatively low number of mutations in the
genome-wide dataset, the relative mutability of each of the 16
possible dinucleotide combinations in the motif NNC/GNN was
calculated. Under the null hypothesis of no hot/cold-spot targeting, the mutability for each motif will be focused around a value
of 1. However, we observed significant deviations from this null
hypothesis for both the group A genes and the positive control Jh4
intron. In each case, the same general trend is observed: trinucleotide motifs representing the classic hotspot WRC have a
higher mutability than random targeting, whereas those representing the classic coldspot SYC have a lower mutability than
random targeting (Fig. 2A). In addition, inspection of the bootstrap interval for the mutability of each motif shows a high overlap between the two datasets. Thus, we conclude that AID displays
a range of hot/cold-spot preferences in the genome-wide setting
that is consistent with those at the Ig locus.
To test whether differences in the mutability of each gene could
account for differential mutation accumulation, we compared the
hot-/coldspot densities to the mutation frequency across the set
of 83 genes from the dKO setting. Supplemental Fig. 1B shows
a minimal negative relationship between hotspot frequency and
mutation frequency (r = 20.16), and this was not significant when
the genes were split into groups A/B/C (p = 0.33, KW) (Fig. 2B).
The AID coldspot SYC/GRS displayed a weak positive relationship with the mutation frequency (r = +0.2118, Supplemental Fig.
1C), but once again this was not statistically significant (p = 0.31)
for groups A/B/C (Fig. 2C). We conclude that the density of AID
hot-/coldspots does not account for the differential targeting of
AID to group A/B/C genes.
Clonal recruitment of AID is not preserved in non-Ig genes
Clonal recruitment of SHM to the Ig locus was suggested based on
transgene studies (7). According to this model, once a particular
gene in a cell is targeted, the gene remains marked as accessible to
FIGURE 2. Genome-wide AID hot/cold-spots mirror the Ig locus but do not influence overall targeting. (A) The relative mutability of each DNA triplet
motif ending in C based on group A genes and the Jh4 intron. The error bars represent the 95% confidence interval, as determined through bootstrapping of
the original sequences. (B) Normalized WRC/GYW hotspot frequency for each group of genes. (C) Normalized SYC/GRS coldspot frequency for each
group of genes.
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
FIGURE 1. Transcription is correlated weakly with mutation frequency. (A) Normalized mRNA expression levels (averaged over all probe sets and
microarray samples) for genes in groups A, B, and C. A set of 17 nonexpressed genes (NE) is included as a control. (B) The maximum peak for RNA
polymerase II binding based on ChIP-Seq of GC B cells for each group of genes, including the control of nonexpressed genes (NE). (C) Comparison of the
maximum peak of RNA polymerase II and the observed mutation frequency for the 83 individual genes sequenced in the dKO setting.
3882
E-boxes, YY1, and C/EBP-b binding sites are associated with
AID targeting
We previously demonstrated that evolutionarily conserved E-box
motifs (CASSTG), which bind various E-proteins, including E12
and E47, were enriched in the region 2 kb around the TSS for highly
mutated genes (1). Extending these results, we also found moderate
enrichment for a more general form of the motif (CANNTG)
(GSEA p value = 0.006; FDR q-value = 0.061). This association
between E-box motifs and highly mutated genes is further supported
by a shift in the location of the motif in group A genes toward the
TSS (CASSTG, p = 0.0089; CANNTG, p = 0.0092, MWU). Previous studies showed that binding sites located closer to the TSS
have a higher probability of being true regulatory sites (37, 38).
These results strongly support a role for E-boxes in recruiting AID.
We next tested whether YY1 binding sites were associated
with AID targeting, because this transcription factor was recently
shown to be a regulator of the GC gene-expression program (39).
Four separate definitions exist for a YY1 binding site in the
TRANSFAC database, where each defined YY1 binding site has
a different length and slightly different consensus motif. Two of
these four binding site definitions (YY1_Q6 and YY1_Q6_02)
were significantly enriched in the promoter regions of group A
genes (GSEA p , 0.05; FDR q-value , 0.1). Further supporting
the involvement of YY1, these binding sites tended to be farther
from the TSS in group C genes (Fig. 4). This shift in location was
significant for the YY1_Q6_02 binding site definition (p = 0.014),
and we refer to this definition as a YY1 binding site throughout
the rest of this article. Thus, along with E-box motifs, YY1 motifs
are associated with the recruitment of AID.
To identify additional transcription factors influencing AID
targeting, we screened the remaining set of 705 high-quality vertebrate TRANSFAC transcription factor binding sites. This analysis
identified an additional factor associated with AID targeting:
evolutionarily conserved binding sites for C/EBP-b (CCAATenhancer binding protein, b), also known as NF-IL6 (p , 0.001;
FDR q-value = 0.094). As with both E-box and YY1 sites, the
location of C/EBP-b binding sites tends to drift away from the TSS
with the decrease in mutation frequency (Fig. 4), and this is significant for group C genes (p = 0.0086). At no point in our analysis
did we find enrichment for binding sites inhibiting AID targeting.
Altogether, our transcription factor binding site screen and location
analysis identified three binding sites that could be involved in the
recruitment of AID: E-boxes, YY1, and C/EBP-b.
Colocation of E-boxes, YY1, and C/EBP-b sites in AIDtargeted genes
Because we showed that evolutionarily conserved binding sites for
E-boxes, YY1, and C/EBP-b have a nonrandom location distri-
FIGURE 3. AID is clonally recruited to the Ig locus but not to non-Ig genes. (A) The frequency distribution of the total number of mutations per Jh4
intron sequence was fit to both the NB (s) and the ZI-NB (n). A similar analysis was carried out for group A genes, including Myc (B), Bcl6 (C), H2afx (D),
and Eif4a2 (E).
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
SHM in the cell’s progeny. If true, the differential ability of genes
to initiate clonal recruitment could contribute to the wide range of
observed mutation frequencies in the genome-wide setting. We first
sought to confirm the idea of clonal recruitment at the Ig locus by
analyzing the distribution of mutation counts in the Jh4 intronic
region. If mutation acts randomly among all of the cells in a population, the number of mutations per sequence should fit an NB
distribution (33, 34). In contrast, the process of clonal recruitment
implies that only some of the cells will accumulate mutations
efficiently, whereas others that fail to recruit AID will not accumulate mutations at all. In this case, we propose that the number of
mutations per sequence should fit a ZI-NB distribution, which has
the capability of modeling the distribution of the number of mutations per sequence for cells that have recruited AID separately
from those with no mutations present. Fig. 3A clearly shows that
the NB distribution is a poor model for the observed mutations per
sequence in the Jh4 intron (p = 2.6 3 1028, x2 test), whereas the
ZI-NB distribution is appropriate (p . 0.05) (35, 36). Thus, these
data support the idea of clonal recruitment of AID at the Ig locus.
In contrast, the NB distribution was an appropriate fit to all genes
in group A (p . 0.05) (Fig. 3, Supplemental Fig. 2). This was true
for both genes with a low mutation frequency, like Eif4a2 (Fig.
3E), as well as the most mutated genes, like Myc, Bcl6, and H2afx
(Fig. 3B–D). Thus, unlike the Ig locus, we find no evidence for
clonal recruitment of AID in the genome-wide setting, and this
process is unlikely to play a role in the differential accumulation
of mutations.
AID TARGETING IN NON-Ig GENES
The Journal of Immunology
3883
FIGURE 4. E-box, YY1, and C/EBP-b binding
sites exhibit a nonrandom location distribution with
respect to the TSS of group A genes. Each gene is
represented by the binding site that is closest to the
TSS within the window 7.5 kb in either direction.
For groups A, B and C, comparisons were made
against two control groups: a set of highly expressed
genes in GC B cells (HC) and a set of nonexpressed
genes (NE). A statistically significant shift in binding site locations is indicated by the group’s name at
the top of the panel (p , 0.05, versus HC, Bonferroni-corrected MWU). No significant difference
was detected for any group compared with the set of
NE genes.
CASSTG E-boxes, YY1, and C/EBP-b sites are trilocated in
highly mutated genes
The observation that both YY1 and C/EBP-b binding sites are
colocated with E-box motifs suggests that the three binding sites
may be located together, perhaps forming a cis-regulatory module
located close to the TSS of the gene. To characterize the organi-
zation of the three sites together, we developed a trilocation score
that incorporates the maximal distances between the three binding
sites and the location from the TSS of the gene in a manner similar
to the colocation score (see Materials and Methods). We find that
both group A and group B have significantly lower trilocation scores
compared with control genes (p , 0.05, Fig. 6). In addition, both
groups A and B have significantly lower trilocation scores compared
with group C (p = 1.7 3 1024 and p = 9.3 3 1024, respectively) but
are not distinguishable from one another (p = 0.07). These results
suggest that E-box motifs, YY1, and C/EBP-b binding sites form
a regulatory module that is capable of recruiting AID to target the
gene for mutation.
A combination of features can predict AID targeting genome
wide
Although none of the features investigated above could individually separate genes in groups A, B, and C (note the high overlap
in box plots for each group), we hypothesized that a combination
of these features could accurately predict AID targeting. As a first
step, importance analysis based on random forests was used to
select a subset of variables to be included in the final model. A
list of variables included in this analysis is shown in Supplemental
Fig. 3. Note that some colocation scores were excluded to reduce
redundancy. Variable importance analysis (see Materials and
Methods) identified seven features as significant for separating
group A, B, and C genes: CASSTG E-box and YY1 colocation,
CASSTG E-box and C/EBP-b colocation, YY1 and C/EBP-b
colocation, number of C/EBP-b sites (6 2 kb around TSS), location of the C/EBP-b site closest to the TSS, and the maximum
polymerase II peak (6 100 bp around the TSS) (Supplemental
Fig. 3).
To predict whether each gene belongs to group A, B, or C, the
seven variables identified as important were used to create a classification tree. In generating this model, we strongly penalized the
FIGURE 5. E-box, YY1, and C/EBP-b binding
sites are colocated in group A genes. For each pair
of enriched transcription factors, the colocation
score combining the distance between factors and
the distance to the TSS was calculated for each
gene. Groups A, B, and C were compared separately with each set of control genes (HC [highly
expressed control] and NE lines at top of plot, see
Fig. 4 legend). Statistically significant shifts in
colocation scores are indicated above the relevant
group. Single letter: p , 0.05, double letter: p ,
0.01, triple letter: p , 0.001, Bonferroni-corrected
MWU.
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
bution, we set out to test whether these sites are colocated near
one another. A colocation score was defined that combines the
distance between pairs of binding sites with the distance of the
pair from the TSS of the gene, such that a low colocation score
reflects a pair of binding sites that are located close to one another
and close to the TSS (see Materials and Methods). Using this
scoring metric, we found that CASSTG E-box motifs were colocated both with YY1 and C/EBP-b binding sites in group A genes
at a statistically significant level compared with control genes
(p , 0.01, Fig. 5). In addition, these two pairs of sites are significantly colocated in group A compared with either group B
genes or group C genes (p , 6.5 3 1024 and 6.0 3 1023, respectively). Although a similar trend was observed in the colocation scores of the CANNTG E-box with both YY1 and C/EBP-b,
the colocation signal is much stronger for the CASSTG E-box
motif.
In addition to being colocated with E-box motifs, YY1 and
C/EBP-b sites were found to be colocated with one another in both
group A and B genes (Fig. 5, far right panel), and both groups have
significantly lower colocation scores compared with group C genes
(p = 5.7 3 1024 and 5.0 3 1024, respectively). Interestingly, the
colocation scores from genes in groups A and B are indistinguishable from one another (p . 0.8). Because the strong colocation of YY1 and C/EBP-b binding sites in groups A and B can
distinguish genes that mutate from nonmutating genes (group C), it
seems that more information is required to explain the differential
level of mutation observed between the two groups.
3884
misclassification between groups A and C. The penalty for misclassification of group B genes was deemed less significant for two
reasons. First, the original classification of a gene in this group
could change based on where the initial cut points between groups
was drawn by Liu et al. (1). Second, group B genes are still
considered to be targeted by AID, albeit at a lower level than
group A genes; therefore, misclassification would not be as severe
for this middle group. The full model generated from the data was
pruned to an appropriate size based on the 10-fold cross-validation
error, producing a final model containing five terminal nodes with
four decision points (Fig. 7A). For the 83 genes used to build the
model, the overall misclassification rate was 25%, with the majority of these misclassified genes from group B (as expected by
the penalty setup). Importantly, the misclassification rate between
groups A and C was only 8%. It was determined through crossvalidation that the sensitivity of the model for correctly predicting
a group A gene was 55%, with a specificity of 84% in predicting
non–group A genes. Consequently, this model allows for the accurate separation of genome-wide AID targets (groups A and B)
compared with nontargeted genes (group C).
The classification tree model was applied genome wide to predict
additional AID targets. Of the 22,492 RefSeq-defined mouse genes
that were not used to build the model, 3,648 (16%) were predicted
to be in group A, 1,949 (9%) were predicted to be in group B, and
the remaining 16,895 (75%) were predicted to be group C (Supplemental Table I). To generate a list of high-quality AID targets
that would be of interest for experimental validation, predicted
group A genes were filtered to include only those that have an
average gene expression in the top 25% of all genes in GC B cells,
as determined through mRNA expression; have human homologs
[defined by the Mouse Genome Database (40)] annotated as either
oncogenes or tumor suppressors [identified by an Entrez query
through the CancerGenes tool (41)], or are found within the
COSMIC Cancer Gene Census (42) version 56. This filtering
produced a set of 101 high-interest genes shown in Table I. Some
notable members of this list are immune-related genes: Bcl7a,
Btg1, Cd82, Cdk4, Cxcr4, Foxo1, Foxp1, Irf1, Irf8, Raf1, Rela,
Stat3, and Tcf4.
We carried out two validations of the model by comparing the
predicted AID targets with sets of experimentally identified genes
that should be enriched for actual AID targets. In the first validation, the set of actual AID targets was based on results from
a translocation-capture sequencing (TC-Seq) study (43). This study
identified 92 genes that could translocate with either IgH or Myc
in an AID-dependent manner in ex vivo B cells. Of the 92 TC-Seq
genes, 10 were excluded from the validation because they had been
used to generate the model, 11 were not in our RefSeq database,
and 1 was translocated to Myc in the AID2/2 control. Of the
remaining 70 TC-Seq genes, the model predicted 19 (27%) as
group A, 13 (19%) as group B, and 38 (54%) as group C (Supplemental Table I). When compared with the set of predictions
from all mouse RefSeq-defined genes, the frequency of predicted
group A and B genes was significantly higher among the TC-Seq
genes than expected by chance (p = 0.00019, x2 test) (Fig. 7B).
This overrepresentation of predicted group A and B genes remained
significant, even when the background was restricted to genes that
were highly expressed in GC B cells and, therefore, were more
likely to accumulate AID-induced mutations (p = 0.0031, x2 test).
Overall, these results demonstrate that the classification model
predictions are enriched for actual AID targets.
In the second validation, the set of actual AID targets was defined
by AID binding, according to a ChIP-Seq study (44). Although the
B cells used in this study undergo class-switch recombination
(CSR) and not SHM, we expect that because the two processes are
linked in their use of AID we would still observe enrichment of
AID-binding genes for SHM. This assumption is supported by the
observation that group A, B, and C genes differed in the mean
number of AID sequence tags that were bound (p = 0.013,
ANOVA), with the highest frequency of AID tags found in group
A genes (Fig. 7C). Differential AID binding was also observed
among highly expressed genes that were predicted to be in groups
A, B, and C, with predicted group A genes exhibiting the highest
frequency of AID tags (p = 1.9 3 1029, ANOVA) (Fig. 7C). An
even higher frequency of AID binding was found among the 101
high-interest predicted group A genes. The significantly higher
AID binding among targets predicted by the classification model
may be even more significant because CSR and SHM are dependent on different regions of the AID protein and may involve
process-specific cofactors that can change the targeting (3, 45), so
that complete overlap with the model predictions was not expected. Taken together, these results further support the validity of
the classification tree model to predict genome-wide targets of
AID and highlight the rationale for choosing the set of 101 highinterest genes put forth as candidates for experimental validation.
Discussion
The mechanisms that underlie AID targeting to the Ig locus and
mistargeting to non-Ig loci are poorly understood. We analyzed the
sequence context of 83 non-Ig genes that are targeted by AID to
varying degrees in dKO mice to define several mechanisms responsible for genome-wide AID targeting. Our data confirm the
association of AID targeting with transcription and the presence of
E-box motifs. E-boxes were found to increase the frequency of SHM
in several systems (12, 46, 47). Our data also implicate several new
factors as relevant to the targeting of AID. In particular, YY1 and
C/EBP-b binding sites are associated with increased mutation accumulation, and these sites, together with E-box motifs, are colocated near the TSS of AID targets, perhaps forming a cis-regulatory
module. YY1 is of high interest because it is active in GC B cells
and was recently identified as a regulator for the GC program (39).
Intriguingly, YY1 was recently shown to interact directly with
AID, to influence levels of AID in the nucleus, and to enhance CSR
(48). C/EBP-b mRNA and protein levels were reported to increase
in abundance as B cells mature, and C/EBP-g, the negative regu-
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
FIGURE 6. E-boxes (CASSTG), C/EBP-b and YY1 binding sites trilocate in AID targets. For each gene, the trilocation score was calculated
by combining the distance between all three sites and the distance from the
TSS. Groups A, B, and C were compared separately with each set of
control genes (HC [highly expressed control] and NE lines at top of plot,
see Fig. 4 legend). Statistically significant shifts in trilocation scores are
indicated above the relevant group. Single letter: p , 0.05, double letter:
p , 0.01, triple letter: p , 0.001, Bonferroni-corrected MWU.
AID TARGETING IN NON-Ig GENES
The Journal of Immunology
3885
lator of C/EBP-b, decreases with B cell maturity (49). In addition,
a C/EBP-b site was recently identified in a mutational enhancer
element in the Ig L chain of DT40 cells that is conserved in the
condor and zebra finch (50). These results support a role for shared
functional elements in explaining differential AID activity in nonIg genes.
The contribution of AID hot/cold-spots and gene transcription to
mutation accumulation was also evaluated. Although known AID
hotspots are specifically targeted in non-Ig genes (1), the frequency
of hotspots (or coldspots) in the sequenced region did not correlate
with the overall gene-mutation frequency. This suggests that AID
recruitment to the locus is a rate-limiting step. Gene transcription
Table I. High-interest predicted group A genes from classification tree model
Entrez Identifier
Oncogene
Tumor suppressor
Model Leaf
1
3
4
1
3
4
Tumor suppressor and oncogene
Neither
1
3
4
1
3
4
Genes
a
Golga5 , Lmo4, Rab18, Rab30, Rab5c, Snrpe, Ubtf
Actb, Fusa
Cblbb, Mybl1, Nae1, Pfdn5, Rapgef1, Sh3kbp1, Tpd52, Vav1
Anp32b, Ccna2, Cd82, Cdk2ap2, Cdkn2cb, Cfl1, Eif2s1, Flna, Foxp1a, G3bp2, Gabarap,
Gtf2e1, Ing3, Ltf, Mef2c, Nbr1, Nfkbia, Rbbp5, Tcf4, Uhrf1, Zbtb33
Btg2, Ddx3x, Ddx5a, Dhx9, G3bp1, Irf1, Irf8, Klf10, Tnfaip3b, Uvrag
Anxa7, Chfr, Foxo1, Msh2b, Ndufa13, Raf1a, Rnf129, Rnf40, Rtn4, Sdhbb, Sept9, Spint2,
Stk4, Tes, Tusc2, Ywhah, Ywhaq, Zfp238
Ccdn3a, Ptenb, Rela, Rhoa
Ctnnb1a, Cxcr4, Hif1a, Lyn, Stat3
Mapk1, Prkar1aa,b, Smarcb1
Bcl7aa, Brd4a, Ep300a,b, Eps15a, Gnasb, Herpud1a, Ikzf1b, Lasp1a, Myh9a, Ncoa1a, Nina,
Nsd1a, Thrap3a
Btg1a
Cdk4b, Chic2a, Crebbpa,b, Ncoa2a, Nfe2l2b, Numa1a, Sept6a, Sh3gl1a, Tpm3a
Genes are listed according to the Entrez Query Identifier and by the leaf of the model which led to the gene being predicted as a strong AID target.
a
Identified in the COSMIC database as translocation type.
b
Identified in the COSMIC database as other mutation type.
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
FIGURE 7. Classification tree model to predict genome-wide targets of AID. (A) Classification tree to predict AID targeting for individual genes. For
each decision node in the tree (gray circle), the gene proceeds down the left branch if it satisfies the decision condition. The leaves of the tree indicate the
predicted group (A, B, or C), along with the number of genes falling into this group from the 83 genes sequenced in the dKO setting (group A/group B/
group C). For example, genes in leaf #1 are predicted to be in group A, and this leaf has a total of nine genes from the training data: six group A genes, three
group B genes, and zero group C genes. (B) AID-mediated translocated genes (43) are enriched among genes predicted by the model to be AID targets
(group A) compared with the set of all genes (22,492 genes) and the set of highly expressed genes comprised of the top 25% of all genes by mRNA
expression levels (4,403 genes). (C) Based on the ChIP-Seq study by Yamane et al. (44), group A and predicted group A genes show greater recruitment of
AID than do either group B or C. The mean and 95% confidence intervals of AID tags per gene are shown for each group of genes. The set of 17
nonexpressed genes is shown as a negative control. Also shown separately is the set of 101 high-interest predicted group A genes. ANOVA was used to
compare the number of AID tags per gene for the group A/B/C genes used to build the model, as well as for the genes predicted by the model to be in
groups A/B/C, with the p value shown above each comparison.
3886
of AID at the Ig locus, at least some features seem to be shared.
Several of the transcription factors that we implicate in genomewide targeting have been associated with targeting at the Ig locus
(46, 47, 53). In addition, we find that AID hot/cold-spots for
mutation are generally shared between the Ig and non-Ig genes.
Interestingly, there is a significant difference in the relative mutability of the four individual trinucleotide motifs that form the
WRC hotspot (Fig. 2A). This may be the result of varying cofactors
present in each unique genomic setting that slightly alters the
specificity of AID to its preferred nucleotide target.
One significant difference found between AID targeting at Ig and
non-Ig loci is the occurrence of clonal recruitment of AID. For a
random mutation process with hot-/coldspots, the distribution of the
number of mutations per sequence should fit an NB distribution (33).
Indeed, we find this model to be an appropriate fit for the non-Ig
genes. In contrast, the NB is a poor fit for the Jh4 intron mutation
data, because this model significantly underestimates the frequency
of unmutated sequences, given the mutation distribution among
sequences that do acquire mutations. Instead, these data can be fit
by a ZI-NB distribution, which separately models the process of
mutation accumulation from the induction of AID activity. This
suggests that these sequences from the Jh4 intron were derived
from two subsets of cells: one subset in which AID had been effectively recruited to the gene being sequenced, resulting in a high
degree of mutation, and a second subset in which AID had not (yet)
been recruited. One caveat of this analysis is that we may not observe enough mutations in the non-Ig genes to be able to detect
whether clonal recruitment of AID is occurring, and further sequencing is necessary to increase the number of observed mutations
in each non-Ig gene. Overall, these results provide additional support for the idea that AID is clonally recruited to the Ig loci but that
clonal recruitment is not a property of genome-wide AID targeting.
The evidence presented in this article supporting clonal recruitment of AID to the Ig genes has potential implications with
respect to Ab diversity during the affinity-maturation process.
Current thought predisposes B cells with receptors of initial high
affinity to the Ag as having a competitive advantage for expansion
within GCs (54). The addition of clonal recruitment may allow for
the B cells with receptors of lower initial affinity, if able to recruit
AID early, to go through enough rounds of mutation, selection, and
expansion to yield mutated receptors with equal or higher affinity
than the unmutated higher-affinity cells. These lower initial affinity,
but early AID-recruiting, B cells could then be competitive in the
GC environment. Thus, depending on the time of clonal recruitment of AID in B cells, the expected competitive advantage of
B cells with high-affinity receptors may be diminished enough to
allow for a broader spectrum of Abs to be produced from B cells
with a wider range of initial affinities to the Ag.
A limitation of this study is the focus on the promoter region
when analyzing transcription factor binding site associations. At
the Ig locus, enhancer regions were shown to play a role in SHM
that is separate from influencing the rate of transcription (12, 50,
55). Nevertheless, through this approach we identified three important DNA elements that can effectively differentiate group A
and C genes. It is likely that future work incorporating additional
elements in the enhancer regions will improve these predictions
and may also distinguish group B genes, which were not a major
focus of this work. However, there are still many problems to be
worked out in the identification of enhancer boundaries and their
association(s) with particular genes.
In conclusion, we identified several mechanisms that are significantly associated with AID targeting to non-Ig genes, including
polymerase II binding along with the presence of E-boxes, YY1, and
C/EBP-b binding sites. A classification model integrating these
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
is also important, and some transcription is required for mutation
accumulation in non-Ig genes, because nonexpressed genes do not
accumulate mutations (1). Indeed, there was a small, but significant, increase in polymerase II binding among genes with the
highest mutation frequencies (group A). However, there was only
a weak correlation at the level of individual genes, and highly
mutated genes could be identified with widely varying levels of
polymerase II binding. Notably, the polymerase II–associated
pause/elongation factor Spt5, a component of the DRB-sensitivity
inducing complex, interacts with AID and with Ig and non-Ig
targets of AID (51). Although our data indicate that total polymerase II levels do not correlate strongly with AID targeting,
a better correlation might be seen upon examination of stalled
polymerase II and Spt5 in GC B cells. A recent study found that
levels of stalled polymerase II, but not Spt5, correlated with AIDmediated mutation of an artificial mutation reporter (52). Overall,
although our analysis identifies several associations that are statistically significant when comparing group A, B, and C genes,
none of these features alone is sufficient to predict the differential
accumulation of mutations by individual genes.
We hypothesized that several mechanisms work together to
promote AID targeting to individual genes. By combining the
features that were each associated with AID targeting, we were able
to construct a classification tree model that separated highly mutating genes (group A) from genes that do not mutate (group C) with
an accuracy of 92%. It is instructive to consider the genes for which
AID targeting is incorrectly predicted. In the training data, there are
five genes that are considered major misclassifications: Pax5 exon
1b, Il21r, Fas, B2m, and Mll1. For the group A genes misclassified
as group C genes (Pax5 exon 1b, Il21r, and Fas) the lack of strong
evolutionary conservation between mouse and human near the TSS
diminishes the ability of the model to accurately predict these
genes as true group A genes. Pax5 exon 1b presents unique challenges to this analysis because the sequencing was focused on an
alternative promoter, which may not be adhering to the same
conventions of the other genes within the group. The two group C
genes were predicted as group A genes for different reasons: B2m
had a high peak of polymerase II near the TSS (max peak = 20),
and Mll1 had a conserved YY1 and E-box that had a very low
colocation score (score = 11), reflecting two sites that were overlapping and located at the TSS of the gene. Thus, many features of
each gene contribute in a small way to AID targeting. In predicting
AID targets, there was not a single reason underlying false positive
and negative predictions, and improvements in accuracy are likely
to come from the inclusion of additional features in the model.
The model was applied genome wide to predict AID targets. We
identified a set of 101 high-interest genes that fulfilled three criteria: they are predicted to be in group A, they are highly expressed
in GC B cells, and they have a known association with cancer.
These targets are significantly enriched for genes identified through
TC-Seq of B cells (43). They also tend to bind higher levels of AID
in an ex vivo model of CSR (44). About 25% of the high-interest
genes are known to undergo translocations in oncogenic settings
as identified through the Cancer Gene Census. Several of these
genes undergo translocations with other genes known to sustain
high levels of AID targeting: Ccnd3 translocates with IgH, Bcl7a
and Btg1 each translocate with Myc, and Foxp1 translocates with
Pax5. Additionally, two genes, Btg1 and Raf1, were previously
identified as sustaining high levels of SHM in the wild-type setting
(1) but were not sequenced in dKO mice. Thus, we have high
confidence in the validity of our model to predict AID-induced
mutations and suggest these high-interest genes as targets for future experiments. Although it is not clear whether a common
mechanism underlies genome-wide targeting of AID and targeting
AID TARGETING IN NON-Ig GENES
The Journal of Immunology
features captures ∼55% of highly targeted (group A) genes with
the benefit of being specific for AID targets. This model was used
to predict AID targeting genome wide, and a set of 101 additional
high-interest AID targets was identified for further experimental
study.
Acknowledgments
We thank Dustin Schones (Department of Cancer Biology, Beckman Research Institute, City of Hope, Duarte, CA), Kairong Cui (Systems Biology
Center, National Heart, Lung, and Blood Institute, National Institutes of
Health, Bethesda, MD), and Keji Zhao (Systems Biology Center, National
Heart, Lung, and Blood Institute, National Institutes of Health) for expertise, sequencing, and read mapping of the RNA polymerase II ChIP-Seq
experiment and Annette Molinaro (Departments of Neurological Surgery
and of Epidemiology and Biostatistics, University of California, San Francisco, San Francisco, CA) for guidance in developing the classification tree
model.
Disclosures
The authors have no financial conflicts of interest.
1. Liu, M., J. L. Duke, D. J. Richter, C. G. Vinuesa, C. C. Goodnow,
S. H. Kleinstein, and D. G. Schatz. 2008. Two levels of protection for the B cell
genome during somatic hypermutation. Nature 451: 841–845.
2. Rada, C., M. R. Ehrenstein, M. S. Neuberger, and C. Milstein. 1998. Hot spot
focusing of somatic hypermutation in MSH2-deficient mice suggests two stages
of mutational targeting. Immunity 9: 135–141.
3. Peled, J. U., F. L. Kuang, M. D. Iglesias-Ussel, S. Roa, S. L. Kalis,
M. F. Goodman, and M. D. Scharff. 2008. The biochemistry of somatic hypermutation. Annu. Rev. Immunol. 26: 481–511.
4. Rada, C., J. M. Di Noia, and M. S. Neuberger. 2004. Mismatch recognition and
uracil excision provide complementary paths to both Ig switching and the A/Tfocused phase of somatic mutation. Mol. Cell 16: 163–171.
5. Pham, P., R. Bransteitter, J. Petruska, and M. F. Goodman. 2003. Processive
AID-catalysed cytosine deamination on single-stranded DNA simulates somatic
hypermutation. Nature 424: 103–107.
6. Bransteitter, R., P. Pham, P. Calabrese, and M. F. Goodman. 2004. Biochemical
analysis of hypermutational targeting by wild type and mutant activationinduced cytidine deaminase. J. Biol. Chem. 279: 51612–51621.
7. Goyenechea, B., N. Klix, J. Yélamos, G. T. Williams, A. Riddell,
M. S. Neuberger, and C. Milstein. 1997. Cells strongly expressing Ig(kappa)
transgenes show clonal recruitment of hypermutation: a role for both MAR and
the enhancers. EMBO J. 16: 3987–3994.
8. Yu, D., M. C. Cook, D.-M. Shin, D. G. Silva, J. Marshall, K.-M. Toellner,
W. L. Havran, P. Caroni, M. P. Cooke, H. C. Morse, et al. 2008. Axon growth and
guidance genes identify T-dependent germinal centre B cells. Immunol. Cell
Biol. 86: 3–14.
9. Anderson, S. M., A. Khalil, M. Uduman, U. Hershberg, Y. Louzoun,
A. M. Haberman, S. H. Kleinstein, and M. J. Shlomchik. 2009. Taking advantage: high-affinity B cells in the germinal center have lower death rates, but
similar rates of division, compared to low-affinity cells. J. Immunol. 183: 7314–
7325.
10. Klein, U., Y. Tu, G. A. Stolovitzky, J. L. Keller, J. Haddad, Jr., V. Miljkovic,
G. Cattoretti, A. Califano, and R. Dalla-Favera. 2003. Transcriptional analysis of
the B cell germinal center reaction. Proc. Natl. Acad. Sci. USA 100: 2639–2644.
11. Alizadeh, A., M. Eisen, R. E. Davis, C. Ma, H. Sabet, T. Tran, J. I. Powell,
L. Yang, G. E. Marti, D. T. Moore, et al. 1999. The lymphochip: a specialized
cDNA microarray for the genomic-scale analysis of gene expression in normal
and malignant lymphocytes. Cold Spring Harb. Symp. Quant. Biol. 64: 71–78.
12. Odegard, V. H., and D. G. Schatz. 2006. Targeting of somatic hypermutation.
Nat. Rev. Immunol. 6: 573–583.
13. Altshuler, D., V. J. Pollara, C. R. Cowles, W. J. Van Etten, J. Baldwin, L. Linton,
and E. S. Lander. 2000. An SNP map of the human genome generated by reduced representation shotgun sequencing. Nature 407: 513–516.
14. Tomayko, M. M., S. M. Anderson, C. E. Brayton, S. Sadanand, N. C. Steinel,
T. W. Behrens, and M. J. Shlomchik. 2008. Systematic comparison of gene
expression between murine memory and naive B cells demonstrates that memory
B cells have unique signaling capabilities. J. Immunol. 181: 27–38.
15. Affymetrix, Inc. 2004. Eukaryotic sample and array processing. In GeneChip
Expression Analysis Technical Manual, 701021 Rev. 5. Affymetrix, Inc., Santa
Clara, CA, 2.1.3–2.3.18.
16. Pruitt, K. D. 2004. NCBI Reference Sequence (RefSeq): a curated non-redundant
sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 33:
D501–D504.
17. Hannum, L. G., A. M. Haberman, S. M. Anderson, and M. J. Shlomchik. 2000.
Germinal center initiation, variable gene region hypermutation, and mutant
B cell selection without detectable immune complexes on follicular dendritic
cells. J. Exp. Med. 192: 931–942.
18. Karolchik, D., A. S. Hinrichs, T. S. Furey, K. M. Roskin, C. W. Sugnet,
D. Haussler, and W. J. Kent. 2004. The UCSC Table Browser data retrieval tool.
Nucleic Acids Res. 32: D493–D496.
19. Shapiro, G. S., K. Aviszus, J. Murphy, and L. J. Wysocki. 2002. Evolution of
Ig DNA sequence to target specific base positions within codons for somatic
hypermutation. J. Immunol. 168: 2302–2306.
20. Venables, W. N., and B. D. Ripley. 2002. Modern Applied Statistics with S, 4th
Ed. Springer, New York.
21. Bolker, B. M. 2008. Ecological Models and Data in R. Princeton University
Press, Princeton, NJ.
22. Blanchette, M., W. J. Kent, C. Riemer, L. Elnitski, A. F. Smit, K. M. Roskin,
R. Baertsch, K. Rosenbloom, H. Clawson, E. D. Green, et al. 2004. Aligning
multiple genomic sequences with the threaded blockset aligner. Genome Res. 14:
708–715.
23. Kent, W. J., C. W. Sugnet, T. S. Furey, K. M. Roskin, T. H. Pringle, A. M. Zahler,
and D. Haussler. 2002. The human genome browser at UCSC. Genome Res. 12:
996–1006.
24. Smit, A. F. A., R. Hubley, and P. Green. 2011. RepeatMasker. Available at: http://
repeatmasker.org. Accessed: March 20, 2011.
25. Kel, A. E., E. Gössling, I. Reuter, E. Cheremushkin, O. V. Kel-Margoulis, and
E. Wingender. 2003. MATCH: A tool for searching transcription factor binding
sites in DNA sequences. Nucleic Acids Res. 31: 3576–3579.
26. Matys, V., E. Fricke, R. Geffers, E. Gössling, M. Haubrock, R. Hehl,
K. Hornischer, D. Karas, A. E. Kel, O. V. Kel-Margoulis, et al. 2003.
TRANSFAC: transcriptional regulation, from patterns to profiles. Nucleic Acids
Res. 31: 374–378.
27. Subramanian, A., P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert,
M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, and
J. P. Mesirov. 2005. Gene set enrichment analysis: A knowledge-based approach
for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA
102: 15545–15550.
28. Strobl, C., A.-L. Boulesteix, A. Zeileis, and T. Hothorn. 2007. Bias in random
forest variable importance measures: illustrations, sources and a solution. BMC
Bioinformatics 8: 25.
29. Strobl, C., J. Malley, and G. Tutz. 2009. An introduction to recursive partitioning: rationale, application, and characteristics of classification and regression
trees, bagging, and random forests. Psychol. Methods 14: 323–348.
30. Therneau, T. M., and E. J. Atkinson. 1997. Mayo Foundation. An introduction to
recursive partitioning using the RPART routines. Available at: http://www.mayo.
edu/hsr/techrpt/61.pdf. Accessed: January 4, 2011.
31. Breiman, L., J. Friedman, C. J. Stone, and R. A. Olshen. 1984. Classification and
Regression Trees. Wadsworth International Group, Belmont, CA.
32. Garcı́a-Martı́nez, J., A. Aranda, and J. E. Pérez-Ortı́n. 2004. Genomic run-on
evaluates transcription rates for all yeast genes and identifies gene regulatory
mechanisms. Mol. Cell 15: 303–313.
33. Uzzell, T., and K. W. Corbin. 1971. Fitting discrete probability distributions to
evolutionary events. Science 172: 1089–1096.
34. Storb, U., E. L. Klotz, J. Hackett, K. Kage, G. Bozek, and T. E. Martin. 1998. A
hypermutable insert in an immunoglobulin transgene contains hotspots of somatic mutation and sequences predicting highly stable structures in the RNA
transcript. J. Exp. Med. 188: 689–698.
35. Zuur, A. F., E. N. Ieno, N. Walker, A. A. Saneliev, and G. M. Smith. 2009. Mixed
Effects Models and Extension in Ecology with R (Statistics for Biology and
Health), 1st Ed. Springer, New York.
36. Lord, D., S. Washington, and J. Ivan. 2005. Poisson, Poisson-gamma and zeroinflated regression models of motor vehicle crashes: balancing statistical fit and
theory. Accid. Anal. Prev. 37: 35–46.
37. Liu, R., R. C. McEachin, and D. J. States. 2003. Computationally identifying
novel NF-kappa B-regulated immune genes in the human genome. Genome Res.
13: 654–661.
38. Tabach, Y., R. Brosh, Y. Buganim, A. Reiner, O. Zuk, A. Yitzhaky,
M. Koudritsky, V. Rotter, and E. Domany. 2007. Wide-scale analysis of human
functional transcription factor binding reveals a strong bias towards the transcription start site. PLoS One 2: e807.
39. Green, M. R., S. Monti, R. Dalla-Favera, L. Pasqualucci, N. C. Walsh, M. SchmidtSupprian, J. L. Kutok, S. J. Rodig, D. S. Neuberg, K. Rajewsky, et al. 2011. Signatures of murine B-cell development implicate Yy1 as a regulator of the germinal
center-specific program. Proc. Natl. Acad. Sci. USA 108: 2873–2878.
40. Blake, J. A., C. J. Bult, J. A. Kadin, J. E. Richardson, and J. T. Eppig. 2011. The
Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics. Nucleic Acids Res. 39: D842–D848.
41. Higgins, M. E., M. Claremont, J. E. Major, C. Sander, and A. E. Lash. 2007.
CancerGenes: a gene selection resource for cancer genome projects. Nucleic
Acids Res. 35: D721–D726.
42. Santarius, T., J. Shipley, D. Brewer, M. R. Stratton, and C. S. Cooper. 2010. A
census of amplified and overexpressed human cancer genes. Nat. Rev. Cancer
10: 59–64.
43. Klein, I. A., W. Resch, M. Jankovic, T. Oliveira, A. Yamane, H. Nakahashi,
M. Di Virgilio, A. Bothmer, A. Nussenzweig, D. F. Robbiani, et al. 2011.
Translocation-capture sequencing reveals the extent and nature of chromosomal
rearrangements in B lymphocytes. Cell 147: 95–106.
44. Yamane, A., W. Resch, N. Kuo, S. Kuchen, Z. Li, H. Sun, D. F. Robbiani,
K. McBride, M. C. Nussenzweig, and R. Casellas. 2011. Deep-sequencing
identification of the genomic targets of the cytidine deaminase AID and its cofactor RPA in B lymphocytes. Nat. Immunol. 12: 62–69.
45. Teng, G., and F. N. Papavasiliou. 2007. Immunoglobulin somatic hypermutation.
Annu. Rev. Genet. 41: 107–120.
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
References
3887
3888
46. Michael, N., H. M. Shen, S. Longerich, N. Kim, A. Longacre, and U. Storb.
2003. The E box motif CAGGTG enhances somatic hypermutation without
enhancing transcription. Immunity 19: 235–242.
47. Tanaka, A., H. M. Shen, S. Ratnam, P. Kodgire, and U. Storb. 2010. Attracting
AID to targets of somatic hypermutation. J. Exp. Med. 207: 405–415.
48. Zaprazna, K., and M. L. Atchison. 2012. YY1 controls immunoglobulin class
switch recombination and nuclear activation-induced deaminase levels. Mol.
Cell. Biol. 32: 1542–1554.
49. Cooper, C. L., A. L. Berrier, C. Roman, and K. L. Calame. 1994. Limited expression of C/EBP family proteins during B lymphocyte development. Negative
regulator Ig/EBP predominates early and activator NF-IL-6 is induced later. J.
Immunol. 153: 5049–5058.
50. Kothapalli, N. R., K. M. Collura, D. D. Norton, and S. D. Fugmann. 2011.
Separation of mutational and transcriptional enhancers in Ig genes. J. Immunol.
187: 3247–3255.
AID TARGETING IN NON-Ig GENES
51. Pavri, R., A. Gazumyan, M. Jankovic, M. Di Virgilio, I. Klein, C. AnsarahSobrinho, W. Resch, A. Yamane, B. R. San-Martin, V. Barreto, et al. 2010.
Activation-induced cytidine deaminase targets DNA at sites of RNA polymerase
II stalling by interaction with Spt5. Cell 143: 122–133.
52. Kohler, K. M., J. J. McDonald, J. L. Duke, H. Arakawa, S. Tan, S. H. Kleinstein,
J.-M. Buerstedde, and D. G. Schatz. 2012. Identification of core DNA elements
that target somatic hypermutation. J. Immunol. 189: 5314–5326.
53. Liu, M., and D. G. Schatz. 2009. Balancing AID and DNA repair during somatic
hypermutation. Trends Immunol. 30: 173–181.
54. Shlomchik, M. J., and F. Weisel. 2012. Germinal center selection and the development of memory B and plasma cells. Immunol. Rev. 247: 52–63.
55. Blagodatski, A., V. Batrak, S. Schmidl, U. Schoetz, R. B. Caldwell,
H. Arakawa, and J.-M. Buerstedde. 2009. A cis-acting diversification activator
both necessary and sufficient for AID-mediated hypermutation. PLoS Genet.
5: e1000332.
Downloaded from http://www.jimmunol.org/ by guest on June 16, 2017
© Copyright 2026 Paperzz