MAGICBOX - Automated primer assay design for haploid, diploid

MAGICBOX - Automated primer assay design for haploid,
diploid and polyploid species.
Jonathan Curry, Markus Kietzmann, Steve Smith and Susan Kirby
P0367
LGC - Genomics Division, Hoddesdon UK; [email protected]
MAGICBOX Example Assay Improvement
Introduction
With any PCR-based technique, the need for sequence specificity is paramount to obtaining
good, accurate and useable data. Accuracy of PCR primer design is key and dependant upon
the state of genomic reference sequence information available for the target species. This
information is needed to select target sequence that is either totally unique for the length of the
primer, or different enough to other similar, but non-target sequences that so that off-target primer
binding is negligible. Where PCR targets are highly similar to other regions on the genome, and
specificity is an issue, non-specific binding can be reduced further with optimisation of cycling
conditions or buffer components such as Mg2+ or with additives such as DMSO. For large
genotyping projects this is impractical. The uniquely modified Taq polymerase used in our KASP™
genotyping chemistry has a pronounced 3’ complimentarity requirement providing additional,
integral specificity mechanism for added accuracy to our genotyping reactions. This is used as
the basis for our SNP allelic discrimination assays, where the two allele-specific primers differ
only in the SNP allele present at the 3’ end of the primer. We have found that the principle is not
confined to the detection allele specific primers but can also be applied to the reverse common
primer for an extra level of target specificity useful in highly homologous targets. In fact genome
location specificity is driven by the reverse common primer; only a single unique base is required
to discriminate between two near identical loci in any given genome – a term we call primer
‘anchoring’.
Unanchored
Workflow – Batch import with BLAST
• Import sequence files with variant plus surrounding flanking sequence.
This can be isubmitted as either our standard delimited text format or fasta
• Select BLAST database for the appropriate organism / make custom
database
• Imported sequences are stored by name and submitted to BLAST by
sequence importer
• Reports are passed to BLAST interpreter and matched to the sequence by
name
• Each sequence is analysed by BLAST interpreter as follows:
»» Compares query with each returned hit and creates a
histogram from scoring matching / mismatching bases along
the sequence for each row
»» Preserves the order returned by BLAST and uses this to
analyse each row to compare scores with query and first
returned hit
»» Screens each column for differences based on the histogram
results
»» Alters the score based on differences including gaps and
inserted sequence strings.
»» Once unique base(s) are found the sequence is annotated with
our standard chevron submission format <>
»» Multiple identical sequences are reported as are results where
no anchors are found.
• Annotated sequence is passed to primer picker and assays are designed.
Workflow – Batch import with prior BLAST submission
• BLAST analysis and design can be performed prior to sequence
submission using any available BLAST database:
»» NCBI
»» Online Resource
»» Output from CLOUD_BLAST implementation
»» Use the “-outfmt 1” BLAST command or reformat output as
query-anchored, showing identities, no gaps in query
»» Private resources – there is no need to disclose large genomic
data set other than the alignment files which are easily stored
and shared
• Files are combined and passed to BLAST interpreter and anchor base(s)
annotated
• Assays are designed using primer picker
• This is a simple way of allowing large processing jobs to be performed
using off site public or pay-for computing recourses such as Amazon
Elastic Compute (EC2)
Any Genome
• Sequences can be derived from any source and built into a database
using the blastdb functions
• Databases can be easily created therefore easily updated
• Create databases with early stage contig or scaffold assemblies, or map
and GBS data to improve assays
• Run on standard 32- or 64-bit personal computers / connect to servers
containing data and BLAST executables
• With a 4 Gb ram, 32-bit Windows PC, large projects can be screened for
design prior to development – average 15 seconds for BLAST analysis
and design or around 1 second for BLAST interpretation.
Anchored - HCAR2 / 590447
Figure 2. Example of closely related sequences in human – HCAR2/3. Presenting the correct target sequence to M^G1B30X allows the
algorithm to detect the correct sequence and identify anchors. Plot (A) shows data from the unanchored assay which is detecting alleles at
both HCAR2 and 3, giving data with an incorrect frequency. Plot B shows data when the HCAR3 assay is anchored to the A base (shown at
the position below the pink chevrons) which gives the correct genotyping for rs55718746. Plot C shows data from an assay for HCAR2 with
anchoring to the unique G base to give the correct genotyping for RS590447.
Unanchored
In order to find unique bases for anchoring a screen of available genomic sequence resources is
required using BLASTN (rather than MegaBLAST (1) or BLAT (2) which are too stringent). The key
is to find as much divergent, but potentially similar sequence, as possible in order to find base(s)
unique to the target loci to enhance target amplification specificity. To perform this manually,
especially for the large numbers of markers we design for and the diversity of organisms we
genotype this is a time consuming exercise yet hugely beneficial. An automated software solution
exists for hexaploid wheat but uses database files that not easily updated (3). To overcome this
we have created MAGICBOX, an automated pipeline extension for our LIMS assay design and
genotyping software Kraken™ in order to perform and interpret BLAST analysis for any available
genome. It communicates directly to easily installed BLAST executables and database, and can
also accept output from public or privately obtained BLAST alignment files. Here we outline the
MAGICBOX pipeline and how it is accessed, illustrating the technology with assay data from
different organisms to highlight its effectiveness.
MAGICBOX - Automated primer design
Anchored - HCAR3 / rs55718746
Anchored - 2BS
Figure 3. An example from hexaploid wheat showing homology onthe same chromosome and across other genomes. The target SNP is
2BS for BS00003365. Plot A shows data from the unanchored assay which has produces off-target amplification. Plot B shows data produced
by the anchored assay after processing through MAGICBOX.
MAGICBOX example workflow
Aii
Conclusions
We designed and created a simple way of
performing automated analysis for the design
of target-specific KASP assays. Normally
this would need to be done manually in order
to ensure specificity when ordering assays
for use by customers in their own lab, or for
running in our genotyping service laboratories.
Ai
B
C
Figure 1: MAGICBOX example workflow
Making use of BLAST with it’s flexibility and
familiarity to the whole genomics community
along, plus its ubiquity and large, easily
accessible data, made the choice very simple.
The MAGICBOX system is part of Kraken
but will be accessible using BLAST output
and sequence file (either text or fasta) the
universal currency. As publically available
BLAST resources can be used to improve
assays - local installations of the software
and database are not necessary to submit
sequence for design. Updating the databases
is simple; they can be created / downloaded
as needed. Because BLAST interpreter
recognises the format rather than a specific
file type, problems of software updating/
obsolescence have been minimised.
The power of identifying and using unique
bases to design and anchor primers was
illustrated with before and after data. We
chose examples of a diploid human and
hexaploid wheat to illustrate that the algorithm
can be applied for any organism that has a
sequence information. The specificity of KASP
is again highlighted by selection of different
loci in the human example and the genotyping
plots show the change in frequency for their
different positions. For wheat one of the most
important aspects is to be able to correctly
select loci target and discriminate against high
similar homeologues. The above example is
one of a set of assay 65 we assessed using
MAGICBOX.
To expand on the in silico – in vitro translation
of MAGICBOX we present further data in
poster P0368 (to your right).
Figure 1 illustrates a typical three stage workflow which includes an importer for either sequence, or
sequence and BLAST analysis output (Ai and Aii).
• BLAST importer collates all the data into the correct format and runs the BLAST command line scripts.
• These can be in any location either locally on a PC, or on a networked server.
• BLAST interpreter matches the sequence to the alignment, reads the alignment to identify any unique
bases for anchoring, and annotates using our <> convention (B).
• The sequence is then annotated to identify the base(s) that are potential anchoring points (C). This is
passed to our primer picker software in Kraken for assay design.
1. J. Curry, M. Kietzmann et al. MAGIC BOX – Automated Primer Assay Design for Haploid, Diploid and Polyploid Species. Poster P0367. PAG 2016. 2. Mol Breeding (2009) 23:13-22 doi:10.1007/s11032-008-9209-z 3. Mol Genet Genomics. 2015 Apr;290(2):531-44. doi: 10.1007/
s00438-014-0933-2.
No part of this publication may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, recording or any retrieval system,
without the written permission of the copyright holder. © LGC Limited, 2016. All rights reserved. FSo/0116