SEQUENCE SIMILARITY SEARCHING

SEQUENCE SIMILARITY SEARCH ING
Kristi Holmes, PhD
[email protected]
March 15, 2010
Information directories
Nucleic Acids Research Database Issue
–
–
–
The 2010 Nucleic Acids Research Database Issue and online Database Collection: a community of data resources. Cochrane
GR, Galperin MY. Nucleic Acids Res. 2010 Jan;38(Database issue):D1-4. Epub 2009 Dec 3. PMID: 19965766 [PubMed - in
process] Related articles Free article
Complete table of contents for the NAR database issue (Tip: to see the table of contents from the database issue for a
previous year, just reduce the volume number in the URL (to the complete table of contents) by one.)
Searchable database of summary papers
Nucleic Acids Research Web Server Issue
–
–
2009 Web Server complete table of contents
Searchable database of web server summaries
Nucleic Acids Research Methods index
Bioinformatics Links Directory (described in an NAR article, July 2007 web server issue)
ExPASY Life Science Directory
–
>1000 links on a single page, organized by category
BioMed Central Databases collection
Biocatalog by EBI
–
database providing summary and access information for a wide range of molecular biology databases and software; browse
category of interest or search complete db with EMBL SRS server
Online Bioinformatics Resources Collection (OBRC) from the Health Sciences Library System, University
of Pittsburgh.
-- From NAWBIS Information Hubs for Molecular Biology Databases and Software, Renata Geer
Tutorials
•
BioInformatics Tutorials Series (BITS) from
Countway Library of Medicine at Harvard
and the Engineering and Science Libraries at
MIT by Paul Bain, Courtney Crummett and
David Osterbur
–
–
–
–
Comparative Genomics, Volume 1 (Nicholas
Bergman, editor) – available on the NCBI
Bookshelf
–
–
•
•
•
•
9 BLAST QuickStart: Example-Driven Web-Based BLAST
Tutorial David Wheeler and Medha Bhagwat
10 PSI-BLAST Tutorial Medha Bhagwat and L. Aravind
EBI Tutorials: BLAST similarity search
Tutorials from Nicola Gaedeke at Max Planck
- B1: Nucleotide BLAST B2: Protein BLAST
OpenWetWare BLAST tutorial
Digitalworldbiology.com BLAST tutorial
•
•
•
•
•
•
•
•
•
•
introduction
information
tutorial
guide
PSI-BLAST tutorial
more information
similarity searching
rules of thumb
glossary
reference list
http://www.deskpicture.com/DPs/Miscellaneous/AtomicBlast.jpg
•
BIT 2.1: BLAST Link (7:24)
BIT 2.2: Do I Need BLAST? The Use of the Related
Sequences Tool (6:53)
BIT 2.3: Nucleotide BLAST (5:46)
BIT 2.4: Nucleotide BLAST: Algorithm Comparison
(6:14)
NCBI Tutorials: BLAST information guide
What is BLAST?
Basic Local Alignment Search Tool
• A sequence comparison algorithm optimized for speed used
to search sequence databases for optimal local alignments to
a query.
• query is a DNA or protein sequence, not a text term (query
sequence in FASTA Format)
• character string comparison against all the sequences in the
target database
• rigorous statistics used to identify statistically significant
matches
• STATISTICS OF SEQUENCE SIMILARITY SEARCHING
Why BLAST?
Global vs. Local vs. Local vs. Local Alignments
•
Needleman/Wunsch – Finds the best Global alignment between any two sequences.
•
•
•
Smith/Waterman – An extension of Needleman/Wunsch that compares segments of all possible
lengths (Local) between two sequences to maximize alignment.
•
•
•
Very sensitive search.
CPU and time intensive.
FASTA – Local alignment. Uses a lookup table to increase speed. Sensitivity and speed are
determined by the size of the "word" used for the initial lookup table.
•
•
CPU and time intensive.
Often misses domain and/or motif alignments in sequences.
Sensitive and fast
BLAST – Local alignment.
•
•
Fairly sensitive search.
Very fast.
•
A good comparison of how these algorithms function can be found at Aoife McLysaght's web site
under Tutorials and guides (http://www.gen.tcd.ie/molevol/)
•
Relevant Citations
•
•
•
•
Needleman, S.B. and Wunsch, C.D. A general method applicable to the search for similarities in the amino acid sequence of
two proteins. J Mol Biol. 48(3):443-53 (1970).
Smith, T.F. and Waterman, M.S. Identification of common molecular subsequences. J Mol Biol. 147(1):195-7 (1981).
Pearson, W.R. and Lipman, D.J. Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A. 85(8):2444-8
(1988).
Altschul, S.F. et al. Basic local alignment search tool. J Mol Biol. 215(3):403-10 (1990).
What is BLAST?
Basic Local Alignment Search Tool
A sequence comparison
algorithm optimized for
speed used to search
sequence databases for
optimal local
alignments to a query.
What is BLAST?
(cont.)
The initial search is done for a
word of length “W” that scores
at least “T” when compared to
the query using a substitution
matrix.
Word hits are then extended in
either direction in an attempt to
generate an alignment with a
score exceeding the threshold
of “S”. The “T” parameter
dictates the speed and
sensitivity of the search.
PAM and BLOSUM:
The Matrices of Choice
PAM
•
Percent Accepted Mutation. A unit introduced by Dayhoff et al. to quantify the amount
of evolutionary change in a protein sequence. 1.0 PAM unit, is the amount of evolution
which will change, on average, 1% of amino acids in a protein sequence. A PAM(x)
substitution matrix is a look-up table in which scores for each amino acid substitution
have been calculated based on the frequency of that substitution in closely related
proteins that have experienced a certain amount (x) of evolutionary divergence.
[Taken from the NCBI Glossary – PAM definition].
BLOSUM
•
Blocks Substitution Matrix. A substitution matrix in which scores for each position are
derived from observations of the frequencies of substitutions in blocks of local
alignments in related proteins. Each matrix is tailored to a particular evolutionary
distance. In the BLOSUM62 matrix, for example, the alignment from which scores were
derived was created using sequences sharing no more than 62% identity. Sequences
more identical than 62% are represented by a single sequence in the alignment so as to
avoid over-weighting closely related family members. (Henikoff and Henikoff)
[Taken from the NCBI Glossary – BLOSUM definition].
An amino acid scoring matrix: BLOSUM62
www.acm.org
Differences between PAM and
BLOSUM
•
BLOSUM matrices tend to be more sensitive to distant relationships than PAM.
BLOSUM tends to give higher scores to substitutions involving hydrophilic amino acids
and lower scores to substitutions involving hydrophobic amino acids than PAM
Substitutions of rare amino acids are more tolerated by BLOSUM
•
The differences in derivation lead to some general rules:
•
•
– Use higher PAM or lower BLOSUM matrices for more divergent sequences
– Use lower PAM or higher BLOSUM matrices for more closely related sequences
•
•
In tests BLOSUM 62 performed better than PAM 80, 100, 120, 140, 160, 200 and 250 --BLOSUM is generally preferred.
Though BLOSUM performed better than PAM in these performance tests, there were
instances in each trial where PAM detected significant similarities that BLOSUM did not.
It is generally better to perform multiple initial searches with varied
parameters.
Scoring About Nucleic Acids
•
The scoring matrix for nucleic acid searches is much simpler than the substitution
matrices for proteins. It is based upon the assumption that all positions are equally
mutable and that all substitutions occur at an equal frequency
• The assumption that all point mutations occur at equal frequencies is not true. The rate of
transition mutations (purine to purine or pyrimidine to pyrimidine) is approximately 1.5-5X
that of transversion mutations (purine to pyrimidine or vice-versa) in all genomes where it has
been measured (see e.g. Wakely, Mol Biol Evol 11(3):436-42, 1994). This is another good
reason to use protein BLAST rather than nucleic acid BLAST searches if at all possible. Matrices
that take this mutation frequency bias into account have been constructed (States et al.
Methods 3(1):66-70, 1991) and may be used in other similarity searching algorithms.
•
•
•
•
The matrix used for BLASTn is an identity matrix – any substitutions result in a negative
score at the changed position.
An identity results in a score of +1 at that position.
A mismatch results in a score of -3 at that position.
Degeneracy in the code can also cause problems for BLAST searches of nucleic acids.
• SER can be encoded by two codons with no sequence overlap (UCU vs AGC). This further
reduces the sensitivity of BLAST searches using protein coding nucleotide sequences. If at all
possible use the protein sequence of your gene for BLAST searches.
Mind the Gap
Life does not evolve by point mutation alone
What about the Gaps?
Life does not evolve by point mutation alone
Local Alignment vs. Global Alignment
• If alignment is local, why allow gaps?
–
–
–
•
Stretches of similarity will not be broken.
Alignments will be more accurate.
Alignments should reflect biological relationships more closely.
The scoring for gaps in BLAST must reflect the fact that such mutations are not infrequent and
may be relatively long but must also take into account that introduction of gaps could
introduce errors in alignments. Gap penalties in BLAST reflect these facts by having large
existence (establishment) penalties and smaller extension penalties. This is one of the areas of
BLASTp where a set of standard choices are easily available.
Allowable Gap Values
What else about
the Gaps?
Large gap penalties for gap existence
and smaller extension penalties. The
rationale for this is that insertion and
mutation events are rare, but, when they
do occur, several adjacent residues may
be involved.
Affine Gap Penalty: G + Ln
G: Gap opening penalty
L: Gap extension penalty
n: Length of gap
if G = -5, L = -2, n = 6, Then total
penalty for this gap = -5+(-2)*6 = -17
Selection of gap parameters is
highly empirical.
Factors that influence the statistical
parameters of BLAST
1.
2.
3.
4.
5.
The length of the segments being compared
The size of the search space (database)
The accuracy and biological significance of BLAST results are
dependent on the statistical veracity of the substitution
matrices used
Gaps – there is no general theory capable of fully
accounting for gap statistics.
Low complexity regions can play havoc with alignment stats.
BLAST filters out these regions by default.
Steps of BLAST
Step 1. Choose the program to use and the database to
search.
Step 2. Input the data.
Step 3. Set the program options or choose defaults.
Step 4. Set the output formatting options
Step 5. Perform the search
Sequence format
Sequence format: Identifiers,
FASTA, and Bare Sequence
1. Identifiers:
If you know the Accession number or the GI of a
sequence in GenBank, you can use this as the query
sequence in a BLAST search.
– Accession numbers, accession version, or NCBI
sequence identifiers
– No spaces allowed or input will be treated as
sequence data
– If you DO enter an improper format, you’ll receive
an error message to that effect…
Sequence format, continued
2. FASTA format
A sequence in FASTA format begins with a single-line description, followed by lines
of sequence data. The description line is distinguished from the sequence
data by a greater-than (">") symbol in the first column. It is recommended
that all lines of text be shorter than 80 characters in length.
An example sequence in FASTA format is:
•
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSE
NRTQIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLR
LRQAWCHFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANL
WFNCHGEFFYCKMDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTY
VACHIRSVIIWLETISKKTYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESI
WAAELDRYKLVEITPIGFAPTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXX
XXXXVQSQHLLAGILQQQKNLLAAVEAQQQMLKLTIWGVK
Sequence format, continued
3. Bare Sequence
Sequence that has been cut and pasted into the search box
–
–
–
–
The sequence contains no FASTA definition line
Sequence data should contain no blank lines.
Sequence data may be in the format as seen in GenBank sequence reports with
interspersed numbers and spaces.
Sequence data should correspond to the standard IUB/IUPAC codes for amino
acids and nucleotides
Raw Sequence
mkaliilglvllsvtvqgkifercelartlkklgldgykgvslanwvclakwesgyntdatnynpgdestdygi
fqinsrywcngktpgavnachiscnallqnniadavacakrvvsdpqgirawvawkkhcqnrdvsqyv
egcgv
GenBank Format
1 mkaliilglv llsvtvqgki fercelartl kklgldgykg vslanwvcla kwesgyntda
61 tnnpgdest dygifqinsr ywcnngktpg avnachiscn allqniada vacakrvvsd
121 pqgirawvaw kkhcqnrdvs gvvegcgv
Short discussion of BLAST output
before we continue…
A report is generated as a result of performing BLAST
1. Header – contains info about the query sequence
(graphical overview on the web)
2. One line descriptions of each database found to
match the query sequence; provide a quick
overview for browsing
3. The alignments for each database sequence
matched
The Header
•
The BLAST report header. The top line gives information about the type of program (in this case,
BLASTP), the version (2.2.1), and a version release date. The research paper that describes BLAST is
then cited, followed by the request ID (issued by QBLAST), the query sequence definition line, and a
summary of the database searched. The Taxonomy reports link displays this BLAST result on the
basis of information in the Taxonomy database
The Header – graphical depiction
generated from web-based BLAST
•
The query sequence is represented by the numbered red bar at the top of the figure.
Database hits are shown aligned to the query, below the red bar. Of the aligned
sequences, the most similar are shown closest to the query. In this case, there are three
high-scoring database matches that align to most of the query sequence. The next
twelve bars represent lower-scoring matches that align to two regions of the query, from
about residues 3–60 and residues 220–500. The cross-hatched parts of the these bars
indicate that the two regions of similarity are on the same protein, but that this
intervening region does not match. The remaining bars show lower-scoring alignments.
Mousing over the bars displays the definition line for that sequence to be shown in the
window above the graphic.
One-line descriptions in the BLAST report
Each line is composed of four fields:
(a) the gi number, database designation, Accession
number, and locus name for the matched
sequence, separated by vertical bars
(b) a brief textual description of the sequence, the
definition. This usually includes information on the
organism from which the sequence was derived,
the type of sequence (e.g., mRNA or DNA), and
some information about function or phenotype.
The definition line is often truncated in the one-line
descriptions to keep the display compact;
(c) the alignment score in bits. Higher scoring hits are
found at the top of the list; and
(d) the E-value, which provides an estimate of
statistical significance.
For the first hit in the list, the gi number is 116365, the
database designation is sp (for SWISS-PROT), the
Accession number is P26374, the locus name is
RAE2_HUMAN, the definition line is Rab proteins,
the score is 1216, and the E-value is 0.0. Note that
the first 17 hits have very low E-values (much less
than 1) and are either RAB proteins or GDP
dissociation inhibitors. The other database matches
have much higher E-values, 0.5 and above, which
means that these sequences may have been
matched by chance alone.
The Alignments
•
The alignment is preceded by the sequence identifier, the full definition line, and the length of the matched sequence, in amino acids. Next
comes the bit score (the raw score is in parentheses) and then the E-value. The following line contains information on the number of
identical residues in this alignment (Identities), the number of conservative substitutions (Positives), and if applicable, the number of gaps
in the alignment. Finally, the actual alignment is shown, with the query on top, and the database match is labeled as Sbjct, below. The
numbers at left and right refer to the position in the amino acid sequence. One or more dashes (–) within a sequence indicate insertions or
deletions. Amino acid residues in the query sequence that have been masked because of low complexity are replaced by Xs (see, for
example, the fourth and last blocks). The line between the two sequences indicates the similarities between the sequences. If the query
and the subject have the same amino acid at a given location, the residue itself is shown. Conservative substitutions, as judged by the
substitution matrix, are indicated with +.
Let’s take a look at the BLAST
website…
BLAST (http://www.ncbi.nlm.nih.gov/BLAST/)
1. BLAST help
2. BLAST assembled genomes
3. Basic BLASTs (BLASTn, BLASTp, BLASTx, tBLASTn,
tBLASTx)
4. Specialized BLAST (search trace archives, find
conserved domains and conserved domain
architecture, GEO profiles, immunoglobulins,
SNPs, align two sequences, and screen for vector
contamination)
BLAST help
• NCBI BLAST documentation
• BLAST tips
BLAST assembled genomes
There are a couple of ways to get to the genome
information that you need…
BLAST homepage
NCBI Map Viewer
Basic BLASTs: BLASTn
Search a nucleotide database using a nucleotide query
Algorithms (see BLASTn page)
• Blastn (somewhat similar sequences): slow, but allows a wordsize down to seven bases.
• Megablast (highly similar sequences): intended for comparing a
query to closely related sequences and works best if the target
percent identity is 95% or more but is very fast.
• Discontinuous megablast (more dissimilar sequences): uses an
initial seed that ignores some bases (allowing mismatches) and is
intended for cross-species comparisons.
Basic BLASTs: BLASTp
Search protein databases using a protein query
Algorithms (see BLASTp page)
• BLASTp: simply compares a protein query to a protein
database.
• PSI-BLAST: allows the user to build a PSSM (positionspecific scoring matrix) using the results of the first
BlastP run.
• PHI-BLAST: performs the search but limits alignments to
those that match a pattern in the query.
Basic BLASTs: BLASTp
Sample User Question
•
I am interested in studying the evolution of a protein sequence (Chaperone protein dnaK (Heat
shock protein 70) Accession Number NP_212652). I would like to find the closest matches in
higher animals. What is the easiest way to do this?
Analysis/Comments
•
You could run a BLAST search for the protein, then look at the results and choose the metazoan
representatives. An easier method is to use BLink and eliminate all the taxonomic groups except
the one of interest.
Flow Chart
•
NP_212652 Entrez Protein Record - Look at record and access BLink from here
•
NP_212652 BLink Record
Step By Step Guide
Entrez Protein
•
Find Entrez Protein entry for appropriate protein
•
Examine the record and be sure it is the appropriate entry
•
Click the BLink Link
•
BLink
•
Click the underlined name of each taxonomic group that you would like to eliminate from the
list or click in the colored box preceeding the group that you would like to select.
•
BLink will remove all groups but the one chosen.
PSI-BLAST
&
PHI-BLAST
PSI-BLAST
It works in two major steps:
1. With a scoring matrix (BLOSUM62) and a query, a database is searched
with a standard BLAST
2. With the set of matches from the search, a PSSM is constructed by
aligning the matches to the query. Then the next round of database
search is performed with the PSSM instead of the stand scoring matrix.
•
The Position-Specific Iterated BLAST (PSI-BLAST) program performs iterative searches
with a protein query, in which sequences found in one round of search are used to
build a custom score model for the next round.
To run this search, "PSI-BLAST" must be checked.
•
Inclusion Threshold
•
– This sets the statistical significance threshold for including a sequence in the model used by
PSI-BLAST to create the PSSM on the next iteration.
PSI-BLAST
•
PSI-BLAST is the most sensitive BLAST program, making it useful for finding very distantly related
proteins or new members of a protein family. Use PSI-BLAST when your standard protein-protein
BLAST search either failed to find significant hits, or returned hits with descriptions such as
"hypothetical protein" or "similar to...".
•
The first round of PSI-BLAST is a standard protein-protein BLAST search. The program builds a
position-specific scoring matrix (PSSM or profile) from a multiple alignment of the sequences
returned with Expect values better (lower) than the inclusion threshold (default=0.005).
•
The PSSM will be used to evaluate the alignment in the next iteration of search. Any new database
hits below the inclusion threshold are included in the construction of the new PSSM.
•
A PSI-BLAST search is said to have converged when no more matches to new database sequences are
found in subsequent iterations. You can add database hits that fall outside the inclusion threshold to
your PSSM for the next round by checking the box next to the hit. Already selected hits can also be
removed from the selection by uncheck the checkbox.
PSSM is query specific. You can save a PSSM created during a PSI-BLAST search of one database and
use it to search a different database with the same query. To do this, click on “reformat these results”
(top of results page) and change "Alignment" to "PSSM" in a pull-down menu (at any iteration after
the first). Then format the search, copy the resulting ascii encoded PSSM and paste it into the PSSM
window (located in the “algorithm parameters” section of a new PSI-BLAST search page.
Position Specific Score Matrices
(PSSM)
• The matrix is constructed from a multiple alignment of the highest scoring
hits in the original BLAST search.
• Substitutions of the same amino acid are scored differently, depending on
position
• Amino acids in highly conserved positions scores higher than those in
weakly conserved positions. This matrix is used to perform the next BLAST
search--run PSI-BLAST iteration 2
• The results of the next set of alignments are used to refine the matrix for
the next iteration of the search.
• This is an effective method of finding new protein family members.
• It also can deduce the function of hypothetical proteins that are
unanotated in the database.
• NP_659187
PSSM
Using the PSSM box in BLAST
PSI-BLAST can save the PSSM to be used in other protein searches. The PSSM can be stored in a
text file and cut and paste into the PSSM field.
To save a PSSM file:
1.
Run a protein BLAST search
2.
Check the PSI-BLAST box on main page.
3.
Click the “Format” Button.
4.
On the PSI-BLAST results page, click the “Run PSI-BLAST Iteration2”.
5.
Click on “reformat these results” (top of results page) and change "Alignment" to "PSSM"
in a pull-down menu (at any iteration after the first).
6.
Save or copy the resulting text
To use the PSSM in a new protein BLAST search against other databases:
1.
Copy the above PSSM from the browser
2.
Open a new protein BLAST page
3.
Paste the PSSM in the PSSM field in the page
4.
provide the SAME query in the search box
5.
select a different target database
6.
click "BLAST" button to start the search
This will display text output with the ASCII-encoded PSSM. The “Save as…” option of the browser
can be used to save this to a plain text file on your hard drive
PSSM text file
PHI-BLAST
Pattern Hit Initiated BLAST
•
PHI-BLAST (Pattern-Hit Initiated BLAST) is a search program that combines
matching of regular expressions with local alignments surrounding the match.
Given a protein sequence S and a regular expression pattern P occurring in S, PHIBLAST helps answer the question:
What other protein sequences both contain an occurrence of P and are
homologous to S in the vicinity of the pattern occurrences?
•
PHI-BLAST may be preferable to just searching for pattern occurrences because it
filters out those cases where the pattern occurrence is probably random and not
indicative of homology.
•
PHI-BLAST allows a researcher to emphasize regions of a protein sequence as
especially important in a search.
•
PHI-BLAST does this by searching the database for sequences that match the query
sequence as well as match a pattern of amino acids that is specified by the user.
PHI-BLAST
• Pattern-Hit Initiated (PHI)-BLAST is designed to search for proteins that
contain a pattern specified by the user AND are similar to the query
sequence in the vicinity of the pattern. This dual requirement is intended
to reduce the number of database hits that contain the pattern, but are
likely to have no true homology to the query.
To run PHI-BLAST, enter your query (which contains one or more instances
of the pattern) into the "Search" box, and enter your pattern into the "PHI
pattern" box in the "Options" section of the page. Patterns must follow
the syntax conventions of PROSITE. Only one pattern can be used in a
given search. Pattern syntax is described here.
Accepted Parameters for Other
Advanced Field
•
•
•
•
•
•
•
•
-G Cost to open gap [Integer]: default = 5 for nucleotides/ 11 for proteins
-E Cost to extend gap [Integer]: default = 2 for nucleotides/ 1 for proteins
-q Penalty for nucleotide mismatch [Integer]: default = -3
-r reward for nucleotide match [Integer]: default = 1-eexpect value [Real]: default =
10
-W wordsize [Integer]: default = 11 for nucleotides/ 28 for megablast/ 3 for
proteins
-y Dropoff (X) for blast extensions in bits: default = 20 for blastn/ 7 for others
-X X dropoff value for gapped alignment (in bits): default = 15 for all programs, not
applicable to blastn
-Z final X dropoff value for gapped alignment (in bits): 50 for blastn 25 for
othersOnly limited values for gap existence and extension are supported for BLAST
programs. For protein BLAST, see pulldown menu display next to the Matrix for
details.
PHI-BLAST example
• >gi|4758958|ref|NP_004148.1| Human cAMP-dependent protein kinase
MSHIQIPPGLTELLQGYTVEVLRQQPPDLVEFAVEYFTRLREARAPASVLPAATPRQSLGHP
PPEPGPDRVADAKGDSESEEDEDLEVPVPSRFNRRVSVCAETYNPDEEEEDTDPRVIHPKT
DEQRCRLQEACKDILLFKNLDQEQLSQVLDAMFERIVKADEHVIDQGDDGDNFYVIERGT
YDILVTKDNQTRSVGQYDNRGSFGELALMYNTPRAATIVATSEGSLWGLDRVTFRRIIVKN
NAKKRKMFESFIESVPLLKSLEVSERMKIVDVIGEKIYKDGERIITQGEKADSFYIIESGEVSILI
RSRTKSNKDGGNQEVEIARCHKGQYFGELALVTNKPRAASAYAVGDVKCLVMDVQAFERL
LGPCMDIMKRNISHYEEQLVKMFGSSVDLGNLGQ
• [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV].
Basic BLASTs: BLASTx
Search protein databases using a translated nucleotide query
Useful when trying to find homologous proteins to a nucleotide
coding region
Blastx compares translational products of the nucleotide query
sequence to a protein database. Because blastx translates the
query sequence in all six reading frames and provides combined
significance statistics for hits to different frames, it is particularly
useful when the reading frame of the query sequence is unknown
or it contains errors that may lead to frame shifts or other coding
errors.
Blastx is often the first analysis performed with a newly determined
nucleotide sequence and is used extensively in analyzing EST
sequences. This search is more sensitive than nucleotide blast
since the comparison is performed at the protein level.
Basic BLASTs: tBLASTn
Search translated nucleotide database using a protein query
useful for finding protein homologs in unannotated nucleotide data
A tblastn search allows you to compare a protein sequence to the six-frame translations of a
nucleotide database. It can be a very productive way of finding homologous protein
coding regions in unannotated nucleotide sequences such as expressed sequence tags
(ESTs) and draft genome records (HTG), located in the BLAST databases est and htgs,
respectively.
ESTs are short, single-read cDNA sequences. They comprise the largest pool of sequence
data for many organisms and contain portions of transcripts from many uncharacterized
genes. Since ESTs have no annotated coding sequences, there are no corresponding
protein translations in the BLAST protein databases. Hence a tblastn search is the only
way to search for these potential coding regions at the protein level. The HTG
sequences, draft sequences from various genome projects or large genomic clones, are
another large source of unannotated coding regions.
Like all translating searches, the tblastn search is especially suited to working with error
prone data like ESTs and draft genomic sequences from HTG because it combines BLAST
statistics for hits to multiple reading frames and thus is robust to frame shifts introduced
by sequencing error.
Basic BLASTs: tBLASTx
Search translated nucleotide database using a translated nucleotide query
useful for identifying novel genes in error prone nucleotide query sequences.
tblastx takes a nucleotide query sequence, translates it in all six frames, and
compares those translations to the database sequences dynamically translated
in all six frames. This effectively performs a more sensitive blastp search
without doing the manual translation.
tblastx gets around the potential frame-shift and ambiguities that may prevent
certain open reading frames from being detected. This is very useful in
identifying potential proteins encoded by single pass read ESTs. In addition, it
can be a good tool for identifying novel genes.
Computationally intensive and should be used only as last resort. Searching with
large genomic queries is NOT recommended. For users with regular or batch
need for this time of searches, the best way is to install standalone blast and
perform the search locally.
BLINK
Why BLAST when you can just Blink?
What does Blinking do for me?
Blink displays the graphical output of precomputed blastp results
against the protein nonredundant (nr) databases. This graphical
output includes:
• Alignment of up to 200 BLAST hits on
the query sequence
• Best hits to each organism
• List of known protein domains in the
query sequence
• Filter hits by selecting the BLAST
cutoff score
• Distribution of hits by taxonomic
grouping
• Display of similar sequences with
known 3D structure
• Filter hits by database and/or by
taxonomic grouping
• Display a taxonomic tree of all
organisms with similar sequences
Blink “BLAST Link”
•
•
•
•
•
•
Displays the results of BLAST searches that have been done for every protein
sequence in the Entrez Proteins data domain.
To access it, follow the BLink link displayed beside any hit in the results of an Entrez
Proteins search.
In contrast to Entrez's "Related Sequences" feature, which lists the titles of similar
sequences, BLink displays the graphical output of pre-computed blastp results
against the protein non-redundant (nr) database. The output includes the positions
of up to 200 BLAST hits on the query sequence, scores, and alignments. (View sample
BLink output for human MLH1 protein.)
BLink offers a variety of display options, including the distribution of hits by
taxonomic grouping, the best hit to each organism, the protein domains in the query
sequence, similar sequences that have known 3-D structures, and more.
Additional options allow you to specify which taxa you would like to exclude, increase
or decrease the BLAST cutoff score, or filter the BLAST hits to show only those from a
specific source database, such as RefSeq or Swiss-Prot. Click on a display option
below for additional information about it.
http://www.ncbi.nlm.nih.gov/sutils/static/blinkhelp.html
BLink, continued.
Blink Advantages
• Quickly and easily done
• Large amount and diversity of information available from single
link
• Manipulation of data choices is available after Blink is selected
Blink Disadvantages
• Not possible if sequence of interest is not in Entrez protein
database
• Each search is run with pre-set parameters – no choices are
available
• List of hits is restricted to the top 200 BLAST scores – may be
too few
• to see significant results for a given protein
Specialized BLAST
1.
2.
3.
4.
5.
6.
Find conserved domains
Search GEO profiles
Search immunoglobulins
Search SNPs
Align two sequences
Screen for vector contamination
Specialized BLAST:
Find conserved domains using RPS-BLAST
Find conserved domains in the query
• What regions of my protein have significant similarity to conserved
domains (CDs)?
• Proteins often contain several modules or domains, each with a
distinct evolutionary origin and function. NCBI's Conserved Domain
Database is a collection of multiple sequence alignments for
ancient domains and full-length proteins. The CD-Search service
may be used to identify the conserved domains present in a
protein query sequence
• 1007208A. Reports epidermal growth ...[gi:224020]
Specialized BLAST:
Find conserved domains using RPS-BLAST
Find conserved domains in the query
•
•
•
•
Reverse Position Specified BLAST is a search of your query against a database of already
compiled PSSM’s. RPS-BLAST is used to identify the conserved domain in the query
sequence.
A variant of the PSI-BLAST program ("Position-Specific Iterated BLAST"). PSI-BLAST finds
sequences significantly similar to the query in a database search and uses the resulting
alignments to build a Position-Specific Score Matrix (PSSM) for the query. With this PSSM
the database is scanned again to eventually pull in more significant hits, and further
refine the scoring model.
RPS-BLAST uses the query sequence to search a database of pre-calculated PSSMs, and
report significant hits in a single pass. The role of the PSSM has changed from "query" to
"subject", hence the term "reverse" in RPS-BLAST.
RPS-BLAST is the search tool used in the CD-Search service. The CD-Search service
provides a web-interface to the RPS-BLAST program, the CD search databases, and
interactive alignment visualization including 3D structures.
RPS-BLAST Results
Jagged edges on a domain graphic indicate the more than 20% of the
domain’s N- or C- (or both) terminus is missing from the alignment and
may indicate a truncated query sequence, a false hit for that particular
domain or an unusual domain architecture.
Specialized BLAST: GEO
Search sequences that have gene expression profiles
• The GEO BLAST tool enables retrieval of expression profiles on
the basis of nucleotide sequence similarity. Paste in a
nucleotide sequence, or specify a sequence accession number,
and a BLAST query is performed against all GenBank identifiers
represented on microarray Platforms or SAGE libraries in GEO.
The output resembles conventional BLAST output with each
alignment receiving a quality score.
• Each retrieval has an expression ‘E’ icon that links directly to
corresponding Entrez GEO Profiles.
Specialized BLAST:
Immunoglobulins
Search Immunoglobulin sequences
• IgBLAST was developed at NCBI to facilitate analysis of immunoglobulin V
region sequences in GenBank. It uses BLAST search algorithm.
In addition to performing a regular BLAST search, IgBLAST has several additional
functions:
1. Reports the germline V, D and J gene matches to the query sequence.
2. Annotates the immunoglobulin domains (FWR1 through FWR3).
3. Matches the returned hits (for databases other than germline genes) to the
closest germline V genes, making it easier to identify related sequences.
4. Reveals the V(D)J junction details such as nucleotide homology between the
ends of V(D)J segments and N nucleotide insertions.
5. D and J gene reporting is only for nucleotide sequence search and requires a
stretch of five or more nucleotide identity between the query and D or J genes.
Specialized BLAST: SNPs
BLAST the SNP database
BLASTn and tBLASTn are available as well as MEGABLAST. Numerous databases are
available as well as filter options.
Specialized BLAST:
BLAST two sequences to align
Use BLAST 2 Sequences to look for mismatches between two
nucleotide sequences
There are many instances where the ability to compare two
sequences without having to do a complete BLAST search of a
database is useful, especially in an instance like this one where
the function of a construct may be in question.
Specialized BLAST:
BLAST two sequences to align
Sample User Question
• Use the program bl2seq to compare the RefSeq mRNA for CCAAT/enhancer binding protein
(C/EBP), delta CEBPD (NM_005195) with the model transcript (XM_005023) predicted from the
human genome. (Note: You will need to turn filtering off to get the complete alignment). Are there
any mismatches? Do these mismatches affect the model protein sequence compared to the
reference sequence?
Step By Step Guide
• BLAST 2 Sequences
• Paste the accession numbers from NM_005195 and XM_005023 in the appropriate boxes on the
bl2seq page.
• Click the check box next to the Filter option to turn filtering off.
• Click the Align button.
• View the one difference in the two sequences. This difference is in the coding region of the gene
and would cause a frame-shift mutation in the translation.
• To see the results of the frame-shift between the two resulting protein sequences run BLAST 2
Sequences again with the translation products of the two nucleotide sequences. (Translation
products can be generated using the Translate Tool from ExPASy.)
•
What about NM_131887 ?
Finally…taking out the Garbage
VecScreen is used to find
pieces of cloning vector
that may have been
included in the sequence
of interest (usually as
pieces of the linker)
VecScreen is only for
nucleotide sequences
http://upload.wikimedia.org/wikipedia/commons/7/78/Vuilnis.JPG
http://www.ncbi.nlm.nih.gov/VecScreen/VecScreen.html
Basic BLASTs: BLASTp vs. BLink
Sample User Question
•
I am interested in studying the evolution of a protein sequence (Chaperone protein dnaK (Heat
shock protein 70) Accession Number NP_212652). I would like to find the closest matches in
higher animals. What is the easiest way to do this?
Analysis/Comments
•
You could run a BLAST search for the protein, then look at the results and choose the metazoan
representatives. An easier method is to use BLink and eliminate all the taxonomic groups except
the one of interest.
Flow Chart
•
•
NP_212652 Entrez Protein Record - Look at record and access BLink from here
NP_212652 BLink Record
Step By Step Guide
•
•
•
•
•
•
Entrez Protein
Find Entrez Protein entry for appropriate protein
Examine the record and be sure it is the appropriate entry
Click the BLink Link and BLink
Click the underlined name of each taxonomic group that you would like to eliminate from the
list or click in the colored box preceeding the group that you would like to select.
BLink will remove all groups but the one chosen.
Use nucleotide-nucleotide
BLAST
http://microbewiki.kenyon.edu/index.php/Image:Acetobacter.JPG
Sample User Question
• Michael Crichton's fantasy about cloning dinosaurs, Jurassic Park, contains a putative
dinosaur DNA sequence. Use nucleotide-nucleotide BLAST against the default nucleotide
database, nr, to identify the real source of the following sequence. Select, copy and paste it
into the BLAST form window.
• This is probably the most common use of nucleotide-nucleotide BLAST: sequence
identification, establishing whether an exact match for a sequence is already present in the
database.
• ftp://ftp.ncbi.nih.gov/pub/FieldGuide/jurassic.txt
>DinoDNA from JURASSIC PARK p. 103 nt 1-1200
GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG
TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG
CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG
ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT
GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA
CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA
CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG
CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA
ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG
CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT
Use translating BLAST 2 (blastx)
Sample User Question, continued
• NCBI scientist Mark Boguski noticed this obvious "contaminant" and supplied Crichton with a
better sequence, shown below, for the sequel, The Lost World. Identify the most likely source of
this sequence using nucleotide-nucleotide BLAST. Mark imbedded his name in the sequence he
provided. To see Mark's name use the translating BLAST (blastx) page with the sequence below.
(Look for MARK WAS HERE NIH). The the proper use of the translating BLAST services is to look for
similar proteins (identify potential homologues) in other species.
• ftp.ncbi.nih.gov/pub/FieldGuide/lostworld.txt
•
>DinoDNA from THE LOST WORLD p. 135
GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACG
GACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCC
ATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAA
GCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCC
TCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGGCAGACACGGGTACTTTGGGG
ACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTG
CAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGC
GGGCCCCCACCCTGCGAGGCCCGTGAGTGCGTCATGGCCAGGAAGAACTGCGGAGCGACG
GCAACGCCGCTGTGGCGCCGGGACGGCACCGGGCATTACCTGTGCAACTGGGCCTCAGCC
TGCGGGCTCTACCACCGCCTCAACGGCCAGAACCGCCCGCTCATCCGCCCCAAAAAGCGC
CTGCTGGTGAGTAAGCGCGCAGGCACAGTGTGCAGCCACGAGCGTGAAAACTGCCAGACA
TCCACCACCACTCTGTGGCGTCGCAGCCCCATGGGGGACCCCGTCTGCAACAACATTCAC
GCCTGCGGCCTCTACTACAAACTGCACCAAGTGAACCGCCCCCTCACGATGCGCAAAGAC
GGAATCCAAACCCGAAACCGCAAAGTTTCCTCCAAGGGTAAAAAGCGGCGCCCCCCGGGG
GGGGGAAACCCCTCCGCCACCGCGGGAGGGGGCGCTCCTATGGGGGGAGGGGGGGACCCC
TCTATGCCCCCCCCGCCGCCCCCCCCGGCCGCCGCCCCCCCTCAAAGCGACGCTCTGTAC
GCTCTCGGCCCCGTGGTCCTTTCGGGCCATTTTCTGCCCTTTGGAAACTCCGGAGGGTTT
TTTGGGGGGGGGGCGGGGGGTTACACGGCCCCCCCGGGGCTGAGCCCGCAGATTTAAATA
ATAACTCTGACGTGGGCAAGTGGGCCTTGCTGAGAAGACAGTGTAACATAATAATTTGCA
CCTCGGCAATTGCAGAGGGTCGATCTCCACTTTGGACACAACAGGGCTACTCGGTAGGAC
CAGATAAGCACTTTGCTCCCTGGACTGAAAAAGAAAGGATTTATCTGTTTGCTTCTTGCT
GACAAATCCCTGTGAAAGGTAAAAGTCGGACACAGCAATCGATTATTTCTCGCCTGTGTG
AAATTACTGTGAATATTGTAAATATATATATATATATATATATATCTGTATAGAACAGCC TCGGAGGCGGCATGGACCCAGCGTAGATCATGCTGGATTTGTACTGCCGGAATTC
•
Which has best E value?
Use BLink
Sample User Question
• What can Blink tell you about the results from the previous
query?
•
LOCUS NP_990795 304 aa linear VRT 18-OCT-2005 DEFINITION erythroid-specific transcription
factor eryf1 [Gallus gallus]. ACCESSION NP_990795 VERSION NP_990795.1 GI:45382623 DBSOURCE
REFSEQ: accession NM_205464.1 KEYWORDS . SOURCE Gallus gallus (chicken) ORGANISM Gallus
gallus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves;
Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus. REFERENCE 1 (residues 1 to 304)
AUTHORS Mott,B.H., Bassman,J. and Pikaart,M.J. TITLE A molecular dissection of the interaction
between the transcription factor Gata-1 zinc finger and DNA JOURNAL Biochem. Biophys. Res.
Commun. 316 (3), 910-917 (2004) PUBMED 15033488
• BLink ("BLAST Link") displays the results of BLAST searches that
have been done for every protein sequence in the Entrez Proteins
data domain.
Take a look at the Conserved Domains
Sample User Question
• What can CDD tell you about the results from the previous query?
•
LOCUS NP_990795 304 aa linear VRT 18-OCT-2005 DEFINITION erythroid-specific transcription
factor eryf1 [Gallus gallus]. ACCESSION NP_990795 VERSION NP_990795.1 GI:45382623 DBSOURCE
REFSEQ: accession NM_205464.1 KEYWORDS . SOURCE Gallus gallus (chicken) ORGANISM Gallus
gallus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Archosauria; Aves;
Neognathae; Galliformes; Phasianidae; Phasianinae; Gallus. REFERENCE 1 (residues 1 to 304)
AUTHORS Mott,B.H., Bassman,J. and Pikaart,M.J. TITLE A molecular dissection of the interaction
between the transcription factor Gata-1 zinc finger and DNA JOURNAL Biochem. Biophys. Res.
Commun. 316 (3), 910-917 (2004) PUBMED 15033488
• Shows the domains in the query protein sequence, based on a
comparison of that sequence to the Conserved Domain Database
(CDD). A brief description of CDD is available on the NCBI Site
Map. The CDD Help document provides detailed information.
References
•
•
•
Baxevanis, A.D. and Ouellette, B.F.F., eds., Bioinformatics: A Practical Guide to the
Analysis of Genes and Proteins, third edition. Wiley, 2005. ISBN 0-471-47878-4
Chen, YB, Chattopadhyay A., Bergen P., Gadd C and Tannery N. 2007. The online
Bioinformatics resources collection at the University of Pittsburgh Health Sciences
Library System - A one-stop gateway to online Bioinformatics databases and
software tools. Nucleic Acids Research 2007 Database Issue, 35:D780-D785
http://www.hsls.pitt.edu/guides/genetics/obrc [date cited March 15, 2010]
Geer, R.C., Messersmith, D.J, Alpi, K., Bhagwat, M., Chattopadhyay, A., Gaedeke, N.,
Lyon, J., Minie, M.E., Morris, R.C., Ohles, J.A., Osterbur, D.L. & Tennant, M.R. 2002.
NCBI Advanced Workshop for Bioinformatics Information Specialists. [Online]
Similarity Searching. http://www.ncbi.nlm.nih.gov/Class/NAWBIS/. [date revised
July 23, 2006; date cited March 15, 2010]