Advanced Topics in Bioinformatics Lab 2: Statistical Aspects of BLAST

Advanced Topics in Bioinformatics
Lab 2: Statistical Aspects of BLAST
Maoying, Wu
Department of Bioinformatics & Biostatistics
Shanghai Jiao Tong University
November 12, 2014
Maoying Wu
BLAST
Inferring the Scoring Matrices
Dayhoff similarities between two amino acids i and j:
sij = log2 (qij /pi pj )
where pi and pj are their individual probabilities, and their
frequency of pairing is qij .
For example, p(M) = .01, p(L) = .1 and qML = 1/500, then
.002
s(M, L) = log2 ( .01×.1
) = +1.
Maoying Wu
BLAST
Two Categories of Scoring Matrices
PAM (Percent Accepted Mutations)
a.k.a Dayhoff matrices, named after Margaret Dayhoff
Of strong theoretical component and make a few evolutionary
assumptions.
The PAM1 matrix was constructed with a set of proteins that
were all 85% or more identical to one another.
PAMn = PAM1n : extrapolation approach
BLOSUM (BLOck SUbstitution Matrix)
Use a more empirical approach
blocks = ungapped segments in a set of multiply-aligned
protein families
blocks are clustered on the basis of their percent identity.
BLOSUM62 are derived from the blocks having at least 62%
identity to one another.
BLOSUM matrices outperforms PAM matrices w.r.t BLAST
applications.
Maoying Wu
BLAST
Example: BLOSUM62 Matrix
Maoying Wu
BLAST
Compute scoring matrix specific λ
Raw score can be a misleading quantity because scaling
factors are arbitrary.
A normalized score is a more useful measure.
Converting a raw score to a normalized score require a
matrix-specific constant called λ.
λSij
n X
i
X
= loge (qij /pi pj )
qij
= pi pj e λSij
qij
= 1
i=1 j=1
Thus, the key to solving for λ is:
n X
i
X
qij =
i=1 j=1
Maoying Wu
n X
i
X
i=1 j=1
BLAST
pi pj e λsij = 1
Relative entropy of a scoring matrix
The expected score of a scoring matrix:
E=
20 X
i
X
pi pj Sij
i=1 j=1
The relative entropy of a scoring matrix:
H=−
20 X
i
X
qij λSij
i=1 j=1
PAM1 has greater H than PAM120; similarly, BLOSUM80 has
a greater H than BLOSUM62.
Which PAM matrix is most similar to BLOSUM45 matrix? To
answer this question, all we need to do is to compare their
relative entropies.
Maoying Wu
BLAST
A perl script for estimating λ in match-mismatch scoring
system
# !/ usr / bin / perl -w
use strict ;
use constant Pn = > 0.25; # probability of any nucleotide
die " usage : $0 < match > < mismatch >\ n " unless @ARGV == 2;
my ( $match , $mismatch ) = @ARGV ;
my $e x pe cted_score = $match * 0.25 + $mismatch * 0.75;
die " illegal scores \ n " if $match <= 0 or $expected_score >= 0;
# calculate lambda
my ( $lambda , $high , $low ) = (1 , 2 , 0) ; # initial estimates
while ( $high - $low > 0.001) { # precision
# calculate the sum of all normalized scores
my $sum = Pn * Pn * exp ( $lambda * $match ) * 4
+ Pn * Pn * exp ( $lambda * $mismatch ) * 12;
# refine guess at lambda
if ( $sum > 1) {
$high = $lambda ; $lambda = ( $lambda + $low ) /2;
}
else {
$low = $lambda ; $lambda = ( $lambda + $high ) /2;
}
}
# compute target frequency and H
my $targetID = Pn * Pn * exp ( $lambda * $match ) * 4;
my $H = $lambda * $match * $targetID
+ $lambda * $mismatch * (1 - $targetID ) ;
# output
print " expscore : $expected_score \ n " ;
print " lambda : $lambda nats ( " , $lambda / log (2) , " bits ) \ n " ;
print " H : $H nats ( " , $H / log (2) , " bits ) \ n " ;
print " % ID : " , $targetID * 100 , " \ n " ;
Maoying Wu
BLAST
Karlin-Altschul equation
E = kmne −λS
k - small adjustment factor by taking into account the fact
that optimal local alignment scores for alignments that start
at different positions in the two sequences may be highly
correlated.
m - effective query size;
n - effective database size;
S - raw score;
λ - scoring matrix specific parameter.
Maoying Wu
BLAST
Correction of search space
The search space is less than m × n due to the edge effect.
l
= ln(κmn)/H
m
0
= m−l
n
0
= n − (l × number of sequences in database)
l is referred to as the expected HSP length, which is used to
indicate the minimum sequence length for producing an
alignment with a significant Expect.
For many searches, m0 is set to 1/k routinely.
Maoying Wu
BLAST
Probability Versus Expectation
P = 1 − e −E
E
= −ln(1 − P)
An E -value tells you how many alignments with a score larger
than the given score are expected by chance.
A P-value tells you how often you can expect to see such an
alignment.
Maoying Wu
BLAST