Better Filtering with
Gapped q-grams
S. Burkhardt
Center for Bioinformatics, Saarbrücken
J. Kärkkäinen
Max-Planck Institut f. Informatik, Saarbrücken
Outline
Motivation
The `classic` q-gram Lemma
q-shapes
Measuring Filter quality/speed
Experimental Results
Conclusion
The k-mismatches problem
For a pattern P, a string S, a value k :
find all occurences of P in S with at
most k character replacements.
Filter Algorithms
Filtration Stage:
Examine S with a Filter Criterium
Return areas with potential matches
Verification Stage:
Verify which areas have true matches
Pattern P
ACTC
k=1
String S
G CATT C GAT G GAC T G GAC TAG T GATT GAG T
Find occurences of P with at most k errors
The q-gram Lemma
For a pattern P, a string S, a value k:
Matches to P in S with at most k errors
contain at least
|P|-q+1-(kq)
substrings of length q (q-grams) from S.
TCGATTAC
q=3
TCG
# of q-grams :
CGA
|P| - q + 1
GAT
ATT
TTA
k=1
TAC
|P| = 8
=> t = 8-3+1-1 = 5
G C AT T C G AT G G A C T G G A C TA G T G AAT C A G T
Error number k :
at least
t = |P| - q + 1 - (qk)
common q-grams in |P| letters
In the DP
matrix, one
can count
the number
of matching
q-grams
per diagonal
General idea:
Use substrings with gaps (q-shapes)
compute correct threshold t
total length s is called span
|Q| = 11
k=3
3-shape
##.#
s=4
1 gap
t=1
OOOXXOOXOOO
OO.X
OO.X
OX.O
XX.O
XO.X
OO.O
OX.O
XO.O
OOXOOXOOXOO
OOX
OXO
XOO
OOX
OXO
XOO
OOX
OXO
XOO
O = match, X = mismatch
3-gram
###
t=0
no filter!
Judging the quality of q-shapes I
We developed a DP based
approach for computing the
threshold t given a q-shape
and a query length |P|
Observation: The threshold t is not
the only factor that influences the
behaviour of a q-shape
Judging the quality of q-shapes II
##.#
##.#
----For t=2 and
the 3-shape
##.#
the minimum
coverage is 5
We define the minimum
coverage as the minimum
number of matching
characters for any
arrangement of t matching
q-shapes in P and a
substring of length |P| in S
Judging the quality of q-shapes III
3-shape: ##.#
S = {A,C,G,T}
Expected number of
occurences of a
single 3-shape in S:
1
occ =
q |S|
|S|
The value q (i.e.the
number of matching
characters in a shape)
determines the expected
number of occurences in
a random string S
Judging the quality of q-shapes IV
Speed:
value of q
Efficiency:
minimum coverage
The speed of the filter step
is influenced by the expected
number of matching q-shapes
in S. The efficiency of the
filtration correlates closely
with the minimum coverage
Judging the quality of q-shapes V
Shapes with maximal
minimum coverage for:
|Q| = 50, k=5
q=6 : ##......#..#..#.#
q=9 : ###..#..#.#...#.##
q=10: ###..#..#.#..###.#
q=11: #######.##.##
q=12: ###.#..###.#..###.#
Good shapes are
not neccessarily
regular or
predictable in
their form.
Evaluating q-shapes
Experimental setup for q-shapes:
• 50 million character random (Bernoulli) string S
• 1000 random queries of length 500
• queries have no approximate matches in S
• compute threshold for |Q|=50
• actual value of |Q| is 500! (to reduce runtime of tests)
Experiments show 10x reduced filter efficiency;
relative performance between shapes unaffected
Evaluating q-shapes
What we measured for every shape and all queries:
A) The total number of occurrences of all shapes
Good indicator of the total work for the filter phase
B) The number of diagonals containing at least t shapes
Good indicator of the filter efficiency
The experiments show a good correlation between
A and the predicted values as well as B and the minimum
coverage
Our work….
• An analysis of q-grams with gaps (q-shapes)
• Results include:
• experimental evidence for their superiority
when compared to standard q-grams
• a method to roughly judge their quality, the
minimum coverage
• a way to calculate the parameters required to
us them in a filter algorithm
Todo….
• an algorithm to predict the best shapes
• improve the quality measure for q-grams
• extension to the k-differences problem (with
insertions and deletions)
• a thorough analysis of filter behaviour for
> k differences (use as a heuristic filter)
© Copyright 2026 Paperzz