CQP
ANNIS
Tools for Annotating and Searching Corpora
Practical Session 4: Searching Corpora
Stefanie Dipper
Institute of Linguistics
Ruhr-University Bochum
Corpus Linguistics Fest (CLiF)
June 6-10, 2016
Indiana University, Bloomington
Stefanie Dipper
Tools for annotating and searching
1 / 32
CQP
ANNIS
Today’s session
We learn in detail how to work with two corpus search tools
CQP
ANNIS
Stefanie Dipper
Tools for annotating and searching
2 / 32
CQP
ANNIS
Outline
1
CQP
2
ANNIS
Stefanie Dipper
Tools for annotating and searching
3 / 32
CQP
ANNIS
CQP
Ref: Christ, Schulze, Hofmann, and König (1999)
‘Corpus Query Processor’
Free, open source: http://cwb.sourceforge.net/
Mainly developed by Stefan Evert (Erlangen)
Very efficient, widely-used corpus search tool (e.g.
BNCweb)
Demo websites: http://cwb.sourceforge.net/
demos.php, http://corpora.linguistik.
uni-erlangen.de/demos/CQP/cqpdemo.html
Stefanie Dipper
Tools for annotating and searching
4 / 32
CQP
ANNIS
CQP demo corpus: Dickens
We work first with the demo corpus ‘Dickens’
A collection of novels by Charles Dickens (from Project
Gutenberg)
3.4 million tokens
Automatically tagged and lemmatized (plus further
annotations)
Tagset: Penn Treebank with a few modifications
proper nouns: NP and NPS instead of NNP and NNPS
personal pronouns: PP instead of PRP
SENT for end-of-sentence punctuation
Stefanie Dipper
Tools for annotating and searching
5 / 32
CQP
ANNIS
CQP demo corpus: Dickens
Go to http://corpora.linguistik.
uni-erlangen.de/demos/CQP
Select the option ‘CQP query’ of the DICKENS corpus
A new windows open: the search window
Stefanie Dipper
Tools for annotating and searching
6 / 32
CQP
ANNIS
CQP demo corpus: Dickens
Go to http://corpora.linguistik.
uni-erlangen.de/demos/CQP
Select the option ‘CQP query’ of the DICKENS corpus
A new windows open: the search window
I first give a short introduction to CQP
Then there is time for practicing
Stefanie Dipper
Tools for annotating and searching
6 / 32
CQP
ANNIS
CQP Internals
Internal representation of the data: table
cpos
word
pos
lemma
0
1
2
3
4
5
6
7
I
have
endeavoured
in
this
Ghostly
little
book
PP
VBP
VBN
IN
DT
JJ
JJ
NN
I
have
endeavour
in
this
ghostly
little
book
Remember: Personal pronouns: tagged as PP
Stefanie Dipper
Tools for annotating and searching
7 / 32
CQP
ANNIS
CQP syntax: individual tokens
Search for a token pattern by [..]
Each pattern [..] applies to one row in the table each
Stefanie Dipper
Tools for annotating and searching
8 / 32
CQP
ANNIS
CQP syntax: individual tokens
Search for a token pattern by [..]
Each pattern [..] applies to one row in the table each
A pattern can specify the word form and/or its annotations
Stefanie Dipper
Tools for annotating and searching
8 / 32
CQP
ANNIS
CQP syntax: individual tokens
Search for a token pattern by [..]
Each pattern [..] applies to one row in the table each
A pattern can specify the word form and/or its annotations
Connectives: &, |, !
Stefanie Dipper
Tools for annotating and searching
8 / 32
CQP
ANNIS
CQP syntax: individual tokens
Search for a token pattern by [..]
Each pattern [..] applies to one row in the table each
A pattern can specify the word form and/or its annotations
Connectives: &, |, !
Example queries:
[word="off" & pos="IN"]
[pos="JJ"]
[word="saw" & pos!="VBD"]
Stefanie Dipper
Tools for annotating and searching
8 / 32
CQP
ANNIS
CQP syntax: individual tokens
Search for a token pattern by [..]
Each pattern [..] applies to one row in the table each
A pattern can specify the word form and/or its annotations
Connectives: &, |, !
Example queries:
[word="off" & pos="IN"]
[pos="JJ"]
[word="saw" & pos!="VBD"]
[ ]: maximally-underspecified token
Stefanie Dipper
Tools for annotating and searching
8 / 32
CQP
ANNIS
BNCweb’s Simple search vs. CQP queries
‘Simple’ ‘CQP’
Description
X
[word="X"]
word form
{X}
[lemma="X"] lemma
CQP example query
[word="saw"]
[lemma="see"]
_X
[pos="X"]
[pos="NNS"]
_{X}
[pos=<RE>]
fine-grained POS
class
major POS class
Stefanie Dipper
[pos="N.*"]
Tools for annotating and searching
9 / 32
CQP
ANNIS
CQP regular expressions (selection)
RE
Description
Example query and matches
.
a single char
[word="doo."] → door, doom;
[pos="NN."] → NNP, NNS
x?
x*
zero or one
zero to infinite
x+
x{n}
x{n,m}
[xy]
one to infinite
n repetitions
n–m repetitions
character class
[word="an?"] → a, an
[word="oo*"] → o, oo, ooo, . . .
[pos="JJ.*"] → catholic
[word=".*o+.*"] → pooosh
[word="o{3}"] → ooo
[word="o{3,5}"] → ooo
[word=".*[äöü].*"]
→ begrüße
Stefanie Dipper
Tools for annotating and searching
10 / 32
CQP
ANNIS
CQP regular expressions (selection)
RE
Description
Example query and matches
&
|
conjunction
disjunction
!
negation
[word="walk" & pos="VB"]
[pos="(PRP|WP)"],
[pos="VB(P|S)"]
[pos="walk" & word=!"VB"]
(. . . )
grouping
[A & (B | C)]
Stefanie Dipper
Tools for annotating and searching
11 / 32
CQP
ANNIS
CQP syntax: token sequences
Token patterns can be combined to sequences, using REs
Concatenation:
[pos="DT"] [pos="JJ"]
Stefanie Dipper
Tools for annotating and searching
12 / 32
CQP
ANNIS
CQP syntax: token sequences
Token patterns can be combined to sequences, using REs
Concatenation:
[pos="DT"] [pos="JJ"]
With repetition operators:
[pos="DT"]? [pos="JJ"]* [pos="NN.*"]
[pos="JJ.*"]{3}
Stefanie Dipper
Tools for annotating and searching
12 / 32
CQP
ANNIS
CQP syntax: token sequences
Token patterns can be combined to sequences, using REs
Concatenation:
[pos="DT"] [pos="JJ"]
With repetition operators:
[pos="DT"]? [pos="JJ"]* [pos="NN.*"]
[pos="JJ.*"]{3}
With alternatives:
([pos="PP"] | [pos="DT"]) [pos="V.*"]
equivalent: ([pos="(PP|DT)"]) [pos="V.*"]
Stefanie Dipper
Tools for annotating and searching
12 / 32
CQP
ANNIS
CQP exercises
Search for:
1
Words that contain exactly 5 os (not necessarily adjacent)
2
Three adjacent adjectives
3
Words whose lemma form ends with -ity
4
Words ending with -atious that are not adjectives
5
A general NP
6
Ditransitive constructions
7
WH-extraction
8
V topicalization
Stefanie Dipper
Tools for annotating and searching
13 / 32
CQP
ANNIS
CQP solutions from class I
1
Words that contain exactly 5 os (not necessarily adjacent)
(this search: at least 5 o’s)
[word="(.*o){5}.*"]
(version with exactly 5 o’s: [^o] means: any character
other than ’o’)
[word="([^o]*o){5}[^o]*"]
2
Three adjacent adjectives
[pos="J.*"]{3}
3
Words whose lemma form ends with -ity
[lemma=".*ity"]
4
Words ending with -atious that are not adjectives
[word=".*atious" & pos!="JJ.*"]
Stefanie Dipper
Tools for annotating and searching
14 / 32
CQP
ANNIS
CQP solutions from class II
5
A general NP
(this search: no postnominal modifiers included)
([pos="DT"] | [pos="PP$"])? [pos="CD"]?
([pos="RB"]? [pos="J.*"])*
[pos="N.*"][pos="N.*"]?
6
Ditransitive constructions
[pos="VB.*" & lemma!="([bB]e|[hH]ave|[dD]o)"]
((([pos="DT"] | [pos="PP$"])? [pos="CD"]?
([pos="RB"]? [pos="J.*"])* [pos="N.*"][pos="N.*
| [pos="PP" & word!="(I|he|she|they|we)"]){2}
7
WH-extraction
8
V topicalization
Stefanie Dipper
Tools for annotating and searching
15 / 32
CQP
ANNIS
CQP macros
For expressions that you need over and over again, you
can define macros
MACRO np(0) # a very simple NP
[pos = "DT"]
[pos = "JJ"]?
[pos = "NN"]
;
Call: /np[] /np[]
Note: macros have to be defined in extra files and to be
imported into CQP (does not work with the online demo)
Exercise: refine the query above to search for sensible
NPs (which could be saved as a macro)
Stefanie Dipper
Tools for annotating and searching
16 / 32
CQP
ANNIS
Further options
‘Distribution’ shows the match frequencies in different
subcorpora
is it a general or a domain-/genre-specific term
Stefanie Dipper
Tools for annotating and searching
17 / 32
CQP
ANNIS
Further options
‘Distribution’ shows the match frequencies in different
subcorpora
is it a general or a domain-/genre-specific term
‘Frequencies’ shows the match frequencies
especially interesting if more than one wordform matches
the query → individual frequencies are displayed
e.g. lists all adjectives in the corpus, sorted by frequency
Stefanie Dipper
Tools for annotating and searching
17 / 32
CQP
ANNIS
Further options
Menue item ‘Tools’ allows us to get frequencies of
unigrams (‘Word list’) or general ngrams (‘Phrases’) of
word forms, lemmas or POS tags
Word list expressions can be filtered by a regex (simplified
version)
e.g. attribute=lemma, filter=*ity
e.g. attribute=POS, filter=NN*
Stefanie Dipper
Tools for annotating and searching
18 / 32
CQP
ANNIS
Further options
Menue item ‘Tools’ allows us to get frequencies of
unigrams (‘Word list’) or general ngrams (‘Phrases’) of
word forms, lemmas or POS tags
Word list expressions can be filtered by a regex (simplified
version)
e.g. attribute=lemma, filter=*ity
e.g. attribute=POS, filter=NN*
Phrases can be filtered by open/closed class, nouns,
verbs, none
e.g. attribute=POS, filter=none
Stefanie Dipper
Tools for annotating and searching
18 / 32
CQP
ANNIS
CQP demo corpus: Europarl
Go to back to http://corpora.linguistik.
uni-erlangen.de/demos/CQP/cqpdemo.html
Now select the option ‘CQP query’ from the EUROPARL
corpus
Stefanie Dipper
Tools for annotating and searching
19 / 32
CQP
ANNIS
CQP demo corpus: Europarl
Go to back to http://corpora.linguistik.
uni-erlangen.de/demos/CQP/cqpdemo.html
Now select the option ‘CQP query’ from the EUROPARL
corpus
EUROPARL
a parallel corpus of the proceedings of the European
Parliament
here: 6 languages (EN, DE, FR, ES, IT, NL)
40 million words per language
pairwise sentence alignments → for cross-linguistic
research
Stefanie Dipper
Tools for annotating and searching
19 / 32
CQP
ANNIS
CQP demo corpus: Europarl
In the search window, specify the source language that you
would like to query (option ‘lang=. . . ’)
Tick the languages that you want to see aligned (boxes
EN, DE, etc.)
tick 2 languages, e.g. DE, FR
Enter a query and run it
Stefanie Dipper
Tools for annotating and searching
20 / 32
CQP
ANNIS
CQP demo corpus: Europarl
Stefanie Dipper
Tools for annotating and searching
21 / 32
CQP
ANNIS
CQP import
How do I import my data into CQP?
You need to install CWB, the Corpus Workbench
Your corpus has to be in tab-separated format
There are scripts that create the binaries for the tool
Finally import the data into the database
. . . and start CQP to query your data
Stefanie Dipper
Tools for annotating and searching
22 / 32
CQP
ANNIS
Outline
1
CQP
2
ANNIS
Stefanie Dipper
Tools for annotating and searching
23 / 32
CQP
ANNIS
ANNIS
Ref: Krause and Zeldes (2014)
Mainly developed by Anke Lüdeling’s group (Berlin)
Focus on deeply-annotated corpora rather than efficiency
Open source: http://corpus-tools.org/annis/
Demo website: https://corpling.uis.
georgetown.edu/annis-corpora/
Stefanie Dipper
Tools for annotating and searching
24 / 32
CQP
ANNIS
ANNIS demo corpus: MASC
Go to https://corpling.uis.georgetown.edu/
annis-corpora/
Select the treebank GUM
A new windows open: the search window
Look at the metadata (click on “i”)
Stefanie Dipper
Tools for annotating and searching
25 / 32
CQP
ANNIS
ANNIS demo corpus: MASC
Go to https://corpling.uis.georgetown.edu/
annis-corpora/
Select the treebank GUM
A new windows open: the search window
Look at the metadata (click on “i”)
I first give an introduction to ANNIS’ query language
Then there is time for practicing
Stefanie Dipper
Tools for annotating and searching
25 / 32
CQP
ANNIS
ANNIS querying concept
ANNIS has a different concept than CQP
You first specify individual constraints and then relate
them:
pos="NN"
& lemma="saw"
& #1 _=_ #2
‘#1’ refers to the expression specified in the first
line/conjunct
“take a noun (line 1), take a lemma (line 2); both cover the
same span (_=_)
Stefanie Dipper
Tools for annotating and searching
26 / 32
CQP
ANNIS
ANNIS querying concept
ANNIS has a different concept than CQP
You first specify individual constraints and then relate
them:
pos="NN"
& lemma="saw"
& #1 _=_ #2
‘#1’ refers to the expression specified in the first
line/conjunct
“take a noun (line 1), take a lemma (line 2); both cover the
same span (_=_)
Abbreviated version:
pos="NN" _=_ & lemma="saw"
Stefanie Dipper
Tools for annotating and searching
26 / 32
CQP
ANNIS
ANNIS querying concept
ANNIS has a different concept than CQP
You first specify individual constraints and then relate
them:
pos="NN"
& lemma="saw"
& #1 _=_ #2
‘#1’ refers to the expression specified in the first
line/conjunct
“take a noun (line 1), take a lemma (line 2); both cover the
same span (_=_)
Abbreviated version:
pos="NN" _=_ & lemma="saw"
Elements can be named by variables:
N1#pos="NNP" & N2#pos="NNP"
& #N1 . #N2
Stefanie Dipper
Tools for annotating and searching
26 / 32
CQP
ANNIS
Defining elements
For spelled-out feature values, use quotes: [word="on"]
For regular expressions, use slashes: [word=/.*on/]
Stefanie Dipper
Tools for annotating and searching
27 / 32
CQP
ANNIS
Defining elements
For spelled-out feature values, use quotes: [word="on"]
For regular expressions, use slashes: [word=/.*on/]
tok denotes the underspecified token
node denotes underspecified nodes in trees
to have a first look at a corpus, search for tok or node
Stefanie Dipper
Tools for annotating and searching
27 / 32
CQP
ANNIS
ANNIS operators (selection)
Ref: http://corpus-tools.org/annis/aql.html
Operator
Description
.
.*
^
^{3,4}
_=_.
_i_.
_o_.
direct precedence
indirect precedence
direct adjacency
3–4 tokens away
identical coverage
inclusion
overlap
>
>*
>@l
>@r
->LABEL
$
direct dominance
indirect dominance
left-most child
right-most child
labeled pointing relation
siblings
Stefanie Dipper
Tools for annotating and searching
28 / 32
CQP
ANNIS
ANNIS exercises
Search for:
1
Words that contain exactly 5 os (not necessarily adjacent)
2
Words whose lemma form ends with -ity
3
Words ending with -atious that are not adjectives
4
A general NP
5
Three adjacent adjectives
6
Ditransitive constructions
7
WH-extraction
8
V topicalization
Stefanie Dipper
Tools for annotating and searching
29 / 32
CQP
ANNIS
ANNIS solutions from class I
1
Words that contain exactly 5 os (not necessarily adjacent)
(this search: at least 5):
tok=/([^o]*o){5}[^o]*/
2
Words whose lemma form ends with -ity
lemma=/.ity/
3
Words ending with -atious that are not adjectives
\item A general NP
\begin{verbatim}
cat="NP"
4
Three adjacent adjectives
tok=/.*atious/ &
pos != /JJ.*/ &
#1_=_#2
Stefanie Dipper
Tools for annotating and searching
30 / 32
CQP
ANNIS
ANNIS solutions from class II
5
Ditransitive constructions
VP#cat="VP" > NP1#cat="NP" &
#VP > NP2#cat="NP" &
#NP1 .* #NP2
6
WH-extraction
7
V topicalization
Stefanie Dipper
Tools for annotating and searching
31 / 32
CQP
ANNIS
References I
Christ, O., B. M. Schulze, A. Hofmann, and E. König (1999).
The IMS Corpus Workbench: Corpus Query Processor (CQP)
user’s manual.
Technical report, IMS, University of Stuttgart, Germany.
Krause, T. and A. Zeldes (2014).
Annis3: A new architecture for generic corpus query and
visualization.
Digital Scholarship in the Humanities.
http://dsh.oxfordjournals.org/cgi/content/
abstract/fqu057?ijkey=GJBr0LhNfKW1g8i&
keytype=ref.
Stefanie Dipper
Tools for annotating and searching
32 / 32
© Copyright 2026 Paperzz