Regular expressions
and CQP syntax
Corpus Linguistics
Heike Zinsmeister
3.3.-5.3.2010
Remark
• These slides are a translation of original
German slides by Stefan Evert for a the
tutorial at DGfS 2010 (See next slide)
• They are slightly adapted for a compact
course in corpus linguistics at the University
of Konstanz, 3.3.-5.3.2010
03.03.2010
2
Reguläre Ausdrücke
Heike Zinsmeister
Stefan Evert
Stefanie Dipper
Berlin, 23.2.2010
Patterns for words
• Query for a set of words
that follow a pattern
– Words, that end on -ung or -ungen
– Acronyms like E.S.S.T. and S.O.S.
– Word with more the four adjacent
consonants etc.
– Numeral adjectives like 27-prozentig, 5-fach
– Are there any words with 6 o’s?
Regular Expressions
•
•
•
•
Any character: . (instead ?)
Suffix/prefix: .*ung (instead *ung)
One or more char.: .+ (instead +)
Alternative: (auf|ab) (instead
[auf,ab])
• Regular expressions are compositional
– combination of elementary operators result
in complex queries
– (ha)+ ! ha, haha, hahaha, …
RA: Individual Characters
• any character: .
• meta sign „literal“: \., \?, …
• set of characters: [aeiou], [a-z], …
– Note: [a-z] doesn’t include Umlaut and
other non-English characters, e.g. ä, ö, ü, ß!
• exclusion of numerals: [^0-9]
– everything except numerals (anything!)
RA: Repetition
• Operators for repetition
–
–
–
–
–
–
(…)?
(…)*
(…)+
(…){n}
(…){n,m}
(…){n,}
optional (=zero or 1)
arbitrary many repetitions
1 or more repetitions
exactly n repetitions
at least n, max m
n or more repetitions
RA: Repetition
• Operators influence …
– .+
1 or more arbitrary signs
– z+
z, zz, zzz, zzzz, zzzzz, …
– [0-9]+ 1 or more numerals
– ha+
ha, haa, haaa, haaaa, …
– (ha)+
ha, haha, hahaha, …
– (ta|tü)+ ta, tü, tatü, tatütata, …
– (…)+
(…)complex pattern
RA: Example
• Looking for acronyms like S.O.S.
– divide pattern in elementary pattern
– 2 or more repetitions of
– A., B., C., D., E., … = [A-Z]\.
– Operator: (…){2,}
– combined: ([A-Z]\.){2,}
• Corpus query (EUROPARL-DE):
I.F., W.G., S.A., O.K., U.S., S.O.S., …
Advantages of Regular
Expressions
• For the user: complex patterns with few
meta characters
• For the computer: reduction to Meta
characters |, * und (…)
• Implementation with
finite state automata
(FSA) is very efficient
CQP-Anfragesyntax
Heike Zinsmeister
Stefan Evert
Stefanie Dipper
Berlin, 23.2.2010
CQP
• CQP is the Corpus Query Processor
of the IMS Open Corpus Workbench (CWB)
– fast search on large text corpora with linguistic
annotation
• http://cwb.sourceforge.net/
CQP & regular expressions
• regular expressions on character level
– "([A-Z]\.){2,}"
– "[0-9]+-[a-z]+" %cd for numeral comp.
• %c ignors case of characters
• %d also matches umlauts and accents
applies only to individual words!
• regular expressions on word level
– e.g.. PP = Prep (Det)? ((Adv)? Adj)* N
CQP & table format
cpos
0
1
2
3
4
5
6
word
Um
den
linguistischen
Reichtum
zu
beweisen
,
7
…
46
welchen
…
.
pos
KOUI
ART
ADJA
NN
PTKZU
VVINF
$,
PRELS
…
$.
lemma
um
d
linguistisch
Reichtum
zu
beweisen
,
welch
…
.
CQP syntax
• Token patter […] ! raw in table
– Access to arbitrary annotation:
[pos = "VV.*"], [lemma = ".*ung"]
– "[0-9]+" short for [word = "[0-9]+"]
– logical connectors (Boolian operators)
& (and), | (or), ! (not), != (does not apply)
– e.g. [lemma = "unter.*" & pos = "VV.*"]
– immediate comparison: [lemma != word]
CQP syntax
• Regular expressions for token patterns
– […] corresponds to character(set),
[] corresponds to. („match all“)
– Operators: ?, *, +, {m,n}, Alternatives
(…|…|…) with embedding
– Example: simple NP with NN ending on -ung
– [pos = "ART"]? [pos = "ADJA"]*
[pos = "NN" & lemma = ".+ung"]
© Copyright 2026 Paperzz