Regular Expressions in Haskell

Regular Expressions in Haskell
(Ares)
Juan Pedro Villa Isaza
(Modified: August 13, 2012)
Ares: Contents
I
Introduction
I
Regular expressions in Haskell
I
Miscellany
I
Bibliography
Ares: Contents
I
Introduction
I
I
I
I
Motivation
Applications of regular expressions
Regular expressions in UNIX
Mechanics of regular expressions
I
Regular expressions in Haskell
I
Miscellany
I
Bibliography
Ares: Introduction
Motivation motivation
I $ egrep -i ‘\<([a-z]+)\s+\1\>’ ares.tex
\framesubtitle{Motivation motivation}
...
I $ egrep -i ‘\<([a-z]+)\s+\1\>’ pg11.txt pg12.txt
pg12.txt:...that the WHITE kitten had had nothing to do with
pg12.txt:had had quite a long argument with her sister only...
pg12.txt:...singing it a long long time!’
pg12.txt:(They had had quite enough of the subject of age...
pg12.txt:...The name really IS "THE AGED AGED
pg12.txt:...I saw an aged aged man,
Ares: Introduction
Motivation
To master regular expressions is to master your data.
Once you learn regular expressions, you’ll realize that
they’re an invaluable part of your toolkit, and you’ll
wonder how you could ever have gotten by without them.
Regular expressions are easy!
...
Jeffrey E. F. Friedl
Ares: Introduction
Applications of regular expressions
A regular expression that gives a “picture” of the pattern
we want to recognize is the medium of choice for
applications that search for patterns in text.
The regular expressions are then compiled, behind the
scenes, into deterministic or nondeterministic automata,
which are then simulated to produce a program that
recognizes patterns in text.
...
John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman
Ares: Introduction
Regular expressions in UNIX: egrep
I $ man egrep
GREP(1)
GREP(1)
NAME
grep, egrep, fgrep, rgrep - print lines matching a pattern
SYNOPSIS
grep [OPTIONS] PATTERN [FILE...]
grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...]
DESCRIPTION
grep searches the named input FILEs (or standard input if no
files are named, or if a single hyphen-minus (-) is given as
file name) for lines containing a match to the given PATTERN.
By default, grep prints the matching lines.
...
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items to match a single character
I
‘.’ (dot) matches any one character.
Exercise
I
I
If Σ = {0, 1}, ‘.’ stands for the regular expression...
‘[· · · ]’ (character class) matches any one character listed.
Examples
I
I
I
‘[01]’ stands for the regular expression 0 + 1.
The digits can be expressed ‘[0-9]’.
The lowercase letters can be expressed ‘[a-z]’.
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items to match a single character
I
‘.’ (dot) matches any one character.
Exercise
I
I
I
If Σ = {0, 1}, ‘.’ stands for the regular expression...
0 + 1.
‘[· · · ]’ (character class) matches any one character listed.
Examples
I
I
I
‘[01]’ stands for the regular expression 0 + 1.
The digits can be expressed ‘[0-9]’.
The lowercase letters can be expressed ‘[a-z]’.
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items to match a single character
I
‘[^· · · ]’ (negated character class) matches any one
character not listed.
Example
I
I
$ egrep -i ‘j[^aeiou]’ pg11.txt pg12.txt
occasional exclamation of ‘Hjckrrh!’...
When char is a metacharacter, or the escaped combination is
not otherwise special, ‘\char ’ (escaped character) matches
the literal char .
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items appended to provide “counting”
I
‘?’ (question) means “one allowed, but it is optional.”
Example
I
I
$ egrep -i ‘colou?r’ pg11.txt pg12.txt
right colour, and...
‘I don’t care about the colour,’ the...
‘*’ (star) means “any number allowed, but all are optional.”
Exercises
I
‘u?’ is the same as the regular expression...
I
If Σ = {0, 1}, ‘.*’ stands for the regular expression...
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items appended to provide “counting”
I
‘?’ (question) means “one allowed, but it is optional.”
Example
I
I
$ egrep -i ‘colou?r’ pg11.txt pg12.txt
right colour, and...
‘I don’t care about the colour,’ the...
‘*’ (star) means “any number allowed, but all are optional.”
Exercises
I
I
I
‘u?’ is the same as the regular expression...
+ u.
If Σ = {0, 1}, ‘.*’ stands for the regular expression...
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items appended to provide “counting”
I
‘?’ (question) means “one allowed, but it is optional.”
Example
I
I
$ egrep -i ‘colou?r’ pg11.txt pg12.txt
right colour, and...
‘I don’t care about the colour,’ the...
‘*’ (star) means “any number allowed, but all are optional.”
Exercises
I
I
I
I
‘u?’ is the same as the regular expression...
+ u.
If Σ = {0, 1}, ‘.*’ stands for the regular expression...
(0 + 1)∗ .
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items appended to provide “counting”
I
‘+’ (plus) means “at least one required, additional are
optional.”
Exercise
I
I
‘[0-9]+’ stands for the regular expression...
‘{min,max }’ (specified range) means “min required, max
allowed.”
Example
I
‘[0-9]{3,4}’ matches ‘2011’ in
‘august 25, 2011’.
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items appended to provide “counting”
I
‘+’ (plus) means “at least one required, additional are
optional.”
Exercise
I
I
I
‘[0-9]+’ stands for the regular expression...
(0 + · · · + 9)(0 + · · · + 9)∗ .
‘{min,max }’ (specified range) means “min required, max
allowed.”
Example
I
‘[0-9]{3,4}’ matches ‘2011’ in
‘august 25, 2011’.
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Items that match a position
I
‘^’ (caret) matches the position at the start of the line.
I
‘$’ (dollar) matches the position at the end of the line.
I
‘\<’ (word boundary) matches the position at the start of a
word.
I
‘\>’ (word boundary) matches the position at the end of a
word.
Example
I
‘\<begin\>’ matches ‘begin’ in
‘begin at the beginning’.
Exercise
I
How would egrep interpret ‘^begin$’, ‘^$’, or ‘^’?
Ares: Introduction
Regular expressions in UNIX: egrep metacharacters
Other
I
‘|’ (alternation) matches either expression it separates.
Example
I
‘0|1’ stands for the regular expression 0 + 1.
I
‘(· · · )’ (parentheses) limits scope of alternation, provides
grouping for the quantifiers, and “captures” for
backreferences.
I
‘\1’, ‘\2’, etc. (backreference), matches text previously
matched within first, second, etc., set of parentheses.
Ares: Introduction
Mechanics of regular expressions
Rule 1
I
The match that begins earliest (leftmost) wins.
Example
I
When matching ‘curious’ against
‘curiouser and curiouser!’,
the match is in the first ‘curiouser’.
Exercise
I
Where does ‘head|her|with’ match in the string
‘off with her head!’?
Ares: Introduction
Mechanics of regular expressions
Rule 1
I
The match that begins earliest (leftmost) wins.
Example
I
When matching ‘curious’ against
‘curiouser and curiouser!’,
the match is in the first ‘curiouser’.
Exercise
I
Where does ‘head|her|with’ match in the string
‘off with her head!’?
I
‘with’.
Ares: Introduction
Mechanics of regular expressions
Rule 2
I
The standard quantifiers (?, *, +, and {min,max }) are greedy.
Examples
I
‘[0-9]+’ matches ‘27’ in ‘january 27’.
I
‘^.*([0-9][0-9])’ matches ‘chapter 11’ in
‘chapter 11. who stole the tarts?’
after more than 20 cycles.
Exercise
I
Where does ‘^.*([0-9]+)’ match in the string
‘copyright 2004’?
Ares: Introduction
Mechanics of regular expressions
Rule 2
I
The standard quantifiers (?, *, +, and {min,max }) are greedy.
Examples
I
‘[0-9]+’ matches ‘27’ in ‘january 27’.
I
‘^.*([0-9][0-9])’ matches ‘chapter 11’ in
‘chapter 11. who stole the tarts?’
after more than 20 cycles.
Exercise
I
Where does ‘^.*([0-9]+)’ match in the string
‘copyright 2004’?
I
‘copyright 2004’.
Ares: Introduction
Mechanics of regular expressions
Exercise
I
Where does ‘".*"’ match in the string
‘"i see what i eat" is the same thing as "i eat what i see"!’
I
?
Where does ‘"[^"]*"’ match in the string
?
‘"i see what i eat" is the same thing as "i eat what i see"!’
Ares: Introduction
Mechanics of regular expressions
Exercise
I
Where does ‘".*"’ match in the string
‘"i see what i eat" is the same thing as "i eat what i see"!’
I
I
‘"i see what i eat" is the same thing as "i eat what i see"’
.
Where does ‘"[^"]*"’ match in the string
?
‘"i see what i eat" is the same thing as "i eat what i see"!’
I
?
‘"i see what i eat"’
and
‘"i eat what i see"’
.
Ares: Introduction
Mechanics of regular expressions
I
I
“Understanding how the regex engine really works is the key
to really understanding.”
For practical purposes, there are three types of regex engines:
I
I
I
DFA (POSIX or not)
Traditional NFA
POSIX NFA
NFA engine: regex-directed
I
When matching ‘to(nite|night)’ against ‘tonight’,
control moves within the regex from component to
component.
Ares: Introduction
Mechanics of regular expressions
DFA engine: text-directed
I
When matching ‘to(nite|night)’ against ‘tonight’, the
engine keeps track of all matches “currently in the works.”
in text
tonight
^
I
in regex
to(nite|night)
^
...
in text
tonight
^
in regex
to(nite|night)
^
^
Each character scanned from the text controls the engine.
Ares: Introduction
Mechanics of regular expressions
DFA engine: text-directed
I
When matching ‘to(nite|night)’ against ‘tonight’, the
engine keeps track of all matches “currently in the works.”
in text
tonight
^
in regex
to(nite|night)
^
...
in text
tonight
^
in regex
to(nite|night)
^
^
I
Each character scanned from the text controls the engine.
I
Rule 1: The longest leftmost match wins.
$ echo caterpillar | egrep -o ‘cat(erpillar)?’
caterpillar
$ echo caterpillar | egrep -o ‘cat|caterpillar’
caterpillar
Ares: Introduction
Mechanics of regular expressions: Theory versus reality
In theory, NFA and DFA engines should match exactly
the same text and have exactly the same features.
In practice, the desire for richer, more expressive regular
expressions has caused their semantics to diverge.
Jeffrey E. F. Friedl
Ares: Contents
I
Introduction
I
Regular expressions in Haskell
I
Miscellany
I
Bibliography
Ares: Regular expressions in Haskell
Backends
I
regex-posix (0.95.2)
I
regex-tdfa (1.1.8)
I
regex-pcre (0.94.4)
I
pcre-light (0.4)
I
regex-parsec, regex-tre, regex-dfa, ...
I
regexpr, regexpr-symbolic, ...
Ares: Regular expressions in Haskell
Backends
I
$ cabal install regex-posix
$ cabal install regex-tdfa
$ cabal install ...
$ ghci
Prelude> :set prompt "ghci> "
ghci> :module +Text.Regex.Posix
I
(ghci> :module +Data.ByteString.Char8)
Ares: Regular expressions in Haskell
The (=˜) operator
(=˜)
I
ghci> :type (=~)
(=~)
:: (RegexMaker Regex CompOption ExecOption source,
RegexContext Regex source1 target) =>
source1 -> source -> target
Ares: Regular expressions in Haskell
The (=˜) operator
(=˜)
I
ghci> :type (=~)
(=~)
:: (RegexMaker Regex CompOption ExecOption source,
RegexContext Regex source1 target) =>
source1 -> source -> target
I
((=~) :: text -> regex -> result)
Ares: Regular expressions in Haskell
The (=˜) operator
(=˜)
I
ghci> "poor little thing!" =~ "little"
<interactive>:1:22:
No instance for (RegexContext Regex [Char] target0)
arising from a use of ‘=~’
Possible fix:
add an instance declaration for (RegexContext Regex [Char] target0)
In the expression: "poor little thing!" =~ "little"
In an equation for ‘it’: it = "poor little thing!" =~ "little"
I
ghci> :info RegexContext
class RegexLike
regex source => RegexContext regex source target where
match :: regex -> source -> target
matchM :: Monad m => regex -> source -> m target
-- Defined in Text.Regex.Base.RegexLike
instance RegexContext Regex String String
-- Defined in Text.Regex.Posix.String
...
Ares: Regular expressions in Haskell
Instances for the first match
Bool
I
With a result of type Bool, (=~) returns True if there is a
match.
I
ghci> let text = "poor little thing!"
ghci> text =~ "little" :: Bool
True
I
ghci> text =~ "LITTLE" :: Bool
False
Ares: Regular expressions in Haskell
Instances for the first match
String
I
With a result of type String, (=~) returns the text of the
whole match.
I
ghci> let text = "a mad tea party"
ghci> let regex = "[aeiou][aeiou]"
ghci> text =~ regex :: String
"ea"
I
ghci> text =~ "alice" :: String
""
I
ghci> text =~ "" :: String
""
Ares: Regular expressions in Haskell
Instances for the first match
ByteString
I
With a result of type ByteString, (=~) returns the text of
the whole match.
I
ghci> let text = "a mad tea party"
ghci> let regex = "[aeiou][aeiou]"
ghci> pack text =~ regex :: ByteString
"ea"
I
ghci> text =~ regex :: ByteString
<interactive>:1:6:
No instance for (RegexContext Regex [Char] ByteString)
arising from a use of ‘=~’
Possible fix:
add an instance declaration for
(RegexContext Regex [Char] ByteString)
In the expression: text =~ regex :: ByteString
In an equation for ‘it’: it = text =~ regex :: ByteString
Ares: Regular expressions in Haskell
Instances for the first match
(MatchOffset, MatchLength)
I
With a result of type (MatchOffset, MatchLength) (i.e.,
(Int, Int)), (=~) returns the initial index and length of the
whole match.
I
ghci> let text = "contrariwise"
ghci> text =~ "wise"
:: (MatchOffset, MatchLength)
(8,4)
I
ghci> text =~ "WISE" :: (Int, Int)
(-1,0)
Ares: Regular expressions in Haskell
Instances for the first match
(String, String, String)
I
With a result of type (String, String, String), (=~)
returns the text before the match, the text of the match, and
the text after the match.
I
ghci> let text = "it’s my own invention"
ghci> text =~ "own" :: (String, String, String)
("it’s my ","own"," invention")
I
ghci> text =~ "OWN" :: (String, String, String)
("it’s my own invention","","")
Ares: Regular expressions in Haskell
Instances for all the matches
Int
I
With a result of type Int, (=~) returns the number of
matches.
I
ghci> let text =
"twinkle, twinkle, winkle, twinkle"
ghci> text =~ "t?winkle" :: Int
4
Exercise
I
ghci> text =~ ".*" :: Int
2
I
Which ones?
Ares: Regular expressions in Haskell
Instances for all the matches
[(MatchOffset, MatchLength)]
I
A result of type [(MatchOffset, MatchLength)] is a list of
results of type (MatchOffset, MatchLength).
I
ghci> let text = "tweedledum and tweedledee"
ghci> let regex = "tweedle"
ghci> getAllMatches (text =~ regex)
:: [(MatchOffset, MatchLength)]
[(0,7),(15,7)]
I
ghci> let text = "i’m sure i’m very sorry"
ghci> getAllMatches (text =~ regex)
:: [(Int, Int)]
[]
Ares: Regular expressions in Haskell
Instances for all the matches
[String]
I
A result of type [String] is a list of results of type String.
I
ghci> let text = "THE KING AND QUEEN OF HEARTS"
ghci> let regex = "QUEEN|KING"
ghci> getAllTextMatches (text =~ regex)
:: [String]
["KING","QUEEN"]
I
ghci> let regex = "queen|king"
ghci> getAllTextMatches (text =~ regex)
:: [String]
[]
Ares: Contents
I
Introduction
I
Regular expressions in Haskell
Miscellany
I
I
I
I
Regular expressions in Java
Curiouser and curiouser?
Bibliography
Ares: Miscellany
Regular expressions in Java
Example
I
I
$ cat ares.java
import java.util.regex.*;
class ares {
public static void main(String[] args) {
String text
= "why is a raven like a writing-desk?";
String regex
= "(a|an)\\s+\\w+";
Pattern pattern = Pattern.compile(regex);
Matcher matcher = pattern.matcher(text);
while (matcher.find()) {
String string
= matcher.group();
int
matchOffset = matcher.start();
int
matchLength = matcher.end() - matchOffset;
System.out.println("\""
+ string
+
"\" (" + matchOffset +
", "
+ matchLength +
").");
}
}
}
$ javac ares.java
$ java ares
"a raven" (7, 7).
"a writing" (20, 9).
Ares: Miscellany
Curiouser and curiouser?
Download Project Gutenberg’s
Alice’s Adventures in Wonderland1 and
Through the Looking-Glass2 ,
by Lewis Carroll, and play around with regular expressions in
Haskell!
1
2
pg11.txt.
pg12.txt.
Ares: Contents
I
Introduction
I
Regular expressions in Haskell
I
Miscellany
I
Bibliography
Ares: Bibliography (1)
Jeffrey E. F. Friedl.
Mastering Regular Expressions.
O’Reilly, 3rd edition, 2006.
Tony Stubblebine.
Regular Expression Pocket Reference.
O’Reilly, 2003.
Ares: Bibliography (2)
John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman.
Introduction to Automata Theory, Languages, and
Computation, pages 109-114.
Addison-Wesley, 3rd edition, 2006.
Bryan O’Sullivan, John Goerzen, and Don Stewart.
Real World Haskell, chapter 8 (File processing, regular
expressions, and filename matching), pages 193-212.
O’Reilly, 2008.
http://books.realworldhaskell.org/
Ares: Bibliography (3)
Lewis Carroll.
Alice’s Adventures in Wonderland.
http://www.gutenberg.org/ebooks/11
Lewis Carroll.
Through the Looking-Glass.
http://www.gutenberg.org/ebooks/12
Sir John Tenniel.
The Tenniel Illustrations for Carroll’s Alice in Wonderland.
http://www.gutenberg.org/ebooks/114
Ares: Links
GitHub repository:
https://github.com/jpvillaisaza/Ares
PDF:
http://goo.gl/Ilsaj