Regular Expressions in Haskell (Ares) Juan Pedro Villa Isaza (Modified: August 13, 2012) Ares: Contents I Introduction I Regular expressions in Haskell I Miscellany I Bibliography Ares: Contents I Introduction I I I I Motivation Applications of regular expressions Regular expressions in UNIX Mechanics of regular expressions I Regular expressions in Haskell I Miscellany I Bibliography Ares: Introduction Motivation motivation I $ egrep -i ‘\<([a-z]+)\s+\1\>’ ares.tex \framesubtitle{Motivation motivation} ... I $ egrep -i ‘\<([a-z]+)\s+\1\>’ pg11.txt pg12.txt pg12.txt:...that the WHITE kitten had had nothing to do with pg12.txt:had had quite a long argument with her sister only... pg12.txt:...singing it a long long time!’ pg12.txt:(They had had quite enough of the subject of age... pg12.txt:...The name really IS "THE AGED AGED pg12.txt:...I saw an aged aged man, Ares: Introduction Motivation To master regular expressions is to master your data. Once you learn regular expressions, you’ll realize that they’re an invaluable part of your toolkit, and you’ll wonder how you could ever have gotten by without them. Regular expressions are easy! ... Jeffrey E. F. Friedl Ares: Introduction Applications of regular expressions A regular expression that gives a “picture” of the pattern we want to recognize is the medium of choice for applications that search for patterns in text. The regular expressions are then compiled, behind the scenes, into deterministic or nondeterministic automata, which are then simulated to produce a program that recognizes patterns in text. ... John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman Ares: Introduction Regular expressions in UNIX: egrep I $ man egrep GREP(1) GREP(1) NAME grep, egrep, fgrep, rgrep - print lines matching a pattern SYNOPSIS grep [OPTIONS] PATTERN [FILE...] grep [OPTIONS] [-e PATTERN | -f FILE] [FILE...] DESCRIPTION grep searches the named input FILEs (or standard input if no files are named, or if a single hyphen-minus (-) is given as file name) for lines containing a match to the given PATTERN. By default, grep prints the matching lines. ... Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items to match a single character I ‘.’ (dot) matches any one character. Exercise I I If Σ = {0, 1}, ‘.’ stands for the regular expression... ‘[· · · ]’ (character class) matches any one character listed. Examples I I I ‘[01]’ stands for the regular expression 0 + 1. The digits can be expressed ‘[0-9]’. The lowercase letters can be expressed ‘[a-z]’. Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items to match a single character I ‘.’ (dot) matches any one character. Exercise I I I If Σ = {0, 1}, ‘.’ stands for the regular expression... 0 + 1. ‘[· · · ]’ (character class) matches any one character listed. Examples I I I ‘[01]’ stands for the regular expression 0 + 1. The digits can be expressed ‘[0-9]’. The lowercase letters can be expressed ‘[a-z]’. Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items to match a single character I ‘[^· · · ]’ (negated character class) matches any one character not listed. Example I I $ egrep -i ‘j[^aeiou]’ pg11.txt pg12.txt occasional exclamation of ‘Hjckrrh!’... When char is a metacharacter, or the escaped combination is not otherwise special, ‘\char ’ (escaped character) matches the literal char . Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items appended to provide “counting” I ‘?’ (question) means “one allowed, but it is optional.” Example I I $ egrep -i ‘colou?r’ pg11.txt pg12.txt right colour, and... ‘I don’t care about the colour,’ the... ‘*’ (star) means “any number allowed, but all are optional.” Exercises I ‘u?’ is the same as the regular expression... I If Σ = {0, 1}, ‘.*’ stands for the regular expression... Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items appended to provide “counting” I ‘?’ (question) means “one allowed, but it is optional.” Example I I $ egrep -i ‘colou?r’ pg11.txt pg12.txt right colour, and... ‘I don’t care about the colour,’ the... ‘*’ (star) means “any number allowed, but all are optional.” Exercises I I I ‘u?’ is the same as the regular expression... + u. If Σ = {0, 1}, ‘.*’ stands for the regular expression... Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items appended to provide “counting” I ‘?’ (question) means “one allowed, but it is optional.” Example I I $ egrep -i ‘colou?r’ pg11.txt pg12.txt right colour, and... ‘I don’t care about the colour,’ the... ‘*’ (star) means “any number allowed, but all are optional.” Exercises I I I I ‘u?’ is the same as the regular expression... + u. If Σ = {0, 1}, ‘.*’ stands for the regular expression... (0 + 1)∗ . Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items appended to provide “counting” I ‘+’ (plus) means “at least one required, additional are optional.” Exercise I I ‘[0-9]+’ stands for the regular expression... ‘{min,max }’ (specified range) means “min required, max allowed.” Example I ‘[0-9]{3,4}’ matches ‘2011’ in ‘august 25, 2011’. Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items appended to provide “counting” I ‘+’ (plus) means “at least one required, additional are optional.” Exercise I I I ‘[0-9]+’ stands for the regular expression... (0 + · · · + 9)(0 + · · · + 9)∗ . ‘{min,max }’ (specified range) means “min required, max allowed.” Example I ‘[0-9]{3,4}’ matches ‘2011’ in ‘august 25, 2011’. Ares: Introduction Regular expressions in UNIX: egrep metacharacters Items that match a position I ‘^’ (caret) matches the position at the start of the line. I ‘$’ (dollar) matches the position at the end of the line. I ‘\<’ (word boundary) matches the position at the start of a word. I ‘\>’ (word boundary) matches the position at the end of a word. Example I ‘\<begin\>’ matches ‘begin’ in ‘begin at the beginning’. Exercise I How would egrep interpret ‘^begin$’, ‘^$’, or ‘^’? Ares: Introduction Regular expressions in UNIX: egrep metacharacters Other I ‘|’ (alternation) matches either expression it separates. Example I ‘0|1’ stands for the regular expression 0 + 1. I ‘(· · · )’ (parentheses) limits scope of alternation, provides grouping for the quantifiers, and “captures” for backreferences. I ‘\1’, ‘\2’, etc. (backreference), matches text previously matched within first, second, etc., set of parentheses. Ares: Introduction Mechanics of regular expressions Rule 1 I The match that begins earliest (leftmost) wins. Example I When matching ‘curious’ against ‘curiouser and curiouser!’, the match is in the first ‘curiouser’. Exercise I Where does ‘head|her|with’ match in the string ‘off with her head!’? Ares: Introduction Mechanics of regular expressions Rule 1 I The match that begins earliest (leftmost) wins. Example I When matching ‘curious’ against ‘curiouser and curiouser!’, the match is in the first ‘curiouser’. Exercise I Where does ‘head|her|with’ match in the string ‘off with her head!’? I ‘with’. Ares: Introduction Mechanics of regular expressions Rule 2 I The standard quantifiers (?, *, +, and {min,max }) are greedy. Examples I ‘[0-9]+’ matches ‘27’ in ‘january 27’. I ‘^.*([0-9][0-9])’ matches ‘chapter 11’ in ‘chapter 11. who stole the tarts?’ after more than 20 cycles. Exercise I Where does ‘^.*([0-9]+)’ match in the string ‘copyright 2004’? Ares: Introduction Mechanics of regular expressions Rule 2 I The standard quantifiers (?, *, +, and {min,max }) are greedy. Examples I ‘[0-9]+’ matches ‘27’ in ‘january 27’. I ‘^.*([0-9][0-9])’ matches ‘chapter 11’ in ‘chapter 11. who stole the tarts?’ after more than 20 cycles. Exercise I Where does ‘^.*([0-9]+)’ match in the string ‘copyright 2004’? I ‘copyright 2004’. Ares: Introduction Mechanics of regular expressions Exercise I Where does ‘".*"’ match in the string ‘"i see what i eat" is the same thing as "i eat what i see"!’ I ? Where does ‘"[^"]*"’ match in the string ? ‘"i see what i eat" is the same thing as "i eat what i see"!’ Ares: Introduction Mechanics of regular expressions Exercise I Where does ‘".*"’ match in the string ‘"i see what i eat" is the same thing as "i eat what i see"!’ I I ‘"i see what i eat" is the same thing as "i eat what i see"’ . Where does ‘"[^"]*"’ match in the string ? ‘"i see what i eat" is the same thing as "i eat what i see"!’ I ? ‘"i see what i eat"’ and ‘"i eat what i see"’ . Ares: Introduction Mechanics of regular expressions I I “Understanding how the regex engine really works is the key to really understanding.” For practical purposes, there are three types of regex engines: I I I DFA (POSIX or not) Traditional NFA POSIX NFA NFA engine: regex-directed I When matching ‘to(nite|night)’ against ‘tonight’, control moves within the regex from component to component. Ares: Introduction Mechanics of regular expressions DFA engine: text-directed I When matching ‘to(nite|night)’ against ‘tonight’, the engine keeps track of all matches “currently in the works.” in text tonight ^ I in regex to(nite|night) ^ ... in text tonight ^ in regex to(nite|night) ^ ^ Each character scanned from the text controls the engine. Ares: Introduction Mechanics of regular expressions DFA engine: text-directed I When matching ‘to(nite|night)’ against ‘tonight’, the engine keeps track of all matches “currently in the works.” in text tonight ^ in regex to(nite|night) ^ ... in text tonight ^ in regex to(nite|night) ^ ^ I Each character scanned from the text controls the engine. I Rule 1: The longest leftmost match wins. $ echo caterpillar | egrep -o ‘cat(erpillar)?’ caterpillar $ echo caterpillar | egrep -o ‘cat|caterpillar’ caterpillar Ares: Introduction Mechanics of regular expressions: Theory versus reality In theory, NFA and DFA engines should match exactly the same text and have exactly the same features. In practice, the desire for richer, more expressive regular expressions has caused their semantics to diverge. Jeffrey E. F. Friedl Ares: Contents I Introduction I Regular expressions in Haskell I Miscellany I Bibliography Ares: Regular expressions in Haskell Backends I regex-posix (0.95.2) I regex-tdfa (1.1.8) I regex-pcre (0.94.4) I pcre-light (0.4) I regex-parsec, regex-tre, regex-dfa, ... I regexpr, regexpr-symbolic, ... Ares: Regular expressions in Haskell Backends I $ cabal install regex-posix $ cabal install regex-tdfa $ cabal install ... $ ghci Prelude> :set prompt "ghci> " ghci> :module +Text.Regex.Posix I (ghci> :module +Data.ByteString.Char8) Ares: Regular expressions in Haskell The (=˜) operator (=˜) I ghci> :type (=~) (=~) :: (RegexMaker Regex CompOption ExecOption source, RegexContext Regex source1 target) => source1 -> source -> target Ares: Regular expressions in Haskell The (=˜) operator (=˜) I ghci> :type (=~) (=~) :: (RegexMaker Regex CompOption ExecOption source, RegexContext Regex source1 target) => source1 -> source -> target I ((=~) :: text -> regex -> result) Ares: Regular expressions in Haskell The (=˜) operator (=˜) I ghci> "poor little thing!" =~ "little" <interactive>:1:22: No instance for (RegexContext Regex [Char] target0) arising from a use of ‘=~’ Possible fix: add an instance declaration for (RegexContext Regex [Char] target0) In the expression: "poor little thing!" =~ "little" In an equation for ‘it’: it = "poor little thing!" =~ "little" I ghci> :info RegexContext class RegexLike regex source => RegexContext regex source target where match :: regex -> source -> target matchM :: Monad m => regex -> source -> m target -- Defined in Text.Regex.Base.RegexLike instance RegexContext Regex String String -- Defined in Text.Regex.Posix.String ... Ares: Regular expressions in Haskell Instances for the first match Bool I With a result of type Bool, (=~) returns True if there is a match. I ghci> let text = "poor little thing!" ghci> text =~ "little" :: Bool True I ghci> text =~ "LITTLE" :: Bool False Ares: Regular expressions in Haskell Instances for the first match String I With a result of type String, (=~) returns the text of the whole match. I ghci> let text = "a mad tea party" ghci> let regex = "[aeiou][aeiou]" ghci> text =~ regex :: String "ea" I ghci> text =~ "alice" :: String "" I ghci> text =~ "" :: String "" Ares: Regular expressions in Haskell Instances for the first match ByteString I With a result of type ByteString, (=~) returns the text of the whole match. I ghci> let text = "a mad tea party" ghci> let regex = "[aeiou][aeiou]" ghci> pack text =~ regex :: ByteString "ea" I ghci> text =~ regex :: ByteString <interactive>:1:6: No instance for (RegexContext Regex [Char] ByteString) arising from a use of ‘=~’ Possible fix: add an instance declaration for (RegexContext Regex [Char] ByteString) In the expression: text =~ regex :: ByteString In an equation for ‘it’: it = text =~ regex :: ByteString Ares: Regular expressions in Haskell Instances for the first match (MatchOffset, MatchLength) I With a result of type (MatchOffset, MatchLength) (i.e., (Int, Int)), (=~) returns the initial index and length of the whole match. I ghci> let text = "contrariwise" ghci> text =~ "wise" :: (MatchOffset, MatchLength) (8,4) I ghci> text =~ "WISE" :: (Int, Int) (-1,0) Ares: Regular expressions in Haskell Instances for the first match (String, String, String) I With a result of type (String, String, String), (=~) returns the text before the match, the text of the match, and the text after the match. I ghci> let text = "it’s my own invention" ghci> text =~ "own" :: (String, String, String) ("it’s my ","own"," invention") I ghci> text =~ "OWN" :: (String, String, String) ("it’s my own invention","","") Ares: Regular expressions in Haskell Instances for all the matches Int I With a result of type Int, (=~) returns the number of matches. I ghci> let text = "twinkle, twinkle, winkle, twinkle" ghci> text =~ "t?winkle" :: Int 4 Exercise I ghci> text =~ ".*" :: Int 2 I Which ones? Ares: Regular expressions in Haskell Instances for all the matches [(MatchOffset, MatchLength)] I A result of type [(MatchOffset, MatchLength)] is a list of results of type (MatchOffset, MatchLength). I ghci> let text = "tweedledum and tweedledee" ghci> let regex = "tweedle" ghci> getAllMatches (text =~ regex) :: [(MatchOffset, MatchLength)] [(0,7),(15,7)] I ghci> let text = "i’m sure i’m very sorry" ghci> getAllMatches (text =~ regex) :: [(Int, Int)] [] Ares: Regular expressions in Haskell Instances for all the matches [String] I A result of type [String] is a list of results of type String. I ghci> let text = "THE KING AND QUEEN OF HEARTS" ghci> let regex = "QUEEN|KING" ghci> getAllTextMatches (text =~ regex) :: [String] ["KING","QUEEN"] I ghci> let regex = "queen|king" ghci> getAllTextMatches (text =~ regex) :: [String] [] Ares: Contents I Introduction I Regular expressions in Haskell Miscellany I I I I Regular expressions in Java Curiouser and curiouser? Bibliography Ares: Miscellany Regular expressions in Java Example I I $ cat ares.java import java.util.regex.*; class ares { public static void main(String[] args) { String text = "why is a raven like a writing-desk?"; String regex = "(a|an)\\s+\\w+"; Pattern pattern = Pattern.compile(regex); Matcher matcher = pattern.matcher(text); while (matcher.find()) { String string = matcher.group(); int matchOffset = matcher.start(); int matchLength = matcher.end() - matchOffset; System.out.println("\"" + string + "\" (" + matchOffset + ", " + matchLength + ")."); } } } $ javac ares.java $ java ares "a raven" (7, 7). "a writing" (20, 9). Ares: Miscellany Curiouser and curiouser? Download Project Gutenberg’s Alice’s Adventures in Wonderland1 and Through the Looking-Glass2 , by Lewis Carroll, and play around with regular expressions in Haskell! 1 2 pg11.txt. pg12.txt. Ares: Contents I Introduction I Regular expressions in Haskell I Miscellany I Bibliography Ares: Bibliography (1) Jeffrey E. F. Friedl. Mastering Regular Expressions. O’Reilly, 3rd edition, 2006. Tony Stubblebine. Regular Expression Pocket Reference. O’Reilly, 2003. Ares: Bibliography (2) John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation, pages 109-114. Addison-Wesley, 3rd edition, 2006. Bryan O’Sullivan, John Goerzen, and Don Stewart. Real World Haskell, chapter 8 (File processing, regular expressions, and filename matching), pages 193-212. O’Reilly, 2008. http://books.realworldhaskell.org/ Ares: Bibliography (3) Lewis Carroll. Alice’s Adventures in Wonderland. http://www.gutenberg.org/ebooks/11 Lewis Carroll. Through the Looking-Glass. http://www.gutenberg.org/ebooks/12 Sir John Tenniel. The Tenniel Illustrations for Carroll’s Alice in Wonderland. http://www.gutenberg.org/ebooks/114 Ares: Links GitHub repository: https://github.com/jpvillaisaza/Ares PDF: http://goo.gl/Ilsaj
© Copyright 2025 Paperzz