NFA

CS510
Compiler
Lecture 2
Phase Ordering of Front-Ends
• Lexical analysis (lexer)
• Break input string into “words” called tokens
• Syntactic analysis (parser)
• Recover structure from the text and put it in a parse tree
• Semantic Analysis
• Discover “meaning” (e.g., type-checking)
• Prepare for code generation
• Works with a symbol table
What is a Token?
• A syntactic category
• In English:
• Noun, verb, adjective, …
• In a programming language:
• Identifier, Integer, Keyword, White-space, …
• A token corresponds to a set of strings
Terms
• Token
• Syntactic “atoms” that are “terminal” symbols in the
grammar from the source language
• A data structure (or pointer to it) returned by lexer
• Pattern
• A “rule” that defines strings corresponding to a token
• Lexeme
• A string in the source code that matches a pattern
Terms
Token classes or pattern correspond to sets of strings.
Identifier:
–strings of letters or digits, starting with a letter
Integer:
a non-empty string of digits
Keyword:
–“else” or “if” or “begin” or …
Whitespace:
–a non-empty sequence of blanks, newlines, and tabs
An implementation must do two things:
1.Recognize substrings corresponding to tokens
The lexemes
2.Identify the token class of each lexeme
An Example of these Terms
• int foo = 100;
• The lexeme matched by the pattern for the token
represents a string of characters in the source program
that can be treated as a lexical unit
Example
x = x*c^2
<token class or Pattern, lexeme> = token
<id,x>
<assign-op ,=>
<id,x>
<mult-op,*>
<id,c>
<exp-op,^>
<num,2>
Lexical Analysis Problem
• Partition input string of characters into disjoint
substrings which are tokens
• Useful tokens here: identifier, keyword, relop,
integer, white space, (, ), =, ;
Designing Lexical Analyzer
• First, define a set of tokens
• Tokens should describe all items of interest
• Choice of tokens depends on the language and the design of the
parser
• Then, describe what strings belongs to each token by
providing a pattern for it
Main points
1.The goal is to partition the string. This is implemented
by reading left-to-right, recognizing one token at a time
Partition the input string into lexemes
Identify the token of each lexeme
2.“Lookahead” may be required to decide where one token
ends and the next token begins
Review
• Left-to-right scan, sometimes requiring lookahead
• We still need
• A way to describe the lexemes of each token: pattern
• A way to resolve ambiguities
• Is “==“ two equal signs “=“ “=“ or a single relational op?
13
The Scanning Process
• Input: source code as a stream of characters
• Output: meaningful units called Tokens
• Keywords: such as IF and THEN which represent the
strings of characters “if” and “then”
• Symbols: such as PLUS and MINUS which represent
the characters “+” and “-”
• Variable tokens: such as ID and NUM which represent
identifiers and numbers
14
The Scanning Process
• Although the task of the scanner is to convert the
entire source program into a sequence of tokens,
the scanner will rarely do this all at once
• Instead, the scanner will operate under the
control of the parser, returning the single next
token from the input on demand via a function
that will have the C declaration:
TokenType getToken();
15
The Scanning Process
• The scanning process is a special case of pattern
matching, so we need to study methods of:
• Pattern specification
We will use regular expressions for pattern specification
• Pattern recognition
We will use finite automata for recognizing patterns
• We turn now to the study of regular expressions
and finite automata
TOPICS FOR SCANNING
PROCESS
17
Topics
•
•
Regular Expressions
Finite Automata
•
Implementation of DFA
Regular Expression
• Regular Expression
• Regular Language
• Regular Definition
Regular Expressions
• There are several formalisms for specifying tokens but the
most popular one is “regular languages”
• A regular expression is an expression that matches sets
of strings (the “language” of the regular expression)
Formal Language Theory
• Alphabet ∑ : a finite set of symbols (characters)
• Ex: {a,b}, an ASCII character set
• String: a finite sequence of symbols over ∑
• Ex: abab, aabb, a over {a,b}; “hello” over ASCII
• Empty string є: zero-length string
• є≠Ø≠ {є}
• Language: a set of strings over ∑
• Ex: {a, b, abab} over {a,b}
• Ex: a set of all valid C programs over ASCII
21
Regular Expressions
• In its basic form, a regular expression is built up out of
basic expressions (individual symbols) and the
operations:
• choice (|),
• concatenation (no operator), and
• repetition (*)
Operations on Strings
• Concatenation (·):
• a · b = ab, “hello” · ”there” = “hellothere”
• Denoted by α · β = αβ
• Exponentiation:
• hello3 = hello · hello · hello = hellohellohello, hello0 = є
23
Regular Expressions (Continued)
• Examples of Regular Expressions
• a(b|c)* : strings beginning with a, followed by any number of b’s
and c’s
• ab|c* : the string ab, or the strings , c, cc, ….
• (a|c)*b(a|c)*: strings contain exactly one b
• (b|)(a|ab)* : strings of a’s and b’s with no two consecutive b’s
Regular Languages over ∑
• Definition of regular languages over ∑
• Ø is regular
• {a} is regular
• {є} is regular
• R∪S is regular if R, S are regular
• R·S is regular if R, S are regular
• Nothing else
Regular Expressions
• Easy way to express a language that is accepted by
FSA
• Rules:
• ε is a regular expression
• Any symbol in Σ is a regular expression
If r and s are any regular expressions then so is:
• r|s denotes union e.g. “r or s”
• rs denotes r followed by s (concatination)
• (r)* denotes concatination of r with itself zero or more times
(Kleene closer)
• () used for controlling order of operations
26
Regular Expressions (Continued)
• Common Extensions of R.E. operations:
• + : one or more repetitions of
(r+ is equivalent to rr*)
• [] : range of characters
([abcd] = [a-d] = a|b|c|d)
• . : any character
(.*b.*: strings that contain at least one b)
• ^ : negate a set
([^abc] = any character except a, b, or c)
• ? : optional sub-expression
((+|-)?[0-9] = [0-9] | +[0-9] | -[0-9])
• \ : “escape” an operator or meta-symbol
Example
reg.exp: a|b  language : {a,b}
(a|b)(a|b)  {aa,ab, ba, bb}
a*  {€,a, aa, aaa, …..}
You Try
• What is the language described by each Regular
Expression?
a*
(a|b)*
a|a*b
(a|b)(a|b)
aa|ab|ba|bb
(+|-|ε)(0|1|2|3|4|5|6|7|8|9)*
Exercise
Which of the following matches regexp
/a(ab)*a/
1)
2)
3)
4)
5)
abababa
aaba
aabbaa
aba
aabababa
• 2 and 5
Which of the following matches
Reg-exp : /a.[bc]+/
1)
2)
3)
4)
5)
6)
abc
abbbbbbbb
azc
abcbcbcbbc
ac
asccbbbbcbcccc
•1
RE Shorthands
• r? = r|Є (zero or one instance of r)
• r+ = r·r* (positive closure)
• Charater class: [abc] = a|b|c, [a-z] = a|b|c|…|z
• Ex: ([ab]c?)+ = {a, b, aa, ab, ac, ba, bb, bc,…}
Regular Definitions
If Σ is an alphabet of basic symbols, then a regular
definition is a sequence of definitions of the form:
D1 → R1
D2 → R2
…
1. Each di is a new symbol not in Σ and
Dn → Rn
not the same as any other of the d’s.
2. Each ri is a regular expression over
Σ U (d1,d2,…,di-1)
Regular Definitions Example
Example C identifiers:
Σ = ASCII
letter_
digit →
id
→
a|b|c|…|z|A|B|C|…|Z|_
0|1|2|…|9
→
letter_(letter_|digit)*
Regular Definitions Example
Example Unsigned Numbers (integer or float):
Σ = ASCII
digit
→
digits
→
optionalFraction →
optionalExponent →
number
→
0|1|2|…|9
digit digit*
. digits | ε
(E(+|-| ε)digits)| ε
digits optionalFraction optionalExponent
37
FINITE AUTOMATE
38
Patterns for token
• digit  [0-9]
• digits  digit+
• Letter  [A –Z a-z]
• If if
• then then
• else else
• Relop <|<=|=|<>|>|>=
• Idletter(letter|digit)*
• Numdigits+ (.digits+)?(E(+|-)?digits+)?
• ws (blank|tab|newline)+
39
Transition Diagram
• A transition diagram is a stylized flowchart.
• Transition diagram is used to keep track of information
about characters that are seen as the forward pointer
scans the input.
• We do so by moving from position to position in the
diagrams as characters are read.
40
Transition Diagram
• Have a collection of circles
called states .
• Each state represents a
condition that could occur
during the process of
scanning the input looking for
a lexeme that matches one
of several patterns .
•
Start state
• Accept state
• Any state
41
Transition Diagram
• Edge : represent transitions from one state to another as
caused by the input
42
43
Finite automata: why?
• Regular expressions = specification
• Finite automata = implementation
44
Definition : Finite Automaton
A finite automaton is a 5 - tupple (Q, ,  , q , F ), where
0
1. Q is a finite set called the states,
2.  is a finite set called the alphabet ,
3.  :QQ is the transition function,
4. q0Q is the start state, and
5. F  Q is the set of accept states.
45
Exercise
• A finite automaton that accepts only “a”
stat
input
1
a
2 Is
accepted
b
?
a
1
2
• Language of FS is the set of accepted states
46
Finite Automata
• Two types – both describe what are called
regular languages
• Deterministic (DFA) – There is a fixed
number of states and we can only be in one
state at a time
• Nondeterministic (NFA) –There is a fixed
number of states but we can be in multiple
states at one time
• NFA and DFA stand for nondeterministic
finite automaton and deterministic finite
automaton, respectively.
47
Finite Automata
48
State Diagram
1
0
q1
1
0
q2
q3
0, 1
49
Finite Automata
• Transition
• s1 a s2
Is read
• In state s1 on input a go to state s2
• •If end of input and in accepting state => accept
• •Otherwise => reject
50
Data Representation
1. Q  {q1 , q2 , q3 }
2.   {0,1}
3.  is described as
4. q1 is the start state, and
5. F  {q2 }.
0
1
q1
q1
q2
q2
q3
q2
q3
q2
q2
,
51
NFA
52
Test youself
• Construct an NFA to accept all strings terminating in 01
53
Answer
54
Test yourself 2
• Construct an NFA to accept those strings containing three
consecutive zeroes
55
Answer 2
56
DFA
57
DFA
• Is special case of NFA where:
• There are no moves on input Ԑ
• For each state s and input symbol a , there exactly one edge out of
a labeled a.
58
DFA
• Deterministic Finite Automata (DFA)
• One transition per input per state
• No Ԑ-moves
• Nondeterministic Finite Automata (NFA)
• Can have multiple transitions for one input in a given state
• Can have -moves
59
Test yourself (DFA1)
1- Construct a DFA to accept a string containing a zero
followed by a one
60
Answer
1-
61
Test yourself (DFA2)
• 2- Construct a DFA to accept a string containing two
consecutive zeroes followed by two consecutive ones
62
answer
• 2-
63
Test yourself (DFA3)
3- Construct a DFA to accept a string containing even
number of zeroes and any number of ones
64
Answer
•3
65
Test yourself (DFA4)
• 4- Construct a DFA to accept all strings which do not
contain three consecutive zeroes
66
Answer
• 4-
67
Why NFA and DFA?
•
From Regular Expression to DFA
•
•
From Regular Expression to NFA
From NFA to DFA