Finite Automata, context

CS 404
Introduction to Compiler Design
Lecture 2
Ahmed Ezzat
Finite Automata, Context
Free Grammar (CFG)
1
CS 404
Ahmed Ezzat
Finite Automata



Evaluate regular expressions
Recognize certain languages and reject
others
Two kinds of FA:
–
–
2
Non-deterministic FA (NFA)
Deterministic FA (DFA)
CS 404
Ahmed Ezzat
A Finite Automata Consists of





3
An input alphabet, e.g., Σ = {a,b, …}
A set of states, e.g., S = {s0, s1, s2, …}
A set of transitions from states to states,
labeled by elements of Σ or ∈
A start state, e.g., s0
A set of final states, e.g., F = {s1, s2}
CS 404
Ahmed Ezzat
FA and Language


4
An FA accepts string x if and only if there is
some path in the transition graph from the
start state to a final state, such that the edge
labels along this path spells x.
The set of strings an FA accepts is said to be
the language defined by this FA.
CS 404
Ahmed Ezzat
NFA
5
CS 404
Ahmed Ezzat
DFA and NFA


What we have defined is called NFA
A DFA is a special case of NFA
–
–
6
No states has an ∈ transition.
For each state S and input symbol a, there is at
most one edge labeled a leaving S.
CS 404
Ahmed Ezzat
NFA to DFA
2
a
b
a
1
c
NFA
b
c
1
2
-
4
1
2
-
3
-
2, 1
3
2
-
-
3
4
-
-
3
4, 3
3
c
4
ἑ
a
Chart representing the graph
a
2,1
aἑ
a
a
1
c
7
DFA
b
abab
a
cc
cb
c
caa
ccab
2,1
-
4,3
ccacc
2,1
2,1
3
4,3
ccac
4,3
4,3
2,1
-
3
33
2,1
-
-
3
c
cἑ
4,3
CS 404
No
ab
1
b
c a
bἑ
Yes
abacab
Ahmed Ezzat
NFA, DFA and Regular Expressions




8
A DFA is an NFA (without ∈)
Each NFA can be converted into a DFA
One can construct an NFA from a regular
expression
FAs are used by lexical analyzer to
recognize tokens
CS 404
Ahmed Ezzat
Syntax Analysis




9
Syntax Analysis is also called parsing
Create hierarchical structures (parse trees)
Use “grammars” to define the structures
Comparing with lexer, parser only accepts
syntactically correct sentences
CS 404
Ahmed Ezzat
Grammars

A grammar is a formal way to specify a set of
valid sentences in a language L
–

A syntax analyzer (or parser) is a software
tool that recognizes all valid sentences in L
–
10
Just like a regular expression is a formal way to
define a token in a language L
Just like a lexical analyzer is a software tool that
recognizes all valid lexemes in a language L
CS 404
Ahmed Ezzat
Context Free Grammars (CFG)

A context free grammar has four components:
–
–
–
–
11
A set of terminal symbols, e.g., T = { a, b, … }
A set of non-terminal symbols, e.g. N = {S, A, B, …}
A set of productions where each consists of a nonterminal on the left side, and terminal or nonterminal on the right hand side. e.g., A  aB
A start symbol, which is a non-terminal, e.g., S
CS 404
Ahmed Ezzat
Formal Definition of a CFG




There is a finite set of symbols that form the strings, i.e. there is a finite
alphabet. The alphabet symbols are called terminals (think of a parse
tree and terminals are the leafs)
There is a finite set of variables, sometimes called non-terminals or
syntactic categories. Each variable represents a language (i.e. a set of
strings).
One of the variables is the start symbol. Other variables may exist to
help define the language.
There is a finite set of productions or production rules that represent
the recursive definition of the language. Each production rule is
defined as follows:
1. Has a single variable that is being defined to the left of the production
2.
3.
12
Has the production symbol 
Has a string of zero or more terminals or variables, called the body of
the production. To form strings we can substitute each variable’s
production in for the body where it appears.
CS 404
Ahmed Ezzat
CFG Notations

A CFG G may then be represented by these
four components, denoted G = (V,T,R,S)
–
–
–
–
13
V is the set of variables
T is the set of terminals
R is the set of production rules
S is the start symbol.
CS 404
Ahmed Ezzat
Sample CFG
1.
2.
3.
4.
5.
6.
7.
8.
9.
EI
// Expression is an identifier
EE+E
// Add two expressions
EE*E
// Multiply two expressions
E(E)
// Add parenthesis
I L
// Identifier is a Letter
I ID
// Identifier + Digit
I IL
// Identifier + Letter
D0|1|2|3|4|5|6|7|8 |9
// Digits
L a|b|c|…A|B|…|Z
// Letters
Note Identifiers are regular; could describe as (letter)(letter + digit)*
14
CS 404
Ahmed Ezzat
Recursive Inference



The process of coming up with strings that satisfy individual
productions and then concatenating them together according
to more general rules; this is called recursive inference.
This is a bottom-up process
For example, parsing the identifier “r5”
–
–
–
–
Rule 8 tells us that D  5
Rule 9 tells us that L  r
Rule 5 tells us that IL so Ir
Apply recursive inference using rule 6 for IID and get


–
15
I  rD.
Use D5 to get Ir5.
Finally, we know from rule 1 that EI, so r5 is also an
expression.
CS 404
Ahmed Ezzat
Derivations



16
A derivation is a sequence of applications of
rules from P, resulting in a string of terminals
(i.e., a sentence)
Basically, we treat a production as a rewriting rule and we replace the non-terminal
in the LHS with the RHS
There can be more than one derivations for a
sentence
CS 404
Ahmed Ezzat
More on derivations





17
 derives in one step
* derives in zero or more steps
+ derives in one or more steps
α * α for any string α
If α * β and β  γ, then α * γ
CS 404
Ahmed Ezzat
Derivation

Similar to recursive inference, but top-down instead
of bottom-up
–

For example, given a*(a+b1) we can derive this by:
–

18
Expand start symbol first and work way down in such a way
that it matches the input string
E  E*E  I*E  L*E  a*E  a*(E)  a*(E+E)  a*(I+E)
 a*(L+E)  a*(a+E)  a*(a+I)  a*(a+ID)  a*(a+LD) 
a*(a+bD)  a*(a+b1)
Note that at each step of the productions we could
have chosen any one of the variables to replace with
a more specific rule.
CS 404
Ahmed Ezzat
Multiple Derivation


We saw an example of  in deriving
a*(a+b1)
We could have used * to condense the
derivation.
–
E.g. we could just go straight to E * E*(E+E) or
even straight to the final step


19
E * a*(a+b1)
Going straight to the end is not recommended on a
homework or exam problem if you are supposed to show
the derivation
CS 404
Ahmed Ezzat
Leftmost Derivation


20
In the previous example we used a derivation called
a leftmost derivation. We can specifically denote a
leftmost derivation using the subscript “lm”, as in:
lm or *lm
A leftmost derivation is simply one in which we
replace the leftmost variable in a production body by
one of its production bodies first, and then work our
way from left to right.
CS 404
Ahmed Ezzat
Rightmost Derivation



21
Not surprisingly, we also have a rightmost derivation
which we can specifically denote via:
rm or *rm
A rightmost derivation is one in which we replace the
rightmost variable by one of its production bodies
first, and then work our way from right to left.
CS 404
Ahmed Ezzat
Rightmost Derivation Example


a*(a+b1) was already shown previously using a
leftmost derivation.
We can also come up with a rightmost derivation, but
we must make replacements in different order:
–
22
E rm E*E rm E * (E) rm E*(E+E) rm E*(E+I) rm
E*(E+ID) rm E*(E+I1) rm E*(E+L1) rm E*(E+b1)
rm E*(I+b1) rm E*(L+b1) rm E*(a+b1) rm I*(a+b1)
rm L*(a+b1) rm a*(a+b1)
CS 404
Ahmed Ezzat
Left or Right?



23
Does it matter which method you use?
Answer: No
Any derivation has an equivalent leftmost and
rightmost derivation. That is, A * . iff A *lm 
and A *rm .
CS 404
Ahmed Ezzat
Language of Context Free Grammar



24
The language that is represented by a CFG G(V,T,P,S)
may be denoted by L(G), is a Context Free Language
(CFL) and contains all strings X such that S  *X
In other words, L(G) consists of terminal strings that
have derivations from the start symbol:
L(G) = { w in T | S *G w }
Note that the CFL / L(G) consists solely of terminals
from G.
CS 404
Ahmed Ezzat
Parse Tree

A parse tree is a graphical (top-down)
representation for a derivation:
–
–
–
–
25
The root is the start symbol
Each leaf is a terminal symbol or ∈
Each internal node is a non-terminal
If A is an internal node and X1X2..Xn are A’s
children nodes, then AX1X2…Xn is a
production used in the derivation
CS 404
Ahmed Ezzat
Sample Parse Tree

Sample parse tree for the CFG for 1110111:
P
P   | 0 | 1 | 0P0 | 1P1
1. EI
2. EE+E
3. EE*E
4. E(E)
5. I L
6. I ID
7. I IL
8. D  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
9. L  a | b | c | … A | B | … Z
26
CS 404
1
P
1
1
P
1
1
P
1
0
Ahmed Ezzat
Ambiguity


A grammar that produces more than one parse
tree for some sentence is said to be ambiguous
Sometimes we can re-write the rules in P to
make a grammar un-ambiguous
–
–
–
27
Example: write rules to reflect the precedence of
the operators
S  AS | ε
A  A1 | 0A1 | 01
CS 404
Ahmed Ezzat
Other Types of Grammars





28
Regular Grammars (RG)
Context Free Grammars (CFG)
Context Sensitive Grammars (CSG)
Unrestricted Grammars (UG)
L(RG) c= L(CFG) c= L(CSG) c= L(UG)
CS 404
Ahmed Ezzat
Use CFG and Parsing


CFG is used to define the structure of a
program (a language)
Parsing is used to test whether a sentence
belongs to a valid language
–
–
29
Parsing can be done by hand
Parsing algorithms (next lecture)
CS 404
Ahmed Ezzat
END
30
CS 404
Ahmed Ezzat