context-free Grammars, derivation and parse trees

CS 404
Introduction to Compiler Design
Lecture 3
Ahmed Ezzat
Top-Down Parsing LL(1)
1
CS 404
Ahmed Ezzat
Review of Context Free Grammars



2
Context-free-language (CFL): A language L is CF if there exists
CFG such that L = L(G).
Every regular language (can be generated by regular grammar) is
a subclass of CFL
CFG generates a language rather
than regular expressions
CFL
 CFG Can describe the syntax of
most programming languages
Regular
 Good at nested structures
 Can be efficiently implemented
 Can guide parser generation
CS 404
Ahmed Ezzat
Tasks Cannot be done by CFG

Wait until “semantic analysis,” i.e., needs
to be done first:
–
–
–
3
Match name uses against declarations
Verify function called with right number of
arguments
Type checking in expressions
CS 404
Ahmed Ezzat
Write a Parser for Language L
1.
2.
3.
4
Write a CFG for L (e.g., C, C++) and verify that G accepts
all strings in L
Eliminate ambiguity (no formal rules)
Eliminate left recursion, e.g., special case of recursion
where a string is recognized as part of a language by the
fact that it decomposes into a string from that language on
the left (non-terminal is left recursive) and a suffix on the
right
AAα
Where A is a nonterminal and α is a string of grammar
symbols.
CS 404
Ahmed Ezzat
Write a Parser for Language L
4.
Eliminate left factoring: removing the common factor that
appears in 2 productions of the same non-terminal, i.e., to
avoid back tracking by the parser.
Example: A  qB | qC
where A, B, C are non-terminals and q is a sentence.
In this case the parser will get confused as to which of the 2
production rules to choose. After left factoring the grammar is
converted to:
A  qD
DB|C
No ambiguity on the right production rule
5
CS 404
Ahmed Ezzat
Parsing Approaches (Top-down)


6
Syntax Analysis phase of a compiler verified that the sequence of tokens
extracted by the parser represents a valid sentence in the grammar of the
programming language.
There are 2 major parsing approaches:
 Top-down: you start with the start symbol and apply production rules until
you arrive at the desired string
S  AB
A  aA | ϵ
B  b | bB
Prove that the string aaab complies with the above grammar?
S
AB
S  AB
aAB
A  aA
aaAB
A  aA
aaaAB
A  aA
aaaϵB
Aϵ
aaab
Bb
CS 404
Ahmed Ezzat
Parsing Approaches (Bottom-up)

7
Bottom-up: start with the string and reduce it to the start
symbol, i.e., it works in reverse.
aaab
aaaϵb
(insert ϵ)
aaaAb
Aϵ
aaAb
A  aA
aAb
A  aA
Ab
A  aA
AB
Bb
S
S  AB
 Handles larger set of grammars
CS 404
Ahmed Ezzat
Top-Down Parsing




8
A parser is top-down if it discovers a parse tree top to bottom:
 A top-down parse corresponds to a preorder traversal of the parse tree
 A leftmost derivation is applied at each derivation step
Top-down parsers come in 2 forms:
 Predictive Parsers: Predict the production rule to be applied using
lookahead tokens
 Backtracking Parsers: Will try different productions, backing up when a
parse fails.
Predictive parsers are much faster than backtracking ones
 Predictive parsers operate in linear time – will be our focus
 Backtracking parsers operate in exponential time – will not be considered.
Two kinds of top-down parsing techniques
 Recursive-descent parsing (used to construct the syntax tree)
 LLparsing
CS 404
Ahmed Ezzat
Top-Down Parsing
9

Start with grammar

Apply rules until generate desired sentence

Build parse tree down from root

Easy with simple grammars

Easily apply by hand
CS 404
Ahmed Ezzat
Top-down Parsing

Predictive: try to guess which production
rule to apply next, given
–
–

Two ways to do predictive parsing
–
–
10
The current non-terminal symbol
One or more ‘look-ahead’ terminal symbols
Use recursive procedures
Use a predictive parsing table
CS 404
Ahmed Ezzat
Top-down Parsing:
Construction of a Syntax Tree


11
Although recursive-descent is a top-down parsing technique …
 The construction of the syntax tree for expressions is bottom up
 Tracing verifies the precedence and associativity of operators
The tree construction of a – b + c * (b + d) is given below
 ptr1  symtable.lookup(a)
 ptr2  symtable.lookup(b)
 ptr3  new node( ‘–’ , ptr1 , ptr2 )
 ptr4  symtable.lookup(c)
 ptr2  symtable.lookup(b)
 ptr5  symtable.lookup(d)
 ptr6  new node(‘+’ , ptr2 , ptr5 )
 ptr7  new node(‘*’ , ptr4 , ptr6 )
 ptr8  new node(‘+’ , ptr3 , ptr7 )
CS 404
Ahmed Ezzat
LL(1) Grammar




12
A restrict set of grammars with no need to backtrack
Uses an explicit stack rather than recursive calls to
perform parsing
LL(k) parsing means that k tokens of lookahead are used
LL(1):
 L: scan input string from left to right
 L: left-most derivation is applied at each step
 1: one input symbol for lookahead
CS 404
Ahmed Ezzat
LL(1) Grammar

13
An LL parser consists of:
 Parser stack that holds grammar symbol: nonterminals and tokens.
 Parsing table that specifies the parser action
 Driver function that interacts with parser stack,
parsing table and scanner
CS 404
Ahmed Ezzat
FIRST and FOLLOW sets
14

For terminal, non-terminal and a string of symbols

FIRST(α) contains any symbol that might begin a
sentence derived from α

If we have a rule X α , and “t” is in FIRST(α), and
we are looking at symbol t, then X α may be the
right rule to apply
CS 404
Ahmed Ezzat
Compute FIRST
15

If x is a terminal, then FIRST(x) = {x}

If xε, then add ε to FIRST(x)

If x is non-terminal and XY1Y2…Yk, then add z to
FIRST(x) if for some i, z is in FIRST(Yi) and ε is in
FIRST(Yj) for all j<i
CS 404
Ahmed Ezzat
Compute FIRST



16

Suppose we have the following grammar:
 The RHS of the productions of S do not begin with terminals
 Parser has no immediate guidance which production to apply to expand S
 We may follow all possible derivations of S as shown below
SAa|Bb
ADc|CA
BdA|e
CfC|b
Dh|i
We predict S  A a when
 First token is h, i, f, or b. First(Aa) = {h, i, f, b}
We predict S  B b when
 First token is d or e. First(Bb) = {d, e}
Otherwise, we have an error
CS 404
Ahmed Ezzat
Use of FIRST(α)
17

If we have two rules Xα | β, we use FIRST(α) and
FIRST(β) to pick which rule

If t (lookahead) in FIRST(α) and not FIRST(β) , pick
Xα

If FIRST(α) and FIRST(β) share the same symbol,
cannot do predictive parsing
CS 404
Ahmed Ezzat
FOLLOW for non-terminal
18

FOLLOW(A) includes all symbols that could appear
immediately after A in a valid sentence

FOLLOW is used because FIRST alone still cannot
determine which rule in some cases
CS 404
Ahmed Ezzat
Compute FOLLOW





19
Suppose we have the following grammar
 We follow derivations of S as shown below …
SAcB
AaA
Aϵ
BbBS
Bϵ
We predict A  a A when
 Next token is a because First(a A) = {a}
We predict A  ϵ when
 Next token is c because Follow(A) = {c}
Similarly, we predict B  b B S when
 Next token is b because First(b B S) = {b}
We predict B  ϵ when
 Next token is a, c, or $ (end-of-file token) because Follow(B) = {a, c, $}
CS 404
Ahmed Ezzat
Compute FOLLOW
20

Put $ in FOLLOW(S) ($ is called endmarker)

If AαBβ, then put FIRST(β) into FOLLOW(B)

If Aαβ, or AαBβ and βε, then put FOLLOW(A)
into FOLLOW(B)
CS 404
Ahmed Ezzat
Determine Predicate Set





21
The predict set of a production A  α is defined as follows:
If a is NOT nullable then Predict(A  α) = First(α)
If a is Nullable then Predict(A  α) = (First(α) – {ϵ}) U Follow(A))
This is the set of lookahead tokens that will cause the selection of A  α
Example on determining the predict set:
E TQ
Predict E  T Q = First(TQ) = First(T) = {( , id}
Q +TQ
Predict Q  + T Q = First(+TQ) = { + }
Q –TQ
Predict Q  – T Q = First(–TQ) = { – }
Qϵ
Predict Q  e = Follow(Q) = {$ , )}
TFR
Predict T  F R = First(FR) = First(F) = {( , id}
R*FR
Predict R  * F R = First(*FR) = { * }
R/FR
Predict R  / F R = First(/FR) = { / }
Rϵ
Predict R  e = Follow(R) = {+ , – , $ , )}
F(E)
Predict F  ( E ) = { ( }
F  id
Predict F  id = { id }
CS 404
Ahmed Ezzat
Construct LL(1) Parsing Table




22
The predict sets can be represented in an LL(1) parse table
 The rows are indexed by the nonterminals
 The columns are indexed by the tokens
If A is a nonterminal and tok is the lookahead token then
 Table[A][tok] indicates which production rule to predict
 If no production rule can be used Table[A][tok] gives an error value
Table[A][tok] = A  α iff tok Î predict(A  α)
Example on constructing the LL(1) parsing table:
1: S  A c B Predict(1) = {a, c}
2: A  a A Predict(2) = {a}
3: A  ϵ Predict(3) = {c}
4: B  b B S Predict(4) = {b}
5: B  ϵ Predict(5) = {$, a, c}
CS 404
Ahmed Ezzat
Use Parsing Table to Parse




Push $S into the stack, attach $ to the end of
the string. x is the stack top, a is the input
If x=a=$, success
If x=a<>$, pop x, advance input
If x is non-terminal
–
–
23
If M[x,a] = {xUVW}, replace x by WVU (U on
top)
If M[x,a] has no rule, error
CS 404
Ahmed Ezzat
LL(1) and Predictive Parsing
24

The parsing table of LL(1) grammar has no multiplydefined entries

No ambiguous or left recursive grammar can be
LL(1)
CS 404
Ahmed Ezzat
Parsing Errors
25

If top of stack is terminal, but no matching input

If top of stack is non-terminal, but no rules
CS 404
Ahmed Ezzat
Handing Parsing Errors

Report
–
–

Patch up
–

Insert missing symbols
Skip
–
–
–
26
Report expected vs found symbols
Fill the empty entries in the parse tree with error messages
To the next delimiter
Until find matching parenthesis
Until find }
CS 404
Ahmed Ezzat
END
27
CS 404
Ahmed Ezzat
Compute FIRST for a String

For α = X1X2…Xn
–
–
–
28
Add all non-ε symbols of FIRST(X1) to FIRST(α)
Add all non- ε symbols of FIRST(Xj) to FIRST(α) if
ε is in all FIRST(Xi) for i<j
Add ε to FIRST(α) if ε is in all FIRST(Xi) for all i
CS 404
Ahmed Ezzat
Predictive Parsing Table

For each production rule Aα
–
–
–
29
For each terminal a in FIRST(α), add Aα to
M[A,a]
If ε is in FIRST(α), add Aα to M[A,b] for each
terminal b in FOLLOW(A). (b can be $)
Unidentified entry of M are ‘error entries’
CS 404
Ahmed Ezzat