The role of the parser code source tokens errors scanner parser IR

The role of the parser
source
code
scanner
Syntax analysis
tokens
parser
Context-free syntax is specified with a context-free grammar.
IR
Formally, a CFG G is a 4-tuple (Vt ,Vn, S, P), where:
errors
Vt is the set of terminal symbols in the grammar.
For our purposes, Vt is the set of tokens returned by the scanner.
Parser
•
•
•
•
•
Vn, the nonterminals, is a set of syntactic variables that denote sets of (sub)strings
occurring in the language.
These are used to impose a structure on the grammar.
performs context-free syntax analysis
guides context-sensitive analysis
constructs an intermediate representation
produces meaningful error messages
attempts error correction
S is a distinguished nonterminal (S ∈ Vn) denoting the entire set of strings in L(G).
This is sometimes called a goal symbol.
P is a finite set of productions specifying how terminals and non-terminals can be
combined to form strings in the language.
Each production must have a single non-terminal on its left hand side.
c 2007 by Antony L. Hosking. Permission to make digital or hard copies of part or all of this work for
Copyright !
personal or classroom use is granted without fee provided that copies are not made or distributed for profit or
commercial advantage and that copies bear this notice and full citation on the first page. To copy otherwise, to
republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request
permission to publish from [email protected].
CS502
Parsing
The set V = Vt ∪Vn is called the vocabulary of G
1
Notation and terminology
CS502
Parsing
2
Syntax analysis
• a, b, c, . . . ∈ Vt
Grammars are often written in Backus-Naur form (BNF).
• A, B,C, . . . ∈ Vn
Example:
• U,V,W, . . . ∈ V
1 )goal* ::= )expr*
2 )expr* ::= )expr*)op*)expr*
3
| num
| id
4
5 )op* ::= +
| −
6
7
| ∗
8
| /
• !, ", #, . . . ∈ V ∗
• u, v, w, . . . ∈ Vt∗
If A → # then !A" ⇒ !#" is a single-step derivation using A → #
Similarly, ⇒∗ and ⇒+ denote derivations of ≥ 0 and ≥ 1 steps
This describes simple expressions over numbers and identifiers.
In a BNF for a grammar, we represent
If S ⇒∗ " then " is said to be a sentential form of G
1. non-terminals with angle brackets or capital letters
2. terminals with typewriter font or underline
3. productions as in the example
L(G) = {w ∈ Vt∗ | S ⇒+ w}, w ∈ L(G) is called a sentence of G
Note, L(G) = {" ∈ V ∗ | S ⇒∗ "} ∩Vt∗
CS502
Parsing
3
CS502
Parsing
4
Scanning vs. parsing
Derivations
Where do we draw the line?
We can view the productions of a CFG as rewriting rules.
term ::= [a − zA − z]([a − zA − z] | [0 − 9])∗
| 0 | [1 − 9][0 − 9]∗
Using our example CFG:
op
::= + | − | ∗ | /
expr ::= (term op)∗term
)goal* ⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
Regular expressions are used to classify:
• identifiers, numbers, keywords
• REs are more concise and simpler for tokens than a grammar
• more efficient scanners can be built from REs (DFAs) than grammars
)expr*
)expr*)op*)expr*
)expr*)op*)expr*)op*)expr*
)id,x*)op*)expr*)op*)expr*
)id,x* + )expr*)op*)expr*
)id,x* + )num,2*)op*)expr*
)id,x* + )num,2* ∗ )expr*
)id,x* + )num,2* ∗ )id,y*
Context-free grammars are used to count:
We have derived the sentence x + 2 ∗ y.
We denote this )goal*⇒∗ id + num ∗ id.
• brackets: (), begin. . . end, if. . . then. . . else
• imparting structure: expressions
Such a sequence of rewrites is a derivation or a parse.
Syntactic analysis is complicated enough: grammar for C has around 200
productions. Factoring out lexical analysis as a separate phase makes compiler
more manageable.
CS502
Parsing
The process of discovering a derivation is called parsing.
5
CS502
Parsing
Derivations
Rightmost derivation
At each step, we chose a non-terminal to replace.
For the string x + 2 ∗ y:
)goal* ⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
This choice can lead to different derivations.
Two are of particular interest:
leftmost derivation
the leftmost non-terminal is replaced at each step
rightmost derivation
the rightmost non-terminal is replaced at each step
6
)expr*
)expr*)op*)expr*
)expr*)op*)id,y*
)expr* ∗ )id,y*
)expr*)op*)expr* ∗ )id,y*
)expr*)op*)num,2* ∗ )id,y*
)expr* + )num,2* ∗ )id,y*
)id,x* + )num,2* ∗ )id,y*
Again, )goal*⇒∗ id + num ∗ id.
The previous example was a leftmost derivation.
CS502
Parsing
7
CS502
Parsing
8
Precedence
Precedence
These two derivations point out a problem with the grammar.
goal
It has no notion of precedence, or implied order of evaluation.
To add precedence takes additional machinery:
expr
expr
expr
op
expr
<id,x>
+
<num,2>
op
expr
*
<id,y>
1 )goal*
2 )expr*
3
4
5 )term*
6
7
8 )factor*
9
Treewalk evaluation computes (x + 2) ∗ y
— the “wrong” answer!
::=
::=
|
|
::=
|
|
::=
|
)expr*
)expr* + )term*
)expr* − )term*
)term*
)term* ∗ )factor*
)term*/)factor*
)factor*
num
id
This grammar enforces a precedence on the derivation:
• terms must be derived from expressions
• forces the “correct” tree
Should be x + (2 ∗ y)
CS502
Parsing
9
Precedence
CS502
Parsing
10
Precedence
Now, for the string x + 2 ∗ y:
)goal* ⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
⇒
goal
)expr*
)expr* + )term*
)expr* + )term* ∗ )factor*
)expr* + )term* ∗ )id,y*
)expr* + )factor* ∗ )id,y*
)expr* + )num,2* ∗ )id,y*
)term* + )num,2* ∗ )id,y*
)factor* + )num,2* ∗ )id,y*
)id,x* + )num,2* ∗ )id,y*
expr
expr
+
term
*
term
term
factor
factor
<id,x>
<num,2>
factor
<id,y>
Again, )goal*⇒∗ id + num ∗ id, but this time, we build the desired tree.
Treewalk evaluation computes x + (2 ∗ y)
CS502
Parsing
11
CS502
Parsing
12
Ambiguity
Ambiguity
If a grammar has more than one derivation for a single sentential form, then it is
ambiguous
May be able to eliminate ambiguities by rearranging the grammar:
)stmt*
::= )matched*
| )unmatched*
)matched*
::= if )expr* then )matched* else )matched*
| other stmts
)unmatched* ::= if )expr* then )stmt*
| if )expr* then )matched* else )unmatched*
Example:
)stmt* ::=
|
|
if )expr*then )stmt*
if )expr*then )stmt*else )stmt*
other stmts
Consider deriving the sentential form:
This generates the same language as the ambiguous grammar, but applies the
common sense rule:
if E1 then if E2 then S1 else S2
It has two derivations.
match each else with the closest unmatched then
This ambiguity is purely grammatical.
It is a context-free ambiguity.
This is most likely the language designer’s intent.
CS502
Parsing
13
Ambiguity
CS502
Parsing
14
Parsing: the big picture
tokens
Ambiguity is often due to confusion in the context-free specification.
Context-sensitive confusions can arise from overloading.
Example:
grammar
a = f(17)
parser
generator
parser
code
IR
In many Algol-like languages, f could be a function or subscripted variable.
Disambiguating this statement requires context:
• need values of declarations
• not context-free
• really an issue of type
Our goal is a flexible parser generator system
Rather than complicate parsing, we will handle this separately.
CS502
Parsing
15
CS502
Parsing
16
Top-down versus bottom-up
Top-down parsing
Top-down parsers
A top-down parser starts with the root of the parse tree, labelled with the start or
goal symbol of the grammar.
• start at the root of derivation tree and fill in
To build a parse, it repeats the following steps until the fringe of the parse tree
matches the input string
• picks a production and tries to match the input
• may require backtracking
1. At a node labelled A, select a production A → ! and construct the appropriate
child for each symbol of !
• some grammars are backtrack-free (predictive)
Bottom-up parsers
• start at the leaves and fill in
2. When a terminal is added to the fringe that doesn’t match the input string,
backtrack
• start in a state valid for legal first tokens
3. Find the next node to be expanded (must have a label in Vn)
• as input is consumed, change state to encode possibilities
(recognize valid prefixes)
• use a stack to store both state and sentential forms
The key is selecting the right production in step 1
⇒ should be guided by input string
CS502
Parsing
17
Simple expression grammar
CS502
Prod’n
–
1
2
4
7
9
–
–
3
4
7
9
–
–
7
8
–
–
5
7
8
–
–
9
–
::=
::=
|
|
::=
|
|
::=
|
)expr*
)expr* + )term*
)expr* − )term*
)term*
)term* ∗ )factor*
)term* /)factor*
)factor*
num
id
Consider the input string x − 2 ∗ y
CS502
18
Example
Recall our grammar for simple expressions:
1 )goal*
2 )expr*
3
4
5 )term*
6
7
8 )factor*
9
Parsing
Parsing
19
CS502
Sentential form
)goal*
)expr*
)expr* + )term*
)term* + )term*
)factor* + )term*
id + )term*
id + )term*
)expr*
)expr* − )term*
)term* − )term*
)factor* − )term*
id − )term*
id − )term*
id − )term*
id − )factor*
id − num
id − num
id − )term*
id − )term* ∗ )factor*
id − )factor* ∗ )factor*
id − num ∗ )factor*
id − num ∗ )factor*
id − num ∗ )factor*
id − num ∗ id
id − num ∗ id
Input
↑x
−
↑x
−
↑x
−
↑x
−
↑x
−
↑x
−
x ↑−
↑x
−
↑x
−
↑x
−
↑x
−
↑x
−
x ↑−
x
−
x
−
x
−
x
−
x
−
x
−
x
−
x
−
x
−
x
−
x
−
x
−
Parsing
2
2
2
2
2
2
2
2
2
2
2
2
2
↑2
↑2
↑2
2
↑2
↑2
↑2
↑2
2
2
2
2
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
∗
↑∗
∗
∗
∗
∗
↑∗
∗
∗
∗
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
y
↑y
↑y
y
↑
20
Example
Left-recursion
Another possible parse for x − 2 ∗ y
Top-down parsers cannot handle left-recursion in a grammar
Prod’n
–
1
2
2
2
2
2
Sentential form
)goal*
)expr*
)expr* + )term*
)expr* + )term* + )term*
)expr* + )term* + · · ·
)expr* + )term* + · · ·
···
Input
↑x −
↑x −
↑x −
↑x −
↑x −
↑x −
↑x −
Formally, a grammar is left-recursive if
2
2
2
2
2
2
2
∗
∗
∗
∗
∗
∗
∗
y
y
y
y
y
y
y
∃A ∈ Vn such that A ⇒+ A! for some string !
If the parser makes the wrong choices, expansion doesn’t terminate.
This isn’t a good property for a parser to have.
Our simple expression grammar is left-recursive
(Parsers should terminate!)
CS502
Parsing
21
CS502
Parsing
Eliminating left-recursion
Example
To remove left-recursion, we can transform the grammar
Our expression grammar contains two cases of left-recursion
)expr* ::=
|
|
)term* ::=
|
|
Consider the grammar fragment:
)foo* ::= )foo*!
| "
where ! and " do not start with )foo*
22
)expr* + )term*
)expr* − )term*
)term*
)term* ∗ )factor*
)term*/)factor*
)factor*
Applying the transformation gives
We can rewrite this as:
)expr*
)expr.*
)term*
)term.*
)foo* ::= ")bar*
)bar* ::= !)bar*
| $
where )bar* is a new non-terminal
::=
::=
::=
::=
)term*)expr.*
+)term*)expr.* | −)term*)expr.* | $
)factor*)term.*
∗)factor*)term.* | /)factor*)term.* | $
With this grammar, a top-down parser will
• terminate
• backtrack on some inputs
This fragment contains no left-recursion
CS502
Parsing
23
CS502
Parsing
24
Example
Example
This cleaner grammar defines the same language
Our long-suffering expression grammar:
1
2
3
4
5
6
7
8
9
)goal*
)expr*
::=
::=
|
|
)term* ::=
|
|
)factor* ::=
|
)expr*
)term* + )expr*
)term* − )expr*
)term*
)factor* ∗ )term*
)factor*/)term*
)factor*
num
id
1
2
3
4
5
6
7
8
9
10
11
It is
)goal*
)expr*
)expr.*
::=
::=
::=
|
|
)term* ::=
)term.* ::=
|
|
)factor* ::=
|
)expr*
)term*)expr.*
+)term*)expr.*
−)term*)expr.*
$
)factor*)term.*
∗)factor*)term.*
/)factor*)term.*
$
num
id
• right-recursive
• free of $-productions
Unfortunately, it generates different associativity
Same syntax, different meaning
CS502
Parsing
Recall, we factored out left-recursion
25
CS502
How much lookahead is needed?
Predictive parsing
We saw that top-down parsers may need to backtrack when they select the wrong
production
Basic idea:
Do we need arbitrary lookahead to parse CFGs?
• in general, yes
Parsing
26
For any two productions A → ! | ", we would like a distinct way of
choosing the correct production to expand.
For some RHS ! ∈ G, define FIRST(!) as the set of tokens that appear first in
some string derived from !.
That is, for some w ∈ Vt∗, w ∈ FIRST (!) iff. ! ⇒∗ w#.
• use the Earley or Cocke-Younger, Kasami algorithms
Fortunately
• large subclasses of CFGs can be parsed with limited lookahead
• most programming language constructs can be expressed in a grammar that
falls in these subclasses
Key property:
Whenever two productions A → ! and A → " both appear in the grammar, we
would like
FIRST (!) ∩ FIRST (") = %
Among the interesting subclasses are:
LL(1): left to right scan, left-most derivation, 1-token lookahead; and
LR(1): left to right scan, right-most derivation, 1-token lookahead
This would allow the parser to make a correct choice with a lookahead of only one
symbol!
The example grammar has this property!
CS502
Parsing
27
CS502
Parsing
28
Left factoring
Example
What if a grammar does not have this property?
Consider a right-recursive version of the expression grammar:
1
2
3
4
5
6
7
8
9
Sometimes, we can transform a grammar to have this property.
For each non-terminal A find the longest prefix
! common to two or more of its alternatives.
if ! /= $ then replace all of the A productions
A → !"1 | !"2 | · · · | !"n
with
A → !A.
A. → " 1 | " 2 | · · · | " n
where A. is a new non-terminal.
)goal*
)expr*
::=
::=
|
|
)term* ::=
|
|
)factor* ::=
|
)expr*
)term* + )expr*
)term* − )expr*
)term*
)factor* ∗ )term*
)factor*/)term*
)factor*
num
id
To choose between productions 2, 3, & 4, the parser must see past the num or id
and look at the +, −, ∗, or /.
FIRST (2) ∩ FIRST (3) ∩ FIRST (4) /= %
Repeat until no two alternatives for a single
non-terminal have a common prefix.
This grammar fails the test.
Note: This grammar is right-associative.
CS502
Parsing
29
CS502
Parsing
Example
Example
There are two nonterminals that must be left-factored:
Substituting back into the grammar yields
)expr* ::= )term* + )expr*
| )term* − )expr*
| )term*
1
2
3
4
5
6
7
8
9
10
11
)term* ::= )factor* ∗ )term*
| )factor*/)term*
| )factor*
Applying the transformation gives us:
)expr* ::= )term*)expr.*
)expr.* ::= +)expr*
| −)expr*
| $
Parsing
::=
::=
::=
|
|
)term* ::=
)term. * ::=
|
|
)factor* ::=
|
)expr*
)term*)expr.*
+)expr*
−)expr*
$
)factor*)term. *
∗)term*
/)term*
$
num
id
Now, selection requires only a single token lookahead.
)term* ::= )factor*)term.*
)term.* ::= ∗)term*
| /)term*
| $
CS502
)goal*
)expr*
)expr.*
30
Note: This grammar is still right-associative.
31
CS502
Parsing
32
Example
Back to left-recursion elimination
–
1
2
6
11
–
9
4
–
2
6
10
–
7
–
6
11
–
9
5
Sentential form
)goal*
)expr*
)term*)expr.*
)factor*)term.*)expr.*
id)term. *)expr.*
id)term. *)expr.*
id$ )expr.*
id− )expr*
id− )expr*
id− )term*)expr.*
id− )factor*)term.*)expr.*
id− num)term. *)expr.*
id− num)term. *)expr.*
id− num∗ )term*)expr.*
id− num∗ )term*)expr.*
id− num∗ )factor*)term.*)expr.*
id− num∗ id)term. *)expr.*
id− num∗ id)term. *)expr.*
id− num∗ id)expr.*
id− num∗ id
Input
↑x − 2 ∗ y
↑x − 2 ∗ y
↑x − 2 ∗ y
↑x − 2 ∗ y
↑x − 2 ∗ y
x ↑- 2 ∗ y
x ↑- 2
x ↑- 2 ∗ y
x − ↑2 ∗ y
x − ↑2 ∗ y
x − ↑2 ∗ y
x − ↑2 ∗ y
x − 2 ↑* y
x − 2 ↑* y
x − 2 ∗ ↑y
x − 2 ∗ ↑y
x − 2 ∗ ↑y
x − 2 ∗ y↑
x − 2 ∗ y↑
x − 2 ∗ y↑
Given a left-factored CFG, to eliminate left-recursion:
if ∃ A → A! then replace all of the A productions
A → A! | " | . . . | #
with
A → NA.
N → " | ... | #
A. → !A. | $
where N and A. are new productions.
Repeat until there are no left-recursive productions.
The next symbol determined each choice correctly.
CS502
Parsing
33
CS502
Parsing
Generality
Recursive descent parsing
Question:
Now, we can produce a simple recursive descent parser from the
(right-associative) grammar.
By left factoring and eliminating left-recursion, can we transform an
arbitrary context-free grammar to a form where it can be predictively
parsed with a single token lookahead?
goal:
token ← next token();
if (expr() = ERROR | token /= EOF) then
return ERROR;
Answer:
expr:
if (term() = ERROR) then
return ERROR;
else return expr prime();
expr prime:
if (token = PLUS) then
token ← next token();
return expr();
else if (token = MINUS) then
token ← next token();
return expr();
else return OK;
Given a context-free grammar that doesn’t meet our conditions, it is
undecidable whether an equivalent grammar exists that does meet our
conditions.
Many context-free languages do not have such a grammar:
{an0bn | n ≥ 1}
[
{an1b2n | n ≥ 1}
Must look past an arbitrary number of a’s to discover the 0 or the 1 and so
determine the derivation.
CS502
Parsing
34
35
CS502
Parsing
36
Recursive descent parsing
Building the tree
One of the key jobs of the parser is to build an intermediate representation of the
source code.
term:
if (factor() = ERROR) then
return ERROR;
else return term prime();
term prime:
if (token = MULT) then
token ← next token();
return term();
else if (token = DIV) then
token ← next token();
return term();
else return OK;
factor:
if (token = NUM) then
token ← next token();
return OK;
else if (token = ID) then
token ← next token();
return OK;
else return ERROR;
CS502
Parsing
To build an abstract syntax tree, we can simply insert code at the appropriate
points:
• factor() can stack nodes id, num
• term prime() can stack nodes ∗, /
• term() can pop 3, build and push subtree
• expr prime() can stack nodes +, −
• expr() can pop 3, build and push subtree
• goal() can pop and return tree
37
CS502
Parsing
Non-recursive predictive parsing
Non-recursive predictive parsing
Observation:
Now, a predictive parser looks like:
Our recursive descent parser encodes state information in its run-time
stack, or call stack.
38
stack
Using recursive procedure calls to implement a stack abstraction may not be
particularly efficient.
source
code
This suggests other implementation methods:
scanner
tokens table!driven
parser
IR
• explicit stack, hand-coded parser
• stack-based, table-driven parser
parsing
tables
Rather than writing code, we build tables.
Building tables can be automated!
CS502
Parsing
39
CS502
Parsing
40
Table-driven parsers
Non-recursive predictive parsing
A parser generator system often looks like:
Input: a string w and a parsing table M for G
tos ← 0
Stack[tos] ← EOF
Stack[++tos] ← Start Symbol
token ← next token()
repeat
X ← Stack[tos]
if X is a terminal or EOF then
if X = token then
pop X
token ← next token()
else error()
else /* X is a non-terminal */
if M[X,token] = X → Y1Y2 · · ·Yk then
pop X
push Yk ,Yk−1, · · · ,Y1
else error()
until X = EOF
stack
source
code
scanner
tokens table!driven
parser
parser
generator
grammar
IR
parsing
tables
This is true for both top-down (LL) and bottom-up (LR) parsers
CS502
Parsing
41
CS502
Parsing
Non-recursive predictive parsing
FIRST
What we need now is a parsing table M.
For a string of grammar symbols !, define FIRST (!) as:
Our expression grammar:
1
2
3
4
5
6
7
8
9
10
11
†
)goal*
)expr*
)expr.*
::=
::=
::=
|
|
)term* ::=
)term. * ::=
|
|
)factor* ::=
|
)expr*
)term*)expr.*
+)expr*
−)expr*
$
)factor*)term. *
∗)term*
/)term*
$
num
id
Its parse table:
id
)goal*
1
)expr*
2
)expr.*
–
)term*
6
)term. *
–
)factor* 11
num
1
2
–
6
–
10
+
–
–
3
–
9
–
−
–
–
4
–
9
–
∗
–
–
–
–
7
–
/
–
–
–
–
8
–
• the set of terminal symbols that begin strings derived from !:
{a ∈ Vt | ! ⇒∗ a"}
$†
–
–
5
–
9
–
• If ! ⇒∗ $ then $ ∈ FIRST (!)
FIRST (!)
contains the set of tokens valid in the initial position in !
To build FIRST(X):
1. If X ∈ Vt then FIRST(X) is {X}
2. If X → $ then add $ to FIRST(X)
3. If X → Y1Y2 · · ·Yk :
(a) Put FIRST(Y1) − {$} in FIRST (X)
(b) ∀i : 1 < i ≤ k, if $ ∈ FIRST(Y1) ∩ · · · ∩ FIRST(Yi−1)
(i.e., Y1 · · ·Yi−1 ⇒∗ $)
then put FIRST(Yi) − {$} in FIRST(X)
(c) If $ ∈ FIRST (Y1) ∩ · · · ∩ FIRST(Yk ) then put $ in FIRST(X)
Repeat until no more additions can be made.
we use $ to represent EOF
CS502
42
Parsing
43
CS502
Parsing
44
FOLLOW
LL(1) grammars
For a non-terminal A, define FOLLOW(A) as
Previous definition
the set of terminals that can appear immediately to the right of A in some
sentential form
A grammar G is LL(1) iff. for all non-terminals A, each distinct pair of
T
productions A → " and A → # satisfy the condition FIRST(") FIRST (#) = %.
What if A ⇒∗ $?
Thus, a non-terminal’s FOLLOW set specifies the tokens that can legally appear
after it.
Revised definition
A terminal symbol has no FOLLOW set.
A grammar G is LL(1) iff. for each set of productions A → !1 | !2 | · · · | !n:
To build FOLLOW(A):
1. FIRST(!1), FIRST (!2 ), . . . , FIRST (!n) are all pairwise disjoint
2. If !i ⇒∗ $ then FIRST(! j )
1. Put $ in FOLLOW()goal*)
2. If A → !B":
T
FOLLOW (A) = %, ∀1 ≤
j ≤ n, i /= j.
If G is $-free, condition 1 is sufficient.
(a) Put FIRST(") − {$} in FOLLOW(B)
(b) If " = $ (i.e., A → !B) or $ ∈ FIRST (") (i.e., " ⇒∗ $) then put FOLLOW(A) in
FOLLOW(B)
Repeat until no more additions can be made
CS502
Parsing
45
CS502
Parsing
LL(1) grammars
LL(1) parse table construction
Provable facts about LL(1) grammars:
Input: Grammar G
1. No left-recursive grammar is LL(1)
Output: Parsing table M
2. No ambiguous grammar is LL(1)
Method:
3. Some languages have no LL(1) grammar
1. ∀ productions A → !:
4. A $–free grammar where each alternative expansion for A begins with a
distinct terminal is a simple LL(1) grammar.
(a) ∀a ∈ FIRST (!), add A → ! to M[A, a]
(b) If $ ∈ FIRST (!):
Example
i. ∀b ∈ FOLLOW(A), add A → ! to M[A, b]
ii. If $ ∈ FOLLOW(A) then add A → ! to M[A, $]
• S → aS | a is not LL(1) because FIRST(aS) = FIRST (a) = {a}
2. Set each undefined entry of M to error
S → aS.
S. → aS. | $
•
46
If ∃M[A, a] with multiple entries then grammar is not LL(1).
accepts the same language and is LL(1)
Note: recall a, b ∈ Vt , so a, b /= $
CS502
Parsing
47
CS502
Parsing
48
Example
Building the tree
Our long-suffering expression grammar:
Again, we insert code at the right points:
tos ← 0
Stack[tos] ← EOF
Stack[++tos] ← root node
Stack[++tos] ← Start Symbol
token ← next token()
repeat
X ← Stack[tos]
if X is a terminal or EOF then
if X = token then
pop X
token ← next token()
pop and fill in node
else error()
else /* X is a non-terminal */
if M[X,token] = X → Y1Y2 · · ·Yk then
pop X
pop node for X
build node for each child and
make it a child of node for X
push nk ,Yk , nk−1,Yk−1, · · · , n1,Y1
else error()
until X = EOF
T → FT .
S→E
T . → ∗T | /T | $
E → T E.
.
E → +E | −E | $ F → id | num
FIRST
FOLLOW
S {num, id}
{$}
E {num, id}
{$}
E . {$, +, −}
{$}
T {num, id} {+, −, $}
T . {$, ∗, /}
{+, −, $}
F {num, id} {+, −, ∗, /, $}
id {id}
−
num {num}
−
∗
{∗}
−
/
{/}
−
+
{+}
−
−
{−}
−
S
E
E.
T
T.
F
id
num
+
−
∗
/
$
S→E S→E
−
−
−
−
−
E → T E. E → T E.
−
−
−
−
−
−
−
E . → +E E . → −E
−
−
E. → $
T → FT . T → FT .
−
−
−
−
−
−
−
T . → $ T . → $ T . → ∗T T . → /T T . → $
F → id F → num
−
−
−
−
−
CS502
Parsing
49
A grammar that is not LL(1)
CS502
Parsing
50
Error recovery
Key notion:
)stmt* ::= if )expr* then )stmt*
| if )expr* then )stmt* else )stmt*
| ...
Left-factored:
)stmt* ::= if )expr* then )stmt* )stmt.* | . . .
)stmt.* ::= else )stmt* | $
• For each non-terminal, construct a set of terminals on which the parser can
synchronize
• When an error occurs looking for A, scan until an element of SYNCH (A) is
found
Now, FIRST ()stmt.*) = {$, else}
Also, FOLLOW()stmt.*) = {else, $}
T
But, FIRST()stmt.*) FOLLOW()stmt.*) = {else} /= %
Building SYNCH:
1. a ∈ FOLLOW(A) ⇒ a ∈ SYNCH (A)
2. place keywords that start statements in SYNCH (A)
On seeing else, conflict between choosing
3. add symbols in FIRST(A) to SYNCH (A)
)stmt.* ::= else )stmt* and )stmt.* ::= $
If we can’t match a terminal on top of stack:
⇒ grammar is not LL(1)!
1. pop the terminal
The fix:
2. print a message saying the terminal was inserted
Put priority on )stmt.* ::= else )stmt* to associate else with closest
previous then.
3. continue the parse
(i.e., SYNCH (a) = Vt − {a})
CS502
Parsing
51
CS502
Parsing
52
LL(k) parsing
Strong LL(k): example
Definition:
Consider the grammar, G: G→S$ S→aAa A→b L(G) = {aba$, aa$, bbba$, bba$}
S→bAba A→$
G is strong LL(k) iff. for A → " and A → # (" /= #),
T
FIRST k (" FOLLOWk (A)) FIRST k (# FOLLOWk (A)) = %
G is NOT LL(1):
Predict(A → b) = FIRST 1(b) = {b}
Predict(A → $) = FOLLOW1(A) = {a, b}
LL(1) = Strong LL(1), but Strong LL(k) ⊂ LL(k) for k > 1.
Definition:
G is LL(k) iff.:
1. S ⇒∗lm wA! ⇒lm w"! ⇒∗ wx
G is NOT Strong LL(2):
Predict(A → b) = FIRST 2(bFOLLOW2(A)) = {ba, bb}
Predict(A → $) = FIRST 2($FOLLOW2(A)) = {a$, ba}
2. S ⇒∗lm wA! ⇒lm w#! ⇒∗ wy
3. FIRSTk (x) = FIRSTk (y)
⇒"=#
i.e., knowing w, A, and next k input tokens, FIRSTk (x) = FIRST k (y) accurately
predicts production to use: A → "(= #)
G is LL(2):
In aAa:
In bAba:
• LL(k) ⊂ LL(k + 1)
• Strong LL(k) ⊂ Strong LL(k + 1)
• Strong LL(k) ⊂ LL(k), k > 1
CS502
Parsing
53
CS502
Some definitions
Bottom-up parsing
Recall
Goal:
For a grammar G, with start symbol S, any string ! such that S ⇒∗ ! is called a
sentential form
54
Given an input string w and a grammar G, construct a parse tree by
starting at the leaves and working to the root.
At each match, it applies a reduction to build on the frontier:
• Otherwise it is just a sentential form (not a sentence in L(G))
• each reduction matches an upper frontier of the partially built tree to the RHS
of some production
A left-sentential form is a sentential form that occurs in the leftmost derivation of
some sentence.
• each reduction adds a node on top of the frontier
A right-sentential form is a sentential form that occurs in the rightmost derivation
of some sentence.
Parsing
Parsing
The parser repeatedly matches a right-sentential form from the language against
the tree’s upper frontier.
• If ! ∈ Vt∗, then ! is called a sentence in L(G)
CS502
Predict(A → b) = {ba} and Predict(A → $) = {a$}
Predict(A → b) = {bb} and Predict(A → $) = {ba}
The final result is a rightmost derivation, in reverse.
55
CS502
Parsing
56
Example
Handles
Consider the grammar
What are we trying to find?
1 S → aABe
2 A → Abc
| b
3
4 B → d
A substring ! of the tree’s upper frontier that
matches some production A → ! where reducing ! to A is one step in the
reverse of a rightmost derivation
and the input string abbcde
We call such a string a handle.
Prod’n.
3
2
4
1
–
Sentential Form
a b bcde
a Abc de
aA d e
aABe
S
Formally:
a handle of a right-sentential form # is a production A → " and a position
in # where " may be found and replaced by A to produce the previous
right-sentential form in a rightmost derivation of #.
i.e., if S ⇒∗rm !Aw ⇒rm !"w then A → " in the position following ! is a handle of !"w
The trick appears to be scanning the input and finding valid sentential forms.
CS502
Parsing
Because # is a right-sentential form, the substring to the right of a handle contains
only terminal symbols.
57
Handles
CS502
Parsing
58
Handles
Theorem:
S
If G is unambiguous then every right-sentential form has a unique handle.
Proof: (by definition)
!
1. G is unambiguous ⇒ rightmost derivation is unique
2. ⇒ a unique production A → " applied to take #i−1 to #i
A
3. ⇒ a unique position k at which A → " is applied
w
"
The handle A → " in the parse tree for !"w
CS502
Parsing
4. ⇒ a unique handle A → "
59
CS502
Parsing
60
Example
Handle-pruning
The left-recursive expression grammar
1
2
3
4
5
6
7
8
9
)goal*
)expr*
::=
::=
|
|
)term* ::=
|
|
)factor* ::=
|
(original form)
Prod’n.
–
1
3
5
9
7
8
4
7
9
)expr*
)expr* + )term*
)expr* − )term*
)term*
)term* ∗ )factor*
)term*/)factor*
)factor*
num
id
Sentential Form
)goal*
)expr*
)expr* − )term*
)expr* − )term* ∗ )factor*
)expr* − )term* ∗ id
)expr* − )factor* ∗ id
)expr* − num ∗ id
)term* − num ∗ id
)factor* − num ∗ id
id − num ∗ id
The process to construct a bottom-up parse is called handle-pruning.
To construct a rightmost derivation
S = #0 ⇒ #1 ⇒ #2 ⇒ · · · ⇒ #n−1 ⇒ #n = w
we set i to n and apply the following simple algorithm
for i = n downto 1
1. find the handle Ai → "i in #i
2. replace "i with Ai to generate #i−1
This takes 2n steps, where n is the length of the derivation
CS502
Parsing
61
Stack implementation
CS502
Parsing
62
Example: back to x − 2 ∗ y
One scheme to implement a handle-pruning, bottom-up parser is called a
shift-reduce parser.
Shift-reduce parsers use a stack and an input buffer
1. initialize stack with $
2. Repeat until the top of the stack is the goal symbol and the input token is $
a) find the handle
if we don’t have a handle on top of the stack, shift an input symbol onto the
stack
b) prune the handle
if we have a handle A → " on the stack, reduce
1
2
3
4
5
6
7
8
9
)goal*
)expr*
::=
::=
|
|
)term* ::=
|
|
)factor* ::=
|
)expr*
)expr* + )term*
)expr* − )term*
)term*
)term* ∗ )factor*
)term*/)factor*
)factor*
num
id
Stack
$
$id
$)factor*
$)term*
$)expr*
$)expr* −
$)expr* − num
$)expr* − )factor*
$)expr* − )term*
$)expr* − )term* ∗
$)expr* − )term* ∗ id
$)expr* − )term* ∗ )factor*
$)expr* − )term*
$)expr*
$)goal*
Input
id − num ∗ id
− num ∗ id
− num ∗ id
− num ∗ id
− num ∗ id
num ∗ id
∗ id
∗ id
∗ id
id
Action
shift
reduce 9
reduce 7
reduce 4
shift
shift
reduce 8
reduce 7
shift
shift
reduce 9
reduce 5
reduce 3
reduce 1
accept
1. Shift until top of stack is the right end of a handle
i) pop | " | symbols off the stack
2. Find the left end of the handle and reduce
ii) push A onto the stack
5 shifts + 9 reduces + 1 accept
CS502
Parsing
63
CS502
Parsing
64
Shift-reduce parsing
LR parsing
Shift-reduce parsers are simple to understand
The skeleton parser:
A shift-reduce parser has just four canonical actions:
push s0
token ← next token()
repeat forever
s ← top of stack
if action[s,token] = "shift si" then
push si
token ← next token()
else if action[s,token] = "reduce A → ""
then
pop | " | states
s. ← top of stack
push goto[s.,A]
else if action[s, token] = "accept" then
return
else error()
1. shift — next input symbol is shifted onto the top of the stack
2. reduce — right end of handle is on top of stack;
locate left end of handle within the stack;
pop handle off stack and push appropriate non-terminal LHS
3. accept — terminate parsing and signal success
4. error — call an error recovery routine
Key insight: recognize handles with a DFA:
• DFA transitions shift states instead of symbols
This takes k shifts, l reduces, and 1 accept, where k is the length of the input
string and l is the length of the reverse rightmost derivation
• accepting states trigger reductions
CS502
Parsing
65
Example tables
state
0
1
2
3
4
5
6
7
8
1
2
3
4
5
6
id
s4
–
–
–
–
s4
s4
–
–
ACTION
+ ∗
–
–
–
–
s5 –
r5 s6
r6 r6
–
–
–
–
–
–
r4 –
CS502
Parsing
66
Example using the tables
$
–
acc
r3
r5
r6
–
–
r2
r4
)expr*
1
–
–
–
–
7
–
–
–
Stack
$0
$04
$03
$036
$0364
$0363
$0368
$02
$025
$0254
$0253
$0252
$0257
$01
GOTO
)term* )factor*
2
3
–
–
–
–
–
–
–
–
2
3
8
3
–
–
–
–
The Grammar
::= )expr*
::= )term* + )expr*
| )term*
)term* ::= )factor* ∗ )term*
| )factor*
)factor* ::= id
)goal*
)expr*
Input
id∗ id+ id$
∗ id+ id$
∗ id+ id$
id+ id$
+ id$
+ id$
+ id$
+ id$
id$
$
$
$
$
$
Action
s4
r6
s6
s4
r6
r5
r4
s5
s4
r6
r5
r3
r2
acc
Note: This is a simple little right-recursive grammar; not the same as in previous lectures.
CS502
Parsing
67
CS502
Parsing
68
LR(k) grammars
LR(k) grammars
Informally, we say that a grammar G is LR(k) if, given a rightmost derivation
Formally, a grammar G is LR(k) iff.:
S = #0 ⇒ #1 ⇒ #2 ⇒ · · · ⇒ #n = w,
1. S ⇒∗rm !Aw ⇒rm !"w, and
we can, for each right-sentential form in the derivation,
2. S ⇒∗rm #Bx ⇒rm !"y, and
1. isolate the handle of each right-sentential form, and
3. FIRST k (w) = FIRSTk (y)
2. determine the production by which to reduce
by scanning #i from left to right, going at most k symbols beyond the right end of
the handle of #i.
⇒ !Ay = #Bx
i.e., Assume sentential forms !"w and !"y, with common prefix !" and common
k-symbol lookahead FIRSTk (y) = FIRST k (w), such that !"w reduces to !Aw and
!"y reduces to #Bx.
But, the common prefix means !"y also reduces to !Ay, for the same result.
Thus !Ay = #Bx.
CS502
Parsing
69
CS502
Parsing
Why study LR grammars?
LR parsing
LR(1) grammars are often used to construct parsers.
Three common algorithms to build tables for an “LR” parser:
We call these parsers LR(1) parsers.
• virtually all context-free programming language constructs can be expressed
in an LR(1) form
• LR grammars are the most general grammars parsable by a deterministic,
bottom-up parser
• efficient parsers can be implemented for LR(1) grammars
3. LALR(1)
• intermediate sized set of grammars
• same number of states as SLR(1)
• canonical construction is slow and large
• better construction techniques exist
• LR grammars describe a proper superset of the languages recognized by
predictive (i.e., LL) parsers
LL(k): recognize use of a production A → " seeing first k symbols derived
from "
An LR(1) parser for either Algol or Pascal has several thousand states, while an
SLR(1) or LALR(1) parser for the same language may have several hundred
states.
LR(k): recognize the handle " after seeing everything derived from " plus k
lookahead symbols
Parsing
1. SLR(1)
• smallest class of grammars
• smallest tables (number of states)
• simple, fast construction
2. LR(1)
• full set of LR(1) grammars
• largest tables (number of states)
• slow, large construction
• LR parsers detect an error as soon as possible in a left-to-right scan of the
input
CS502
70
71
CS502
Parsing
72
LR(k) items
Example
The table construction algorithms use sets of LR(k) items or configurations to
represent the possible states in a parse.
The • indicates how much of an item we have seen at a given state in the parse:
[A → •XY Z] indicates that the parser is looking for a string that can be derived
from XY Z
[A → XY • Z] indicates that the parser has seen a string derived from XY and is
looking for one derivable from Z
An LR(k) item is a pair [!, "], where
! is a production from G with a • at some position in the RHS, marking how
much of the RHS of a production has already been seen
" is a lookahead string containing k symbols (terminals or $)
LR(0) items:
A → XY Z generates 4 LR(0) items:
1.
2.
3.
4.
Two cases of interest are k = 0 and k = 1:
LR(0) items play a key role in the SLR(1) table construction algorithm.
LR(1) items play a key role in the LR(1) and LALR(1) table construction
algorithms.
CS502
Parsing
(no lookahead)
73
[A → •XY Z]
[A → X •Y Z]
[A → XY • Z]
[A → XY Z•]
CS502
Parsing
The characteristic finite state machine (CFSM)
closure0
The CFSM for a grammar is a DFA which recognizes viable prefixes of
right-sentential forms:
Given an item [A → ! • B"], its closure contains the item and any other items that
can generate legal substrings to follow !.
A viable prefix is any prefix that does not extend beyond the handle.
Thus, if the parser has viable prefix ! on its stack, the input should reduce to B"
(or # for some other item [B → •#] in the closure).
It accepts when a handle has been discovered and needs to be reduced.
To construct the CFSM we need two functions:
function closure0(I)
repeat
if [A → ! • B"] ∈ I
add [B → •#] to I
until no more items can be added to I
return I
• closure0(I) to build its states
• goto0(I, X) to determine its transitions
CS502
Parsing
74
75
CS502
Parsing
76
goto0
Building the LR(0) item sets
Let I be a set of LR(0) items and X be a grammar symbol.
We start the construction with the item [S. → •S$], where
S. is the start symbol of the augmented grammar G.
S is the start symbol of G
$ represents EOF
Then, GOTO(I, X) is the closure of the set of all items
[A → !X • "] such that [A → ! • X"] ∈ I
If I is the set of valid items for some viable prefix #, then GOTO(I, X) is the set of
valid items for the viable prefix #X.
GOTO (I, X)
To compute the collection of sets of LR(0) items
function items(G.)
s0 ← closure0({[S. → •S$]})
S ← {s0}
repeat
for each set of items s ∈ S
for each grammar symbol X
if goto0(s, X) /= % and goto0(s, X) /∈ S
add goto0(s, X) to S
until no more item sets can be added to S
return S
represents state after recognizing X in state I.
function goto0(I,X)
let J be the set of items [A → !X • "]
such that [A → ! • X"] ∈ I
return closure0(J)
CS502
Parsing
77
LR(0) example
I0 : S → •E$
E → •E + T
E → •T
T → •id
T → •(E)
I1 : S → E • $
E → E • +T
I2 : S → E$•
I3 : E → E + •T
T → •id
T → •(E)
The corresponding CFSM:
9
T
id
0
+
1
2
CS502
E
I4 : E → E + T •
I5 : T → id•
I6 : T → (•E)
E → •E + T
E → •T
T → •id
T → •(E)
I7 : T → (E•)
E → E • +T
I8 : T → (E)•
I9 : E → T •
7
)
T
4
(
6
(
+
3
$
T
id
5
id
E
Parsing
78
Constructing the LR(0) parsing table
1 S → E$
2 E → E +T
| T
3
4 T → id
5
| (E)
(
CS502
1. construct the collection of sets of LR(0) items for G.
2. state i of the CFSM is constructed from Ii
(a) [A → ! • a"] ∈ Ii and goto0(Ii, a) = I j
⇒ ACTION[i, a] ← “shift j”
(b) [A → !•] ∈ Ii, A /= S.
⇒ ACTION[i, a] ← “reduce A → !”, ∀a
(c) [S. → S$•] ∈ Ii
⇒ ACTION[i, a] ← “accept”, ∀a
3. goto0(Ii, A) = I j
⇒ GOTO[i, A] ← j
4. set undefined entries in ACTION and GOTO to “error”
8
5. initial state of parser s0 is closure0([S. → •S$])
Parsing
79
CS502
Parsing
80
LR(0) example
Conflicts in the ACTION table
9
state
T
(
id
0
+
1
5
E
7
)
T
4
0
1
2
3
4
5
6
7
8
9
(
6
(
+
3
$
2
id
id
E
T
8
ACTION
GOTO
id ( ) + $ S E T
s5 s6 – – – – 1 9
– – – s3 s2 – – –
acc acc acc acc acc – – –
s5 s6 – – – – – 4
r2 r2 r2 r2 r2 – – –
r4 r4 r4 r4 r4 – – –
s5 s6 – – – – 7 9
– – s8 s3 – – – –
r5 r5 r5 r5 r5 – – –
r3 r3 r3 r3 r3 – – –
If the LR(0) parsing table contains any multiply-defined ACTION entries then G is
not LR(0)
Two conflicts arise:
shift-reduce: both shift and reduce possible in same item set
reduce-reduce: more than one distinct reduce action possible in same item set
Conflicts can be resolved through lookahead in ACTION. Consider:
• A → $ | a!
⇒ shift-reduce conflict
• a:=b+c*d
requires lookahead to avoid shift-reduce conflict after shifting c
(need to see * to give precedence over +)
CS502
Parsing
81
SLR(1): simple lookahead LR
CS502
Parsing
82
From previous example
Add lookaheads after building LR(0) item sets
1 S → E$
2 E → E +T
| T
3
4 T → id
5
| (E)
Constructing the SLR(1) parsing table:
1. construct the collection of sets of LR(0) items for G.
FOLLOW(E) = FOLLOW(T ) = {$, +, )}
state
2. state i of the CFSM is constructed from Ii
9
(a) [A → ! • a"] ∈ Ii and goto0(Ii, a) = I j
⇒ ACTION[i, a] ← “shift j”, ∀a /= $
(b) [A → !•] ∈ Ii, A /= S.
⇒ ACTION[i, a] ← “reduce A → !”, ∀a ∈ FOLLOW(A)
T
(
id
0
(c) [S. → S • $] ∈ Ii
⇒ ACTION[i, $] ← “accept”
id
E
+
1
3. goto0(Ii, A) = I j
⇒ GOTO[i, A] ← j
2
E
7
)
T
4
(
6
(
+
3
$
T
id
5
0
1
2
3
4
5
6
7
8
9
ACTION
GOTO
id ( ) + $ S E T
s5 s6 – – – – 1 9
– – – s3 acc – – –
– – – – – –– –
s5 s6 – – – – – 4
– – r2 r2 r2 – – –
– – r4 r4 r4 – – –
s5 s6 – – – – 7 9
– – s8 s3 – – – –
– – r5 r5 r5 – – –
– – r3 r3 r3 – – –
8
4. set undefined entries in ACTION and GOTO to “error”
5. initial state of parser s0 is closure0([S. → •S$])
CS502
Parsing
83
CS502
Parsing
84
Example: A grammar that is not LR(0)
1
2
3
4
5
6
7
S → E$
E → E +T
| T
T → T ∗F
| F
F → id
| (E)
FOLLOW
E {+, ), $}
T {+, ∗, ), $}
F {+, ∗, ), $}
CS502
I0 : S → •E$
E → •E + T
E → •T
T → •T ∗ F
T → •F
F → •id
F → •(E)
I1 : S → E • $
E → E • +T
I2 : S → E$•
I3 : E → E + •T
T → •T ∗ F
T → •F
F → •id
F → •(E)
I4 : T → F•
I5 : F → id•
Example: But it is SLR(1)
I6 : F → (•E)
E → •E + T
E → •T
T → •T ∗ F
T → •F
F → •id
F → •(E)
I7 : E → T •
T → T • ∗F
I8 : T → T ∗ •F
F → •id
F → •(E)
I9 : T → T ∗ F•
I10 : F → (E)•
I11 : E → E + T •
T → T • ∗F
I12 : F → (E•)
E → E • +T
Parsing
state
0
1
2
3
4
5
6
7
8
9
10
11
12
85
Example: A grammar that is not SLR(1)
Consider:
S → L=R
| R
L → ∗R
| id
R → L
Its LR(0) item sets:
I0 : S. → •S$
S → •L = R
S → •R
L → •∗R
L → •id
R → •L
I1 : S. → S • $
I2 : S → L• = R
R → L•
I3 : S → R•
I4 : L → id•
+
–
s3
–
–
r5
r6
–
r3
–
r4
r7
r2
s3
CS502
∗
–
–
–
–
r5
r6
–
s8
–
r4
r7
s8
–
ACTION
id (
)
$
s5 s6
–
–
–
–
–
acc
–
–
–
–
s5 s6
–
–
–
–
r5
r5
–
–
r6
r6
s5 s6
–
–
–
–
r3
r3
s5 s6
–
–
–
–
r4
r4
–
–
r7
r7
–
–
r2
r2
–
– s10
–
S
–
–
–
–
–
–
–
–
–
–
–
–
–
GOTO
E
T
1
7
–
–
–
–
– 11
–
–
–
–
12 7
–
–
–
–
–
–
–
–
–
–
–
–
F
4
–
–
4
–
–
4
–
9
–
–
–
–
Parsing
86
LR(1) items
Recall: An LR(k) item is a pair [!, "], where
I5 : L → ∗ • R
R → •L
L → •∗R
L → •id
I6 : S → L = •R
R → •L
L → •∗R
L → •id
I7 : L → ∗R•
I8 : R → L•
I9 : S → L = R•
! is a production from G with a • at some position in the RHS, marking how
much of the RHS of a production has been seen
" is a lookahead string containing k symbols (terminals or $)
What about LR(1) items?
• All the lookahead strings are constrained to have length 1
• Look something like [A → X •Y Z, a]
Now consider I2: = ∈ FOLLOW(R) (S ⇒ L = R ⇒ ∗R = R)
CS502
Parsing
87
CS502
Parsing
88
LR(1) items
closure1(I)
What’s the point of the lookahead symbols?
Given an item [A → ! • B", a], its closure contains the item and any other items that
can generate legal substrings to follow !.
• carry along to choose correct reduction when there is a choice
• lookaheads are bookkeeping, unless item has • at right end:
Thus, if the parser has viable prefix ! on its stack, the input should reduce to B"
(or # for some other item [B → •#, b] in the closure).
– in [A → X •Y Z, a], a has no direct use
– in [A → XY Z•, a], a is useful
function closure1(I)
repeat
if [A → ! • B", a] ∈ I
add [B → •#, b] to I, where b ∈ first("a)
until no more items can be added to I
return I
• allows use of grammars that are not uniquely invertible†
The point: For [A → !•, a] and [B → !•, b], we can decide between reducing to A
or B by looking at limited right context
†
No two productions have the same RHS
CS502
Parsing
89
CS502
Parsing
goto1(I)
Building the LR(1) item sets for grammar G
Let I be a set of LR(1) items and X be a grammar symbol.
We start the construction with the item [S. → •S, $], where
S. is the start symbol of the augmented grammar G.
S is the start symbol of G
$ represents EOF
Then, GOTO(I, X) is the closure of the set of all items
[A → !X • ", a] such that [A → ! • X", a] ∈ I
If I is the set of valid items for some viable prefix #, then GOTO(I, X) is the set of
valid items for the viable prefix #X.
To compute the collection of sets of LR(1) items
function items(G.)
s0 ← closure1({[S. → •S, $]})
S ← {s0}
repeat
for each set of items s ∈ S
for each grammar symbol X
if goto1(s, X) /= % and goto1(s, X) /∈ S
add goto1(s, X) to S
until no more item sets can be added to S
return S
goto(I, X) represents state after recognizing X in state I.
function goto1(I,X)
let J be the set of items [A → !X • ", a]
such that [A → ! • X", a] ∈ I
return closure1(J)
CS502
Parsing
90
91
CS502
Parsing
92
Constructing the LR(1) parsing table
Back to previous example (/∈ SLR(1))
Build lookahead into the DFA to begin with
S →
|
L →
|
R →
1. construct the collection of sets of LR(1) items for G.
2. state i of the LR(1) machine is constructed from Ii
(a) [A → ! • a", b] ∈ Ii and goto1(Ii, a) = I j
⇒ ACTION[i, a] ← “shift j”
(b) [A → !•, a] ∈ Ii, A /= S.
⇒ ACTION[i, a] ← “reduce A → !”
(c) [S. → S•, $] ∈ Ii
⇒ ACTION[i, $] ← “accept”
I0 : S. → •S,
$
S → •L = R, $
S → •R,
$
L → • ∗ R, =
L → •id,
=
R → •L,
$
L → • ∗ R, $
L → •id,
$
I1 : S. → S•,
$
I2 : S → L• = R, $
R → L•,
$
I3 : S → R•,
$
I4 : L → ∗ • R, = $
R → •L,
=$
L → • ∗ R, = $
L → •id,
=$
L=R
R
∗R
id
L
3. goto1(Ii, A) = I j
⇒ GOTO[i, A] ← j
4. set undefined entries in ACTION and GOTO to “error”
5. initial state of parser s0 is closure1([S. → •S, $])
I5 : L → id•,
=$
I6 : S → L = •R, $
R → •L,
$
L → • ∗ R, $
L → •id,
$
I7 : L → ∗R•, = $
I8 : R → L•,
=$
I9 : S → L = R•, $
I10 : R → L•,
$
I11 : L → ∗ • R, $
R → •L,
$
L → • ∗ R, $
L → •id,
$
I12 : L → id•,
$
I13 : L → ∗R•, $
I2 no longer has shift-reduce conflict: reduce on $, shift on =
CS502
Parsing
93
CS502
Parsing
Example: back to SLR(1) expression grammar
Another example
In general, LR(1) has many more states than LR(0)/SLR(1):
Consider:
0 S. →
1 S →
2 C →
|
3
1 S → E
4 T → T ∗F
| F
2 E → E +T 5
| T
3
6 F → id
7
| (E)
LR(1) item sets:
I0 :
S → •E,
$
E → •E + T,+$
E → •T,
+$
T → •T ∗ F, ∗ + $
T → •F,
∗+$
F → •id, ∗ + $
F → •(E), ∗ + $
CS502
I0. :shifting (
F → (•E), ∗ + $
E → •E + T,+)
E → •T,
+)
T → •T ∗ F, ∗+)
T → •F,
∗+)
F → •id, ∗+)
F → •(E), ∗+)
Parsing
S
CC
cC
d
state ACTION
c d $
0 s3 s4 –
1 – – acc
2 s6 s7 –
3 s3 s4 –
4 r3 r3 –
5 – – r1
6 s6 s7 –
7 – – r3
8 r2 r2 –
9 – – r2
I0.. :shifting (
F → (•E), ∗+)
E → •E + T,+)
E → •T,
+)
T → •T ∗ F, ∗+)
T → •F,
∗+)
F → •id, ∗+)
F → •(E), ∗+)
95
CS502
GOTO
S C
1 2
– –
– 5
– 8
– –
– –
– 9
– –
– –
– –
LR(1) item sets:
I0 : S. → •S, $
S → •CC, $
C → •cC, cd
C → •d, cd
I1 : S. → S•, $
I2 : S → C •C, $
C → •cC, $
C → •d, $
I3 : C → c •C, cd
C → •cC, cd
C → •d, cd
Parsing
94
I4 : C → d•, cd
I5 : S → CC•, $
I6 : C → c •C, $
C → •cC, $
C → •d, $
I7 : C → d•, $
I8 : C → cC•, cd
I9 : C → cC•, $
96
LALR(1) parsing
LALR(1) table construction
Define the core of a set of LR(1) items to be the set of LR(0) items derived by
ignoring the lookahead symbols.
To construct LALR(1) parsing tables, we can insert a single step into the LR(1)
algorithm
Thus, the two sets
(1.5) For each core present among the set of LR(1) items, find all sets having that
core and replace these sets by their union.
• {[A → ! • ", a], [A → ! • ", b]}, and
• {[A → ! • ", c], [A → ! • ", d]}
The goto function must be updated to reflect the replacement sets.
have the same core.
Key idea:
If two sets of LR(1) items, Ii and I j , have the same core, we can merge
the states that represent them in the ACTION and GOTO tables.
The resulting algorithm has large space requirements.
CS502
Parsing
97
CS502
LALR(1) table construction
Example
The revised (and renumbered) algorithm
Reconsider:
0 S. → S
1 S → CC
2 C → cC
| d
3
1. construct the collection of sets of LR(1) items for G.
2. for each core present among the set of LR(1) items, find all sets having that
core and replace these sets by their union (update the goto function
incrementally)
3. state i of the LALR(1) machine is constructed from Ii.
Merged states:
I36 : C → c •C, cd$
C → •cC, cd$
C → •d, cd$
I47 : C → d•, cd$
I89 : C → cC•, cd$
(a) [A → ! • a", b] ∈ Ii and goto1(Ii, a) = I j
⇒ ACTION[i, a] ← “shift j”
(b) [A → !•, a] ∈ Ii, A /= S.
⇒ ACTION[i, a] ← “reduce A → !”
(c) [S. → S•, $] ∈ Ii ⇒ ACTION [i, $] ← “accept”
4. goto1(Ii, A) = I j ⇒ GOTO[i, A] ← j
Parsing
I0 : S. → •S, $
S → •CC, $
C → •cC, cd
C → •d, cd
I1 : S. → S•, $
I2 : S → C •C, $
C → •cC, $
C → •d, $
I3 : C → c •C, cd
C → •cC, cd
C → •d, cd
I4 : C → d•, cd
I5 : S → CC•, $
98
I6 : C → c •C, $
C → •cC, $
C → •d, $
I7 : C → d•, $
I8 : C → cC•, cd
I9 : C → cC•, $
state ACTION GOTO
c d $ S C
0 s36 s47 – 1 2
1
– – acc – –
2 s36 s47 – – 5
36 s36 s47 – – 8
47 r3 r3 r3 – –
5
– – r1 – –
89 r2 r2 r2 – –
5. set undefined entries in ACTION and GOTO to “error”
6. initial state of parser s0 is closure1([S. → •S, $])
CS502
Parsing
99
CS502
Parsing
100
More efficient LALR(1) construction
The role of precedence
Observe that we can:
Precedence and associativity can be used to resolve shift/reduce conflicts in
ambiguous grammars.
• represent Ii by its basis or kernel:
items that are either [S. → •S, $]
or do not have • at the left of the RHS
• lookahead with higher precedence ⇒ shift
• same precedence, left associative ⇒ reduce
Advantages:
• compute shift, reduce and goto actions for state derived from Ii directly from
its kernel
• more concise, albeit ambiguous, grammars
• shallower parse trees ⇒ fewer reductions
Classic application: expression grammars
This leads to a method that avoids building the complete canonical collection of
sets of LR(1) items
CS502
Parsing
101
CS502
Parsing
The role of precedence
Error recovery in shift-reduce parsers
With precedence and associativity, we can use:
The problem
E →
|
|
|
|
|
|
|
E ∗E
E/E
E +E
E −E
(E)
-E
id
num
• encounter an invalid token
• bad pieces of tree hanging from stack
• incorrect entries in symbol table
We want to parse the rest of the file
Restarting the parser
This eliminates useless reductions (single productions)
CS502
Parsing
102
• find a restartable state on the stack
• move to a consistent place in the input
• print an informative message to stderr
103
CS502
Parsing
(line number)
104
Error recovery in yacc/bison/Java CUP
Example
The error mechanism
Using error
stmt list
• designated token error
• valid in any production
:
|
stmt
stmt list ; stmt
can be augmented with error
• error shows syncronization points
stmt list
When an error is discovered
• pops the stack until error is legal
:
|
|
stmt
error
stmt list ; stmt
This should
• skips input tokens until it successfully shifts 3
• throw out the erroneous statement
• synchronize at “;” or “end”
• invoke yyerror("syntax error")
• error productions can have actions
This mechanism is fairly general
Other “natural” places for errors
• all the “lists”: FieldList, CaseList
• missing parentheses or brackets
• extra operator or missing operator
(yychar)
See §Error Recovery of the on-line CUP manual
CS502
Parsing
105
Left versus right recursion
CS502
Parsing
106
Parsing review
Right Recursion:
Recursive descent
• needed for termination in predictive parsers
A hand coded recursive descent parser directly encodes a grammar (typically
an LL(1) grammar) into a series of mutually recursive procedures. It has most
of the linguistic limitations of LL(1).
• requires more stack space
• right associative operators
Left Recursion:
LL(k)
• works fine in bottom-up parsers
An LL(k) parser must be able to recognize the use of a production after
seeing only the first k symbols of its right hand side.
• limits required stack space
• left associative operators
LR(k)
Rule of thumb:
An LR(k) parser must be able to recognize the occurrence of the right hand
side of a production after having seen all that is derived from that right hand
side with k symbols of lookahead.
• right recursion for top-down parsers
• left recursion for bottom-up parsers
CS502
Parsing
107
CS502
Parsing
108
Complexity of parsing: grammar hierarchy
ambiguous
Language vs. grammar
unambiguous
For example, every regular language has a grammar that is LL(1), but not all
regular grammars are LL(1). Consider:
type!0:
!&>"
type!1: context!sensitive
!'"&>!("
Linear!bounded automaton: PSPACE complete
type!2: context!free
'&>!
Earley’s algorithm: O(n")
type!3: regular
A!>wX
DFA: O(n)
S → ab
S → ac
Without left-factoring, this grammar is not LL(1).
O(n#)
LR(k)
LL(k)
Knuth’s algorithm: O(n)
LR(1)
LL(1)
LALR(1)
SLR(1)
LR(0)
LL(0)
Note: this is a hierarchy of grammars not languages
CS502
Parsing
109
CS502
Parsing
110