Instructor

CHAPTER 2: INTRODUCTION to PARSING
Parsers have already been identified as one of the most
important and sophisticated elements of a compiler. Its
role was highlighted on an earlier figure.
The parsing methods commonly used in compilers can
be classified as being either top-down or bottom-up
depending on the way they build the parse tree; from
the root to the leaves or from the leaves to the root
respectively.
The most efficient parse methods work only for
subclasses of grammars. However this is enough to
describe most syntactic constructs in modern
programming languages.
Typically the most complex grammars processed by a
parser are the context free grammars, CFG, which
generate the context free languages, CFL.
2-1
Compiler's phases at a glance: The Parser
(assume that variables p, p0 and r below are all real numbers)
p = p0 + r*60
id1 = id2 + id3*60
source code Lexical tokens
Parser
Analyzer
=
id 1
Used by all phases
of the compiler:
Symbol &
Attribute
Tables
&
Access
Routines
+
"semantic"
= parse tree
*
...
...
...
...
... ...
id 3
p
p0
r
itof
+
id1
id2
num 60
*
id3
i.to.f
60
Intermediate
Code Generator
O.S.
Interfaces
MOVF id3, R2
MULF #60.0, R2
MOVF id2, R1
ADDF R2, R1
MOVF R1, id1
Semantic
Analyzer
id 2
Symbol Table
Error
Handler
parse
tree
tmp1 = itof(60)
intermediate
tmp2 = id3 * tmp1
language
tmp3 = id2 + tmp2
id1 = tmp3
Optimizer
Code
"improved"
Generator
object
intermediate
code
tmp2 = id3 * 60.0 language
id1 = id2 + tmp2
(ASU Figs. 1.9, 1.10, 1.11)
2-2
A grammar for a given language uses two finite
disjoint sets of symbols: the nonterminal symbols,
usually represented by capital letters which we shall
often denote by N, and the set  of terminal symbols,
which we represent by lower case letter and other
symbols and it is everything else.
The heart of a grammar is a finite set P of formation
rules, or productions as we shall call them, which
describe how the sentences of the language are to be
generated (see Jose’s Theory notes).
Example (example 4.5 & 4.6 in ALSU)
Given the following grammar (expressed in pseudo
BNF notation) and represented by its productions P:
expression  expression + term
expression  expression - term
expression  term
term
 term * factor
term
 term / factor
term
 factor
factor
 (expression)
factor
 id
2-3
Obviously this representation of the grammar is rather
cumbersome to work with.
Therefore, an obviously “more mathematical” notation
to represent the set P of productions would be:
EE+T
EE-T
ET
TT*F
TT/F
TF
F  (E)
F  id
Which in turn can still be represented in a much
shorter notation using the vertical line “|” as a
separator as follows:
EE+T|E–T|T
TT*F|T/F |F
F  (E) | id
From this example we can distinguish the grammar 4tuple definition where in this case the set N of nonterminals is
2-4
N = {E, T, F},
the set  of terminals is
 = {+, - , *, /, ( , ) , id},
The start symbol is the non-terminal E, and the set P of
productions is
P = {E E + T | E – T | T ; T T * F | T / F | F; F
(E) | id}.
Derivations and the parse tree.
The words derived from the grammar rules are quite
straightforward.
For example, using the grammar above, let’s
determine the derivations necessary to build the
expression (word)
id + id*(id – id)
We proceed as follows:
E
 E + T T + T F + T*F id + F*(E)
 id + id*(E - T)  id + id*(T - F)
2-5
 id + id*(T - id)  id + id*(id - id).
Obviously there may be very complicated derivations
and the process may become quite cumbersome.
The corresponding parse three is given below:
E
E
+
T
T
T
*
F
F
F
(
E
)
id
id
-
T
E
T
F
F
id
id
For more information refer to cs3186 class or any
textbook in Theory of Automata.
&&&&&&&&&&&&&&&&&&&&&&&&&&&&
STOP HERE AND DISCUSS THE PROJECT!!!
&&&&&&&&&&&&&&&&&&&&&&&&&&&&
2-6
Example of a Parser at work: Non-Recursive Predictive
Parser - A Table-Driven Parser
• The key problem during predictive parsing is determining
the production to be applied for a non-terminal.
• A table-driven predictive parser has:
- an input buffer, containing the string w to be parsed
followed by a $ symbol used as an end marker,
- a stack, containing a sequence of grammar symbols with $
on the bottom as a bottom of the stack indicator,
- a parsing table, which is a 2-D array M[A, a] where A is a
non-terminal and a is a terminal or the symbol $, and
- an output stream (the production used or error call).
INPUT
BUFFER
STACK
X
x
y
+
*
$
ip
Predictive Parsing
Program
Y
Z
$
z
Parsing Table
M
2-7
OUTPUT
Running the Program
The program considers X the symbol on the top of the stack
and a, the current input symbol. There are three possible
parser actions:
1) If X = a = $, the parser halts and successfully completed
the parsing process.
2) If X = a ≠ $, the parser pops X off the stack and advances
the input pointer to the next input symbol.
3) If X is a nonterminal, the program uses (consults) the
parsing table for the input M[X, a]. This entry will be
either an X-production of the grammar or an error entry. If
it is a production, the top of the stack X gets replaced by
the production.
For example, if M[X, a] = {X  PQR} the parser replaces X
on top of the stack by RQP (with P on top).
The behavior of the parser can be described in terms of its
configurations, which give the stack contents and the
remaining input.
An algorithm to perform Nonrecursive predictive parsing is
listed below.
2-8
Algorithm: Non-recursive Predictive Parsing
Input. A string w and a parsing table M for grammar G.
Output. If w is in L(G), a leftmost derivation of w; otherwise,
an error indication.
Method.
Initially, the parser is in a configuration in which it has $S
on the stack with S, the start symbol of G on top, and w$ in
the input buffer. The program that utilizes the predictive
parsing table M to produce a parse for the input is given
below. (Table M will be generated later).
set pointer ip to point to the first symbol of w$;
repeat
let X be the top stack symbol and a the symbol
pointed to by ip;
if X is a terminal or $ then
if X = a then
pop X from the stack and advance ip
else error ( )
else
/* X is a nonterminal */
if M[X, a] = X -> Y1Y2 ... Yk then begin
pop X from the stack;
push Yk, Yk-1, . . ., Y1 onto the stack, with Y1 on
top;
output the production X -> Y1 Y2 ... Yk
end
else error()
until X = $
/* stack is empty */
2-9
Example
Consider the grammar G with productions: {E  TE’;
E’  +TE’ | ; T  FT’; T’  *FT’ | ; F  (E) | id} (the
empty string is also written using greek letter, either  or 
or ). A predictive parsing table for this grammar is given
below.
Rules
NonInput Symbol
Term. id
+
*
(
E
TE’
TE’
E  TE’
E’
+TE’
E’  +TE’| 
T
FT’
FT’
T  FT’
T’
T’  *FT’| 
 *FT’
F
id
(E)
F  (E)| id
)
$




Blanks are error entries; non-blanks indicate a production
with which to expand the top nonterminal on the stack. Note
that we have not yet indicated how these entries could be
selected, but we shall do so shortly.
With input id + id * id the predictive parser makes the
sequence of moves shown on the next page.
The input pointer points to the leftmost symbol of the string
in the INPUT column.
If we observe the actions of this parser carefully, we see that
it is tracing out a leftmost derivation for the input, that is,
the productions output are those of a leftmost derivation.
2-10
The input symbols that have already been scanned, followed
by the grammar symbols on the stack (from the top to
bottom), make up the sentential forms in the derivation.
The moves made by the predictive parser on input
id + id * id are shown below.
STACK
$E
$E’T
$E’T’F
$E’T’id
$E’T’
$E’
$E’T+
$E’T
$E’T’F
$E’T’id
$E’T’
$E’T’F*
$E’T’F
$E’T’id
$E’T’
$E’
$
INPUT
id + id * id $
id + id * id $
id + id * id $
id + id * id $
+ id * id $
+ id * id $
+ id * id $
id * id $
id * id $
id * id $
* id $
* id $
id $
id $
$
$
$
2-11
OUTPUT
E  TE’
T  FT’
F  id
T’  e
E’  +TE’
T  FT’
F  id
T’  *FT’
F  id
T’  e
E’  e
The construction of predictive parsers is aided by two
functions associated with a grammar: FIRST and FOLLOW.
These are the two most important functions used in
predictive parsing.
Consider a string of grammar symbols and a nonterminal
variable A
Definition:
NULLABLE(A)
is true if A derives the empty symbol , that is, A =*=> .
FIRST()
is the set of all terminal symbols that can begin any string
derivable from . If  derives the empty symbol  (that is,
 =*=> ), thenis in FIRST(). It is easy to visualize the
set FIRST() by building a derivation tree.
FOLLOW(A)
is the set of terminals that immediately follow A in any given
sentential form, that is, the set of terminals a such that there
exists a derivation of the form S =*=> XAaY for some string
X and Y. Note that at some time during the derivation some
symbols may have been between A and a but they
disappeared after -derivations. If A can be the rightmost
symbol in some sentential form, then $ is in FOLLOW(A).
2-12
Example
Consider the grammar
S  Ab | Bc
A  Dg | CA
B  hA | f
C  dC | c
Di|j
N = {S,A,B,C,D,}
T = {b,c,g,h,f,d,i,j}
Building a tree find FIRST(S) and FIRST(Ab).
igb
Dgb
Ab
CAb
S
jgb
dCAb …
cAb…
hAc
Bc
fc
From the tree above we obtain:
FIRST(Ab) = {c, d, i, j};
FIRST(S) = {c, d, f, h, i, j}; etc.
String recognition code for "S”:
if next_token in ['c', 'd', 'i', 'j'] then try S  Ab end;
else if next_token in ['f', 'h'] then try S  Bc end;
else error ('S'); S = false end;
can write similar code for A, B, etc.
2-13
Algorithm to Compute FIRST()
Case 1:  is a single character or:
if  is a terminal y then FIRST () = y
else if  is  then FIRST () = 
else if  is a non-terminal and   1| 2 |…| k
then FIRST() = union {FIRST(i) for i = 1 to k}
Case 2:  = X1 X2…Xn
FIRST () = {};
j = 0;
repeat
j=j+1
Include FIRST(Xj) in FIRST()
until {Xj non nullable or {j = n;
if Xn is nullable then add { } to FIRST ();}}
Example
Given the productions for a grammar
G = {S  ABCd; A  f | g | ; B  h | i | ; C  p | q}
find FIRST(ABCd) = FIRST (S)
It's easy to verify (using a tree) that
FIRST(S) = FIRST (ABCd) = {f, g, h, i, p, q}.
Applying the algorithm we have:
FIRST(ABCd)
= {f, g}
[i.e., FIRST (A) - { }]
 {h, i}
[i.e., FIRST (B) - { }]
 {p, q}
[i.e., FIRST (C)]
= {f, g, h, i, p, q}.
2-14
Algorithm to Compute FOLLOW(X)
(1) If X is the starting symbol then put the end marker $ into
FOLLOW (X)
(2) If there is a production A  X, then
a) If  begins with a terminal q, then q is in FOLLOW(X).
b) Otherwise FOLLOW(X) includes FIRST() - { }.
c) If  =  (that is, X comes at the end of A  X for some
 = ), or if  is nullable, then include FOLLOW(A) in
FOLLOW(X).
Example
Given { E  TQ
Q  +TQ | -TQ| 
T  FR
R  *FR | /FR | 
F  (E) | i }
find the FIRST() and the FOLLOW() for all non-terminals.
Answer:
FIRST(E) = {(, i} = f(T) =f(F)
FIRST(Q) = {+, -,  }
FIRST(R) = {*, /,  }
FOLLOW(E) = {$, )}
FOLLOW(Q) = {$, )}
FOLLOW(T) = {+, -, $, )} =F(R)
FOLLOW(F) = {*, /, +, -, $, )}
2-15
By (1) FOLLOW(E) includes the symbol $ and from F  (E)
it also includes the symbol ), that is:
FOLLOW(E) = { $, )}
FOLLOW(Q) = FOLLOW(E) [by rule 2c]
FOLLOW(T) = FIRST(Q) -{} U FOLLOW(E) [by 2b & 2c]
= {+, -} U { ), $} = {+, -, ), $}
FOLLOW(R) = FOLLOW (T) [by rule 2c] from TFR
FOLLOW(F) = FIRST(R) -{} U FOLLOW(T) [by 2b & 2c]
= {*, /} U { +, -, ), $} = {+, -, *, /, ), $}
HOMEWORK
1. Given the following grammar productions:
1. S  aSA
2. S  A
3. A  bS
4. A  c
(use the order S, A, a, b, c). Find FIRST and FOLLOW for S
and A.
2. Consider the following grammar: {S  AS | b; A  SA
|a}. Find all the FIRST and FOLLOW functions. Prove that
this is an ambiguous grammar.
3. Given the following grammar X  XX | (X) | ( ), find its
FIRST and FOLLOW functions.
2-16
Construction of Predictive Parsing Tables
Building the parsing completes our review of the top-down
parsers (see algorithm below).
Idea:
Suppose that A   is a production with a in FIRST(). The
parser will expand A by a when the current input symbol is
a.
The only complication occurs when  =  or  =*=> .
In this case, we should again expand A by  if the current
input symbol is in FOLLOW(A), or if the $ on the input has
been reached and $ is in FOLLOW(A).
Algorithm : Construction of a Predictive Parsing Table
Input: Grammar G
Output: Parsing Table M
Method:
(1) For each production A   of G, do steps (2) & (3).
(2) For each terminal a in FIRST(), add A   to M[A, a].
(3) If  is in FIRST(), add A   to M[A, b] for each
terminal b in FOLLOW(A). If  is in FIRST() and $ is in
FOLLOW(A), add A   to M[A, $].
(4) Make each undefined entry of M be error.
2-17
Example
Given the grammar G with productions
{1. E  TE’;
2. E’  +TE’ | ;
3. T  FT’;
4. T’  *FT’ | ;
5. F  (E) | id}.
build its predictive parsing table.
Using the corresponding FIRST() and FOLLOW()
algorithms for all the non-terminals we get:
FIRST(E) = FIRST(T) = FIRST(F) = {(, id}
FIRST(E’) = {+,  }
FIRST(T’) = {*,  }
FOLLOW(E) = FOLLOW(E’) = {), $}
FOLLOW(T) = FOLLOW(T’) = {+, ), $}
FOLLOW(F) = {+, *, ), $}
and these functions can be used together with the
corresponding algorithm to build the table below.
Rules
NonInput Symbol
Term. id
+
*
(
E
TE’
TE’
E  TE’
E’
+TE’
E’  +TE’| 
T
FT’
FT’
T  FT’
T’
*FT’
T’  *FT’| 

F
id
(E)
F (E)| id
2-18
)
$




Details of the Construction of the Table
Since FIRST(TE’) = FIRST(T) = {(, id}, the 1st production
should go under these two terminal symbols.
The 2nd production E’  +TE’ causes this production to go
under the + terminal. The 2nd production E’   causes the
symbol  to go under the terminals ) and $ in FOLLOW(E’).
For the 3rd production, since FIRST(FT’) = FIRST(F) =
{(, id}, the 3rd production should go under these two terminal
symbols.
The 4th production T’  *FT’ causes this production to go
under the * terminal. The 4th production T’  e causes the
symbol e to go under the terminals +, ) and $ in
FOLLOW(T’).
For the last production, since FIRST((E)) = {(} the
production generating (E) should go under terminal. Since
FIRST(id) = id, the production generating id should go
under terminal id.
Computation of FIRST() and FOLLOW() for the previous
problem.
Trivially from the 2nd production, FIRST(E’) = {+,  }, from
the 4th, FIRST(T’) = {*,  } and from the last production
FIRST(F) = {(, id}.
2-19
Now we can use the 3rd production to obtain FIRST(T) =
FIRST(F) = {(, id}.
Lastly, we can use the 1st production to find FIRST(E) =
FIRST(T) = {(, id}.
Since E is the starting symbol, $ is in the FOLLOW(E) (by
rule 1). Also, from the 5th production is clear that “)”
follows E (rule 2a).
Hence we have that FOLLOW(E) = {), $}.
Now from the 1st production and rule 2c FOLLOW(E’)
includes FOLLOW(E) and there is no more E’ possibilities.
Thus, FOLLOW(E’) = FOLLOW(E) = {), $}.
The the 3rd production and rule 2c, FOLLOW(T’) includes
the FOLLOW(T) and there are no more T’ possibilities.
By the 2nd production and rule 2b, FOLLOW(T) includes the
FIRST(E’) – { } and by 1st production when E’   we get
that E  T and hence T also includes the FOLLOW(E),
thus:
FOLLOW(T) = [FIRST(E’) – { }] U FOLLOW(E) =
= {+} U {), $} = {+, ), $}
Similarly, FOLLOW(F) = {*, +, ), $} since it includes
FIRST(T’) – { } by rule 2b and the 4th production plus the
FOLLOW(T) by rule 2c and the 3rd production when T’  
is used.
2-20
Example
Find the parsing table for the grammar G with productions
{
S  if E then SQ | a | b
N = {S, E, Q}
Ex|y
starting symbol = S
Q  else S |  }
T = {if, then, a, b, x, y, else}
empty symbol = .
[Will use f() for FIRST() and F() for FOLLOW()].
f(S) = {if, a, b};
f(E) = {x, y};
f(Q) = {else, }.
F(S) = {$} U {{else,  } – { } } = {else, $} [rules 1 & 2b]
F(Q) = F(S) = {else, $}
[by rule 2c].
F(E) = {then}
By f(S) we put the production “ S  if E then SQ” under
terminal if, the production “S  a” under terminal a and
the production “S  b” under terminal b.
By f(E) we put the production “E  x” under terminal x and
the production “E  y” under terminal y.
By f(Q) we put the production “Q  else S” under terminal
else.
Now the production “Q  ” must go under the terminals in
F(Q), that is, under terminals else and $ respectively.
The parsing table is shown below.
2-21
S
E
Q
if
if E then SQ
x
y
x
y
then
a
a
b
b
else
else S

$

The grammar is ambiguous and the ambiguity is manifested
by the multiple entry!
The problem is which one to choose!
This problem exceeds what the parser for this grammar can
do and thus other techniques may have to be used or
alternatively a different parser must be used.
2-22
INTRODUCTION TO BOTTOM-UP PARSING
(shift-reduce parsing)
Goal: Given an input string w and a grammar G,
construct a parse tree starting at the leaves and working
towards the root.
Strategy: Repeatedly match the RHS of a production
against a substring in the current right-sentential form.
At each match, it applies a reduction to build the tree:
- each reduction replaces the matched substring with the
nonterminal on the LHS of the production.
- each reduction adds an internal node to the parse tree.
- the result is another right-sentential form.
Result: A rightmost derivation in reverse
Example
Consider the grammar G given by the productions:
{S  aABf; A  Abc | b; B  d}
Scanning the sentence abbcdf from left to right can be
reduced to S by the following steps.
abbcdf
aAbcdf
a A df
a A Bf
S
The reductions trace out the following rightmost derivations
in reverse
S  aABf  aAdf  aAbcdf  abbcdf
2-23
INTRODUCTION TO LR PARSERS
• This technique is called LR(k) parsing; [the "L" is for leftto-right scanning of the input, the ''R" for constructing a
rightmost derivation in reverse, and the k for the number of
input symbols of lookahead that are used in making parsing
decisions. When (k) is omitted, k is assumed to be 1].
• LR parsers can be constructed to recognize virtually all
programming language constructs for which context-free
grammars can be written. [i.e., it can be used to parse a large
class of context-free grammars].
• The LR parsing method is the most general nonbacktracking shift-reduce parsing method known, yet it can
be implemented as efficiently as other shift-reduce methods.
• The class of grammars that can be parsed using LR
methods is a proper superset of the class of grammars that
can be parsed with predictive parsers.
• An LR parser can detect a syntactic error as soon as it is
possible to do so on a left-to-right scan of the input.
Cons:
• It is too much work to construct an LR parser by hand for
a typical programming-language grammar. Fortunately,
many LR parser generators tools are available (Yacc/Bison).
2-24
Model of an LR Parser: Push-Down Automaton.
INPUT BUFFER a1 ...
ai
... an
$
ip
sm
Xm
sm-1
Xm-1
sp
LR Parsing
Program
(Driver)
....
….
s1
X1
s0
action
goto
Parsing Table "M"
• Each Xi is a grammar symbol
• Each si is a state
• LR consults function action [sm, ai] on the table.
Four possible values:
1) Shift state n: Advance input one token; push n and
token onto the stack.
2) Reduce rule K: Pop the stack as many times as the
number of symbols on the RHS of rule K; let X be the
LHS symbol of rule K; in the state now on top of the stack
look up X to get goto n.
3) Accept: Stop parsing and report success
4) Error: Stop parsing and report failure
• Function goto takes a state and a grammar symbol and
produces a state.
2-25
Algorithm: LR Parsing Algorithm
place an end marker $ at the end of the input string
push state s0 (the starting state) onto the stack.
Start:
set ip to point to first symbol of w$
repeat forever
begin
let s be the state on top of the stack and a the symbol
pointed to by ip;
if action[s, a] = shift s' then begin
push a then push s’ on the stack;
advance ip to next input symbol;
end
else if action [s, a] = reduce A->  then begin
pop 2*|| symbols off the stack
let s’ be the state now on top of the stack;
push A then goto[s’, A] on top of the stack;
output the production A  ;
end
else if action [s, a] = accept then
return
else error()
end
end
Fig. 4.30. LR parsing program.
2-26
Example 4.33
The table below (Figure 4.31) shows the parsing action and
goto functions of an LR parsing table for the grammar
(1) E  E + T;
(2) E  T
(3) T  T * F ;
(4) T  F
(5) F  (E);
(6) F  id
for arithmetic expressions with binary operators + and *
The codes for the actions are:
1. si means shift and stack state i,
2. rj means reduce by production numbered j,
3. acc means accept,
4. blank means error.
State
0
1
2
3
4
5
6
7
8
9
10
11
id
s5
+
s6
r2
r4
action
*
(
s4
s7
r4
s5
$
r2
r4
acc
r2
r4
s4
r6
s5
s5
)
r6
r6
s4
s4
goto
E
T
1
2
F
3
8
2
3
9
3
10
r6
s6
s11
r1 s7
r1
r1
r3 r3
r3
r3
r5 r5
r5
r5
Fig. 4.31. Parsing table for expression grammar
2-27
Note that the value of goto[s, a] for terminal a is found in the
action field connected with the shift action on input a for
state s. The goto field gives goto[s, A] for nonterminals A.
Example
On input id * id + id, the sequence of stack and input
contents is shown on the table below.
For example, at line (1) the LR parser is in state 0 with id
the first input symbol.
The action in row 0 and column id of the action field of the
table is s5, meaning shift and cover the stack with state 5.
That is what has happened at line (2): the first token id and
the state symbol 5 have both been pushed onto the stack, and
id has been removed from the input.
Then, * becomes the current input symbol, and the action of
state 5 on input * is to reduce by F  id.
Two symbols are popped off the stack (one state symbol andone grammar symbol). State 0 is then exposed. Since the
goto of state 0 on F is 3, F and 3 are pushed onto the stack.
We now have the configuration in line (3).
Each of the remaining moves is determined similarly.
The details are listed right after the table below.
2-28
STACK
(1) 0
(2) 0 id 5
(3) 0 F 3
(4) 0 T 2
(5) 0 T 2 * 7
(6) 0 T 2 * 7 id 5
(7) 0 T 2 * 7 F 10
(8) 0 T 2
(9) 0 E 1
(10) 0 E 1 + 6
(11) 0 E 1 + 6 id 5
(12) 0 E 1 + 6 F 3
(13) 0 E 1 + 6 T 9
(14) 0 E 1
INPUT
id * id + id $
* id + id $
* id + id $
* id + id $
id + id $
+ id $
+ id $
+ id $
+ id $
id $
$
$
$
$
ACTION
shift
reduce by F  id
reduce by T  F
shift
shift
reduce by F  id
reduce by T  T*F
reduce by E  T
shift
shift
reduce by F  id
reduce by T  F
reduce by E  E + T
accept
line 3: s3 sees * , action: r4 (TF), pop state & symbol F
s0 is exposed & goto under T is 2, push T & 2
line 4: s2 sees *, action: s7, push * & 7 onto stack
line 5: 7 sees id, action: s5, push id & 5 onto stack
line 6: 5 sees +, action: r6 (Fid), pop state 5 & id
s7 exposed & its goto under F is10, push F & 10
line 7: 10 sees +, action: r3 (TT*F), pop 3 states & 3
symbols
s0 exposed & goto under T is 2, push T & 2
line 8: 2 sees +, action: r2 (ET), pop 2 & T
s0 exposed & its goto under E is 1, push E & 1
line 9: 1 sees +, action:s6, push + and 6 onto stack
......
2-29
Once we have obtained all the productions we can list them
in reverse order and then do a top-down processing by hand
to get a better understanding of the parser function.
step # 13
step # 12
step # 11
step # 8
step # 7
step # 6
step # 3
step # 2
EE+T
TF
F  id
ET
TT*F
F  id
TF
F  id
the corresponding parsing tree is shown below.
E
E
+
T
T
F
*
T
F
F
id
id
id
2-30
Handles
A right-hand side of a production that we can reduce to get
the preceding step in the derivation.
If w is a sentential form, then for  to be the handle, it
must be the RHS of a production, that is if A  , then Aw
 w in a rightmost derivation.
For Aw to be the correct preceding step, we must also have
S =*=> Aw, because A   must remove the entire subtree
rooted in A without disturbing the rest of the tree; if S
cannot derive Aw, then we have reached a dead end.
Example
Let G = {S  aABf; A  Abc | b; B  d} and consider the
sentence: abbcdf  aAbcdf  aAdf ... then abbcde is
right-sentential form whose handle is A  b at position 2.
The handle of the right-sentential form aAbcdf at position 2
is A  Abc.
S

A
w

The handle A  in the parse tree for w.
2-31
Constructing SLR Parsing Tables
Three methods, varying in their power and ease of
implementation:
• SLR or "Simple LR”: weakest in terms of the number of
grammars for which it succeeds, but easiest to implement.
• The canonical LR parser (the hardest and most powerful).
• The lookahead LR, LALR (the most commonly used).
Definition:
An LR(0) item (item for short) of a grammar G is a
production of G with a dot at some position of the right
side.
Production A  XYZ yields the four LR(0) items:
A  •XYZ; A  X•YZ; A  XY•Z; and A  XYZ•.
Production A   generates only item A  •
Definition:
A LR(0) State is a set of LR(0) items.
To construct the LR(0), basis for constructing SLR parsers,
we need to define an augmented grammar and two functions
closure and goto.
Definition:
Augmented grammar for G is G' which includes additional
start production S'  S whose purpose is to announce
acceptance when the parser is about to process it.
2-32
Closure Operation
If I is a set of items for a grammar G, then closure(I) is the
set of items constructed from I by the two rules:
1) Initially, every item in I is added to closure(I).
2) If A  •B is in closure(I) and B   is a
production, then add the item B  •  to I, if it is not
already there. We apply this rule until no more new items
can be added to closure(I).
Intuitively, A  •B in closure(I) indicates that, at some
point in the parsing process, we think we might next see a
substring derivable from B as input if B   is a
production, we also expect we might see a substring
derivable from  at this point. For this reason we also
include B  • in closure(I).
Algorithm: Computation of Closure
function closure(I);
begin
J = I;
repeat
for each item A  •Bin J and each production
B   of G such that B  •  is not in J do
add B  •  to J
until no more items can be added to J;
return J
end
2-33
Example
Consider the augmented grammar:
0) E'  E
1) E  E + T
2) E  T
3) T  T*F
4) T  F
5) F  (E)
6) F  id
If I is the set of one item {[E'  •E]}, then closure(I) contains
the items:
E'  •E
{E'  E is put in closure(I) by rule (1)}
E  •E+T
E  •T
{Since there is an E immediately to the
right of a dot, by rule (2) we add the
E -productions with dots at the left end}
T  •T*F
T  •F
{Now there is a T immediately to the right
of a dot, so we add T  • T*F and T  •F}
F  •(E)
F  •id
{Next, the F to the right of a dot forces
F  • (E) and F  • id to be added}
2-34
The Goto Operation
Function goto(I, X), where I is a set of items and X is a
grammar symbol, is defined to be the closure of the set of all
items ( A  X• ) such that [ A  •X ] is in I.
Intuitively, if I is the set of items that are valid for some
viable prefix , then goto(I, X) is the set of items that are
valid for the viable prefix X.
For every grammar G the goto function defines a DFA that
recognizes the viable prefixes of G where states are sets of
LR(0) items.
The set of prefixes of right sentential forms that can appear
on the stack of a shift-reduce parser are called viable
prefixes.
Definition: Valid items
Item A  • is valid for a viable prefix 1 if there is a
derivation S' =*=> Aw =*=> 2w.
In general, an item will be valid for many viable prefixes.
The fact that A  • is valid for 1 tells us a lot about
whether to shift or reduce when we find 1 on the parsing
stack.
2-35
In particular, if 2 ≠ e, then it suggests that we have not yet
shifted the handle onto the stack, so shift is our move.
If 2 = e, then it looks as if A  1 is the handle, and we
should reduce by this production.
Of course, two valid items may tell us to do different things
for the same viable prefix.
Some of these conflicts can be resolved by looking at the next
input symbol, and others can be resolved by other methods.
We should not suppose that all parsing action conflicts can
be resolved if the LR method is used to construct a parsing
table for an arbitrary grammar.
Example
Consider the augmented grammar:
{ E'  E,
E  E + T | T,
T  T*F | F,
F  (E) | id }.
Compute goto(I, + ) if I is the set of only two LR(0) items
{[E'  E• ], [E  E• +T]}.
We compute goto(I, +) by examining I for items with +
immediately to the right of the dot.
E'  E• is not such an item, but E  E• + T is.
2-36
We move the dot over the + to get {E -> E + •T} and then
take the closure of this set.
1) E -> E + • T;
4) F -> • (E);
2) T -> • T * F;
5) F -> • id
3) T -> • F
Algorithm: "The set-of-items construction”
Construct C, the canonical collection of sets of LR(0) items
for an augmented grammar G'.
procedure items (G');
begin
C = {closure({[S'  .S]})};
repeat
for each set of items I in C and each grammar symbol X
such that goto(I, X) is not empty and not in C do
add goto(I, X) to C
until no more sets of items can be added to C
end
Example
Find the canonical collection of sets of LR(0) items for the
augmented grammar:
E'  E
EE+T|T
T  T*F | F
F  (E) | id
2-37
Also show the goto function for this set of items as a
transition diagram of a deterministic finite automaton D.
First let’s organize the grammar productions as follows.
0) E'  E
1) E  E + T
2) E  T
3) T  T*F
4) T  F
5) F  (E)
6) F  id
The LR(0) items are:
(a) E'  • E,
(b) E'  E •
(c) E  • E + T,
(d) E  E • + T,
(g) E  • T,
(h) E  T •
(i) T  • T * F,
(j) T  T • * F,
(m) T  • F,
(n) T  F •
(0) F  • (E),
(p) F  (• E),
(s) F  • id,
(t) F  id •
2-38
(e) E  E + • T,
(f) E  E + T •
(k) T  T * • F,
(l) T  T * F •
(q) F  (E • ),
(r) F  (E) •
The canonical LR(0) collection for the augmented grammar
G' above is given in the table below.
I0:
I1:
I2:
I3:
E'  •E
E  •E + T
E  •T
T  •T * F
T  •F
F  •(E)
F  •id
E'  E•
E  E• + T
E  T•
TT•*F
T  F•
I4:
F  (•E)
E  •E + T
E  •T
T  •T * F
T  •F
F  •(E)
F  •id
I5:
F  id•
I6:
E  E + •T
T  •T*F
T  •F
F  •(E)
F  •id
The corresponding DFA D is given below.
2-39
I7:
T  T * •F
F  •(E)
F  •id
I8:
F  (E•)
E  E• + T
I9:
E  E + T•
T  T• * F
I10: T  T * F•
I11: F  (E) •
Transition diagram of DFA D for viable prefixes.
E
I0
+
I1
T
I6
I9
F
(
to I3
to I4
to I5
id
T
*
I2
F
I7
(
F
to I4
id
I3
I10
to I5
(
(
I4
)
I8
T to I2
F to I3
id
id
E
+
I5
2-40
to I6
I11
*
to I7
Example
Given the augmented grammar:
E'  E
EE+T|T
T  T*F | F
F  (E) | id
find all viable items for prefix E + T*.
First we look at the DFA on the previous page to verify that
string E + T* is a viable prefix of the grammar.
We can verify that starting at state I0 we can build E + after
passing by state I1 and then E + T after passing by I6 and
lastly, passing by I9 we have the string E + T*. Then we go
to state I7.
The outputs of state I7 are F, (, and id, that is state 7
contains the items:
(i) T  T * • F; (ii) F  •(E); (iii) F  •id
which are precisely the items valid for E + T*.
To see this consider the following three rightmost
derivations
E’  E
E+T
 E + T*
E’
E
E+T
 E + T*
 E + T*F
 E + T*(E)
2-41
E’
E
E+T
 E + T*
 E + T*F
 E + T*id
These derivations prove the validity of the items in (i), (ii)
and (iii) above for the viable prefix E + T*. (No other items
exist).
Algorithm: Constructing an SLR parsing table.
Input: Augmented grammar G'.
Output: SLR parsing table functions action and goto for G'.
Method:
1. Construct C = {I0, I1, . . ., In} the collection of sets of
LR(0) items for G'.
2. State i is constructed from Ii. The parsing actions for
state i are determined as follows:
a) If [A  •a] is in li and goto(Ii, a) = Ij, then set
action[i, a] to "shift j." Here a must be a terminal.
b) If [A  • ] is in Ii, then set action[i, a] to "reduce
A  " for all a in FOLLOW(A). Here A may not be S'.
c) If [S'  S•] is in Ii, then set action[i, $] to "accept".
3. The goto transitions for state i are constructed for all
nonterminals A using the rule: If goto(Ii, A) = Ij, then
goto[i, A] = j.
4. All entries not defined by rules (2) and (3) are made
"error."
5. The initial state of the parser is the one constructed from
the set of items containing [S'  •S].
2-42
Remarks
If any conflicting actions are generated by the above rules 1
through 3, we say the grammar is not SRL(1). The
algorithm fails to produce a parse in this case.
The parsing table consisting of the parsing action and goto
functions determined by Algorithm II-8 is called the SLR(1)
table for G.
An LR parser using the SLR(1) table for G is called the
SLR(1) parser for G, and a grammar having an SLR(1)
parsing table is said to be SLR(1).
We usually omit the "(1)" after the "SLR," since we shall
not deal here with parsers having more than one symbol of
lookahead.
Every SLR(1) grammar is unambiguous. The reciprocal is
not true.
The SLR parser is not powerful enough to solve conflicts of
the shift/reduce type even for unambiguous grammars.
We will show with an example that the SLR parser is not
powerful enough to parse even an unambiguous grammar.
2-43
Building an SLR Parsing Table
We have already found the canonical LR(0) collection for an
augmented grammar G'.
Also, we have built a DFA for the LR(0) set goto function.
We will generate the following steps using the last Algorithm.
STEPS:
1. Consider item I0.
We only refer to terminals for "action’s" and "goto’s".
One of the items with terminals is F  •(E) which gives:
action[0, (] = ? and the ? is resolved by looking at the DFA
picture on p. 40  action[0, ( ] = shift 4.
The other item is F  •id which gives action[0, id] = shift 5. No
other item in I0 yields actions.
2. Consider item I1.
The initial production (look at the picture of DFA D) E'  E•
yields action[1, $] = accept. The second production yields
action[1,+] = shift 6
3. Consider item I2 [Note: FOLLOW(E) = {+,),$} ].
The first item yields: action[2, +] = action[2, *] = action[2, $] =
reduce using E  T.
The second item yields action[2, *] = shift 7.
The table built is now reproduced below.
2-44
The table below shows the parsing action and goto functions of
an LR parsing table for the grammar
(1) E  E + T;
(2) E  T
(3) T  T * F ;
(4) T  F
(5) F  (E);
(6) F  id
for arithmetic expressions with binary operators + and *.
The codes for the actions are:
1. si means shift and stack state i,
2. rj means reduce by production numbered j,
3. acc means accept,
4. blank means error.
FOLLOW(E) = {+, ), $}; FOLLOW(T) = {+, *, ), $}
State
0
1
2
3
4
5
6
7
8
9
10
11
id
s5
+
s6
r2
r4
action
*
(
)
s4
s7
r4
s5
r2
r4
$
r6
s5
s5
r6
s7
r3
r5
s11
r1
r3
r5
2-45
2
3
9
3
10
r6
s4
s4
s6
r1
r3
r5
8
acc
r2
r4
s4
r6
E
1
goto
T
F
2
3
r1
r3
r5
Example
The grammar
G = { S  L = R; SR; L  *R; L  id; R  L}
is NOT ambiguous. (L and R stand for l-value and r-value and
* is the operator “contents of”).
The canonical collection of sets of LR(0) items for this
grammar is given below.
S'  •S
S  •L = R
S  •R
L  •* R
L  •id
R  •L
I3:
S  R•
I6:
I4:
L  *•R
R  •L
L  •*R
L  •id
S  L = •R
R  •L
L  •*R
L  •id
I7:
L  *R•
I1:
S’  S•
I5:
L  id•
I8:
R  L•
I2:
S  L• = R
R  L•
I9:
S  L = R•
I0:
Consider the items I2.
The first item in this set makes action[2, =] be “shift 6”. Since
FOLLOW(R) contains “=”, the second item sets action [2, =] to
“reduce R  L”.
2-46
Thus entry action[2, =] is multiply defined and state 2 has a
shift/reduce conflict on input symbol =. (A solution is to assign
higher priority to the shift operation).
This conflict points to the fact that the SLR construction is not
powerful enough to remember enough and then make a
decision of what action the parser should take.
2-47

Download Report

Instructor

Paperzz.com

Your Paperzz