COMBINING COMPATIBLE STATES DURING LR(1) PARSER

COMBINING COMPATIBLE
STATES DURING LR(1)
PARSER CONSTRUCTION
The LR(0) algorithm for creating
compilers is one in which contexts
are not evaluated, and states are
considered identical if they consist
of the same set of marked
productions
But this algorithm is insufficient for
actual programming languages,
producing parsers with numerous
conflicts
The LR(1) algorithm when applied to creating
compilers for real computer languages, such
as those for Java or C++, results in a parsing
machine that is a order or more larger than
those produced by an LR(0) algorithm for the
same grammar.
On the other hand the LR(1)
algorithm, which you made use of
in your last assignment, produces
parsers, for the large grammars
employed for actual computer
languages, which are a few orders
larger than those produced by the
LR(0) algorithm.
As a compromise, various methods,
including the one employed by
Yacc, have been devised for
subsets of the LR(1) languages,
using a hybrid approach.
This works well for most programming languages,
but imposes a greater responsibility on the
compiler writer, to come up with a grammar that
does not lead to conflicts (i.e. to cases where more
than one action is defined at a parsing machine
state for the same next input symbol).
These methods only work for a subset of the LR(1)
grammars, and there are applications, including
ones involving natural language processing, for
which they are inadequate.
However one can employ a
definition of compatibility between
states, which works for all LR(1)
languages, and which produces
parsers of the same size as those
referred to previously
DEFINITION. The nucleus of state consists of
the configurations in the state in which the
marker is in a position greater that zero.
Example
A configuration in a state of the form
A → bc.d, {x,y}
would be a member of its nucleus, but a
configuration such as
A → .bcd, {x,y}
would not be a member.
DEFINITION OF COMPATIBILITY BETWEEN LR(1) STATES
Let S and S be two states in a LR(1) parsing machine
whose nuclei consist of the same marked productions, which
we will denote as P1,…,Pn .
For 1≤ t ≤ n, let Ut denote the set of contexts associated with
marked production Pt in state S, and let Ut denote the set of
contexts associated with that marked production in state S.
Then states S and S are compatible if, for all 1 ≤ i < j ≤ n, at
least one of the following condition holds:
(a) Ui  Uj =  and Ui  Uj = 
( is the empty set, i.e. the intersections involved are both empty)
(b) Ui  Uj ≠ 
(c) Ui  Uj ≠ 
Note
If states S and S are as described above, and
their nuclei consist of only a single
configuration, then according to the above
definition they are compatible
In the case where S and S as described above
are compatible, one can combine the states
into a single state whose nucleus consists of
the same marked productions listed above,
while for 1≤ t ≤ n, the set of contexts associated
with marked production Pt is Ut  Ut .
One way of looking at the definition is to say
that every pair of configurations in the nuclei
must pass a test, and that two states are
compatible only if they all in fact pass.
Fortunately, in grammars for actual
programming languages such as Java, C++,
etc., there are at most 6 configurations in the
nucleus of any state.
The states may be large, with many
immediate successors, but the nuclei are all
quite small.
EXAMPLES
We show only the nucleus of the states in
these examples, since, according to the
definition, states are compatible if and only if
their nuclei are.
S
S’
A → ab.c
{x,y}
A → ab.c
{d}
B → b.n
{s,t}
B → b.n
{s}
C → rb.ed
{u,v}
C → rb.ed
{x,v}
The above two states are not compatible
because the pair consisting of the first and last
configurations fail the test.
For this pair condition (a) of the defn. is not true,
since the context of the first configuration of S
contains an x, and so does the context of the third
production of S’
In addition neither of conditions (b) or (c) are true.
S
S’
A → ab.c
{x,y}
A → ab.c
{x,y,d}
B → b.n
{s,t}
B → b.n
{s}
C → rb.ed
{x,v}
C → rb.ed
{x,v}
The first and third configurations in this case pass the test
because condition (b) of the defn. applies to the first and
third configurations of S. Both of these configurations
contain x in their set of contexts. The states in this case are
compatible.
Remember, that while every pair of configurations in the
nucleus must pass the test, it only requires that one of
conditions (a), (b) or (c) be true for a given pair for it to
pass.
Since the states are compatible, they can be
combined to form one whose nucleus is:
A → ab.c
{x,y,d}
B → b.n
{s,t}
C → rb.ed
{x,u,v}
Note.
In the figure on the next slide, where we omit
the context set of various configurations (i.e.
only show the marked production involved), the
inference involved is that they are irrelevant to
the assertions being made about the figure.
States 2 and 8 are not
compatible since the
first configuration of
state 2 has d as
context in common
with the second
configuration of state
8. In fact if we were to
combine states 2 and
8, it would produce a
combination of states
3 and 9 as its usuccessor. This state
would have a conflict,
in that in had reduce
actions, for when the
next input symbol was
d, for both
Z → tu and V → є
Now consider the altered machine obtained if
the production X → aYd where replaced by
(say) X → aYa. In this case the first
configuration of state 2 would be Y → t.W {a}.
It would then follow that states 2 and 8 were
compatible and could safely be combined to
form:
Y → t.W {a, e}.
Z → t.u {c, d}
W → .uV
The Journal paper describing this method
of combining states contains a formal
proof of its correctness. But seeing our’s
is a practically oriented course, we will just
consider an informal justification based on
a few examples to supply a flavor of the
reasoning involved
The main argument is that if the parsing
machine containing the states S and S, as
described in the defn. of compatibility, has
no conflicts, and S and S are compatible,
then the parsing machine obtained by
combining them will also have no conflicts.
The argument is by contradiction. Let’s
consider examples of the various ways
that two configurations in the combination
of S and S could have conflicts or lead to
conflicts between other pairs of
configurations in states reachable from S.
In each case we hope to show that either
the parsing machine as it was before S
and S were combined contained conflicts
in the first place or that S and S could not
in fact have been compatible.
Case 1. Let configs 1 and 2 of the combined state formed from
states S and S’ be:
A → r B.uv {a,b}
C → t B.uv {a,c}
Seeing that the machine as it was before the
combination contained no conflicts, and specifically did not
contain a conflict in the uv successor of these states, either
(1) state S must have contained the a in its version of
config1, while state S contained the a in its version of
config 2, or
(2) vice-versa.
Case 1 contd.
A → r B.uv {a,b}
C → t B.uv {a,c}
In either case neither condition (a) nor (b) of the defn.
would then be true for the two configs, and since
condition (c) is also not true, states S and S’ could not
have been compatible in the first place.
Case 2. Let configs 1 and 2 of the combined
state be:
A → r B.uv {a,b}
D → t B.Ca
C →.uv
{a}
Either S or S must contain A → r B.uv {a.. },
in which case the original parsing machine
would have had a conflict at its uv-successor.
This is in contradiction to our assumption that
the original parsing machine was conflict-free.
Case 3. Let configs 1 and 2 be:
A → s B.Ea
E →.uv {a}
D → t B.Ca
C →.uv {a}
Here again the original parsing machine would
have had conflicts in the uv-successors of both
S and S
Case 4. Let configs 1 and 2 be:
A → r B.uv
D → t B.uvr
Here too the original parsing machine would
have had conflicts in the uv-successors of both
S and S. In this case the conflict would have
been between a reduction and a transition.
EXERCISE
Construct an LR(1) parsing machine for the
grammar on the next slide, combining
compatible states as you encounter them
program → main ; statement_list end main;
statement_list → statement_list statement
| statement
statement → assign_statement
| while_statement
| do_statement
assign_statement → identifier = identifier
while_statement → while ( condition )
statement_list wend
condition → identifier = identifier
do_statement → do identifier = number to
number ; statement_list end do ;