Normal Forms - postech mlg

Context-Free Grammars: Normal Forms
Seungjin Choi
Department of Computer Science and Engineering
Pohang University of Science and Technology
77 Cheongam-ro, Nam-gu, Pohang 37673, Korea
[email protected]
1 / 41
Outline
I
Simplification of CFG: Transform an arbitrary CFG into an
equivalent form that satisfies certain restrictions on its form
I
I
I
I
I
Substitution rule
Removing useless productions
Removing -productions
Removing unit-productions
Normal forms
I
I
Chomsky normal form
Greibach normal form
2 / 41
I
Throughout the lecture, we only consider languages that do not
contain the empty string , since it plays a rather singular role in
many theorems and proofs.
I
In doing so, we do not lose generality.
Let G be a CFG for L − {}.
Then we add a new variable S0 to V and add productions to P:
S0 → S | .
Any non-trivial conclusion that we can make for L − {} will almost
certainly transfer to L. Given any CFG G there is a method of obtaining
Gb such that L(Gb) = L(G ) − {}.
3 / 41
Substitution Rule
Theorem
Let G = (V , T , S, P) be a CFG. Suppose that P contains a production of
the form
A → x1 Bx2 .
Assume that A and B are different variables and that
B → y1 | y2 | · · · | yn
is the set of all productions in P which have B as the left side. Let
b be the grammar in which Pb is constructed by deleting
Gb = (V , T , S, P)
A → x1 Bx2 from P and adding to it
A → x1 y1 x2 | x1 y2 x2 | · · · | x1 yn x2 .
Then L(Gb) = L(G ).
4 / 41
Proof. What you need to show is that
b)
I if w ∈ L(G ) then w ∈ L(G
I
if w ∈ L(Gb) then w ∈ L(G )
5 / 41
∗
Suppose that w ∈ L(G ), i.e., S ⇒ w .
G
If this derivation does not involve the production A → x1 Bx2 , then obviously
∗
S ⇒ w.
Gb
If it does, then look at the derivation the first time A → x1 Bx2 is used. The B
so introduced eventually has to be replaced; we lose nothing by assuming that
this is done immediately. Thus,
∗
S ⇒ u1 Au2 ⇒ u1 x1 Bx2 u2 ⇒ u1 x1 yj x2 u2 .
G
G
G
But with grammar Gb we can get
∗
S ⇒ u1 Au2 ⇒ u1 x1 yj x2 u2 .
Gb
Gb
Thus we can reach the same sentential form with G and Gb. If A → x1 Bx2 is
used again later, we can repeat the argument. If follows then, by induction on
the number of times the production is applied, that
∗
S ⇒ w.
Gb
Therefore, if w ∈ L(G ) then w ∈ L(Gb).
By similar reasoning, we can show that if w ∈ L(Gb) then w ∈ L(G ), which
completes the proof.
6 / 41
Example: Consider G = ({A, B}, {a, b, c}, A, P) with productions
A →
B
→
a | aaA | abBc,
abbA | b.
Substitutions for the variable B, lead to the grammar Gb with productions
A →
B
→
a | aaA | ababbAc | abbc,
abbA | b. (unnecessary)
The string aaabbc has the derivation
A ⇒ aaA ⇒ aaabBc ⇒ aaabbc
in G
and
A ⇒ aaA ⇒ aaabbc
in Gb.
7 / 41
Useful?
Definition
Let G = (V , T , S, P) be a CFG. A variable A ∈ V is said to be useful if
and only if there is at least one w ∈ L(G ) such that
∗
S ⇒ xAy
reachable
∗
⇒ w
generating
with x, y ∈ (V ∪ T )∗ .
In other words, a variable is useful if and only if it occurs in at least one
derivation.
Two requirements to be ”useful”:
I
I
Can derive a terminal string?: A symbol X ∈ V ∪ T is generating if
∗
X ⇒ w for some w ∈ T ∗ .
Can be reached from the start variable?: A symbol X ∈ V ∪ T is
∗
reachable if there is a derivation S ⇒ αX β for some α and β.
8 / 41
Removing Useless Productions
I
I
I
A variable that is not useful is called useless.
A production is useless if it involves any useless variable.
Eliminate the symbols that are not generating first, and then
eliminate from the remaining grammars those symbols that are not
reachable. Note that the other way around does not work!
Example: Eliminate useless variables and productions from G where
V = {S, A, B, C } and T = {a, b} with
S
→
A →
B
C
→
→
aS | A | C ,
a,
aa,
aCb.
9 / 41
First, identify the set of variables that can lead to a terminal string
A
B
S
C
→ a,
(OK)
→ aa,
(OK)
⇒ A ⇒ a,
(OK)
(No!)
Remove C , then we have V1 = {S, A, B} and T = {a} with
S
→
aS | A,
→
aa.
A →
B
a,
10 / 41
Next, eliminate the variables that cannot be reached from the start
variable. To this end we draw a dependency graph where its vertices are
labeled with variables and an edge between C and D is connected if and
only if there is a production form C → xDy .
S
A
B
b where
The variable B is useless. Therefore, Gb = (Vb , Tb , S, P)
Vb = {S, A} and Tb = {a} with
S
A
→ aS | A,
→ a.
11 / 41
Theorem
Let G = (V , T , S, P) be a CFG. Then there exists an equivalent
b that does not contain any useless variables or
grammar Gb = (Vb , Tb , S, P)
productions.
Proof. The algorithm Gb can be generated from G by an algorithm
consisting of two parts. In the first part we construct an intermediate
grammar G1 = (V1 , T1 , S, P1 ) such that V1 contain only variables A for
∗
which A ⇒ w ∈ T ∗ is possible.
1. Set V1 to φ.
2. Repeat the following step until no more variables are added to V1 .
For every A ∈ V for which P has a production of the form
A → x1 x2 · · · xn
with all xi ∈ V1 ∪ T ,
add A to V1 .
3. Take P1 as all the productions in P whose symbols are all in V1 ∪ T .
12 / 41
Clearly this procedure terminates. It is equally clear that if A ∈ V1 then
∗
A ⇒ w ∈ T ∗ is a possible derivation with G1 . The remaining issue is
∗
whether every A for which A ⇒ w is added to V1 before the procedure
terminates.
Consider a parse tree
A
Level k − 2
Aj
c
Ai
a
b
Level k − 1
Level k
13 / 41
At level k, there are only terminals, so every variable Ai at level k − 1 will
be added to V1 on the first pass through step 2. Any variable at level
k − 2 will then be added to V1 on the second pass through step 2. The
third time through step 2, all variables at level k − 2 will be added, and
so on. The algorithm cannot terminate while there are variable in the
tree that are not yet in V1 . Hence A will be eventually added to V1 .
In the second part of the construction, we get the final answer Gb from
G1 . We draw the variable dependency graph for G1 and from it find all
variables that cannot be reached from S. We can also eliminate any
terminal that does not occur in some useful productions. The result is
b
the grammar Gb = (Vb , Tb , S, P).
Because of the construction, Gb does not contain any useless symbols or
productions. Also, for each w ∈ L(G ) we have a derivation
∗
∗
S ⇒ xAy ⇒ w .
14 / 41
Since the construction of Gb retains A and all associated productions, we
have everything needed to make the derivation
∗
∗
Gb
Gb
S ⇒ xAy ⇒ w .
The grammar Gb is constructed from G by removal of productions,
implying Pb ⊆ P, which gives
L(Gb) ⊆ L(G ).
Putting the two results together, we see that G and Gb are equivalent. 15 / 41
Removing -Productions
Definition
Any production of a CFG that is of the form
A→
is called -production.
Definition
Any variable A for which the derivation
∗
A⇒
is possible, is called nullable.
16 / 41
Example: Consider the CFG
S
S1
→ aS1 b,
→ aS1 b | .
This leads to L = {an b n | n ≥ 1}.
The -production, S1 → , can be removed by the substituting for S1 ,
leading to
S
S1
→
→
aS1 b | ab,
aS1 b | ab.
which also generate L = {an b n | n ≥ 1}.
17 / 41
Theorem
Let G be any CFG with not in L(G ). Then there exists an equivalent
grammar Gb having no -productions.
Proof. Find the set VN of all nullable variables of G .
1. Find all productions A → , and put A into VN .
2. Repeat the following step until no further variables are added to VN .
2.1 For all productions
B → A1 A2 · · · An ,
where A1 , A2 , . . . , An ∈ VN , put B into VN .
18 / 41
b We look
Once the set VN has been found, we are ready to construct P.
at all productions in P of the form
A → x1 x2 · · · xm ,
m ≥ 1,
where xi ∈ V ∪ T .
For each such production of P, we put into Pb that productions as well as
all those generated by replacing nullable variables with in all possible
combinations.
For example, if xi and xj are both nullable, there will be one production
in Pb with xi replaced with , one in which xj is replaced with , and one
in which both xi and xj are replaced with . There is one exception: if all
b If is
xi are nullable, the production A → is not put into P.
b
straightforward to see G is equivalent to G .
19 / 41
Example: Consider the CFG G with
S
→ ABaC ,
A → BC ,
B
C
D
→ b | ,
→ D | ,
→ d.
Identify nullable variables. First we recognize VN = {B, C } since B → and C → . In addition, since we have A → BC , the set of nullable
variables is VN = {A, B, C }.
20 / 41
Then we construct Gb (with no -productions)
S
A
B
C
D
→ ABaC | BaC | AaC | ABa | aC | Aa | Ba | a,
→ BC | B | C ,
→ b,
→ D,
→ d.
21 / 41
Removing Unit-Productions
Definition
Any production of a CFG of the form
A → B,
where A, B ∈ V is called a unit-production.
We use the substitution rule to remove unit-productions.
22 / 41
Theorem
Let G = (V , T , S, P) be any CFG without -productions. Then there
b that does not have any unit-productions
exists a CFG Gb = (Vb , Tb , S, P)
and that is equivalent to G .
Proof. Obviously, any unit production of the form A → A can be
removed from the grammar without effect. We need only consider
A → B where A and B are different variables.
We first find, for each A all variables such that
∗
A ⇒ B.
We can do this by drawing a dependency graph with an edge (C , D)
∗
whenever the grammar has a unit production C → D. A ⇒ B holds
whenever there is a walk between A and B.
23 / 41
The new grammar Gb is generated by first putting into Pb all non-unit
∗
productions of P. Next, for all A and B satisfying A ⇒ B, we add to Pb
A → y1 | y2 | · · · | yn ,
where B → y1 | y2 | · · · | yn is the set of all rules in P with B on the left.
b none of the yi can
Note that since B → y1 | y2 | · · · | yn is taken from P,
be a single variable, so that non-unit productions are created by the last
step.
To show that the resulting grammar is equivalent to the original one we
can follow the same line of reasoning as in Theorem 1.
24 / 41
Example: Remove all unit-productions from
S
B
→
→
A →
Aa | B,
A | bb,
a | bc | B.
We first draw a dependency graph for the unit-productions.
S
A
B
25 / 41
It follows from the dependency graph that we have
S
S
B
∗
⇒ A,
∗
⇒ B,
∗
⇒ A,
∗
A ⇒ B.
The new rules are
S
→
A →
B →
a | bc | bb,
bb,
a | bc.
26 / 41
The original non-unit productions are
S
B
→
→
A →
Aa,
bb,
a | bc.
Therefore, the equivalent grammar (without unit-productions) are
constructed by adding the new rules to the original non-unit productions:
S
A
B
→ a | bc | bb | Aa,
→ bb | a | bc,
→ a | bc | bb.
27 / 41
Theorem
Let L be CFG that does not contain . Then there exists a CFG that
generates L and that does not have any useless productions,
-productions, or unit-productions.
28 / 41
Summary
To clean up a grammar we can
1. Eliminate -productions
2. Eliminate unit-productions
3. Eliminate useless productions
in this order.
29 / 41
Chomsky Normal Form
Definition
A CFG is in Chomsky normal form (CNF) if all productions are of the
form
A → BC ,
or
A → a,
where A, B, C ∈ V and a ∈ T .
I
The number of symbols on the right of a production are strictly
limited.
I
The string on the right of a production consists of no more than two
symbols.
30 / 41
Example: The grammar with the following productions
S
→
A →
AS | a,
SA | b,
is in CNF.
The grammar with the following productions
S
→
A →
AS | AAS,
SA | aa,
is not in CNF.
31 / 41
Theorem
b
Any CFG G with ∈ L(G ) has an equivalent grammar Gb = (Vb , Tb , S, P)
in CNF.
32 / 41
Example: Consider a CFG G with productions
S
A
B
→ ABa,
→ aab,
→ Ac.
Convert G to CNF.
33 / 41
Step 1: Introduce new variables, Ba , Bb , Bc , for terminals a, b, c,
respectively. Then we have
S
→ ABBa ,
A → Ba Ba Bb ,
B
Ba
Bb
Bc
→ ABc ,
→ a,
→ b,
→ c.
34 / 41
Step 2: Introduce additional variables to take care of the first two
productions:
S
D1
→ AD1 ,
→ BBa ,
A → Ba D2 ,
D2
B
Ba
Bb
Bc
→ Ba Bb ,
→ ABc ,
→ a,
→ b,
→ c.
35 / 41
Greibach Normal Form
Definition
A CFG is said to be in Greibach normal form (GNF) if all productions
have the form
A → ax,
where a ∈ T and x ∈ V ∗ .
The form A → ax is common to both Greibach normal form and
s-grammars but GNF does not carry the restrictions that the pair (A, a)
occur at most once.
Theorem
For every CFG G with ∈ L(G ), there exists an equivalent Gb in GNF.
36 / 41
From CFG to GNF
I
I
Not trivial
A simple idea is to eliminate the following:
1. Front recursions: A → AB
2. Front non-terminals
3. Non-front terminals
37 / 41
Remove Front Recursion
Consider a grammar involving the front recursion
A → AB | CD,
We convert this into the following equivalent grammar:
A →
N
→
CDN | CD,
BN | B.
38 / 41
Remove Front Non-Terminal
Consider a grammar involving the front non-terminal
A →
B
→
BbC | aA,
cDA | aE ,
We convert this into the following equivalent grammar:
A → cDAbC | aEbC | aA,
B
→ cDA | aE .
39 / 41
Remove Non-Front Terminal
Consider a grammar involving the non-front terminal
A → cDAbC ,
We convert this into the following equivalent grammar:
A
N
→ cDANC ,
→ b.
40 / 41
Example: Consider a CFG with
S
→ SA | BeA,
A → AS | a,
B → b.
Find an equivalent GNF?
Solution. We directly apply the technique explained in previous 3 slides.
The GNF is given by
S
C
→
→
A →
D
E
→
→
bEAC | bEA,
aDC | aD | aC | a,
aD | a,
bEACD | bEAC | bEAD | bEA,
e.
41 / 41