Context-Free Grammars: Normal Forms Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea [email protected] 1 / 41 Outline I Simplification of CFG: Transform an arbitrary CFG into an equivalent form that satisfies certain restrictions on its form I I I I I Substitution rule Removing useless productions Removing -productions Removing unit-productions Normal forms I I Chomsky normal form Greibach normal form 2 / 41 I Throughout the lecture, we only consider languages that do not contain the empty string , since it plays a rather singular role in many theorems and proofs. I In doing so, we do not lose generality. Let G be a CFG for L − {}. Then we add a new variable S0 to V and add productions to P: S0 → S | . Any non-trivial conclusion that we can make for L − {} will almost certainly transfer to L. Given any CFG G there is a method of obtaining Gb such that L(Gb) = L(G ) − {}. 3 / 41 Substitution Rule Theorem Let G = (V , T , S, P) be a CFG. Suppose that P contains a production of the form A → x1 Bx2 . Assume that A and B are different variables and that B → y1 | y2 | · · · | yn is the set of all productions in P which have B as the left side. Let b be the grammar in which Pb is constructed by deleting Gb = (V , T , S, P) A → x1 Bx2 from P and adding to it A → x1 y1 x2 | x1 y2 x2 | · · · | x1 yn x2 . Then L(Gb) = L(G ). 4 / 41 Proof. What you need to show is that b) I if w ∈ L(G ) then w ∈ L(G I if w ∈ L(Gb) then w ∈ L(G ) 5 / 41 ∗ Suppose that w ∈ L(G ), i.e., S ⇒ w . G If this derivation does not involve the production A → x1 Bx2 , then obviously ∗ S ⇒ w. Gb If it does, then look at the derivation the first time A → x1 Bx2 is used. The B so introduced eventually has to be replaced; we lose nothing by assuming that this is done immediately. Thus, ∗ S ⇒ u1 Au2 ⇒ u1 x1 Bx2 u2 ⇒ u1 x1 yj x2 u2 . G G G But with grammar Gb we can get ∗ S ⇒ u1 Au2 ⇒ u1 x1 yj x2 u2 . Gb Gb Thus we can reach the same sentential form with G and Gb. If A → x1 Bx2 is used again later, we can repeat the argument. If follows then, by induction on the number of times the production is applied, that ∗ S ⇒ w. Gb Therefore, if w ∈ L(G ) then w ∈ L(Gb). By similar reasoning, we can show that if w ∈ L(Gb) then w ∈ L(G ), which completes the proof. 6 / 41 Example: Consider G = ({A, B}, {a, b, c}, A, P) with productions A → B → a | aaA | abBc, abbA | b. Substitutions for the variable B, lead to the grammar Gb with productions A → B → a | aaA | ababbAc | abbc, abbA | b. (unnecessary) The string aaabbc has the derivation A ⇒ aaA ⇒ aaabBc ⇒ aaabbc in G and A ⇒ aaA ⇒ aaabbc in Gb. 7 / 41 Useful? Definition Let G = (V , T , S, P) be a CFG. A variable A ∈ V is said to be useful if and only if there is at least one w ∈ L(G ) such that ∗ S ⇒ xAy reachable ∗ ⇒ w generating with x, y ∈ (V ∪ T )∗ . In other words, a variable is useful if and only if it occurs in at least one derivation. Two requirements to be ”useful”: I I Can derive a terminal string?: A symbol X ∈ V ∪ T is generating if ∗ X ⇒ w for some w ∈ T ∗ . Can be reached from the start variable?: A symbol X ∈ V ∪ T is ∗ reachable if there is a derivation S ⇒ αX β for some α and β. 8 / 41 Removing Useless Productions I I I A variable that is not useful is called useless. A production is useless if it involves any useless variable. Eliminate the symbols that are not generating first, and then eliminate from the remaining grammars those symbols that are not reachable. Note that the other way around does not work! Example: Eliminate useless variables and productions from G where V = {S, A, B, C } and T = {a, b} with S → A → B C → → aS | A | C , a, aa, aCb. 9 / 41 First, identify the set of variables that can lead to a terminal string A B S C → a, (OK) → aa, (OK) ⇒ A ⇒ a, (OK) (No!) Remove C , then we have V1 = {S, A, B} and T = {a} with S → aS | A, → aa. A → B a, 10 / 41 Next, eliminate the variables that cannot be reached from the start variable. To this end we draw a dependency graph where its vertices are labeled with variables and an edge between C and D is connected if and only if there is a production form C → xDy . S A B b where The variable B is useless. Therefore, Gb = (Vb , Tb , S, P) Vb = {S, A} and Tb = {a} with S A → aS | A, → a. 11 / 41 Theorem Let G = (V , T , S, P) be a CFG. Then there exists an equivalent b that does not contain any useless variables or grammar Gb = (Vb , Tb , S, P) productions. Proof. The algorithm Gb can be generated from G by an algorithm consisting of two parts. In the first part we construct an intermediate grammar G1 = (V1 , T1 , S, P1 ) such that V1 contain only variables A for ∗ which A ⇒ w ∈ T ∗ is possible. 1. Set V1 to φ. 2. Repeat the following step until no more variables are added to V1 . For every A ∈ V for which P has a production of the form A → x1 x2 · · · xn with all xi ∈ V1 ∪ T , add A to V1 . 3. Take P1 as all the productions in P whose symbols are all in V1 ∪ T . 12 / 41 Clearly this procedure terminates. It is equally clear that if A ∈ V1 then ∗ A ⇒ w ∈ T ∗ is a possible derivation with G1 . The remaining issue is ∗ whether every A for which A ⇒ w is added to V1 before the procedure terminates. Consider a parse tree A Level k − 2 Aj c Ai a b Level k − 1 Level k 13 / 41 At level k, there are only terminals, so every variable Ai at level k − 1 will be added to V1 on the first pass through step 2. Any variable at level k − 2 will then be added to V1 on the second pass through step 2. The third time through step 2, all variables at level k − 2 will be added, and so on. The algorithm cannot terminate while there are variable in the tree that are not yet in V1 . Hence A will be eventually added to V1 . In the second part of the construction, we get the final answer Gb from G1 . We draw the variable dependency graph for G1 and from it find all variables that cannot be reached from S. We can also eliminate any terminal that does not occur in some useful productions. The result is b the grammar Gb = (Vb , Tb , S, P). Because of the construction, Gb does not contain any useless symbols or productions. Also, for each w ∈ L(G ) we have a derivation ∗ ∗ S ⇒ xAy ⇒ w . 14 / 41 Since the construction of Gb retains A and all associated productions, we have everything needed to make the derivation ∗ ∗ Gb Gb S ⇒ xAy ⇒ w . The grammar Gb is constructed from G by removal of productions, implying Pb ⊆ P, which gives L(Gb) ⊆ L(G ). Putting the two results together, we see that G and Gb are equivalent. 15 / 41 Removing -Productions Definition Any production of a CFG that is of the form A→ is called -production. Definition Any variable A for which the derivation ∗ A⇒ is possible, is called nullable. 16 / 41 Example: Consider the CFG S S1 → aS1 b, → aS1 b | . This leads to L = {an b n | n ≥ 1}. The -production, S1 → , can be removed by the substituting for S1 , leading to S S1 → → aS1 b | ab, aS1 b | ab. which also generate L = {an b n | n ≥ 1}. 17 / 41 Theorem Let G be any CFG with not in L(G ). Then there exists an equivalent grammar Gb having no -productions. Proof. Find the set VN of all nullable variables of G . 1. Find all productions A → , and put A into VN . 2. Repeat the following step until no further variables are added to VN . 2.1 For all productions B → A1 A2 · · · An , where A1 , A2 , . . . , An ∈ VN , put B into VN . 18 / 41 b We look Once the set VN has been found, we are ready to construct P. at all productions in P of the form A → x1 x2 · · · xm , m ≥ 1, where xi ∈ V ∪ T . For each such production of P, we put into Pb that productions as well as all those generated by replacing nullable variables with in all possible combinations. For example, if xi and xj are both nullable, there will be one production in Pb with xi replaced with , one in which xj is replaced with , and one in which both xi and xj are replaced with . There is one exception: if all b If is xi are nullable, the production A → is not put into P. b straightforward to see G is equivalent to G . 19 / 41 Example: Consider the CFG G with S → ABaC , A → BC , B C D → b | , → D | , → d. Identify nullable variables. First we recognize VN = {B, C } since B → and C → . In addition, since we have A → BC , the set of nullable variables is VN = {A, B, C }. 20 / 41 Then we construct Gb (with no -productions) S A B C D → ABaC | BaC | AaC | ABa | aC | Aa | Ba | a, → BC | B | C , → b, → D, → d. 21 / 41 Removing Unit-Productions Definition Any production of a CFG of the form A → B, where A, B ∈ V is called a unit-production. We use the substitution rule to remove unit-productions. 22 / 41 Theorem Let G = (V , T , S, P) be any CFG without -productions. Then there b that does not have any unit-productions exists a CFG Gb = (Vb , Tb , S, P) and that is equivalent to G . Proof. Obviously, any unit production of the form A → A can be removed from the grammar without effect. We need only consider A → B where A and B are different variables. We first find, for each A all variables such that ∗ A ⇒ B. We can do this by drawing a dependency graph with an edge (C , D) ∗ whenever the grammar has a unit production C → D. A ⇒ B holds whenever there is a walk between A and B. 23 / 41 The new grammar Gb is generated by first putting into Pb all non-unit ∗ productions of P. Next, for all A and B satisfying A ⇒ B, we add to Pb A → y1 | y2 | · · · | yn , where B → y1 | y2 | · · · | yn is the set of all rules in P with B on the left. b none of the yi can Note that since B → y1 | y2 | · · · | yn is taken from P, be a single variable, so that non-unit productions are created by the last step. To show that the resulting grammar is equivalent to the original one we can follow the same line of reasoning as in Theorem 1. 24 / 41 Example: Remove all unit-productions from S B → → A → Aa | B, A | bb, a | bc | B. We first draw a dependency graph for the unit-productions. S A B 25 / 41 It follows from the dependency graph that we have S S B ∗ ⇒ A, ∗ ⇒ B, ∗ ⇒ A, ∗ A ⇒ B. The new rules are S → A → B → a | bc | bb, bb, a | bc. 26 / 41 The original non-unit productions are S B → → A → Aa, bb, a | bc. Therefore, the equivalent grammar (without unit-productions) are constructed by adding the new rules to the original non-unit productions: S A B → a | bc | bb | Aa, → bb | a | bc, → a | bc | bb. 27 / 41 Theorem Let L be CFG that does not contain . Then there exists a CFG that generates L and that does not have any useless productions, -productions, or unit-productions. 28 / 41 Summary To clean up a grammar we can 1. Eliminate -productions 2. Eliminate unit-productions 3. Eliminate useless productions in this order. 29 / 41 Chomsky Normal Form Definition A CFG is in Chomsky normal form (CNF) if all productions are of the form A → BC , or A → a, where A, B, C ∈ V and a ∈ T . I The number of symbols on the right of a production are strictly limited. I The string on the right of a production consists of no more than two symbols. 30 / 41 Example: The grammar with the following productions S → A → AS | a, SA | b, is in CNF. The grammar with the following productions S → A → AS | AAS, SA | aa, is not in CNF. 31 / 41 Theorem b Any CFG G with ∈ L(G ) has an equivalent grammar Gb = (Vb , Tb , S, P) in CNF. 32 / 41 Example: Consider a CFG G with productions S A B → ABa, → aab, → Ac. Convert G to CNF. 33 / 41 Step 1: Introduce new variables, Ba , Bb , Bc , for terminals a, b, c, respectively. Then we have S → ABBa , A → Ba Ba Bb , B Ba Bb Bc → ABc , → a, → b, → c. 34 / 41 Step 2: Introduce additional variables to take care of the first two productions: S D1 → AD1 , → BBa , A → Ba D2 , D2 B Ba Bb Bc → Ba Bb , → ABc , → a, → b, → c. 35 / 41 Greibach Normal Form Definition A CFG is said to be in Greibach normal form (GNF) if all productions have the form A → ax, where a ∈ T and x ∈ V ∗ . The form A → ax is common to both Greibach normal form and s-grammars but GNF does not carry the restrictions that the pair (A, a) occur at most once. Theorem For every CFG G with ∈ L(G ), there exists an equivalent Gb in GNF. 36 / 41 From CFG to GNF I I Not trivial A simple idea is to eliminate the following: 1. Front recursions: A → AB 2. Front non-terminals 3. Non-front terminals 37 / 41 Remove Front Recursion Consider a grammar involving the front recursion A → AB | CD, We convert this into the following equivalent grammar: A → N → CDN | CD, BN | B. 38 / 41 Remove Front Non-Terminal Consider a grammar involving the front non-terminal A → B → BbC | aA, cDA | aE , We convert this into the following equivalent grammar: A → cDAbC | aEbC | aA, B → cDA | aE . 39 / 41 Remove Non-Front Terminal Consider a grammar involving the non-front terminal A → cDAbC , We convert this into the following equivalent grammar: A N → cDANC , → b. 40 / 41 Example: Consider a CFG with S → SA | BeA, A → AS | a, B → b. Find an equivalent GNF? Solution. We directly apply the technique explained in previous 3 slides. The GNF is given by S C → → A → D E → → bEAC | bEA, aDC | aD | aC | a, aD | a, bEACD | bEAC | bEAD | bEA, e. 41 / 41
© Copyright 2026 Paperzz