Formal Grammars and Languages

Context-Free Grammars and Languages
Derivations and Derivation Trees
A recursive definition of derivation trees from XN in a CFG G=(N,T,P,S)
(also called syntax trees) The yields of the trees will be defined at the same time.
1) Basis:
X
is a d.t. (a tree with a single node whose label is X) from X in G. The
corresponding derivation in G is X (a derivation of length 0). X is the yield of the
tree.
2) Inductive step: If
X
 Y
is a d.t. from X in G, where

i) X is the label of the root,
ii) Y is the label of a leaf,
iii) the string obtained by taking the labels of the leaves from left to right is Y,
i.e.,
the yield of the tree is Y,
iv) the corresponding derivation is X*Y, and
v) YY1…Yk is a production in P with each Yi in NT ,
then
X
 Y
is a d.t. from X in G, and a corresponding derivation is

…
Y1
Yk
X*YY1…Yk. Y1…Yk is the yield of the resulting tree.
In particular, in case of k=0, i.e., Y1…Yk=, then
X
is a d.t. from X in
G,
 Y


and a corresponding derivation is X*Y.  is the yield of the resulting
tree.
Examples
#007
1
Context-Free Grammars and Languages
Leftmost Derivations and Ambiguity
Let G=(N,T,P,S) be a CFG.
01…n is called a leftmost derivation if for each i, i=xiXii and
i+1=xiii for some xi in T*, i in V*, and Xii in P. For leftmost
derivations, we sometimes use lm instead of .
are defined similarly.
Rightmost derivations
THEOREM Let G be a CFG and X be a nonterminal. Then
(1) X* holds in G iff there exists a d.t. from X in G whose yield is .
(2) There is a 1:1 correspondence between the leftmost derivations in G
and the d.t.’s in G.
G is said to be ambiguous if there exists x in L(G) that is generated by two
distinct leftmost derivations from S, in other words, if there are two
different derivation trees for x.
Examples
1. Identifiers:
Gid=(Nid,Tid,Pid,id)
Unambiguous
Nid={id,A,L,N}, Tid={a,…,z,A,…Z}{0,1,…9},
Pid={idL|id A, AL|N, La|b|…|z|A|…|Z, N0|1|…|9}.
id
idlmid Almid AAlmL AA
id
lmAAAlmALAlmAbA
A
id
A
N
L
L
1
A
b
2. Arithmetic expressions:
lmAbNlmAb1
Gexp1=({E},{x,y,z},Pexp1,E)
Pexp1={Ex|y|z|E+E|E-E|E*E|E/E}
Ambiguous  There are two different derivation trees for x*y+z:
E
E
E
E
E
E
E
E
E
#007
E
2
Context-Free Grammars and Languages
x
*
y
+
z
x
(x*y)+z
*
y
+
z
x*(y+z)
CFG s= BNFs
BNF (Backus-Naur form or Backus’ notation)
Expressions to describe the syntax of programming languages
<…>  metasymbol to represent grammatical element …
::=   is defined to be of the form 

 either  or 
Examples
1. Identifiers
<identifier>::=<letter>|<identifier><alphanumeral>
<alphanumeral>::=<letter>|<numeral>
<letter>::=a|b|…|z|A|B|…|Z
<numeral>::=0|1|…|9
2. Arith. Exps. (an ambiguous grammar)
<arith exp 1>::=x|y|z|<arith exp 1>+<arith exp 1>|
<arith exp 1><arith exp 1>|<arith exp 1>*<arith exp 1>|
<arith exp 1>/<arith exp 1>
3. Another definition of arith. exps. (unambiguous one!)
<arith exp 2>::=<arith exp 2>+<term>|<arit exp 2><term>|<term>
<term>::=<term>*<factor>|<term>/<factor>|<factor>
<factor>::=<factor>|(<arith exp 2>)|<variable>|<constant>
<variable>::=<identifier>
<constant>::=<numeral>
4. Statements in a Pascal-like language
<statement>::=<for statement>|<if statement>|<while statement>|…
<for statement>::=for <variable><arith exp 2> to <arith exp 2> do <statement>
<if statement>::=if <condition> then <statement>|
if <condition> then <statement> else <statement>
The dangling else problem
#007
3
Context-Free Grammars and Languages
Which does “if p then if q then a else b” mean,
“if p then begin if q then a else b end” or “if p then begin if q then a end else b” ?
The Equivalence of BNFs and CFGs
metavariables  nonterminals
::=
basic symbols

 terminals
Exercises
1. Define the following in terms of BNFs:
a) “Numbers” in language C
b) “while statements” in language C
c)
d)
Language AnBn
Even integers as strings over alphabet {0,1,…,9}
e)
f)
2. Draw the syntax trees for the following:
a) Two different syntax trees for “if p then if q then a else b” according to
the definition of if statements in the Pascal-like language given
above.
b) Arith. exps. x+y*z/x, x+y*z, etc.
3. Show an equivalent CFG for each of the following BNFs:
a)
b)
4. Prove the following:
a) If L and L’ are CFLs, then so is LL’.
In fact, ℒ2 is closed under , ,
*, R, substitution, homomorphism, intersection with regular sets, and
so on.
b)
c)
#007
4
Context-Free Grammars and Languages
5. Give CFGs for the following languages:
a)
b)
The Pumping Lemma for CFLs
A CFG G=(N,T,P,S) is in Chomsky normal form if every production in P is
either of the form XYZ or of the form Xa, where X,Y,Z are in N and a in
T.
Lemma For each context-free language L there exists a CFG in Chomsky
normal form generating L{}.
THEOREM
For any CFL L, there exists an integer n satisfying the
following condition: If zL and |z|n, then there exist u, v, w, and x which
satisfy the following properties:
1) x=uvwxy,
2) |vwx|n,
3) vx,
4) uviwxiyL for any integer i0.
Proof. Let G=(N,T,P,S) be a CFG in Chomsky normal form generating L{}.
Let k=|N|, the number of nonterminals of G, and put n=2k+1.
Assume that zL and |z|n and consider any derivation tree, say T, in G
from S generating z. T is a binary tree, since G is in Chomsky normal form.
Thus the height of T is at least k+1, which means that there is a path, say p,
from the root to a leaf whose length  k+1. Then p must have two identical
nonterminals, say X, on it:
S
Choose a pair of X’s such
segment
of
path
p
contains
nonterminals except for the two X’s. Then
to 4)
hold.
that the dotted
no
identical
properties 1)
X
T
X
#007
p
5
Context-Free Grammars and Languages
u
v
w
x
y
Applications of the Pumping Lemma
1. AnBnCn={anbncn | n0} is not a CFL.
Assume the contrary. Then there exists an integer n satisfying the P.L.
Consider z=anbncnAnBnCn. Since |z|=3nn, z can be decomposed into
z=uvwxy such that properties 1) to 4) of the P.L. hold. Since |vwx|n, vx
contains at most two letters from a,b,c, that is, vxa*b*b*c*.
Case 1: If vxa+b* (note that vx) then #a(uv2wx2y)>#c(uv2wx2y), which
contradicts the fact that uv2wx2y is in AnBnCn (Property 4)). A similar
contradiction can be derived from any one of the following possible
assumptions: vxa*b+, vxb+c*, vxb*c+.
Case2: If vxa+b+, then uv2wx2y is not of the form anbncn, a contradiction to
property 4). Similar contradictions can be derived from vxb+c+.
In any case we obatined a contradiction. Thus AnBnCn is not a CFL.
{aibjaibj | i,j0} can be shown not to be CF in a similar way.
2. {ap | p is prime} is not a CFL.
Assume the contrary and consider z=an for the n in the P.L. For
z=vuwxy,
|uviwxiy|=n(1+|vx|) for i=n+1, a contradiction, since n(1+|vx|) is not
prime (note that vx).
3. L={aibj | j=i2} is not CF.
If it were, there exists n satisfying the P.L. Consider z=anbmL, where
m=n2. Consider any decomposition z=uvwxy satisfying the P.L. v cannot
contain both a and b; because, otherwise we have a contradiction that
uv2wx2y is not of the form aibj. A similar argument shows that x does not
contain both a and b.
Case 1: If vwx is in a+, then uv2wx2y=an+|vx|bm and this word is in L by the
P.L. However m=n2(n+|vx|)2, since |vx|1, a contradiction.
Case 2: If vwx is in b+, then uv2wx2y=anbm+|vx| and it is in L. However,
#007
6
Context-Free Grammars and Languages
m+|vx| =n2+|vx|n2, a contradiction.
Case 3: If v is in a+ and x in b+, then z’=uvi+1wxi+1y=an+i|v|bm+i|x| is in L for
any i. Note that |v|>0 and |x|>0. Taking i=|x|, we have (n+i|v|)2 =
(n+|x||v|)2 = n2+2n|x||v|+|x|2|v|2 > n2+|x|2 =m+i|x|, a contradiction.
#007
7