The pumping lemma for context-free languages

CSE 3813
Introduction to Formal Languages and Automata
Chapter 8
Properties of Context-free Languages
These class notes are based on material from our textbook, An
Introduction to Formal Languages and Automata, 4th ed.,
by Peter Linz, published by Jones and Bartlett Publishers, Inc.,
Sudbury, MA, 2006. They are intended for classroom use
only and are not a substitute for reading the textbook.
The pumping lemma for context-free languages
• Suppose you have a CFG G in which the variable
A is used in two different rules, to derive two
different strings, e.g.,
(1) S  vAz
(2) A  wAy
(3) A  x
• We can use these rules, applying rule 2
recursively, to generate the following string:
S  vAz  vwAyz  vwwAyyz 
vwwwAyyyz  ...  vwnxynz.
The pumping lemma for CFLs
Of course, we can apply rule 3 at any point
along the way to bring the process to a halt.
Thus, the following strings are all legitimate
strings in the language:
vwxyz, vwwxyyz, vwwwxyyyz, etc.
In fact, with rules 2 and 3 in the language,
there is no way to prevent the language from
containing an infinite number of strings of the
form vwnxynz.
The pumping lemma for CFLs
Remember the definition of Chomsky Normal
Form grammars: A CFG is in Chomsky
Normal Form if every production is of one of
these two types:
A  BC
Aa
Remember also that we can put any CFG
grammar into CNF (omitting the null string, if
it belongs to the original language).
The pumping lemma for CFLs
If a grammar is in CNF, then its derivation tree will
be binary; that is, every node will have at most two
children. Why? There are only 3 possibilities:
(1) The node represents the first type of rule
above, in which a single variable produces two
variables.
(2) The node represents the second type of rule
above, in which a single variable produces a single
terminal.
(3) The node is a terminal node and so has no
children.
The pumping lemma for CFLs
•A path in a binary tree is either empty, or consists
of a node, one of its descendants, and all of the
nodes in between.
•The length of a path is the number of nodes it
contains (for this class, we will us this definition;
however, most of the time length and height are in
terms of the number of edges, not number of
nodes).
•The height of a binary tree is the length of its
longest path.
The pumping lemma for CFLs
• You could create a very tall binary tree by
having all branches be unary.
• You can create the shortest possible binary
tree by having all of its branches be binary,
except possibly for some or all of the
branches at the bottom level of the tree.
The pumping lemma for CFLs
• What is the smallest height possible in a binary tree
of 7 nodes? How many leaf nodes does it have?
height = 3
num. leaves = 4
The pumping lemma for CFLs
•What is the smallest height possible in a binary tree of
15 nodes? How many leaf nodes does it have?
height = 4
num. leaves = 8
The pumping lemma for CFLs
• What is the smallest height possible in a binary tree
of 31 nodes? How many leaf nodes does it have?
height = 5
num. leaves = 16
The pumping lemma for CFLs
•What is the smallest height possible in a
binary tree of (2n) - 1 nodes? How many leaf
nodes does it have?
• height = n
• num. leaves = 2n-1
The pumping lemma for CFLs
Note the pattern here:
In a completely filled binary tree with
(2n) – 1 nodes, half of the nodes (rounding
up) will be leaves. That is, (2n) / 2 nodes will
be leaf nodes. And we can rewrite (2n) / 2 as
2n-1.
This leads us to the following lemma:
The pumping lemma for CFLs
Lemma:
For any h  1, a binary tree which has more
than 2h-1 leaf nodes must have a height
greater than h.
Example:
If a binary tree has 17 leaf nodes, can it have
a height of 5?
No; a complete binary tree of height 5 has
only 16 leaf nodes. A binary tree with 17 leaves
must have a height greater than 5.
The pumping lemma for CFLs
Here is the point of all this:
If the height of the derivation tree for a given
string in the language is h, and there are
fewer than h production rules in the grammar,
then at least one rule must recur on the same
path in the derivation of this string.
The pumping lemma for CFLs
For a variable to recur farther down in the
same path, it must be either:
• self-recursive (e.g., A  aA)
or
• path-recursive (e.g., A  aB, and B  bA )
In either case, this variable may be pumped
an unrestricted number of times.
Theorem 8.1
Let L be a CFL. Then there is an integer m so
that for any w  L satisfying |w|  m, there
are strings u, v, x, y, and z satisfying
w = uvxyz
|vy| > 0
|vxy|  m
for any i > 0, uvixyiz  L
The pumping lemma for CFLs
• We can use the pumping lemma for
context-free languages to prove that there
must exist some language that is not contextfree.
• We do this by assuming that the language is
context free; this means that there must be an
m satisfying the conditions given above.
• If we find that this causes a contradiction,
then we know the language can’t be a CFL.
Proof
• Given the language L = {aibici | i  1},
assume that L is context-free.
• Let w = ambmcm, with |w|  m.
• According to theorem 8.1, |vy| > 0. Thus, v
and y together must contain at least one type
of symbol.
• According to theorem 8.1, |vxy|  m. Thus,
the string vxy can contain at most two distinct
types of symbols.
Proof
The string vxy can’t contain all three symbols, a,
b, and c. (Why? Because |vxy|  m.)
• The string uv2xy2z contains additional occurrences
of the symbols in v and y.
• Therefore, uv2xy2z cannot contain equal numbers
of all three symbols.
• But the pumping lemma says that uv2xy2z must be
a legitimate string in L. Obviously, this is a
contradiction.
• Consequently, L cannot be a context-free
language.
•
Example
Given the language L = {aibici | i  1}, how
would you try to process this language using
a push-down automaton?
We can insure that we have an equal number
of a’s and b’s, by pushing the a’s onto the
stack one at a time, then popping them off
and matching them up with the b’s one by
one.
Example
• However, once we have done that, we don’t
have anything left to match the c’s with, so
we can’t guarantee that we have the same
number of c’s as a’s and b’s.
• We can’t solve this problem by pushing a’s
or b’s back onto the stack.
• This is due to the limitations of the type of
memory we have in a PDA.
Pumping lemma (again)
 The
pumping lemma for regular languages
states: every sufficiently long string in a
regular language contains a short substring
that can be pumped.
 The pumping lemma for context-free
languages states: every sufficiently long
string in a context-free language contains
two short (and close-together) substrings that
can be pumped (the same number of times).
Formal statement (again)
Let L be a context-free language. Then there
exists some positive integer m such that any
string w  L of length |w|  m can be
decomposed into substrings, u, v, x, y, z, such
that w = uvxyz, and
|vxy|  m,
|v| > 0 or |y| > 0,
uvkxykz  L, for k  0
Informal statement
Every context-free language has a “pumping
length” such that every string in the language
that is longer than this can be pumped to
yield another string in the language.
The string can be divided into five parts such
that the second and fourth parts can be
repeated together, or “pumped,” any number
of times, and the resulting string remains in
the language.
What is m?
In the pumping lemma for regular languages,
the “pumping length” m reflects the number
of states of the finite automaton.
In the pumping lemma for context-free
languages, what does m reflect? Roughly, it
is the length of the longest string that can be
generated by a parse tree in which the same
nonterminal never occurs twice on the same
path through the tree.
In a sufficiently large parse tree, some
nonterminal must repeat along some path
from the root. This follows from the
pigeonhole principle.
S
A
A
u
v
x
y
z
Proof Idea
The repetition of some nonterminal along a path
through the parse tree allows us to replace the
subtree under the last occurrence of the
nonterminal with the subtree under an earlier
occurrence of the nonterminal and still get a
valid parse tree
 This corresponds to pumping v and y
 Note that the parse tree of the previous slide
corresponds to the following derivation:

S  uAz  uvAyz  uvxyz
Important to remember
You can use a pumping lemma to prove
that a language is not context-free (or
regular).
You cannot use a pumping lemma to prove
that a language is context-free (or regular).
Exercise
The language L = {ww | w  {a, b}*} is not
context-free.
Pick a string in L. Try ambmambm. Then note that
you must consider three cases. It must be the case
that vxy is a substring of the prefix ambm, or the
“middle” bmam, or the suffix ambm.
Intuitively, why can’t a PDA accept this language,
although it can accept the language {wwR | w 
{a, b}*}?
Pumping Lemma for Linear Languages
Let L be an infinite linear language. Then
there exists some positive integer m, such
that any w  L, with |w|  m can be
decomposed as w = uvxyz with
|uvyz|  m
|vy|  1
such that
uvixyiz  L
for all i = 0,1,2…
Pumping Lemma for Linear Languages
Note that the conclusion for this theorem is different
from Theorem 8.1, since in 8.1 we have
|vxy|  m
and in Theorem 8.2 we have
|uvyz|  m
This implies that the strings v and y to be pumped must
now be within m symbols of the left and right ends of w,
respectively. The middle string x can be of arbitrary
length.
Theorem 8.2 helps establish the fact that the family of
linear languages is a proper subset of the family of
context-free languages.
Closure properties for context-free languages
The family of context-free languages is
closed under the operations of:
Union
Concatenation
Kleene closure
but not under the operations of
Intersection
Complementation
Definition
A context-free grammar (CFG) is a 4-tuple
G = (V, T, S, P) where V and T are disjoint
sets, S  V, and P is a finite set of rules of the
form A  x, where A  V and x  (V  T)*.
V = non-terminals or variables
T = terminals
S = Start symbol
P = Productions or grammar rules
Closure properties of CFGs
CFLs are closed under Union, Concatenation
and Kleene closure.
Proof by construction:
Let
G1 = (V1, T1, S1, P1) and
G2 = (V2, T2, S2, P2)
with
L1 = L(G1) and
L2 = L(G2)
Union
We create grammar Gu = (Vu, T1  T2, Su, Pu)
generating
L1  L2
1. Rename the elements of V2 if necessary so that
V1  V2 = .
2. Create a new start symbol Su, not already in V1
or V2.
3. Set Vu = V1  V2  {Su}
4. Set Pu = P1  P2  {Su  S1 | S2}
Construction completed.
Concatenation
We create grammar Gc = (Vc, T1  T2, Sc, Pc)
generating L1L2
1. Rename the elements of V2 if necessary so that
V1  V2 = .
2. Create a new start symbol Sc, not already in V1
or V2.
3. Set Vc = V1  V2  {Sc}
4. Set Pc = P1  P2  {Sc  S1S2}
Construction completed.
Closure under Kleene star
Let G1 be any context-free grammar with
the starting symbol S. Adding the rules
S  λ and
S  SS
creates a new context-free grammar G2
such that L(G2) is the result of applying the
Kleene star operator to L(G1).
Kleene Closure
We create grammar G* = (V, T, S, P)
generating L1*
1. Create a new start symbol S, not already
in V1.
2. Set V* = V1  {S}
3. Set P* = P1  {S  S1S | l}
Construction completed. (See text for
justification.)
Not closed under intersection
The context-free languages are not closed
under Intersection. However, the
intersection of a context-free language
with a regular language is always a
context-free language.
The context-free languages are not closed
under Complementation
Corollary:
Are Regular Languages context free?
Yes.
Why?
We can express any Regular language in the
form of a CFG.
Regular languages are a proper subset of
CFGs.
Are Regular Languages context free?
Proof:
According to your textbook, the set of regular
languages is the smallest set that contains all
languages , {l}, and {a} (for every a  S)
and is closed under the operations of union,
concatenation, and Kleene*. We just
demonstrated that the operations of union,
concatenation, and Kleene* on CFGs produce
CFGs, so all we need to do is show that the
languages , {l}, and {a} have CFGs.
Are Regular Languages context free?
The empty language can be written
SS
The language consisting of a null string can be written
Sl
The language consisting of single characters can be
written
Sa
QED
Decision properties of
context-free languages
Can decide:
Membership
Empty
Infinite
But there is no algorithm for deciding
whether two CFGs generate the same
language!