Automata and Formal Languages
Context Free Grammars
Sipser pages 100 - 109
Lecture 11
Tim Sheard
1
Formal Languages
1. Context free languages provide a convenient
notation for recursive description of languages.
2. The original goal of formalizing the structure of
natural languages is still elusive, but CFGs are
now the universally accepted formalism for
definition of (the syntax of) programming
languages.
3. Writing parsers has become an almost fully
automated process thanks to this theory.
A Simple Grammar for English
Example taken from Floyd & Beigel.
<Sentence>
→
<Subject> <Predicate>
<Subject>
→
<Pronoun1> | <Pronoun2>
<Pronoun1>
→
I | we | you | he | she | it | they
<Noun Phrase>
→
<Simple Noun Phrase> | <Article> <Noun Phrase>
<Article>
→
a | an | the
<Simple Noun Phrase> →
<Noun> | <Adjective> <Simple Noun Phrase>
<Predicate>
→
<Verb> | <Verb> <Object>
<Object>
→
<Pronoun2> | <Noun Phrase>
<Pronoun2>
→
me | us | you | him | her | it | them
<Noun>
→
...
<Verb>
→
...
1.
2.
3.
4.
Each rule is called a production
lhs → lhs
lhs is a symbol, rhs is a string of symbols
Variable (or non-terminal) < … >
Symbol (or terminal) bold
Example
Derive the sentence “She drives a shiny black car” from these
rules.
sentence
subject
predicate
pronoun1
object
verb
Noun phrase
article
Simple Noun phrase
adjective
Simple Noun phrase
adjective
Simple noun phrase
Noun
She
drives
a
shiny
black
car
Derivation rules
1. Write down the start symbol
rule, unless otherwise stated)
(lhs of the first
2. Find a variable, v, and a corresponding
rule where: v → rhs
3. Replace the variable with the string rhs
4. Repeat, starting at 2. until no more
variables remain.
Derivation
<sentence> ⇒
I Have underlined
<subject> <predicate> ⇒
the variable I will
replace in the next
<pronoun> <predicate> ⇒
step.
She <predicate> ⇒
She <verb> <object> ⇒
She drives <object> ⇒
She drives <simple noun phrase> ⇒
She drives <article> <noun phrase> ⇒
She drives a <noun phrase> ⇒
She drives a <adjective> <noun phrase> ⇒
She drives a shiny <noun phrase> ⇒
She drives a shiny <adjective> <simple noun phrase> ⇒
She drives a shiny black <simple noun phrase> ⇒
She drives a shiny black <noun> ⇒
She drives a shiny black car
A Grammar for Expressions
<Expression> → <Term> | <Expression> + <Term>
<Term>
→ <Factor> | <Term> * <Factor>
<Factor>
→ <Identifer> | ( <Expression> )
<Identifier>
→ x|y|z|…
In class exercise: Derive
• x + ( y * 3)
•x+z*w+q
Definition of Context-Free-Grammars
A CFG is a quadruple G= (V,Σ,R,S), where
– V is a finite set of variables (nonterminals,
syntactic categories)
– Σ is a finite set of terminals
– R is a finite set of productions -- rules of the
form X→a, where X∈V and a∈(V ∪ Σ)*
– S, the start symbol, is an element of V
Vertical bar (|), as used in the examples on the previous slide, is
used to denote a set of several productions (with the same lhs).
Example
<Expression>
→
<Term> | <Expression> + <Term>
<Term>
→
<Factor> | <Term> * <Factor>
<Factor>
→
<Identifer> | ( <Expression> )
<Identifier>
→
x|y|z|…
V = {<Expression>, <Term>, <Factor>, <Identifier>}
Σ = {+, *, (, ), x, y, z, …}
R={
}
<Expression> → <Term>
<Expression> → <Expression> + <Term>
<Term>
→ <Factor>
<Term>
→ <Term> * <Factor>
<Factor>
→ <Identifer>
<Factor>
→ ( <Expression> )
<Identifier>
→x
<Identifier>
→y
<Identifier>
→z
<Identifier> → …
S = <Expression>
Notational Conventions
a,b,c, … (lower case, beginning of alphabet)
are concrete terminals;
u,v,w,x,y,z (lower case, end of alphabet) are
for strings of terminals
α,β,γ, …
(Greek letters) are for strings over
(T ∪ V) (sentential forms)
A,B,C, … (capitals, beginning of alphabet) are
for variables (for non-terminals).
X,Y,Z are for variables standing for terminals.
Short-hand
Note. We often abbreviate a context free grammar,
such as:
G2 = ( V={S},
Σ={(,)},
R={ S → ε, S → SS, S → (S) },
S=S}
By giving just its productions
S → ε | SS |(S)
And by using the following conventions.
1) The start symbol is the lhs of the first production.
2) Multiple production for the same lhs non-terminal can be
grouped together by using vertical bar ( | )
3) Non-terminals are capitalized.
4) Terminal-symbols are lower case or non-alphabetic.
Definitions
The single-step derivation relation ⇒ on (V∪ T)* is defined
by:
1. α ⇒ β iff β is obtained from α by replacing an
occurrence of the lhs of a production with its rhs. That
is, α'Aα'' ⇒ α'γα'' is true iff A → γ is a production. We
say α'Aα'' yields α'γα''
2. We write α ⇒∗ β when β can be obtained from α
through a sequence of several (possibly zero) derivation
steps.
3. The language of the CFG , G, is the set
L(G) = {w∈T* | S ⇒∗ w} (where S is the start symbol of G)
1. Context-free languages are languages of the form L(G)
Example 1
The familiar non-regular language
L = { akbk | k ≥ 0 }
is context-free.
The grammar G1 for it is given by T={a,b}, V={S},
and productions:
1. S → Λ
2. S → a S b
Here is a derivation showing a3b3∈ L(G):
S ⇒2 aSb ⇒2 aaSbb ⇒2 aaaSbbb ⇒1 aaabbb
(Note: we sometimes label the arrow with a subscript which tells the
production used to enable the transformation)
Example 1 continued
Note, however, that the fact L=L(G1) is not totally
obvious. We need to prove set inclusion both
ways.
To prove L ⊆ L(G1) we must show that there exists
a derivation for every string akbk; this is done by
induction on k.
For the converse, L(G1) ⊆ L, we need to show that
if S ⇒∗ w and w∈T*, then w∈ L. This is done by
induction on the length of derivation of w.
Example 2
The language of balanced parentheses is
context-free. It is generated by the
following grammar:
G2 = ( V={S},
Σ={(,)},
R={ S → ε | SS |(S)},
S=S}
Example 3
Consider the grammar:
S → AS | ε
A → 0A1 | A1 | 01
The derivation:
S⇒ AS ⇒ A1S ⇒ 011S ⇒ 011AS ⇒
0110A1S ⇒ 0110011S ⇒ 0110011
shows that 0110011 ∈ L(G3).
Example 3 notes
The language L(G3) consists of strings
w∈{0,1}* such that:
P(w): Either w=ε, or w begins with 0, and
every block of 0's in w is followed by at
least as many 1's
Again, the proof that G3 generates all and
only strings that satisfy P(w) is not
obvious. It requires a two-part inductive
proof.
Leftmost and Rightmost Derivations
The same string w usually has many possible
derivations S ≡ a0⇒a1⇒a2⇒ … ⇒ an ≡ w
We call a derivation leftmost if in every step
ai⇒ai+1, it is the first (leftmost) variable in ai$
that is being replaced with the rhs of a
production. Similarly, in a rightmost derivation, it
is always the last variable that gets replaced.
The above derivation of the string 0110011 in the
grammar G3 is leftmost. Here is a rightmost
derivation of the same string:
S ⇒ AS ⇒ AAS ⇒ AA ⇒ A0A1 ⇒ A0011 ⇒
A10011 ⇒ 0110011
S → AS | ε
A → 0A1 | A1 | 01
Facts
Every Regular Language is also a Context
Free Language
How might we prove this?
Choose one of the many specifications for
regular languages
Show that every instance of that kind of
specification has a total mapping into a
Context Free Grammar
What is an appropriate choice?
Designing CFGs
Break the language into simpler (disjoint)
parts with Grammars A B C . Then put
them together S → StartA | StartB | StartC
If a Language fragment is Regular construct
a DFA or RE, use these to guide you.
Infinite languages use rules like
R→aR
R→Rb
Languages with linked parts use rules like
R→BxB
R→xRx
In Class Exercise
Map the Haskell Regular Expression
datatype into a Context Free language.
data RegExp a
= Lambda
| Empty
| One a
| Union (RegExp a) (RegExp a)
| Cat (RegExp a) (RegExp a)
| Star (RegExp a)
-------
the empty string ""
the empty set
a singleton set {a}
union of two RegExp
Concatenation
Kleene closure
Find CFG for these languages
{an b an | n Є Nat}
{ w | w Є {a,b}*,
and w is a palindrome of even length}
{an bk | n,k Є Nat, n ≤ k}
{an bk | n,k Є Nat, n ≥ k}
{ w | w Є {a,b}*,
w has equal number of a’s and b’s
}
Ambiguity
A language is ambiguous if one of its strings
has 2 or more leftmost (or rightmost)
derivations.
• Consider the grammar
1. E -> E + E
2. E -> E * E
3. E -> x | y
• And the string:
x+x*y
E ⇒ 1 E + E ⇒ 3 x + E ⇒2 x + E * E ⇒ 3 x + x * E ⇒ 3 x + x * y
E ⇒2 E * E ⇒1 E + E * E ⇒3 x + E * E ⇒3 x + x * E ⇒3 x + x * y
Note we use the productions in a different order.
Common Grammars with ambiguity
Expression grammars with infix operators
with different precedence levels.
Nested if-then-else statements
st -> if exp then st else st
| if exp then st
| id := exp
if x=2 then if x=3 then y:=2 else y := 4
if x=2 then (if x=3 then y:=2 ) else y := 4
if x=2 then (if x=3 then y:=2 else y := 4)
Removing ambiguity.
Adding levels to a grammar
E -> E + E | E * E | id | ( E )
Transform to an equivalent grammar
E -> E + T | T
T -> T * F | F
F -> id | ( E )
Levels make formal the notion of precedence.
Operators that bind “tightly” are on the lowest levels
Chomsky Normal Form defined
There are many CFG's for any given CFL. When reasoning
about CFL's, it often helps to assume that a grammar for
it has some particularly simple form.
A grammar is in Chomsky normal form
(CNF) if every rule is of the form
1. A → BC
where B,C are variables
2. A → a
3. S → ε
where a is a terminal
is allowed (only S can be nullable)
1.
B and C can’t be S
Remove ε-rules
A→ε
Remove unit rules
A→B
Transform rules with too many symbols on rhs
A→BBxC
Finding a Chomsky Normal Form
Every grammar has a equivalent grammar
(that generates the same language) in
Chomsky normal form.
This grammar can be constructed using a
few simple (language preserving)
grammar transformations
ε-Productions
A variable A is nullable if A ⇒∗ ε. We can modify a
given grammar G and obtain a grammar G' in
which there are no nullable variables and which
satisfies L(G') = L(G) - {ε}.
Find nullable symbols iteratively, using these facts:
1. If A → ε is a production, then A is nullable.
2. If A → B1B2 … Bk is a production and B1,B2, … ,Bk
are all nullable, then A is nullable.
Once nullable symbols are known, we get G' as
follows:
1. For every production A → α, add new
productions A → α’ , where α’ is obtained by
deleting some (or all) nullable symbols from α.
2. Remove all productions A → ε
Example. If G contains a production A → BC and
both B and C are nullable, then we add
A→B|C
to G'.
Unit Productions
These are of the form A → B, where A,B are variables.
Assuming the grammar has no ε−productions, we can
eliminate unit productions as follows.
1. Find all pairs of variables such that A ⇒∗ B. (This
happens iff B can be obtained from A by a chain of
unit productions.)
2. Add new production A → α whenever A ⇒∗ B ⇒ α.
3. Remove all unit productions.
Theorem. For every CFG G, there exists a CFG G'
in CNF such that L(G')=L(G) - {ε}
The first three steps of getting G' are elimination of
ε-productions, elimination of unit productions,
and elimination of useless symbols (in that
order). There remain two steps:
1. Arrange that all productions are of the form A
→ α, where α is a terminal, or contains only
variables.
2. Break up every production A → α with | α |>2
into productions whose rhs has length two.
For the first part, introduce a new variable C for each
terminal c that occurs in the rhs of some
production, add the production C → c (unless
such a production already exists), and replace c
with C in all other productions.
For example, the production A → 0B1 would be
replaced with A0 → 0, A1 → 1, A → A0BA1.
An example explains the second part. The production
A → BCDE is replaced by three others,
1. A → BA1,
2. A1 → CA2,
3. A2 → DE,
using two new variables A1, A2.
© Copyright 2026 Paperzz