Context-Free Grammars

Context-Free Grammars
24 October 2013
OSU CSE
1
BL Compiler Structure
Tokenizer
string of
characters
(source code)
Parser
string of
tokens
(“words”)
Code
Generator
abstract
program
string of
integers
(object code)
The parser is arguably the most
interesting, and most difficult,
piece of the BL compiler.
24 October 2013
OSU CSE
2
Plan for the BL Parser
• Design a context-free grammar (CFG) to
specify syntactically valid BL programs
• Use the grammar to implement a
recursive-descent parser (i.e., an
algorithm to parse a BL program and
construct the corresponding Program
object)
24 October 2013
OSU CSE
3
Plan for the BL Parser
• Design a context-free grammar (CFG) to
specify syntactically valid BL programs
• Use the grammar to implement a
recursive-descent parser (i.e., an
grammar
is a set
of
algorithm to parseA a
BL program
and
formation rules for
strings in
construct the corresponding
Program
a language.
object)
24 October 2013
OSU CSE
4
Plan for the BL Parser
• Design a context-free grammar (CFG) to
specify syntactically valid BL programs
• Use the grammar to implement a
recursive-descent parser (i.e., an
A grammar is context-free
algorithm to parse a BL program and
if it satisfies certain
construct the corresponding
Program
technical conditions
object)
described herein.
24 October 2013
OSU CSE
5
Languages
• A language is a set of strings over some
alphabet Σ
• If L is a language, then mathematically it is
a set of string of Σ
24 October 2013
OSU CSE
6
Aside: Characters vs. Tokens
• In the following examples of CFGs, we
deal with languages over the alphabet of
individual characters (e.g., Java’s char
values)
Σ = character
• In the BL project, we deal with languages
over an alphabet of tokens (to be
explained later)
24 October 2013
OSU CSE
7
Example: Real-Number Constants
• Some syntactically valid real-number
constants (i.e., some strings in the
“language of valid real-number
constants”):
37.044
615.22E16
99241.
18.E-93
24 October 2013
OSU CSE
8
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
E – digit-seq
digit-seq  digit digit-seq |
digit
digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
9
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
This
a rewrite rule (a
E –isdigit-seq
rule),
digit-seq replacement
digit digit-seq
| which
describes
digit how strings in the
language
be| 5
formed.
digit

0 | 1 | 2 may
|3|4
|6|7|8|9
24 October 2013
OSU CSE
10
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
– digit-seq
AE name
on the left of a
digit-seq  rewrite
digit digit-seq
|
rule is called
a
digit
non-terminal
symbol.
digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
11
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
– digit-seq
TheE special
CFG symbol 
digit-seq means
 digit“can
digit-seq
|
be rewritten
as”
ordigit
“can be replaced by”.
digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
12
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
The
CFG symbol |
E special
– digit-seq
means
“or”, i.e., |there are
digit-seq 
digit digit-seq
multiple
digit possible “rewrites”
for0the
digit

| 1 same
| 2 | 3non-terminal.
|4|5|6|7|8|9
24 October 2013
OSU CSE
13
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
E – digit-seq
digit-seq  digit digit-seq
|
So this ...
digit
digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
14
CFG Rewrite Rules
real-const
real-const
real-const
real-const
exponent
digit-seq
digit
24 October 2013
 digit-seq . digit-seq
 digit-seq . digit-seq exponent
 digit-seq .
 digit-seq . exponent
 E digit-seq |
E + digit-seq |
E – digit-seq
... means
exactly the same
 digit
| separate
thing
as digit-seq
these four
digit rewrite rules.
0|1|2|3|4|5|6|7|8|9
OSU CSE
15
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
One non-terminal symbol
E – digit-seq
(normally in the first rewrite
digit-seq  digit digit-seq |
rule) is called the
digit
start symbol.
digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
16
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
A symbol from the alphabet
E – digit-seq
on the right-hand side of a
digit-seq  digit digit-seq |
rewrite rule is called a
digit
terminal symbol.
digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
17
CFG Rewrite Rules
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
digit-seq . |
digit-seq . exponent
exponent  E digit-seq |
E + digit-seq |
To remember the name: terminal
E – digit-seq
symbols are what you end up with
digit-seq  digit digit-seq |
when generating strings in the
digit
language (see below).
digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
18
Four Components of a CFG
• Non-terminal symbols for this CFG:
– real-const, exponent, digit-seq, digit
• Terminal symbols for this CFG:
– ., E, +, -, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9
• Start symbol for this CFG:
– real-const
• Rewrite rules for this CFG:
– (see previous slides)
24 October 2013
OSU CSE
19
Derivations
• A derivation of a string of terminal
symbols consists of a sequence of specific
rewrite-rule applications that begin with the
start symbol and continue until only
terminal symbols remain
– A string is in the language of the CFG iff
there is a derivation that leads to it
• The symbol indicates a derivation step,
i.e., a specific rewrite-rule application
24 October 2013
OSU CSE
20
Example: Derivation of 5.6E10
• Begin with the start symbol:
real-const
24 October 2013
OSU CSE
21
Example: Derivation of 5.6E10
• Begin with the start symbol:
real-const
• ... and pick one possible rewrite:
real-const  digit-seq . digit-seq |
digit-seq . digit-seq exponent |
Which rewrite digit-seq . |
is appropriate digit-seq . exponent
to derive
5.6E10?
24 October 2013
OSU CSE
22
Example: Derivation of 5.6E10
• This is the first step of the derivation:
real-const
24 October 2013
digit-seq . digit-seq exponent
OSU CSE
23
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:
real-const
24 October 2013
digit-seq . digit-seq exponent
OSU CSE
24
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:
real-const
digit-seq . digit-seq exponent
• ... and pick one possible rewrite:
 digit digit-seq |
digit
Which rewrite
is appropriate
to derive
5.6E10?
digit-seq
24 October 2013
OSU CSE
25
Example: Derivation of 5.6E10
• This is the second step of the derivation:
real-const
24 October 2013
digit-seq . digit-seq exponent
digit . digit-seq exponent
OSU CSE
26
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:
real-const
24 October 2013
digit-seq . digit-seq exponent
digit . digit-seq exponent
OSU CSE
27
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:
real-const
digit-seq . digit-seq exponent
digit . digit-seq exponent
• ... and pick one possible rewrite:
digit
24 October 2013
0|1|2|3|4|5|6|7|8|9
OSU CSE
28
Example: Derivation of 5.6E10
• This is the third step of the derivation:
real-const
24 October 2013
digit-seq . digit-seq exponent
digit . digit-seq exponent
5 . digit-seq exponent
OSU CSE
29
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:
real-const
24 October 2013
digit-seq . digit-seq exponent
digit . digit-seq exponent
5 . digit-seq exponent
OSU CSE
30
Example: Derivation of 5.6E10
• Choose a non-terminal to rewrite:
real-const
digit-seq . digit-seq exponent
digit . digit-seq exponent
5 . digit-seq exponent
• ... and pick one possible rewrite:
digit-seq
24 October 2013
 digit digit-seq |
digit
OSU CSE
31
One Derivation of 5.6E10
real-const
24 October 2013
digit-seq . digit-seq exponent
digit . digit-seq exponent
5 . digit-seq exponent
5 . digit exponent
5 . 6 exponent
5 . 6 E digit-seq
5 . 6 E digit digit-seq
5 . 6 E 1 digit-seq
5 . 6 E 1 digit
5.6 E10
OSU CSE
32
One Derivation of 5.6E10
real-const
24 October 2013
that aexponent
derivation is
digit-seq .Note
digit-seq
used
in this way to
digit . digit-seq
exponent
generate
a string in the
5 . digit-seq
exponent
language of the CFG.
5 . digit exponent
5 . 6 exponent
5 . 6 E digit-seq
5 . 6 E digit digit-seq
5 . 6 E 1 digit-seq
5 . 6 E 1 digit
5.6 E10
OSU CSE
33
Another Derivation of 5.6E10
real-const
24 October 2013
digit-seq . digit-seq exponent
digit-seq . digit-seq E digit-seq
digit-seq . digit-seq E digit digit-seq
digit-seq . digit-seq E digit digit
digit-seq . digit-seq E digit 0
digit-seq . digit-seq E 1 0
digit-seq . digit E 1 0
digit-seq . 6 E 1 0
digit . 6 E 1 0
5.6E10
OSU CSE
34
Derivation Trees
• A derivation tree depicts a derivation
(such as those above) in a tree
• Note that the order in which rewrites are
done is sometimes arbitrary
– A tree captures the required temporal order of
rewrites from top-to-bottom
– A tree captures the required spatial order
among terminal symbols from left-to-right
24 October 2013
OSU CSE
35
A Derivation Tree for 5.6E10
real-const
digit-seq
.
digit-seq
digit
digit
5
6
exponent
E
digit-seq
digit
digit-seq
1
digit
0
24 October 2013
OSU CSE
36
A Derivation Tree for 5.6E10
real-const
digit-seq
.
digit-seq
digit
digit
5
6
E
This tree captures both
derivations previously illustrated
(and all others) for 5.6E10.
24 October 2013
exponent
OSU CSE
digit-seq
digit
digit-seq
1
digit
0
37
Other Examples
• Can you find a derivation tree for 5.E3?
– If so, it’s in the language of the CFG;
otherwise it’s not in that language
• Can you find a derivation tree for .6E10?
– If so, it’s in the language of the CFG;
otherwise it’s not in that language
24 October 2013
OSU CSE
38
A Famous CFG
expr
term
factor
add-op
mult-op
 expr add-op term | term
 term mult-op factor | factor
 ( expr ) | digit-seq
+| * | DIV | REM
digit-seq
digit
 digit digit-seq | digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
39
Example: 4+6*2
• Find a derivation tree for 4+6*2
24 October 2013
OSU CSE
40
A Derivation Tree for 4+6*2
expr
expr
add-op
term
+
term
term
mult-op
factor
factor
factor
*
digit-seq
digit-seq
digit-seq
digit
digit
digit
2
4
6
24 October 2013
OSU CSE
41
Example: (4+6)*2
• Find a derivation tree for (4+6)*2
• How is it different from the previous one?
24 October 2013
OSU CSE
42
A Simpler CFG for Expressions
expr
op
 expr op expr | ( expr ) | digit-seq
 + | - | * | DIV | REM
digit-seq
digit
 digit digit-seq | digit
0|1|2|3|4|5|6|7|8|9
24 October 2013
OSU CSE
43
One Derivation Tree for 4+6*2
expr
expr
op
digit-seq
+
expr
expr
op
expr
digit
digit-seq
*
digit-seq
4
digit
digit
6
2
24 October 2013
OSU CSE
44
Another Derivation Tree for 4+6*2
expr
expr
op
expr
*
digit-seq
expr
op
expr
digit-seq
+
digit-seq
digit
digit
digit
2
4
6
24 October 2013
OSU CSE
45
Ambiguity
• The second (simpler) CFG for arithmetic
expressions is ambiguous because some
strings in the language of the CFG have
more than one derivation tree
• As is often the case, ambiguity is bad
– If you want to use the derivation tree as the
basis for evaluating the expression, only one
of the derivation trees shown above results in
the right answer (which one?)
24 October 2013
OSU CSE
46
Resources
• Wikipedia: Context-Free Grammar
– http://en.wikipedia.org/wiki/Context-free_grammar
24 October 2013
OSU CSE
47