Ambiguous Grammar

程式語言的語法
Grammar
作者 :
單位 :
Email:
URL :
陳鍾誠
金門技術學院資管系
[email protected]
http://ccc.kmit.edu.tw
日期 : 2017/7/14
Grammar
2
陳鍾誠 - 2017/7/14
Language
3
陳鍾誠 - 2017/7/14
Recursive Definition
4
陳鍾誠 - 2017/7/14
Mathematical Expression
5
陳鍾誠 - 2017/7/14
Structure of Expressions
6
陳鍾誠 - 2017/7/14
Formal Language
7
陳鍾誠 - 2017/7/14
Backus Naur Form (BNF)
8
1960 by J. Backus陳鍾誠
and -P.2017/7/14
Naur
EBNF (Extended BNF)
9
陳鍾誠 - 2017/7/14
BNF  EBNF
BNF
EBNF
10
陳鍾誠 - 2017/7/14
Formalism
(Formal notation)

N. Chomsky

近代結構語言學之父
11
N. Chromsky -
陳鍾誠 - 2017/7/14
Differing structural trees
for the same expression
12
陳鍾誠 - 2017/7/14
Problem of Different
structural trees
13
陳鍾誠 - 2017/7/14
No Ambiguous Sentence
14
陳鍾誠 - 2017/7/14
Context Free Language




Syntactic equations of the form defined in EBNF generate contextfree languages.
The term "context free” is due to Chomsky and stems from the fact
that substitution of the symbol left of = by a sequence derived from
the expression to the right of = is always permitted, regardless of the
context in which the symbol is embedded within the sentence.
It has turned out that this restriction to context freedom (in the sense
of Chomsky) is quite acceptable for programming languages, and
that it is even desirable.
Context dependence in another sense, however, is indispensible.
We will return to this topic in Chapter 8.
15
陳鍾誠 - 2017/7/14
Regular Expression

A language is regular, if its syntax can be
expressed by a single EBNF expression.

The requirement that a single equation
suffices also implies that only terminal
symbols occur in the expression.

Such an expression is called a regular
expression.
16
陳鍾誠 - 2017/7/14
Syntax Analysis v.s.
Regular Expression

The reason for our interest in regular languages
lies in the fact that programs for the recognition
of regular sentences are particularly simple and
efficient. By "recognition" we mean the
determination of the structure of the sentence,
and thereby naturally the determination of
whether the sentence is well formed, that is, it
belongs to the language. Sentence recognition
is called syntax analysis.
17
陳鍾誠 - 2017/7/14
Regular Expression v.s.
State Machine

For the recognition of regular sentences a finite
automaton, also called a state machine, is necessary
and sufficient. In each step the state machine reads the
next symbol and changes state. The resulting state is
solely determined by the previous state and the symbol
read. If the resulting state is unique, the state machine is
deterministic, otherwise nondeterministic. If the state
machine is formulated as a program, the state is
represented by the current point of program execution.
18
陳鍾誠 - 2017/7/14
EBNF  Program

The analyzing program can be derived directly from the
defining syntax in EBNF. For each EBNF construct K
there exists a translation rule which yields a program
fragment Pr(K). The translation rules from EBNF to
program text are shown below. Therein sym denotes a
global variable always representing the symbol last read
from the source text by a call to procedure next.
Procedure error terminates program execution, signaling
that the symbol sequence read so far does not belong to
the language.
19
陳鍾誠 - 2017/7/14
Analyzing program
20
陳鍾誠 - 2017/7/14
EBNF with only 1 rule
21
陳鍾誠 - 2017/7/14
First()
22
陳鍾誠 - 2017/7/14
Precondition
23
陳鍾誠 - 2017/7/14
Lexical Analysis for
Identifier
24
陳鍾誠 - 2017/7/14
Lexical Analysis for Integer
25
陳鍾誠 - 2017/7/14
Scanner

The process of syntax analysis is based on a
procedure to obtain the next symbol. This
procedure in turn is based on the definition of
symbols in terms of sequences of one or
more characters. This latter procedure is
called a scanner, and syntax analysis on this
second, lower level, lexical analysis.
26
陳鍾誠 - 2017/7/14
Lexical Analysis v.s.
Syntax Analysis
27
陳鍾誠 - 2017/7/14
A Scanner Example

As an example we show a scanner for a
parser of EBNF. Its terminal symbols and
their definition in terms of characters are
28
陳鍾誠 - 2017/7/14
Procedure GetSym() –(1)
29
陳鍾誠 - 2017/7/14
Procedure GetSym() –(2)
30
陳鍾誠 - 2017/7/14
Procedure GetSym() –(3)
31
陳鍾誠 - 2017/7/14
Syntax Analysis Overview


Goal – determine if the input token stream
satisfies the syntax of the program
What do we need to do this?



An expressive way to describe the syntax
A mechanism that determines if the input token
stream satisfies the syntax description
For lexical analysis


Regular expressions describe tokens
Finite automata = mechanisms to generate tokens
from input stream
Just Use Regular
Expressions?

REs can expressively describe tokens


Easy to implement via DFAs
So just use them to describe the syntax of a
programming language


NO! – They don’t have enough power to express any nontrivial syntax
Example – Nested constructs (blocks, expressions,
{
{ { {
{
statements)
–
Detect
balanced
braces:
{{} {} {{} { }}}
- We need unbounded counting!
- FSAs cannot count except in a strictly
modulo fashion
...
}
}
}
}
}
Context-Free Grammars

Consist of 4 components:




Terminal symbols = token or 
Non-terminal symbols = syntactic variables
Start symbol S = special non-terminal
Productions of the form LHSRHS




LHS = single non-terminal
RHS = string of terminals and non-terminals
Specify how non-terminals may be expanded
SaSa
ST
TbTb
T
Language generated by a grammar is the set of
strings of terminals derived from the start symbol by
repeatedly applying the productions

L(G) = language generated by grammar G
CFG - Example

Grammar for balanced-parentheses
language
? Why is the final S required?


S(S)S
S





1 non-terminal: S
2 terminals: “)”, “)”
Start symbol: S
2 productions
If grammar accepts a string, there is a
derivation of that string using the productions


“(())”
S = (S)  = ((S) S)  = (()  )  = (())
More on CFGs

Shorthand notation – vertical bar for multiple
productions
SaSa|T
 TbTb|
CFGs powerful enough to expression the syntax in
most programming languages
Derivation = successive application of productions
starting from S
Acceptance? = Determine if there is a derivation for an
input token stream




A Parser
Context free
grammar, G
Parser
Yes, if s in L(G)
No, otherwise
Token stream, s
(from lexer)
Error messages
Syntax analyzers (parsers) = CFG acceptors which also
output the corresponding derivation when the token stream
is accepted
Various kinds: LL(k), LR(k), SLR, LALR
RE is a Subset of CFG
Can inductively build a grammar for each RE

S
a
Sa
R1 R2S  S1 S2
R1 | R2
S  S1 | S2
R1*
S  S1 S | 
Where
G1 = grammar for R1, with start symbol S1
G2 = grammar for R2, with start symbol S2
Grammar for Sum
Expression

Grammar



SE+S|E
E  number | (S)
Expanded




SE+S
SE
E  number
E  (S)
4 productions
2 non-terminals (S,E)
4 terminals: “(“, “)”, “+”, number
start symbol: S
Constructing a Derivation



Start from S (the start symbol)
Use productions to derive a sequence of
tokens
For arbitrary strings α, β, γ and for a
production: A  β



A single step of the derivation is
αAγ
α β γ (substitute β for A)
Example


SE+S
(S + E) + E  (E + S + E) + E
Class Problem



SE+S|E
E  number | (S)
Derive: (1 + 2 + (3 + 4)) + 5
Parse Tree
S
E
+
S
( S )
E
E + S
5
• Parse tree = tree representation of the
derivation
• Leaves of the tree are terminals
• Internal nodes are non-terminals
• No information about the order of
the derivation steps
1 E + S
2
E
( S )
E + S
3
E
4
Parse Tree vs Abstract
Syntax Tree
S
E
Parse tree also called “concrete syntax”
+
S
( S )
E
E + S
5
+
+
1 E + S
2
1
+
2
E
4
AST discards (abstracts) unneeded
information – more compact format
E + S
E
+
3
( S )
3
5
4
Derivation Order



Can choose to apply productions in any order, select
non-terminal and substitute RHS of production
Two standard orders: left and right-most
Leftmost derivation



In the string, find the leftmost non-terminal and apply a
production to it
E+S1+S
Rightmost derivation


Same, but find rightmost non-terminal
E+SE+E+S
Leftmost/Rightmost
Derivation Examples
»SE+S|E
» E  number | (S)
» Leftmost derive: (1 + 2 + (3 + 4)) + 5
S  E + S  (S)+S  (E+S) + S  (1+S)+S  (1+E+S)+S 
(1+2+S)+S  (1+2+E)+S  (1+2+(S))+S  (1+2+(E+S))+S 
(1+2+(3+S))+S  (1+2+(3+E))+S  (1+2+(3+4))+S 
(1+2+(3+4))+E  (1+2+(3+4))+5
»Now, rightmost derive the same input string
S  E+S  E+E  E+5  (S)+5  (E+S)+5  (E+E+S)+5 
(E+E+E)+5  (E+E+(S))+5  (E+E+(E+S))+5 
(E+E+(E+E))+5  (E+E+(E+4))+5  (E+E+(3+4))+5 
(E+2+(3+4))+5  (1+2+(3+4))+5
Result: Same parse tree: same productions chosen, but in diff order
Class Problem



SE+S|E
E  number | (S) | -S
Do the rightmost derivation of : 1 + (2 + -(3 + 4)) + 5
Ambiguous Grammars


In the sum expression grammar, leftmost and
rightmost derivations produced identical
parse trees
+ operator associates to the right in parse
tree regardless of derivation order
+
(1+2+(3+4))+5
+
5
1
+
2
+
3
4
An Ambiguous Grammar


+ associates to the right because of the rightrecursive production: S  E + S
Consider another grammar


S  S + S | S * S | number
Ambiguous grammar = different derivations
produce different parse trees

More specifically, G is ambiguous if there are 2
distinct leftmost (rightmost) derivations for some
sentence
Ambiguous Grammar Example
S  S + S | S * S | number
Consider the expression: 1 + 2 * 3
Derivation 1: S  S+S 
1+S  1+S*S  1+2*S 
1+2*3
Derivation 2: S  S*S 
S+S*S  1+S*S  1+2*S 
1+2*3
*
+
1
+
*
2
3
1
Obviously not equal!
3
2
Impact of Ambiguity


Different parse trees correspond to different
evaluations!
Thus, program meaning is not defined!!
*
+
1
2
=7
+
*
3
1
3
2
=9
Can We Get Rid of
Ambiguity?



Ambiguity is a function of the grammar, not the
language!
A context-free language L is inherently
ambiguous if all grammars for L are
ambiguous
Every deterministic CFL has an unambiguous
grammar



So, no deterministic CFL is inherently ambiguous
No inherently ambiguous programming languages
have been invented
To construct a useful parser, must devise an
unambiguous grammar
Eliminating Ambiguity

Often can eliminate ambiguity by adding
nonterminals and allowing recursion only on
right or left




S
SS+T|T
T  T * num | num
S + T
T
T * 3
1
2
T non-terminal enforces precedence
Left-recursion; left associativity
A Closer Look at
Eliminating Ambiguity

Precedence enforced by



Introduce distinct non-terminals for each
precedence level
Operators for a given precedence level are
specified as RHS for the production
Higher precedence operators are accessed by
referencing the next-higher precedence nonterminal
Associativity

An operator is either left, right or non
associative




Left:
Right:
Non:
a + b + c = (a + b) + c
a ^ b ^ c = a ^ (b ^ c)
a < b < c is illegal (thus undefined)
Position of the recursion relative to the
operator dictates the associativity


Left (right) recursion  left (right) associativity
Non: Don’t be recursive, simply reference next
higher precedence non-terminal on both sides
of operator
Class Problem (Tough)
S  S + S | S – S | S * S | S / S | (S) | -S | S ^ S | number
Enforce the standard arithmetic precedence rules and remove
all ambiguity from the above grammar
Precedence (high to low)
(), unary –
^
*, /
+, Associativity
^ = right
rest are left