Parsing

Parsing
Discrete Mathematics and
Its Applications
Baojian Hua
[email protected]
Derivations


A string is valid in a language if and
only if there exists a derivation from
the start state which produces it
Begin with the start symbol, and
apply grammar rules until you
produce the string

Note that the final string (sentence)
consists of only terminals
Question


Given a formal grammar G and a
sentence (program) p, is p derivable
from grammar G ?
Or equivalently, is a given program p
valid according to some language’s
syntax (say C)?
Example:
Context-Free Grammar
S ::= x A
| y B
A ::= u C
| v C
B ::= t
C ::= w
| z
// derivable?
xum
Example:
Context-Free Grammar
S ::= x A
| y B
A ::= u C
| v C
B ::= t
C ::= w
| z
// derivable?
xum
xuwz
Example:
Context-Free Grammar
S ::= x A
| y B
A ::= u C
| v C
B ::= t
C ::= w
| z
// derivable?
xum
xuwz
xwu
Example:
Context-Free Grammar
S ::= x A
| y B
A ::= u C
| v C
B ::= t
C ::= w
| z
// derivable?
xum
xuwz
xwu
xuz
Lexical Analyzer

The lexical analyzer translates the
source program into a stream of lexical
tokens

Source program:


Lexical token:


stream of (ASCII or Unicode) characters
compiler data structure that represents the
occurrence of a terminal symbol
Valid sentence consists of only
allowable terminals
Example:
Context-Free Grammar
S ::= x A
| y B
A ::= u C
| v C
B ::= t
C ::= w
| z
// all terminals
T={x, y, u, v, t, w, z}
Example:
Context-Free Grammar
S ::= x A
| y B
// all terminals
T={x, y, u, v, t, w, z}
A ::= u C
| v C
B ::= t
C ::= w
| z
// allowable strings
T*
Predictive Parsing




Parsing: recognizing a string and do
something useful
The most naïve approach to use when
implementing a parser is to use
recursive descent
A form of top-down parsing
Not as powerful as other methods, but
easy enough to implement by hand
Predictive Parsing
S ::= x A
| y B
A ::= u C
| v C
B ::= t
C ::= w
| z
// Valid?
xum
xuwz
xwu
xuz
A Predictive Parser in C
(Sketch)
tokenTy token;
void parseS ()
{
switch (token.kind)
{
case x: token = nextToken (); parseA ();
break;
case y: token = nextToken (); parseB ();
break;
default: error (…);
}
}
// other functions are similar
Output:
Abstract Syntax Tree
xuz
S
x
A
u
C
z
A Predictive Parser Emitting
AST in C (Sketch)
tokenTy token;
S parseS ()
{
switch (token.kind)
{
case x: token = nextToken (); a=parseA ();
return newS1 (x, a);
case y: token = nextToken (); b=parseB ();
return newS2 (y, b);
default: error (…);
}
}
// other functions are similar
Predictive Parsing Difficulties
S ::= x A
| x B
A ::= u C
| v C
B ::= t
C ::= w
| z
// derivable?
xuz
Or Even Worse
1 E ::= id
15*(3+4)
E
2
| num
By 4 => E * E
3
| E + E
By 5 => E * (E + E)
4
| E * E
By 2 => E * (E + 4)
5
| ( E )
By 2 => E * (3 + 4)
By 2 => 15 * (3 + 4)
Or Even Worse
15*(3+4)
E
E
E * E
E * E
E * (E + E)
15 * E
E * (E + 4)
15 * (E + E)
E * (3 + 4)
15 * (3 + E)
15 * (3 + 4)
15 * (3 + 4)
rightmost derivation
leftmost derivation
Ambiguous grammars

A grammar is ambiguous if there is a
sentence with >1 parse tree
E
E
E
15
*
E
3
15 * 3 + 4
E
E
+
E
E
4
15
*
+
E
3
E
15
Eliminating ambiguity

In programming language syntax,
ambiguity often arises from missing
operator precedence or associativity



* higher precedence than +?
* and + are left associative?
Can sometimes rewrite the grammar to
disambiguate this

Beyond the scope of this course
Unambiguous Grammar
E ::= id
| num
| E + E
| E * E
| ( E )
E ::= E + T
| T
T ::= T * F
| F
F ::= id
| num
| ( E )
Accepts the same language, but parses unambiguously
Limitations with Predictive
Parsing

Rewriting grammar:



to resolve ambiguity
Grammars/trees are ugly
But…easy to write code by hand, and
very good for error reporting
Doing better


We can do better
We can use a parsing algorithm that
can handle all context-free languages


(though not all context-free grammars)
Remember: a context-free language might
have many different context-free grammars
The Yacc Tool
semantic
analyzer
specification
parser
Yacc
Originally developed for C, and now almost every
main-stream language has its own Yacc-tool:
bison (C), ml-yacc (SML), Cup (Java), GPPG (C#), …
Whole Structure
source
code
lexical
analyzer
tokens
parser
abstract
syntax
tree
other
part
Pentiu
m