CS3012: Formal Languages and Compilers

3012
Formal Languages
&
Compilers
CS3012 Formal Languages
Course Notes
2
CS3012 Formal Languages
Course Notes
UNIVERSITY OF ABERDEEN
Department of Computing Science
CS3012: Formal Languages & Compilers
Course Notes 2006
Frank Guerin1
31/07/2017
About CS3012
Formal languages underlie all of Computing
Science: if a language is not formally defined it
is difficult to use a computer to process it in a
consistent manner. This course provides an
introduction to how formal languages may be
defined and describes how computers may be
used to manipulate such languages, with
particular reference to compiling programming
languages. You will gain practical experience of
tools that are widely used in industrial
applications to generate parsers and lexical
analysers.
Attendance
You are expected to attend all meetings of
CS3012. The course introduces many formal
concepts, and builds upon them each week. The
only way to do well on this course is to work
steadily throughout the term, building up your
familiarity with the different formalisms. If you
can't do the exercises in the tutorials, you will
not be able to understand the lectures. You will
not be able to cram in all the information in the
weeks before the exam if you have not been
working steadily. Attendance will be taken at
tutorials. If you are not attending at least 75% of
the tutorial classes, you will be reported to the
Senate Office as being "at risk".
Motivation for the Course
A compiler takes a program we have written in a
particular language, and converts it into another
format ready to be executed on a particular
computer. In order to write efficient programs,
we must understand how a compiler works. The
basic ideas of compiling are used in many
different areas in computing, including user
interfaces, software design, and intelligent
agents.
4. Go cars large red quickly.
5. Coches rojos grandes marchan rapidamente.
Even if we can't state clearly which of these are
valid sentences, we can still make a good guess
at what some of them mean. Unfortunately, we
can't get away with hand-waving like this when
we are dealing with computer programs – we
have to be precise and unambiguous, and the
only effective way we have of interpreting
programs requires us to be very strict about what
is and what is not a valid program. We will then
use this strict specification to help us recognise
the structure of a program. Once we have
recognised the structure, we can start worrying
about what it means, and what actions we have
to take to execute the program as expected.
Before we can write a compiler, we must
establish the language that we will be compiling.
That is, we have to determine what programs are
valid. Then, we must be able to recognise valid
programs, and process them. We face similar
problems understanding natural language. For
example, which of the following five examples
are proper sentences?
To specify the valid programs, we need the
concept of a formal language – languages where
the programs (or sentences) that are valid are
defined solely in terms of the form (or shape, or
structure) of the program. The course will begin
1. large red cars go quickly,
2. Large red cars go quickly.
3. Colourless green ideas sleep furiously.
1
These notes mostly from Ken Brown’s original notes
3
CS3012 Formal Languages
Course Notes
by looking at some simple ways to define a
formal language, and some algorithms for
recognising sentences in those languages. We
will see how to define simple examples, and we
will investigate how powerful the methods are.
We will then look at more powerful methods,
which will allow us to specify programming
languages, and we will see how to recognise
whether or not a program is valid.
meaningful – i.e. the instructions make sense,
and obey the conventions of the language. We
will then start translating the original instructions
into a different format, and we will show how to
create the necessary structures in memory. We
will also look at the runtime environment,
showing how diffrent programming language
styles manipulate the memory of the computer as
programs are executed. We will use the standard
UNIX tools of Lex and Yacc to build parts of a
working compiler. A schematic model of a
compiler is shown below:
Once we have recognised a valid program, we
have to start translating it into an executable
form. First, we will check that the program is
symbol table
source
lexical
analysis
syntax
analysis
semantic
analysis
intermediate
code
generation
code
optimisation
error handling
error messages
Course Content
basic formal language theory
1. Alphabets, Strings Languages and Machines
2. Finite State Automata
3. Regular Expressions and Regular Languages
4. Finite State Automata and Regular Languages
5. Finite State Automata with Output
grammar theory
7. Languages and Grammars
8. Derivations and Ambiguity
9. Parsing
compilation
10. Yacc: A Parser Generator
11. Error Handling
12. Syntax Directed Translation
13. Symbol Table
14. Type Checking
15. Run-time Environment
16. Intermediate Code Generation
lexical analysis
6. Lex: A Lexical Analysis Tool
4
code
generation
target
CS3012 Formal Languages
Course Notes
A note on the style of the course
This handout together with the exercise sheets
from the problem classes contain all of the
required material for the course. However, the
handout is concise, may be difficult to read, and
will not contain much in the way of discussion,
motivation or examples.
The exercises in the problem classes are
designed to test your understanding of the
material and to give you practice applying the
definitions and algorithms in solving problems.
In order to survive on the course, you are
strongly advised to attend the problem classes
and attempt the exercises. Much of this material
is formal and abstract, and without practice you
will quickly fall behind. Make sure you do
attempt the exercises for yourself, and don't wait
for the solutions - reading solutions is no
substitute for trying to solve problems. If there is
anything you don't understand in the exercises
(or the notes), or you can't see how to generate
the solutions for yourself, then ask for help. Do
not sit in silence hiding the fact that you are
struggling.
The lectures will consist of slides, spoken
material, and additional examples given on the
blackboard. In order to understand the
algorithms and the reasons for studying the
material, you will need to attend the lectures and
take notes to supplement the handout. This is
your responsibility. If there is anything you do
not understand during the lectures, then ask,
either during or after the lecture. If the lectures
are covering the material too quickly, then say
so. If there is anything you do not understand in
the handout, then ask, either at a lecture or in the
problem classes.
There are also continuous assessment exercises.
These put into practice the theory you will have
learned on the course. They are part of the
formal assessment for the course, so it is
obviously important that you make serious
attempts at solving them. If you do not complete
the assessments, you risk failing the course.
The textbooks contain additional material, and
may motivate the material in different ways.
They are useful background to the course, and
may help you understand it.
5
CS3012 Formal Languages
Finite State Automata
1. Alphabets, Strings and Languages
Definitions 1.0: Set and Mathematical Notation
We need to define some mathematical notation before we start, so that we can talk about things concisely,
without having to write a paragraph every time.
AB
AB
AB
A is a subset of B - every element of A is also in B
the union of A and B - a set containing every element in either A or B
the intersection of A and B - a set containing only those elements in both A and B


there exists
for all
2P
the set of all subsets of P


is an element of
is not an element of
A\B
A minus B: all elements of A, except those that are also in B
AB
the set of all pairs of elements, where the first is an element of A and the second is an
element of B


the empty set - i.e. the set with no members
not
{ x | p(x) }
means the set of all elements x, such that the sentence p(x) is true for each x
Example: {x : x  Z, x > 0, x < 10} means the set of all integers between 0 and 10, and is
the same set as {1,2,3,4,5,6,7,8,9}
f : X  Y :: x  y
f is a function that maps elements of set X to elements of set Y, and maps a
particular element x to a particular element y
Example: f: N  N :: x  x2 is the "square" function for positive integers. It may
also be written f(x) = x2. Thus f(2) = 4, f(3) = 9, etc.
6
CS3012 Formal Languages
Finite State Automata
Definitions 1.1
A symbol is a basic unit.
An alphabet is a finite set of symbols.
A string over an alphabet T is a finite sequence of symbols from T. This may be shortened to Tstring, or if the context is clear or unimportant, simply string.
Example 1.2
If T = {a, b, c, d} is an alphabet, then abd, aaaa and abaabc are T-strings.

Definitions 1.3
The empty string is the string with no symbols, denoted .
The length of a string w is the number of symbols in the sequence, denoted |w|.
Two strings, w and v, are equal if they have exactly the same sequence of symbols, denoted w = v.
The concatenation of two strings, w and v, is the string consisting of the sequence of symbols in w
followed by the sequence of symbols in v, denoted wv.
Note: concatenation is not commutative - vw and wv need not be equal - but is associative - (uv)w =
u(vw).
Example 1.4
If w = abb and v = bab then wv = abbbab and w= abb.

Definitions 1.5
A string u is a substring of w if there exists strings x and y such that (s.t.) w = xuy.
If u is a substring of w as above, and x = , then u is a prefix of w. If u  w, the u is a proper prefix.
If u is a substring of w as above, and y = , then u is a suffix of w. If u  w, then u is a proper suffix.
Note that  is a substring of every string.
Example 1.6
ba is both a prefix and a suffix of babba.

Definition 1.7
If T is an alphabet, then T* is the set of all strings over T.
Example 1.8
T = {a,b} and T* = {, a, b, aa, ab, ba, bb, aaa, aab, ...}

7
CS3012 Formal Languages
Finite State Automata
Definitions 1.9
T+ is T* without .
If a is a symbol, then
(i) an (n ≥ 0) is the string consisting of n a's. Note that anam = an+m.
(ii) a* = {, a, aa, aaa, ...}
(iii) a+ = {a, aa, aaa, ...}
Definition 1.10
Language over T
A language over an alphabet T is a set of strings over T. This may be abbreviated to T-language, or
simply language.
Note that L is a T-language if and only if (iff) L  T*.
Example 1.11
If T = (a, b}, then {, ab, babba, bbbbbbb} is a T-language.

Definitions 1.12
Let A and B be languages over an alphabet T.
A+B (or A  B) denotes the set union of A and B.
A  B denotes the set intersection of A and B.
A' denotes the complement of A - i.e. all the strings in T* but not in A
AB denotes the concatenation of A and B - all strings uv s.t. u  A and v  B.
Note that language concatenation is associative, but not commutative.
An denotes the concatenation of A with itself n times ( = AA...A). Note: A0 = {}.
A* = A0 + A1 + A2 +... i.e. the set of all strings consisting of the concatenation of strings from A. This
operation is called the "Kleene closure". Note: A** = A*.
A+ = A1 + A2 +...
Definitions 1.13
Let T be an alphabet with an ordering on its symbols. Say T = {t1, t2, t3, ...}. Strings over T can be
ordered in two ways:
Dictionary Order
All strings beginning t1 are ordered before strings beginning t2, and t2 before t3, etc. Within the group
of strings beginning t1, strings are ordered by the second symbol, etc.  is always the first string.
Lexical Order
Strings are ordered by their length. Within each group of strings of the same length, strings are
ordered by dictionary order. Again,  is the first string.
8
CS3012 Formal Languages
Finite State Automata
2. Finite State Automata
Definition 2.1 Finite State Automata
A Finite-State Automaton (FSA) is a quintuple (Q,I,F,T,E), where
Q is a finite set (whose elements are called states)
I is a subset of Q (whose elements are the initial states)
F is a subset of Q (whose elements are the final states)
T is an alphabet, and
E is a subset of Q  (T + )  Q (whose elements are called edges)
Essentially, a FSA is a labelled, directed graph - that is, it is a set of nodes with directed arcs between the
nodes, where arcs may have labels from an alphabet.
Notation: We will sketch a FSA as a graph, where the edges of the FSA are the arcs of the graph and the
states are the nodes (drawn as circles). The initial states will be drawn with a short incoming arrow, and
the final states will be drawn as double circles.
Example 2.2
2
The FSA A1:
Q = {1,2,3,4}, I = {1}
F = {4}, T = {a, b}
E = {(1,a,2), (1,b,4), (2,b,4),
(2,a,3), (3,a,3), (3,b,3),
(4,a,2), (4,b,4)}
can be sketched as shown:
a
a
b
1
3
a,b
a
b
4
b

Example 2.3
The FSA of 2.2 (A1) can be interpreted as follows:
The machine starts in state "1". From there it can move either to state "2", by action labelled "a", or it
can move to state "4", by action labelled "b". From state "2" it can move to state "3", by action
labelled "a", or it can move to state "4", by action labelled "b". From state "3", it can stay in state "3'
by actions labelled "a" or "b". From state "4", it can move to state "2", by action labelled "a", or it can
stay in state "4", by action labelled "b". The machine can stop successfully in state "4".

Definitions 2.4
If (x,a,y) is an edge in a FSA, then x is the start state of the edge and y is the end state.
A path in a FSA is a sequence of edges, such that the end state of one is the start state of the next.
A cycle in a FSA is a path, such that the start state and the end state are the same.
A path is successful if its first state is an initial state, and its last state is a final state.
The label of a path is the sequence of labels of the edges in the path.
A string is accepted by a FSA if it is the label of a successful path. A string is rejected if it is not the
label of a successful path.
9
CS3012 Formal Languages
Finite State Automata
Definition 2.5 Language accepted by a FSA
The language accepted by a FSA, A, is the set of strings accepted by A. Denote the language L(A).
Example 2.6
Consider the FSA A1 (example 2.2):
(i) p1 = (2,b,4), (4,a,2), (2,a,3) is a path;
(ii) p2 = (2,b,4), (4,b,4), (4,a,2) is a cycle;
(iii) p3 = (1,b,4), (4,a,2), (2,b,4), (4,b,4) is a successful path;
(iv) The label of p1 = baa;
(v) babb is accepted by A1;
(vi) baa is rejected by A1;
(vii) A1 accepts the language of strings of a's and b's which end in a b, and in which no two a's are
adjacent.

Definition 2.7
The transition function of a FSA, A, is the function
 : (x,t)  {y:  edge (x, t, y) in A}.
Definition 2.8
If A = (Q,I,F,T,E) is a FSA, then a transition matrix for A is a matrix which has one row for each
state in Q and one column for each symbol in T s.t. the entry in row q and column t is (q,t) ( 2Q)
Notation: A transition matrix will be drawn as a table, labelling the rows and columns with states and
symbols. Each entry in the table will be the set of states as defined above, or will be left blank in the case
of the empty set. Additionally, rows corresponding to initial states will be labelled with an "in" arrow, and
final states with an "out" arrow.
Example 2.9
The transition matrix for A1 is:

10
CS3012 Formal Languages
Finite State Automata
Definition 2.10
A FSA, A, is non-deterministic if
(i) there are edges labelled with , or
(ii) there are two edges (x,t,y) and (x,t,z) in A s.t. y  z, or
(iii) there is more than one initial state.
Conversely, if none of (i), (ii) or (iii) hold, then A is a deterministic FSA
Non-deterministic and deterministic FSA's will be denoted NDFSA and DFSA respectively.
Example 2.11
A1 is a DFSA

Note: For a DFSA, every entry in the transition matrix is either a singleton set or the empty set.
Algorithm 2.12
Recognition Algorithm (DFSA)
Problem: Given a DFSA, A = (Q,I,F,T,E) and a string w, determine whether w  L(A).
begin
Add symbol # to end of w
q := initial state
t := first symbol of w#
while (t  # & q  {}) do begin
q := (q, t)
t := next symbol in w#
end
return ((t == #) & (q  F))
while the current symbol is not the end marker and
we are in a proper state
get the next state from the transition table
get the next symbol from the input string
if the current symbol is the end marker and the
current state is a finish state, return true, else false
end

Theorem 2.13 DFSA = NDFSA
Let L be a language. L is accepted by a NDFSA iff L is accepted by a DFSA.

Algorithm 2.14
NDFSA -> DFSA
Problem: Given a NDFSA, A, create a DFSA, A'
begin
create unique initial state
remove -edges
remove edge choices
end

11
CS3012 Formal Languages
Algorithm 2.15
Finite State Automata
create unique initial state
Given a NDFSA, A = (Q,I,F,T,E), create FSA A' with a single initial state s.t. L(A) = L(A')
begin
Q := Q  {i} (where i  Q)
for each q  I do add edge (i,,q) to E
I := {i}
return (Q,I,F,T,E)
end
Algorithm 2.16
/* add a new initial state i */
/* reset the initial set to be just i */

remove -edges
Given a NDFSA, A = (Q,I,F,T,E), create FSA A' with no -edges
begin
remove all edges of form (q,,q) from E
while there are cycles of -edges in E do begin
select a cycle
merge all states in cycle into single state, keeping all edges in/out of cycle
end
while there are -edges in E do begin
select a -edge (p,,q)
for each edge (q,t,r)  E do add edge (p,t,r) to E
if q  F then add p to F
remove (p,,q) from E
end
return (Q,I,F,T,E)
end
12

CS3012 Formal Languages
Algorithm 2.17
Finite State Automata
remove edge choices (subset construction)
Given a NDFSA, A = (Q,I,F,T,E), create DFSA A' s.t. L(A) = L(A')
begin
I' := {I}
/* Note: I is a set; I' is a set with one member, I */
F' := {}
E' := {}
S := {I}
Q' := {I}
while S is not empty do begin
select X  S
for each t  T do begin
S' := {q Q : (p,t,q)  E, for some p  X}
if S'  {} then begin
if S'  F  {} then F' := F'  {S'}
E' := E'  {(X,t,S')}
S := S  {S'}\Q'
/* if we haven't seen S' before, add to S */
Q' := Q'  {S'} /* if S' is already in Q', Q' doesn't change */
end
end
S := S\{X}
end
return (Q',I',F',T,E')
end

Alternative description of Algorithm 2.17
Make a new initial state, I', representing all the old initial states.
Make empty sets F' and E', for the new finish states and new edges
Create a set S of states we haven't exanded yet, initially containing just I'
Create a set Q' of all new states, initially containing just I'.
While there are states we haven't expanded yet (i.e. still states left in S)
Pick one of those states, and call it X
For each symbol in the alphabet
Find all the old states that make up X
Find all the old states we could have got to from those states by reading in the
current alphabet symbol
Group all those old states into a new state, S'
If S' is not empty (i.e. there is at least one old state making up S')
If any of the states making up S' were old finish states
Make S' a new finish state (i.e. add to F')
Add a new edge from X to S' for the current symbol (i.e. add to E')
If we hadn't seen S' before, add S' to S
If we hadn't seen S' before, add S' to Q'
Take X out of S
Return the new FSA we have just created, where Q' is the set of states, I' is the set of initial
states, F' is the set of finish states, T is the alphabet, and E' is the set of edges.

13
CS3012 Formal Languages
Finite State Automata
Example 2.18
A = ({1,2,3},{1},{3},{a,b},{(1,a,1),(1,a,2),(1,b,1),(2,b,3),(3,a,3),(3,b,3)})
Convert A into a DFSA.
1
a
2
b
a,b
I'
F'
{1}
E'
({1},a,{1,2})
({1},b,{1})
({1,2},a,{1,2})
{1,3}
({1,2},b,{1,3})
{1,2,3} ({1,3},a,{1,2,3})
({1,3},b,{1,3})
({1,2,3},a,{1,2,3})
({1,2,3},b,{1,3})
A' =
3
a,b
S
{1}
{1,2}
Q'
{1}
{1,2}
{1,3}
{1,2,3}
{1,3}
{1,2,3}
X
{1}
t
a
b
{1,2}
a
b
{1,3}
a
b
{1,2,3} a
b
S'
{1,2}
{1}
{1,2}
{1,3}
{1,2,3}
{1,3}
{1,2,3}
{1,3}
({{1},{1,2},{1,3},{1,2,3}},
{{1}},
{{1,3},{1,2,3}},
{a,b),
{({1},a,{1,2}),({1},b,{1}),({1,2},a,{1,2}),({1,2},b,{1,3}),
({1,3},a,{1,2,3}), ({1,3),b,{1,3}),({1,2,3},a,{1,2,3})),
({1,2,3},b,{1,3})})

Definition 2.19
Let A =(Q,I,F,T,E) be a FSA. For any two strings x, y T*, x and y are distinguishable w.r.t. A if
there is a string z  T* s.t. exactly one of xz and yz are in L(A). We say z distinguishes x and y
w.r.t. A.
Theorem 2.20
L is a language over T. If, for some integer, n, there are n elements of T* s.t. any two are
distinguishable w.r.t. A, then any FSA that recognises L must have at least n states.

14
CS3012 Formal Languages
Finite State Automata
Theorem 2.21
For a given language L, there exists a minimal DFSA accepting L, and it is unique.

Algorithm 2.22
DFSA -> minimal DFSA
Given a DFSA A, create a DFSA A' s.t. A' is minimal over all FSAs accepting L(A).
begin
R := {}
for all edges (p,t,q)  E do add (q,t,p) to R
remove edge choices from (Q,F,I,T,R) to get (Q',I',F',T,E')
Z := equivalent_states(Q, Q')
(Q'',I'',F'',T,E'') := merge(Z,Q,I,F,T,E)
return (Q'',I'',F'',T,E'')
end
Algorithm 2.23

equivalent states
Given two sets of states Q and Q', produce Z, the set of states of Q equivalent in Q'.
M is a 2d array, indexed by Q and {}  Q
begin
set all cells of M to t
for all p  Q do
for all sets S  Q' do
if p  S then
for each q  Q do begin
if q  S then M[p][q] := f
end
else
for each q  Q do
if q  S then M[p][q] := f
S := {}
for each p  Q do begin
Z := {}
if M[p][] = t then
for each q  Q do
if M[p][q] = t then begin
add q to Z
M[q][] := f
end
add Z to S
end
return S
end
15

CS3012 Formal Languages
Algorithm 2.24
Finite State Automata
merge
Given a DFSA A and a set of sets of states Z, return a new DFSA A'
begin
for all S  Z do begin
select p  S
for all q  S s.t. q ≠ p do begin
delete q from Q
for each edge (q,t,w)  E do begin
delete (q,t,w) from E
if (p,t,w)  E then add (p,t,w) to E
end
for each edge (w,t,q)  E do begin
delete (w,t,q) from E
if (w,t,p)  E then add (w,t,p) to E
end
if q  I then begin
delete q from I
if p  I then add p to I
end
if q  F then begin
delete q from F
if p  F then add p to F
end
end
end
return (Q,I,F,T,E)
end

Example 2.25 Minimise the following DFSA:
2
1
4
a
a
b
b
5
a
b
b
a
2
a
1
a
a,b
3
b
5
Reverse:
4
a
a
b
b
a,b
b
3
The set Q' from "remove edge choices" is { {3,4}, {2,3,4}, {2,3,4,5} and {1,2,3,4,5} }

1
2
3
4
5
t
t
t
t
t
1
t
t
t
t
t
2
t
t
t
t
t
3
t
t
t
t
t
4
t
t
t
t
t
5
t
t
t
t
t
=>

1
2
3
4
5
t
t
t
t
t
1
t
f
f
f
f
2
f
t
f
f
f
3
f
f
t
t
f
4
f
f
t
t
f
5
f
f
f
f
t
2
a
1
a,b
3,4
4
a
a,b
b
b
5
giving merge set Z = { {1}, {2}, {3,4}, {5} }:

16
CS3012 Formal Languages
Regular Expressions and Regular Languages
3. Regular Expressions and Regular Languages
Definition 3.1 Regular Expressions
Let T be an alphabet. A regular expression over T defines a language over T as follows:
(i)  denotes {},  denotes {}, and t denotes {t} for t  T;
(ii) if r and s are regular expressions denoting languages R and S, then
(r + s) denoting R + S,
(rs) denoting RS, and
(r*) denoting R* are regular expressions; and
(iii) nothing else is a regular expression over T.
Note: when writing regular expressions, if we give the operators +, . and * ascending priorities, then we
can omit most of the brackets. For example, the regular expression
((a)* + ((b)* + (c))*)((b) + (c))
can be written as
(a* + (b* + c)*)(b + c)
Precedence:
r + st should be interpreted as r + (st)
r + st* should be interpreted as r + (s (t*))
Notation: If T is an alphabet, then T also denotes the regular language of strings over T of length 1.
tn denotes ttt...t n times.
Example 3.2
(i)The regular expression (a* + (b* + c)*)(b + c) denotes a set, some of whose members are:
aaaab, b, bbbcbbbbcccccc, etc.
*
(ii) aT denotes the language consisting of all strings over T starting with a.
(iii) T*(a2 + b2)T* denotes the set of all strings over T with a substring of aa or bb.
(iv) 0 + 1(0 + 1)* denotes the set of all binary numbers.
(v) The set of strings over {0,1} not containing two adjacent 0's is (1 + 01)*( + 0)

Example 3.3
Application of Regular Expressions (I)
Searching for strings of characters in UNIX using ex and other editors, and using grep and egrep.
the ex command /a*[abc]/ means find any line containing a substring starting with any number of
a's followed by an a, b, or a c.
Application of Regular Expressions (II)
Lexical analysis, the initial phase of compiling, divides the source code into "tokens". The
definition of what constitutes the different tokens is given by regular expressions.

17
CS3012 Formal Languages
Regular Expressions and Regular Languages
Definition 3.4
A language L over T is a regular language iff there is a regular expression defining it.
Theorem 3.5
If A and B are regular languages, then so are A+B, AB and A*.

Notation: A' denotes the complement of A: i.e. the set of all strings in T* not in A.
Theorem 3.6
If A and B are regular languages, then so are AB and A'.

Theorem 3.7
Any finite language is regular.

18
CS3012 Formal Languages
Finite State Automata and Regular Languages
4. Finite State Automata and Regular Languages
Theorem 4.1
Kleene's Theorem
A language L is accepted by a FSA iff L is regular

Algorithm 4.2 Regular Expression -> NDFSA
Given a regular language, L, over T, defined by a regular expression, r, create a NDFSA, A, s.t. L = L(A).
begin
if r == , then A := ({q},{q},{q},T,{})
else if r == , then A := ({q},{q},{},T,{})
else if r == t, then A = ({p,q},{p},{q},T,{(p,t,q)})
else if r == r1 + r2 then begin
obtain A1 = (Q1,{i1},{f1},T,E1), L1 = L(A1)
obtain A2 = (Q2,{i2},{f2},T,E2), L2 = L(A2)
A := (Q1Q2{i,f},{i},{f},T,E1E2{(i,,i1),(i,,i2),(f1,,f),(f2,,f)})
end
else if r == r1r2 then begin
obtain A1 and A2 as above
A := (Q1Q2,{i1},{f2},T,E1E2{(f1,,i2)})
end
else if r == r1* then begin
obtain A1 as above
A := (Q1{i,f},{i},{f},T,E1{(i,,i1),(i,,f),(f1,,f),(f1,,i1)})
end
return A
end

Example 4.3 Regular Expression ->NDFSA
Let L = (b+ab)(b+ab)*, T = {a, b}
Find NDFSA's for: (i) a (ii) b (iii) ab (iv) (b+ab) (v) (b+ab)* (vi) (b+ab)(b+ab)*
(i) ({1,2},{1},{2},T,{(1,a,2)})
(ii) ({3,4},{3},{4},T,{(3,b,4)})
(iii) ({1,2,3,4},{1},{4},T,{(1,a,2),(2,,3),(3,b,4)})
(ii)' ({5,6},{5},{6},T,{(5,b,6)})
(iv) ({1,2,3,4,5,6,7,8},{7},{8},T,{(7,,1),(7,,5),(1,a,2),(2,,3),(3,b4),(5,b6),(4,,8),(6,,8)})
(iv)' ({9,10,11,12,13,14,15,16,{15},{16},T,
{(15, ,9),(15,,13),(9,a,10),(10,,11),(11,b,12),(13,b,14),(12,,16),(14,,16)})
(v)
({9,10,11,12,13,14,15,16,17,18},{17},{18},T,{(17,,15),(17,,18),(15,,9),(15,,13),(9,a,10),(10,,11),(11,b,12),
(13,b,14),(12,,16),(14,,16) (16,,18),(16,,15)})
(vi) ({1,2,...,18},{7},{18},T,{(7,1),(7,,5),(1,a,2),(2,,3),(3,b,4),(5,b,6),(4,,8),(6,,8),(8,,17),(17,,15),(17,,18),
(15,,9),(15,,13),(9,a,10),(10,,11),(11,b,12), (13,b,14),(12,,16),(14,,16),(16,,18),(16,,15)})
19
CS3012 Formal Languages
1
Finite State Automata and Regular Languages
a
2

b
3
a
4
9




5
b

b
11
8
17


15
6
12



7
10


13
b
16

18
14


Algorithm 4.4 FSA -> Regular Expression
Given a FSA, A, create a regular expression defining L(A)
begin
create unique initial state
create unique final state
unique FSA -> regular expression
end

Algorithm 4.5 create unique final state
Given a NDFSA, A = (Q,I,F,T,E), create FSA A' with a single final state s.t. L(A) = L(A')
begin
Q := Q  {f} (where f  Q)
for each q  F do add (q,,f) to E
F := {f}
return (Q,I,F,T,E)
end

Definition 4.6
A regular finite state automaton (RFSA) is a FSA where the edge labels may be regular
expressions. An edge labelled with the regular expression r indicates that we can move along that
edge on input of any string defined by r.
20
CS3012 Formal Languages
Finite State Automata and Regular Languages
Algorithm 4.7 unique FSA -> regular expression
Given a FSA, A = (Q,{i},{f},T,E), with unique initial and final states, create a regular expression r
defining L(A).
begin
convert A to a RFSA
%trivial
while Q\{i,f} is not empty do begin
for each state p  Q with more than one edge (p,ri,p) (i ≤ n) do
replace all those edges by (p,r1+r2+...+rn,p)
for each pair p,q  Q with more than one edge (p,ri,q) (i ≤ n) do
replace all those edges by (p, r1+r2+...+rn,q)
select s  Q
for each pair p,q  Q (p,q  s) s.t. there are edges (p,r1,s) and (s,r2,q) do
if there is an edge (s,r3,s) then add the edge (p,r1r3*r2,q)
else add the edge (p,r1r2,q)
remove all edges to or from s
remove all states and edges with no path from i
end
return r, where E = {(i,r,f)}
end

Example 4.8 FSA -> Regular Expression
2
a
1
3
a
b
a
b
4
b
a
2
a
create unique initial and final states
i

1
3
a+b
b
a
b
a+b

4
f
b
remove state 2 - edges are 1->3, 1->4, 4->3, 4->4
3
aa
i

1
aa
b
ab
remove edge pairs
21

4
ab
b
f
a+b
CS3012 Formal Languages
Finite State Automata and Regular Languages
i
aa
1

a+b
3
aa
b+ab

4
f
b+ab
remove state 3 - no edges
i
1

b+ab
4

f
b+ab
remove state 4 - edge is 1->f
i

1
(b+ab)(b+ab)*
f
remove state 1 - edge is i->f
i
(b+ab)(b+ab)*
f
expression is (b+ab)(b+ab)*

Theorem 4.9
The Pumping Lemma
If L is a regular language, then there exists an integer N s.t. for any w  L with
|w| ≥ N, there are strings x, u and y s.t.
w = xuy
|xu| ≤ N
|u | > 0
and s.t. for any m ≥ 0, xumy  L.
Proof:
Since L is a regular language, there must be a DFSA A which accepts L. Let N be the number of states in
A. Suppose there is a string  of length ≥ N which is accepted by A (i.e.   L). Since  is accepted by A,
the accepting path must make ||+1 visits to states, and hence must make > N visits to states. But A has
only N states, and so at least one state must be visited at least twice. Let s be the first state which is visited
twice on the accepting path. We can split the path into three sub-paths: from the start to the first visit to s,
from the first visit to s to the second visit to s, and from the second visit to s to the end. Let x, u and y be
the substrings of  corresponding to these three subpaths (so  = xuy). The subpath xu does not visit any
states more than once except s, and so makes at most N+1 visits, and so must have length ≤ N (so |xu| ≤
N). For the two visits to s to be separate, there must be at least one character accepted in that subpath (so
|u| > 0).
Now, if we are in state s, an input of u will take us back to s, and an input of y will take us to the finish
state. From the start state, an input of x will take us to state s. Therefore, the input xy will be accepted (x
takes us to s, and y takes us to the finish state), and so will xuy, xuuy, xuuuy, etc.. Therefore, for any
m ≥ 0, xumy will be accepted by A. But, by definition of A, any string accepted by A is in L. Therefore, for
any m ≥ 0, xumy  L

22
CS3012 Formal Languages
Example 4.10
Finite State Automata and Regular Languages
Using the Pumping Lemma
Show L = {anbn : |n| ≥ 0} is not regular.
Suppose L is regular.
Then, by the pumping lemma, there exists some integer N s.t. for any w  L with
|w| ≥ N, there are strings x, u and y s.t.
w = xuy
|xu| ≤ N
|u| > 0
and  m ≥ 0, xumy  L
Choose i > N/2.
Let w be the string aibi. Then w has length > N.
By the pumping lemma, w can be split into substrings xuy, s.t. |xu| ≤ N and |u| > 0.
Now u must be of the form an, or anbm, or bm., for some n and m.
If u = an, then w = xuy = ajanakbi, where j+n+k = i. So xu2y = ajananakbi, which is not in L, because
it has more a's then b's.
The same argument works for u = bm.
If u = anbm, then w = xuy = ajanbmbk, and xu2y = ajanbmanbmbk, which is obviously not in L, because
it has b's before a's.
Thus in no case is xu2y in L.
But by the Pumping Lemma, xu2y  L. Contradiction.
Therefore our first assumption must have been wrong, so L is not regular.

23
CS3012 Formal Languages
Finite State Automata with Output
5. Finite State Automata with Output
Definition 5.1 Moore Machine
A Moore Machine is a 6-tuple (Q,I,T,E,,O), where
Q, I, T and E are as for DFSA's
 is an alphabet (called the output alphabet), and
O is a subset of Q   (called the output function)
q/x
Notation: if (q,x)  O, then sketch state q by
The output function defines the output of the machine whenever the machine enters a particular state.
Example 5.2
A Moore machine which prints out a "1" every time an aab substring is input:
a
b
a
0/
b
1/
a
b
2/
b
3/
a
The input aaababaaab gives the output 11.

Definition 5.3 Mealy Machine
A Mealy Machine is a 6-tuple (Q,I,T,E,,O) where
Q, I, T and E are as for DFSA's
 is an alphabet (called the output alphabet), and
O is a subset of Q  T   (called the output function).
Notation: if (q,t,x)  O, then for any arc (q,t,p)  E, label the arc by t/x.
The output function defines the output of the machine whenever the machine leaves a particular state
through a particular labelled action.
24
CS3012 Formal Languages
Finite State Automata with Output
Example 5.4
A Mealy Machine which takes reversed binary numbers as input, and prints as output the reversed
number one larger:
0/0, 1/1
0/1
0/1
1/0
1/0
The input 11101 gives the output 00011.

Definitions 5.5
Let M be a Moore machine or a Mealy Machine, with output alphabet . Define Mo(w) to be the
output of M on w.
Let M1 = (Q1,I1,T1,E1,1,O1) be a Moore Machine, and M2 = (Q2,I2,T2,E2,2,O2) be a Mealy
Machine. Let M1o() = b.
M1 and M2 are equivalent if T1 = T2 and for all strings w  T1*, M1o(w) = bM2o(w).
Theorem 5.6 Moore-Mealy Equivalence
If M1 is a Moore Machine, then there exists a Mealy Machine M2 equivalent to M1.
If M2 is a Mealy Machine, then there exists a Moore Machine M1 equivalent to M2.

25
CS3012 Formal Languages
Languages and Grammars
6. Lex: A Lexical Analysis Tool
Lex is a program generator, accepting a series of regular expression definitions, and producing a program
which analyses input to identify lexical tokens defined by those regular expressions.
A Lex script has three sections, separated by a line containing only "%%":
... definitions ...
%%
... regular expression / action pairs ...
%%
... user-defined functions ...
Lex Syntax
Let c be a character, x,y regular expressions, s a string, m,n integers, and i an identifier.
regular expressions
c
any character except meta characters
[...]
any of the list of characters enclosed (may be a range of characters)
[...]
any of the characters not in the list enclosed
.
any ASCII character except newline
xy
the concatenation of x and y
x*
same as x*
x+
same as x+
x?
an optional x (same as x + )
x|y
x or y
{i}
the definition of i
x/y
x, but only if followed by y (and y is not read from the input)
x{m,n} m to n occurrences of x
x
x, but only at the beginning of a line
x$
x, but only at the end of a line
"s"
exactly what is in the quotes (except for "\" and the following character)
Precedence: brackets, then unary operators (+,?,*), then concatenation, then |, then /.
Regular expression are terminated by a space or a tab.
If there is a conflict between different regular expression, then Lex will match against the longest
expression, and for the same length expression, will match against the first definition.
meta characters (do not match themselves)
()[]{}<>+/,^*|.\"$?-%
A match with a meta-character can be obtained by preceding with "\"
Backslash, tab and newline are represented by \\, \t and \n respectively.
26
CS3012 Formal Languages
Languages and Grammars
Actions
An action is a C language statement (followed by ";").
For example:
[0-9]+
[a-zA-Z]+
printf("Integer\n");
printf("String\n");
will print out "Integer" after receiving a digit string as input, and "String" after receiving a
character string.
Thus the input
12+19=sum
will result in
Integer
+Integer
=String
Note that a recognised regular expression is held in the string variable yytext, and its length is
held in the integer variable yylen.
Any input not recognised by the regular expression section will simply be echoed to the screen.
Definition Section
If
identifier
string
appears in the definition section, then string will replace identifier
{identifier} appears in the regular expression section.
Thus
L [a-zA-Z]
%%
{L}+
is equivalent to
%%
[a-zA-Z]+
whenever
Anything enclosed between %{ ... %} in this section will be copied into the output program.
include and define statements, all variable declarations, all function definitions and any
comments should be so enclosed.
Functions Section
The section should contain the user-defined "main" routine, and any other required functions,
written as C code.
A simple "main" routine is found in the lex library, and will be used if no user-defined "main" is
supplied.
27
CS3012 Formal Languages
Languages and Grammars
Running Lex
The command lex calls the lex program on the specified file (usually with a ".l" suffix). The
output, a C file, is called lex.yy.c. This program must then be compiled with the lex library (using
the -ll option) with the object file renamed if required. To run the program, simply type the
name of the object file.
For example, to compile and run the lex script "example.l", type:
lex example.l
cc lex.yy.c -o example.o -ll
example.o
Example Lex Program
The following program specifies a simple word recognition lexical analyser
%{
/* simple word recognition program */
%}
L
[a-zA-Z]
%%
[ \t]+
is|are
a|the
dog |
cat |
male |
female
{L}+
.|\n
;
/* ignore whitespace */
printf("verb: %s; ", yytext);
printf("determiner: %s; ", yytext);
printf("noun: %s; ", yytext);
printf("unknown: %s; ", yytext);
ECHO;
%%
main()
{
yylex();
}
Running this program as above would give the following (user input is underlined)
% word.o
the dog is a male <cr>
determiner: the; noun: dog; verb: is; determiner: a; noun:
male;
female cat dog is <cr>
noun: female; noun: cat; noun: dog; verb: is;
catdog is male <cr>
unknown: catdog; verb: is; noun: male;
<ctrl-d>
%
28
CS3012 Formal Languages
Languages and Grammars
Practical Class: Using Lex
Write a lexical analyser using Lex, for the language C-, defined below.
What is required?
A "y.tab.h" file will be supplied, defining all the different tokens to be used. It is linked from the web
page for the practical, and should be copied to your own filespace before beginning the practical. Note
that "KEY_REAL_T" is intended for the "real" keyword, and "REAL_T" is intended for real numbers.
You have to write a Lex script, containing a definition section, a regular expression/action pair section,
and a function section. The script, when run through Lex, should create a program which takes a file as
input, reads the file, and outputs the result of a lexical analysis (either to another file or to the screen).
The output from the analyser should be in the form of <token, attribute> pairs. Every element of an input
program should be classified. Thus, on receiving input of z := y*27; the output should be something
like:
<ID_T,z>
<BECOMES_T,:=>
<ID_T,y>
<MUL_T,*>
<INT_T,27>
<SEMI_T,;>
The input must be described by regular expressions, and you must use Lex. You are advised to use the
skeleton file "lexer.l" in the above directory. Note that you are only asked to do word recognition, and not
check syntax. The action for each regular expression should be a simple "return" statement.
You have to decide what to do with errors, but do not allow something to pass as a token if it should not
pass, do not misclassify tokens, and do not allow valid tokens to pass as errors.
C- language definition
NOTE: A program is a sequence of function
declarations and variable declarations.
followed by an integer-valued expression
between "[" and "]".
Each function and each variable must be
declared before use. A variable is declared by
stating the variable type followed by a nonempty space-separated sequence of identifiers or
array specifications, ending with a semi-colon. A
function is declared by stating its return type,
followed by the function name (an identifier), a
(possibly empty) comma-separated list of
parameter declarations between "(" and ")",
followed by the code block between "{" and "}".
A parameter declaration is a variable type, a
space, and an identifier.
Each program must have a "main" function
block, which must be the last function to be
declared. It has no parameters, and no return
type.
The code block is a sequence of statements. Each
statement may be a variable declaration, an
assignment, a call to a function, a print
statement, a code block between "{" and "}", a
while statement, an if-then statement, an if-thenelse statement, or a return statement. All except
the code block and while and if statements must
be terminated by a semi-colon.
Possible variable and return types are "real" and
"int" (and "void" as a special return type).
An identifier is a sequence of letters. An array
specification is an identifier followed by an
integer between "[" and "]" representing the size
of the array. An array reference is an identifier
29
CS3012 Formal Languages
Languages and Grammars
An assignment has an array reference or an
identifier on the left-hand side, a ":=", and an
expression on the right-hand side. Expressions
are built from the "+","-", "*" and "/" operators,
and the basic factors are reals, integers,
identifiers, array references, function calls, or
expressions inside "(" and ")".
more digits, a decimal point, and one or more
digits.
Variables or functions which are used before
being declared will give an error. Trying to use a
return value of a void function gives an error.
Using a return statement inside a void function
gives an error. Variables are either declared
outside a function, and can be accessed by all
functions which follow them, or are declared
inside a function, and can only be used inside
that function. Redeclaring a variable within its
scope gives an error. All parameters are passed
by value (i.e. the value of a passed parameter
does not change once the function has
completed).
A call to a function is the function name, and
then the argument list between "(" and ")". The
argument list is a possibly empty commaseparated sequence of expressions.
A print statement is the keyword "print"
followed by an expression between "(" and ")".
A while statement is the keyword "while"
followed by a test between "(" and ")" followed
by the keyword "do" followed by a statement.
A small example C- program is given below:
int a b c ;
int g[5];
An if-then statement is the keyword "if"
followed by a test between "(" and ")" followed
by the keyword "then" followed by a statement.
An if-then-else statement is an if-then statement
followed by the keyword "else" followed by a
statement.
int testFunc(int x) {
real y;
y := (x+a)/2;
print(y);
return a;
}
A test is two expressions with a relational
operator in between. The relational operators are
"<", ">", "=<", "=>", "=" and "!=" (standing for
"not equals").
main() {
a := 1;
while (a < 3) do {
testFunc(a);
a := a + 1;
}
A return statement is the keyword "return"
followed by an expression.
Integers have an optional sign followed by one or
more digits. Reals have an optional sign, one or
}
30
CS3012 Formal Languages
Languages and Grammars
7. Languages and Grammars
Definitions 7.1
Grammar
A grammar is a 4-tuple, G = (N,T,S,P), where
N is a finite alphabet (called the non-terminals);
T is a finite alphabet (called the terminals);
N  T = ;
S  N is the start symbol; and
P is a finite set of productions of the form
 ->,
where   (N  T)+,  has at least one member from N, and   (N  T)*.
Let G = (N,T,S,P) be a grammar.
If s, t, x, y, u and v are strings s.t. s = xuy, t = xvy, and u -> v  P, then s directly derives t, written
s => t.
If there is a sequence of strings s0, s1, ..., sn s.t. s0 => s1 => ... => sn-1 => sn, then
s0 derives sn, written s0 =>* sn.
A sentential form of G is a string w  (N  T)* s.t. S =>* w.
A sentence of G is a sentential form w  T* - i.e. w has only terminal symbols.
Definition 7.2
Language defined by a grammar
The language defined by G is the set of all sentences of G, denoted L(G).
Example 7.3
Let G = ({S}, {a,b}, S, {S -> , S -> aSb}).
G has one non-terminal: S
The terminals of G are a and b.
The start symbol of G is S.
G has two productions.
aaaSbbb => aaaaSbbbb.
S =>* aaaabbbb.
aaaSbbb is a sentential form of G
aaaabbbb is a sentence of G.
L(G) = {, ab, aabb, aaabbb, ...}, which is {anbn: n ≥ 0}

31
CS3012 Formal Languages
Languages and Grammars
Notation: We will not normally write the grammar as a tuple, but will use the following conventions:
Non-terminals will be uppercase
Terminals will be lowercase
Unless stated otherwise, the start symbol will be S.
The set of productions may be numbered.
If x => y using production number i, then we write x =>i y.
 -> 1 | 2
| ... | n will be shorthand for the n productions
 -> i .
Definition 7.4 Context-Free Grammars and Languages
A context-free grammar (denoted CFG) is a grammar in which all productions are of the form
 -> ,
where   N - i.e.the left hand side is a single non-terminal.
A context-free language (denoted CFL) is one defined by a context-free grammar.
Example 7.5
A Grammar of Algebraic Expressions: G0
G = ({S}, {a, +, *, (, )}, S, {
1) S -> S + S
2) S -> S * S
3) S -> (S)
4) S -> a
}
Example derivation:
S =>2 S * S =>4 a * S =>3 a * (S) =>1 a * (S + S) =>4 a * (a + S) =>4 a * (a + a).
Note that there are many other ways of deriving the same string.

Definition 7.6 Regular Grammar
A grammar is regular if each production is of the form:
(i) A -> t ,
(ii) A -> tB, or
(iii) A -> 
where A, B  N, t  T.
32
CS3012 Formal Languages
Languages and Grammars
Example 7.7
S -> aA | bB
A -> aS | a
B -> bS | b
S => aA => aaS => aaaA => aaaaS => aaaabB => aaaabb
The language generated by this grammar is the same as (aa + bb)+.

Theorem 7.8
A language is regular iff it can be defined by a regular grammar.

Techniques for constructing grammars
To create sequences of a symbol (e.g. aaa...a):
A -> aA | 
or
A -> Aa | 
Example: A => aA => aaA => ... => aaaaaA => aaaaa
To "bracket" a string (e.g. axxx...xb):
A -> aBb
B ->xB | 
or
A -> Cb
C -> ax | Cx
Example: A => aBb => axBb => axxBb => ... => axxxxxBb => axxxxxb
To create a nested structure (e.g. aaa...<.....>...bbb):
A -> aAb | B
B -> xB | 
Example A => aAb => aaAbb => ... => aaaaaAbbbbb => aaaaaBbbbbb => aaaaaxBbbbbb
=> aaaaaxxBbbbbb => aaaaaxxxbbbbb
Example 7.9
Construct a grammar for the language consisting of all strings of the form abccc...cab or
abab...abccc...cabab...ab
|<-- ntimes -->|
|<-- n times -->|
A -> abAab | abBab
B -> cB | c

33
CS3012 Formal Languages and Compilers
Lex
8. Derivations and Ambiguity
Recognition problem
Given a grammar, G, and a string, w, is w L(G)?
Parsing Problem
Given a grammar, G, and a string, w  L(G), how is w derived in G?
Definition 8.1
Derivation tree
Let (S =) w0 =>i1 w 1 =>i2 w 2 =>i3 ... =>in wn be a derivation. We construct the corresponding
derivation tree as follows. It has w0 as its root. Every time a symbol a is replaced by a substring ,
a branch is added from a to every symbol in , in the same order in which they appear in .
Example 8.2
Let S => S+S => S+(S) => S+(S*S) => S+(S*a) => S+(a*a) => a+(a*a) be a derivation in the
gramar G0. Its corresponding derivation tree is
S
S
+
a
S
(
S
a
S
*
)
S
a

Definitions 8.3
A derivation in which, at each step, the rightmost non-terminal is replaced is a right-derivation.
A CFG is ambiguous if there is at least one string in L(G) having two or more different right
derivations.
Note: A string has two different right derivations iff it has two different derivation trees.
34
CS3012 Formal Languages and Compilers
Lex
Example 8.4
G0 is ambiguous, since the string a+a*a has two different right derivations:
1. S => S+S => S+S*S => S+S*a => S+a*a => a+a*a
2. S => S*S => S*a => S+S*a => S+a*a => a+a*a
with the two derivation trees:
1.
2.
S
S
S + S
S
a S * S
a
a
* S
S + S a
a
a

Example 8.5
An unambiguous grammar of algebraic expressions G
S
1) S -> S + T
2) S -> T
3) T -> T * F
4) T -> F
5) F -> (S)
6) F -> a
S + T
T T * F
F F
a
a a
S => S+T => S+T*F => S+T*a => S+F*a => S+a*a => T+a*a => F+a*a => a+a*a

Definition 8.6
A language for which every defining grammar is ambiguous is inherently ambiguous.
35
CS3012 Formal Languages
Intermediate Code Generation
9. Parsing
Definition 9.1
Top-down parsing creates a derivation tree for a given string by expanding from the start symbol
by applying productions.
Definition 9.2
Recursive-descent parsing is a top-down parsing method that associates a recursive procedure
with each non-terminal of the grammar.
Predictive parsing is recursive-descent parsing where it is possible to determine which procedure
to call at each stage by examining the next symbol of the input.
Example 9.3
Consider the following grammar:
Type ->Simple | array [Simple] of Type
Simple -> int | num .. num
We can write procedures for Type and Simple as follows:
procedure type
begin
if token  {int, num} then simple
else if token = array then begin
match(array)
match('[')
simple
match(']')
match(of)
type
end
else error
end
procedure simple
begin
if token = int then match(int)
else if token = num then begin
match(num)
match(..)
match(num)
end
else error
end
procedure match (t:token)
begin
if token = t then token := nexttoken
else error
end
36
CS3012 Formal Languages
Intermediate Code Generation
A parse of "array[3 .. 11] of int" then consists of the following procedure calls:
token:
array
procedure calls:
type
match(array)
match('[')
simple
match(num )
match(..)
match(num )
match(']')
match(of )
type
simple
match(int )
[
num
..
num
]
of
int
Type
array [ Simple ] of
num ..
num
Type
Simple
int

Definition 9.4
LL(1) parsing means:
(i) read the input from the left (to the right)
(ii) generate a left derivation
(iii) using 1 lookahead symbol.
LL(1) parsing for a given grammar requires a 2D table with a column for each terminal plus a new
symbol #, and a row for each non-terminal. Each cell is a single production from the grammar.
Algorithm 9.5
LL(1) Parsing Algorithm
Given a string, a grammar and an LL(1) parse table, parse the string using the table.
Variables: z - a string (the parsing stack),
w - a string (the input),
M - the LL(1) table
begin
z := start symbol concatenated with #
w := input string concatenated with #
while q ≠ # do begin
q represents the first symbol in z
t represents the first symbol in w
if q = a and t = a then begin
remove a from front of w
remove a from front of z
end
else if q = N and t = a and M[N,a] = p then begin
remove N from front of z
put  onto the front of z
end
else error
end while
if q = # and t = # then accept
else error
end
37
%top state in stack
% a is a terminal
% 'match'
% p = N -> 
%input  L(G)
CS3012 Formal Languages
Intermediate Code Generation

Example 9.6
Grammar:
1) S -> ( S ) S
LL(1) table:
S:
z (parsing stack)
S#
(S)S#
S)S#
)S#
S#
#
(
1
2) S -> 
)
2
#
2
w (input stack)
()#
()#
)#
)#
#
#
action
S -> ( S ) S
match
S -> 
match
S -> 
accept

Both recursive-descent parsing and LL(1) parsing require first(N) and follow(N) to be known for all nontemrinal symbols N. first(N) is the set of all tokens which could appear as the first symbol in a token
substring derived from N, while follow(N) is the set of all tokens which could appear as the next token
once N's token substring is finished. Algorithms to compute these sets are known (but are omitted from
the course).
Neither recursive-descent nor LL(1) parsing can be used on grammars which are left recursive, or which
have two or more productions for the one non-terminal where the right-hand side starts with the same
substring.
Definition 9.7
Bottom-up parsing constructs a derivation tree from the input string, applying productions in
reverse (called reductions) until the start symbol is reached.
Algorithm 9.8 Basic Shift-Reduce Parsing
Given a string and a grammar, construct a derivation of the string.
variables:
z - a string (the stack), w - a string (the input), h - a substring (the handle)
begin
z := 
w := input string concatenated with #
while z  S or w  # do begin
obtain the handle h (corresponding to production A -> h)
if z does not end in h, then move first symbol of w to end of z
else begin
remove h from z
put A on end of z
end
end
38
% shift
% reduce
CS3012 Formal Languages
Intermediate Code Generation
end

Example 9.9
Parse a+a*a in grammar G
Stack
Input
a+a*a
+a*a
+a*a
+a*a
+a*a
a*a
*a
*a
*a
a
a
F
T
S
S+
S+a
S+F
S+T
S+T*
S+T*a
S+T*F
S+T
S
Action
shift
reduce (6)
reduce (4)
reduce (2)
shift
shift
reduce (6)
reduce (4)
shift
shift
reduce (6)
reduce (3)
reduce (1)
accept

Definition 9.10
LR(k) Parse Table
An LR(k) parse table is a 2D matrix, with rows indexed by integers and columns indexed by
length k strings of grammar symbols plus an endmarker. The entries of the table are of five types:
Rp
(reduce by production p)
Sn
(shift, go to state n)
n
(go to state n)
A
(accept)
/* there is only one 'A' entry */
E
(error)
/* Notation: appear blank in the table */
"LR(k)" means:
Example 9.11
0
1
2
3
4
5
6
7
8
9
10
11
(i) read the input from the left (to the right)
(ii) generate a right derivation
(iii) using k lookahead symbols.
LR(1) parse table for G
S
1
8
T
2
2
9
F
3
3
3
10
a
S5
+
*
S6
R2
R4
S7
R4
R6
R6
S5
(
S4
)
#
R2
R4
A
R2
R4
R6
R6
S11
R1
R3
R5
R1
R3
R5
S4
S5
S5
S4
S4
S6
R1
R3
R5
S7
R3
R5

39
CS3012 Formal Languages
Algorithm 9.12
Intermediate Code Generation
LR(1) Parsing Algorithm
Given a string, a grammar and an LR(1) parse table, parse the string using the table.
begin
z := 0
w := input string concatenated with #
loop
q := last symbol in z
%top state in stack
t := first symbol in w
if M[q, t] = Sn then begin
% row q, col t in table
remove t from front of w
put n on end of z
end
else if M[q, t] = Rp then begin
take the grammar rule numbered with p
let the left hand side of it be called B
and let the right hand side be called 
i.e. the grammar rule has the form: p = B -> 
remove || symbols from end of z
q := last symbol in z
%top state in stack
put M[q,B] on end of z
%new state
end
else if M[q, t] = A then return true %input  L(G)
else return false
%input  L(G)
end
end
Example 9.13 Parse a+a*a using the table of 9.11 and grammar G
1) S -> S + T
2) S -> T
3) T -> T * F
4) T -> F
5) F -> (S)
6) F -> a
40

CS3012 Formal Languages
Symbol stack Stack(z) Input (w)
0
a+a*a#
a
05
+a*a#
a
0
+a*a#
F
03
+a*a#
F
0
+a*a#
T
02
+a*a#
T
0
+a*a#
S
01
+a*a#
S+
016
a*a#
S+a
0165
*a#
S+a
016
*a#
S+F
0163
*a#
S+F
016
*a#
S+T
0169
*a#
S+T*
01697
a#
S+T*a
016975
#
S+T*a
01697
#
S+T*F
01697 10
#
S+T*F
016
#
S+T
0169
#
S+T
0
#
S
01
#
Intermediate Code Generation
q
0
5
0
3
0
2
0
1
6
5
6
3
6
9
7
5
7
10
6
9
0
1
t
a
+
B
F
+
T
+
S
+
a
*
F
*
T
*
a
#
F
#
T
#
S
#
Action
S5
R6
R6
R4
R4
R2
R2
S6
S5
R6
R6
R4
R4
S7
S5
R6
R6
R3
R3
R1
R1
A
Grammar Rule
6) F -> a
4) T -> F
2) S -> T
6) F -> a
4) T -> F
6) F -> a
3) T -> T * F
1) S -> S + T

Definitions 9.14
A grammar is LR(k) if we can construct a deterministic LR(k) parse table for it.
A language is LR(k) if it has an LR(k) grammar.
41
CS3012 Formal Languages and Compilers
Yacc
10. Yacc: A Parser Generator
Yacc is a parser generator, accepting a contextfree grammar, and producing a program which
analyses input to check whether it conforms to
the syntax of the grammar. Yacc constructs the
LR(1) parse table, and implements the LR(1)
parsing algorithm (in fact, LALR(1) - a slight
restriction of LR(1) - and not LR(1)). The input
must first be converted to a stream of integer
tokens, using a function yylex(). The function
yylex() can be hand-written, or generated by
Lex.
we now write
A:abc|efg;
and liberal use of white space is encouraged to
improve readability, and to make it easier to
update scripts. For example, the two productions
above would be better written as
A
A Yacc script has three sections, separated by
lines containing only %%:
:
|
;
abc
efg
YACC will take the left-hand symbol of the first
rule in this section, and make it the start symbol.
Comments can be included in 'C' format. For
example:
... definitions ...
%%
... production rules ...
%%
... user-defined functions ...
/* A can be rewritten to abc or to efg */
Functions section
Definitions section
As in Lex, this section should contain the userdefined main() routine, and any other required
functions. The usual functions to include here
apart from main() are:
As in Lex, anything in this section enclosed
between %{ and %} will be copied into the
output program. Any #include or #define
statements or variable or function declarations
required for the user-defined functions should be
enclosed here.
lexerr() - defining what to do if the lexical
analyser finds an undefined token. This
requires that the default case in the lexer has a
call to this function as its associated action.
In this section must appear a set of "token"
declarations, and there must be a token for each
terminal which will appear in the grammar. For
example:
yyerror(char*) - defining what to do if
the parser cannot recognise the syntax of part of
the input. This function will be called by the
parser, which passes a string describing the
type of error. Note that when an error occurs,
the line number of the input is held in
yylineno, and the last token read when the
error is reached is held in yytext.
%token VERB_T
%token NOUN_T
declares two terminals for use in a grammar. A
useful convention is to use uppercase ending in
"_T" for token names, and to use mixed case,
starting with a capital letter, for non-terminals.
Running Yacc
Productions section
The command yacc calls the Yacc program on
the specified input. Using the "-d" option forces
Yacc to create a file y.tab.h, which contains
the #define statements for all the tokens
declared in the definitions section. If we need to
Instead of writing
A -> a b c | e f g
42
CS3012 Formal Languages and Compilers
Yacc
use the integer values of these tokens in the user
defined functions, we can then place #include
"y.tab.h" between the %{ and %} lines of the
definitions section. Using the "-v" option forces
Yacc to create a file y.output, which contains
information on the parse table useful for
debugging. The output of the yacc command is
a file y.tab.c, which contains the 'C' source
for the parser.
for the second, where FRED_T could be parsed
two ways.
If these messages appear, then your grammar is
not suitable. In most cases, by carefully studying
the grammar (using the information in
y.output), you can find a different set of
productions which Yacc can handle. The two
simplest cases are given above. In particular,
note that productions of the form E -> E+E are
guaranteed to produce conflicts.
If we have written a Lex script for the lexical
analyser, we must also create lex.yy.c as
before.
Occasionally, it may turn out that the language
you are trying to define is inherently ambiguous,
in which case Yacc is of no use; however, this is
very unlikely. If the language is easy to
understand, then, generally, it is easy to write a
simple, unambiguous grammar for it. Remember
that Yacc can handle even large and relatively
complex languages like PASCAL and C - in fact,
the Berkeley PASCAL and Sun C compilers are
written in Yacc.
To obtain executable code for the complete
parser, we then must link the object files, using
both the yacc library, "-ly" and the Lex library,
"-ll".
Error Messages
Yacc can only accept grammars of a particular
sort. Specifically, it cannot handle ambiguous
grammars, nor can it handle grammars requiring
two or more symbols of lookahead for parsing.
The two messages resulting from ambiguous
grammars that you will see most often are:
If Yacc does output the above messages, do not
let your grammar go uncorrected. Although a
parser will be generated, it will probably not
define the language you intend, and will fail in
mysterious ways.
shift-reduce conflict or
reduce-reduce conflict
Example 10.1
Write a Yacc script to construct a parser for
sentences from the natural language grammar
below.
Example productions giving rise to these
messages are:
Expr
:
|
;
TOKEN
Expr + Expr
for
the
first
TOKEN+TOKEN+TOKEN
two ways, and
S -> NP VP
NP
-> Det NP1 | PN
NP1
-> Adj NP1 | N
Det
-> a | the
PN
-> peter | paul | mary
Adj
-> large | grey
N
-> dog | cat | male | female
VP
-> V NP
V
-> is | likes | hates
case,
where
could be parsed in
Animal :
|
;
Dog
Cat
Dog
:
FRED_T ;
Cat
:
FRED_T;
First, we will accept files consisting of multiple
sentences. Each sentence will be delimited by a
".". Therefore, change the first production to
read:
43
CS3012 Formal Languages and Compilers
Yacc
S -> NP VP .
Note that we are only trying to parse sentences,
and not understand them - therefore, our lexical
analysis only needs to be to the level of the parts
of speech (i.e. we only need to recognise nouns
and verbs, and not individual words).
and we also add two new productions describing
"documents" in terms of sentences:
D
-> S D | 
The lexical analyser is a modification of the example Lex program given on p23. Instead of "print"
statements, we will return tokens. Therefore, in the definitions section, we have a line which includes the
token list which will be created by Yacc.
%{
/* simple part of speech lexer */
#include "y.tab.h"
%}
L [a-zA-Z]
%%
In the regular expression section, we need expressions for each part of speech, plus special symbols and
unknown input.
[ \t\n]+
is|likes|hates
a|the
dog |
cat |
male |
female
peter |
paul |
mary
large | grey
\.
{L}+
.
%%
/* ignore whitespace */;
return VERB_T;
return DET_T;
return NOUN_T;
return PROPER_T;
return ADJ_T;
return PERIOD_T;
lexerr();
lexerr();
We will use the standard yylex() function created by Lex, and so we don't need user-defined functions.
In the definitions section of the Yacc script, we need to declare the variables we will use in the error
functions, as well as all the tokens we expect to be passed by the lexer.
%{
/* a Yacc script for a simple natural language grammar */
#include <stdio.h>
#include "y.tab.h"
extern
extern
extern
extern
int yyleng;
char yytext[];
int yylineno;
int yyval;
extern int yyparse();
44
CS3012 Formal Languages and Compilers
Yacc
%}
%token
%token
%token
%token
%token
%token
DET_T
NOUN_T
PROPER_T
VERB_T
ADJ_T
PERIOD_T
%%
The grammar rules are straightforward.
/* a document is a sentence and the rest of the document,
or is empty */
Doc
|
;
:
Sent Doc
/* empty */
/* a sentence is a noun phrase, verb phrase, and a period */
Sent
:
NounPhrase VerbPhrase PERIOD_T ;
/* a noun phrase is a determiner and an undetermined noun phrase,
or a proper noun */
NounPhrase :
|
;
DET_T NounPhraseUn
PROPER_T
/* an undetermined noun phrase is an adjective and an undetermined
noun phrase,
or a noun */
NounPhraseUn
:
|
;
ADJ_T NounPhraseUn
NOUN_T
/* a verb phrase is a verb and a noun phrase */
VerbPhrase :
VERB_T NounPhrase ;
%%
In the user-defined functions section, we need to handle errors from the lexical analysis and errors from
the syntax analysis, as well as defining the output from successful parsing.
void lexerr()
{
printf("Invalid input '%s' at line %i\n",yytext,yylineno);
exit(1);
}
void yyerror(s)
char *s;
45
CS3012 Formal Languages and Compilers
Yacc
{
(void)fprintf(stderr, "%s at line %i, last token: %s\n",
s, yylineno, yytext);
}
void main()
{
if (yyparse() == 0)
printf("Parse OK\n");
else printf("Parse Failed\n");
}
To compile the program, we type:
yacc -d -v parser.y
cc -c y.tab.c
lex parser.l
cc -c lex.yy.c
cc y.tab.o lex.yy.o -o parser -ly -ll
Suppose we have three different input files, file1, file2 and file 3, as follows:
file1:
peter is a large grey cat.
the dog is a female.
paul is peter.
file2:
the cat is mary.
a dogcat is a male.
file3:
peter is male.
mary is a female.
Typing the following commands gives the following results:
% parser < file1
Parse OK
% parser < file2
Invalid input 'dogcat' at line 2
% parser < file3
syntax error at line 1, last token: male
%
The second sentence of file2 contains unknown input - the word "dogcat".
The first sentence of file3 has a syntax error - we have defined the word "male" to be a noun, and it must
be preceded by a determiner.
46
CS3012 Formal Languages and Compilers
Error Handling
11. Error Handling
never be reached during the parse of a
syntactically correct string. This condition is
used to trigger an error recovery procedure
which reports the error and then tries to return
the parser to a state where it can continue.
Error Handling
It is part of the task of a compiler to assist in the
identification, location and correction of errors.
Errors can occur at any stage in the process, and
it is desirable for each component of the
compiler to report (and maybe recover from) the
errors corresponding to its operation.
Error recovery
Once an error has been detected, the aim is to put
the parser in a state such that it can continue
processing input with a reasonable hope that
subsequent correct input will be parsed, and
subsequent errors will be detected.
Lexical errors
Very few errors can be detected during lexical
analysis, because the analyser has a very local
view of the code. The main type of error is when
the analysis halts because the input cannot be
matched to any of the declared regular
expressions - i.e. there is an invalid character or
sequence of characters in the program.
If the parser is not returned to a good state, there
will be an avalanche of spurious errors, which
are not actually errors in the source program, but
were introduced by the changes made to the state
of parser. Even if the rest of the input is
accepted, there is no guarantee that it doesn't
contain errors
The easiest way to recover from this type of error
(after reporting it) is simply to delete the
offending characters from the input, and
continue processing. This is not very
satisfactory, however, as it is uncontrolled, and
may cause confusion during later stages of
compilation.
Strategies
panic mode - ignore all input symbols until a
designated "synchronising" token is reached - for
example, end or ";". Start processing again after
this token. This method often skips large parts of
code without checking for errors, but it is simple,
and it does not enter infinite loops.
Parsing errors
The error handler in the parser should:
• report errors clearly and accurately
phrase level - locally correct the input - that is,
replace a prefix of the current input by
something that would allow the parser to
continue. Commonly, this involves replacing,
inserting or deleting delimiters. Care must be
taken, however, that the parser does not start to
loop - a possibility if it always adds input onto
the front rather than replaces input. The method
also has a problem if the error actually occurred
before the current point on the input stack
• recover from each error quickly enough
to detect subsequent errors
• not significantly slow down the
compilation.
The design of parser error handling requires
finding a balance between these three objectives.
Error detection
error productions - if certain errors are known to
happen frequently, it is possible to include in the
grammar what are called error productions. The
grammar then caters for these errors, includes
likely recovery, and allows specific diagnostics
to be output.
The LR-parsing method has the advantage that it
detects the errors at the earliest possible point in
the input. The errors are detected by the parser
reaching a blank (or "E") entry in the parse table,
indicating that this (state,lookahead) pair can
47
CS3012 Formal Languages and Compilers
Error Handling
global correction - ideally, we would want the
compiler to carry out the minimum of changes to
the input in order to jump over an error. Given
an incorrect input string x, and a grammar G, it is
possible to find a parse tree for a related string y,
such that the number of changes made to x to get
y is minimised. However, this method is very
expensive in time and space, and so, generally, is
not used in practice.
Si). We then discard input symbols until we
reach one, a, say, which is in follow(A).
Normally, we restrict the possibilities for A to be
major program components - e.g. statement - and
then a might be a semi-colon or an end. We
remove the states above the selected one from
the stack, and place i on the stack. Basically, we
assume that a string derivable from A contains
the error. Part of this string has already been
processed (the states above s), and part remains
on the input (the symbols to be discarded). The
parser tries to skip over the error by assuming
that A has been parsed successfully, and jumping
to a symbol that should follow it.
Error recovery in LR Parsing
The SLR parser may make a few erroneous
reductions before discovering an error, but will
never shift an erroneous token from the input
onto the stack. We can implement the first two
error recovery strategies in the following ways:
phrase level - for this mode, we study each error
entry in the table (the blanks or "E"s), and decide
on the most likely cause. We then implement
recovery procedures which assume that cause
and take the appropriate action to modify the
input.
panic mode - scan down the stack until we find a
state, s, which has a shift command for
particular non-terminals (A, say, with shift action
Example 11.1 phrase-level error recovery in LR(1) parsing
Consider the LR(1) parse table for the grammar G augmented with error procedures:
48
CS3012 Formal Languages and Compilers
0
1
2
3
4
5
6
7
8
9
10
11
S
1
T
2
F
3
8
2
3
9
3
10
Error Handling
a
S5
e3
e3
e3
S5
e3
S5
S5
e3
e3
e3
e3
+
e1
S6
R2
R4
e1
R6
e1
e1
S6
R1
R3
R5
*
e1
e4
S7
R4
e1
R6
e1
e1
e4
S7
R3
R5
(
S4
e3
e3
e3
S4
e3
S4
S4
e3
e3
e3
e3
)
e2
e2
R2
R4
e2
R6
e2
e3
S11
R1
R3
R5
#
e1
A
R2
R4
e1
R6
e1
e1
e5
R1
R3
R5
e1: /* called from states 0, 4, 6 or 7, that are expecting the beginning of an operand
(either an a or a "("), but instead a "+", "*" or "#" is found */
put 5 on top of the stack
/* assumes a has been found */
issue message "missing operand"
e2: /* called from states 0, 1, 4, 6 or 7, which find an unexpected ")" */
remove ")" from input
/* simply ignore it */
issue message "unmatched right parenthesis"
e3: /* called from states 1 or 8 which expect "+", but find an a or a "(" */
put 6 on to the stack
/* assume a "+" has been found */
issue message "missing '+'"
e4: /* called from states 1 or 8 which expect "+" but find "*" */
put 6 on top of stack
/* assume a "+" has been found */
remove "*" from input
/* assume it was a "+" */
issue message "'*' instead of '+'"
e5: /* called from state 8 which expects a ")" but finds # */
put 11 on stack
/* assume ")" is found */
issue message "missing right parenthesis"

Error recovery in Yacc
The easiest way to recover from errors in Yacc is to use error productions. In practice, this corresponds
more to the idea of phrase level recovery discussed above. You must decide which non-terminals will
have error recovery procedures associated with them, and then add to the grammar productions of the
form A -> error  where  is a string of grammar symbols (possibly empty). When Yacc finds an error,
its scans down the stack until it finds a state whose items include a rule of the form A -> something error
. The parser then "shifts" a fictitious token, and scans through the input until it finds a substring
matching ; once found, it removes everything up to the end of that substring from the input. The parser
then reduces to A, and continues. For example, an error production
Statement -> error ;
would say to Yacc to skip beyond the next semi-colon and assume a statement had been parsed.
An appropriate error message can be generated at this point.
49
CS3012 Formal Languages and Compilers
Syntax-Directed Translation
12. Syntax-directed Translation
Translation is the process of taking some input
and converting it into some other form whose
structure and content is dependent on the
structure and content of the input. We will do
this for programming languages by associating
actions with the productions of the grammar
defining the programming language.
number
1
2
3
4
5
6
rule
S -> S + T
S -> T
T -> T * F
T -> F
F -> ( S )
F -> a
Example 12.1 translating from infix expressions
to postfix expressions
The following actions convert expressions from
the grammar G to postfix notation:
action
print ("+")
print ("*")
print(a)
Parse a + a*a + a
a + a * a + a <=6 F + a * a + a <= T + a * a + a
<= S + a * a + a <=6 S + F * a + a <= S + T * a
+ a <=6 S + T * F + a <=3 S + T + a <=1 S + a
<=6 S + F <= S + T <=1 S
The Value Stack
A more general scheme is to associate values
with each symbol on the parsing stack. On the
stack, therefore, we have pairs of <symbol,
value>, so we can think of this as two separate
stacks, the symbol stack and the value stack. We
can then associate with each reduction some
action to be carried out on the value stack. The
end result of a parse is then a report on whether
the input had the correct syntax, and a value
derived from the input's structure.
Suppose we are about to apply the reduction A > x1x2...xn. The parsing stack then has the
symbols x1, x2, ... xn on the right. The values
corresponding to these symbols we will call $1,
$2, ... $n. On performing the reduction, we
remove the n symbols from the symbol stack
(and eventually replace by A): therefore, we will
remove the top n symbols from the value stack,
Printing output in the order in which the
reductions were applied (6, 6, 6, 3, 1, 6, 1)
which gives aaa*+a+, which is the
corresponding postfix expression.

and replace by some new value defined by the
rule augmentation. Call this new value $$. The
most general form of this action is then a
function, such that $$ = f($1, $2, ..., $n). In
practice, this function might be an actual
function, or a sequence of lower level actions
which take the $i values as parameters. We don't
need to use all of the $i.
Putting values on the stack
There are basically two cases:
 Putting on a non-terminal, and
 Putting on a terminal.
A non-terminal only goes on during a reduction
(or the shift immediately following a reduction).
This corresponds to the evaluation of the
function defined above. The values of the
terminal symbols, on the other hand, generally
come from the lexical analysis.
Example 12.2 computing the values of expressions
Assume a lexical analyser returns the value of an integer along with the ID_T token
50
CS3012 Formal Languages and Compilers
1) S -> S + T
2) S -> T
3) T -> T * F
4) T -> F
5) F -> ( S )
6) F -> a
Syntax-Directed Translation
$$ := $1 + $3
$$ := $1
$$ := $1 * $3
$$ := $1
$$ := $2
$$ := $1
Parsing 1 + 2 * 3 is then as follows:
Symbol
Values
a
F
T
S
S+
S+a
S+F
S+T
S+T*
S+T*a
S+T*F
S+T
S
1
1
1
1
1•
1•2
1•2
1•2
1•2•
1•2•3
1•2•3
1•6
7
Stack
0
05
03
02
01
016
0165
0163
0169
01697
016975
01697 10
0169
01
Input
1+2*3#
+2*3#
+2*3#
+2*3#
+2*3#
2*3#
*3#
*3#
*3#
3#
#
#
#
#
Action
S5
R6
R4
R2
S6
S5
R6
R4
S7
S5
R6
R3
R1
A
3. The lookahead is a string representing a real
number. It should be converted to a floating
point, and stored in a real array. Again, its
position in the array will be passed to yylval.
4. The lookahead is an identifier or a keyword.
User-defined identifiers must be stored as for
strings (but only one copy should be kept).
The Value Stack in Lex and Yacc
Lex
Yacc assumes values are passed to it in the
global variable yylval. Lex places the
lookahead in yytext; it also must assign values
to yylval. There are a number of possibilities:
Yacc
1. The lookahead is a digit string. The internal
value must be computed and placed in yylval.
2. The lookahead is a character string. It must be
copied from yytext to a safe place, usually
either:
(i) a much larger string array, and the
value placed in yylval is the position
in which it starts in that larger array,
or
(ii) a dynamically allocated character
string, and the value placed in yylval
is the pointer to that string.
Yacc allows us to place an action after any
production. This action will be performed at the
moment the reduction is performed (which is
before the values are removed from the stack).
The action is a C statement within {...}.
Values should be represented by the $i notation
described above. When the statement is reached
by Yacc, it will translate the $i's into their
appropriate values or array positions.
51
CS3012 Formal Languages and Compilers
Syntax-Directed Translation
Example 12.3 using Yacc's value stack
S will be represented by "Expr", T by "Term" and F by "Factor".
%%
Finish
;
:
Expr
{ printf("%d", $1); }
Expr :
|
;
Expr PLUS_T Term
Term
{ $$ = $1 + $3; }
Term :
|
;
Term MUL_T Factor
Factor
{ $$ = $1 * $3; }
Factor
|
;
:
OB_T Expr CB_T
INT_T
{ $$ = $2; }
%%

Definition 12.4
Attribute Grammar
With each symbol in the grammar, we associate
a set of attributes. An attribute can represent any
form of information we require, including data
type, number, pointer or string. The semantic
rules we associate with each production
determine how the values of the attributes are
computed.
semantic rules of the form b := f(c1, c2, ..., cn),
where f is a function, c1, c2, ..., cn are attributes
of any of the grammar symbols in the
production.
If b is an attribute of A, and the ci are attributes
of symbols in , then b is a synthesised attribute.
If b is an attribute of one of the symbols in ,
then
b
is
an
inherited
attribute.
In an attribute grammar, each grammar
production A ->  has associated with it a set of
Example 12.5 computing the value of expressions
1)
2)
3)
4)
5)
6)
7)
Production
S -> E
E1 -> E2 + T
E -> T
T1 -> T2 * F
T -> F
F -> ( E )
F -> digit
Semantic Rules
print(E.val)
E1.val := E2.val + T.val
E.val := T.val
T1.val := T2.val * F.val
T.val := F.val
F.val := E.val
F.val := digit.lexval
The subscripts on symbols are simply to distinguish which symbol in the semantic rule refers to
which symbol in the syntax. The symbol digit is a terminal (or token), and it is assumed to have a
single attribute, returned by the lexical analyser. In this case, it will be the value of the particular
number token.
52
CS3012 Formal Languages and Compilers
Syntax-Directed Translation

Definition 12.6
A syntax-directed definition which uses only synthesised attributes is called an S-attributed
definition.
We can augment parse trees for attribute grammars with the attribute values at each node: for S-attributed
definitions, we can evaluate all the attribute values by starting at the leaf nodes and applying the semantic
rules from the bottom to the top.
Example 12.7
Annotated parse tree for the expression 6+2*3
S
val = 12
E
val = 12
E
val = 6
T
val = 6
F
val = 6
+
T
val = 6
T
val = 2
*
F
val = 3
digit
F
val = 2
digit
digit
lexval = 6
lexval = 2
lexval = 3

An inherited attribute is one whose value is determined by the values of the attributes of its parent or
siblings. They are useful for describing the way in which the meaning of a symbol depends upon the
context in which it appears. For example, we can use an inherited attribute to keep track of which side of
an assignment statement an identifier appears on, so that we know whether to use its address or value
during processing.
Example 12.8 using inherited attributes
The following grammar defines a language of integer or real variable declarations. The semantic
rules determine how the symbol table is to be updated, by passing the values of the inherited
attributes down from the attribute of the T symbol (which is synthesised from rules 2 and 3).
1)
2)
3)
4)
5)
Production
D -> T L
T -> int
T -> real
L1 -> L2 , id
L-> id
Semantic Rules
L.t := T.t
T.t := integer
T.t := real
L2.t := L1.t, addtype(id, L1.t)
addtype(id, L.t)
53
CS3012 Formal Languages and Compilers
Syntax-Directed Translation
The augmented parse tree is shown below on the left, and the flow of information between the
different attributes is shown on the right:
D
T
t = real
real
D
L
T
t = real
L
t = real
,
id3
real
t = real
L
L
t = real
L
,
id3
,
id2
t = real
,
id2
L
t = real
t = real
id1
id1

dependencies. This obviously limits the class
of attribute grammars that can be
implemented.
Dependency graphs
If the value of an attribute b depends on the
value of attribute c, then the semantic rule for b
must be evaluated after that for c. The
interdependencies between the attributes can be
drawn as a dependency graph (as above). A
topological sort of a graph is an ordering of the
attributes of the graph such that all edges in the
graph go from the attributes earlier in the
ordering to attributes later. A topological sort
gives a valid order in which to evaluate the
semantic rules.
Methods 2 and 3 are more efficient, in that no
compile-time analysis is required.
Abstract Syntax Trees
A useful form of intermediate representation of a
program is a syntax tree. Using syntax trees
allows the translation process to be separated
from the parsing process. This is particularly
useful for two reasons:
There are a number of different methods for
evaluating semantic rules.
1. A grammar that is suitable for parsing
might not explicitly represent the hierarchical
nature of the programs it describes
1. Parse-tree based. At compile time,
construct a parse tree, then a dependency
graph, and then obtain a topological sort. Use
the sort to determine the order in which to
process the rules. This method works for all
dependency graphs with no cycles.
2. Rule based. When the compiler is
constructed,
analyse
the
rules
for
dependencies. The order in which rules are to
be evaluated is then fixed before compilation
starts.
3. Oblivious. The compiler simply selects an
evaluation order without analysing the
2. The parsing method constrains the order in
which the nodes are considered. This may not
be the best order for translation.
A syntax tree is a condensed parse tree, where
the operators and keywords do not appear as
leaves, but are associated with the interior nodes
that would have been their parent node in the
parse tree. Also, chains of single productions can
be
collapsed
into
a
single
branch.
54
CS3012 Formal Languages and Compilers
Syntax-Directed Translation
Example 12.9 abstract syntax trees
The derivation step S => if B then S1 else S2 would have the syntax tree:
if then else
B
S1
S2
The parse tree below has the syntax tree on the right:
E
E
+
+
T
T
T
F
F
6
2
*
F
6
*
2
3
3

Example 12.10
creating abstract syntax trees
We will use the following functions, which return pointers to the newly created nodes:
mknode(op, left, right): creates an internal node for the operator "op", with two fields
containing pointers to the left and right operands;
mkleaf_id(id, string): creates a leaf node for the identifier "id", and a field containing a
pointer to a string for that identifier;
mkleaf_num(num, val): creates a leaf node, labelled "num", with a field containing the value
of the number.
The grammar and semantic rules are given below. Each non-terminal in the grammar has an attribute
ptr, which keeps track of the pointers returned by the functions:
1)
2)
3)
4)
5)
6)
7)
Production
E1 -> E2 + T
E -> T
T1 -> T2 * F
T -> F
F -> ( E )
F -> id
F -> num
Semantic Rules
E1.ptr := mknode('+', E2.ptr, T.ptr)
E.ptr := T.ptr
T1.ptr := mknode('*', T2.ptr, F.ptr)
T.ptr := F.ptr
F.ptr := E.ptr
F.ptr := mkleaf_id(id, id.string)
F.ptr := mkleaf_num(num, num.val)
The parse tree for 6+2*x is shown below, with the constructed syntax tree on the right.
55
CS3012 Formal Languages and Compilers
Syntax-Directed Translation
E
ptr =
+
E
ptr =
*
T
T
+
T
ptr =
ptr =
ptr =
ptr =
F
ptr =
6
2
num 6
F
*
ptr =
num 2
x
F
id
string for x

Example 12.11
The following grammar specifies compound statements:
CStat -> Stat ; CStat
Stat -> s
|
Stat
The string s ; s ; s ; s has the parse tree below on the left, and one possible syntax tree on the right
CStat
;
Stat
;
CStat
s
Stat
s
CStat
;
Stat
s
;
s
;
;
s
CStat
s
s
Stat
s
The semi-colon serves only to bind the statements into a sequence. A more natural tree is shown
below on the left. However, this requires each node to have arbitrarily many children. A better tree
is shown on the right, where statements are joined as siblings. This requires only one extra field in
our syntax tree nodes.
seq
s
s
s
seq
s
s
s
s
s

however, we use multiple attributes of different
types for the symbols. We can implement this in
Yacc as follows.
A note on implementing attribute grammars
and Yacc and Lex
The value stack in Yacc maintains a single value
for each symbol on the symbol stack. In the
syntax-directed definitions in these notes,
Symbol Types
56
CS3012 Formal Languages and Compilers
Syntax-Directed Translation
Internally, Yacc declares each value as a C
union. List all types that will be required in a
%union declaration in the definitions section of
the Yacc script - e.g.
Referring to a value using the $$, $1,... notation
causes Yacc to use the appropriate field of the
union.
Multiple Attributes
%union
{
int intval;
char *strptr;
struct table *tblptr;
}
Multiple attributes can be implemented using the
symbol table (see 11) which is defined to have a
number of attribute places for each entry. Instead
of referring explicitly to values on the value
stack, we would then refer to the symbol table
entry, and extract the appropriate attribute value
as required.
which declares three symbol types - an integer
value, a string pointer, and a pointer to some
table structure (which would have to be declared
elsewhere).
Inherited Attributes
Each token must be declared to use one of the
types from the union, using the %token
declaration - e.g.
The Yacc value stack is designed for synthesised
attributes - that is, when a rule is used as a
reduction, the values of all symbols on the right
hand side are known. In some cases, however,
we would like to use inherited attributes to
assign values to symbols on the right hand side.
Yacc does allow us to do this, by accessing
symbols on the internal stack to the left of the
current rule, using the notation $0, $-1, $-2, ... .
Thus we might have the rules:
%token <intval> INT_T
%token <tblptr> ID_T
Non-terminals must also be declared, using the
%type declaration - e.g.
%type <intval> Expr Term Factor
Decl : Type Idlist ;
Type : KEY_REAL_T
| KEY_INT_T
;
{$$ = 1;}
{$$ = 2;}
Idlist : Idlist ID_T
| ID_T
;
{action($0, $2);}
{action($0, $1);}
where action(...) is some function which
assigns type information. The symbol Type,
which contains that information, always occurs
one place to the left of the Idlist nonterminal, and this symbol's value is referred to by
the $0 notation. If we wanted to refer to the
symbol two places to the left, we would use $-1,
etc.. Note that these rules are different from the
ones given in lectures (p48) - here the values are
not passed in the Decl rules, but in the Idlist
rule.
This use of the value stack is not recommended.
It is very hard to keep track of positions in more
complicated grammars - to use this notation,
every time we use Idlist, we must be
confident that the symbol one place to the left in
the internal stack has the appropriate value.
The preferred method of dealing with inherited
values is to create a list of pointers to the
attribute places and maintain this list as an
attribute of Idlist. The Decl rule would then
use this list to assign the correct attribute values
to the various identifiers.
57
CS3012 Formal Languages and Compilers
Symbol Table
13. Symbol Table
The symbol table stores information about various source language constructs. Information is built up
during the analysis stages of compiling, and is used in succeeding stages. Finally, the code generation
phase uses the information in the table to generate the target code.
The symbol table is central to the work of the compiler. In practice, efficient methods of manipulating and
storing the table must be used. In this course, though, we will not consider efficiency - we will use a
linked list and some operations for manipulating the information.
In some compilers, the symbol table is used extensively during lexical analysis and parsing, to represent
information and resolve ambiguities. In other cases, lexical analysis and parsing simply construct a
complete abstract syntax tree. This tree is then analysed to produce the symbol table.
There are three main functions we need to implement:
• lookup(s):
determines whether a particular string has already been stored returns the index of the
table entry, or 0 (or -1 in some systems) if it has not been stored
• insert(s,t):
inserts a new string (of token t) into the table returns the index of the new entry
• delete(s):
deletes an entry from the table (or, typically, hides it)
Example 13.1 A Simple Symbol Table Implementation
An initial node will point to the first and last entries, and store the length of the table. A separate array
will store all the string identifiers. Each node will be of the form:
index
token
atts
next
strPtr
7
ID_T
•
•
•
...
...
...
The table will be of the form:
Table first
•
1
ID_T
•
•
•
c
o
u
length last
78
•
...
n
t
#
2
ID_T
•
•
•
...
i
...
#
78
ID_T
•
•
•
n
a
m
e
...
#
...

58
CS3012 Formal Languages and Compilers
Symbol Table
Declarations
There are four basic kinds of declaration that may require entries in the Symbol Table:
Constant:
e.g. const int MAX = 10000;
Type:
e.g. struct Entry {
int index;
char *strPtr;
};
Variable:
e.g. int count, marks[100];
Function:
e.g. int gcd(int n, int m) {
if (m == 0) return n;
else return gcd(m, n % m);
}
Constant and variable declarations can be stored in the table in the style shown above. Type declarations
may require more work, while function declarations are normally indexed by their name, with the code
being treated separately. In some compilers, separate symbols tables are used for each different kind of
declaration; in others, each separate region of the program (e.g. functions) may be given a separate table.
The attributes stored with each entry will depend on the kind of declaration: constant declarations will
typically have value bindings; type, variable and function declarations will have type signatures. Variables
will have pointers to allocated memory for storing values. Functions may have pointers to code
representations. All kinds may have scope attributes, defining when memory should be allocated, and
when it should be accessible.
Example 13.2 Single scope
In languages with very restricted scoping rules (and in other siutations) it is possible to construct
the symbol table during lexical analysis.
Augment the lexical analysis rule for recognising identifiers as follows:
{L}+
{entry = lookup(yytext);
if (entry == -1)
yylval.entry = insert(yytext, ID_T); }
This will insert an entry the first time an identifier is encountered. Ensuring that, for example, it is
properly declared will be a function of the semantic analysis phase.

Scoping Rules
Many languages allow programs to be constructed from blocks. In C, blocks are files, function
declarations and compound statements (between "{" and "}"). Also, structures and unions can be
considered to be blocks. The use of blocks complicates the symbol table, as the same identifier can refer
to different data objects depending on the position it occurs in the code, and the scoping rules. In this
situation, it is not sufficient to use lookup during lexical analysis.
59
CS3012 Formal Languages and Compilers
Symbol Table
Example 13.3 Nested scope
int i;
int f1(int k) {
int j;
...
print i;
}
int f2() {
int j;
...
}
In the above C program fragment, i is a global (integer) variable, normally accessible throughout
the code. When f1 is entered, a new (integer) entry is required for k. Immediately, a new
(integer) entry for j is also required. Inside the function, i -- in print(i) -- refers to the global
variable. Once f1 is exited, the entries for j and k are deleted. Once f2 is entered, a new (integer)
entry for j is required - note that this is a different variable from the one inside f1.

To implement nested scopes, the lookup function must find the most recently inserted declaration, the
insert function must not overwrite previous declarations of the same name, but should hide them,
while the delete function should only delete (or hide) the most recent declaration and uncover the
previous one. The symbol table should thus behave as a stack.
Example 13.4 Nesting Level
One possible way of obeying scope rules while constructing the symbol table during the first pass
of the compiler is to use explicit nesting level and scope variables. Use an explicit stack where the
top entry represents the current nesting level and scoping identifier. We also need last, the index
of the last entry in the table. Initially, the top of the stack is set to (0,0), and last is set to 0.
Consider the following grammar fragment for recognising programs similar to 13.3:
Prog -> Dec Prog
Prog -> Main
Dec -> VDec ;
Dec -> FDec
VDec -> int id
FDec -> SFDec Par ) { CStat }
SFDec -> int id (
Par
-> 
Par
-> Vdec
Par
-> PList , Vdec
PList -> Vdec
PList -> Vdec , PList
{ decrement(stack); }
{ increment(stack); }
The lexical analysis action then becomes
60
CS3012 Formal Languages and Compilers
{L}+
Symbol Table
{entry = lookup(yytext,stack);
if (entry == -1) insert(yytext,ID_T, stack); }
insert now places an entry at the end of the table, and associates the pair of values at top of the
stack as nesting level and scope attributes.
lookup now searches the symbol table for a matching string. When it finds a match, it checks the
nesting level, and then moves down the stack until it finds the entry with the same nesting level. If
the index of the match is less than the corresponding scope value, it ignores it and continues with
the search. If no appropriate match is found, return -1.
decrement simply deletes the top element of the stack.
increment adds a new element to the top of the stack, incrementing the nesting level, and
assigning the last index as the scope value.
A parse tree and associated symbol table for 13.3 are shown below.
Prog
Prog
Dec
VDec ;
i nt i d
i
i nt
Dec
Prog
FDec 2
Dec
1
SFDec Par )
id (
f1
VDec
i nt i d
k
FDec 4
{ CStat }
VDec ;
...
3
SFDec Par )
i nt i d
j
i nt
id (
f2

Str
i
f1
k
j
f2
j
Nest
0
0
1
1
0
1
Scope
0
0
1
1
0
4
Atts
...
The changes in the stack are as follows (top on the right):
Event Last
0
1
1
2
3
3
4
4
5
{ CStat }
VDec ;
i nt i d
j
pri nt ( i d ) ;
i
Index
0
1
2
3
4
5
...
Stack (Nest,Scope)
(0,0)
(0,0), (1,1)
(0,0)
(0,0), (1,4)
(0,0)
61
...
CS3012 Formal Languages and Compilers
Symbol Table
Instead of attempting to complete all compilation in a single pass, it is often easier to make a number of
passes through the program. Using the techniques of section 11, an abstract syntax tree can be constructed
during the parsing phase. This tree can then be processed to build the symbol table and to support the later
phases of the compilation. Although this may be slower, it can result in more natural grammars, and
simpler translation and analysis routines.
Example 13.5
A possible abstract syntax tree for the program of 13.3 is shown below. From this, it should be
easy to see the nesting levels and scope of the different declarations.
Prog
VDec
i nt i d
i
func
i nt i d VDec 
f1
i nt i d
k
func
VDec
i nt i d
j
pri nt
i nt i d
f2

VDec
i nt i d
j
id
i

62
CS3012 Formal Languages and Compilers
Type Checking
14. Type Checking
The final part of the analysis phases of compilation we will consider is type checking, where the compiler
checks operators, functions and procedures are not applied to objects of incompatible datatypes.
Definition 14.1
A type checker verifies that the type of a construct matches that expected by its context.
Example 14.2 type signature
The PASCAL arithmetic operator "mod" requires two integer operands, and returns an integer. We
can describe this by a signature:
_mod_ : integer  integer  integer
The underscores on either side of the mod operator indicate that it is an infix operator (that is, it is
placed between its two operands). After the colon is the signature, which here indicates that the
operator takes two integers, and returns an integer.

Type information will be required when intermediate code is generated. Operators like "+" can be used in
a number of different ways, and the particular way depends on the context. Four different signatures can
be given for "+":
_+_ : integer  integer  integer
_+_ : integer real  real
_+_ : real  integer  real
_+_ : real  real  real
In the second and third case, some form of type translation will be required, in order to allow the integers
and reals to be added together. Once the types have been determined, the intermediate code generator can
put in the required conversion operations.
The "+" operator is an example of an overloaded operator - that is, an operator which represents different
operations in different contexts.
Type Expressions
In Pascal and C, types are either basic or constructed. Basic types have no internal structure as far as the
programmer is concerned - for example, boolean, character and integer in Pascal. Constructed types are
built from basic types and other constructed types, such as arrays, records and sets in Pascal.
Each language construct has a type associated with it implicitly; this will be denoted by a type expression.
63
CS3012 Formal Languages and Compilers
Type Checking
Definition 14.3
A type expression is either a basic type, or is formed by applying an operator called a type
constructor to other type expressions.
1. A basic type is a type expression (e.g. boolean, char, integer). A special basic type called
type_error will indicate an error found during type checking. A basic type, void, indicates
the absence of a value, allowing constructs with no type to be checked.
2. A type name is a type expression.
3. A type constructor applied to type expressions is a type expression. Constructors include:
(a) arrays - if T is a type expression, then array(I,T) is a type expression denoting the type
of an array with elements of type T and index set I. For example, the Pascal declaration
var A : array[1..10] of integer;
associates the type expression array(1..10,integer) with A.
(b) products - if T1 and T2 are type expressions, then so is T1  T2
(c) records - a record is a product with names for its fields. The record type constructor
will apply to a tuple formed from field names and types. E.g.:
type row = record
address : integer;
lexeme : array[1..15] of char;
end;
declares the type name row representing the type expression:
record((address  integer)  (lexeme  array(1..15,char)))
(d) functions - mathematically a function maps elements of one set to another set. We will
treat functions in programming languages as mapping from a type D to a type R (from
domain to range). The type will be denoted D  R. E.g.:
function f(a, b : char) : integer;
has the type expression
char  char  integer

It is sometimes convenient to represent type expressions as graphs. We can use abstract syntax trees, with
nodes for type constructors, and leaves for basic types and names.
64
CS3012 Formal Languages and Compilers
Type Checking
Example 14.4 type trees
Possible trees for the type expressions of 14.3 are:

product:
T1
T2
f
address

integer
char
char
record


function:

integer
lexeme
array
1..10
array
1..10
char
integer
Note the use of sublings to represent the different elements of products, but the use of child nodes to
represent the function name and return type.

Type Systems
A type system is a collection of rules for assigning type expressions to the different parts of a program. A
type checker implements a type system.
Since type checking has the potential for discovering errors in programs, it is important for a type checker
to do something reasonable when an error is discovered. The compiler must report the nature and location
of the error, but again the checker should recover from the error so that the rest of the input can be
processed. A type checker able to handle errors may result in a more complicated grammar than that
required solely for processing correct programs. Again, for that reason, some type checkers operate on the
abstract syntax tree rather than during parsing.
Example 14.5 Specifying the type checker
We now specify a type checker for a simple language which requires declaration of identifiers
before their use. The grammar below generates programs, represented by the non-terminal P,
consisting of a sequence of declarations D followed by a single expression E.
P -> D ; E
D -> D ; D
D ->id : T
T -> char | integer | array[num] of T
E -> num | id | E mod E | E [E] | id := E
65
CS3012 Formal Languages and Compilers
Type Checking
The language has two basic types: char and integer. The two special basic types type_error and
void are used to signal errors and the absence of a type respectively. Arrays are assumed to start at
index 1, so the declaration
array [256] of char
leads to the type expression array(1..256,char).
In the translation scheme given below (for a one-pass compiler), actions add type information to
the symbol table entry for the identifiers.
P -> D ; E
D -> D ; D
D -> id : T
T -> char
T -> integer
T1 -> array[num] of T2
addtype(id.entry, T.type)
T.type := char
T.type := integer
T1.type := array(1..num.value, T2.type)
These actions allow the type of all declared identifiers to be added to the symbol table. The
expressions can now be checked.
E -> num
E -> id
E1 -> E2 mod E3
E1 -> E2 [E3]
E1 -> id := E2
E.type := integer
E.type := lookup(id.entry)
E1.type := if E2.type = integer and E3 = integer
then integer
else type_error
E1.type := if E3.type = integer and E2.type = array(s,t)
then t
else type_error
E1.type := if lookup(id.entry) = E2.type
then void
else type_error
Numbers are of type integer.
"lookup(x)" searches the symbol table and return the stored type of entry x.
The mod operator requires that both its operands are of type integer. If so, then the resulting
expression is also of type integer; if not, then there is an error.
For the array lookups, the index to the array must be of type integer, and the type of the array
name must , obviously, be an array. If both of these conditions are met, then the type of the
expression is the same as the type of the elements of the array.
For the assignment expressions, the type of the identifier must match the type of the expression. If
so, then the special type void is returned; otherwise, the value type-error is returned.

66
CS3012 Formal Languages and Compilers
Runtime Environment
15. Runtime Environment
After the analysis phases are complete, the compiler must generate executable code. In particular, the
compiler must generate code to maintain the structure of the target machine's registers and memory during
execution. In this section, we consider the types of environment that are required.
In most compiled languages executable code is stored in a fixed area of RAM which cannot be changed
during execution. The code for each different function or procedure is stored separately, at a known
address (or at a known offset from a base address). Static data (e.g. constants or strings known at compile
time) and global variables can also be stored in this fixed area. The remainder of the data, plus
bookeeping information for control flow, will be stored in areas that will be allocated dynamically during
execution.
Example 15.1 Simple runtime storage structure
entry address
entry address
code for function 1
code for function 2
...
entry address
code for function n
global/static area
stack
free space
heap
The stack is used for data that can be allocated in a last-in, first-out manner, while the heap area is
used for other data (e.g. C pointers).

Definition 15.2
A procedure activation record is a section of memory allocated each time a procedure is called. It
contains space for arguments, local data and local temporary variables, and pointer to code area
and the activation record which called it.

67
CS3012 Formal Languages and Compilers
Runtime Environment
Definition 15.3
In a fully static environment, no procedures can be called recursively, there are no pointers, and no
dynamic memory allocation - for example, FORTRAN77. In such an environment, we only ever
need to maintain one procedure activation record for each procedure, as it is not possible for more
than one copy of a single procedure to be in use simultaneously. Thus, at compile time, we can
construct a procedure activation record for each procedure, Each time a procedure is called, we
compute its arguments and store them in the appropriate record, and store the address of the
calling procedure. We then jump to the start of the code for the current procedure, execute it, using
the space in the current record for maintaining data, and on exit, jump back to the return address.
Example 15.4 A simple static environment
1 int i = 10;
2 int f1(int j) {
3 int k;
4
k = 3 * j;
5
if (k < i) print(i);
6
else print(j);
7 }
8 main() {
9
int k = 1;
10
while (k < 5) {
11
f1(k);
12
k = k+1;
13
}
14 }
global area
i (int):
activation record: main
k (int):
start code ptr: 8
current code ptr:
activation record: f1
j (int):
start code ptr: 2
current code ptr:
return address:
k (int):
initial environment
global area
i (int): 10
activation record: main
k (int): 1
start code ptr: 8
current code ptr:11
activation record: f1
j (int):1
start code ptr: 2
current code ptr: 2
return address:11
k (int):
on entry to f1
global area
i (int): 10
activation record: main
k (int): 2
start code ptr: 8
current code ptr:14
activation record: f1
j (int):
start code ptr: 2
current code ptr:
return address:
k (int):
on reaching line 14

68
CS3012 Formal Languages and Compilers
Runtime Environment
Definition 15.5
In a stack-based environment, procedures may be called recursively. It is not sufficient to maintain
a single activation record for each procedure. A stack is required, onto which new records are
placed each time a procedure is called, and from which old records are deleted when procedures
exit. Each procedure may have several records on the stack at any one time. Each activation record
should maintain a pointer to the previous activation record, to allow it to be recovered on exit. The
environment requires a pointer to the current activation record, and a pointer to the last allocated
position on the stack.
Example 15.6 A simple stack-based environment
1
int x, y;
2
3
4
5
int gcd(int u, int v) {
if (v == 0) return u;
else return gcd(v, u % v);
}
6
7
8
9
main() {
scanf("%d%d", &x, &y);
printf("%d\n", gcd(x,y));
}
initial environment
fp
sp
on 1st entry to gcd
global area
x (int):
y (int):
global area
x (int): 15
y (int): 10
global area
x (int): 15
y (int): 10
activation record: main
start code address: 6
current code address:
activation record: main
start code address: 6
current code address:8
activation record: gcd
u (int): 15
v (int): 10
start code address: 2
current code address:2
return pointer: •
return address: 8
k (int):
free space
activation record: main
start code address: 6
current code address:8
activation record: gcd
u (int): 15
v (int): 10
start code address: 2
current code address: 4
return pointer: •
return address: 8
kactivation
(int): record: gcd
u (int): 10
v (int): 5
start code address: 2
current code address:4
return pointer: •
return address: 4
k (int): record: gcd
activation
u (int): 5
v (int): 0
start code address: 2
current code address:2
return pointer: •
return address: 4
k (int):
free space
free space
fp
sp
about to exit main
global area
x (int): 15
y (int): 10
fp
sp
on 3rd entry to gcd
fp
activation record: main
start code address: 6
current code address:9
sp
free space

69
CS3012 Formal Languages and Compilers
Runtime Environment
The details of how much space to allocate for an activation record and the offsets to be computed to reach
the appropriate data items must be provided by the compiler. This is covered in the next chapter. Note that
the stack-based environment presented here is particularly simple - there is no discussion of variable
length data, temporary variables, internal blocks and nested declarations, local procedures (as in Pascal),
nor procedures as arguments. Detailed descriptions of methods for dealing with such situations are given
by Aho et al. (1986).
Definition 15.7
In a dynamic environment, activation records are not maintained on a stack, but must exist and be
accessible for as long as all references to them exist, and must be capable of being dynamically
deallocated when they become inaccessible (a process called garbage collection).
Example 15.8 dangling references
int *dangle() {
int x;
return &x;
}
In a stack-based environment, the variable x is allocated space only during the lifetime of the
function dangle(). Once the function has returned, the space is reclaimed, and will be reused by
subsequent procedure calls. However, the address of x has been returned, and will be assumed to
point to an integer, although the memory location could now contain anything. C is a stack-based
language, and the above procedure is defined to be a logical error. Other languages do not have
this restriction, and so require dynamic environments.

Dynamic allocation is handled in the heap area. A heap provides two operations: allocate and free.
Allocate takes a size parameter, and returns the address of a block of memory of the correct size. Free
takes an address, and marks it as being free. The main problem in heap management is that the memory
can quickly become fragmented, unless contiguous free memory blocks are combined into a whole. A
second problem is in ensuring that free is only ever applied to the start of an allocated block of the
appropriate size, or corruption can result.
70
CS3012 Formal Languages and Compilers
Runtime Environment
Example 15.9 simple heap management
Maintain a circular linked list of
allocated memory blocks. Each block to
be allocated is headed by some
bookkeeping information, with the
address of the next allocated block, the
size of the used space, and the size of the
following free space. The first element of
the list is the top of the heap, which also
has a pointer to a block with some free
space.
used space, set the new element's free
size to the selected element's free size
minus the size of the new element, and
set the selected element's free size to null.
To free a block, move to the start of the
list, and step through until the appropriate
address is found - if it is not found, the
address is invalid. Add the current block
size and its free sizee to its predecessor's
free size, and delete from the list.
To allocate a new block, move round the
list until we find an element with enough
free space. Create a new element, insert it
into the list after the selected element's
The figure below shows the heap during
a
sequence
of
allocations
and
deallocations.
header
last
next
header: next
used sz free sz
header
last
next
header: next
used sz free sz
header
last
next
header: next
used sz free sz
header
last
next
header: next
used sz free sz
used
used
used
used
header: next
used sz free sz
used
header: next
used sz free sz
used
free
free
header: next
used sz free sz
used
header: next
used sz free sz
used
header: next
used sz free sz
used
header: next
used sz free sz
header: next
used sz free sz
header: next
used sz free sz
used
used
used
free
free
free
free

Note that the heap management system in Example 15.9 is for dealing with explicit manual allocation and
deallocation commands, and as such is required in stack-based languages like C.
Fully dynamic languages require additional routines for garbage collection.
16. Intermediate Code Generation
After a program has been parsed and statically checked, the compiler converts it to an intermediate
language, and then optimises the code before producing the final executable version. The main advantage
of developing this intermediate code is machine independence - the analysis techniques can be developed
71
CS3012 Formal Languages
Exercises
without concern for the target language, a single optimisation procedure can be used, and porting the
compiler to new machines only requires changing the final component.
In this section, we build on the previous material, and consider syntax-directed methods of generating the
intermediate code, in the language known as three-address code.
Three Address Code
Statements in this language take the general form:
x := y op z
where x, y and z are names, constants or compiler-generated temporaries, and op stands for any operator.
An expression like a+b*c has to be translated into the sequence
t1 := b * c
t2 := a + t1
where t1 and t2 are compiler-generated temporary names. Unravelling complicated arithmetical
expressions allows them to be optimised effectively and translated easily to the target language, as threeaddress code is similar to assembly language.
Example 16.1 three-address code
Three-address code is a linearised form of postfix expressions and syntax trees. For example,
consider the statement
a := b * c + b / c
In postfix, this is
abc*bc/+ :=
:=
As a syntax tree, it becomes:
a
+
/
*
b
72
c
b
c
CS3012 Formal Languages
Exercises
In three-address code, it is
t1 := b / c
t2 := b * c
t3 := t1 + t2
a := t3

The three-address statements used in this chapter are shown below:
• assignment statement - x := y op z
• unary assignment - x := op y
(e.g. unary minus, negation, type conversion)
• copy - x := y
• unconditional jump - goto L
(L is the label of a statement)
• conditional jump - if x relop y goto L
• procedure call - param x
- call p n
- return y
• indexed assignments - x := y[i]
x[i] := y
(relop is a relational operator: <, ≤, =, ...)
(defines x as a parameter)
(call procedure p, passing the last n declared
parameters)
(optional)
(x is set to the value at i memory locations after y)
(i memory locations after x is set to y)
We assume that statements in three-address code can be labelled (labels are referred to by the goto
statement).
The choice of allowable memory operators is the critical issue in the design of an intermediate language.
The operator set must be sufficiently rich to implement the operations in the source language, and
expressive enough that the code generator does not need to generate long sequences of instructions to
implement each operator.
73
CS3012 Formal Languages
Exercises
Example 16.2 A Syntax-directed Translation
The syntax-directed definition given below translates assignment statements into three address
code.
The synthesised attribute S.code in the definition that follows represents the three-address code
fragment for assignment S. The non-terminal E has two attributes:
E.place - the name that will hold the value of E
E.code - the three address code code fragment for E
The notation gen(x ":=" y "+" z) represents the three address code statement x := y+z.
Expressions appearing instead of the variables (x,y,z) are evaluated before being passed to gen,
and the quoted strings are taken literally.
The notation <code fragment> || expression means concatenate the expression onto the end of the
code fragment.
newtemp() creates a new temporary variable.
Production Semantic Rules
S -> id := E S.code := E.code || gen(id.place ":=" E.place)
E1  E2 + E3 E1.place := newtemp();
E1.code := E2.code || E3.code || gen(E1.place ":=" E2.place "+" E3.place)
E1 -> E2 * E3 E1.place := newtemp();
E1.code := E2.code || E3.code || gen(E1.place ":=" E2.place "*" E3.place)
E1 -> -E2
E1.place := newtemp();
E1.code := E2.code || gen(E1.place ":=" "uminus" E2.place)
E1 -> (E2) E1.place := E2.place;
E1.code := E2.code
E -> id
E.place := id.place;
E.code := ""

74
CS3012 Formal Languages
Exercises
Example 16.3
The parse tree for a := b*c + b*-c is:
S
a
:=
E8
E3
E1
*
b
E7
+
E2
E4
c
b
E6
*
-
E5
c
The attributes are constructed as follows:
Symbol
E1
E2
E3
E4
E5
E6
E7
E8
S
place
b
c
t1
b
c
t2
t3
t4
code
E1.code || E2.code || t1 := b * c
E5.code || t2 := uminus c
E4.code || E6.code || t3 := b * t2
E3.code || E7.code || t4 := t1 + t3
E8.code || a:= t4
Expanding the code attribute for S then gives us the three address code:
t1 := b * c
t2 := uminus c
t3 := b * t2
t4 := t1 + t3
a := t4

75
CS3012 Formal Languages
Example 16.4
Exercises
flow of control
We can extend the language defined by 16.2 by including flow of control statements:
Production
Semantic Rules
S1 -> while E do S2 S1.begin := newlabel();
S1.after := newlabel();
S1.code := gen(S1.begin ":") || E.code ||
gen("if" E.place "=" "0" "goto" S1.after) ||
S2.code || gen("goto" S1.begin) || gen(S1.after ":")
We have introduced new attributes, "begin" and "after", which will hold labels, and that the
function newlabel() will create a new label and return it. A schematic drawing of the code created
by this semantic rule is shown below:
labels
S1.begin :
code
E.code
if E.place = 0 goto
S1.after S2.code
S1.after :
goto S1.begin
...
We assume that if the expression E is non-zero, it is true, and thus if the expression, evaluated by
E.code, is false, control shifts to S1.after; if the expression is true, S2.code is executed, then control
shifts back to S1.begin, and the expression is evaluated again.

Assignment statements
The previous sections assumed that when variable names were used, they represented pointers to the
symbol table. This section demonstrates how names corresponding to the terminal id are looked up in the
symbol table - the function lookup(id.name) returns a pointer to the entry of the identifier if it is in the
symbol table, or nil if not.
76
CS3012 Formal Languages
Exercises
Example 16.5
We now redo the semantic rules of 16.2 to show the use of the lookup function. Instead of
concatenating the code together in the attributes of the symbols, we now output the intermediate
code to a file, using the emit() function.
Production Semantic Rules
S -> id := E p := lookup(id.name);
if p ≠ nil then emit(p ":=" E.place)
else error
E1  E2 + E3 E1.place := newtemp();
emit(E1.place ":=" E2.place "+" E3.place)
E1 -> E2 * E3 E1 := newtemp();
emit(E1.place ":=" E2.place "*" E3.place)
E1 -> -E2
E1.place := newtemp();
emit(E1.place ":=" "uminus" E2.place)
E1 -> (E2) E1.place := E2.place;
E -> id
p := lookup(id.name);
if p ≠ nil then E.place := p
else error
Parsing the fragment
res := a * (alpha + -b)
assuming that res and alpha have already been declared and placed in the symbol table:
lexptr
token
attributes
index
:
:
:
-> res
ID_T
5
-> a
ID_T
6
-> alpha
ID_T
7
-> b
ID_T
8
gives the following sequence:
processed string
res := a * (alpha + -b)
res := E1 * (alpha + -b)
res := E1 * (E2 + -b)
res := E1 * (E2 + -E3)
res := E1 * (E2 + E4)
res := E1 * (E5)
res := E1 * E6
res := E7
S
attributes
E1.place = <6>
E2.place = <7>
E3.place = <8>
E4.place = <9>
E5.place = <10>
E6.place = <11>
E7.place = <12>
output
<9> := uminus <8>
<10> := <7> + <9>
<12> := <6> * <11>
<5> := <12>

For the remainder of the chapter, we will dispense with the <i> notation, and simply refer to identifiers by
their name.
77
CS3012 Formal Languages
Exercises
Arrays
We can access the elements of an array quickly if we store them in a block of consecutive locations. Let A
be an array, the width of each array element be w, the lower bound of the index be low, and the address of
the storage for A be base. The ith element of A then begins at location:
base + (i - low)  w.
To speed up the access of array elements, we can partially evaluate this address at compile time by
rewriting it as:
i  w + (base - low w)
and evaluating the subexpression (base - low w). This value, c, say, is then stored in the table with A,
and the relative address of an element A[i] can then be found by adding i w to c.
We can also do something similar for multi-dimensional arrays. Two-dimensional arrays can be stored
either row by row or column by column. For arrays stored row by row, the relative address of A[i,j] can
be calculated by the formula:
base + ((i - low1)  n2 + j - low2)  w
where low1 and low2 are the lower bounds on i and j, and n2 is the number of values that j can take.
Assuming that i and j are the only two values not known at compile time, we can rewrite this as:
((i  n2) + j)  w + (base - ((low1  n2) + low2)  w)
(**)
As before, the last term can be pre-computed at compile time.
The chief problem in generating code for array references is to relate the computation of the positions of
elements in an array to a grammar of array references. A grammar may be given as follows:
L -> id[Elist] | id
Elist -> Elist, E | E
It is useful to re-write this grammar to allow the dimensional limits of the array to be available as the
index expressions are grouped into an Elist:
L -> Elist] | id
EList -> Elist, E | id [E
These productions allow a pointer to the symbol table entry for the array name to be passed as a
synthesised attribute of Elist.
The following attributes are used below:
Elist.ndim: the number of dimensions of Elist;
limit(array,j): function returning the number of elements along the jth dimension of the array;
Elist.place: temporary variable holding a value computed from Elist;
L.place: position in the symbol table;
L.ofset: an offset into the array, or is null to indicate that the l-value is a simple name rather than
an array reference.
c(Elist.array): a function returning the pre-computed expression of (**) above
width(array): a function returning w in (**) above
78
CS3012 Formal Languages
Exercises
Example 16.6 translation scheme for addressing array elements
Production
1 S -> L := E
2
3
4
5
6
7
8
Semantic Rules
if L.offset = null then emit(L.place ":=" E.place)
else emit(L.place "[" L.offset "]" ":=" E.place)
E1  E2 + E3
E1.place := newtemp();
emit(E1.place ":=" E2.place "+" E3.place)
E1 -> (E2)
E1.place := E2.place;
E -> L
if L.offset = null then E.place := L.place
else E.place := newtemp;
emit(E.place ":=" L.place "[" L.offset "]")
L -> Elist]
L.place := newtemp();
L.offset = newtemp();
emit(L.place ":=" c(Elist.array))
emit(L.offset ":=" Elist.place "*" width(Elist.array))
L -> id
L.place := id.place;
L.offset := null
Elist1 -> Elist2, E t := newtemp();
m := Elist2.ndim + 1;
emit(t ":=" Elist2.place "*" limit(Elist2.array,m))
emit(t ":=" t "+" E.place);
Elist1.array := Elist2.array;
Elist1.place := t;
Elist1.ndim := m
Elist -> id [E
Elist.array := id.place;
Elist.place := E.place;
Elist.ndim := 1

79
CS3012 Formal Languages
Exercises
Example 16.7 generating code using the scheme of 16.6
Let A be a 10  20 array with low1 = low2 = 1. Therefore n1 = 10 and n2 = 20. Take w to be 4. The
assignment x := A[y,z] is parsed and translated as follows:
sentential forms
x := A[y, z]
L1 := A[y, z]
L1 := A[L2, z]
L1 := A[E1, z]
L1 := Elist1, z]
L1 := Elist1, L3]
L1 := Elist1, E2]
L1 := Elist2]
L1 := L4
L1 := E3
S
attributes
generated code
L1.place = x
L1.offset = null
L2.place = y
L2.offset = null
E1.place = y
Elist1.array = A
Elist1.place = y
Elist1.ndim = 1
L3.place = z
L3.offset = null
E2.place = z
<t = t1>
<m = 2>
Elist2.array = A
Elist2.place = t1
Elist2.ndim = 2
L4.place = t2
L4.offset = t3
E3.place = t4
t1 := y * 20
t1 := t1 + z
t2 := c
/* baseA - 84
*/
t3 := t1 * 4
t4 := t2[t3]
x := t4

Type Conversions
In practice, there are many different types of variables and constants, and the programmer may wish to
combine their use in a single expression where appropriate; it is the task of the compiler to generate
appropriate type conversion instructions. In the above, suppose there are reals and integers, and that
integers can be converted to reals. The semantic rules for the arithmetic operations (and most of the
other productions) must be modified to generate three-address statements which carry out the type
conversion where necessary. We also include with the operator some indication of whether we intend
fixed point or floating point operations.
80
CS3012 Formal Languages
Exercises
Example 16.8 arithmetic type conversions
(for E1 -> E2 + E3 ). We need one additional attribute, E.type, which is either integer or real.
E1.place := newtemp();
if E2.type = integer and E3.type = integer then
begin
emit(E1.place ":=" E2.place "int+" E3.place);
E1.type := integer
end
else if E2.type = real and E3.type = real then
begin
emit(E1.place ":=" E2.place "real+" E3.place);
E1.type := real
end
else if E2.type = integer and E3.type = real then
begin
u := newtemp();
emit(u ":=" "intotoreal" E2.place);
emit(E1.place ":=" u "real+" E3.place);
E1.type := real
end
else if E2.type = real and E3.type = integer then
begin
u := newtemp();
emit(u ":=" "inttoreal" E3.place);
emit(E1.place ":=" E2.place "real+" u);
E1.type := real
end
else
E1.type = type_error;

Similar semantic functions are required for E -> E*E, replacing "int+" with "int*" etc.
Example 16.9
Parsing and translating the string x := y + i * j, where x and y are reals, and i and j are integers
symbols
x := y + i * j
x := E1 + i * j
x := E1 + E2 * j
x := E1 + E2 * E3
x := E1 + E4
x := E5
attributes
code
E1.place = y
E1.type = real
E2.place = i
E2.type = integer
E3.place = j
E3.type = integer
E4.place = t1
E4.type = integer
E5.place = t2
<u = t3>
E5.type = real
t1 := i int* j
t3 := inttoreal t1
t2 := y real+ t3
S
x := t2
81

Download Report

CS3012: Formal Languages and Compilers

Paperzz.com

Your Paperzz