1
Study of a Simple Compiler
In this chapter we will study a simple compiler and study
the different steps to build a compiler. This chapter will
be an introduction of the rest of the course.
2
Arithmetic expression processing
using the stack
The stack operations are:
• Push (x) : puts the value of X in the top of the
stack
• Pop () : returns the value in the top of the stack.
Before using the stack for arithmetic expression
processing we have to translate the expression
from Infix form to postfix form.
3
Examples of expression translation
Infix
1+5
1+5*2
(1+5) * 2
9–5+2
Postfix
15+
152*+
15+2*
95–2+
4
Processing of expression
To process an arithmetic expression using the stack
we have to follow the following steps:
1) Read the expression from left to write
2) When getting a number put it in the top of the
stack (using push).
3) When getting an operation:
Get the first number from the top of the stack (using pop)
Get the second number from the top of the stack (using
pop)
Do the operation between the first number and the second
number.
Put the result in the top if the stack (using push).
5
If we process the following expression
Translation
1+5*2
152*+
1
5
1
2
5
1
push 1
push 5
push 2
10
1
11
pop r1
pop r1
Pop r2
Pop r2
mult r2,r1
add r2,r1
push r2
push r2
6
Exercise
1) Process the other expression in the above table (page 3) using
the stack.
2) Complete the following table.
Infix
1-5
1+5-2
9 – 3 / (1+2)
(9-3)/1+2
Postfix
7
Simple compiler structure
Character stream
(Infix
representation)
Lexical analyzer
Token
stream
Intermediate
Syntax-directed translator representation
(Postfix
Representation)
8
Grammar
Grammar (context free grammar (CFG))
1) Set of Tokens (called terminal symbols(
2) Set of Non-terminals
3) Set of rules each has
Left part (Non-terminal)
Arrow
Right part (sequence (string) of Tokens and/or Non-terminal
symbols)
4) Start symbol (one of Non-terminal symbols)
9
1) Example 1:
List list + digit
List list – digit
List digit
Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
This may be written as follow:
List list + digit | list – digit | digit
Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
10
- Terminal symbols (Tokens)
+ - 0 1 2 3 4 5 6 7 8 9
- Non-terminals
Digit , List
- Starting non-terminal
List
String of tokens: is a sequence of number of
Tokens or terminal symbols. This number may be
zero in this case the string is called Empty String
and is written e.
All Token strings that may be built from a
grammar starting at the start symbol form the
language represented by this grammar.
11
Exercise
Example 2)
1. determine the non-terminal symbols and the
terminal symbols from the following grammar:
2. Determine the start symbol
3. Give three token strings derived from this
grammar:
Block begin compound_stmts end
Compound_stmts stmt_list | e
Stmt_list stmt_list ; stmt | stmt
Stmt a | c | b
12
Parse Tree
• Shows how the start symbol of a grammar can derive
a string in the language
• A tree with the following properties:
1- the root is the start symbol
2- each internal node is a Non-terminal
3- each leaf is a Token or e.
4- If A is the label for an interior node, and
X1,X2,…,Xn (nonterminals or tokens) are the labels of
its children, then the following production must exist:
A
A X1X2…Xn
X
X
1
2
...
X
n
13
Example
SSS+|SS*|a
1) Derive the following string: aa+a*
S S S * Sa* SS+a* Sa+a* aa+a*
SSS*
Sa
SSS+
Sa
Sa
14
2) Draw the Parse tree of the derivation:
S S S * Sa* SS+a* Sa+a* aa+a*
s
s
s
s
a
a
s
+
a
*
15
Ambiguous Grammars
• If any string has more than one parse tree, grammar is said
to be ambiguous
• Need to avoid for compilation, since string can have more
than one meaning
• List of digits separated by plus or minus signs:
string → string + string | string – string |0 |
1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
• Example merges notion of digit and list into single
nonterminal string
• Same strings are derivable, but some strings have multiple
parse trees (possible meanings)
16
Two Parse Trees: 9 – 5 + 2
17
Precedence and Associativity
• Precedence
– Determines the order in which different operators are evaluated
when they occur in the same expression
– Operators of higher precedence are applied before operators of
lower precedence
• Associativity
– Determines the order in which operators of equal precedence are
evaluated when they occur in the same expression
– Most operators have a left-to-right associativity, but some have
right-to-left associativity
18
Precedence and Associativity
Example: Arithmetic Expression
We start with the lowest level in the grammar (highest priority)
Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
Then the higher level (lower priority)
Factor digit | (expr)
Then the higher level (lower priority)
Term term * factor | term / factor | factor
Then the highest level (lowest priority)
expr expr + term | expr – term | term
19
Postfix Notation
• Formal rules, infix → postfix
– If E is variable or constant, E → E
– If E is expression of form E1 op E2, where op is binary
operator, E1 → E1’, and E2 → E2’, then E → E1’ E2’ op
– If E is expression of form (E1) and E1 → E1’, then E → E1’
• Parentheses are not needed!
20
Translation Schemes
• Adds to a CFG
• Includes “semantic actions” embedded within
productions
Example Translation Scheme
expr
expr
expr
term
term
expr + term { print(‘+’) }
expr – term { print(‘-’) }
term
0 { print(‘0’) }
1 { print(‘1’) }
…
term 9 { print(‘9’) }
21
Equivalent Translation Scheme
expr
rest
rest
rest
term
term
term rest
+ term { print(‘+’) } rest
- term { print(‘-’) } rest
ε
0 { print(‘0’) }
1 { print(‘1’) }
…
term 9 { print(‘9’) }
22
Parsing
• Parsing is the process of determining if a string of
tokens can be generated by a grammar
23
Top-down Parsing
• Recursively apply the following steps:
– At node n with nonterminal A, select a production for A
– Construct children at n for symbols on right side of selected
production
– Find next node for which subtree needs to be constructed
• Top-down parsing uses a “lookahead” symbol
• Selecting production may involve trial-and-error and
backtracking
24
Predictive Parsing
• Recursive-descent parsing is a recursive, top-down
approach to parsing
• A procedure is associated with each nonterminal
of the grammar
• Predictive parsing
– Special case of recursive-descent parsing
– The lookahead symbol unambiguously determines the
procedure for each nonterminal
25
Procedures for Nonterminals
• Production with right side α used if lookahead is in
FIRST(α)
– FIRST(α) is set of all symbols that can be first symbol of α
– If lookahead symbol is not in FIRST set for any production, can
use production with right side of ε
– If two or more possibilities, can not use this method
– If no possibilities, an error is declared
• Nonterminals on right side of selected production are
recursively expanded
26
Left Recursion
• Left-recursive productions can cause recursivedescent parsers to loop forever
• Example: example example + term
• Can eliminate left recursion
AAα|β
AβR
RαR|ε
27
Eliminating Left Recursion
expr
expr
expr
term
term
expr
rest
rest
rest
term
term
expr + term { print(‘+’) }
expr – term { print(‘-’) }
term
0 { print(‘0’) }
1 { print(‘1’) }
…
term 9 { print(‘9’) }
term rest
+ term { print(‘+’) } rest
- term { print(‘-’) } rest
ε
0 { print(‘0’) }
1 { print(‘1’) }
…
term 9 { print(‘9’) }
28
Infix to Prefix Code: Part 1
#include <stdio.h>
#include <ctype.h>
int lookahead;
void
void
void
void
void
expr(void);
rest(void);
term(void);
match(int);
error(void);
int main(void)
{
lookahead = getchar();
expr();
putchar('\n'); /* adds trailing newline character */
}
…
29
Infix to Prefix Code: Part 2
…
void expr(void)
{
term();
rest();
}
void term(void)
{
if (isdigit(lookahead)) {
putchar(lookahead);
match(lookahead);
}
else
error();
}
…
30
Infix to Prefix Code: Part 3
…
void rest(void)
{
if (lookahead == '+') {
match('+');
term();
putchar('+');
rest();
}
else if (lookahead == '-') {
match('-');
term();
putchar('-');
rest();
}
}
…
31
Infix to Prefix Code: Part 4
…
void match(int t)
{
if (lookahead == t)
lookahead = getchar();
else
error();
}
void error(void)
{
printf("syntax error\n"); /* print error message */
exit(1); /* then halt */
}
32
Code Optimization 1
void rest(void)
{
REST:
if (lookahead == '+') {
match('+');
term();
putchar('+');
goto REST;
}
else if (lookahead == '-') {
match('-');
term();
putchar('-');
goto REST;
}
}
33
Code Optimization 2
void expr(void)
{
term();
while (1) {
if (lookahead == '+') {
match('+');
term();
putchar('+');
}
else if (lookahead == '-') {
match('-');
term();
putchar('-');
}
else
break;
}
}
34
Improvements Remaining
•
•
•
•
Want to ignore whitespace
Allow numbers
Allow identifiers
Allow additional operators (multiplications and
division)
• Allow multiple expressions (separated by
semicolons)
35
Lexical Analyzer
• Eliminates whitespace (and comments)
• Reads numbers (not just single digits)
• Reads identifiers and keywords
36
Implementing the Lexical Analyzer
37
Allowable Tokens
• expected tokens: +, -, *, /, DIV, MOD, (, ), ID,
NUM, DONE
• ID represents an identifier, NUM represents a
number, DONE represents EOF
38
Tokens and Attributes
LEXEME
white space
TOKEN
ATTRIBUTE VALUE
---
---
sequence of digits
NUM
numeric value of
sequence
div
DIV
---
mod
MOD
---
letter followed by letters
and digits
ID
EOF
DONE
any other character
that character
index into symbol table
--NONE
39
A Simple Symbol Table
• Each record of symbol table contains a token type and a
string (lexeme or keyword)
• Symbol table has fixed size
• All lexemes in array of fixed size
• Will be able to insert and search for tokens:
– insert(s, t): creates entry with string s and token t, returns
index into symbol table
– lookup(s): searches for entry with string s, returns index if
found, 0 otherwise
• Keywords (div and mod) will be inserted into symbol
table, they can not be used as identifiers
40
Updated Translation Scheme
start list eof
list expr; list | ε
expr expr + term { print(‘+’) }
| expr – term { print(‘-’) }
| term
term term * factor { print(‘*’) }
| term / factor { print(‘/’) }
| term div factor { print(‘DIV’) }
| term mod factor { print(‘MOD’) }
| factor
factor (expr)
| id { print(id.lexeme) }
| num { print(num.value) }
41
After Eliminating Left Recursion
start list eof
list expr; list | ε
expr term moreterms
moreterms + term { print(‘+’) } moreterms
| - term { print(‘-’) } moreterms
| ε
term factor morefactors
morefactors * factor { print(‘*’) } morefactors
| / factor { print(‘/’) } morefactors
| div factor { print(‘DIV’) } morefactors
| mod factor { print(‘MOD’) } morefactors
| ε
factor (expr)
| id { print(id.lexeme) }
| num { print(num.value) }
42
Final Code
• About 250 lines of C
• Pretty sloppy, otherwise would be longer
43
********** global.h ************* الملف
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>
#define BSIZE 128
#define NONE -1
#define EOS '\0'
#define NUM
#define DIV
#define MOD
#define ID
#define DONE
int tokenval;
int lineno;
struct entry {
char *lexptr;
int token;
};
256
257
258
259
260
44
********** Init.c *************
Array symtable
#include "global.h"
lexptr
DIV
MOD
ID
ID
struct entry keywords[] = {
"div", DIV,
"mod", MOD,
0, 0
};
void init()
d i
{
struct entry *p;
for (p = keywords; p->token; p++)
insert(p->lexptr, p->token);
}
token
v eos m o d eos c o u n t eos i eos
Array lexemes
45
The lexical analyzer calls:
- Lookup function for symbol search in the symbol
table.
- Insert function to add a symbol to the symbol
table.
- Adds 1 to the counter of lines when the end of line
character is found.
46
********** symbol.c *************
#include "global.h"
int insert(char s[], int tok)
#define STRMAX 999
#define SYMMAX 100
{
int len;
len = strlen(s);
char lexemes[STRMAX];
int lastchar = -1;
struct entry symtable[SYMMAX];
int lastentry = 0;
if (lastentry + 1 >= SYMMAX)
error("symbol table full");
if (lastchar + len + 1 >= STRMAX)
error("lexemes array full");
int lookup(char s[])
lastentry = lastentry + 1;
{
int p;
for (p = lastentry; p > 0; p = p-1)
if (strcmp(symtable[p].lexptr, s) == 0)
return p;
symtable[lastentry].token = tok;
symtable[lastentry].lexptr = &lexemes[lastchar + 1];
lastchar = lastchar + len + 1;
return 0;
}
strcpy(symtable[lastentry].lexptr, s);
return lastentry;
}
47
********** lexer.c *************
#include "global.h"
char lexbuf[BSIZE];
int lineno = 1;
int tokenval = NONE;
int lexan()
{
else if (isalpha(t)) {
int p, b = 0;
while (isalnum(t)) {
lexbuf[b] = t;
t = getchar();
b = b + 1;
if (b >= BSIZE)
error("compiler error");
}
int t;
lexbuf[b] = EOS;
if (t != EOF)
ungetc(t, stdin);
p = lookup(lexbuf);
if(p == 0)
p = insert(lexbuf, ID);
tokenval = p;
return symtable[p].token;
}
else if (t == EOF)
return DONE;
else {
tokenval = NONE;
return t;
}
}
while(1) {
t = getchar();
if (t == ' ' || t == '\t');
else if (t == '\n')
lineno = lineno + 1;
else if (isdigit (t)) {
ungetc(t, stdin);
scanf("%d", &tokenval);
return NUM;
}
}
48
********** emitter.c *************
#include "global.h"
void emit(t, tval)
int t, tval;
{
switch(t) {
case '+': case '-': case '*': case '/':
printf("%c", t);
break;
case DIV:
printf(“ DIV ");
break;
case MOD:
printf(“ MOD ");
break;
case NUM:
printf("%d", tval);
break;
case ID:
printf(” %s ", symtable[tval].lexptr);
break;
default:
printf("token %d, tokenval %d\n", t, tval);
}
}
49
********** parse.c *************
void parse()
{
lookahead = lexan();
while (lookahead != DONE) {
expr(); match(';');
}
}
void expr()
{
int t;
term();
while(1)
switch (lookahead) {
case '+': case '-':
t = lookahead;
match(lookahead); term(); emit(t, NONE);
continue;
default:
return;
}
}
void term()
{
int t;
factor();
while(1)
switch (lookahead) {
case '*': case '/': case DIV: case MOD:
t = lookahead;
match(lookahead); factor(); emit(t, NONE);
continue;
default:
return;
}
}
50
********** parse.c (Con’d)**********
void factor()
{
switch (lookahead) {
case '(':
match ('('); expr(); match(')');
break;
case NUM:
emit(NUM, tokenval);
match(NUM); break;
case ID:
emit(ID, tokenval);
match(ID);
break;
default:
error("syntax error");
}
}
void match(t)
int t;
{
if (lookahead == t)
lookahead = lexan();
else error ("syntax error");
}
51
*** error.c ***
#include "global.h"
void error(char* m)
{
fprintf(stderr, "line %d: %s\n", lineno, m);
exit(1);
}
*** main.c ***
#include "global.h"
void main()
{
init();
parse();
exit(0);
}
© Copyright 2026 Paperzz