期中重點整理:
Chapter 1 Introduction
1.1 Overview and History
Compilers are fundamental to modern computing.
They act as translators, transforming human-oriented programming languages into computer-oriented machine
languages.
Programming Language (source)--->compiler ---> Machine Code (Excutable/target)
The first real compiler
FORTRAN compilers of the late 1950s 18 (person-years to build)
Compiler technology is more broadly applicable and has been employed in rather unexpected areas.
1.
2.
3.
4.
Text-formatting languages, like nroff and troff; preprocessor packages like eqn, tbl, pic
Silicon compiler for the creation of VLSI circuits
Command languages of OS
Query languages of Database systems
1.2 What Do Compilers Do?
Compilers may be distinguished according to the kind of target code they generate:
1.
2.
3.
4.
Another way that compilers differ from one another is in the format of the target machine code they generate
1.
2.
3.
4.
•
Pure Machine Code
Augmented Machine Code
Virtual Machine Code
JVM, P-code (for a virtual stack machine)
Assembly Language Format
Re-locatable Binary Format
A linkage step is required
Memory-Image (Load-and-Go) Format
Another kind of language processor, called an interpreter, differs from a compiler in that it executes
programs without explicitly performing a translation
1.3 The Structure of a Compiler
Any compiler must perform two major tasks –Analysis of the source program Synthesis of a
machine-language program
Source Program--> Scanner--tokens--> Parser--Syntactic Structure--> Semantic Routines--Intermediate
Representation --> Optimizer-->code generator-->Target Machine code
•
Scanner (lexical analysis)
– The scanner begins the analysis of the source program by reading the input, character by
character, and grouping characters into individual words and symbols (tokens, or other
feature such as color, shape, texture, ontology definition...)
– The tokens are encoded and then are fed to the parser for syntactic analysis (regular expression)
•
•
Scanner generators
Parser
– Given a formal syntax specification (typically as a context-free grammar [CFG] ), the parse
reads tokens and groups them into units as specified by the productions of the CFG being
used.
– While parsing, the parser verifies correct syntax, and if a syntax error is found, it issues a
suitable diagnostic.
– As syntactic structure is recognized, the parser either calls corresponding semantic routines
directly or builds a syntax tree.
•
Semantic Routines
– Perform two functions :
•
•
1. Check the static semantics of each construct (legal and meaningful )
2. Do the actual translation (Intermediate Representation code, associated with
individual productions of CFG or subtrees of syntax tree )
– The heart of a compiler
•
Optimizer
– The IR code generated by the semantic routines is analyzed and transformed into functionally
equivalent but improved IR code.
– This phase can be very complex and slow (it often involves numerous subphases )
– Peephole optimization (attempt to do simple, often machine dependent, code improvement) i.e.
multiplication by 1, addition of 0, previous instruction stores that value from the register to a
memory location and replacing a sequence of instructions by a single instruction with same
effect.
•
One-pass compiler
– No optimization is required
– To merge code generation with semantic routines and eliminate the use of an IR
•
Compiler writing tools
– Compiler generators or compiler-compilers
•
E.g. scanner and parser generators
1.4 The Syntax and Semantics of Programming Languages
Syntax (Structure)
Semantics(Meaning)
Chapter 2 A Simple Compiler
2.1 Micro compiler
•
Micro: a very simple language
–
Only integers
–
No declarations
–
Variables consist of A..Z, 0..9, and at most 32 characters long.
–
Comments begin with -- and end with end-of-line.
–
Three kinds of stmts:
•
•
assignments, e.g., a := b + c
•
read(list of ids), e.g., read(a, b)
•
write(list of exps), e.g., write(a+b)
–
Begin, end, read, and write are reserved words.
–
Tokens may not extend to the following line.
One-pass type, no explicit intermediate representations used
Source
Program
Scanner
Tokens
Parser
Syntactic
Semantic
Structure
(Character
Stream)
Routines
Intermediate
Representation
Symbol and
Optimizer
Attribute
Tables
(Used by all
Phases of The
Compiler)
Code
Generator
The structure of a Syntax-Directed Compiler
Target Machine
Code
•
The interface
–
Parser is the main routine.
–
Parser calls scanner to get the next token.
–
Parser calls semantic routines at appropriate times.
–
Semantic routines produce output in assembly language.
–
A simple symbol table is used by the semantic routines
2.2 A Micro Scanner
•
How to handle RESERVED words?
–
•
Reserved words are similar to identifiers.
Two approaches:
–
Use a separate table of reserved words
–
•
Put all reserved words into symbol table initially.
Provision for saving the characters of a token as they are scanned
–
•
token_buffer, buffer_char(), clear_buffer(), check_reserved()
Handle end of file
–
feof(stdin)
2.3 The Syntax of Micro
•
Micro's syntax is defined by a context-free grammar (CFG)
–
•
CFG is also called BNF (Backus-Naur Form) grammar
CFG consists of a set of production rules,
AB C D Z
LHS
RHS
LHS must be a single nonterminal
RHS consists 0 or more terminals or nonterminals
• Two kinds of symbols
–
Nonterminals
•
Delimited by < and >
•
Represent syntactic structures
–
Terminals
•
Represent tokens
•
E.g. <program> begin <statement list> end
•
Start or goal symbol
•
: empty or null string
<statement list> <statement><statement tail>
<statement tail>
<statement tail> <statement><statement tail>
• Extended BNF: some abbreviations
1. optional: [ ] 0 or 1
<stmt> if <exp> then <stmt>
<stmt> if <exp> then <stmt> else <stmt>
can be written as
<stmt> if <exp> then <stmt> [ else <stmt> ]
2. repetition:
{ } 0 or more
<stmt list> <stmt> <tail>
<tail>
<tail> <stmt> <tail>
can be written as
<stmt list> <stmt> { <stmt> }
• Extended BNF: some abbreviations
3. alternative: | or
<stmt> <assign>
<stmt> <if stmt>
can be written as
<stmt> <assign> | <if stmt>
• Extended BNF == BNF
–
Either can be transformed to the other.
–
Extended BNF is more compact and readable
• The derivation of
begin ID:= ID + (INTLITERAL – ID); end
A grammar fragment defines such a precedence relationship
• Action Symbols
Chapter 3 Scanning
3.2
Regular Expressions
Tokens are built from symbols of a finite vocabulary.
Common vocabulary includes a-z, A-Z, 0-9, ;, =, <, >,
We use regular expressions to define structures of tokens. Ex. Delimit = ( '(' | ')' | := | '+' | - | '*')
Because some char are in the vocabulary and are used in regular expressions, they need to be properly quoted.
(meta-char).
ex. (
)
+ * = | are meta char.
Regular expressions:
f
(empty-set)
l
(null string)
c
(any char)
if A and B are regular expressions, so are
A | B (alternation)
AB
(concatenation)
A*
(Kleene closure)
• some notational convenience
P+
== PP*
Not(A)
== V - A
Not(S)
== V* - S
== AA …A (k times)
AK
3.3 Finite Automata (or finite state machines)
-
regular expressions Û Finite Automata (fa)
-
An fa consists of
* a set of states
* an initial state
* some accepting states
* transitions between states, labeled with characters.
Two kinds of Finite Automata (fa):
* deterministic:
next transition is unique
* non-deterministic: otherwise
Another examples:
comments= - - NOT(eol)*eol
Ex:
(0|1)*0(0|1)(0|1)
• We may perform some actions during state transition.
Ex. Save the input char in a buffer.
•
Ex
Quote
= “ (NOT( “ ) | ““
)* “
T means toss-away.
3.6
Regular Expression Finite Automata(fa)
•
three steps:
r.e. non-deterministic fa
deterministic fa
minimize # of states
•
two cases an fa is nondeterministic:
1.
2.
NFA to DFA:
Step 1.
Regular Expression NFSM
We need to review the definition of Regular Expression
1.
l (null string)
2.
a (a char of the vocabulary)
3.
A|B (or)
4.
AB (concatenation)
5.
A* (repetition)
Example:
Step 2
NFA DFA
“subset construction”.
-- Use
-- Let N be an NFA, M be an equivalent DFA.
-- Each state of M corresponds to a set of states of N. If we run M and N on the same input string, M is in
state {x,y,z} iff N is in state x, y, or z.
--Thus, M keeps track of all possible paths N might take.
-- A state {x,y, … } of M is an accepting state if at least one of x,y, … is an accepting state of N.
--The initial state of M is the set of states reachable from the initial state of N by l-transitions.
Given an NFA N, Let S be a set of states of N; assume there is an operation close(S)
close(S) = { q | q S or $p S, p q}
Algorithm:
S := close( { initial state of N } )
repeat {
choose a state of S and a char c
T := { q | $ p S and p q}
T := close(T)
/*T is a new state*/
} until no more states can be added.
Example
• The subset construction always terminates since there is only a finite # of subsets.
•
The DFA is equivalent to the NFA in they accepts the same set of strings.
•
The DFA may contain 2n states,
where n is the # of states of N!
Home work:
1. ab*a|ba*b
2. a(cda+bcda)*
3. ab*c
4. (a|(bc)*d)+
5. ((0|1*(2|3)+|0011
Step 3
minimize # of states.
-- Every DFA has a unique smallest equivalent DFA.
--
use splitting to construct the equivalent minimal DFA.
Algorithm:
•
Initially, there are two sets, one consisting all accepting states of M, the other the remaining states.
repeat {
Choose a set A = { s1, s2, … , sn } and a char c.
Split A into A1, A2, … , Ak so that
for all i, if si,sj Ai and remain in the same merged state
sj tj and si ti
iff ti and tj are in the same set.
} until no more change.
issues:
1.
exponential blow up
2.
balanced parentheses problem - CFG
3.
Ada’s attributes
4.
ambiguity
5.
lookahead problem
Chapter 4 Grammars and Parsing
4.1 CFG(context free grammar)
G = (Vt, Vn, S, P)
Vt: terminals
Vn: nonterminals
S: start symbol
P: productions
• Vt Vn = f
(disjoint)
• S Vn
• P: set of rules
A X1X2 … Xm
where A Vn
m0
Xi Vt ∪ Vn
L(G) = { u Vt*| S u }
derivation : parser
leftmost derivation : top-down parser
rightmost derivation : bottom-up parser
e.g. Consider the grammar:
SE$
E E + T | T
T ID | ( E )
Show the leftmost and rightmost derivation of (ID)+(ID) (and also a third derivation).
• derivations and parse trees
Each sentence may have several
derivations; but it should have a
unique parse tree.
e.g. ID+(ID+ID)$
leftmost:
S E$
E+T$
T+T$
ID + T $
ID + ( E ) $
ID + ( E + T ) $
ID + ( T + T ) $
ID + ( ID + T ) $
ID + ( ID + ID ) $
rightmost: S E $
E+T$
E+(E)$
E+(E+T)$
E + ( E + ID ) $
E + ( T + ID ) $
E + ( ID + ID ) $
T + ( ID + ID ) $
ID + ( ID + ID ) $
Same # of steps, but different order.
• Not all syntactic rules are expressible using CFG,
e.g. variable declaration.
These are considered static semantics.
•
Context-sensitive grammars:
production rules may have the form:
aAb adb
•
type 0 grammars: no restrictions
ab
•
CFG is a balance between expressive power and ease of parsing.
• Chomsky hierarchy(context free grammar)
ab
aAbagb
Aa
AaB
type 0
type 1
type 2
type 3
4.2 debugging CFG
useless nonterminals:
nonterminals that derive no terminal strings or are unreachable from start symbol.
e.g.
SA|B
Aa
BbB
Ca
B and C are useless
Useless symbols may be removed.
The resulting CFG is called reduced.
2. ambiguous grammars: A sentence may have more than one parse tree.
e.g. E E - E
E ID
Consider : ID - ID - ID
• Ambiguous grammars are seldom used.
• In general, it is not possible to decide whether a CFG is ambiguous.
However, for certain classes of CFGs, we can.
3. grammar equality (similar)
extended BNF BNF
• for A a [b] g
AaNg
Nb
Nl
• for A a {b} g
AaNg
NbN
Nl
4.4 parsers and recognizers
•
recognizers: YES/NO
parsers: structure (parse tree)
•
top-down, predictive, preorder, lm, LL
bottom-up, postorder , rm(reverse), LR
•
4.5
first L states the token sequence Will be parsed from left to right 1 look-ahead
Analyzing CFG’s
1. Does A
l? (show an example)
S=f
repeat {
for each production P do {
if RHS of P contains no terminals
and all nonterminals of P’s RHS
are members of S
then add P’s LHS to S
}
} until no more change
2. follow(A) = { b Vt | S
… Ab …
U { l | if S
}
…A}
The follow sets are useful for producing look-aheads.
3. first(a) = { b Vt | a
U { l | if
• Does A
l
b…}
a
l}
?
SABC
ADE
B
BbA
CBE
DdBE
D
E
add B
add D
add E
add A
add C
add S
Ex.
Let a = X0X1 … Xk
1. First(a) = First(X0) - { l }
U (if X0
U (if X0X1
l then First(X1)-{l})
l then First(X2)-{l})
…
U (if X0 … Xk-1
l then First(Xk))
2. if Xi Vt then First(Xi) = { Xi }
3. if Xi Vn then
for each production Xi a,
First(Xi) = First(Xi) ∪ First(a)
Computing first sets.
Initially,
if X Vt then First(X) = { X }
if X Vn then
if X
l then First(X) = { l }
else First(X) = f
repeat {
for each production X a do
First(X) = First(X) ∪
First(a)
end
} until no more change.
where let a = X0X1 … Xk
First(a) = First(X0) - { l }
È (if X0
l then First(X1) - { l })
È (if X0X1
l then First(X2) - { l })
…
È (if X0...Xk-1
l then First(Xk))
To compute the follow sets,
Initially, Follow(START_SYMBOL) = { l }
all other follow sets are empty.
repeat {
for each production A … Bb do
Follow(B) = Follow(B)
∪ First(b)
if l First(b) then
Follow(B) = Follow(B) ∪ Follow(A)
} until no more change
Chapter 5 Grammars and Parsers
• LL(1)
•
left-to-right scanning, leftmost derivation, 1-token lookahead
parser generator:
Parsing becomes the easiest!
Modifying parsers is also convenient.
Example :(1)咖啡機請幫我泡 1 杯熱拿鐵
(2)咖啡機關機/關機
(3)咖啡機拿鐵
(4)咖啡機請幫我泡 2 杯熱的 3 杯冰的拿鐵
1 <program>→咖啡機<statement list>end
2 <statement list> → <statement><statement tail>
3 <statement tail> → <statement><statement tail>
4 <statement tail> → λ
// 5 <statement> → <受詞><數量><形容詞><飲料>
5.1 <statement> → <受詞><數量形容詞>
5.2 <數量形容詞> → <數量形容詞> <數量形容詞 tail>
5.3 <數量形容詞 tail > → <數量形容詞> <數量形容詞 tail>
5.4 <數量形容詞 tail > → λ
5.5 <數量形容詞> → <數量><形容詞> <飲料>
6< statement > →<命令>
7<受詞> → <祈使語><人><動詞>
7.1 <動詞> → 泡 | 沖 | 煮
8<受詞> → λ
9<數量> → 1杯|2杯|3杯|4杯|5杯
10<數量>→ λ
11 <形容詞>→<品味>
12<品味> → <品味> <溫度>
13 <品味> → λ
14<飲料> →<咖啡>
15 <飲料> →<茶>
15.1 <飲料> → λ
16<命令> →開機
17<命令> →關機
18 <溫度> →冰的(5)
19 <溫度> →熱的(50~70)
20<溫度> →<攝氏度>
21<溫度> →<華氏度>
22 <攝氏度> →Integer(-5~100)
23<溫度> →λ
24<咖啡> →拿鐵
25<咖啡> →藍山
26 <咖啡> →咖啡(綜合)
27 <咖啡> → 曼特寧
28<茶> →烏龍
29 <茶> →普洱
30<茶> →花茶
31 <祈使語> →請幫助 | 幫助 | 請幫
32<祈使語> →33<祈使語> → λ
34<人> →我
35 <人> →<人名>
36 <人名> →鄧有光 / …
…
• An LL(1) parse table
T:Vn x Vt P{Error}
• The definition of T
T[A][t]=AX1Xm if tPrediction(AX1Xm );
T[A][t]=Error otherwise
Building Recursive Descent Parsers from LL(1) Tables
• The form of parsing procedure:
• E.g. of an parsing procedure for <statement> in Micro
• gen_actions()
• Takes the grammar symbols and generates the actions necessary to match them in a recursive descent parse
• During parsing, the appearance of an action symbol in a production will server to initiate the corresponding
semantic action – a call to the corresponding semantic routine.
gen_action(“ID:=<expression> #gen_assign;”)
match(ID);
match(ASSIGN);
exp();
assign();
match(semicolon);
• Major LL(1) prediction conflicts
– Common prefixes
– Left recursion
• Common prefixes
<stmt> if <exp> then <stmt>
<stmt> if <exp> then <stmt> else <stmt>
• Solution: factoring transform
– See figure 5.12
<stmt> if <exp> then <stmt list> <if suffix>
<if suffix> end if;
<if suffix> else <stmt list> end if;
• Other transformation may be needed
– No common prefixes, no left recursion
1
<stmt> <label> <unlabeled stmt>
2
<label> ID :
3
<label>
4
<unlabeled stmt> ID := <exp> ;
<stmt> ID <suffix>
<suffix> : <unlabeled stmt>
<suffix> := <exp> ;
<unlabeled stmt> ID := <exp> ;
look ahead 2 tokens
• Example
A: B := C ;
B := C ;
• In Ada, we may declare arrays as
A: array(I .. J, BOOLEAN)
• A straightforward grammar for array bound
<array bound> <expr> .. <expr>
<array bound> ID
• Solution
<array bound> <expr> <bound tail>
<bound tail> .. <expr>
<bound tail>
• First try
– G1:
S
[ S CL
S
CL ]
S
• G1 is ambiguous: E.g., [[]
• Second try
– G2:
S [S
S S1
S1 [S1]
S1
• G2 is not ambiguous: E.g., [[]
• The problem is
[ First([S) and [ First(S1)
[[ First2([S) and [[ First2 (S1)
– G2 is not LL(1), nor is it LL(k) for any k.
© Copyright 2026 Paperzz