(Id (yytext, yypos))

Lexical Analysis
Compiler
Baojian Hua
[email protected]
Compiler
source
program
compiler
target
program
Front and Back Ends
source
program
front
end
IR
back
end
target
program
Front End
source
code
lexical
analyzer
tokens
parser
abstract
syntax
tree
semantic
analyzer
IR
Lexical Analyzer

The lexical analyzer translates the source
program into a stream of lexical tokens

Source program:



stream of characters
vary from language to language (ASCII or Unicode, or …)
Lexical token:


compiler internal data structure that represents the
occurrence of a terminal symbol
vary from compiler to compiler
Conceptually
character
lexical
sequence
analyzer
token
sequence
Example

Recall the min-ML language in “code3”
prog -> decs
decs -> dec; decs
|
dec -> val id = exp
| val _ = printInt exp
exp -> id
| num
| exp + exp
| true
| false
| if (exp) then exp else exp
| (exp)
Example
val x = 3;
val y = 4;
val z = if (2)
then (x)
else y;
val _ = printInt z;
lexical
analysis
VAL IDENT(x) ASSIGN INT(3) SEMICOLON
VAL IDENT(y) ASSIGN INT(4) SEMICOLON
VAL IDENT(z) ASSIGN IF LPAREN INT(2) RPAREN THEN
LPAREN IDENT(x) RPAREN ELSE IDENT(y) SEMICOLON
VAL UNDERSCORE ASSIGN PRINTINT INDENT(z) SEMICOLON
EOF
Lexer Implementation

Options:

Write a lexer by hand from scratch



boring, error-prone, and too much work
see dragon book sec3.4
Automatic lexer generator

Quick and easy
Lexer Implementation
declarative
specification
lexical
analyzer
Regular Expressions

How to specify a lexer?



Develop another language
Regular expressions
What’s a lexer-generator?

Another compiler…
Basic Definitions



Alphabet: the char set (say ASCII or
Unicode)
String: a finite sequence of char from
alphabet
Language: a set of strings


finite or infinite
say the C language
Regular Expression (RE)

Construction by induction

each c \in alphabet


empty \eps


(a|b) = {a, b}
for M and N, then MN


{}
for M and N, then M|N


{a}
(a|b)(c|d) = {ac, ad, bc, bd}
for M, then M* (Kleen closure)

(a|b)* = {\eps, a, aa, b, ab, abb, baa, …}
Regular Expression

Or more formally:
e ->
|
|
|
|
{}
c
e | e
e e
e*
Example

C’s indentifier:


starts with a letter (“_” counts as a letter)
followed by zero or more of letter or digit
(…) (…)
(_|a|b|…|z|A|B|…|Z) (…)
(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)
(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)*

It’s really error-prone and tedious…
Syntax Sugar

More syntax sugar:







[a-z] == a|b|…|z
e+
== one or more of e
e?
== zero or one of e
“a*” == a* itself
e{i, j} == more than i and less than j of e
.
== any char except \n
All these can be translated into core RE
Example Revisted

C’s indentifier:


starts with a letter (“_” counts as a letter)
followed by zero or more of letter or digit
(…) (…)
(_|a|b|…|z|A|B|…|Z) (…)
(_|a|b|…|z|A|B|…|Z)(_|a|b|…|z|A|B|…|Z|0|…|9)
[_a-zA-Z][_a-zA-Z0-9]*

What about the key word “if”?
Ambiguous Rule


A single RE is not ambiguous
But in a language, there may be many
REs?



[_a-zA-Z][_a-zA-Z0-9]*
“if”
So, for a string, which RE to match?
Ambiguous Rule

Two conventions:


Longest match: The regular expression that
matches the longest string takes precedence.
Rule Priority: The regular expressions identifying
tokens are written down in sequence. If two
regular expressions match the same (longest)
string, the first regular expression in the sequence
takes precedence.
Lexer Generator History

Lexical analysis was once a performance
bottleneck



certainly not true today!
As a result, early research investigated
methods for efficient lexical analysis
While the performance concerns are largely
irrelevant today, the tools resulting from this
research are still in wide use
History: A long-standing goal

In this early period, a considerable
amount of study went into the goal of
creating an automatic compiler
generator (aka compiler-compiler)
declarative
compiler
specification
compiler
History: Unix and C



In the mid-1960’s at Bell Labs, Ritchie and others
were developing Unix
A key part of this project was the development of C
and a compiler for it
Johnson, in 1968, proposed the use of finite state
machines for lexical analysis and developed Lex
[CACM 11(12), 1968]


read the accompanying paper on course page
Lex realized a part of the compiler-compiler goal by
automatically generating fast lexical analyzers
The Lex tool


The original Lex generated lexers written in C
(C in C)
Today every major language has its own lex
tool(s):


sml-lex, ocamllex, JLex, C#lex, …
Our topic next:


sml-lex
concepts and techniques apply to other tools
SML-Lex Specification

Lexical specification consists of 3 parts
(yet another programming language):
User Declarations
(plain SML types, values, functions)
%%
SML-LEX Definitions
(RE abbreviations, special stuff)
%%
Rules (association of REs with tokens)
(each token will be represented in
plain SML)
User Declarations

User Declarations:


User can define various values that are
available to the action fragments.
Two values must be defined in this section:

type lexresult


type of the value returned by each rule action.
fun eof ()

called by lexer when end of input stream is reached.
(EOF)
SML-LEX Definitions

ML-LEX Definitions:

User can define regular expression
abbreviations:
digits = [0-9] +;
letter = [a-zA-Z];

Define multiple lexers to work together.
Each is given a unique name.
%s lex1
lex2
lex3;
Rules

Rules:
<lexerList> regularExp => (action) ;

A rule consists of a pattern and an action:




Pattern in a regular expression.
Action is a fragment of ordinary SML code.
Longest match & rule priority used for
disambiguation
Rules may be prefixed with the list of lexers
that are allowed to use this rule.
Rules

Rule actions can use any value defined in the User
Declarations section, including

type lexresult


val eof : unit -> lexresult


type of value returned by each rule action
called by lexer when end of input stream reached
special variables:



yytext: input substring matched by regular expression
yypos: file position of the beginning of matched string
continue (): doesn’t return token; recursively calls lexer
Example #1
(* A language called Toy *)
prog -> word prog
->
word -> symbol
-> number
symbol -> [_a-zA-Z][_0-9a-zA-Z]*
number -> [0-9]+
Example #1
(* Lexer Toy, see the accompany code for detail *)
datatype token = Symbol of string * int
| Number of string * int
exception End
type lexresult = unit
fun eof () = raise End
fun output x = …;
%%
letter = [_a-zA-Z];
digit = [0-9];
ld = {letter}|{digit};
symbol = {letter} {ld}*;
number = {digit}+;
%%
<INITIAL>{symbol} =>(output (Symbol(yytext, yypos)));
<INITIAL>{number} =>(output (Number(yytext, yypos)));
Example #2
(* Expression Language
* C-style comment, i.e. /* … */
*)
prog -> stms
stms -> stm; stms
->
stm -> id = e
-> print e
e
-> id
-> num
-> e bop e
-> (e)
bop -> + | - | * | /
Sample Program
x = 4;
y = 5;
z = x+y*3;
print z;
Example #2
(* All terminals *)
prog -> stms
stms -> stm; stms
->
stm -> id = e
-> print e
e
-> id
-> num
-> e bop e
-> (e)
bop -> + | - | * | /
Example #2 in Lex
(* Expression language, see the accompany code
* for detail.
* Part 1: user code
*)
datatype token
= Id of string * int
| Number of string * int
| Print of string * int
| Plus of string * int
| … (* all other stuffs *)
exception End
type lexresult = unit
fun eof () = raise End
fun output x = …;
Example #2 in Lex, cont’
(* Expression language, see the accompany code
* for detail.
* Part 2: lex definition
*)
%%
letter = [_a-zA-Z];
digit = [0-9];
ld = {letter}|{digit};
sym = {letter} {ld}*;
num = {digit}+;
ws = [\ \t];
nl = [\n];
Example #2 in Lex, cont’
(* Expression language, see the accompany code
* for detail.
* Part 3: rules
*)
%%
<INITIAL>{ws} =>(continue ());
<INITIAL>{nl} =>(continue ());
<INITIAL>”+” =>(output (Plus (yytext, yypos)));
<INITIAL>”-” =>(output (Minus (yytext, yypos)));
<INITIAL>”*” =>(output (Times (yytext, yypos)));
<INITIAL>”/” =>(output (Divide (yytext, yypos)));
<INITIAL>”(” =>(output (Lparen (yytext, yypos)));
<INITIAL>”)” =>(output (Rparen (yytext, yypos)));
<INITIAL>”=” =>(output (Assign (yytext, yypos)));
<INITIAL>”;” =>(output (Semi (yytext, yypos)));
Example #2 in Lex, cont’
(* Expression language, see the accompany code
* for detail.
* Part 3: rules cont’
*)
<INITIAL>”print”=>(output (Print(yytext, yypos)));
<INITIAL>{sym} =>(output (Id (yytext, yypos)));
<INITIAL>{num} =>(output (Number(yytext, yypos)));
<INITIAL>”/*” => (YYBEGIN COMMENT; continue ());
<COMMENT>”*/” => (YYBEGIN INITIAL; continue ());
<COMMENT>{nl} => (continue ());
<COMMENT>.
=> (continue ());
<INITIAL>.
=> (error (…));
Lex Implementation
Lex accepts regular expressions (along
with others)
 So SML-lex is a compiler from RE to a
lexer
 Internal:
RE  NFA  DFA  table-driven alog’

Finite-state Automata (FA)
Input String
{Yes, No}
M
M = (, S, q0, F, )
Input
alphabet
State
set
Initial
state
Final
states
Transition
function
Transition functions


DFA
 : S    S
NFA

: S    (S)
DFA example

Which strings of as and bs are accepted?
0
b

a
1
a
b
Transition function:

{ (q0,a)q1, (q0,b)q0,
(q1,a)q2, (q1,b)q1,
(q2,a)q2, (q2,b)q2 }
2
a,b
NFA example

a,b
0 b
1
a
b
Transition function:

{(q0,a){q0,q1}, (q0,b){q1},
(q1,a), (q1,b){q0,q1}}
RE -> NFA:
Thompson algorithm

Break RE down to atoms



construct small NFAs directly for atoms
inductively construct larger NFAs from
small NFAs
Easy to implement

a small recursion algorithm
RE -> NFA:
Thompson algorithm
e ->
->
->
->
->

c
e1 e2
e1 | e2
e1*

c
e1

e2
RE -> NFA:
Thompson algorithm
e ->
->
->
->
->

c
e1 e2
e1 | e2
e1*



e1

e1


e2


Example
%%
letter = [_a-zA-Z];
digit = [0-9];
id = {letter} ({letter}|{digit})* ;
%%
<INITIAL>”if” => (IF (yytext, yypos));
<INITIAL>{id} => (Id (yytext, yypos));
(* Equivalent to:
*
“if” | {id}
*)
Example
<INITIAL>”if” => (IF (yytext, yypos));
<INITIAL>{id} => (Id (yytext, yypos));

i

f


…
NFA -> DFA:
Subset construction algorithm
(* subset construction: workList algorithm *)
q0 <- e-closure (n0)
Q <- {q0}
workList <- q0
while (workList != \phi)
remove q from workList
foreach (character c)
t <- -closure (move (q, c))
D[q, c] <- t
if (t\not\in Q)
add t to Q and workList
NFA -> DFA:
-closure
(* -closure: fixpoint algorithm *)
(* Dragon Fig 3.33 gives a DFS-like algorithm.
* Here we give a recursive version. (Simpler)
*)
X <- \phi
fun eps (t) =
X <- X ∪ {t}
foreach (s \in one-eps(t))
if (s \not\in X)
then eps (s)
NFA -> DFA:
-closure
(* -closure: fixpoint algorithm *)
(* dragon Fig 3.33 gives a DFS-like algorithm.
* Here we give a recursive version. (Simpler)
*)
fun e-closure (T) =
X <- T
foreach (t \in T)
X <- X ∪ eps(t)
NFA -> DFA:
-closure
(* -closure: fixpoint algorithm *)
(* A BFS-like algorithm.
*)
X <- empty;
fun e-closure (T) =
Q <- T
X <- T
while (Q not empty)
q <- deQueue (Q)
foreach (s \in one-eps(q))
if (s \not\in X)
enQueue (Q, s)
X <- X ∪ s
Example
<INITIAL>”if” => (IF (yytext, yypos));
<INITIAL>{id} => (Id (yytext, yypos));

0
1

i
2
[_a-zA-Z]
5
6


3
f
7
[_a-zA-Z0-9]
4

8
Example
q0 = {0, 1, 5}
D[q0, “i”] = {2, 3, 6, 7, 8}
D[q0, _] = {6, 7, 8}
D[q1, “f”] = {4, 7, 8}

0
i
1

[_a-zA-Z]
5
q1
i
q0
_
f
2
6


Q
Q
Q
Q
3
=
∪
∪
∪
{q0}
q1
q2
q3
f
7
q3
[_a-zA-Z0-9]
q2
4

8
Example
D[q1,
D[q2,
D[q3,
D[q4,
0

_]
_]
_]
_]
=
=
=
=
{7, 8}
{7, 8}
{7, 8}
{7, 8}
i
1

2
[_a-zA-Z]
5
f
q1
i
q0
_
q2

f
7
q3
_
_
6
Q ∪ q4
Q
Q
Q
3
[_a-zA-Z0-9]
_
q4
_
4

8
Example
q0 = {0, 1, 5}
q2 = {6, 7, 8}

0
i
1

q1 = {2, 3, 6, 7, 8}
q3 = {4, 7, 8}
2
[_a-zA-Z]
5
“f”
“i”
q1
q0
letter-”i”
q3
6

ld
ld-”f”
q2

f
7
[_a-zA-Z0-9]
q4
ld
3
ld
q4 = {7, 8}
4

8
Example
q0 = {0, 1, 5}
q2 = {6, 7, 8}

0
i
1

q1 = {2, 3, 6, 7, 8}
q3 = {4, 7, 8}
2
[_a-zA-Z]
5
“f”
“i”
q1
q0
letter-”i”
q3
6

ld
ld-”f”
q2

f
7
[_a-zA-Z0-9]
q4
ld
3
ld
q4 = {7, 8}
4

8
Table-driven Algorithm


Conceptually, an FA is a directed graph
Pragmatically, many different strategies to
encode an FA:

Matrix (adjacency matrix)




Array of list (adjacency list)
Hash table
Jump table (switch statements)


sml-lex
flex
Balance between time and space
Example
<INITIAL>”if” => (IF (yytext, yypos));
<INITIAL>{id} => (Id (yytext, yypos));
“f”
letter-”i”-”f” …
state\char “i”
other
q0
q1
q2
q2
…
error
q1
q4
q3
q4
…
error
q2
q4
q4
q4
…
error
q3
q4
q4
q4
…
error
q4
q4
q4
q4
…
error
“f”
“i”
q1
q0
letter-”i”
q3
ld
ld-”f”
q2
state
q4
ld
action
ld
q0 q1
Id
q2 q3 q4
Id
IF
Id
DFA Minimization:
Hopcroft’s Algorithm
“f”
“i”
q1
q3
ld
ld-”f”
q0
letter-”i”
state
action
q4
ld
ld
q2
q0 q1 q2 q3 q4
Id
Id
IF
Id
DFA Minimization:
Hopcroft’s Algorithm
“f”
“i”
q1
q3
ld
ld-”f”
q0
letter-”i”
state
action
q4
ld
ld
q2
q0 q1 q2 q3 q4
Id
Id
IF
Id
DFA Minimization:
Hopcroft’s Algorithm
“f”
“i”
q3
q1
ld
ld-”f”
q0
q2, q4
letter-”i”
ld
state
action
q0
q1
q2, q4 q3
Id
Id
IF
Summary

A Lexer:



Writing lexers by hand is boring, so we use a
lexer generator: ml-lex


input: stream of characters
output: stream of tokens
RE -> NFA -> DFA -> table-driven algo
Moral: don’t underestimate your theory
classes!


great application of cool theory developed in
mathematics.
we’ll see more cool apps as the course progresses