Role of lexical analyzer

Chapter 3
Lexical analyzer
Zhang Jing, Yu SiLiang
College of Computer Science & Technology
Harbin Engineering University

This chapter deals with the techniques of
lexical analyzer. That is, how to build a lexical
analyzer? How to construct a symbol table
which includes the tokens coming from the
source language? Then, how to produce lexical
analyzer efficiently ?
?
[email protected]
2
Role of lexical analyzer

The role of lexical analyzer is that it can recognize words—
tokens from source program. The input of lexical analyzer is
source program, the output of it are tokens. If we want to do
lexical analyzer, we should firstly identify tokens and remove
white space, enter, note and other information that are not
related with parse and code generation. Second, we should
divide the tokens into different types, namely, to judge if they
are type of identifiers, constants, literal strings, operators,
keywords and punctuation symbols (parentheses, commas and
semicolons). Third, translate all the different type tokens into
special expressions. Finally, put them into symbol table.
[email protected]
3

Phase of lexical analyzer in compiler is shown
in Fig. 3.1.
[email protected]
4

1. Types and expression of token
Actually, tokens in a program can be divided into 5
types; they are identifiers, constants, operators, keywords
and punctuation symbols (parentheses, commas and
semicolons) .
.
Type 1: Keywords. They are the word of command
definition, such as “IF”, “FOR” .
Type 2: Identifiers. They are the name of variable,
procedure, function and so on, such as : “index”, “count”.
Type 3: Constants. such as “65”, “-0.993”,“123.4”
Type 4: Operators. For example, “+”, “*” and “>”
are all operators.
Type 5: Punctuation symbol. They are the symbols,
such as “,” , “:” , “ ;”.
[email protected]
5

Example 3.1

This is one instruction in a program.
Index := 2 * count +17;
After the process of lexical analyzer, the tokens of it
are shown by Table 3.1.

[email protected]
6
Role of buffer in lexical analyzing


Lexical analyzer needs buffer all the times when source
program is compiled, because lexical analyzer should
look ahead for several characters to judge if they are in
same token. In addition, a great deal of time is spent in
locating the characters. Buffering techniques can reduce
the amount of time when scanning input characters, here
we only outline one of them.
.
The buffer we use is divided into two halves so that each
half includes N-characters. When scanning, we should
judge pointer “forward” if it reaches the end of the first
half buffer, if yes, we should load the other half .
.
[email protected]
7

Example 3.2
There is a sentence in source program:
Index := 2 * count +17;
The buffer that stores the sentence is separated into
two halves. The first half includes 4 characters, the
second half also has 4 characters.
.
It is described below.
[email protected]
8

The algorithm for storing sentence in buffer is
shown as follows,
[email protected]
9
Design of lexical analyzer

Before designing lexical analyzer, we should
draw transition diagram first. We shall give
several examples to explain how to draw the
state diagram and how to obtain the lexical
analysis.
.
[email protected]
10



1.Grammar of U::=aW|a
The state diagram of grammar U::=aW is written
Similarly, the state diagram of grammar U::= a
is
[email protected]
11

Example 3.3
Grammar G[S]:
S::=aA | bB
A::=aS | bC
B::=bS | aC
C::=bA | aB|
The state diagram of example 3.3 is shown by Fig.3.2.
[email protected]
12

Example 3.4
Grammar G[S]:
S::=+N | -N
S::=dN | d
N::=dN | d
The state diagram of example 3.4 is shown by Fig.3.3.
Note: ◎ in Figure 3.3 and 3.4 represents the output—
leaving state.
[email protected]
13

2. Grammar of U::=a|Wa
There is regular grammar:
U::=Wa
The state diagram of it is:
The state diagram of grammar U::= a is:

For this grammar, we add a start state S (S Ï VN)to the
state diagram.
[email protected]
14

Example 3.5
Grammar G[Z]:
Z::=Za|Aa|Bb
A::=Ba|a
B::=Ab|b
What we want to do is that to construct a state
diagram from this grammar and judge if string
“ababaaa” belongs to the language. Fig.3.4 are the
procedure of generating the state diagram of
example 3.5 from begin to end.
.
[email protected]
15

From the start state of S, we input the charaters “ababaaa”
one by one, at last reach the end state Z. So string
“ababaaa” is the sentence of the grammar
.
[email protected]
16
Finite Automata

The aim we study the language and grammar is
to create a lexical analyzer. Actually, we first
know a language, grammar, and then we can
construct transition diagram from it. This section
we go on forming automata from the transition
diagram, and then design a program to realize the
automata, namely, lexical analyzer.
.
[email protected]
17




Deterministic Finite Automata—DFA
The finite automata is a mathematical model of state
transition, it can be described by five elements.
(K , VT , M , S , Z)
While K is a set of states; VT is a set of input symbols;
S is start state, S∈K;Z is leaving state which belongs to
nonempty set, Z ÌK; M is a transition function that is
state-symbol pairs K×VT, M (W , a)=U. While W is
the present state, when W accepts an input symbol “a”,
W will move to next state U .
.
If it has a unique and definite next state when it moves
form one state to others, the FA is called definite finite
automata—DFA.
[email protected]
18

Example 3.5 can be described by DFA and it is shown
below.
({S,Z,A,B},{a,b},M,S,{Z})
M:
M(S,a)=A
M(S,b)=B
M(A,a)=Z
M(A,b)=B
M(B,a)=A
M(B,b)=Z
M(Z,a)=Z
Now we can deduce to judge if string “ababaaa” can be
recognized by the DFA.
M(S, ababaa)=M(M(S, a), babaa)=
M(A, babaa)=M(M(A,b), abaa)=M(B, abaa)=M(A,
baa)=M(B, aa)=M(A, a)= Z
[email protected]
19

Example 3.6
FA=({0, 1, 2, 3},{a, b}, M, 0,{3})
While, M: M (0, a) = 1
M (0, b) = 2
M (1, a) = 3
M (2, b) = 3
M (3, a) = 3
M (3, b) = 3
State set is K={0, 1, 2, 3},input symbol is VT={a,
b},start state is 0;leaving state set is{3}. When we
want to judge if the string “aab” would be accepted by the
FA, the transition function M is
M (0, a) = 1
M (1, a) = 3
M (3, b) = 3
[email protected]
20

So string “aab” can be accepted by the FA.
Similarly, you can try if string “abab” would be
recognized by the FA.
.
[email protected]
21

Example 3.7
FA=({A, B, C},{a , b}, M, A,{C})
While, M: M (A, a) = B
M (A, b) = A
M (B, a) = B
M (B, b) = C
M (C, a) = B
M (C, b) = A
“abab” can be accepted by FA, because the deduction
from start state is.
.
M (A, a)=B
M (B, b)=C
M (C, a)=B
M (B, b)=C
[email protected]
22

The deduction can also be written as
M (A, abab) = M (M (A, a) , bab) = M (B,
bab) = M (M (B, b), ab) = M (C, ab) = M (M (C,
a) , b) = M (B, b) = C
[email protected]
23
Example 3.8
There is FA=({W, S, P},{t, x, ε}, M, W,{P})
M:
M (W,ε) = W
M (W,t) = S
M (S,x) = P
The question is to judge if “tx” is recognized by the FA.
The deduction is as follows,
M (W,ε) = W
M (W, tx) = M( M (W , t) , x)
M (S, x) = P
Because P∈Z, we can say “tx” is recognized by the FA.

[email protected]
24

The algorithm of DFA
There is an input string “x”, the start symbol is
S0, S is state set, G is set of leaving state.
[email protected]
25

FA Program
There is an FA=({0,1,2,3}, {a,b}, M, 0, {3})
M:
M(0,a)=1 M(0,b)=2 M(1,a)=3
M(1,b)=2
M(2,a)=1 M(2,b)=3 M(3,a)=3
M(3,b)=3
The question is to judge if the string “abbb”
would be identified or accepted by the FA?
The FA program is as follows.
[email protected]
26
[email protected]
27
[email protected]
28
[email protected]
29

Result of the FA program is shown by Fig.3.7.
[email protected]
30

Nondeterministic Finite Automata (NFA)
There is a grammar G:
U::=Wa and V::=Wa
The transition diagram of G is
[email protected]
31

The FA of G:
M (W, a) = U and M (W, a) = V
Or
M (W, a) = {U, V}
So the state-symbol pair is not unique, the FA
is named as Nondeterministic Finite
Automata(NFA).
.
[email protected]
32

The definition of NFA is
(K, VT, M, S, Z)
While K is state set; VT is a set of input symbols;
S is start state, S∈K;Z is leaving state which
belongs to nonempty set, Z Ì K; M is statesymbol pairs K× VT*
M (W, ε) = {W}
M (W, tx) = M{P1, x}∪M{P2, x}
∪…M{Pn, x}
While, P∈M(W, t);t∈VT;x∈VT.
[email protected]
33

Example 3.9
Regular grammar G[Z]:
P:
Z::=U1|V0|Z0|Z1
U::=Q1|1
V::=Q0|0
Z::=Q1
Q::=0
The transition state diagram of example 3.9 is
shown by Figure 3.8, Z is leaving state, S is start
state.
.
[email protected]
34

From the transition state of example 3.9, we know
that state-symbol pairs of M is not unique, so the G
[Z]can be described by NFA.
.
NFA=({S, Q, U, V, Z},{0, 1}, M,{S},{Z})
While M: M (S, 0) ={V, Q} M (S, 1) ={U}
M (U, 0) =Φ
M (U, 1) ={Z}
M (V, 0) ={Z}
M (V, 1) =Φ
M (Q, 0) ={V}
M (Q, 1) ={U,Z}
M (Z, 0) ={Z}
M (Z, 1) ={Z}
[email protected]
35
[email protected]
36

The state Φ is empty state that doesn’t include any state.
The deduction of string “0111” begins from the start
state S, the state-symbol pair M is
M (S, 0111) = M (V, 111)∪M (Q, 111)
=Φ∪M (U, 11) ∪ M (Z, 11)
= M (Z, 1) ∪ M (Z, 1)
= M (Z, 1)
={Z}
So M (S, 0111) ={Z}, state Z is leaving state, namely,
string “0111” can be accepted by the NFA.
You can try string “101” by yourselves to judge if it will
be accepted by the NFA.
[email protected]
37

Constructing DFA from NFA
Any NFA: N=(K, VT, M, S, F) can has an
correspond DFA: N’=(K’, VT, M’, S’, F’). While
K’ is the set coming from the subset of K.
.
[Q1,Q2,…,Qm] is the elements of K’, Qi∈K;
M’([R1,R2,…,Ri],T)= [Q1,Q2,….Qj],
[R1,R2,…,Ri] is the elements of K,T∈VT ;S’=[S1,
S2, …, Sn]; F’={[Sj, Sk, …, Sl]|[Sj, Sk, …,
Sl]∈K’, [Sj, Sk, …, Sl]∩F≠φ };
L(N)=L(N’).
,
[email protected]
38

Example 3.10
Grammar[Z]:
Z:: =Za|Aa|Bb
A::=Ba|Za|a
B::=Ab|Ba|b
The state set K={S, A, B, Z};
NFA of the grammar
is shown by Figure 3.9.
[email protected]
39
The NFA of grammar Z is
N=({S,A,B,Z},{a,b},M,{S},{Z})
M:
M(S,a)={A} M(S,b)={B}
M(A,a)={Z} M(A,b)={B}
M(B,a)={A,B} M(B,b)={Z}
M(Z,a)={A,Z}
[email protected]
40

Now what we want to do is that to construct DFA
from NFA, We first begin from start state of S.
K’={[S]}
M([S],a)=[A] M([S],b)=[B]
K’={[S],[A],[B]}
M([A],a)=[Z] M([A],b)=[B]
M([B],a)=[AB] M([B],b)=[Z]
[email protected]
41
K’={S],[A],[B],[Z],[AB]}
M([Z],a)=[AZ] M([Z],b)=φ
M([AB],a)=[ABZ] M([AB],b)=[BZ]
K’={S],[A],[B],[Z],[AB],[AZ],[BZ], [ABZ]}
M([AZ],a)=[AZ] M([AZ],b)=[B]
M([BZ],a)=[ABZ] M([BZ],b)=[Z]
M([ABZ],a)=[ABZ] M([ABZ],b)=[BZ]
According to the states transition above, we can
obtain the state set of DFA, and they are shown by the
left in the Table3.2, that is:
:
K’={[S],[A],[B],[Z],[AB],[AZ],[BZ],[ABZ]}
[email protected]
42
[email protected]
43

The start state still is S, the leaving states are the
states that include the leaving state Z in K, namely,
[Z],[AZ],[BZ],[ABZ]. The DFA is shown by
Fig.3.10.
.
[email protected]
44

Minimum DFA
This section we want to make the DFA briefly, namely,
t o m i n i m i z e D FA . F i r s t w e i n t r o d u c e s o m e
concepts:
:
(1) Equivalence states: the next states of the states belong
to same state set when input characters.
.
(2) Terminal states: states that include leaving state.
.
(3) Nonterminal states: states that do not include any
leaving state.
.
(4) Dead state: the nonterminal states that can not reach
any terminal states.
.
(5) Unreachable state: states that can not be reached from
start state.
.
[email protected]
45

The algorithm of minimum DFA:
(1)Divide the states into two state sets, namely,
terminal state and nonterminal state.
(2)Judge if states are equivalence states, if yes,
we should merge equivalence states.
(3)Remove dead states and unreachable states.
[email protected]
46

:(1)States are divided into two state sets:
nonterminal state set that include state 0,1,2,3,
and terminal states are equivalence states and are
merged into state 4,shown in Fig.3.11.
 Fig.3.11 The DFA that is divided into two state sets

图3.11被分为非终结符状态集和终结符状态集的确定有穷
a
自动机
a
a
1.
0
b
b
4
a
a
b
3
b
2
b
[email protected]
47
(2) Judge if nonterminal states are equivalence
states. For nonterminal states 0,1,2,3, we input
character a and b.
.
M(0,a)={1} M(1,a)={1}
M(2,a)={1} M(3,a)={1}
M(0,b)={2} M(1,b)={3}
M(2,b)={2} M(3,b)={4}
[email protected]
48

From above, we know that the next state of state
3 is not in nonterminal states set when input
character “b”, so the nonterminal states set is
divided into state set 3 and state set 0,1,2. Again
for state set 0,1,2, state 1 is not in the state set
when input character “b”, so state set 0,1,2 is
divided into set 1 and set 0,2. Till now, we know
that state 0 and state 2 are equivalence states, and
they should be merged,shown by
Fig.3.12.
.
.
[email protected]
49
[email protected]
50

Constructing DFA from State Subset of εCLOSURE
NFA: N=(K, VT, M, S, F), in addition, there are
character “ε” in NFA.
The definition of ε—CLOSURE is:
If I is subset of K, thenε—CLOSURE(I)
(1)If P∈I,then P∈ε—CLOSURE(I);
(2)If P∈I,then P’∈ε—CLOSURE (I),
while P’ is the next state start from P on the path
of ε.
[email protected]
51

The definition of Ia is:
Ia means that I and J are all the subset state of K,
J is the next state begins from I on the path a (or
jump over some path of ε before state J or after
state J)
Ia=ε—CLOSURE(J)
Note: ε—CLOSURE(I)is the subset of K.
[email protected]
52




Example 3.11
The NFA of regular expression (a|b)*ab is
shown below
Fig.3.13 The translation system of regular
expression (a|b)*ab
图3.13 正则式(a|b)*ab的转换系统
[email protected]
53
If we want to construct DFA from the translation
system, we should remove theεin Fig.3.12, and
reconstruct it.
.
The state set K={1,2,3,4,5,6,7,8,9,10,11,12};
If subset I={1}, then ε—CLOSURE(I)= {1, 2, 3, 4,
8, 9}= Q0
M(Q0, a) = Ia={5,7,8,9,2,3,4,10,11} = Q1
M (Q0, b) = Ib = {6,7,8,9,2,3,4} = Q2
M (Q1, a) = Ia = Q1
M (Q1, b) = Ib = {6,7,8,9,2,3,4,12} = Q3
M (Q2, a) = Ia = Q1
M (Q2, b) = Ib = Q2
M (Q3, a) = Ia = Q1
M (Q3, b) = Ib = Q2
[email protected]
54

The state set is reconstructed and is shown by
Figure 3.14.
[email protected]
55
Now we should minimize the DFA above.
(1)States are divided into two state sets: terminal state set
3, and nonterminal state that include state 0,1,2.
(2)Judge if states0,1,2 are equivalence states.
M(Q0, a) = Q1
M (Q0, b) = Q2
M(Q1, a) = Q1
M (Q1, b) = Q3
M (Q2, a) = Q1
M (Q2, b) = Q2
The next state of state 1 is not in nonterminal states set
when input character b, so the nonterminal states set is
divided into state set 1 and state set 0, 2 .
.
[email protected]
56


States 0 and 2 behave the same way, so they are
equivalence states and can be merged.
The minimun of DFA is shown by Fig.3.15
[email protected]
57


An example for constructing DFA from regular
expression
Example 3.12
The unsigned digital is “d…d. d…dESd…d”, it
includes four regular expressions:
dd*
{A1}
d*·dd*
{A2}
d*ESdd*
{A3}
d*·dd*ESdd*
{A4}
while
VT={0, … , 9, · , +, - , E}
d = 0|1|…|9
s= +|-|ε
[email protected]
58

What we want to do is to translate the expression
into DFA, the transition includes three steps.
Firstly create transition diagram for A1,A2,A3
and A4. Secondly, transfer the NFA to FA. Finally,
to minimize the FA.
(1)The transition diagram of A1,A2,A3 and A4 .
Unsigned digital NFA is shown by Fig.3.16.
[email protected]
59
(2)Transfer the NFA to FA
The state of NFA are {0, 1, … ,15},the start
state is {0},the leaving state are {2, 5, 9, 15},
input symbol are VT={d,· , E, S}
ε—CLSOURE ({0}) ={0, 1, 3, 6, 10}
Id、I.、IS、IE are shown in Table 3.3.The
combination of four FA is shown by Fig.3.17.
[email protected]
60
[email protected]
61

Reconstruction of Table 3.3 is shown by Table
3.4
[email protected]
62

So the DFA is shown by Figure 3.17.
[email protected]
63
(3)Minimum the FA
K={A, B, C, D, E, F, G, H, I , J}
The subset are
K1={A, C, D, F, G, I}
K2={B, E, H, J}
Among set K1,set F and set I, set D and set G are
equivalent; so set K1 is divided into four subsets:
{A}、{C}、{D}、{F}
Similarly, among K2,set H and set J are
equivalent,K2 is divided into three subsets:
{B}、{E}、{H}
[email protected]
64

The minimum states are shown by Figure 3.18
and Table 3.5
[email protected]
65
[email protected]
66

The program and result of example 3.12 is
shown below.
[email protected]
67
[email protected]
68
[email protected]
69
[email protected]
70

The result of example 3.12 is shown by
Fig.3.20
[email protected]
71