Chapter 3 Lexical analyzer Zhang Jing, Yu SiLiang College of Computer Science & Technology Harbin Engineering University This chapter deals with the techniques of lexical analyzer. That is, how to build a lexical analyzer? How to construct a symbol table which includes the tokens coming from the source language? Then, how to produce lexical analyzer efficiently ? ? [email protected] 2 Role of lexical analyzer The role of lexical analyzer is that it can recognize words— tokens from source program. The input of lexical analyzer is source program, the output of it are tokens. If we want to do lexical analyzer, we should firstly identify tokens and remove white space, enter, note and other information that are not related with parse and code generation. Second, we should divide the tokens into different types, namely, to judge if they are type of identifiers, constants, literal strings, operators, keywords and punctuation symbols (parentheses, commas and semicolons). Third, translate all the different type tokens into special expressions. Finally, put them into symbol table. [email protected] 3 Phase of lexical analyzer in compiler is shown in Fig. 3.1. [email protected] 4 1. Types and expression of token Actually, tokens in a program can be divided into 5 types; they are identifiers, constants, operators, keywords and punctuation symbols (parentheses, commas and semicolons) . . Type 1: Keywords. They are the word of command definition, such as “IF”, “FOR” . Type 2: Identifiers. They are the name of variable, procedure, function and so on, such as : “index”, “count”. Type 3: Constants. such as “65”, “-0.993”,“123.4” Type 4: Operators. For example, “+”, “*” and “>” are all operators. Type 5: Punctuation symbol. They are the symbols, such as “,” , “:” , “ ;”. [email protected] 5 Example 3.1 This is one instruction in a program. Index := 2 * count +17; After the process of lexical analyzer, the tokens of it are shown by Table 3.1. [email protected] 6 Role of buffer in lexical analyzing Lexical analyzer needs buffer all the times when source program is compiled, because lexical analyzer should look ahead for several characters to judge if they are in same token. In addition, a great deal of time is spent in locating the characters. Buffering techniques can reduce the amount of time when scanning input characters, here we only outline one of them. . The buffer we use is divided into two halves so that each half includes N-characters. When scanning, we should judge pointer “forward” if it reaches the end of the first half buffer, if yes, we should load the other half . . [email protected] 7 Example 3.2 There is a sentence in source program: Index := 2 * count +17; The buffer that stores the sentence is separated into two halves. The first half includes 4 characters, the second half also has 4 characters. . It is described below. [email protected] 8 The algorithm for storing sentence in buffer is shown as follows, [email protected] 9 Design of lexical analyzer Before designing lexical analyzer, we should draw transition diagram first. We shall give several examples to explain how to draw the state diagram and how to obtain the lexical analysis. . [email protected] 10 1.Grammar of U::=aW|a The state diagram of grammar U::=aW is written Similarly, the state diagram of grammar U::= a is [email protected] 11 Example 3.3 Grammar G[S]: S::=aA | bB A::=aS | bC B::=bS | aC C::=bA | aB| The state diagram of example 3.3 is shown by Fig.3.2. [email protected] 12 Example 3.4 Grammar G[S]: S::=+N | -N S::=dN | d N::=dN | d The state diagram of example 3.4 is shown by Fig.3.3. Note: ◎ in Figure 3.3 and 3.4 represents the output— leaving state. [email protected] 13 2. Grammar of U::=a|Wa There is regular grammar: U::=Wa The state diagram of it is: The state diagram of grammar U::= a is: For this grammar, we add a start state S (S Ï VN)to the state diagram. [email protected] 14 Example 3.5 Grammar G[Z]: Z::=Za|Aa|Bb A::=Ba|a B::=Ab|b What we want to do is that to construct a state diagram from this grammar and judge if string “ababaaa” belongs to the language. Fig.3.4 are the procedure of generating the state diagram of example 3.5 from begin to end. . [email protected] 15 From the start state of S, we input the charaters “ababaaa” one by one, at last reach the end state Z. So string “ababaaa” is the sentence of the grammar . [email protected] 16 Finite Automata The aim we study the language and grammar is to create a lexical analyzer. Actually, we first know a language, grammar, and then we can construct transition diagram from it. This section we go on forming automata from the transition diagram, and then design a program to realize the automata, namely, lexical analyzer. . [email protected] 17 Deterministic Finite Automata—DFA The finite automata is a mathematical model of state transition, it can be described by five elements. (K , VT , M , S , Z) While K is a set of states; VT is a set of input symbols; S is start state, S∈K;Z is leaving state which belongs to nonempty set, Z ÌK; M is a transition function that is state-symbol pairs K×VT, M (W , a)=U. While W is the present state, when W accepts an input symbol “a”, W will move to next state U . . If it has a unique and definite next state when it moves form one state to others, the FA is called definite finite automata—DFA. [email protected] 18 Example 3.5 can be described by DFA and it is shown below. ({S,Z,A,B},{a,b},M,S,{Z}) M: M(S,a)=A M(S,b)=B M(A,a)=Z M(A,b)=B M(B,a)=A M(B,b)=Z M(Z,a)=Z Now we can deduce to judge if string “ababaaa” can be recognized by the DFA. M(S, ababaa)=M(M(S, a), babaa)= M(A, babaa)=M(M(A,b), abaa)=M(B, abaa)=M(A, baa)=M(B, aa)=M(A, a)= Z [email protected] 19 Example 3.6 FA=({0, 1, 2, 3},{a, b}, M, 0,{3}) While, M: M (0, a) = 1 M (0, b) = 2 M (1, a) = 3 M (2, b) = 3 M (3, a) = 3 M (3, b) = 3 State set is K={0, 1, 2, 3},input symbol is VT={a, b},start state is 0;leaving state set is{3}. When we want to judge if the string “aab” would be accepted by the FA, the transition function M is M (0, a) = 1 M (1, a) = 3 M (3, b) = 3 [email protected] 20 So string “aab” can be accepted by the FA. Similarly, you can try if string “abab” would be recognized by the FA. . [email protected] 21 Example 3.7 FA=({A, B, C},{a , b}, M, A,{C}) While, M: M (A, a) = B M (A, b) = A M (B, a) = B M (B, b) = C M (C, a) = B M (C, b) = A “abab” can be accepted by FA, because the deduction from start state is. . M (A, a)=B M (B, b)=C M (C, a)=B M (B, b)=C [email protected] 22 The deduction can also be written as M (A, abab) = M (M (A, a) , bab) = M (B, bab) = M (M (B, b), ab) = M (C, ab) = M (M (C, a) , b) = M (B, b) = C [email protected] 23 Example 3.8 There is FA=({W, S, P},{t, x, ε}, M, W,{P}) M: M (W,ε) = W M (W,t) = S M (S,x) = P The question is to judge if “tx” is recognized by the FA. The deduction is as follows, M (W,ε) = W M (W, tx) = M( M (W , t) , x) M (S, x) = P Because P∈Z, we can say “tx” is recognized by the FA. [email protected] 24 The algorithm of DFA There is an input string “x”, the start symbol is S0, S is state set, G is set of leaving state. [email protected] 25 FA Program There is an FA=({0,1,2,3}, {a,b}, M, 0, {3}) M: M(0,a)=1 M(0,b)=2 M(1,a)=3 M(1,b)=2 M(2,a)=1 M(2,b)=3 M(3,a)=3 M(3,b)=3 The question is to judge if the string “abbb” would be identified or accepted by the FA? The FA program is as follows. [email protected] 26 [email protected] 27 [email protected] 28 [email protected] 29 Result of the FA program is shown by Fig.3.7. [email protected] 30 Nondeterministic Finite Automata (NFA) There is a grammar G: U::=Wa and V::=Wa The transition diagram of G is [email protected] 31 The FA of G: M (W, a) = U and M (W, a) = V Or M (W, a) = {U, V} So the state-symbol pair is not unique, the FA is named as Nondeterministic Finite Automata(NFA). . [email protected] 32 The definition of NFA is (K, VT, M, S, Z) While K is state set; VT is a set of input symbols; S is start state, S∈K;Z is leaving state which belongs to nonempty set, Z Ì K; M is statesymbol pairs K× VT* M (W, ε) = {W} M (W, tx) = M{P1, x}∪M{P2, x} ∪…M{Pn, x} While, P∈M(W, t);t∈VT;x∈VT. [email protected] 33 Example 3.9 Regular grammar G[Z]: P: Z::=U1|V0|Z0|Z1 U::=Q1|1 V::=Q0|0 Z::=Q1 Q::=0 The transition state diagram of example 3.9 is shown by Figure 3.8, Z is leaving state, S is start state. . [email protected] 34 From the transition state of example 3.9, we know that state-symbol pairs of M is not unique, so the G [Z]can be described by NFA. . NFA=({S, Q, U, V, Z},{0, 1}, M,{S},{Z}) While M: M (S, 0) ={V, Q} M (S, 1) ={U} M (U, 0) =Φ M (U, 1) ={Z} M (V, 0) ={Z} M (V, 1) =Φ M (Q, 0) ={V} M (Q, 1) ={U,Z} M (Z, 0) ={Z} M (Z, 1) ={Z} [email protected] 35 [email protected] 36 The state Φ is empty state that doesn’t include any state. The deduction of string “0111” begins from the start state S, the state-symbol pair M is M (S, 0111) = M (V, 111)∪M (Q, 111) =Φ∪M (U, 11) ∪ M (Z, 11) = M (Z, 1) ∪ M (Z, 1) = M (Z, 1) ={Z} So M (S, 0111) ={Z}, state Z is leaving state, namely, string “0111” can be accepted by the NFA. You can try string “101” by yourselves to judge if it will be accepted by the NFA. [email protected] 37 Constructing DFA from NFA Any NFA: N=(K, VT, M, S, F) can has an correspond DFA: N’=(K’, VT, M’, S’, F’). While K’ is the set coming from the subset of K. . [Q1,Q2,…,Qm] is the elements of K’, Qi∈K; M’([R1,R2,…,Ri],T)= [Q1,Q2,….Qj], [R1,R2,…,Ri] is the elements of K,T∈VT ;S’=[S1, S2, …, Sn]; F’={[Sj, Sk, …, Sl]|[Sj, Sk, …, Sl]∈K’, [Sj, Sk, …, Sl]∩F≠φ }; L(N)=L(N’). , [email protected] 38 Example 3.10 Grammar[Z]: Z:: =Za|Aa|Bb A::=Ba|Za|a B::=Ab|Ba|b The state set K={S, A, B, Z}; NFA of the grammar is shown by Figure 3.9. [email protected] 39 The NFA of grammar Z is N=({S,A,B,Z},{a,b},M,{S},{Z}) M: M(S,a)={A} M(S,b)={B} M(A,a)={Z} M(A,b)={B} M(B,a)={A,B} M(B,b)={Z} M(Z,a)={A,Z} [email protected] 40 Now what we want to do is that to construct DFA from NFA, We first begin from start state of S. K’={[S]} M([S],a)=[A] M([S],b)=[B] K’={[S],[A],[B]} M([A],a)=[Z] M([A],b)=[B] M([B],a)=[AB] M([B],b)=[Z] [email protected] 41 K’={S],[A],[B],[Z],[AB]} M([Z],a)=[AZ] M([Z],b)=φ M([AB],a)=[ABZ] M([AB],b)=[BZ] K’={S],[A],[B],[Z],[AB],[AZ],[BZ], [ABZ]} M([AZ],a)=[AZ] M([AZ],b)=[B] M([BZ],a)=[ABZ] M([BZ],b)=[Z] M([ABZ],a)=[ABZ] M([ABZ],b)=[BZ] According to the states transition above, we can obtain the state set of DFA, and they are shown by the left in the Table3.2, that is: : K’={[S],[A],[B],[Z],[AB],[AZ],[BZ],[ABZ]} [email protected] 42 [email protected] 43 The start state still is S, the leaving states are the states that include the leaving state Z in K, namely, [Z],[AZ],[BZ],[ABZ]. The DFA is shown by Fig.3.10. . [email protected] 44 Minimum DFA This section we want to make the DFA briefly, namely, t o m i n i m i z e D FA . F i r s t w e i n t r o d u c e s o m e concepts: : (1) Equivalence states: the next states of the states belong to same state set when input characters. . (2) Terminal states: states that include leaving state. . (3) Nonterminal states: states that do not include any leaving state. . (4) Dead state: the nonterminal states that can not reach any terminal states. . (5) Unreachable state: states that can not be reached from start state. . [email protected] 45 The algorithm of minimum DFA: (1)Divide the states into two state sets, namely, terminal state and nonterminal state. (2)Judge if states are equivalence states, if yes, we should merge equivalence states. (3)Remove dead states and unreachable states. [email protected] 46 :(1)States are divided into two state sets: nonterminal state set that include state 0,1,2,3, and terminal states are equivalence states and are merged into state 4,shown in Fig.3.11. Fig.3.11 The DFA that is divided into two state sets 图3.11被分为非终结符状态集和终结符状态集的确定有穷 a 自动机 a a 1. 0 b b 4 a a b 3 b 2 b [email protected] 47 (2) Judge if nonterminal states are equivalence states. For nonterminal states 0,1,2,3, we input character a and b. . M(0,a)={1} M(1,a)={1} M(2,a)={1} M(3,a)={1} M(0,b)={2} M(1,b)={3} M(2,b)={2} M(3,b)={4} [email protected] 48 From above, we know that the next state of state 3 is not in nonterminal states set when input character “b”, so the nonterminal states set is divided into state set 3 and state set 0,1,2. Again for state set 0,1,2, state 1 is not in the state set when input character “b”, so state set 0,1,2 is divided into set 1 and set 0,2. Till now, we know that state 0 and state 2 are equivalence states, and they should be merged,shown by Fig.3.12. . . [email protected] 49 [email protected] 50 Constructing DFA from State Subset of εCLOSURE NFA: N=(K, VT, M, S, F), in addition, there are character “ε” in NFA. The definition of ε—CLOSURE is: If I is subset of K, thenε—CLOSURE(I) (1)If P∈I,then P∈ε—CLOSURE(I); (2)If P∈I,then P’∈ε—CLOSURE (I), while P’ is the next state start from P on the path of ε. [email protected] 51 The definition of Ia is: Ia means that I and J are all the subset state of K, J is the next state begins from I on the path a (or jump over some path of ε before state J or after state J) Ia=ε—CLOSURE(J) Note: ε—CLOSURE(I)is the subset of K. [email protected] 52 Example 3.11 The NFA of regular expression (a|b)*ab is shown below Fig.3.13 The translation system of regular expression (a|b)*ab 图3.13 正则式(a|b)*ab的转换系统 [email protected] 53 If we want to construct DFA from the translation system, we should remove theεin Fig.3.12, and reconstruct it. . The state set K={1,2,3,4,5,6,7,8,9,10,11,12}; If subset I={1}, then ε—CLOSURE(I)= {1, 2, 3, 4, 8, 9}= Q0 M(Q0, a) = Ia={5,7,8,9,2,3,4,10,11} = Q1 M (Q0, b) = Ib = {6,7,8,9,2,3,4} = Q2 M (Q1, a) = Ia = Q1 M (Q1, b) = Ib = {6,7,8,9,2,3,4,12} = Q3 M (Q2, a) = Ia = Q1 M (Q2, b) = Ib = Q2 M (Q3, a) = Ia = Q1 M (Q3, b) = Ib = Q2 [email protected] 54 The state set is reconstructed and is shown by Figure 3.14. [email protected] 55 Now we should minimize the DFA above. (1)States are divided into two state sets: terminal state set 3, and nonterminal state that include state 0,1,2. (2)Judge if states0,1,2 are equivalence states. M(Q0, a) = Q1 M (Q0, b) = Q2 M(Q1, a) = Q1 M (Q1, b) = Q3 M (Q2, a) = Q1 M (Q2, b) = Q2 The next state of state 1 is not in nonterminal states set when input character b, so the nonterminal states set is divided into state set 1 and state set 0, 2 . . [email protected] 56 States 0 and 2 behave the same way, so they are equivalence states and can be merged. The minimun of DFA is shown by Fig.3.15 [email protected] 57 An example for constructing DFA from regular expression Example 3.12 The unsigned digital is “d…d. d…dESd…d”, it includes four regular expressions: dd* {A1} d*·dd* {A2} d*ESdd* {A3} d*·dd*ESdd* {A4} while VT={0, … , 9, · , +, - , E} d = 0|1|…|9 s= +|-|ε [email protected] 58 What we want to do is to translate the expression into DFA, the transition includes three steps. Firstly create transition diagram for A1,A2,A3 and A4. Secondly, transfer the NFA to FA. Finally, to minimize the FA. (1)The transition diagram of A1,A2,A3 and A4 . Unsigned digital NFA is shown by Fig.3.16. [email protected] 59 (2)Transfer the NFA to FA The state of NFA are {0, 1, … ,15},the start state is {0},the leaving state are {2, 5, 9, 15}, input symbol are VT={d,· , E, S} ε—CLSOURE ({0}) ={0, 1, 3, 6, 10} Id、I.、IS、IE are shown in Table 3.3.The combination of four FA is shown by Fig.3.17. [email protected] 60 [email protected] 61 Reconstruction of Table 3.3 is shown by Table 3.4 [email protected] 62 So the DFA is shown by Figure 3.17. [email protected] 63 (3)Minimum the FA K={A, B, C, D, E, F, G, H, I , J} The subset are K1={A, C, D, F, G, I} K2={B, E, H, J} Among set K1,set F and set I, set D and set G are equivalent; so set K1 is divided into four subsets: {A}、{C}、{D}、{F} Similarly, among K2,set H and set J are equivalent,K2 is divided into three subsets: {B}、{E}、{H} [email protected] 64 The minimum states are shown by Figure 3.18 and Table 3.5 [email protected] 65 [email protected] 66 The program and result of example 3.12 is shown below. [email protected] 67 [email protected] 68 [email protected] 69 [email protected] 70 The result of example 3.12 is shown by Fig.3.20 [email protected] 71
© Copyright 2026 Paperzz