Chapter 2-I
Scanning
Sung-Dong Kim,
Dept. of Computer Engineering,
Hansung University.
Abstract (1)
Scanning = lexical analysis
Read source program as a file of characters
Divide it up into tokens
Token
Words of a natural language
Unit of information
Examples
Keywords: if, while, …
identifiers
Special symbols: +, -, <, >,…
(2010-1) Compiler
2
Abstract (2)
Pattern matching
Pattern specification regular expression
Pattern recognition finite automata
Efficient operation
Practical details of the scanner structure
(2010-1) Compiler
3
Abstract (3)
What to do
Overview
Structure
Concepts
Regular expression
Finite state machines = finite automata
FA out of RE
(2010-1) Compiler
4
Abstract (4)
Practical method for writing programs for recognition process
Complete implementation of a scanner
Scanner generator: LEX
(2010-1) Compiler
5
1. Scanning Process (1)
Token
typedef enum
{IF, THEN, ELSE, PLUS, MINUS, NUM, ID, …}
TokenType;
Reserved words: IF, THEN, …
Special symbols: PLUS, MINUS, …
Multiple strings: NUM, ID, …
String value = lexeme
“if”, “then”, “a”, “index”, “100”, …
(2010-1) Compiler
6
1. Scanning Process (2)
Attribute of the token
Value associated to a token
String value attribute: “ksd”, “123”
Numeric value attribute: 123
Token record
typedef struct
{ TokenType tokenval;
char * stringval;
int numval;
} TokenRecord;
TokenType
tokenval
char*
stringval
int
numval
(2010-1) Compiler
7
1. Scanning Process (3)
TokenType getToken(void)
a
[
i
n
d
e
x
]
=
4
+
2
d
e
x
]
=
4
+
2
After return ID,
a
[
i
n
(2010-1) Compiler
8
2. Regular Expressions (1)
RE: r
Represent patterns of strings of characters
L(r)
Language generated by the RE r
(2010-1) Compiler
9
2. Regular Expressions (2)
Language
Set of strings
Depends on the character set
ASCII set
More general set
Symbols: elements of the set
Alphabet (Σ): set of legal symbols
(2010-1) Compiler
10
2. Regular Expressions (3)
Elements of RE
Characters from alphabet: indicate patterns
Characters with special meanings
a: a
Metacharacters, metasymbols
Escape character
Turn off the special meaning of a metacharacter
\
(2010-1) Compiler
11
2.1 Definition (1)
Language by RE r = L(r)
Basic regular expression
Set of strings
Single characters
a: a L(a) = {a}
Empty string: ε L(ε) = {ε}
Empty set: φ L(φ) = { }
Regular expression operations
Alternatives: |
Concatenation
Repetition (closure)
(2010-1) Compiler
12
2.1 Definition (2)
Alternatives
L(r | s) = L(r) L(s)
Example
L(a | b) = L(a) L(b) = {a} {b} = {a, b}
L(a | ) = L(a) L() = {a} {} = {a, }
More than one alternative
L(a | b | c | d) = {a, b, c, d}
Others
a | b | ... | z
(2010-1) Compiler
13
2.1 Definition (3)
Concatenation
L(r s) = L(r) L(s)
Example
L(a b) = L(a) L(b) = {a}{b} = {ab}
L((a|b) c) = L((a|b)) L(c) = {a,b}{c} = {ac,bc}
L((aa|b)(a|bb)) = L((aa|b)) L((a|bb)) = … = {aa,b}{a,bb} =
{aaa,aabb,ba,bbb}
Extension
L(a b c d) = {abcd}
(2010-1) Compiler
14
2.1 Definition (4)
Repetition
L(r*) = L(r)*
S* = {} S SS SSS ...
Example
L(a*) = {, a, aa, aaa, aaaa,...}
L((a|bb)*) = L((a|bb))* = {a,bb}* =
{,a,bb,aa,abb,bba,bbbb,aaa,aabb,abba,abbbb,bbaa,...}
(2010-1) Compiler
15
2.1 Definition (5)
Precedence
* > concatenation > |
Name for RE
(0|1|2|…|9) (0|1|2|…|9)*
digit = 0|1|2|…|9
digit digit*
Example
Description RE
RE description
(2010-1) Compiler
16
2.1 Definition (6)
Example 2.1
= {a,b,c}
Set of all strings that contain exactly one b
(a|c)*b(a|c)*
Example 2.2
= {a,b,c}
Set of all strings that contain at most one b
(a|c)*|(a|c)*b(a|c)*
(a|c)*(b|)(a|c)*
(2010-1) Compiler
17
2.1 Definition (7)
Example 2.5
= {a,b,c}
r = ((b|c)*a(b|c)*a)*(b|c)*
L(r) = all strings containing an even number of a’s
(2010-1) Compiler
18
2.2 Extensions (1)
One or more repetitions
Binary number
(0|1)* wrong
(0|1)(0|1)*
(0|1)+
Any character
.
Strings that contain at least one b: .*b.*
(2010-1) Compiler
19
2.2 Extensions (2)
Range of characters
[a-z] = a|b|…|z
[0-9] = 0|1|…|9
[abc] = a|b|c
[a-zA-Z]
[A-Za-z] [A-z]
Any character not in a given set
~
~(a|b|c) = [^abc]
(2010-1) Compiler
20
2.2 Extensions (3)
Optional subexpressions
r?: strings matched by r are optional
natural = [0-9]+
signedNatural = natural | + natural | - natural
natural = [0-9] +
signedNatural = (+|-)? natural
(2010-1) Compiler
21
2.3 RE for PL Tokens (1)
Token categories
Reserved words (keywords)
Fixed strings that have special meaning in the language
if, while, do, …
Special symbols
Single character: =, …
Multiple characters: :=, ++, …
(2010-1) Compiler
22
2.3 RE for PL Tokens (2)
Identifiers
Sequences of letters and digits beginning with a letter
Literals (constants)
42, 3.14159, …
“hello, world”, “a”, …
(2010-1) Compiler
23
2.3 RE for PL Tokens (3)
Numbers
Natural numbers
Decimal numbers
Numbers with an exponent
nat = [0-9]+
signedNat = (+|-)? nat
number = signedNat (“.” nat) ? (E signedNat) ?
(2010-1) Compiler
24
2.3 RE for PL Tokens (4)
Reserved words
reserved = if | while | do | …
Identifiers
letter = [a-zA-Z]
digit = [0-9]
identifier = letter(letter|digit)*
(2010-1) Compiler
25
2.3 RE for PL Tokens (5)
Comments
Easy cases
{this is a Pascal comment}
-- this is an Ada comment
{ (~})* }
--(~newline)*
Much more difficult case
ba(~(ab))*ab
/* this is a C comment */
b*(a*~(a|b)b*)*a*
(2010-1) Compiler
26
2.3 RE for PL Tokens (6)
Ambiguity
Keyword or identifier: if, while
Single token or two token: <>
Disambiguating rules
Keyword > identifier: reserved word
Single-token > several-token: principle of longest substring
(2010-1) Compiler
27
2.3 RE for PL Tokens (7)
Token delimiter
Characters that are unambiguously part of other tokens
xtemp=ytemp
Blanks, newlines, tab characters, …
Comments
while x …
do/**/if
White space pseudotoken
whitespace = (newline|blank|tab|comment)+
(2010-1) Compiler
28
2.3 RE for PL Tokens (8)
Free format language
Discard white space after checking for any token delimiting
effects
Lookahead
Delimiters
End token string
Not part of the token itself
xtemp=ytemp
(2010-1) Compiler
29
2.3 RE for PL Tokens (9)
FORTRAN
Fixed-format language
I F ( X 2 . EQ. 0) THE N
IF(X2.EQ.0)THEN
No reserved words
IF(IF.EQ.O)THENTHEN=1.0
DO99I=1,10
DO99I=1.10
(2010-1) Compiler
30
© Copyright 2026 Paperzz