a | b

Chapter 2-I
Scanning
Sung-Dong Kim,
Dept. of Computer Engineering,
Hansung University.
Abstract (1)


Scanning = lexical analysis

Read source program as a file of characters

Divide it up into tokens
Token

Words of a natural language

Unit of information

Examples

Keywords: if, while, …

identifiers

Special symbols: +, -, <, >,…
(2010-1) Compiler
2
Abstract (2)


Pattern matching

Pattern specification  regular expression

Pattern recognition  finite automata
Efficient operation

Practical details of the scanner structure
(2010-1) Compiler
3
Abstract (3)

What to do

Overview

Structure

Concepts

Regular expression

Finite state machines = finite automata

FA out of RE
(2010-1) Compiler
4
Abstract (4)

Practical method for writing programs for recognition process

Complete implementation of a scanner

Scanner generator: LEX
(2010-1) Compiler
5
1. Scanning Process (1)

Token
typedef enum
{IF, THEN, ELSE, PLUS, MINUS, NUM, ID, …}
TokenType;


Reserved words: IF, THEN, …

Special symbols: PLUS, MINUS, …

Multiple strings: NUM, ID, …
String value = lexeme

“if”, “then”, “a”, “index”, “100”, …
(2010-1) Compiler
6
1. Scanning Process (2)


Attribute of the token

Value associated to a token

String value attribute: “ksd”, “123”

Numeric value attribute: 123
Token record
typedef struct
{ TokenType tokenval;
char * stringval;
int numval;
} TokenRecord;
TokenType
tokenval
char*
stringval
int
numval
(2010-1) Compiler
7
1. Scanning Process (3)

TokenType getToken(void)
a

[
i
n
d
e
x
]
=
4
+
2
d
e
x
]
=
4
+
2
After return ID,
a
[
i
n
(2010-1) Compiler
8
2. Regular Expressions (1)

RE: r


Represent patterns of strings of characters
L(r)

Language generated by the RE r
(2010-1) Compiler
9
2. Regular Expressions (2)

Language

Set of strings

Depends on the character set

ASCII set

More general set

Symbols: elements of the set

Alphabet (Σ): set of legal symbols
(2010-1) Compiler
10
2. Regular Expressions (3)

Elements of RE

Characters from alphabet: indicate patterns


Characters with special meanings


a: a
Metacharacters, metasymbols
Escape character

Turn off the special meaning of a metacharacter

\
(2010-1) Compiler
11
2.1 Definition (1)

Language by RE r = L(r)


Basic regular expression


Set of strings
Single characters

a: a  L(a) = {a}

Empty string: ε  L(ε) = {ε}

Empty set: φ  L(φ) = { }
Regular expression operations

Alternatives: |

Concatenation

Repetition (closure)
(2010-1) Compiler
12
2.1 Definition (2)

Alternatives

L(r | s) = L(r)  L(s)

Example


L(a | b) = L(a)  L(b) = {a}  {b} = {a, b}

L(a | ) = L(a)  L() = {a}  {} = {a, }
More than one alternative


L(a | b | c | d) = {a, b, c, d}
Others

a | b | ... | z
(2010-1) Compiler
13
2.1 Definition (3)

Concatenation

L(r s) = L(r) L(s)

Example

L(a b) = L(a) L(b) = {a}{b} = {ab}

L((a|b) c) = L((a|b)) L(c) = {a,b}{c} = {ac,bc}

L((aa|b)(a|bb)) = L((aa|b)) L((a|bb)) = … = {aa,b}{a,bb} =
{aaa,aabb,ba,bbb}

Extension

L(a b c d) = {abcd}
(2010-1) Compiler
14
2.1 Definition (4)

Repetition

L(r*) = L(r)*

S* = {}  S  SS  SSS  ...

Example

L(a*) = {, a, aa, aaa, aaaa,...}

L((a|bb)*) = L((a|bb))* = {a,bb}* =
{,a,bb,aa,abb,bba,bbbb,aaa,aabb,abba,abbbb,bbaa,...}
(2010-1) Compiler
15
2.1 Definition (5)

Precedence



* > concatenation > |
Name for RE

(0|1|2|…|9) (0|1|2|…|9)*

digit = 0|1|2|…|9

digit digit*
Example

Description  RE

RE  description
(2010-1) Compiler
16
2.1 Definition (6)

Example 2.1

 = {a,b,c}

Set of all strings that contain exactly one b
(a|c)*b(a|c)*

Example 2.2

 = {a,b,c}

Set of all strings that contain at most one b
(a|c)*|(a|c)*b(a|c)*
(a|c)*(b|)(a|c)*
(2010-1) Compiler
17
2.1 Definition (7)

Example 2.5

 = {a,b,c}

r = ((b|c)*a(b|c)*a)*(b|c)*

L(r) = all strings containing an even number of a’s
(2010-1) Compiler
18
2.2 Extensions (1)

One or more repetitions


Binary number

(0|1)*  wrong

(0|1)(0|1)*

(0|1)+
Any character

.

Strings that contain at least one b: .*b.*
(2010-1) Compiler
19
2.2 Extensions (2)


Range of characters

[a-z] = a|b|…|z

[0-9] = 0|1|…|9

[abc] = a|b|c

[a-zA-Z]

[A-Za-z]  [A-z]
Any character not in a given set

~

~(a|b|c) = [^abc]
(2010-1) Compiler
20
2.2 Extensions (3)

Optional subexpressions

r?: strings matched by r are optional
natural = [0-9]+
signedNatural = natural | + natural | - natural
natural = [0-9] +
signedNatural = (+|-)? natural
(2010-1) Compiler
21
2.3 RE for PL Tokens (1)

Token categories


Reserved words (keywords)

Fixed strings that have special meaning in the language

if, while, do, …
Special symbols

Single character: =, …

Multiple characters: :=, ++, …
(2010-1) Compiler
22
2.3 RE for PL Tokens (2)

Identifiers


Sequences of letters and digits beginning with a letter
Literals (constants)

42, 3.14159, …

“hello, world”, “a”, …
(2010-1) Compiler
23
2.3 RE for PL Tokens (3)

Numbers

Natural numbers

Decimal numbers

Numbers with an exponent
nat = [0-9]+
signedNat = (+|-)? nat
number = signedNat (“.” nat) ? (E signedNat) ?
(2010-1) Compiler
24
2.3 RE for PL Tokens (4)

Reserved words
reserved = if | while | do | …

Identifiers
letter = [a-zA-Z]
digit = [0-9]
identifier = letter(letter|digit)*
(2010-1) Compiler
25
2.3 RE for PL Tokens (5)

Comments

Easy cases
{this is a Pascal comment}
-- this is an Ada comment

{ (~})* }
--(~newline)*
Much more difficult case
ba(~(ab))*ab
/* this is a C comment */
b*(a*~(a|b)b*)*a*
(2010-1) Compiler
26
2.3 RE for PL Tokens (6)

Ambiguity

Keyword or identifier: if, while

Single token or two token: <>

Disambiguating rules

Keyword > identifier: reserved word

Single-token > several-token: principle of longest substring
(2010-1) Compiler
27
2.3 RE for PL Tokens (7)

Token delimiter

Characters that are unambiguously part of other tokens

xtemp=ytemp

Blanks, newlines, tab characters, …


Comments


while x …
do/**/if
White space pseudotoken
whitespace = (newline|blank|tab|comment)+
(2010-1) Compiler
28
2.3 RE for PL Tokens (8)

Free format language

Discard white space after checking for any token delimiting
effects

Lookahead


Delimiters

End token string

Not part of the token itself
xtemp=ytemp
(2010-1) Compiler
29
2.3 RE for PL Tokens (9)

FORTRAN

Fixed-format language
I F ( X 2 . EQ. 0) THE N

IF(X2.EQ.0)THEN
No reserved words
IF(IF.EQ.O)THENTHEN=1.0
DO99I=1,10
DO99I=1.10
(2010-1) Compiler
30