CS 404
Introduction to Compiler Design
Lecture 1
Ahmed Ezzat
Lexical Analysis &
Regular Expressions
1
CS 404
Lecture 1
Administration
Introduction to Compilers
–
–
Lexical Analysis
–
–
2
What is a compiler
Compiler phases
Tokens, lexemes
Regular expressions
CS 404
An Introduction to Compilers
3
A compiler is a program translator
Source language Target language
Compiler examples: cc, gcc, javac
CS 404
Phases of a Compiler
Front end == analysis
–
–
Back end == synthesis
–
–
4
Lexical analysis, Syntax analysis, Semantic
analysis
Language dependent, machine independent
Intermediate code generation, Optimization,
Target code generation
Language independent, machine dependent
CS 404
Examples of Compiler Phases
Lexical analysis
–
Syntax analysis
–
–
5
Scanning: transforms characters into “tokens”
Parsing: transforms token streams into “parse
trees”
Structural analysis
CS 404
Examples of Compiler Phases (2)
Semantic analysis: checks whether the input
program “make sense”
–
Intermediate code generation
–
6
Example: type checking
Example: Three Address Code (TAC)
Code optimization
Target code generation
CS 404
Compiler Issues
7
Symbol table: a data structure containing a
record for each identifier, with attributes of
the identifier
Error handling: detection, reporting, recovery
Compiler passes: one pass versus multiple
passes
CS 404
Working Together With Compilers
8
Pre-processors: macros, file handling
Assembler: from assembly code to machine
code
Loaders: place instructions and data in
memory
Linkers: link several target programs together
CS 404
Lexical Analysis
Source language token streams
Token: e.g., identifier, constant, keyword
–
–
–
Lexeme: e.g. my_id, count2
–
9
Classes of sequence of characters
Satisfy certain patterns (or rules)
Data structure returned by lexical analyzer
String matches a pattern
CS 404
Describe Patterns: Regular Expression
10
Pattern or rules to identify lexemes
Precise specification of sets of strings
There exists a computational model to
evaluate (Finite Automata)
There exists tools to process them (LEX)
CS 404
Regular Expression Notations
Symbols: e.g., a, b, c, 1, 2
Alphabet: finite set of symbols, Σ (sigma)
–
e.g., hello, ε (epsilon, empty string)
Language: a set of strings over an alphabet
–
–
11
set of alphabet characters
String: a sequence of symbols
–
e.g., Σ = {a,b}
e.g., {a, ab, ba}
e.g., the set of all valid C programs
CS 404
Regular Expression Definition
Every symbol of Σ U {ε} is a regular
expression
If r1 and r2 are regular expressions, so are
–
–
–
12
Concatenation: r1r2
Alternation: r1 | r2
Repetition: r1*
Nothing else is a regular expression
CS 404
Regular Expression Extended
13
a+ : one or more a’s
a* : zero or one or more a’s
a? : zero or one a
a{n}: a repeats n times
a{n,}: a repeats at least n times
a{n,m}: a repeats at least n but no more than m times
…. and more
CS 404
Regular Expressions Cannot Do
14
Arithmetic expressions
Set of strings over {(,)} with matched
parentheses
Strings over {a,b} with equal number of b’s
following a’s
CS 404
Regular Definitions
Give names to regular expressions and use
them as shorthand
Must avoid recursive definitions
Examples
–
–
–
–
15
digit 1 | 2| … |9
int Digit+
letter -> A | B | … Z
Id letter (letter | digit)*
CS 404
Finite Automata
Evaluate regular expressions
Recognize certain languages and reject
others
Two kinds of FA:
–
–
16
Non-deterministic FA (NFA)
Deterministic FA (DFA)
CS 404
FA and Language
17
An FA accepts string x if and only if there is
some path in the transition graph from the
start state to a final state, such that the edge
labels along this path spells x
The set of strings an FA accepts is said to be
the language defined by this FA.
CS 404
An NFA Consists of
18
CS 404
An NFA Consists of
19
An NFA can simultaneously be in multiplestates. Each state can
progress to multiple other states for a given input character.
NFA Example: start in state A, read one character and transition to B
and C. Then read 2nd character, state B transition to {C and D}, and C
transition to {D and E}. After reading 2nd character you would be in
the states {C, D} U {D, E} = {C,D,E}
An input alphabet, e.g., Σ = {a,b}
A set of states, e.g., S = {s0, s1, s2}
A set of transitions from states to states, labeled by elements of Σ or
ε (empty string)
A start state, e.g., s0
A set of final states, e.g., F = {s1, s2}
CS 404
An NFA Consists of
20
CS 404
An NFA Consists of
21
CS 404
An NFA Consists of
22
CS 404
An NFA Consists of:
Example
23
CS 404
An NFA Consists of
One way to think about NFA is to try all paths.
24
CS 404
An NFA Consists of:
only in NFA
25
CS 404
An NFA Consists of:
26
CS 404
An NFA Consists of:
Example
27
CS 404
Deterministic Finite Automata
(DFA)
28
CS 404
Deterministic Finite Automata
(DFA)
29
A DFA is a special case of NFA
DFA is allowed to be in one state at any given time
and for a given input it progresses to only one other
state.
No states has an ε transition
For each state s and input symbol a, there is at most
one edge labeled a leaving s
CS 404
NFA vs. DFA
30
CS 404
NFA to DFA
2
a
b
a
1
c
NFA
b
c
1
2
-
4
1
2
-
3
-
2, 1
3
2
-
-
3
4
-
-
3
4, 3
3
c
4
ἑ
a
Chart representing the graph
a
2,1
aἑ
a
a
1
c
31
DFA
b
abab
a
cc
cb
c
caa
ccab
2,1
-
4,3
ccacc
2,1
2,1
3
4,3
ccac
4,3
4,3
2,1
-
3
33
2,1
-
-
3
c
cἑ
4,3
CS 404
No
ab
1
b
c a
bἑ
Yes
abacab
NFA, DFA and Regular Expressions
32
NFA processing “n” states; at any point NFA can be
summarized by which of the 2n subset of states it is
currently in.
If you can represent each possible subset of NFA states by
one state in a DFA; you can build a DFA that recognizes
the same language as the NFA.
A DFA is an NFA
Each NFA can be converted into a DFA
One can construct an NFA from a regular expression
FAs are used by lexical analyzer to recognize tokens
CS 404
END
33
CS 404
Ahmed Ezzat
© Copyright 2026 Paperzz