Translator Architecture
Tokenizer
string of
characters
(source code)
string of
tokens
Parser
Code
Generator
abstract
program
string of
integers
(object code)
A Tokenizing Machine
Another great
machine!
Ready to Dispense?
In
(Chars)
Out
(Tokens)
Tokenizing Machine Continued…
2: Light comes on
Ready to
Dispense?
1: Characters
go in here
In
Out
3: Tokens
come out
here
Tokenizing Machine Continued…
Type
Initial Value
BL_Tokenizing_Machine_Kernel
is modeled by (
buffer : string of character
ready_to_dispense : boolean
)
constraint …
(empty_string, false)
Tokenizing Machine Continued…
Operations
m.Insert (ch)
m.Dispense (token_text, token_kind)
m.Is_Ready_To_Dispense ()
m.Flush_A_Token (token_text, token_kind)
m.Size ()
A State-Transition View
Flush_A_Token
Insert
Size
not ready
to dispense
Insert
Dispense
ready
to dispense
Size
Is_Ready_To_Dispense
Is_Ready_To_Dispense
Tokenizing BL Programs
Token Types
KEYWORD
IDENTIFIER
CONDITION
WHITE_SPACE
COMMENT
ERROR
A Very Useful Extension
procedure_body Get_Next_Token (
alters Character_IStream& str,
produces Text& token_text,
produces Integer& token_kind
)
{ while (not self.Is_Ready_To_Dispense () and not str.At_EOS ())
{
object Character ch;
str.Read (ch);
self.Insert (ch);
}
if (self.Is_Ready_To_Dispense ())
{ self.Dispense (token_text, token_kind); }
else
{ self.Flush_A_Token (token_text, token_kind); }
}
Another Useful Extension
procedure_body Get_Next_Non_Separator_Token (
alters Character_IStream& str,
produces Text& token_text,
produces Integer& token_kind
)
{
self.Get_Next_Token (str, token_text, token_kind);
while ((token_kind == WHITE_SPACE) or
(token_kind == COMMENT))
{
self.Get_Next_Token (str, token_text, token_kind);
}
}
How Does Insert Work?
#m:
buffer: “PROGRAM”
ready_to_dispense: false
Here’s another
character.
‘ ’
m:
buffer: “PROGRAM ”
ready_to_dispense: true
The Specification of Insert
procedure Insert (
preserves Character ch
) is_abstract;
/*!
requires
self.ready_to_dispense = false
ensures
self.buffer = #self.buffer * <ch> and
self.ready_to_dispense =
IS_COMPLETE_TOKEN_TEXT (#self.buffer, ch)
!*/
An Important Math Operation
math definition IS_COMPLETE_TOKEN_TEXT (
s: string of character
c: character
): boolean is
( s is in OK_STRINGS and
s is a complete
“valid” token
s * <c> is not in OK_STRINGS ) or
( <c> is in PREFIX (OK_STRINGS) and
s * <c> is not in PREFIX (OK_STRINGS) )
c can start a “valid” token, but
s*<c> doesn’t start a “valid” token
Other Math Definitions
OK_STRINGS =
{s:
{s:
{s:
{s:
{s:
string
string
string
string
string
of
of
of
of
of
character (IS_KEYWORD (s))} union
character (IS_IDENTIFIER (s))} union
character (IS_CONDITION_NAME (s))} union
character (IS_WHITE_SPACE (s))} union
character (IS_COMMENT (s))}
PREFIX (s_set) =
{x: string of character
(there exists y: string of character
(x * y is in s_set))}
PREFIX Examples
s_set = {“abc”}
PREFIX (s_set) =
{“”, “a”, “ab”, “abc”}
s_set = {“abc”, “de”}
PREFIX (s_set) =
{“”, “a”, “ab”, “abc”, “d”, “de”}
Tokenizing Machine:
Implementation
Obvious Representation
Text buffer_rep
Boolean token_ready
Insert (ch)?
check if IS_COMPLETE_TOKEN_TEXT
(self[buffer_rep], ch), and
set self[token_ready] accordingly
append ch at end of self[buffer_rep]
Tokenizing Machine:
Implementation Continued…
Dispense (token_text, token_kind)?
set token_text to all but the last
character of self[buffer_rep]
set token_kind to the value of
WHICH_KIND (token_text)
set self[token_ready] to false
Tokenizing Machine:
Implementation Continued…
How do we “check if
IS_COMPLETE_TOKEN_TEXT
(self[buffer_rep], ch)”?
How do we determine
“WHICH_KIND (token_text)”?
How do we do these things quickly?
Making Decisions Quickly
Keep track of the “state” of the
buffer by adding one field to the
representation:
Text buffer_rep
Boolean token_ready
Integer buffer_state
Possible Buffer States
How many interestingly different
buffer states do you think there may
be?
Let’s start enumerating them…
Buffer States Continued…
Initial state (empty buffer)
How many states after inserting the
first character?
‘B’, ‘D’, ‘E’, ‘I’, ‘P’, ‘T’, ‘W’, ‘n’, ‘r’, ‘t’,
identifier (any other letter)
white_space (‘ ’, ‘\n’, ‘\t’)
comment (‘#’)
error (any other character)
Buffer States Continued…
How many states after inserting the
second character?
“BE”, “DO”, “EL”, “EN”, “IF”, “IN”, “IS”,
“PR”, “TH”, “WH”, “ne”, “ra”, “tr”,
identifier (any other id character)
white_space (‘ ’, ‘\n’, ‘\t’)
comment (any other character but ‘\n’)
error (any character that cannot start a
new “good” token)
A State Transition Diagram:
Transitions Out of ‘empty’ Only
‘D’
B
D
‘E’
E
‘I’
I
P
‘P’
‘B’
W
‘W’
empty
‘n’
any other
character
error
T
‘T’
n
‘r’
‘#’
comment
r
‘t’
‘ ’,’\n’,’\t’
white_space
any other
letter
identifier
t
Structure of Body of Insert
case_select (self[buffer_state])
{
case empty:
// case for buffer = empty_string
case B:
// case for buffer = “B”
case D:
// case for buffer = “D”
case E:
// case for buffer = “E”
...
case error:
// case for buffer holding an error token
}
A Simplified View
Buffer States
EMPTY_BS
ID_OR_KEYWORD_OR_CONDITION_BS
WHITE_SPACE_BS
COMMENT_BS
ERROR_BS
The State Transition Diagram
ID_OR_KEYWORD_
OR_CONDITION_BS
‘a’..’z’,
‘A’..’Z’
‘ ’, ‘\n’, ‘\t’
‘a’..’z’,
‘A’..’Z’,
‘0’..’9’,
‘-’
‘ ’, ‘\n’, ‘\t’
WHITE_SPACE_BS
EMPTY_BS
‘#’
any other
character
ERROR_BS
any character except
‘a’..’z’, ‘A’..’Z’, ‘ ’, ‘\n’, ‘\t’, ‘#’
COMMENT_BS
any character
except ‘\n’
Useful Private Functions
Is_White_Space_Character (ch)
Is_Digit_Character (ch)
Is_Alphabetic_Character (ch)
Is_Identifier_Character (ch)
Can_Start_Token (ch)
Id_Or_Keyword_Or_Condition (t)
Buffer_Type (ch)
Token_Kind (bs, str)
© Copyright 2026 Paperzz