11-TokenizingMachine.pps

Translator Architecture
Tokenizer
string of
characters
(source code)
string of
tokens
Parser
Code
Generator
abstract
program
string of
integers
(object code)
A Tokenizing Machine
Another great
machine!
Ready to Dispense?
In
(Chars)
Out
(Tokens)
Tokenizing Machine Continued…
2: Light comes on
Ready to
Dispense?
1: Characters
go in here
In
Out
3: Tokens
come out
here
Tokenizing Machine Continued…

Type

Initial Value
BL_Tokenizing_Machine_Kernel
is modeled by (
buffer : string of character
ready_to_dispense : boolean
)
constraint …
(empty_string, false)
Tokenizing Machine Continued…

Operations





m.Insert (ch)
m.Dispense (token_text, token_kind)
m.Is_Ready_To_Dispense ()
m.Flush_A_Token (token_text, token_kind)
m.Size ()
A State-Transition View
Flush_A_Token
Insert
Size
not ready
to dispense
Insert
Dispense
ready
to dispense
Size
Is_Ready_To_Dispense
Is_Ready_To_Dispense
Tokenizing BL Programs

Token Types






KEYWORD
IDENTIFIER
CONDITION
WHITE_SPACE
COMMENT
ERROR
A Very Useful Extension
procedure_body Get_Next_Token (
alters Character_IStream& str,
produces Text& token_text,
produces Integer& token_kind
)
{ while (not self.Is_Ready_To_Dispense () and not str.At_EOS ())
{
object Character ch;
str.Read (ch);
self.Insert (ch);
}
if (self.Is_Ready_To_Dispense ())
{ self.Dispense (token_text, token_kind); }
else
{ self.Flush_A_Token (token_text, token_kind); }
}
Another Useful Extension
procedure_body Get_Next_Non_Separator_Token (
alters Character_IStream& str,
produces Text& token_text,
produces Integer& token_kind
)
{
self.Get_Next_Token (str, token_text, token_kind);
while ((token_kind == WHITE_SPACE) or
(token_kind == COMMENT))
{
self.Get_Next_Token (str, token_text, token_kind);
}
}
How Does Insert Work?
#m:
buffer: “PROGRAM”
ready_to_dispense: false
Here’s another
character.
‘ ’
m:
buffer: “PROGRAM ”
ready_to_dispense: true
The Specification of Insert
procedure Insert (
preserves Character ch
) is_abstract;
/*!
requires
self.ready_to_dispense = false
ensures
self.buffer = #self.buffer * <ch> and
self.ready_to_dispense =
IS_COMPLETE_TOKEN_TEXT (#self.buffer, ch)
!*/
An Important Math Operation
math definition IS_COMPLETE_TOKEN_TEXT (
s: string of character
c: character
): boolean is
( s is in OK_STRINGS and
s is a complete
“valid” token
s * <c> is not in OK_STRINGS ) or
( <c> is in PREFIX (OK_STRINGS) and
s * <c> is not in PREFIX (OK_STRINGS) )
c can start a “valid” token, but
s*<c> doesn’t start a “valid” token
Other Math Definitions

OK_STRINGS =
{s:
{s:
{s:
{s:
{s:

string
string
string
string
string
of
of
of
of
of
character (IS_KEYWORD (s))} union
character (IS_IDENTIFIER (s))} union
character (IS_CONDITION_NAME (s))} union
character (IS_WHITE_SPACE (s))} union
character (IS_COMMENT (s))}
PREFIX (s_set) =
{x: string of character
(there exists y: string of character
(x * y is in s_set))}
PREFIX Examples




s_set = {“abc”}
PREFIX (s_set) =
{“”, “a”, “ab”, “abc”}
s_set = {“abc”, “de”}
PREFIX (s_set) =
{“”, “a”, “ab”, “abc”, “d”, “de”}
Tokenizing Machine:
Implementation

Obvious Representation



Text buffer_rep
Boolean token_ready
Insert (ch)?


check if IS_COMPLETE_TOKEN_TEXT
(self[buffer_rep], ch), and
set self[token_ready] accordingly
append ch at end of self[buffer_rep]
Tokenizing Machine:
Implementation Continued…

Dispense (token_text, token_kind)?



set token_text to all but the last
character of self[buffer_rep]
set token_kind to the value of
WHICH_KIND (token_text)
set self[token_ready] to false
Tokenizing Machine:
Implementation Continued…



How do we “check if
IS_COMPLETE_TOKEN_TEXT
(self[buffer_rep], ch)”?
How do we determine
“WHICH_KIND (token_text)”?
How do we do these things quickly?
Making Decisions Quickly

Keep track of the “state” of the
buffer by adding one field to the
representation:



Text buffer_rep
Boolean token_ready
Integer buffer_state
Possible Buffer States


How many interestingly different
buffer states do you think there may
be?
Let’s start enumerating them…
Buffer States Continued…


Initial state (empty buffer)
How many states after inserting the
first character?




‘B’, ‘D’, ‘E’, ‘I’, ‘P’, ‘T’, ‘W’, ‘n’, ‘r’, ‘t’,
identifier (any other letter)
white_space (‘ ’, ‘\n’, ‘\t’)
comment (‘#’)
error (any other character)
Buffer States Continued…

How many states after inserting the
second character?




“BE”, “DO”, “EL”, “EN”, “IF”, “IN”, “IS”,
“PR”, “TH”, “WH”, “ne”, “ra”, “tr”,
identifier (any other id character)
white_space (‘ ’, ‘\n’, ‘\t’)
comment (any other character but ‘\n’)
error (any character that cannot start a
new “good” token)
A State Transition Diagram:
Transitions Out of ‘empty’ Only
‘D’
B
D
‘E’
E
‘I’
I
P
‘P’
‘B’
W
‘W’
empty
‘n’
any other
character
error
T
‘T’
n
‘r’
‘#’
comment
r
‘t’
‘ ’,’\n’,’\t’
white_space
any other
letter
identifier
t
Structure of Body of Insert
case_select (self[buffer_state])
{
case empty:
// case for buffer = empty_string
case B:
// case for buffer = “B”
case D:
// case for buffer = “D”
case E:
// case for buffer = “E”
...
case error:
// case for buffer holding an error token
}
A Simplified View

Buffer States





EMPTY_BS
ID_OR_KEYWORD_OR_CONDITION_BS
WHITE_SPACE_BS
COMMENT_BS
ERROR_BS
The State Transition Diagram
ID_OR_KEYWORD_
OR_CONDITION_BS
‘a’..’z’,
‘A’..’Z’
‘ ’, ‘\n’, ‘\t’
‘a’..’z’,
‘A’..’Z’,
‘0’..’9’,
‘-’
‘ ’, ‘\n’, ‘\t’
WHITE_SPACE_BS
EMPTY_BS
‘#’
any other
character
ERROR_BS
any character except
‘a’..’z’, ‘A’..’Z’, ‘ ’, ‘\n’, ‘\t’, ‘#’
COMMENT_BS
any character
except ‘\n’
Useful Private Functions








Is_White_Space_Character (ch)
Is_Digit_Character (ch)
Is_Alphabetic_Character (ch)
Is_Identifier_Character (ch)
Can_Start_Token (ch)
Id_Or_Keyword_Or_Condition (t)
Buffer_Type (ch)
Token_Kind (bs, str)