Token patterns for names, reserved words, and integer literals

Chapter 4
Lexical and Syntax Analysis
4.1 Introduction
Three approaches to implement programming languages
•
Compilation (C, C++, Compiler)
•
Pure interpretation (JavaScript, Interpreter)
•
Hybrid implementation (Java)
Compiler front end consists of
a Lexical Analyzer and a Syntax Analyzer
The Lexical Analyzer is the first phase of a compiler. Its main task
is to read the input characters and produce as output a sequence
of that the syntax analyzer uses for syntax analysis (dealing with
names and numerical literals).
The Syntax Analyzer obtains a string of tokens from the lexical
analyzer and verifies that the string can be generated by the
grammar for the source language (dealing with expressions,
statements, and so on).
Read
character
input
Lexical
Analyzer
Pass token and
its attributes
Parser
Push back
character
Syntax
Analysis
Lexical
Analysis
4.2 Lexical Analysis
The lexical analyzer extract lexemes from the stream of input characters and
produce the corresponding tokens.
4.2.1 Example: the statement from the source code
sum = oldsum - value / 100;
would be grouped into the following tokens:
Token
Lexeme
1.
The identifier (IDENT)
sum
2.
The assignment symbol (ASSIGN_OP)
=
3.
The identifier (IDENT)
oldsum
4.
The minus sign (SUBTRACT_OP)
-
5.
The identifier (IDENT)
value
6.
The division sign (DIVISION)
/
7.
The number (INT_LIT)
100
8.
The semicolon (SEMICOLON)
;
4.2.2 Functions of Lexical Analysis
•
Removing White Space and Comments
“white space” (blanks, tabs, and newlines)
•
Recognizing Constants
Collect a sequence of digits into integers.
For example: the input
31 + 28 + 59
Let num be the token representing an integer.
The Lexical Analyzer should group 31, 28, 59 into the token num and pass
num to the syntax analyzer, also pass these values along as attributes of num.
•
Recognizing Identifiers and Keywords
Identifiers are used as names of variables, arrays, functions, an so on. A
grammar for a language treats an identifier as a token (id)
For example: the input
count = count + increament
The Lexical Analyzer should convert this statement into
id = id + id
The Lexical Analyzer is also able to recognize the keywords reserved for the
programming language.
4.3 A simpler lexical analyzer:
Only recognize program names, reserved words, and integer literals
Token patterns for names, reserved words, and integer literals.
•
Names: consists of strings of uppercase letters, lowercase letters, and
digits, but begin with a letter.
•
Reserved words: use a lookup in a table of reserved words to determine
which names are reserved words.
•
Integer literal: begin with ten different characters.
Subprograms of the lexical analyzer
•
getChar( ) gets the next character of from the input program,
puts its in a global variable (nextChar), determine
the character class and puts it in charClass.
•
addChar( ) puts nextChar into lexeme (buffer)
•
lookup( )
check whether the current contents of lexeme is a
reserved word or name
Finite automata: State diagrams of form are used for lexical analyzers.
A state diagram to recognize names, reserved words, and integer literals
Implementation
/* lex-a simple lexical analyzer */
int lex( ) {
getChar( );
switch(charClass) {
/* Parse identifiers and reserved words
*/
case LETTER:
addChar( );
getChar( );
while (charClass == LETTER || charClass == DIGIT) {
addChar( );
getChar( );
}
return lookup(lexeme);
break;
/* Parse integer literals */
case DIGIT:
addChar( );
getChar( );
while (charClass == DIGIT) {
addChar( )
getChar( )
}
return INT_LIT;
break;
}/* End of switch */
}/* End of function lex */
4.3 Syntax Analysis (Parsing)
Parsers for programming languages are used to construct parse trees for
given programs.
Top-down Parsers: The tree is built from the root downward to the leaves.
Bottom-up Parsers: The tree is built from the leaves upward to the root.
•
Terminal symbols – Lowercase letters at the beginning of the alphabet
(a, b, …)
•
Nonterminal symbols - Uppercase letters at the beginning of the
alphabet (A, B, …)
•
Terminal or Nonterminals - Uppercase letters at the end of the
alphabet (W, X, Y, Z)
•
Mixed strings (Terminal and/or Nonterminals) - Lowercase
Greek letters (a, b, d, g)
4.4 Recursive-Descent Parsing (Top-down Parsing)
Example 1: Consider the grammar
S  cAd
A  ab | a
and the input string cad
1. Create a tree
consisting of the root S
3. Point to a and expand
the tree by first RHS of
the production for A
S
c
A
S
d
2. Input c by a pointer
and expand the tree by
the first production for S
c
a
A
S
d
b
4. Point to d and it does
not match tree generated
by the first RHS of the
production for A
c
A
d
a
5. Go back to A, reset
pointer to d and expand
the tree generated by the
second RHS of the
production for A
Step 1: The top-down construction of a parse tree is done by starting with root, labeled
with the starting nonterminal, and repeatedly performing the following the next two
steps
Step 2: At node n, labeled with nonterminal A, select one of the productions for A and
construct children at n for the symbols on the right side of the production.
Step 3: Find the next node at which a subtree is to be constructed.
Example 2: Consider the grammar
type  simple | id | array [ simple ] of type
simple  integer | char | num dotdot num
and the input string array [ num dotdot num ] of integer
Implementation of the recursive-descent parser
Each non-terminal has a subprogram
For example:
<expr>  <term> { (+ | -) <term> }
<term>  <factor> { (*|/) <factor> }
<factor>  id | ( <expr> )
The LL Grammar algorithms
The first L in LL specifies a left-to-right scan of the input; the second L specifies that a
leftmost derivation is generated.
Problem 1: Left recursion
Example:
Approach
Problem 2: Choice of the correct RHS
Approach
4.5 Bottom-up Parsing
Example: Consider the grammar
S  aABe
A  Abc | b
Bd
and the sentence abbcde
S
=>
aABe
=>
aAde
=>
aAbcde
=>
abbcde