Chapter 4
Lexical and Syntax
Analysis
ISBN 0-321-19362-8
Chapter 4 Topics
•
•
•
•
•
Introduction
Lexical Analysis
The Parsing Problem
Recursive-Descent Parsing
Bottom-Up Parsing
lexical analysis
語法剖析
編譯程式的一部分。它分析原始程式
的詞,檢查詞語的正確性,並把它們
變換成內部表示形式輸給編譯程式的
其它部分(如語法剖析)。
parsing
剖析
將程式中的敘述分解成能轉換為機器
指令的基本單位的處理程式。此程式
是語言處理器根據語法中既定的法則
來執行的。
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-2
Introduction
• Language implementation systems must
analyze source code, regardless of the specific
implementation approach
• Nearly all syntax analysis is based on a formal
description of the syntax of the source
language (BNF)
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-3
Introduction
• The syntax analysis portion of a language
processor nearly always consists of two parts:
– A low-level part called a lexical analyzer
(mathematically, a finite automaton based on a regular grammar)
– A high-level part called a syntax analyzer, or
parser (mathematically, a push-down automaton based on a
context-free grammar, or BNF)
lexical analyzer
語法剖析程式
編譯程式的基本組成部分。它讀入原
始程式的字元,通常從左到右掃描原
始程式中的各個字元,構造原始程式
中的單詞或符號,然後再將這些符號
傳送給分析程式,同時刪除註解。掃
描程式還能把識別字存放到符號表中
,也能執行一些不需要分析原始程式
就能完成的各種簡單任務。
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
syntax
語法
該術語是指字元或字元組之間的關係
,與字元組本身的意義或與解釋和使
用它們方式都無關。
電腦語言中運算式的結構。
決定語言結構的規則。
parser
語法剖析程式
在PC(個人計算)機遊戲中的一種軟
體。它解釋操作者的響應並使電腦能
理解使用者以英語句子形式的輸入。
4-4
Introduction
• Reasons to use BNF to describe syntax:
– Provides a clear and concise(簡明的) syntax
description
– The parser can be based directly on the BNF
– Parsers based on BNF are easy to maintain
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-5
Introduction
• Reasons to separate lexical and syntax
analysis:
– Simplicity - less complex approaches can be used for
lexical analysis; separating them simplifies the parser
– Efficiency - separation allows optimization of the lexical
analyzer
– Portability - parts of the lexical analyzer may not be
portable, but the parser always is portable, the syntax
analyzer can be platform independent but the lexical
analyzer is somewhat platform dependent.
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-6
Lexical Analysis
The first phase of the compiler
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-7
Lexical Analyzer
• The lexical analyzer is also called a scanner.
• A lexical analyzer is a pattern matcher
– It finds a substring of a given string of characters
that matches a given pattern of characters
– It does this by reading the source program one
character at a time
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-8
Lexical Analyzer
• Lexical analysis can be viewed as very low-level
syntax analysis
• The lexical analyzer collects characters into logical
groupings and assigns internal codes to them
according to their structure
• The character groupings are called lexemes
• The internal codes are called tokens, and are usually
coded as integer values
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-9
Lexical Analyzer
• The lexical analyzer’s job is to transform the
source program into a stream of tokens that
will be the input to the syntax analyzer (aka
parser)
• The lexical analyzer is usually implemented
as a function that produces the next token and
returns it to the caller (the parser).
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-10
Tokens and lexemes
• sum = oldsum – value / 100;
Token
IDENT
ASSIGN_OP
IDENT
SUBTRACT_OP
IDENT
DIVIDE_OP
INT_LITERAL
SEMICOLON
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
Lexeme
sum
=
oldsum
value
/
100
;
4-11
Building a lexical analyzer
• Three approaches to building a lexical analyzer:
– Use a scanner generator, such as lex
– Design a state transition diagram that describes the
token patterns of the language, and then write a
program that implements the diagram. We will
illustrate this approach
– Design a state transition diagram that describes the
token patterns of the language, and then handconstruct a table-driven implementation of the
diagram
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-12
State transition diagrams
• State transition diagrams are sometimes referred to as
finite automata. They are basically directed graphs
with the following features:
– The nodes of a state diagram are labeled with state
names
– The arcs of a state diagram represent transitions
from one state to another. They are labeled with
input characters that cause the transitions.
– An arc may also be labeled with actions the lexical
analyzer must perform when the transition is taken
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-13
Finite automaton example
• We can design finite automata to recognize the tokens of a
programming language. Here is a finite automaton that
recognizes numeric literals:
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-14
A simple scanner Finite automaton
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-15
A simple scanner Declarations
// SimpleLex.cpp
#include <iostream>
#include <fstream>
#include <string>
#include <cctype>
using namespace std;
enum Token {IDENT, NUM_LITERAL, ASSIGNOP,
ADDOP, MULOP, LPAREN, RPAREN, SEMICOLON,
EOFTOK, ERRTOK};
enum States {START, INIDENT, INNUMBER};
string Lexeme;
Token curtok;
ifstream fin;
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-16
A simple scanner
NextToken: state = START (1)
Token NextToken() {
Lexeme = "";
char ch;
States state = START;
while (fin.get(ch)) {
switch(state) {
case START:
if (isspace(ch))
state = START;
else if (isalpha(ch)) {
Lexeme = Lexeme + ch;
state = INIDENT;
}
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-17
A simple scanner
NextToken: state = START (2)
else if (isdigit(ch)) {
Lexeme = Lexeme + ch;
state = INNUMBER;
}
else if (ch == '=')
return ASSIGNOP;
else if (ch == '+')
return ADDOP;
else if (ch == '*')
return MULOP;
else if (ch == '(')
return LPAREN;
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-18
A simple scanner
NextToken:state = START (3)
else if (ch == ')')
return RPAREN;
else if (ch == ';')
return SEMICOLON;
else return ERRTOK;
break;
// end state == START
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-19
A simple scanner
NextToken:state = INIDENT
case INIDENT:
if (isalnum(ch))
Lexeme = Lexeme + ch;
else {
fin.putback(ch);
return IDENT;
}
break;
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-20
A simple scanner
NextToken:state = INNUMBER
}
case INNUMBER:
if (isdigit(ch))
Lexeme = Lexeme + ch;
else {
fin.putback(ch);
return NUM_LITERAL;
}
}
// end switch(state)
// end while (fin.get(ch))
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-21
A simple scanner
NextToken (after while loop)
if (state == INIDENT)
return IDENT;
else if (state == INNUMBER)
return NUM_LITERAL;
else return EOFTOK;
}
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-22
State transition diagrams
• In many cases, transitions can be combined to
simplify the state diagram
– When recognizing an identifier, all uppercase and
lowercase letters are equivalent
• Use a character class that includes all letters
– When recognizing an integer literal, all digits are
equivalent - use a digit class
– Reserved words and identifiers can be recognized
together (rather than having a part of the diagram
for each reserved word)
• Use a table lookup to determine whether a possible
identifier is in fact a reserved word
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-23
State transition diagrams
• Convenient utility subprograms:
– getChar - gets the next character of input,
puts it in nextChar, determines its class and
puts the class in charClass
– addChar - puts the character from nextChar
into the place the lexeme is being accumulated,
lexeme
– lookup - determines whether the string in lexeme
is a reserved word (returns a code)
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-24
State Diagram
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-25
Lexical Analysis
• Implementation (assume initialization):
int lex() {
getChar();
switch (charClass) {
case LETTER:
addChar();
getChar();
while (charClass == LETTER || charClass == DIGIT)
{
addChar();
getChar();
}
return lookup(lexeme);
break;
…
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-26
Lexical Analysis
case DIGIT:
addChar();
getChar();
while (charClass == DIGIT) {
addChar();
getChar();
}
return INT_LIT;
break;
} /* End of switch */
} /* End of function lex */
Copyright © 2004 Pearson Addison-Wesley. All rights reserved.
4-27
© Copyright 2026 Paperzz