Compilers

ACSC373 – Compiler Writing
Chapter 4 – Compilers
4 - Syntax, Semantics and Translation
5 - Lexical Analysis
6 - Syntax and Semantic Analysis
7 - Code generation
The design of compilers
3 phases of the compilation process:
(1) lexical analysis: reads high-level source program and divides it into a stream of basic
lexical ‘tokens’.
(2) syntax and semantic analysis phase: then combines these tokens into data structures
reflecting the form of the source program in terms of the syntactic structures of the
language.
(3) code generator: converts these data structures into code for the target machine.
SYNTAX, SEMANTICS AND TRANSLATION
Language Translation: The Compilation Process
Compilers: large and complex programs
Task of a compiler: 1. the analysis of the source program (lexical syntax and semantic
analysis)
2. the synthesis of the object program (a single code-generation
phase)
The phases of the compilation process
source program
lexical
analysis
syntax
analysis
semantic
analysis
code
generation
object
program
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Lexical Analysis
Reads the characters of the source program and recognizes the tokens or basic syntactic
components that they represent. It is able to distinguish and pass on, as single units,
objects such as
Numbers,
Punctuation symbols,
Operators,
Reserved keywords,
Identifiers and so on.
Effect: simplifies the syntax analyser  effectively reducing the size of the grammar the
syntax analyzer has to handle.
In a free-formal language, the lexical analyser ignores spaces, newlines, tabs and other
layout characteristics, as well as comments.
e.g. for I := 1 to 10 do sum := sum + term[i]; (*sum array*)
will be transformed by the lexical analyser into the sequence of tokens:
for
:=
i
sum
:=
+
1
term
to
[i];
10
do
sum
Pascal, maintains a list of reserved words so that they can be distinguished from
identifiers and passed to the next phase of the compiler in the form of, for example, a
short integer code (or, use a symbol table).
Syntax Analysis
The syntax analyser or parser has to determine how the tokens retuned by the lexical
analyser should be grouped and structured according to the syntax rules of the language.
Output: representation of the syntactic structure often expressed in the form of a tree (the
‘parse tree’).
Usually, lexical analyser should be responsible for all the simple syntactic constructs,
such as identifiers, reserved words and numbers, while the syntax analyser should deal
with all the other structures.
Semantic Analysis
To determine the semantics or meaning of the source program (the translation phase) may
cope with tasks involving declarations and scopes of identifiers, storage allocation, type
checking, selection of appropriate polymorphic operators, addition of automatic type
transfers etc.
This phase is often followed be a process that takes the parse tree from the syntax
analyser and produces a linear sequence of instructions equivalent to the original source
program (instructions for a virtual machine).
2
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Code Generation
(the final phase) to take the output from the semantic analyser or translator and output
machine code or assembly language for the target hardware (machine’s architecture is
required to write good code generator).
+ code improvement or code optimisation.
Syntax Specification
Role: to define the set of valid programs.
Sets
e.g. a digit in Pascal could be defined as {0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 0}
or, {xyn | n>0} xy, xyy, xyyy, …, xyn
i.e. a string that starts with a single x followed by any number
(greater than zero) of ys.
Backus-Naur Form (BNF)
A formal metalanguage that is frequently used in the definition of the syntax of
programming languages (introduced in the definition of ALGOL 60).
A technique for representing rules that can be used to derive sentences of the language.
(If a finite set of these rules can be used to derive all sentences of a language, then this set
of rules constitutes a formal definition of the syntax of the language).
Example A
<sentence>  <subject> <predicate>
<subject>  <noun> | <pronoun>
<predicate>  <transitive verb> <object> | <intransitive verb>
<noun>  cats | dogs | sheep
<pronoun>  I | we| you | they
<transitive verb>  like | hate | eat
<object>  biscuits | grass | sunshine
<intransitive verb>  sleep | talk | run
Using complete set of rules, sentences can be generated by making random choices.
e.g. <sentence>
<subject> <predicate>
<noun> <predicate>
sheep <transitive verb> <object>
sheep eat <object>
sheep eat biscuits
3
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Other, possible sentences in this language are:
I sleep,
Dogs hate grass,
We like sunshine
…
Note: no meaning of these sentences considered e.g. “cats eat sunshine” is syntactically
correct, makes no good sense, no concern to the BNF rules.
Two distinct symbol types in BNF:
1. Symbols such as <sentence> <pronoun> and <intransitive verb> are called nonterminal symbols (i.e. appear on the left-hand of a BNF).
2. Symbols such as cats, dogs, I and grass are called terminal symbols, since they
cannot be expanded further (the set of symbols of the language being defined).
Non-terminals were enclosed by angle brackets.
Other convention
- Non-terminals are either enclosed by angle brackets or are single upper-case letters.
- Terminals are represented as single lower-case letters, digits, special symbols (such as
+, * or =), punctuation symbols or stings in bold type (such as begin).
Another language
Set of rules: S  S+T | T
Ta|b
S and T are non-terminals, whereas
a, b and + are terminals
(S is being defined recursively).
S
S + T (using S  S + T)
S + T + T (using S  S + T)
T + T + T (using S  T)
b + T + T (using T  b)
b + a + T (using T  a)
b + a + a (using T  a)
4
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Example
<expression>  <term> | <expression> + <term>
<term>  <primary> | <term> * <primary>
<primary>  a | b | c
a + b * c is generated as follows:
<expression>
<expression> + <term>
<term> + <term>
<primary> + <term>
a + <term>
a + <term> * <primary>
a + <primary> * <primary>
a + b * <primary>
a+b*c
or, a + b + c is generated as follows:
5
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Syntax diagrams
Pictorial notation
A set of syntax diagrams, each defining a specific language construct.
e.g. the definition of a constant (in Pascal)
Constant Identifier
+
-
Unsigned Number
Character String
i.e. rectangular box – non-terminal symbol
terminal symbols, ‘+’ & ‘-‘ enclosed in circles
(any path yields a syntactically correct structure)
Compact specification (e.g. Pascal in only a couple of sheets of paper).
EBNF (Extended Backus-Naur Form)
- used in the ISO Pascal Standard
- differs from BNF in several ways
- principal changes: the set of metasymbols (symbols used for special purposes in the
metalanguage)
i.e. - terminal symbols enclosed in double quotation marks
- a full stop is used to end each production
- equals sign is used to separate the non-terminal from its definition
- parenthesis for grouping
e.g.
AssignementStatement = (Variable | FunctionalIdentifier) “:=” Expression
For repetition: [x] implies zero or more instance of x (i.e. x optional)
{x} implies zero or more instances of x
6
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
e.g. Identifier = Letter {Letter | Digit}
i.e. Simpler definition of syntax. Much Clearer.
The same in BNF could be:
<Identifier> ::= <letter> | <identifier> <letter>
| <identifier> <digit>
Grammars
Study of grammars started long before programming languages and was exposed
primarily on the study of natural languages.
Then, found direct relevance in the formal study of programming languages.
Noam Chomsky (great influence his work)
(a set of BNF rules as only part of the definition of the grammar of a language)  BNF
only for very restricted class of languages.
The grammar – formally defined as a 4-tuple
G = (N, T, S, P)
N – the finite set of non-terminal symbols
T – the finite set of terminal symbols
S – the starting symbol (must be a member of N)
P – the set of productions (general form: α  β)
i.e. any occurrence of the string α in the string to be transformed can be
replaced by the string β.
e.g. the set of BNF productions presented in example A above forms a part of the
definition of the grammar of a language. The remainder of the definition is:
N = {sentence, subject, predicate, noun, pronoun, intransitive verb, transitive verb,
object}
T = {cats, dogs, sheep, I, we, you, they, live, hate, eat, biscuits, grass, sunshine, sleep,
talk, run}
S = sentence
Still, the definition of the grammar is not quite complete, the strings α and β must have
some relationship to the sets N and T.
Suppose U is the set of all terminals and non-terminal symbols of the language; that is, U
=NUT
U+ - the closure of U – non-empty strings
U* - the closure of U i.e. U+ U {ε} – empty string
Sentential form: any string that can be derived from the starting symbol
7
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Sentence: a sentential form that does not contain any non-terminal symbols (just
terminals, no expansion).
We have seen how sentences can be generated very simply using a set of BNF
productions.
The reverse, how the BNF rules were applied to generate a sentence is much harder. This
process of determining the syntactic structure of a sentence is called
parsing, or syntax analysis (major part of the compilation of H.l.L.
programs).
Chomsky Classification
A grammar with general form
α  β (no restrictions on the sentential forms α & β)
is called a Chomsky type 0 or a free grammar
(too general)
Restricted form
αAβ  αγβ
where α, β and γ are members of U*
γ is not null
A is a single non-terminal symbol
then the resulting grammar is of type 1, the context-sensitive grammars.
Especially, if α  β
Where | α | <= | β | and | α | denotes the length of the string α, then the grammar is
context sensitive.
(A is transformed to γ only when it occurs in the context of being preceded by α and
followed by β).
Type 2 or Context-free grammars if
Aγ
where A is a single non-terminal (since A can always be transformed into
γ without any concern for its context.
(Immense importance in programming language design – corresponds directly to the
BNF notation, where each production has a single non-terminal symbol on its left-hand
side, and so any grammar that is expressed in BNF must be a context-free grammar).
e.g. Pascal and ALGOL 60 – context-free or type 2
Type 3 or finite, finite-state or regular grammars if all productions are of the form:
Aα
or
AαB
where A and B – non-terminal
α is a terminal symbol
8
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
(too restricted)
only for the design some of the structures used as components of most programming
languages, e.g. identifiers, numbers, etc.
That is, hierarchically,
type 3
regular gr.
type 2
context
free gr.
type 1
context
sensitive gr.
type 0
free gr.
Complexity increases
e.g. all type 3 languages are also type 1 languages
In processing from type 0 to type 3 grammars, language complexity and hence
complexity of recognizers, parsers or compilers decreases.
Type 2
Type 3 – easier – cause of finite-state automata
Protonotions: sequences of “small syntactic marks” composed essentially of lower-case
letters (with spaces for readability)
used to represent terminal symbols
e.g. ‘letter a symbol’ – a symbol with ‘a’ representation
each rule, followed by a colon, alternative(;), ends(.), separation(,).
9
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Semantics
Semantic rules specify the meanings or actions of all valid programs (much more difficult
techniques for semantic specification than for syntax specification, not so well developed
yet).
Compiler is concerned with two processes:
1. the analysis of the source program (concern of syntax)
2. the synthesis of the object program.
Syntax is largely concerned with the analysis phase
Semantics is largely concerned with the synthesis phase (+ sometimes syntax)
Specification of semantics: -
operational approach
denotational semantics
axiomatic approach
Parsing
e.g. <expression>  <term> | <expression> + <term>
<term>  primary | <term> * <primary>
<primary>  a | b | c
to generate expressions such as a * b + c.
However, the compiler has to reserve this process, that is, perform a syntax analysis of
the string, to determine if a string such as a * b + c is a valid expression and, if so, how it
is structured in terms of the units <term> and <primary>.
The parse tree
10
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Example (how the string a * b + c may be reduced)
Given the three productions defining <expression>, <term> and <primary>, the string a *
b + c can be reduced as:
a*b+c
<primary> * b + c
(<primary>  a)
<primary> * <primary> + c
(<primary>  b)
<primary> * <primary> + <primary> (<primary>  c)
<term> * <primary> + <primary> (<term>  <primary>)
<term> * <primary> + <term>
(<term>  <primary>)
<term> + <term>
(<term>  <term> * <primary>)
<expression> + <term>
(<expression>  <term>)
<expression>
(<expression> <expression> + <term>)
However, if the productions are used differently,
a*b+c
<primary> * b + c
(<primary>  a)
<primary> * <primary> + c
(<primary>  b)
<primary> * <primary> + <primary> (<primary>  c)
<primary> * <term> + <primary> (<term>  <primary>)
<primary> * <expression> + <primary>
(<expression>  <term>)
<primary> * <expression> + <term> (<term>  <primary>)
<primary> * <expression>
(<expression>  <expression> + <term>)
<term> * <expression>
(<term>  <primary>)
<expression> * <expression>
(<expression>  <term>)
and then become stuck, implying the false deduction,
i.e. a * b + c not a sentence.
 parsing process not a trivial matter!
Syntactic structure of the string a * b + c
<expression>
<expression>
<term>
<term> * <primary>
<primary>
+
<term>
<primary>
c
b
a
11
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
The tree combines the relevant information contained in the set of productions, together
with the content of the original sentence, in a structure that is self-contained and which
can be used by subsequent phases of the compiler.
Parsing strategies
Problem of parsing: take the starting symbol of the language and start generate all
possible sentences from it.
If the input matched one of these sentences, then all the input string is a valid sentence of
the language (impractical approach since infinite possible sentences).
Two categories:
1. Top-down parsers: starting at the root (the starting symbol) and proceeding to the
leaves.
2. Bottom-up parsers: start at the leaves and move up towards the root.
Top-down parsers – easy to write
actual code capable of being derived directly from the production
rules, but cannot always applied as an approach.
 bottom-up parsers, can handle a larger set of grammars.
Top-down parsing
The parsing process starts at the root of the parse tree; it first considers the starting
symbol of the grammar.
The goal: to produce, from this starting symbol, the sequence of terminal symbols that
have been presented as input to the parser.
Bottom-up parsing
Starts with the input string and repeatedly replaces strings on the right-hand sides of
productions by the corresponding strings on the left-hand sides of productions, until,
hopefully, just the starting symbol remains.
Necessary to determine which strings should be replaced and in what order the
replacement should occur.
Process of derivation (leftmost, rightmost).
Leftmost derivation of a * b + c from
Rightmost derivation
expression
<expression>
<expression> + <term>
<term> + <term>
<term> + <primary> + <term>
<primary> + <primary> + <term>
a * <primary> + <term>
a * b + <term>
a * b + <primary>
a*b+c
<expression>
<expression> + <term>
<expression> + <primary>
<expression> + c
<term> + c
<term> * <primary> + c
<term> * b + c
<primary> * b + c
a*b+c
12
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Handle: the substring that is reduced, replaced by the left-hand side of the corresponding
production.
Example
Parse of a * b + c
Sentential form
Handle
Production used
a*b+c
<primary> * b + c
a
<primary>
<term> * b + c
b
<primary>  a
<term> 
<primary>
<primary>  b
<term> * <primary> +
c
<term> + c
<term> *
<primary>
<term>
<expression> + c
c
<expression> +
<primary>
<expression> + <term>
<primary>
<expression>
+ <term>
<term>  <term> *
<primary>
<expression>
<term>
<primary>  c
<term> 
<primary>
<expression> 
<expression> +
<term>
Reduced sentential
form
<primary> * b + c
<term> * b + c
<term> * <primary> +
c
<term> + c
<expression> + c
<expression> +
<primary>
<expression> + <term>
<expression>
<expression>
i.e. the canonical parse (canonical derivation)
13
ACSC373 – Compiler Writing – Chapter 4 – Dr. Stephania Loizidou Himona
Notes
Compilers can often be conveniently structured into four phases: lexical analysis, syntax
analysis, semantic analysis and code generation. The first three of these phases are
concerned with the analysis of the source program whereas code generation is concerned
with the synthesis of the object program.
There are several widely used techniques for the specification of the syntax of
programming languages. BNF is a particular popular metalanguage used for this purpose.
The major part of the formal specification of the grammar of a language is the set of
productions. The Chomsky classification groups grammars according to the form of their
productions.
It may be possible to make use of two-level grammars to express language features, such
as context sensitivity. There are several other approaches to grammar specification which
are sometimes used.
Formal techniques are also available for the specification of semantics.
Parsers can be broadly classified into two groups: top-down parsers and bottom-up
parsers. Top-down parsers try to achieve the goal of recognising the starting symbol by
repeatedly subdividing the goal until terminal symbols from the input string can be
matched. Bottom-up parsers work directly on the input strings, repeatedly matching
symbols on the right-hand sides of productions by the corresponding symbols on the lefthand sides of productions until just the starting symbol remains.
14

Download Report

Compilers

Paperzz.com

Your Paperzz