Colour version (for viewing)

Language Processors (E2.15)
Lecture 2: Language Grammars
Objectives
“
To introduce:
“
The concept of grammars
“
“
“
“
“
Types of grammars
Chomsky’s hierarchy
Examples; string derivations; parse trees
Regular expressions
Notations associated with grammars
E2.15 - Language Processors
(Lect 2)
2
Introduction to Grammars (1)
What are they?
Grammars are the essential formalism for
describing the structure of programs in a
programming language.
“ They provide a precise, yet easy to
understand, syntactic specification of a
programming language
“ Although in principle they describe the syntax
of a language only, they are instrumental in
the definition of semantics
“
E2.15 - Language Processors
(Lect 2)
3
Introduction to Grammars (2)
Why have them?
[by definition] To precisely and unambiguously
describe a language
“ From certain classes of grammars we can
automatically construct an efficient parser that
determines if a source program is syntactically
well formed [we will see how in coming
lectures].
“
E2.15 - Language Processors
(Lect 2)
4
Introduction to Grammars (3)
Where is the challenge?
If a language only contained a finite number of
elements (strings), its definition would be
straightforward (albeit tedious), since all the
strings in the language could be listed.
“ All languages of interest contain an infinite
number of strings, therefore some means of
representing an infinite number of strings in a
finite manner is required.
“ We will examine different notations, with
varying expressive power.
“
E2.15 - Language Processors
(Lect 2)
5
Grammar Basics (1)
Form of a grammar (1)
“
“
“
A grammar consists mainly of a set of production
rules [full definition of grammar coming up]
Production rules have two parts, a left- and a righthand side, separated by a left-to-right arrow; the left
hand side is the name of the syntactic construct; the
right hand side shows a possible form of the syntactic
construct.
The right-hand side of a production rule can contain
two kinds of symbols, terminal and non-terminal.
E2.15 - Language Processors
(Lect 2)
6
Grammar Basics (1)
Form of a grammar (2)
(e.g. of a production rule)
Expression -> ‘(‘ expression operator expression ‘)’
Non-terminal symbols
Terminal symbols
[usually denoted with capital letters
(A, B,C)]
[usually denoted with small letters,
(a, b, c)]
Sequences of grammar symbols are usually denoted with Greek
letters (α, β, γ...); the empty sequence is denoted by ε (epsilon)
E2.15 - Language Processors
(Lect 2)
7
Grammar Basics (2)
Example of a grammar (1)
[1] Expression -> ‘(‘ Expression operator Expression ‘)’
[2] Expression -> ‘1’
[3] Operator -> ‘+’
[4] Operator -> ‘*’
From these production rules you can produce various
strings, for example:
(1*(1+1) )
This is the sequential form of a “program text”, and the
steps of the production process that lead to this
string, are called the derivation of that string.
E2.15 - Language Processors
(Lect 2)
8
Grammar Basics (2)
Example of a grammar (2) – derivation of a string
[1] Expression -> ‘(‘ Expression operator Expression ‘)’
[2] Expression -> ‘1’
[3] Operator -> ‘+’
[4] Operator -> ‘*’
(1*(1+1))
The (leftmost) derivation of the string was:
expression
<- “Start Symbol”
1@1 ‘(‘ expression operator expression ‘)’
2@2 ‘(‘ ‘1’ operator expression ‘)’
4@3 ‘(‘ ‘1’ ‘*’ expression ‘)’
1@4 ‘(‘ ‘1’ ‘*’ ‘(‘ expression operator expression ‘)’ ‘)’
2@5 ‘(‘ ‘1’ ‘*’ ‘(‘ ‘1’ operator expression ‘)’ ‘)’
3@6 ‘(‘ ‘1’ ‘*’ ‘(‘ ‘1’ ‘+’ expression ‘)’ ‘)’
2@7 ‘(‘ ‘1’ ‘*’ ‘(‘ ‘1’ ‘+’ ‘1’ ‘)’ ‘)’
E2.15 - Language Processors
(Lect 2)
9
Grammar Basics (3)
Derivations and Parse trees
“
Parse tree of the derivation [the order is no longer visible].
Expression
‘(‘
Expression Operator
‘1’
‘*’
Expression
‘(‘
Expression Operator
‘1’
E2.15 - Language Processors
(Lect 2)
‘+’
‘)’
‘)’
Expression
‘1’
10
Formal aspects (1)
Formal definition of a grammar
A grammar is defined as a quadruple (VT,VN,P,S)
where
“ VT is the set of terminal symbols
“ VN is the set of non-terminal symbols
“ VT and VN have no symbols in common, and V is the
union of VT and VN
“ P is the set of production rules, each element of which
consists of a pair (α, β) (where α is in V+ [one of more
elements of V] and β is in V* [zero or more elements of
V]), and a production has the form: α -> β
“ S is a member of VN, and is known as the start or
sentence symbol and is the starting point in the
generation of any string in the language
E2.15 - Language Processors
(Lect 2)
11
Formal aspects (2)
Types of grammar - Chomsky’s hierarchy
The grammar we defined has no restrictions on the
types of productions that may appear.
Chomsky defined four classes of grammar (types 0–3),
where type-0 is the unrestricted grammar we saw,
and the other types are derived by imposing
restrictions on the form of productions that may be
used
Type-0 grammars are equivalent to Turing-machines
E2.15 - Language Processors
(Lect 2)
12
Formal aspects (3)
Type-1 grammars
The first restriction that may appear in a
grammar is that for its productions α -> β, it is
necessary that |α| ≤ |β|
Type-1 grammars are equivalent to linearlybound automata
E2.15 - Language Processors
(Lect 2)
13
Formal aspects (4)
Type-2 grammars
A further restriction that may appear in a
grammar is that only a single non-terminal
may appear on the left-side of a production.
These are also known as “context-free”
grammars (compiler theory is based almost
entirely on CF grammars), and are equivalent
to push-down automata.
E2.15 - Language Processors
(Lect 2)
14
Formal aspects (5)
Type-3 grammars
If we further impose the restriction that productions
should either all be left-linear, or all be right-linear ,
then the grammar is now a type-3 grammar.
[“Right-Linear”: every production is of the form
A->a, or A->bC
“Left-Linear”: every production is of the form
A->a, or A->Bc]
Type-3 grammars are also known as regular grammars,
and are equivalent to finite state automata.
E2.15 - Language Processors
(Lect 2)
15
Formal aspects (6)
Backus-Naur Form (BNF)
The notation used so far is sufficient, but in
practice a richer notation, called the BackusNaur form is used.
“ The right hand side of rules with the same left
side are combined, and separated by vertical
bars.
“ E.g.
the rules
N -> α, N->β, N->γ
are replaced with
N -> α | β | γ
“ It is very suitable for expressing nesting and
recursion
“
E2.15 - Language Processors
(Lect 2)
16
Formal aspects (7)
Extended Backus-Naur Form (1) – the additional operators
“
“
However, BNF is less convenient for expressing
repetition and optionality
The extended BNF introduces three additional
notations, each in the form of a postfix operator, to
remedy this
“
“
“
“
R+ indicates the occurrence of one or more Rs, to express
repetition [the Kleene cross]
R? indicates the occurrence of zero or one Rs, to express
optionality
R* indicates the occurrence of zero or more Rs to express
optional repetition [the Kleene star]
Parentheses are added if these postfix operators are
to operate on more than one grammar symbol.
E2.15 - Language Processors
(Lect 2)
17
Formal aspects (7)
Extended Backus-Naur Form (2) – example
“
The production rule
Expression -> ‘way’* ‘too’ ‘boring’ (‘for’ (‘me’|‘us’))?
can produce the following sentences
way too boring
way way too boring
way too boring for me
way way too boring for us
too boring
E2.15 - Language Processors
(Lect 2)
18
Regular expressions (1)
Notation (1)
“
“
“
“
Languages generated by type-3 grammars (regular
grammars) can also be generated using another
important notation, that of regular expressions.
One of the most common application is their use in
advanced text searches [have a look at your favourite
search engine]
The simplest kind is a sequence of simple characters,
for example (using Perl notation): /student/
We can then use the Kleene star/cross, as well as
the ? operator to express repetition, and optionality
E2.15 - Language Processors
(Lect 2)
19
Regular expressions (1)
Notation (2)
“
The following are also used:
“
“
The operators [ ] to express a range, for example
[0-9] to mean any digit between 0 and 9 (also
known as character classes)
Other operators are also sometimes used; many
are language or application specific, for example,
Perl also has the /./ operator to mean any
character, the /^/ operator (caret) after the single
brackets to specify what a pattern cannot be (for
example, [^0-9] means not a digit) etc
E2.15 - Language Processors
(Lect 2)
20
Regular expressions (2)
Example – SheepTalk
“
Consider the language of certain kinds of sheep,
which consist of strings that look like:
“
“
“
“
“
“
Baa!
Baaa!
Baaaa!
….
The language consists of strings with a B, followed by
at least two as, followed by an exclamation point.
So, the regular expression /baa+!/ can generate this
language, as well as the /baaa*!/ regular expression.
(example from Chapter 2 of Speech and Language Processing, Jurafsky and Martin)
E2.15 - Language Processors
(Lect 2)
21
Regular expressions (3)
Advantages / Limitations
“
Regular expressions are powerful but have limitations
“
“
“
To generate the language {xmyn , with m, n ≥0} we can use
the regular expression x*y*, but to generate the language
{xmym, m ≥ 0} there is no regular expression.
However they are simpler to use, so they are
frequently used in the lexical analysis part of the input
processing in language processors.
Also the theoretical constructs (Finite State
Automata) associated with regular expressions are
well studied, so a number of proofs can be made
about the expressions and the languages that are
generated by them.
E2.15 - Language Processors
(Lect 2)
22
Summary
Language Grammars:
Are necessary for specifying the language in a formal and
compact way
“ Can be classified according to Chomsky’s hierarchy
“ Type-3 grammars, or the equivalent Regular Expression notation
are used for the lexical analysis part of language processing,
while CF grammars are more frequently associated with the
syntax analysis part.
“
Next lectures:
More on Grammars
A complete example of a simple language processor
E2.15 - Language Processors
(Lect 2)
23
Recommended Reading
“
Section 1.9 of Grune et al.
“
Section 4.2 of Aho et al
“
Chapters 5, 8 of Terry
“
Chapter 2 of “The Essence of Compilers” by Robin Hunter
E2.15 - Language Processors
(Lect 2)
24