Lecture - 03: Introduction Language Theory

Introduction to Language Theory
Programming Language Concepts
Lecture 3
Prepared by
Manuel E. Bermúdez, Ph.D.
Associate Professor
University of Florida
Introduction to Language Theory
Definition: An alphabet (or vocabulary) Σ is a
finite set of symbols.
Example: Alphabet of Pascal:
+-*/<…
(operators)
begin end if var
(keywords)
<identifier>
(identifiers)
<string>
(strings)
<integer>
(integers)
;:,()[]
(punctuators)
Note: All identifiers are represented by one
symbol, because Σ must be finite.
Introduction to Language Theory
Definition: A sequence t = t1t2…tn of symbols
from an alphabet Σ is a string.
Definition: The length of a string t = t1t2…tn
(denoted |t|) is n. If n = 0, the string is ε, the
empty string.
Definition: Given strings s = s1s2…sn and
t = t1t2…tm, the concatenation of s and t,
denoted st, is the string s1s2…snt1t2…tm.
Introduction to Language Theory
Note: εu = u = uε, uεv = uv, for any
strings u,v (including ε)
Definition: Σ* is the set of all strings of
symbols from Σ.
Note: Σ* is called the reflexive, transitive
closure of Σ.
Σ* is described by the graph (Σ*, ·),
where “·” denotes concatenation, and
there is a designated “start” node, ε.
Introduction to Language Theory
Example: Σ = {a, b}.
(Σ*, ·)
a
a
a
ε
aa
b
ab
a
aba
b
abb
b
a
b
ba
b
bb
Σ* is countably infinite, so can’t compute all of
Σ*, and can only compute finite subsets of
Σ*, but can compute whether a given string
is in Σ*.
Introduction to Language Theory
Example: Σ = Pascal vocabulary.
Σ* = all possible alleged Pascal
programs, i.e. all possible inputs to
Pascal compiler.
Need to specify L  Σ*, the correct
Pascal programs.
Definition: A language L over an
alphabet Σ is a subset of Σ*.
Introduction to Language Theory
Example: Σ = {a, b}.
L1 = ø is a language
L2 = {ε} is a language
L3 = {a} is a language
L4 = {a, ba, bbab} is a language
L5 = {anbn / n >= 0} is a language
where an = aa…a, n times
L6 = {a, aa, aaa, …} is a language
Note: L5 is an infinite language, but
described finitely.
Introduction to Language Theory
THIS IS THE MAIN GOAL OF LANGUAGE
SPECIFICATION :
To describe (infinite) programming
languages finitely, and to provide
corresponding finite inclusion-test
algorithms.
Language Constructors
Definition: The catenation (or product) of two
languages L1 and L2, denoted L1L2, is the set
{uv | uL1, vL2}.
Example: L1 = {ε, a, bb}, L2 = {ac, c}
L1L2 = {ac, c, aac, ac, bbac, bbc}
= {ac, c, aac, bbac, bbc}
Language Constructors
Definition: Ln = LL…L (n times),
and L0 = {ε}.
Example: L = {a, bb}
L3 = {aaa, aabb, abba,
abbbb, bbaa, bbabb, bbbba, bbbbbb}
Language Constructors
Definition: The union of two languages L1 and
L2 is the set L1 L2 = {u | uL1} { v | vL2}
∩
∩
Definition: The Kleene star (L*) of a language
is the set L* = U Ln, n >0.
Example: L = {a, bb}
L* = {any string composed of a’s and
bb’s}
Definition: The Transitive Closure (L+) of a
language L is the set L+ = U Ln, n > 1.
Language Constructors
Note:
In general, L* = L+ U {ε}, but L+ ≠ L* - {ε}.
For example, consider L = {ε}. Then
{ε} = L+ ≠ L* – {ε} = {ε} – {ε} = ø.
Grammars
Goal: Providing a means for describing
languages finitely.
Method: Provide a subgraph (Σ*, →*) of
(Σ*, ·), and a start node S, such that
the set of reachable nodes (from S)
are the strings in the language.
Grammars
Example: Σ = {a, b}
L = {anbn / n > 0}
a
a
a
aa
b
a
aaa
b
aab
ab
b
b
b
b
a
ba
a
bb
aaba
aabb
ε
a
a
b
bbaa
bba
bbb
b
bbab
Grammars
“=>” (derives) is a relation defined by a
finite set of rewrite rules known as
productions.
Definition: Given a vocabulary V, a
production is a pair (u, v)  V* x V*,
denoted u → v. u is called the left-part;
v is called the right-part.
Grammars
Example: Pseudo-English.
V = {Sentence, NP, VP, Adj, N, V, boy, girl, the,
tall, jealous, hit, bit}
Sentence
NP
NP
N
N
Adj
Adj
Adj
VP
V
V
→
→
→
→
→
→
→
→
→
→
→
NP VP
N
Adj NP
boy
girl
the
tall
jealous
V NP
hit
bit
(one production)
Note: English is much too complicated to be described this way.
Grammars
Definition:
Given a finite set of productions P  V* x V* the
relation => is defined such that
, β, u, v  V* , uβ => vβ iff
u → v  P is a production.
Example:
Sentence
NP
NP
N
N
→
→
→
→
→
NP VP
N
Adj NP
boy
girl
Adj
Adj
Adj
VP
V
V
→
→
→
→
→
→
the
tall
jealous
V NP
hit
bit
Grammars
Sentence
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
=>
NP
Adj
the
the
the
the
the
the
the
the
the
the
the
VP
NP VP
NP VP
Adj NP
jealous
jealous
jealous
jealous
jealous
jealous
jealous
jealous
jealous
VP
NP VP
N VP
girl VP
girl V NP
girl hit NP
girl hit Adj
girl hit the
girl hit the
girl hit the
NP
NP
N
boy
Grammars
Definition: A grammar is a 4-tuple G = (Φ, Σ, P, S)
where
Φ is a finite set of nonterminals,
Σ is a finite set of terminals,
V = Φ U Σ is the grammar’s vocabulary,
S  Φ is called the start or goal symbol,
and P  V* x V* is a finite set of productions.
Example: Grammar for {anbn / n > 0}.
G = (Φ, Σ, P, S), where
Φ = {S},
Σ = {a, b},
and P = {S → aSb, S → ε}
Grammars
Derivations:
S => aSb => aaSbb => aaaSbbb => aaaaSbbbb → …
aaabbb
=>
aabb
=>
ab
=>
=>
=>
ε
aaaabbbb
Note: Normally, grammars are given by simply listing
the productions.
Grammar Conventions
TWS convention
1.
2.
3.
4.
Upper case letter (identifier) – nonterminal
Lower case letter (string) – terminal
Lower case greek letter – strings in V*
Left part of the first production is assumed to
be the start symbol, e.g.
S → aSb
S→ε
5. Left part omitted if same as for preceeding
production, e.g.
S → aSb
→ε
Grammars
Example: Grammar for identifiers.
Identifier
Letter
Digit
→
→
→
→
→
.
.
→
→
→
.
.
→
Letter
Identifier Letter
Identifier Digit
‘a’ → ‘A’
‘b’ → ‘B’
‘z’ → ‘Z’
‘0’
‘1’
‘9’
Grammars
Definition: The language generated by a
grammar G, is the set L(G) = {  Σ*
| S =>*  }
Definition: A sentential form generated
by a grammar G is any string α such
that S =>*  .
Definition: A sentence generated by a
grammar G is any sentential form 
such that   Σ*.
Grammars
Example:
sentential forms
S => aSb => aaSbb => aaaSbbb => aaaaSbbbb > …
aaabbb
sentences
Lemma: L(G) = { | is a sentence}
Proof: Trivial.
=>
aabb
=>
ab
=>
=>
=>
ε
aaaabbbb
Grammars
Example: A → aABC
→ aBC
aB → ab
bB → bb
bC → bc
CB → BC
cC → cc
Grammars
=> aABC
=> aaABCBC
aabCBC
abc
aabBCC
=>
=>
aabbcC
aaabBBCCC
(2)
aaabbbCCC
=>
=>
aabbcc
aaaBBBCCC
=> =>
=>
aabbCC
aaaBBCBCC
=>
abC
aaaBCBCBC
=>
aaBCBC
=>
aBC
=>
=>
=> => =>
Derivations: A
=>
aaabbbcCC
(2)
aaabbbccc
L (G) = {anbncn | n > 1}
=> …
The Chomsky Hierarchy
A hierarchy of grammars, the languages
they generate, and the machines the
accept those languages.
The Chomsky Hierarchy
Type
Language
Name
Grammar
Name
Restrictions
On
grammar
Accepting
Machine
0
Recursively
Enumerable
Unrestricted None
re-writing
system
1
Context-Sensitive
Language
ContextSensitive
Grammar
For all →, Linear
Bounded
||≤||
Automaton
2
Context- Free
Language
ContextFree
Grammar
For all →, Push-Down
Automaton
Φ.
(parser)
3
Regular
Language
Regular
Grammar
For all →, Finite- State
Φ, U Automaton
ΦU{}
Turing
Machine
Language Hierarchy
0: Recursively Enumerable Languages
1: Context-Sensitive Languages
2:
Context-free Languages
3: Regular
Languages
{an | n > 0}
{anbn | n>0}
{anbncn | n>0}
English?
We will deal with
type 2 (syntax) and
type 3 (lexicon)
languages.