Formal languages - SisInf Lab

Formal Languages
and Compilers
Master’s Degree Course in
Computer Engineering
A.Y. 2016/2017
FORMAL LANGUAGES AND COMPILERS
Formal languages
Floriano Scioscia
1
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Translator
• A translator is a program which translates automatically from a
language to another one.
• Languages used in computing are formal.
• Given a sentence s1 in the formal language L1 (source language),
the translator constructs a sentence s2 in the formal language L2
(target or sink language).
• Sentence s2 must “correspond" to s1.
Formal languages - Floriano Scioscia
2
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Formal language
• In mathematics, logics, linguistics and computer science, a formal
language is a set of finite-length strings constructed over a finite
alphabet, that is over a finite set of simple objects named characters,
symbols or letters.
Formal languages - Floriano Scioscia
3
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Alphabet
• An alphabet is a finite set of elements, named symbols or terminal
characters.
Examples:
{a, b, c}
{0, 1}
{α, β, γ, δ}
• The cardinality of an alphabet is the number of symbols in it.
• If Σ denotes an alphabet, then |Σ| denotes its cardinality.
Examples:
|{a, b, c}| = 3
|{0, 1}| = 2
|{α, β, γ, δ}| = 4
Formal languages - Floriano Scioscia
4
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Strings
• A string or word s over an alphabet is a sequence (or list) of symbols
belonging to the alphabet.
– Examples:
aabb, cac, cba, abba are strings over the alphabet {a, b, c}
binary numbers are strings over the alphabet {0, 1}
– Two words differing only for the order of symbols are different: aabb and
abba are different words.
– Two words are equal only if their characters coincide when read in (left-toright) order.
• The length of a string s, denoted with |s|, is the number of its
characters.
– Examples: |aabb| = 4
|cac| = 3
|101011| = 6
– Equal strings have the same length (vice versa is not true in general).
• The empty string ε (sometimes denoted with λ) is the string which
does not contain any symbol.
– The length of the empty string is zero: |ε| = 0
Formal languages - Floriano Scioscia
5
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Language
• A language over an alphabet is a set of strings over that alphabet.
Examples:
– {aabb, cac, cba, abba} is a language over the alphabet {a, b, c}
– the set of binary numbers is a language over the alphabet {0, 1}
– the set of palindrome strings containing only the symbols a, b, c is a language over
the alphabet {a, b, c}
Please notice that the first and third example have the same alphabet. In general,
infinite languages can be defined over a given alphabet.
• Ø, the empty set, is a language.
• {ε} is the language containing only the empty string.
• The set of all possible C programs is a language.
• The set of all possible identifiers in a programming language is a
language.
Formal languages - Floriano Scioscia
6
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Cardinality of a language
• The cardinality of a language is the number of its strings.
• If L denotes a language, then |L| denotes its cardinality. Examples:
|{aabb, cac, cba, abba}| = 4
|the set of numbers in the binary system|=∞
• A language is finite if its cardinality is finite.
• A language is infinite if its cardinality is infinite.
• The empty language (denoted with Ø) is the language containing no
strings.
|Ø| = 0
Warning:
Ø ≠ {ε}, since |Ø| = 0 ≠ |{ε}| = 1
Formal languages - Floriano Scioscia
7
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on strings:
concatenation (1/3)
• The concatenation of 2 strings is the string composed of all the
symbols of the first string followed by all the ones of the second string.
x.y or xy denotes the concatenation of strings x and y
If x = a1 … ah and y = b1 … bk then xy = a1 … ah b1 … bk
Examples:
nano.technology = nanotechnology
tele.vision = television
• Concatenation is not commutative: x.y ≠ y.x
vision.tele = visiontele
• Concatenation is associative: x.(y.z) = (x.y).z
nano.(techno.logy) = nano.technology = nanotechnology
(nano.techno).logy = (nanotechno).logy = nanotechnology
Formal languages - Floriano Scioscia
8
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on strings:
concatenation (2/3)
• The length of the concatenation of two strings is the sum of the
lengths of the strings: |x.y| = |x| + |y|
|television| = 10 = 4 + 6 = |tele| + |vision|
• The empty string is the identity element of concatenation: εx = x = xε
• String y is a substring of string x if there exist strings u, v such that
x = uyv
• String y is a prefix of string x if there exists a string u such that
x = yu
• String y is a suffix of string x if there exists a string u such that x = uy
• A substring (resp. prefix, suffix) of a string is proper if it does not
coincide with the given string.
• If |x| ≥ k we denote with k:x the prefix of x with length k (start of
length k of x).
Formal languages - Floriano Scioscia
9
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on strings:
concatenation (3/3)
Examples:
• The substrings of abbc are {ε, a, b, c, ab, bb, bc, abb, bbc, abbc}
• The proper substrings of abbc are {ε, a, b, c, ab, bb, bc, abb, bbc}
• The prefixes of abbc are {ε, a, ab, abb, abbc}
• The proper prefixes of abbc are {ε, a, ab, abb}
• The suffixes of abbc are {ε, c, bc, bbc, abbc}
• The proper suffixes of abbc are {ε, c, bc, bbc}
• 2:abbc = ab
• 3:abbc = abb
Formal languages - Floriano Scioscia
10
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on strings:
reversal
• The reverse of a string is the string obtained by writing the characters
in inverse order.
• xR denotes the reverse of string x
• (a1 … ah)R = ah … a1
• (abbc)R = cbba
• Reversal is idempotent: (xR)R = x
• The reverse of the concatenation of two strings is the inverse
concatenation of their reverses: (xy)R = yRxR
• The reverse of the empty string is the empty string: εR = ε
• Reversal has precedence over concatenation: abbcR = abbc
Formal languages - Floriano Scioscia
11
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on strings:
exponentiation
• The mth power of string x (notation: xm) is the concatenation of x
with itself m times.
xm  if m = 0 then ε else xm-1x
• Examples:
(abbc)3 = abbcabbcabbc
(abbc)6 = abbcabbcabbcabbcabbcabbc
• Exponentiation has precedence over concatenation: abbc3 = abbccc
Formal languages - Floriano Scioscia
12
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on languages
(1/5)
• Languages are sets!
• The union L1  L2 of languages L1 and L2 is the set of strings in L1 or
in L2
L1  L2 = {x | x  L1 ∨ x  L2}
• The intersection L1  L2 of languages L1 and L2 is the set of strings in
L1 and L2
L1  L2 = {x | x  L1 ∧ x  L2}
• The difference L1 ∖ L2 of language L1 minus language L2 is the set of
strings in L1 which are not in L2
L1 ∖ L2 = {x | x  L1 ∧ x  L2}
Examples:
{ab, abc}  {ab, aa, cb} = {ab, abc, aa, cb}
{ab, abc}  {ab, aa, cb} = {ab}
{ab, abc} ∖ {ab, aa, cb} = {abc}
{ab, aa, cb} ∖ {ab, abc} = {aa, cb}
Formal languages - Floriano Scioscia
13
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on languages
(2/5)
• Inclusion: language L1 is included in language L2 (notation: L1  L2)
if all the strings in L1 are in L2
L1  L2  x  L1 : x  L2
• The language L1 is properly included in language L2 (notation: L1 
L2) if all the strings in L1 are in L2 and at least one string in L2 is not in
L1
L1  L2  (x  L1 : x  L2) ∧ ( y  L2 . y  L1)
• Two languages are equal if they contain the same set of strings
L1 = L2  L1  L2 ∧ L2  L1
• Examples:
L1  L1  L2
L1  L2  L1
L1 ∖ L2  L1
Formal languages - Floriano Scioscia
14
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on languages
(3/5)
• The reverse of language L (notation: LR) is the set of reversed strings
in L
LR = {xR | x  L}
{ab, abc}R = {ba, cba}
• The concatenation of languages L1 and L2 (notation: L1L2 ) is the set
obtained by concatenating in all possible ways the strings in L1 with
the strings in L2
L1 L2 = {xy | x  L1 ∧ y  L2}
•
Examples:
{ab, abc}{ab, aa, cb} = {abab, abaa, abcb, abcab, abcaa, abccb}
LØ=Ø=ØL
L{ε} = L = {ε}L
Formal languages - Floriano Scioscia
15
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on languages
(4/5)
• The mth power of language L (notation: Lm) is the concatenation of L
with itself m times.
Lm  if m = 0 then {ε} else Lm-1L
{ab, abc}2 = {abab, ababc, abcab, abcabc}
Ø0 = {ε}
• The Kleene closure of the language L (notation: L) is the union of all
powers of L
L = h=0... Lh = {ε}  L1  L2 …
{ab, abc} = {ε, ab, abc, abab, ababc, abcab, abcabc, …}
• Properties
– L  L (monotonicity)
– (x  L) ∧ (y  L)  xy  L* (closure with respect to concatenation)
– (L) = L (idempotence)
– (L)R = (LR) (commutativity of closure and reversal)
– Ø = {ε}
– {ε} = {ε}
Formal languages - Floriano Scioscia
16
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Operations on languages
(5/5)
• The positive closure of language L (notation: L+) is the union of all
positive powers of L
L+ = h=1... Lh = L1  L2 . . .
{ab, abc}+ = {ab, abc, abab, ababc, abcab, abcabc, . . .}
L = L+  {ε}
L+ = LL = LL
• The (right) quotient of languages L1 and L2 (notation: L1 / L2 ) is the
set of prefixes which produce strings in L1 when concatenated with
strings in L2
L1 / L2 = {x | xy  L1 ∧ y  L2}
{ab, abc}/{bc} = {a}
{bc}/{ab, abc} = Ø
Formal languages - Floriano Scioscia
17
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Linguistic universe
(free monoid)
• The linguistic universe (or free monoid) of an alphabet Σ is the
(infinite) set of strings over the alphabet.
• It can be defined as the limit of the exponentiation function, that is its
closure Σ
{a, b} = {ε, a, b, aa, bb, ab, ba, . . .}
• Every language over an alphabet Σ is included in Σ
• The complement of a language L (notation: ¬L) over an alphabet Σ
is the difference between Σ and L
¬L = Σ ∖ L
¬{ab, ba} = {ε, a, b, aa, bb, aaa, …}
Formal languages - Floriano Scioscia
18
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Basics of formal language
theory
• Formal language theory studies the sets of strings, i.e. the
sentences of a language, in order to establish their correctness and
meaning.
• Specifying a language completely and rigorously is no easy task. One
cannot list all valid sentences, as they are infinite and a priori of
unlimited length.
• One must use an algorithm, which allows to produce all the (possibly
infinite) sentences of the language or to check their correctness
through a finite set of rules.
Formal languages - Floriano Scioscia
19
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
The specification of a
language
• Generative approach
– We use an enumeration algorithm, which can produce all the (possibly
infinite) sentences of the language through a finite set of computation
rules.
– The rules of the enumeration algorithm compose the so-called generative
grammar/generative syntax of the language.
• Recognition approach
– We use an algorithm which recognizes whether a sentence is correct or
not, and determines its meaning.
– Practically, instead of an algorithm, a more abstract description is
preferred, through the notion of recognizing automaton.
Formal languages - Floriano Scioscia
20
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Other approaches to
language specification
•
The above-mentioned approaches are not the only ones investigated in formal
language theory.
•
Generative approach: a language is defined as the set of all and only the
•
Recognition approach: a language is defined as the set of all and only the
•
Denotational approach: a language is defined through compact symbolic
strings produced by a generative grammar or by another rewriting system.
This approach is usually the most frequently used one in the manual or in
documents which describe the language.
strings accepted by an automaton. It is the analytical approach used to
describe a compiler (or an interpreter) for the language.
expressions, such as regular expressions, which denote (all and only) the
strings in a concise form.
Algebraic approach: a language is defined through its algebraic properties.
• Transformational approach: a language is defined as the result of the
•
transformation of another (usually simpler) language.
Formal languages - Floriano Scioscia
21
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Focus in this course
• Within the realm of the generative approach, based on the definition
of grammars, we will limit to a simple and widespread grammar
class: context-free grammars (type 2 in Chomsky’s hierarchy), and a
subset of them, regular expressions.
• Within the realm of the of the recognition approach, based on
recognizing automata or analyzers, we will focus on finite-state and
push-down automata.
• Afterwards we will highlight that the two approaches are
interchangeable.
Formal languages - Floriano Scioscia
22
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Generative vs recognition
approach
• The two approaches are dual and equivalent: it is always
possible to switch from an enumeration algorithm (i.e., a grammar) to
a recognition one (i.e., an automaton), in a fully mechanical way.
• The rules which allow a grammar to enumerate the sentences of a
language can allow an automaton to recognize the membership of a
sentence in a language.
• Syntax analyzer design is a clear example of that.
Formal languages - Floriano Scioscia
23
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
From correctness check to
sentence translation
• In addition to the primary demand of recognizing whether a sentence
is correct, a further goal is to translate (transform) the sentence, like a
compiler (or an interpreter) does when it converts a program from a
high-level language to a processor machine language.
• A translation is a mapping (more specifically, a function) from
sentences in the source language to ones in the target language.
• Two approaches exist for translation, too:
– The generative one exploits syntax schemes to generate source-target pairs of
sentences, which correspond in the translation; a syntax scheme is actually a
coupling of two generative grammars
– The recognition (translation) one exploits translating automata which are
distinguished from recognizing ones for the capability to emit the desired translation
Formal languages - Floriano Scioscia
24
Formal Languages
and Compilers
A.Y. 2016/2017
DEI – Politecnico di Bari
Translation via syntactic or
semantic methods
• Like in linguistics, also in artificial languages the distinction between
syntax and semantics is not easy, and sometimes it is arbitrary.
• In linguistics the two words are used to distinguish between structure
and content, but at a deeper examination this distinction becomes
elusive.
• To simplify, the difference between the two methods lies in:
– formalism: syntax uses elements (alphabets, character strings, …) and operations
(concatenation, iteration, substitution, homomorphism, …), while semantics takes
into account also numbers, arithmetic operations, propositional and predicate
calculus;
– the higher complexity of semantic algorithms: the languages we focus on are
almost only the ones which can be recognized and translated in linear time (that is,
proportional to the length of the considered sentence); this is a huge simplification,
excluding cases which would require at least quadratic complexity.
Formal languages - Floriano Scioscia
25