CS2 Language Processing note 1 Languages and finite automata

CS2 Language Processing note 1
CS2Ah 1.10.2004
CS2 Language Processing note 1
Languages and finite automata
This thread of the course is concerned with formal languages, that is, artificial
languages defined by formal rules. Programming languages are, of course, prime
examples of such languages.
The thread is divided into two halves.
The first set of 6 lectures (by Don Sannella) study certain particularly simple
formal languages: so called regular languages. Regular languages and the tools
associated with them (finite automata, regular expressions, . . . ) have numerous
applications in computer science.
The second set of 5 lectures (by John Longley) will consider more complex
languages (e.g. context-free languages) of the sort actually used for programming
languages. They will cover the formal grammars used to specify such languages,
and techniques used for parsing languages (as used by compilers).
The lecture notes for this thread were originally created by Stuart Anderson,
and have been subsequently modified by Vivek Gore, Martin Grohe, Don Sannella, Alex Simpson, Ian Stark and Colin Stirling.
Finite automata
Rather than starting directly with languages, we begin instead by looking at
finite state machines, as introduced in CS1Ah. That these have a connection
with formal languages should already be familiar to you from CS1Ah, where you
saw how a finite state machine can sometimes be used to recognise a language.
In fact, we shall define the regular languages to be exactly those languages that
can be recognised by finite state machines. However, saying this is jumping
ahead to the next lecture. In this lecture, we review the notion of a finite state
machine, and some of the different applications to which finite state machines
can be put. However, we shall use the term finite automaton (plural automata)
instead of finite state machine. This may seem pretentious, but it is standard
terminology, and fits in well with more specialised terminology we shall need
later.
In CS1Ah you were introduced to finite automata as simple models of reactive
systems. Such systems receive inputs from the outside world and react to these
inputs in a specified way.
In their basic form, finite automata are used to describe the states of a
1
CS2 Language Processing note 1
CS2Ah 1.10.2004
system and the transitions that are possible between these states.
Often, we want to use automata to describe systems that do not only change
their states depending on their input, but also produce an output. Finite automata that produce an output are called transducers. The Parking Ticket
Machine of [CS1Ah, LN12]1 and the Cruise Control of [CS1Ah, LN15] are
examples.
We have also seen automata that accept certain sequences of inputs (and
consequently reject all other sequences of inputs). Such automata are called
acceptors. The automaton in Example 1.1 is an acceptor; more examples
can be found in [CS1Ah, LN12] and [CS1Ah, LN13].
Example 1.1: A combination lock. We want to describe a lock that receives
input from a numeric keypad (with keys 0, 1, , 9) and opens whenever the
last 4 keys pressed have been 1102. An automaton modelling such a lock is dis-
1
1
0,2−9
none
1
1
1
1
11
0,2−9
2−9
0
110
0,3−9
2
open
0,2−9
Figure 1: Finite automaton for combination lock
played in Figure 1. It has one accepting state ‘open’ which is reached whenever
the last four keys pressed have been 1102. The automaton has 4 intermediate
states which are reached depending on how many of the correct keys have been
pressed: If the last 3 keys pressed have been 110, then the automaton is in state
‘110’. If the last 2 keys pressed have been 11, then the automaton is in state
‘11’. If the last key pressed has been 1, and the last two keys have not been 11,
then the automaton is in state ‘1’. Otherwise, it is in state ‘none’.
1
This means “Lecture Note 12 of CS1Ah”.
2
CS2 Language Processing note 1
CS2Ah 1.10.2004
Languages
In the following lectures, we will use finite automata as language acceptors. In
order to do this (and in order to handle the other types of language considered
later in the thread), it is necessary to have a general notion of language.
Here are two examples. The Java programming language can be considered
as the set of all strings over the alphabet of ASCII characters that represent a
syntactically correct Java program. The English language may be considered as
the set of all strings over the usual alphabet (perhaps with punctuation symbols
included) that represent grammatically correct sentences in English. These are
both rather complex examples (and the latter is even poorly specified, as no
two speakers of English are likely to agree entirely on its grammar). In the
forthcoming lecture notes, we shall encounter many much simpler examples of
languages.
To refer to languages concisely, it is useful to develop a convenient notation
system.
In general we shall use the Greek letter (capital “sigma”) to denote a finite
alphabet of symbols from which the language will be built.
Natural examples of alphabets are the set a b z of all letters of the
English alphabet, the set of all ASCII characters, or just 0 1 , the alphabet
consisting just of the two letters ‘0’ and ‘1’, which is the alphabet of binary
numbers or bit strings. In our examples, we shall often use small, more
artificial alphabets such as a b or a b c .
For any finite alphabet
we write
for the set of all finite strings (se
quences or words) of members of . The set
contains a special element,
the empty string, for which we use the symbol , the Greek letter “epsilon”,
rather than the notation "" often used in programming languages.
For example, if
a b , then
a b aa ab ba bb aaa aab aba For two strings and in , denotes their concatenation. For example,
if abab and baa, then ababbaa. In particular, for every string
.
we have A prefix of a string is an initial substring of , that is, a string for which there exists a string such that .
For example, if abbc, then the prefixes of are
a ab abb abbc Similarly, a suffix of a string is a string a string !
such that " .
For example, the suffixes of abbc are
abbc bbc bc c #
3
for which there exists
CS2 Language Processing note 1
CS2Ah 1.10.2004
'&
The length of a string is denoted $ %$ . For example, $ $
, $ a$
Note that for all strings +#,
we have $ -$
$ %$./$ -$ .
)(
, $ abba $
'*
.
For a string 0
and a natural number 1 we use 32 to denote the string
45# where is concatenated 1 times with itself. Thus 76
, 98
and
2:;8
2
2 .
For any alphabet
, a language (over ) is a subset < of
For example, a language over
such as the the set
)
<
8
.
a b is any set of strings of as and bs,
ab abba (consisting of the strings , ab, and abba). Unlike < , most interesting lan8
guages consist of infinitely many strings. For example, the set <>= of all
strings over the alphabet a b that start with an a is a language over the
alphabet a b that consists of infinitely many strings. Formally, we may
specify <%= as follows:
DC
@?
<%=
aA$BA a b Be careful not to confuse the empty language E with the language that
just consists of the empty string.
Note that we generally use the letters a, b, c to denote individual symbols and ,
, and to denote strings of symbols. A single symbol like a also doubles as a
(
string of length .
Finite Automata as Language Acceptors
is an acceptor automaton whose inputs are letters from an
Suppose that F
alphabet . Then sequences of inputs for F are strings from , some of which
are accepted by F and some of which are rejected. We think of the automaton
as reading the input string letter by letter and changing its states accordingly.
After the whole string is read, it is accepted or rejected, depending on whether
the automaton is in an accepting state or not.
Now suppose < is a language over a finite alphabet . We say that F recognises (or accepts) < if F accepts precisely the strings that are contained in < ,
that is, if the following equivalence holds for all strings :
G<IHKJ
accepts +
F
Example 1.2: Binary strings with an even number of zeros. We consider
the language < over the alphabet 0 1 that consists of all strings with an even
number of 0s. Figure 2 shows an automaton recognising this language. To
see how this automaton works, we simply observe that after reading a prefix
& ( L) of the input string, the automaton is in state ‘even’ if contains an
even number of 0s and in state ‘odd’ otherwise.
4
CS2 Language Processing note 1
CS2Ah 1.10.2004
1
1
0
even
odd
0
Figure 2: An automaton recognising strings with an even number of 0s
Example 1.1 revisited. The finite automaton for the combination lock in Exam
ple 1 recognises the language < consisting of all words over 0 55 9 that end
with the four “letters” 1102. Formally, we may specify this language by
<
or
<
1102 $A 0 5 9 BA 0 55 9 $ 1102 is a suffix of M
Deterministic finite automata
We now give a formal presentation of a special class of finite automata: deterministic finite automata. In the next lecture, we shall use these to define the notion
of a regular language. Deterministic finite automata are characterised by the
following two features:
there is a unique starting state; and
from every state there is exactly one transition for each possible input symbol.
Systems described by such automata are completely predictable. Given some
input string, the sequence of states that is visited is completely determined.
We now give a formal definition of deterministic finite automata.
Definition 1.3. A deterministic finite automaton (or DFA) is a tuple
F
@NPO
consisting of:
1. a finite set
O
of states,
2. a finite alphabet
,
5
#Q S RST"U
6
CS2 Language Processing note 1
CS2Ah 1.10.2004
3. a distinguished starting state Q
4. a set R@V
O
6
O
,
of final states (the ones that indicate acceptance), and
5. a description T of all the possible transitions.
In order for the automaton to be deterministic, T must be given by a table
that answers the following question: “Given a state Q and an input symbol a,
what is the next state?” There must be an answer to this no matter what Q
OYXZ
and W are. (In mathematical terms T is a function from the set
to the
O
set and is usually referred to as the transition function.)
Example 1.2 revisited. The finite automaton of Example 1.2 is a DFA formally
specified by
[
even odd 5 0 1 even 5 even ST]\^
where the transition function T is given by the following table:
T
0
1
even odd even
odd even odd
Example 1.1 re-revisited. The finite automaton of Example 1.1 is a DFA de[
scribed by
none 1 11 110 open 0 55 9 none 5 open ST]\
where the transition function T is given by the following table:
T
0
1
2
3
4 5 6
none none 1 none none
1
none 11 none none
11
110 11 none none
110 none 1 open none
open none 1 none none
7 8 9
References
A basic reference for the material of Lectures 1–6 of the Language Processing
Thread is Chapter 10 of Foundations of Computer Science (C Edition) by A. V. Aho
and J. D. Ullman, Computer Science Press, 1995.
More advanced material can be found in Chapters 2 and 3 of Introduction to
Automata Theory, Languages, and Computation (2nd Edition) by J. E. Hopcroft,
R. Motwani, and J. D. Ullman, Addison-Wesley, 2001.
6
CS2 Language Processing note 1
CS2Ah 1.10.2004
Exercises
1. Design a DFA that recognises the language < consisting of all strings over
&
(
0 1 with an even number of s and an odd number of s.
2. Design a DFA that recognises the language < consisting of all strings over
a b c starting with an a and ending either with an a or a c.
3. We want to modify the combination lock of Example 1.1 in a such a way
that several people can open it, each of them using his or her own number.
Say, there are three people who want to be able to open the lock using
the numbers 1102, 0102, 1110, respectively. Design a DFA describing a
combination lock that opens whenever the last four keys pressed are either
1102, 0102, or 1110.
More formally, this means that you are supposed to design a DFA that
recognises the language < consisting of all strings over 0 9 whose
last four letters are either 1102, or 0102, or 1110.
Don Sannella
7