N Regular Languages and Finite Automata

N
Lecture Notes on
Regular Languages
and Finite Automata
for Part IA of the Computer Science Tripos
Prof. Andrew M. Pitts
Cambridge University Computer Laboratory
c A. M. Pitts, 1998-2002
First Edition 1998.
Revised 1999, 2000, 2001, 2002.
Contents
Learning Guide
ii
1 Regular Expressions
1.1 Alphabets, strings, and languages .
1.2 Pattern matching . . . . . . . . .
1.3 Some questions about languages .
1.4 Exercises . . . . . . . . . . . . .
.
.
.
.
1
1
4
6
8
.
.
.
.
.
11
11
14
17
20
20
.
.
.
.
2 Finite State Machines
2.1 Finite automata . . . . . . . . . . .
2.2 Determinism, non-determinism, and 2.3 A subset construction . . . . . . . .
2.4 Summary . . . . . . . . . . . . . .
2.5 Exercises . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . .
-transitions
. . . . . . .
. . . . . . .
. . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Regular Languages, I
21
3.1 Finite automata from regular expressions . . . . . . . . . . . . . . . . . . . . 21
3.2 Decidability of matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Exercises . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4 Regular Languages, II
4.1 Regular expressions from finite automata . . . . .
4.2 An example . . . . . . . . . . . . . . . . . . . . .
4.3 Complement and intersection of regular languages .
4.4 Exercises . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
29
29
30
32
34
5 The Pumping Lemma
5.1 Proving the Pumping Lemma . . . .
5.2 Using the Pumping Lemma . . . . .
5.3 Decidability of language equivalence
5.4 Exercises . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
38
39
42
43
6 Grammars
6.1 Context-free grammars
6.2 Backus-Naur Form . .
6.3 Regular grammars . . .
6.4 Exercises . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
45
45
47
49
51
Lectures Appraisal Form
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
53
ii
Learning Guide
The notes are designed to accompany six lectures on regular languages and finite automata
for Part IA of the Cambridge University Computer Science Tripos. The aim of this short
course will be to introduce the mathematical formalisms of finite state machines, regular
expressions and grammars, and to explain their applications to computer languages. As such,
it covers some basic theoretical material which Every Computer Scientist Should Know.
Direct applications of the course material occur in the various CST courses on compilers.
Further and related developments will be found in the CST Part IB courses Computation
Theory and Semantics of Programming Languages and the CST Part II course Topics in
Concurrency.
This course contains the kind of material that is best learned through practice. The books
mentioned below contain a large number of problems of varying degrees of difficulty, and
some contain solutions to selected problems. A few exercises are given at the end of each
section of these notes and relevant past Tripos questions are indicated there. At the end
of the course students should be able to explain how to convert between the three ways of
representing regular sets of strings introduced in the course; and be able to carry out such
conversions by hand for simple cases. They should also be able to prove whether or not a
given set of strings is regular.
Recommended books Textbooks which cover the material in this course also tend to
cover the material you will meet in the CST Part IB courses on Computation Theory and
Complexity Theory, and the theory underlying parsing in various courses on compilers.
There is a large number of such books. Three recommended ones are listed below.
D. C. Kozen, Automata and Computability (Springer-Verlag, New York, 1997).
T. A. Sudkamp, Languages and Machines (Addison-Wesley Publishing Company,
Inc., 1988).
J. E. Hopcroft and J. D. Ullman, Introduction to Automata Theory, Languages and
Computation (Addison-Wesley Publishing Company, Inc., 1979).
Note The material in these notes has been drawn from several different sources, including
the books mentioned above and previous versions of this course by the author and by others.
Any errors are of course all the author’s own work. A list of corrections will be available
from the course web page (follow links from www.cl.cam.ac.uk/Teaching/). A lecture(r)
appraisal form is included at the end of the notes. Please take time to fill it in and return it.
Alternatively, fill out an electronic version of the form via the URL www.cl.cam.ac.uk/cgibin/lr/login.
Andrew Pitts
[email protected]
1
1 Regular Expressions
Doubtless you have used pattern matching in the command-line shells of various operating
systems (Slide 1) and in the search facilities of text editors. Another important example of
the same kind is the ‘lexical analysis’ phase in a compiler during which the text of a program
is divided up into the allowed tokens of the programming language. The algorithms which
implement such pattern-matching operations make use of the notion of a finite automaton
(which is Greeklish for finite state machine). This course reveals (some of!) the beautiful
theory of finite automata (yes, that is the plural of ‘automaton’) and their use for recognising
when a particular string matches a particular pattern.
Pattern matching
What happens if, at a Unix/Linux shell prompt, you type
and press return?
Suppose the current directory contains files called , , "
. What happens if you type
#"
!
,
, and (strangely)
and press return?
Slide 1
1.1 Alphabets, strings, and languages
The purpose of Section 1 is to introduce a particular language for patterns, called regular
expressions, and to formulate some important problems to do with pattern-matching which
will be solved in the subsequent sections. But first, here is some notation and terminology to
do with character strings that we will be using throughout the course.
1 REGULAR EXPRESSIONS
2
Alphabets
An alphabet is specified by giving a finite set,
$
, whose elements
are called symbols. For us, any set qualifies as a possible
alphabet, so long as it is finite.
Examples:
%'&)(+*-,.0/1.32.54.768.39.5:.5;.7<.>=?
/@,
%BAC(+*-D.5E-.5FG.0H0H0H7.>IJ.7K.5L?
2M: — -element set of decimal digits.
—
-element set of lower-case
X
&
W
%ONP(+*GQSof
RGQUtheTSEnglish
%'&V? language.
2
characters
—
-element set of all subsets of the alphabet
of decimal digits.
Non-example:
(+*-,.0/1.32.54.0H0H0HZ?
Y
— set of all non-negative whole numbers is not
an alphabet, because it is infinite.
Slide 2
Strings over an alphabet
A string of length
[
[
( \^] ) over an alphabet
$
is just an ordered
$ , written without punctuation.
%_(+*-D.5EG.5F@?
D DE DDF
E`EaDF
-tuple of elements of
Example:
if
%
, then ,
,
, and
are strings over
of lengths one, two, three and four respectively.
$cbed-i fhg
set of all strings over
$
of any finite length.
N.B. there is a unique string of length zero over
string (or empty string) and denoted
are talking about).
Slide 3
j
$
, called the null
(no matter which
$
we
1.1 Alphabets, strings, and languages
3
Concatenation of strings
The concatenation of two strings kml`npoS$
b
is the string
obtained by joining the strings end-to-end.
(rDE
Examples:
(rDE`DE If q
qq
wPs
and
(utvD
, s F`DxztMD and
(r
(rF`Dx
w
s"q
, then
kJn
(ytMDDE
,
.
This generalises to the concatenation of three or more strings.
E.g.
k{n|}k{n i~€-"~‚-~ƒ~€@"~
.
Slide 4
%
%…„
Slides 2 and 3 define the notions of an alphabet , and the set
of finite strings over an
‘“%
alphabet. The length of a string q will be denoted by †ˆ‡a‰Šv‹Œ
Žq . Slide 4 defines theD’operation
of concatenation of strings. We make no notational
%
% distinction between a symbol %c„ and
the corresponding
string of length one over : so can be regarded as a subset of . Note
% „
. . always
‘“%Pcontains
„
that
is never empty—it
the null string, , the unique string of length zero.
Note also that for any q s w
q
and †ˆ‡a‰Šv‹Œ
Žqs
(
(
q
(
%_(+*-D?
%P„
, then
%_(+*-D.5E@?
(ii) If
, then
%—(™˜
Žqshw
qs"w
(
q”Žs"w•
%
for different :
contains
%P„
and
†ˆ‡`‰Šv‹ŒŽq–r†ˆ‡a‰Šv‹Œ
Žs .
% „
Example 1.1.1. Examples of
(i) If
-q
(
.5D.5DD.5DDD.5DDDD.0H`HVH
contains
.5D.5EG.5DD.5DE-.5E`D.5E3EG.5D"DD.5D"DEG.>DE`D.7DE`EG.>E`D"[email protected]`[email protected]
(iii) If
(the empty set — the unique set with no elements), then
containing the null string.
%•„š(›* ?
, the set just
1 REGULAR EXPRESSIONS
4
1.2 Pattern matching
%
Slide 5 defines the patterns, ort regular expressions, over an alphabet that we will use.
%P„ such regulart expression, , represents a whole set (possibly an infinite set) of strings
Each
˜ matching relation is given on Slide 6. It
in
that match . The precise definition of this
might seem odd to include a regular expression that is matched by no strings at all—but it
˜ „ technically convenient to do so. Note˜ „that the regular expression
( is in fact equivalent to
is
).
, in the sense that a string q matches iff it matches (iff q
Regular expressions over an alphabet
œ
each symbol ~
œ j
œ 
ž
œ
o_$
$
is a regular expression
is a regular expression
is a regular expression
œ
if

and
are regular expressions, then so is
œ
Ÿ
 ¡ zŸ ¢
if

and
Ÿ
are regular expressions, then so is
 Ÿ
if

is a regular expression, then so is
 ¢b
Every regular expression is built up inductively, by finitely many
applications of the above rules.
˜
R
„
(N.B. we assume , , Ž ,  , , and
%
are not symbols in
.)
Slide 5
Remark 1.2.1 (Binding precedence in% regular expressions). In the definition on Slide 5
we assume implicitly that the alphabet does not contain the six symbols
˜
Ž

R
„
%
Then, concretely speaking, the regular expressions over % form a certain set of strings over
the alphabet obtained by adding these six symbols to . However it makes things more
readable if we adopt a slightly more abstract syntax, dropping as many brackets as possible
and using the convention that
R
¤
£
£
£
binds more tightly than
, binds more tightly than £ .
tR¦¥V§5„
tR¦¥ § „
tR¦¥ § „
tR¨¥0§ „
Ž   , not Ž VŽ  , or Ž>Ž
> , etc.
So, for example,
means Ž
£
„
1.2 Pattern matching
5
Matching strings to regular expressions
œ k
matches ~
œ k
matches ¡ Ÿ iff
oS$
iff
œ k matches j iff k i j
œ no string matches 
œ k
matches 
Ÿ
k
k i—~
matches either  or
Ÿ
iff it can be expressed as the concatenation of
two strings, k i n | , with n matching  and |
œ k matches  b iff either k i j , or k matches 
matching
, or
k
Ÿ
can be
expressed as the concatenation of two or more strings, each
of which matches 
Slide 6
t „
The definition of ‘q matches
’ on Slide 6 is equivalent to saying
,
t as a concatenation of © strings, q
& some
A¬H0H0H ©«ª , q can be expressed
for
q q
q­ , where each q® matches .
(r,
t „
(
(
(¯/
The case ©
just
just
t means that q
(so t always matchest „ ); and the case © %e(°
*-Dmeans
.5EG.5F@?
DE
t„ also matches ). For example, if
that t…
matches
(so any string matching
q (r
and
, then the strings matching are
.5DEG.5DEaDE-.5DE`DE`DEG.
H
etc
%²(
Note that we didn’t include a regular expression for the ‘ ± ’ occurring in the UNIX
examples
*-D&`.5DAG.0H0H0on
H3.5D Slide
? 1. However, once we know which alphabet we are referring to,
­ say, we can get the effect of ± using the regular expression
Ž
D&vR¨DA"R5H0H0H@R¨D
which
% is indeed matched by any string in
in ).
­8
%•„
„
D&vR¨DA"R5H0H0H@R¨D
(because
­ is matched by any symbol
1 REGULAR EXPRESSIONS
6
Examples of matching, with
œ ] ¡µ
œ 8µ ·]
œ a X]
is matched by each symbol in
¡ µ"¢7b
$ i´³ ]lvµ¶
$
is matched by any string in
$¸b
that starts with a ‘ µ ’
¡ µ"¢0 ·] ¡ "µ ¢3¢ b is matched by any string of even length in $ b
œ · ] ¡ µ"¢ b X] ¡ µ"¢ b is matched by any string in $ b
œ Xj ¡ ]¢0 Xj ¡ µ"¢ ¡ µ8µ is matched by just the strings j , ] , µ , ]
µ , and
µ8µ
œž µ ¡ ] is just matched by ]
Slide 7
t
¥
tR¦¥
Notation 1.2.2. The
t notation , – is quite often used for what we write as .
­
t ª , is an abbreviation for the regular expression obtained by
The notation , for ©+
concatenating © copies of . Thus:
W
¹ t
t &
­z¾
º`( »½¼
º`(¿
»½¼ t t H
Ž ­ 
t „
t
,
­
Thus q matches
iff q matches
fortvt"some
©pª .
t
„
t
¾
We use
as an abbreviation for
. Thus q matches
t ¾ iff it can be expressed as the
concatenation of one or more strings, each one matching .
1.3 Some questions about languages
Slide 8 defines the notion of a formal language over an alphabet. We take a very extensional
view of language: a formal language is completely determined by the ‘words in the
dictionary’, rather than by any grammatical rules. Slide 9 gives some important questions
about languages, regular expressions, and the matching relation between strings and regular
expressions.
1.3 Some questions about languages
7
Languages
A (formal) language
in
$cb
À
over an alphabet
$
is just a set of strings
.
Thus any subset À^Á$Âb determines a language over
The language determined by a regular expression
Àà  ¢ -d i fhg ³ kÄoÅ$ b ¡ k
Two regular expressions  and
À  ¢
equivalent iff Ã
Ÿ
matches 
¶

$
.
over
$
is
(over the same alphabet) are
and Àà ZŸz¢ are equal sets (i.e. have exactly the
same members).
Slide 8
Some questions
k and a regular

expression (over the same alphabet), computes whether or
not k matches  ?
(a) Is there an algorithm which, given a string
(b) In formulating the definition of regular expressions, have we
missed out some practically useful notions of pattern?
(c) Is there an algorithm which, given two regular expressions 
and
Ÿ
(over the same alphabet), computes whether or not
they are equivalent? (Cf. Slide 8.)
(d) Is every language of the form
À¸  ¢
Slide 9
?
1 REGULAR EXPRESSIONS
8
The answer to question (a) on Slide 9 is ‘yes’. Algorithms for deciding such patternmatching questions make use of finite automata. We will see this during the next few sections.
If you have used the UNIX utility ÆÇÈMÉ , or a text editor with good facilities for regular
expression based search, like È@ÊËÌÍ , you will know that the answer to question (b) on Slide 9
is also ‘yes’—the regular expressions defined on Slide 5 leave out some forms of pattern
that one sees in such applications. However, the answer to the question is also ‘no’, in the
sense that (for a fixed alphabet) these extra forms of regular expression are definable, up
to equivalence, from the basic forms given on Slide 5. For example, if the symbols of the
alphabet are ordered in some standard way, it is common to provide a form of pattern for
naming ranges of symbols—for example ÎÏË £ÑÐ"Ò might denote a pattern matching any lowercase letter. It is not hard to see how to define a regular expression (albeit a rather long one)
which achieves the same effect. However, some other commonly occurring kinds of pattern
are much harder to describe using
t the rather minimalist syntax of Slide 5. The principle
example is complementation, Ó}Ž  :
t
q matches ÓԎ 
t
iff
q does not match .
It will be a corollary of the work we do on finite automata (and a good measure of its power)
that every pattern making use of the complementation operation ÓԎ £  can be replaced by
an equivalent regular expression just making use of the operations on Slide 5. But why do
we stick to the minimalist syntax of regular expressions on that slide? The answer is that it
reduces the amount of work we will have to do to show that, in principle, matching strings
against patterns can be decided via the use of finite automata.
The answer to question (c) on Slide 9 is ‘yes’ and once again this will be a corollary of
%
the work we do on finite automata. (See Section 5.3.)
%•„ seen to be ‘no’, provided the alphabet
Finally, the answer to question (d) is easily
contains at least % one symbol. For in that case
%•
is„ countably infinite; and hence the number of
% number of subsets of is uncountable. (Recall Cantor’s diagonal
languages over , i.e. the
argument.)
But since is a finite set, there are only countably many regular expressions
%
over . (Why?) So the answer to (d) is ‘no’ for cardinality reasons. However, even amongst
t
t will
the countably many languages that are ‘finitely describable’
(an intuitive notion that we
not formulate precisely) many are not of the form Õ'Ž  for any regular expression . For
example, in Section 5.2 we will use the ‘Pumping Lemma’ to see that
*-D E R
,?
­ ­ ©Öª
is not of this form.
1.4 Exercises
Exercise 1.4.1. Write down an ML data type declaration for a type constructor ×>ËyÇÈzÆzØÙzÉ
whose values correspond to the regular expressions over an alphabet ×>Ë .
*-,.0/M?
Exercise 1.4.2. Find regular expressions over
*
(a)
q
R
/ ?
q contains an even number of ’s
that determine the following languages:
1.4 Exercises
*
(b)
q
R
9
,
q contains an odd number of ’s
?
Exercise 1.4.3. For which alphabets
alphabet?
% „
%
is the set
%
of all finite strings over
itself an
%
%…„
* ?
%…„
contains
only one symbol, what does
look like? (Answer:
Exercise 1.4.4. If an alphabet%Å(Ú
* ? So if
is the set consisting of just the null string, what is ?
see Example 1.1.1(i).)
(The answer is not .) Moral: the punctuation-free notation we use for the concatenation of
two strings can sometimes be confused with the punctuation-free notation we use to denote
strings of individual symbols.
Tripos questions
1999.2.1(s)
1997.2.1(q)
1996.2.1(i)
1993.5.12
10
1 REGULAR EXPRESSIONS
11
2 Finite State Machines
We will be making use of mathematical models of physical systems called finite automata,
or finite state machines to recognise whether or not a string is in a particular language.
This section introduces this idea and gives the precise definition of what constitutes a finite
automaton. We look at several variations on the definition (to do with the concept of
determinism) and see that they are equivalent for the purpose of recognising whether or not
a string is in a given language.
2.1 Finite automata
Example of a finite automaton
Þ
ÛvÜ
Ý
Ý
Ûß
Ý
Þ
Þ
ÛvÜ , Ûß , ÛMà , Ûvá
Input symbols: ~ , € .
States:
ÛMà
Ý
Ûvá
Þ
.
Transitions: as indicated above.
Start state:
ÛvÜ
.
Accepting state(s):
Û á
.
Slide 10
The key features of this abstract notion of ‘machine’ are listed below and are illustrated
by the example on Slide 10.
There are only finitely many different states
W & that
A a finite
N automaton can be in. In the
example there are four states, labelled â , â , â , and â .
We do not care at all about the internal structure of machine states. All we care about
%
%
is which transitions
the machine can make between the states. A symbol from some
fixed alphabet is associated with each transition: we think of the elements of
as input symbols. Thus all the possible transitions of the finite automaton can be
specified by giving a finite graph whose vertices are the states and whose edges have
2 FINITE STATE MACHINES
12
%
both a direction and a label (drawn
& from ). In the example
possible transitions from state â are
â
&åä W
æ£ â
and
â
%ã(^*-D.5E-?
and the only
&åç AvH
£æ â
&
E
InW other words, in state â theD machine can either
A input the symbol and enter state
ç N â . (Note that transitions from a state
â , or it can input the symbol and enterN state
æ
£
back to the same state are allowed: e.g. â
â in the example.)
W
There is a distinguished start state.1 In the example it is â . In the graphical
representation of a finite automaton, the start state is usually indicated by means of
a unlabelled arrow.
The states are partitioned into two kinds: accepting states2 and non-accepting states.
In the graphical representation of a finite automaton, the accepting states are indicated
N are
by double circles round the name of each such state, and the non-accepting states
indicated using single circles. In the example there is only one accepting state, â ; the
other three states are non-accepting. (The two extreme possibilities that all states are
accepting, or that no states are accepting, are allowed; it is also allowed for the start
state to be accepting.)
The reason for the partitioning of the states of a finite automaton into ‘accepting’ and
„ use to which one puts finite automata—namely
(
% „ to recognise
‘e% the
‘non-accepting’ has to do with
whether or not a string q
is in a particular language ( subset of
). Given q we
begin in the start state of the automaton and traverse its graph of transitions, using up the
symbols in q in the correct order reading the string from left to right. If we can use up all the
symbols in q in this way and reach an accepting state, then q is in the language ‘accepted’
(or ‘recognised’) by this particular automaton; otherwise q is not in that language. This is
summed up on Slide 11.
1
2
The term initial state is a common synonym for ‘start state’.
The term final state is a common synonym for ‘accepting state’.
2.1 Finite automata
13
Àà ·èé¢
, language accepted by a finite automaton
consists of all strings
ÛvÜyëì ê b Û
satisfying
k
è
over its alphabet of input symbols
with
ÛvÜ
state. Here
the start state and
Û
some accepting
Û Ü ëì ê b Û
i ~ ß ~ à Gv ~í say, that for some states
k —
Û ß l Û à l Gv l Û í i Û (not necessarily all distinct) there are
means, if
transitions of the form
Û Ü 7ë ÝMì î Û ß 7ë Ý@ì ï Û à 5ë Ý@ì ð ñGñvñ ëz@Ý ì ò
(r,
© (¯/
N.B.
case
case
©
:
:
„
âSæ£ óç „ âvô
â £æ vâ ô
Û í i Û (
â ç vâ ô
â £æ vâ ô .
iff
iff
Slide 11
Example 2.1.1. Let õ be the finite automaton pictured on Slide 10. Using the notation
introduced on Slide 11 we have:
Wžçaç`ç`ä „ N
â £·£8£ æ â
W çaäöç`ç „
( A
æ
·
£
8
£
£
â
â iff â
â
A äöçaç`ç „
( N
â £·£8£ æ â iff â
â
DDDEC‘
(so
DE`DD“‘ ÷
(so
ÕPŽõ^ )
ÕPŽõ^ )
(no conclusion about Õ'Žõ^ ).
In fact in this case
(ù,.0/1.32
ÕPŽõ^
(+*
q
R
D
q contains three consecutive ’s
?zH
D
t
(For âV® (ø
) corresponds
to the state in the process of reading a string
in which the
last ø symbols read were all ’s.) So ÕPŽõ^ coincides with the language Õ'Ž  determined by
the regular expression
t…(
(cf. Slide 8).
Ž
D{R E „ DDD D{R¨E „

Ž 
2 FINITE STATE MACHINES
14
A non-deterministic finite automaton (NFA),
è
,
is specified by
œ
a finite set
œ
úJû>üûZý-þMÿ
a finite set
$ ÿ
œ
(of states)
(the alphabet of input symbols)
Û S
o úJû>üûZý-þ ÿ and each ~ o_$ ÿ , a subset
ÿ Û l ~ ¢'ÁùúJû>üûZý-þ ÿ (the set of states that can be
Û
reached from with a single transition labelled ~ )
for each
œ
an element
œ
a subset
ŸMÿ
oãúJûZüûZý-þMÿ
Vý”û ÿ
(the start state)
Áùú!ûZüûZý-þMÿ
(of accepting states)
Slide 12
2.2 Determinism, non-determinism, and -transitions
ç the function‘
Slide 12 gives the formal definition of the notion of finite automaton. Note that
.5D a precise way of specifying the allowed transitions of õ , via: â £ æ â ô iff â ô
gives
Žâ  .
‘ the qualification ‘non-deterministic’
D“‘ž% on Slide 12 is because in general,
The reason for
and each input symbol
, we allow the possibilities
that
for each state â
‹ v‹·‡
D
.5D be reached in a single transition labelled from â ,
there are no, one, or many states that can
Žâ  has no, one, or many elements. For example, if õ
corresponding to the cases that
is the NFA pictured on Slide 13, then
&0.5E r
( ˜
&
E
Ž â

i.e. in õ , no state can be reached from â with a transition labelled ;
&0.5D (+* AG?
&
Ž â

Dâ
i.e. in õ , precisely one state can be reached from â with a transition
labelled ;
Žâ
WM.5D
(+* MW . &`?
W

â â
D in õ , precisely two states can be reached from â with a
i.e.
transition labelled .
2.2 Determinism, non-determinism, and -transitions
15
Example of a non-deterministic finite automaton
Input alphabet:
³"~ l € ¶
.
States, transitions, start state, and accepting states as shown:
Ý
Ý
Û Ü
Ý
Û ß
Ý
Û à
Ý
Û á
Þ
Þ
The language accepted by this automaton is the same as for the
automaton on Slide 10, namely
³ kUo "³ ~ l € ¶ b ¡ k
contains three consecutive ~ ’s ¶
Slide 13
.5D
When each subset
Žâ  has exactly one element we say that õ is deterministic.
This is a particularly important case and is singled out for definition on Slide 14.
D.5EGtook
.5F-? the
The finite automaton pictured on Slide 10 is deterministic. But note that if*-we
same graph of transitions but insisted that the alphabet of input symbols
WM.5F (#
was
˜
say,
then we have specified an NFA not a DFA, since for example
Žâ

. The moral of
this is: when specifying an NFA, as well as giving the graph of state transitions, it is important
to say what is the alphabet of input symbols (because some input symbols may not appear in
the graph at all).
When constructing machines for matching strings with regular expressions (as we will
do in Section 3) it is useful to consider finite state machines exhibiting an ‘internal’ form
of non-determinism in which the machine is allowed to change state without consuming any
input symbol. One calls such transitions -transitions and writes them as
â æ£ ó â ô
H
% 15. Note that in an NFAó , õ
This leads to the definition on Slide
is not an element of the alphabet
of input symbols.
, we always assume that 2 FINITE STATE MACHINES
16
A deterministic finite automaton (DFA)
is an NFA
~ o_$ ÿ
è
with the property that for each
ÿ Û l~ ¢
, the finite set
element—call it
ÿ Û l~ ¢
Û l~ ¢
and
contains exactly one
.
Thus in this case transitions in
next-state function, 1ÿ
Û oSúJû>üûZý-þ ÿ
è
are essentially specified by a
, mapping each (state, input symbol)-pair
to the unique state
by a transition labelled ~ :
Û ëì Ý Û ÿ Û l~ ¢
iff
which can be reached from
Û i M ÿ Û l ~ ¢
Slide 14
An NFA with j -transitions (NFA )
is specified by an NFA
è
together with a binary relation, called
the j -transition relation, on the set
Û ìë Û
ú!ûZüûZý-þzÿ
to indicate that the pair of states
Û l Û ¢
Example (with input alphabet =
³"~ l € ¶
Ûß
Ý
Û Ü
Þ
Û
Ý
Þ
ÛMà
Û
Slide 15
. We write
is in this relation.
):
Ý
Þ
Ûvá
Û
Ý
Û
Þ
Û
2.3 A subset construction
Àà ·èé¢
ÛvÜ ê Û
satisfying
ñ ñ
state. Here 17
, language accepted by an NFA
consists of all strings
Û
Û iff
Û i Û
k
over the alphabet
ÛvÜ
with
or there is a sequence
(for
~ oÅ$…ÿ
(for
~ l€ Å
o $…ÿ
è
Û
) iff
Û
) iff
Û
from
ñ ëì Ý
Û
to
ñ
ñ ëì Ý
some accepting
Û ìë Û ñ ñ
ñGñvñ Û
ìë
è
of input symbols
Û
the initial state and
is defined by:
more j -transitions in
Û Ý Û
Þ
Û Ý Û
$ ÿ
Þ
ñ
of one or
Û and similarly for longer strings
Slide 16
‘“% „
When using an NFAó õ to accept a string q
of input symbols, we are interested in
sequences of transitions in which the symbols in q occur in the correct order, but with zero
or more -transitions before or after each one. We write
â"# ! â ô
W from state â toW state âzô in the NFAó . Then, by definition
to indicate that such a sequence exists
q is accepted by the NFAó õ iff â
â holds for â the start state and â some accepting state:
D that the language
see Slide 16. For example, for the NFAó on Slide 15, it is not too hard to see
„
accepted consists
of all strings which either contain two consecutiveD
R¨E ’s „ orDD{contain
E
R E`E D{R E two
consecutive ’s, i.e. the language determined by the regular expression Ž
 Ž
`Ž  .
#!
2.3 A subset construction
Note that every DFA is an NFA (whose transition relation is deterministic) and that every
NFA is an NFAó (whose -transition relation is empty). It might seem that non-determinism
and -transitions allow a greater range of languages to be characterised as recognisable by a
finite automaton, but this is not so. We can use a construction, called the subset construction,
to convert an NFAó õ into a DFA cõ accepting the same language (at the expense of
increasing the number of states, possibly exponentially). Slide 17 gives an example of this
construction. The name ‘subset construction’ refers to the fact that there
Q”.3Q isT one state of Âõ
Q õ . Given two subsets
ô
for each subsetQ ofç theQ set ‹ v‹·‡
of states of
‹ v‹X‡ , there
æ
£
ô in Âõ just in case ô consists of all the õ -states âMô reachable from
is a transition
$
$
% &
$
% '
2 FINITE STATE MACHINES
18
Q
ç
#
via the (
(
D i.e. such that we can get from â to
states â in
relation defined on Slide 16,
â ô in õ via finitely many -transitions followed by an -transition followed by finitely many
-transitions.
Example of the subset construction
è
)
*mÿ )
Ý

Û ß
ÛvÜ
³ vÛ Ü
³ Û Ü
³ Û ß
³ Û Ü lÛ ß
Ý
Û à
Þ
³ vÛ Ü ¶
³ Ûß ¶
³ ÛMà ¶
l Û"ß ¶
lÛ à ¶
lÛ à ¶
lÛ à ¶
³ GÛ Ü l
³
³ GÛ Ü l
³ Û Ü l
³
³ Û Ü l
~
€

Û ß
Ûß

Ûß
Û ß
Û ß
Û ß

³ MÛ à ¶

³ MÛ à ¶
³ MÛ à ¶
³ Û à ¶
³ Û à ¶
³ Û à ¶
l MÛ à ¶
¶
l MÛ à ¶
lÛ à ¶
¶
lÛ à ¶
Slide 17
$
% ' +
QrT elements are the
By definition, the start state of cõ is the subset of ‹ v‹X‡
whose
Q
‹ v‹·‡
states reachable by -transitions from the start state of õ ; and a subset
is an
* WMaccepting
. &V. AG? state of õ is an element of . Thus in the example
accepting state of Âõ iff some
on Slide 17 the start state is â â â
and
$
DEP‘
DEP‘
% Õ'Žõڐ
W ç W
A ä A
æ
æ
£
£
because in õ : â
â ó â æ£ â
* WM. &0. A@? ç * MW . V& . AG?_ä * AG?zH
because in $Âõ :
â â â £æ â â â æ£ â
Õ'Ž,$cõ^
(
D „E„ (
Õ'Ž
 Õ'Ž,$cõ^ . The fact that õ
$
Indeed, in this case Õ'Žõ^
and Âõ accept the same
language in this case is no accident, as the Theorem on Slide 18 shows. That slide also gives
the definition of the subset construction in general.
2.3 A subset construction
19
-
Theorem. For each NFABè
there is a DFA šè
with the
same alphabet of input symbols and accepting exactly the same
strings as
è
Definition of
-šè
œ úJû>üûZý-þ m
* ÿ
(refer to Slide 12):
@d i fZg ³0/Ú¡1/ Á°úJûZüûZý-þ ÿ ¶
œ $ *mÿ
œ / ìë Ý
-d i fhg $ ÿ
œ Ÿ* ÿ
@d i fZg ³ Û ¡ Ÿvÿ
/ in š
- è
*mÿ_ / l ~ ¢)d@i fZg
œ
Àà .-šè¿¢ i Àà ·èé¢
, i.e. with
Vý”û *mÿ
/ i *mÿ_ / l ~ ¢ , where
¡12 Û o / Û Ý Û in è¿¢`¶
Û ¶
iff
³ Û
@d i fZg
³0/ oãúJûZüûZý-þ m
* ÿ
¡02 Û o
/
Slide 18
Û o3Vý”û ÿ ¢a¶
(
To prove the theorem on Slide 18, given any NFAó õ we have to show that Õ'Žõ^
Õ'Ž4$Âõ^ . We split the proof into two halves.
T
‘
¥ # ó â for
‘ 58797 the ; case
=
‘ 58
6
7  Õ'4Ž $Âõ^ <¥ .; Consider
Proof that
of first: ‘ if , then
ÕP79Žõ^
'
Õ

Ž
^
õ

some â
, hence
‡ :‹ and thus Õ'Ž,$cõ^ . Now given any non( ‡ :D‹ &7DA”
H0H0H>D
null string q
­ , if q is accepted by õ then there is a sequence of transitions in
õ of the form
Aç B C
¥ ?ç > & Aç @
‘ 5D77 H
#
#
#
(1)
â
(?(?( âV­
‡ :‹
D&7DA H0H0HZD
Since it is deterministic, feeding
­ to $cõ results in the sequence of transitions
ç æ B Q
E¥ ; ?ç æ > Q & Aç æ @
£
£
ˆ
£
(?(?( £ ­
(2)
Q &O&
( F; E¥ ; .5D& Q
AP&
( F; Q”&0.5DA
F?; where
(Slide 18), from
Ž
,
Ž
 , etc. By definition of
(1) we deduce
& ‘GF;
â A•‘GF;
so â H0H0H
‘GF;
so âV­
Ž ¥E; 5. D& (uQ”&
 Q{A
Q”&0.5DA (u
Ž

Q
Ž ­IH
&@.5D
­
(uQ
­
2 FINITE STATE MACHINES
20
Q
and hence ­
by cõ .
‘J58797
$
;
‡:‹
(because âV­
T
58797 ; ,$
:
‘J58797
‡%:‹
<; ). Therefore (2) shows that q is accepted
K58797
‘
¥;
‘
L58797 :
M58797 ; F; :
IH
#BQ>
IH
OH
# E>
, $ #
Proof that ÕPŽ Âõڐ
ÕPŽõ^ . Consider
ÕP¥ Ž Âõڐ , then
‘
‘e¥ the case of ‘ first: if ‘and so there is some â
(™‡ D‹ &5DA , H0i.e.
H0HhD
ó â
‡ ‹
‡ ‹
with â
and thus ÕPŽõ^ . Now given any non-null string q
­ , ifQ q is‘ accepted by
‘
‘°Q with( ­
Q ‡ &0‹.5D ,
Âõ thenQ there is a sequence of transitions
in Âõ of the form (2)
i.e. with ­ containing some â0­
‡ &_
‹ ‘«.Q Now
Ž ­
­ ,
& since â0­ & ç ­
­
by definition
of(
there
withAÄâV‘ ­ Q
&å‘ Q &“
Q isAv.5D some& â0­
A âV­ in õ .A çThen since
&
âV­
­
Ž ­
­  , there is some âV­
­
âV­ .
with âV­
&…‘Ñwe
Q”& can
( build up
¥ a sequence
.5D&
¥ likeç (1)& until, at the
Working backwards in this way
of transitions
last step, from the fact that â
‘
Ž
 we deduce that
â . So we get
a sequence of transitions (1) with â@­
‡ ‹ , and hence q is accepted by õ .
$
IH
F ; PF; IH
N58797
OH
OH
IH
$
:
RF; <; J58797 %:
IH
: B
#
IH
IH
IH
2.4 Summary
The important concepts in Section 2 are those of a deterministic finite automaton (DFA) and
the( language of strings that it accepts. Note that if we know that a language Õ is of the form
Õ
Õ'Žõڐ for some DFA õ , then we have a method for deciding whether or not any given
string q (over the alphabet of Õ ) is in Õ or not: begin in the start state of õ and carry out
the sequence of transitions given by reading q from left to right (at each step the next state is
uniquely determined because õ is deterministic); if the final state reached is accepting, then
q is in Õ , otherwise it is not.
We also introduced other kinds of finite automata (with non-determinism and transitions) and proved that they determine exactly the same class of languages as DFAs.
2.5 Exercises
Exercise 2.5.1. For each of the two languages mentioned in Exercise 1.4.2 find a DFA that
accepts exactly that set of strings.
D„aconstructs
E@„
Exercise 2.5.2. The example of the subset construction given on Slide 17
a DFA
 . Give a DFA
with eight states whose language of accepted strings happens to be Õ'Ž
with the same language of accepted strings, but fewer states. Give an NFA with even fewer
states that does the same job.
Exercise% 2.5.3. Given a DFA õ , construct a new ‘Ú
DFA
%'„ õ^ô with the same alphabet of input
symbol
and with the property that for all q
, q is accepted by õ¯ô iff q is not
accepted by õ .
&V.
A
%
Exercise 2.5.4. Given two DFAs õ
õ
with the same
‘ùalphabet
% „
of input symbols,
& DFA õ A with the property that q
. A it is
construct a third such
is accepted by õ &0iff
accepted by &C
both
[Hint:
take the states of õ to be ordered pairs Žâ ⠐ of
‘ õ and õ . A'
‘
states with â
and â
.]
‹ v‹X‡
‹ v‹·‡
S
Tripos questions
% ' >
2001.2.1(d)
T
@
2000.2.1(b)
1998.2.1(s)
1995.2.19
1992.4.9(a)
21
3 Regular Languages, I
Slide 19 defines the notion of a regular language, which is a set of strings of the form ÕPŽõ^
for some DFA õ (cf. Slides 11 and 14). The slide also gives the statement of Kleene’s
Theorem, which connects regular languages with the notion of matching strings to regular
expressions introduced in Section 1: the collection of regular languages coincides with the
collection of languages determined by matching strings with regular expressions. The aim of
this section is to prove part (a) of Kleene’s Theorem. We will tackle part (b) in Section 4.
Definition
A language is regular iff it is the set of strings accepted by some
deterministic finite automaton.
Kleene’s Theorem
(a) For any regular expression  ,
À¸  ¢
is a regular language
(cf. Slide 8).
(b) Conversely, every regular language is the form
some regular expression 
Àà  ¢
for
.
Slide 19
3.1 Finite automata from regular expressions
t
%
%
%…„
Given a regular expression , over an alphabet say, we wish to ‘“
construct
a DFA t õ with
alphabet of input symbols t and
( with the property that for each q
, q matches iff q is
accepted by õ —so that Õ'Ž 
ÕPŽõ^ .
( thet Theorem on Slide 18 it is enough to construct an NFAV
Note that by
ó U with the
(
( then we can
( apply the
( subset
t
ÕPŽ U  Õ'Ž  . For
property
construction to U to obtain a DFA
õ
$ U with ÕPŽõ^ ÕP,Ž $ U  ÕPŽ U  Õ'Ž  . Working with finite automata that
t
are non-deterministic
and have -transitions simplifies the construction of a suitable finite
automaton from .
%
Let us fix on a particular% alphabet and from now on only consider finite automatat
% set of input symbols is . The construction of an NFAó for each regular expression
whose
over
proceeds by recursion on the syntactic structure of the regular expression, as follows.
3 REGULAR LANGUAGES, I
22
D DU‘y%
˜
(i) For each atomic form of regular expression, (
), , and , we give an NFAó
accepting just the strings matching that regular expression.
&
(ii) Given any NFAó s õ
property
A
and õ
W YX[Z
, we construct a new NFAó , !‰ G‰”Žõ
&V.
A
õ
 with the
&V. A (+* R ‘
&
‘
A ?zH
ÕPŽW!‰YX[Zv‰!Žõ
õ > 
q q
Õ'Žõ  or q
ÕPŽõ 
t1&vR t-A (
&V. A
t1& (
&
tGA (
A
Thus Õ'Ž
 Õ'Ž W!Y‰ X[ZG‰”Žõ
õ >  when Õ'Ž  '
Õ Žõ  and Õ'Ž  ÕPŽõ  .
(iii) Given any NFAó s õ
property
7
&
and õ
A
A
\]ZG‰ v‹VŽõ
&V.
(Ú* & AÃR & ‘
&
A ‘
'
ÕPŽ\^ZG‰ v‹3Žõ
õ >
q q
q
Õ'Žõ  and q
Õ'Žõ
t1&>t-A (
7
&V. A
tz& (
&
t-A
Thus Õ'Ž
 ÕPŽ\^Zv‰ v‹`Žõ
õ > when Õ'Ž  Õ'Ž õ  and ÕPŽ
(iv) Given any NFAó õ
&0.
, we construct a new NFAó ,
7
õ
A
 with the
A z? H

(
A
 ÕPŽõ  .
_
, we construct a new NFAó , ‹ Žõ^ with the property
(+* & A”H0H0H
R
,
‘
?zH
Õ Ž ‹_Žõڐ>
'
q q
q ­ ©¤ª and each q® ÕPŽõ^
tz„ (
t (
Õ Ž  ÕPŽ %‹ _Žõ^> when Õ'Ž  '
Õ Žõڐ .
Thus '
t
Thus starting with step (i) and applying the constructions in steps (ii)–(iv) over and over
again, we eventually build NFAó s with the required property for every regular expression .
Put more formally, one can prove the statement
,
`
t for
( all regular expressions of size Å© , there exists an NFAó
for all ©åª , and
õ such that ÕPŽ  ÕPŽõ^
by mathematical induction on © , using step (i) for the base case and steps (ii)–(iv) for the
induction steps. Here we
R can take the size of a regular expression
„
to be the number of
£
£
Ñ
£
£
£
occurrences of union (
), concatenation (
), or star ( ) in it.
D
*-D? (i) D’Slide
‘“% 20 gives(¯NFAs
* ? whose ˜ languages
(r˜
Step
of accepted strings are respectively Õ'Ž 
, and Õ'Ž 
(any
), Õ'Ž M
.
(
3.1 Finite automata from regular expressions
23
NFAs for atomic regular expressions
Ý
ÛvÜ
just accepts the one-symbol string
Ûß
~
ÛvÜ
just accepts the null string, j
ÛvÜ
accepts no strings
Slide 20
&
A
W YX[ Z
&V.
A
õ  is pictured on
Step (ii) Given NFAó s õ
and õ , the construction of !‰ G‰”Žõ
&V. A we assume that ‹ v‹·‡
&
‹ vA ‹·‡
Slide 21. First, renaming states if necessary,
and
are
disjoint. Then the states W of !‰ G‰”Žõ
õ  are all the states
&V. inA either õ W or õ , together
with a new state, called â say. The start state of J‰ G‰”&Žõ
õ A  is this â and its accepting
states are all
&V. the
A states that are accepting in either õ & or õ A . Finally, the transitions of
W given by all those in& either A õ
or õ , together with two new !‰ G‰”Žõ
õ  are
transitions out of
‘ â to the
& start states of õ ¥ and õ . &
& ‘
W YX[Z
W YX[Z
% >
% @
W aX[Z
b58797
ÕPŽõ &  , i.e. if we have ‘ > # ! â for &Vsome
‡:‹ > , then
. A â
A we
#
æ
>
£ ó &V. A ! â showing that q
ÕP& Ž W!‰YX[ZG‰”Žõ A õ  . Similarly for õ . So
get â
ÕPŽW!‰YX[ZG‰”Ž&`õ . A õ > contains the union of ÕPŽõ W  and Õ'Žõ  .‘JConversely
58797 > if q is‘Jaccepted
5D77 by@
#
W!‰YXZG‰!Žõ õ  , there is a transition sequence â ! â with â
‡%:‹ or â
‡:‹ .
Clearly, in either W case this transition sequence has to begin with one or other of the &
A from â , and thereafter we get a transition sequence‘ entirely in one&0. or A other of
transitions
õ or õ ‘ finishing
& in an
‘ acceptable
A
state for that one. So if q
ÕPŽ W!Y‰ X[Zv‰JŽõ
õ > , then
either q
ÕPŽõ  or q
ÕPŽõ  . So we do indeed have
&V. A (+* R ‘
&
‘
A ?zH
ÕPŽ W!Y‰ X[ZG‰”Žõ
õ >
q q
Õ'Žõ  or q
Õ'Žõ 
Thus
if ¥ q
W
3 REGULAR LANGUAGES, I
24
c^dfe%gId
Û Ü
·è ß l`è à ¢
Ÿ ÿ î
è ß
Ÿvÿ ï
è à
Set of accepting states is union of
AVý”û ÿ
î
and
Vý”û ÿ
ï
.
Slide 21
&
A
7
&`. A
\^Z >
% ^\ Z
7
\^Z õ  is pictured on
Step (iii) Given NFAó s õ
and õ , the construction of G‰ M‹`Žõ
Slide 22. First, renaming states if necessary,
and
‹ A v‹·‡
are
&V. A we assume that ‹ v‹·‡
&
&V. A of v‰ v‹`Žõ
&V. start
A
disjoint. Then the states
õ  are& all the states in either õ or õ . The
state of G‰ v‹VŽõ
õ  is the
õ 
A start state of õ . The accepting states&V. of A v‰ v‹VŽõ
are the accepting & states A of õ . Finally, the transitions of v‰ v‹VŽõ
õ  are given by& all
those in either õ
or
A õ , together with new -transitions from each accepting state of õ to
the start state &c
of‘ õ
(only
is shown in the picture). ¥
& one such
AÑ new transition
A
&
&
Thus
if
and
,
there
are
transition
sequences
in
q
'
Õ

Ž
õ

q
P
Õ

Ž
õ

â
õ
¥
A
&C‘
A
A…‘
‡ ‹ , and
â in õ with â
‡ ‹ . These combine to yield
with â
7
7
\^Z \^Z J5D77
@ #! @
7
% @
> #! >
J5D77 @
:
¥
> # ! > â & æ£ ó ¥ @ # ! @ â A
7
&`. A
& A
7
&V. A
õ  witnessing the fact that
‘ q q is7 accepted
&`. Aby \]ZG‰ v‹`Žõ
õ  . Conin \^ZG‰ v‹aŽõ
ÕPŽ\^ZG‰ v ‹aŽõ
õ > is of this form. For& any
versely, it is not hard to see that every s
: >
transition sequence witnessing the fact that
A s is accepted starts out in the states of õ but
finishes in the disjoint set of states & of õ .A At some point in the sequence( one& ofA the new&
toA get from õ
toA õ
and thus we can split s as s
-transitions occurs
&
q q with q
accepted by õ
and q accepted by õ . So we do indeed have
7
Õ'Ž\^ZG‰ M ‹aŽõ
&V.
õ
A
(+* & A¸R & ‘
&
A•‘
A ?zH
> 
q q
q
Õ'Žõ  and q
Õ'Žõ 
3.1 Finite automata from regular expressions
h Og d V üûG ·è
Ÿ ÿ î
è ß
Set of accepting states is
25
ß l`è à ¢
AVý”û ÿ
è à
ï
Ÿ ÿ ï
.
Slide 22
%_
W on Slide 23. The
Step (iv) Given an NFAó õ , the construction of ‹ Žõڐ is pictured
W  are all those of õ together with a new state, called â say. The start state
states of ‹ Žõ^
of ‹ Žõ^ is â and this is also the only accepting state of ‹ Žõڐ .W Finally, the transitions
of ‹ Žõڐ are all those of õ together with
W new -transitions from â to the start state of õ
and from each accepting state of õ to â (only one of this latter kind of transition is shown
in the picture).
%_
%_
_
%_
_
Clearly, ‹ Žõ^ accepts (since its start state is accepting) and any concatenation of
one W or more strings accepted by õ . Conversely, if s is accepted by ‹ Žõڐ , the occurrences
of â in a transition sequence witnessing this fact allow us to split s into the concatenation of
zero or more strings, each of which is accepted by õ . So we do indeed have
ÕPŽ ‹%_Žõ^>
%_
(+* & A”H0H0H
R
,
‘
?zH
q q
q­ ©pª and each q® Õ'Žõڐ
3 REGULAR LANGUAGES, I
26
úJû>üOi hè¿¢
Û Ü
Ÿvÿ
è
The only accepting state of
úJû>üOi hè¿¢
is
Û Ü
.
Slide 23
D
R¨E „31D shows how
This completes the proof of part (a) of Kleene’s Theorem (Slide 19). Figure
the step-by-step construction applies
inR E the
(
D{
„3D case of the regular expression Ž  to produce
an NFAó õ satisfying Õ'Žõ^
Õ'ŽZŽ   . Of course an automaton with fewer states and
-transitions doing the same job can be crafted by hand. The point of the construction is that
it provides an automatic way of producing automata for any given regular expression.
3.2 Decidability of matching
The proof of part (a) of Kleene’s Theorem provides us with a positive answer to question (a)
on
t Slide 9. In other words, it providest a method that, given any string q and regular expression
, decides whether or not q matches . The method is:
construct a DFA õ
satisfying Õ'Žõڐ
(
t
Õ'Ž  ;
beginning in õ ’s start state, carry out the sequence of transitions in õ corresponding
to the string q , reaching some state â of õ (because õ is deterministic, there is a
unique such transition sequence);
‘
(
t
t
check whether
‘ ÷ â is accepting
(
t or not: if it is, then q t P
Õ Žõ^ ÕPŽ  , so q matches ;
otherwise q
Õ'Žõڐ Õ'Ž  , so q does not match .
Note. The subset construction used to convert the NFAó resulting from steps (i)–(iv) of
Section 3.1 to a DFA produces an exponential blow-up of the number of states. (Âõ has
$
3.2 Decidability of matching
27
D
ç
Step of type (i):
E
Step of type (i):
ä
D{R E
ç
Step of type (ii):
ó
ä
ó
Step of type (iv): Ž
D{R¨E „

ó
ç
ó
ó
ä
ó
ó
D{R E 3„ D

Step of type (iii): Ž
ó
ç
ó
ó
ç
ó
ó
ä
ó
D
R¨E a„ D

Figure 1: Steps in constructing an NFAó for Ž
3 REGULAR LANGUAGES, I
28
2
­ states if õ
has © .) This makes the method described above very inefficient. (Much more
efficient algorithms exist.)
3.3 Exercises
%_
Exercise 3.3.1. Why can’t the automaton ‹ Žõڐ required in step (iv) of Section 3.1 be
constructed simply by taking õ and adding new -transitions back from each accepting state
to its start state?
„ DDE „ through the steps in Section 3.1 to construct an NFAó õ
Exercise
( 3.3.2.R E Work
Õ'Žõ^ Õ'Ž>Ž 
 . Do the same for some other regular expressions.
Exercise 3.3.3. Show that any finite set of strings is a regular language.
Tripos question
1992.4.9(b) [needs Slide 26 from Section 4]
satisfying
29
4 Regular Languages, II
The aim of this section is to prove part (b) of Kleene’s Theorem (Slide 19).
4.1 Regular expressions from finite automata
t
Given any DFA õ , t we( have to find a regular expression (over the alphabet of input symbols
Õ'Žõڐ . In fact we do something more general than this, t asj described
of õ ) satisfying Õ'Ž 
kl k9 m
in the Lemma on Slide 24.1 Note that if we can find such regular expressions
for any
choice of , â , and â ô , then¥ the problem is solved. For t taking to be the whole of ‹ M‹X‡
and â to be the start state, say, then by definition
, a string q matches this regular
¥
„ of
æ
expression iff there is a &0transition
.0H0H0H3.
sequence £
âGô in õ . As âvô ranges over the finitely
t
R match
Rt
many accepting states, â
â say, then
we
exactly all the strings accepted by õ .
In other words the regular expression
has the property we want for part (b)
(Ú,
of Kleene’s Theorem. (In case
, i.e. there ˜ are no accepting states in õ , then Õ'Žõڐ is
empty and so we can use the regular expression .)
j n
ol k9m
n
!
?p
t
j
j
o9lqk > (?(?( ol ksr
è , for each subset uéÁùú!ûZüûZý-þ ÿ
Û
Û
and each pair of states l oãúJûZüû>ý-þ ÿ , there is a regular
 w?v xqwy satisfying
expression I
Lemma Given an NFA
Àà 
?w v xqw y ¢
i´³ kžoy >$ ÿ ¢ b ¡ Û+ëì ê b Û
in
è
with all inter-
mediate states of the sequence
in
Àà hè¿¢ i Àà  ¢
u
¶
, where ši
{ i number of accepting states,
| i }v xqw~ with u i úJûZüûZý-þ ÿ ,
Hence
Ÿ i
.
ß ¡ ñGñvñ ¡ z
and
start state,
Û | €
i  th accepting state.
(r,
t
(In case
t
, take
˜
to be the regular expression .)
Slide 24
Proof of the Lemma on Slide 24. The regular expression
tion on the number of elements in the subset .
n
1
„ …‡ƒ †
ˆƒ

tj
klqk9m
can be constructed by induc-
The lemma works just as well whether
is deterministic or non-deterministic; it also works for
by
(cf. Slide 16).
NFA s, provided we replace
‚
4 REGULAR LANGUAGES, II
30
.
n
Base case, is empty. In this case, for each pair of states â â ô , we are looking for a regular
expression to describe the set of strings
*
R
q
â £æ !
„
â ô with no intermediate states
D
?zH
æ
(if â £
ç
( set is either a single input symbol
âGô holds in õ ) or
So each element of this
âMô . If there are no input symbols that take us from â to â1ô in õ , we
possibly , in case â
can simply take
(
º`( »½¼ ¹ ˜
t
if â ( âvô
if â
â ô.
‰
klqk m
‹Š
D&0.0H0H0H3.5D
On the other hand, if there are some such input symbols,
t‰
klqk9m
ºa( »½¼ ¹ D&vR
D&vR
R¨D
(?(?( R¨D p
(?(?( p
R
(
‹Š
if â (
if â
p
say, we can take
âvô
vâ ô .
Induction step. Suppose we have defined the required
regular expressions for all subsets
/
W…‘
* Wv?’with
(°* ©Â‘ – elements,
R ( Wv? choose some element â
of states with © elements. If is a subset
. consider
‘
â
â
â
â . Then for any pair of states
and
the © -element set
â âGô
‹ v‹·‡ , by inductive hypothesis we have already constructed the regular expressions
S
n
% º`»½ ¼ t-A `º é
( »½¼ t
Žj ' k%’‘ .
klqk m
t1& (¿t
n&Œ
Consider the regular expression
t
n
“j ' k%9‘ .
klqk.
t@N `º ( »½¼ t
n
TŠ
“j ' k.’‘ .
k%Alqk%
and
t?” `º ¿
( »½¼ t
t aº ¿
( »½¼ t1&MR t-A @t N „ t?”1H
Ž 
Clearly every string matching is in the set
*
q
R
â £æ !
„
“j ' k.’‘ H
k%Alqk m
â ô with all intermediate states in this sequence in n
?zH
„
Conversely,
if q is in this set, W consider the number of times the‘ sequence
of transitions
t"&
æ
£
t1â &
â ô passes through state â . / If this number is zero then q
ÕP/ Ž  (by definition of
). Otherwise
this number is ’ª
and the sequence splits into P– W pieces: the first
t-A
/ piece
£
W of â ), the next
is in Õ'Ž t- N (as the sequence goes from â to the first occurrence
pieces
W the next), and the last piece
are in Õ't Ž  (as the sequence goes from one occurrence of â to
t-A ÕPt@Ž N „3 t (as the sequence goes from thet last occurrence of â to âvô ). So in this case q is in
is in
in t"Õ'&MŽ R t-A . t@So
tÕ'Ž Ž   . So in either case q is
t…(y
N „3tot complete the induction step we can define
Ž  .
to be this regular expression
b!
t
?”
j
klqk9m
?”
t
t
?”
4.2 An example
Perhaps an example will help to understand the rather clever argument in Section 4.1. The
example will also demonstrate that we do not have to(r
pursue
the inductive construction of the
˜
regular expression to thet bitter end (the base case
): often it is possible to find some of
one needs by ad hoc arguments.
the regular expressions
•j
klqk m
n
4.2 An example
31
Note also that at the inductive W steps in the construction of a regular expression for õ
we are free to choose which state â to remove from the current state set . A good rule of
thumb is: choose a state that disconnects the automaton as much as possible.
n
Example
Ý
Þ
]
]
–
µ
]

~ ~ b
Ü à?˜
1| — x ™ x
–
µ
j
€¡ ~~ b €
Ý
–
Ý
Direct inspection yields:
Ü?˜
0| — x ™
µ
Þ
]
~
–
j
]
~ b
µ
–
~ b€
µ
Slide 25
,
As an example, consider the NFA shown on Slide 25. Since the start state is and this
W & A state, the language of accepted strings is that determined by the
is also the only accepting
tW W
/
regular expression
. WChoosing
to remove
state
from
set, we have
& A
W A
W A
W A theW state
A
 ll‘
l
t W W l
t & & l ‘ „ t &  W l ‘ H
‘
l Ž l W Al 
Õ'Ž
(3)
W A Õ'Ž l
t W W l ‘ (
D8„
t W & l ‘ (
W
A
W
A
l

Õ'Ž  and
Direct
2 ÕPŽ l 
t & & l inspection
t shows
& W l ‘ that ÕPŽ
‘
Õ'Ž l  , and ÕPŽ l  , we choose
W A to remove
W state
W :W
W
t & & l ‘ (
t &  & ‘ R t & A ‘ t A  A ‘ „ t A  & ‘
Wl Ž Wl  Wl 
Õ'Ž Wl A  Õ'Ž Wl
t & W l ‘ (
t &  W ‘ R t & A ‘ t A  A ‘ „ t A  W ‘ H
l Ž l  l 
Õ'Ž l  Õ'Ž l
l l ‘
t W W l ‘ R t W & l
(
ÕPŽ
D„aE
 . To calculate
These regular expressions can allW be
A determined by inspection, as shown on Slide 25. Thus
t & & l
„ E"R DD „ E
Ž M Ž
>
R DD„aE
 ; and
and it’s not hard to see that this is equal to Õ'Ž
W A
„ DD „
t & W l ‘ (
˜R¨D
ÕPŽ l  P
Õ Ž Žv Ž
>
ÕPŽ
l ‘
(
ÕPŽ
R¨D
4 REGULAR LANGUAGES, II
32
which is equal to ÕPŽ
D„zR¨D8„aE
DDD „
R DD„aE „aDDD„
 . Substituting all these values into (3), we get
W & A
t W W l l ‘ (
D „ R D „ E R DD „ E „ DDD „ H
Õ'Ž l
 ÕPŽ
Ž


So
Ž

is a regular expression whose matching strings comprise the
language accepted by the NFA on Slide 25. (Clearly, one could simplify this to a smaller, but
equivalent regular expression (in the sense of Slide 8), but we do not bother to do so.)
4.3 Complement and intersection of regular languages
We saw in Section 3.2 that part (a) of Kleene’s Theorem allows us to answer question (a)
on Slide 9. Now that we have proved the other half of the theorem, we can say more about
question (b) on that slide.
t
% Recall that on page 8 we mentionedt that for each regular expression
Complementation
over an alphabet , we can findt a regular expression Ó}Ž  that determines the complement
of the language determined by :
t
ÕP޽Ó}Ž >
(¯*
q
‘ % „ R
q
‘÷
t z? H
Õ'Ž 
As we now show, this is a consequence of Kleene’s Theorem.
šg
ûG ·èé¢
œ úJû>üûZý-þ›œž Ÿ ÿG¡ d@i Zf g úJû>üûZý-þMÿ
œ $¢›œž Ÿ ÿG¡ d@i Zf g $…ÿ
œ
œ
œ
š g û- hè¿¢
šg û@ hè¿¢
start state of
transitions of
Vý”û ›£œ'ž.Ÿ ÿG¡
è
= transitions of
= start state of
i´³ Û oãúJûZüûZý-þ1ÿ
Û ¡ ¥
o ¤ AVý”û ÿ ¶
is a deterministic finite automaton, then k is
š
g û@ hè¿¢ iff it is not accepted by è :
accepted by
šg ûG ·èé¢3¢ i´³ kÄoÅ$ b ¡ k ož
¤ Àà ·èé¢a¶ .
À¸ Provided
è
è
Slide 26
.
4.3 Complement and intersection of regular languages
33
Lemma
‘ ÷ ? 4.3.1. If Õ is a regular language over alphabet
q
Õ is also regular.
%
*
, then its complement q
(
‘ % „ R
¦VZ
* Õ ‘p%…Õ'
„ÃŽR õ^ ‘ ÷ . Let? G‹VŽõ^
Proof. Since Õ is regular, by definition there is a DFA õ such that
q
Õ is the set
be the DFA constructed from õ as indicated on Slide 26. Then q
of strings accepted by G‹`Žõ^ and hence is regular.
¦VZ
t
t ( a regular expression , by part (a) of Kleene’s Theorem there is a DFA õ such
Given
t part (b) of thet theorem
(
Õ'Žõڐ . Then by
that ÕPŽ 
applied to the DFA GV‹ Žõ^ , we can
find a regular expression Ó Ž  so that Õ'޽Ó}Ž >
Õ'Ž v‹aŽõڐ> . Since
t
Õ'Ž%¦‡Zv‹aŽõڐ>
(+*
‘“% „ R
q
q
‘÷
¦VZ
%¦‡Z
Õ'Žõ^
?C(+*
q
‘“% „ R
t
‘÷
q
t z? .
ÕPŽ 
this Ó}Ž  is the regular expression we need for the complement of .
„ R ‘÷
or
Note. The construction given on Slide 26 can be applied to * a finite
‘ % automaton
õ whether
?
not it is deterministic. However, for ÕPŽ G‹`Žõ^> to equal q
q
Õ'Žõ^ we need
õ to be deterministic. See Example 4.4.2.
%¦VZ
9§
t1&
t-A As another example of the power of Kleene’s Theorem,
t& •t-given
A
Intersection
regular expres with the property:
sions and we can show the existence of a regular expression Ž
q matches Ž
tz&’§•t-A
 iff q matches
tz&
and q matches
tGA
.
This can be deduced from the following lemma.
Lemma 4.3.2. If Õ
intersection
&
A
and Õ
&©¨
Õ
%
are a regular languages over an alphabet
Õ
A `º ( »½¼ *
q
‘“% „ R
q
‘
Õ
&
and q
‘
Õ
, then their
AG?
is also regular.
&
A
5
&
A
( Since Õ ( and
/1.32 Õ are regular&V. languages,
A
A that
Proof.
there are DFA õ
and & õ
such
Õ ® Õ'Žõž®  (ø
õ &V.  beA the DFA constructed from õ ‘“%Pand
„ õ as on
). Let ‰ Žõ
AB( q
&`. A
õ &  has theA property& that any
Slide 27. It &`is. notA hard to see that ‰ Žõ
is accepted
by ‰ Žõ
õ  iff it is accepted by both õ and õ . Thus Õ
Õ
Õ'Ž ‰ Žõ
õ >
is a regular language.
5
ǻ
ǻ 5
ǻ
Ǭ
5
ǻ
4 REGULAR LANGUAGES, II
34
d­¬
d“¬
œ
states of
œ
start state of
hè ß lVè à ¢
·è ß Vl è à ¢ are all ordered pairs Û ß l Û à ¢ with
Ûß oSúJû>üûZý-þMÿ î and MÛ à oãúJûZüûZý-þMÿ ï
œ alphabet of d“¬ hè ß lVè à ¢ is the common alphabet of è ß
à
and è
œ Û ß l Û à ¢ ëì Ý Û ß l Û à ¢ in d“¬ ·è ß lVè à ¢ iff Û ß ìë Ý Û ß in è ß
Û à ìë Ý Û à in è à
and
d“¬
·è ß l`è à ¢
ZŸ ÿ î 0l Ÿ ÿ ï ¢
hè ß lVè à ¢ iff Û ß accepting in è ß
œ Û ß l Û à ¢ accepting in d“¬
Û à accepting in è à .
and
is
Slide 27
t"&
-t A
A
t (
.32 part (a) of Kleene’s Theorem we can find
Thus& given regular
expressions
and («,/1by
DFA õ
and õ
with Õ'Ž tz&’® §•t-A ÕPŽõž®  (ø tz& §…t-A ). Then
by
(
5 part (b)
&V. ofA the theorem we can
t1&9§•t-aA regular
5 expression
&V. A
tz&
& 
AÕPŽ ®
find
so that ÕPŽ
‰ ªŽõ
õ > . Thus q matches
õ  accepts q , iff both õ and õ accept q , iff q matches both and
t-A
iff ‰«ª
Žõ
, as required.
4.4 Exercises
*-,.0construction
/1.32?
Exercise 4.4.1. Use the
in Section 4.1, to find a regular expression for 2 the DFA
*-,Dwhose
.5E-? start state is , whose only accepting state is , whose
õ whose state set is
, and whose next-state function is given by the following
alphabet of input symbols is
table.
D E
F °¯
,
2
æ±
/
/
2
2
¦VZ
2
/
/
Exercise 4.4.2. The construction õ
G‹VŽõڐ given on Slide 26 applies to both DFA and
NFA; but for Õ'Ž v‹aŽõڐ> to be the
% complement of Õ'Žõڐ we need õ to be deterministic.
%
* ‘“an
%C„¸example
R ‘÷
? alphabet and a NFA õ with set of input symbols , such that
Give
of an
q
q
ÕPŽõ^ is not the same set as ÕPŽ G‹`Žõ^> .
%¦‡Z
%¦VZ
4.4 Exercises
35
D{R E „ D E D{R E „
t
Exercise
Let
for over the
%r( *-D.5E-4.4.3.
?
Ž 
Ž t . Find a complement
%
t alphabet
( * ‘
satisfying ÕP޽ÓԎ >
% „ R ‘ ÷ , t i.e.? a regular expressions ÓԎ  over
*-Dthe
.5EG.5F@alphabet
?
q
.
q
Õ'Ž  . Do the same for the alphabet
t´(
Tripos questions
1995.2.20
1994.3.3
1988.2.3
2000.2.7
36
4 REGULAR LANGUAGES, II
37
5 The Pumping Lemma
In the context of programming languages, a typical example of a regular language (Slide 19)
is the set of all strings of characters which are well-formed tokens (basic keywords, identifiers,
etc) in a particular programming language, Java say. By contrast, the set of all strings which
represent well-formed Java programs is a typical example of a language that is not regular.
Slide 28 gives some simpler examples of non-regular languages. For example, there is no
way to use a search based on matching a regular expression to find all the palindromes in a
piece of text (although of course there are other kinds of algorithm for doing this).
Examples of non-regular languages
œ
The set of strings over
³ 5l@¢Vl ~ l € l Gv lA²¶
in which the
œ
parentheses ‘ ’ and ‘ ¢ ’ occur well-nested.
The set of strings over
³"~ l € l Gv l'²¶
which are palindromes,
i.e. which read the same backwards as forwards.
œ ³"~ í € í ¡ [Å\^]¶
Slide 28
The intuitive reason why the languages listed on Slide 28 are not regular is that a machine
for recognising whether or not any given string is in the language would need infinitely many
E
different states (whereas a characteristic feature of the machines we have been using isD that
­
­
D
they have only finitely many states). For example,
to recognise that a string isE of the form
one would need to remember how many s had been seen
D before the first is encountered,
requiring countably many states of the form ‘just seen © s’. This section make this intuitive
argument rigorous and describes a useful way of showing that languages such as these are
not regular.
The fact that a finite automaton does only have finitely many states means that as we look
at longer and longer strings that it accepts, we see a certain kind of repetition—the pumping
lemma property given on Slide 29.
5 THE PUMPING LEMMA
38
The Pumping Lemma
For every regular language À , there is a number
³
\´µ
satisfying
the pumping lemma property :
all
|´oUÀ
with
´ ý d¶µ û ·¬ X| ¢•\K³
can be expressed as a
i k ß nk à
concatenation of three strings, |
, where k
ß ,n
and
k à
satisfy:
œ
œ
œ
´ ý daµ û·¬( ·n¢•\´µ
(i.e. s¸Š )
´ ý daµ û¬· ·k ß n¢º¹K³
í
k ß n & k à A•oU
‘ À
(i.e.
& q q A'‘ Õ , q & s"q A'‘ Õ [but we knew that anyway],
q s"s"q
Õ , q ss"sq
Õ , etc).
for all & [ÅA'
\^‘ ] ,
Slide 29
5.1 Proving the Pumping Lemma
Since Õ is regular, it is equal to the set Õ'Žõ^ of strings accepted by some DFA õ . Then we
D&5Dthe
A”H0H0number
HZD
can(Útake
mentioned on‘ Slide 29 to be the number of states in õ . For suppose
w
­ with ©Uª . If w
ÕPŽõ^ , then there is a transition sequence as shown at
(
the top of Slide 30. Then w can ( be split into/ three pieces as& shown
on that slide. Note that
£ , øPª
, †ˆ‘ ‡`‰Šv‹ŒŽs
and †ˆ‡`‰Šv‹ŒŽq s
. So it just remains
by choice of ø and
&
A}
­
to check that q s q
Õ for all ©Äª . As shown on the lower half of( Slide 30, the string
s & takesA the machine õ from state â0® back
¥ ( to the same state (since â0®
â ). So for any © ,
­
‘ state
, & theA loop from â@® to
q s q takes us from the initial
â to âV® , then © times round
‘
â0® to âV­
‡ ‹ . Therefore for any ©Uª , q s ­ q is accepted by
itself, and& thenA'from
õ , i.e. q s ­ q
Õ .
»
½
¼»
½
658797 %:
½¸`L»
¾
?¿
Note. In the above construction it is perfectly possible that ø
null-string, .
(«,
, in which case q
&
is the
5.2 Using the Pumping Lemma
39
[Å\K³ i number of states of è , then in
ÝÀ
MÝ î
Ý@ï
Ý@ò
Ÿvÿ i Â Û Ü ë7ì Û ß ë5ì ÃÄ Û à ñGñvñ ëXì ÛEÅ Á ñvñGñ ëzì Û í oAVý”û ÿ
ÁYÆ ß states
Û Ü l Gv l Û Á can’t all be distinct states. So Û | i Û ™ for some
] ¹ ÈÇ¼É ¹K³ . So the above transition sequence looks like
G
If
Ê
Ÿ ÿ
Ü Vë ê ì î b Û
i Û r
where
|
i
Û ™ b Vë ê ì ï Û í
b
oAVý”û ÿ
vG ~
n d-i hf g ~a| Æ ß
Gv ~a|
k ß @d i fZg ~ ß
™
k à -d i fhg ~
™Æ
ß vv ~ í Slide 30
»
Remark 5.1.1. One consequence of the pumping lemma property of Õ and is that if there
is any string w in Õ of length ª , then Õ contains arbitrarily long strings. (We just ‘pump
up’ w by increasing © .)
L»
If you did Exercise 3.3.3, you will know that if Õ is a finite set of strings then it is regular.
In this case, what is the number with the property on Slide 29? The answer is that we can
take any strictly greater than the length of any string in the finite
‘ set Õ . Then the Pumping
Lemma property is trivially satisfied because there are no w
for
Õ with †ˆ‡`‰Šv‹ŒŽw• ª
which we have to check the condition!
»
»
€»
5.2 Using the Pumping Lemma
The Pumping Lemma (Slide 5.1) says that every regular language has a certain property—
,
namely that there exists a number with the pumping lemma property.
So to show that
ª
a language Õ is not regular, it suffices to show that no
possesses the pumping
lemma property for the language Õ . Because the pumping lemma property involves quite a
complicated alternation of quantifiers, it will help to spell out explicitly what is its negation.
This is done on Slide 31. Slide 32 gives some examples.
»
»
5 THE PUMPING LEMMA
40
How to use the Pumping Lemma to prove
that a language À
³
For each ’\´µ , find some |«oUÀ
Î
Ë
( )
ÌÍÍ
ÍÍÏ
is not regular
of length
\K³
so that
| i k ß nk à ,
daµ û·¬ ·k ß n¢8¹K³ and ´ ý daµ û·¬ ·n¢•\´µ ,
with ´ ý
ß í à
there is some [Å\^] for which k n k is not in À
no matter how
|
is split into three,
.
Slide 31
Examples
(i)
í í
À ß @d i fZg ³"~ € / ¡ DÅ
[ Ð>E\^
ÐB‘ ]¶ &is not regular.
Õ is of length ª6»
[For each »•ª
,
Ñ
and has property ( )
on Slide 31.]
(ii)
À à d@i Zf g ³ |´o / "³ ~ DlÐ>E`€ D¶ÐB
b ‘ ¡ | & a palindrome¶ is not regular.
[For each »•ª
,
Õ is of length ª6» and has property
Ñ
( ).]
(iii)
À á @d i fZg ³"~Oғ¡?Ó/ prime ¶ is not regular.
2
[For
D× ‘ eachN »•ª , we can find a prime Ô with ÔÖÕ
Õ has length ª6» and has property ( Ñ ).]
Slide 32
»
and then
5.2 Using the Pumping Lemma
41
Proof of the examples on Slide 32. We use the method on Slide 31.
/
DOÐ7E9Ð
&
Ð has length ªØ
( » . & WeA show
(i) For any »Ñª
, consider the string w
. It is ( in Õ DIÐ7E9and
that property
&
( Ñ ) holds for this w . /For suppose
& w
is split as w D
q s"q & (²
with
DÚ
(r, entirely
&
A s, so q &
†ˆ‡a‰ŠM‹Œ
(r
Žq D so Ù`N» and †ˆ‡a‰Šv‹A Œ
Ž(r
s DÐ ª Ú .o EThen
s must consist
of
H H Ð . Thenq the
­
and s
say, and hence q
case ©
of q s q is not in Õ since
& W A ( & A (rD Ú D Ð Ú o E Ð (rD Ð o E Ð
H
q s q
q q
Ž H H

and
DÐ
H o E Б ÷
Õ
(
&
»
because £
¥ (
Š »
Ð Ð
¥'(
(since
†ˆ‡`‰Šv‹ŒŽs)ª
/
).
Ð H o Ð
D >E`D
, for& example
A
& A with
(²D theEaD palindrome
(ii) The( argument
is very similar to(²
that
(i), but starting
­
w
. Once again, the © ¥ (
of q s q yields a string q q
which is
£
not a palindrome (because
).
»
/
Š »
find one satisfying
» ª , since there(—areD × infinitely many primes Ô , we can certainly
(D ×
( & A
`
º
½
»
¼
`
º
½
»
¼
has property
(
).
For
suppose
is
split
as
Ô¥Õ » . I claim& that w
Ñ
w
w
q sq
/
t (
&
¥ (
with †ˆ‡a‰Šv‹Œ
Žq A s( 8`¼» and
t †ˆ¥ ‡a‰Šv‹Œ
ŽsPª . Letting
†ˆ‡a‰Šv‹Œ
Žq  and
†ˆ‡a‰Šv‹Œ
Žs , so
£
£
that †ˆ‡a‰Šv‹Œ
Žq 
, we have
Ô
& × o A (rD Ú D oÛ × o Ü D × Ú o (rD o × o @ × o (rD ÛÝo & ÜÛ × osÜ H
H
H H H
H ¾ H
¾
H
(iii) Given2
q s
q
¥
/
/
¥ (
/
¥
£
¥
t
¥ ( ˆ† ‡a‰Šv‹Œ
Žs¤ª& ) and
/ prime, because
2
–
Õ (since
Now¥ Ž – 2 VŽÞÔ (  is not
ˆ† ‡a‰Šv‹Œ
Žq s
`à» ).
Ô £ Õ » & £ ­ » A’‘ ÷ » ª N (since( ÔßÕ ¥ » by choice, and ` –
£
Therefore q s q
Õ when ©
Ô .
¥
/
Remark 5.2.1. Unfortunately, the method on Slide 31 can’t cope with every non-regular
language. This is because the pumping lemma property is a necessary, but not a sufficient
condition for / a language to be regular. In other words there do exist languages Õ for which a
number }ª
can be found satisfying the pumping lemma property on Slide 29, but which
nonetheless, are not regular. Slide 33 gives an example of such an Õ .
»
5 THE PUMPING LEMMA
42
Example of a non-regular language
that satisfies the ‘pumping lemma property’
í í
Àyd@i fZg ³"‚<á~ € ¡Oâ
ã
\´µ
and
[_\ ]¶
³"~¶á€ í ¡â la[Å\^]¶
satisfies the pumping lemma property on Slide 29 with
[For
AC( any
q
w
‘
rest of
But
À
w
Õ
of length
ª
/
, can take
q
& (
,s
(
³
i
µ
first letter of
.
w
,
.]
is not regular. [See Exercise 5.4.2.]
Slide 33
5.3 Decidability of language equivalence
& on
The proof of the Pumping Lemma provides us with a positive answer to question t(c)
t-Slide
A
%
9. In other words, it provides
a method that, given any two regular expressions and
t1& ( the t-same
A
(over
alphabet ) decides whether or not the languages they determine are equal,
Õ'Ž  ÕPŽ  .
First note that this problem can be reduced to t"deciding
& (
t-whether
A
or
t1& not
T thet-A set of
t-A T accepted
t1& by any given DFA is empty. For Õ'Ž 
Õ'Ž  iff ÕPŽ 
Õ'Ž  and
strings
.
Using
the
results
about
complementation
and
intersection
in
Section
4.3, we
ÕPŽ  Õ'Ž 
t& T
t-A
˜
cant1& reducet-A the (r
question
of whether or not Õ'Ž 
ÕPŽ  to the question of whether or not
, since
Õ'Ž
Ž Ó >
’§
t1& ¨“* ‘“% „ R ‘ ÷
t-A ?P(r˜H
iff Õ'Ž 
q
q
ÕPŽ 
t-A
t&9§
t-A
By Kleene’s
t-A§
t1& theorem, given and we& can firstA construct regular& expressions
(
t1&’§
t-A Ž½Ó 
t- ,AA§ then t1construct
&
tz& õ t-A and õ such that Õ'Žõ 
and A ( ޽Ó
DFAs
ÕPŽ
Ž½Ó > and&
Õ'Žõ  A Õ'Ž
Ž½Ó > . Then and are equivalent iff the languages accepted by õ
and by õ
are both empty.
(r˜
follows from
The fact that, given any DFA õ , one can decide whether or not Õ'Žõڐ
the Lemma on Slide 34. For then, to check whether or not Õ'Žõ^ is empty, we just have to
t1& T
t-A
Õ'Ž  ÕPŽ 
t"&
check whether or not any of the finitely many strings of length less than the number of states
of õ is accepted by õ .
5.4 Exercises
43
Lemma If a DFA
è
accepts any string at all, it accepts one
whose length is less than the number of states in
Proof. Suppose
è
has
³
³
states (so Ô\´µ ). If
è
.
Àà ·èé¢
is not
empty, then we can find an element of it of shortest length,
~ ß ~ à Gv ~í
say (where [Å\^] ). Thus there is a transition
sequence
Ÿ ÿ
If
[Å\K³
i Û Ü ë5MÝ ì î Û ß ë5@Ý ì ï Û à ñvñvñ ë1@Ý ì ò
, then not all the
[åäµ
Û í oVý”û ÿ
.
states in this sequence can be
distinct and we can shorten it as on Slide 30. But then we would
obtain a strictly shorter string in
~ ß ~ à Gv ~í
. So we must have
[
À¸ hè¿¢
Ç ³.
contradicting the choice of
Slide 34
5.4 Exercises
Exercise 5.4.1. Show that the first language mentioned on Slide 28 is not regular.
Exercise 5.4.2. Show that there is no DFA õ for which Õ'Žõ^ is the language on Slide 33.
E õ ô with the same
[Hint: argue by contradiction. If there were such an õ , considerD the DFA
E
F
states as õ , with alphabet of inputD symbols
just consisting of¥ .5and
, with¥ transitions all
those of õ which are labelled by or , with start state
Ž
 (where
is the start
*-D with
E R the same
,? accepting states as õ . Show that the language accepted by
state of õ ), and
and deduce that no such õ can exist.]
õ¯ô has to be ­ ­ ©¤ª
F
(¯that
/ the language mentioned there satisfies
Exercise 5.4.3. Check the claim made on Slide 33
the pumping lemma property of Slide 29 with
.
»
Tripos questions 2001.2.7 1999.2.7
1993.6.12 1991.4.6 1989.6.12
1998.2.7
1996.2.1(j)
1996.2.8
1995.2.27
44
5 THE PUMPING LEMMA
45
6 Grammars
We have seen that regular languages can be specified in terms of finite automata that accept
or reject strings, and equivalently, in terms of patterns, or regular expressions, which strings
are to match. This section briefly introduces an alternative, ‘generative’ way of specifying
languages.
6.1 Context-free grammars
Some production rules for ‘English’ sentences
æ ØOçè"ØOçIé1Ø æ æê0ëaì Ø0éèîí"Øï
æêIë¶ì Ø0éè æ ñ ïIèaòéó"Ø
ç ðê
ðë¶ì Ø0éè æ ñ ïIèaòéó"Ø
ç ðê
ñ ïOè¶òéó"Ø æ Ë
ñ ïOè¶òé"ó Ø æ÷öø È
ç ððêê çôõOï ññ ææ Ø ææ çñùð ê ì ç
ç çôõOï Ø
ØIé èaò<í"Øç
ñù ì ØIéèaò<"í Ø æPú¶û Æ
ñù ì ØIéèaò<í"Ø æ Í`ÊËüü
ç ððêê ç ææ ý1ÌMËþ ö
ç ç
Æ
ë
æ
ö
í"ØOï
È"Ë Í
ë
ðë¶ì Ø0éè
çôOõï ñ æ Ø
çôOõï ñ æ Ø
ðê ç
Slide 35
Slide 35 gives an example of a context-free grammar for generating strings over the seven
element alphabet
% º`( »½¼ * .
.
.
.
.
.
?zH
Ë
ú«û Æ
Ì1Ë
ö ý1þ
Æ È"Ë ö Í Í`ÊËOüü
öø
È
The elements of the alphabet are called terminals for reasons that will emerge below. The
grammar uses finitely many extra symbols, called non-terminals, namely the eight symbols
ñù ì 0Ø éèaò<í"Ø . ñ Oï èaòé•óØ . ç ð ê ç . ç ð ê ç ôõï ñ æ Ø . •ð ëaì ØIéè . æ ØIçè"ØOçOézØ . æêIë¶ì Ø0éè . í"Øï ë H
æ
One of these is designated as the start symbol. In this case it is ØOçOèzØIçOé1Ø (because we are
I a finite set
interested in generating sentences). Finally, the context-free I grammar contains
æ q , where is one of the
of production rules, each of which consists of a pair, written
non-terminals and q is a string of terminals and non-terminals. In this case there are twelve
productions, as shown on the slide.
6 GRAMMARS
46
æ Içè OçOé
The idea is that we begin with the start symbol Ø "Ø zØ and use the productions to
continually replace non-terminal symbols by strings. At successive stages in this process we
have a string which may contain both terminals and non-terminals. We choose one of the
non-terminals in the string and a production which has that non-terminal as its left-hand side.
Replacing the non-terminal by the right-hand side of the production we obtain the next string
in the sequence, or derivation as it is called. The %derivation stops when we obtain a string
containing only terminals. The set of strings over that may be obtained in this way from
the start symbol is by definition the language generated the context-free grammar.
A derivation
æ OØ çOèzØOçIé1Ø
æê0ëaì ØIéè í"ØOï ë
ð•ëaì Ø0éè
æ÷ñ ïIèaòé•"ó Ø ç ðê çôOõï ñ æ Ø
í"ØOï ë
ð•ëaì ØIéè
æPöø ȇç ðê çôõï ñ æ ØÿíØï ë ðëaì ØIéè
æPöø ȇç ðê çôõï ñ æ Ø ÈË ö Í ðëaì ØIéè
æPöø È ñù ì Ø0éèaò<í"Ø ç ðê çÔÈ"Ë ö Í ð•ëaì ØIéè
æPöø È ú¶û Æç ðê ç ÈË ö Í ðëaì ØIéè
æPöø È ú¶û Æ ÌMË ö È"Ë ö Í ð•ëaì Ø0éè
æPöø È ú¶û Æ ÌMË ö È"Ë ö Í ñ ïOèaòé•ó Ø ç ðê çôõï ñ æ
æPöø È ú¶û Æ ÌMË ö È"Ë ö ÍÂË ç ðê çôõOï ñ æ Ø
æPöø È ú¶û Æ ÌMË ö È"Ë ö ÍÂË ñù ì Ø0éè¶òEí"Ø ç ðê ç
æPöø È ú¶û Æ ÌMË ö È"Ë ö ÍÂËÔÍVÊËüOü‡ç ðê ç
æPöø È ú¶û Æ ÌMË ö È"Ë ö ÍÂËÔÍVÊËüOü ý1þ Æ
æ
Ø
Slide 36
For example, the string
öø È ú¶û ƒÌMË ö
È"Ë ö Â
Í ËÔÍ`ÊËüOü
ý1þ
Æ
is in this language, as witnessed by the derivation on Slide 36, in which we have indicated
left-hand sides of production rules by underlining. On the other hand, the string
(4)
öø È ý1þ
ÆÔË
æ Içè"ØOçOézØ
is not in the language, because there is no derivation from Ø
to the string. (Why?)
Remark 6.1.1. The phrase ‘context-free’ refers to the fact that in a derivation we are allowed
to replace an occurrence of a non-terminal by the right-hand side of a production without
, on either side of the occurrence (its ‘context’). A more general
regard to the strings that occur
form of grammar (a ‘type grammar’ in the Chomsky hierarchy—see page 257 of Kozen’s
6.2 Backus-Naur Form
47
æ
book, for example) has productions of the form q
s where q and s are arbitrary strings of
terminals and non-terminals. For example a production of the form
ñù ì IØ éèaò<í"Ø
Ë
ö
Ì1Ë
ý1þ
æ
Æ
ñù ì IéèaòEí
ö
would allow occurrences of ‘
Ø
Ø ’ that occur between ‘ Ë ’ and ‘ ÌMË ’ to be replaced
by ‘ Æ ’, deleting the surrounding symbols at the same time. This kind of production is not
permitted in a context-free grammar.
ý1þ
Example of Backus-Naur Form (BNF)
Terminals:
Non-terminals:
ä
"
Start symbol:
ë
¢
Productions:
))i
))i
))i
"
¡ ä ¡ë ¡
" ¡ ¡ Z ¢
Slide 37
6.2 Backus-Naur Form
It is quite likely that the same non-terminal will appear on the left-hand side of several
productions in a context-free grammar. Because of this, it is common to use a more compact
notation for specifying productions, called Backus-Naur Form (BNF), in which all the
( right-hand
productions for a given non-terminalR are specified together, with the different
sides being separated by the symbol ‘ ’. BNF also tends to use the symbol ‘ ’ rather than
æ
‘ ’ in the notation for productions. An example of a context-free grammar in BNF is given
on Slide 37. Written out in full, the context-free grammar on this slide has eight productions,
¯¯
6 GRAMMARS
48
namely:
ûý}æ Ù
û}
ý æ û ý
þÉ
æ
þÉ
æ
–
þÉ
æ
£
±
È1ÙzÉ æ
È1ÙzÉ æ
È1ÙzÉ æ
ô
ûý
ÈzÙ1É þ É È1Ù1É
޽ÈzÙ1É8
The language generated by this grammar is supposed to represent certain arithmetic expressions. For example
م–rŽÙ ôÏô 
(5)
is in the language, but
م–rŽÙ ô ô
(6)
is not. (See Exercise 6.4.2.)
A context-free grammar
for the language
í í
³"~ €
Terminals:
¡ [Å\^]¶
~
Non-terminal:
Start symbol:
€
Productions:
i) ) j ¡"~ €
Slide 38
6.3 Regular grammars
49
6.3 Regular grammars
%
A language Õ over an alphabet is context-free% iff Õ is the set of strings generated by
*-D E(with
R set of,terminals
?
some context-free grammar
). The context-free grammar on Slide 38
­ ­ ©Úª
generates the language
. We saw in Section 5.2 that this is not a regular
language. So the class of context-free languages is not the same as the class of regular
languages. Nevertheless, as Slide 39 points out, every regular language is context-free. For
the grammar defined % on„ that slide clearly has the property that derivations from the start
symbol to a string in
must be of the form of a finite number of productions of the first
kind followed by a single production of the second kind, i.e.
¥
where in õ
æ
D& &
â æ
D&5DA A
â æ
(?(?(
æ
D&5DA”H0H0H>D
D&5DA!H0H0H>D
­âV­ æ
­
the following transition sequence holds
£ Eç æ >
¥
â
& Aç @
£æ
(?(?(
çAB
‘J58797
££ æ âV­
‡%:‹
H
Thus a string is in the language generated by the grammar iff it is accepted by õ
Every regular language is context-free
Given a DFA
è
, the set
À¸ hè¿¢
of strings accepted by
generated by the following context-free grammar:
set of terminals =
$ ÿ
set of non-terminals =
úJûZüûZý-þ ÿ
start symbol = start state of
è
productions of two kinds:
Û ì
Û ì
j
~ Û
Û ëì Ý Û
in è
Û
whenever o3Vý”û ÿ
whenever
Slide 39
è
can be
.
6 GRAMMARS
50
Definition A context-free grammar is regular iff all its
productions are of the form
or
where
k
ì
k
ì
k
is a string of terminals and
and
are non-terminals.
Theorem
(a) Every language generated by a regular grammar is a
regular language (i.e. is the set of strings accepted by some
DFA).
(b) Every regular language can be generated by a regular
grammar.
Slide 40
It is possible to single out context-free grammars of a special form, called regular (or
right linear), which do generate regular languages. The definition is on Slide 40. Indeed, as
the theorem on that slide states, this type of grammar generates all possible regular languages.
Proof of the Theorem on Slide 40. First note that part (b) of the theorem has already been
proved, because the context-free grammar generating ÕPŽõ^ on Slide 39 is a regular grammar
(of a special kind).
To prove part (a), given a regular grammar we have to construct a DFA õ whose set
of accepted strings coincides with the strings generated by the grammar. By the Subset
Construction (Theorem on Slide 18), it is enough to construct an NFAó with this property.
This makes the task much easier. We take the states of õ to be the non-terminals, augmented
by some extra states described below. Of course the alphabet of input symbols of õ should
be the set of terminal symbols of the grammar. The start state is the start symbol. Finally, the
transitions and the accepting states of õ are defined as follows.
/
(¿D&7DA”H0H0H>D
æ
/
H0H3. †ˆ‡a‰& ŠM‹Œ
Žq’ª , say q
(i) For each
production/ of the form â &0. AG. qâMN ô .0H0with
­ with
©Öª , we add © £ fresh states â â â
âV­IH to the automaton and transitions
Eç > & çA@ N ç
& çAB
H
â £ æ â £ æ â £ æ (?(?(hâV­IH ££ æ â ô
(°,
(
æ
(ii) For each production of the form â
, i.e. with q
qâ ô with †ˆ‡a‰Šv‹Œ
Žq
, we add an
-transition
H
â æ£ ó â ô
6.4 Exercises
51
(iii) For each production of &0the
. A-form
. NM.0H0â H0æHa.
we add © fresh states â â â
?ç æ >
£
â
â
/
(rD&7DA”H0H0H>D
q with †ˆ‡`‰Šv‹ŒŽq ª , say q
â`­ to the automaton and transitions
çB
& 'ç @ A ç N
H
£ æ â £ æ â (?(?( £ˆ£ æ âV­
Moreover we make the state â@­ accepting.
(+,
æ
(iv) For each production of the form â
, i.e. with q
q with †ˆ‡`‰Šv‹ŒŽq
in any new states or transitions, but we do make â an accepting state.
¥
#!
(
­ with ©pª
/
,
‘
, we do not add
5D77
: If we have a transition sequence in õ of the form
â with â
‡ ‹ , we
can divide it up into pieces according to where non-terminals occur and then convert each
piece into a use of one of the production rules, thereby forming a derivation of q in the
grammar. Reversing this process, every derivation of a string of terminals can be converted
into a transition sequence in the automaton from the start state to an accepting state. Thus
this NFAó does indeed accept exactly the set of strings generated by the given regular
grammar.
6.4 Exercises
Exercise 6.4.1. Why is the string (4) not in the language generated by the context-free
grammar in Section 6.1?
Exercise 6.4.2. Give a derivation showing that (5) is in the language generated by the
context-free grammar on Slide 37. Prove that (6) is not in that language. [Hint: show that
if q is a string of terminals and non-terminals occurring in a derivation of this grammar and
that ‘ô ’ occurs in q , then it does so in a substring of the form s8ô , or sô ô , or sô ôÏô , etc., where s is
either Ù or .]
ûý
*-D.5E@?
Exercise
6.4.3. Give a context-free grammar generating all the palindromes over the alphabet
(cf. Slide 28).
D.5E-? Give a context-free grammar generating all the regular expressions over the
Exercise *-6.4.4.
alphabet
.
Exercise 6.4.5. Using the construction given in the proof
W of part (a) of the Theorem on
Slide 40, convert the regular grammar with start symbol â and productions
W
âW æ
âW æ
â & æ
â
æ
DE W
F &â
Dâ E
into an NFAó whose language is that generated by the grammar.
Exercise 6.4.6. Is the language generated by the context-free grammar on Slide 35 a regular
language? What about the one on Slide 37?
Tripos questions
1997.2.7
1996.2.1(k)
1994.4.3
1991.3.8
1990.6.10
52
6 GRAMMARS
Lectures Appraisal Form
If lecturing standards are to be maintained where they are high, and improved where they
are not, it is important for the lecturers to receive feedback about their lectures.
Consequently, we would be grateful if you would complete this questionnaire, and either
return it to the lecturer in question, or to the Student Administrator, Computer Laboratory,
William Gates Building. Thank you.
ù ñ
0ý ÇÈ ô û<öOö Í
æ è ôËzÇ ö ò ñ ï ÈzÆ
1üË1Ç&ó8Ë Title of Course: é ¼
1. Name of Lecturer: Ç
2.
Æ
8ËzÆÈ8ÍrË 0ýYû¶û<ö È
ñ öYþ ÊË ö
Ë
3. How many lectures have you attended in this series so far? . . . . . . . . . . . . . . . . . . . . . . .
Do you intend to go to the rest of them? Yes/No/Series finished
4. What do you expect to gain from these lectures? (Underline as appropriate)
Detailed coverage of selected topics or Advanced material
Broad coverage of an area
or Elementary material
Other (please specify)
5. Did you find the content: (place a vertical mark across the line)
Too basic
-------------------------------------------------- Too complex
Too general
-------------------------------------------------- Too specific
Well organised -------------------------------------------------- Poorly organised
Easy to follow -------------------------------------------------- Hard to follow
6. Did you find the lecturer’s delivery: (place a vertical mark across the line)
Too slow
-------------------------------------------------- Too fast
Too general -------------------------------------------------- Too specific
Too quiet
-------------------------------------------------- Too loud
Halting
-------------------------------------------------- Smooth
Monotonous -------------------------------------------------- Lively
Other comments on the delivery:
7. Was a satisfactory reading list provided?
How might it be improved.
Yes/No
8. Apart from the recommendations suggested by your answers above, how else might
these lectures be improved? Do any specific lectures in this series require attention?
(Continue overleaf if necessary)