CS2 Language Processing note 1 CS2Ah 1.10.2004 CS2 Language Processing note 1 Languages and finite automata This thread of the course is concerned with formal languages, that is, artificial languages defined by formal rules. Programming languages are, of course, prime examples of such languages. The thread is divided into two halves. The first set of 6 lectures (by Don Sannella) study certain particularly simple formal languages: so called regular languages. Regular languages and the tools associated with them (finite automata, regular expressions, . . . ) have numerous applications in computer science. The second set of 5 lectures (by John Longley) will consider more complex languages (e.g. context-free languages) of the sort actually used for programming languages. They will cover the formal grammars used to specify such languages, and techniques used for parsing languages (as used by compilers). The lecture notes for this thread were originally created by Stuart Anderson, and have been subsequently modified by Vivek Gore, Martin Grohe, Don Sannella, Alex Simpson, Ian Stark and Colin Stirling. Finite automata Rather than starting directly with languages, we begin instead by looking at finite state machines, as introduced in CS1Ah. That these have a connection with formal languages should already be familiar to you from CS1Ah, where you saw how a finite state machine can sometimes be used to recognise a language. In fact, we shall define the regular languages to be exactly those languages that can be recognised by finite state machines. However, saying this is jumping ahead to the next lecture. In this lecture, we review the notion of a finite state machine, and some of the different applications to which finite state machines can be put. However, we shall use the term finite automaton (plural automata) instead of finite state machine. This may seem pretentious, but it is standard terminology, and fits in well with more specialised terminology we shall need later. In CS1Ah you were introduced to finite automata as simple models of reactive systems. Such systems receive inputs from the outside world and react to these inputs in a specified way. In their basic form, finite automata are used to describe the states of a 1 CS2 Language Processing note 1 CS2Ah 1.10.2004 system and the transitions that are possible between these states. Often, we want to use automata to describe systems that do not only change their states depending on their input, but also produce an output. Finite automata that produce an output are called transducers. The Parking Ticket Machine of [CS1Ah, LN12]1 and the Cruise Control of [CS1Ah, LN15] are examples. We have also seen automata that accept certain sequences of inputs (and consequently reject all other sequences of inputs). Such automata are called acceptors. The automaton in Example 1.1 is an acceptor; more examples can be found in [CS1Ah, LN12] and [CS1Ah, LN13]. Example 1.1: A combination lock. We want to describe a lock that receives input from a numeric keypad (with keys 0, 1, , 9) and opens whenever the last 4 keys pressed have been 1102. An automaton modelling such a lock is dis- 1 1 0,2−9 none 1 1 1 1 11 0,2−9 2−9 0 110 0,3−9 2 open 0,2−9 Figure 1: Finite automaton for combination lock played in Figure 1. It has one accepting state ‘open’ which is reached whenever the last four keys pressed have been 1102. The automaton has 4 intermediate states which are reached depending on how many of the correct keys have been pressed: If the last 3 keys pressed have been 110, then the automaton is in state ‘110’. If the last 2 keys pressed have been 11, then the automaton is in state ‘11’. If the last key pressed has been 1, and the last two keys have not been 11, then the automaton is in state ‘1’. Otherwise, it is in state ‘none’. 1 This means “Lecture Note 12 of CS1Ah”. 2 CS2 Language Processing note 1 CS2Ah 1.10.2004 Languages In the following lectures, we will use finite automata as language acceptors. In order to do this (and in order to handle the other types of language considered later in the thread), it is necessary to have a general notion of language. Here are two examples. The Java programming language can be considered as the set of all strings over the alphabet of ASCII characters that represent a syntactically correct Java program. The English language may be considered as the set of all strings over the usual alphabet (perhaps with punctuation symbols included) that represent grammatically correct sentences in English. These are both rather complex examples (and the latter is even poorly specified, as no two speakers of English are likely to agree entirely on its grammar). In the forthcoming lecture notes, we shall encounter many much simpler examples of languages. To refer to languages concisely, it is useful to develop a convenient notation system. In general we shall use the Greek letter (capital “sigma”) to denote a finite alphabet of symbols from which the language will be built. Natural examples of alphabets are the set a b z of all letters of the English alphabet, the set of all ASCII characters, or just 0 1 , the alphabet consisting just of the two letters ‘0’ and ‘1’, which is the alphabet of binary numbers or bit strings. In our examples, we shall often use small, more artificial alphabets such as a b or a b c . For any finite alphabet we write for the set of all finite strings (se quences or words) of members of . The set contains a special element, the empty string, for which we use the symbol , the Greek letter “epsilon”, rather than the notation "" often used in programming languages. For example, if a b , then a b aa ab ba bb aaa aab aba For two strings and in , denotes their concatenation. For example, if abab and baa, then ababbaa. In particular, for every string . we have A prefix of a string is an initial substring of , that is, a string for which there exists a string such that . For example, if abbc, then the prefixes of are a ab abb abbc Similarly, a suffix of a string is a string a string ! such that " . For example, the suffixes of abbc are abbc bbc bc c # 3 for which there exists CS2 Language Processing note 1 CS2Ah 1.10.2004 '& The length of a string is denoted $ %$ . For example, $ $ , $ a$ Note that for all strings +#, we have $ -$ $ %$./$ -$ . )( , $ abba $ '* . For a string 0 and a natural number 1 we use 32 to denote the string 45# where is concatenated 1 times with itself. Thus 76 , 98 and 2:;8 2 2 . For any alphabet , a language (over ) is a subset < of For example, a language over such as the the set ) < 8 . a b is any set of strings of as and bs, ab abba (consisting of the strings , ab, and abba). Unlike < , most interesting lan8 guages consist of infinitely many strings. For example, the set <>= of all strings over the alphabet a b that start with an a is a language over the alphabet a b that consists of infinitely many strings. Formally, we may specify <%= as follows: DC @? <%= aA$BA a b Be careful not to confuse the empty language E with the language that just consists of the empty string. Note that we generally use the letters a, b, c to denote individual symbols and , , and to denote strings of symbols. A single symbol like a also doubles as a ( string of length . Finite Automata as Language Acceptors is an acceptor automaton whose inputs are letters from an Suppose that F alphabet . Then sequences of inputs for F are strings from , some of which are accepted by F and some of which are rejected. We think of the automaton as reading the input string letter by letter and changing its states accordingly. After the whole string is read, it is accepted or rejected, depending on whether the automaton is in an accepting state or not. Now suppose < is a language over a finite alphabet . We say that F recognises (or accepts) < if F accepts precisely the strings that are contained in < , that is, if the following equivalence holds for all strings : G<IHKJ accepts + F Example 1.2: Binary strings with an even number of zeros. We consider the language < over the alphabet 0 1 that consists of all strings with an even number of 0s. Figure 2 shows an automaton recognising this language. To see how this automaton works, we simply observe that after reading a prefix & ( L) of the input string, the automaton is in state ‘even’ if contains an even number of 0s and in state ‘odd’ otherwise. 4 CS2 Language Processing note 1 CS2Ah 1.10.2004 1 1 0 even odd 0 Figure 2: An automaton recognising strings with an even number of 0s Example 1.1 revisited. The finite automaton for the combination lock in Exam ple 1 recognises the language < consisting of all words over 0 55 9 that end with the four “letters” 1102. Formally, we may specify this language by < or < 1102 $A 0 5 9 BA 0 55 9 $ 1102 is a suffix of M Deterministic finite automata We now give a formal presentation of a special class of finite automata: deterministic finite automata. In the next lecture, we shall use these to define the notion of a regular language. Deterministic finite automata are characterised by the following two features: there is a unique starting state; and from every state there is exactly one transition for each possible input symbol. Systems described by such automata are completely predictable. Given some input string, the sequence of states that is visited is completely determined. We now give a formal definition of deterministic finite automata. Definition 1.3. A deterministic finite automaton (or DFA) is a tuple F @NPO consisting of: 1. a finite set O of states, 2. a finite alphabet , 5 #Q S RST"U 6 CS2 Language Processing note 1 CS2Ah 1.10.2004 3. a distinguished starting state Q 4. a set R@V O 6 O , of final states (the ones that indicate acceptance), and 5. a description T of all the possible transitions. In order for the automaton to be deterministic, T must be given by a table that answers the following question: “Given a state Q and an input symbol a, what is the next state?” There must be an answer to this no matter what Q OYXZ and W are. (In mathematical terms T is a function from the set to the O set and is usually referred to as the transition function.) Example 1.2 revisited. The finite automaton of Example 1.2 is a DFA formally specified by [ even odd 5 0 1 even 5 even ST]\^ where the transition function T is given by the following table: T 0 1 even odd even odd even odd Example 1.1 re-revisited. The finite automaton of Example 1.1 is a DFA de[ scribed by none 1 11 110 open 0 55 9 none 5 open ST]\ where the transition function T is given by the following table: T 0 1 2 3 4 5 6 none none 1 none none 1 none 11 none none 11 110 11 none none 110 none 1 open none open none 1 none none 7 8 9 References A basic reference for the material of Lectures 1–6 of the Language Processing Thread is Chapter 10 of Foundations of Computer Science (C Edition) by A. V. Aho and J. D. Ullman, Computer Science Press, 1995. More advanced material can be found in Chapters 2 and 3 of Introduction to Automata Theory, Languages, and Computation (2nd Edition) by J. E. Hopcroft, R. Motwani, and J. D. Ullman, Addison-Wesley, 2001. 6 CS2 Language Processing note 1 CS2Ah 1.10.2004 Exercises 1. Design a DFA that recognises the language < consisting of all strings over & ( 0 1 with an even number of s and an odd number of s. 2. Design a DFA that recognises the language < consisting of all strings over a b c starting with an a and ending either with an a or a c. 3. We want to modify the combination lock of Example 1.1 in a such a way that several people can open it, each of them using his or her own number. Say, there are three people who want to be able to open the lock using the numbers 1102, 0102, 1110, respectively. Design a DFA describing a combination lock that opens whenever the last four keys pressed are either 1102, 0102, or 1110. More formally, this means that you are supposed to design a DFA that recognises the language < consisting of all strings over 0 9 whose last four letters are either 1102, or 0102, or 1110. Don Sannella 7
© Copyright 2026 Paperzz