Formal Languages and Finite State Machines: Finite Automata
Section A:
Formal Language Theory
The modern theory of computing has evolved into three areas of study:
•
formal language theory and finite state machines
Turing Machines are divided into specific categories according to the formal
language it accepts.
•
decidability
We have already encountered several such problems. Some of our examples were
decidable; others were not. We expand this study and identify several additional
categories of problem. We then determine whether or not each category is
decidable.
•
performance
If a problem is decidable, then the next obvious question would be: How long
does it take to actually solve it? This final area of study focuses on the execution
time and the space requirements necessary to solve a given problem.
Formal language theory covers a broad range of topics. At times formal languages will appear to
be very similar to natural language and sentence structure. At times formal languages will appear
to be nothing more than structured gibberish! However, formal languages play a very important
role in computing, including pattern matching, lexical analysis, and programming language
syntax.
Building Block Components
Def: An alphabet Σ is a finite set of symbols.
Examples
(1)
(2)
(3)
(4)
Σ={0,1}
Σ={a,b,c}
Σ={a,b,c,…,z}
Σ = { boy , girl , a , the , wears , pants , skirt }
II-A-1
Formal Languages and Finite State Machines: Finite Automata
Def: A word w is a finite string of symbols taken from the underlying alphabet Σ.
Examples
(1)
(2)
(3)
(4)
0101 , 1000, 0 , 1 , e = empty string!
abc , cba , abba , cab , baaaa
dog , cat , zebra , blahblahblah
the boy wears pants , a girl wears a skirt
Comment:
Depending on the underlying alphabet, a word w might look like any of the following:
•
•
•
•
a string of bits, i.e., a binary number
a word in the dictionary
an English sentence
a JAVA program
Def: A language L is a collection of words built from the underlying alphabet Σ.
Examples
(1)
(2)
(3)
(4)
L = { w over {0,1} | #0s = #1s }
L = { w over {a,b,c} | #c’s = #a’s + #b’s }
L = { w over {a,b,…,z} | w is a palindrome }
L = { syntactically correct JAVA program }
Obviously formal languages run a wide range of possibilities from the very simple to the very
complex. As we continue our discussion of formal languages, we will fill in the details for
Chomsky’s Hierarchy of Languages:
II-A-2
Formal Languages and Finite State Machines: Finite Automata
level
language name
finite state machine
3
regular
finite automaton
2
context free
push down automaton
1
context sensitive
linearly bounded automaton
0
unrestricted
Turing machine
As we progress through the subsequent chapters and sections, we will give meaning and
substance to each of the terms above. At this point it is sufficient to understand that different
categories of Turing machine are associated with each of the different level of languages.
II-A-3
Formal Languages and Finite State Machines: Finite Automata
Section B:
Finite Automata
Def: A finite automaton M is a 5-tuple ( Q, S, d, q0, F ) where Q and S are both finite sets
and
(1) Q is a set of states
(2) S is a set of symbols (alphabet)
(3) d is a transition function
d ( q , s ) = snext
(4) q0 is a unique start state, q0 ∈ Q
(5) F ⊆ Q is a set of accept states
This scaled down Turing Machine satisfies the following conditions:
•
•
•
•
•
•
input only, no output is generated
processes input from left to right in single steps (no backtracking)
machine stops when no more input is available
if the final state is an element of F, then accept; if not, then reject
transition function d is “hard-wired” into the machine, same as a standard Turing
Machine
any undefined transition immediately halts and rejects
II-B-1
Formal Languages and Finite State Machines: Finite Automata
Examples
(1) a finite automaton which generates a finite number of words (one word to be specific!)
L = { 001 }
II-B-2
Formal Languages and Finite State Machines: Finite Automata
(2) a finite automaton which generates an infinite number of words
L = { e , 0 , 1 , 00 , 11 , 000 , 111 , 0000 , 1111 , 00000 , 11111 , … }
II-B-3
Formal Languages and Finite State Machines: Finite Automata
Def: We say that a finite automaton M accepts the word w provided:
•
•
w = w1 w2 … wk
wj ∈ S
there exists a sequence r0, r1, … , rk ∈ Q such that
• r0 = q 0
start state
• rj = d ( rj-1 , wj )
j = 1, 2, … k
• rk ∈ F
accept state
Def: The language recognized by a finite automaton M is the collection
L (M) = { words w | M accepts w }.
Examples
(1) easy example:
L (M) = { w over {a,b,c} | w begins with a }
II-B-4
Formal Languages and Finite State Machines: Finite Automata
(2) harder example:
L (M) = { 1 , 101 , 10101 , 1010101 , 101010101 , … }
At this point in our discussion we come to a very important new topic – the differentiation
between a deterministic machine and a nondeterministic machine.
The previous definition and examples should more precisely be referred to as deterministic finite
automata (DFA rather than FA) in that the transition function d uniquely determines the next
state. There are no alternative options and no potential choices for the machine to consider. The
concept of nondeterminism introduces the possibility of the machine following multiple branches
at any given configuration of state and symbol.
II-B-5
Formal Languages and Finite State Machines: Finite Automata
Example
our first nondeterministic finite automaton
(a) at state q0
d ( q0 , 1 ) = q0 or d ( q0 , 1 ) = q1
which state should I move to – q0 or q1?
(b) at state q1
d ( q1 , 0 ) = q2 and d ( q1 , e ) = q2
should I remove the input symbol 0? or should I remove nothing?
either way I would move to state q2
The special symbol e represents the empty string (no characters). A transition incorporating the
symbol e (called an e-transition) does not remove any symbol from the input line.
Nondeterministic machines are especially nice for hiding all the complexity in a given task. Let
the machine work out all the details generated by the various choices. We incorporate this new
concept into another definition.
II-B-6
Formal Languages and Finite State Machines: Finite Automata
Please note that the updated definitions include the special symbol e and that the transition
function no longer defines a single state but rather a set of possible states.
Def: A nondeterministic finite automaton M is a 5-tuple ( Q, Se, d, q0, F ) where Q and Se
are both finite sets and
(1) Q is a set of states
(2) Se is a set of symbols (alphabet)
augmented with the empty string e = “”
(3) d is a transition function
d ( q , s ) = set of possible next states
(4) q0 is a unique start state, q0 ∈ Q
(5) F ⊆ Q is a set of accept states
Def: We say that a nondeterministic finite automaton M accepts the word w provided:
•
•
w = w1 w2 … wk
w j ∈ Se
there exists a sequence r0, r1, … , rk ∈ Q such that
• r0 = q 0
start state
• rj ∈ d ( rj-1 , wj )
j = 1, 2, … k
• rk ∈ F
accept state
Def: The language recognized by a nondeterministic finite automaton M is the collection
L (M) = { words w | M accepts w }.
Please note that a deterministic finite automaton (DFA) is automatically a nondeterministic finite
automaton (NFA)! A transition function d that identifies a single state identifies a set of states
(not a large set, just one). Hence
{ DFAs } ⊆ { NFAs }
So what does nondeterminism look like in practice? How does a finite automaton keep track of
the various choices as they are encountered? We illustrate the process in the following example
using our first nondeterministic example.
II-B-7
Formal Languages and Finite State Machines: Finite Automata
Example
Consider the following input string w = 010110. Does M accept or reject this word?
As we encounter choices, we generate a tree structure; as we find undefined transitions, we prune
the tree structure.
After considering the last symbol in the input string, we look to the various states that have not
been pruned. If any one of them is an accept state, then the nondeterministic finite automaton
accepts the string!
II-B-8
Formal Languages and Finite State Machines: Finite Automata
Theorem:
Every nondeterministic finite automaton M has an equivalent deterministic finite automaton M’,
i.e., both machines recognize exactly the same language,
L (M) = L (M’).
Proof:
For any finite set A, the power set ℘(A) = { S | S ⊆ A }. If #A = k, then #℘(A) = 2k.
Step One: The original start state becomes the start state for the resulting deterministic finite
automaton.
Step Two: For every element in ℘(Q) other than the empty set, determine its images via the
transition function d, excluding any e-transitions. This image will itself be a set of states.
Step Three: For every e-transition, augment the image sets generated in Step Two. Repeat Step
Three as needed.
Step Four: Delete any element in ℘(Q) having no incoming transition. Repeat Step Four as
needed.
Note: The start state is considered as having an incoming transition!
Step Five: Any remaining element in ℘(Q) containing an original accept state becomes an
accept state for the resulting deterministic finite automaton.
QED
II-B-9
Formal Languages and Finite State Machines: Finite Automata
Example
Our original nondeterministic finite automaton again!
II-B-10
Formal Languages and Finite State Machines: Finite Automata
The above scratch work yields the following deterministic finite automaton:
Using JFLAP it is possible to do the conversion process with just a few clicks of the mouse
button (see next page!).
Question: Now that we have two flavors of finite automata, which one will we work with?
Answer: BOTH! The problem at hand will often determine which of the two is easier to work
with.
•
•
DFAs are easier to understand and implement due to their simplicity
NFAs are easier to manipulate and combine due to their flexibility
II-B-11
Formal Languages and Finite State Machines: Finite Automata
II-B-12
Formal Languages and Finite State Machines: Finite Automata
Section C:
Regular Languages and Regular Expressions
Def: A regular language is a language L which is recognized by a finite automaton M, i.e.,
L = L (M).
Note: M may be either deterministic or nondeterministic!
six fundamental operations with languages
•
•
•
union
intersection
complement
where
L1 ∪ L2
L1 ∩ L2
LC
•
•
•
difference
concatenation
Kleene closure
L1 ° L2
= { w = w1 w2 | w1 ∈ L1 and w2 ∈ L2 }
L*
= { w = w1 w2 … wk | wj ∈ L and k ≥ 0 }
and
Theorem:
Regular Languages are closed under the six fundamental operations.
Proof:
(union and intersection – we use the basic definitions)
L1 is recognized by M1 = ( Q’ , S’ , d’ , q0’ , F’ )
L2 is recognized by M2 = ( Q” , S” , d” , q0” , F” )
Define M = ( Q , S , d , q0 , F ) as follows:
•
•
•
•
•
Q = Q’ x Q” = { (q’,q”) | q’ ∈ Q’ and q” ∈ Q” )
S = S’ x S”
q0 = ( q0’ , q0” )
d (( q’ , q” ) , s ) = ( d’ ( q’ , s ) , d” ( q” , s ) )
F will be defined shortly!
Note: We are running both machines simultaneously on the input stream!
II-C-1
L1 ∼ L2
L1 ° L2
L*
Formal Languages and Finite State Machines: Finite Automata
For union,
F = F’ x Q” ∪ Q’ x F”
i.e., either machine may accept!
For intersection
F = F’ x F”
i.e., both machines must accept!
QED
Proof:
(complement and set difference – we use some tricks)
For complement,
L is recognized by M = ( Q , S , d , q0 , F )
LC is recognized by M’ = ( Q , S , d , q0 , FC ), where FC = Q ∼ F
For set difference,
L1 ∼ L2 = L1 ∩ L2C
complement and intersection are both closed
QED
Proof:
(concatenation and Kleene closure – we use NFAs)
QED
II-C-2
Formal Languages and Finite State Machines: Finite Automata
Regular languages have been defined in terms of the finite automata that generate / recognize
them. These machines can sometimes be quite cumbersome and confusing. An alternate
approach to describing a language would be with a pattern or a template describing its words.
This is the idea behind regular expressions.
Def: A regular expression R describing the words in a language L is defined inductively:
•
•
•
∅ is a regular expression
ε is a regular expression
s ∈ S is a regular expression
the empty language
the empty string
any single character
•
if R1 and R2 are regular expressions
then so is R1 ∪ R2
if R1 and R2 are regular expressions
then so is R1 ° R2
if R is a regular expression
then so is R*
union
•
•
concatenation
Kleene closure
Notation
The notation may take several forms
•
•
•
( a ∪ b ∪ c ) ° b* ° d ° e
( a + b + c ) ° b* ° d ° e
{ a , b , c } b* d e
traditional
arithmetic
implied concatenation
Remember:
•
•
∅
ε
•
•
•
a∪∅=a
a°∅=∅
∅* = ε
= empty language
= empty string
=> no words
=> no characters, but is a word
so
•
•
•
II-C-3
a∪ε={a,ε}
a°ε=a
ε* = ε
Formal Languages and Finite State Machines: Finite Automata
Def: If R is a regular expression, then the language L generated by R is
L = L (R) = { words w | w matches the pattern defined by R }
Examples
(1) R = c ° a ° t
(2) R = ( a ∪ b )*
L ( R ) = { cat }
L ( R ) = { w over { a , b } | with | w | ≥ 0 }
Lemma 1:
If R is a regular expression, the L (R) is a regular language.
Proof:
(inductive proof modeled on the definition of a regular expression)
QED
II-C-4
Formal Languages and Finite State Machines: Finite Automata
Lemma 2:
If L is a regular language, then there exists a regular expression R such that
L = L ( R ).
Proof:
The proof of this lemma requires one more type of finite automaton – a generalized
nondeterministic finite automaton (GNFA).
Def: A generalized nondeterministic finite automaton (GNFA) M is a 5-tuple
( Q , S , d , qs , qa ) where Q and S are both finite sets and
(1) Q is a set of states
(2) S is a set of symbols (alphabet)
(3) d is a transition function
d : ( Q – { qa } ) X ( Q - { qs } ) → ℜ (regular expressions)
transitions are represented by regular expressions
rather than individual symbols in Sε
(4) qs is a unique start state, qs ∈ Q, with no incoming arcs
(5) qa is a unique accept state, qa ∈ Q, with no outgoing arcs
Let M represent the deterministic finite automaton such that L = L (M).
•
•
•
•
•
augment M, if necessary, with a unique start state qs having no incoming arcs
augment M, if necessary, with a unique accept state qa having no outgoing arcs
a single transition from state q i to qj is represented by the regular expression of a single
symbol
multiple transitions from state q i to qj is represented by the union of single symbol regular
expressions
if qi and qj are two states with no transition symbol, then represent a transition from state
qi to qj by the regular expression ∅
II-C-5
Formal Languages and Finite State Machines: Finite Automata
We have now converted a DFA M with k states into a GNFA M’ with k+2 states. Furthermore,
transitions in M’ are all represented by a single regular express, albeit many having the value ∅.
Our goal now is the reduce our GNFA having k+2 states to an equivalent GNFA having exactly 2
states – qs and qa. The regular expression representing the transition from qs to qa will be the
desired generator for the language.
We do this reduction by eliminating states one at a time using a surprisingly simple process.
QED
Before we present an example of this process, we summarize all our results regarding regular
languages in the following theorem.
Theorem:
(1)
(2)
(3)
(4)
(5)
The following five statements are all equivalent:
L is a regular language.
There exists a DFA M such that L = L (M).
There exists a NFA M such that L = L (M).
There exists a GNFA M such that L = L (M).
The exists a regular expression R such that L = L (R).
II-C-6
Formal Languages and Finite State Machines: Finite Automata
Examples
(1) Example 1:
II-C-7
Formal Languages and Finite State Machines: Finite Automata
(2) Example 2:
II-C-8
Formal Languages and Finite State Machines: Finite Automata
Section D:
The Pumping Lemma
In this section we present the pumping lemma which describes a very important property which
belongs to all regular languages. It is interesting that the primary use of the property is not in
recognizing words within a regular language but rather in proving a given language can not be
regular!
Theorem:
The Pumping Lemma
If L is a regular language, then there exists an integer p (called the pumping length) such that if
w is any string in L with | w | ≥ p then w may be divided into three substrings w = x y z with
(1) | y | > 0
(2) | x y | ≤ p
(3) x yk z ∈ L
i.e., y is not e
for k = 0, 1, 2, …
Proof:
Let p = #Q = number of states.
Let w = s1 s2 … sk with | w | = k ≥ p #Q.
q0 -s1→ q1 -s2→ q2 - … → qk
Since k ≥ p = #Q the “pigeon hole principle” tells us that there must be at least one pair of
duplicate states in the sequence! Find the first such pair: 0 ≤ i < j ≤ p.
Set x = s1 s2 … si, y = si+1 si+2 … sj, and z = sj+1 sj+2 … sk
Observe
(1) | y | = j – i > 0
(2) | x y | = j ≤ p
(3) q0 -x→ qi -y→ qj -z→ qk
=>
x yk z ∈ L
with qi = qj
for k = 0, 1, 2, …
QED
II-D-1
Formal Languages and Finite State Machines: Finite Automata
Examples
Remember that the Pumping Lemma is typically used not to identify strings that belong to
regular languages but rather to identify languages which can not be regular!
(1) L = { w = 0n 1n | n ≥ 0 } = { e, 01, 0011, 000111, 00001111, … }
Assume that L is a regular language and that p is its pumping length.
Consider the word w = 0p 1p ∈ L. According to the Pumping Lemma, w = x y z.
| y | > 0 and | x y | ≤ p implies that y = 0 … 0 = 0m, m ≥ 1. So x y2 z = 0m+p 1p ∈ L. However,
the two exponents are different and x y2 z can not be a word in L. A contradiction.
Therefore L is not a regular language.
(2) L = { w e {0,1}* | #0s = #1s } = { e, 01,10, 0011, 0101, 0110, 1100, 1010, 1001, … }
Assume that L is a regular language and that p is its pumping length.
Once again consider the word w = 0p 1p ∈ L. According to the Pumping Lemma, w = x y z.
| y | > 0 and | x y | ≤ p implies that y = 0 … 0 = 0m, m ≥ 1. So x y2 z = 0m+p 1p ∈ L. However,
the #0s and the #1s are different and x y2 z can not be a word in L. A contradiction.
Therefore L is not a regular language.
(3) L = { w = 0i 1j | i > j ≥ 0 } = { 0, 00, 001, 000, 0001, 00011, … }
Assume that L is a regular language and that p is its pumping length.
Consider the word w = 0p+1 1p ∈ L. According to the Pumping Lemma, w = x y z.
| y | > 0 and | x y | ≤ p implies that y = 0 … 0 = 0m, m ≥ 1. So x z = x y0 z = 0p+1-m 1p ∈ L.
However, the #1s is at least as large as the #0s and x y0 z can not be a word in L. A
contradiction.
Therefore L is not a regular language.
II-D-2
Formal Languages and Finite State Machines: Finite Automata
Section E:
Applications of Regular Expressions
(1) *nix tools: sed, grep, and awk
sed, grep, and awk are true *nix tools, known for their awkward names and equally awkward
syntax. “sed” is an abbreviation for “string editor”; “grep” comes from the “ed” command
“g/re/p” (globally search a regular expression and print); and “awk” has its name derived from its
authors (Aho, Weinberger, and Kernighan).
They represent the most immediate access to working with regular expressions. Even their most
recent competitor, Perl, is known for producing very powerful but incomprehensible code.
Though I acknowledge their awkward natures, their usefulness cannot be ignored, and learning
how to use each will aid you in your ascension to line processing supremacy. Each is best used in
the following manner:
•
•
•
grep
sed
awk
pattern matching
replacement and line manipulation
advanced line processing
II-E-1
Formal Languages and Finite State Machines: Finite Automata
(2) transition diagrams for various internet protocols
II-E-2
Formal Languages and Finite State Machines: Finite Automata
(3) lexical analysis
Identifying the basic components of a programming language within an input stream of ASCII
characters is an important first step in building a compiler for a programming language. This is
the role of a lexical analyzer.
program fragment
if ( a > 0 )
then b = c + a
else b = 7 ;
becomes
token stream
IF
LPAREN
ID
GT
INTEGER
RPAREN
THEN
ID
ASSIGN
ID
ADDOP
ID
ELSE
ID
ASSIGN
INTEGER
SEMICOLON
if
(
a
>
0
)
then
b
=
c
+
a
else
b
=
7
;
a token is a pair
- syntactic element
- value
the value of a token is called a lexeme
II-E-3
Formal Languages and Finite State Machines: Finite Automata
Token definitions are typically given by regular expressions!
INTEGER
IDENTIFIER
ADDOP
IF
THEN
=
=
=
=
=
( + , - , e } { 1 , 2 , 3 , … , 9 } { 0 , 1 , 2 , 3 , … , 9 }*
{ A , … , Z , a , … z } {A , … , Z , a , … , z , 0 , … , 9 }*
{+,-}
if
then
II-E-4
© Copyright 2026 Paperzz