INTERACTIVE PARTNER AGENTS A PRACTICAL INTRODUCTION

I NTERACTIVE PARTNER AGENTS
A PRACTICAL INTRODUCTION
MATTIAS WAHDE
Department of Applied Mechanics
CHALMERS UNIVERSITY OF TECHNOLOGY
Göteborg, Sweden 2017
Interactive partner agents
A practical introduction
MATTIAS WAHDE
c MATTIAS WAHDE, 2017.
All rights reserved. No part of these lecture notes may be reproduced or transmitted in any form or by any means, electronic or mechanical, without permission in writing from the author.
Department of Applied Mechanics
Chalmers University of Technology
SE–412 96 Göteborg
Sweden
Telephone: +46 (0)31–772 1000
Contents
1 Introduction
1
2 Agent structure
2.1 Agent components . . . . . . .
2.2 Distributed programming . . .
2.3 The Communication library . .
2.3.1 The Server class . . . .
2.3.2 The Client class . . . .
2.3.3 The DataPacket class
2.3.4 A simple example . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
6
7
8
9
10
11
13
3 Decision-making, memory, and dialogue
3.1 A simple example . . . . . . . . . . . .
3.2 The AgentLibrary . . . . . . . . . .
3.2.1 The Agent class . . . . . . . .
3.2.2 The Memory class . . . . . . . .
3.2.3 The DialogueProcess class
3.3 Demonstration application . . . . . . .
3.3.1 TestAgent1 . . . . . . . . . . . .
3.3.2 TestAgent2 and TestAgent3 . .
3.3.3 TestAgent4 . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
16
20
20
21
22
23
23
25
25
4 Computer vision
4.1 Digital images . . . . . . . . . . . . .
4.1.1 Color spaces . . . . . . . . . .
4.1.2 Color histograms . . . . . . .
4.2 The ImageProcessing library . . . . .
4.2.1 The ImageProcessor class
4.2.2 The Camera class . . . . . . .
4.3 Basic image processing . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
27
27
28
29
31
31
33
34
i
.
.
.
.
.
.
.
CONTENTS
ii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
36
36
36
38
39
41
42
43
45
48
48
49
52
56
56
56
5 Visualization and animation
5.1 Three-dimensional rendering . . . . . . . . . . . . .
5.1.1 Triangles and normal vectors . . . . . . . . .
5.1.2 Rendering objects . . . . . . . . . . . . . . . .
5.2 The ThreeDimensionalVisualization library
5.2.1 The Viewer3D class . . . . . . . . . . . . . .
5.2.2 The Object3D class . . . . . . . . . . . . . .
5.3 Faces . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Visualization . . . . . . . . . . . . . . . . . .
5.3.2 Animation . . . . . . . . . . . . . . . . . . . .
5.4 Demonstration applications . . . . . . . . . . . . . .
5.4.1 The Sphere3D application . . . . . . . . . .
5.4.2 The FaceEditor application . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
59
60
60
61
64
64
65
69
69
70
72
72
74
.
.
.
.
.
.
.
.
.
.
77
78
78
83
85
86
86
87
88
90
91
4.4
4.5
4.3.1 Contrast and brightness . . . . . . .
4.3.2 Grayscale conversion . . . . . . . . .
4.3.3 Binarization . . . . . . . . . . . . . .
4.3.4 Image convolution . . . . . . . . . .
4.3.5 Obtaining histograms . . . . . . . .
4.3.6 Histogram manipulation . . . . . . .
4.3.7 Edge detection . . . . . . . . . . . .
4.3.8 Integral image . . . . . . . . . . . . .
4.3.9 Connected component extraction . .
4.3.10 Morphological image processing . .
Advanced image processing . . . . . . . . .
4.4.1 Adaptive thresholding . . . . . . . .
4.4.2 Motion detection . . . . . . . . . . .
4.4.3 Face detection and recognition . . .
Demonstration applications . . . . . . . . .
4.5.1 The ImageProcessing application
4.5.2 The VideoProcessing application
6 Speech synthesis
6.1 Computer-generated sound . . . . .
6.1.1 The WAV sound format . . .
6.1.2 The AudioLibrary . . . . .
6.2 Basic sound processing . . . . . . . .
6.2.1 Low-pass filtering . . . . . .
6.2.2 High-pass filtering . . . . . .
6.3 Formant synthesis . . . . . . . . . . .
6.3.1 Generating voiced sounds . .
6.3.2 Generating unvoiced sounds
6.3.3 Amplitude and voicedness .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
CONTENTS
6.4
6.5
iii
6.3.4 Generating sound transitions
6.3.5 Sound properties . . . . . . .
6.3.6 Emphasis and emotion . . . .
The SpeechSynthesis library . . .
The VoiceGenerator application .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Speech recognition
7.1 Isolated word recognition . . . . . . . . .
7.1.1 Preprocessing . . . . . . . . . . . .
7.1.2 Feature extraction . . . . . . . . . .
7.1.3 Time scaling and feature sampling
7.1.4 Training a speech recognizer . . .
7.1.5 Word recognition . . . . . . . . . .
7.2 Recording sounds . . . . . . . . . . . . . .
7.3 The SpeechRecognitionLibrary . . .
7.4 Demonstration applications . . . . . . . .
7.4.1 The IWR application . . . . . . . .
7.4.2 The Listener application . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
93
94
96
97
98
.
.
.
.
.
.
.
.
.
.
.
101
102
102
103
106
107
108
109
110
113
113
114
8 Internet data acquisition
8.1 The InternetDataAcquisition library
8.1.1 Downloading data . . . . . . . . . .
8.2 Parsing data . . . . . . . . . . . . . . . . . .
8.2.1 The HTMLParser class . . . . . . .
8.2.2 RSS feeds . . . . . . . . . . . . . . .
8.3 The RSSReader application . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
117
118
118
120
120
121
122
A Programming in C#
A.1 Using the C# IDE . . . . . . . . .
A.2 Classes . . . . . . . . . . . . . . .
A.3 Generic lists . . . . . . . . . . . .
A.4 Threading . . . . . . . . . . . . .
A.5 Concurrent reading and writing .
A.6 Event handlers . . . . . . . . . . .
A.7 Serialization and de-serialization
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
123
124
127
131
134
137
138
141
Bibliography
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
146
Chapter
1
Introduction
Intelligent (software) agents are computer programs able to process incoming
information, make a suitable decision based on all the available information,
and take the appropriate action, often, but not always, in interaction with a
user or even another agent. Such programs are becoming more and more common. Examples include automatic systems for reservations (for example travel
reservations), the personal assistants available on mobile phones (and some
operating systems), systems for decision-support, for instance in medicine or
finance, driver support systems in vehicles etc.
An important special case, which is at the heart of this course, are interactive partner agents (IPAs) , which are specifically designed to interact with
human users in a friendly and human-like manner. In addition to all the applications listed above, a typical application example (though by no means
the only one) specifically for IPAs is in health and elderly care, where such an
agent might assist in gathering information and replying to questions on some
particular topic.
IPAs must thus not only be capable to output factual information, but must
also be able to do so in a way that is a least reminiscent of human interaction.
Thus, their mode of interaction normally includes speech, gestures, facial expressions etc. An IPA generally runs on a computer with a web camera, a microphone, and a loudspeaker, and also features a three-dimensional animated
face displayed on the screen. A schematic view of an IPA is shown in Fig. 1.1.
Building an IPA thus requires knowledge of many different domains, including human-computer interaction (for example dialogue), speech recognition,
speech synthesis, image processing (for example for gesture detection), threedimensional visualization and animation, and information gathering (from the
internet etc.). In addition, one must also be able to put together the various
parts, and make them work in cooperation with each other in order to generate a complete IPA.
The aim of this compendium is to give students of IPAs a general and prac1
2
CHAPTER 1. INTRODUCTION
Figure 1.1: A schematic illustration of a typical IPA setup, with several modalities for interaction: A camera for vision, a microphone for speech input, loudspeakers for speech output,
and an animated three-dimensional cartoon-like face for displaying emotions.
tical introduction to the various topics listed above. One could approach this
task in different ways, one possibility being to obtain a set of third-party, blackbox solutions and simply putting them together. Here, however, the aim is to
give the reader a thorough understanding of the basics of each relevant component in an IPA. Thus the reader will have access to the full source code of
a set of software libraries (henceforth referred to as the IPA libraries) as well
as a set of demonstration applications, and the text will go through each agent
component in detail.
The IPA libraries are (1) the AgentLibrary described in Chapter 3; (2) the
AudioLibrary used in Chapters 6 and 7; (3) the CommunicationLibrary
described in Chapter 2; (4) the ImageProcessingLibrary discussed in Chapter 4; (5) the InternetDataAcquisitionLibrary considered in Chapter 8;
(6) the MathematicsLibrary which is an auxiliary library, used by several
other libraries; (7) the ObjectSerializerLibrary which is used for serializing (saving) and de-serializing (loading) objects; see also Appendix A,
Sect. A.7; (8) the PlotLibrary that is used, for example, in one of the applications regarding speech recognition; (9) the SpeechRecognitionLibrary
which is described in Chapter 7; (10) the SpeechSynthesisLibrary that is
used in Chapter 6, and (11) the ThreeDimensionalVisualizationLibrary
described in Chapter 5.
As for the programming language, here, C# .NET, included in the Visual
c 2017, Mattias Wahde, [email protected]
CHAPTER 1. INTRODUCTION
3
Studio integrated development environment (IDE) by Microsoft, has been
chosen, and the code is thus intended primarily for computers running Windows. Of course, many other programming languages could have been selected, but C# .NET offers some compelling advantages (at least in the author’s
view), one of them being the elegance, robustness, and high speed of execution of code written in C# .NET. Moreover, by using the .NET framework, one
also opens up the possibility of writing applications in other .NET languages
(e.g. C++ or Visual Basic) while still being able to use the IPA libraries. Also,
with the integration of Xamarin in the 2015 version of Visual Studio, it is possible to deploy code written in C# .NET on mobile devices, both under Android
and iOS. Using the Mono framework it is also possible to run applications
developed in C# .NET under Linux.
The compendium has been developed for a seven-week university course
at Chalmers University of Technology. It has been assumed that the reader
has an engineering background, covering at least engineering mathematics
as well as programming in some high-level language (though not necessarily C# .NET). Prior familiarity with .NET is recommended, but not absolutely
required. However, a reader unfamiliar with .NET will need to study this topic
alongside the other topics in the compendium. Appendix A provides a brief
introduction to C# .NET, but it is not a complete description.
Needless to say, there is a limit on how much one can do in seven weeks.
Thus, some tradeoffs have been necessary, especially since each of the topics
in Chapters 2 to 8 could easily fill a university course. Hopefully, a suitable
balance between depth and breadth has been found, but it should be noted that
the aim, again, is to give a deep understanding of the basics of each topic rather
than trying to build a state-of-the-art IPA. It is the author’s hope and belief
that, with the knowledge obtained from reading the compendium, a reader
will have a solid foundation for further studies in any of the topics considered
here.
c 2017, Mattias Wahde, [email protected]
4
CHAPTER 1. INTRODUCTION
c 2017, Mattias Wahde, [email protected]
Chapter
2
Agent structure
This chapter gives general overview of the logical structure of the interactive
partner agents used here. Already at this point, some familiarity with C# .NET
(and its IDE) is assumed. Thus, readers unfamiliar with this programming
language should start by reading Appendix A.
Fig. 2.1 shows the structure of the interactive partner agents. As can be seen
in the figure, the agent consists of a main program (an executable file, with the
suffix .exe under Windows), along with a sequence of additional components
(which are also executable files), each handling some important aspect of the
agent and communicating with the main program.
Of course, an IPA could have been written as a single standalone executable.
However, the distributed structure shown in the figure has multiple advantages. First of all, if the agent were to be written as a single program, it is
not unlikely that there would be some code entanglement between the various parts, making it more difficult to replace or upgrade some part (e.g. speech
recognition or vision). With the distributed structure, such problems are avoided.
Moreover, it is possible that some components of the agent would require
strong computational power. Written as a single program, an agent would
have to divide the computational power (of the computer on which it runs)
between the various components. By contrast, in the distributed structure, as
will be shown below, it is possible to run the agent’s constituent programs on
different computers, connected to one another over a wireless network (for example) so that the agent, as a whole, can summon the computational power of
several computers. In most cases, however, and certainly in the cases considered here, the computational power of a single computer will be sufficient.
Another advantage is that, with the distributed structure in Fig. 2.1, it is
possible for the main program to monitor the other programs and to restart a
particular program (e.g. the one handling speech recognition) if it should stop
running for some reason. Alternatively, a simple monitoring program can be
set to run in the background on each computer running any component of the
5
6
CHAPTER 2. AGENT STRUCTURE
Figure 2.1: The logical structure of the interactive partner agents used here. The main
program acts as a server (see Sect. 2.2) whereas all other programs are clients. The various
components are introduced in Sect. 2.1.
agent, making sure that any crashed component restarts automatically. In either case, the probability of crashing the entire agent will be much smaller than
if it were written as a single executable. Finally, for the purpose of a university
course, the distributed structure is excellent, as it allows different developers
(e.g. students) to work completely independently on various parts of the agent,
once an agreement has been made regarding the type (and perhaps amount)
of information that is to be transmitted between the various components.
2.1 Agent components
As can be seen in Fig. 2.1, the structure consists of six components: (i) A main
program, also referred to as the agent program, responsible for coordinating,
(selectively) storing, and processing the information obtained from, or sent to,
the other components. This program maintains a working memory that stores
important recent information, and it is responsible for decision-making and
dialogue with the user(s); Its structure also allows the developer to store a set
of (artificial) brain processes, of which dialogues constitute an important special case. Optionally, this program can also maintain a long-term memory,
the contents of which are loaded into the working memory when it is started;
(ii) a vision program that receives, processes, and interprets the continuous
flow of visual information from (web) cameras. The information transferred
to the main program is in the form of text strings; for example, if the vision
program suddenly detects the face of a known person, the information might
consist of the person’s name along with some information regarding the person’s current facial expression; (iii) a listener program that continuously listens to external input, either in the form of typed text or in the form of sounds
recorded by a connected microphone (if any). The listener then processes the
c 2017, Mattias Wahde, [email protected]
CHAPTER 2. AGENT STRUCTURE
7
information, applying speech recognition in the case of sounds, and generates textual information that is sent to the main program; (iv) an internet data
acqusition program that, for example, reads and processes information from
news feeds before transferring it to the main program; (v) a speech program
that receives text strings from the main program and then produces the corresponding sounds (with textual output as an option as well), and finally; (vi)
a visualizer program that handles visualization and animation of the agent’s
face, based on textual information (e.g. smile or blink) obtained from the main
program. Note that the long-term memory of the agent is, in fact, distributed
as well: The listener program (for example) requires information for speech
recognition, which must thus be loaded upon startup. Similarly, the speech
program must load the appropriate parameter settings for representing the
agent’s voice etc.
In the following chapters, the components just listed will be described in
detail. First, however, the topic of distributed programming will be discussed.
2.2 Distributed programming
The distributed IPA structure follows the client-server model, in which there is
a central component (the server) that handles the information flow to and from
the other components (the clients). An alternative approach would be to use a
peer-to-peer model, in which case there would be no central server. While it is
certainly possible to implement an IPA using the peer-to-peer model, here we
shall only use the client-server model.
An obvious point to consider is the fact that the server cannot control the
flow of information from its various clients. For example, in the case of the IPA,
the vision program and the speech recognition program may both provide input to the agent program at any time, completely independently of each other.
Thus, the server must be able to reliably handle asynchronous communication.
The client-server structure has been implemented as a C# class library,
the CommunicationLibrary. This class library, in turn, makes use of the
System.Net.Sockets namespace, one of the standard namespaces available in C# .NET, which contains classes for handling the low-level aspects of
communication between computers. Those low-level aspects involve, for example, the rules by which computers connect to each other, the format of the
data sent or received, as well as error handling. Those aspects will not be considered in detail here. Suffice it to say that the communication will be handled
using the common TCP/IP protocol. Next, a brief description of the communication library will be given.
c 2017, Mattias Wahde, [email protected]
8
CHAPTER 2. AGENT STRUCTURE
Listing 2.1: The constructor and the Connect() method of the Server class.
public Server ( )
{
clientStateList = new List<C l i e n t S t a t e >() ;
clientIndex = 0 ;
serverSocket = new S o ck e t ( AddressFamily . InterNetwork , SocketType . Stream ,
ProtocolType . Tcp ) ;
backLog = DEFAULT_BACKLOG ;
bufferSize = DEFAULT_BUFFER_SIZE ;
}
...
p u b l i c void Connect ( s t r i n g ipAddressString , i n t serverPort )
{
Boolean ok = Bind ( ipAddressString , serverPort ) ;
i f ( ok )
{
i f ( serverSocket . IsBound )
{
connected = t r u e ;
OnProgress ( CommunicationAction . Connect , name + ” connected ” ) ;
Listen ( ) ;
}
e l s e { connected = f a l s e ; }
}
e l s e { connected = f a l s e ; }
}
2.3 The Communication library
The most important classes in this library are the Server and Client classes.
However, in turn, they make use of other classes. A typical sequence of operations would be as follows: First, the server is instantiated and then it establishes a connection to a given IP address and a given port. In case the server
and clients all run on the same computer, the IP address will be taken as the
loopback address, namely 127.0.0.1. The port can, in principle, be any integer in the range 0 to 65535. However, if the client and server run on different
computers (so that the IP address would be different from 127.0.0.1), one
should keep in mind that some ports are used by other programs, so that the
port number must be selected with some care. Next, the server begins listening for clients. Once a client has been started, it can attempt to connect to the
server, provided that it knows the IP address and the port used by the server.
When the connection is established, the server adds the client to its list of available clients. In the communication library, each client is assigned a unique ID.
The client and server are then able to exchange data. Note that the server also
listens for new clients continuously, making it possible to connect additional
clients at any time.
c 2017, Mattias Wahde, [email protected]
CHAPTER 2. AGENT STRUCTURE
9
Listing 2.2: The fields defined in the ClientState class.
public class Clie n t S t at e
{
p r i v a t e s t r i n g clientName ; // The c l i e n t name communicated by t h e c l i e n t .
p r i v a t e s t r i n g clientID ;
// The ( unique ) c l i e n t ID a s s i g n e d by t h e s e r v e r
p r i v a t e Boolean connected ;
p r i v a t e byte [ ] receiveBuffer ;
p r i v a t e byte [ ] sendBuffer ;
p r i v a t e S o ck e t clientSocket ;
...
}
2.3.1 The Server class
The constructor of the Server class instantiates a list of objects containing information about the clients (see below) and also establishes the server socket
that (simplifying somewhat) acts as an end point for the communication with
the clients, much as an electrical socket acts as and end point for the electric grid. The constructor is shown in the upper half of Listing 2.1 The server
maintains a list of client information, such that each client is defined using an
instance of the ClientState class. The client state, in turn, maintains the
name and ID for the clients, a Boolean variable determining whether or not
the connection is valid, buffers for sending and receiving data, as well as the
actual client socket. The fields defined in the ClientState class are shown
in Listing 2.2. Once the server has been instantiated, the Connect() method,
shown in the lower half of Listing 2.1, is generally called first. As can be seen
in the listing, if a connection is established, the server triggers an event (by
calling OnProgress), which can then be handled by the program making use
of the server, for example in order to display the information regarding the
established connection.
In fact, the server defines four different events: (i) Progress, which monitors the (amount of) information sent to, or received from, the clients; (ii)
Error, which is triggered if, for example, a communication error occurs; (iii)
Received, which is triggered whenever data are received (in the form of a
DataPacket, see below); and (iv) ClientConnected, which is triggered
when a new client connects to the server. Each of these events makes use of
custom EventArgs classes (see also Appendix A.6), which are all included in
the communication library.
Once the server has been connected and is listening, the next step is to accept clients. Of course, the server cannot control at what time the incoming
connection request(s) will come. Thus, it needs to listen continuously for any
new connection requests. In the communication library, this is achieved using
c 2017, Mattias Wahde, [email protected]
10
CHAPTER 2. AGENT STRUCTURE
Listing 2.3: An illustration of the use of asynchronous callback methods, in this case involving a server’s procedure for accepting incoming connection requests from clients.
p u b l i c void AcceptClients ( )
{
serverSocket . BeginAccept ( new AsyncCallback ( AcceptClientsCallBack) , n u l l ) ;
}
...
p r i v a t e void AcceptClientsCallBack( IA s y n cRe s u l t asyncResult )
{
i f ( ! connected ) { r e t u r n ; }
S o ck e t clientSocket = n u l l ;
try
{
clientSocket = serverSocket . EndAccept ( asyncResult ) ;
C l i e n t S t a t e clientState = new C l i e n t S t a t e ( bufferSize , clientSocket ) ;
OnProgress ( CommunicationAction . Connect , ” C l i e n t d e t e c t e d ” ) ;
Receive ( clientState ) ;
AcceptClients ( ) ;
}
c a t c h ( S o ck e t E x ce p t i o n ex )
{
OnError ( ex . Message ) ;
i f ( clientSocket != n u l l ) { clientSocket . Close ( ) ; }
}
}
a programming pattern involving asynchronous callback methods. A brief
description of this approach is given in Listing 2.3. The server calls a method
BeginAccept that (when an incoming connection request is received) triggers the asynchronous callback method specified in the call to BeginAccept,
in this case a method called AcceptClientsCallBack, also shown in the
listing. In this method, a call is made to EndAccept and, if the connection
is successfully established, the server then receives the connection message
(if any) provided by the client. Next, the AcceptClients method is called
again, so that the server can continue listening for additional connection requests. A similar programming pattern is used also for sending and receiving
messages; see the source code for the Server class for additonal information.
2.3.2 The Client class
The Client class maintains a client socket, and it is the information regarding
this socket that is transmitted to the server when the client connects to it, thus
establishing the connection. The client also defines a set of events, similar to
the ones used in the server, which are triggered, for example, when a message
is sent or received.
c 2017, Mattias Wahde, [email protected]
CHAPTER 2. AGENT STRUCTURE
11
Listing 2.4: The two methods in the Client class responsible for receiving messages from
the server.
p r i v a t e void Receive ( )
{
clientSocket . BeginReceive ( receiveBuffer , 0 , receiveBuffer . Length ,
S o c k e t F l a g s . None , new AsyncCallback ( ReceiveCallBack ) , n u l l ) ;
}
...
p r i v a t e void ReceiveCallBack ( IA s y n cRe s u l t asyncResult )
{
try
{
i f ( connected )
{
i n t receivedMessageSize = clientSocket . EndReceive ( asyncResult ) ;
byte [ ] messageAsBytes = new byte [ receivedMessageSize ] ;
Array . Copy ( receiveBuffer , messageAsBytes , receivedMessageSize) ;
DataPacket dataPacket = new DataPacket ( ) ;
Boolean ok = dataPacket . Generate ( messageAsBytes ) ;
i f ( ok )
{
OnReceived ( dataPacket , ” S e r v e r ” ) ;
OnProgress ( CommunicationAction . Receive , ” Received ” +
receivedMessageSize . ToString ( ) + ” b y t e s from s e r v e r ” ) ;
}
else
{
OnError ( ” Corrupted message r e c e i v e d ” ) ;
}
Receive ( ) ;
}
}
c a t c h ( S o ck e t E x ce p t i o n ex )
{
connected = f a l s e ;
OnConnectionClosed ( ) ;
OnError ( ex . Message ) ;
}
}
Like the Server class, The Client class also makes use of asynchronous callback methods for connecting to a server, and for sending and receiving messages. As an illustration, Listing 2.4 shows the two methods responsible for
receiving information from the server. The messages defined in the communication library are stored in instances of the DataPacket class, described
below. As can be seen from the listing, the pattern is very similar, in general,
to the one shown in Listing 2.3. Provided that the data packet arrives and is
not corrupted, the client processes the message, triggering its Received event
(by calling the OnReceived method) and the Progress event, and then calls
Receive again.
2.3.3 The DataPacket class
In the TCP/IP protocol, messages are sent as simple arrays of bytes. For a user,
it is of course more convenient to send (or receive) a readable string. Moreover,
c 2017, Mattias Wahde, [email protected]
12
CHAPTER 2. AGENT STRUCTURE
Listing 2.5: The four fields of the DataPacket class, along with the AsBytes method,
which combines the fields and converts the resulting string to a byte array.
p u b l i c c l a s s DataPacket
{
p r i v a t e DateTime timeStamp ;
p r i v a t e s t r i n g senderName ;
p r i v a t e s t r i n g message ;
p r i v a t e i n t checkSum ;
...
p u b l i c byte [ ] AsBytes ( )
{
s t r i n g tmpString = timeStamp . ToString ( ”yyMMddHHmmssfff” ) + ” : ” + senderName + ←֓
” : ” + message +” : ” ;
byte [ ] dataAsBytes = Encoding . ASCII . GetBytes ( tmpString ) ;
i n t checkSum = GetCheckSum ( dataAsBytes ) ;
s t r i n g dataPacketAsString = tmpString + checkSum . ToString ( ) ;
byte [ ] dataPacketAsBytes = Encoding . ASCII . GetBytes ( dataPacketAsString) ;
r e t u r n dataPacketAsBytes ;
}
...
}
it is useful to know the time stamp of the message (meaning the date and time
at which the packet was generated). Furthermore, since a server might have
many clients connected, the identity of the sender should also be provided.
As mentioned above, messages are packaged in instances of the DataPacket
class, which contains four fields, shown in Listing 2.5. As is shown in the
listing, in addition to the fields just mentioned, the DataPacket also contains
a checksum, which is obtained by simply summing the ASCII values for each
byte that it sent.
The contents of a data packet can then be converted to a byte array by calling the AsBytes method. Before conversion, a special character, here chosen
as :, is inserted as a separator between the items. Thus, this particular character is not allowed as a part of a message. Of course, one could have chosen
another character as the separator, or even allowing the user to define the separator character, but the scheme shown in the listing will be sufficient here. In
the uncommon case that one must send a colon character one can, for example,
send it as the word colon, perhaps surrounded by brackets to indicate that the
word between the brackets is to be interpreted in some fashion, rather than
taken literally.
Of course, the DataPacket class also contains a method (called Generate)
for the reverse operation, i.e. for obtaining the four fields described above,
given the byte array; see the source code of the DataPacket class for a detailed description of this method.
c 2017, Mattias Wahde, [email protected]
CHAPTER 2. AGENT STRUCTURE
13
Figure 2.2: A simple illustration of the client-server model implemented in the communication library. The server can accept any number of connected clients (two, in the case shown
here), and can then send a simple message (hello) to all clients. Similarly, each client can send
the same message back to the server.
2.3.4 A simple example
Fig. 2.2 shows the GUI of a server and two clients in a minimalistic usage
example involving the CommunicationLibrary. The code for this simple
example is contained in the CommunicationSolution. Here, the server is
connected to the loopback IP address (127.0.0.1) and the two clients, both
running on the same computer as the server, then establish connections to the
server. Of course, any number of clients could be used. However, for the
purpose of demonstrating the code, using two clients is sufficient. When the
user presses the button marked Hello on the server, a hello message is sent to all
the clients, who then promptly report that they received this message from the
server. Similarly, if the corresponding button is clicked in any of the clients, the
hello message is sent to the server, which then acknowledges that it received the
message, and also displays the identity of the sender. Both the server and client
can also handle the case in which the counterpart is unavailable. Moreover, a
client can be disconnected and then connected again. Similarly, the server can
be disconnected and then connected again, but in this case the clients must
also once more connect to the server.
A brief code snippet (Listing 2.6) shows the code for generating and connecting the server, and for starting to listen for incoming connection requests.
Of course, one would not normally define hard-set constants for the IP address
and port: Those fields are added just to complete the example. The three methods (event handlers) HandleServerProgress, HandleServerError, and
HandleServerReceived must be defined as well. As an example, consider
the event handler for received messages, shown in Listing 2.7. This event handler simply formats and prints (in a ListBox called messageListBox) the
message contained in the data packet (that, in turn, is represented as a property
c 2017, Mattias Wahde, [email protected]
14
CHAPTER 2. AGENT STRUCTURE
Listing 2.6: A brief code snippet, showing how the server is generated and started. Note that
event handlers are specified for handling the Progress, Error, and Received events,
respectively.
...
s t r i n g ipAddressString = ” 1 2 7 . 0 . 0 . 1 ” ;
i n t port = 7 ;
server = new S e r v e r ( ) ;
server . Name = ” S e r v e r ” ;
server . Progress += new EventHandler<CommunicationProgressEventArgs>
( HandleServerProgress) ;
server . Error += new EventHandler<CommunicationErrorEventArgs >(HandleServerError) ;
server . Received += new EventHandler<DataPacketEventArgs >(HandleServerReceived) ;
server . Connect ( ipAddressString , port ) ;
i f ( server . Connected ) { server . AcceptClients ( ) ; }
...
Listing 2.7: An event handler that processes (and displays) messages received by the server.
p r i v a t e void HandleServerReceived( o b j e c t sender , DataPacketEventArgs e )
{
s t r i n g information = e . DataPacket . TimeStamp . ToString ( ”yyyyMMdd HHmmss. f f f : ” ) +
e . DataPacket . Message + ” from ” + e . SenderID . ToString ( ) ;
i f ( InvokeRequired ) { t h i s . BeginInvoke ( new MethodInvoker ( ( ) =>
messageListBox . Items . Insert ( 0 , information ) ) ) ; }
e l s e { messageListBox . Items . Insert ( 0 , information ) ; }
}
in the DataPacketEventArgs; see the corresponding code for a full description). Note that the server does not run in the GUI thread. In order to avoid
illegal cross-thread operations (see Appendix A, especially Sect. A.4) one must
therefore use the BeginInvoke pattern.
c 2017, Mattias Wahde, [email protected]
Chapter
3
Decision-making, memory, and
dialogue
One of the most fundamental requirements on an IPA is that it should be able,
within reasonable limits, to carry out a meaningful dialogue with a human,
using the various input and output modalities (speech, typing, gestures, facial
expressions etc.) that it might have at its disposal, processing incoming information to generate a suitable decision (perhaps consulting its memory in the
process), and then executing that decision.
Now, for all the other subfields that will be studied in the coming chapters,
there is normally quite a bit of theory available, usually rooted in (human) biology. Moreover, in those cases, there often exists implementable, mathematical
models that can be used directly in an agent. As one example among many,
the formant speech synthesis in Chapter 6 uses a mathematical model based
on a simplified description of the human vocal tract.
However, regarding the processes of decision-making, memory, and dialogue, there are fewer (useful) theories available. Of course, quite a large number of theories (or perhaps hypotheses, rather) regarding the workings of the
mind have been presented in the field of psychology. However, those theories
are generally not associated with implementable models as would be needed
here. More detailed approaches can be found in the field of neurobiology, but
those often concern the microscopic level (neuron assemblies or even individual neurons) rather than the brain as a whole.
Even though theories of the brain as a whole rarely are presented in implementable form, for good reason, one can still make use of such theories as
an inspiration when formulating a simplified implementable model. Thus, for
example, the use of the working memory in the Agent class (see below) has
been inspired by models of working memory in humans. The same can be said
for the IPA structure as a whole, namely the fact that it is implemented as a set
of separate processes with strong, asynchronous interaction.
15
16
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
Then there is the question of whether one, even in principle, can generate a
truly intelligent agent, regardless of the method used. Here, such aspects will
not be considered: Instead, the semblance of intelligence is sufficient. That is,
the goal is to generate an IPA that can handle, for example, a basic dialogue
with a human. The brain of the IPAs considered here will thus be modelled
as a collection of simple, and rather rigid, dialogues. Returning to the topic
of theory, it should be noted that the brain of an IPA could of course have
been implemented in many different ways. The implementation described in
Sect. 3.2 was selected with the aim of making it easy to set up a set of dialogues
while, at the same time, maintaining flexibility for further development.
3.1 A simple example
In order to illustrate some of the difficulties encountered when implementing
decision-making, memory, and dialogue in an IPA, a simple example will now
be given. Consider a situation involving an IPA that is supposed to retrieve
and read news stories to a user, perhaps a visually impaired person using
speech (or, possibly, gestures) as the input modality, even though the example would be valid even in the case of text input (typing). The beginning of a
specific dialogue of this kind can be seen in Fig. 3.1.
First of all, the user must get the attention of the IPA. In the simplest situation, the IPA may have only a news reader dialogue, in which case this dialogue
could simply wait for input from the user to start the discussion. However,
a slightly more advanced IPA could contain numerous dialogues (and, perhaps, non-dialogue processes as well). In such situations the user must somehow trigger the dialogue, starting with getting the attention of the IPA. How
should that part be implemented? Already here, many options present themselves: One could, for example, have a loop running in the agent, checking for
inputs. However, apart from being inelegant, such an implementation would
make use of computational resources even when there is no reason for doing
so, since it would involve constant checking.
A better approach is to use an event-based system, in which event handlers
stand by (without any looping), waiting for something to happen. Then, the
next problem appears: What should that something be, and which processes
should be standing by? Should all dialogues check for suitable input? What if
more than one dialogue finds that the input matches it starting condition, thus
perhaps triggering two dialogues to run simultaneously? Regarding the first
question, one possible approach (among several) with some biological justification, is to trigger events via changes in the IPA’s working memory. Thus, the
IPA must be fitted with a working memory that, among other things, would
contain the various inputs (e.g. speech, text, or gestures) as well as an event
that should be triggered whenever there is a change in the working memory,
c 2017, Mattias Wahde, [email protected]
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
17
User: Hello!
[The agent detects and recognizes the face of the user (Mattias)]
[The user’s input is detected by a separate Listener program,
and is then transferred to the agent’s working memory.
A top-level dialogue (for topic selection) is activated.]
Agent: Hello Mattias. How can I be of service?
[The agent’s statement is sent to a Speech program, if available.]
User: I would like to hear the news, please.
[The agent processes the input, disables the top-level dialogue,
and triggers a news dialogue.]
Agent: OK. Which topic are you interested in?
User: Economy, please.
[The agent searches its working memory for items of interest]
Agent: I have three new items, which arrived in the last hour.
User: OK, list them for me.
Agent: Item 1: The bank of England today announced ..
User: Skip that one.
Agent: OK. Item 2: U.S. jobless claims down more than expected.
User: Read that one, please.
etc. etc.
Figure 3.1: A partially annotated example of (the beginning of) a simple human-agent dialogue. As described in the main text, even a simple and somewhat robotic conversation of this
kind requires quite a complex implementation, at least if some variety is to be allowed in the
dialogue.
i.e. when a new item is added. Next, rather than having all dialogues standing
by, the triggering of a dialogue can be taken care of by an event handler in the
agent itself.
Several new problems then appear: How should a dialogue respond to
events (user input) and how should the dialogue be structured? Regarding
the first question, one could in principle let the event handler just described
pass the information that a new user input has been received to the currently
active dialogue, and then let the dialogue handle it. However, this would be
slightly inelegant as it would involve passing information (unnecessarily, as
will be demonstrated) between different parts of the agent. Moreover, it is
likely that such an implementation would result in a very complex event handler as it would now have to handle not only the triggering of dialogues but
also passing information to (and, perhaps, from) dialogues. An alternative approach (used here) is to let the active dialogue itself subscribe to (i.e. respond
to) the event triggered when the working memory is changed, thus removing
the need of passing the user input from the agent itself to the active dialogue.
One must then make sure to unsubscribe to the event whenever a dialogue is
c 2017, Mattias Wahde, [email protected]
18
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
de-activated, to avoid triggering actions from inactivated dialogues.
As for the second question, choosing a representation for the structure of a
dialogue involves a difficult trade-off between simplicity, on the one hand, and
flexibility, on the other. A dialogue between two humans generally involves a
strong degree of flexibility: In any given part of the exchange, many different
statements would be valid as input to the other person and the dialogue would
take different directions (including complete changes of topic) depending on
the sequences of statements made by the two participants. Moreover, both
participants would, in all likelihood, have a clear understanding of the context in which the dialogue takes place and also share certain common-sense
knowledge. None of those things apply, at least not a priori, to an IPA. As a
simple example, when giving an affirmative, verbal response, a human might
simply say yes. However, it is also possible to respond in other, equivalent
ways, e.g. ok, sure, fine etc. In order for an IPA to handle even this simple case,
it must be provided with the knowledge that those responses represent the
same thing. Of course, one can easily envision many examples that are considerably more complex than the one just given. For instance, just include any
form of joke or humor into a sentence, and it is easy to understand how an
IPA, devoid of context and without a sense of humor, will be lost.
Here, a rather simple approach has been taken, in which a dialogue is built
up as a finite-state machine (FSM), consisting of a number of states (referred
to as dialogue items) in which a given input is mapped to a specific output.
The inputs are retrieved from the agent’s working memory. The easiest way
to do that is simply to take the most recent item in the working memory when
a change is detected (i.e. when an item is added to the working memory) as
the user’s input. However, there are cases in which multiple sources may add
memory items that are not related to the user’s input. For example, an agent
equipped with a camera, may detect and recognize a face (in a separate program as described in Chapter 2) and then place the corresponding person’s
name in the working memory, which the agent might then mistake as the response to its statement. In order to avoid such problems, each memory item is
equipped with a memory item tag that consists of a string that can be used for
identifying and classifying memory items. Thus, a memory item from the face
recognition program may contain a tag such as Vision:FaceRecognized
where the first part identifies the process responsible for generating the memory item and the second part describes the category of the memory item. Each
memory item also has a content string that, in this particular example, would
contain the name of the person whose face was recognized. Moreover, each
memory item is associated with a time stamp so that old memory items, which
have not be accessed for a long time or have become obsolete for some other
reason, can be removed.
As will be shown below, some flexibility of the agent’s response has been
added by using an implementation that allows for alternative inputs as well
c 2017, Mattias Wahde, [email protected]
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
19
as alternative (but equivalent) outputs. The latter is also important for the perception of the dialogue from the user’s point of view: If the IPA constantly uses
the same style of replying, the dialogue will appear very rigid and unnatural.
By adding a bit of variety, one can to some extent reduce such perceptions.
The dialogues considered in the implementation used here are rather limited, and it is required that the user should stay on the topic, rather than trying
to wander off into other topics as frequently happens in human-to-human dialogues. However, even a simple IPA must somehow be able to deal with
cases in which the user gives an incorrect response. Thus, another problem
appears. One solution, of course, is simply to wait until the user gives a response that the agent can understand. Such an approach will quickly become
annoying for the user, though, who might not know or even be able to guess
precisely what input the agent requires. Of course, one could make the agent
list the allowed inputs but that, too, would represent a strong deviation from
a human-to-human dialogue. A better approach might be to include, in every
dialogue item involving human-agent interaction, a separate method for handling incorrect or unexpected responses, asking for a clarification a few times
(at most), before perhaps giving up on the dialogue and instead returning to a
resting state, awaiting input from the user. This is indeed the approach chosen
here.
Now, returning to the beginning of the example, the aim was to generate
an agent that could read the news in interaction with the user. Thus, in addition to handling the dialogue, the IPA must also be able to retrieve news items
upon request. As shown in Chapter 8, one can write a separate program for
obtaining and parsing news, and then sending them to the agent. How, then,
should the agent handle the news items, in a dialogue with a human? Where
should they be stored, and how should they be accessed? Here, again, many
possibilities present themselves to the programmer. One can include, for example, a specialized state in the dialogue that would actively retrieve news
items, on a given topic, by sending a request to the corresponding program
and then receiving and processing the response. However, in that case, the
user may have to wait a little bit for this procedure to be completed. Sending
and receiving the data over the network is usually very fast, but (for example)
reloading a web page from which the news are obtained might take some time.
Even a delay of 0.5 s will generally be perceived as annoying by a human user.
An alternative approach, used here, is to let the program responsible for
downloading the news send the information about incoming news items to
the agent as soon as they become available. The agent can then store the news
items in its working memory, so that they can be retrieved on demand in the
dialogue, thus eliminating any delays. As above, the memory item tags can
then be used for distinguishing between, say, user input and a news item. In a
dialogue, the agent may then be able to retrieve, for example, all memory items
regarding sport news, received in the last half hour.
c 2017, Mattias Wahde, [email protected]
20
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
As will hopefully now be clear from this example, even generating a very
simple human-agent dialogue involves quite a number of complex problems
that, moreover, can be solved in many different ways. There are many additional refinements that can be made, of course. As one example among many,
in cases where the agent starts reading a long news item (or any other text),
the user might quickly realize that he or she is not interested and wishes to
move on. Then, ideally (and as in a human-to-human conversation) it should
be possible to interrupt the agent so that one can direct it to a topic of greater
interest.
The next section contains a brief description of the main classes implemented in the AgentLibrary. When reading that description, it is very useful to keep the example above in mind.
3.2 The AgentLibrary
The AgentLibrary contains the necessary classes for setting up the brain of
an agent, including a set of dialogues (and, possibly, other non-dialogue brain
processes) as well as the agent’s working memory as well as (optionally) its
long-term memory.
3.2.1 The Agent class
The Agent class contains a list of brain processes, as well as a working memory and a long-term memory. There is a Start method responsible for setting up a server, initializing the working memory and loading the long-term
memory (if available), and also starting the client programs (see Fig. 2.1). It
is possible to modify the structure of the IPA by excluding some of the client
programs. For example, a simple agent may just define a Listener client and
a Speech client. Upon startup, the agent also checks which brain processes
should be active initially. Those processes are then started.
From this point onward, most of the agent’s work is carried out by the
HandleWorkingMemoryChanged event handler, which is triggered whenever there is any change in the agent’s working memory. There is also a Stop
method that shuts down all client processes, and then also the agent’s server.
The HandleWorkingMemoryChanged event handler consist of four blocks
of code: (i) First, it checks (using the memory item tags of the available items in
the working memory) whether any new speech memory item has been added
to the working memory. If so, the content of the corresponding memory item
is sent to the Speech client. The agent also keeps track of the time at which the
speech output was sent, to avoid repeating the same output again. (ii) Next,
it repeats the procedure, but this time concerning facial expressions. Any new
c 2017, Mattias Wahde, [email protected]
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
21
Listing 3.1: The fields defined in the MemoryItem class.
p u b l i c c l a s s MemoryItem
{
p r i v a t e DateTime creationDateTime ;
p r i v a t e s t r i n g tag ;
p r i v a t e s t r i n g content ;
...
}
Listing 3.2: The InsertItem method of the Memory class.
p u b l i c void InsertItems ( L i s t <MemoryItem> insertedItemList)
{
Monitor . Enter ( lockObject ) ;
f o r ( i n t ii = 0 ; ii < insertedItemList . Count ; ii++)
{
MemoryItem item = insertedItemList [ ii ] ;
DateTime itemCreationDateTime = item . CreationDateTime ;
i n t insertionIndex = 0 ;
while ( insertionIndex < itemList . Count )
{
i f ( itemList [ insertionIndex ] . CreationDateTime < itemCreationDateTime)
{ break ; }
insertionIndex++;
}
itemList . Insert ( insertionIndex , item ) ;
}
OnMemoryChanged ( ) ;
Monitor . Exit ( lockObject ) ;
}
facial expression memory item (again identified using the appropriate memory item tag) is sent to the Visualizer client. Then (iii) it checks whether
any brain process should be activated or (iv) deactivated.
3.2.2 The Memory class
The Memory class, used for defining the working memory of an agent, simply
contains a list of memory items, of type MemoryItem. Each memory item contains (i) the date and time at which the item was generated; (ii) the tag; and (iii)
the contents of the memory item, as shown in Listing 3.1. Items are inserted
into the working memory by using the InsertItems method shown in Listing 3.2. This method also makes sure that the items are inserted in the order in
which they were generated, with the most recent item at the first index (0) of
the list. Finally, an event (MemoryChanged) is triggered to indicate that there
has been a change in the working memory. This event, in turn, is then handled
c 2017, Mattias Wahde, [email protected]
22
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
by several event handlers: The agent’s event handler described above, as well
as event handlers in any active brain process (see below). There are also several
methods for accessing memory items. For example, the GetLastItemByTag
method retrieves the most recent item (if any) matching an input tag.
The insertion and access methods all make use of the Monitor construct
(see Appendix A.5) in order to handle the fact that several asynchronous processes (brain processes or external clients) act upon the working memory. Thus,
for example, during the (very brief) time interval when the agent is accessing the most recent speech-related memory item, it has exclusive access to the
working memory.
3.2.3 The DialogueProcess class
This class is derived from the base class BrainProcess and defines a specific kind of brain process aimed at handling human-agent dialogue. Each dialogue process, in turn, contains a set of dialogue items, which are either of type
InteractionItem for those items that handle direct interaction between the
agent and a user, or of type MemoryAccessItem for those items that handle
other aspects of the dialogue (such as accessing and processing information).
At any given time, exactly one dialogue item is active, and it is referred to as
the current dialogue item. Each dialogue item, in turn, contains a list of of objects derived from the type DialogueAction and each such object defines a
target dialogue item to which the dialogue process will jump, if the action in
question is executed; see also below.
Whenever a dialogue process is activated, a subscription is established
with respect to the MemoryChanged event of the working memory. Similarly, whenever a dialogue process is deactivated, the subscription is removed.
Thus, only active brain processes react to changes in the agent’s working memory. The HandleWorkingMemoryChanged method in the DialogueProcess
class is a bit complex. Summarizing briefly, it first accesses the current dialogue item. If that item is an interaction item, it checks whether or not the
item requires input. If it does not, the output obtained from its first dialogue
action is simply placed in the agent’s working memory, and the current dialogue item is set as specified by that dialogue action. If the dialogue item does
require input, the next step is to check whether or not the actual input matches
any of the required inputs for the current dialogue item, by going through the
available dialogue actions until a match is found. If a match is found, the corresponding output is placed in the agent’s working memory, and the index of
the current dialogue item is updated as specified by the matching dialogue action. Note that the input can come either from the Listener client or, in cases
where input in the form of gestures is allowed, from the Vision client. If
the current dialogue item is instead a memory access item, the CheckMemory
c 2017, Mattias Wahde, [email protected]
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
23
method of the MemoryAccessItem is called. This method goes through the
various dialogue actions, generating lists of memory items based on the tag
specified in each dialogue action. The memory items are then placed in the
working memory, thus triggering the MemoryChanged event. For example,
the ReadByTagAction retrieves from working memory all memory items
matching a given tag (News, say), not older than a pre-specified time interval, and then generates an output memory item with the tag Speech for one
of those items (if any) based on a user-specified index. If the index is set to 0,
the most recent item is selected. The output item is then placed in the agent’s
working memory, thus triggering the MemoryChanged event so that, in turn,
the agent’s event handler can send the output to the Speech client.
Most dialogues are not just a linear sequence of input-output mappings.
For example, when the agents asks a question, the next dialogue item (responsible for processing the input) can take different actions depending on whether
the input is affirmative (e.g. yes) or negative (e.g no). There is a great degree of
flexibility here. For example, a single dialogue process may contain two different paths, one handling an affirmative input and one handling a negative
input. Alternatively, upon receiving the input, the current dialogue process
can deactivate itself and also activate another process or, possibly, either of
two processes (one for affirmative input and one for negative input).
The AgentLibrary contains a few dialogue action types, derived from the
DialogueAction base class. Those types can handle the most basic forms of
dialogue, but for more advanced dialogues additional derived dialogue action
classes might be needed. However, such classes can easily be added, without,
of course, having to change the rather complex framework described above.
3.3 Demonstration application
The AgentDevelopmentSolution contains a simple demonstration application that illustrates the basic aspects of human-agent dialogue in the AgentLibrary.
In addition to an agent program, this solution also contains a very simple listener program, which reads only text input, and an equally simple speech program that only outputs text. The agent program contains a menu that gives
the user access to four hard-coded simple dialogue examples, which will be
described next. In all cases, the dialogues are incomplete, and the examples
are merely intended to show how the various classes in the AgentLibrary
can be used.
3.3.1 TestAgent1
This agent is generated if the user chooses the menu actions File - New agent
- Test agent 1. In this case, the generated agent handles the beginning of
c 2017, Mattias Wahde, [email protected]
24
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
Listing 3.3: The code that generates TestAgent1.
p r i v a t e void GenerateTestAgent1 ( )
{
SetUpAgent ( ) ;
Di a l o g u e P r o ce s s dialogue1 = new Di a l o g u e P r o ce s s ( ) ;
dialogue1 . Name = ” Dialogue1 ” ;
agent . BrainProcessList . Add ( dialogue1 ) ;
dialogue1 . ActiveOnStartup = t r u e ;
I n t e r a c t i o n I t e m dialogueItem1 = new I n t e r a c t i o n I t e m ( ) ;
dialogueItem1 . Name = ” Item1 ” ;
dialogueItem1 . MaximumRepetitionCount = 2 ;
ResponseAction action1 = new ResponseAction ( ) ;
action1 . InputList . Add ( ” Hello ” ) ;
action1 . InputList . Add ( ”Hi” ) ;
action1 . TargetDialogueItemName = ” Item2 ” ;
action1 . OutputList . Add ( ” Hello u s e r ” ) ;
dialogueItem1 . ActionList . Add ( action1 ) ;
dialogue1 . ItemList . Add ( dialogueItem1 ) ;
I n t e r a c t i o n I t e m dialogueItem2 = new I n t e r a c t i o n I t e m ( ) ;
dialogueItem2 . MillisecondDelay = 5 0 0 ;
dialogueItem2 . Name = ” Item2 ” ;
OutputAction action2 = new OutputAction ( ) ;
action2 . OutputList . Add ( ”How can I be o f s e r v i c e ? ” ) ;
action2 . BrainProcessToDeactivate = dialogue1 . Name ;
dialogueItem2 . ActionList . Add ( action2 ) ;
dialogue1 . ItemList . Add ( dialogueItem2 ) ;
FinalizeSetup ( ) ;
}
a greeting dialogue, by first activating a dialogue item that waits for a greeting (e.g. hello) from the user. If a greeting is received, the agent moves to the
next dialogue item, in which it asks if it can be of service, and that concludes
this simple example. Despite its simplicity, the example is sufficient for illustrating several aspects of the AgentLibrary. The code defining TestAgent1
is shown in Listing 3.3. The SetupAgent method sets up the server and file
paths to the listener and speech programs. Next, the dialogue is defined. The
Dialogue1 process is set to be active as soon as the agent starts. By construction, the first dialogue item (in this case named Item1) becomes the current
dialogue item when the dialogue is started. The agent then awaits user input,
in this case requiring that the input should be either Hello or Hi (note that the
input-matching is case-insensitive, so either Hello or hello would work). The
agent then responds with the phrase Hello user and proceeds to the next dialogue item (Item2). Here, it waits for 0.5 s before outputting the phrase How
can I be of service?. Next, the dialogue process is deactivated, and the agent
stops responding to input. Since the second dialogue item does not require
any input, a OutputAction (unconditional output) was used instead of the ResponseAction used in the first dialogue item. The same effect could have been
achieved also by using a ResponseAction (in Item2) with an empty input list.
Note that if the agent cannot understand the user’s reply, i.e. if the input
c 2017, Mattias Wahde, [email protected]
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
25
is anything except Hello and Hi, the current dialogue item will handle the situation by asking for a clarification. By default, this is done twice. If the user
still fails to give a comprehensible answer the third time, the dialogue is deactivated, and a user-specified dialogue (for example one that simply waits for
the user to start over) is activated instead, provided that such a dialogue exists,
of course. The allowed number of failed answers can also be modified by the
user and can differ between dialogue items.
Ideally, an agent should be able to understand any greeting that a human
would understand (i.e. not just Hello and Hi). One can of course extend the
list of allowed inputs to obtain a better approximation of human behavior.
Moreover, for some particular cases, a set of default input strings have been
defined. Thus, for example, if one wants to add a response action taking affirmative input, instead of listing all affirmative answers (yes, sure etc.) for each
such action, one can simply use the SetAffirmativeInput method in the
ResponseAction class. Similar methods exit also for negative inputs and for
greetings; see also the source code for TestAgent3 below.
3.3.2 TestAgent2 and TestAgent3
These two examples illustrate the fact that a dialogue can be implemented in
several different ways. In this case, the agent again awaits a greeting from the
user. If the greeting is received, the agent asks about the user’s health: How
are you today? If the user gives a positive answer (Fine) the agent activates a
path within the current dialogue for handling that answer, and if the answer
is negative (Not so good), the agent instead activates another path, still within
the same dialogue for handling that answer. Listing 3.4 shows a small part of
the definition of TestAgent2, namely the dialogue item that handles the user’s
response to the question How are you today?. As can be seen in the listing, the
dialogue item defines two different actions, which are selected based on the
mood, negative or positive, of the user’s reply.
By contrast, in TestAgent3, if a positive answer is received from the user,
the initial dialogue is deactivated and another dialogue is activated for handling that particular case. If instead a negative answer is received, the initial
dialogue is also deactivated, and yet another dialogue is activated for handling
the negative answer.
3.3.3 TestAgent4
This example illustrates memory access. If the user asks to hear the news (Read
the news, please), the agent searches its memory for memory items that carry
the tag News. The agent the selects the first item, somewhat arbitrarily, and
sends the corresponding text to the speech program. Now, normally, the news
c 2017, Mattias Wahde, [email protected]
26
CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE
Listing 3.4: A small part of the code for TestAgent2. The dialogue item shown here contains
two dialogue actions. The first action is triggered if the user gives a negative input, in which
case the agent then moves to a dialogue item (not shown) called NegativeItem1. If instead
the user gives a positive input, the agent moves to another dialogue item (not shown either)
called PositiveItem1, in both cases after first giving an appropriate output.
...
I n t e r a c t i o n I t e m dialogueItem3 = new I n t e r a c t i o n I t e m ( ) ;
dialogueItem3 . Name = ” Item3 ” ;
ResponseAction action31 = new ResponseAction ( ) ;
action31 . InputList . Add ( ”Not so good” ) ;
action31 . OutputList . Add ( ” I ’m s o r r y t o hear t h a t ” ) ;
action31 . TargetDialogueItemName = ” NegativeItem1 ” ;
dialogueItem3 . ActionList . Add ( action31 ) ;
ResponseAction action32 = new ResponseAction ( ) ;
action32 . InputList . Add ( ” F i n e ” ) ;
action32 . OutputList . Add ( ” I ’m happy t o hear t h a t ” ) ;
action32 . TargetDialogueItemName = ” P o s i t i v e I t e m 1 ” ;
dialogueItem3 . ActionList . Add ( action32 ) ;
dialogue1 . ItemList . Add ( dialogueItem3) ;
...
items would have been obtained by an internet data acquisition program (see
Chapter 2) that would read news continuously, and then send any new items
to the agent so that the latter can include them in its working memory. Here,
for simplicity, a few artificial news items have simply been hardcoded into the
agent’s working memory.
As mentioned above, only a few DialogueAction classes have been included
in the AgentLibrary. It is likely that, for more advanced dialogues than the
ones considered here, the user will have to write additional dialogue action
classes.
c 2017, Mattias Wahde, [email protected]
Chapter
4
Computer vision
The ability to see is, of course, of great importance for many animal species.
Similarly, computer vision, generated by the use of one or several (video) cameras can play a very important role in IPAs as well as other kinds of intelliget
agents. However, one of the main difficulties in using vision in intelligent
agents is the fact that cameras typically provide very large amounts of information that must be processed quickly in order to be relevant for the agent’s
decision-making.
This chapter starts with a general description of digital images, followed by
a description of the ImageProcessing library, which contains source code
for basic image processing as well as code for reading video streams. The
basic image processing operations are then described in some detail. Next,
a brief overview is given regarding more advanced image processing operations, such as adaptive thresholding, motion detection, and face detection.
Two simple demonstration programs are then introduced.
4.1 Digital images
A digital image consists of picture elements called pixels. In a color image,
the color of each pixel is normally specified using three numbers, defining the
pixel’s location in a color space. An example of such a space is the red-greenblue (RGB) color space, in which the three numbers (henceforth denoted R, G,
and B) specify the levels of the red, green, and blue components for the pixel
in question. These components typically take values in the range [0, 255]. In
other words, for each pixel, three bytes are required to determine the color of
the pixel. In some cases, a fourth byte is used, defining an alpha channel that
determines the level of transparency of a pixel.
In a grayscale image, only a single value (in the range [0, 255]) is required
for each pixel, such that 0 corresponds to a completely black pixel and 255 to a
27
28
CHAPTER 4. COMPUTER VISION
completely white pixel, and where intermediate values provide levels of gray.
Thus, a grayscale image requires only one third of the information required for
a color image. The conversion of an RGB image to a grayscale image is often
carried out as
Γ(i, j) = 0.299R(i, j) + 0.587G(i, j) + 0.114B(i, j),
(4.1)
where Γ(i, j), the gray level for pixel (i, j), is then rounded to the nearest integer. In the remainder of this chapter, the indices (i, j) will normally be omitted,
for brevity, except in those cases (e.g. convolutions, see below) where the indices are really needed to avoid confusion. A more complete description of
grayscale conversion can be found in Subsect. 4.3.2 below.
Taking the information reduction one step further, one can also binarize
an image, in which case each pixel is described by only one bit (rather than a
byte), such that 0 corresponds to a black pixel and 1 to a white pixel, and where
there are no intermediate values. The process of binarization is described in
Subsect. 4.3.3 below.
4.1.1 Color spaces
In addition to the RGB color space, there are also other color spaces, some of
the most common being CMY(K) (cyan, magenta, yellow, often augmented
with black (K)), HSV (hue-saturation-value), and YCbCr, consisting of a luma
component (Y ) and two chrominance components (Cb and Cr ). The YCbCr
color space has, for example, been used in face detection, since skin color pixels
generally tend to fall in a rather narrow range in Cb and Cr. In its simplest
form, the YCbCr color scheme is given by
Y = 0.299R + 0.587G + 0.114B
(4.2)
Cb = B − Y
(4.3)
Cr = R − Y
(4.4)
Note that the luma component (Y ) corresponds to the standard grayscale defined above. However, the conversion from RGB to YCbCr normally takes a
slightly different form. In the Rec.601 standard for video signals, the Y component takes (in the case of eight-bit encoding) integer values in the range
[16, 235] (leaving the remainder of the ranges [0, 15] and [236, 255] for image
processing purposes, such as carrying information about transparency). Furthermore, Cb and Cr also take integer values, in the range [16, 240], with the
center position at 128. The equations relating RGB to this definition of YCbCr
take the form
 

 


Y
16
0.25679
0.50413
0.09791
R
 Cb  =  128  +  −0.14822 −0.29099
0.43922   G  , (4.5)
Cr
128
0.43922 −0.36779 −0.07143
B
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
29
Figure 4.1: An example of the YCbCr color space. The upper left panel shows the original
image, whereas the upper right panel shows the luma (Y) component. The lower panels show
the Cb (left) and Cr (right) components. When plotting any of the YCbCr components the
other components were set to the center of their range. Thus, for example, for the Cb plot, Y
was set to 126 and Cb to 128. See also Eq. (4.6). Photo by the author.
where the resulting values are rounded to the nearest integer. The inverse
transformation can easily be derived from Eq. (4.5), and takes the form

 


R
1.16438
0.00000
1.59603
Y − 16
 G  =  1.16438 −0.39176 −0.81297   Cb − 128  .
(4.6)
B
1.16438
2.01723
0.00000
Cr − 128
An example of the YCbCr color space is shown in Fig. 4.1. In the remainder
of the chapter, unless otherwise specified, the RGB color space will be used.
However, the YCbCr color space will be revisited in connection with the discussion on face detection in Subsect. 4.4.3.
4.1.2 Color histograms
The information in an image can be summarized in different ways. For example, one can form color histograms measuring the distribution of colors over an
c 2017, Mattias Wahde, [email protected]
30
CHAPTER 4. COMPUTER VISION
Figure 4.2: An example of image histograms. The panels on the right show, from top to
bottom, the red, green, blue, and gray histograms, respectively. Photo by the author.
image, also referred to as the color spectrum. A color histogram (for a given
color channel, for example, red) is formed by counting the number of pixels
taking any given value in the allowed range [0,255], and then (optionally) normalizing the histogram by dividing by the total number of pixels in the image.
Thus, a (normalized) histogram for a given color channel can be viewed as a
set of 256 bins, each bin measuring the fraction of the image pixels taking the
color encoded by the bin number; see also Subsect. 4.3.6 below. Of course, it is
also possible to generate a gray scale histogram.
An example is shown in Fig. 4.2. Here, the histograms for the red, green,
and blue channels were extracted for the image on the left. Next, the image
was converted to grayscale, and the gray histogram was generated as well.
Note that the histogram plots use a relative scale, so that the bin with maximum content (for each channel) extends to the top of the corresponding plot.
Note also that the blue histogram has a strong spike at 0, making the rest of
that histogram look rather flat.
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
31
Listing 4.1: The constructor and the Lock method of the ImageProcessor class.
p u b l i c ImageProcessor ( Bitmap bitmap )
{
t h i s . bitmap = new Bitmap ( bitmap ) ;
Lock ( ) ;
}
p r i v a t e void Lock ( )
{
bitmapData = t h i s . bitmap . LockBits (
new Rectangle ( 0 , 0 , t h i s . bitmap . Width , t h i s . bitmap . Height ) ,
ImageLockMode . ReadWrite , t h i s . bitmap . P i x e l F o r m a t ) ;
isLocked = t r u e ;
}
4.2 The ImageProcessing library
The IPA libraries include a C# class library for image processing, namely the
ImageProcessingLibrary. In order to speed up the various image processing tasks, this library makes use of two important concepts, namely locked
bitmaps and parallel processing. Locking a bitmap in memory allows the program to access and manipulate the image pixels (much) faster than with the
GetPixel and SetPixel methods of the Bitmap class. Moreover, in some
(but not all) cases, the pixel operations necessary to process an image occur in
sequence and independently of each other. In such cases, one can make use of
the parallel processing methods (available in the System.Threading.Tasks
namespace) to further speed up the processing.
4.2.1 The ImageProcessor class
When an instance of the ImageProcessor class is generated, it begins by making a copy of the bitmap, and then locking the copy in memory, using the
LockBits method, as described in Listing 4.1. The image processor is then
ready to carry out various operations on the locked bitmap. The list of public
methods in the ImageProcessor class is given in Table 4.1.
Locking the bitmap takes some time, since a copy of the bitmap is made before locking occurs (so that the original bitmap can be used for other purposes
while the image processor uses the copy). Thus, the normal usage is to first
generate the image processor by calling the constructor, then carrying out a
sequence of operations, of the kinds describe below, then calling the Release
method, reading off the processed bitmap, and then disposing the image processor. The last step is important since, even though the garbage collector in
.NET will eventually dispose of the image processor (and, more importantly, the
associated bitmap), it may take some time before it does so. If one is processing
c 2017, Mattias Wahde, [email protected]
32
CHAPTER 4. COMPUTER VISION
Listing 4.2: An example of the typical usage of the ImageProcessor class. In this example, it
is assumed that a bitmap is available. The first few lines just define some input variables, in
order to avoid ugly hard-coding of numerical parameters as inputs to the various methods.
double relativeContrast = 1 . 2 ;
double relativeBrightness = 0 . 9 ;
i n t binarizationThreshold = 1 2 7 ;
ImageProcessor imageProcessor = new ImageProcessor ( bitmap ) ;
imageProcessor . ChangeContrast ( relativeContrast) ;
imageProcessor . ChangeBrightness( relativeBrightness) ;
imageProcessor . ConvertToStandardGrayscale ( ) ;
imageProcessor . Binarize ( binarizationThreshold) ;
imageProcessor . Release ( ) ;
Bitmap processedBitmap = imageProcessor . Bitmap ;
imageProcessor . Dispose ( ) ;
Listing 4.3: A code snippet showing the setup of a camera. Once the relevant parameters
have been specified, the camera is started. Moreover, a pointer to the camera is passed to a
CameraViewControl which is responsible for showing the image stream from the camera.
...
camera = new Camera ( ) ;
camera . DeviceName = Camera . GetDeviceNames ( ) [ 0 ] ;
camera . ImageWidth = 6 4 0 ;
camera . ImageHeight = 4 8 0 ;
camera . FrameRate = 2 5 ;
camera . Start ( ) ;
cameraViewControl . SetCamera ( camera ) ;
cameraViewControl . Start ( ) ;
...
a video stream generating, say, 25 images per second, the memory usage may
become very large (even causing an out-of-memory error) before the garbage
collector has time to remove the image processors.
An example of a typical usage of the ImageProcess class is given in Listing 4.2. In this example, an image processor is generated that first changes the
contrast and the brightness of the image, then converts it to grayscale before,
finally, carrying out binarization. The various methods shown in this example
are described in the text below. Even though it is not immediately evident from
the code in Listing 4.2, the step in which the processed bitmap is obtained also
involves copying the image residing in the image processor, so that the latter
can then safely be disposed.
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
Method
ChangeContrast
ChangeBrightness
ConvertToGrayscale
ConvertToStandardGrayscale
Binarize
GenerateHistogram
Convolve
BoxBlur3x3
GaussianBlur3x3
Sharpen3x3
SobelEdgeDetect
StretchHistogram
33
Description
Changes the (relative) contrast of an image.
Changes the (relative) brightness of an image.
Converts a color image to grayscale using
parameters specified by the user.
Converts a color image to grayscale
using default parameters.
Binarizes a grayscale image, using a single
(non-adaptive) threshold.
Generates the histogram for a given
color channel (red, green, blue, or gray).
Convolves an image with an N × N mask,
where N ≥ 3 is an odd number.
Blurs an image, using a 3 × 3 box
convolution mask.
Blurs an image, using a 3 × 3 Gaussian
convolution mask.
Sharpens the image, using a convolution
mask of size 3 × 3.
Carries out Sobel edge detection on a
grayscale image.
Stretches the histogram of the image,
in order to enhance the contrast.
Table 4.1: Brief summary of (some of) the public methods in the ImageProcessor class. For
more complete descriptions, see Sect. 4.3.
4.2.2 The Camera class
The Camera class is used for reading an image stream from a video camera,
for example a web camera. This class makes use of the CaptureDevice class
that, in turn, uses classes from the DirectShowLib library that contains the
methods required for low-level camera access. In the camera class, a separate thread is started that reads the current image from the capture device and
stores it in a bitmap that can be accessed in a thread-safe manner (see Sect. A.4)
by other classes, for example the CameraViewControl user control, which is
also included in the ImageProcessing library, and which uses a separate
thread for displaying the most recent bitmap available in the corresponding
Camera instance. Thus, it can run with a different updating frequency compared to the camera itself. The ImageProcessing library also contains a
CameraSetupControl user control, in which the user can set the various
parameters (e.g. brightness, contrast etc.) of a camera.
Listing 4.3 shows a code snippet in which a camera is set up, in this case
using the first available camera device (there might of course be several camc 2017, Mattias Wahde, [email protected]
34
CHAPTER 4. COMPUTER VISION
Listing 4.4: The ChangeContrast method.
p u b l i c void ChangeContrast ( double alpha )
{
unsafe
{
i n t bytesPerPixel = Bitmap . GetPixelFormatSize( bitmap . P i x e l F o r m a t ) / 8 ;
i n t widthInBytes = bitmapData . Width ∗ bytesPerPixel ;
byte ∗ PtrFirstPixel = ( byte ∗ ) bitmapData . Scan0 ;
P a r a l l e l . For ( 0 , bitmapData . Height , y =>
{
byte ∗ currentLine = PtrFirstPixel + ( y ∗ bitmapData . Stride ) ;
f o r ( i n t x = 0 ; x < widthInBytes ; x = x + bytesPerPixel )
{
double oldBlue = currentLine [ x ] ;
double oldGreen = currentLine [ x + 1 ] ;
double oldRed = currentLine [ x + 2 ] ;
i n t newBlue = ( i n t ) Math . Round ( 1 2 8 +(oldBlue−128) ∗ alpha ) ;
i n t newGreen = ( i n t ) Math . Round ( 1 2 8 + ( oldGreen−128) ∗ alpha ) ;
i n t newRed = ( i n t ) Math . Round ( 1 2 8 + ( oldRed−128) ∗ alpha ) ;
i f ( newBlue < 0 ) { newBlue = 0 ; }
e l s e i f ( newBlue > 2 5 5 ) { newBlue = 2 5 5 ; }
i f ( newGreen < 0 ) { newGreen = 0 ; }
e l s e i f ( newGreen > 2 5 5 ) { newGreen = 2 5 5 ; }
i f ( newRed < 0 ) { newRed = 0 ; }
e l s e i f ( newRed > 2 5 5 ) { newRed = 2 5 5 ; }
currentLine [ x ] = ( byte ) newBlue ;
currentLine [ x + 1 ] = ( byte ) newGreen ;
currentLine [ x + 2 ] = ( byte ) newRed ;
}
}) ;
}
}
eras available). Once the resolution and frame rate have been set, the camera
is started. A pointer to the camera is then passed to a CameraViewControl
that, once started, displays the camera image, in this case with the same frame
rate as the camera.
4.3 Basic image processing
This section introduces and describes some common image processing operations, which are often used as parts of the more advanced image processing
tasks considered in Sect. 4.4 below. Here, the value of a pixel in an unspecified
color channel (i.e either red, green, or blue) is generally denoted P ≡ P (i, j).
Thus, for a given pixel and a given color channel, P is an integer in the range
[0, 255]. Some of the operations below may result in non-integer values. The
pixel value is then set as the nearest integer. If an operation results in a value
smaller than 0, the pixel value is set to 0. Similarly, if a value larger than 255
is obtained, the pixel value is set to 255. For grayscale images, the gray level
(also in the range [0, 255]) is denoted Γ(i, j).
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
Method
Locked bitmaps, parallel processing (Listing 4.4)
Locked bitmap, sequential processing
Direct pixel access, using GetPixel and SetPixel
35
Computation time (s)
0.0124
0.0522
2.37
Table 4.2: A speed comparison involving three different methods for changing the contrast
of an image with 1600 × 1067 pixels, using a computer with an Intel Core i7 processor running at 3.4 GHz. The parallel method given in Listing 4.4 reduces the computation time by
around 76% compared to a sequential method, and by more than 99% compared to the method
involving direct pixel access.
4.3.1 Contrast and brightness
The contrast and brightness of an image can be controlled using a simple linear
transformation, even though non-linear transformations exist as well. For a
given pixel value (for some color channel), the transformation
P ← α(P − 128) + 128 + β,
(4.7)
transforms both the contrast (controlled by α) and the brightness (controlled
by β) of an image. The method ChangeContrast takes α as input, and
changes the image using the transformation in Eq. (4.7), with β = 0, whereas
the ChangeBrightness method takes the relative brightness br as input, from
which β is obtained as
β = 255(br − 1),
(4.8)
after which Eq. (4.7) is applied, with α = 1. Note that β is, of course, rounded
to the nearest integer. It should also be noted that operations which change
contrast or brightness are not necessarily reversible, since any pixel value above
255 will be set to 255, and any value below 0 will be set to 0.
Listing 4.4 shows the implementation of the ChangeContrast method,
and also illustrates the syntax for parallel processing. As can be seen in the
listing, the method begins with the unsafe keyword, which should be applied when carrying out pointer operations (such as accessing the bytes of a
locked bitmap). The methods then runs through the lines of the image, changing the contrast of each pixel as described above. The Parallel.For syntax
implies that different rows are processed in parallel. Note that the transformations applied to a pixel are independent of the transformations applied to any
other pixel. This is important since, with a parallel for-loop, the operations
may occur in any order. Of course, one could have used a standard (sequential) for-loop as well, but the parallel syntax does lead to a rather significant
speedup. To illustrate this, two additional methods were tested, one that runs
through the locked bitmap as in Listing 4.4, but with a standard for-loop instead of the parallel for-loop, and one that directly accessed the pixels of the
image (without even locking the bitmap), using the GetPixel and SetPixel
c 2017, Mattias Wahde, [email protected]
36
CHAPTER 4. COMPUTER VISION
methods. The results are summarized in Table 4.2. As is evident from the table,
the parallel method is by far the fastest.
4.3.2 Grayscale conversion
The transformation of a color image to a grayscale image involves compressing
the information in the three color channels (red, green, and blue) into a single
channel (gray). In practice, a gray value is computed, and that single value is
then applied to the three color channels. The general transformation can be
written
Γ = fr R + fg G + fb B,
(4.9)
where fr , fg , and fb are the red, green, and blue fractions, respectively. The
method ConvertToGrayscale takes these three fractions (all in the range
[0, 1], and with a sum of 1) as inputs, and then carries out the transformation
in Eq. (4.9), rounding the values of Γ to the nearest integer. As mentioned in
Sect. 4.1, the settings fr = 0.299, fg = 0.587, and fb = 0.114 are commonly used
in grayscale conversion. The method ConvertToStandardGrayscale, which
does not take any inputs, uses these values.
4.3.3 Binarization
In its simplest form, the process of binarization uses a single threshold (the
binarization threshold), and sets the color of any pixel whose gray level value
is below the threshold to black. All other pixels are set to white. Note that
this process should be applied to a grayscale image, rather than a color image.
The ImageProcessor class implements this simple form of binarization in
its Binarize method.
However, in practical applications, one must often handle brightness variations across the image. Thus, some form of adaptive threshold is required,
something that will be discussed in Subsect. 4.4.1 below.
4.3.4 Image convolution
Many image operations, e.g. blurring and sharpening, can be formulated as
a convolution, i.e. a process in which one passes a matrix (the convolution
mask) over an image and changes the value of the center pixel using matrix
multiplication. More precisely, convolution using an N × N mask (denoted C)
changes the value P (i, j) of pixel (i, j) as
P (i, j) ←
N X
N
X
C(k, m)P (i − ν + k − 1, j − ν + m − 1),
k=1 m=1
c 2017, Mattias Wahde, [email protected]
(4.10)
CHAPTER 4. COMPUTER VISION
37
Figure 4.3: An example of sharpening, using the convolution mask Cs . Photo by the author.
where ν = (N − 1)/2, and N is assumed to be odd. The mask is passed over
each pixel in the image1 , changing the value of the central pixel in each step.
The pixel value is the rounded to the nearest integer. The ImageProcessor
class contains a method Convolve, which takes as input a convolution mask
(in the form of a List<List<double>>), and then carries out convolution as
in Eq. (4.10).
Of course, the result of a convolution depends on the elements in the convolution mask. By setting those elements to appropriate values (see below),
one can carry out, for example, blurring and sharpening. It should be noted,
however, that convolutions can be computationally costly, since a matrix multiplication must be carried out for each pixel.
Blurring
Consider the convolution mask

1 1
1
Cb =  1 1
9
1 1

1
1 .
1
(4.11)
When this mask is passed over the image, the value of any pixel is set to the average in a 3 × 3 region centered around the pixel in question, resulting in a distinct blurring of the image. The matrix Cb defines so called box blurring. This
kind of blurring, with N = 3, is implemented in the ImageProcessor class as
BoxBlur3x3. Of course, one can use a larger convolution mask (e.g. N = 5),
and pass it to the Convolve method described above. The BoxBlur3x3 simply provides a convenient shortcut to achieve blurring with N = 3, which is
1
Except boundary pixels, for which the mask would extend outside the image. Such pixels
are normally ignored, i.e. their values are left unchanged. Alternatively, one can extend the
image (a process called padding) by adding a frame, (N − 1)/2 pixels wide, around it.
c 2017, Mattias Wahde, [email protected]
38
CHAPTER 4. COMPUTER VISION
usually sufficient. Blurring can be achieved
one may instead use the mask

1 2
1 
2 4
Cg =
16
1 2
in different ways. For example,

1
2 ,
1
(4.12)
thus obtaining Gaussian blurring, so called since the matrix approximates a
two-dimensional Gaussian. The method GaussianBlur3x3 carries out such
blurring, using the matrix defined in Eq. (4.12).
Sharpening
For any γ > 0, the mask
− γ8
Cs =  − γ8
− γ8

− γ8
1+γ
− γ8

− γ8
− γ8  ,
− γ8
(4.13)
results in a sharpening of the image. The parameter γ is here referred to as
the sharpening factor. Sharpening using a 3 × 3 mask is implemented in the
method Sharpen3x3, which takes the sharpening factor as input. An example
is shown in Fig. 4.3.
4.3.5 Obtaining histograms
The ImageProcessor class contains a method for obtaining histograms, namely
GenerateHistogram, which takes a color channel as input (represented as
a ColorChannel enum object, with the possible values Red, Green, Blue,
and Gray). Note that the method does not carry out grayscale conversion.
Thus, in order to obtain the gray histogram, one must first convert the image
to grayscale, then apply the GenerateHistogram method. The method will
then pick an arbitrary channel (in this case, blue) and generate the histogram.
Generating the histogram for any color channel is straightforward, except
for one thing: If the histogram is to be generated using a parallel for-loop, one
must be careful when incrementing the contents of the histogram bins. This is
so, since the standard ++ operator in C# is not thread-safe: Whenever this operator is called, the value contained at the memory location in question is loaded,
then incremented, and then the new value (i.e. the old value plus one) is assigned to the memory location. However, since the increment takes some time,
it is perfectly possible for a situation to occur where the same value is loaded by
two different threads, incremented (in each thread), and then assigned again,
so that the total increment is one, not two. In order to avoid such errors, one
can use the lock keyword in C#. However, for simple operations, such as
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
39
Listing 4.5: The GenerateHistogram method, illustrating the use of the Interlocked class for
thread-safe increments.
p u b l i c ImageHistogram GenerateHistogram( ColorChannel colorChannel )
{
ImageHistogram imageHistogram = new ImageHistogram ( ) ;
i n t [ ] pixelNumberArray = new i n t [ 2 5 6 ] ;
unsafe
{
i n t bytesPerPixel = Bitmap . GetPixelFormatSize( bitmap . PixelFormat ) / 8 ;
i n t widthInBytes = bitmapData . Width ∗ bytesPerPixel ;
byte ∗ PtrFirstPixel = ( byte ∗ ) bitmapData . Scan0 ;
P a r a l l e l . For ( 0 , bitmapData . Height , y =>
{
byte ∗ currentLine = PtrFirstPixel + ( y ∗ bitmapData . Stride ) ;
f o r ( i n t x = 0 ; x < widthInBytes ; x = x + bytesPerPixel )
{
byte pixelValue = 0 ;
i f ( colorChannel == ColorChannel . Red )
{ pixelValue = currentLine [ x + 2 ] ; }
e l s e i f ( colorChannel == ColorChannel . Green )
{ pixelValue = currentLine [ x + 1 ] ; }
e l s e { pixelValue = currentLine [ x ] ; }
I n t e r l o c k e d . Increment ( r e f pixelNumberArray [ ( i n t ) pixelValue ] ) ;
}
}) ;
}
imageHistogram . PixelNumberList = pixelNumberArray . ToList ( ) ;
r e t u r n imageHistogram ;
}
incrementing, there is a faster way, namely to use the Increment method in
the static Interlocked class. This method makes sure to carry out what is
known as an atomic (thread-safe) increment, meaning that no increments are
omitted. The use of this method is illustrated in Listing 4.5. Note that, here,
the increment is carried out on the elements of an array rather than just a single integer variable. This is allowed for arrays (of fixed length), but not for a
generic List (e.g. List<int>). Thus, as shown in the listing, the counting of
pixel values is carried out in an array of length 256 and, at the very end, this
array is converted to a list, which is then assigned to the image histogram.
4.3.6 Histogram manipulation
Images taken in, for example, adverse lighting conditions are often too bright
or too dark relative to an image taken under perfect conditions. Consider the
image in the left panel of Fig. 4.4. Here, while reasonably sharp, the image
still appears somewhat hazy and rather pale. The histogram, shown below the
image, confirms this: The image contains only grayscale values in the range 68
to 251. In order to improve the contrast, one can of course apply the method
described in Subsect. 4.3.1. However, those methods do not provide prescripc 2017, Mattias Wahde, [email protected]
40
CHAPTER 4. COMPUTER VISION
Figure 4.4: Histogram stretching. The panels on the left show an image with poor contrast,
along with its histogram. The panels on the right show the image after stretching with p =
0.025, as well as the resulting histogram.
tions for the suitable parameter settings. Thus, there is a risk that one might
increase (or decrease) the contrast too much. There are several methods for
automatically changing the contrast and brightness in an image, in a way that
will give good (or at least acceptable) results over a large set of lighting conditions. These methods are generally applied to grayscale images.
One such method is histogram stretching. In this method, one first generates the (grayscale) histogram H(j), j = 0, . . . , 255. Then, normalization is
applied, resulting in the normalized histogram
H(j)
, j = 0, . . . , 255
Hn (j) = P255
j=0 H(j)
(4.14)
Finally, the cumulative histogram is generated according to Hc (0) = Hn (0)
and
Hc (j) = Hc (j − 1) + Hn (j). j = 1, . . . , 255
(4.15)
Thus, for any j, Hc (j) determines the fraction of pixels having gray level j or
darker. Next, one identifies the bin index jlow corresponding to a given fraction
p of the total number of pixels, as well as the bin index jhigh corresponding to
the fraction 1 − p. Thus, jlow is the smallest j such that Hc (j) > p, while jhigh
is the largest j such that Hc (j) < 1 − p. Then, any pixel with gray level below
jlow is set to black (i.e. gray level 0) and any pixel with gray level above jhigh is
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
41
set to white (gray level 255). For pixels with gray levels in the range [jlow , jhigh ]
new gray levels are generated as
Γnew = 255
Γ − jlow
.
jhigh − jlow
(4.16)
Thus, after this stretching, the histogram will cover the entire range from 0 to
255. An example is shown in the right-hand part of Fig. 4.4, where the upper
panel shows the image after stretching with p = 0.025 and the lower panel
shows the corresponding histogram. In this particular case, jlow was found to
be 108, and jhigh was found to be 197.
The reason for using a value of p > 0 (but smaller than 0.5) is that, even for
an image with poor contrast, there might be a few pixels with gray level 0 and
a few pixels with gray level 255, in which case the stretching would have no
effect, as can easily be seen from Eq. (4.16). By choosing a small positive value
of p, as in the example above, the stretching will produce a non-trivial result.
Typical values of p fall in the range from 0.01 to 0.05.
The method just described stretches the histogram, but does not change it
in any other way. An alternative approach is to apply histogram equalization, in which one attempts to make the histogram as flat as possible, i.e. with
roughly equal number of pixels in each bin. This method will not be described
in detail here, however.
4.3.7 Edge detection
In edge detection, the aim is to locate sharp changes in intensity (i.e. edges) that
usually define the boundaries of objects. Edge detection is thus an important
step in (some methods for) object detection. It is also biologically motivated:
Evidence from neurophysiology indicates that sharp edges play a central role
in object detection in animals. There are many edge detection methods, and
they typically make use of convolutions of the kind described above. However, as will be shown below, one can sometimes summarize the results of
repeated convolution by carrying out a single so called pseudo-convolution
over the image.
One of the most successful edge detection methods is the Canny edge detector [4]. In addition to carrying out some convolutions, this method also uses
a few pre- and post-processing steps. For example, in Canny edge detection,
one blurs the image before carrying out edge detection, in order to remove
noise. Here, only the central component of the Canny edge detector will be
studied, namely the convolutions. Consider the two convolution masks


-1 0 1
C1 =  -2 0 2 
(4.17)
-1 0 1
c 2017, Mattias Wahde, [email protected]
42
CHAPTER 4. COMPUTER VISION
and

1

C2 =
0
-1

2 1
0 0 .
-2 -1
(4.18)
These two masks detect horizontal and vertical edges, respectively. Together
(and sometimes augmented by two additional masks for detecting diagonal
edges), the masks define the so called Sobel operator for edge detection. By
convolving a given (normally grayscale) image Γ using C1 and then convolving (the original) image using C2 one obtains two images Γx and Γy , whose
pixel values can then be combined to form an edge image Ie as
q
(4.19)
Γe (i, j) = (Γx (i, j)2 + Γy (i, j)2 ).
However, since this computation can be a bit time-consuming, one often uses
the simpler procedure
Γe (i, j) = |Γx (i, j)| + |Γy (i, j)|
(4.20)
instead. In that case, one can generate the edge image by a single pass (a
pseudo-convolution) through the original image, setting the value of a given
pixel as
Γe (i, j) ← | (Γ(i − 1, j − 1) + 2Γ(i, j − 1) + Γ(i + 1, j − 1)) −
((Γ(i − 1, j + 1) + 2Γ(i, j + 1) + Γ(i + 1, j + 1)) | +
| (Γ(i + 1, j − 1) + 2Γ(i + 1, j) + Γ(i + 1, j + 1)) −
(Γ(i − 1, j − 1) + 2Γ(i − 1, j) + Γ(i − 1, j + 1)) |
(4.21)
4.3.8 Integral image
The integral image (or summed area table) concept is useful in cases where
one needs to form the sum (or average) over many regions in an image. Once
an integral image has been formed, the sum of the pixel values within any
given rectangular region of the image can be obtained with one addition and
two subtractions. Consider, for simplicity, the case of a binary image, in which
pixels P (i, j) either take the value 0 (black) or the value 1 (white), as illustrated
in Fig. 4.5. In such images, a white pixel is also called a foreground pixel,
whereas a black pixel is referred to as a background pixel. The integral image
I(i, j) is defined as the sum of all pixels above and to the left of (i, j). Thus,
X
P (i′ , j ′ ),
(4.22)
I(i, j) =
i′ ≤i,j ′ ≤j
The integral image can be formed in a single pass through the image, using
the difference equation
I(i, j) = P (i, j) + I(i − 1, j) + I(i, j − 1) − I(i − 1, j − 1).
c 2017, Mattias Wahde, [email protected]
(4.23)
CHAPTER 4. COMPUTER VISION
0
1
2
2
3
1
2
4
5
6
43
2 2 2
4 5 6
7 8 9
9 11 13
11 13 16
Figure 4.5: The left panel shows a small black-and-white image with 5 × 5 pixels. The
corresponding integral image is shown in the middle panel. For the panel on the right, the sum
of the pixel values (7) in the red rectangle can be obtained using Eq. (4.24).
Note that, in the right-hand side of this equation, I is set to zero in case either
(or both) indices are negative. Once the integral image has been obtained, the
sum of pixels in a given rectangular region can be obtained as
X
P (i, j) = I(i0 , j0 ) + I(i1 , j1 ) − I(i1 , j0 ) − I(i0 , j1 ).
(4.24)
i0 <i≤i1
j0 <j≤j1
Obviously, if all that is needed is a sum of pixels in one or a few regions,
the computational effort needed to compute the integral image may be prohibitive. However, in cases where pixel sums (or averages) are needed for
many regions in the image, as for example in some face detection algoritms
(such as the Viola-Jones algorithm; see Subsect. 4.4.3 below), the integral image is a rapid way of obtaining the required information. As a specific example, consider the right panel of Fig. 4.5. Using Eq. (4.24), the sum of the pixel
values in the red rectangle can be computed as
X
p(i′ , j ′ ) = I(0, 0) + I(3, 3) − I(3, 0) − I(0, 3) = 0 + 11 − 2 − 2 = 7. (4.25)
0<i≤3
0<j≤3
4.3.9 Connected component extraction
In many image processing tasks, e.g. object detection, text interpretation (for
example, reading zip codes) etc., it is often necessary to find groups of pixels
that are connected to each other (forming, for example, a face or a letter), using
some measure of connectivity. In image processing, such groups of pixels are
referred to as connected components. The concept of connected components
is most easily illustrated for the case of binary images, which is the only case
that will be considered here, even though the process can be generalized to
gray scale or even color images.
c 2017, Mattias Wahde, [email protected]
44
CHAPTER 4. COMPUTER VISION
Figure 4.6: An example of object detection using connected components. The upper panels
show the preprocessing steps, resulting in a binarized image. After removing all but the two
largest connected components, the image in the lower left panel is obtained. The lower middle
and lower right panels show the connected components labeled 1 and 2, respectively. Photo by
the author.
The measure of connectivity is typically taken as either 4-connectivity or 8connectivity In the case of 4-connectivity all foreground (white) pixels P (i, j)
are compared with the neighbors P (i − 1, j), P (i, j + 1), P (i + 1, j), and P (i, j −
1). If any of the neighbors also is a foreground pixel, that pixel and P (i, j)
belong to the same connected component. In the case of 8-connectivity all
foreground (white) pixels P (i, j) are compared with the neighbors used in the
4-connectivity case, but also the neighbors P (i − 1, j − 1), P (i + 1, j + 1), P (i −
1, j + 1), and P (i + 1, j − 1).
There are several algorithms for finding the connected components in a
given (binarized) image, the details of which will not be given here, however.
Once the connected components have been found, one may apply additional
operators. For example, in order to find the dominant object in an image (for
example, a face), one may wish to remove all connected components except
the largest one. Obviously, such a simple procedure will not work under all
circumstances; if the image contains a (bright) object that is larger than the
face, the result may be an incorrect identification. Nevertheless, connected
component extraction is an important first step in many object detection tasks.
An example is shown in Fig. 4.6. Here, the objective was to identify the pixels consistuting the blue pot shown in the upper left panel. The image was first
inverted (since the pot is rather dark), as shown in the upper middle panel, and
was then converted to grayscale. Next, the image was binarized (with a threshc 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
45
old of 150, in this case), resulting in the image shown in the upper right panel.
Then, the connected components were extracted. In the final step, the results
of which are shown in the lower left panel, only the two largest connected
components have been kept. When the connected components have been extracted, the pixels belonging to a given connected component are labeled (even
though this is not visible in the figure) using an integer, e.g. 1 for the pot and 2
for the island, in this case, assuming that the labels have been sorted according
to the size (number of pixels) of the connected components. In the lower middle panel, the pixels of the connected component labeled 1 (the pot) are shown,
whereas the lower right panel shows the pixels of connected component 2 (the
island). Once the connected components are available, other techniques, such
as matching the pixels in the connected components to a pot-shaped template,
can be used for determining which of the two connected components represents the pot.
4.3.10 Morphological image processing
In morphological image processing, a particular shape, referred to as a structuring element2 is passed over an image such that, at each step, the value of
a given pixel (relative to the position of the structuring element) is changed
if certain conditions are met. Typically, morphological image processing is
applied to binary (black-and-white) images. Even though generalizations to
grayscale and color images exist, here only binary images will be considered.
In general, a structuring element consists of pixels taking either the value
1 (white) or the value 0 (black). Consider Fig. 4.7. The structuring element is
shown to the left of the image. Now, when the structuring element is placed
at a given position over the image, one can compare the pixel values of the
structuring element to the pixel values of the part of the underlying image
covered by the structuring element. Thus, considering the structuring element
as a whole, one of three things can happen, illustrated in the figure, using the
three colors green, yellow, and red. The structuring element may (i) completely
match the part of the image that it covers. That is, every pixel in the structuring element matches its corresponding image pixel (green); (ii) partially match
the covered part of the image (yellow); or (iii) not match the covered part at
all (red). As the structuring is passed over an image there will, in general, be
some positions where it matches, some where it partially matches, and some
where it does not match at all. Depending on which of those situations that occurs for a given position of the structuring element, some action (exemplified
below) is applied to an image pixel at a given position relative to the structuring element (indicated by a ring in Fig. 4.7), referred to as the origin of the
structure element. In case of a symmetric structuring element, that position is
2
Note that a structuring element need not be a square; it can take any shape.
c 2017, Mattias Wahde, [email protected]
46
CHAPTER 4. COMPUTER VISION
Figure 4.7: An example of a structuring element, shown in the left part of the figure. The
origin of the structuring element is at its center. The right part of the figure shows three different cases (i) one (green) in which the structuring element completely matches the foreground
(white) pixels, (ii) one (yellow) in which there is a partial match, and (iii) one in which there is
no match.
often, but not always, the pixel under the center of the structuring element.
Erosion
In erosion, for any (i, j) for which the structuring element completely matches
covered part of the image, the origin is colored white. If the structuring element does not match completely, the origin is colored black. As the name
implies, erosion tends top chip away pixels at the inner and outer boundaries
in regions of foreground pixels, resulting in larger gaps between regions as
well as removal of small regions. An example of erosion is shown in Fig. 4.8.
Here, the structuring element in the left panel has been applied to the image
in the center panel, resulting in the eroded image shown in the right panel.
Dilation
In dilation, for any (i, j) such that the structuring partially or completely matches
the image, the origin is set to white. If the structuring element does not match
at all, the origin is set to black. Dilation can be seen as the inverse of erosion,
as it tends to grow (and sometimes join) foreground regions of the image. An
example of dilation is shown in Fig. 4.9.
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
47
Figure 4.8: The right panel shows the results of carrying out erosion on the image shown in
the middle panel, using the structuring element shown in the left panel. The ring in the center
of the structuring element indicates the pixel currently under study.
Figure 4.9: The right panel shows the results of carrying out dilation on the image shown in
the middle panel, using the structuring element shown in the left panel.
Other operators
In addition to erosion and dilation, there are also many other morphological
operators, for example opening (erosion followed by dilation) closing (dilation followed by erosion).
Another common operation is hit-and-miss. In this case, one uses a more
complex structuring element: In the description above, the relevant parts of
the structuring element were the white pixels, For the hit-and-miss transform
one needs structuring elements with pixels taking either of three different values, namely 1 (white, foreground pixel), 0 (black, background pixel) and x
(ignored pixel). Thus, to be strict, the non-white pixels in the structuring elements for erosion and dilation above should really have the value x rather than
0 but for simplicity they are often drawn as in the figures above. In any case,
in erosion and dilation, the non-white pixels are ignored. In the hit-and-miss
transform, by contrast, both the foreground pixels (1s) and the background
pixels (0s) must match in order for the pixel under the origin (see above) of the
structuring element to be set to the foreground color (white). Otherwise, it is
c 2017, Mattias Wahde, [email protected]
48
CHAPTER 4. COMPUTER VISION
Figure 4.10: Left panel: A text image obtained by holding a single sheet of paper in front
of a web camera; middle panel: The result of binarization, using the (subjectively chosen) best
possible threshold; right panel: The result of adaptive thresholding using Sauvola’s method
with j = 7 and k = 0.23.
set to the background color. The hit-and-miss transform is typically used for
finding corners of foreground shapes.
Finally, the thinning operator, which is related to the hit-and-miss operator,
is used for reducing the width of lines and edges down to a single pixel (at
which point the thinning operation will no longer change the image).
4.4 Advanced image processing
4.4.1 Adaptive thresholding
Thresholding is the process of reducing the number of colors used in an image. A very important special case, which will be considered from now on, is
binarization, in which a (grayscale) image is converted into a two-color (black
and white) image, as discussed in Subsect. 4.3.3 above. However, in the common case in which the lighting varies over an image, using a global threshold,
as in Subsect. 4.3.3, may not produce very good results, as illustrated in the left
and middle panels of Fig. 4.10. Instead, one must use some form of adaptive
thresholding, in which the binarization threshold varies over the image.
An important application, in the case of IPAs, is the problem of reading
text in a (low-quality) image. For example, one can consider an IPA whose
task it is to help a visually impaired person to read a document held in front
of a (web) camera. Due to variations in lighting as well as the fact that light
generally shines through a single sheet of paper, the quality of the resulting
image is often quite low, as can be seen in the left panel of Fig. 4.10.
Many methods have been suggested for adaptive thresholding of (text)
images. Here, only two such methods will be introduced, namely Niblack’s
method [11] and Sauvola’s method [12]. In Niblack’s method one measures
the mean m and standard deviation over the area (j × j pixels, where j is an
odd integer larger than 1) surrounding the pixel under consideration. Then,
the binarization threshold Tn (for that pixel) is set at k standard deviations
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
49
above the mean:
Tn = m + kσ.
(4.26)
This procedure is repeated for all pixels in the image. In Sauvola’s method, the
local binarization threshold Ts is instead computed as
h
s
i
Ts = m 1 + k
−1 ,
R
(4.27)
where k is a parameter, and R is the maximum value of the standard deviation
over all of the j × j areas considered. Typical values are j = 9 − 21 and k =
0.2−0.5. Sauvola’s method generally produces good results, even though there
are other methods that outperform it slightly (see e.g. [20]). The right panel of
Fig. 4.10 shows the results obtained with Sauvola’s method, with j = 7 and
k = 0.23. The apparent frame around the image is caused by the fact that the
j ×j matrix cannot be applied at the edges. Of course, one can easily solve that
problem by padding the image with white pixels, but this has not been done
here.
4.4.2 Motion detection
Many applications involving video streams require detection of motion. For
example, an IPA may be required to determine whether or not a person has
just sat down in from of the IPA’s camera(s) and then detect and recognize the
user’s gestures.
In its simplest form, motion detection consists of comparing a camera image at a given time with an image taken earlier. Consider, for simplicity, a
gray scale image, whose gray levels at time t will be denoted Γ(i, j; t). One can
then compare this image with an earlier image, with gray levels Γ(i, j; t − 1),
the rationale being that those pixels that differ will belong to moving objects.
Introducing a threshold T for the minimum required difference, one can determine which pixels fulfil the inequality
|Γ(i, j; t) − Γ(i, j; t − 1)| > T
(4.28)
and then, for example, set those pixels to white, and all others pixels to black.
However, this simple method will typically be quite brittle, as even in a supposedly static scene, there are almost always small brightness variations, some
of which typically exceed the threshold T , leading to incorrect detections. The
reverse problem appears as well: A person who sits absolutely still in the
camera’s field of view will fade to invisibility (unless, of course, the motion
detection method is combined with other approaches, e.g. face detection; see
below).
c 2017, Mattias Wahde, [email protected]
50
CHAPTER 4. COMPUTER VISION
Background subtraction
A common special case, particularly relevant for IPAs, is background subtraction in which one assumes the existence of an essentially fixed background.
Anything that causes the view to differ from the background (such as, for example, a person moving in the camera’s field of view) will then constitute the
foreground. In such situations, the problem of motion detection is often referred to as background subtraction. Background subtraction can be carried
out both in color images and grayscale images. Here, only grayscale background subtraction will be considered. The (gray) intensity of pixels belonging to the background will be denoted B(i, j). Pixels that do not belong to the
background are, per definition, foreground pixels.
A simple approach to background subtraction, based on frame differencing, is to start from an image which is known to represent only the background. In the case of an IPA, one may use the scene visible to the agent before
any person sits down in front of it. The inequality
|Γ(i, j; t) − B(i, j)| > T
(4.29)
will then find the pixels whose gray level differs from their background values
and which can thus be taken to represent the foreground. However, this naı̈ve
approach does not, in general, work very well, partly because the background
pixel intensity will never be completely constant so that, even if there is no foreground object, some pixels will differ from their supposed background values
by an amount that exceeds the threshold T (unless a very high value is used
for T , but in that case one risks being unable to detect actual foreground objects!) Another serious problem with this approach is that, over the course of
a day, the light level of the background will change so that background pixels
will gradually drift into the foreground. This can of course happen instantaneously as well, if someone turns on a light, for example. The problem can
be somewhat reduced by forming the background image as an average over a
number of images (again without any foreground object present in the scene),
but even so, the method is rather error-prone.
A more robust approach is to make use of exponential Gaussian averaging, in which one maintains a probability density function for the intensity of
each pixel. Here, a pixel is considered as foreground only if its intensity differs
from the (average) background intensity by a certain number of standard deviations. Let µ(i, j; t) denote the average intensity of pixel (i, j), and σ 2 (i, j; t)
its variance. In order to initialize this method, one would set the average at
t = 0 as the current gray level of an image (containing background only), i.e.
µ(i, j; 0) = Γ(i, j; 0).
(4.30)
The initial variance value can be set, for example, as the variance computed using the pixels adjacent to (i, j). Then, the average and variance can be updated
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
51
Figure 4.11: An example of background subtraction. Left panel: Snapshot from the camera
stream; right panel: The result of background subtraction, just after the user completed a
gesture (raising the right hand). Foreground pixels are shown in white. Here, the background
was subtracted using exponential Gaussian averaging, with ρ = 0.025 and α = 1.3..
using a running (exponential) average
µ(i, j; t) = ρΓ(i, j; t) + (1 − ρ)µ(i, j; t)
(4.31)
σ 2 (i, j; t) = ρδ(i, j)2 + (1 − ρ)σ 2 (i, j; t),
(4.32)
and
where ρ is the exponential averaging parameter that takes values in the open
range ]0, 1[ and
δ(i, j) = |Γ(i, j; t) − µ(i, j; t)|.
(4.33)
Using these equations, one can then compare the pixel intensity Γ(i, j; t) with
the average µ(i, j; t) and set a pixel as foreground if
|Γ(i, j; t) − µ(i, j; t)| > ασ(i, j; t).
(4.34)
That is, a pixel is considered to be in the foreground if it differs from its (running) average by more than α standard deviations. In addition to the method
just described, there are also methods (so called Gaussian mixture models)
that make use of multiple Gaussians for each pixel [14].
Background subtraction is also very much an active research field, in which
new methods appear continuously. One such method, with excellent performance, is the ViBe method [3], For a recent review of background subtraction
methods see, for example [13]. In a addition to background detection, in order
to improve their precision many approaches also involve a degree of postprocessing, such as connected component extraction following a sequence of
morphological operations to remove noise.
c 2017, Mattias Wahde, [email protected]
52
CHAPTER 4. COMPUTER VISION
Gesture recognition
In addition to mere detection of foreground objects, one may also wish to interpret their movements. For example, an IPA may have gestures as an input
modality, and must thus be able to carry out gesture recognition. This is also
an active topic for research, but it is beyond the scope of this text. A review
can be found in [10]. It should be noted that depth cameras, such as (the sensor used in) Microsoft’s kinect, are becoming more and more common in such
applications [17] even though many approaches are based on ordinary camera
images.
4.4.3 Face detection and recognition
Face detection and recognition are important processes in many IPAs. These
are also large and active research topics, in which new results appear continuously. Here, only a brief introduction will be given, with some references to
further reading.
Skin pixel detection
By itself, detection of skin pixels is not sufficient for finding (and tracking)
faces. However, in combination with other methods, such as edge detection
and connected component extraction, skin pixel detection often plays an important part in face detection.
Many methods have been defined for detecting skin pixels based on pixel
color. In the RGB color space, a common, but somewhat inelegant, rule3 for
detecting (mainly Caucasian) skin pixels says that a pixel represents skin if all
of the following rules are satisfied: (i) R > 95, (ii) G > 40, (iii) B > 20, (iv)
max{R, G, B} − min{R, G, B} > 15, (v) |R − G| > 15, (vi) |R − G| < 75, (vii)
R > G and, finally (ix) R > B.
However, this set of rules is rather sensitive to lighting conditions. Moreover, skin pixel detection must of course work over a range of different skin
colors. It turns out that such rules can be found in the YCbCr space (see Subsect. 4.1.1 above), by excluding the luminance (Y ) component and instead focusing on the Cb and Cr . Here, skin pixels typically fulfil 77 ≤ Cb ≤ 127 and
133 ≤ Cr ≤ 173, using the definition of the YCbCr space found in Eq. (4.5).
A more sophisticated approach consists of generating a two-dimensional
histogram Hs (Cb , Cr ) of skin pixels. In this approach, one would measure the
values of Cb and Cr for a large number of known skin pixels, taken from many
different images with skin of different color and tone, and with different lighting conditions. For every occurrence of a particular combination (Cb , Cr ), one
would then increment the contents of the corresponding histogram bin by one.
3
This rule is a slight modification of the rule found in [8].
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
53
Figure 4.12: Examples of skin detection. Upper left panel: The original picture; upper right
panel: Skin detection using the RGB-based method described in the main text; lower left panel:
Skin detection using the method based on simple ranges in Cb and Cr ; lower right panel:
Skin detection based on the method using a two-dimensional distribution of skin pixels, with a
threshold T = 0.04. In the three latter images, all non-skin pixels have been set to black color,
whereas the identified skin pixels were left unchanged.
Using the range [16, 240] for Cb and Cr , defined in Subsect. 4.1.1, this histogram
will thus have 225 × 225 bins (elements). Once a sufficient number of skin pixels have been considered, the histogram would then be (linearly) normalized
such that the bin contents for the most frequent combination(s) (Cb , Cr ) would
be set to 1. Thus, for all other combinations, the bin contents would be smaller
than 1. For any given pixel, one can then obtain Cb and Cr , and then classify
the pixel as skin if the corresponding bin contents exceed a certain threshold
T.
An example is shown in Fig. 4.12. Here, an image of a person is shown,
along with the results obtained for the three different skin detection methods.
For the last method described above, the histograms were obtained by manually labelling thousands of skin pixels for a set of face images. As is evident
from the figure, no matter which of the three methods above that is used, one
c 2017, Mattias Wahde, [email protected]
54
CHAPTER 4. COMPUTER VISION
can generally achieve quite a high fraction of true positives (also known as
sensitivity), but also with a rather high fraction of false positives. That is, pixels that are supposed to be identified as skin are indeed so identified, but many
non-skin pixels are erroneously classified as skin pixels as well.
Face detection
Note that even with a very sophisticated skin pixel detection, there will always
be misclassifications since many objects that are not skin can still fall in a similar color range. For example, light, unpainted wood is often misclassified as
skin. In order to reliably detect a face, one must therefore either combine skin
pixel detection with other methods, or even use methods that do not involve
specific skin pixel detection.
Moreover, the level of sophistication required depends on the application
at hand. Thus, for example, it is much more difficult to find an unknown
number of faces, with arbitrary orientation, in a general image compared to
finding a single face that, in the case of an IPA, most often is seen directly from
the front. For the latter case, a possible approach is to binarize the image based
on skin pixels, i.e. setting skin pixels to white and everything else to black, and
then extracting the connected components (see Subsect. 4.3.9 above). The face
is then generally the largest connected component, provided that the background does not contain too many skin-colored pixels. Of course, this technique can also be combined with background subtraction as discussed above,
in order to increase its reliabilty.
It is also possible, though computationally costly, to match the skin pixel
regions against a face template, provided that the face is seen (almost) from
the front. In this approach, one can begin by detecting the eyes (or, rather, a
set of candidate feature pairs of which one hopefully does represent the eyes),
and thereby obtaining the orientation of the face. One can then match other
features, such as the mouth, the eyebrows etc., using the template [16].
Perhaps the most widely used method for face detection, however, is the
Viola-Jones face detection algorithm [18]. This detector operates on grayscale
images, and considers a large set of simple classifiers (so called weak classifiers that, by themselves, are not capable of detecting an entire face, but at
least a part of it. The classifiers are based on features that simply compute the
difference between the pixel intensities in two adjacent rectangles, and then
compare the result to a threshold parameter. If, and only if, the result is above
the threshold, a match is obtained for the feature in question.
The sum of pixel intensities can be computed very quickly using the concept of integral images discussed in Subsect. 4.3.8 above and it is thus both
possible and essential (for this method) to consider a large number of weak
classifiers. The weak classifiers can then be combined to a strong classifier.
The many difficulty lies in the fact that the number of possible weak classifiers
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
55
is generally very large. In the Viola-Jones algorithm, the training is carried
out using the so called Adaboost training algorithm, and the resulting strong
classifiers contain surprisingly few weak classifiers (at least relative to the total
number of available weak classifiers) and are still able to carry out face detection rather reliably.
Moreover, it is possible to generate a cascade of strong classifiers, with close
to 100% true positive detection rate, and with progressively lower (cumulative) false positive rates. Thus, for example, the first element in the cascade
may have a false positive rate of around 50%, so that, while letting almost all
actual faces through, it also lets quite a number of non-faces through. The second element of the cascade typically has a smaller false positive rate (30% say),
while still having a true positive rate near 100%. After passing through both
classifiers, the cumulative false positive rate will drop to 0.5 × 0.3 = 0.15 etc.
Viola and Jones [18] presented a cascaded detector containing 38 elements and
a total of 6060 weak classifiers. This kind of detector cascade generally does
very well on faces seen from the front, but is less accurate for, say, faces shown
from the side, or with partial occlusion.
Face detection is still very much an active field of research, with new methods appearing continuously. Recent progress involves the use of deep learning for face detection; see e.g. [15, 5]. General reviews on face detection can be
found, for example, in [22] and [21].
Face recognition
After detecting the presence of a face in its field of view, an IPA may also
need to carry out face recognition in order to determine to whom the face
belongs. A variety of methods have been defined, one of the most commonly
used being the eigenface method, in which facial templates are generated from
a large set of images. The face of any given person is then represented as
a linear combination of the facial templates, with numerical weights for each
template. Thus, in this approach face recognition can be reduced to comparing
the weights obtained (via, for example, principal component analysis) for a
detected face to a database of stored weights for those faces that the system is
required to recognize.
There are also approaches that make use of artificial neural networks (combined, more recently, with deep learning). Another alternative is to find and
use invariant properties of a face (for example, the relative location of salient
facial features). One of the difficulties in trying to achieve reliable face recognition is the fact that a person can, of course, present a range of facial expressions. Thus, methods for face recognition often require rather large data sets,
with several views of the same person. The details of the many available face
recognition methods will not be given here. Reviews that include the methods briefly described above, as well as other methods, can be found in [23, 2].
c 2017, Mattias Wahde, [email protected]
56
CHAPTER 4. COMPUTER VISION
Figure 4.13: A screenshot from the ImageProcessing application. Photo by the author.
There are also plenty of publicly available face databases, which can be used
for training face recognition systems.
4.5 Demonstration applications
4.5.1 The ImageProcessing application
The ImageProcessing application demonstrates some important aspects of
the ImageProcessingLibrary. The user can load an image, and then apply
a sequence of image processing operations. The sequence of images thus generated is stored and shown in a list box, so that the user can return to an earlier
image in the sequence. It is also possible to zoom and pan in the displayed
image using the mouse wheel (zooming) and mouse movements (panning).
A screenshot from this application is shown in Fig. 4.13. Here, the user has
loaded an image, and applied a few operations to improve the contrast and
sharpness of the image.
4.5.2 The VideoProcessing application
This application illustrates the use of the Camera class and the associated user
controls. Provided that at least one web camera is attached to the computer,
clicking the Start button will start the camera and display the image stream
c 2017, Mattias Wahde, [email protected]
CHAPTER 4. COMPUTER VISION
57
Figure 4.14: The CameraSetupControl, which allows the user to modify the settings of a
(running) camera.
in a CameraViewControl. The user can then select the Camera setup tab
page to view and change the camera settings, as shown in Fig. 4.14. It is often
necessary to change the settings, since only default settings are applied when
the camera is started, and the suitable settings can vary strongly between cameras. The program also features background subtraction using exponential
Gaussian averaging, as described above.
c 2017, Mattias Wahde, [email protected]
58
CHAPTER 4. COMPUTER VISION
c 2017, Mattias Wahde, [email protected]
Chapter
5
Visualization and animation
A large part of human communication is non-verbal and involves facial expressions, gestures etc. Similarly, the visual representation of the face (and,
possibly, body) of an IPA can also play an important role in human-agent communication. Being met by a friendly and well-animated face (smiling, nodding
in understanding etc.) strongly affects a person’s perception of a discussion.
While a two-dimensional rendering would certainly be possible, with a
three-dimensional rendering one can obtain a much more life-like representation of a face, including variations in lighting as parts of the face move. As
will be illustrated below, even very complex objects are usually rendered in the
form of (many) triangle shapes. Of course, since the screen is two-dimensional,
three-dimensional objects, or rather their constituent triangles, must be projected in two dimensions, something that requires a large number of matrix operations, as does any rotation of a three-dimensional object. There are several
software libraries specifically designed to quickly carry out all the necessary
operations, processing thousands or even millions of triangles per second, determining the necessary projections and colors for each pixel. Some examples
are OpenGL and Direct3D. Here, OpenGL will be used or, more specifically, a
C# library called OpenTK that serves as a bridge between C# and OpenGL.
This chapter begins with a general description of three-dimensional rendering. Next, the ThreeDimensionalVisualization library, used here
for such rendering, is described in some detail. This description is followed by
an introduction to the special case of facial visualization and animation, and
the chapter is concluded with a brief description of two demonstration applications, one for illustrating various levels of visualization and shading, and
one for illustrating the FaceEditor user control.
59
60
CHAPTER 5. VISUALIZATION AND ANIMATION
5.1 Three-dimensional rendering
Rendering a three-dimensional (3D) object onto a two-dimensional (2D) screen
involves a sequence of matrix operations, involving 4x4 matrices that combine
translation and rotation. Here, only a brief introduction will be given regarding these transformations; for an excellent detailed description of the transformations below, see [1]. The first step is to rotate and translate the object
to its position in space, i.e. to transform it from model coordinates measured
relative to the center of the object to world coordinates, measured relative to
the origin of the modeled world. This transformation is handled by the model
matrix. Next, the location and orientation of the camera must be accounted for.
Clearly, the appearance of an object depends on where it is located relative to
the camera viewing the scene. Thus, another transformation is applied, using
the view matrix such that the scene will now be in camera coordinates1 . For
convenience, the model and view matrices are often combined (multiplied) to
form a single modelview matrix. The final step is to account for the properties
of the camera, by carrying out a perspective projection such that, for example,
objects along a line parallel to the camera’s line-of-sight will appear closer to
the center of the view the further away they are from the camera. This final
transformation is carried out by the projection matrix.
5.1.1 Triangles and normal vectors
It is common to render three-dimensional objects as a set of triangles. This is
so since a triangle, i.e. an object defined by three vertices in space or, equivalently, by a vertex or two (non-colinear) vectors emanating from that vertex, is
a planar surface, with a uniquely defined (surface) normal vector that, moreover, is easily obtained by computing the cross product of the two vectors just
mentioned. Objects with more than three vertices may be planar, but not necessarily, meaning that the computation of normal vectors for such objects will,
in general, be more complex.
The normal vectors matter greatly, since they are used in determining the
intensity of the light reflected from a given surface and are also needed for determining the shading of the pixels in the triangle (at least if a smooth shading
model is used, as discussed below). Fig. 5.1 shows a triangle, defined by three
vertices (points) p1 , p2 , and p3 . The three vertices can be used for generating
the vectors v21 = p2 − p1 and v31 = p3 − p1 that, together with p1 , uniquely
define the triangle and its position in space. Given the two vectors, one can
1
It should be noted that, in OpenGL, the camera is, in fact, always located at the origin,
looking down OpenGL’s negative z-axis (into the screen, as it were). Thus, in OpenGL, to
achieve the effect of placing the camera in a given position, instead of moving the camera, one
moves the entire world in the opposite direction.
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
61
Figure 5.1: A triangle in three-dimensional space, along with its normal vector. The enumerated vertices are shown as filled discs.
form the (normalized) normal vector as
n=
v21 × v31
.
|v21 × v31 |
(5.1)
Note that this normal vector points towards the reader. One can of course also
generate a normal vector pointing in the other direction, by reversing the order of the vectors in the cross product. This matters since, in OpenGL, one can
render both sides of any triangle using, for example, different color properties of the two sides, and the normal vector determines which of the two sides
of a triangle is visible from a given camera position. In OpenGL a counterclockwise convention is used such that, if the camera looks at a given side of a
polygon, the correct normal will be the one obtained by the right-hand rule if
the vertices (of a triangle) are traversed in the counter-clockwise order. Moreover, OpenGL uses a right-handed coordinate system such that x and y are
in the plane of the screen, and z points towards the user. In the implementation of the three-dimensional visualization library (see below) the axes are
instead such that x and z are in the plane of the screen, and y points away from
the user (again forming a right-handed coordinate system), which perhaps is
more natural.
5.1.2 Rendering objects
There are many options available when rendering an object. The user can control the color of an object, its lighting and shading, whether or not the object is
translucent etc.
c 2017, Mattias Wahde, [email protected]
62
CHAPTER 5. VISUALIZATION AND ANIMATION
Figure 5.2: Left panel: A triangle with equal colors (yellow) for all three vertices. Right
panel: A triangle with different colors (red, green, and blue) for the vertices, resulting in color
interpolation over the surface. Note that lighting was not used in either case. The vertices are
shown as black discs.
The simplest form of rendering is obtained if no light is used. In this case,
the colors assigned to the vertices are used for determining the pixel colors of
the rendered object. If all vertices have the same color (e.g. yellow) as in the
left panel of Fig. 5.2, the object will be uniform in color. If instead the vertices
have been assigned different colors, e.g. red, green, and blue, as in the right
panel of the figure, OpenGL will interpolate the colors.
In most instances of three-dimensional rendering, however, one uses some
form of light source2 . With lighting, an object takes on a distinct 3D appearance, even if the object is uniform in color. In order to compute the correct
color of a pixel, a lighting model is required. In OpenGL, the lighting model
requires three colors: The ambient color, the diffuse color, and the specular color. Simplifying somewhat, the ambient component provides a kind of
background lighting, such that an object will not be completely dark even if
the light from the light source does not hit it directly, whereas the diffuse component determines the light reflected in all directions from a surface, and thus
outlines the shape of a three-dimensional object. Finally, the specular component also handles light reflection but more in the manner of a mirror, giving a
certain shininess to the surfaces. In fact, an additional parameter, called shininess is required to determine how shiny the surface is. Moreover, in OpenGL
one can define the ambient, diffuse, and specular lightling components both
for the light source and for each object independently. Thus, for example, it is
possible to bathe a white object in blue light etc.
Another important concept is shading, a procedure that determines how
2
OpenGL implementations generally support the use of multiple lights, eight as a minimum.
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
63
Figure 5.3: A schematic illustration of vertex and triangle normals. Here, a part of a 3D
object is seen edge-on, with vertices shown as filled discs and triangle sides represented by
solid lines connecting the vertices. The solid arrows show the triangle normals used in flat
shading, whereas the dashed arrows show the vertex normals used in smooth shading, obtained
by averaging the triangle normals over the triangles (in this case, two) connected to a given
vertex. Note that, in the case of smooth shading, the normal vector also varies over a triangle,
since it is formed by interpolating the three normal vectors (one for each vertex) for the triangle
in question, even though that particular aspect is not visible in this edge-on figure. See also
Panels (ii) and (iii) in Fig. 5.7.
the light intensity varies over a surface. The normal vectors (see above) play
an important role in shading. OpenGL defines two standard shading models, namely flat shading and smooth shading. In flat shading, surfaces are
rendered with a uniform color, determined by the interaction between the
light, the surface material, and the normal vector of the surface in question. In
smooth shading, by contrast, OpenGL interpolates the normal vectors over the
surface, a procedure that requires a normal vector for each vertex rather than
the surface (triangle) normal vector. As indicated in Fig. 5.3, the vertex normals are obtained by interpolating the surface normals for all those triangles
that are connected to a given vertex; see also the description of the Object3D
class below.
In the figure, a surface is shown edge-on (for simplicity), along with the
triangle and vertex normals. For flat shading the triangle normals are sufficient. However, consider now smooth shading: If one were to use the surface
normals, one would run into trouble at the edge connecting several triangles,
since one would have multiple different normal vectors to choose from! If instead the vertex normals are used, as indeed they are in smooth shading, one
can find an interpolated normal vector at any point on a triangle (thus, effectively, generating a smoothly curved surfaces rather than a flat one!). The effect
is striking, as shown in Fig. 5.7 below. At this point, the reader may wish to
study the various rendering options for a 3D object by running the Sphere3D
c 2017, Mattias Wahde, [email protected]
64
CHAPTER 5. VISUALIZATION AND ANIMATION
Listing 5.1: The paint event handler in the Viewer3D class.
p r i v a t e void HandlePaint ( o b j e c t sender , PaintEventArgs e )
{
GL . Clear ( ClearBufferMask . ColorBufferBit | ClearBufferMask . DepthBufferBit ) ;
GL . MatrixMode ( MatrixMode . Modelview ) ;
GL . LoadMatrix ( r e f cameraMatrix ) ;
SetLights ( ) ;
i f ( showOpenGLAxes ) { DrawOpenGLAxes ( ) ; }
i f ( showWorldAxes ) { DrawWorldAxes ( ) ; }
RenderObjects ( ) ;
SwapBuffers ( ) ;
}
example application, described in Subsect. 5.4.1.
5.2 The ThreeDimensionalVisualization library
5.2.1 The Viewer3D class
The Viewer3D class is a visualizer that handles the visualization and animation of a three-dimensional scene. The objects in a scene, as well as the lights
illuminating the scene, are contained in the object scene, of type Scene3D.
The visualizer contains event handlers for rotating and zooming a scene,
i.e. for moving the camera in response to mouse actions. The event handler
for redrawing the view, (an event that is triggered whenever the user control’s
Invalidate method is called) takes the form shown in Listing. 5.1. The first
line clears the view, and the following two sets the appropriate transformation
matrix. Next the lights are set by calling the SetLights method (also defined
in the Viewer3D class; see the source code for details). Then, depending on
settings, the viewer can visualize either the axes defined in OpenGL’s standard
coordinate system (described above), with the x−axis shown in red, the y−axis
in green, and the z−axis in blue, or the axes defined in the coordinate system
used in the threedimensional visualization library (also described above, with
the same color settings as for the OpenGL axes). The objects are then rendered,
by calling the RenderObjects method that, in turn, simply calls the Render
method of each object in the scene (see below). Finally, the resulting view is
then pasted onto the viewing surface by calling the SwapBuffers method.
In a dynamic scene, where some objects change their position or orientation
(or both), one must also handle animation. In the Viewer3D class, animation
runs as a separate thread that simply invalidates the scene at regular intervals,
thus triggering the Paint event that, in turn, is handled by the paint event
handler shown in Listing 5.1. The two methods used for starting and running the animation are shown in Listing 5.2. There is, of course, a method for
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
65
Listing 5.2: The two methods for running an animation in the Viewer3D visualizer . The
first method sets the frame rate (or, rather, the frame duration), and launches the thread. The
second method simply calls the Invalidate method at regular intervals, causing the scene
to be redrawn.
p u b l i c void StartAnimation ( )
{
millisecondAnimationSleepInterval = ( i n t ) Math . Round( 1 0 0 0 / framesPerSecond) ;
animationThread = new Thread ( new T h r e a d S t a r t ( ( ) => AnimationLoop ( ) ) ) ;
animationRunning = t r u e ;
animationThread . Start ( ) ;
}
p r i v a t e void AnimationLoop ( )
{
while ( animationRunning)
{
Thread . Sleep ( millisecondAnimationSleepInterval) ;
Invalidate ( ) ;
}
}
stopping the animation as well (not shown).
5.2.2 The Object3D class
A scene (stored in an instance of Scene3D) contains a set of three-dimensional
objects (not to be confused with objects in the programming sense!) as well as
the light sources illuminating the scene. The library contains class definitions
for several types of 3D objects, e.g. a sphere, a (planar) rectangle etc., which are
all derived from the base class Object3D. Each instance of this class consists
of a set of vertices, as well as a set of triangles (each containing the indices
of three vertices) that define the triangles (see above). Moreover, the triangle
normal vectors are stored as well. In cases where smooth shading is used (see
Subsect. 5.1.2 above) the vertex normals are required instead, and they can be
computed by calling the ComputeVertexNormalVectors method.
For most 3D objects, the definition of the triangles is obtained by using the
Generate method, which takes as input a list of parameters (of type double)
that define the specific details of the 3D object in question, e.g. the radius as
well as the number of vertices in the case of a sphere. For all but the simplest
objects, this method usually is rather complex. A case in point is the Face
class that defines a rotationally symmetric structure, somewhat reminiscent of
a face, which typically contains thousands of triangles. The face structure can
then be edited in a face editor, as will be discussed below.
A simpler case is the Rectangle3D that, in fact, defines a two-dimensional
object (which then can be oriented in any way in three dimensions). It consists
c 2017, Mattias Wahde, [email protected]
66
CHAPTER 5. VISUALIZATION AND ANIMATION
Figure 5.4: The definition of the two triangles used in the Rectangle3D class.
Listing 5.3: The Generate method of the Rectangle3D class. In this case, the method
takes two parameters as input, determining the size of the rectangle. Next, the four vertices
are generated, and then the two triangles as well as the triangle and vertex normal vectors.
p u b l i c o v e r r i d e void Generate ( List<double> parameterList )
{
base . Generate ( parameterList ) ;
i f ( parameterList == n u l l ) { r e t u r n ; }
i f ( parameterList . Count < 2 ) { r e t u r n ; }
sideLength1 = parameterList [ 0 ] ;
sideLength2 = parameterList [ 1 ] ;
Vertex3D vertex1 = new Vertex3D (−sideLength1 / 2 , −sideLength2 / 2 , 0 ) ;
Vertex3D vertex2 = new Vertex3D ( sideLength1 / 2 , −sideLength2 / 2 , 0 ) ;
Vertex3D vertex3 = new Vertex3D ( sideLength1 / 2 , sideLength2 / 2 , 0 ) ;
Vertex3D vertex4 = new Vertex3D (−sideLength1 / 2 , sideLength2 / 2 , 0 ) ;
vertexList . Add ( vertex1 ) ;
vertexList . Add ( vertex2 ) ;
vertexList . Add ( vertex3 ) ;
vertexList . Add ( vertex4 ) ;
T r i a n g l e I n d i c e s triangleIndices1 = new T r i a n g l e I n d i c e s ( 0 , 1 , 2 ) ;
triangleIndicesList . Add ( triangleIndices1) ;
T r i a n g l e I n d i c e s triangleIndices2 = new T r i a n g l e I n d i c e s ( 0 , 2 , 3 ) ;
triangleIndicesList . Add ( triangleIndices2) ;
GenerateTriangleConnectionLists ( ) ;
ComputeTriangleNormalVectors ( ) ;
ComputeVertexNormalVectors ( ) ;
}
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
67
Listing 5.4: The Render method of the Object3D class.
p u b l i c void Render ( )
{
i f ( ! visible ) { r e t u r n ; }
GL . PushMatrix ( ) ;
GL . Translate ( position [ 0 ] , position [ 2 ] , −position [ 1 ] ) ;
GL . Rotate ( rotation [ 2 ] , new Vector3d( 0 f , 1f , 0f ) ) ;
GL . Rotate ( rotation [ 1 ] , new Vector3d( 0 f , 0f , −1f ) ) ;
GL . Rotate ( rotation [ 0 ] , new Vector3d( 1 f , 0f , 0f ) ) ;
GL . BlendFunc ( BlendingFactorSrc . SrcAlpha , BlendingFactorDest . OneMinusSrcAlpha) ;
i f ( alpha < 1 ) GL . Enable ( EnableCap . Blend ) ;
i f ( showSurfaces ) { RenderSurfaces ( ) ; }
i f ( showWireFrame ) { RenderWireFrame ( ) ; }
i f ( showVertices ) { RenderVertices ( ) ; }
i f ( alpha < 1 ) GL . Disable ( EnableCap . Blend ) ;
i f ( object3DList != n u l l )
{
f o r e a c h ( Object3D object3D i n object3DList )
{
object3D . Render ( ) ;
}
}
GL . PopMatrix ( ) ;
}
of only two triangles, each defined using three vertices. The vertices are ordered in a counterclockwise manner, as shown in Fig. 5.4. The Generate
method for the Rectangle3D class is shown in Listing 5.3. The method first
checks that it has a sufficient number of parameters, and it then uses the two
parameters to set the side lengths of the rectangle. Then the vertices are defined and added to the list of vertices. Next, the two triangles are formed, by
specifying the indices of the (three) vertices constituting each triangle. The
GenerateTriangleConnectionLists method generates a list that keeps
track of the triangles in which each vertex is included. In this particular case,
vertices 0 and 2 are included in both triangles, whereas vertex 1 is only included in the first triangle, and vertex 3 only in the second. Next, the normal
vectors are computed for each triangle, by simply computing the (normalized)
cross product as discussed above; see Eq. (5.1). Finally, the vertex normal vectors are computed, by averaging (and then re-normalizing) the triangle normal vectors of all triangles in which a given vertex is included. The three
methods just mentioned are called at the end of the Generate method of any
three-dimensional object and once they have been called, all the necessary information is available for both flat and smooth shading. Note that one can,
of course, define a rectangle using more than two triangles. In general, threedimensional objects often consist of hundreds or thousands of triangles.
The Render method is a crucial part of the Object3D class, as it determines the current visual appearance of the object in the scene. This method is
c 2017, Mattias Wahde, [email protected]
68
CHAPTER 5. VISUALIZATION AND ANIMATION
shown in Listing 5.4. The method first checks if the object is visible. If it is not,
the method returns directly. If the object is visible, the next step is to position
and render the object. OpenGL operates as a state machine. Thus, when visualizing any object, at a given position and orientation, one first accesses the
current modelview matrix using the PushMatrix command. Next, the appropriate rotations and translations are carried out for the object in question.
The object is rendered and then, finally, the old (stored) transformation matrix
is set again, by calling PopMatrix, so that the next object can be rendered etc.
Note that the transformations occur in the inverse order in which they are
presented, since they are carried out using post-multiplication. Thus, the rotations take place first, and then the translation. If, for example, only the zrotation (rotation[2]) is non-zero, the object will first be rotated around the
z-axis, and then translated to the current position. Note that if the operations
were carried out in the opposite order, a different result would be obtained. It
is thus important to fully understand the order in which these operations take
place. It should also be noted that the GL.Rotate method is applied in the
OpenGL coordinate system rather than in the coordinate system used in the
three-dimensional visualization library, as is evident from the rotation vectors
shown in the code above.
The lines involving the GL.BlendFunc and the EnableCap.Blend are
needed to deal with translucent objects. The parameter alpha is equal to one
for an opaque object, and 0 for a completely transparent (thus invisible) object.
For values of alpha between 0 and 1, a translucent object is obtained. In order
to handle translucent objects properly, the objects in the scene must be rendered in the appropriate order, namely from high alpha to low alpha. There is
method in Scene3D that handles this issue.
The next few lines contain calls to methods that render the surfaces, wireframe, and vertices, respectively. The wireframe rendering consists of straight
lines connecting the vertices of a triangle. The three methods RenderSurfaces,
RenderWireFrame, and RenderVertices will not be described in detail
here, but it is a useful exercise to study those methods in the source code.
Finally, as can be seen in the listing, there is a possibility to use nested
definitions, such that a 3D object contains its own list (objectList) of 3D
objects. This makes it possible to rotate and translate an entire group of objects
as a unit, rather than having to move and rotate each 3D object separately. This
type of nested definition can be used to any desired depth. Thus, the objects
in an object list may themselves contain objects in their respective object lists.
For example, the face and eyes of an agent may be contained in the object list
of an object representing the head of the agent, and each eye, in turn, may
contain (in its object list) the objects representing the iris, pupil, and eyelid. It
should be noted that the positions and orientations of objects in an object list
are measured relative to the object on the preceding level.
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
69
5.3 Faces
As evidenced by (for example) animated movies, modern computer technology is sufficiently advanced to generate (almost) photo-realistic representations of any object, including a human face.
However, while humans quickly find (or at least ascribe) human features
to any artificial rendering of a living system that is even remotely humanlooking (such as a cartoon character), once an artificial system (for example,
an IPA or a robot) attempts to mimic a human face exactly, including all the
minute changes in facial expressions that are subconsciously detected during
a conversation, that system is often perceived as eerie and frightening. This is
known as the uncanny valley phenomenon [9]. Thus, in other words, unless
it is possible to render an artificial face with such a level of detail in all its
expressions that it is indistinguishable from a real human face, it is most often
better to use a more cartoon-like face, with human-like features but without
an attempt to mimic a human face exactly.
5.3.1 Visualization
Conceptually, a 3D head is no different from any other 3D object. In practice,
however, generating and animating a 3D head is not easy. The most realistic
renditions can be obtained by using a model very similar to a biological face,
that is, by generating a skeleton (skull), adding muscles attached to the head,
and then finally a skin layer that, of course, is what the user will see. Here,
however, a slightly simpler approach will be taken, in which the face consists
only of the skin layer and where animation is limited to movements of the
entire head (such as looking left or right) as well as movements of the eyes
and eyebrows. An obvious additional step would be to add a movable jaw
and a mouth. This can certainly be done, but it is beyond the scope of this
text. Still, a surprising range of emotions can be expressed even by the simple
animations just described.
The heads considered here consist of seven distinct 3D objects (each of
which, of course, contains hundreds or thousands of triangles): The actual
face (and, optionally, neck), the two eyes, the two eyelids, and the two eyebrows. The face object can be generated using the face editor program, described in Subsect. 5.4.2 below. The resulting face will, by construction, be
symmetric around an axis (in this case, the y−axis, if the face is not rotated)
and should have two deep indentations for the eyes. An eye can be generated
as a white sphere, with an iris and a pupil each consisting of a spherical sector
with slightly larger radius than the eye, rotated 90 degrees around the x−axis.
An eyelid is in the form of a semi-sphere, such that, with proper rotation, it
can completely cover the eye. Finally, an eyebrow consists of an elongated
structure in the form of a toroidal sector. As an example, Fig. 5.5 shows a disc 2017, Mattias Wahde, [email protected]
70
CHAPTER 5. VISUALIZATION AND ANIMATION
Figure 5.5: The parts used here for rendering a head. Left panel: The head, without eyes
and eyebrows Right panel: An eye, consisting of three distinct objects: The eyeball, the iris (in
this case, green), and the pupil. Also shown is the eyelid, in the form of a semi-sphere, and the
eyebrow.
membered rendering of a head, in which the parts just described have been
dislocated a bit, for individual inspection.
Fig. 5.6 shows a few examples of expressions that can be generated with
this 3D head; see also the next subsection.
5.3.2 Animation
As mentioned in Subsect. 5.2.1 above, the Viewer3D is able to run a separate thread in which the entire scene is rendered at regular intervals. Thus, to
achieve animation, all that is required is to change the position and the rotation
of the objects in a gradual manner. Since the rotation of a 3D object is carried
out before it is translated, the rotations will occur around the axes that meet
at the origin. Thus, the manner in which these objects are defined (before any
rotation and translation) greatly influences the effects of rotation.
For example, the semi-sphere (as well as any other spherical segment) has
been defined as if it were a sphere centered at (0, 0, 0) from which some parts
have been removed to generate the segment in question. Thus, when rotated,
such a segment will move as if it were sliding over a sphere of the same radius
and centered in (0, 0, 0). The alternative would be to define a 3D object such
that its center-of-mass would be located at the origin. In that case, however,
in order to achieve the effect of, say, a semi-sphere (such as an eyelid) moving
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
71
Figure 5.6: A few examples of mental states and facial expressions generated with a face of
the kind described in the main text. Top row, from left to right: Awake (neutral), sleepy, and
asleep. Bottom row, from left to right: Surprised, angry, fearful.
over a sphere (such as an eyeball) one would have to both rotate and translate
the semi-sphere relative to the center of the eyeball. Clearly, by combining rotations and translations, one can achieve the same effect using either definition.
For the application considered here, the first option makes animation easier
and it is thus the approach chosen.
For a head of the kind described above, some typical animations are (i)
moving the eyes, an effect that can be achieved by rotating the eye around the
z−axis, noting that the iris and pupil can be appended in the objectList
of the eyeball 3D object, so that they will rotate with the eyeball; (ii) blinking,
which can be carried out by rotating the eyelids around the x−axis; and (iii)
moving the eyebrows, an action that, in its simplest form, consists of a translation (up or down). Of course, more sophisticated movements can be achieved
by allowing the eyebrows to rotate and deform as well.
The actual movements are generated in separate threads that gradually
move the appropriate objects. Note that the motion is completely independent
of the rendering, which is handled by the animation thread in the Viewer3D.
Listing 5.5 shows an example, namely a thread that carries out blinking (of
both eyes), with a given duration.
c 2017, Mattias Wahde, [email protected]
72
CHAPTER 5. VISUALIZATION AND ANIMATION
Listing 5.5: An example of animation, illustrating blinking of the two eyes. Note that the
animationStepDuration is defined elsewhere in the code, and it typically set to 0.010.02 s. The fullClosureAngle is typically set to 90 (degrees).
p u b l i c void Blink ( double duration )
{
blinkThread = new Thread ( new T h r e a d S t a r t ( ( ) => BlinkLoop ( duration ) ) ) ;
blinkThread . Start ( ) ;
}
p u b l i c void BlinkLoop ( double duration )
{
double halfDuration = duration/ 2 ;
i n t numberOfSteps = ( i n t ) Math . Round ( halfDuration/animationStepDuration) ;
double deltaAngle = fullClosureAngle/numberOfSteps ;
Object3D leftEyelid = viewer3D . Scene . GetObject ( ” L e f t E y e l i d” ) ;
Object3D rightEyelid = viewer3D . Scene . GetObject ( ” Ri g h t E y e l i d ” ) ;
f o r ( i n t iStep = 0 ; iStep < numberOfSteps ; iStep++)
{
leftEyelid . RotateX ( deltaAngle ) ;
rightEyelid . RotateX ( deltaAngle ) ;
}
f o r ( i n t iStep = 0 ; iStep < numberOfSteps ; iStep++)
{
leftEyelid . RotateX(−deltaAngle ) ;
rightEyelid . RotateX(−deltaAngle ) ;
}
}
5.4 Demonstration applications
In this section, two applications will be described that illustrate the properties
and capabilities of the three-dimensional visualization library. The first application is a very simple demonstration of various aspects of rendering, lighting,
and shading. The second application is more advanced, especially the highly
complex FaceEditor user control, which can be used for generating a face
shape starting from a simple rotationally symmetric structure.
5.4.1 The Sphere3D application
This application simply shows a green sphere, under various conditions of
rendering, lighting, and shading. The GUI contains a sequence of menu items,
allowing the user to visualize a sphere (i) without lighting; (ii) with lighting
and flat shading; (iii) with lighting and smooth shading; (iv) as a wireframe
structure; (v) as vertices; (v) as (iii) but with vertices overlaid; (vi) as (iii) but
with vertices and wireframe overlaid; (vii) as a translucent object (in this case
with another blue sphere inside); and, finally, (ix) with a texture added. All
cases are shown in Fig. 5.7.
Even though the Viewer3D does support texture mapping, i.e. pasting
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
73
Figure 5.7: Nine examples of rendering a sphere. The first row of images shows, from left to
right, cases (i)-(iii) described in the main text, the second row cases (iv)-(vi), and the third row
cases (vii)-(ix).
(parts of) one or several images over the surface of a 3D object, this topic has
been deliberately avoided above, as it is not needed for the applications considered here. However, the interested reader should study the source code
for the textured sphere just described, as well as the rendering method in the
Sphere3D class and the corresponding method in the base class Object3D.
Note that texture mapping requires a specification of which part of an image
that is to be mapped onto a given triangle. This information is stored in the
TextureCoordinates field of each vertex.
c 2017, Mattias Wahde, [email protected]
74
CHAPTER 5. VISUALIZATION AND ANIMATION
Figure 5.8: A screenshot of the FaceEditor application, showing the three-dimensional
face object, along with a slice plane (shown in green color) as well as a two-dimensional view
of the slice under consideration, in which the user has selected and moved a few control points
(shown in red). Note that left-right symmetry is enforced, such that the points on the opposite side of the slice move together with the selected points, but in the opposite (horizontal)
direction.
5.4.2 The FaceEditor application
This application is intended to simplify the process of generating the threedimensional face of an IPA. The FaceEditor application makes use of an
advanced user control, the FaceEditor, which is included as a part of the
ThreeDimensionalVisualization library, and which does most of the
work in this application. A screenshot from the application is shown in Fig. 5.8.
Except for the menu strip the entire form is covered by the face editor, which
has three tool strips at the top and two panels below. Upon initialization, the
face editor provides the user with a starting point in the form of a rotationally symmetric three-dimensional Face object (shown in a Viewer3D control,
on the left side of the face editor), with a shape similar to that of a human
head, with a neck just below the head, but without any other particular features such as nose, ears, or eye sockets. The right panel of the face editor contains a BezierCurveViewer that shows a horizontal slice through the threedimensional object. Each such slice is defined as a closed composite Bézier
curve that in turn consists of a set of two-dimensional cubic Bézier splines
given by
x(u) = P0 (1 − u)3 + 3P1 u(1 − u)2 + 3P2u2 (1 − u) + P3 u3 ,
(5.2)
where x = (x, y), Pj are 2-dimensional control points, and u is a parameter
ranging from 0 to 1. Per default, each slice is defined using 32 splines, each
c 2017, Mattias Wahde, [email protected]
CHAPTER 5. VISUALIZATION AND ANIMATION
75
with four control points. The last control point of a given spline coincides
with the first control point of the next spline, so that the effective number of
control points is smaller. A detailed description of Bézier splines will not be
given here, but note that the control points that define the smooth spline curve
do not necessarily lie on the curve itself, as can be seen in Fig. 5.8. One can
use the mouse to grab any set of control points in a given slice, and then drag
those points to generate any desired (left-right symmetric) shape for the slice in
question. In the figure, the user has grabbed a few points (shown in red), and
started moving them inwards. As can be seen, left-right symmetry is enforced,
such that the points on the opposite side of the slice move together with the
selected points. It is also possible to zoom in, so that the points can be moved
with greater precision. When the points in any slice are moved, the threedimensional representation is also updated simultaneously, so that one can
easily assess the result. In the particular case shown in the figure, the green
slice plane (used for keeping track of which slice is being edited) obscures the
view. However, the user can hide the slice plane in order to see the effects
on the three-dimensional shape. One can also move between slice planes, by
clicking on the three-dimensional viewer and then using the arrow keys to
move up or down. Moreover, the user can both insert and remove slices.
The three-dimensional shape is obtained by interpolating (sampling) the
composite Bézier curves defining the slice planes. The user can specify the
number of points used. This is a global measure, i.e. the same number of interpolated points is generated for all slices. Note that the number of interpolated points need not equal the number of control points for the splines: These
curves can be interpolated with arbitrary precision, using hundreds of points if
desired. Typically, 50-100 points per slice is sufficient. The interpolated points
are then used as vertices for the three-dimensional Face object. In order to
generate triangles, the interpolation is shifted so that the interpolated points
of odd-numbered slice planes appear (horizontally) midway between the interpolated points of even-numbered slice planes. An illustration is shown in
the figure, where the wireframe representation has been overlaid on the threedimensional shape so that the triangles are clearly visible. Generating the appropriate triangle indices is a bit complicated, since each triangle will involve
two slice planes; For details, see the definition of the Face class.
With this application the user can quickly generate a cartoon-like face for
use in an IPA, and then save the corresponding Face object in XML format.
There are some limitations. For example, the application does not generate the
eyes of the IPA (instead, the user must define a face with eye sockets, in which
the eyes can be added later), and neither does it generate a movable jaw. The
example face used earlier in this chapter (see the left panel of Fig. 5.5) was
generated using the FaceEditor application.
c 2017, Mattias Wahde, [email protected]
76
CHAPTER 5. VISUALIZATION AND ANIMATION
c 2017, Mattias Wahde, [email protected]
Chapter
6
Speech synthesis
In principle, speech synthesis is simple, as it can be approached as a mere
playback of recorded sounds. However, in practice, it is not easy to generate a high-quality synthetic voice capable of displaying all the subtleties and
emotions of human speech.
Speech synthesis can be approached in many different ways. Two of the
main approaches are concatenative synthesis and formant synthesis. As the
name implies, concatenative synthesis consists of pasting together previously
recorded sounds in order to form a given sentence or word. This is a process
that also involves considerable modification of the recorded sounds, in order
to make sure that an utterance formed by concatenation should sound natural: Simply pasting together a sequence of recorded words will not produce
a natural-sounding sentence at all, even if each word is perfectly uttered (in
isolation).
Many state-of-the-art speech synthesis systems use the approach of modifying and pasting together recorded snippets of sounds. However, an alternative approach is to generate all sounds as they are needed, in which no human voice recording is required at all. In this approach, known as formant
speech synthesis, one uses instead a model of the human vocal tract, in which
a train of pulses excites a set of oscillators (corresponding to the oscillating vocal cords) in order to produce a vowel sound. Consonants are produced in a
slightly different way, but with the same model.
Whereas concatenative synthesis can be made to generate sounds that resemble those of a human voice very closely, formant synthesis produces a more
artificial, robotic-sound voice but, if done well, with surprising clarity. Moreover, a formant voice requires much less (storage) memory space than a concatenative voice, something that also explains the popularity of formant voices
in the early days of personal computers.
One may certainly argue that concatenative synthesis is superior as regards
the quality of the generated voice, but one can also make the argument that for
77
78
CHAPTER 6. SPEECH SYNTHESIS
Chunk ID (4 bytes)
Chunk data size (4 bytes)
Chunk data
Figure 6.1: The structure of a RIFF chunk.
an agent with a cartoon-like face, similar to the example shown in Chapter 5,
a perfect human-sounding voice would be somewhat out of place. Moreover,
as formant synthesis does provide interesting insights regarding both human
sound generation and signal processing, this has been the approach chosen
here.
6.1 Computer-generated sound
Ultimately, any sound is of course simply a variation (over time) in air pressure. Computers generate sounds from a set of discrete values, known as samples. The number of samples handled per second is known as the sampling
frequency or sample rate, and the range of allowed values for the samples
is known as the sample width. The sampling frequency for a CD is 44100
Hz, whereas lower sampling frequencies are used in telephones. Acceptable
sound quality can be obtained with sampling frequencies of 8000 Hz or above.
The sampling width is typically 16 bits, meaning that samples can range from
-32768 to 32767. The digital signal is then converted to an analog signal (voltages) using a digital-to-analog (D/A) converter, which is then passed to an
amplifier that in turn drives a speaker. In systems with multiple speakers, one
may wish to send different signals to different speakers. A common case, of
course, is two-channel or stereo sound. For speech synthesis, single-channel
or mono sound is often sufficient, however.
6.1.1 The WAV sound format
Sounds can be stored in different formats. A common format under Windows
is the Waveform audio format (WAV). In this format, the samples can be stored
either in uncompressed or compressed form. For simplicity, only the uncomc 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
79
R I F F (0x52494646)
Chunk data size (4 bytes)
W A V E (0x57415645)
fmt subchunk (header and data)
(Optional) fact subchunk (header and data)
Data subchunk (header and data)
Figure 6.2: The structure of a WAV sound file. Note that both the main chunk and the
subchunk all follow the RIFF chunk format shown in Fig. 6.1. The main chunk’s data section
begins with four bytes encoding the word ”WAVE”, after which the subchunks follow, each
consisting of a header section (8 bytes) and a data section.
pressed format will be considered here. A WAV sound is built using the concept of RIFF chunks that contain an eight-byte header followed by data, as
shown in Fig 6.1. The first four bytes of a chunk encode the chunk ID, and
the following four bytes encode the number of bytes in the data part of the
chunk. As illustrated in Fig. 6.2, strictly speaking, a WAV sound contains a
main chunk that encloses all the other chunks (which therefore normally are
referred to as subchunks) in its data section.
The required subchunks for uncompressed WAV sounds are the fmt (format) and data subchunks. For compressed WAV sounds, a third subchunk,
namely the fact subchunk, must be included, and it is normally placed between the two other subchunks. Here, only uncompressed WAV sounds will
be considered. However, some uncompressed WAV sounds contain an unnecessary fact subchunk. Thus, a program for reading WAV sounds must be able
to cope with the potential presence of a fact subchunk, regardless of whether
the sound is compressed or not.
c 2017, Mattias Wahde, [email protected]
80
CHAPTER 6. SPEECH SYNTHESIS
f m t (0x666d7420)
Chunk data size (4 bytes)
Compression code (2 bytes)
Number of channels (2 bytes)
Sample rate (4 bytes)
Bytes per second (4 bytes)
Block align (2 bytes)
Bits per sample (2 bytes)
(Optional) # of extra format bytes (2 bytes)
Extra format bytes (if any)
Figure 6.3: The fmt subchunk. In the absence of extra format bytes, the chunk data size is
either 16 or 18, depending on whether the two bytes specifying the number of extra format
bytes are included or not.
The first four bytes of a WAV sound file contain the word ”RIFF” (in uppercase letters), represented as four ASCII bytes (taking hexadecimal values1
0x52, 0x49, 0x46 and 0x46). The following four bytes encode the file size (n)
minus 8 (i.e. the size of the header). In other words, those four bytes determine the number of bytes contained in the sound file, after the header. All
integers in a WAV file are stored using little endian format, i.e. with the least
1
Hexadecimal numbers are written 0xnn . . . nn, such that each n takes a value in the set
{0, . . . , 9,A,. . .,F}, where the letters A,. . .,F represent the numbers 10, . . . , 15.
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
81
d a t a (0x64617461)
Chunk data size (4 bytes)
Interlaced sample data
Figure 6.4: The data subchunk. The data part of this subchunk contains the actual sound
samples.
significant byte first. The first four data bytes of the main chunk determines
the RIFF type which always takes the value ”WAVE” (hexadecimal representation: 0x57415645). The remaining n − 12 bytes contain the subchunks. Since
each subchunk begins with a chunk ID, the subchunks can, in principle, be
placed in any order. However, it is customary to place the fmt subchunk first,
followed by the fact subchunk (if needed) and then the data subchunk.
The fmt subchunk
The fmt subchunk, illustrated in Fig. 6.3, begins with the chunk ID ”fmt ” (with
a space at the end!), with hexadecimal representation 0x666D7420. The next
four bytes encode the subchunk data size, which for the fmt subchunk equals
either 16 + k or 18 + k where k is the number of extra format bytes (normally
zero, see below)
After the eight-byte header, the following two bytes encode the compression code (or, somewhat confusingly, audio format) for the WAV sound. For
uncompressed WAV sounds, the compression code is equal to 1. The next two
bytes encode the number of channels (i.e two, for stereo sound, or one, for
mono sound). The following four bytes encode the sample rate (or sampling
frequency) of the WAV sound file.
The next eight bytes of the fmt subchunk encode (i) the (average) number of
bytes per second of the sound sample, (ii) the block align, and (iii) the number
of bits per sample. The three numbers derived from these bytes are partly
redundant: Once the number of bits per sample ns has been specified (requires
c 2017, Mattias Wahde, [email protected]
82
CHAPTER 6. SPEECH SYNTHESIS
Sample 1, Channel 1 (left), 2 bytes
Sample 1, Channel 2 (right), 2 bytes
Sample 2, Channel 1 (left), 2 bytes
Sample 2, Channel 2 (right), 2 bytes
.
.
.
Figure 6.5: The data part of the data subchunk for a stereo sound. The samples are stored in
an interlaced fashion, as described in the main text.
two bytes), the block align ba can be computed as
ba =
ns nc
,
8
(6.1)
where nc is the number of channels. Thus, the block align measures the number of bytes needed to store the data from all channels of one sample. The
number of bytes b per second, which requires four bytes in the fmt subchunk,
can simply be computed as
b = sba ,
(6.2)
where s is the sample rate. Here, only 16-bit sound formats will be used.
The next two bytes indicate the number of (optional) extra format bytes.
For uncompressed WAV sounds, normally no extra format bytes are used. In
such cases, sometimes the two bytes determining the number of extra format
bytes are omitted as well, so that the data size of the fmt subchunk becomes 16
rather than 18.
The data subchunk
The data subchunk has a rather simple format, illustrated in Fig. 6.4. The first
four bytes encode the chunk ID, which for the data subchunk simply consists
of the string ”data”, with hexadecimal representation 0x64617461. The next
four bytes encode the data subchunk size, i.e. the number of bytes of actual
sample data available in the WAV sound file. The samples from the various
channels (two, in the case of stereo sound) are stored in an interlaced fashion,
as illustrated in Fig. 6.5: for any given time slice, the samples from the different
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
83
channels appear in sequence, followed by the samples from the next time slice
etc.
The individual samples are stored as 2s complement signed integers2 which,
in the case of 16-bit samples, take values in the range [−32768, 32767], such that
the middle-point (0) corresponds to silence. The procedure of generating the
numerical value from a 16-bit sample is as follows: Let bi and bi+1 denote two
consecutive bytes defining the sample. Taking into account the little endian
storage format, these two bytes are decoded to form a temporary value vtmp as
vtmp = 28 bi+1 + bi .
(6.3)
The final sample value is then obtained as
v=
vtmp
−65536 + vtmp
if vtmp ≤ 32767
otherwise
(6.4)
Other subchunks
As mentioned above, even uncompressed WAV sounds sometimes contain an
unnecessary fact subchunk. This subchunk begins with four bytes encoding
the string ”fact”, followed by four bytes specifying the data size of the subchunk. In the case of an uncompressed WAV sound, any data contained in
the fact subchunk can be ignored. However, for compressed WAV sounds, the
fact subchunk contains crucial information regarding the decoding procedure
needed for playback. In addition to the subchunks just discussed, additional
subchunk types exist as well, e.g. the slnt subchunk which can be used for
defining periods of silence (thus reducing the size of the WAV sound file provided, of course, that periods of silence are present in the sound in question).
6.1.2 The AudioLibrary
The Audiolibrary contains classes for storing, manipulating, and visualizing sounds in WAV format. The WAVSound class stores a byte array defining
both the header and the data of a WAVSound, as described above. This byte
array is the actual sound and is used, for example, by the MediaPlayer class
for playing the sound. However, a byte array defining both a header and a
sequence of data is hardly human-readable. Thus, the byte array containing
the data can also be converted to one or two arrays of samples (depending on
whether the sound is in mono or stereo format), which can then be visualized
2
This applies to formats using 16 bits per sample, or more. If the format uses only 8 bits
per sample, each sample is stored as an unsigned integer.
c 2017, Mattias Wahde, [email protected]
84
CHAPTER 6. SPEECH SYNTHESIS
Method
LoadFromFile
SaveToFile
GenerateFromSamples
Extract
Join
LowPassFilter
HighPassFilter
SetRelativeVolume
Description
Loads a WAV sound from a file.
Saves a WAV Sound to a file
Generates the byte array required by the
WAV format, based on sound samples.
Extracts (in a new instance) a part of a sound.
Joins a set of sounds, with optional periods of silence
between consecutive sounds, to form a single sound.
Carries out low-pass filtering of a sound; see Eq. (6.6).
Carries out high-pass filtering of a sound; see Eq. (6.8).
Increases or decreases the volume of a sound, depending
on the value of the input parameter.
Table 6.1: Brief summary of some public methods in the WAVSound class.
using, for example, the SoundVisualizer user control. This class is also defined in the AudioLibrary and has been used throughout this chapter in the
figures displaying sound samples.
The Audiolibrary also contains classes that are more relevant to speech
recognition (such as a WAVRecorder class) but still logically belong to the
AudioLibrary. These classes will be considered in the next chapter.
Some of the most important methods in the WAVSound class are shown in
Table 6.1. The class also contains a constructor that generates a WAVSound
header (see Subsect. 6.1.1 above) with given values for the sample rate, the
number of channels, and the number of bits per sample (the sample width).
Provided that a sound header has been generated by calling this constructor, one can then generate a WAV Sound from a set of samples, using the
GenerateFromSamples method. The SaveToFile method only saves the
byte array. All other relevant properties can be generated. Consequently the
LoadFromFile method loads the byte array, and then generates the header
and the sound samples in human-readable form.
Whenever the samples of a sound are modified, for example when modifying the volume, the byte array representing the sound (as per the WAV
format described above) must be regenerated, something that is handled by
a (private) method, namely GenerateSoundDataFromSamples. If also the
number of samples is changed, for example when appending samples, one
must call the (private) ExtractInformation method that also re-extracts
the header of the sound (to reflect the fact that the number of samples has
changed). These two operations are generally handled automatically in the
various public methods, but must be taken into account if a user wishes to
write additional methods for manipulating WAV sounds.
As an example, consider the method SetRelativeVolume, shown in Listing 6.1. Here, the sound samples are scaled by a given factor (the input paramc 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
85
Listing 6.1: The SetRelativeVolume method of the WAVSound class. Note the call
to the GenerateSoundDataFromSamples method in the final step, which generates the
byte array representing the sound.
p u b l i c void SetRelativeVolume ( double relativeVolume )
{
f o r ( i n t iChannel = 0 ; iChannel < samples . Count ; iChannel++)
{
f o r ( i n t jj = 0 ; jj < samples [ iChannel ] . Count ; jj++)
{
double newDoubleSample =
Math . Truncate ( relativeVolume ∗ samples [ iChannel ] [ jj ] ) ;
i f ( newDoubleSample > MAXIMUM_SAMPLE )
{newDoubleSample = MAXIMUM_SAMPLE ; }
e l s e i f ( newDoubleSample < MINIMUM_SAMPLE)
{newDoubleSample = MINIMUM_SAMPLE ; }
samples [ iChannel ] [ jj ] = ( I n t 1 6 ) Math . Round ( newDoubleSample ) ;
}
}
GenerateSoundDataFromSamples ( ) ;
}
Listing 6.2: An example of the usage of the MediaPlayer class for playing WAV sounds.
SoundPlayer soundPlayer = new SoundPlayer ( ) ;
sound . GenerateMemoryStream ( ) ;
sound . WAVMemoryStream . Position = 0 ; // Manually rewind stream
soundPlayer . Stream = sound . WAVMemoryStream ;
soundPlayer . PlaySync ( ) ;
eter). However, just scaling the samples will not affect the byte array; thus, the
method ends with a call to GenerateSoundDataFromSamples.
For playback, one can use the MediaPlayer class from the System.Media
namespace, which is included in C#. An example is shown in Listing 6.2. Note
that one must manually rewind the stream to ensure correct playback.
6.2 Basic sound processing
In many cases, for example as a precursor to speech recognition (see Chapter 7), one can apply a sequence of operations to an input sound, in order to
remove noise, increase contrast etc. Many of these operations can be represented as digital filters that, in turn, can be represented either in the frequency
domain (using Z-transforms in the case of discrete-time signal of the kind used
here) or in the time domain. Here, only time-domain analysis will be used, in
which case a (linear) digital filter can be represented in the form of a difference
c 2017, Mattias Wahde, [email protected]
86
CHAPTER 6. SPEECH SYNTHESIS
equation of the form
s(k) + a1 s(k − 1) + a2 s(k − 2) + . . . ap s(k − p) =
b0 x(k) + b1 x(k − 1) + . . . + bq s(k − q),
(6.5)
where ai , i = 1, 2, . . . and bi , i = 0, 1, . . ., as well as p and q, are constants, s(k)
is the output at time step k, and x(k) is the input.
6.2.1 Low-pass filtering
The purpose of low-pass filtering is to remove the high-frequency parts (typically noise) of a signal. This is achieved by applying an exponential moving
average:
s(k) = (1 − αL )s(k − 1) + αL x(k),
(6.6)
so that, using the notation above, a1 = −(1 − αL ), b0 = αL (and, therefore
p = 1, q = 0). As is evident from the equation, if αL is close to 0, the sample
s(k) will be close to s(k − 1), meaning that the signal changes slowly or, in
other words, that the high-frequency components are removed. Thus, if this
filter is applied to a digital signal x(k), the resulting output will be a signal
that is basically unchanged for low frequencies, but attenuated for frequencies
around and above a certain cutoff frequency fc . One can show that αL is related
to fc as
2π∆tfc
αL =
,
(6.7)
2π∆tfc + 1
where ∆t is the sampling interval (the inverse of the sampling frequency).
Hence, this filter is also referred to as a (first-order) low-pass filter.
6.2.2 High-pass filtering
A high-pass filter removes the low-frequency parts of a signal. In the time
domain, a (first-order) high-pass filter takes the form
s(k) = αH s(k − 1) + αH (x(k) − x(k − 1)),
(6.8)
where αH is a parameter, which is related to the cutoff frequency as
αH =
1
.
2π∆tfc + 1
(6.9)
After passing through this filter, the signal will be attenuated for low frequencies (below the cutoff frequency) but largely unchanged at higher frequencies.
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
87
Figure 6.6: Unvoiced and voiced sounds: Here, an unvoiced sound (in this case s) precedes
a voiced sound (in this case a long o), to generate the word so. The difference between the
noise-driven first half of the word, and the more oscillatory second half of the word, is easy to
spot.
6.3 Formant synthesis
In formant synthesis all the spoken sounds are generated based on a model
of the human vocal tract. Formant speech synthesizers use a so-called sourcefilter model, in which the source component corresponds to the excitation of
the vocal cord, and the filter components model the (resonances of the) vocal
tract.
A useful analogue is that of a oscillating spring-damper system, i.e. a mechanical system described by the equation
s′′ (t) + 2ζωs′(t) + ω 2 s(t) = x(t),
(6.10)
where x(t) is the input (forcing) signal and s(t) is the output. With appropriately selected values of ζ and ω, such a system will exhibit oscillations in the
form of a damped sinusoid.
In particular, provided that ζ < 1, the response of the system to a discrete
pulse in the form of a delta
p function (x(t) = δ(t)) that leads to an instantaneous
′
velocity s (0) = v0 ≡ A 1 − ζ 2 ω equals
p
(6.11)
s(t) = Ae−ζωt sin 1 − ζ 2 ωt,
for t ≥ 0. In connection with sound signals, it is more common to use the
(equivalent) form
s(t) = αe−βπt sin 2πf t,
(6.12)
where α is the amplitude, β the bandwidth and f the frequency. In computergenerated speech, time is discrete. In order to use a damped sinusoid in such
c 2017, Mattias Wahde, [email protected]
88
CHAPTER 6. SPEECH SYNTHESIS
a context, one therefore needs a discrete version, which can be written
s(k) = αe−βπk∆t sin 2πf k∆t,
(6.13)
where k enumerates the samples, and ∆t = 1/ν is the inverse of the sampling
frequency.
Looking at the representation of the voiced speech signal in the rightmost
path of Fig. 6.6, one can see clear similarities with a sequence of damped sinusoids. How can such a signal be generated? Of course, it is possible to generate a sequence of damped sinusoids directly from Eq. (6.13), by computing
y(k), k = 0, 1, . . . for a certain number of samples, then resetting k and repeating. However, a more elegant way is to represent the discrete signal using a
difference equation. Such an equation can be derived in several ways (either
by discretizing the differential equation directly, or by using Laplace and Z
transforms). The details will not be given here, but the resulting difference
equation for generating the signal given by Eq. (6.13) takes the form
s(k) = −a1 s(k − 1) − a2 s(k − 2) + b1 x(k − 1),
(6.14)
a1 = −2αe−βπ∆t cos 2πf ∆t,
(6.15)
where
−2βπ∆t
a2 = αe
−βπ∆t
b1 = αe
,
(6.16)
sin 2πf ∆t,
(6.17)
and x(k) is the input signal, described in the following subsections. As noted
above, the differential equation (6.10) responds with a single damped sinusoid
if it is subjected to a delta pulse. Similarly, the discrete version in Eq. (6.14),
which will henceforth be referred to as a damped sinusoid filter, will generate
a damped sinusoid if the input consists of a single pulse, namely x(k) = 1 for
k = 0 and 0 otherwise.
6.3.1 Generating voiced sounds
As noted above, a single damped sinusoid is generated if the input to the
filter represented by the difference equation in Eq. (6.14) consists of a single
pulse. By providing pulses repeatedly, one can generate a recurrent pattern of
damped sinusoids, similar to the pattern seen in the rightmost part of Fig. 6.6.
Thus, in this case, the pulse train takes the form
1 if k mod n = 0,
(6.18)
x(k) =
0 otherwise,
where n is the spacing between pulses. In terms of the source-filter model
mentioned above, the pulse train x(k) is the source, and the filter is given by
Eq. (6.14).
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
Symbol
ee (female)
ee (male)
oo (female)
oo (male)
aw (female)
aw (male)
Ex.
see
see
loose
loose
saw
saw
F0
180
120
180
120
180
120
f1
310
270
370
300
590
570
89
β1
100
100
100
100
100
100
f2
2990
2790
950
870
920
840
β2
100
100
100
100
100
100
f3
3310
3010
2670
2240
2710
2410
β3
100
100
100
100
100
100
Table 6.2: Fundamental frequency (see Eq. (6.19)) as well as sinusoid frequencies and bandwidths for some (English) vowels, for both male and female voices.
In general, for a given human voice, one can define a fundamental frequency, denoted F0 , that represents the frequency of pulses generated by the
oscillating vocal cords. Thus, for the discrete representation in Eq. (6.14), one
can write
n = ν/F0 ,
(6.19)
where ν, again, is the sampling frequency. For male voices, a typical value of
F0 is around 120 whereas, for a typical female voice, F0 is around 180.
Now, in most cases, more than one sinusoid is required to capture all aspects of a spoken voiced sound. In the model used here, the vocal tract is
modelled using a linear superposition of three damped sinusoid filters. Thus,
to generate basic vowels, one need only set the fundamental frequency as well
as the amplitudes, frequencies, and bandwidths of the three sinusoids (10 parameters in total), and then generate a pulse train with repeated pulses, driving three instances of the discrete damped sinusoid filter, and then, finally,
summing the resulting three oscillations to form a sequence of samples representing the voiced sound.
The procedure is illustrated in Fig. 6.7. Some typical settings for a few
vowels3 are given in Table 6.2. The amplitudes, which are not specified in the
table, are generally set to values smaller than 1, such that the resulting samples
fall in the range [−1, 1]. Normally, the damped sinusoid with lowest frequency
has the highest amplitude. Of course, the samples must then be rescaled to
an appropriate interval and then inserted in an object of type WAVSound, as
described in Sect. 6.1.2 above.
It is not only vowels have the shape of repeated, damped sinusoids. Some
consonants, e.g. the nasal consonants m and n, can also be represented in this
way. However, even though the model presented here can represent those
3
Here, a simplified notation is used, in which a short vowel (such as a in cat) is written with
a single letter (i.e. a) and a long vowel (such as a in large) is written using double letters (i.e. aa).
Moreover, the symbol - represents a short period of silence. Thus, for example, the word cat
would be written ka - t, whereas the word card can (somewhat depending on pronunciation,
though) be written kaa - - d.
c 2017, Mattias Wahde, [email protected]
90
CHAPTER 6. SPEECH SYNTHESIS
Figure 6.7: Voiced sounds: The pulse train on the left is given as input to three sinusoid
filters. The output of the three filters is then added to form the samples of the voiced sound.
sounds very well, the comparison with the biological counterpart is somewhat
diminished in those cases: In a human voice, nasal consonants are generated
in a complex interplay between the vocal tract and the nasal cavity. In a more
biologically plausible formant synthesizer, such as the one introduced already
by Klatt [7], one can model both the vocal tract and the nasal cavity (as well
as other body parts involved in speech, such as the lips). However, here, it
is sufficient that the synthesizer is able to generate all sounds that occur in
speech, even at the price of a slight reduction in biological plausibility of the
model.
6.3.2 Generating unvoiced sounds
Returning to Fig. 6.6, it is clear that unvoiced sounds bear little obvious resemblance to voiced sounds. In fact, using only visual inspection, it might be
difficult to distinguish an unvoiced sound from noise! However, if the sound
is played, one can clearly hear a consonant, rather than noise. How can such
sounds be generated in a speech synthesizer of the kind used here? In fact,
one may as well as the question how humans can generate such sounds: As
was illustrated above, the human vocal tract effectively acts as a set of damped
sinusoid filters. How can such a system generate signals of the kind seen in
the leftmost part of Fig. 6.6?
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
91
Listing 6.3: The properties defined in the FormantSettings class.
p u b l i c c l a s s FormantSettings
{
p u b l i c double Duration { g e t ; s e t ; }
p u b l i c double TopAmplitude { g e t ; s e t ; }
p u b l i c double RelativeStartAmplitude { g e t ; s e t ; }
p u b l i c double RelativeEndAmplitude { g e t ; s e t ; }
p u b l i c double TopStart { g e t ; s e t ; }
p u b l i c double TopEnd { g e t ; s e t ; }
p u b l i c double TransitionStart { g e t ; s e t ; }
p u b l i c double VoicedFraction { g e t ; s e t ; }
p u b l i c double Amplitude1 { g e t ; s e t ; }
p u b l i c double Frequency1 { g e t ; s e t ; }
p u b l i c double Bandwidth1 { g e t ; s e t ; }
p u b l i c double Amplitude2 { g e t ; s e t ; }
p u b l i c double Frequency2 { g e t ; s e t }
p u b l i c double Bandwidth2 { g e t ; s e t ; }
p u b l i c double Amplitude3 { g e t ; s e t ; }
p u b l i c double Frequency3 { g e t ; s e t ; }
p u b l i c double Bandwidth3 { g e t ; s e t ; }
}
The answer lies not so much in the properties of the vocal tract as in the
properties of the pulse train used for initiating the oscillations in the first place.
As noted by Hillenbrand and Houde [6], the model presented here is perfectly
capable of generating unvoiced sounds if, instead of the pulse train given by
Eq. (6.18), one uses a pulse train consisting of randomly generated Gaussian
pulses, such that, for any k
N(0, σ) with probability p,
x(k) =
(6.20)
0
with probability 1 − p,
where N(0, σ) denotes random Gaussian samples, with mean 0 and standard
deviation σ. Thus, to generate an unvoiced sound, one can use the procedure
described above but with the pulse train generated by Eq. (6.20) instead of
Eq. (6.18).
6.3.3 Amplitude and voicedness
In the model used here, a given sound is generated by specifying an instance
of FormantSettings, as described in Listing 6.3, which is included in the
SpeechSynthesizer library; see also Sect. 6.4 below. This class contains
specifications for the amplitudes (ai , i = 1, 2, 3), frequencies (fi , i = 1, 2, 3), and
bandwidths (bi , i = 1, 2, 3) of the three sinusoids, as well as the duration d of
the sound. The number of samples required (n) then equals d × ν.
Moreover, there is an overall amplitude TopAmplitude that multiplies
the sum of the three sinusoids, allowing the user to control the global amplitude of the sound with only one parameter. Thus, the amplitudes of the
c 2017, Mattias Wahde, [email protected]
92
CHAPTER 6. SPEECH SYNTHESIS
individual sinusoids can perhaps best be viewed as relative amplitudes. Of
course, there is some redundancy here: One could remove one of the amplitude constants without loss of generality, but the representation used in the
FormantSettings class makes it easier to control the overall amplitude.
In addition, one can represent a situation in which the (global) amplitude of
a sound starts at a given value, defined by the RelativeStartAmplitude,
rises to a maximum and stays there for a while, and then tapers off towards another value, defined by the RelativeEndAmplitude. This is achieved using
also two relative time parameters, namely TopStart and TopEnd. Letting
Atop denote the TopAmplitude, astart and aend the relative start and end amplitudes, respectively, and τ1 and τ2 the TopStart and TopEnd parameters,
respectively, one can compute the two time parameters T1 and T2 as
T1 = dτ1 ,
(6.21)
T2 = dτ2 ,
(6.22)
and
where d is the duration of the sound. The (absolute) start and end amplitudes
(Astart ) and (Aend ) are computed as
Astart = Atop astart ,
(6.23)
Aend = Atop aend .
(6.24)
and
Then, the variation in the global amplitude of the sound is given by

t
 Astart + (Atop − Astart ) T1 for t < T1 ,
Atop
for T1 ≤ t ≤ T2 ,
A(t) =

t−T2
Atop − (Atop − Aend ) d−T
for t > T2
2
(6.25)
In practice, the global amplitude is sampled at discrete times, as in the case
of the sinusoids; see above. With a sampling frequency of ν samples per second, the mapping between elapsed time t (for the sound in question) and the
sample index k is given by
t = kν.
(6.26)
Of course, one can avoid modifying the amplitude altogether, by simply setting τ1 = 0 and τ2 = 1 or, equivalently, astart = aend = 1.
The VoicedFraction parameter (here denoted v) determines fraction of
the input x(k) that is voiced: Before any sound is generated, two pulse trains
are defined, namely one voiced pulse train xv (k), k = 0, 1, . . . n, with samples
obtained from Eq. (6.18) and one unvoiced pulse train xu (k), k = 0, 1, . . . n,
whose samples are given by Eq. (6.20) The two pulse trains are then combined
to form the complete input signal as
x(k) = vxv (k) + (1 − v)xu (k).
c 2017, Mattias Wahde, [email protected]
(6.27)
CHAPTER 6. SPEECH SYNTHESIS
93
When generated in isolation, vowels are typically completely voiced (v = 1)
whereas (many) consonants are completely unvoiced (v = 0). However, as
will be illustrated below, in the transition between two sounds, e.g. a vowel
and a consonant, one mixes the parameters, including the parameter v. Also,
even for vowels, one may include a certain unvoiced component to generate a
hoarse voice.
6.3.4 Generating sound transitions
Even though some speech sounds (for example, a vowel such as a short a) can
be used separately, normal speech obviously involves sequences of sounds
that form words. It would be possible to generate the sounds letter by letter
and then paste those sounds together. However, the result would, in general,
not sound natural at all. Instead, the normal procedure in speech synthesis
is to generate sounds that represent more than one letter. A common choice
is to use diphones that usually represent two letters (technically phones but
that distinction will not be made here). In practice, one generally uses both
diphones and phones in speech synthesis.
For example, the word can can be generated by playing two sounds in rapid
sequence, namely a diphone representing ka followed by a phone representing n. Alternatively, one could combine the diphones ka and an. However,
the latter alternative would involve not only handling the transitions between
phone within a diphone, but also the transition between the diphones themselves. Thus, here, the former alternative will generally be used. The second
alternative is commonly used in connection with speech recognition, however.
Diphones can be generated by transitioning from one set of parameters to
another, for example by linear interpolation. In fact, even when generating parameters for single consonant sounds, it helps to have a vowel included, either
before or after the consonant sound, in order to properly hear the consonant.
For example, without an adjacent vowel, it is sometimes difficult to distinguish
between s and f or between p and b. Once the consonant has been generated in
this way, one can simply cut away the vowel and thus obtain the consonant in
isolation. It should also be noted, however, that different parameters may be
needed for a given consonant, depending on the situation. For example, the
set of parameters needed to generate the t in the word take may differ from
the parameters needed for the t in at.
In any case, when generating a sequence of two sounds (i.e. a diphone)
one must specify not only the settings for each of the two sounds, but also the
transition between them. The TransitionStart parameter (see Listing 6.3,
denoted τs , determines the point at which a transition to the following sound
begins. This, too, is a relative time measure, so that the actual time (Ts ) at
c 2017, Mattias Wahde, [email protected]
94
CHAPTER 6. SPEECH SYNTHESIS
which the transition starts equals
Ts = dτs .
(6.28)
The transition affects only the amplitudes (ai ) frequcencies (fi ), and bandwidths (bi ), as well as the voicedness (v). Let p1 and p2 denote the values of
any such parameter in two adjacent sounds. For the first sound, the parameter
value p1 is then used until time t = Ts,1 , i.e. the transition start time for the first
sound. Then, until time t = d1 (i.e. the duration of the first sound, the mixed
parameter value
p = λp1 + (1 − λ)p2
(6.29)
is used, where
λ=
t − Ts,1
,
d − Ts,1
(6.30)
runs from 0 to 1, thus generating a smooth parameter transition from the first
sound to the second. Once t = d1 has been reached, t is again set to 0, and the
parameter value p2 is used, until t = Ts,2 , at which point the transition from the
second sound to a third sound (if any) begins. Note that, for the last sound in a
sequence, no transition is carried out, of course. In order to paste two sounds
together without any transition, one can simply set τs to 1, in which case p = p1
for the entire duration of the first sound.
6.3.5 Sound properties
In some cases, one may wish to change the properties of a sound. For example, if one has generated a particular voice, one might want to generate another voice that is either darker (low-pitched) or brighter (high-pitched) and,
perhaps, speaks either slower or faster than the original voice. The main modifiable properties involve a sound’s volume, pitch, and duration. Fig. 6.8 shows
a few examples of sound modification.
Volume
Changing the volume of a sound is simple, at least in the case of linear volume
scaling: One simply rescales every sample by a constant factor. One has to be
careful, however, to make sure that no sample exceeds the maximum value
that can be represented (32767 and -32768, for 16-bit sounds). Samples that
exceed the limit will be automatically clipped to the corresponding limit, and
if that happens for many samples, the quality of the sound will be reduced.
Moreover, once clipping has occurred, the procedure is irreversible, should
one, later on, wish to reduce the volume.
The WAVSound class contains a method SetRelativeVolume for setting
the volume relative to the current volume, using linear scaling. Moreover,
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
95
Figure 6.8: Examples of sound modification. Upper left panel: The original sound, in this
case a long vowel (a); The sound was then modified in three different ways. Upper right panel:
Increased volume; lower left panel: Decreased pitch; lower right panel: Increased duration.
there is a method SetMaximumNonClippingVolume which sets the maximum possible volume, under the condition that no clipping should take place.
Pitch
There are general procedures for modifying pitch (and duration) of spoken
sounds that can be applied regardless of the method used for generating the
sounds in the first place. For example, the pitch of a sound can be changed
by finding the fundamental frequency (which, of course, may vary across a
sound), extracting pitch periods, i.e. the sound samples in the interval between two successive peaks (pitch marks) corresponding, in the case of formant synthesis, to the pulse train excitations for voiced sounds, and then either moving the intervals (samples) between pitch marks closer (for higher
pitch) or further apart (for lower pitch) using, for example, a method called
time-domain pitch-synchronous overlap-and-add (TD-PSOLA).
However, in the case of formant synthesis, the procedure is even easier
since, in that case, one controls the production of the sound in the first place.
Thus, in order to modify the pitch, one need only change the spacing of the
pulses in the pulse train, i.e. the fundamental frequency. Pitch changes have
the largest effect on voiced sounds, which are dominated by their pulse train
c 2017, Mattias Wahde, [email protected]
96
CHAPTER 6. SPEECH SYNTHESIS
rather than the more random excitation used for unvoiced sounds.
Duration
TD-PSOLA can be used also for changing the duration of a sound, either by
removing pitch periods (for shorter duration) or by adding pitch periods (for
longer duration). In the case of formant synthesis, changing the duration of
a sound is even more straightforward, since the duration is indeed one of the
parameters in the formant settings. By increasing the value of the duration
parameter, one simply instructs the synthesizer to apply the pulse train for a
longer time, resulting in a sound of larger duration, and vice versa for sounds
of shorter duration.
In general, changes of duration are mostly applied to vowels (even though
some consonants, such as s can be extended as well). Many consonants such
as, for example the t in the word cat need not be extended much in human
speech, even though a formant synthesizer can, in principle, generate a t (or
any other sound) of any duration.
6.3.6 Emphasis and emotion
Even though one can define an average fundamental frequency for a given
voice, it is not uncommon in human speech to vary volume, pitch, and duration in order to emphasize a word or to express an emotion. If all words are
always read with the same intonation, the result is a very robotic-sounding
voice for which one has to use context, rather than simply listening, to distinguish, say, a statement from a question: In normal speech, the variation in
emphasis over a sentence can be used for make subtle changes in the meaning
of the sentence. For example, there is a difference between the sentences Did
you see the cat? (as opposed to, for example, just hearing it meowing), and Did
you see the cat? (as opposed to seeing something else).
In fact, the speech synthesizer defined here does not include features such
as emphasis. However, the formant method certainly supports such features.
For example, in order to raise the pitch towards the end of a word, one need
only generate a pulse train in which the pulse period is shortened gradually
over the word, rather than being constant over the entire word. In addition,
one must also change the word-to-sound mappings (see Sect. 6.4 below), by
adding symbols that can be used for distinguishing between a normal utterance of a word, and an utterance involving emphasis.
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
97
Listing 6.4: A simple usage example for the GenerateWordSequence method in the
SpeechSynthesizer class. Here, the sentence Hello, how are you? is generated provided, of course, that the speech synthesizer contains the required word-to-sound mappings for
the four words as well as the corresponding format settings required to generate each word.
...
L i s t <s t r i n g> wordList = new L i s t <s t r i n g >() { ” h e l l o ” , ”how” , ” a r e ” , ”you” } ;
L i s t <double> silenceList = new L i s t <double >() { 0 . 1 0 , 0 . 0 2 , 0 . 0 2 } ;
WAVSound sentenceSound = speechSynthesizer .
GenerateWordSequence( wordList , silenceList ) ;
...
6.4 The SpeechSynthesis library
This library contains classes for generating speech using formant synthesis. As
shown above, the FormantSettings class is used for holding the parameters (also described above) of a sound. In cases where a sound requires several
different settings, as in the case of a diphone involving two distinct sounds,
the FormantSpecification acts as a container class for a list of formant
settings. This class also contains a method (GetInterpolatedSettings)
that is used during the transition between two sounds, as described in Subsect. 6.3.4.
The actual synthesis is carried out in the SpeechSynthesizer class, which
contains a method GenerateSound that takes a formant specification as input. In this method, a pulse train, for use in voiced sounds, is generated with
the appropriate pulse interval. Moreover, a random pulse train is generated
as well, for use in unvoiced sounds. The pulse trains are then combined as
in Eq. (6.27), and the resulting combined pulse train is then fed to the three
damped sinusoids, for which the current parameter settings are used (obtained
via interpolation in the case of transition between two sounds, as mentioned
above). The resulting set of samples is then used for generating a WAVSound
that is returned by the method.
A speech synthesizer must also contain specifications of which sounds to
combine, and in what order, so as to produce a specific word. Thus, the
SpeechSynthesizer class contains a list of WordtoSoundMapping objects
that, in turn, map a word to a list of sound names. The SpeechSynthesizer
class contains two additional methods, GenerateWord, which takes a string
(the word specification) as inputs, finds the appropriate sounds (or, rather, the
formant specification required to generate those sounds), and then produces
the corresponding sounds. The GenerateWordSequence method generates
a sequence of words (for example, but not necessarily, a complete sentence),
with (optional) intervals of silence between the words; see also Listing 6.4.
c 2017, Mattias Wahde, [email protected]
98
CHAPTER 6. SPEECH SYNTHESIS
Figure 6.9: The sound editor tab page of the VoiceGenerator application. In this case,
the user has set the parameters so that they approximately generate a long o.
6.5 The VoiceGenerator application
The description above shows how the various parameters are used when forming a sound. However, an important questions still remains, namely which
parameter settings are required for generating a given sound? Table 6.2 offers some guidance regarding a few vowels, but in order to generate an entire
voice, capable of uttering any word in a given language, one must of course
find parameter settings for all sounds used in the language in question. Needless to say, these sounds will differ between languages, even though some
sounds are found in almost all languages.
In fact, while the parameters for a given sound, especially a vowel, can be
estimated using knowledge of the human vocal tract (and its typical formant
frequencies), a more efficient way might be to use an interactive evolutionary
algorithm (IEA), which is a form of subjective optimization, i.e. a procedure
where a human assesses and scores the different alternatives. Of course, sound
generation is particularly suitable for such an approach, since a human can
quickly assess whether or not a given sound corresponds to a desired sound
or not.
This kind of optimization is implemented in the VoiceGenerator demonstration application. In this case, starting from a given set of parameters, the
user is presented with nine sounds, whose samples are shown graphically on
the screen, in a 3 × 3 matrix, with the initial sound in the center. For the remaining eight sounds, the parameters have been slightly modified, based on
the parameters of the sound at the center of the matrix. The user then listens to
c 2017, Mattias Wahde, [email protected]
CHAPTER 6. SPEECH SYNTHESIS
99
Figure 6.10: The interactive optimization tab page of the VoiceGenerator application.
Starting from the sound shown in Fig. 6.9, the user has inserted a randomized sound before the
an already optimized vowel, and has begun the process of optimizing the sound by modifying
only the first part in order to turn it into a consonant.
the nine sounds (or a subset of them), and selects (by double-clicking) the one
that is least different from the desired sound. That sound then appears in the
center of the matrix, surrounded by eight sounds whose parameters are slight
variations of the parameters of the selected sound. This process is repeated
until the desired sound has been obtained. For a person unfamiliar with IEA,
this might seem like a very slow and tedious process. However, it is actually
quite fast: Starting from any parameter settings, with some experience one can
typically find parameters for any vowel in 10-20 selection steps or less. Consonants may require a few more steps but, overall, the process is rather efficient.
The program does allow manual editing of parameters as well. The GUI
contains three tabs, one for interactive optimization as described above, one
for manual editing of sounds, and one for defining and synthesizing the various words stored in a speech synthesizer. Fig. 6.9 shows the sound editor tab
page. Here, the user can experiment with various (manually defined) formant
settings, in order to generate a starting point for the IEA. In the particular example shown in the figure, the parameters have been set so as to generate a
long o sound. Fig. 6.10 shows the interactive optimization tab page, during the
optimization of a vowel sound. The currently selected sound is shown in the
center frame, whereas the eight surrounding frames display modified versions
of that sound. The user can select the parameters that the optimizer is allowed
to modify. For example, in the case of a vowel, one would normally start from
a sound that is completely voice (voice fraction equal to 1), and then disallow
changes in the voice fraction during optimization.
The user can also select the scope of modification, a possibility that is relec 2017, Mattias Wahde, [email protected]
100
CHAPTER 6. SPEECH SYNTHESIS
vant in cases where the sound is generated from a formant specification containing more than one formant setting. For example, a common approach for
generating consonant-vowel combinations (e.g. kaa, taa etc.) is to first generate
the vowel using the IEA, and then assigning the sound to the sound editor (by
clicking the appropriate button). Next, in the sound editor tab page, one would
copy the vowel (by clicking on the append button), thus obtaining a sound defined by a sequence of two formant settings. Then, one would randomize the
first formant settings, and assign the sound to the optimizer. At this point,
before starting optimization, one can set the scope of modification such that it
only affects the first formant settings (that are now random, but are supposed
to generate a consonant after optimization), and then begin using the IEA to
find the appropriate settings for the consonant; see also Fig. 6.10.
Of course, if one uses a random starting point for every sound generated,
the resulting set of sounds may not form a coherent voice. In other words,
when the sounds are used to form words, they will not be perceived as belonging to a single voice. One should therefore use the following method:
First, generate a vowel (say, a long a, denoted aa, as in large). Next, insert a
randomized sound before the vowel, and use the optimizer to generate suitable consonants to form consonant-vowel diphones such as baa, daa, gaa etc.,
every time using the same formant settings for the vowel and just optimizing
the formant settings for the consonant. A similar procedure can be used for
generating vowel-consonant diphones, by keeping the first sound (the vowel)
constant.
c 2017, Mattias Wahde, [email protected]
Chapter
7
Speech recognition
Speech recognition can be divided into two main cases, namely isolated word
recognition (IWR) and continuous speech recognition (CSR). As is easily understood, CSR is more difficult than IWR, for example due to the fact that, in
continuous speech, the brief periods of silences that separate spoken sounds
do not generally occur at word boundaries. Knowing that a sound constitutes
a single word, as might be the case (though not necessarily) in IWR, greatly
simplifies the recognition process.
There are many approaches to CSR, for example dynamic time warping
(DTW), a deterministic technique that attempts to match (non-linearly) two
different time series in order to find similarity between the two series; Hidden Markov models (HMMs) that, simplifying somewhat, can be seen as a
stochastic alternative to DTW; and artificial neural networks (ANNs) that can
be used for recognizing patterns in general, not just speech. The different approaches can be combined: HMMs have long dominated CSR research and
many modern HMM-based speech recognizers make use of (deep) ANNs instead of the so called Gaussian mixture models (GMMs) that were earlier used
in connection with HMM-based speech recognition.
In IWR, one normally considers a rather limited vocabulary and the speech
recognizer can therefore be trained on instances of entire words. In CSR, by
contrast, the number of possible words is so large (around 80000 for fluently
spoken English, for example) that one must instead base the recognition of
speech on smaller units of sounds, namely phones (see Chapter 6), along with
diphones and even triphones that involve the transitions between phones.
When combined, such units form words and sentences.
However, regardless of which approach is used, on the most fundamental level, speech recognition involves finding speech features in a sound and
then comparing them to stored feature values from sounds used during training of the speech recognizer. In this chapter, the aim will be to describe the
steps involved in extracting the features of spoken sounds, and then match101
102
CHAPTER 7. SPEECH RECOGNITION
ing them using a linear scaling (instead of DTW), as will be described below.
The approach will be limited to IWR, as this is sufficient for the applications
considered here.
7.1 Isolated word recognition
There are four basic steps in the approach to IWR considered here [19]: First,
the sound is subjected to preprocessing and frame splitting (see below). Then,
a number of features are extracted to form a feature vector for each frame, thus
resulting in a time series for each feature. Next, the time scale is (linearly) normalized to range from 0 to 1, and the time series are resampled at fixed values
of normalized time. Finally, the feature vector is compared to stored feature
vectors, one for each sound that the IWR system has been trained to recognize,
in order to determine whether or not the spoken sound is recognizable and,
if so, return information regarding the recognized sound. In the description
of this process, it will be assumed that the (input) sound constitutes a single
word. However, later on, the process of splitting a sound and concatenating
the parts in various different ways before applying IWR will be considered
briefly as well.
7.1.1 Preprocessing
As in Chapter 6, here s(k) denotes the samples of a sound. The first step consists of removing the so called DC component by settings the mean (s) of the
sound samples to zero. Thus, the samples are transformed as
s(k) ← s(k) − s.
(7.1)
Assuming, again, that the sound contains a single spoken word (but with periods of silence or, rather, noise before and after the word), the next step is to
extract the samples belonging to the word. This is done by first moving forward along the sound samples, starting from the µth sample, and forming a
moving average involving (the modulus of) µ sound samples. Once this moving average exceeds a threshold tp , the corresponding sample, with index ks ,
is taken as the start of the word. The procedure is then repeated, starting with
sample m − µ + 1, where m is the number of recorded samples, forming the
moving average as just described, and then moving backward, towards lower
indices. When a sample (with index ke ) is found for which the moving average
exceeds tp , the end point has been found. The sound containing the ke − ks + 1
samples is then extracted.
The sound is the pre-emphasized, by applying a digital filter that, in the
time domain, takes the form
s(k) ← s(k) − cs(k − 1),
c 2017, Mattias Wahde, [email protected]
(7.2)
CHAPTER 7. SPEECH RECOGNITION
103
where c is a parameter with a typical value slightly below 1. As is evident from
this equation, low frequencies, for which s(k) is not very different from s(k−1),
are de-emphasized, whereas high frequencies are emphasized, improving the
signal-to-noise ratio.
Next, frame splitting is applied. Here, snippets of duration τ are extracted,
with consecutive snippets shifted by δτ . δτ is typically smaller than τ , so that
adjacent frames partially overlap. Finally, each frame is subjected to (Hamming) windowing such that
s(k) ← s(k)v(k),
(7.3)
with
2πk
,
(7.4)
n
where n is the number of samples in the frame, and α is yet another parameter,
typically set to around 0.46.
v(k) = (1 − α) − α cos
7.1.2 Feature extraction
Once the word has been preprocessed as described above, resulting in a set
of frames, sound features are computed for each frame. A sound feature is a
mapping from the set of samples s(k) of a frame to a single number describing
that frame. Suitable sound features are those that capture properties of a frame
that are (ideally) independent of the speaker and also of the intensity (volume)
of speech etc. One can define many different kinds of features. Here, four
types will be used, namely (i) the autocorrelation coefficients, (ii) the linear
predictive coding (LPC) coefficients, (iii) the cepstral coefficients, and (iv) the
relative number of zero crossings. These feature types will now be described
in some detail.
Autocorrelation coefficients
The autocorrelation of a time series measure its degree of self-similarity over
a certain sample distance (the lag) and can thus be used for finding repeated
sequences in a signal. Here, the normalized autocorrelation, defined as
aN
i
=
n−i
X
(s(k) − s)(s(k + i) − s)
k=1
σ2
,
(7.5)
is used, where s again is the mean of the samples1 and σ 2 is their variance. The
number of extracted autocorrelation coefficients (i.e., the number of values of
i used, starting from i = 1) is referred to as the autocorrelation order.
1
Note that, while the mean was removed from the original sound in the first preprocessing
step, this does not imply that the mean of every frame is necessarily equal to zero.
c 2017, Mattias Wahde, [email protected]
104
CHAPTER 7. SPEECH RECOGNITION
LPC coefficients
Provided that a sound (quasi-)stationary2 something that often applies to a
sound frame of the kind considered here (provided that the frame duration is
sufficiently short), linear predictive coding (LPC) can be used as a method for
compressing the information in the sound frame. In LPC, one determines the
coefficients li that provide the best possible linear approximation of the sound,
that is, an approximation of the form
ŝ(k) =
p
X
li s(k − i),
(7.6)
i=1
such that the error e(k) = s(k)− ŝ(k) is minimal in the least-square sense. Here,
p is referred to as the LPC order. The LPC coefficients can be computed from
the (non-normalized) autocorrelation coefficients ai (defined as the normalized
autocorrelation coefficients, but without the σ 2 denominator). The equation for
the prediction error e(k) can be written
e(k) = s(k) −
p
X
li s(k − i).
(7.7)
i=1
The total squared error E then becomes
E=
∞
X
k=−∞
e2 (k) =
∞
X
s(k) −
p
X
li s(k − i)
i=1
k=−∞
!2
.
(7.8)
Thus, the minimum of E is found at the stationary point where
∂E
= 0, j = 1, . . . , p.
∂lj
(7.9)
Taking the derivative of E, one finds
p
∞
∞
X
X
X
1 ∂E
s(k − i)s(k − j) = 0.
li
s(k − j)s(j) −
=
2 ∂lj
i=1
(7.10)
k=−∞
k=−∞
Using the definition of the autocorrelation coefficients, this expression can be
rewritten as
p
X
li a|j−i| = aj ,
(7.11)
i=1
2
In general, a stationary time series is one in which the mean, variance etc. are constant
across the time series.
c 2017, Mattias Wahde, [email protected]
CHAPTER 7. SPEECH RECOGNITION
105
a set of equations called the Yule-Walker equations. This expression can be
written in matrix form as
A · l = a,
(7.12)
where l = (l1 , . . . , lp ), a = (a1 , . . . ap ), and A is given by


a0
a1
. . . ap−1
 a1
a0
. . . ap−2 


A =  ..
.
..
..
..
 .

.
.
.
ap−1 ap−2 . . . a0
(7.13)
This symmetric matrix is a so called Toeplitz matrix. There exists an efficient
way of solving the Yule-Walker equations using so called Levinson-Durbin
recursion that has been implemented in the MathematicsLibrary included
in the IPA libraries.
Cepstral coefficients
The cepstral coefficients (CCs) represent the envelope (the enclosing hull) of
the signal’s spectrum and are thus useful as a compact representation of the
signal’s overall characteristics. The CCs can be computed as (the first coefficients of) the inverse (discrete) fourier transform of the logarithm of the (discrete) fourier spectrum of the signal. However, one can show that, starting
from the autocorrelation coefficients ai and the LPC coefficients, li one can
compute the CCs (denoted ci ) also as follows: The first coefficients are c0 = a0
and c1 = l1 . Then,
i−1
X
k
c i = li +
ck li−k , i = 1, . . . , p
(7.14)
i
k=1
and
ci =
i−1
X
k
ck li−k , i > p.
i
k=i−p
(7.15)
The number of cepstral coefficients used in a given situation is referred to as
the cepstral order. In addition to the cepstral coefficients, it is common also to
define so-called mel-frequency cepstral coefficients (MFCCs), which attempt
to mimic human auditory perception more closely by applying a non-linear
frequency scale rather than the linear frequency scale used in computing the
cepstral coefficients. The MFCCs will not be further considered here, however.
Number of zero crossings
The number of zero crossings z, i.e. the number of times that the signal changes
from negative to positive or the other way around, can help in distinguishing
c 2017, Mattias Wahde, [email protected]
106
CHAPTER 7. SPEECH RECOGNITION
different sounds from each other. For example, as can be seen in Fig. 6.6, a
voiced sound, represented as a superposition of sinusoidal waveforms, typically has fewer zero crossings than an unvoiced sound. Here, a zero crossing
occurs if either
s(k)s(k − 1) < 0
(7.16)
or
s(k − 1) = 0 and s(k)s(k − 2) < 0
(7.17)
where s(k) again denotes the sound samples. In order to make the measure
independent of the duration of the sound, the relative number of zero crossings, obtained by dividing z by the number of samples, is used instead.
7.1.3 Time scaling and feature sampling
For any given sound, one can carry out the steps above, dividing the sound
into frames and computing the various sound features for each frame. However, with given values of the frame duration and the frame shift, the number
of frames will vary between sounds, depending on their duration. Thus, in
order to compare the features from one sound to those from another, one must
first obtain a uniform time scale. As mentioned above, a common approach to
rescaling the time variable so that the sounds can be compared is to use DTW.
On the other hand, at least for IWR, simple linear scaling of the time typically
works as well as DTW [19]. Thus, linear scaling, illustrated in Fig. 7.1, has been
used here. The figure shows three instances of the same sound, uttered with
different speed (and, therefore, different duration). The time scale of the sound
features extracted for each sound was then linearly rescaled to unit (relative)
time, such that the first feature value occurs at relative time 0 and the last at
relative time 1. In order to illustrate that the linear scaling works well, the
panels on the right show the time series for the first LPC (l1 ) for each sound
without time rescaling. By contrast, in the bottom panel, the three LPC time
series obtained with (linear) time rescaling are shown together. As can be seen,
there is a high degree of similarity betwen the three time series.
Now, with or without time rescaling, the feature time series will contain
different number of points. For example, with a frame duration of 0.03 s and
a frame shift of 0.01 s, a sound with 0.15 s duration will provide time series
(for any feature) with 13 feature values, whereas a sound with 0.25 s duration
will provide time series with 23 feature values. Moreover, after time rescaling,
the spacing (in relative time) will differ between the two time series. However,
once the time has been rescaled, to produce time series of the kind shown in
the bottom panel of Fig. 7.1 one can of course resample the time series, using
linear interpolation (between measured points) to obtain a time series, with
any number of values, at equal spacing in relative time. Thus, with linear
time rescaling followed by resampling, one can make sure that any sound will
c 2017, Mattias Wahde, [email protected]
CHAPTER 7. SPEECH RECOGNITION
107
LPC1
(b)
4.0
(a)
3.0
2.0
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
t
LPC1
(d)
4.0
(c)
3.0
2.0
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
t
(f)
LPC1
4.0
(e)
3.0
2.0
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
t
LPC1
(g)
4.0
3.0
2.0
1.0
0.0
0.0
0.2
0.4
0.6
0.8
1.0
t
Figure 7.1: Top three rows: Three instances of a sound, namely the Swedish word ekonomi
(economy), uttered with different speed. For each sound, the left panel shows the sound samples, and the right panel shows the first LPC coefficient. The bottom row shows the three LPC
time series superposed, after linear time rescaling.
generate time series (for each feature), with a given number of values, equidistantly spaced in relative time.
7.1.4 Training a speech recognizer
After preprocessing, feature extraction, time rescaling, and resampling, the final step consists of comparing the feature time series thus obtained with stored
time series obtained during training. Before describing that step, the training
procedure will be defined. The training procedure is simple, but in order to
obtain a general representation of a word, one should use multiple instances
of the word, and then form averages of the resulting feature vectors.
The procedure is thus as follows: First, decide how many autocorrelation
coefficients, LPC coefficients, and cepstral coefficients should be computed. A
typical choice is to use an autocorrelation order of 8, an LPC order of 8, and
a cepstral order of 12. Adding also the relative number of zero crossings, the
total number of features (nf ) will be 29. For each word that the speech recognizer is supposed to learn, generate n instances. For each instance, carry out
c 2017, Mattias Wahde, [email protected]
108
CHAPTER 7. SPEECH RECOGNITION
LPC1
(b)
4.00
(a)
3.00
2.00
1.00
0.00
0.00
0.20
0.40
0.60
t
(c)
LPC1
4.00
3.00
2.00
1.00
0.00
0.00
0.20
0.40
0.60
0.80
1.00
t
Figure 7.2: Top row: The left panel shows an instance of the Swedish word nästa (next),
along with the (unscaled) time series for the first LPC coefficients, for five instances of this
word. The bottom panel shows the average time series for the same LPC coefficient, generated
by linearly rescaling and resampling the feature time series for each instance, and then forming
the average. Note that the standard deviation (over the instances used for generating the
average) is shown as well.
the preprocessing and then compute the feature time series for the autocorrelation, LPC, and cepstral coefficients, as well as the (relative) number of zero
crossings. Then rescale and resample each time series at equidistant relative
times, generating ns (typically around 40 or so) samples per time series. Finally, form the average (over the instances) of the resampled time series for
each feature. At the end of this procedure, each word will be represented by
np = nf × ns parameters. With the numerical values exemplified above, np
would be equal to 29 × 40 = 1160. This is a fairly large number of parameters,
but still smaller than the number of samples in a typical sound instance (and,
of course, providing a better representation of the sound’s identity than the
samples themselves). Moreover, as will be shown below, not all parameters
are necessarily used during speech recognition. An illustration of (a part of)
the training procedure is shown in Fig. 7.2.
7.1.5 Word recognition
Once the training has been completed, the speech recognizer is ready for use.
The recognition procedure starts from a sound instance, which is then first
preprocessed as described above and is then subjected to feature extraction, as
well as time rescaling and resampling. Next, the recognizer runs through all
the words stored during training, computing a distance measure di between
the current sound instance and the stored word i = 1, . . . , nw (where nw is the
c 2017, Mattias Wahde, [email protected]
CHAPTER 7. SPEECH RECOGNITION
109
number of stored words) as
nf
1 X
di =
wj
nu nf j=1
ns
X
(Fijk − ϕjk )2
k=1
!
,
(7.18)
where Fijk denotes the k th sample of feature time series j for stored word i, and
ϕjk is the k th sample of feature time series j for the current sound instance.
The inner sum (over k) covers the samples of each feature time series (40 in
the example above). The outer sum runs over the features. wj ≥ 0 are the
feature weights and nu denotes the number of features used, i.e. the number of
features for which wj > 0. The feature weights are thus additional parameters
of the speech recognizer, and must be set by the user. The easiest option is to
set all wj to 1. However, as shown in [19] where the weights were set using an
evolutionary algorithm, (slightly) better performance can in fact be obtained
by using only 5 of the 29 features defined above, namely cepstral coefficients
3, 7, 8, 11, and 12.
Expressed more simply, one computes the mean square distance between
the feature time series for the stored sounds and the current sound instance.
The (index of the) recognized word wi is then taken as
ir = argmini di , i = 1, . . . , nw ,
(7.19)
provided that the minimum distance, i.e.
dmin = mini di , i = 1, . . . , nw ,
(7.20)
does not exceed a given threshold T (another parameter of the speech recognizer). If dmin > T , the recognizer does not produce a result. This can happen if
the sound instance is garbled or incorrectly extracted (meaning that the sound
contains more, or less, than one word) or, of course, if it represents a sound
that the recognizer does not have in its database.
7.2 Recording sounds
In order to train and use a speech recognizer, one must of course have some
way of recording sounds. The AudioLibrary contains a class WAVRecorder
for this purpose. For the purpose of training, it would perhaps be sufficient
with a recorder that could record for a given duration (2 s, say). However, for
the purpose of listening (continuously) as the IPA will have to do, the recorder
must be able to record continuously.
The WAVRecorder has indeed been implemented in this way, and it makes
use of several external methods available in the winmm DLL that is an integral
part of Windows. The source code will not be shown here. Suffice it to say
c 2017, Mattias Wahde, [email protected]
110
CHAPTER 7. SPEECH RECOGNITION
that, basically, the WAVRecorder defines a recording buffer in the form of a
byte array, and then opens a channel for input sound (using the waveInOpen
method in winmm) and also defines a callback method that is triggered whenever the recording buffer is full.
The recorded bytes are then transferred, in a thread-safe manner, to a list
of byte arrays (timeRecordingList), which also keeps track of the time
stamp at which the byte array was acquired from the recorder. Moreover, in
order to prevent the recorder from storing sound data that can grow without an upper bound, the first (oldest) element of timeRecordingList is removed whenever a new element is added, once the number of elements in
timeRecordingList reaches a certain user-defined threshold.
The WAVRecorder class also contains a method for extracting all the available, recorded bytes in the form of a single array, which can then be converted
to a WAVSound (see also Sect. 6.1.2). In addition, the WAVRecorder of course
also contains a method for stopping the recording.
7.3 The SpeechRecognitionLibrary
This library contains the IsolatedWordRecognizer class that, in turn, contains methods for training the recognizer on a given word and for recognizing
an input sound. The training method (AppendSound) takes as input the name
(identity) of the sound as well as a list of instances of the sound. The method,
which also makes use of the SoundFeature and SoundFeatureSet classes,
defined in the AudioLibrary, is shown in Listing 7.1. The first step is to
preprocess the sound as described above. The method does assume that each
instance contains a single word, but it must still be preprocessed, for example
to remove initial and final periods of silence and also, of course, to generate
the sound frames. Next, the feature time series are obtained for the current instance, and are then rescaled and resampled. Once feature time series are available for all instances, the average sound feature set is generated and stored for
later use.
The RecognizeSingle method, which takes a sound instance as input,
is shown in Listing 7.2. Here, too, the sound is processed as in the training method, resulting in a set of feature time series, with the same number
of samples as in the stored series. The method then computes the distance
(deviation) between the input sound and each of the stored sounds, and
then returns the list of deviations in the form of a RecognitionResult that
contains both the computed feature time series, as well as a list of sound identifiers (i.e. the text string name for each stored sound) and distances di for each
stored sound, sorted in ascending order. From the RecognitionResult it is
then easy to check whether or not the first element (i.e. the one with smallest
distance) has a distance value below the threshold T .
c 2017, Mattias Wahde, [email protected]
CHAPTER 7. SPEECH RECOGNITION
111
Listing 7.1: The AppendSound method in the IsolatedWordRecognizer class.
p u b l i c o v e r r i d e void AppendSound ( s t r i n g name , L i s t <WAVSound> instanceList )
{
L i s t <SoundFeatureSet> soundFeatureSetList = new L i s t <SoundFeatureSet >() ;
f o r e a c h (WAVSound soundInstance i n instanceList )
{
soundInstance . SubtractMean ( ) ;
double startTime = soundInstance . GetFirstTimeAboveThreshold
( 0 , soundExtractionMovingAverageLength, soundExtractionThreshold) ;
double endTime = soundInstance . GetLastTimeAboveThreshold
( 0 , soundExtractionMovingAverageLength, soundExtractionThreshold) ;
WAVSound extractedInstance = soundInstance . Extract ( startTime , endTime ) ;
extractedInstance . PreEmphasize ( preEmphasisThresholdFrequency) ;
WAVFrameSet frameSet = new WAVFrameSet( extractedInstance , frameDuration ,
frameShift ) ;
frameSet . ApplyHammingWindows( alpha ) ;
SoundFeatureSet soundFeatureSet = new SoundFeatureSet ( ) ;
L i s t <SoundFeature> autoCorrelationFeatureList = frameSet .
GetAutoCorrelationSeries( ” A u t o C o r r e l a t i o n ” , autoCorrelationOrder) ;
soundFeatureSet . FeatureList . AddRange ( autoCorrelationFeatureList) ;
L i s t <SoundFeature> lpcAndCepstralFeatureList = frameSet .
GetLPCAndCepstralSeries( ”LPC” , lpcOrder , ” C e p s t r a l ” , cepstralOrder ) ;
soundFeatureSet . FeatureList . AddRange ( lpcAndCepstralFeatureList) ;
SoundFeature relativeNumberOfZeroCrossingsFeature = frameSet .
GetRelativeNumberOfZeroCrossingsSeries( ”RNZC” ) ;
soundFeatureSet . FeatureList . Add ( relativeNumberOfZeroCrossingsFeature) ;
soundFeatureSet . SetNormalizedTime ( ) ;
soundFeatureSet . Interpolate ( numberOfValuesPerFeature) ;
soundFeatureSetList . Add ( soundFeatureSet) ;
}
SoundFeatureSet averageSoundFeatureSet = SoundFeatureSet .
GenerateAverage ( soundFeatureSetList) ;
averageSoundFeatureSet . Information = name ;
i f ( averageSoundFeatureSetList == n u l l )
{ averageSoundFeatureSetList = new L i s t <SoundFeatureSet >() ; }
averageSoundFeatureSetList . Add ( averageSoundFeatureSet) ;
averageSoundFeatureSetList . Sort ( ( a , b ) =>
a . Information . CompareTo ( b . Information ) ) ;
OnAvailableSoundsChanged ( ) ;
}
c 2017, Mattias Wahde, [email protected]
112
CHAPTER 7. SPEECH RECOGNITION
Listing 7.2: The RecognizeSingle method in the IsolatedWordRecognizer
class.
p u b l i c o v e r r i d e R e c o g n i t i o n R e s u l t RecognizeSingle (WAVSound sound )
{
sound . SubtractMean ( ) ;
double startTime = sound . GetFirstTimeAboveThreshold( 0 ,
soundExtractionMovingAverageLength , soundExtractionThreshold) ;
double endTime = sound . GetLastTimeAboveThreshold( 0 ,
soundExtractionMovingAverageLength , soundExtractionThreshold) ;
WAVSound extractedInstance = sound . Extract ( startTime , endTime ) ;
extractedInstance . PreEmphasize ( preEmphasisThresholdFrequency) ;
WAVFrameSet frameSet = new WAVFrameSet( extractedInstance , frameDuration ,
frameShift ) ;
frameSet . ApplyHammingWindows( alpha ) ;
SoundFeatureSet soundFeatureSet = new SoundFeatureSet ( ) ;
L i s t <SoundFeature> autoCorrelationFeatureList = frameSet .
GetAutoCorrelationSeries( ” A u t o C o r r e l a t i o n ” , autoCorrelationOrder) ;
soundFeatureSet . FeatureList . AddRange ( autoCorrelationFeatureList) ;
L i s t <SoundFeature> lpcAndCepstralFeatureList = frameSet .
GetLPCAndCepstralSeries( ”LPC” , lpcOrder , ” C e p s t r a l ” , cepstralOrder ) ;
soundFeatureSet . FeatureList . AddRange ( lpcAndCepstralFeatureList) ;
SoundFeature relativeNumberOfZeroCrossingsFeature = frameSet .
GetRelativeNumberOfZeroCrossingsSeries( ”RNZC” ) ;
soundFeatureSet . FeatureList . Add ( relativeNumberOfZeroCrossingsFeature) ;
soundFeatureSet . SetNormalizedTime ( ) ;
soundFeatureSet . Interpolate ( numberOfValuesPerFeature) ;
R e c o g n i t i o n R e s u l t recognitionResult = new R e c o g n i t i o n R e s u l t ( ) ;
recognitionResult . SoundFeatureSet = soundFeatureSet ;
i f ( averageSoundFeatureSetList != n u l l )
{
f o r e a c h ( SoundFeatureSet averageSoundFeatureSet i n
averageSoundFeatureSetList)
{
double deviation = SoundFeatureSet . GetDeviation
( averageSoundFeatureSet , soundFeatureSet , weightList ) ;
s t r i n g soundName = averageSoundFeatureSet . Information ;
recognitionResult . DeviationList . Add
( new Tuple<s t r i n g , double >(soundName , deviation ) ) ;
}
recognitionResult . DeviationList . Sort ( ( a , b ) =>
a . Item2 . CompareTo ( b . Item2 ) ) ;
}
r e t u r n recognitionResult ;
}
c 2017, Mattias Wahde, [email protected]
CHAPTER 7. SPEECH RECOGNITION
113
Figure 7.3: The speech recognizer tab of the IWR application. Here, the time series for
the third cepstral coefficient (average and standard deviation) are shown for two of the stored
words, namely end and yes.
7.4 Demonstration applications
Two applications have been written for the purpose of demonstrating how
the SpeechRecognitionLibrary can be used, namely (i) an Isolated word
recognizer (IWR) application that allows the user to train a speech recognizer
using a set of instances for each word, and then to use the speech recognizer
either by loading a sound instance from a file or by recording it; and (ii) a Listener application, which continuously records from a microphone and applies
speech recognition whenever a new sound is available.
7.4.1 The IWR application
Figs. 7.3 and 7.4 show the GUI of the IWR application. The form contains a tab
control with two tabs, a speech recognizer tab and a usage tab. In order to train
a speech recognizer, the user first sets the appropriate parameter values for
preprocessing and feature extraction (which then remain fixed, regardless of
how many words are added to the database). Next, a new speech recognizer is
generated, and the user can then train it by loading a set of instances for each
word and forming the time normalized and resampled feature time series as
described above. In Fig. 7.3, the speech recognizer has been trained on the
c 2017, Mattias Wahde, [email protected]
114
CHAPTER 7. SPEECH RECOGNITION
Figure 7.4: The usage tab of the IWR application. The user has recorded the word yes, and
then applied the speech recognizer, which correctly identified the word. The graph in the lower
right part of the figure shows the first autocorrelation coefficient time series for the recorded
sound.
words back, end, next, no, and yes. Once a few words have been added to the
database, the user can view the time series (averages and standard deviation)
for each feature, and for each stored word. The figure shows two time series,
namely for the third cepstral coefficient for the two (selected) words, end and
yes.
This program assumes that the recorded sound consists of a single word,
possibly with some periods of silence in the beginning and end of the recording. Thus, it does not make any attempt to distinguish, for example a true
recorded sound from noise. Fig. 7.4 shows the usage tab page. Here, the
user has recorded a word, namely yes, and then applied the speech recognizer,
which correctly identified the word. Here, the recognition threshold was set
to 0.0333, and the distance dmin (obtained for the word yes) was 0.0219. Quite
correctly, no other stored word reached a distance value below the threshold.
The form also shows a plot of the feature time series for the recorded sound.
7.4.2 The Listener application
As mentioned above, the Listener application records continuously and, moreover, is able to act as a client connecting to an agent program as described in
c 2017, Mattias Wahde, [email protected]
CHAPTER 7. SPEECH RECOGNITION
115
Figure 7.5: The GUI of the Listener application. In the situation shown here, the incoming
sound has been split at six split points (shown as yellow vertical lines), and the listener has
recognized the word back, ignoring the noise seen between the first two split points.
Chapter 2. Thus, the output of the program is a string representation of the
recognized word, along with a time stamp, making it possible for the agent to
determine the appropriate response, if any.
However, for this application, it is not sufficient to use the simple approach
employed in the IWR application, where it was assumed that the recorded
sound contain precisely one word: In continuous recording, even though the
WAVRecorder that is responsible for the actual recording does have a limited
memory, its current recording may still contain several words and also partial
words, i.e. a word that the speaker has begun, but not yet finished, uttering.
Note also that the recording buffer does have a certain size, so even after the
user has completely uttered a given word, there is a (small) delay until the
corresponding sound samples are available to the speech recognizer. Moreover, when recording sounds continuously, it is inevitable that there will be
occasional noise in the signal. Thus, some form of intelligent processing is
required. In the Listener program, one such procedure has been implemented.
Here, the incoming sound is split into pieces by considering short snippets
(typical duration: 0.02 s) and then defining split points at those snippets that
contain only silence (based on a given threshold). This gives a set of k split
points. The program then builds all possible sounds such that the start of the
sound occurs at split point i, i = 1, . . . , k − 1, and the end at split point j,
j = i + 1, . . . , k. Next, the speech recognizer is applied to all such sounds,
resulting in a set of dmin values, one for each sound. The word identified (if
any) for the sound corresponding to the lowest value of dmin (provided that
this values is below the thresold T ) is then taken as the output and is sent to
the agent program (if the latter is available).
An example, which also illustrates the GUI of the Listener application is
c 2017, Mattias Wahde, [email protected]
116
CHAPTER 7. SPEECH RECOGNITION
shown in Fig. 7.5. Here, the word back is available but also a small sequence
of noise preceding the word. There is a total of six split points, resulting in 15
different possible sounds according to the procedure just described. Of those
combinations, the one involving the third and the sixth (last) split point gave
the lowest dmin , which also was below the detection threshold for the word
back, resulting in recognition of this word, as can be seen in the figure.
c 2017, Mattias Wahde, [email protected]
Chapter
8
Internet data acquisition
In addition to sensing its immediate surroundings using cameras and microphones, as described in some of the previous chapters, an IPA may also need
to access information from the internet. For example, one can envision an
agent with the task of downloading news or weather reports from the internet, and then presenting the results, perhaps along with pictures and videos,
to the user, either spontaneously when some news item (of interest to the user)
appears, or upon request from the user.
The procedure of accessing data from the internet can be divided into two
logical steps: First, the agent must download the raw data. Next, it must parse
the raw data to produce a meaningful and easily interpretable result. Of course,
neither an agent nor its user(s) can control the formatting of a given web
page. Thus, any specific method for parsing the contents of a general web
page is likely to be brittle and prone to error if the structure of the web page
is changed, for some reason. However, there are sites (especially for news,
weather, and similar topics) that operate as so called Really simple syndication (RSS) feeds and are formatted in a well-defined manner, so that they can
easily be parsed.
At this point, it is important to note that not all sites welcome (or even
allow) access by artificial agents. In fact, some even take countermeasures
such as requiring information to confirm that the user is indeed human, or
banning access from an IP number that tries to reload a page too frequently.
Of course, one must respect those restrictions, and only let an agent access
sites that allow downloads by artificial agents. Here, again, the RSS feeds are
important, since they are specifically designed for repeated, automatic access,
and therefore rarely carry restrictions of the kind just described.
For the IPAs considered here, a specific library that will be described next,
namely the InternetDataAcquisition library has been implemented for
downloading and parsing information from web sites.
117
118
CHAPTER 8. INTERNET DATA ACQUISITION
Listing 8.1: The DownloadLoop method in the HTMLDownloader class.
p r i v a t e void DownLoadLoop ( )
{
while ( running )
{
Stopwatch stopWatch = new Stopwatch ( ) ;
stopWatch . Start ( ) ;
using ( WebClient webClient = new WebClient ( ) )
{
try
{
s t r i n g html = webClient . DownloadString( url ) ;
DateTime dateTime = DateTime . Now ;
Boolean newDataStored = StoreData ( dateTime , html ) ;
i f ( newDataStored ) { OnNewDataAvailable ( ) ; }
}
c a t c h ( WebException e )
{
running = f a l s e ;
OnError ( e . Status . ToString ( ) ) ;
}
}
stopWatch . Stop ( ) ;
double elapsedSeconds = stopWatch . ElapsedTicks / ( double ) Stopwatch . Frequency ;
i n t elapsedMilliseconds = ( i n t ) Math . Round ( elapsedSeconds∗
MILLISECONDS_PER_SECOND) ;
i n t sleepInterval = millisecondDownloadInterval − elapsedMilliseconds ;
i f ( sleepInterval > 0 ) { Thread . Sleep ( sleepInterval ) ; }
i f ( ! runRepeatedly ) { running = f a l s e ; }
}
}
8.1 The InternetDataAcquisition library
The classes in this library provide implementations of the two steps described
above, namely downloading and then parsing data.
8.1.1 Downloading data
Most of the low-level code required to access web pages is available in the
standard libraries distributed with C#. Thus, the user can focus on more highlevel aspects of data downloads. Here, two data downloaders have been implemented, the HTMLDownloader and the CustomXMLReader.
The HTMLDownloader class
This class allows repeated downloads of the raw HTML code of any web page,
using the WebClient class available in the System.Net namespace. Here,
the HTML code is placed in a single string. The download is handled by
a separate thread that also is responsible for storing the downloaded string
c 2017, Mattias Wahde, [email protected]
CHAPTER 8. INTERNET DATA ACQUISITION
119
(along with a time stamp) if it differs from the most recent already downloaded string. Listing 8.1 shows the Download loop method executing in
the download thread. Two event handlers are defined, one for signalling
the arrival of new data and one for indicating download errors. Unless the
runRepeatedly variable is set to false, download attempts are carried out
with a user-specified frequency.
Here, again, it is important to note that not all web sites allow this kind
of repeated, automatic downloads. A specific example is Google, which actively prevents a user from accessing (for example) image search results by
direct download of the (links in the) HTML code of the search page. However,
Google does allow access via their own C# API, which can be downloaded from
their web site. Thus, in this particular case, it is still possible to obtain the information without violating any rules, but this is not the case for all web pages.
It is the user’s responsibility to check any restrictions on automatic downloads
before attempting to apply such methods.
The RSSDownloader class
As mentioned above, some sites are specifically designed for repeated automatic downloads. RSS feeds constitute an important special case. The RSSDownloader
class has been written specifically to deal with this case. RSS pages are generated in XML format and can thus be accessed using the XmlTextReader
class, available in the System.Xml namespace.
Among the various information items specified in an RSS item is the (publish) date of the item in question. Somewhat surprisingly, the standard Xml
(text) reader class (i.e. the XmlTextReader) does not handle all date formats.
Two common ways to format a date (that, along with several others, are handled by the standard XML reader) are ddd, dd MMM yyyy hh:mm:ss (example: Fri, 21 Oct 2016 07:14:17) and ddd, dd MMM yyyy hh:mm:ss
’GMT’ (example: Fri, 21 Oct 2016 07:18:53 GMT)1 However, a format
such as ddd MMM dd yyyy hh:mm:ss ’GMT+0000’ (example: Fri Oct
21 2016 07:28:19 GMT+0000), which (along with several other formats)
often occur in RSS feeds that are not based in the US, cannot be handled by the
standard XML reader.
For that reason, an alternative approach is required. Of course, one could
just read the web page using the HTMLDownloader described above and then
write a custom parser (see below). However, a better approach is simply to
write a custom XML reader class that implements all the aspects of the standard XML reader, while also handling different date formats. This is the approach chosen here, with the implementation of the CustomXmlReader. This
1
In C#, there are many different ways of formatting a DateTime or DateTimeOffset
instances. See e.g. MSDN for more information.
c 2017, Mattias Wahde, [email protected]
120
CHAPTER 8. INTERNET DATA ACQUISITION
Listing 8.2: The RunLoop method in the RSSDownloader class. The ProcessFeed
method, not shown here, simply stores the various items in the SyndicationFeed in a
thread-safe manner, to allow asynchronous access.
p r i v a t e void RunLoop ( )
{
while ( running )
{
Stopwatch stopWatch = new Stopwatch ( ) ;
stopWatch . Start ( ) ;
using ( CustomXmlReader xmlReader = new CustomXmlReader ( url ) )
{
xmlReader . SetCustomDateTimeFormat( customDateTimeFormat) ;
xmlReader . Read ( ) ;
S y n d i ca t i o n F e e d feed = S y n d i ca t i o n F e e d . Load ( xmlReader ) ;
ProcessFeed ( feed ) ;
}
stopWatch . Stop ( ) ;
double elapsedSeconds = stopWatch . ElapsedTicks / ( double ) Stopwatch . Frequency ;
i n t elapsedMilliseconds = ( i n t ) Math . Round ( elapsedSeconds ∗
MILLISECONDS_PER_SECOND) ;
i n t sleepInterval = millisecondDownloadInterval − elapsedMilliseconds ;
i f ( sleepInterval > 0 ) { Thread . Sleep ( sleepInterval ) ; }
}
}
class operates precisely as the standard XML reader, except that the user can
also specify the date format, which is required in cases where it differs from the
format that can be handled by the standard XML reader.
The RSSDownloader class makes use of the CustomXmlReader to download RSS feeds at regular intervals, and to store all the items for later access.
The parsing of an RSS feed will be described below. Listing 8.2 shows the
thread (in the RSSDownloader) responsible for executing repeated downloads of RSS feeds.
8.2 Parsing data
Parsing is the process by which an encoded piece of information, such as a
web page in HTML format, is converted into standard, readable text. Here,
two approaches will be described briefly, namely general parsing of HTML
code, and parsing of the XML code in an RSS feed.
8.2.1 The HTMLParser class
This class provides generic processing of any information (initially) stored in a
single string. The class contains a Split method that simply splits the string
(in the first call to the method, after assigning the initial string) or the list of
c 2017, Mattias Wahde, [email protected]
CHAPTER 8. INTERNET DATA ACQUISITION
121
Listing 8.3: Code snippet for setting up and starting an RSSDownloader. The values
of the three parameters (url, dateFormat, and downloadInterval) are obtained, for
example, via text boxes in the GUI of the application in question.
...
rssDownloader = new RSSDownloader ( url ) ;
rssDownloader . SetCustomDateTimeFormat( dateFormat ) ;
rssDownloader . DownloadInterval = downloadInterval ;
rssDownloader . Start ( ) ;
...
strings resulting from an earlier application of the same method. The user provides the method with a list of so called split strings. Whenever such a split
string is encountered, the string in which it was found is split into two (and
the split string itself is removed). A typical HTML page contains characters
(HTML tags) used when formatting the HTML code for display in a browser,
such as, for example, <p> and </p> to indicate the start and end of a paragraph, or <b> and </b> to indicate the start and end of the use of a bold font.
In order to convert an HTML page to plain text, a common step is thus to remove such tags, by applying the appropriate call to the Split method. Other
methods are also defined in this class, for example to extract all strings fulfilling some conditions. For instance, one may wish to extract all web page links
to PDF documents, by finding strings that start with http:// and end with
.pdf.
8.2.2 RSS feeds
As mentioned above, RSS feeds are in XML format and, more specifically, define certain fields that can easily be accessed. When the contents of a (custom)
Xml reader are passed to a SyndicationFeed instance, the result is a list of
object of type SyndicationItem that, in turn, defines several fields, such as
Title, Summary, PublishDate2 etc. Once the syndication feed items have
been generated, very little additional parsing is required. Thus, no specific
class has been written for parsing RSS feeds. However, a usage example will
be given in the next section.
2
Note that, in the SyndicationItem class, the PublishDate is defined as a
DateTimeOffset (rather than a DateTime), the difference being that the DateTimeOffset
measures coordinated universal time (UTC) and also, for example, makes comparisons between instances based on UTC, whereas DateTime generally refers to the date and time in a
given time zone.
c 2017, Mattias Wahde, [email protected]
122
CHAPTER 8. INTERNET DATA ACQUISITION
Figure 8.1: The GUI of the RSSReader application. Here, a single news item, with its
publish date and title shown in green, arrived between the two most recent updates of the
RSSDownloader.
8.3 The RSSReader application
As its name implies, the RSSReader application reads from an RSS feed, and
displays the publish date and the title of each item on the screen. The program
defines an RSSDownloader that carries out downloads in its own thread, with
a user-specified interval between downloads. Moreover, the program is capable of sending the corresponding information to the agent program, if the latter
is available. A separate thread (independent of the RSSDownloader) handles
the display of news items to the screen. Listing 8.3 shows the code snippet for
setting up and starting the RSSDownloader. Fig. 8.1 shows the GUI of the
RSSReader application. New items, i.e. those that have been published since
the last update of the RSSDownloader, are shown in green, whereas older
items are shown in gray.
c 2017, Mattias Wahde, [email protected]
Appendix
A
Programming in C#
In this appendix, several important aspects of C# .NET are introduced. The
aim is not to give a complete description of either the C# language or its IDE,
but rather to describe some concepts that anyone developing IPAs in C# (for
example using the IPA libraries) must know. In addition to reading the text
below, the reader should also study the various demonstration applications
distributed together with the IPA libraries.
There are also several excellent books on C#. In addition, the answers to
many questions regarding C# can be found either at the Microsoft Development Network (MSDN) web site1 or in various internet fora such as StackOverflow2 . In fact, given the number of skilled people working with C# .NET,
finding an answer to a given question is not so difficult; the problem is to ask
the right question, something that requires a bit of experience. The first three
sections below describe basic, fundamental concepts of C#, whereas the remaining sections describe more advanced topics.
As mentioned in Chapter 1, C# .NET is a part of Microsoft’s Visual Studio. The illustrations below will be given in the 2010 version of Visual Studio,
running under Windows 7. However, the appearance and use of the IDE is
essentially the same for newer versions of Visual Studio (e.g. the 2015 version)
and for newer versions of Windows (e.g. Windows 10). The code in the IPA libraries has been tested under Windows versions 7, 8, and 10, and Visual studio
versions 2010 and 2015. A detailed introduction to the C# IDE can be found at
MSDN3 .
1
msdn.microsoft.com
stackoverflow.com
3
https://msdn.microsoft.com/en-us/library/ms173064(v=vs.90).aspx
2
123
124
APPENDIX A. PROGRAMMING IN C#
Figure A.1: The window of the C# IDE, showing (1) the Solution Explorer, (2) the Windows
Form Designer and Code Editor, (3) the Properties panel and (4) the Toolbox.
A.1 Using the C# IDE
In .NET, the source code of an application is contained in one or several projects
that, in turn, are contained in a solution. A specific example is the SpeechProcessing
solution distributed along with the IPA libraries. This solution contains two
applications, i.e. projects that define a standalone executable (an .exe file)
but also many other projects (from the IPA libraries) in the form of class libraries. A class library is simply a set of classes (see Sect. A.2 below) that can
be used in one or several applications.
When the C# IDE is opened, a window similar to the one shown in Fig. A.1
appears4 . In the specific case shown in the figure, the user has opened the
DemonstrationSolution containing the source code described in this appendix. Now, in the IDE main window, there are many subwindows that assist the user with various aspects of code development. Some of the most
important subwindows have been highlighted in the figure, namely (1) the
Solution Explorer, (2) the Windows Forms Designer and Code Editor, (3)
the Properties Window and (4) the Toolbox.
As can be seen in the Solution Explorer, the solution contains several projects.
Most of those projects are applications, but one (the ObjectSerializer library) is a class library used in connection with serialization; see Sect. A.7. The
user can start an application by right-clicking on a project, and then selecting
4
The exact appearance of the IDE is somewhat version-dependent and can also be customized to fit the user’s preferences.
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
125
Figure A.2: The window of the C# IDE after the user has opened the code associated with the
form of the FirstExample application.
Debug - Start new instance. In every solution (containing at least one
executable application), exactly one project is the startup project, i.e. the application that will run by default, if the user presses the green arrow in the
tool strip (near the top of the window) or simply presses F5. In this case,
FirstExample is the startup project. The user can easily change the startup
project, by right-clicking on any application and selecting Set as StartUp
Project.
The Windows Forms Designer allows the user to generate the layout of the
so called forms (i.e. the windows) of an application. A form is a special case of
a control, i.e. a graphical component. The Properties window allows the user
to set various parameters associated with a control. In Fig. A.1, the user has
scrolled down to view the Text property that determines the caption of the
application’s form. The Toolbox allows the user to select and add additional
controls (e.g. buttons, text boxes etc.) to a form.
The window referred to as the Windows Forms Designer above is also used
as Code Editor. In Fig. A.2, the user has opened the code associated with the
main form of the FirstExample, by right-clicking on FirstExampleMainForm.cs
in the solution explorer, and selecting View Code. The code can then be
edited as necessary. Note also that the IDE can help the user by auto-generating
some parts of the code. For example, in the case of a button, some action
should be taken when the user clicks on it. If a button is double-clicked in the
Windows Forms Editor, the IDE will generate skeleton code for the method
associated with the button click. The user must then fill the method with the
c 2017, Mattias Wahde, [email protected]
126
APPENDIX A. PROGRAMMING IN C#
Listing A.1: The code in the FirstExample main form.
using
using
using
using
using
using
using
using
System ;
System . Collections . Generic ;
System . ComponentModel ;
System . Data ;
System . Drawing ;
System . Linq ;
System . Text ;
System . Windows . Forms ;
namespace FirstExample
{
p u b l i c partial c l a s s FirstExampleMainForm : Form
{
p u b l i c FirstExampleMainForm ( )
{
InitializeComponent ( ) ;
}
p r i v a t e s t r i n g GenerateResponse ( )
{
s t r i n g hello = ” Hello u s e r . Today i s a ” ;
s t r i n g dayOfWeek = DateTime . Now . DayOfWeek . ToString ( ) ;
hello += dayOfWeek + ” . ” ;
r e t u r n hello ;
}
p r i v a t e void helloButton_Click( o b j e c t sender , EventArgs e )
{
s t r i n g response = GenerateResponse ( ) ;
responseTextBox . Text = response ;
}
p r i v a t e void exitButton_Click( o b j e c t sender , EventArgs e )
{
A p p l i ca t i o n . Exit ( ) ;
}
}
}
necessary code for responding appropriately to the user’s action. Note that
every control is in fact associated with a large number of events, of which the
button click is one example.
When running an application from within the IDE (for example by pressing
F5) it is possible also to pause the code using breakpoints. A breakpoint
(shown as a red filled disc in the IDE) can be inserted either by clicking on
the left frame of the Code Editor, or by right-clicking on a line of code in the
Code Editor and selecting BreakPoint - Insert Breakpoint. When the
application reaches a breakpoint, execution is paused. If F5 is pressed, the
application then continues to the next breakpoint (if any). One can also use
the F10 (step over) and F11 (step into) keys to step through the code. When
execution is paused, the corresponding line of code is shown in yellow, and
the user can investigate the values of variables etc., by placing the mouse over
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
127
a given statement in the code.
The entire listing for the FirstExample main form is given in Listing A.1.
The listing begins with a set of using clauses, which are specifications of class
libraries necessary for the code associated with the control. In this particular
case, these clauses all involve code included in the System namespace. However, in many cases, one may need to use code that is not included in the standard distribution of C#. A specific example is the use of the ObjectSerializer
library in the SerializationExample (see Sect. A.7 below). In such cases,
one must first add a reference before instructing C# that the code in a specific
namespace should be used. In order to do so, the user must right-click on
the folder marked References in the solution explorer (for the project in question), and then select the appropriate file. Once the reference has been added,
the corresponding code (or, to be exact, its public methods and properties; see
Sect. A.2 below) will be available for use.
The remainder of the listing defines the methods associated with the main
form of the FirstExample. Summarizing briefly, the code responds to a click
on the Hello button, by printing, in the text box, the text Hello user followed
by a specification of the current weekday. If instead the user clicks the Exit
button, the application terminates.
A.2 Classes
C# .NET is an object-oriented programming language (as are many other
modern programming languages), in which one defines and uses objects that,
in turn, are instances of classes. In general, a class contains the fields (variables) and methods relevant for objects of the type in question. Object-oriented
programming is a very large topic and, as mentioned earlier in this chapter,
here only a very brief description will be given.
As a specific example, consider flat, two-dimensional shapes, such as rectangles, circles, triangles etc. Such shapes share some characteristics. For example, they all have a certain surface area, even though its detailed computation
varies between the different shapes. A common approach is to define a so
called abstract (base) class, from which other classes are derived.
Consider now the ClassExample application. Here, a simple base class
has been defined5 for representing shapes. Moreover, a derived class has been
defined as well (see below). As can be seen in Listing A.2, the base class
(Shape) contains one field, namely hasCorners. Note that, by convention,
5
In order to add a class to a project, one right-clicks on the application in the Solution
Explorer, and then one selects Add - Class.... To rename a class, one should right-click on
the class and select Rename. Finally, to rename a field, one should right-click on it in the Code
Editor, and then select Refactor - Rename.... the IDE then makes sure that all instances of
the field are correctly renamed.
c 2017, Mattias Wahde, [email protected]
128
APPENDIX A. PROGRAMMING IN C#
Listing A.2: The (abstract) Shape class, from which classes that implement specific shapes
are derived.
p u b l i c a b s t r a c t c l a s s Shape
{
p r o t e c t e d Boolean hasCorners ;
p u b l i c a b s t r a c t double ComputeArea ( ) ;
p u b l i c Boolean HasCorners
{
g e t { r e t u r n hasCorners ; }
}
}
Listing A.3: The Rectangle class, derived from the Shape class.
p u b l i c c l a s s Re ct a n g l e : Shape
{
p r i v a t e double sideLengthX ;
p r i v a t e double sideLengthY ;
p u b l i c Re ct a n g l e ( ) // C o n s t r u ct o r
{
hasCorners = t r u e ;
}
p u b l i c o v e r r i d e double ComputeArea ( )
{
double area = sideLengthX ∗ sideLengthY ;
r e t u r n area ;
}
p u b l i c double SideLengthX
{
g e t { r e t u r n sideLengthX ; }
s e t { sideLengthX = value ; }
}
p u b l i c double SideLengthY
{
g e t { r e t u r n sideLengthY ; }
s e t { sideLengthY = value ; }
}
p u b l i c double Area
{
g e t { r e t u r n sideLengthX ∗ sideLengthY ; }
}
}
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
129
fields always start with a small letter. It also defines an abstract method called
ComputeArea. The class itself is marked as abstract, as is the method just
mentioned, meaning that this method must be implemented in the classes derived from the Shape class. The method is also public meaning that it is
visible in other classes (for example, but not limited to, classes derived from
the Shape class). This method should return the area as its output, which
would be a number of type double, and this is also specified in the code.
In general, method names begin with a capital letter. Note also that since a
method is intended to actively carry out some action, in this case computing
the area of a shape, the name should reflect this by including a verb. Thus,
AreaComputation would not be a suitable name for this method.
The field is listed as protected, meaning that it is visible to any classes
derived from the Shape class, but not to other classes. The Shape class also
defines a property which is public, meaning that it is visible to other classes.
In this particular case, the property is very simple, but in other cases a property
may involve more complex operations, including method calls. By convention,
properties always begin with a capital letter.
Listing A.3 shows a derived class, namely Rectangle. The first line in
the class indicates that the Rectangle class is derived from the Shape class,
and therefore can access its (protected) fields. Note that the Shape class is
not explicitly derived from any class but it is implicit in C# that all classes are
derived from a generic base class called Object. In this case, each derived
class must define additional fields that are specific to the shape in question,
and which are then used in the respective ComputeArea methods, in order
to compute the area. Note that the field hasCorners is visible to the derived
classes, since it is marked as being protected. The fields introduced in the
derived classes are marked as private, meaning that they are not visible to
other classes. The use of these keywords (private, protected, public
etc.) makes it possible for a developer to determine which parts should be
visible to other users, who may not perhaps have access to the source code,
but instead only a dynamic-link library (DLL)6 . An external user will only be
able to access public methods and properties.
The derived class also has a constructor which is called whenever a
corresponding object (i.e. an instance of the class) is generated. In this simple
case, the constructor simply sets the parameter that determines whether or not
the shape in question has any corners. This parameter that is not, of course,
needed for the computation of the area; it is included only to demonstrate the
use of fields in derived classes. The ComputeArea method of the Rectangle
6
During compilation of a C# application, the various class libraries are compiled into DLLs
so that they can be used by the application. In cases where one does not have access to the
source code of a class library, one can still make use of the class library, provided that one has
the corresponding DLL. If so, one can add a reference to the DLL, just as one would add a
reference to a class library.
c 2017, Mattias Wahde, [email protected]
130
APPENDIX A. PROGRAMMING IN C#
Listing A.4: A simple example showing the use of the Rectangle class. First, a rectangle
with side lengths 3 and 2 is generated. Next, its area is obtained and printed. Then, the longer
of the two sides is shortened to 1 length unit, and the (new) area is again obtained and printed.
p r i v a t e void runExampleButton_Click( o b j e c t sender , EventArgs e )
{
Re ct a n g l e rectangle = new Re ct a n g l e ( ) ;
rectangle . SideLengthX = 3 ;
rectangle . SideLengthY = 2 ;
double area = rectangle . Area ;
classExampleTextBox . Text = ” S i d e l e n g t h s : ” + rectangle . SideLengthX . ToString ( ) +
” , ” + rectangle . SideLengthY . ToString ( ) + ” , Area : ” + area . ToString ( ) + ”\ r \n” ;
rectangle . SideLengthX = 1 ;
area = rectangle . ComputeArea ( ) ;
classExampleTextBox . Text += ” S i d e l e n g t h s : ” + rectangle . SideLengthX . ToString ( ) +
” , ” + rectangle . SideLengthY . ToString ( ) + ” , Area : ” + area . ToString ( ) + ”\ r \n” ;
}
class is prefixed with the keyword override, meaning that it overrides (replaces) the abstract method defined in the base class.
Note that the properties of the Rectangle class are a bit more complex
than for the base class. Here, one can both retrieve the side lengths (x and y)
and also set their values. Moreover, an Area property is defined, which computes the area. Note that this property is redundant: One might as well use the
ComputeArea method to obtain the area. The property has been introduced
here only to illustrate a more complex case, where a certain computation (beyond mere assignment) is carried out in a property, and where the property, in
fact, does not have a corresponding field (e.g. area). Here, it is better not to
define an area field, particularly if the user would be allowed to set it directly.
In that case a user might, say, update the side lengths and then incorrectly set
the area! It is not possible to make such a mistake with the code shown in
Listing A.3: The user can access or compute the area, but cannot set it directly. A
suitable exercise for the reader is now to implement, say, a Circle class, with
the corresponding fields, the ComputeArea method, and the Area property.
Listing A.4 shows a simple method (the button click event handler in the
form (window) of the application) that instantiates a rectangle shape, computes and prints the area, then changes the length of one side, and then computes (in a different way) and prints the area again. Clearly, one can define
many other fields and methods relevant to shapes, for example fields that set
the color, position, orientation etc. of a shape, and methods that, for instance,
grow, shrink, move, or rotate the shape. As another exercise, the reader should
add a few additional fields and their respective properties, along with a few
methods of the kind just mentioned.
Note also that fields can themselves consist of objects. In this example, all
fields were so called simple types, i.e. types that are available as an integral
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
131
Listing A.5: An example of the use of generic lists, in this case a simple list of integers. The
ShowList method (not shown here) simply prints the elements of the list to the screen, along
with a comment.
p r i v a t e void runExample1Button_Click( o b j e c t sender , EventArgs e )
{
L i s t <i n t > integerList1 = new L i s t <i n t >() ; // => { }
integerList1 . Add ( 5 ) ;
// => {5}
integerList1 . Add ( 8 ) ;
// => { 5 , 8 }
integerList1 . Add( −1) ;
// => { 5 , 8 , 1 }
ShowList ( ” Addition o f elements : ” , integerList1 ) ;
integerList1 . Sort ( ) ;
// => {−1, 5 , 8}
ShowList ( ” S o r t i n g : ” , integerList1 ) ;
integerList1 . Reverse ( ) ; // => { 8 , 5 −1}
ShowList ( ” Re v e r s a l : ” , integerList1 ) ;
integerList1 . Insert ( 0 , 3 ) ; // => { 3 , 8 , 5 , −1}
ShowList ( ” I n s e r t i o n : ” , integerList1 ) ;
integerList1 . RemoveAt ( 2 ) ; // => {3 ,8 , −1}
ShowList ( ”Removal a t index 2 : ” , integerList1 ) ;
L i s t <i n t > integerList2 = integerList1 ; // i n t e g e r L i s t 2 p o i n t s t o i n t e g e r L i s t 1 !
ShowList ( ” P o i n t e r t o l i s t : ” , integerList2 ) ;
integerList1 [ 1 ] = 2 ; // => Assigns 2 t o i n t e g e r L i s t 1 [ 1 ] AND i n t e g e r L i s t 2 [ 1 ]
// ( both a r e t h e same l i s t ! )
ShowList ( ” L i s t 1 , element 1 modified : ” , integerList1 ) ;
ShowList ( ” . . . and l i s t 2 : ” , integerList2 ) ;
L i s t <i n t > integerList3 = new L i s t <i n t >() ; // A new i n s t a n c e . . .
f o r e a c h ( i n t element i n integerList1 ) { integerList3 . Add ( element ) ; }
integerList3 [ 1 ] = 7 ; // => Assigns 7 t o i n t e g e r L i s t 3 [ 1 ] but NOT i n t e g e r L i s t 1 [ 1 ]
ShowList ( ” L i s t 1 , again : ” , integerList1 ) ;
ShowList ( ” . . . and l i s t 3 : ” , integerList3 ) ;
}
part of the C# language. However, one could very well define a class containing fields that are instances of any of the shape classes just defined, or even
lists of such classes (see also the next section).
A.3 Generic lists
The .NET framework includes the concept of generic lists, i.e. lists containing
instances of any kind of object, and with operations that are common to a list
regardless of its contents, such as addition, insertion, removal etc. Moreover,
there are generic operators for certain common operations, such as sorting.
An example showing some of the many uses of generic lists can be found
in the GenericListExample application. This application contains four buttons, one for each example. The code for the first example (the leftmost button
on the form) is shown in Listing A.5. In this case, a simple list of integers is
generated, and it is then sorted and reversed. Next a new element is inserted
(at index 0), and then the element at index 2 is removed. A new list is then
generated that points to the first list, so that if one makes changes is one of the
lists, those changes also affect the other list. Finally, a new list is generated as
c 2017, Mattias Wahde, [email protected]
132
APPENDIX A. PROGRAMMING IN C#
Listing A.6: The TestClass used in the second, third, and fourth examples.
public class TestClass
{
p r i v a t e i n t integerField ;
p r i v a t e double doubleField ;
p u b l i c T e s t C l a s s Copy ( )
{
T e s t C l a s s copiedObject = new T e s t C l a s s ( ) ;
copiedObject . IntegerProperty = integerField ;
copiedObject . DoubleProperty = doubleField ;
r e t u r n copiedObject ;
}
p u b l i c s t r i n g AsString ( )
{
s t r i n g objectAsString = integerField . ToString ( ) + ” ” +
doubleField . ToString ( ) ;
r e t u r n objectAsString ;
}
p u b l i c i n t IntegerProperty
{
g e t { r e t u r n integerField ; }
s e t { integerField = value ; }
}
p u b l i c double DoubleProperty
{
g e t { r e t u r n doubleField ; }
s e t { doubleField = value ; }
}
}
a new instance, such that any changes made to it do not affect the other list.
The situation becomes a bit more complex if the elements of a list are not
simple types i.e. types such as int, double etc. Consider now the second
example (second button from the left, on the form). In this case, a generic list
of objects (of type TestClass) is defined. The definition of this simple class is
given in Listing A.6. The class also contains an explicit Copy method, which
generates a new instance identical to the one being copied7
In Example 1 above, sorting the list was easy, as the process of comparing
two integers to determine which is one larger is, of course, well-defined. But
what about the list of objects in Example 2? As shown in the code for this
7
Note that copying can be handled automatically (using the so called ICloneable interface), but one must be careful to distinguish between a shallow copy and a deep copy. In the
case of a shallow copy, not all copied fields (except simple types) consist of new instances but
instead references to instances in the original object. For this reason, it is often a good idea to
write an explicit copying method, which copies the necessary fields as required by the application at hand. This is especially true in cases where the source code is provided, so that the
programmer easily can see exactly what parts are being copied.
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
133
Listing A.7: The two methods required for the second example. Here, a list (list1) of
TestClass objects is generated, and the list is then sorted in two different ways. The
ShowTestClassList method displays the elements of a list of TestClass objects on
the screen.
p r i v a t e void GenerateList1 ( )
{
list1 = new L i s t <T e s t C l a s s >() ;
T e s t C l a s s testObject1 = new T e s t C l a s s
testObject1 . IntegerProperty = 4 ;
testObject1 . DoubleProperty = 0 . 5 ;
list1 . Add ( testObject1 ) ;
T e s t C l a s s testObject2 = new T e s t C l a s s
testObject2 . IntegerProperty = 2 ;
testObject2 . DoubleProperty = 1 . 5 ;
list1 . Add ( testObject2 ) ;
T e s t C l a s s testObject3 = new T e s t C l a s s
testObject3 . IntegerProperty = 5 ;
testObject3 . DoubleProperty = −1.5;
list1 . Add ( testObject3 ) ;
T e s t C l a s s testObject4 = new T e s t C l a s s
testObject4 . IntegerProperty = 2 ;
testObject4 . DoubleProperty = −0.5;
list1 . Add ( testObject4 ) ;
}
() ;
() ;
() ;
() ;
p r i v a t e void runExample2Button_Click( o b j e c t sender , EventArgs e )
{
displayTextBox . Text = ” ” ;
GenerateList1 ( ) ;
ShowTestClassList( ” I n i t i a l l i s t ” , list1 ) ;
list1 . Sort ( ( a , b ) => a . DoubleProperty . CompareTo ( b . DoubleProperty ) ) ;
ShowTestClassList( ” L i s t s o r t e d ( DoubleProperty ) ” , list1 ) ;
list1 = ( L i s t <T e s t C l a s s >)list1 . OrderBy ( a => a . IntegerProperty) .
ThenBy ( b => b . DoubleProperty ) . ToList ( ) ;
ShowTestClassList( ” L i s t s o r t e d ( I n t e g e r P r o p e r t y , then DoubleProperty ) ” , list1 ) ;
}
Listing A.8: An example of a shallow copy of a list of objects.
p r i v a t e void runExample3Button_Click( o b j e c t sender , EventArgs e )
{
displayTextBox . Text = ” ” ;
GenerateList1 ( ) ; // See example 1
ShowTestClassList( ” L i s t 1 ” , list1 ) ;
// Shallow copy
list2 = new L i s t <TestClass>() ;
list2 . Add ( list1 [ 0 ] ) ;
list2 . Add ( list1 [ 1 ] ) ;
list2 . Add ( list1 [ 2 ] ) ;
list2 . Add ( list1 [ 3 ] ) ;
ShowTestClassList( ” L i s t 2 ” , list1 ) ;
list2 [ 0 ] . DoubleProperty = −1; // Changes l i s t 2 [ 0 ] AND l i s t 1 [ 0 ] .
ShowTestClassList( ” L i s t 2 again ” , list2 ) ;
ShowTestClassList( ” L i s t 1 again ” , list1 ) ;
}
c 2017, Mattias Wahde, [email protected]
134
APPENDIX A. PROGRAMMING IN C#
Listing A.9: An example of a deep copy of a list of objects.
p r i v a t e void runExample4Button_Click( o b j e c t sender , EventArgs e )
{
displayTextBox . Text = ” ” ;
GenerateList1 ( ) ;
ShowTestClassList( ” L i s t 1 ” , list1 ) ;
// Deep copy
list3 = new L i s t <TestClass>() ;
list3 . Add ( list1 [ 0 ] . Copy ( ) ) ;
list3 . Add ( list1 [ 1 ] . Copy ( ) ) ;
list3 . Add ( list1 [ 2 ] . Copy ( ) ) ;
list3 . Add ( list1 [ 3 ] . Copy ( ) ) ;
ShowTestClassList( ” L i s t 3 ” , list1 ) ;
list3 [ 0 ] . DoubleProperty = −5; // Changes ONLY l i s t 3 [ 0 ] .
ShowTestClassList( ” L i s t 3 again ” , list3 ) ;
ShowTestClassList( ” L i s t 1 again ” , list1 ) ;
}
example (Listing A.7) one can certainly sort such a list as well, but one must
first tell C# how it is to be sorted. Two sortings are carried out here: First, the list
is sorted based on the values of the DoubleProperty. Next, the list is sorted
first based on the values of the IntegerProperty, and then all elements that
have the same value of the IntegerProperty are sorted on the basis of their
DoubleProperty value. The reader should now click on the button marked
Run example 2 to view the results.
In the third and fourth examples, the difference between a shallow copy
and a deep copy is illustrated. In the third example, a shallow copy is made:
A new list is instantiated (i.e. it does not just point to the original list) but
the elements of the list are not explicitly copied, but instead simply point to the
elements of the original list. This means that if one changes a property in one of
those elements in one of the lists (see Listing A.8), the corresponding property
(of the element with the same index) in the original list changes as well. In
the fourth example (see Listing A.9), by contrast, the elements of the new list
are explicitly copied before being added. In this case, changing a property of
an element in the new list does not change the corresponding property (of the
element with the same index) in the original list.
A.4 Threading
The concept of (multi-)threading is crucial to all but the simplest applications.
A program may start and run any number of threads, i.e. sequences of computational instructions that may share memory resources, but otherwise operate
as independent units executing in parallel. On processors with multiple cores
(i.e. all modern processors) different threads can run truly in parallel, on different cores. However, often the number of threads greatly exceeds the number
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
135
Listing A.10: A method that (unwisely) runs a lengthy computation in the GUI thread. Note
that the progress information will, in fact, only be shown at the very end of the computation.
p r i v a t e void runInSingleThreadButton_Click( o b j e c t sender , EventArgs e )
{
runInSingleThreadButton . Enabled = f a l s e ;
runMultiThreadedButton . Enabled = f a l s e ;
progressListBox . Items . Clear ( ) ;
progressListBox . Items . Add ( ” S t a r t i n g ” ) ;
f o r ( i n t k = 1 ; k <= UPPER_LIMIT ; k++)
{
double sum = 0 ;
f o r ( i n t j = 1 ; j <= k ; j++) { sum += j∗j ; }
i f ( k % PRINT_INTERVAL == 0 ) { ShowProgress ( ”k = ” + k . ToString ( ) ) ; }
}
progressListBox . Items . Add ( ”Done” ) ;
runInSingleThreadButton . Enabled = t r u e ;
runMultiThreadedButton . Enabled = t r u e ;
}
p r i v a t e void ShowProgress ( s t r i n g progressInformation)
{
progressListBox . Items . Add ( progressInformation) ;
}
of cores. Thus, the operating system is responsible for assigning time slices
to each thread and rapidly switching between the threads, giving the illusion
(from the user’s point of view) of parallel computation for all threads, whether
or not they run on different processor cores.
Writing a program that makes proper use of multithreading is a non-trivial
task, especially in cases where communication between threads is required.
Here, only a simple example will be given. There are plenty of additional
examples in the various IPA libraries; see also the next section.
Now, consider the ThreadingExample application. The application’s form
contains two buttons, one for single-thread execution and one for execution
using multithreading. In this case, the computation consists of computing the
sum of the square of all integers from 0 to k, for k = 1, 2, . . . , 100000. As is evident when running the single-thread version, the GUI of the application gets
frozen and unresponsive during the calculation. Moreover, the progress information is only printed to the screen after the computation has been completed.
This is not particularly elegant: It should be possible for a user to access the
GUI, and to get progress information even while the computation is running.
Perhaps there are other tasks that the user may wish to launch? Alternatively,
the user may wish to abort the computation before it is completed. The problem, in this case, is that the computation is started on the same thread as the
GUI. Since the computer will try to run the computation as fast as possible, it
will be difficult for it also to respond to user commands (for example attempts
to move the window using the mouse). The code is shown in Listing A.10.
c 2017, Mattias Wahde, [email protected]
136
APPENDIX A. PROGRAMMING IN C#
Listing A.11: In this case, the computationally expensive loop is executed in a separate
thread. In this case, the progress information is displayed continuously on the screen. The
ThreadSafeHandleDone method is available in the source code, but is not shown here.
p r i v a t e void runInSingleThreadButton_Click( o b j e c t sender , EventArgs e )
{
runInSingleThreadButton . Enabled = f a l s e ;
runMultiThreadedButton . Enabled = f a l s e ;
progressListBox . Items . Clear ( ) ;
progressListBox . Items . Add ( ” S t a r t i n g ” ) ;
computationThread = new Thread ( new T h r e a d S t a r t ( ( ) => ComputationLoop ( ) ) ) ;
computationThread . Start ( ) ;
}
p r i v a t e void ComputationLoop ( )
{
f o r ( i n t k = 1 ; k <= UPPER_LIMIT ; k++)
{
double sum = 0 ;
f o r ( i n t j = 1 ; j <= k ; j++) { sum += j ∗ j ; }
i f ( k % PRINT_INTERVAL == 0 )
{ ThreadSafeShowProgress( ”k = ” + k . ToString ( ) ) ; }
}
ThreadSafeHandleDone ( ) ;
}
p r i v a t e void ThreadSafeShowProgress( s t r i n g progressInformation)
{
i f ( InvokeRequired ) { BeginInvoke ( new MethodInvoker ( ( ) =>
ShowProgress ( progressInformation) ) ) ; }
e l s e { ShowProgress ( progressInformation) ; }
}
This is where multithreading comes in: If the user instead clicks the other
button (for multithreaded execution), a separate thread will be started for carrying out the computation, leaving the GUI (which, again, runs on its own
thread) free to do other things. In this case, two methods are used: One for
starting the thread in which the computation is to be carried out, and one
(ComputationLoop) for running the actual computation. Now the GUI responds nicely to any user actions, and the progress information is printed to
the screen during the computation. The code is shown in Listing A.11.
However, there is a price to be paid: Since the computation now runs in
a separate thread, and any output to the screen (or other GUI actions) requires access to the GUI thread, on must handle the correponding operation
with care: Accessing the GUI (from another thread) is not thread-safe. In
.NET, thread-safe access to the GUI thread, from another thread, is achieved
by means of the BeginInvoke method that is defined for any object derived
from the Control class (for example, the Form class). Printing the progress
during the computation and updating the Enabled property of the buttons
(at the end of the computation) requires access to the GUI thread; hence, the
BeginInvoke method (for the form) is used, as shown in the code listing.
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
137
Listing A.12: The two methods used for handling concurrent access to a generic list. The
accessLockObject is defined in the class, but the definition is not shown here.
p u b l i c void AddElement ( )
{
Monitor . Enter ( accessLockObject ) ;
integerList . Add ( 1 ) ;
integerList . RemoveAt ( 0 ) ;
Monitor . Exit ( accessLockObject) ;
}
p u b l i c i n t GetCheckSum ( )
{
i n t checkSum = 0 ;
Monitor . Enter ( accessLockObject ) ;
checkSum = integerList . Count ;
Monitor . Exit ( accessLockObject) ;
r e t u r n checkSum ;
}
Note that only the method that handles the progress update is shown in the
listing. The method ThreadSafeHandleDone, which is called when the computation is complete, is available in the source code, though.
This example shows the basics of multithreading. However, multithreading is a large and, at times, complex subject. There are plenty of examples of
the use of multithreading in the various IPA libraries, which should be studied
careefully by the reader.
A.5 Concurrent reading and writing
In a program that uses multiple threads or, as in the case of an IPA, communicates asynchronously with several other programs, it is not uncommon that
one must both write to, and read from, a given object, for example a generic
list. One must then be careful, as shown in the ConcurrentAccessExample.
Here, a simple object is generated, which contains a list of 10 integers (all equal
to 1). Next, two threads are started: The first thread (additionThread) adds
another 1 to the list, and then removes the first element of the list so that, again,
it consist of 10 (equal) elements. The second thread (checkThread) simply
measures the length of the list. Now, since the two threads run independently
of each other it can happen that the length computation (in the checkThread)
occurs after the addition of a 1 (in the additionThread), but before the removal of the first element in the list. If this happens, the checkThread will
find a list of length 11, rather than 10.
One can avoid these problems by locking the list both during the addition and removal operations and during the checking operation. A procedure
(there are several ways) for doing so, using the Monitor class, is shown in
c 2017, Mattias Wahde, [email protected]
138
APPENDIX A. PROGRAMMING IN C#
Listing A.12. Here, a lock object is defined, and whenever a code snippet encounters a Monitor.Enter method, the program will temporarily halt execution if another code snippet has acquired the lock. Execution will be halted
until the lock is released, using the Monitor.Exit method. Thus, in this particular case, even if the GetCheckSum method gets called between the two list
operations in AddElement, the actual checking will not take place until the
AddElement method releases the lock, thus avoiding the problem described
above.
The reader should run the ConcurrentAccessExample in order to investigate the two cases, first clicking the left button (running without locking)
a few times, and then clicking the right button. As can be seen, in the first case
(without locking) the erroneous length is invariably found, albeit at different
iterations in different runs, again illustrating the fact that the two threads run
independently of each other. In the second case (with locking) the error never
occurs.
Note that, in newer versions of .NET, there are libraries for handling concurrent access to objects (such as lists). Still, it is good to know how to handle
concurrent access explicitly, as just illustrated.
A.6 Event handlers
The concept of event handlers is used frequently in the IPA libraries. Consider, for example, the arrival of new information in the working memory of an
IPA. While it would be possible, in theory, to check continuously (with a loop)
whether or not a new memory item has been added to (or removed from) the
working memory, it would not be very elegant to do so. Moreover, it would be
a computationally expensive procedure. A better approach would be to let the
working memory itself trigger an event whenever a new memory item arrives,
and to let other parts of the agent program (that might need to use the items
in the working memory) respond accordingly by subscribing to the event by
means of an event handler.
As another example, note that events and event handlers are used frequently in connection with GUI operations: Any user action on a GUI (such
as a button click or a mouse movement) triggers one or several events, which
can then be handled by the appropriate event handler. For example, if a user
calls the Invalidate method on a control, the result will be that the control’s
Paint event is triggered, so that the user can repaint whatever is shown in the
control, via an event handler (called, for example, HandlePaint).
A simple example of event handling is given in the EventHandlerExample
application. In this example, a separate thread is executed in an object of
type EventTestClass, which computes the sums of all integers from 1 to k,
k = 1, 2, . . . 100, 000. Two events are defined, namely Started, which is trigc 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
139
Listing A.13: The EventTestClass with its two events, Started and Progress.
The Progress event makes use of a custom EventArgs class, shown in Listing A.15.
public c l a s s EventTestClass
{
p r i v a t e c o n s t i n t UPPER_LIMIT = 1 0 0 0 0 0 ;
p r i v a t e c o n s t i n t PROGRESS_REPORT_INTERVAL = 2 5 0 0 ;
p r i v a t e Thread runThread ;
p u b l i c event EventHandler Started = n u l l ;
p u b l i c event EventHandler<ProgressEventArgs> Progress = n u l l ;
p r i v a t e void RunLoop ( )
{
OnStarted ( ) ;
f o r ( i n t ii = 1 ; ii <= UPPER_LIMIT ; ii++)
{
double sum = 0 ;
f o r ( i n t jj = 1 ; jj <= ii ; jj++) { sum += jj ; }
i f ( ii % PROGRESS_REPORT_INTERVAL == 0 ) { OnProgress ( ii ) ; }
}
}
p u b l i c void Run ( )
{
runThread = new Thread ( new T h r e a d S t a r t ( ( ) => RunLoop ( ) ) ) ;
runThread . Start ( ) ;
}
p r i v a t e void OnStarted ( )
{
i f ( Started != n u l l )
{
EventHandler handler = Started ;
handler ( t h i s , EventArgs . Empty ) ;
}
}
p r i v a t e void OnProgress ( ( i n t sumsCompleted )
{
i f ( Progress != n u l l )
{
EventHandler<ProgressEventArgs> handler = Progress ;
ProgressEventArgs e = new ProgressEventArgs ( sumsCompleted ) ;
handler ( t h i s , e ) ;
}
}
}
c 2017, Mattias Wahde, [email protected]
140
APPENDIX A. PROGRAMMING IN C#
Listing A.14:
Three relevant methods defined in the code for the form of the
EventHandlerExample application.
p r i v a t e void runButton_Click ( o b j e c t sender , EventArgs e )
{
E v e n t T e s t C l a s s eventTestObject = new E v e n t T e s t C l a s s ( ) ;
eventTestObject . Started += new EventHandler ( HandleStarted ) ;
eventTestObject . Progress += new EventHandler<ProgressEventArgs >(HandleProgress ) ;
eventTestObject . Run ( ) ;
}
p r i v a t e void HandleStarted ( o b j e c t sender , EventArgs e )
{
s t r i n g startInformationString = ” S t a r t e d ” ;
i f ( InvokeRequired ) { BeginInvoke ( new MethodInvoker ( ( ) =>
progressListBox . Items . Add ( startInformationString) ) ) ; }
e l s e { progressListBox . Items . Add ( startInformationString) ; }
}
p r i v a t e void HandleProgress ( o b j e c t sender , ProgressEventArgs e )
{
s t r i n g progressInformationString = ”Sums completed : ” +
e . SumsCompleted . ToString ( ) ;
i f ( InvokeRequired ) { BeginInvoke ( new MethodInvoker ( ( ) =>
progressListBox . Items . Add ( progressInformationString) ) ) ; }
e l s e { progressListBox . Items . Add ( progressInformationString) ; }
}
gered when the operation starts and Progress, which is triggered at regular
intervals (in this example, for every 2,500 values of k). The definition of the
EventTestClass is shown in Listing A.13. For any event, the nomenclature
is such that the event is triggered using a method with the same name as the
event, but prefixed by the word On. Thus, for example, the Started event is
triggered in the beginning of the RunLoop, by calling the OnStarted method.
This method takes no input since all that is required is for the program to report that a particular operation was started. In the OnStarted method, the
program first checks whether or not there are any subscribers to this event (see
below). If that is the case, the event is fired.
In order to understand the concept of event subscription, consider listing A.14. This listing shows the three user-defined methods defined in the
code for the application’s form. When the user clicks the Run button on the
form, an object of type EventTestClass is generated. The next two lines set
up the event handlers that subscribe to the Started and Progress events.
Note that, here, the nomenclature is such that, for a given event, the event handler carries the same name as the event, but with the prefix Handle. Note also
that the event handler is appended to the invocation list of the event, which
keeps track of the number (and identity) of all subscribers. Thus, it would be
possible to define additional methods that would also subscribe to the same
event. In this case, the method HandleStarted simply prints a string (by
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
141
Listing A.15: The ProgressEventArgs class, which is derived from the EventArgs
class.
p u b l i c c l a s s ProgressEventArgs : EventArgs
{
p r i v a t e i n t sumsCompleted ;
p u b l i c ProgressEventArgs ( i n t sumsCompleted )
{
t h i s . sumsCompleted = sumsCompleted ;
}
p u b l i c i n t SumsCompleted
{
g e t { r e t u r n sumsCompleted ; }
}
}
adding it as an item in a list box on the form) that tells the user that the computation has been started. Note that since the computation runs in a separate
thread, the addition of the string to the list box must be done in a thread-safe
manner; see also Subsect. A.4 above.
Next, consider the slightly more complex Progress event. In many cases,
it is not sufficient just to learn that some event took place; one may also need
some additional information about the event, something that can be achieved
by defining a custom EventArgs class. In this particular case, the required information is the value of k (stored in the variable sumsCompleted). When the
event is triggered, this variable is sent as input to the OnProgress method.
Next, an object of type ProgressEventArgs is instantiated (see also listing A.15, and is assigned the value of k. The subscriber (HandleProgress)
can then extract and display the corresponding value.
Additional events could certainly be added. A suitable exercise for the
reader would be to add an event Completed which would be triggered at the
end of the RunLoop.
A.7 Serialization and de-serialization
Most programs require some form of input data before they can run. As an
example, a program for visualizing and animating a three-dimensional rendering of a face requires information about the detailed appearance of the face
(i.e. the vertices of all the triangles constituting the face etc.) and, similarly, a
program for speech recognition needs detailed information about the parameters of the speech recognizer etc.
While it is certainly possible to write methods for loading and saving the
properties of any object, it is often a tedious and complex procedure, especially
c 2017, Mattias Wahde, [email protected]
142
APPENDIX A. PROGRAMMING IN C#
Listing A.16: The SerializationTestClass.
[ DataContract ]
public class S e rializat ion T e st Class
{
p r i v a t e i n t intParameter ;
p r i v a t e double doubleParameter ;
p r i v a t e double doubleParameter2 ;
p r i v a t e List<i n t > integerList ;
[ DataMember ]
p u b l i c i n t IntParameter
{
g e t { r e t u r n intParameter ; }
s e t { intParameter = value ; }
}
[ DataMember ]
p u b l i c double DoubleParameter
{
g e t { r e t u r n doubleParameter ; }
s e t { doubleParameter = value ; }
}
p u b l i c double DoubleParameter2
{
g e t { r e t u r n doubleParameter2 ; }
s e t { doubleParameter2 = value ; }
}
[ DataMember ]
p u b l i c List<i n t > IntegerList
{
g e t { r e t u r n integerList ; }
s e t { integerList = value ; }
}
}
c 2017, Mattias Wahde, [email protected]
APPENDIX A. PROGRAMMING IN C#
143
Listing A.17:
The two methods used for de-serialization and serialization in the
SerializationTestExample application.
p r i v a t e void loadObjectToolStripMenuItem_Click( o b j e c t sender , EventArgs e )
{
using ( OpenFileDialog openFileDialog = new OpenFileDialog ( ) )
{
openFileDialog . Filter = ” .XML f i l e s ( ∗ . xml ) | ∗ . xml” ;
i f ( openFileDialog . ShowDialog ( ) == DialogResult . OK )
{
serializationTestObject = ( S e r i a l i z a t i o n T e s t C l a s s ) O b j e c t X m l S e r i a l i z e r .
ObtainSerializedObject( openFileDialog . FileName ,
typeof ( S e r i a l i z a t i o n T e s t C l a s s ) ) ;
ShowTestObject ( ) ;
}
}
}
p r i v a t e void saveObjectToolStripMenuItem_Click( o b j e c t sender , EventArgs e )
{
using ( S a v e F i l e D i a l o g saveFileDialog = new S a v e F i l e D i a l o g ( ) )
{
saveFileDialog . Filter = ” .XML f i l e s ( ∗ . xml ) | ∗ . xml” ;
i f ( saveFileDialog . ShowDialog ( ) == DialogResult . OK )
{
O b j e c t X m l S e r i a l i z e r . SerializeObject ( saveFileDialog . FileName ,
serializationTestObject) ;
}
}
}
for classes that contain, for example, lists of objects that, in turn, may contain
additional objects etc. Fortunately, there are methods for saving (serializing)
and loading (de-serializing) the properties of any object, provided that certain
attributes are defined. The general code for serialization is contained in the
System.RunTime.Serialization namespace, which must thus be referenced if one wants to make use of serialization. One must also add a reference to a custom-made ObjectSerializerLibrary, which contains specific code for serializing and de-serializing an object in XML format.
Consider now the SerializationExample. Here, a simple class is defined (SerializationTestClass) that contains a few fields as shown in
Listing A.16. Note that the class itself is marked with the DataContract attribute, which tells the program that this class can be serialized8 . Three of the
four properties (which must have both the get and the set parts defined, for
serialization and de-serialization) are marked with the DataMember attribute,
meaning that they will be considered in serialization and de-serialization. The
fourth property (DoubleParameter2) is not thus marked, and will therefore
8
Note that serialization and de-serialization can be implemented in various different ways.
Here, however, only the methods implemented in the ObjectSerializerLibrary will be
used.
c 2017, Mattias Wahde, [email protected]
144
APPENDIX A. PROGRAMMING IN C#
not be considered. It is not uncommon that some properties are omitted during
serialization, for example properties whose values are obtained dynamically
when the corresponding program is running.
The code for actual serialization and de-serialization is contained in the
ObjectSerializerLibrary. Listing A.17 shows the two event handlers for
the Load and Save menu items, respectively. Note that, during de-serialization,
one must specify the type of the object being de-serialized and explicit casting
(as (SerializationTestClass) must then also be applied. The method
ShowTestObject (listing not shown here) simply prints the values of the
various parameters.
When the program is started, an SerializationTestClass object is instantiated and its properties are assigned some arbitrary values, which are
then shown in a list box. The user can then save (serialize) the object by
selecting the Save menu item. If one then loads (de-serializes) the object
by selecting the Load menu item, the parameter values of the loaded object are again shown in the list box. In this example, note that the value of
DoubleParameter2 changes (upon loading) from 2 to 0. This is so, since
DoubleParameter2 was not serialized, and is therefore assigned a default
value of 0.
For serialization and de-serialization as described above, C# requires information regarding the serializable types. This information is gathered in
the ObtainSerializableTypes method in the ObjectXMLSerializer.
However, this method only extracts the types available in the current assembly9 . For example, when serializing an agent, i.e. an instance of the Agent
class in the AgentLibrary, the serializable types obtained by a call to the
ObtainSerializableTypes method will be all the types in the AgentLibrary.
However, if one wants to add types (classes) outside the AgentLibrary, derived from classes in that library (for example, a new class derived from the
base class DialogueAction), C# will not automatically know how to handle those classes in serialization and de-serialization. Thus, in such cases,
one must explicitly specify that the added types are serializable too. Methods for serialization and de-serialization in such cases are also available in the
ObjectXMLSerializer.
9
Simplifying somewhat, one can say that the classes in a class library (or, rather, the corresponding DLL) together constitute an assembly.
c 2017, Mattias Wahde, [email protected]
Bibliography
[1] Opengl tutorial, chapter 3. http://www.opengl-tutorial.org/
beginners-tutorials/tutorial-3-matrices/. Accessed: 201611-15.
[2] A.F. Abate, M. Nappi, D. Riccio, and G. Sabatino. 2d and 3d face recognition: A survey. Pattern Recognition Letters, 28:1885–1906, 2007.
[3] O. Barnich and M. van Droogenbroeck. Vibe: A powerful random technique to estimate the background in video sequences. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 945–
948, 2009.
[4] J. Canny. A computational approach to edge detection. IEEE Trans. Pattern
Analysis and Machine Intelligence, 8:679–698, 1986.
[5] S. S. Farfade, M. J. Saberian, and L.-J. Li. Multi-view face detection using
deep convolutional neural networks. In Proceedings of the 5th ACM on
International Conference on Multimedia Retrieval, pages 643–650, 2015.
[6] J. Hillenbrand and R.A. Houde. Speech synthesis using damped sinusoids. Journal of speech, language, and hearing research, 45:639–650, 2002.
[7] D. H. Klatt. Software for a cascade/parallel formant synthesizer. Journal
of the Acoustical Society of America, 67:971–995, 1980.
[8] J. Kovac, P. Peer, and F. Solina. Human skin color clustering for face detection. In Proceedings of EUROCON 2003, volume 2, pages 144–148, 2003.
[9] M. Mori, K.F. MacDorman, and N. Kageki. The uncanny valley [from the
field]. IEEE Robotics & Automation Magazine, 19:98–100, 2012.
[10] G.R.S. Murthy and R.S. Jadon. A review of vision-based hand gestures
recognition. International Journal of Information Technology and Knowledge
Management, 2:405–410, 2009.
145
146
BIBLIOGRAPHY
[11] W. Niblack. An introduction to digital image processing. Prentice-Hall, 1986.
[12] J. Sauvola and M. Pietikäinen. Adaptive document image binarization.
Pattern recognition, 33:225–236, 2000.
[13] A. Sobral and A. Vacavant. A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos. Computer
Vision and Image Understanding, 122:4–21, 2014.
[14] C. Stauffer and W.E.L. Grimson. Adaptive background mixture models
for real-time tracking. In IEEE Computer Society Conference on Computer
Vision and Pattern Recognition, 1999, volume 2, pages 2245–2252, 1999.
[15] H. Li et al. A convolutional neural network cascade for face detection. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015.
[16] S.-H. Jeng et al. Facial feature detection using geometrical face model: An
efficient approach. Pattern recognition, 31:273–282, 1998.
[17] Z. Ren et al. Robust part-based hand gesture recognition using kinect
sensor. IEEE transactions on multimedia, 15:1110–1120, 2013.
[18] P. Viola and M. Jones. Robust real-time face detection. International Journal
of Computer Vision, 57:137–154, 2004.
[19] M. Wahde. Swedish speech recognition using linear time normalization
and feature selection optimization. In Proceedings of INTELLI2012, pages
1–6, 2012.
[20] M. Wahde. A method for binarization of document images from a live
camera stream. In 6th International Conference on Agents and Artificial Intelligence, ICAART 2014; Lecture Notes in Artificial Intelligence, pages 137–150,
2015.
[21] M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A
survey. IEEE Transactions on pattern analysis and machine intelligence, 24:34–
58, 2002.
[22] C. Zhang and Z. Zhang. A survey of recent advances in face detection.
Technical Report MSR-TR-2010-66, Microsoft Corporation, 2010.
[23] X. Zhang and Y. Gao. Face recognition across pose: A review. Pattern
Recognition, 42:2876 – 2896, 2009.
c 2017, Mattias Wahde, [email protected]
Index
asynchronous, 10
camera coordinates, 60
Canny edge detector, 41
cepstral coefficient, 103, 105
cepstral order, 105
chrominance, 28
chunk ID, 79
class (C#), 127
abstract, 127
derived, 127
class library (C#), 124
client-server model, 7
CMYK, 28
color
ambient, 62
diffuse, 62
specular, 62
color histogram, 29
color space, 27
color spectrum, 30
composite Bézier curve, 74
compression code, 81
concatenative synthesis, 77
connected components, 43
constructor (C#), 129
continuous speech recognition (CSR),
101
control (C#), 125
convolution, 36
convolution mask, 36
coordinated universal time (UTC), 121
cubic Bézier splines, 74
2s complement signed integer, 83
4-connectivity, 44
8-connectivity, 44
Adaboost, 55
agent program, 6
alpha channel, 27
application (C#), 124
artificial neural network (ANN), 101
assembly (C#), 144
atomic operation, 39
attribute (C#), 143
autocorrelation, 103
normalized, 103
autocorrelation coefficient, 103
autocorrelation order, 103
background subtraction, 50
exponential Gaussian averaging,
50
frame differencing, 50
Gaussian mixture model, 51
ViBe, 51
bandwidth, 87
binarization, 28
binarization threshold, 36
block align, 81
blurring, 37
box, 37
Gaussian, 38
brain process, 6
breakpoint (C#), 126
callback
147
148
INDEX
damped sinusoid, 87
damped sinusoid filter, 88
data parsing, 120
DC component, 102
de-serializing (C#), 143
depth camera, 52
dialogue item, 18
difference equation, 86, 88
digital filter, 85
high-pass, 86
low-pass, 86
diphone, 93
Direct3D, 59
dynamic time warping (DTW), 101
dynamic-link library (DLL), 129
eigenface method, 55
event (C#), 126, 138
subscription, 138
event handler (C#), 138
event-based system, 16
exponential moving average, 86
face recognition, 55
face template, 54
feature vector (speech), 102
field (C#), 127
finite-state machine (FSM), 18
form (C#), 125
formant synthesis, 77, 87
frame splitting, 103
fundamental frequency, 89
Gaussian mixture model (GMM), 101
gesture recognition, 52
Hamming windowing, 103
hidden Markov model (HMM), 101
histogram
cumulative, 40
normalized, 40
histogram equalization, 41
histogram stretching, 40
HSV, 28
HTML tag, 121
image, color, 27
image, grayscale, 27
integral image, 42
integrated development environment
(IDE), 3
interactive evolutionary algorithm, 98
interactive partner agent (IPA), 1
internet data acquisition program, 7
invocation list (C#), 140
IPA libraries, 2
isolated word recognition (IWR), 101
lag (autocorrelation), 103
Levinson-Durbin recursion, 105
lighting model, 62
linear predictive coding, 104
listener program, 6
locked bitmap, 31
LPC coefficient, 103
LPC order, 104
luma, 28
mel-frequency cepstral coefficients, 105
memory
long-term, 6
working, 6
memory item tag, 18
method (C#), 125
abstract, 129
external, 109
model coordinates, 60
model matrix, 60
modelview matrix, 60
Mono, 3
mono sound, 78
morphological image processing, 45
closing, 47
dilation, 46
erosion, 46
hit-and-miss, 47
opening, 47
thinning, 48
c 2017, Mattias Wahde, [email protected]
INDEX
149
motion detection
background, 50
foreground, 50
multithreading, 134
namespace (C#), 127
Niblack’s method, 48
number of zero crossings, 105
relative, 106
object (C#), 127
object-oriented programming, 127
OpenGL, 59
OpenTK, 59
overlap-and-add (TD-PSOLA), 95
padding, 37
path (dialogue), 23
peer-to-peer model, 7
perspective projection, 60
phone (speech), 93
pitch mark, 95
pitch period, 95
pixel, 27
background, 42
foreground, 42
post-multiplication, 68
pre-emphasis, 102
project (C#), 124
projection matrix, 60
property (C#), 129
Really simple syndication (RSS), 117
recording buffer, 110
reference (C#), 127
RGB, 27
RIFF chunk, 79
sample (sound), 78
sample rate, 78
sample width, 78
sampling frequency, 78
Sauvola’s method, 48
sensitivity, 54
serializing (C#), 143
shading, 62
flat, 63
smooth, 63
shading model, 63
sharpening, 38
sharpening factor, 38
shininess, 62
simple type (C#), 130
Sobel operator, 42
socket, 9
solution (C#), 124
solution explorer (C#), 124
speech feature, 101
speech program, 7
startup project (C#), 125
stationary time series, 104
stereo sound, 78
strong classifier, 54
structuring element, 45
origin, 45
subchunk (WAV)
data, 79
fact, 79
fmt, 79
subjective optimization, 98
summed area table, 42
TCP/IP protocol, 7
thread
safe access, 136
thresholding, 48
adaptive, 48
Toeplitz matrix, 105
triphone, 101
uncanny valley phenomenon, 69
using clause (C#), 127
view matrix, 60
Viola-Jones algorithm, 54
vision program, 6
Visual Studio, 3
visualizer program, 7
c 2017, Mattias Wahde, [email protected]
150
INDEX
Waveform audio format (WAV), 78
weak classifier, 54
world coordinates, 60
Xamarin, 3
XML format, 143
c 2017, Mattias Wahde, [email protected]