I NTERACTIVE PARTNER AGENTS A PRACTICAL INTRODUCTION MATTIAS WAHDE Department of Applied Mechanics CHALMERS UNIVERSITY OF TECHNOLOGY Göteborg, Sweden 2017 Interactive partner agents A practical introduction MATTIAS WAHDE c MATTIAS WAHDE, 2017. All rights reserved. No part of these lecture notes may be reproduced or transmitted in any form or by any means, electronic or mechanical, without permission in writing from the author. Department of Applied Mechanics Chalmers University of Technology SE–412 96 Göteborg Sweden Telephone: +46 (0)31–772 1000 Contents 1 Introduction 1 2 Agent structure 2.1 Agent components . . . . . . . 2.2 Distributed programming . . . 2.3 The Communication library . . 2.3.1 The Server class . . . . 2.3.2 The Client class . . . . 2.3.3 The DataPacket class 2.3.4 A simple example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 7 8 9 10 11 13 3 Decision-making, memory, and dialogue 3.1 A simple example . . . . . . . . . . . . 3.2 The AgentLibrary . . . . . . . . . . 3.2.1 The Agent class . . . . . . . . 3.2.2 The Memory class . . . . . . . . 3.2.3 The DialogueProcess class 3.3 Demonstration application . . . . . . . 3.3.1 TestAgent1 . . . . . . . . . . . . 3.3.2 TestAgent2 and TestAgent3 . . 3.3.3 TestAgent4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 16 20 20 21 22 23 23 25 25 4 Computer vision 4.1 Digital images . . . . . . . . . . . . . 4.1.1 Color spaces . . . . . . . . . . 4.1.2 Color histograms . . . . . . . 4.2 The ImageProcessing library . . . . . 4.2.1 The ImageProcessor class 4.2.2 The Camera class . . . . . . . 4.3 Basic image processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 27 28 29 31 31 33 34 i . . . . . . . CONTENTS ii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 36 36 36 38 39 41 42 43 45 48 48 49 52 56 56 56 5 Visualization and animation 5.1 Three-dimensional rendering . . . . . . . . . . . . . 5.1.1 Triangles and normal vectors . . . . . . . . . 5.1.2 Rendering objects . . . . . . . . . . . . . . . . 5.2 The ThreeDimensionalVisualization library 5.2.1 The Viewer3D class . . . . . . . . . . . . . . 5.2.2 The Object3D class . . . . . . . . . . . . . . 5.3 Faces . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Visualization . . . . . . . . . . . . . . . . . . 5.3.2 Animation . . . . . . . . . . . . . . . . . . . . 5.4 Demonstration applications . . . . . . . . . . . . . . 5.4.1 The Sphere3D application . . . . . . . . . . 5.4.2 The FaceEditor application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 60 60 61 64 64 65 69 69 70 72 72 74 . . . . . . . . . . 77 78 78 83 85 86 86 87 88 90 91 4.4 4.5 4.3.1 Contrast and brightness . . . . . . . 4.3.2 Grayscale conversion . . . . . . . . . 4.3.3 Binarization . . . . . . . . . . . . . . 4.3.4 Image convolution . . . . . . . . . . 4.3.5 Obtaining histograms . . . . . . . . 4.3.6 Histogram manipulation . . . . . . . 4.3.7 Edge detection . . . . . . . . . . . . 4.3.8 Integral image . . . . . . . . . . . . . 4.3.9 Connected component extraction . . 4.3.10 Morphological image processing . . Advanced image processing . . . . . . . . . 4.4.1 Adaptive thresholding . . . . . . . . 4.4.2 Motion detection . . . . . . . . . . . 4.4.3 Face detection and recognition . . . Demonstration applications . . . . . . . . . 4.5.1 The ImageProcessing application 4.5.2 The VideoProcessing application 6 Speech synthesis 6.1 Computer-generated sound . . . . . 6.1.1 The WAV sound format . . . 6.1.2 The AudioLibrary . . . . . 6.2 Basic sound processing . . . . . . . . 6.2.1 Low-pass filtering . . . . . . 6.2.2 High-pass filtering . . . . . . 6.3 Formant synthesis . . . . . . . . . . . 6.3.1 Generating voiced sounds . . 6.3.2 Generating unvoiced sounds 6.3.3 Amplitude and voicedness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . CONTENTS 6.4 6.5 iii 6.3.4 Generating sound transitions 6.3.5 Sound properties . . . . . . . 6.3.6 Emphasis and emotion . . . . The SpeechSynthesis library . . . The VoiceGenerator application . . . . . . . . . . . . . . . . 7 Speech recognition 7.1 Isolated word recognition . . . . . . . . . 7.1.1 Preprocessing . . . . . . . . . . . . 7.1.2 Feature extraction . . . . . . . . . . 7.1.3 Time scaling and feature sampling 7.1.4 Training a speech recognizer . . . 7.1.5 Word recognition . . . . . . . . . . 7.2 Recording sounds . . . . . . . . . . . . . . 7.3 The SpeechRecognitionLibrary . . . 7.4 Demonstration applications . . . . . . . . 7.4.1 The IWR application . . . . . . . . 7.4.2 The Listener application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 94 96 97 98 . . . . . . . . . . . 101 102 102 103 106 107 108 109 110 113 113 114 8 Internet data acquisition 8.1 The InternetDataAcquisition library 8.1.1 Downloading data . . . . . . . . . . 8.2 Parsing data . . . . . . . . . . . . . . . . . . 8.2.1 The HTMLParser class . . . . . . . 8.2.2 RSS feeds . . . . . . . . . . . . . . . 8.3 The RSSReader application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117 118 118 120 120 121 122 A Programming in C# A.1 Using the C# IDE . . . . . . . . . A.2 Classes . . . . . . . . . . . . . . . A.3 Generic lists . . . . . . . . . . . . A.4 Threading . . . . . . . . . . . . . A.5 Concurrent reading and writing . A.6 Event handlers . . . . . . . . . . . A.7 Serialization and de-serialization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 124 127 131 134 137 138 141 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Chapter 1 Introduction Intelligent (software) agents are computer programs able to process incoming information, make a suitable decision based on all the available information, and take the appropriate action, often, but not always, in interaction with a user or even another agent. Such programs are becoming more and more common. Examples include automatic systems for reservations (for example travel reservations), the personal assistants available on mobile phones (and some operating systems), systems for decision-support, for instance in medicine or finance, driver support systems in vehicles etc. An important special case, which is at the heart of this course, are interactive partner agents (IPAs) , which are specifically designed to interact with human users in a friendly and human-like manner. In addition to all the applications listed above, a typical application example (though by no means the only one) specifically for IPAs is in health and elderly care, where such an agent might assist in gathering information and replying to questions on some particular topic. IPAs must thus not only be capable to output factual information, but must also be able to do so in a way that is a least reminiscent of human interaction. Thus, their mode of interaction normally includes speech, gestures, facial expressions etc. An IPA generally runs on a computer with a web camera, a microphone, and a loudspeaker, and also features a three-dimensional animated face displayed on the screen. A schematic view of an IPA is shown in Fig. 1.1. Building an IPA thus requires knowledge of many different domains, including human-computer interaction (for example dialogue), speech recognition, speech synthesis, image processing (for example for gesture detection), threedimensional visualization and animation, and information gathering (from the internet etc.). In addition, one must also be able to put together the various parts, and make them work in cooperation with each other in order to generate a complete IPA. The aim of this compendium is to give students of IPAs a general and prac1 2 CHAPTER 1. INTRODUCTION Figure 1.1: A schematic illustration of a typical IPA setup, with several modalities for interaction: A camera for vision, a microphone for speech input, loudspeakers for speech output, and an animated three-dimensional cartoon-like face for displaying emotions. tical introduction to the various topics listed above. One could approach this task in different ways, one possibility being to obtain a set of third-party, blackbox solutions and simply putting them together. Here, however, the aim is to give the reader a thorough understanding of the basics of each relevant component in an IPA. Thus the reader will have access to the full source code of a set of software libraries (henceforth referred to as the IPA libraries) as well as a set of demonstration applications, and the text will go through each agent component in detail. The IPA libraries are (1) the AgentLibrary described in Chapter 3; (2) the AudioLibrary used in Chapters 6 and 7; (3) the CommunicationLibrary described in Chapter 2; (4) the ImageProcessingLibrary discussed in Chapter 4; (5) the InternetDataAcquisitionLibrary considered in Chapter 8; (6) the MathematicsLibrary which is an auxiliary library, used by several other libraries; (7) the ObjectSerializerLibrary which is used for serializing (saving) and de-serializing (loading) objects; see also Appendix A, Sect. A.7; (8) the PlotLibrary that is used, for example, in one of the applications regarding speech recognition; (9) the SpeechRecognitionLibrary which is described in Chapter 7; (10) the SpeechSynthesisLibrary that is used in Chapter 6, and (11) the ThreeDimensionalVisualizationLibrary described in Chapter 5. As for the programming language, here, C# .NET, included in the Visual c 2017, Mattias Wahde, [email protected] CHAPTER 1. INTRODUCTION 3 Studio integrated development environment (IDE) by Microsoft, has been chosen, and the code is thus intended primarily for computers running Windows. Of course, many other programming languages could have been selected, but C# .NET offers some compelling advantages (at least in the author’s view), one of them being the elegance, robustness, and high speed of execution of code written in C# .NET. Moreover, by using the .NET framework, one also opens up the possibility of writing applications in other .NET languages (e.g. C++ or Visual Basic) while still being able to use the IPA libraries. Also, with the integration of Xamarin in the 2015 version of Visual Studio, it is possible to deploy code written in C# .NET on mobile devices, both under Android and iOS. Using the Mono framework it is also possible to run applications developed in C# .NET under Linux. The compendium has been developed for a seven-week university course at Chalmers University of Technology. It has been assumed that the reader has an engineering background, covering at least engineering mathematics as well as programming in some high-level language (though not necessarily C# .NET). Prior familiarity with .NET is recommended, but not absolutely required. However, a reader unfamiliar with .NET will need to study this topic alongside the other topics in the compendium. Appendix A provides a brief introduction to C# .NET, but it is not a complete description. Needless to say, there is a limit on how much one can do in seven weeks. Thus, some tradeoffs have been necessary, especially since each of the topics in Chapters 2 to 8 could easily fill a university course. Hopefully, a suitable balance between depth and breadth has been found, but it should be noted that the aim, again, is to give a deep understanding of the basics of each topic rather than trying to build a state-of-the-art IPA. It is the author’s hope and belief that, with the knowledge obtained from reading the compendium, a reader will have a solid foundation for further studies in any of the topics considered here. c 2017, Mattias Wahde, [email protected] 4 CHAPTER 1. INTRODUCTION c 2017, Mattias Wahde, [email protected] Chapter 2 Agent structure This chapter gives general overview of the logical structure of the interactive partner agents used here. Already at this point, some familiarity with C# .NET (and its IDE) is assumed. Thus, readers unfamiliar with this programming language should start by reading Appendix A. Fig. 2.1 shows the structure of the interactive partner agents. As can be seen in the figure, the agent consists of a main program (an executable file, with the suffix .exe under Windows), along with a sequence of additional components (which are also executable files), each handling some important aspect of the agent and communicating with the main program. Of course, an IPA could have been written as a single standalone executable. However, the distributed structure shown in the figure has multiple advantages. First of all, if the agent were to be written as a single program, it is not unlikely that there would be some code entanglement between the various parts, making it more difficult to replace or upgrade some part (e.g. speech recognition or vision). With the distributed structure, such problems are avoided. Moreover, it is possible that some components of the agent would require strong computational power. Written as a single program, an agent would have to divide the computational power (of the computer on which it runs) between the various components. By contrast, in the distributed structure, as will be shown below, it is possible to run the agent’s constituent programs on different computers, connected to one another over a wireless network (for example) so that the agent, as a whole, can summon the computational power of several computers. In most cases, however, and certainly in the cases considered here, the computational power of a single computer will be sufficient. Another advantage is that, with the distributed structure in Fig. 2.1, it is possible for the main program to monitor the other programs and to restart a particular program (e.g. the one handling speech recognition) if it should stop running for some reason. Alternatively, a simple monitoring program can be set to run in the background on each computer running any component of the 5 6 CHAPTER 2. AGENT STRUCTURE Figure 2.1: The logical structure of the interactive partner agents used here. The main program acts as a server (see Sect. 2.2) whereas all other programs are clients. The various components are introduced in Sect. 2.1. agent, making sure that any crashed component restarts automatically. In either case, the probability of crashing the entire agent will be much smaller than if it were written as a single executable. Finally, for the purpose of a university course, the distributed structure is excellent, as it allows different developers (e.g. students) to work completely independently on various parts of the agent, once an agreement has been made regarding the type (and perhaps amount) of information that is to be transmitted between the various components. 2.1 Agent components As can be seen in Fig. 2.1, the structure consists of six components: (i) A main program, also referred to as the agent program, responsible for coordinating, (selectively) storing, and processing the information obtained from, or sent to, the other components. This program maintains a working memory that stores important recent information, and it is responsible for decision-making and dialogue with the user(s); Its structure also allows the developer to store a set of (artificial) brain processes, of which dialogues constitute an important special case. Optionally, this program can also maintain a long-term memory, the contents of which are loaded into the working memory when it is started; (ii) a vision program that receives, processes, and interprets the continuous flow of visual information from (web) cameras. The information transferred to the main program is in the form of text strings; for example, if the vision program suddenly detects the face of a known person, the information might consist of the person’s name along with some information regarding the person’s current facial expression; (iii) a listener program that continuously listens to external input, either in the form of typed text or in the form of sounds recorded by a connected microphone (if any). The listener then processes the c 2017, Mattias Wahde, [email protected] CHAPTER 2. AGENT STRUCTURE 7 information, applying speech recognition in the case of sounds, and generates textual information that is sent to the main program; (iv) an internet data acqusition program that, for example, reads and processes information from news feeds before transferring it to the main program; (v) a speech program that receives text strings from the main program and then produces the corresponding sounds (with textual output as an option as well), and finally; (vi) a visualizer program that handles visualization and animation of the agent’s face, based on textual information (e.g. smile or blink) obtained from the main program. Note that the long-term memory of the agent is, in fact, distributed as well: The listener program (for example) requires information for speech recognition, which must thus be loaded upon startup. Similarly, the speech program must load the appropriate parameter settings for representing the agent’s voice etc. In the following chapters, the components just listed will be described in detail. First, however, the topic of distributed programming will be discussed. 2.2 Distributed programming The distributed IPA structure follows the client-server model, in which there is a central component (the server) that handles the information flow to and from the other components (the clients). An alternative approach would be to use a peer-to-peer model, in which case there would be no central server. While it is certainly possible to implement an IPA using the peer-to-peer model, here we shall only use the client-server model. An obvious point to consider is the fact that the server cannot control the flow of information from its various clients. For example, in the case of the IPA, the vision program and the speech recognition program may both provide input to the agent program at any time, completely independently of each other. Thus, the server must be able to reliably handle asynchronous communication. The client-server structure has been implemented as a C# class library, the CommunicationLibrary. This class library, in turn, makes use of the System.Net.Sockets namespace, one of the standard namespaces available in C# .NET, which contains classes for handling the low-level aspects of communication between computers. Those low-level aspects involve, for example, the rules by which computers connect to each other, the format of the data sent or received, as well as error handling. Those aspects will not be considered in detail here. Suffice it to say that the communication will be handled using the common TCP/IP protocol. Next, a brief description of the communication library will be given. c 2017, Mattias Wahde, [email protected] 8 CHAPTER 2. AGENT STRUCTURE Listing 2.1: The constructor and the Connect() method of the Server class. public Server ( ) { clientStateList = new List<C l i e n t S t a t e >() ; clientIndex = 0 ; serverSocket = new S o ck e t ( AddressFamily . InterNetwork , SocketType . Stream , ProtocolType . Tcp ) ; backLog = DEFAULT_BACKLOG ; bufferSize = DEFAULT_BUFFER_SIZE ; } ... p u b l i c void Connect ( s t r i n g ipAddressString , i n t serverPort ) { Boolean ok = Bind ( ipAddressString , serverPort ) ; i f ( ok ) { i f ( serverSocket . IsBound ) { connected = t r u e ; OnProgress ( CommunicationAction . Connect , name + ” connected ” ) ; Listen ( ) ; } e l s e { connected = f a l s e ; } } e l s e { connected = f a l s e ; } } 2.3 The Communication library The most important classes in this library are the Server and Client classes. However, in turn, they make use of other classes. A typical sequence of operations would be as follows: First, the server is instantiated and then it establishes a connection to a given IP address and a given port. In case the server and clients all run on the same computer, the IP address will be taken as the loopback address, namely 127.0.0.1. The port can, in principle, be any integer in the range 0 to 65535. However, if the client and server run on different computers (so that the IP address would be different from 127.0.0.1), one should keep in mind that some ports are used by other programs, so that the port number must be selected with some care. Next, the server begins listening for clients. Once a client has been started, it can attempt to connect to the server, provided that it knows the IP address and the port used by the server. When the connection is established, the server adds the client to its list of available clients. In the communication library, each client is assigned a unique ID. The client and server are then able to exchange data. Note that the server also listens for new clients continuously, making it possible to connect additional clients at any time. c 2017, Mattias Wahde, [email protected] CHAPTER 2. AGENT STRUCTURE 9 Listing 2.2: The fields defined in the ClientState class. public class Clie n t S t at e { p r i v a t e s t r i n g clientName ; // The c l i e n t name communicated by t h e c l i e n t . p r i v a t e s t r i n g clientID ; // The ( unique ) c l i e n t ID a s s i g n e d by t h e s e r v e r p r i v a t e Boolean connected ; p r i v a t e byte [ ] receiveBuffer ; p r i v a t e byte [ ] sendBuffer ; p r i v a t e S o ck e t clientSocket ; ... } 2.3.1 The Server class The constructor of the Server class instantiates a list of objects containing information about the clients (see below) and also establishes the server socket that (simplifying somewhat) acts as an end point for the communication with the clients, much as an electrical socket acts as and end point for the electric grid. The constructor is shown in the upper half of Listing 2.1 The server maintains a list of client information, such that each client is defined using an instance of the ClientState class. The client state, in turn, maintains the name and ID for the clients, a Boolean variable determining whether or not the connection is valid, buffers for sending and receiving data, as well as the actual client socket. The fields defined in the ClientState class are shown in Listing 2.2. Once the server has been instantiated, the Connect() method, shown in the lower half of Listing 2.1, is generally called first. As can be seen in the listing, if a connection is established, the server triggers an event (by calling OnProgress), which can then be handled by the program making use of the server, for example in order to display the information regarding the established connection. In fact, the server defines four different events: (i) Progress, which monitors the (amount of) information sent to, or received from, the clients; (ii) Error, which is triggered if, for example, a communication error occurs; (iii) Received, which is triggered whenever data are received (in the form of a DataPacket, see below); and (iv) ClientConnected, which is triggered when a new client connects to the server. Each of these events makes use of custom EventArgs classes (see also Appendix A.6), which are all included in the communication library. Once the server has been connected and is listening, the next step is to accept clients. Of course, the server cannot control at what time the incoming connection request(s) will come. Thus, it needs to listen continuously for any new connection requests. In the communication library, this is achieved using c 2017, Mattias Wahde, [email protected] 10 CHAPTER 2. AGENT STRUCTURE Listing 2.3: An illustration of the use of asynchronous callback methods, in this case involving a server’s procedure for accepting incoming connection requests from clients. p u b l i c void AcceptClients ( ) { serverSocket . BeginAccept ( new AsyncCallback ( AcceptClientsCallBack) , n u l l ) ; } ... p r i v a t e void AcceptClientsCallBack( IA s y n cRe s u l t asyncResult ) { i f ( ! connected ) { r e t u r n ; } S o ck e t clientSocket = n u l l ; try { clientSocket = serverSocket . EndAccept ( asyncResult ) ; C l i e n t S t a t e clientState = new C l i e n t S t a t e ( bufferSize , clientSocket ) ; OnProgress ( CommunicationAction . Connect , ” C l i e n t d e t e c t e d ” ) ; Receive ( clientState ) ; AcceptClients ( ) ; } c a t c h ( S o ck e t E x ce p t i o n ex ) { OnError ( ex . Message ) ; i f ( clientSocket != n u l l ) { clientSocket . Close ( ) ; } } } a programming pattern involving asynchronous callback methods. A brief description of this approach is given in Listing 2.3. The server calls a method BeginAccept that (when an incoming connection request is received) triggers the asynchronous callback method specified in the call to BeginAccept, in this case a method called AcceptClientsCallBack, also shown in the listing. In this method, a call is made to EndAccept and, if the connection is successfully established, the server then receives the connection message (if any) provided by the client. Next, the AcceptClients method is called again, so that the server can continue listening for additional connection requests. A similar programming pattern is used also for sending and receiving messages; see the source code for the Server class for additonal information. 2.3.2 The Client class The Client class maintains a client socket, and it is the information regarding this socket that is transmitted to the server when the client connects to it, thus establishing the connection. The client also defines a set of events, similar to the ones used in the server, which are triggered, for example, when a message is sent or received. c 2017, Mattias Wahde, [email protected] CHAPTER 2. AGENT STRUCTURE 11 Listing 2.4: The two methods in the Client class responsible for receiving messages from the server. p r i v a t e void Receive ( ) { clientSocket . BeginReceive ( receiveBuffer , 0 , receiveBuffer . Length , S o c k e t F l a g s . None , new AsyncCallback ( ReceiveCallBack ) , n u l l ) ; } ... p r i v a t e void ReceiveCallBack ( IA s y n cRe s u l t asyncResult ) { try { i f ( connected ) { i n t receivedMessageSize = clientSocket . EndReceive ( asyncResult ) ; byte [ ] messageAsBytes = new byte [ receivedMessageSize ] ; Array . Copy ( receiveBuffer , messageAsBytes , receivedMessageSize) ; DataPacket dataPacket = new DataPacket ( ) ; Boolean ok = dataPacket . Generate ( messageAsBytes ) ; i f ( ok ) { OnReceived ( dataPacket , ” S e r v e r ” ) ; OnProgress ( CommunicationAction . Receive , ” Received ” + receivedMessageSize . ToString ( ) + ” b y t e s from s e r v e r ” ) ; } else { OnError ( ” Corrupted message r e c e i v e d ” ) ; } Receive ( ) ; } } c a t c h ( S o ck e t E x ce p t i o n ex ) { connected = f a l s e ; OnConnectionClosed ( ) ; OnError ( ex . Message ) ; } } Like the Server class, The Client class also makes use of asynchronous callback methods for connecting to a server, and for sending and receiving messages. As an illustration, Listing 2.4 shows the two methods responsible for receiving information from the server. The messages defined in the communication library are stored in instances of the DataPacket class, described below. As can be seen from the listing, the pattern is very similar, in general, to the one shown in Listing 2.3. Provided that the data packet arrives and is not corrupted, the client processes the message, triggering its Received event (by calling the OnReceived method) and the Progress event, and then calls Receive again. 2.3.3 The DataPacket class In the TCP/IP protocol, messages are sent as simple arrays of bytes. For a user, it is of course more convenient to send (or receive) a readable string. Moreover, c 2017, Mattias Wahde, [email protected] 12 CHAPTER 2. AGENT STRUCTURE Listing 2.5: The four fields of the DataPacket class, along with the AsBytes method, which combines the fields and converts the resulting string to a byte array. p u b l i c c l a s s DataPacket { p r i v a t e DateTime timeStamp ; p r i v a t e s t r i n g senderName ; p r i v a t e s t r i n g message ; p r i v a t e i n t checkSum ; ... p u b l i c byte [ ] AsBytes ( ) { s t r i n g tmpString = timeStamp . ToString ( ”yyMMddHHmmssfff” ) + ” : ” + senderName + ←֓ ” : ” + message +” : ” ; byte [ ] dataAsBytes = Encoding . ASCII . GetBytes ( tmpString ) ; i n t checkSum = GetCheckSum ( dataAsBytes ) ; s t r i n g dataPacketAsString = tmpString + checkSum . ToString ( ) ; byte [ ] dataPacketAsBytes = Encoding . ASCII . GetBytes ( dataPacketAsString) ; r e t u r n dataPacketAsBytes ; } ... } it is useful to know the time stamp of the message (meaning the date and time at which the packet was generated). Furthermore, since a server might have many clients connected, the identity of the sender should also be provided. As mentioned above, messages are packaged in instances of the DataPacket class, which contains four fields, shown in Listing 2.5. As is shown in the listing, in addition to the fields just mentioned, the DataPacket also contains a checksum, which is obtained by simply summing the ASCII values for each byte that it sent. The contents of a data packet can then be converted to a byte array by calling the AsBytes method. Before conversion, a special character, here chosen as :, is inserted as a separator between the items. Thus, this particular character is not allowed as a part of a message. Of course, one could have chosen another character as the separator, or even allowing the user to define the separator character, but the scheme shown in the listing will be sufficient here. In the uncommon case that one must send a colon character one can, for example, send it as the word colon, perhaps surrounded by brackets to indicate that the word between the brackets is to be interpreted in some fashion, rather than taken literally. Of course, the DataPacket class also contains a method (called Generate) for the reverse operation, i.e. for obtaining the four fields described above, given the byte array; see the source code of the DataPacket class for a detailed description of this method. c 2017, Mattias Wahde, [email protected] CHAPTER 2. AGENT STRUCTURE 13 Figure 2.2: A simple illustration of the client-server model implemented in the communication library. The server can accept any number of connected clients (two, in the case shown here), and can then send a simple message (hello) to all clients. Similarly, each client can send the same message back to the server. 2.3.4 A simple example Fig. 2.2 shows the GUI of a server and two clients in a minimalistic usage example involving the CommunicationLibrary. The code for this simple example is contained in the CommunicationSolution. Here, the server is connected to the loopback IP address (127.0.0.1) and the two clients, both running on the same computer as the server, then establish connections to the server. Of course, any number of clients could be used. However, for the purpose of demonstrating the code, using two clients is sufficient. When the user presses the button marked Hello on the server, a hello message is sent to all the clients, who then promptly report that they received this message from the server. Similarly, if the corresponding button is clicked in any of the clients, the hello message is sent to the server, which then acknowledges that it received the message, and also displays the identity of the sender. Both the server and client can also handle the case in which the counterpart is unavailable. Moreover, a client can be disconnected and then connected again. Similarly, the server can be disconnected and then connected again, but in this case the clients must also once more connect to the server. A brief code snippet (Listing 2.6) shows the code for generating and connecting the server, and for starting to listen for incoming connection requests. Of course, one would not normally define hard-set constants for the IP address and port: Those fields are added just to complete the example. The three methods (event handlers) HandleServerProgress, HandleServerError, and HandleServerReceived must be defined as well. As an example, consider the event handler for received messages, shown in Listing 2.7. This event handler simply formats and prints (in a ListBox called messageListBox) the message contained in the data packet (that, in turn, is represented as a property c 2017, Mattias Wahde, [email protected] 14 CHAPTER 2. AGENT STRUCTURE Listing 2.6: A brief code snippet, showing how the server is generated and started. Note that event handlers are specified for handling the Progress, Error, and Received events, respectively. ... s t r i n g ipAddressString = ” 1 2 7 . 0 . 0 . 1 ” ; i n t port = 7 ; server = new S e r v e r ( ) ; server . Name = ” S e r v e r ” ; server . Progress += new EventHandler<CommunicationProgressEventArgs> ( HandleServerProgress) ; server . Error += new EventHandler<CommunicationErrorEventArgs >(HandleServerError) ; server . Received += new EventHandler<DataPacketEventArgs >(HandleServerReceived) ; server . Connect ( ipAddressString , port ) ; i f ( server . Connected ) { server . AcceptClients ( ) ; } ... Listing 2.7: An event handler that processes (and displays) messages received by the server. p r i v a t e void HandleServerReceived( o b j e c t sender , DataPacketEventArgs e ) { s t r i n g information = e . DataPacket . TimeStamp . ToString ( ”yyyyMMdd HHmmss. f f f : ” ) + e . DataPacket . Message + ” from ” + e . SenderID . ToString ( ) ; i f ( InvokeRequired ) { t h i s . BeginInvoke ( new MethodInvoker ( ( ) => messageListBox . Items . Insert ( 0 , information ) ) ) ; } e l s e { messageListBox . Items . Insert ( 0 , information ) ; } } in the DataPacketEventArgs; see the corresponding code for a full description). Note that the server does not run in the GUI thread. In order to avoid illegal cross-thread operations (see Appendix A, especially Sect. A.4) one must therefore use the BeginInvoke pattern. c 2017, Mattias Wahde, [email protected] Chapter 3 Decision-making, memory, and dialogue One of the most fundamental requirements on an IPA is that it should be able, within reasonable limits, to carry out a meaningful dialogue with a human, using the various input and output modalities (speech, typing, gestures, facial expressions etc.) that it might have at its disposal, processing incoming information to generate a suitable decision (perhaps consulting its memory in the process), and then executing that decision. Now, for all the other subfields that will be studied in the coming chapters, there is normally quite a bit of theory available, usually rooted in (human) biology. Moreover, in those cases, there often exists implementable, mathematical models that can be used directly in an agent. As one example among many, the formant speech synthesis in Chapter 6 uses a mathematical model based on a simplified description of the human vocal tract. However, regarding the processes of decision-making, memory, and dialogue, there are fewer (useful) theories available. Of course, quite a large number of theories (or perhaps hypotheses, rather) regarding the workings of the mind have been presented in the field of psychology. However, those theories are generally not associated with implementable models as would be needed here. More detailed approaches can be found in the field of neurobiology, but those often concern the microscopic level (neuron assemblies or even individual neurons) rather than the brain as a whole. Even though theories of the brain as a whole rarely are presented in implementable form, for good reason, one can still make use of such theories as an inspiration when formulating a simplified implementable model. Thus, for example, the use of the working memory in the Agent class (see below) has been inspired by models of working memory in humans. The same can be said for the IPA structure as a whole, namely the fact that it is implemented as a set of separate processes with strong, asynchronous interaction. 15 16 CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE Then there is the question of whether one, even in principle, can generate a truly intelligent agent, regardless of the method used. Here, such aspects will not be considered: Instead, the semblance of intelligence is sufficient. That is, the goal is to generate an IPA that can handle, for example, a basic dialogue with a human. The brain of the IPAs considered here will thus be modelled as a collection of simple, and rather rigid, dialogues. Returning to the topic of theory, it should be noted that the brain of an IPA could of course have been implemented in many different ways. The implementation described in Sect. 3.2 was selected with the aim of making it easy to set up a set of dialogues while, at the same time, maintaining flexibility for further development. 3.1 A simple example In order to illustrate some of the difficulties encountered when implementing decision-making, memory, and dialogue in an IPA, a simple example will now be given. Consider a situation involving an IPA that is supposed to retrieve and read news stories to a user, perhaps a visually impaired person using speech (or, possibly, gestures) as the input modality, even though the example would be valid even in the case of text input (typing). The beginning of a specific dialogue of this kind can be seen in Fig. 3.1. First of all, the user must get the attention of the IPA. In the simplest situation, the IPA may have only a news reader dialogue, in which case this dialogue could simply wait for input from the user to start the discussion. However, a slightly more advanced IPA could contain numerous dialogues (and, perhaps, non-dialogue processes as well). In such situations the user must somehow trigger the dialogue, starting with getting the attention of the IPA. How should that part be implemented? Already here, many options present themselves: One could, for example, have a loop running in the agent, checking for inputs. However, apart from being inelegant, such an implementation would make use of computational resources even when there is no reason for doing so, since it would involve constant checking. A better approach is to use an event-based system, in which event handlers stand by (without any looping), waiting for something to happen. Then, the next problem appears: What should that something be, and which processes should be standing by? Should all dialogues check for suitable input? What if more than one dialogue finds that the input matches it starting condition, thus perhaps triggering two dialogues to run simultaneously? Regarding the first question, one possible approach (among several) with some biological justification, is to trigger events via changes in the IPA’s working memory. Thus, the IPA must be fitted with a working memory that, among other things, would contain the various inputs (e.g. speech, text, or gestures) as well as an event that should be triggered whenever there is a change in the working memory, c 2017, Mattias Wahde, [email protected] CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE 17 User: Hello! [The agent detects and recognizes the face of the user (Mattias)] [The user’s input is detected by a separate Listener program, and is then transferred to the agent’s working memory. A top-level dialogue (for topic selection) is activated.] Agent: Hello Mattias. How can I be of service? [The agent’s statement is sent to a Speech program, if available.] User: I would like to hear the news, please. [The agent processes the input, disables the top-level dialogue, and triggers a news dialogue.] Agent: OK. Which topic are you interested in? User: Economy, please. [The agent searches its working memory for items of interest] Agent: I have three new items, which arrived in the last hour. User: OK, list them for me. Agent: Item 1: The bank of England today announced .. User: Skip that one. Agent: OK. Item 2: U.S. jobless claims down more than expected. User: Read that one, please. etc. etc. Figure 3.1: A partially annotated example of (the beginning of) a simple human-agent dialogue. As described in the main text, even a simple and somewhat robotic conversation of this kind requires quite a complex implementation, at least if some variety is to be allowed in the dialogue. i.e. when a new item is added. Next, rather than having all dialogues standing by, the triggering of a dialogue can be taken care of by an event handler in the agent itself. Several new problems then appear: How should a dialogue respond to events (user input) and how should the dialogue be structured? Regarding the first question, one could in principle let the event handler just described pass the information that a new user input has been received to the currently active dialogue, and then let the dialogue handle it. However, this would be slightly inelegant as it would involve passing information (unnecessarily, as will be demonstrated) between different parts of the agent. Moreover, it is likely that such an implementation would result in a very complex event handler as it would now have to handle not only the triggering of dialogues but also passing information to (and, perhaps, from) dialogues. An alternative approach (used here) is to let the active dialogue itself subscribe to (i.e. respond to) the event triggered when the working memory is changed, thus removing the need of passing the user input from the agent itself to the active dialogue. One must then make sure to unsubscribe to the event whenever a dialogue is c 2017, Mattias Wahde, [email protected] 18 CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE de-activated, to avoid triggering actions from inactivated dialogues. As for the second question, choosing a representation for the structure of a dialogue involves a difficult trade-off between simplicity, on the one hand, and flexibility, on the other. A dialogue between two humans generally involves a strong degree of flexibility: In any given part of the exchange, many different statements would be valid as input to the other person and the dialogue would take different directions (including complete changes of topic) depending on the sequences of statements made by the two participants. Moreover, both participants would, in all likelihood, have a clear understanding of the context in which the dialogue takes place and also share certain common-sense knowledge. None of those things apply, at least not a priori, to an IPA. As a simple example, when giving an affirmative, verbal response, a human might simply say yes. However, it is also possible to respond in other, equivalent ways, e.g. ok, sure, fine etc. In order for an IPA to handle even this simple case, it must be provided with the knowledge that those responses represent the same thing. Of course, one can easily envision many examples that are considerably more complex than the one just given. For instance, just include any form of joke or humor into a sentence, and it is easy to understand how an IPA, devoid of context and without a sense of humor, will be lost. Here, a rather simple approach has been taken, in which a dialogue is built up as a finite-state machine (FSM), consisting of a number of states (referred to as dialogue items) in which a given input is mapped to a specific output. The inputs are retrieved from the agent’s working memory. The easiest way to do that is simply to take the most recent item in the working memory when a change is detected (i.e. when an item is added to the working memory) as the user’s input. However, there are cases in which multiple sources may add memory items that are not related to the user’s input. For example, an agent equipped with a camera, may detect and recognize a face (in a separate program as described in Chapter 2) and then place the corresponding person’s name in the working memory, which the agent might then mistake as the response to its statement. In order to avoid such problems, each memory item is equipped with a memory item tag that consists of a string that can be used for identifying and classifying memory items. Thus, a memory item from the face recognition program may contain a tag such as Vision:FaceRecognized where the first part identifies the process responsible for generating the memory item and the second part describes the category of the memory item. Each memory item also has a content string that, in this particular example, would contain the name of the person whose face was recognized. Moreover, each memory item is associated with a time stamp so that old memory items, which have not be accessed for a long time or have become obsolete for some other reason, can be removed. As will be shown below, some flexibility of the agent’s response has been added by using an implementation that allows for alternative inputs as well c 2017, Mattias Wahde, [email protected] CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE 19 as alternative (but equivalent) outputs. The latter is also important for the perception of the dialogue from the user’s point of view: If the IPA constantly uses the same style of replying, the dialogue will appear very rigid and unnatural. By adding a bit of variety, one can to some extent reduce such perceptions. The dialogues considered in the implementation used here are rather limited, and it is required that the user should stay on the topic, rather than trying to wander off into other topics as frequently happens in human-to-human dialogues. However, even a simple IPA must somehow be able to deal with cases in which the user gives an incorrect response. Thus, another problem appears. One solution, of course, is simply to wait until the user gives a response that the agent can understand. Such an approach will quickly become annoying for the user, though, who might not know or even be able to guess precisely what input the agent requires. Of course, one could make the agent list the allowed inputs but that, too, would represent a strong deviation from a human-to-human dialogue. A better approach might be to include, in every dialogue item involving human-agent interaction, a separate method for handling incorrect or unexpected responses, asking for a clarification a few times (at most), before perhaps giving up on the dialogue and instead returning to a resting state, awaiting input from the user. This is indeed the approach chosen here. Now, returning to the beginning of the example, the aim was to generate an agent that could read the news in interaction with the user. Thus, in addition to handling the dialogue, the IPA must also be able to retrieve news items upon request. As shown in Chapter 8, one can write a separate program for obtaining and parsing news, and then sending them to the agent. How, then, should the agent handle the news items, in a dialogue with a human? Where should they be stored, and how should they be accessed? Here, again, many possibilities present themselves to the programmer. One can include, for example, a specialized state in the dialogue that would actively retrieve news items, on a given topic, by sending a request to the corresponding program and then receiving and processing the response. However, in that case, the user may have to wait a little bit for this procedure to be completed. Sending and receiving the data over the network is usually very fast, but (for example) reloading a web page from which the news are obtained might take some time. Even a delay of 0.5 s will generally be perceived as annoying by a human user. An alternative approach, used here, is to let the program responsible for downloading the news send the information about incoming news items to the agent as soon as they become available. The agent can then store the news items in its working memory, so that they can be retrieved on demand in the dialogue, thus eliminating any delays. As above, the memory item tags can then be used for distinguishing between, say, user input and a news item. In a dialogue, the agent may then be able to retrieve, for example, all memory items regarding sport news, received in the last half hour. c 2017, Mattias Wahde, [email protected] 20 CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE As will hopefully now be clear from this example, even generating a very simple human-agent dialogue involves quite a number of complex problems that, moreover, can be solved in many different ways. There are many additional refinements that can be made, of course. As one example among many, in cases where the agent starts reading a long news item (or any other text), the user might quickly realize that he or she is not interested and wishes to move on. Then, ideally (and as in a human-to-human conversation) it should be possible to interrupt the agent so that one can direct it to a topic of greater interest. The next section contains a brief description of the main classes implemented in the AgentLibrary. When reading that description, it is very useful to keep the example above in mind. 3.2 The AgentLibrary The AgentLibrary contains the necessary classes for setting up the brain of an agent, including a set of dialogues (and, possibly, other non-dialogue brain processes) as well as the agent’s working memory as well as (optionally) its long-term memory. 3.2.1 The Agent class The Agent class contains a list of brain processes, as well as a working memory and a long-term memory. There is a Start method responsible for setting up a server, initializing the working memory and loading the long-term memory (if available), and also starting the client programs (see Fig. 2.1). It is possible to modify the structure of the IPA by excluding some of the client programs. For example, a simple agent may just define a Listener client and a Speech client. Upon startup, the agent also checks which brain processes should be active initially. Those processes are then started. From this point onward, most of the agent’s work is carried out by the HandleWorkingMemoryChanged event handler, which is triggered whenever there is any change in the agent’s working memory. There is also a Stop method that shuts down all client processes, and then also the agent’s server. The HandleWorkingMemoryChanged event handler consist of four blocks of code: (i) First, it checks (using the memory item tags of the available items in the working memory) whether any new speech memory item has been added to the working memory. If so, the content of the corresponding memory item is sent to the Speech client. The agent also keeps track of the time at which the speech output was sent, to avoid repeating the same output again. (ii) Next, it repeats the procedure, but this time concerning facial expressions. Any new c 2017, Mattias Wahde, [email protected] CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE 21 Listing 3.1: The fields defined in the MemoryItem class. p u b l i c c l a s s MemoryItem { p r i v a t e DateTime creationDateTime ; p r i v a t e s t r i n g tag ; p r i v a t e s t r i n g content ; ... } Listing 3.2: The InsertItem method of the Memory class. p u b l i c void InsertItems ( L i s t <MemoryItem> insertedItemList) { Monitor . Enter ( lockObject ) ; f o r ( i n t ii = 0 ; ii < insertedItemList . Count ; ii++) { MemoryItem item = insertedItemList [ ii ] ; DateTime itemCreationDateTime = item . CreationDateTime ; i n t insertionIndex = 0 ; while ( insertionIndex < itemList . Count ) { i f ( itemList [ insertionIndex ] . CreationDateTime < itemCreationDateTime) { break ; } insertionIndex++; } itemList . Insert ( insertionIndex , item ) ; } OnMemoryChanged ( ) ; Monitor . Exit ( lockObject ) ; } facial expression memory item (again identified using the appropriate memory item tag) is sent to the Visualizer client. Then (iii) it checks whether any brain process should be activated or (iv) deactivated. 3.2.2 The Memory class The Memory class, used for defining the working memory of an agent, simply contains a list of memory items, of type MemoryItem. Each memory item contains (i) the date and time at which the item was generated; (ii) the tag; and (iii) the contents of the memory item, as shown in Listing 3.1. Items are inserted into the working memory by using the InsertItems method shown in Listing 3.2. This method also makes sure that the items are inserted in the order in which they were generated, with the most recent item at the first index (0) of the list. Finally, an event (MemoryChanged) is triggered to indicate that there has been a change in the working memory. This event, in turn, is then handled c 2017, Mattias Wahde, [email protected] 22 CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE by several event handlers: The agent’s event handler described above, as well as event handlers in any active brain process (see below). There are also several methods for accessing memory items. For example, the GetLastItemByTag method retrieves the most recent item (if any) matching an input tag. The insertion and access methods all make use of the Monitor construct (see Appendix A.5) in order to handle the fact that several asynchronous processes (brain processes or external clients) act upon the working memory. Thus, for example, during the (very brief) time interval when the agent is accessing the most recent speech-related memory item, it has exclusive access to the working memory. 3.2.3 The DialogueProcess class This class is derived from the base class BrainProcess and defines a specific kind of brain process aimed at handling human-agent dialogue. Each dialogue process, in turn, contains a set of dialogue items, which are either of type InteractionItem for those items that handle direct interaction between the agent and a user, or of type MemoryAccessItem for those items that handle other aspects of the dialogue (such as accessing and processing information). At any given time, exactly one dialogue item is active, and it is referred to as the current dialogue item. Each dialogue item, in turn, contains a list of of objects derived from the type DialogueAction and each such object defines a target dialogue item to which the dialogue process will jump, if the action in question is executed; see also below. Whenever a dialogue process is activated, a subscription is established with respect to the MemoryChanged event of the working memory. Similarly, whenever a dialogue process is deactivated, the subscription is removed. Thus, only active brain processes react to changes in the agent’s working memory. The HandleWorkingMemoryChanged method in the DialogueProcess class is a bit complex. Summarizing briefly, it first accesses the current dialogue item. If that item is an interaction item, it checks whether or not the item requires input. If it does not, the output obtained from its first dialogue action is simply placed in the agent’s working memory, and the current dialogue item is set as specified by that dialogue action. If the dialogue item does require input, the next step is to check whether or not the actual input matches any of the required inputs for the current dialogue item, by going through the available dialogue actions until a match is found. If a match is found, the corresponding output is placed in the agent’s working memory, and the index of the current dialogue item is updated as specified by the matching dialogue action. Note that the input can come either from the Listener client or, in cases where input in the form of gestures is allowed, from the Vision client. If the current dialogue item is instead a memory access item, the CheckMemory c 2017, Mattias Wahde, [email protected] CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE 23 method of the MemoryAccessItem is called. This method goes through the various dialogue actions, generating lists of memory items based on the tag specified in each dialogue action. The memory items are then placed in the working memory, thus triggering the MemoryChanged event. For example, the ReadByTagAction retrieves from working memory all memory items matching a given tag (News, say), not older than a pre-specified time interval, and then generates an output memory item with the tag Speech for one of those items (if any) based on a user-specified index. If the index is set to 0, the most recent item is selected. The output item is then placed in the agent’s working memory, thus triggering the MemoryChanged event so that, in turn, the agent’s event handler can send the output to the Speech client. Most dialogues are not just a linear sequence of input-output mappings. For example, when the agents asks a question, the next dialogue item (responsible for processing the input) can take different actions depending on whether the input is affirmative (e.g. yes) or negative (e.g no). There is a great degree of flexibility here. For example, a single dialogue process may contain two different paths, one handling an affirmative input and one handling a negative input. Alternatively, upon receiving the input, the current dialogue process can deactivate itself and also activate another process or, possibly, either of two processes (one for affirmative input and one for negative input). The AgentLibrary contains a few dialogue action types, derived from the DialogueAction base class. Those types can handle the most basic forms of dialogue, but for more advanced dialogues additional derived dialogue action classes might be needed. However, such classes can easily be added, without, of course, having to change the rather complex framework described above. 3.3 Demonstration application The AgentDevelopmentSolution contains a simple demonstration application that illustrates the basic aspects of human-agent dialogue in the AgentLibrary. In addition to an agent program, this solution also contains a very simple listener program, which reads only text input, and an equally simple speech program that only outputs text. The agent program contains a menu that gives the user access to four hard-coded simple dialogue examples, which will be described next. In all cases, the dialogues are incomplete, and the examples are merely intended to show how the various classes in the AgentLibrary can be used. 3.3.1 TestAgent1 This agent is generated if the user chooses the menu actions File - New agent - Test agent 1. In this case, the generated agent handles the beginning of c 2017, Mattias Wahde, [email protected] 24 CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE Listing 3.3: The code that generates TestAgent1. p r i v a t e void GenerateTestAgent1 ( ) { SetUpAgent ( ) ; Di a l o g u e P r o ce s s dialogue1 = new Di a l o g u e P r o ce s s ( ) ; dialogue1 . Name = ” Dialogue1 ” ; agent . BrainProcessList . Add ( dialogue1 ) ; dialogue1 . ActiveOnStartup = t r u e ; I n t e r a c t i o n I t e m dialogueItem1 = new I n t e r a c t i o n I t e m ( ) ; dialogueItem1 . Name = ” Item1 ” ; dialogueItem1 . MaximumRepetitionCount = 2 ; ResponseAction action1 = new ResponseAction ( ) ; action1 . InputList . Add ( ” Hello ” ) ; action1 . InputList . Add ( ”Hi” ) ; action1 . TargetDialogueItemName = ” Item2 ” ; action1 . OutputList . Add ( ” Hello u s e r ” ) ; dialogueItem1 . ActionList . Add ( action1 ) ; dialogue1 . ItemList . Add ( dialogueItem1 ) ; I n t e r a c t i o n I t e m dialogueItem2 = new I n t e r a c t i o n I t e m ( ) ; dialogueItem2 . MillisecondDelay = 5 0 0 ; dialogueItem2 . Name = ” Item2 ” ; OutputAction action2 = new OutputAction ( ) ; action2 . OutputList . Add ( ”How can I be o f s e r v i c e ? ” ) ; action2 . BrainProcessToDeactivate = dialogue1 . Name ; dialogueItem2 . ActionList . Add ( action2 ) ; dialogue1 . ItemList . Add ( dialogueItem2 ) ; FinalizeSetup ( ) ; } a greeting dialogue, by first activating a dialogue item that waits for a greeting (e.g. hello) from the user. If a greeting is received, the agent moves to the next dialogue item, in which it asks if it can be of service, and that concludes this simple example. Despite its simplicity, the example is sufficient for illustrating several aspects of the AgentLibrary. The code defining TestAgent1 is shown in Listing 3.3. The SetupAgent method sets up the server and file paths to the listener and speech programs. Next, the dialogue is defined. The Dialogue1 process is set to be active as soon as the agent starts. By construction, the first dialogue item (in this case named Item1) becomes the current dialogue item when the dialogue is started. The agent then awaits user input, in this case requiring that the input should be either Hello or Hi (note that the input-matching is case-insensitive, so either Hello or hello would work). The agent then responds with the phrase Hello user and proceeds to the next dialogue item (Item2). Here, it waits for 0.5 s before outputting the phrase How can I be of service?. Next, the dialogue process is deactivated, and the agent stops responding to input. Since the second dialogue item does not require any input, a OutputAction (unconditional output) was used instead of the ResponseAction used in the first dialogue item. The same effect could have been achieved also by using a ResponseAction (in Item2) with an empty input list. Note that if the agent cannot understand the user’s reply, i.e. if the input c 2017, Mattias Wahde, [email protected] CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE 25 is anything except Hello and Hi, the current dialogue item will handle the situation by asking for a clarification. By default, this is done twice. If the user still fails to give a comprehensible answer the third time, the dialogue is deactivated, and a user-specified dialogue (for example one that simply waits for the user to start over) is activated instead, provided that such a dialogue exists, of course. The allowed number of failed answers can also be modified by the user and can differ between dialogue items. Ideally, an agent should be able to understand any greeting that a human would understand (i.e. not just Hello and Hi). One can of course extend the list of allowed inputs to obtain a better approximation of human behavior. Moreover, for some particular cases, a set of default input strings have been defined. Thus, for example, if one wants to add a response action taking affirmative input, instead of listing all affirmative answers (yes, sure etc.) for each such action, one can simply use the SetAffirmativeInput method in the ResponseAction class. Similar methods exit also for negative inputs and for greetings; see also the source code for TestAgent3 below. 3.3.2 TestAgent2 and TestAgent3 These two examples illustrate the fact that a dialogue can be implemented in several different ways. In this case, the agent again awaits a greeting from the user. If the greeting is received, the agent asks about the user’s health: How are you today? If the user gives a positive answer (Fine) the agent activates a path within the current dialogue for handling that answer, and if the answer is negative (Not so good), the agent instead activates another path, still within the same dialogue for handling that answer. Listing 3.4 shows a small part of the definition of TestAgent2, namely the dialogue item that handles the user’s response to the question How are you today?. As can be seen in the listing, the dialogue item defines two different actions, which are selected based on the mood, negative or positive, of the user’s reply. By contrast, in TestAgent3, if a positive answer is received from the user, the initial dialogue is deactivated and another dialogue is activated for handling that particular case. If instead a negative answer is received, the initial dialogue is also deactivated, and yet another dialogue is activated for handling the negative answer. 3.3.3 TestAgent4 This example illustrates memory access. If the user asks to hear the news (Read the news, please), the agent searches its memory for memory items that carry the tag News. The agent the selects the first item, somewhat arbitrarily, and sends the corresponding text to the speech program. Now, normally, the news c 2017, Mattias Wahde, [email protected] 26 CHAPTER 3. DECISION-MAKING, MEMORY, AND DIALOGUE Listing 3.4: A small part of the code for TestAgent2. The dialogue item shown here contains two dialogue actions. The first action is triggered if the user gives a negative input, in which case the agent then moves to a dialogue item (not shown) called NegativeItem1. If instead the user gives a positive input, the agent moves to another dialogue item (not shown either) called PositiveItem1, in both cases after first giving an appropriate output. ... I n t e r a c t i o n I t e m dialogueItem3 = new I n t e r a c t i o n I t e m ( ) ; dialogueItem3 . Name = ” Item3 ” ; ResponseAction action31 = new ResponseAction ( ) ; action31 . InputList . Add ( ”Not so good” ) ; action31 . OutputList . Add ( ” I ’m s o r r y t o hear t h a t ” ) ; action31 . TargetDialogueItemName = ” NegativeItem1 ” ; dialogueItem3 . ActionList . Add ( action31 ) ; ResponseAction action32 = new ResponseAction ( ) ; action32 . InputList . Add ( ” F i n e ” ) ; action32 . OutputList . Add ( ” I ’m happy t o hear t h a t ” ) ; action32 . TargetDialogueItemName = ” P o s i t i v e I t e m 1 ” ; dialogueItem3 . ActionList . Add ( action32 ) ; dialogue1 . ItemList . Add ( dialogueItem3) ; ... items would have been obtained by an internet data acquisition program (see Chapter 2) that would read news continuously, and then send any new items to the agent so that the latter can include them in its working memory. Here, for simplicity, a few artificial news items have simply been hardcoded into the agent’s working memory. As mentioned above, only a few DialogueAction classes have been included in the AgentLibrary. It is likely that, for more advanced dialogues than the ones considered here, the user will have to write additional dialogue action classes. c 2017, Mattias Wahde, [email protected] Chapter 4 Computer vision The ability to see is, of course, of great importance for many animal species. Similarly, computer vision, generated by the use of one or several (video) cameras can play a very important role in IPAs as well as other kinds of intelliget agents. However, one of the main difficulties in using vision in intelligent agents is the fact that cameras typically provide very large amounts of information that must be processed quickly in order to be relevant for the agent’s decision-making. This chapter starts with a general description of digital images, followed by a description of the ImageProcessing library, which contains source code for basic image processing as well as code for reading video streams. The basic image processing operations are then described in some detail. Next, a brief overview is given regarding more advanced image processing operations, such as adaptive thresholding, motion detection, and face detection. Two simple demonstration programs are then introduced. 4.1 Digital images A digital image consists of picture elements called pixels. In a color image, the color of each pixel is normally specified using three numbers, defining the pixel’s location in a color space. An example of such a space is the red-greenblue (RGB) color space, in which the three numbers (henceforth denoted R, G, and B) specify the levels of the red, green, and blue components for the pixel in question. These components typically take values in the range [0, 255]. In other words, for each pixel, three bytes are required to determine the color of the pixel. In some cases, a fourth byte is used, defining an alpha channel that determines the level of transparency of a pixel. In a grayscale image, only a single value (in the range [0, 255]) is required for each pixel, such that 0 corresponds to a completely black pixel and 255 to a 27 28 CHAPTER 4. COMPUTER VISION completely white pixel, and where intermediate values provide levels of gray. Thus, a grayscale image requires only one third of the information required for a color image. The conversion of an RGB image to a grayscale image is often carried out as Γ(i, j) = 0.299R(i, j) + 0.587G(i, j) + 0.114B(i, j), (4.1) where Γ(i, j), the gray level for pixel (i, j), is then rounded to the nearest integer. In the remainder of this chapter, the indices (i, j) will normally be omitted, for brevity, except in those cases (e.g. convolutions, see below) where the indices are really needed to avoid confusion. A more complete description of grayscale conversion can be found in Subsect. 4.3.2 below. Taking the information reduction one step further, one can also binarize an image, in which case each pixel is described by only one bit (rather than a byte), such that 0 corresponds to a black pixel and 1 to a white pixel, and where there are no intermediate values. The process of binarization is described in Subsect. 4.3.3 below. 4.1.1 Color spaces In addition to the RGB color space, there are also other color spaces, some of the most common being CMY(K) (cyan, magenta, yellow, often augmented with black (K)), HSV (hue-saturation-value), and YCbCr, consisting of a luma component (Y ) and two chrominance components (Cb and Cr ). The YCbCr color space has, for example, been used in face detection, since skin color pixels generally tend to fall in a rather narrow range in Cb and Cr. In its simplest form, the YCbCr color scheme is given by Y = 0.299R + 0.587G + 0.114B (4.2) Cb = B − Y (4.3) Cr = R − Y (4.4) Note that the luma component (Y ) corresponds to the standard grayscale defined above. However, the conversion from RGB to YCbCr normally takes a slightly different form. In the Rec.601 standard for video signals, the Y component takes (in the case of eight-bit encoding) integer values in the range [16, 235] (leaving the remainder of the ranges [0, 15] and [236, 255] for image processing purposes, such as carrying information about transparency). Furthermore, Cb and Cr also take integer values, in the range [16, 240], with the center position at 128. The equations relating RGB to this definition of YCbCr take the form Y 16 0.25679 0.50413 0.09791 R Cb = 128 + −0.14822 −0.29099 0.43922 G , (4.5) Cr 128 0.43922 −0.36779 −0.07143 B c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 29 Figure 4.1: An example of the YCbCr color space. The upper left panel shows the original image, whereas the upper right panel shows the luma (Y) component. The lower panels show the Cb (left) and Cr (right) components. When plotting any of the YCbCr components the other components were set to the center of their range. Thus, for example, for the Cb plot, Y was set to 126 and Cb to 128. See also Eq. (4.6). Photo by the author. where the resulting values are rounded to the nearest integer. The inverse transformation can easily be derived from Eq. (4.5), and takes the form R 1.16438 0.00000 1.59603 Y − 16 G = 1.16438 −0.39176 −0.81297 Cb − 128 . (4.6) B 1.16438 2.01723 0.00000 Cr − 128 An example of the YCbCr color space is shown in Fig. 4.1. In the remainder of the chapter, unless otherwise specified, the RGB color space will be used. However, the YCbCr color space will be revisited in connection with the discussion on face detection in Subsect. 4.4.3. 4.1.2 Color histograms The information in an image can be summarized in different ways. For example, one can form color histograms measuring the distribution of colors over an c 2017, Mattias Wahde, [email protected] 30 CHAPTER 4. COMPUTER VISION Figure 4.2: An example of image histograms. The panels on the right show, from top to bottom, the red, green, blue, and gray histograms, respectively. Photo by the author. image, also referred to as the color spectrum. A color histogram (for a given color channel, for example, red) is formed by counting the number of pixels taking any given value in the allowed range [0,255], and then (optionally) normalizing the histogram by dividing by the total number of pixels in the image. Thus, a (normalized) histogram for a given color channel can be viewed as a set of 256 bins, each bin measuring the fraction of the image pixels taking the color encoded by the bin number; see also Subsect. 4.3.6 below. Of course, it is also possible to generate a gray scale histogram. An example is shown in Fig. 4.2. Here, the histograms for the red, green, and blue channels were extracted for the image on the left. Next, the image was converted to grayscale, and the gray histogram was generated as well. Note that the histogram plots use a relative scale, so that the bin with maximum content (for each channel) extends to the top of the corresponding plot. Note also that the blue histogram has a strong spike at 0, making the rest of that histogram look rather flat. c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 31 Listing 4.1: The constructor and the Lock method of the ImageProcessor class. p u b l i c ImageProcessor ( Bitmap bitmap ) { t h i s . bitmap = new Bitmap ( bitmap ) ; Lock ( ) ; } p r i v a t e void Lock ( ) { bitmapData = t h i s . bitmap . LockBits ( new Rectangle ( 0 , 0 , t h i s . bitmap . Width , t h i s . bitmap . Height ) , ImageLockMode . ReadWrite , t h i s . bitmap . P i x e l F o r m a t ) ; isLocked = t r u e ; } 4.2 The ImageProcessing library The IPA libraries include a C# class library for image processing, namely the ImageProcessingLibrary. In order to speed up the various image processing tasks, this library makes use of two important concepts, namely locked bitmaps and parallel processing. Locking a bitmap in memory allows the program to access and manipulate the image pixels (much) faster than with the GetPixel and SetPixel methods of the Bitmap class. Moreover, in some (but not all) cases, the pixel operations necessary to process an image occur in sequence and independently of each other. In such cases, one can make use of the parallel processing methods (available in the System.Threading.Tasks namespace) to further speed up the processing. 4.2.1 The ImageProcessor class When an instance of the ImageProcessor class is generated, it begins by making a copy of the bitmap, and then locking the copy in memory, using the LockBits method, as described in Listing 4.1. The image processor is then ready to carry out various operations on the locked bitmap. The list of public methods in the ImageProcessor class is given in Table 4.1. Locking the bitmap takes some time, since a copy of the bitmap is made before locking occurs (so that the original bitmap can be used for other purposes while the image processor uses the copy). Thus, the normal usage is to first generate the image processor by calling the constructor, then carrying out a sequence of operations, of the kinds describe below, then calling the Release method, reading off the processed bitmap, and then disposing the image processor. The last step is important since, even though the garbage collector in .NET will eventually dispose of the image processor (and, more importantly, the associated bitmap), it may take some time before it does so. If one is processing c 2017, Mattias Wahde, [email protected] 32 CHAPTER 4. COMPUTER VISION Listing 4.2: An example of the typical usage of the ImageProcessor class. In this example, it is assumed that a bitmap is available. The first few lines just define some input variables, in order to avoid ugly hard-coding of numerical parameters as inputs to the various methods. double relativeContrast = 1 . 2 ; double relativeBrightness = 0 . 9 ; i n t binarizationThreshold = 1 2 7 ; ImageProcessor imageProcessor = new ImageProcessor ( bitmap ) ; imageProcessor . ChangeContrast ( relativeContrast) ; imageProcessor . ChangeBrightness( relativeBrightness) ; imageProcessor . ConvertToStandardGrayscale ( ) ; imageProcessor . Binarize ( binarizationThreshold) ; imageProcessor . Release ( ) ; Bitmap processedBitmap = imageProcessor . Bitmap ; imageProcessor . Dispose ( ) ; Listing 4.3: A code snippet showing the setup of a camera. Once the relevant parameters have been specified, the camera is started. Moreover, a pointer to the camera is passed to a CameraViewControl which is responsible for showing the image stream from the camera. ... camera = new Camera ( ) ; camera . DeviceName = Camera . GetDeviceNames ( ) [ 0 ] ; camera . ImageWidth = 6 4 0 ; camera . ImageHeight = 4 8 0 ; camera . FrameRate = 2 5 ; camera . Start ( ) ; cameraViewControl . SetCamera ( camera ) ; cameraViewControl . Start ( ) ; ... a video stream generating, say, 25 images per second, the memory usage may become very large (even causing an out-of-memory error) before the garbage collector has time to remove the image processors. An example of a typical usage of the ImageProcess class is given in Listing 4.2. In this example, an image processor is generated that first changes the contrast and the brightness of the image, then converts it to grayscale before, finally, carrying out binarization. The various methods shown in this example are described in the text below. Even though it is not immediately evident from the code in Listing 4.2, the step in which the processed bitmap is obtained also involves copying the image residing in the image processor, so that the latter can then safely be disposed. c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION Method ChangeContrast ChangeBrightness ConvertToGrayscale ConvertToStandardGrayscale Binarize GenerateHistogram Convolve BoxBlur3x3 GaussianBlur3x3 Sharpen3x3 SobelEdgeDetect StretchHistogram 33 Description Changes the (relative) contrast of an image. Changes the (relative) brightness of an image. Converts a color image to grayscale using parameters specified by the user. Converts a color image to grayscale using default parameters. Binarizes a grayscale image, using a single (non-adaptive) threshold. Generates the histogram for a given color channel (red, green, blue, or gray). Convolves an image with an N × N mask, where N ≥ 3 is an odd number. Blurs an image, using a 3 × 3 box convolution mask. Blurs an image, using a 3 × 3 Gaussian convolution mask. Sharpens the image, using a convolution mask of size 3 × 3. Carries out Sobel edge detection on a grayscale image. Stretches the histogram of the image, in order to enhance the contrast. Table 4.1: Brief summary of (some of) the public methods in the ImageProcessor class. For more complete descriptions, see Sect. 4.3. 4.2.2 The Camera class The Camera class is used for reading an image stream from a video camera, for example a web camera. This class makes use of the CaptureDevice class that, in turn, uses classes from the DirectShowLib library that contains the methods required for low-level camera access. In the camera class, a separate thread is started that reads the current image from the capture device and stores it in a bitmap that can be accessed in a thread-safe manner (see Sect. A.4) by other classes, for example the CameraViewControl user control, which is also included in the ImageProcessing library, and which uses a separate thread for displaying the most recent bitmap available in the corresponding Camera instance. Thus, it can run with a different updating frequency compared to the camera itself. The ImageProcessing library also contains a CameraSetupControl user control, in which the user can set the various parameters (e.g. brightness, contrast etc.) of a camera. Listing 4.3 shows a code snippet in which a camera is set up, in this case using the first available camera device (there might of course be several camc 2017, Mattias Wahde, [email protected] 34 CHAPTER 4. COMPUTER VISION Listing 4.4: The ChangeContrast method. p u b l i c void ChangeContrast ( double alpha ) { unsafe { i n t bytesPerPixel = Bitmap . GetPixelFormatSize( bitmap . P i x e l F o r m a t ) / 8 ; i n t widthInBytes = bitmapData . Width ∗ bytesPerPixel ; byte ∗ PtrFirstPixel = ( byte ∗ ) bitmapData . Scan0 ; P a r a l l e l . For ( 0 , bitmapData . Height , y => { byte ∗ currentLine = PtrFirstPixel + ( y ∗ bitmapData . Stride ) ; f o r ( i n t x = 0 ; x < widthInBytes ; x = x + bytesPerPixel ) { double oldBlue = currentLine [ x ] ; double oldGreen = currentLine [ x + 1 ] ; double oldRed = currentLine [ x + 2 ] ; i n t newBlue = ( i n t ) Math . Round ( 1 2 8 +(oldBlue−128) ∗ alpha ) ; i n t newGreen = ( i n t ) Math . Round ( 1 2 8 + ( oldGreen−128) ∗ alpha ) ; i n t newRed = ( i n t ) Math . Round ( 1 2 8 + ( oldRed−128) ∗ alpha ) ; i f ( newBlue < 0 ) { newBlue = 0 ; } e l s e i f ( newBlue > 2 5 5 ) { newBlue = 2 5 5 ; } i f ( newGreen < 0 ) { newGreen = 0 ; } e l s e i f ( newGreen > 2 5 5 ) { newGreen = 2 5 5 ; } i f ( newRed < 0 ) { newRed = 0 ; } e l s e i f ( newRed > 2 5 5 ) { newRed = 2 5 5 ; } currentLine [ x ] = ( byte ) newBlue ; currentLine [ x + 1 ] = ( byte ) newGreen ; currentLine [ x + 2 ] = ( byte ) newRed ; } }) ; } } eras available). Once the resolution and frame rate have been set, the camera is started. A pointer to the camera is then passed to a CameraViewControl that, once started, displays the camera image, in this case with the same frame rate as the camera. 4.3 Basic image processing This section introduces and describes some common image processing operations, which are often used as parts of the more advanced image processing tasks considered in Sect. 4.4 below. Here, the value of a pixel in an unspecified color channel (i.e either red, green, or blue) is generally denoted P ≡ P (i, j). Thus, for a given pixel and a given color channel, P is an integer in the range [0, 255]. Some of the operations below may result in non-integer values. The pixel value is then set as the nearest integer. If an operation results in a value smaller than 0, the pixel value is set to 0. Similarly, if a value larger than 255 is obtained, the pixel value is set to 255. For grayscale images, the gray level (also in the range [0, 255]) is denoted Γ(i, j). c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION Method Locked bitmaps, parallel processing (Listing 4.4) Locked bitmap, sequential processing Direct pixel access, using GetPixel and SetPixel 35 Computation time (s) 0.0124 0.0522 2.37 Table 4.2: A speed comparison involving three different methods for changing the contrast of an image with 1600 × 1067 pixels, using a computer with an Intel Core i7 processor running at 3.4 GHz. The parallel method given in Listing 4.4 reduces the computation time by around 76% compared to a sequential method, and by more than 99% compared to the method involving direct pixel access. 4.3.1 Contrast and brightness The contrast and brightness of an image can be controlled using a simple linear transformation, even though non-linear transformations exist as well. For a given pixel value (for some color channel), the transformation P ← α(P − 128) + 128 + β, (4.7) transforms both the contrast (controlled by α) and the brightness (controlled by β) of an image. The method ChangeContrast takes α as input, and changes the image using the transformation in Eq. (4.7), with β = 0, whereas the ChangeBrightness method takes the relative brightness br as input, from which β is obtained as β = 255(br − 1), (4.8) after which Eq. (4.7) is applied, with α = 1. Note that β is, of course, rounded to the nearest integer. It should also be noted that operations which change contrast or brightness are not necessarily reversible, since any pixel value above 255 will be set to 255, and any value below 0 will be set to 0. Listing 4.4 shows the implementation of the ChangeContrast method, and also illustrates the syntax for parallel processing. As can be seen in the listing, the method begins with the unsafe keyword, which should be applied when carrying out pointer operations (such as accessing the bytes of a locked bitmap). The methods then runs through the lines of the image, changing the contrast of each pixel as described above. The Parallel.For syntax implies that different rows are processed in parallel. Note that the transformations applied to a pixel are independent of the transformations applied to any other pixel. This is important since, with a parallel for-loop, the operations may occur in any order. Of course, one could have used a standard (sequential) for-loop as well, but the parallel syntax does lead to a rather significant speedup. To illustrate this, two additional methods were tested, one that runs through the locked bitmap as in Listing 4.4, but with a standard for-loop instead of the parallel for-loop, and one that directly accessed the pixels of the image (without even locking the bitmap), using the GetPixel and SetPixel c 2017, Mattias Wahde, [email protected] 36 CHAPTER 4. COMPUTER VISION methods. The results are summarized in Table 4.2. As is evident from the table, the parallel method is by far the fastest. 4.3.2 Grayscale conversion The transformation of a color image to a grayscale image involves compressing the information in the three color channels (red, green, and blue) into a single channel (gray). In practice, a gray value is computed, and that single value is then applied to the three color channels. The general transformation can be written Γ = fr R + fg G + fb B, (4.9) where fr , fg , and fb are the red, green, and blue fractions, respectively. The method ConvertToGrayscale takes these three fractions (all in the range [0, 1], and with a sum of 1) as inputs, and then carries out the transformation in Eq. (4.9), rounding the values of Γ to the nearest integer. As mentioned in Sect. 4.1, the settings fr = 0.299, fg = 0.587, and fb = 0.114 are commonly used in grayscale conversion. The method ConvertToStandardGrayscale, which does not take any inputs, uses these values. 4.3.3 Binarization In its simplest form, the process of binarization uses a single threshold (the binarization threshold), and sets the color of any pixel whose gray level value is below the threshold to black. All other pixels are set to white. Note that this process should be applied to a grayscale image, rather than a color image. The ImageProcessor class implements this simple form of binarization in its Binarize method. However, in practical applications, one must often handle brightness variations across the image. Thus, some form of adaptive threshold is required, something that will be discussed in Subsect. 4.4.1 below. 4.3.4 Image convolution Many image operations, e.g. blurring and sharpening, can be formulated as a convolution, i.e. a process in which one passes a matrix (the convolution mask) over an image and changes the value of the center pixel using matrix multiplication. More precisely, convolution using an N × N mask (denoted C) changes the value P (i, j) of pixel (i, j) as P (i, j) ← N X N X C(k, m)P (i − ν + k − 1, j − ν + m − 1), k=1 m=1 c 2017, Mattias Wahde, [email protected] (4.10) CHAPTER 4. COMPUTER VISION 37 Figure 4.3: An example of sharpening, using the convolution mask Cs . Photo by the author. where ν = (N − 1)/2, and N is assumed to be odd. The mask is passed over each pixel in the image1 , changing the value of the central pixel in each step. The pixel value is the rounded to the nearest integer. The ImageProcessor class contains a method Convolve, which takes as input a convolution mask (in the form of a List<List<double>>), and then carries out convolution as in Eq. (4.10). Of course, the result of a convolution depends on the elements in the convolution mask. By setting those elements to appropriate values (see below), one can carry out, for example, blurring and sharpening. It should be noted, however, that convolutions can be computationally costly, since a matrix multiplication must be carried out for each pixel. Blurring Consider the convolution mask 1 1 1 Cb = 1 1 9 1 1 1 1 . 1 (4.11) When this mask is passed over the image, the value of any pixel is set to the average in a 3 × 3 region centered around the pixel in question, resulting in a distinct blurring of the image. The matrix Cb defines so called box blurring. This kind of blurring, with N = 3, is implemented in the ImageProcessor class as BoxBlur3x3. Of course, one can use a larger convolution mask (e.g. N = 5), and pass it to the Convolve method described above. The BoxBlur3x3 simply provides a convenient shortcut to achieve blurring with N = 3, which is 1 Except boundary pixels, for which the mask would extend outside the image. Such pixels are normally ignored, i.e. their values are left unchanged. Alternatively, one can extend the image (a process called padding) by adding a frame, (N − 1)/2 pixels wide, around it. c 2017, Mattias Wahde, [email protected] 38 CHAPTER 4. COMPUTER VISION usually sufficient. Blurring can be achieved one may instead use the mask 1 2 1 2 4 Cg = 16 1 2 in different ways. For example, 1 2 , 1 (4.12) thus obtaining Gaussian blurring, so called since the matrix approximates a two-dimensional Gaussian. The method GaussianBlur3x3 carries out such blurring, using the matrix defined in Eq. (4.12). Sharpening For any γ > 0, the mask − γ8 Cs = − γ8 − γ8 − γ8 1+γ − γ8 − γ8 − γ8 , − γ8 (4.13) results in a sharpening of the image. The parameter γ is here referred to as the sharpening factor. Sharpening using a 3 × 3 mask is implemented in the method Sharpen3x3, which takes the sharpening factor as input. An example is shown in Fig. 4.3. 4.3.5 Obtaining histograms The ImageProcessor class contains a method for obtaining histograms, namely GenerateHistogram, which takes a color channel as input (represented as a ColorChannel enum object, with the possible values Red, Green, Blue, and Gray). Note that the method does not carry out grayscale conversion. Thus, in order to obtain the gray histogram, one must first convert the image to grayscale, then apply the GenerateHistogram method. The method will then pick an arbitrary channel (in this case, blue) and generate the histogram. Generating the histogram for any color channel is straightforward, except for one thing: If the histogram is to be generated using a parallel for-loop, one must be careful when incrementing the contents of the histogram bins. This is so, since the standard ++ operator in C# is not thread-safe: Whenever this operator is called, the value contained at the memory location in question is loaded, then incremented, and then the new value (i.e. the old value plus one) is assigned to the memory location. However, since the increment takes some time, it is perfectly possible for a situation to occur where the same value is loaded by two different threads, incremented (in each thread), and then assigned again, so that the total increment is one, not two. In order to avoid such errors, one can use the lock keyword in C#. However, for simple operations, such as c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 39 Listing 4.5: The GenerateHistogram method, illustrating the use of the Interlocked class for thread-safe increments. p u b l i c ImageHistogram GenerateHistogram( ColorChannel colorChannel ) { ImageHistogram imageHistogram = new ImageHistogram ( ) ; i n t [ ] pixelNumberArray = new i n t [ 2 5 6 ] ; unsafe { i n t bytesPerPixel = Bitmap . GetPixelFormatSize( bitmap . PixelFormat ) / 8 ; i n t widthInBytes = bitmapData . Width ∗ bytesPerPixel ; byte ∗ PtrFirstPixel = ( byte ∗ ) bitmapData . Scan0 ; P a r a l l e l . For ( 0 , bitmapData . Height , y => { byte ∗ currentLine = PtrFirstPixel + ( y ∗ bitmapData . Stride ) ; f o r ( i n t x = 0 ; x < widthInBytes ; x = x + bytesPerPixel ) { byte pixelValue = 0 ; i f ( colorChannel == ColorChannel . Red ) { pixelValue = currentLine [ x + 2 ] ; } e l s e i f ( colorChannel == ColorChannel . Green ) { pixelValue = currentLine [ x + 1 ] ; } e l s e { pixelValue = currentLine [ x ] ; } I n t e r l o c k e d . Increment ( r e f pixelNumberArray [ ( i n t ) pixelValue ] ) ; } }) ; } imageHistogram . PixelNumberList = pixelNumberArray . ToList ( ) ; r e t u r n imageHistogram ; } incrementing, there is a faster way, namely to use the Increment method in the static Interlocked class. This method makes sure to carry out what is known as an atomic (thread-safe) increment, meaning that no increments are omitted. The use of this method is illustrated in Listing 4.5. Note that, here, the increment is carried out on the elements of an array rather than just a single integer variable. This is allowed for arrays (of fixed length), but not for a generic List (e.g. List<int>). Thus, as shown in the listing, the counting of pixel values is carried out in an array of length 256 and, at the very end, this array is converted to a list, which is then assigned to the image histogram. 4.3.6 Histogram manipulation Images taken in, for example, adverse lighting conditions are often too bright or too dark relative to an image taken under perfect conditions. Consider the image in the left panel of Fig. 4.4. Here, while reasonably sharp, the image still appears somewhat hazy and rather pale. The histogram, shown below the image, confirms this: The image contains only grayscale values in the range 68 to 251. In order to improve the contrast, one can of course apply the method described in Subsect. 4.3.1. However, those methods do not provide prescripc 2017, Mattias Wahde, [email protected] 40 CHAPTER 4. COMPUTER VISION Figure 4.4: Histogram stretching. The panels on the left show an image with poor contrast, along with its histogram. The panels on the right show the image after stretching with p = 0.025, as well as the resulting histogram. tions for the suitable parameter settings. Thus, there is a risk that one might increase (or decrease) the contrast too much. There are several methods for automatically changing the contrast and brightness in an image, in a way that will give good (or at least acceptable) results over a large set of lighting conditions. These methods are generally applied to grayscale images. One such method is histogram stretching. In this method, one first generates the (grayscale) histogram H(j), j = 0, . . . , 255. Then, normalization is applied, resulting in the normalized histogram H(j) , j = 0, . . . , 255 Hn (j) = P255 j=0 H(j) (4.14) Finally, the cumulative histogram is generated according to Hc (0) = Hn (0) and Hc (j) = Hc (j − 1) + Hn (j). j = 1, . . . , 255 (4.15) Thus, for any j, Hc (j) determines the fraction of pixels having gray level j or darker. Next, one identifies the bin index jlow corresponding to a given fraction p of the total number of pixels, as well as the bin index jhigh corresponding to the fraction 1 − p. Thus, jlow is the smallest j such that Hc (j) > p, while jhigh is the largest j such that Hc (j) < 1 − p. Then, any pixel with gray level below jlow is set to black (i.e. gray level 0) and any pixel with gray level above jhigh is c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 41 set to white (gray level 255). For pixels with gray levels in the range [jlow , jhigh ] new gray levels are generated as Γnew = 255 Γ − jlow . jhigh − jlow (4.16) Thus, after this stretching, the histogram will cover the entire range from 0 to 255. An example is shown in the right-hand part of Fig. 4.4, where the upper panel shows the image after stretching with p = 0.025 and the lower panel shows the corresponding histogram. In this particular case, jlow was found to be 108, and jhigh was found to be 197. The reason for using a value of p > 0 (but smaller than 0.5) is that, even for an image with poor contrast, there might be a few pixels with gray level 0 and a few pixels with gray level 255, in which case the stretching would have no effect, as can easily be seen from Eq. (4.16). By choosing a small positive value of p, as in the example above, the stretching will produce a non-trivial result. Typical values of p fall in the range from 0.01 to 0.05. The method just described stretches the histogram, but does not change it in any other way. An alternative approach is to apply histogram equalization, in which one attempts to make the histogram as flat as possible, i.e. with roughly equal number of pixels in each bin. This method will not be described in detail here, however. 4.3.7 Edge detection In edge detection, the aim is to locate sharp changes in intensity (i.e. edges) that usually define the boundaries of objects. Edge detection is thus an important step in (some methods for) object detection. It is also biologically motivated: Evidence from neurophysiology indicates that sharp edges play a central role in object detection in animals. There are many edge detection methods, and they typically make use of convolutions of the kind described above. However, as will be shown below, one can sometimes summarize the results of repeated convolution by carrying out a single so called pseudo-convolution over the image. One of the most successful edge detection methods is the Canny edge detector [4]. In addition to carrying out some convolutions, this method also uses a few pre- and post-processing steps. For example, in Canny edge detection, one blurs the image before carrying out edge detection, in order to remove noise. Here, only the central component of the Canny edge detector will be studied, namely the convolutions. Consider the two convolution masks -1 0 1 C1 = -2 0 2 (4.17) -1 0 1 c 2017, Mattias Wahde, [email protected] 42 CHAPTER 4. COMPUTER VISION and 1 C2 = 0 -1 2 1 0 0 . -2 -1 (4.18) These two masks detect horizontal and vertical edges, respectively. Together (and sometimes augmented by two additional masks for detecting diagonal edges), the masks define the so called Sobel operator for edge detection. By convolving a given (normally grayscale) image Γ using C1 and then convolving (the original) image using C2 one obtains two images Γx and Γy , whose pixel values can then be combined to form an edge image Ie as q (4.19) Γe (i, j) = (Γx (i, j)2 + Γy (i, j)2 ). However, since this computation can be a bit time-consuming, one often uses the simpler procedure Γe (i, j) = |Γx (i, j)| + |Γy (i, j)| (4.20) instead. In that case, one can generate the edge image by a single pass (a pseudo-convolution) through the original image, setting the value of a given pixel as Γe (i, j) ← | (Γ(i − 1, j − 1) + 2Γ(i, j − 1) + Γ(i + 1, j − 1)) − ((Γ(i − 1, j + 1) + 2Γ(i, j + 1) + Γ(i + 1, j + 1)) | + | (Γ(i + 1, j − 1) + 2Γ(i + 1, j) + Γ(i + 1, j + 1)) − (Γ(i − 1, j − 1) + 2Γ(i − 1, j) + Γ(i − 1, j + 1)) | (4.21) 4.3.8 Integral image The integral image (or summed area table) concept is useful in cases where one needs to form the sum (or average) over many regions in an image. Once an integral image has been formed, the sum of the pixel values within any given rectangular region of the image can be obtained with one addition and two subtractions. Consider, for simplicity, the case of a binary image, in which pixels P (i, j) either take the value 0 (black) or the value 1 (white), as illustrated in Fig. 4.5. In such images, a white pixel is also called a foreground pixel, whereas a black pixel is referred to as a background pixel. The integral image I(i, j) is defined as the sum of all pixels above and to the left of (i, j). Thus, X P (i′ , j ′ ), (4.22) I(i, j) = i′ ≤i,j ′ ≤j The integral image can be formed in a single pass through the image, using the difference equation I(i, j) = P (i, j) + I(i − 1, j) + I(i, j − 1) − I(i − 1, j − 1). c 2017, Mattias Wahde, [email protected] (4.23) CHAPTER 4. COMPUTER VISION 0 1 2 2 3 1 2 4 5 6 43 2 2 2 4 5 6 7 8 9 9 11 13 11 13 16 Figure 4.5: The left panel shows a small black-and-white image with 5 × 5 pixels. The corresponding integral image is shown in the middle panel. For the panel on the right, the sum of the pixel values (7) in the red rectangle can be obtained using Eq. (4.24). Note that, in the right-hand side of this equation, I is set to zero in case either (or both) indices are negative. Once the integral image has been obtained, the sum of pixels in a given rectangular region can be obtained as X P (i, j) = I(i0 , j0 ) + I(i1 , j1 ) − I(i1 , j0 ) − I(i0 , j1 ). (4.24) i0 <i≤i1 j0 <j≤j1 Obviously, if all that is needed is a sum of pixels in one or a few regions, the computational effort needed to compute the integral image may be prohibitive. However, in cases where pixel sums (or averages) are needed for many regions in the image, as for example in some face detection algoritms (such as the Viola-Jones algorithm; see Subsect. 4.4.3 below), the integral image is a rapid way of obtaining the required information. As a specific example, consider the right panel of Fig. 4.5. Using Eq. (4.24), the sum of the pixel values in the red rectangle can be computed as X p(i′ , j ′ ) = I(0, 0) + I(3, 3) − I(3, 0) − I(0, 3) = 0 + 11 − 2 − 2 = 7. (4.25) 0<i≤3 0<j≤3 4.3.9 Connected component extraction In many image processing tasks, e.g. object detection, text interpretation (for example, reading zip codes) etc., it is often necessary to find groups of pixels that are connected to each other (forming, for example, a face or a letter), using some measure of connectivity. In image processing, such groups of pixels are referred to as connected components. The concept of connected components is most easily illustrated for the case of binary images, which is the only case that will be considered here, even though the process can be generalized to gray scale or even color images. c 2017, Mattias Wahde, [email protected] 44 CHAPTER 4. COMPUTER VISION Figure 4.6: An example of object detection using connected components. The upper panels show the preprocessing steps, resulting in a binarized image. After removing all but the two largest connected components, the image in the lower left panel is obtained. The lower middle and lower right panels show the connected components labeled 1 and 2, respectively. Photo by the author. The measure of connectivity is typically taken as either 4-connectivity or 8connectivity In the case of 4-connectivity all foreground (white) pixels P (i, j) are compared with the neighbors P (i − 1, j), P (i, j + 1), P (i + 1, j), and P (i, j − 1). If any of the neighbors also is a foreground pixel, that pixel and P (i, j) belong to the same connected component. In the case of 8-connectivity all foreground (white) pixels P (i, j) are compared with the neighbors used in the 4-connectivity case, but also the neighbors P (i − 1, j − 1), P (i + 1, j + 1), P (i − 1, j + 1), and P (i + 1, j − 1). There are several algorithms for finding the connected components in a given (binarized) image, the details of which will not be given here, however. Once the connected components have been found, one may apply additional operators. For example, in order to find the dominant object in an image (for example, a face), one may wish to remove all connected components except the largest one. Obviously, such a simple procedure will not work under all circumstances; if the image contains a (bright) object that is larger than the face, the result may be an incorrect identification. Nevertheless, connected component extraction is an important first step in many object detection tasks. An example is shown in Fig. 4.6. Here, the objective was to identify the pixels consistuting the blue pot shown in the upper left panel. The image was first inverted (since the pot is rather dark), as shown in the upper middle panel, and was then converted to grayscale. Next, the image was binarized (with a threshc 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 45 old of 150, in this case), resulting in the image shown in the upper right panel. Then, the connected components were extracted. In the final step, the results of which are shown in the lower left panel, only the two largest connected components have been kept. When the connected components have been extracted, the pixels belonging to a given connected component are labeled (even though this is not visible in the figure) using an integer, e.g. 1 for the pot and 2 for the island, in this case, assuming that the labels have been sorted according to the size (number of pixels) of the connected components. In the lower middle panel, the pixels of the connected component labeled 1 (the pot) are shown, whereas the lower right panel shows the pixels of connected component 2 (the island). Once the connected components are available, other techniques, such as matching the pixels in the connected components to a pot-shaped template, can be used for determining which of the two connected components represents the pot. 4.3.10 Morphological image processing In morphological image processing, a particular shape, referred to as a structuring element2 is passed over an image such that, at each step, the value of a given pixel (relative to the position of the structuring element) is changed if certain conditions are met. Typically, morphological image processing is applied to binary (black-and-white) images. Even though generalizations to grayscale and color images exist, here only binary images will be considered. In general, a structuring element consists of pixels taking either the value 1 (white) or the value 0 (black). Consider Fig. 4.7. The structuring element is shown to the left of the image. Now, when the structuring element is placed at a given position over the image, one can compare the pixel values of the structuring element to the pixel values of the part of the underlying image covered by the structuring element. Thus, considering the structuring element as a whole, one of three things can happen, illustrated in the figure, using the three colors green, yellow, and red. The structuring element may (i) completely match the part of the image that it covers. That is, every pixel in the structuring element matches its corresponding image pixel (green); (ii) partially match the covered part of the image (yellow); or (iii) not match the covered part at all (red). As the structuring is passed over an image there will, in general, be some positions where it matches, some where it partially matches, and some where it does not match at all. Depending on which of those situations that occurs for a given position of the structuring element, some action (exemplified below) is applied to an image pixel at a given position relative to the structuring element (indicated by a ring in Fig. 4.7), referred to as the origin of the structure element. In case of a symmetric structuring element, that position is 2 Note that a structuring element need not be a square; it can take any shape. c 2017, Mattias Wahde, [email protected] 46 CHAPTER 4. COMPUTER VISION Figure 4.7: An example of a structuring element, shown in the left part of the figure. The origin of the structuring element is at its center. The right part of the figure shows three different cases (i) one (green) in which the structuring element completely matches the foreground (white) pixels, (ii) one (yellow) in which there is a partial match, and (iii) one in which there is no match. often, but not always, the pixel under the center of the structuring element. Erosion In erosion, for any (i, j) for which the structuring element completely matches covered part of the image, the origin is colored white. If the structuring element does not match completely, the origin is colored black. As the name implies, erosion tends top chip away pixels at the inner and outer boundaries in regions of foreground pixels, resulting in larger gaps between regions as well as removal of small regions. An example of erosion is shown in Fig. 4.8. Here, the structuring element in the left panel has been applied to the image in the center panel, resulting in the eroded image shown in the right panel. Dilation In dilation, for any (i, j) such that the structuring partially or completely matches the image, the origin is set to white. If the structuring element does not match at all, the origin is set to black. Dilation can be seen as the inverse of erosion, as it tends to grow (and sometimes join) foreground regions of the image. An example of dilation is shown in Fig. 4.9. c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 47 Figure 4.8: The right panel shows the results of carrying out erosion on the image shown in the middle panel, using the structuring element shown in the left panel. The ring in the center of the structuring element indicates the pixel currently under study. Figure 4.9: The right panel shows the results of carrying out dilation on the image shown in the middle panel, using the structuring element shown in the left panel. Other operators In addition to erosion and dilation, there are also many other morphological operators, for example opening (erosion followed by dilation) closing (dilation followed by erosion). Another common operation is hit-and-miss. In this case, one uses a more complex structuring element: In the description above, the relevant parts of the structuring element were the white pixels, For the hit-and-miss transform one needs structuring elements with pixels taking either of three different values, namely 1 (white, foreground pixel), 0 (black, background pixel) and x (ignored pixel). Thus, to be strict, the non-white pixels in the structuring elements for erosion and dilation above should really have the value x rather than 0 but for simplicity they are often drawn as in the figures above. In any case, in erosion and dilation, the non-white pixels are ignored. In the hit-and-miss transform, by contrast, both the foreground pixels (1s) and the background pixels (0s) must match in order for the pixel under the origin (see above) of the structuring element to be set to the foreground color (white). Otherwise, it is c 2017, Mattias Wahde, [email protected] 48 CHAPTER 4. COMPUTER VISION Figure 4.10: Left panel: A text image obtained by holding a single sheet of paper in front of a web camera; middle panel: The result of binarization, using the (subjectively chosen) best possible threshold; right panel: The result of adaptive thresholding using Sauvola’s method with j = 7 and k = 0.23. set to the background color. The hit-and-miss transform is typically used for finding corners of foreground shapes. Finally, the thinning operator, which is related to the hit-and-miss operator, is used for reducing the width of lines and edges down to a single pixel (at which point the thinning operation will no longer change the image). 4.4 Advanced image processing 4.4.1 Adaptive thresholding Thresholding is the process of reducing the number of colors used in an image. A very important special case, which will be considered from now on, is binarization, in which a (grayscale) image is converted into a two-color (black and white) image, as discussed in Subsect. 4.3.3 above. However, in the common case in which the lighting varies over an image, using a global threshold, as in Subsect. 4.3.3, may not produce very good results, as illustrated in the left and middle panels of Fig. 4.10. Instead, one must use some form of adaptive thresholding, in which the binarization threshold varies over the image. An important application, in the case of IPAs, is the problem of reading text in a (low-quality) image. For example, one can consider an IPA whose task it is to help a visually impaired person to read a document held in front of a (web) camera. Due to variations in lighting as well as the fact that light generally shines through a single sheet of paper, the quality of the resulting image is often quite low, as can be seen in the left panel of Fig. 4.10. Many methods have been suggested for adaptive thresholding of (text) images. Here, only two such methods will be introduced, namely Niblack’s method [11] and Sauvola’s method [12]. In Niblack’s method one measures the mean m and standard deviation over the area (j × j pixels, where j is an odd integer larger than 1) surrounding the pixel under consideration. Then, the binarization threshold Tn (for that pixel) is set at k standard deviations c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 49 above the mean: Tn = m + kσ. (4.26) This procedure is repeated for all pixels in the image. In Sauvola’s method, the local binarization threshold Ts is instead computed as h s i Ts = m 1 + k −1 , R (4.27) where k is a parameter, and R is the maximum value of the standard deviation over all of the j × j areas considered. Typical values are j = 9 − 21 and k = 0.2−0.5. Sauvola’s method generally produces good results, even though there are other methods that outperform it slightly (see e.g. [20]). The right panel of Fig. 4.10 shows the results obtained with Sauvola’s method, with j = 7 and k = 0.23. The apparent frame around the image is caused by the fact that the j ×j matrix cannot be applied at the edges. Of course, one can easily solve that problem by padding the image with white pixels, but this has not been done here. 4.4.2 Motion detection Many applications involving video streams require detection of motion. For example, an IPA may be required to determine whether or not a person has just sat down in from of the IPA’s camera(s) and then detect and recognize the user’s gestures. In its simplest form, motion detection consists of comparing a camera image at a given time with an image taken earlier. Consider, for simplicity, a gray scale image, whose gray levels at time t will be denoted Γ(i, j; t). One can then compare this image with an earlier image, with gray levels Γ(i, j; t − 1), the rationale being that those pixels that differ will belong to moving objects. Introducing a threshold T for the minimum required difference, one can determine which pixels fulfil the inequality |Γ(i, j; t) − Γ(i, j; t − 1)| > T (4.28) and then, for example, set those pixels to white, and all others pixels to black. However, this simple method will typically be quite brittle, as even in a supposedly static scene, there are almost always small brightness variations, some of which typically exceed the threshold T , leading to incorrect detections. The reverse problem appears as well: A person who sits absolutely still in the camera’s field of view will fade to invisibility (unless, of course, the motion detection method is combined with other approaches, e.g. face detection; see below). c 2017, Mattias Wahde, [email protected] 50 CHAPTER 4. COMPUTER VISION Background subtraction A common special case, particularly relevant for IPAs, is background subtraction in which one assumes the existence of an essentially fixed background. Anything that causes the view to differ from the background (such as, for example, a person moving in the camera’s field of view) will then constitute the foreground. In such situations, the problem of motion detection is often referred to as background subtraction. Background subtraction can be carried out both in color images and grayscale images. Here, only grayscale background subtraction will be considered. The (gray) intensity of pixels belonging to the background will be denoted B(i, j). Pixels that do not belong to the background are, per definition, foreground pixels. A simple approach to background subtraction, based on frame differencing, is to start from an image which is known to represent only the background. In the case of an IPA, one may use the scene visible to the agent before any person sits down in front of it. The inequality |Γ(i, j; t) − B(i, j)| > T (4.29) will then find the pixels whose gray level differs from their background values and which can thus be taken to represent the foreground. However, this naı̈ve approach does not, in general, work very well, partly because the background pixel intensity will never be completely constant so that, even if there is no foreground object, some pixels will differ from their supposed background values by an amount that exceeds the threshold T (unless a very high value is used for T , but in that case one risks being unable to detect actual foreground objects!) Another serious problem with this approach is that, over the course of a day, the light level of the background will change so that background pixels will gradually drift into the foreground. This can of course happen instantaneously as well, if someone turns on a light, for example. The problem can be somewhat reduced by forming the background image as an average over a number of images (again without any foreground object present in the scene), but even so, the method is rather error-prone. A more robust approach is to make use of exponential Gaussian averaging, in which one maintains a probability density function for the intensity of each pixel. Here, a pixel is considered as foreground only if its intensity differs from the (average) background intensity by a certain number of standard deviations. Let µ(i, j; t) denote the average intensity of pixel (i, j), and σ 2 (i, j; t) its variance. In order to initialize this method, one would set the average at t = 0 as the current gray level of an image (containing background only), i.e. µ(i, j; 0) = Γ(i, j; 0). (4.30) The initial variance value can be set, for example, as the variance computed using the pixels adjacent to (i, j). Then, the average and variance can be updated c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 51 Figure 4.11: An example of background subtraction. Left panel: Snapshot from the camera stream; right panel: The result of background subtraction, just after the user completed a gesture (raising the right hand). Foreground pixels are shown in white. Here, the background was subtracted using exponential Gaussian averaging, with ρ = 0.025 and α = 1.3.. using a running (exponential) average µ(i, j; t) = ρΓ(i, j; t) + (1 − ρ)µ(i, j; t) (4.31) σ 2 (i, j; t) = ρδ(i, j)2 + (1 − ρ)σ 2 (i, j; t), (4.32) and where ρ is the exponential averaging parameter that takes values in the open range ]0, 1[ and δ(i, j) = |Γ(i, j; t) − µ(i, j; t)|. (4.33) Using these equations, one can then compare the pixel intensity Γ(i, j; t) with the average µ(i, j; t) and set a pixel as foreground if |Γ(i, j; t) − µ(i, j; t)| > ασ(i, j; t). (4.34) That is, a pixel is considered to be in the foreground if it differs from its (running) average by more than α standard deviations. In addition to the method just described, there are also methods (so called Gaussian mixture models) that make use of multiple Gaussians for each pixel [14]. Background subtraction is also very much an active research field, in which new methods appear continuously. One such method, with excellent performance, is the ViBe method [3], For a recent review of background subtraction methods see, for example [13]. In a addition to background detection, in order to improve their precision many approaches also involve a degree of postprocessing, such as connected component extraction following a sequence of morphological operations to remove noise. c 2017, Mattias Wahde, [email protected] 52 CHAPTER 4. COMPUTER VISION Gesture recognition In addition to mere detection of foreground objects, one may also wish to interpret their movements. For example, an IPA may have gestures as an input modality, and must thus be able to carry out gesture recognition. This is also an active topic for research, but it is beyond the scope of this text. A review can be found in [10]. It should be noted that depth cameras, such as (the sensor used in) Microsoft’s kinect, are becoming more and more common in such applications [17] even though many approaches are based on ordinary camera images. 4.4.3 Face detection and recognition Face detection and recognition are important processes in many IPAs. These are also large and active research topics, in which new results appear continuously. Here, only a brief introduction will be given, with some references to further reading. Skin pixel detection By itself, detection of skin pixels is not sufficient for finding (and tracking) faces. However, in combination with other methods, such as edge detection and connected component extraction, skin pixel detection often plays an important part in face detection. Many methods have been defined for detecting skin pixels based on pixel color. In the RGB color space, a common, but somewhat inelegant, rule3 for detecting (mainly Caucasian) skin pixels says that a pixel represents skin if all of the following rules are satisfied: (i) R > 95, (ii) G > 40, (iii) B > 20, (iv) max{R, G, B} − min{R, G, B} > 15, (v) |R − G| > 15, (vi) |R − G| < 75, (vii) R > G and, finally (ix) R > B. However, this set of rules is rather sensitive to lighting conditions. Moreover, skin pixel detection must of course work over a range of different skin colors. It turns out that such rules can be found in the YCbCr space (see Subsect. 4.1.1 above), by excluding the luminance (Y ) component and instead focusing on the Cb and Cr . Here, skin pixels typically fulfil 77 ≤ Cb ≤ 127 and 133 ≤ Cr ≤ 173, using the definition of the YCbCr space found in Eq. (4.5). A more sophisticated approach consists of generating a two-dimensional histogram Hs (Cb , Cr ) of skin pixels. In this approach, one would measure the values of Cb and Cr for a large number of known skin pixels, taken from many different images with skin of different color and tone, and with different lighting conditions. For every occurrence of a particular combination (Cb , Cr ), one would then increment the contents of the corresponding histogram bin by one. 3 This rule is a slight modification of the rule found in [8]. c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 53 Figure 4.12: Examples of skin detection. Upper left panel: The original picture; upper right panel: Skin detection using the RGB-based method described in the main text; lower left panel: Skin detection using the method based on simple ranges in Cb and Cr ; lower right panel: Skin detection based on the method using a two-dimensional distribution of skin pixels, with a threshold T = 0.04. In the three latter images, all non-skin pixels have been set to black color, whereas the identified skin pixels were left unchanged. Using the range [16, 240] for Cb and Cr , defined in Subsect. 4.1.1, this histogram will thus have 225 × 225 bins (elements). Once a sufficient number of skin pixels have been considered, the histogram would then be (linearly) normalized such that the bin contents for the most frequent combination(s) (Cb , Cr ) would be set to 1. Thus, for all other combinations, the bin contents would be smaller than 1. For any given pixel, one can then obtain Cb and Cr , and then classify the pixel as skin if the corresponding bin contents exceed a certain threshold T. An example is shown in Fig. 4.12. Here, an image of a person is shown, along with the results obtained for the three different skin detection methods. For the last method described above, the histograms were obtained by manually labelling thousands of skin pixels for a set of face images. As is evident from the figure, no matter which of the three methods above that is used, one c 2017, Mattias Wahde, [email protected] 54 CHAPTER 4. COMPUTER VISION can generally achieve quite a high fraction of true positives (also known as sensitivity), but also with a rather high fraction of false positives. That is, pixels that are supposed to be identified as skin are indeed so identified, but many non-skin pixels are erroneously classified as skin pixels as well. Face detection Note that even with a very sophisticated skin pixel detection, there will always be misclassifications since many objects that are not skin can still fall in a similar color range. For example, light, unpainted wood is often misclassified as skin. In order to reliably detect a face, one must therefore either combine skin pixel detection with other methods, or even use methods that do not involve specific skin pixel detection. Moreover, the level of sophistication required depends on the application at hand. Thus, for example, it is much more difficult to find an unknown number of faces, with arbitrary orientation, in a general image compared to finding a single face that, in the case of an IPA, most often is seen directly from the front. For the latter case, a possible approach is to binarize the image based on skin pixels, i.e. setting skin pixels to white and everything else to black, and then extracting the connected components (see Subsect. 4.3.9 above). The face is then generally the largest connected component, provided that the background does not contain too many skin-colored pixels. Of course, this technique can also be combined with background subtraction as discussed above, in order to increase its reliabilty. It is also possible, though computationally costly, to match the skin pixel regions against a face template, provided that the face is seen (almost) from the front. In this approach, one can begin by detecting the eyes (or, rather, a set of candidate feature pairs of which one hopefully does represent the eyes), and thereby obtaining the orientation of the face. One can then match other features, such as the mouth, the eyebrows etc., using the template [16]. Perhaps the most widely used method for face detection, however, is the Viola-Jones face detection algorithm [18]. This detector operates on grayscale images, and considers a large set of simple classifiers (so called weak classifiers that, by themselves, are not capable of detecting an entire face, but at least a part of it. The classifiers are based on features that simply compute the difference between the pixel intensities in two adjacent rectangles, and then compare the result to a threshold parameter. If, and only if, the result is above the threshold, a match is obtained for the feature in question. The sum of pixel intensities can be computed very quickly using the concept of integral images discussed in Subsect. 4.3.8 above and it is thus both possible and essential (for this method) to consider a large number of weak classifiers. The weak classifiers can then be combined to a strong classifier. The many difficulty lies in the fact that the number of possible weak classifiers c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 55 is generally very large. In the Viola-Jones algorithm, the training is carried out using the so called Adaboost training algorithm, and the resulting strong classifiers contain surprisingly few weak classifiers (at least relative to the total number of available weak classifiers) and are still able to carry out face detection rather reliably. Moreover, it is possible to generate a cascade of strong classifiers, with close to 100% true positive detection rate, and with progressively lower (cumulative) false positive rates. Thus, for example, the first element in the cascade may have a false positive rate of around 50%, so that, while letting almost all actual faces through, it also lets quite a number of non-faces through. The second element of the cascade typically has a smaller false positive rate (30% say), while still having a true positive rate near 100%. After passing through both classifiers, the cumulative false positive rate will drop to 0.5 × 0.3 = 0.15 etc. Viola and Jones [18] presented a cascaded detector containing 38 elements and a total of 6060 weak classifiers. This kind of detector cascade generally does very well on faces seen from the front, but is less accurate for, say, faces shown from the side, or with partial occlusion. Face detection is still very much an active field of research, with new methods appearing continuously. Recent progress involves the use of deep learning for face detection; see e.g. [15, 5]. General reviews on face detection can be found, for example, in [22] and [21]. Face recognition After detecting the presence of a face in its field of view, an IPA may also need to carry out face recognition in order to determine to whom the face belongs. A variety of methods have been defined, one of the most commonly used being the eigenface method, in which facial templates are generated from a large set of images. The face of any given person is then represented as a linear combination of the facial templates, with numerical weights for each template. Thus, in this approach face recognition can be reduced to comparing the weights obtained (via, for example, principal component analysis) for a detected face to a database of stored weights for those faces that the system is required to recognize. There are also approaches that make use of artificial neural networks (combined, more recently, with deep learning). Another alternative is to find and use invariant properties of a face (for example, the relative location of salient facial features). One of the difficulties in trying to achieve reliable face recognition is the fact that a person can, of course, present a range of facial expressions. Thus, methods for face recognition often require rather large data sets, with several views of the same person. The details of the many available face recognition methods will not be given here. Reviews that include the methods briefly described above, as well as other methods, can be found in [23, 2]. c 2017, Mattias Wahde, [email protected] 56 CHAPTER 4. COMPUTER VISION Figure 4.13: A screenshot from the ImageProcessing application. Photo by the author. There are also plenty of publicly available face databases, which can be used for training face recognition systems. 4.5 Demonstration applications 4.5.1 The ImageProcessing application The ImageProcessing application demonstrates some important aspects of the ImageProcessingLibrary. The user can load an image, and then apply a sequence of image processing operations. The sequence of images thus generated is stored and shown in a list box, so that the user can return to an earlier image in the sequence. It is also possible to zoom and pan in the displayed image using the mouse wheel (zooming) and mouse movements (panning). A screenshot from this application is shown in Fig. 4.13. Here, the user has loaded an image, and applied a few operations to improve the contrast and sharpness of the image. 4.5.2 The VideoProcessing application This application illustrates the use of the Camera class and the associated user controls. Provided that at least one web camera is attached to the computer, clicking the Start button will start the camera and display the image stream c 2017, Mattias Wahde, [email protected] CHAPTER 4. COMPUTER VISION 57 Figure 4.14: The CameraSetupControl, which allows the user to modify the settings of a (running) camera. in a CameraViewControl. The user can then select the Camera setup tab page to view and change the camera settings, as shown in Fig. 4.14. It is often necessary to change the settings, since only default settings are applied when the camera is started, and the suitable settings can vary strongly between cameras. The program also features background subtraction using exponential Gaussian averaging, as described above. c 2017, Mattias Wahde, [email protected] 58 CHAPTER 4. COMPUTER VISION c 2017, Mattias Wahde, [email protected] Chapter 5 Visualization and animation A large part of human communication is non-verbal and involves facial expressions, gestures etc. Similarly, the visual representation of the face (and, possibly, body) of an IPA can also play an important role in human-agent communication. Being met by a friendly and well-animated face (smiling, nodding in understanding etc.) strongly affects a person’s perception of a discussion. While a two-dimensional rendering would certainly be possible, with a three-dimensional rendering one can obtain a much more life-like representation of a face, including variations in lighting as parts of the face move. As will be illustrated below, even very complex objects are usually rendered in the form of (many) triangle shapes. Of course, since the screen is two-dimensional, three-dimensional objects, or rather their constituent triangles, must be projected in two dimensions, something that requires a large number of matrix operations, as does any rotation of a three-dimensional object. There are several software libraries specifically designed to quickly carry out all the necessary operations, processing thousands or even millions of triangles per second, determining the necessary projections and colors for each pixel. Some examples are OpenGL and Direct3D. Here, OpenGL will be used or, more specifically, a C# library called OpenTK that serves as a bridge between C# and OpenGL. This chapter begins with a general description of three-dimensional rendering. Next, the ThreeDimensionalVisualization library, used here for such rendering, is described in some detail. This description is followed by an introduction to the special case of facial visualization and animation, and the chapter is concluded with a brief description of two demonstration applications, one for illustrating various levels of visualization and shading, and one for illustrating the FaceEditor user control. 59 60 CHAPTER 5. VISUALIZATION AND ANIMATION 5.1 Three-dimensional rendering Rendering a three-dimensional (3D) object onto a two-dimensional (2D) screen involves a sequence of matrix operations, involving 4x4 matrices that combine translation and rotation. Here, only a brief introduction will be given regarding these transformations; for an excellent detailed description of the transformations below, see [1]. The first step is to rotate and translate the object to its position in space, i.e. to transform it from model coordinates measured relative to the center of the object to world coordinates, measured relative to the origin of the modeled world. This transformation is handled by the model matrix. Next, the location and orientation of the camera must be accounted for. Clearly, the appearance of an object depends on where it is located relative to the camera viewing the scene. Thus, another transformation is applied, using the view matrix such that the scene will now be in camera coordinates1 . For convenience, the model and view matrices are often combined (multiplied) to form a single modelview matrix. The final step is to account for the properties of the camera, by carrying out a perspective projection such that, for example, objects along a line parallel to the camera’s line-of-sight will appear closer to the center of the view the further away they are from the camera. This final transformation is carried out by the projection matrix. 5.1.1 Triangles and normal vectors It is common to render three-dimensional objects as a set of triangles. This is so since a triangle, i.e. an object defined by three vertices in space or, equivalently, by a vertex or two (non-colinear) vectors emanating from that vertex, is a planar surface, with a uniquely defined (surface) normal vector that, moreover, is easily obtained by computing the cross product of the two vectors just mentioned. Objects with more than three vertices may be planar, but not necessarily, meaning that the computation of normal vectors for such objects will, in general, be more complex. The normal vectors matter greatly, since they are used in determining the intensity of the light reflected from a given surface and are also needed for determining the shading of the pixels in the triangle (at least if a smooth shading model is used, as discussed below). Fig. 5.1 shows a triangle, defined by three vertices (points) p1 , p2 , and p3 . The three vertices can be used for generating the vectors v21 = p2 − p1 and v31 = p3 − p1 that, together with p1 , uniquely define the triangle and its position in space. Given the two vectors, one can 1 It should be noted that, in OpenGL, the camera is, in fact, always located at the origin, looking down OpenGL’s negative z-axis (into the screen, as it were). Thus, in OpenGL, to achieve the effect of placing the camera in a given position, instead of moving the camera, one moves the entire world in the opposite direction. c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 61 Figure 5.1: A triangle in three-dimensional space, along with its normal vector. The enumerated vertices are shown as filled discs. form the (normalized) normal vector as n= v21 × v31 . |v21 × v31 | (5.1) Note that this normal vector points towards the reader. One can of course also generate a normal vector pointing in the other direction, by reversing the order of the vectors in the cross product. This matters since, in OpenGL, one can render both sides of any triangle using, for example, different color properties of the two sides, and the normal vector determines which of the two sides of a triangle is visible from a given camera position. In OpenGL a counterclockwise convention is used such that, if the camera looks at a given side of a polygon, the correct normal will be the one obtained by the right-hand rule if the vertices (of a triangle) are traversed in the counter-clockwise order. Moreover, OpenGL uses a right-handed coordinate system such that x and y are in the plane of the screen, and z points towards the user. In the implementation of the three-dimensional visualization library (see below) the axes are instead such that x and z are in the plane of the screen, and y points away from the user (again forming a right-handed coordinate system), which perhaps is more natural. 5.1.2 Rendering objects There are many options available when rendering an object. The user can control the color of an object, its lighting and shading, whether or not the object is translucent etc. c 2017, Mattias Wahde, [email protected] 62 CHAPTER 5. VISUALIZATION AND ANIMATION Figure 5.2: Left panel: A triangle with equal colors (yellow) for all three vertices. Right panel: A triangle with different colors (red, green, and blue) for the vertices, resulting in color interpolation over the surface. Note that lighting was not used in either case. The vertices are shown as black discs. The simplest form of rendering is obtained if no light is used. In this case, the colors assigned to the vertices are used for determining the pixel colors of the rendered object. If all vertices have the same color (e.g. yellow) as in the left panel of Fig. 5.2, the object will be uniform in color. If instead the vertices have been assigned different colors, e.g. red, green, and blue, as in the right panel of the figure, OpenGL will interpolate the colors. In most instances of three-dimensional rendering, however, one uses some form of light source2 . With lighting, an object takes on a distinct 3D appearance, even if the object is uniform in color. In order to compute the correct color of a pixel, a lighting model is required. In OpenGL, the lighting model requires three colors: The ambient color, the diffuse color, and the specular color. Simplifying somewhat, the ambient component provides a kind of background lighting, such that an object will not be completely dark even if the light from the light source does not hit it directly, whereas the diffuse component determines the light reflected in all directions from a surface, and thus outlines the shape of a three-dimensional object. Finally, the specular component also handles light reflection but more in the manner of a mirror, giving a certain shininess to the surfaces. In fact, an additional parameter, called shininess is required to determine how shiny the surface is. Moreover, in OpenGL one can define the ambient, diffuse, and specular lightling components both for the light source and for each object independently. Thus, for example, it is possible to bathe a white object in blue light etc. Another important concept is shading, a procedure that determines how 2 OpenGL implementations generally support the use of multiple lights, eight as a minimum. c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 63 Figure 5.3: A schematic illustration of vertex and triangle normals. Here, a part of a 3D object is seen edge-on, with vertices shown as filled discs and triangle sides represented by solid lines connecting the vertices. The solid arrows show the triangle normals used in flat shading, whereas the dashed arrows show the vertex normals used in smooth shading, obtained by averaging the triangle normals over the triangles (in this case, two) connected to a given vertex. Note that, in the case of smooth shading, the normal vector also varies over a triangle, since it is formed by interpolating the three normal vectors (one for each vertex) for the triangle in question, even though that particular aspect is not visible in this edge-on figure. See also Panels (ii) and (iii) in Fig. 5.7. the light intensity varies over a surface. The normal vectors (see above) play an important role in shading. OpenGL defines two standard shading models, namely flat shading and smooth shading. In flat shading, surfaces are rendered with a uniform color, determined by the interaction between the light, the surface material, and the normal vector of the surface in question. In smooth shading, by contrast, OpenGL interpolates the normal vectors over the surface, a procedure that requires a normal vector for each vertex rather than the surface (triangle) normal vector. As indicated in Fig. 5.3, the vertex normals are obtained by interpolating the surface normals for all those triangles that are connected to a given vertex; see also the description of the Object3D class below. In the figure, a surface is shown edge-on (for simplicity), along with the triangle and vertex normals. For flat shading the triangle normals are sufficient. However, consider now smooth shading: If one were to use the surface normals, one would run into trouble at the edge connecting several triangles, since one would have multiple different normal vectors to choose from! If instead the vertex normals are used, as indeed they are in smooth shading, one can find an interpolated normal vector at any point on a triangle (thus, effectively, generating a smoothly curved surfaces rather than a flat one!). The effect is striking, as shown in Fig. 5.7 below. At this point, the reader may wish to study the various rendering options for a 3D object by running the Sphere3D c 2017, Mattias Wahde, [email protected] 64 CHAPTER 5. VISUALIZATION AND ANIMATION Listing 5.1: The paint event handler in the Viewer3D class. p r i v a t e void HandlePaint ( o b j e c t sender , PaintEventArgs e ) { GL . Clear ( ClearBufferMask . ColorBufferBit | ClearBufferMask . DepthBufferBit ) ; GL . MatrixMode ( MatrixMode . Modelview ) ; GL . LoadMatrix ( r e f cameraMatrix ) ; SetLights ( ) ; i f ( showOpenGLAxes ) { DrawOpenGLAxes ( ) ; } i f ( showWorldAxes ) { DrawWorldAxes ( ) ; } RenderObjects ( ) ; SwapBuffers ( ) ; } example application, described in Subsect. 5.4.1. 5.2 The ThreeDimensionalVisualization library 5.2.1 The Viewer3D class The Viewer3D class is a visualizer that handles the visualization and animation of a three-dimensional scene. The objects in a scene, as well as the lights illuminating the scene, are contained in the object scene, of type Scene3D. The visualizer contains event handlers for rotating and zooming a scene, i.e. for moving the camera in response to mouse actions. The event handler for redrawing the view, (an event that is triggered whenever the user control’s Invalidate method is called) takes the form shown in Listing. 5.1. The first line clears the view, and the following two sets the appropriate transformation matrix. Next the lights are set by calling the SetLights method (also defined in the Viewer3D class; see the source code for details). Then, depending on settings, the viewer can visualize either the axes defined in OpenGL’s standard coordinate system (described above), with the x−axis shown in red, the y−axis in green, and the z−axis in blue, or the axes defined in the coordinate system used in the threedimensional visualization library (also described above, with the same color settings as for the OpenGL axes). The objects are then rendered, by calling the RenderObjects method that, in turn, simply calls the Render method of each object in the scene (see below). Finally, the resulting view is then pasted onto the viewing surface by calling the SwapBuffers method. In a dynamic scene, where some objects change their position or orientation (or both), one must also handle animation. In the Viewer3D class, animation runs as a separate thread that simply invalidates the scene at regular intervals, thus triggering the Paint event that, in turn, is handled by the paint event handler shown in Listing 5.1. The two methods used for starting and running the animation are shown in Listing 5.2. There is, of course, a method for c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 65 Listing 5.2: The two methods for running an animation in the Viewer3D visualizer . The first method sets the frame rate (or, rather, the frame duration), and launches the thread. The second method simply calls the Invalidate method at regular intervals, causing the scene to be redrawn. p u b l i c void StartAnimation ( ) { millisecondAnimationSleepInterval = ( i n t ) Math . Round( 1 0 0 0 / framesPerSecond) ; animationThread = new Thread ( new T h r e a d S t a r t ( ( ) => AnimationLoop ( ) ) ) ; animationRunning = t r u e ; animationThread . Start ( ) ; } p r i v a t e void AnimationLoop ( ) { while ( animationRunning) { Thread . Sleep ( millisecondAnimationSleepInterval) ; Invalidate ( ) ; } } stopping the animation as well (not shown). 5.2.2 The Object3D class A scene (stored in an instance of Scene3D) contains a set of three-dimensional objects (not to be confused with objects in the programming sense!) as well as the light sources illuminating the scene. The library contains class definitions for several types of 3D objects, e.g. a sphere, a (planar) rectangle etc., which are all derived from the base class Object3D. Each instance of this class consists of a set of vertices, as well as a set of triangles (each containing the indices of three vertices) that define the triangles (see above). Moreover, the triangle normal vectors are stored as well. In cases where smooth shading is used (see Subsect. 5.1.2 above) the vertex normals are required instead, and they can be computed by calling the ComputeVertexNormalVectors method. For most 3D objects, the definition of the triangles is obtained by using the Generate method, which takes as input a list of parameters (of type double) that define the specific details of the 3D object in question, e.g. the radius as well as the number of vertices in the case of a sphere. For all but the simplest objects, this method usually is rather complex. A case in point is the Face class that defines a rotationally symmetric structure, somewhat reminiscent of a face, which typically contains thousands of triangles. The face structure can then be edited in a face editor, as will be discussed below. A simpler case is the Rectangle3D that, in fact, defines a two-dimensional object (which then can be oriented in any way in three dimensions). It consists c 2017, Mattias Wahde, [email protected] 66 CHAPTER 5. VISUALIZATION AND ANIMATION Figure 5.4: The definition of the two triangles used in the Rectangle3D class. Listing 5.3: The Generate method of the Rectangle3D class. In this case, the method takes two parameters as input, determining the size of the rectangle. Next, the four vertices are generated, and then the two triangles as well as the triangle and vertex normal vectors. p u b l i c o v e r r i d e void Generate ( List<double> parameterList ) { base . Generate ( parameterList ) ; i f ( parameterList == n u l l ) { r e t u r n ; } i f ( parameterList . Count < 2 ) { r e t u r n ; } sideLength1 = parameterList [ 0 ] ; sideLength2 = parameterList [ 1 ] ; Vertex3D vertex1 = new Vertex3D (−sideLength1 / 2 , −sideLength2 / 2 , 0 ) ; Vertex3D vertex2 = new Vertex3D ( sideLength1 / 2 , −sideLength2 / 2 , 0 ) ; Vertex3D vertex3 = new Vertex3D ( sideLength1 / 2 , sideLength2 / 2 , 0 ) ; Vertex3D vertex4 = new Vertex3D (−sideLength1 / 2 , sideLength2 / 2 , 0 ) ; vertexList . Add ( vertex1 ) ; vertexList . Add ( vertex2 ) ; vertexList . Add ( vertex3 ) ; vertexList . Add ( vertex4 ) ; T r i a n g l e I n d i c e s triangleIndices1 = new T r i a n g l e I n d i c e s ( 0 , 1 , 2 ) ; triangleIndicesList . Add ( triangleIndices1) ; T r i a n g l e I n d i c e s triangleIndices2 = new T r i a n g l e I n d i c e s ( 0 , 2 , 3 ) ; triangleIndicesList . Add ( triangleIndices2) ; GenerateTriangleConnectionLists ( ) ; ComputeTriangleNormalVectors ( ) ; ComputeVertexNormalVectors ( ) ; } c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 67 Listing 5.4: The Render method of the Object3D class. p u b l i c void Render ( ) { i f ( ! visible ) { r e t u r n ; } GL . PushMatrix ( ) ; GL . Translate ( position [ 0 ] , position [ 2 ] , −position [ 1 ] ) ; GL . Rotate ( rotation [ 2 ] , new Vector3d( 0 f , 1f , 0f ) ) ; GL . Rotate ( rotation [ 1 ] , new Vector3d( 0 f , 0f , −1f ) ) ; GL . Rotate ( rotation [ 0 ] , new Vector3d( 1 f , 0f , 0f ) ) ; GL . BlendFunc ( BlendingFactorSrc . SrcAlpha , BlendingFactorDest . OneMinusSrcAlpha) ; i f ( alpha < 1 ) GL . Enable ( EnableCap . Blend ) ; i f ( showSurfaces ) { RenderSurfaces ( ) ; } i f ( showWireFrame ) { RenderWireFrame ( ) ; } i f ( showVertices ) { RenderVertices ( ) ; } i f ( alpha < 1 ) GL . Disable ( EnableCap . Blend ) ; i f ( object3DList != n u l l ) { f o r e a c h ( Object3D object3D i n object3DList ) { object3D . Render ( ) ; } } GL . PopMatrix ( ) ; } of only two triangles, each defined using three vertices. The vertices are ordered in a counterclockwise manner, as shown in Fig. 5.4. The Generate method for the Rectangle3D class is shown in Listing 5.3. The method first checks that it has a sufficient number of parameters, and it then uses the two parameters to set the side lengths of the rectangle. Then the vertices are defined and added to the list of vertices. Next, the two triangles are formed, by specifying the indices of the (three) vertices constituting each triangle. The GenerateTriangleConnectionLists method generates a list that keeps track of the triangles in which each vertex is included. In this particular case, vertices 0 and 2 are included in both triangles, whereas vertex 1 is only included in the first triangle, and vertex 3 only in the second. Next, the normal vectors are computed for each triangle, by simply computing the (normalized) cross product as discussed above; see Eq. (5.1). Finally, the vertex normal vectors are computed, by averaging (and then re-normalizing) the triangle normal vectors of all triangles in which a given vertex is included. The three methods just mentioned are called at the end of the Generate method of any three-dimensional object and once they have been called, all the necessary information is available for both flat and smooth shading. Note that one can, of course, define a rectangle using more than two triangles. In general, threedimensional objects often consist of hundreds or thousands of triangles. The Render method is a crucial part of the Object3D class, as it determines the current visual appearance of the object in the scene. This method is c 2017, Mattias Wahde, [email protected] 68 CHAPTER 5. VISUALIZATION AND ANIMATION shown in Listing 5.4. The method first checks if the object is visible. If it is not, the method returns directly. If the object is visible, the next step is to position and render the object. OpenGL operates as a state machine. Thus, when visualizing any object, at a given position and orientation, one first accesses the current modelview matrix using the PushMatrix command. Next, the appropriate rotations and translations are carried out for the object in question. The object is rendered and then, finally, the old (stored) transformation matrix is set again, by calling PopMatrix, so that the next object can be rendered etc. Note that the transformations occur in the inverse order in which they are presented, since they are carried out using post-multiplication. Thus, the rotations take place first, and then the translation. If, for example, only the zrotation (rotation[2]) is non-zero, the object will first be rotated around the z-axis, and then translated to the current position. Note that if the operations were carried out in the opposite order, a different result would be obtained. It is thus important to fully understand the order in which these operations take place. It should also be noted that the GL.Rotate method is applied in the OpenGL coordinate system rather than in the coordinate system used in the three-dimensional visualization library, as is evident from the rotation vectors shown in the code above. The lines involving the GL.BlendFunc and the EnableCap.Blend are needed to deal with translucent objects. The parameter alpha is equal to one for an opaque object, and 0 for a completely transparent (thus invisible) object. For values of alpha between 0 and 1, a translucent object is obtained. In order to handle translucent objects properly, the objects in the scene must be rendered in the appropriate order, namely from high alpha to low alpha. There is method in Scene3D that handles this issue. The next few lines contain calls to methods that render the surfaces, wireframe, and vertices, respectively. The wireframe rendering consists of straight lines connecting the vertices of a triangle. The three methods RenderSurfaces, RenderWireFrame, and RenderVertices will not be described in detail here, but it is a useful exercise to study those methods in the source code. Finally, as can be seen in the listing, there is a possibility to use nested definitions, such that a 3D object contains its own list (objectList) of 3D objects. This makes it possible to rotate and translate an entire group of objects as a unit, rather than having to move and rotate each 3D object separately. This type of nested definition can be used to any desired depth. Thus, the objects in an object list may themselves contain objects in their respective object lists. For example, the face and eyes of an agent may be contained in the object list of an object representing the head of the agent, and each eye, in turn, may contain (in its object list) the objects representing the iris, pupil, and eyelid. It should be noted that the positions and orientations of objects in an object list are measured relative to the object on the preceding level. c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 69 5.3 Faces As evidenced by (for example) animated movies, modern computer technology is sufficiently advanced to generate (almost) photo-realistic representations of any object, including a human face. However, while humans quickly find (or at least ascribe) human features to any artificial rendering of a living system that is even remotely humanlooking (such as a cartoon character), once an artificial system (for example, an IPA or a robot) attempts to mimic a human face exactly, including all the minute changes in facial expressions that are subconsciously detected during a conversation, that system is often perceived as eerie and frightening. This is known as the uncanny valley phenomenon [9]. Thus, in other words, unless it is possible to render an artificial face with such a level of detail in all its expressions that it is indistinguishable from a real human face, it is most often better to use a more cartoon-like face, with human-like features but without an attempt to mimic a human face exactly. 5.3.1 Visualization Conceptually, a 3D head is no different from any other 3D object. In practice, however, generating and animating a 3D head is not easy. The most realistic renditions can be obtained by using a model very similar to a biological face, that is, by generating a skeleton (skull), adding muscles attached to the head, and then finally a skin layer that, of course, is what the user will see. Here, however, a slightly simpler approach will be taken, in which the face consists only of the skin layer and where animation is limited to movements of the entire head (such as looking left or right) as well as movements of the eyes and eyebrows. An obvious additional step would be to add a movable jaw and a mouth. This can certainly be done, but it is beyond the scope of this text. Still, a surprising range of emotions can be expressed even by the simple animations just described. The heads considered here consist of seven distinct 3D objects (each of which, of course, contains hundreds or thousands of triangles): The actual face (and, optionally, neck), the two eyes, the two eyelids, and the two eyebrows. The face object can be generated using the face editor program, described in Subsect. 5.4.2 below. The resulting face will, by construction, be symmetric around an axis (in this case, the y−axis, if the face is not rotated) and should have two deep indentations for the eyes. An eye can be generated as a white sphere, with an iris and a pupil each consisting of a spherical sector with slightly larger radius than the eye, rotated 90 degrees around the x−axis. An eyelid is in the form of a semi-sphere, such that, with proper rotation, it can completely cover the eye. Finally, an eyebrow consists of an elongated structure in the form of a toroidal sector. As an example, Fig. 5.5 shows a disc 2017, Mattias Wahde, [email protected] 70 CHAPTER 5. VISUALIZATION AND ANIMATION Figure 5.5: The parts used here for rendering a head. Left panel: The head, without eyes and eyebrows Right panel: An eye, consisting of three distinct objects: The eyeball, the iris (in this case, green), and the pupil. Also shown is the eyelid, in the form of a semi-sphere, and the eyebrow. membered rendering of a head, in which the parts just described have been dislocated a bit, for individual inspection. Fig. 5.6 shows a few examples of expressions that can be generated with this 3D head; see also the next subsection. 5.3.2 Animation As mentioned in Subsect. 5.2.1 above, the Viewer3D is able to run a separate thread in which the entire scene is rendered at regular intervals. Thus, to achieve animation, all that is required is to change the position and the rotation of the objects in a gradual manner. Since the rotation of a 3D object is carried out before it is translated, the rotations will occur around the axes that meet at the origin. Thus, the manner in which these objects are defined (before any rotation and translation) greatly influences the effects of rotation. For example, the semi-sphere (as well as any other spherical segment) has been defined as if it were a sphere centered at (0, 0, 0) from which some parts have been removed to generate the segment in question. Thus, when rotated, such a segment will move as if it were sliding over a sphere of the same radius and centered in (0, 0, 0). The alternative would be to define a 3D object such that its center-of-mass would be located at the origin. In that case, however, in order to achieve the effect of, say, a semi-sphere (such as an eyelid) moving c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 71 Figure 5.6: A few examples of mental states and facial expressions generated with a face of the kind described in the main text. Top row, from left to right: Awake (neutral), sleepy, and asleep. Bottom row, from left to right: Surprised, angry, fearful. over a sphere (such as an eyeball) one would have to both rotate and translate the semi-sphere relative to the center of the eyeball. Clearly, by combining rotations and translations, one can achieve the same effect using either definition. For the application considered here, the first option makes animation easier and it is thus the approach chosen. For a head of the kind described above, some typical animations are (i) moving the eyes, an effect that can be achieved by rotating the eye around the z−axis, noting that the iris and pupil can be appended in the objectList of the eyeball 3D object, so that they will rotate with the eyeball; (ii) blinking, which can be carried out by rotating the eyelids around the x−axis; and (iii) moving the eyebrows, an action that, in its simplest form, consists of a translation (up or down). Of course, more sophisticated movements can be achieved by allowing the eyebrows to rotate and deform as well. The actual movements are generated in separate threads that gradually move the appropriate objects. Note that the motion is completely independent of the rendering, which is handled by the animation thread in the Viewer3D. Listing 5.5 shows an example, namely a thread that carries out blinking (of both eyes), with a given duration. c 2017, Mattias Wahde, [email protected] 72 CHAPTER 5. VISUALIZATION AND ANIMATION Listing 5.5: An example of animation, illustrating blinking of the two eyes. Note that the animationStepDuration is defined elsewhere in the code, and it typically set to 0.010.02 s. The fullClosureAngle is typically set to 90 (degrees). p u b l i c void Blink ( double duration ) { blinkThread = new Thread ( new T h r e a d S t a r t ( ( ) => BlinkLoop ( duration ) ) ) ; blinkThread . Start ( ) ; } p u b l i c void BlinkLoop ( double duration ) { double halfDuration = duration/ 2 ; i n t numberOfSteps = ( i n t ) Math . Round ( halfDuration/animationStepDuration) ; double deltaAngle = fullClosureAngle/numberOfSteps ; Object3D leftEyelid = viewer3D . Scene . GetObject ( ” L e f t E y e l i d” ) ; Object3D rightEyelid = viewer3D . Scene . GetObject ( ” Ri g h t E y e l i d ” ) ; f o r ( i n t iStep = 0 ; iStep < numberOfSteps ; iStep++) { leftEyelid . RotateX ( deltaAngle ) ; rightEyelid . RotateX ( deltaAngle ) ; } f o r ( i n t iStep = 0 ; iStep < numberOfSteps ; iStep++) { leftEyelid . RotateX(−deltaAngle ) ; rightEyelid . RotateX(−deltaAngle ) ; } } 5.4 Demonstration applications In this section, two applications will be described that illustrate the properties and capabilities of the three-dimensional visualization library. The first application is a very simple demonstration of various aspects of rendering, lighting, and shading. The second application is more advanced, especially the highly complex FaceEditor user control, which can be used for generating a face shape starting from a simple rotationally symmetric structure. 5.4.1 The Sphere3D application This application simply shows a green sphere, under various conditions of rendering, lighting, and shading. The GUI contains a sequence of menu items, allowing the user to visualize a sphere (i) without lighting; (ii) with lighting and flat shading; (iii) with lighting and smooth shading; (iv) as a wireframe structure; (v) as vertices; (v) as (iii) but with vertices overlaid; (vi) as (iii) but with vertices and wireframe overlaid; (vii) as a translucent object (in this case with another blue sphere inside); and, finally, (ix) with a texture added. All cases are shown in Fig. 5.7. Even though the Viewer3D does support texture mapping, i.e. pasting c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 73 Figure 5.7: Nine examples of rendering a sphere. The first row of images shows, from left to right, cases (i)-(iii) described in the main text, the second row cases (iv)-(vi), and the third row cases (vii)-(ix). (parts of) one or several images over the surface of a 3D object, this topic has been deliberately avoided above, as it is not needed for the applications considered here. However, the interested reader should study the source code for the textured sphere just described, as well as the rendering method in the Sphere3D class and the corresponding method in the base class Object3D. Note that texture mapping requires a specification of which part of an image that is to be mapped onto a given triangle. This information is stored in the TextureCoordinates field of each vertex. c 2017, Mattias Wahde, [email protected] 74 CHAPTER 5. VISUALIZATION AND ANIMATION Figure 5.8: A screenshot of the FaceEditor application, showing the three-dimensional face object, along with a slice plane (shown in green color) as well as a two-dimensional view of the slice under consideration, in which the user has selected and moved a few control points (shown in red). Note that left-right symmetry is enforced, such that the points on the opposite side of the slice move together with the selected points, but in the opposite (horizontal) direction. 5.4.2 The FaceEditor application This application is intended to simplify the process of generating the threedimensional face of an IPA. The FaceEditor application makes use of an advanced user control, the FaceEditor, which is included as a part of the ThreeDimensionalVisualization library, and which does most of the work in this application. A screenshot from the application is shown in Fig. 5.8. Except for the menu strip the entire form is covered by the face editor, which has three tool strips at the top and two panels below. Upon initialization, the face editor provides the user with a starting point in the form of a rotationally symmetric three-dimensional Face object (shown in a Viewer3D control, on the left side of the face editor), with a shape similar to that of a human head, with a neck just below the head, but without any other particular features such as nose, ears, or eye sockets. The right panel of the face editor contains a BezierCurveViewer that shows a horizontal slice through the threedimensional object. Each such slice is defined as a closed composite Bézier curve that in turn consists of a set of two-dimensional cubic Bézier splines given by x(u) = P0 (1 − u)3 + 3P1 u(1 − u)2 + 3P2u2 (1 − u) + P3 u3 , (5.2) where x = (x, y), Pj are 2-dimensional control points, and u is a parameter ranging from 0 to 1. Per default, each slice is defined using 32 splines, each c 2017, Mattias Wahde, [email protected] CHAPTER 5. VISUALIZATION AND ANIMATION 75 with four control points. The last control point of a given spline coincides with the first control point of the next spline, so that the effective number of control points is smaller. A detailed description of Bézier splines will not be given here, but note that the control points that define the smooth spline curve do not necessarily lie on the curve itself, as can be seen in Fig. 5.8. One can use the mouse to grab any set of control points in a given slice, and then drag those points to generate any desired (left-right symmetric) shape for the slice in question. In the figure, the user has grabbed a few points (shown in red), and started moving them inwards. As can be seen, left-right symmetry is enforced, such that the points on the opposite side of the slice move together with the selected points. It is also possible to zoom in, so that the points can be moved with greater precision. When the points in any slice are moved, the threedimensional representation is also updated simultaneously, so that one can easily assess the result. In the particular case shown in the figure, the green slice plane (used for keeping track of which slice is being edited) obscures the view. However, the user can hide the slice plane in order to see the effects on the three-dimensional shape. One can also move between slice planes, by clicking on the three-dimensional viewer and then using the arrow keys to move up or down. Moreover, the user can both insert and remove slices. The three-dimensional shape is obtained by interpolating (sampling) the composite Bézier curves defining the slice planes. The user can specify the number of points used. This is a global measure, i.e. the same number of interpolated points is generated for all slices. Note that the number of interpolated points need not equal the number of control points for the splines: These curves can be interpolated with arbitrary precision, using hundreds of points if desired. Typically, 50-100 points per slice is sufficient. The interpolated points are then used as vertices for the three-dimensional Face object. In order to generate triangles, the interpolation is shifted so that the interpolated points of odd-numbered slice planes appear (horizontally) midway between the interpolated points of even-numbered slice planes. An illustration is shown in the figure, where the wireframe representation has been overlaid on the threedimensional shape so that the triangles are clearly visible. Generating the appropriate triangle indices is a bit complicated, since each triangle will involve two slice planes; For details, see the definition of the Face class. With this application the user can quickly generate a cartoon-like face for use in an IPA, and then save the corresponding Face object in XML format. There are some limitations. For example, the application does not generate the eyes of the IPA (instead, the user must define a face with eye sockets, in which the eyes can be added later), and neither does it generate a movable jaw. The example face used earlier in this chapter (see the left panel of Fig. 5.5) was generated using the FaceEditor application. c 2017, Mattias Wahde, [email protected] 76 CHAPTER 5. VISUALIZATION AND ANIMATION c 2017, Mattias Wahde, [email protected] Chapter 6 Speech synthesis In principle, speech synthesis is simple, as it can be approached as a mere playback of recorded sounds. However, in practice, it is not easy to generate a high-quality synthetic voice capable of displaying all the subtleties and emotions of human speech. Speech synthesis can be approached in many different ways. Two of the main approaches are concatenative synthesis and formant synthesis. As the name implies, concatenative synthesis consists of pasting together previously recorded sounds in order to form a given sentence or word. This is a process that also involves considerable modification of the recorded sounds, in order to make sure that an utterance formed by concatenation should sound natural: Simply pasting together a sequence of recorded words will not produce a natural-sounding sentence at all, even if each word is perfectly uttered (in isolation). Many state-of-the-art speech synthesis systems use the approach of modifying and pasting together recorded snippets of sounds. However, an alternative approach is to generate all sounds as they are needed, in which no human voice recording is required at all. In this approach, known as formant speech synthesis, one uses instead a model of the human vocal tract, in which a train of pulses excites a set of oscillators (corresponding to the oscillating vocal cords) in order to produce a vowel sound. Consonants are produced in a slightly different way, but with the same model. Whereas concatenative synthesis can be made to generate sounds that resemble those of a human voice very closely, formant synthesis produces a more artificial, robotic-sound voice but, if done well, with surprising clarity. Moreover, a formant voice requires much less (storage) memory space than a concatenative voice, something that also explains the popularity of formant voices in the early days of personal computers. One may certainly argue that concatenative synthesis is superior as regards the quality of the generated voice, but one can also make the argument that for 77 78 CHAPTER 6. SPEECH SYNTHESIS Chunk ID (4 bytes) Chunk data size (4 bytes) Chunk data Figure 6.1: The structure of a RIFF chunk. an agent with a cartoon-like face, similar to the example shown in Chapter 5, a perfect human-sounding voice would be somewhat out of place. Moreover, as formant synthesis does provide interesting insights regarding both human sound generation and signal processing, this has been the approach chosen here. 6.1 Computer-generated sound Ultimately, any sound is of course simply a variation (over time) in air pressure. Computers generate sounds from a set of discrete values, known as samples. The number of samples handled per second is known as the sampling frequency or sample rate, and the range of allowed values for the samples is known as the sample width. The sampling frequency for a CD is 44100 Hz, whereas lower sampling frequencies are used in telephones. Acceptable sound quality can be obtained with sampling frequencies of 8000 Hz or above. The sampling width is typically 16 bits, meaning that samples can range from -32768 to 32767. The digital signal is then converted to an analog signal (voltages) using a digital-to-analog (D/A) converter, which is then passed to an amplifier that in turn drives a speaker. In systems with multiple speakers, one may wish to send different signals to different speakers. A common case, of course, is two-channel or stereo sound. For speech synthesis, single-channel or mono sound is often sufficient, however. 6.1.1 The WAV sound format Sounds can be stored in different formats. A common format under Windows is the Waveform audio format (WAV). In this format, the samples can be stored either in uncompressed or compressed form. For simplicity, only the uncomc 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 79 R I F F (0x52494646) Chunk data size (4 bytes) W A V E (0x57415645) fmt subchunk (header and data) (Optional) fact subchunk (header and data) Data subchunk (header and data) Figure 6.2: The structure of a WAV sound file. Note that both the main chunk and the subchunk all follow the RIFF chunk format shown in Fig. 6.1. The main chunk’s data section begins with four bytes encoding the word ”WAVE”, after which the subchunks follow, each consisting of a header section (8 bytes) and a data section. pressed format will be considered here. A WAV sound is built using the concept of RIFF chunks that contain an eight-byte header followed by data, as shown in Fig 6.1. The first four bytes of a chunk encode the chunk ID, and the following four bytes encode the number of bytes in the data part of the chunk. As illustrated in Fig. 6.2, strictly speaking, a WAV sound contains a main chunk that encloses all the other chunks (which therefore normally are referred to as subchunks) in its data section. The required subchunks for uncompressed WAV sounds are the fmt (format) and data subchunks. For compressed WAV sounds, a third subchunk, namely the fact subchunk, must be included, and it is normally placed between the two other subchunks. Here, only uncompressed WAV sounds will be considered. However, some uncompressed WAV sounds contain an unnecessary fact subchunk. Thus, a program for reading WAV sounds must be able to cope with the potential presence of a fact subchunk, regardless of whether the sound is compressed or not. c 2017, Mattias Wahde, [email protected] 80 CHAPTER 6. SPEECH SYNTHESIS f m t (0x666d7420) Chunk data size (4 bytes) Compression code (2 bytes) Number of channels (2 bytes) Sample rate (4 bytes) Bytes per second (4 bytes) Block align (2 bytes) Bits per sample (2 bytes) (Optional) # of extra format bytes (2 bytes) Extra format bytes (if any) Figure 6.3: The fmt subchunk. In the absence of extra format bytes, the chunk data size is either 16 or 18, depending on whether the two bytes specifying the number of extra format bytes are included or not. The first four bytes of a WAV sound file contain the word ”RIFF” (in uppercase letters), represented as four ASCII bytes (taking hexadecimal values1 0x52, 0x49, 0x46 and 0x46). The following four bytes encode the file size (n) minus 8 (i.e. the size of the header). In other words, those four bytes determine the number of bytes contained in the sound file, after the header. All integers in a WAV file are stored using little endian format, i.e. with the least 1 Hexadecimal numbers are written 0xnn . . . nn, such that each n takes a value in the set {0, . . . , 9,A,. . .,F}, where the letters A,. . .,F represent the numbers 10, . . . , 15. c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 81 d a t a (0x64617461) Chunk data size (4 bytes) Interlaced sample data Figure 6.4: The data subchunk. The data part of this subchunk contains the actual sound samples. significant byte first. The first four data bytes of the main chunk determines the RIFF type which always takes the value ”WAVE” (hexadecimal representation: 0x57415645). The remaining n − 12 bytes contain the subchunks. Since each subchunk begins with a chunk ID, the subchunks can, in principle, be placed in any order. However, it is customary to place the fmt subchunk first, followed by the fact subchunk (if needed) and then the data subchunk. The fmt subchunk The fmt subchunk, illustrated in Fig. 6.3, begins with the chunk ID ”fmt ” (with a space at the end!), with hexadecimal representation 0x666D7420. The next four bytes encode the subchunk data size, which for the fmt subchunk equals either 16 + k or 18 + k where k is the number of extra format bytes (normally zero, see below) After the eight-byte header, the following two bytes encode the compression code (or, somewhat confusingly, audio format) for the WAV sound. For uncompressed WAV sounds, the compression code is equal to 1. The next two bytes encode the number of channels (i.e two, for stereo sound, or one, for mono sound). The following four bytes encode the sample rate (or sampling frequency) of the WAV sound file. The next eight bytes of the fmt subchunk encode (i) the (average) number of bytes per second of the sound sample, (ii) the block align, and (iii) the number of bits per sample. The three numbers derived from these bytes are partly redundant: Once the number of bits per sample ns has been specified (requires c 2017, Mattias Wahde, [email protected] 82 CHAPTER 6. SPEECH SYNTHESIS Sample 1, Channel 1 (left), 2 bytes Sample 1, Channel 2 (right), 2 bytes Sample 2, Channel 1 (left), 2 bytes Sample 2, Channel 2 (right), 2 bytes . . . Figure 6.5: The data part of the data subchunk for a stereo sound. The samples are stored in an interlaced fashion, as described in the main text. two bytes), the block align ba can be computed as ba = ns nc , 8 (6.1) where nc is the number of channels. Thus, the block align measures the number of bytes needed to store the data from all channels of one sample. The number of bytes b per second, which requires four bytes in the fmt subchunk, can simply be computed as b = sba , (6.2) where s is the sample rate. Here, only 16-bit sound formats will be used. The next two bytes indicate the number of (optional) extra format bytes. For uncompressed WAV sounds, normally no extra format bytes are used. In such cases, sometimes the two bytes determining the number of extra format bytes are omitted as well, so that the data size of the fmt subchunk becomes 16 rather than 18. The data subchunk The data subchunk has a rather simple format, illustrated in Fig. 6.4. The first four bytes encode the chunk ID, which for the data subchunk simply consists of the string ”data”, with hexadecimal representation 0x64617461. The next four bytes encode the data subchunk size, i.e. the number of bytes of actual sample data available in the WAV sound file. The samples from the various channels (two, in the case of stereo sound) are stored in an interlaced fashion, as illustrated in Fig. 6.5: for any given time slice, the samples from the different c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 83 channels appear in sequence, followed by the samples from the next time slice etc. The individual samples are stored as 2s complement signed integers2 which, in the case of 16-bit samples, take values in the range [−32768, 32767], such that the middle-point (0) corresponds to silence. The procedure of generating the numerical value from a 16-bit sample is as follows: Let bi and bi+1 denote two consecutive bytes defining the sample. Taking into account the little endian storage format, these two bytes are decoded to form a temporary value vtmp as vtmp = 28 bi+1 + bi . (6.3) The final sample value is then obtained as v= vtmp −65536 + vtmp if vtmp ≤ 32767 otherwise (6.4) Other subchunks As mentioned above, even uncompressed WAV sounds sometimes contain an unnecessary fact subchunk. This subchunk begins with four bytes encoding the string ”fact”, followed by four bytes specifying the data size of the subchunk. In the case of an uncompressed WAV sound, any data contained in the fact subchunk can be ignored. However, for compressed WAV sounds, the fact subchunk contains crucial information regarding the decoding procedure needed for playback. In addition to the subchunks just discussed, additional subchunk types exist as well, e.g. the slnt subchunk which can be used for defining periods of silence (thus reducing the size of the WAV sound file provided, of course, that periods of silence are present in the sound in question). 6.1.2 The AudioLibrary The Audiolibrary contains classes for storing, manipulating, and visualizing sounds in WAV format. The WAVSound class stores a byte array defining both the header and the data of a WAVSound, as described above. This byte array is the actual sound and is used, for example, by the MediaPlayer class for playing the sound. However, a byte array defining both a header and a sequence of data is hardly human-readable. Thus, the byte array containing the data can also be converted to one or two arrays of samples (depending on whether the sound is in mono or stereo format), which can then be visualized 2 This applies to formats using 16 bits per sample, or more. If the format uses only 8 bits per sample, each sample is stored as an unsigned integer. c 2017, Mattias Wahde, [email protected] 84 CHAPTER 6. SPEECH SYNTHESIS Method LoadFromFile SaveToFile GenerateFromSamples Extract Join LowPassFilter HighPassFilter SetRelativeVolume Description Loads a WAV sound from a file. Saves a WAV Sound to a file Generates the byte array required by the WAV format, based on sound samples. Extracts (in a new instance) a part of a sound. Joins a set of sounds, with optional periods of silence between consecutive sounds, to form a single sound. Carries out low-pass filtering of a sound; see Eq. (6.6). Carries out high-pass filtering of a sound; see Eq. (6.8). Increases or decreases the volume of a sound, depending on the value of the input parameter. Table 6.1: Brief summary of some public methods in the WAVSound class. using, for example, the SoundVisualizer user control. This class is also defined in the AudioLibrary and has been used throughout this chapter in the figures displaying sound samples. The Audiolibrary also contains classes that are more relevant to speech recognition (such as a WAVRecorder class) but still logically belong to the AudioLibrary. These classes will be considered in the next chapter. Some of the most important methods in the WAVSound class are shown in Table 6.1. The class also contains a constructor that generates a WAVSound header (see Subsect. 6.1.1 above) with given values for the sample rate, the number of channels, and the number of bits per sample (the sample width). Provided that a sound header has been generated by calling this constructor, one can then generate a WAV Sound from a set of samples, using the GenerateFromSamples method. The SaveToFile method only saves the byte array. All other relevant properties can be generated. Consequently the LoadFromFile method loads the byte array, and then generates the header and the sound samples in human-readable form. Whenever the samples of a sound are modified, for example when modifying the volume, the byte array representing the sound (as per the WAV format described above) must be regenerated, something that is handled by a (private) method, namely GenerateSoundDataFromSamples. If also the number of samples is changed, for example when appending samples, one must call the (private) ExtractInformation method that also re-extracts the header of the sound (to reflect the fact that the number of samples has changed). These two operations are generally handled automatically in the various public methods, but must be taken into account if a user wishes to write additional methods for manipulating WAV sounds. As an example, consider the method SetRelativeVolume, shown in Listing 6.1. Here, the sound samples are scaled by a given factor (the input paramc 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 85 Listing 6.1: The SetRelativeVolume method of the WAVSound class. Note the call to the GenerateSoundDataFromSamples method in the final step, which generates the byte array representing the sound. p u b l i c void SetRelativeVolume ( double relativeVolume ) { f o r ( i n t iChannel = 0 ; iChannel < samples . Count ; iChannel++) { f o r ( i n t jj = 0 ; jj < samples [ iChannel ] . Count ; jj++) { double newDoubleSample = Math . Truncate ( relativeVolume ∗ samples [ iChannel ] [ jj ] ) ; i f ( newDoubleSample > MAXIMUM_SAMPLE ) {newDoubleSample = MAXIMUM_SAMPLE ; } e l s e i f ( newDoubleSample < MINIMUM_SAMPLE) {newDoubleSample = MINIMUM_SAMPLE ; } samples [ iChannel ] [ jj ] = ( I n t 1 6 ) Math . Round ( newDoubleSample ) ; } } GenerateSoundDataFromSamples ( ) ; } Listing 6.2: An example of the usage of the MediaPlayer class for playing WAV sounds. SoundPlayer soundPlayer = new SoundPlayer ( ) ; sound . GenerateMemoryStream ( ) ; sound . WAVMemoryStream . Position = 0 ; // Manually rewind stream soundPlayer . Stream = sound . WAVMemoryStream ; soundPlayer . PlaySync ( ) ; eter). However, just scaling the samples will not affect the byte array; thus, the method ends with a call to GenerateSoundDataFromSamples. For playback, one can use the MediaPlayer class from the System.Media namespace, which is included in C#. An example is shown in Listing 6.2. Note that one must manually rewind the stream to ensure correct playback. 6.2 Basic sound processing In many cases, for example as a precursor to speech recognition (see Chapter 7), one can apply a sequence of operations to an input sound, in order to remove noise, increase contrast etc. Many of these operations can be represented as digital filters that, in turn, can be represented either in the frequency domain (using Z-transforms in the case of discrete-time signal of the kind used here) or in the time domain. Here, only time-domain analysis will be used, in which case a (linear) digital filter can be represented in the form of a difference c 2017, Mattias Wahde, [email protected] 86 CHAPTER 6. SPEECH SYNTHESIS equation of the form s(k) + a1 s(k − 1) + a2 s(k − 2) + . . . ap s(k − p) = b0 x(k) + b1 x(k − 1) + . . . + bq s(k − q), (6.5) where ai , i = 1, 2, . . . and bi , i = 0, 1, . . ., as well as p and q, are constants, s(k) is the output at time step k, and x(k) is the input. 6.2.1 Low-pass filtering The purpose of low-pass filtering is to remove the high-frequency parts (typically noise) of a signal. This is achieved by applying an exponential moving average: s(k) = (1 − αL )s(k − 1) + αL x(k), (6.6) so that, using the notation above, a1 = −(1 − αL ), b0 = αL (and, therefore p = 1, q = 0). As is evident from the equation, if αL is close to 0, the sample s(k) will be close to s(k − 1), meaning that the signal changes slowly or, in other words, that the high-frequency components are removed. Thus, if this filter is applied to a digital signal x(k), the resulting output will be a signal that is basically unchanged for low frequencies, but attenuated for frequencies around and above a certain cutoff frequency fc . One can show that αL is related to fc as 2π∆tfc αL = , (6.7) 2π∆tfc + 1 where ∆t is the sampling interval (the inverse of the sampling frequency). Hence, this filter is also referred to as a (first-order) low-pass filter. 6.2.2 High-pass filtering A high-pass filter removes the low-frequency parts of a signal. In the time domain, a (first-order) high-pass filter takes the form s(k) = αH s(k − 1) + αH (x(k) − x(k − 1)), (6.8) where αH is a parameter, which is related to the cutoff frequency as αH = 1 . 2π∆tfc + 1 (6.9) After passing through this filter, the signal will be attenuated for low frequencies (below the cutoff frequency) but largely unchanged at higher frequencies. c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 87 Figure 6.6: Unvoiced and voiced sounds: Here, an unvoiced sound (in this case s) precedes a voiced sound (in this case a long o), to generate the word so. The difference between the noise-driven first half of the word, and the more oscillatory second half of the word, is easy to spot. 6.3 Formant synthesis In formant synthesis all the spoken sounds are generated based on a model of the human vocal tract. Formant speech synthesizers use a so-called sourcefilter model, in which the source component corresponds to the excitation of the vocal cord, and the filter components model the (resonances of the) vocal tract. A useful analogue is that of a oscillating spring-damper system, i.e. a mechanical system described by the equation s′′ (t) + 2ζωs′(t) + ω 2 s(t) = x(t), (6.10) where x(t) is the input (forcing) signal and s(t) is the output. With appropriately selected values of ζ and ω, such a system will exhibit oscillations in the form of a damped sinusoid. In particular, provided that ζ < 1, the response of the system to a discrete pulse in the form of a delta p function (x(t) = δ(t)) that leads to an instantaneous ′ velocity s (0) = v0 ≡ A 1 − ζ 2 ω equals p (6.11) s(t) = Ae−ζωt sin 1 − ζ 2 ωt, for t ≥ 0. In connection with sound signals, it is more common to use the (equivalent) form s(t) = αe−βπt sin 2πf t, (6.12) where α is the amplitude, β the bandwidth and f the frequency. In computergenerated speech, time is discrete. In order to use a damped sinusoid in such c 2017, Mattias Wahde, [email protected] 88 CHAPTER 6. SPEECH SYNTHESIS a context, one therefore needs a discrete version, which can be written s(k) = αe−βπk∆t sin 2πf k∆t, (6.13) where k enumerates the samples, and ∆t = 1/ν is the inverse of the sampling frequency. Looking at the representation of the voiced speech signal in the rightmost path of Fig. 6.6, one can see clear similarities with a sequence of damped sinusoids. How can such a signal be generated? Of course, it is possible to generate a sequence of damped sinusoids directly from Eq. (6.13), by computing y(k), k = 0, 1, . . . for a certain number of samples, then resetting k and repeating. However, a more elegant way is to represent the discrete signal using a difference equation. Such an equation can be derived in several ways (either by discretizing the differential equation directly, or by using Laplace and Z transforms). The details will not be given here, but the resulting difference equation for generating the signal given by Eq. (6.13) takes the form s(k) = −a1 s(k − 1) − a2 s(k − 2) + b1 x(k − 1), (6.14) a1 = −2αe−βπ∆t cos 2πf ∆t, (6.15) where −2βπ∆t a2 = αe −βπ∆t b1 = αe , (6.16) sin 2πf ∆t, (6.17) and x(k) is the input signal, described in the following subsections. As noted above, the differential equation (6.10) responds with a single damped sinusoid if it is subjected to a delta pulse. Similarly, the discrete version in Eq. (6.14), which will henceforth be referred to as a damped sinusoid filter, will generate a damped sinusoid if the input consists of a single pulse, namely x(k) = 1 for k = 0 and 0 otherwise. 6.3.1 Generating voiced sounds As noted above, a single damped sinusoid is generated if the input to the filter represented by the difference equation in Eq. (6.14) consists of a single pulse. By providing pulses repeatedly, one can generate a recurrent pattern of damped sinusoids, similar to the pattern seen in the rightmost part of Fig. 6.6. Thus, in this case, the pulse train takes the form 1 if k mod n = 0, (6.18) x(k) = 0 otherwise, where n is the spacing between pulses. In terms of the source-filter model mentioned above, the pulse train x(k) is the source, and the filter is given by Eq. (6.14). c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS Symbol ee (female) ee (male) oo (female) oo (male) aw (female) aw (male) Ex. see see loose loose saw saw F0 180 120 180 120 180 120 f1 310 270 370 300 590 570 89 β1 100 100 100 100 100 100 f2 2990 2790 950 870 920 840 β2 100 100 100 100 100 100 f3 3310 3010 2670 2240 2710 2410 β3 100 100 100 100 100 100 Table 6.2: Fundamental frequency (see Eq. (6.19)) as well as sinusoid frequencies and bandwidths for some (English) vowels, for both male and female voices. In general, for a given human voice, one can define a fundamental frequency, denoted F0 , that represents the frequency of pulses generated by the oscillating vocal cords. Thus, for the discrete representation in Eq. (6.14), one can write n = ν/F0 , (6.19) where ν, again, is the sampling frequency. For male voices, a typical value of F0 is around 120 whereas, for a typical female voice, F0 is around 180. Now, in most cases, more than one sinusoid is required to capture all aspects of a spoken voiced sound. In the model used here, the vocal tract is modelled using a linear superposition of three damped sinusoid filters. Thus, to generate basic vowels, one need only set the fundamental frequency as well as the amplitudes, frequencies, and bandwidths of the three sinusoids (10 parameters in total), and then generate a pulse train with repeated pulses, driving three instances of the discrete damped sinusoid filter, and then, finally, summing the resulting three oscillations to form a sequence of samples representing the voiced sound. The procedure is illustrated in Fig. 6.7. Some typical settings for a few vowels3 are given in Table 6.2. The amplitudes, which are not specified in the table, are generally set to values smaller than 1, such that the resulting samples fall in the range [−1, 1]. Normally, the damped sinusoid with lowest frequency has the highest amplitude. Of course, the samples must then be rescaled to an appropriate interval and then inserted in an object of type WAVSound, as described in Sect. 6.1.2 above. It is not only vowels have the shape of repeated, damped sinusoids. Some consonants, e.g. the nasal consonants m and n, can also be represented in this way. However, even though the model presented here can represent those 3 Here, a simplified notation is used, in which a short vowel (such as a in cat) is written with a single letter (i.e. a) and a long vowel (such as a in large) is written using double letters (i.e. aa). Moreover, the symbol - represents a short period of silence. Thus, for example, the word cat would be written ka - t, whereas the word card can (somewhat depending on pronunciation, though) be written kaa - - d. c 2017, Mattias Wahde, [email protected] 90 CHAPTER 6. SPEECH SYNTHESIS Figure 6.7: Voiced sounds: The pulse train on the left is given as input to three sinusoid filters. The output of the three filters is then added to form the samples of the voiced sound. sounds very well, the comparison with the biological counterpart is somewhat diminished in those cases: In a human voice, nasal consonants are generated in a complex interplay between the vocal tract and the nasal cavity. In a more biologically plausible formant synthesizer, such as the one introduced already by Klatt [7], one can model both the vocal tract and the nasal cavity (as well as other body parts involved in speech, such as the lips). However, here, it is sufficient that the synthesizer is able to generate all sounds that occur in speech, even at the price of a slight reduction in biological plausibility of the model. 6.3.2 Generating unvoiced sounds Returning to Fig. 6.6, it is clear that unvoiced sounds bear little obvious resemblance to voiced sounds. In fact, using only visual inspection, it might be difficult to distinguish an unvoiced sound from noise! However, if the sound is played, one can clearly hear a consonant, rather than noise. How can such sounds be generated in a speech synthesizer of the kind used here? In fact, one may as well as the question how humans can generate such sounds: As was illustrated above, the human vocal tract effectively acts as a set of damped sinusoid filters. How can such a system generate signals of the kind seen in the leftmost part of Fig. 6.6? c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 91 Listing 6.3: The properties defined in the FormantSettings class. p u b l i c c l a s s FormantSettings { p u b l i c double Duration { g e t ; s e t ; } p u b l i c double TopAmplitude { g e t ; s e t ; } p u b l i c double RelativeStartAmplitude { g e t ; s e t ; } p u b l i c double RelativeEndAmplitude { g e t ; s e t ; } p u b l i c double TopStart { g e t ; s e t ; } p u b l i c double TopEnd { g e t ; s e t ; } p u b l i c double TransitionStart { g e t ; s e t ; } p u b l i c double VoicedFraction { g e t ; s e t ; } p u b l i c double Amplitude1 { g e t ; s e t ; } p u b l i c double Frequency1 { g e t ; s e t ; } p u b l i c double Bandwidth1 { g e t ; s e t ; } p u b l i c double Amplitude2 { g e t ; s e t ; } p u b l i c double Frequency2 { g e t ; s e t } p u b l i c double Bandwidth2 { g e t ; s e t ; } p u b l i c double Amplitude3 { g e t ; s e t ; } p u b l i c double Frequency3 { g e t ; s e t ; } p u b l i c double Bandwidth3 { g e t ; s e t ; } } The answer lies not so much in the properties of the vocal tract as in the properties of the pulse train used for initiating the oscillations in the first place. As noted by Hillenbrand and Houde [6], the model presented here is perfectly capable of generating unvoiced sounds if, instead of the pulse train given by Eq. (6.18), one uses a pulse train consisting of randomly generated Gaussian pulses, such that, for any k N(0, σ) with probability p, x(k) = (6.20) 0 with probability 1 − p, where N(0, σ) denotes random Gaussian samples, with mean 0 and standard deviation σ. Thus, to generate an unvoiced sound, one can use the procedure described above but with the pulse train generated by Eq. (6.20) instead of Eq. (6.18). 6.3.3 Amplitude and voicedness In the model used here, a given sound is generated by specifying an instance of FormantSettings, as described in Listing 6.3, which is included in the SpeechSynthesizer library; see also Sect. 6.4 below. This class contains specifications for the amplitudes (ai , i = 1, 2, 3), frequencies (fi , i = 1, 2, 3), and bandwidths (bi , i = 1, 2, 3) of the three sinusoids, as well as the duration d of the sound. The number of samples required (n) then equals d × ν. Moreover, there is an overall amplitude TopAmplitude that multiplies the sum of the three sinusoids, allowing the user to control the global amplitude of the sound with only one parameter. Thus, the amplitudes of the c 2017, Mattias Wahde, [email protected] 92 CHAPTER 6. SPEECH SYNTHESIS individual sinusoids can perhaps best be viewed as relative amplitudes. Of course, there is some redundancy here: One could remove one of the amplitude constants without loss of generality, but the representation used in the FormantSettings class makes it easier to control the overall amplitude. In addition, one can represent a situation in which the (global) amplitude of a sound starts at a given value, defined by the RelativeStartAmplitude, rises to a maximum and stays there for a while, and then tapers off towards another value, defined by the RelativeEndAmplitude. This is achieved using also two relative time parameters, namely TopStart and TopEnd. Letting Atop denote the TopAmplitude, astart and aend the relative start and end amplitudes, respectively, and τ1 and τ2 the TopStart and TopEnd parameters, respectively, one can compute the two time parameters T1 and T2 as T1 = dτ1 , (6.21) T2 = dτ2 , (6.22) and where d is the duration of the sound. The (absolute) start and end amplitudes (Astart ) and (Aend ) are computed as Astart = Atop astart , (6.23) Aend = Atop aend . (6.24) and Then, the variation in the global amplitude of the sound is given by t Astart + (Atop − Astart ) T1 for t < T1 , Atop for T1 ≤ t ≤ T2 , A(t) = t−T2 Atop − (Atop − Aend ) d−T for t > T2 2 (6.25) In practice, the global amplitude is sampled at discrete times, as in the case of the sinusoids; see above. With a sampling frequency of ν samples per second, the mapping between elapsed time t (for the sound in question) and the sample index k is given by t = kν. (6.26) Of course, one can avoid modifying the amplitude altogether, by simply setting τ1 = 0 and τ2 = 1 or, equivalently, astart = aend = 1. The VoicedFraction parameter (here denoted v) determines fraction of the input x(k) that is voiced: Before any sound is generated, two pulse trains are defined, namely one voiced pulse train xv (k), k = 0, 1, . . . n, with samples obtained from Eq. (6.18) and one unvoiced pulse train xu (k), k = 0, 1, . . . n, whose samples are given by Eq. (6.20) The two pulse trains are then combined to form the complete input signal as x(k) = vxv (k) + (1 − v)xu (k). c 2017, Mattias Wahde, [email protected] (6.27) CHAPTER 6. SPEECH SYNTHESIS 93 When generated in isolation, vowels are typically completely voiced (v = 1) whereas (many) consonants are completely unvoiced (v = 0). However, as will be illustrated below, in the transition between two sounds, e.g. a vowel and a consonant, one mixes the parameters, including the parameter v. Also, even for vowels, one may include a certain unvoiced component to generate a hoarse voice. 6.3.4 Generating sound transitions Even though some speech sounds (for example, a vowel such as a short a) can be used separately, normal speech obviously involves sequences of sounds that form words. It would be possible to generate the sounds letter by letter and then paste those sounds together. However, the result would, in general, not sound natural at all. Instead, the normal procedure in speech synthesis is to generate sounds that represent more than one letter. A common choice is to use diphones that usually represent two letters (technically phones but that distinction will not be made here). In practice, one generally uses both diphones and phones in speech synthesis. For example, the word can can be generated by playing two sounds in rapid sequence, namely a diphone representing ka followed by a phone representing n. Alternatively, one could combine the diphones ka and an. However, the latter alternative would involve not only handling the transitions between phone within a diphone, but also the transition between the diphones themselves. Thus, here, the former alternative will generally be used. The second alternative is commonly used in connection with speech recognition, however. Diphones can be generated by transitioning from one set of parameters to another, for example by linear interpolation. In fact, even when generating parameters for single consonant sounds, it helps to have a vowel included, either before or after the consonant sound, in order to properly hear the consonant. For example, without an adjacent vowel, it is sometimes difficult to distinguish between s and f or between p and b. Once the consonant has been generated in this way, one can simply cut away the vowel and thus obtain the consonant in isolation. It should also be noted, however, that different parameters may be needed for a given consonant, depending on the situation. For example, the set of parameters needed to generate the t in the word take may differ from the parameters needed for the t in at. In any case, when generating a sequence of two sounds (i.e. a diphone) one must specify not only the settings for each of the two sounds, but also the transition between them. The TransitionStart parameter (see Listing 6.3, denoted τs , determines the point at which a transition to the following sound begins. This, too, is a relative time measure, so that the actual time (Ts ) at c 2017, Mattias Wahde, [email protected] 94 CHAPTER 6. SPEECH SYNTHESIS which the transition starts equals Ts = dτs . (6.28) The transition affects only the amplitudes (ai ) frequcencies (fi ), and bandwidths (bi ), as well as the voicedness (v). Let p1 and p2 denote the values of any such parameter in two adjacent sounds. For the first sound, the parameter value p1 is then used until time t = Ts,1 , i.e. the transition start time for the first sound. Then, until time t = d1 (i.e. the duration of the first sound, the mixed parameter value p = λp1 + (1 − λ)p2 (6.29) is used, where λ= t − Ts,1 , d − Ts,1 (6.30) runs from 0 to 1, thus generating a smooth parameter transition from the first sound to the second. Once t = d1 has been reached, t is again set to 0, and the parameter value p2 is used, until t = Ts,2 , at which point the transition from the second sound to a third sound (if any) begins. Note that, for the last sound in a sequence, no transition is carried out, of course. In order to paste two sounds together without any transition, one can simply set τs to 1, in which case p = p1 for the entire duration of the first sound. 6.3.5 Sound properties In some cases, one may wish to change the properties of a sound. For example, if one has generated a particular voice, one might want to generate another voice that is either darker (low-pitched) or brighter (high-pitched) and, perhaps, speaks either slower or faster than the original voice. The main modifiable properties involve a sound’s volume, pitch, and duration. Fig. 6.8 shows a few examples of sound modification. Volume Changing the volume of a sound is simple, at least in the case of linear volume scaling: One simply rescales every sample by a constant factor. One has to be careful, however, to make sure that no sample exceeds the maximum value that can be represented (32767 and -32768, for 16-bit sounds). Samples that exceed the limit will be automatically clipped to the corresponding limit, and if that happens for many samples, the quality of the sound will be reduced. Moreover, once clipping has occurred, the procedure is irreversible, should one, later on, wish to reduce the volume. The WAVSound class contains a method SetRelativeVolume for setting the volume relative to the current volume, using linear scaling. Moreover, c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 95 Figure 6.8: Examples of sound modification. Upper left panel: The original sound, in this case a long vowel (a); The sound was then modified in three different ways. Upper right panel: Increased volume; lower left panel: Decreased pitch; lower right panel: Increased duration. there is a method SetMaximumNonClippingVolume which sets the maximum possible volume, under the condition that no clipping should take place. Pitch There are general procedures for modifying pitch (and duration) of spoken sounds that can be applied regardless of the method used for generating the sounds in the first place. For example, the pitch of a sound can be changed by finding the fundamental frequency (which, of course, may vary across a sound), extracting pitch periods, i.e. the sound samples in the interval between two successive peaks (pitch marks) corresponding, in the case of formant synthesis, to the pulse train excitations for voiced sounds, and then either moving the intervals (samples) between pitch marks closer (for higher pitch) or further apart (for lower pitch) using, for example, a method called time-domain pitch-synchronous overlap-and-add (TD-PSOLA). However, in the case of formant synthesis, the procedure is even easier since, in that case, one controls the production of the sound in the first place. Thus, in order to modify the pitch, one need only change the spacing of the pulses in the pulse train, i.e. the fundamental frequency. Pitch changes have the largest effect on voiced sounds, which are dominated by their pulse train c 2017, Mattias Wahde, [email protected] 96 CHAPTER 6. SPEECH SYNTHESIS rather than the more random excitation used for unvoiced sounds. Duration TD-PSOLA can be used also for changing the duration of a sound, either by removing pitch periods (for shorter duration) or by adding pitch periods (for longer duration). In the case of formant synthesis, changing the duration of a sound is even more straightforward, since the duration is indeed one of the parameters in the formant settings. By increasing the value of the duration parameter, one simply instructs the synthesizer to apply the pulse train for a longer time, resulting in a sound of larger duration, and vice versa for sounds of shorter duration. In general, changes of duration are mostly applied to vowels (even though some consonants, such as s can be extended as well). Many consonants such as, for example the t in the word cat need not be extended much in human speech, even though a formant synthesizer can, in principle, generate a t (or any other sound) of any duration. 6.3.6 Emphasis and emotion Even though one can define an average fundamental frequency for a given voice, it is not uncommon in human speech to vary volume, pitch, and duration in order to emphasize a word or to express an emotion. If all words are always read with the same intonation, the result is a very robotic-sounding voice for which one has to use context, rather than simply listening, to distinguish, say, a statement from a question: In normal speech, the variation in emphasis over a sentence can be used for make subtle changes in the meaning of the sentence. For example, there is a difference between the sentences Did you see the cat? (as opposed to, for example, just hearing it meowing), and Did you see the cat? (as opposed to seeing something else). In fact, the speech synthesizer defined here does not include features such as emphasis. However, the formant method certainly supports such features. For example, in order to raise the pitch towards the end of a word, one need only generate a pulse train in which the pulse period is shortened gradually over the word, rather than being constant over the entire word. In addition, one must also change the word-to-sound mappings (see Sect. 6.4 below), by adding symbols that can be used for distinguishing between a normal utterance of a word, and an utterance involving emphasis. c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 97 Listing 6.4: A simple usage example for the GenerateWordSequence method in the SpeechSynthesizer class. Here, the sentence Hello, how are you? is generated provided, of course, that the speech synthesizer contains the required word-to-sound mappings for the four words as well as the corresponding format settings required to generate each word. ... L i s t <s t r i n g> wordList = new L i s t <s t r i n g >() { ” h e l l o ” , ”how” , ” a r e ” , ”you” } ; L i s t <double> silenceList = new L i s t <double >() { 0 . 1 0 , 0 . 0 2 , 0 . 0 2 } ; WAVSound sentenceSound = speechSynthesizer . GenerateWordSequence( wordList , silenceList ) ; ... 6.4 The SpeechSynthesis library This library contains classes for generating speech using formant synthesis. As shown above, the FormantSettings class is used for holding the parameters (also described above) of a sound. In cases where a sound requires several different settings, as in the case of a diphone involving two distinct sounds, the FormantSpecification acts as a container class for a list of formant settings. This class also contains a method (GetInterpolatedSettings) that is used during the transition between two sounds, as described in Subsect. 6.3.4. The actual synthesis is carried out in the SpeechSynthesizer class, which contains a method GenerateSound that takes a formant specification as input. In this method, a pulse train, for use in voiced sounds, is generated with the appropriate pulse interval. Moreover, a random pulse train is generated as well, for use in unvoiced sounds. The pulse trains are then combined as in Eq. (6.27), and the resulting combined pulse train is then fed to the three damped sinusoids, for which the current parameter settings are used (obtained via interpolation in the case of transition between two sounds, as mentioned above). The resulting set of samples is then used for generating a WAVSound that is returned by the method. A speech synthesizer must also contain specifications of which sounds to combine, and in what order, so as to produce a specific word. Thus, the SpeechSynthesizer class contains a list of WordtoSoundMapping objects that, in turn, map a word to a list of sound names. The SpeechSynthesizer class contains two additional methods, GenerateWord, which takes a string (the word specification) as inputs, finds the appropriate sounds (or, rather, the formant specification required to generate those sounds), and then produces the corresponding sounds. The GenerateWordSequence method generates a sequence of words (for example, but not necessarily, a complete sentence), with (optional) intervals of silence between the words; see also Listing 6.4. c 2017, Mattias Wahde, [email protected] 98 CHAPTER 6. SPEECH SYNTHESIS Figure 6.9: The sound editor tab page of the VoiceGenerator application. In this case, the user has set the parameters so that they approximately generate a long o. 6.5 The VoiceGenerator application The description above shows how the various parameters are used when forming a sound. However, an important questions still remains, namely which parameter settings are required for generating a given sound? Table 6.2 offers some guidance regarding a few vowels, but in order to generate an entire voice, capable of uttering any word in a given language, one must of course find parameter settings for all sounds used in the language in question. Needless to say, these sounds will differ between languages, even though some sounds are found in almost all languages. In fact, while the parameters for a given sound, especially a vowel, can be estimated using knowledge of the human vocal tract (and its typical formant frequencies), a more efficient way might be to use an interactive evolutionary algorithm (IEA), which is a form of subjective optimization, i.e. a procedure where a human assesses and scores the different alternatives. Of course, sound generation is particularly suitable for such an approach, since a human can quickly assess whether or not a given sound corresponds to a desired sound or not. This kind of optimization is implemented in the VoiceGenerator demonstration application. In this case, starting from a given set of parameters, the user is presented with nine sounds, whose samples are shown graphically on the screen, in a 3 × 3 matrix, with the initial sound in the center. For the remaining eight sounds, the parameters have been slightly modified, based on the parameters of the sound at the center of the matrix. The user then listens to c 2017, Mattias Wahde, [email protected] CHAPTER 6. SPEECH SYNTHESIS 99 Figure 6.10: The interactive optimization tab page of the VoiceGenerator application. Starting from the sound shown in Fig. 6.9, the user has inserted a randomized sound before the an already optimized vowel, and has begun the process of optimizing the sound by modifying only the first part in order to turn it into a consonant. the nine sounds (or a subset of them), and selects (by double-clicking) the one that is least different from the desired sound. That sound then appears in the center of the matrix, surrounded by eight sounds whose parameters are slight variations of the parameters of the selected sound. This process is repeated until the desired sound has been obtained. For a person unfamiliar with IEA, this might seem like a very slow and tedious process. However, it is actually quite fast: Starting from any parameter settings, with some experience one can typically find parameters for any vowel in 10-20 selection steps or less. Consonants may require a few more steps but, overall, the process is rather efficient. The program does allow manual editing of parameters as well. The GUI contains three tabs, one for interactive optimization as described above, one for manual editing of sounds, and one for defining and synthesizing the various words stored in a speech synthesizer. Fig. 6.9 shows the sound editor tab page. Here, the user can experiment with various (manually defined) formant settings, in order to generate a starting point for the IEA. In the particular example shown in the figure, the parameters have been set so as to generate a long o sound. Fig. 6.10 shows the interactive optimization tab page, during the optimization of a vowel sound. The currently selected sound is shown in the center frame, whereas the eight surrounding frames display modified versions of that sound. The user can select the parameters that the optimizer is allowed to modify. For example, in the case of a vowel, one would normally start from a sound that is completely voice (voice fraction equal to 1), and then disallow changes in the voice fraction during optimization. The user can also select the scope of modification, a possibility that is relec 2017, Mattias Wahde, [email protected] 100 CHAPTER 6. SPEECH SYNTHESIS vant in cases where the sound is generated from a formant specification containing more than one formant setting. For example, a common approach for generating consonant-vowel combinations (e.g. kaa, taa etc.) is to first generate the vowel using the IEA, and then assigning the sound to the sound editor (by clicking the appropriate button). Next, in the sound editor tab page, one would copy the vowel (by clicking on the append button), thus obtaining a sound defined by a sequence of two formant settings. Then, one would randomize the first formant settings, and assign the sound to the optimizer. At this point, before starting optimization, one can set the scope of modification such that it only affects the first formant settings (that are now random, but are supposed to generate a consonant after optimization), and then begin using the IEA to find the appropriate settings for the consonant; see also Fig. 6.10. Of course, if one uses a random starting point for every sound generated, the resulting set of sounds may not form a coherent voice. In other words, when the sounds are used to form words, they will not be perceived as belonging to a single voice. One should therefore use the following method: First, generate a vowel (say, a long a, denoted aa, as in large). Next, insert a randomized sound before the vowel, and use the optimizer to generate suitable consonants to form consonant-vowel diphones such as baa, daa, gaa etc., every time using the same formant settings for the vowel and just optimizing the formant settings for the consonant. A similar procedure can be used for generating vowel-consonant diphones, by keeping the first sound (the vowel) constant. c 2017, Mattias Wahde, [email protected] Chapter 7 Speech recognition Speech recognition can be divided into two main cases, namely isolated word recognition (IWR) and continuous speech recognition (CSR). As is easily understood, CSR is more difficult than IWR, for example due to the fact that, in continuous speech, the brief periods of silences that separate spoken sounds do not generally occur at word boundaries. Knowing that a sound constitutes a single word, as might be the case (though not necessarily) in IWR, greatly simplifies the recognition process. There are many approaches to CSR, for example dynamic time warping (DTW), a deterministic technique that attempts to match (non-linearly) two different time series in order to find similarity between the two series; Hidden Markov models (HMMs) that, simplifying somewhat, can be seen as a stochastic alternative to DTW; and artificial neural networks (ANNs) that can be used for recognizing patterns in general, not just speech. The different approaches can be combined: HMMs have long dominated CSR research and many modern HMM-based speech recognizers make use of (deep) ANNs instead of the so called Gaussian mixture models (GMMs) that were earlier used in connection with HMM-based speech recognition. In IWR, one normally considers a rather limited vocabulary and the speech recognizer can therefore be trained on instances of entire words. In CSR, by contrast, the number of possible words is so large (around 80000 for fluently spoken English, for example) that one must instead base the recognition of speech on smaller units of sounds, namely phones (see Chapter 6), along with diphones and even triphones that involve the transitions between phones. When combined, such units form words and sentences. However, regardless of which approach is used, on the most fundamental level, speech recognition involves finding speech features in a sound and then comparing them to stored feature values from sounds used during training of the speech recognizer. In this chapter, the aim will be to describe the steps involved in extracting the features of spoken sounds, and then match101 102 CHAPTER 7. SPEECH RECOGNITION ing them using a linear scaling (instead of DTW), as will be described below. The approach will be limited to IWR, as this is sufficient for the applications considered here. 7.1 Isolated word recognition There are four basic steps in the approach to IWR considered here [19]: First, the sound is subjected to preprocessing and frame splitting (see below). Then, a number of features are extracted to form a feature vector for each frame, thus resulting in a time series for each feature. Next, the time scale is (linearly) normalized to range from 0 to 1, and the time series are resampled at fixed values of normalized time. Finally, the feature vector is compared to stored feature vectors, one for each sound that the IWR system has been trained to recognize, in order to determine whether or not the spoken sound is recognizable and, if so, return information regarding the recognized sound. In the description of this process, it will be assumed that the (input) sound constitutes a single word. However, later on, the process of splitting a sound and concatenating the parts in various different ways before applying IWR will be considered briefly as well. 7.1.1 Preprocessing As in Chapter 6, here s(k) denotes the samples of a sound. The first step consists of removing the so called DC component by settings the mean (s) of the sound samples to zero. Thus, the samples are transformed as s(k) ← s(k) − s. (7.1) Assuming, again, that the sound contains a single spoken word (but with periods of silence or, rather, noise before and after the word), the next step is to extract the samples belonging to the word. This is done by first moving forward along the sound samples, starting from the µth sample, and forming a moving average involving (the modulus of) µ sound samples. Once this moving average exceeds a threshold tp , the corresponding sample, with index ks , is taken as the start of the word. The procedure is then repeated, starting with sample m − µ + 1, where m is the number of recorded samples, forming the moving average as just described, and then moving backward, towards lower indices. When a sample (with index ke ) is found for which the moving average exceeds tp , the end point has been found. The sound containing the ke − ks + 1 samples is then extracted. The sound is the pre-emphasized, by applying a digital filter that, in the time domain, takes the form s(k) ← s(k) − cs(k − 1), c 2017, Mattias Wahde, [email protected] (7.2) CHAPTER 7. SPEECH RECOGNITION 103 where c is a parameter with a typical value slightly below 1. As is evident from this equation, low frequencies, for which s(k) is not very different from s(k−1), are de-emphasized, whereas high frequencies are emphasized, improving the signal-to-noise ratio. Next, frame splitting is applied. Here, snippets of duration τ are extracted, with consecutive snippets shifted by δτ . δτ is typically smaller than τ , so that adjacent frames partially overlap. Finally, each frame is subjected to (Hamming) windowing such that s(k) ← s(k)v(k), (7.3) with 2πk , (7.4) n where n is the number of samples in the frame, and α is yet another parameter, typically set to around 0.46. v(k) = (1 − α) − α cos 7.1.2 Feature extraction Once the word has been preprocessed as described above, resulting in a set of frames, sound features are computed for each frame. A sound feature is a mapping from the set of samples s(k) of a frame to a single number describing that frame. Suitable sound features are those that capture properties of a frame that are (ideally) independent of the speaker and also of the intensity (volume) of speech etc. One can define many different kinds of features. Here, four types will be used, namely (i) the autocorrelation coefficients, (ii) the linear predictive coding (LPC) coefficients, (iii) the cepstral coefficients, and (iv) the relative number of zero crossings. These feature types will now be described in some detail. Autocorrelation coefficients The autocorrelation of a time series measure its degree of self-similarity over a certain sample distance (the lag) and can thus be used for finding repeated sequences in a signal. Here, the normalized autocorrelation, defined as aN i = n−i X (s(k) − s)(s(k + i) − s) k=1 σ2 , (7.5) is used, where s again is the mean of the samples1 and σ 2 is their variance. The number of extracted autocorrelation coefficients (i.e., the number of values of i used, starting from i = 1) is referred to as the autocorrelation order. 1 Note that, while the mean was removed from the original sound in the first preprocessing step, this does not imply that the mean of every frame is necessarily equal to zero. c 2017, Mattias Wahde, [email protected] 104 CHAPTER 7. SPEECH RECOGNITION LPC coefficients Provided that a sound (quasi-)stationary2 something that often applies to a sound frame of the kind considered here (provided that the frame duration is sufficiently short), linear predictive coding (LPC) can be used as a method for compressing the information in the sound frame. In LPC, one determines the coefficients li that provide the best possible linear approximation of the sound, that is, an approximation of the form ŝ(k) = p X li s(k − i), (7.6) i=1 such that the error e(k) = s(k)− ŝ(k) is minimal in the least-square sense. Here, p is referred to as the LPC order. The LPC coefficients can be computed from the (non-normalized) autocorrelation coefficients ai (defined as the normalized autocorrelation coefficients, but without the σ 2 denominator). The equation for the prediction error e(k) can be written e(k) = s(k) − p X li s(k − i). (7.7) i=1 The total squared error E then becomes E= ∞ X k=−∞ e2 (k) = ∞ X s(k) − p X li s(k − i) i=1 k=−∞ !2 . (7.8) Thus, the minimum of E is found at the stationary point where ∂E = 0, j = 1, . . . , p. ∂lj (7.9) Taking the derivative of E, one finds p ∞ ∞ X X X 1 ∂E s(k − i)s(k − j) = 0. li s(k − j)s(j) − = 2 ∂lj i=1 (7.10) k=−∞ k=−∞ Using the definition of the autocorrelation coefficients, this expression can be rewritten as p X li a|j−i| = aj , (7.11) i=1 2 In general, a stationary time series is one in which the mean, variance etc. are constant across the time series. c 2017, Mattias Wahde, [email protected] CHAPTER 7. SPEECH RECOGNITION 105 a set of equations called the Yule-Walker equations. This expression can be written in matrix form as A · l = a, (7.12) where l = (l1 , . . . , lp ), a = (a1 , . . . ap ), and A is given by a0 a1 . . . ap−1 a1 a0 . . . ap−2 A = .. . .. .. .. . . . . ap−1 ap−2 . . . a0 (7.13) This symmetric matrix is a so called Toeplitz matrix. There exists an efficient way of solving the Yule-Walker equations using so called Levinson-Durbin recursion that has been implemented in the MathematicsLibrary included in the IPA libraries. Cepstral coefficients The cepstral coefficients (CCs) represent the envelope (the enclosing hull) of the signal’s spectrum and are thus useful as a compact representation of the signal’s overall characteristics. The CCs can be computed as (the first coefficients of) the inverse (discrete) fourier transform of the logarithm of the (discrete) fourier spectrum of the signal. However, one can show that, starting from the autocorrelation coefficients ai and the LPC coefficients, li one can compute the CCs (denoted ci ) also as follows: The first coefficients are c0 = a0 and c1 = l1 . Then, i−1 X k c i = li + ck li−k , i = 1, . . . , p (7.14) i k=1 and ci = i−1 X k ck li−k , i > p. i k=i−p (7.15) The number of cepstral coefficients used in a given situation is referred to as the cepstral order. In addition to the cepstral coefficients, it is common also to define so-called mel-frequency cepstral coefficients (MFCCs), which attempt to mimic human auditory perception more closely by applying a non-linear frequency scale rather than the linear frequency scale used in computing the cepstral coefficients. The MFCCs will not be further considered here, however. Number of zero crossings The number of zero crossings z, i.e. the number of times that the signal changes from negative to positive or the other way around, can help in distinguishing c 2017, Mattias Wahde, [email protected] 106 CHAPTER 7. SPEECH RECOGNITION different sounds from each other. For example, as can be seen in Fig. 6.6, a voiced sound, represented as a superposition of sinusoidal waveforms, typically has fewer zero crossings than an unvoiced sound. Here, a zero crossing occurs if either s(k)s(k − 1) < 0 (7.16) or s(k − 1) = 0 and s(k)s(k − 2) < 0 (7.17) where s(k) again denotes the sound samples. In order to make the measure independent of the duration of the sound, the relative number of zero crossings, obtained by dividing z by the number of samples, is used instead. 7.1.3 Time scaling and feature sampling For any given sound, one can carry out the steps above, dividing the sound into frames and computing the various sound features for each frame. However, with given values of the frame duration and the frame shift, the number of frames will vary between sounds, depending on their duration. Thus, in order to compare the features from one sound to those from another, one must first obtain a uniform time scale. As mentioned above, a common approach to rescaling the time variable so that the sounds can be compared is to use DTW. On the other hand, at least for IWR, simple linear scaling of the time typically works as well as DTW [19]. Thus, linear scaling, illustrated in Fig. 7.1, has been used here. The figure shows three instances of the same sound, uttered with different speed (and, therefore, different duration). The time scale of the sound features extracted for each sound was then linearly rescaled to unit (relative) time, such that the first feature value occurs at relative time 0 and the last at relative time 1. In order to illustrate that the linear scaling works well, the panels on the right show the time series for the first LPC (l1 ) for each sound without time rescaling. By contrast, in the bottom panel, the three LPC time series obtained with (linear) time rescaling are shown together. As can be seen, there is a high degree of similarity betwen the three time series. Now, with or without time rescaling, the feature time series will contain different number of points. For example, with a frame duration of 0.03 s and a frame shift of 0.01 s, a sound with 0.15 s duration will provide time series (for any feature) with 13 feature values, whereas a sound with 0.25 s duration will provide time series with 23 feature values. Moreover, after time rescaling, the spacing (in relative time) will differ between the two time series. However, once the time has been rescaled, to produce time series of the kind shown in the bottom panel of Fig. 7.1 one can of course resample the time series, using linear interpolation (between measured points) to obtain a time series, with any number of values, at equal spacing in relative time. Thus, with linear time rescaling followed by resampling, one can make sure that any sound will c 2017, Mattias Wahde, [email protected] CHAPTER 7. SPEECH RECOGNITION 107 LPC1 (b) 4.0 (a) 3.0 2.0 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 t LPC1 (d) 4.0 (c) 3.0 2.0 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 t (f) LPC1 4.0 (e) 3.0 2.0 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 t LPC1 (g) 4.0 3.0 2.0 1.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 t Figure 7.1: Top three rows: Three instances of a sound, namely the Swedish word ekonomi (economy), uttered with different speed. For each sound, the left panel shows the sound samples, and the right panel shows the first LPC coefficient. The bottom row shows the three LPC time series superposed, after linear time rescaling. generate time series (for each feature), with a given number of values, equidistantly spaced in relative time. 7.1.4 Training a speech recognizer After preprocessing, feature extraction, time rescaling, and resampling, the final step consists of comparing the feature time series thus obtained with stored time series obtained during training. Before describing that step, the training procedure will be defined. The training procedure is simple, but in order to obtain a general representation of a word, one should use multiple instances of the word, and then form averages of the resulting feature vectors. The procedure is thus as follows: First, decide how many autocorrelation coefficients, LPC coefficients, and cepstral coefficients should be computed. A typical choice is to use an autocorrelation order of 8, an LPC order of 8, and a cepstral order of 12. Adding also the relative number of zero crossings, the total number of features (nf ) will be 29. For each word that the speech recognizer is supposed to learn, generate n instances. For each instance, carry out c 2017, Mattias Wahde, [email protected] 108 CHAPTER 7. SPEECH RECOGNITION LPC1 (b) 4.00 (a) 3.00 2.00 1.00 0.00 0.00 0.20 0.40 0.60 t (c) LPC1 4.00 3.00 2.00 1.00 0.00 0.00 0.20 0.40 0.60 0.80 1.00 t Figure 7.2: Top row: The left panel shows an instance of the Swedish word nästa (next), along with the (unscaled) time series for the first LPC coefficients, for five instances of this word. The bottom panel shows the average time series for the same LPC coefficient, generated by linearly rescaling and resampling the feature time series for each instance, and then forming the average. Note that the standard deviation (over the instances used for generating the average) is shown as well. the preprocessing and then compute the feature time series for the autocorrelation, LPC, and cepstral coefficients, as well as the (relative) number of zero crossings. Then rescale and resample each time series at equidistant relative times, generating ns (typically around 40 or so) samples per time series. Finally, form the average (over the instances) of the resampled time series for each feature. At the end of this procedure, each word will be represented by np = nf × ns parameters. With the numerical values exemplified above, np would be equal to 29 × 40 = 1160. This is a fairly large number of parameters, but still smaller than the number of samples in a typical sound instance (and, of course, providing a better representation of the sound’s identity than the samples themselves). Moreover, as will be shown below, not all parameters are necessarily used during speech recognition. An illustration of (a part of) the training procedure is shown in Fig. 7.2. 7.1.5 Word recognition Once the training has been completed, the speech recognizer is ready for use. The recognition procedure starts from a sound instance, which is then first preprocessed as described above and is then subjected to feature extraction, as well as time rescaling and resampling. Next, the recognizer runs through all the words stored during training, computing a distance measure di between the current sound instance and the stored word i = 1, . . . , nw (where nw is the c 2017, Mattias Wahde, [email protected] CHAPTER 7. SPEECH RECOGNITION 109 number of stored words) as nf 1 X di = wj nu nf j=1 ns X (Fijk − ϕjk )2 k=1 ! , (7.18) where Fijk denotes the k th sample of feature time series j for stored word i, and ϕjk is the k th sample of feature time series j for the current sound instance. The inner sum (over k) covers the samples of each feature time series (40 in the example above). The outer sum runs over the features. wj ≥ 0 are the feature weights and nu denotes the number of features used, i.e. the number of features for which wj > 0. The feature weights are thus additional parameters of the speech recognizer, and must be set by the user. The easiest option is to set all wj to 1. However, as shown in [19] where the weights were set using an evolutionary algorithm, (slightly) better performance can in fact be obtained by using only 5 of the 29 features defined above, namely cepstral coefficients 3, 7, 8, 11, and 12. Expressed more simply, one computes the mean square distance between the feature time series for the stored sounds and the current sound instance. The (index of the) recognized word wi is then taken as ir = argmini di , i = 1, . . . , nw , (7.19) provided that the minimum distance, i.e. dmin = mini di , i = 1, . . . , nw , (7.20) does not exceed a given threshold T (another parameter of the speech recognizer). If dmin > T , the recognizer does not produce a result. This can happen if the sound instance is garbled or incorrectly extracted (meaning that the sound contains more, or less, than one word) or, of course, if it represents a sound that the recognizer does not have in its database. 7.2 Recording sounds In order to train and use a speech recognizer, one must of course have some way of recording sounds. The AudioLibrary contains a class WAVRecorder for this purpose. For the purpose of training, it would perhaps be sufficient with a recorder that could record for a given duration (2 s, say). However, for the purpose of listening (continuously) as the IPA will have to do, the recorder must be able to record continuously. The WAVRecorder has indeed been implemented in this way, and it makes use of several external methods available in the winmm DLL that is an integral part of Windows. The source code will not be shown here. Suffice it to say c 2017, Mattias Wahde, [email protected] 110 CHAPTER 7. SPEECH RECOGNITION that, basically, the WAVRecorder defines a recording buffer in the form of a byte array, and then opens a channel for input sound (using the waveInOpen method in winmm) and also defines a callback method that is triggered whenever the recording buffer is full. The recorded bytes are then transferred, in a thread-safe manner, to a list of byte arrays (timeRecordingList), which also keeps track of the time stamp at which the byte array was acquired from the recorder. Moreover, in order to prevent the recorder from storing sound data that can grow without an upper bound, the first (oldest) element of timeRecordingList is removed whenever a new element is added, once the number of elements in timeRecordingList reaches a certain user-defined threshold. The WAVRecorder class also contains a method for extracting all the available, recorded bytes in the form of a single array, which can then be converted to a WAVSound (see also Sect. 6.1.2). In addition, the WAVRecorder of course also contains a method for stopping the recording. 7.3 The SpeechRecognitionLibrary This library contains the IsolatedWordRecognizer class that, in turn, contains methods for training the recognizer on a given word and for recognizing an input sound. The training method (AppendSound) takes as input the name (identity) of the sound as well as a list of instances of the sound. The method, which also makes use of the SoundFeature and SoundFeatureSet classes, defined in the AudioLibrary, is shown in Listing 7.1. The first step is to preprocess the sound as described above. The method does assume that each instance contains a single word, but it must still be preprocessed, for example to remove initial and final periods of silence and also, of course, to generate the sound frames. Next, the feature time series are obtained for the current instance, and are then rescaled and resampled. Once feature time series are available for all instances, the average sound feature set is generated and stored for later use. The RecognizeSingle method, which takes a sound instance as input, is shown in Listing 7.2. Here, too, the sound is processed as in the training method, resulting in a set of feature time series, with the same number of samples as in the stored series. The method then computes the distance (deviation) between the input sound and each of the stored sounds, and then returns the list of deviations in the form of a RecognitionResult that contains both the computed feature time series, as well as a list of sound identifiers (i.e. the text string name for each stored sound) and distances di for each stored sound, sorted in ascending order. From the RecognitionResult it is then easy to check whether or not the first element (i.e. the one with smallest distance) has a distance value below the threshold T . c 2017, Mattias Wahde, [email protected] CHAPTER 7. SPEECH RECOGNITION 111 Listing 7.1: The AppendSound method in the IsolatedWordRecognizer class. p u b l i c o v e r r i d e void AppendSound ( s t r i n g name , L i s t <WAVSound> instanceList ) { L i s t <SoundFeatureSet> soundFeatureSetList = new L i s t <SoundFeatureSet >() ; f o r e a c h (WAVSound soundInstance i n instanceList ) { soundInstance . SubtractMean ( ) ; double startTime = soundInstance . GetFirstTimeAboveThreshold ( 0 , soundExtractionMovingAverageLength, soundExtractionThreshold) ; double endTime = soundInstance . GetLastTimeAboveThreshold ( 0 , soundExtractionMovingAverageLength, soundExtractionThreshold) ; WAVSound extractedInstance = soundInstance . Extract ( startTime , endTime ) ; extractedInstance . PreEmphasize ( preEmphasisThresholdFrequency) ; WAVFrameSet frameSet = new WAVFrameSet( extractedInstance , frameDuration , frameShift ) ; frameSet . ApplyHammingWindows( alpha ) ; SoundFeatureSet soundFeatureSet = new SoundFeatureSet ( ) ; L i s t <SoundFeature> autoCorrelationFeatureList = frameSet . GetAutoCorrelationSeries( ” A u t o C o r r e l a t i o n ” , autoCorrelationOrder) ; soundFeatureSet . FeatureList . AddRange ( autoCorrelationFeatureList) ; L i s t <SoundFeature> lpcAndCepstralFeatureList = frameSet . GetLPCAndCepstralSeries( ”LPC” , lpcOrder , ” C e p s t r a l ” , cepstralOrder ) ; soundFeatureSet . FeatureList . AddRange ( lpcAndCepstralFeatureList) ; SoundFeature relativeNumberOfZeroCrossingsFeature = frameSet . GetRelativeNumberOfZeroCrossingsSeries( ”RNZC” ) ; soundFeatureSet . FeatureList . Add ( relativeNumberOfZeroCrossingsFeature) ; soundFeatureSet . SetNormalizedTime ( ) ; soundFeatureSet . Interpolate ( numberOfValuesPerFeature) ; soundFeatureSetList . Add ( soundFeatureSet) ; } SoundFeatureSet averageSoundFeatureSet = SoundFeatureSet . GenerateAverage ( soundFeatureSetList) ; averageSoundFeatureSet . Information = name ; i f ( averageSoundFeatureSetList == n u l l ) { averageSoundFeatureSetList = new L i s t <SoundFeatureSet >() ; } averageSoundFeatureSetList . Add ( averageSoundFeatureSet) ; averageSoundFeatureSetList . Sort ( ( a , b ) => a . Information . CompareTo ( b . Information ) ) ; OnAvailableSoundsChanged ( ) ; } c 2017, Mattias Wahde, [email protected] 112 CHAPTER 7. SPEECH RECOGNITION Listing 7.2: The RecognizeSingle method in the IsolatedWordRecognizer class. p u b l i c o v e r r i d e R e c o g n i t i o n R e s u l t RecognizeSingle (WAVSound sound ) { sound . SubtractMean ( ) ; double startTime = sound . GetFirstTimeAboveThreshold( 0 , soundExtractionMovingAverageLength , soundExtractionThreshold) ; double endTime = sound . GetLastTimeAboveThreshold( 0 , soundExtractionMovingAverageLength , soundExtractionThreshold) ; WAVSound extractedInstance = sound . Extract ( startTime , endTime ) ; extractedInstance . PreEmphasize ( preEmphasisThresholdFrequency) ; WAVFrameSet frameSet = new WAVFrameSet( extractedInstance , frameDuration , frameShift ) ; frameSet . ApplyHammingWindows( alpha ) ; SoundFeatureSet soundFeatureSet = new SoundFeatureSet ( ) ; L i s t <SoundFeature> autoCorrelationFeatureList = frameSet . GetAutoCorrelationSeries( ” A u t o C o r r e l a t i o n ” , autoCorrelationOrder) ; soundFeatureSet . FeatureList . AddRange ( autoCorrelationFeatureList) ; L i s t <SoundFeature> lpcAndCepstralFeatureList = frameSet . GetLPCAndCepstralSeries( ”LPC” , lpcOrder , ” C e p s t r a l ” , cepstralOrder ) ; soundFeatureSet . FeatureList . AddRange ( lpcAndCepstralFeatureList) ; SoundFeature relativeNumberOfZeroCrossingsFeature = frameSet . GetRelativeNumberOfZeroCrossingsSeries( ”RNZC” ) ; soundFeatureSet . FeatureList . Add ( relativeNumberOfZeroCrossingsFeature) ; soundFeatureSet . SetNormalizedTime ( ) ; soundFeatureSet . Interpolate ( numberOfValuesPerFeature) ; R e c o g n i t i o n R e s u l t recognitionResult = new R e c o g n i t i o n R e s u l t ( ) ; recognitionResult . SoundFeatureSet = soundFeatureSet ; i f ( averageSoundFeatureSetList != n u l l ) { f o r e a c h ( SoundFeatureSet averageSoundFeatureSet i n averageSoundFeatureSetList) { double deviation = SoundFeatureSet . GetDeviation ( averageSoundFeatureSet , soundFeatureSet , weightList ) ; s t r i n g soundName = averageSoundFeatureSet . Information ; recognitionResult . DeviationList . Add ( new Tuple<s t r i n g , double >(soundName , deviation ) ) ; } recognitionResult . DeviationList . Sort ( ( a , b ) => a . Item2 . CompareTo ( b . Item2 ) ) ; } r e t u r n recognitionResult ; } c 2017, Mattias Wahde, [email protected] CHAPTER 7. SPEECH RECOGNITION 113 Figure 7.3: The speech recognizer tab of the IWR application. Here, the time series for the third cepstral coefficient (average and standard deviation) are shown for two of the stored words, namely end and yes. 7.4 Demonstration applications Two applications have been written for the purpose of demonstrating how the SpeechRecognitionLibrary can be used, namely (i) an Isolated word recognizer (IWR) application that allows the user to train a speech recognizer using a set of instances for each word, and then to use the speech recognizer either by loading a sound instance from a file or by recording it; and (ii) a Listener application, which continuously records from a microphone and applies speech recognition whenever a new sound is available. 7.4.1 The IWR application Figs. 7.3 and 7.4 show the GUI of the IWR application. The form contains a tab control with two tabs, a speech recognizer tab and a usage tab. In order to train a speech recognizer, the user first sets the appropriate parameter values for preprocessing and feature extraction (which then remain fixed, regardless of how many words are added to the database). Next, a new speech recognizer is generated, and the user can then train it by loading a set of instances for each word and forming the time normalized and resampled feature time series as described above. In Fig. 7.3, the speech recognizer has been trained on the c 2017, Mattias Wahde, [email protected] 114 CHAPTER 7. SPEECH RECOGNITION Figure 7.4: The usage tab of the IWR application. The user has recorded the word yes, and then applied the speech recognizer, which correctly identified the word. The graph in the lower right part of the figure shows the first autocorrelation coefficient time series for the recorded sound. words back, end, next, no, and yes. Once a few words have been added to the database, the user can view the time series (averages and standard deviation) for each feature, and for each stored word. The figure shows two time series, namely for the third cepstral coefficient for the two (selected) words, end and yes. This program assumes that the recorded sound consists of a single word, possibly with some periods of silence in the beginning and end of the recording. Thus, it does not make any attempt to distinguish, for example a true recorded sound from noise. Fig. 7.4 shows the usage tab page. Here, the user has recorded a word, namely yes, and then applied the speech recognizer, which correctly identified the word. Here, the recognition threshold was set to 0.0333, and the distance dmin (obtained for the word yes) was 0.0219. Quite correctly, no other stored word reached a distance value below the threshold. The form also shows a plot of the feature time series for the recorded sound. 7.4.2 The Listener application As mentioned above, the Listener application records continuously and, moreover, is able to act as a client connecting to an agent program as described in c 2017, Mattias Wahde, [email protected] CHAPTER 7. SPEECH RECOGNITION 115 Figure 7.5: The GUI of the Listener application. In the situation shown here, the incoming sound has been split at six split points (shown as yellow vertical lines), and the listener has recognized the word back, ignoring the noise seen between the first two split points. Chapter 2. Thus, the output of the program is a string representation of the recognized word, along with a time stamp, making it possible for the agent to determine the appropriate response, if any. However, for this application, it is not sufficient to use the simple approach employed in the IWR application, where it was assumed that the recorded sound contain precisely one word: In continuous recording, even though the WAVRecorder that is responsible for the actual recording does have a limited memory, its current recording may still contain several words and also partial words, i.e. a word that the speaker has begun, but not yet finished, uttering. Note also that the recording buffer does have a certain size, so even after the user has completely uttered a given word, there is a (small) delay until the corresponding sound samples are available to the speech recognizer. Moreover, when recording sounds continuously, it is inevitable that there will be occasional noise in the signal. Thus, some form of intelligent processing is required. In the Listener program, one such procedure has been implemented. Here, the incoming sound is split into pieces by considering short snippets (typical duration: 0.02 s) and then defining split points at those snippets that contain only silence (based on a given threshold). This gives a set of k split points. The program then builds all possible sounds such that the start of the sound occurs at split point i, i = 1, . . . , k − 1, and the end at split point j, j = i + 1, . . . , k. Next, the speech recognizer is applied to all such sounds, resulting in a set of dmin values, one for each sound. The word identified (if any) for the sound corresponding to the lowest value of dmin (provided that this values is below the thresold T ) is then taken as the output and is sent to the agent program (if the latter is available). An example, which also illustrates the GUI of the Listener application is c 2017, Mattias Wahde, [email protected] 116 CHAPTER 7. SPEECH RECOGNITION shown in Fig. 7.5. Here, the word back is available but also a small sequence of noise preceding the word. There is a total of six split points, resulting in 15 different possible sounds according to the procedure just described. Of those combinations, the one involving the third and the sixth (last) split point gave the lowest dmin , which also was below the detection threshold for the word back, resulting in recognition of this word, as can be seen in the figure. c 2017, Mattias Wahde, [email protected] Chapter 8 Internet data acquisition In addition to sensing its immediate surroundings using cameras and microphones, as described in some of the previous chapters, an IPA may also need to access information from the internet. For example, one can envision an agent with the task of downloading news or weather reports from the internet, and then presenting the results, perhaps along with pictures and videos, to the user, either spontaneously when some news item (of interest to the user) appears, or upon request from the user. The procedure of accessing data from the internet can be divided into two logical steps: First, the agent must download the raw data. Next, it must parse the raw data to produce a meaningful and easily interpretable result. Of course, neither an agent nor its user(s) can control the formatting of a given web page. Thus, any specific method for parsing the contents of a general web page is likely to be brittle and prone to error if the structure of the web page is changed, for some reason. However, there are sites (especially for news, weather, and similar topics) that operate as so called Really simple syndication (RSS) feeds and are formatted in a well-defined manner, so that they can easily be parsed. At this point, it is important to note that not all sites welcome (or even allow) access by artificial agents. In fact, some even take countermeasures such as requiring information to confirm that the user is indeed human, or banning access from an IP number that tries to reload a page too frequently. Of course, one must respect those restrictions, and only let an agent access sites that allow downloads by artificial agents. Here, again, the RSS feeds are important, since they are specifically designed for repeated, automatic access, and therefore rarely carry restrictions of the kind just described. For the IPAs considered here, a specific library that will be described next, namely the InternetDataAcquisition library has been implemented for downloading and parsing information from web sites. 117 118 CHAPTER 8. INTERNET DATA ACQUISITION Listing 8.1: The DownloadLoop method in the HTMLDownloader class. p r i v a t e void DownLoadLoop ( ) { while ( running ) { Stopwatch stopWatch = new Stopwatch ( ) ; stopWatch . Start ( ) ; using ( WebClient webClient = new WebClient ( ) ) { try { s t r i n g html = webClient . DownloadString( url ) ; DateTime dateTime = DateTime . Now ; Boolean newDataStored = StoreData ( dateTime , html ) ; i f ( newDataStored ) { OnNewDataAvailable ( ) ; } } c a t c h ( WebException e ) { running = f a l s e ; OnError ( e . Status . ToString ( ) ) ; } } stopWatch . Stop ( ) ; double elapsedSeconds = stopWatch . ElapsedTicks / ( double ) Stopwatch . Frequency ; i n t elapsedMilliseconds = ( i n t ) Math . Round ( elapsedSeconds∗ MILLISECONDS_PER_SECOND) ; i n t sleepInterval = millisecondDownloadInterval − elapsedMilliseconds ; i f ( sleepInterval > 0 ) { Thread . Sleep ( sleepInterval ) ; } i f ( ! runRepeatedly ) { running = f a l s e ; } } } 8.1 The InternetDataAcquisition library The classes in this library provide implementations of the two steps described above, namely downloading and then parsing data. 8.1.1 Downloading data Most of the low-level code required to access web pages is available in the standard libraries distributed with C#. Thus, the user can focus on more highlevel aspects of data downloads. Here, two data downloaders have been implemented, the HTMLDownloader and the CustomXMLReader. The HTMLDownloader class This class allows repeated downloads of the raw HTML code of any web page, using the WebClient class available in the System.Net namespace. Here, the HTML code is placed in a single string. The download is handled by a separate thread that also is responsible for storing the downloaded string c 2017, Mattias Wahde, [email protected] CHAPTER 8. INTERNET DATA ACQUISITION 119 (along with a time stamp) if it differs from the most recent already downloaded string. Listing 8.1 shows the Download loop method executing in the download thread. Two event handlers are defined, one for signalling the arrival of new data and one for indicating download errors. Unless the runRepeatedly variable is set to false, download attempts are carried out with a user-specified frequency. Here, again, it is important to note that not all web sites allow this kind of repeated, automatic downloads. A specific example is Google, which actively prevents a user from accessing (for example) image search results by direct download of the (links in the) HTML code of the search page. However, Google does allow access via their own C# API, which can be downloaded from their web site. Thus, in this particular case, it is still possible to obtain the information without violating any rules, but this is not the case for all web pages. It is the user’s responsibility to check any restrictions on automatic downloads before attempting to apply such methods. The RSSDownloader class As mentioned above, some sites are specifically designed for repeated automatic downloads. RSS feeds constitute an important special case. The RSSDownloader class has been written specifically to deal with this case. RSS pages are generated in XML format and can thus be accessed using the XmlTextReader class, available in the System.Xml namespace. Among the various information items specified in an RSS item is the (publish) date of the item in question. Somewhat surprisingly, the standard Xml (text) reader class (i.e. the XmlTextReader) does not handle all date formats. Two common ways to format a date (that, along with several others, are handled by the standard XML reader) are ddd, dd MMM yyyy hh:mm:ss (example: Fri, 21 Oct 2016 07:14:17) and ddd, dd MMM yyyy hh:mm:ss ’GMT’ (example: Fri, 21 Oct 2016 07:18:53 GMT)1 However, a format such as ddd MMM dd yyyy hh:mm:ss ’GMT+0000’ (example: Fri Oct 21 2016 07:28:19 GMT+0000), which (along with several other formats) often occur in RSS feeds that are not based in the US, cannot be handled by the standard XML reader. For that reason, an alternative approach is required. Of course, one could just read the web page using the HTMLDownloader described above and then write a custom parser (see below). However, a better approach is simply to write a custom XML reader class that implements all the aspects of the standard XML reader, while also handling different date formats. This is the approach chosen here, with the implementation of the CustomXmlReader. This 1 In C#, there are many different ways of formatting a DateTime or DateTimeOffset instances. See e.g. MSDN for more information. c 2017, Mattias Wahde, [email protected] 120 CHAPTER 8. INTERNET DATA ACQUISITION Listing 8.2: The RunLoop method in the RSSDownloader class. The ProcessFeed method, not shown here, simply stores the various items in the SyndicationFeed in a thread-safe manner, to allow asynchronous access. p r i v a t e void RunLoop ( ) { while ( running ) { Stopwatch stopWatch = new Stopwatch ( ) ; stopWatch . Start ( ) ; using ( CustomXmlReader xmlReader = new CustomXmlReader ( url ) ) { xmlReader . SetCustomDateTimeFormat( customDateTimeFormat) ; xmlReader . Read ( ) ; S y n d i ca t i o n F e e d feed = S y n d i ca t i o n F e e d . Load ( xmlReader ) ; ProcessFeed ( feed ) ; } stopWatch . Stop ( ) ; double elapsedSeconds = stopWatch . ElapsedTicks / ( double ) Stopwatch . Frequency ; i n t elapsedMilliseconds = ( i n t ) Math . Round ( elapsedSeconds ∗ MILLISECONDS_PER_SECOND) ; i n t sleepInterval = millisecondDownloadInterval − elapsedMilliseconds ; i f ( sleepInterval > 0 ) { Thread . Sleep ( sleepInterval ) ; } } } class operates precisely as the standard XML reader, except that the user can also specify the date format, which is required in cases where it differs from the format that can be handled by the standard XML reader. The RSSDownloader class makes use of the CustomXmlReader to download RSS feeds at regular intervals, and to store all the items for later access. The parsing of an RSS feed will be described below. Listing 8.2 shows the thread (in the RSSDownloader) responsible for executing repeated downloads of RSS feeds. 8.2 Parsing data Parsing is the process by which an encoded piece of information, such as a web page in HTML format, is converted into standard, readable text. Here, two approaches will be described briefly, namely general parsing of HTML code, and parsing of the XML code in an RSS feed. 8.2.1 The HTMLParser class This class provides generic processing of any information (initially) stored in a single string. The class contains a Split method that simply splits the string (in the first call to the method, after assigning the initial string) or the list of c 2017, Mattias Wahde, [email protected] CHAPTER 8. INTERNET DATA ACQUISITION 121 Listing 8.3: Code snippet for setting up and starting an RSSDownloader. The values of the three parameters (url, dateFormat, and downloadInterval) are obtained, for example, via text boxes in the GUI of the application in question. ... rssDownloader = new RSSDownloader ( url ) ; rssDownloader . SetCustomDateTimeFormat( dateFormat ) ; rssDownloader . DownloadInterval = downloadInterval ; rssDownloader . Start ( ) ; ... strings resulting from an earlier application of the same method. The user provides the method with a list of so called split strings. Whenever such a split string is encountered, the string in which it was found is split into two (and the split string itself is removed). A typical HTML page contains characters (HTML tags) used when formatting the HTML code for display in a browser, such as, for example, <p> and </p> to indicate the start and end of a paragraph, or <b> and </b> to indicate the start and end of the use of a bold font. In order to convert an HTML page to plain text, a common step is thus to remove such tags, by applying the appropriate call to the Split method. Other methods are also defined in this class, for example to extract all strings fulfilling some conditions. For instance, one may wish to extract all web page links to PDF documents, by finding strings that start with http:// and end with .pdf. 8.2.2 RSS feeds As mentioned above, RSS feeds are in XML format and, more specifically, define certain fields that can easily be accessed. When the contents of a (custom) Xml reader are passed to a SyndicationFeed instance, the result is a list of object of type SyndicationItem that, in turn, defines several fields, such as Title, Summary, PublishDate2 etc. Once the syndication feed items have been generated, very little additional parsing is required. Thus, no specific class has been written for parsing RSS feeds. However, a usage example will be given in the next section. 2 Note that, in the SyndicationItem class, the PublishDate is defined as a DateTimeOffset (rather than a DateTime), the difference being that the DateTimeOffset measures coordinated universal time (UTC) and also, for example, makes comparisons between instances based on UTC, whereas DateTime generally refers to the date and time in a given time zone. c 2017, Mattias Wahde, [email protected] 122 CHAPTER 8. INTERNET DATA ACQUISITION Figure 8.1: The GUI of the RSSReader application. Here, a single news item, with its publish date and title shown in green, arrived between the two most recent updates of the RSSDownloader. 8.3 The RSSReader application As its name implies, the RSSReader application reads from an RSS feed, and displays the publish date and the title of each item on the screen. The program defines an RSSDownloader that carries out downloads in its own thread, with a user-specified interval between downloads. Moreover, the program is capable of sending the corresponding information to the agent program, if the latter is available. A separate thread (independent of the RSSDownloader) handles the display of news items to the screen. Listing 8.3 shows the code snippet for setting up and starting the RSSDownloader. Fig. 8.1 shows the GUI of the RSSReader application. New items, i.e. those that have been published since the last update of the RSSDownloader, are shown in green, whereas older items are shown in gray. c 2017, Mattias Wahde, [email protected] Appendix A Programming in C# In this appendix, several important aspects of C# .NET are introduced. The aim is not to give a complete description of either the C# language or its IDE, but rather to describe some concepts that anyone developing IPAs in C# (for example using the IPA libraries) must know. In addition to reading the text below, the reader should also study the various demonstration applications distributed together with the IPA libraries. There are also several excellent books on C#. In addition, the answers to many questions regarding C# can be found either at the Microsoft Development Network (MSDN) web site1 or in various internet fora such as StackOverflow2 . In fact, given the number of skilled people working with C# .NET, finding an answer to a given question is not so difficult; the problem is to ask the right question, something that requires a bit of experience. The first three sections below describe basic, fundamental concepts of C#, whereas the remaining sections describe more advanced topics. As mentioned in Chapter 1, C# .NET is a part of Microsoft’s Visual Studio. The illustrations below will be given in the 2010 version of Visual Studio, running under Windows 7. However, the appearance and use of the IDE is essentially the same for newer versions of Visual Studio (e.g. the 2015 version) and for newer versions of Windows (e.g. Windows 10). The code in the IPA libraries has been tested under Windows versions 7, 8, and 10, and Visual studio versions 2010 and 2015. A detailed introduction to the C# IDE can be found at MSDN3 . 1 msdn.microsoft.com stackoverflow.com 3 https://msdn.microsoft.com/en-us/library/ms173064(v=vs.90).aspx 2 123 124 APPENDIX A. PROGRAMMING IN C# Figure A.1: The window of the C# IDE, showing (1) the Solution Explorer, (2) the Windows Form Designer and Code Editor, (3) the Properties panel and (4) the Toolbox. A.1 Using the C# IDE In .NET, the source code of an application is contained in one or several projects that, in turn, are contained in a solution. A specific example is the SpeechProcessing solution distributed along with the IPA libraries. This solution contains two applications, i.e. projects that define a standalone executable (an .exe file) but also many other projects (from the IPA libraries) in the form of class libraries. A class library is simply a set of classes (see Sect. A.2 below) that can be used in one or several applications. When the C# IDE is opened, a window similar to the one shown in Fig. A.1 appears4 . In the specific case shown in the figure, the user has opened the DemonstrationSolution containing the source code described in this appendix. Now, in the IDE main window, there are many subwindows that assist the user with various aspects of code development. Some of the most important subwindows have been highlighted in the figure, namely (1) the Solution Explorer, (2) the Windows Forms Designer and Code Editor, (3) the Properties Window and (4) the Toolbox. As can be seen in the Solution Explorer, the solution contains several projects. Most of those projects are applications, but one (the ObjectSerializer library) is a class library used in connection with serialization; see Sect. A.7. The user can start an application by right-clicking on a project, and then selecting 4 The exact appearance of the IDE is somewhat version-dependent and can also be customized to fit the user’s preferences. c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 125 Figure A.2: The window of the C# IDE after the user has opened the code associated with the form of the FirstExample application. Debug - Start new instance. In every solution (containing at least one executable application), exactly one project is the startup project, i.e. the application that will run by default, if the user presses the green arrow in the tool strip (near the top of the window) or simply presses F5. In this case, FirstExample is the startup project. The user can easily change the startup project, by right-clicking on any application and selecting Set as StartUp Project. The Windows Forms Designer allows the user to generate the layout of the so called forms (i.e. the windows) of an application. A form is a special case of a control, i.e. a graphical component. The Properties window allows the user to set various parameters associated with a control. In Fig. A.1, the user has scrolled down to view the Text property that determines the caption of the application’s form. The Toolbox allows the user to select and add additional controls (e.g. buttons, text boxes etc.) to a form. The window referred to as the Windows Forms Designer above is also used as Code Editor. In Fig. A.2, the user has opened the code associated with the main form of the FirstExample, by right-clicking on FirstExampleMainForm.cs in the solution explorer, and selecting View Code. The code can then be edited as necessary. Note also that the IDE can help the user by auto-generating some parts of the code. For example, in the case of a button, some action should be taken when the user clicks on it. If a button is double-clicked in the Windows Forms Editor, the IDE will generate skeleton code for the method associated with the button click. The user must then fill the method with the c 2017, Mattias Wahde, [email protected] 126 APPENDIX A. PROGRAMMING IN C# Listing A.1: The code in the FirstExample main form. using using using using using using using using System ; System . Collections . Generic ; System . ComponentModel ; System . Data ; System . Drawing ; System . Linq ; System . Text ; System . Windows . Forms ; namespace FirstExample { p u b l i c partial c l a s s FirstExampleMainForm : Form { p u b l i c FirstExampleMainForm ( ) { InitializeComponent ( ) ; } p r i v a t e s t r i n g GenerateResponse ( ) { s t r i n g hello = ” Hello u s e r . Today i s a ” ; s t r i n g dayOfWeek = DateTime . Now . DayOfWeek . ToString ( ) ; hello += dayOfWeek + ” . ” ; r e t u r n hello ; } p r i v a t e void helloButton_Click( o b j e c t sender , EventArgs e ) { s t r i n g response = GenerateResponse ( ) ; responseTextBox . Text = response ; } p r i v a t e void exitButton_Click( o b j e c t sender , EventArgs e ) { A p p l i ca t i o n . Exit ( ) ; } } } necessary code for responding appropriately to the user’s action. Note that every control is in fact associated with a large number of events, of which the button click is one example. When running an application from within the IDE (for example by pressing F5) it is possible also to pause the code using breakpoints. A breakpoint (shown as a red filled disc in the IDE) can be inserted either by clicking on the left frame of the Code Editor, or by right-clicking on a line of code in the Code Editor and selecting BreakPoint - Insert Breakpoint. When the application reaches a breakpoint, execution is paused. If F5 is pressed, the application then continues to the next breakpoint (if any). One can also use the F10 (step over) and F11 (step into) keys to step through the code. When execution is paused, the corresponding line of code is shown in yellow, and the user can investigate the values of variables etc., by placing the mouse over c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 127 a given statement in the code. The entire listing for the FirstExample main form is given in Listing A.1. The listing begins with a set of using clauses, which are specifications of class libraries necessary for the code associated with the control. In this particular case, these clauses all involve code included in the System namespace. However, in many cases, one may need to use code that is not included in the standard distribution of C#. A specific example is the use of the ObjectSerializer library in the SerializationExample (see Sect. A.7 below). In such cases, one must first add a reference before instructing C# that the code in a specific namespace should be used. In order to do so, the user must right-click on the folder marked References in the solution explorer (for the project in question), and then select the appropriate file. Once the reference has been added, the corresponding code (or, to be exact, its public methods and properties; see Sect. A.2 below) will be available for use. The remainder of the listing defines the methods associated with the main form of the FirstExample. Summarizing briefly, the code responds to a click on the Hello button, by printing, in the text box, the text Hello user followed by a specification of the current weekday. If instead the user clicks the Exit button, the application terminates. A.2 Classes C# .NET is an object-oriented programming language (as are many other modern programming languages), in which one defines and uses objects that, in turn, are instances of classes. In general, a class contains the fields (variables) and methods relevant for objects of the type in question. Object-oriented programming is a very large topic and, as mentioned earlier in this chapter, here only a very brief description will be given. As a specific example, consider flat, two-dimensional shapes, such as rectangles, circles, triangles etc. Such shapes share some characteristics. For example, they all have a certain surface area, even though its detailed computation varies between the different shapes. A common approach is to define a so called abstract (base) class, from which other classes are derived. Consider now the ClassExample application. Here, a simple base class has been defined5 for representing shapes. Moreover, a derived class has been defined as well (see below). As can be seen in Listing A.2, the base class (Shape) contains one field, namely hasCorners. Note that, by convention, 5 In order to add a class to a project, one right-clicks on the application in the Solution Explorer, and then one selects Add - Class.... To rename a class, one should right-click on the class and select Rename. Finally, to rename a field, one should right-click on it in the Code Editor, and then select Refactor - Rename.... the IDE then makes sure that all instances of the field are correctly renamed. c 2017, Mattias Wahde, [email protected] 128 APPENDIX A. PROGRAMMING IN C# Listing A.2: The (abstract) Shape class, from which classes that implement specific shapes are derived. p u b l i c a b s t r a c t c l a s s Shape { p r o t e c t e d Boolean hasCorners ; p u b l i c a b s t r a c t double ComputeArea ( ) ; p u b l i c Boolean HasCorners { g e t { r e t u r n hasCorners ; } } } Listing A.3: The Rectangle class, derived from the Shape class. p u b l i c c l a s s Re ct a n g l e : Shape { p r i v a t e double sideLengthX ; p r i v a t e double sideLengthY ; p u b l i c Re ct a n g l e ( ) // C o n s t r u ct o r { hasCorners = t r u e ; } p u b l i c o v e r r i d e double ComputeArea ( ) { double area = sideLengthX ∗ sideLengthY ; r e t u r n area ; } p u b l i c double SideLengthX { g e t { r e t u r n sideLengthX ; } s e t { sideLengthX = value ; } } p u b l i c double SideLengthY { g e t { r e t u r n sideLengthY ; } s e t { sideLengthY = value ; } } p u b l i c double Area { g e t { r e t u r n sideLengthX ∗ sideLengthY ; } } } c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 129 fields always start with a small letter. It also defines an abstract method called ComputeArea. The class itself is marked as abstract, as is the method just mentioned, meaning that this method must be implemented in the classes derived from the Shape class. The method is also public meaning that it is visible in other classes (for example, but not limited to, classes derived from the Shape class). This method should return the area as its output, which would be a number of type double, and this is also specified in the code. In general, method names begin with a capital letter. Note also that since a method is intended to actively carry out some action, in this case computing the area of a shape, the name should reflect this by including a verb. Thus, AreaComputation would not be a suitable name for this method. The field is listed as protected, meaning that it is visible to any classes derived from the Shape class, but not to other classes. The Shape class also defines a property which is public, meaning that it is visible to other classes. In this particular case, the property is very simple, but in other cases a property may involve more complex operations, including method calls. By convention, properties always begin with a capital letter. Listing A.3 shows a derived class, namely Rectangle. The first line in the class indicates that the Rectangle class is derived from the Shape class, and therefore can access its (protected) fields. Note that the Shape class is not explicitly derived from any class but it is implicit in C# that all classes are derived from a generic base class called Object. In this case, each derived class must define additional fields that are specific to the shape in question, and which are then used in the respective ComputeArea methods, in order to compute the area. Note that the field hasCorners is visible to the derived classes, since it is marked as being protected. The fields introduced in the derived classes are marked as private, meaning that they are not visible to other classes. The use of these keywords (private, protected, public etc.) makes it possible for a developer to determine which parts should be visible to other users, who may not perhaps have access to the source code, but instead only a dynamic-link library (DLL)6 . An external user will only be able to access public methods and properties. The derived class also has a constructor which is called whenever a corresponding object (i.e. an instance of the class) is generated. In this simple case, the constructor simply sets the parameter that determines whether or not the shape in question has any corners. This parameter that is not, of course, needed for the computation of the area; it is included only to demonstrate the use of fields in derived classes. The ComputeArea method of the Rectangle 6 During compilation of a C# application, the various class libraries are compiled into DLLs so that they can be used by the application. In cases where one does not have access to the source code of a class library, one can still make use of the class library, provided that one has the corresponding DLL. If so, one can add a reference to the DLL, just as one would add a reference to a class library. c 2017, Mattias Wahde, [email protected] 130 APPENDIX A. PROGRAMMING IN C# Listing A.4: A simple example showing the use of the Rectangle class. First, a rectangle with side lengths 3 and 2 is generated. Next, its area is obtained and printed. Then, the longer of the two sides is shortened to 1 length unit, and the (new) area is again obtained and printed. p r i v a t e void runExampleButton_Click( o b j e c t sender , EventArgs e ) { Re ct a n g l e rectangle = new Re ct a n g l e ( ) ; rectangle . SideLengthX = 3 ; rectangle . SideLengthY = 2 ; double area = rectangle . Area ; classExampleTextBox . Text = ” S i d e l e n g t h s : ” + rectangle . SideLengthX . ToString ( ) + ” , ” + rectangle . SideLengthY . ToString ( ) + ” , Area : ” + area . ToString ( ) + ”\ r \n” ; rectangle . SideLengthX = 1 ; area = rectangle . ComputeArea ( ) ; classExampleTextBox . Text += ” S i d e l e n g t h s : ” + rectangle . SideLengthX . ToString ( ) + ” , ” + rectangle . SideLengthY . ToString ( ) + ” , Area : ” + area . ToString ( ) + ”\ r \n” ; } class is prefixed with the keyword override, meaning that it overrides (replaces) the abstract method defined in the base class. Note that the properties of the Rectangle class are a bit more complex than for the base class. Here, one can both retrieve the side lengths (x and y) and also set their values. Moreover, an Area property is defined, which computes the area. Note that this property is redundant: One might as well use the ComputeArea method to obtain the area. The property has been introduced here only to illustrate a more complex case, where a certain computation (beyond mere assignment) is carried out in a property, and where the property, in fact, does not have a corresponding field (e.g. area). Here, it is better not to define an area field, particularly if the user would be allowed to set it directly. In that case a user might, say, update the side lengths and then incorrectly set the area! It is not possible to make such a mistake with the code shown in Listing A.3: The user can access or compute the area, but cannot set it directly. A suitable exercise for the reader is now to implement, say, a Circle class, with the corresponding fields, the ComputeArea method, and the Area property. Listing A.4 shows a simple method (the button click event handler in the form (window) of the application) that instantiates a rectangle shape, computes and prints the area, then changes the length of one side, and then computes (in a different way) and prints the area again. Clearly, one can define many other fields and methods relevant to shapes, for example fields that set the color, position, orientation etc. of a shape, and methods that, for instance, grow, shrink, move, or rotate the shape. As another exercise, the reader should add a few additional fields and their respective properties, along with a few methods of the kind just mentioned. Note also that fields can themselves consist of objects. In this example, all fields were so called simple types, i.e. types that are available as an integral c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 131 Listing A.5: An example of the use of generic lists, in this case a simple list of integers. The ShowList method (not shown here) simply prints the elements of the list to the screen, along with a comment. p r i v a t e void runExample1Button_Click( o b j e c t sender , EventArgs e ) { L i s t <i n t > integerList1 = new L i s t <i n t >() ; // => { } integerList1 . Add ( 5 ) ; // => {5} integerList1 . Add ( 8 ) ; // => { 5 , 8 } integerList1 . Add( −1) ; // => { 5 , 8 , 1 } ShowList ( ” Addition o f elements : ” , integerList1 ) ; integerList1 . Sort ( ) ; // => {−1, 5 , 8} ShowList ( ” S o r t i n g : ” , integerList1 ) ; integerList1 . Reverse ( ) ; // => { 8 , 5 −1} ShowList ( ” Re v e r s a l : ” , integerList1 ) ; integerList1 . Insert ( 0 , 3 ) ; // => { 3 , 8 , 5 , −1} ShowList ( ” I n s e r t i o n : ” , integerList1 ) ; integerList1 . RemoveAt ( 2 ) ; // => {3 ,8 , −1} ShowList ( ”Removal a t index 2 : ” , integerList1 ) ; L i s t <i n t > integerList2 = integerList1 ; // i n t e g e r L i s t 2 p o i n t s t o i n t e g e r L i s t 1 ! ShowList ( ” P o i n t e r t o l i s t : ” , integerList2 ) ; integerList1 [ 1 ] = 2 ; // => Assigns 2 t o i n t e g e r L i s t 1 [ 1 ] AND i n t e g e r L i s t 2 [ 1 ] // ( both a r e t h e same l i s t ! ) ShowList ( ” L i s t 1 , element 1 modified : ” , integerList1 ) ; ShowList ( ” . . . and l i s t 2 : ” , integerList2 ) ; L i s t <i n t > integerList3 = new L i s t <i n t >() ; // A new i n s t a n c e . . . f o r e a c h ( i n t element i n integerList1 ) { integerList3 . Add ( element ) ; } integerList3 [ 1 ] = 7 ; // => Assigns 7 t o i n t e g e r L i s t 3 [ 1 ] but NOT i n t e g e r L i s t 1 [ 1 ] ShowList ( ” L i s t 1 , again : ” , integerList1 ) ; ShowList ( ” . . . and l i s t 3 : ” , integerList3 ) ; } part of the C# language. However, one could very well define a class containing fields that are instances of any of the shape classes just defined, or even lists of such classes (see also the next section). A.3 Generic lists The .NET framework includes the concept of generic lists, i.e. lists containing instances of any kind of object, and with operations that are common to a list regardless of its contents, such as addition, insertion, removal etc. Moreover, there are generic operators for certain common operations, such as sorting. An example showing some of the many uses of generic lists can be found in the GenericListExample application. This application contains four buttons, one for each example. The code for the first example (the leftmost button on the form) is shown in Listing A.5. In this case, a simple list of integers is generated, and it is then sorted and reversed. Next a new element is inserted (at index 0), and then the element at index 2 is removed. A new list is then generated that points to the first list, so that if one makes changes is one of the lists, those changes also affect the other list. Finally, a new list is generated as c 2017, Mattias Wahde, [email protected] 132 APPENDIX A. PROGRAMMING IN C# Listing A.6: The TestClass used in the second, third, and fourth examples. public class TestClass { p r i v a t e i n t integerField ; p r i v a t e double doubleField ; p u b l i c T e s t C l a s s Copy ( ) { T e s t C l a s s copiedObject = new T e s t C l a s s ( ) ; copiedObject . IntegerProperty = integerField ; copiedObject . DoubleProperty = doubleField ; r e t u r n copiedObject ; } p u b l i c s t r i n g AsString ( ) { s t r i n g objectAsString = integerField . ToString ( ) + ” ” + doubleField . ToString ( ) ; r e t u r n objectAsString ; } p u b l i c i n t IntegerProperty { g e t { r e t u r n integerField ; } s e t { integerField = value ; } } p u b l i c double DoubleProperty { g e t { r e t u r n doubleField ; } s e t { doubleField = value ; } } } a new instance, such that any changes made to it do not affect the other list. The situation becomes a bit more complex if the elements of a list are not simple types i.e. types such as int, double etc. Consider now the second example (second button from the left, on the form). In this case, a generic list of objects (of type TestClass) is defined. The definition of this simple class is given in Listing A.6. The class also contains an explicit Copy method, which generates a new instance identical to the one being copied7 In Example 1 above, sorting the list was easy, as the process of comparing two integers to determine which is one larger is, of course, well-defined. But what about the list of objects in Example 2? As shown in the code for this 7 Note that copying can be handled automatically (using the so called ICloneable interface), but one must be careful to distinguish between a shallow copy and a deep copy. In the case of a shallow copy, not all copied fields (except simple types) consist of new instances but instead references to instances in the original object. For this reason, it is often a good idea to write an explicit copying method, which copies the necessary fields as required by the application at hand. This is especially true in cases where the source code is provided, so that the programmer easily can see exactly what parts are being copied. c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 133 Listing A.7: The two methods required for the second example. Here, a list (list1) of TestClass objects is generated, and the list is then sorted in two different ways. The ShowTestClassList method displays the elements of a list of TestClass objects on the screen. p r i v a t e void GenerateList1 ( ) { list1 = new L i s t <T e s t C l a s s >() ; T e s t C l a s s testObject1 = new T e s t C l a s s testObject1 . IntegerProperty = 4 ; testObject1 . DoubleProperty = 0 . 5 ; list1 . Add ( testObject1 ) ; T e s t C l a s s testObject2 = new T e s t C l a s s testObject2 . IntegerProperty = 2 ; testObject2 . DoubleProperty = 1 . 5 ; list1 . Add ( testObject2 ) ; T e s t C l a s s testObject3 = new T e s t C l a s s testObject3 . IntegerProperty = 5 ; testObject3 . DoubleProperty = −1.5; list1 . Add ( testObject3 ) ; T e s t C l a s s testObject4 = new T e s t C l a s s testObject4 . IntegerProperty = 2 ; testObject4 . DoubleProperty = −0.5; list1 . Add ( testObject4 ) ; } () ; () ; () ; () ; p r i v a t e void runExample2Button_Click( o b j e c t sender , EventArgs e ) { displayTextBox . Text = ” ” ; GenerateList1 ( ) ; ShowTestClassList( ” I n i t i a l l i s t ” , list1 ) ; list1 . Sort ( ( a , b ) => a . DoubleProperty . CompareTo ( b . DoubleProperty ) ) ; ShowTestClassList( ” L i s t s o r t e d ( DoubleProperty ) ” , list1 ) ; list1 = ( L i s t <T e s t C l a s s >)list1 . OrderBy ( a => a . IntegerProperty) . ThenBy ( b => b . DoubleProperty ) . ToList ( ) ; ShowTestClassList( ” L i s t s o r t e d ( I n t e g e r P r o p e r t y , then DoubleProperty ) ” , list1 ) ; } Listing A.8: An example of a shallow copy of a list of objects. p r i v a t e void runExample3Button_Click( o b j e c t sender , EventArgs e ) { displayTextBox . Text = ” ” ; GenerateList1 ( ) ; // See example 1 ShowTestClassList( ” L i s t 1 ” , list1 ) ; // Shallow copy list2 = new L i s t <TestClass>() ; list2 . Add ( list1 [ 0 ] ) ; list2 . Add ( list1 [ 1 ] ) ; list2 . Add ( list1 [ 2 ] ) ; list2 . Add ( list1 [ 3 ] ) ; ShowTestClassList( ” L i s t 2 ” , list1 ) ; list2 [ 0 ] . DoubleProperty = −1; // Changes l i s t 2 [ 0 ] AND l i s t 1 [ 0 ] . ShowTestClassList( ” L i s t 2 again ” , list2 ) ; ShowTestClassList( ” L i s t 1 again ” , list1 ) ; } c 2017, Mattias Wahde, [email protected] 134 APPENDIX A. PROGRAMMING IN C# Listing A.9: An example of a deep copy of a list of objects. p r i v a t e void runExample4Button_Click( o b j e c t sender , EventArgs e ) { displayTextBox . Text = ” ” ; GenerateList1 ( ) ; ShowTestClassList( ” L i s t 1 ” , list1 ) ; // Deep copy list3 = new L i s t <TestClass>() ; list3 . Add ( list1 [ 0 ] . Copy ( ) ) ; list3 . Add ( list1 [ 1 ] . Copy ( ) ) ; list3 . Add ( list1 [ 2 ] . Copy ( ) ) ; list3 . Add ( list1 [ 3 ] . Copy ( ) ) ; ShowTestClassList( ” L i s t 3 ” , list1 ) ; list3 [ 0 ] . DoubleProperty = −5; // Changes ONLY l i s t 3 [ 0 ] . ShowTestClassList( ” L i s t 3 again ” , list3 ) ; ShowTestClassList( ” L i s t 1 again ” , list1 ) ; } example (Listing A.7) one can certainly sort such a list as well, but one must first tell C# how it is to be sorted. Two sortings are carried out here: First, the list is sorted based on the values of the DoubleProperty. Next, the list is sorted first based on the values of the IntegerProperty, and then all elements that have the same value of the IntegerProperty are sorted on the basis of their DoubleProperty value. The reader should now click on the button marked Run example 2 to view the results. In the third and fourth examples, the difference between a shallow copy and a deep copy is illustrated. In the third example, a shallow copy is made: A new list is instantiated (i.e. it does not just point to the original list) but the elements of the list are not explicitly copied, but instead simply point to the elements of the original list. This means that if one changes a property in one of those elements in one of the lists (see Listing A.8), the corresponding property (of the element with the same index) in the original list changes as well. In the fourth example (see Listing A.9), by contrast, the elements of the new list are explicitly copied before being added. In this case, changing a property of an element in the new list does not change the corresponding property (of the element with the same index) in the original list. A.4 Threading The concept of (multi-)threading is crucial to all but the simplest applications. A program may start and run any number of threads, i.e. sequences of computational instructions that may share memory resources, but otherwise operate as independent units executing in parallel. On processors with multiple cores (i.e. all modern processors) different threads can run truly in parallel, on different cores. However, often the number of threads greatly exceeds the number c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 135 Listing A.10: A method that (unwisely) runs a lengthy computation in the GUI thread. Note that the progress information will, in fact, only be shown at the very end of the computation. p r i v a t e void runInSingleThreadButton_Click( o b j e c t sender , EventArgs e ) { runInSingleThreadButton . Enabled = f a l s e ; runMultiThreadedButton . Enabled = f a l s e ; progressListBox . Items . Clear ( ) ; progressListBox . Items . Add ( ” S t a r t i n g ” ) ; f o r ( i n t k = 1 ; k <= UPPER_LIMIT ; k++) { double sum = 0 ; f o r ( i n t j = 1 ; j <= k ; j++) { sum += j∗j ; } i f ( k % PRINT_INTERVAL == 0 ) { ShowProgress ( ”k = ” + k . ToString ( ) ) ; } } progressListBox . Items . Add ( ”Done” ) ; runInSingleThreadButton . Enabled = t r u e ; runMultiThreadedButton . Enabled = t r u e ; } p r i v a t e void ShowProgress ( s t r i n g progressInformation) { progressListBox . Items . Add ( progressInformation) ; } of cores. Thus, the operating system is responsible for assigning time slices to each thread and rapidly switching between the threads, giving the illusion (from the user’s point of view) of parallel computation for all threads, whether or not they run on different processor cores. Writing a program that makes proper use of multithreading is a non-trivial task, especially in cases where communication between threads is required. Here, only a simple example will be given. There are plenty of additional examples in the various IPA libraries; see also the next section. Now, consider the ThreadingExample application. The application’s form contains two buttons, one for single-thread execution and one for execution using multithreading. In this case, the computation consists of computing the sum of the square of all integers from 0 to k, for k = 1, 2, . . . , 100000. As is evident when running the single-thread version, the GUI of the application gets frozen and unresponsive during the calculation. Moreover, the progress information is only printed to the screen after the computation has been completed. This is not particularly elegant: It should be possible for a user to access the GUI, and to get progress information even while the computation is running. Perhaps there are other tasks that the user may wish to launch? Alternatively, the user may wish to abort the computation before it is completed. The problem, in this case, is that the computation is started on the same thread as the GUI. Since the computer will try to run the computation as fast as possible, it will be difficult for it also to respond to user commands (for example attempts to move the window using the mouse). The code is shown in Listing A.10. c 2017, Mattias Wahde, [email protected] 136 APPENDIX A. PROGRAMMING IN C# Listing A.11: In this case, the computationally expensive loop is executed in a separate thread. In this case, the progress information is displayed continuously on the screen. The ThreadSafeHandleDone method is available in the source code, but is not shown here. p r i v a t e void runInSingleThreadButton_Click( o b j e c t sender , EventArgs e ) { runInSingleThreadButton . Enabled = f a l s e ; runMultiThreadedButton . Enabled = f a l s e ; progressListBox . Items . Clear ( ) ; progressListBox . Items . Add ( ” S t a r t i n g ” ) ; computationThread = new Thread ( new T h r e a d S t a r t ( ( ) => ComputationLoop ( ) ) ) ; computationThread . Start ( ) ; } p r i v a t e void ComputationLoop ( ) { f o r ( i n t k = 1 ; k <= UPPER_LIMIT ; k++) { double sum = 0 ; f o r ( i n t j = 1 ; j <= k ; j++) { sum += j ∗ j ; } i f ( k % PRINT_INTERVAL == 0 ) { ThreadSafeShowProgress( ”k = ” + k . ToString ( ) ) ; } } ThreadSafeHandleDone ( ) ; } p r i v a t e void ThreadSafeShowProgress( s t r i n g progressInformation) { i f ( InvokeRequired ) { BeginInvoke ( new MethodInvoker ( ( ) => ShowProgress ( progressInformation) ) ) ; } e l s e { ShowProgress ( progressInformation) ; } } This is where multithreading comes in: If the user instead clicks the other button (for multithreaded execution), a separate thread will be started for carrying out the computation, leaving the GUI (which, again, runs on its own thread) free to do other things. In this case, two methods are used: One for starting the thread in which the computation is to be carried out, and one (ComputationLoop) for running the actual computation. Now the GUI responds nicely to any user actions, and the progress information is printed to the screen during the computation. The code is shown in Listing A.11. However, there is a price to be paid: Since the computation now runs in a separate thread, and any output to the screen (or other GUI actions) requires access to the GUI thread, on must handle the correponding operation with care: Accessing the GUI (from another thread) is not thread-safe. In .NET, thread-safe access to the GUI thread, from another thread, is achieved by means of the BeginInvoke method that is defined for any object derived from the Control class (for example, the Form class). Printing the progress during the computation and updating the Enabled property of the buttons (at the end of the computation) requires access to the GUI thread; hence, the BeginInvoke method (for the form) is used, as shown in the code listing. c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 137 Listing A.12: The two methods used for handling concurrent access to a generic list. The accessLockObject is defined in the class, but the definition is not shown here. p u b l i c void AddElement ( ) { Monitor . Enter ( accessLockObject ) ; integerList . Add ( 1 ) ; integerList . RemoveAt ( 0 ) ; Monitor . Exit ( accessLockObject) ; } p u b l i c i n t GetCheckSum ( ) { i n t checkSum = 0 ; Monitor . Enter ( accessLockObject ) ; checkSum = integerList . Count ; Monitor . Exit ( accessLockObject) ; r e t u r n checkSum ; } Note that only the method that handles the progress update is shown in the listing. The method ThreadSafeHandleDone, which is called when the computation is complete, is available in the source code, though. This example shows the basics of multithreading. However, multithreading is a large and, at times, complex subject. There are plenty of examples of the use of multithreading in the various IPA libraries, which should be studied careefully by the reader. A.5 Concurrent reading and writing In a program that uses multiple threads or, as in the case of an IPA, communicates asynchronously with several other programs, it is not uncommon that one must both write to, and read from, a given object, for example a generic list. One must then be careful, as shown in the ConcurrentAccessExample. Here, a simple object is generated, which contains a list of 10 integers (all equal to 1). Next, two threads are started: The first thread (additionThread) adds another 1 to the list, and then removes the first element of the list so that, again, it consist of 10 (equal) elements. The second thread (checkThread) simply measures the length of the list. Now, since the two threads run independently of each other it can happen that the length computation (in the checkThread) occurs after the addition of a 1 (in the additionThread), but before the removal of the first element in the list. If this happens, the checkThread will find a list of length 11, rather than 10. One can avoid these problems by locking the list both during the addition and removal operations and during the checking operation. A procedure (there are several ways) for doing so, using the Monitor class, is shown in c 2017, Mattias Wahde, [email protected] 138 APPENDIX A. PROGRAMMING IN C# Listing A.12. Here, a lock object is defined, and whenever a code snippet encounters a Monitor.Enter method, the program will temporarily halt execution if another code snippet has acquired the lock. Execution will be halted until the lock is released, using the Monitor.Exit method. Thus, in this particular case, even if the GetCheckSum method gets called between the two list operations in AddElement, the actual checking will not take place until the AddElement method releases the lock, thus avoiding the problem described above. The reader should run the ConcurrentAccessExample in order to investigate the two cases, first clicking the left button (running without locking) a few times, and then clicking the right button. As can be seen, in the first case (without locking) the erroneous length is invariably found, albeit at different iterations in different runs, again illustrating the fact that the two threads run independently of each other. In the second case (with locking) the error never occurs. Note that, in newer versions of .NET, there are libraries for handling concurrent access to objects (such as lists). Still, it is good to know how to handle concurrent access explicitly, as just illustrated. A.6 Event handlers The concept of event handlers is used frequently in the IPA libraries. Consider, for example, the arrival of new information in the working memory of an IPA. While it would be possible, in theory, to check continuously (with a loop) whether or not a new memory item has been added to (or removed from) the working memory, it would not be very elegant to do so. Moreover, it would be a computationally expensive procedure. A better approach would be to let the working memory itself trigger an event whenever a new memory item arrives, and to let other parts of the agent program (that might need to use the items in the working memory) respond accordingly by subscribing to the event by means of an event handler. As another example, note that events and event handlers are used frequently in connection with GUI operations: Any user action on a GUI (such as a button click or a mouse movement) triggers one or several events, which can then be handled by the appropriate event handler. For example, if a user calls the Invalidate method on a control, the result will be that the control’s Paint event is triggered, so that the user can repaint whatever is shown in the control, via an event handler (called, for example, HandlePaint). A simple example of event handling is given in the EventHandlerExample application. In this example, a separate thread is executed in an object of type EventTestClass, which computes the sums of all integers from 1 to k, k = 1, 2, . . . 100, 000. Two events are defined, namely Started, which is trigc 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 139 Listing A.13: The EventTestClass with its two events, Started and Progress. The Progress event makes use of a custom EventArgs class, shown in Listing A.15. public c l a s s EventTestClass { p r i v a t e c o n s t i n t UPPER_LIMIT = 1 0 0 0 0 0 ; p r i v a t e c o n s t i n t PROGRESS_REPORT_INTERVAL = 2 5 0 0 ; p r i v a t e Thread runThread ; p u b l i c event EventHandler Started = n u l l ; p u b l i c event EventHandler<ProgressEventArgs> Progress = n u l l ; p r i v a t e void RunLoop ( ) { OnStarted ( ) ; f o r ( i n t ii = 1 ; ii <= UPPER_LIMIT ; ii++) { double sum = 0 ; f o r ( i n t jj = 1 ; jj <= ii ; jj++) { sum += jj ; } i f ( ii % PROGRESS_REPORT_INTERVAL == 0 ) { OnProgress ( ii ) ; } } } p u b l i c void Run ( ) { runThread = new Thread ( new T h r e a d S t a r t ( ( ) => RunLoop ( ) ) ) ; runThread . Start ( ) ; } p r i v a t e void OnStarted ( ) { i f ( Started != n u l l ) { EventHandler handler = Started ; handler ( t h i s , EventArgs . Empty ) ; } } p r i v a t e void OnProgress ( ( i n t sumsCompleted ) { i f ( Progress != n u l l ) { EventHandler<ProgressEventArgs> handler = Progress ; ProgressEventArgs e = new ProgressEventArgs ( sumsCompleted ) ; handler ( t h i s , e ) ; } } } c 2017, Mattias Wahde, [email protected] 140 APPENDIX A. PROGRAMMING IN C# Listing A.14: Three relevant methods defined in the code for the form of the EventHandlerExample application. p r i v a t e void runButton_Click ( o b j e c t sender , EventArgs e ) { E v e n t T e s t C l a s s eventTestObject = new E v e n t T e s t C l a s s ( ) ; eventTestObject . Started += new EventHandler ( HandleStarted ) ; eventTestObject . Progress += new EventHandler<ProgressEventArgs >(HandleProgress ) ; eventTestObject . Run ( ) ; } p r i v a t e void HandleStarted ( o b j e c t sender , EventArgs e ) { s t r i n g startInformationString = ” S t a r t e d ” ; i f ( InvokeRequired ) { BeginInvoke ( new MethodInvoker ( ( ) => progressListBox . Items . Add ( startInformationString) ) ) ; } e l s e { progressListBox . Items . Add ( startInformationString) ; } } p r i v a t e void HandleProgress ( o b j e c t sender , ProgressEventArgs e ) { s t r i n g progressInformationString = ”Sums completed : ” + e . SumsCompleted . ToString ( ) ; i f ( InvokeRequired ) { BeginInvoke ( new MethodInvoker ( ( ) => progressListBox . Items . Add ( progressInformationString) ) ) ; } e l s e { progressListBox . Items . Add ( progressInformationString) ; } } gered when the operation starts and Progress, which is triggered at regular intervals (in this example, for every 2,500 values of k). The definition of the EventTestClass is shown in Listing A.13. For any event, the nomenclature is such that the event is triggered using a method with the same name as the event, but prefixed by the word On. Thus, for example, the Started event is triggered in the beginning of the RunLoop, by calling the OnStarted method. This method takes no input since all that is required is for the program to report that a particular operation was started. In the OnStarted method, the program first checks whether or not there are any subscribers to this event (see below). If that is the case, the event is fired. In order to understand the concept of event subscription, consider listing A.14. This listing shows the three user-defined methods defined in the code for the application’s form. When the user clicks the Run button on the form, an object of type EventTestClass is generated. The next two lines set up the event handlers that subscribe to the Started and Progress events. Note that, here, the nomenclature is such that, for a given event, the event handler carries the same name as the event, but with the prefix Handle. Note also that the event handler is appended to the invocation list of the event, which keeps track of the number (and identity) of all subscribers. Thus, it would be possible to define additional methods that would also subscribe to the same event. In this case, the method HandleStarted simply prints a string (by c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 141 Listing A.15: The ProgressEventArgs class, which is derived from the EventArgs class. p u b l i c c l a s s ProgressEventArgs : EventArgs { p r i v a t e i n t sumsCompleted ; p u b l i c ProgressEventArgs ( i n t sumsCompleted ) { t h i s . sumsCompleted = sumsCompleted ; } p u b l i c i n t SumsCompleted { g e t { r e t u r n sumsCompleted ; } } } adding it as an item in a list box on the form) that tells the user that the computation has been started. Note that since the computation runs in a separate thread, the addition of the string to the list box must be done in a thread-safe manner; see also Subsect. A.4 above. Next, consider the slightly more complex Progress event. In many cases, it is not sufficient just to learn that some event took place; one may also need some additional information about the event, something that can be achieved by defining a custom EventArgs class. In this particular case, the required information is the value of k (stored in the variable sumsCompleted). When the event is triggered, this variable is sent as input to the OnProgress method. Next, an object of type ProgressEventArgs is instantiated (see also listing A.15, and is assigned the value of k. The subscriber (HandleProgress) can then extract and display the corresponding value. Additional events could certainly be added. A suitable exercise for the reader would be to add an event Completed which would be triggered at the end of the RunLoop. A.7 Serialization and de-serialization Most programs require some form of input data before they can run. As an example, a program for visualizing and animating a three-dimensional rendering of a face requires information about the detailed appearance of the face (i.e. the vertices of all the triangles constituting the face etc.) and, similarly, a program for speech recognition needs detailed information about the parameters of the speech recognizer etc. While it is certainly possible to write methods for loading and saving the properties of any object, it is often a tedious and complex procedure, especially c 2017, Mattias Wahde, [email protected] 142 APPENDIX A. PROGRAMMING IN C# Listing A.16: The SerializationTestClass. [ DataContract ] public class S e rializat ion T e st Class { p r i v a t e i n t intParameter ; p r i v a t e double doubleParameter ; p r i v a t e double doubleParameter2 ; p r i v a t e List<i n t > integerList ; [ DataMember ] p u b l i c i n t IntParameter { g e t { r e t u r n intParameter ; } s e t { intParameter = value ; } } [ DataMember ] p u b l i c double DoubleParameter { g e t { r e t u r n doubleParameter ; } s e t { doubleParameter = value ; } } p u b l i c double DoubleParameter2 { g e t { r e t u r n doubleParameter2 ; } s e t { doubleParameter2 = value ; } } [ DataMember ] p u b l i c List<i n t > IntegerList { g e t { r e t u r n integerList ; } s e t { integerList = value ; } } } c 2017, Mattias Wahde, [email protected] APPENDIX A. PROGRAMMING IN C# 143 Listing A.17: The two methods used for de-serialization and serialization in the SerializationTestExample application. p r i v a t e void loadObjectToolStripMenuItem_Click( o b j e c t sender , EventArgs e ) { using ( OpenFileDialog openFileDialog = new OpenFileDialog ( ) ) { openFileDialog . Filter = ” .XML f i l e s ( ∗ . xml ) | ∗ . xml” ; i f ( openFileDialog . ShowDialog ( ) == DialogResult . OK ) { serializationTestObject = ( S e r i a l i z a t i o n T e s t C l a s s ) O b j e c t X m l S e r i a l i z e r . ObtainSerializedObject( openFileDialog . FileName , typeof ( S e r i a l i z a t i o n T e s t C l a s s ) ) ; ShowTestObject ( ) ; } } } p r i v a t e void saveObjectToolStripMenuItem_Click( o b j e c t sender , EventArgs e ) { using ( S a v e F i l e D i a l o g saveFileDialog = new S a v e F i l e D i a l o g ( ) ) { saveFileDialog . Filter = ” .XML f i l e s ( ∗ . xml ) | ∗ . xml” ; i f ( saveFileDialog . ShowDialog ( ) == DialogResult . OK ) { O b j e c t X m l S e r i a l i z e r . SerializeObject ( saveFileDialog . FileName , serializationTestObject) ; } } } for classes that contain, for example, lists of objects that, in turn, may contain additional objects etc. Fortunately, there are methods for saving (serializing) and loading (de-serializing) the properties of any object, provided that certain attributes are defined. The general code for serialization is contained in the System.RunTime.Serialization namespace, which must thus be referenced if one wants to make use of serialization. One must also add a reference to a custom-made ObjectSerializerLibrary, which contains specific code for serializing and de-serializing an object in XML format. Consider now the SerializationExample. Here, a simple class is defined (SerializationTestClass) that contains a few fields as shown in Listing A.16. Note that the class itself is marked with the DataContract attribute, which tells the program that this class can be serialized8 . Three of the four properties (which must have both the get and the set parts defined, for serialization and de-serialization) are marked with the DataMember attribute, meaning that they will be considered in serialization and de-serialization. The fourth property (DoubleParameter2) is not thus marked, and will therefore 8 Note that serialization and de-serialization can be implemented in various different ways. Here, however, only the methods implemented in the ObjectSerializerLibrary will be used. c 2017, Mattias Wahde, [email protected] 144 APPENDIX A. PROGRAMMING IN C# not be considered. It is not uncommon that some properties are omitted during serialization, for example properties whose values are obtained dynamically when the corresponding program is running. The code for actual serialization and de-serialization is contained in the ObjectSerializerLibrary. Listing A.17 shows the two event handlers for the Load and Save menu items, respectively. Note that, during de-serialization, one must specify the type of the object being de-serialized and explicit casting (as (SerializationTestClass) must then also be applied. The method ShowTestObject (listing not shown here) simply prints the values of the various parameters. When the program is started, an SerializationTestClass object is instantiated and its properties are assigned some arbitrary values, which are then shown in a list box. The user can then save (serialize) the object by selecting the Save menu item. If one then loads (de-serializes) the object by selecting the Load menu item, the parameter values of the loaded object are again shown in the list box. In this example, note that the value of DoubleParameter2 changes (upon loading) from 2 to 0. This is so, since DoubleParameter2 was not serialized, and is therefore assigned a default value of 0. For serialization and de-serialization as described above, C# requires information regarding the serializable types. This information is gathered in the ObtainSerializableTypes method in the ObjectXMLSerializer. However, this method only extracts the types available in the current assembly9 . For example, when serializing an agent, i.e. an instance of the Agent class in the AgentLibrary, the serializable types obtained by a call to the ObtainSerializableTypes method will be all the types in the AgentLibrary. However, if one wants to add types (classes) outside the AgentLibrary, derived from classes in that library (for example, a new class derived from the base class DialogueAction), C# will not automatically know how to handle those classes in serialization and de-serialization. Thus, in such cases, one must explicitly specify that the added types are serializable too. Methods for serialization and de-serialization in such cases are also available in the ObjectXMLSerializer. 9 Simplifying somewhat, one can say that the classes in a class library (or, rather, the corresponding DLL) together constitute an assembly. c 2017, Mattias Wahde, [email protected] Bibliography [1] Opengl tutorial, chapter 3. http://www.opengl-tutorial.org/ beginners-tutorials/tutorial-3-matrices/. Accessed: 201611-15. [2] A.F. Abate, M. Nappi, D. Riccio, and G. Sabatino. 2d and 3d face recognition: A survey. Pattern Recognition Letters, 28:1885–1906, 2007. [3] O. Barnich and M. van Droogenbroeck. Vibe: A powerful random technique to estimate the background in video sequences. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 945– 948, 2009. [4] J. Canny. A computational approach to edge detection. IEEE Trans. Pattern Analysis and Machine Intelligence, 8:679–698, 1986. [5] S. S. Farfade, M. J. Saberian, and L.-J. Li. Multi-view face detection using deep convolutional neural networks. In Proceedings of the 5th ACM on International Conference on Multimedia Retrieval, pages 643–650, 2015. [6] J. Hillenbrand and R.A. Houde. Speech synthesis using damped sinusoids. Journal of speech, language, and hearing research, 45:639–650, 2002. [7] D. H. Klatt. Software for a cascade/parallel formant synthesizer. Journal of the Acoustical Society of America, 67:971–995, 1980. [8] J. Kovac, P. Peer, and F. Solina. Human skin color clustering for face detection. In Proceedings of EUROCON 2003, volume 2, pages 144–148, 2003. [9] M. Mori, K.F. MacDorman, and N. Kageki. The uncanny valley [from the field]. IEEE Robotics & Automation Magazine, 19:98–100, 2012. [10] G.R.S. Murthy and R.S. Jadon. A review of vision-based hand gestures recognition. International Journal of Information Technology and Knowledge Management, 2:405–410, 2009. 145 146 BIBLIOGRAPHY [11] W. Niblack. An introduction to digital image processing. Prentice-Hall, 1986. [12] J. Sauvola and M. Pietikäinen. Adaptive document image binarization. Pattern recognition, 33:225–236, 2000. [13] A. Sobral and A. Vacavant. A comprehensive review of background subtraction algorithms evaluated with synthetic and real videos. Computer Vision and Image Understanding, 122:4–21, 2014. [14] C. Stauffer and W.E.L. Grimson. Adaptive background mixture models for real-time tracking. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 1999, volume 2, pages 2245–2252, 1999. [15] H. Li et al. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5325–5334, 2015. [16] S.-H. Jeng et al. Facial feature detection using geometrical face model: An efficient approach. Pattern recognition, 31:273–282, 1998. [17] Z. Ren et al. Robust part-based hand gesture recognition using kinect sensor. IEEE transactions on multimedia, 15:1110–1120, 2013. [18] P. Viola and M. Jones. Robust real-time face detection. International Journal of Computer Vision, 57:137–154, 2004. [19] M. Wahde. Swedish speech recognition using linear time normalization and feature selection optimization. In Proceedings of INTELLI2012, pages 1–6, 2012. [20] M. Wahde. A method for binarization of document images from a live camera stream. In 6th International Conference on Agents and Artificial Intelligence, ICAART 2014; Lecture Notes in Artificial Intelligence, pages 137–150, 2015. [21] M.-H. Yang, D. J. Kriegman, and N. Ahuja. Detecting faces in images: A survey. IEEE Transactions on pattern analysis and machine intelligence, 24:34– 58, 2002. [22] C. Zhang and Z. Zhang. A survey of recent advances in face detection. Technical Report MSR-TR-2010-66, Microsoft Corporation, 2010. [23] X. Zhang and Y. Gao. Face recognition across pose: A review. Pattern Recognition, 42:2876 – 2896, 2009. c 2017, Mattias Wahde, [email protected] Index asynchronous, 10 camera coordinates, 60 Canny edge detector, 41 cepstral coefficient, 103, 105 cepstral order, 105 chrominance, 28 chunk ID, 79 class (C#), 127 abstract, 127 derived, 127 class library (C#), 124 client-server model, 7 CMYK, 28 color ambient, 62 diffuse, 62 specular, 62 color histogram, 29 color space, 27 color spectrum, 30 composite Bézier curve, 74 compression code, 81 concatenative synthesis, 77 connected components, 43 constructor (C#), 129 continuous speech recognition (CSR), 101 control (C#), 125 convolution, 36 convolution mask, 36 coordinated universal time (UTC), 121 cubic Bézier splines, 74 2s complement signed integer, 83 4-connectivity, 44 8-connectivity, 44 Adaboost, 55 agent program, 6 alpha channel, 27 application (C#), 124 artificial neural network (ANN), 101 assembly (C#), 144 atomic operation, 39 attribute (C#), 143 autocorrelation, 103 normalized, 103 autocorrelation coefficient, 103 autocorrelation order, 103 background subtraction, 50 exponential Gaussian averaging, 50 frame differencing, 50 Gaussian mixture model, 51 ViBe, 51 bandwidth, 87 binarization, 28 binarization threshold, 36 block align, 81 blurring, 37 box, 37 Gaussian, 38 brain process, 6 breakpoint (C#), 126 callback 147 148 INDEX damped sinusoid, 87 damped sinusoid filter, 88 data parsing, 120 DC component, 102 de-serializing (C#), 143 depth camera, 52 dialogue item, 18 difference equation, 86, 88 digital filter, 85 high-pass, 86 low-pass, 86 diphone, 93 Direct3D, 59 dynamic time warping (DTW), 101 dynamic-link library (DLL), 129 eigenface method, 55 event (C#), 126, 138 subscription, 138 event handler (C#), 138 event-based system, 16 exponential moving average, 86 face recognition, 55 face template, 54 feature vector (speech), 102 field (C#), 127 finite-state machine (FSM), 18 form (C#), 125 formant synthesis, 77, 87 frame splitting, 103 fundamental frequency, 89 Gaussian mixture model (GMM), 101 gesture recognition, 52 Hamming windowing, 103 hidden Markov model (HMM), 101 histogram cumulative, 40 normalized, 40 histogram equalization, 41 histogram stretching, 40 HSV, 28 HTML tag, 121 image, color, 27 image, grayscale, 27 integral image, 42 integrated development environment (IDE), 3 interactive evolutionary algorithm, 98 interactive partner agent (IPA), 1 internet data acquisition program, 7 invocation list (C#), 140 IPA libraries, 2 isolated word recognition (IWR), 101 lag (autocorrelation), 103 Levinson-Durbin recursion, 105 lighting model, 62 linear predictive coding, 104 listener program, 6 locked bitmap, 31 LPC coefficient, 103 LPC order, 104 luma, 28 mel-frequency cepstral coefficients, 105 memory long-term, 6 working, 6 memory item tag, 18 method (C#), 125 abstract, 129 external, 109 model coordinates, 60 model matrix, 60 modelview matrix, 60 Mono, 3 mono sound, 78 morphological image processing, 45 closing, 47 dilation, 46 erosion, 46 hit-and-miss, 47 opening, 47 thinning, 48 c 2017, Mattias Wahde, [email protected] INDEX 149 motion detection background, 50 foreground, 50 multithreading, 134 namespace (C#), 127 Niblack’s method, 48 number of zero crossings, 105 relative, 106 object (C#), 127 object-oriented programming, 127 OpenGL, 59 OpenTK, 59 overlap-and-add (TD-PSOLA), 95 padding, 37 path (dialogue), 23 peer-to-peer model, 7 perspective projection, 60 phone (speech), 93 pitch mark, 95 pitch period, 95 pixel, 27 background, 42 foreground, 42 post-multiplication, 68 pre-emphasis, 102 project (C#), 124 projection matrix, 60 property (C#), 129 Really simple syndication (RSS), 117 recording buffer, 110 reference (C#), 127 RGB, 27 RIFF chunk, 79 sample (sound), 78 sample rate, 78 sample width, 78 sampling frequency, 78 Sauvola’s method, 48 sensitivity, 54 serializing (C#), 143 shading, 62 flat, 63 smooth, 63 shading model, 63 sharpening, 38 sharpening factor, 38 shininess, 62 simple type (C#), 130 Sobel operator, 42 socket, 9 solution (C#), 124 solution explorer (C#), 124 speech feature, 101 speech program, 7 startup project (C#), 125 stationary time series, 104 stereo sound, 78 strong classifier, 54 structuring element, 45 origin, 45 subchunk (WAV) data, 79 fact, 79 fmt, 79 subjective optimization, 98 summed area table, 42 TCP/IP protocol, 7 thread safe access, 136 thresholding, 48 adaptive, 48 Toeplitz matrix, 105 triphone, 101 uncanny valley phenomenon, 69 using clause (C#), 127 view matrix, 60 Viola-Jones algorithm, 54 vision program, 6 Visual Studio, 3 visualizer program, 7 c 2017, Mattias Wahde, [email protected] 150 INDEX Waveform audio format (WAV), 78 weak classifier, 54 world coordinates, 60 Xamarin, 3 XML format, 143 c 2017, Mattias Wahde, [email protected]
© Copyright 2026 Paperzz