Doctoral Dissertation Knowledge Based Perceptual Anchoring: Grounding percepts to concepts in cognitive robots Marios Daoutis Computer Science Örebro Studies in Technology 55 Örebro 2013 Knowledge Based Perceptual Anchoring: Grounding percepts to concepts in cognitive robots Örebro Studies in Technology 55 Marios Daoutis Knowledge Based Perceptual Anchoring: Grounding percepts to concepts in cognitive robots This research has been supported by: © Marios Daoutis, 2013 Title: Knowledge Based Perceptual Anchoring: Grounding percepts to concepts in cognitive robots Publisher: Örebro University, 2013 www.publications.oru.se [email protected] Printer: Örebro University, Repro 12/2012 Typeset with LATEX. ISBN ISSN 1650-8580 978-91-7668-912-7 Statement of Original Authorship This thesis is submitted to the School of Science and Technology, University of Örebro, in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science. I certify that I am the sole author of this thesis and the work contained herein is entirely my own work, describes my own research and has not been previously submitted to meet requirements for an award at this or any other higher education institution. To the best of my knowledge and belief, the thesis contains no ideas or material of others, previously published or written by another person except where due reference is made. I acknowledge that I have read and understood the University’s rules, requirements, procedures and policy related to my higher degree research award and to the present thesis, as well as I certify that I have complied with the policy of the Good Research Practice guide [1] (can be found here) of the Swedish Research Council (Vetenskapsrådet). Marios Daoutis Örebro, April 2013 [1] B. Gustafsson, G. Hermeren, and B. Petersson. Good research practice – what is that? Swedish Science Research Council, 2005. (See p. g). © 2013 All rights reserved Abstract A successful artificial cognitive agent needs to integrate its perception of the environment with reasoning and actuation. A key aspect of this integration is the perceptual-symbolic correspondence, which intends to give meaning to the concepts the agent refers to – known as Anchoring. However, perceptual representations alone (e.g., feature lists) cannot entirely provide sufficient abstraction and enough richness to deal with the complex nature of the concepts’ meanings. On the other hand, neither plain symbol manipulation appears capable of attributing the desired intrinsic meaning. We approach this integration in the context of cognitive robots which operate in the physical world. Specifically we investigate the challenge of establishing the connection between percepts and concepts referring to objects, their relations and properties. We examine how knowledge representation can be used together with an anchoring framework, so as to complement the meaning of percepts while supporting linguistic interaction. This implies that robots need to represent both their perceptual and semantic knowledge, which is often expressed in different abstraction levels and may originate from different modalities. The solution proposed in this thesis concerns the specification, design and implementation of a hybrid cognitive computational model, which extends a classical anchoring framework, in order to address the creation and maintenance of the perceptual-symbolic correspondences. The model is based on four main aspects: (a) robust perception, by relying on state-of-the-art techniques from computer vision and mobile robot localisation; (b) symbol grounding, using topdown and bottom-up information acquisition processes as well as multi-modal representations; (c) knowledge representation and reasoning techniques in order to establish a common language and semantics regarding physical objects, their properties and relations, that are to be used between heterogeneous robotic agents and humans; and (d) commonsense information in order to enable high-level reasoning as well as to enhance the semantic descriptions of objects. The resulting system and the proposed integration has the potential to strengthen and expand the knowledge of a cognitive robot. Specifically, by providing more robust percepts it is possible to cope better with the ambiguity and uncertainty of the perceptual data. In addition, the framework is able to exploit mutual interaction between different levels of representation while integrating different sources of information. By modelling and using semantic & perceptual knowledge, the robot can: acquire, exchange and reason formally about concepts, while prior knowledge can become a cognitive bias in the acquisition of novel concepts. Keywords: anchoring, knowledge representation, cognitive perception, symbol grounding, common-sense information. Acknowledgements First I would like to express my gratitude to Prof. Alessandro Saffiotti, who offered me the opportunity to conduct my PhD studies at the Cognitive Robotic Systems Lab in the Centre for Applied Autonomous Sensor Systems Research (AASS) at Örebro University. However I am even more thankful to my supervisors, Prof. Silvia Coradeschi and Dr. Amy Loutfi, for teaching me how to conduct scientific research but most importantly for letting me become a goal oriented and self independent researcher. A special thanks goes to Dr. Mathias Broxvall for his valuable support, fruitful discussions but most importantly for being there when it was really needed. His determinative intervention made a surprising turn in the development of this thesis. Of course this thesis would not have been possible without the help of many other people. I would like to thank my colleagues Sahar Asadi, Abdelbaki Bouguerra and Kevin LeBlanc, because in various occasions either they have shown me ways to lead me out of a problem or sparked eureka moments during constructive conversations. There is one person to which I am deeply indebted, Athanasia Louloudi, for standing by me both in the tough as well as joyful moments. Finally I would like to thank my parents and sister for their support during my studies. This work was supported by Vetenskapsrådet (VR) for the project titled “Anchoring in Symbiotic Robotic System” and the Centre for Applied Autonomous Sensor Systems (AASS). List of Publications The contents and results of this thesis are partially reported in several refereed papers, some of which are appended in this thesis. The author of this thesis is also the main author of the following publications (except for Paper I). The papers are referred throughout the thesis using their Roman numerals, as they are presented below. Publications included in the thesis Journal Articles ✷ Paper I A. Loutfi, S. Coradeschi, M. Daoutis, and J. Melchert. “Using Knowledge Representation for Perceptual Anchoring in a Robotic System”. Int. Journal on Artificial Intelligence Tools (IJAIT) 17, pp. 925–944, 2008. ✷ Paper II M. Daoutis, S. Coradeshi, A. Loutfi. “Grounding commonsense knowledge in intelligent systems”. Journal of Ambient Intelligence and Smart Environments (JAISE) 1, pp. 311–321, 2009. ✷ Paper III M. Daoutis, S. Coradeschi, A. Loutfi. “Cooperative Knowledge Based Perceptual Anchoring”. Int. Journal on Artificial Intelligence Tools (IJAIT) 21, p. 1250012, 2012. ✷ Paper IV M. Daoutis, S. Coradeschi, A. Loutfi. “Towards concept anchoring for cognitive robots”. Intelligent Service Robotics 5, pp. 213–228, 2012. Other publications from the author Conference Paper ✷ Paper V M. Daoutis, S. Coradeschi, A. Loutfi. “Integrating Common Sense in Physically Embedded Intelligent Systems.” In: Intelligent Environments 2009. Ed. by V. Callaghan, A. Kameas, A. Reyes, D. Royo, et al. Vol. 2. Ambient Intelligence and Smart Environments. IE’09 Best Conference Paper Award. IOS Press, 2009. Pp. 212–219 vi Book Chapter ✷ Paper VI M. Daoutis, A. Loutfi, S. Coradeschi. In: “Bridges between the Methodological and Practical Work of the Robotics and Cognitive Systems Communities – From Sensors to Concepts”. T. Amirat, A. Chibani, G. P. Zarri, eds. Vol. 21chap. Knowledge Representation for Anchoring Symbolic Concepts to Perceptual Data. Springer Publishing (in press), 2012. CONTENTS Contents Abstract i Acknowledgements iii List Of Publications v Contents vii 1 Introduction 1.1 Cognitive robotics . . . . . . . . . . . . . . . . . . . . . . 1.2 Context of the research . . . . . . . . . . . . . . . . . . . 1.2.1 The symbol grounding problem . . . . . . . . . . . 1.2.2 Anchoring Symbols to Sensor Data . . . . . . . . . 1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3.1 Uncertainty and Ambiguity . . . . . . . . . . . . . 1.3.2 Binding . . . . . . . . . . . . . . . . . . . . . . . . 1.3.3 Representation . . . . . . . . . . . . . . . . . . . . 1.3.4 Perceptual & Cognitive Bias . . . . . . . . . . . . 1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.5.1 Part A: Grounding percepts to concepts in cognitive 1.5.2 Part B: Publications . . . . . . . . . . . . . . . . . A 1 . . . 1 . . . 2 . . . 3 . . . 3 . . . 4 . . . 5 . . . 6 . . . 6 . . . 7 . . . 7 . . . 9 robots 9 . . . 10 Grounding percepts to concepts in cognitive robots 2 Background & Related Work 2.1 Computational models of cognition . . . . . 2.1.1 Sub-symbolic modelling . . . . . . . 2.1.2 Symbolic modelling . . . . . . . . . 2.1.3 Hybrid Modelling . . . . . . . . . . . 2.1.4 Intelligent agent & the environment 2.2 Perceptual Systems . . . . . . . . . . . . . . 2.2.1 Visual Perception . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 15 15 16 16 17 18 18 vii CONTENTS 2.3 2.4 2.5 2.6 2.7 2.2.2 Computer Vision for cognitive perception . . . . . 2.2.3 Spatial Perception . . . . . . . . . . . . . . . . . . Symbol Systems . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Models for Knowledge Representation . . . . . . . 2.3.2 Challenges related to KR&R . . . . . . . . . . . . Commonsense Information . . . . . . . . . . . . . . . . . . Perceptual-symbolic integration and the Symbol Grounding Problem . . . . . . . . . . . . . . . . . . . . . . 2.5.1 Harnad’s Symbol Grounding Problem and solution 2.5.2 Physical Symbol Grounding . . . . . . . . . . . . . 2.5.3 Social Symbol Grounding . . . . . . . . . . . . . . Foundations of Perceptual Anchoring . . . . . . . . . . . . Perceptual Anchoring in Cognitive Systems . . . . . . . . 2.7.1 Knowledge Based approaches to Anchoring . . . . 2.7.2 Anchoring with commonsense information . . . . . 2.7.3 Cooperative Anchoring . . . . . . . . . . . . . . . 2.7.4 Anchoring for Human-Robot Interaction . . . . . . 2.7.5 Other approaches to Anchoring . . . . . . . . . . . 2.7.6 Summary of Related Approaches . . . . . . . . . . 3 Methods 3.1 Introduction . . . . . . . . . . . . . . . . 3.2 Anchoring Model . . . . . . . . . . . . . 3.2.1 Components . . . . . . . . . . . 3.2.2 Functionalities . . . . . . . . . . 3.2.3 Relation to Publications . . . . . 3.3 Sensors . . . . . . . . . . . . . . . . . . 3.3.1 Laser Scanner . . . . . . . . . . . 3.3.2 Camera . . . . . . . . . . . . . . 3.4 Mobile Robot Localisation . . . . . . . . 3.5 Vision . . . . . . . . . . . . . . . . . . . 3.5.1 Image Representation . . . . . . 3.5.2 Vector Quantisation . . . . . . . 3.5.3 Classification of Visual Features 3.6 Semantic Knowledge . . . . . . . . . . . 3.6.1 Description Logics . . . . . . . . 3.6.2 Inference Support . . . . . . . . 3.7 Commonsense Knowledge Bases . . . . . 3.7.1 Cyc . . . . . . . . . . . . . . . . 3.7.2 Open Mind Common Sense . . . 3.7.3 ConceptNet . . . . . . . . . . . . 3.7.4 Other approaches . . . . . . . . . 3.7.5 Discussion . . . . . . . . . . . . . 3.8 Conclusions . . . . . . . . . . . . . . . . viii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 21 21 22 23 24 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 26 28 30 30 33 33 35 36 37 39 40 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 42 42 43 45 46 46 46 47 49 49 55 56 59 60 62 63 64 65 66 66 68 69 CONTENTS 4 Summary of Publications 4.1 Anchoring with spatial relations . . . . . . . . . . . . 4.1.1 Semantic modelling of spatial relations . . . . 4.1.2 System validation . . . . . . . . . . . . . . . 4.1.3 Contributions . . . . . . . . . . . . . . . . . . 4.2 Common Sense Knowledge and Anchoring . . . . . . 4.2.1 Linguistic interaction . . . . . . . . . . . . . 4.2.2 Knowledge synchronisation . . . . . . . . . . 4.2.3 Evaluation . . . . . . . . . . . . . . . . . . . 4.2.4 Contributions . . . . . . . . . . . . . . . . . . 4.3 Cooperative Anchoring using Semantic & Perceptual 4.3.1 Semantic integration . . . . . . . . . . . . . . 4.3.2 Implementation . . . . . . . . . . . . . . . . . 4.3.3 Contributions . . . . . . . . . . . . . . . . . . 4.4 Towards the WWW and Conceptual Anchoring . . . 4.4.1 Representation . . . . . . . . . . . . . . . . . 4.4.2 Preliminary results . . . . . . . . . . . . . . . 4.4.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Knowledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 71 72 72 74 74 74 75 75 76 76 77 77 78 79 79 80 80 5 Discussion 5.1 Research Outlook . . . . . . . . . . . . . 5.2 Critical view on the original framework 5.3 Summary of the Research . . . . . . . . 5.4 Concluding Remarks & Considerations . 5.5 Recommendations for Future Work . . . . . . . . 81 81 82 83 85 86 . . . . . . . . . . . . . . . . . . . . . . . . . References B I . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Publications Using Knowledge Representation for Perceptual a Robotic System 1 Introduction . . . . . . . . . . . . . . . . . . . . . 2 Perceptual Anchoring . . . . . . . . . . . . . . . 2.1 Creation and Maintenance of Anchors . . 3 Knowledge Representation . . . . . . . . . . . . . 4 Description of the Framework . . . . . . . . . . . 4.1 Simple Inferences . . . . . . . . . . . . . . 4.2 Spatial Relations . . . . . . . . . . . . . . 4.3 Managing Object Properties . . . . . . . 4.4 Human-Robot Interaction (HRI) . . . . . 5 Experiments . . . . . . . . . . . . . . . . . . . . . 5.1 TestBed . . . . . . . . . . . . . . . . . . . Anchoring in 101 . . . . . . . . 102 . . . . . . . . 103 . . . . . . . . 105 . . . . . . . . 106 . . . . . . . . 108 . . . . . . . . 108 . . . . . . . . 110 . . . . . . . . 112 . . . . . . . . 114 . . . . . . . . 115 . . . . . . . . 115 ix CONTENTS 5.2 Case-Studies . . . . . 6 Related Work . . . . . . . . . 7 Future Work and Conclusions References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 121 122 123 II Grounding Commonsense Knowledge in Intelligent Systems 125 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 128 2.1 Perception . . . . . . . . . . . . . . . . . . . . . . . . . . 129 2.2 Physical Symbol Grounding . . . . . . . . . . . . . . . . 130 2.3 Management & Perceptual Memory . . . . . . . . . . . 133 2.4 Commonsense Knowledge Representation and Reasoning 134 2.5 NL Communication . . . . . . . . . . . . . . . . . . . . 137 3 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137 3.1 Perceptual Information & Properties . . . . . . . . . . . 138 3.2 Qualitative Spatial Relations . . . . . . . . . . . . . . . 139 3.3 Commonsense Reasoning . . . . . . . . . . . . . . . . . 139 4 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . 141 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 III Cooperative Knowledge Based Perceptual Anchoring 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Symbol Grounding in Robotics . . . . . . . . . . . . . 1.2 Perceptual Anchoring . . . . . . . . . . . . . . . . . . 2 The Anchoring Framework . . . . . . . . . . . . . . . . . . . . 2.1 Computational Framework Description . . . . . . . . . 2.2 Global Perceptual Anchoring . . . . . . . . . . . . . . 2.3 Anchor Management . . . . . . . . . . . . . . . . . . . 3 Knowledge Representation in Anchoring . . . . . . . . . . . . 3.1 Cyc Knowledge Base . . . . . . . . . . . . . . . . . . 3.2 Ontology . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Semantic Integration Overview . . . . . . . . . . . . . 4 Instantiation of the Perceptual Anchoring Framework . . . . 4.1 Perception . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Grounding Relations . . . . . . . . . . . . . . . . . . . 4.3 User Interface & Natural Language Interaction . . . . 5 Experimental Scenarios . . . . . . . . . . . . . . . . . . . . . 5.1 Perceptual Anchoring Framework Performance . . . . 5.2 Knowledge Representation and Reasoning Evaluation 6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusion & Future Work . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x . . . . . . . . . . . . . . . . . . . . . 143 144 144 146 147 147 150 152 155 156 156 157 162 163 167 170 173 175 177 183 185 186 CONTENTS IV Towards concept anchoring for cognitive robots 1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2 Conceptual Anchoring Framework . . . . . . . . . . . . 2.1 Model . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Anchoring . . . . . . . . . . . . . . . . . . . . . . 2.3 Anchor Perceptual Indiscernibility Relation . . . 2.4 Bottom-up & Top-down information acquisition 3 Implementation . . . . . . . . . . . . . . . . . . . . . . . 4 Semantic knowledge & commonsense information . . . . 4.1 Cyc commonsense knowledge-base . . . . . . . . 4.2 Ontology . . . . . . . . . . . . . . . . . . . . . . 4.3 Perceptual-Semantic knowledge-base . . . . . . . 5 Conceptual Information Acquisition . . . . . . . . . . . 5.1 Linguistic & Semantic Search . . . . . . . . . . . 5.2 Perceptual learning . . . . . . . . . . . . . . . . . 6 Physical Perception . . . . . . . . . . . . . . . . . . . . . 6.1 Visual perception . . . . . . . . . . . . . . . . . . 6.2 Grounding Relations . . . . . . . . . . . . . . . . 6.3 User Interface & Natural Language Interaction . 7 Example Scenario . . . . . . . . . . . . . . . . . . . . . . 8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 192 194 195 199 200 201 203 203 204 205 206 206 207 208 211 211 212 213 213 216 217 xi “I believe that at the end of the century the use of words and the general educated opinion will have altered so much that one will be able to speak of machines thinking without expecting to be contradicted.” Alan Turing, (1912-1954) 1 Introduction 1.1 Cognitive robotics A robot can be generally conceived as a programmable, self-controlled device, which is able to exhibit intelligent behaviour and manipulate its environment. Research in classical robotics focuses on the technology towards combining sensors, controllers and actuators via software and algorithms, for programmable automation. Typical domains emphasise on low-level sensing and control tasks, such as sensor processing; path planning and manipulator control. Thus robotics can be considered as restricted to sensor-based reactivity. Cognition on the other hand refers to the process of acquiring and using knowledge (information) about the world for goal-oriented purposes. Cognitive Robotics is a field, which combines results from the traditional robotics discipline, with those from artificial intelligence (AI) and cognitive science, so as to design artificial agents that deal with cognitive phenomena such as perception, memory, reasoning or learning. The general research focus of this thesis is concerned with promoting robots with intelligent behaviour, via a uniform theoretical and methodological framework. The aim is to allow robots to perceive, reason and interact in changing, partially known and unpredictable environments (i.e. the physical world). The topic of this dissertation touches upon several diverse fields such as perception and vision, knowledge representation and symbol grounding1 . This 1 Symbol Grounding denotes the grounding of symbols to their sensorimotor representations 1 CHAPTER 1. INTRODUCTION Perception Symbol Grounding Knowledge Representation Percept Perceptual Anchoring Symbol Models Formulae Olfaction Localisation Vision Other Modalities Communicate Plan Learn Infer Figure 1.1: Research overview which depicts in a visual way the main disciplines and concepts involved in this research. thesis aims to be a step towards the integration of the fields mentioned above. More in detail, this thesis examines how perceptual representations can be grounded to their corresponding conceptual representations in the context of cognitive robots. This chapter introduces the background (§ 1.2) and motivation (§ 1.3) behind this integration. The research questions and contributions of the thesis follow in § 1.4. The thesis outline is summarised in § 1.5. 1.2 Context of the research This research lies at the heart of hybrid cognitive modelling and concerns the application of the intelligent agent paradigm to cognitive robots which operate in the ‘real world’ (e.g., service robots). Autonomous cognitive robots are often envisioned to perform difficult tasks in real environments, or to interact effortlessly with humans. This view represents one aspect of the long-term goal of AI and robotics, to create artificial systems that match and eventually exceed human level competence in a multitude of tasks. In this setting, I attempt to investigate the integration of perceptual information with semantic and formally expressed commonsense knowledge, via a framework for symbol grounding [63]. Fig. 1.1 depicts a high-level view of the three main disciplines and the corresponding concepts used in this thesis. As the artificial systems progress steadily from sensor based reactivity to cognitive computational models that enable them to perceive, learn and reason, the underlying models often have to deal with many open problems. In this research we consider robots (or intelligent agents in general) interacting with the human. Therefore, this thesis investigates a) how sensory and perceptual information is acquired and grounded to symbols; b) how the meaning of these symbols is represented; and c) how to express this meaning within a vast 2 1.2. CONTEXT OF THE RESEARCH amount of formally defined commonsense knowledge. Below I introduce the symbol grounding [63] and perceptual anchoring [35] problems, which are two of the main aspects of the thesis. 1.2.1 The symbol grounding problem The Symbol Grounding Problem (SGP) is the problem of intrinsically linking the symbols used by a cognitive agent to the corresponding meanings, through grounding language in perception and action. This aspect appears to be central to computational cognitive modelling. It is important to find ways to combine the typically sub-symbolic perceptual representations with the symbolic conceptual representations used by the different cognitive processes of an intelligent agent, in order to establish the link between perceptual and conceptual knowledge. A linguistic symbol representing some meaning needs to be grounded in how the world is perceived and specifically, how the internal iconic and categorical sensorimotor representations are grounded to symbolic representations that represent the real world phenomena to which they refer[63]. The SGP focuses on how symbols are connected to their meanings, assuming that the cognitivist symbols reside in the mind of the interpreter and always require the semantic interpretation of an external observer. The meaning of “meaning” is itself non trivial, however symbols should be tied to percepts, in a way that is meaningful to the agent itself, be it objects or features in the world that it can sense or actions it can perform (affordances) [59]. Intuitively information has to be represented both perceptually but at the same time semantically, while describing concrete as well as abstract conceptual entities. In an artificial cognitive agent this perceptual-semantic correspondence essentially attempts to define the meaning of the concepts processed by the agent. It further implies the presence of symbol and perceptual systems, where the symbol grounding problem is then solved by grounding the sensorimotor and perceptual representations to the symbolic concepts representing the meaning of the perceptions. In a multi-robot (or in general multi-agent) system, the problem becomes more complex. Not only symbols do have to be grounded, but they have to be commonly shared by convention in order to allow the sharing of information. This is related to the social grounding problem [171], another aspect of the SGP and a fundamental aspect in the development of cognitive systems. A growing amount of research in interactive intelligent systems and cognitive robotics focuses on the close integration of language with other cognitive capabilities [2, 19, 128] 1.2.2 Anchoring Symbols to Sensor Data A cognitive robot operating in the environment has to face simultaneously many diverse practical aspects and hard challenges, such as the ones that accompany 3 CHAPTER 1. INTRODUCTION perception and sensing, the frame–expressivity–decidability problems concerning logic and knowledge representation as well as the well known problem of representing and reasoning with common-sense. On a technical level, some of the challenges mentioned above are investigated under the scope of anchoring, where anchoring has the dual meaning: that of a problem as well as that of a methodology. Coradeschi and Saffiotti formalised the anchoring problem, which concerns “. . . the connection between abstract and physical-level representations of objects in artificial autonomous systems embedded in a physical environment” [38]. Vogt and Divina characterises this problem as a special case of a technical aspect of symbol grounding, since it deals with grounding symbols to specific sensory patterns [171]. However, anchoring studies the problem of creating and maintaining in time the correspondences between symbols and sensor data that refer specifically to one physical object. Implicitly intelligent robotic systems with a symbol system and a perceptual system, need to solve the anchoring problem so as to connect the information present in symbolic form with the sensor data that the robot obtains from the physical world, both in single and cooperative (multi-) robot cases, as well in multi-modal contexts. On the other hand anchoring can be seen as the integrative solution and methodology [100] which mediates the link between the perceptual and sensing abilities, and the representations and conceptual knowledge, manipulated by the various cognitive abilities of an intelligent system. It can be described as a hybrid cognitive model and representation schema which incorporates both sub-symbolic and symbolic components together with information processing mechanisms, for creating and maintaining the links between the high-level semantic representations and the corresponding perceptual representations, which are used by the various cognitive functions of an agent. In the following section we meet some motivations and theoretical aspects that relate to our investigation of anchoring. 1.3 Motivation This thesis investigates the problem of anchoring the interaction in cognitive robots. To get a better understanding of how anchoring is involved even in very simple scenarios, consider the example shown in Fig. 1.2. A cognitive robot observes a novel object and speculates according to its prior perceptual knowledge that the object in discussion might eventually be a “mug”. We can easily identify already, that the robot needs to be able to sense and perceive a complete and coherent representation of the sensed object, while binding its features together, despite the existing ambiguity. Then, in an effort to identify the novel object, it attempts to associate this new object with past perceptual experiences – a form of cognitive bias. It appears quite trivial when the human asserts that he owns this“coffee-cup”, however this linguistic assertion 4 1.3. MOTIVATION Yes..., this is my coffee-cup ..., mug? assert drinking vessel genls cup is a coffeecup-1 on colour table white-3 belongs human-1 Figure 1.2: This scenario depicts the main focus of this thesis which concerns anchoring grounded interaction. A mobile robot is attempting to identify an unknown object, while a human asserts linguistically semantic information regarding this object. The end result is the corresponding knowledge (oerceptual and semantic) which the robot asserts and manipulates in its “mind”. presupposes the presence of related knowledge along with the corresponding semantics. The role of this knowledge is not only to represent the domain of discourse (common-sense), but also to support formal reasoning and inference. Below we see more in detail the underling principles behind this motivational example. 1.3.1 Uncertainty and Ambiguity Information from the environment comes inherently with uncertainty, which can be caused by sensor noise, limited observability, incomplete knowledge or by false recognition of patterns. Moreover in human communication we often meet ambiguities, incomplete information or information that needs to be inferred. Ambiguity, sometimes also referred as “second-order uncertainty”, results in uncertain definitions of uncertain states or outcomes. It has been argued that ambiguity is always avoidable while uncertainty (of the “first-order” kind) is not necessarily avoidable. For instance consider that uncertainty might occur from noise in the visual patterns as a result of sudden illumination or motion blur. This uncertainty propagates through the vision system, thus increasing the error recognising the colour or shape of one object. Hence, ambiguity is a consequence of the increased error, when the agent might want to reason using the colour of the specific object, since it’s assumption is based on an uncertain state. In another form uncertainty may be purely a consequence of lack of knowledge of obtainable facts, as it is only present in the human definitions 5 CHAPTER 1. INTRODUCTION and concepts and is not an objective fact of nature. In fact, uncertainty can be removed with further analysis and processing. The computational model has to account for cases where processing can be used to resolve ambiguities in an attempt to cope with the inherent uncertainty. 1.3.2 Binding The binding problem [159] arises as soon as we have a system with two pieces of information in different places that refer to the same entity and need to bind together the information components into a unified whole. For example when we see a “blue sphere” and a “red cube”, binding (in the biological sense) instructs the neural mechanisms to ensure the sensation of “blue” to be coupled with that of “sphere” and that of “red” to be coupled with that of “cube”. Binding combines the different representations, properties and descriptions of an object, while enables the different sub-systems of the cognitive model to process information about the object in a holistic context. It is suggested that incoming visual (and other) data get allocated object labels in some way, such that “blue” and “square” get tagged as “object number 1” [130]. In this form the binding problem is also an issue of memory. How do we remember the associations among different elements of an event (i.e. that the colour of the sphere turned from blue to green)? How these associations are created and maintained? In the feature integration theory, Treisman and Gelade suggested that binding between features is mediated by the features’ links to a common location [160]. In a comprehensive discussion of binding’s many aspects, Treisman and Gelade point out that objects and locations appear to be separately coded in ventral and dorsal pathways respectively, raising what may be the most basic binding problem: linking what to where. 1.3.3 Representation An intelligent robot operating in the physical world, needs to combine information from its different sensors to form a coherent and more complete representation of the environment. When binding perceptual information as well as conceptual knowledge, the cognitive model has to support mutual interaction between the different representations while integrating different sources of information both horizontally (cross-modal and temporal) and vertically (i.e. top-down & bottom-up). According to Roy and Reiter cross-modal representations are a necessary requirement and are needed to support both the semantic and sensory-motor sub-systems [137]. As mentioned earlier in § 1.3.1 the representations should be able to handle a) ambiguity and uncertainty in the perceptual data; and b) support sound inference using symbolic information. Of course incomplete information may always exist in the knowledge-base and in many cases new knowledge might need to be inferred using prior knowledge with reasoning. Then a non-trivial issue is how to represent knowledge in a way 6 1.4. CONTRIBUTIONS that is machine computable, but also expressive enough to better represent the linguistic structures. The cognitive model should allow the manipulation of complex information structures while maintaining consistency between the various kinds of data, on different abstraction levels. Finally the representation has to deal with grounding the interpretations of the semantic representations in perceptual information (sensor data) from the environment. 1.3.4 Perceptual & Cognitive Bias A holistic view suggests that what someone perceives is the effect of interplays between past experiences and the interpretations of current perceptions. Therefore ambiguity resulting in percepts ambiguity can be rectified by a-priori knowledge. When someone is perceiving an object with a preconceived concept about it, tends to use the preconceived information and associates it with what he perceives as an inherent cognitive bias of previous knowledge. We consider that the perceptual model should be able to cope with storing and interrelating past perceptual experiences so as to influence how it interprets the sensory patterns into percepts. In this context the semantic model should also be able to cope with storing and interrelating past conceptions so as to influence not only how the world is perceived, but also the reasoning processes. It is generally accepted that human perception heavily relies in the concept of similarity as the basis for categorisation [134]. The class of an object is determined by its similarity to (a set of) prototypes which define each category, allowing for varying degree of membership. Inspired by the Exemplar theory [113, 124] which argues that a concept can be implicitly formed via all its observed instances, the aspects of the cognitive bias and similarity are also related to memory in the context of how associations among different elements of an object or event are established and processed. 1.4 Contributions The research behind this thesis is focused on anchoring and specifically its application to cognitive systems. As we see in § 1.2.2, anchoring is established both as a problem and as a methodology, for solving certain aspects of the perceptual-symbolic integration in cognitive systems. Although the basic elements of an anchoring system are formally introduced in various articles [31, 32, 35, 100], there were several aspects which needed further consideration, including the technical aspects behind the design and implementation of a cognitive perceptual system, which is based on an anchoring framework for its computational model. More specifically this thesis investigates a) how the different perceptual modalities and grounding mechanisms can be modelled in a cognitive perceptual system; b) how its perceptual-semantic knowledge is defined and represented; 7 CHAPTER 1. INTRODUCTION and c) how a-priori knowledge can be used as a cognitive bias. In addition, the capabilities of such a system are presented while identifying the benefits of using anchoring as the main computational model. Having in mind the general research questions mentioned above, this thesis presents some technical contributions related to the design and implementation of the proposed perceptual anchoring framework: ⊲ Design and implementation of a complete and novel anchoring framework for cognitive robots acting in a realistic setting. [Paper I, II, III & IV] ⊲ Modelling of perceptual routines which are based on visual, spatial and topological sensing modalities. [Paper II & III] ⊲ Grounding mechanisms which bind the sensorimotor representations to symbolic conceptual knowledge. [Paper II & III] ⊲ Perceptual-semantic knowledge base and processing mechanisms for storing the symbolic information acquired from perceptual experiences, via the anchoring framework. [Paper II & III] ⊲ Integration of the proposed framework with a large-scale deterministic commonsense knowledge base. [Paper II, III & IV] ⊲ Evaluation of the implemented system in a number of scenarios which include linguistic interaction, cooperative and concept acquisition aspects during the anchoring process. [Paper II, III & IV] Besides the technical contributions mentioned above, this thesis discusses also some theoretical contributions which are related to the extension of the anchoring framework so as to address the following: ⊲ Extend the anchoring framework to support the processing of multi-modal perceptual and semantic knowledge in a grounded conceptual structure which can be accessed both bottom-up and top-down. [Paper III & IV] ⊲ Define the nature of the symbol system using standard knowledge representation and reasoning methods while integrating an ontology to be used as a shared vocabulary. [Paper II & III] ⊲ Extend the anchoring functionalities to support the integration of the symbol system proposed above. [Paper II & III] ⊲ Introduce a cooperative semantic anchoring model for grounding collectively acquired perceptual knowledge in a multi-robot context. [Paper III] ⊲ Introduce a novel model for representing, grounding and reasoning with spatial and topological relations using fused visual and spatial sensing modalities. [Paper III] 8 1.5. OUTLINE ⊲ Define the similarity measure for matching between anchors, both perceptually and semantically. [Paper IV] ⊲ Define an extension of the anchoring space which supports the embedded ontology, integrates knowledge from multiple agents, enables multi-modal integration, defines the similarity based grounding and allows logical inference. [Paper I, II, III & IV] In sum this thesis introduces a knowledge-based perceptual anchoring framework which is essentially a product from a series of studies that address the points mentioned above.In the different studies performed in the context of the thesis we attempt to address the challenging problem of how high-level descriptions of visual information may be automatically derived from multiple sensing modalities, while also considering the applicability of the model in real world scenarios. The modelling of perceptual-symbolic correspondences is based on grounding the interpretations of the semantic representations in perceptual information coming from the environment as well as other information sources. 1.5 Outline This thesis is composed of two parts. The first part (Part A) is a comprehensive introduction, intending to provide an insight on the different aspects, contexts and methods which regard this thesis. The second part (Part B) contains the attached publications which support the contributions presented in § 1.4. The remaining parts of this thesis are summarised as follows: 1.5.1 Part A: Grounding percepts to concepts in cognitive robots The first part of this thesis is an introduction which includes an overview of the background and related work (Chap. 2), followed by the main chapter (Chap. 3) which presents in detail the different methods used in this thesis. A summary chapter (Chap. 4) presents the overview of the research and the related publications, while in the final chapter (Chap. 5) the conclusions of the thesis are presented, followed by suggestions for future development. A brief summary of each chapter follows. ⊲ Chapter 2 This chapter introduces the background behind the perceptual & symbol systems, followed by the (sub-) symbolic integration and related extensions, which led to the formation of the symbol grounding problem. Then, a brief discussion summarises the work related to anchoring, concluding with some anchoring approaches in the context of cognitive robots. 9 CHAPTER 1. INTRODUCTION ⊲ Chapter 3 This chapter introduces the different aspects, methodologies and tools which were used during the development of the anchoring framework. In particular the basic anchoring framework; the algorithms that were used for perception (vision and localisation); the framework used for knowledge representation; and a discussion regarding commonly available commonsense knowledge bases. ⊲ Chapter 4 This chapter summarises the included publications of the research project and the contributions from the author. ⊲ Chapter 5 This chapter concludes the development of the proposed anchoring framework by assessing the research goals and findings of the thesis as a whole. Finally a few possible research directions for future work are outlined. 1.5.2 Part B: Publications The contents and results of this thesis are partially reported in conference proceedings, journal papers and a book chapter. Reprint versions of four publications (Papers I, II, III & IV) are included in the second part (Part B) of this thesis. All publications have been fully peer-reviewed, before being accepted for publication. Paper II is an extended version of the results presented in Paper V. Material from Paper VI, which is a summary of the results presented in all the included papers, has been used for the preparation of Chap. 4 from Part A. Finally, the versions of the included papers contain typographical and layout changes in order to match the style of the thesis. 10 1.5. OUTLINE Thesis MindMap Cognitive Robotics INTRODUCTION Symbol Grounding Perceptual System Semantic System Anchoring Motivation Vision Localisation Other Modalities Knowledge Representation Ontology CommonSense Uncertainty Binding Representation Cognitive Bias Contributions PART A Background & Related Work Methods Outline Computational Models of Cognition Perceptual Systems Symbol Systems Commonsense Knowledge Symbol Grounding Perceptual Anchoring Related Work Grounding Percepts to Concepts in Cognitive Robots Anchoring Model Sensors Robot Localisation Vision Semantic Knowledge Commonsense Knowledge Bases Summary of Publications Anchoring with Spatial Relations Commonsense Cooperative Semantic Perceptual Knowledge in Anchoring Anchoring Conceptual Anchoring PART B Publications PAPER I Using Knowledge Representation for Perceptual Anchoring in a Robotic System PAPER III PAPER IV PAPER II Towards Concept Cooperative Grounding Anchoring for Commonsense Knowledge Based Knowledge in Perceptual Anchoring Cognitive Robots Intelligent Systems Discussion Research Outlook Critical View on Anchoring Summary of the Research Evaluation of the Findings Suggestions for Future Work 11 Part A Grounding percepts to concepts in cognitive robots 2 Background & Related Work 2.1 Computational models of cognition Cognition, the process of thought, is abstractly described as the processing of information. However most computational models require a (mathematically and logically) formal representation of a problem. Implicitly the question then becomes: how should we represent information in a cognitive system? And most importantly: what would be the right frame for describing how the cognitive system (biological or artificial) is processing information? Cognitive computational models are used both in simulation and experimental verification of different specific and general properties of intelligence, where three main perspectives are considered, namely the sub-symbolic, symbolic and hybrid approaches. 2.1.1 Sub-symbolic modelling Sub-symbolic approaches mainly concern connectionist (neural network) models, which typically follow the neural and associative properties of the human brain. They mainly focus on emergent behaviour, dynamical systems and neural learning. Connectionism is founded in the idea that the mind is composed of simple nodes and that the power of the system comes primarily from the existence and manner of connections between these simple nodes. Neural-nets are textbook implementations of this approach. Some critics of this approach argue that even though connectionist models approach biological reality as a representation of how the system works, they lack explanatory powers, 15 CHAPTER 2. BACKGROUND & RELATED WORK because complicated systems of connections even with simple rules are extremely complex and often less interpretable than the system they model. 2.1.2 Symbolic modelling Symbolic modelling on the other hand is focused on the abstract, mental functions of an intelligent mind, based on the principle that it operates using symbols. This approach originates from knowledge-based systems merged with a philosophical perspective, that of “Good Old-Fashioned Artificial Intelligence” (GOFAI). Initially, such systems were developed by the first cognitive researchers and later used in information engineering and expert systems [163]. Symbolic Models assume that the mind is built like a digital computer (serial processing), where hypothetical mental representations are symbols which are serially processed using sets of rules. A physical symbol system (also called a formal system) takes physical patterns (symbols) which combines into structures (expressions) that can be manipulated (using processes) to produce new expressions. Newell and Simon’s Physical Symbol System Hypothesis [122] notably argues that symbolic computation is both necessary and sufficient for general intelligent action [138]. Since symbolic computation is Turing-complete this is trivially true, but criticism of symbolic (rule-based) AI maintains that a purely symbolic system does not constitute a feasible practical approach, either because discrete symbols are technically insufficient [12, 13] or because it usually lacks grounding in a physical environment [143]. 2.1.3 Hybrid Modelling This paradigm emerged as a solution to the debate of whether the mind is best viewed as a huge array of small but individually feeble elements (i.e. neurons) or as a collection of higher-level structures such as symbols, schema, plans, and rules. Both the symbolic and the associationistic approaches have their advantages and disadvantages. As Gärdenfors very well argues: “. . . . . . they are often presented as competing paradigms in artificial intelligence, but since they attack cognitive problems on different levels, they should rather be seen as complementary methodologies.” (P. Gärdenfors [58]) Hybrid models are often drawn from machine learning, while they include techniques which put symbolic and sub-symbolic models into correspondence. These models tend to be generalised and take the form of integrated computational models of abstract (synthetic) intelligence, in order to benefit from both symbolic and sub-symbolic models’ advantages. 16 2.1. COMPUTATIONAL MODELS OF COGNITION Environment [ f (agente): P* A ] Percepts Reasoning Sensors Actions Problem solving Actuators Think Vision Time Touch Audition Language Perceive Cognition Knowledge Planning Concepts Kinesthetics Act Consiousness Motorcontrol Olfaction Proprioception Learn Working Longterm Memory Figure 2.1: Schematic diagram of a rational intelligent agent merged with the PEAS (Performance measure, Environment, Actuators, Sensors) paradigm from a cognitive perspective. The agent perceives its environment through sensors, where the complete set of its inputs at a given time is called a percept. Then, the current percept, or a sequence of percepts can influence the actions of the agent through the performance measure. A system with both symbolic and sub-symbolic components is a hybrid intelligent system. 2.1.4 Intelligent agent & the environment A very popular paradigm in hybrid systems, which is also widely accepted since the 1990s in cognitive robotics, is the notion of Intelligent Agent (IA) [138]. An agent can be simply described as an autonomous entity, which perceives the environment so as to acquire new information and uses this knowledge to direct its activity towards achieving its goals, via acting in the same environment. Thus it is a particularly well suited model for cognitive robots which can be very simple or arbitrarily very complex. A reflex machine such as a thermostat can be represented as an intelligent agent, as well as a humanoid robot or even a swarm of robotic systems working together towards a goal. An agent that solves a specific problem can use any approach that works - some agents are symbolic and logical; some use sub-symbolic approaches or neural networks while others may use mixed approaches. The agent’s experience, depending the environment in which the agent exists, is characterized by some properties. The real world is of course, partially observable, stochastic, sequential, dynamic, continuous and multi-agent [138]. 17 CHAPTER 2. BACKGROUND & RELATED WORK All the elements mentioned so far are typically unified in the PEAS (See Fig. 2.1) paradigm which can be briefly described as having three major (possibly overlapping) sub-systems, with different functions which define how sensory data are processed and distributed through the system and where decisions are made. A perceptual sub-system which gains information from the environment, an action sub-system that emits energy in various forms and between those subsystems an arbitrarily complex collection of central mechanisms that interact with the perceptual and action sub-systems but may also be involved in other cognitive processes. 2.2 Perceptual Systems Sensing involves gathering information about the environment using sensors. Robots can be given the ability to see through a camera image sensor and perceive light intensity data. Besides vision, robots can rely on other sensors, for example laser range-finders, where they can perceive the distance to objects. From a traditional point of view perception is the cognitive process of attaining awareness or understanding of sensory information. The perceptual system constructs an internal representation of the world which is further processed and related to other information [123]. The study of perception is mainly concerned with exteroception and quite often the word perception usually means exteroception. In modelling perception we are concerned with how sensory stimuli are observed, represented and interpreted. A stimulus may be the occurrence of an object at some distance from the robot. The perceptual system transforms stimuli (sensory data) from the world into percepts (e.g. colours, shapes, textures) which are then relayed to the cognitive model. This transformation presupposes the existence of a complex perceptual system which mainly relies on visual, spatial perception and their fusion. 2.2.1 Visual Perception Generally vision is a complex task and it involves various scientific topics such as signal processing and pattern recognition. However vision has the potential to provide abundance of information about the surrounding where the main task is to derive the structure of a scene and other information about the environment, from one or several images. The most difficult task robots are required to solve is to be able to answer the question: “What do I see, and where?”. This task in the computer vision community can be summarised in the following two problems: ⊲ Image classification or Image labelling is the task of classifying an image according to its content. 18 2.2. PERCEPTUAL SYSTEMS Figure 2.2: Examples of image variations. ⊲ Object detection is a more complex problem, that of detecting and locating instances of a certain object class inside an image, with the best possible accuracy. As we see, the modelling of visual perception is a difficult task, while achieving human-level results is far from the current state-of-the-art. But why vision is so difficult? Typically we can solve effortlessly the difficult visual problems mentioned above, in fractions of a second and with a very small error. However an artificial vision system must be able to address a wide range of challenges which are often common between the visual processes and regard representing and categorising the visual content with the presence of variations. Possible variations of object appearances especially in real world scenarios, can be divided into two main classes: image and object variations. 2.2.1.1 Image Variations ⊲ Illumination variation influences greatly the appearance of objects, due to environmental illumination changes in the lightning conditions or the occurrence of shadows. ⊲ Scale variation may occur by the imaging device or can be due to perspective transformation. ⊲ Viewpoint variation causes change in the appearance of objects, due to camera position changes, in relation to the object. 19 CHAPTER 2. BACKGROUND & RELATED WORK mug1 mug3 mug2 mug pot mug4 Interclass variations Intraclass variations Figure 2.3: Examples of object variations. ⊲ Clutter results in confusion between foreground and background objects. Object features are likely to occur in the background, thereby producing false matches and wrong evidence about the presence of objects. ⊲ Occlusion is caused by other objects or truncation by the image border. Self occlusion describes the situation where parts of the object occlude other parts of the same object. ⊲ Motion blur occurs due to fast motion of the camera or low light conditions. Slow shutter speed, moving objects or camera shake contribute to this blurring and the obtained image is degraded by motion blur. 2.2.1.2 Object Variations ⊲ Interclass variation refers to the variation between objects of different classes with relatively little appearance variation between them. ⊲ Intraclass variation refers to the variation among objects that belong to the same class. ⊲ Articulation variation occurs when there is variation of the appearance of an object that is caused by different poses or positions of parts of the object. ⊲ Pose variation exists when objects occur in different poses that may result in having completely different appearances as a result. ⊲ Size variation concerns the cases where the size of objects can significantly influence the similarity to other object-classes and increase the variance within one object-class. 20 2.3. SYMBOL SYSTEMS 2.2.2 Computer Vision for cognitive perception A lot of computer vision techniques are involved in robot vision systems. A typical vision system includes methods for: visual interest point detection, features extraction and matching, detection and segmentation as well as high-level processing such as object recognition. This dissertation focuses on computer vision methods and specifically object recognition, in order to help with processing the increasing size of image data, inherent to the domain of cognitive robots (and robots in general). Besides the challenges mentioned in § 2.2.1, the real difficulty comes in interpreting the data retrieved from the camera sensor. This is because a robot must make sense of the raw data, which are represented as an array of numbers indicating light intensity at each pixel, thus making image understanding computationally very demanding. In the early 1960’s, computer vision research initiated as an artificial intelligence problem, the goal of which was to understand images by detecting objects and defining the relationship between them [132]. In the same context, Marr presented his famous views that given a visual information, a complete geometric reconstruction of the observed scene is required for visual perception and the vision system should compute the shape, orientation and colour of every object in the scene while recognising them in one image [108]. It was not until the late 1980’s that a more focused study of the field started when computers could manage the processing of large data sets such as images. 2.2.3 Spatial Perception In mobile robotic applications we have to consider another important aspect, that is localisation and usually refers to the estimation of the position and orientation of a robot in a reference frame. There are two types of localisation problems: a) Global localisation – where the robot is located, given we don’t know where the robot started; and b) Position tracking – where the robot is located given that we know where the robot started. Furthermore, from a cognitive perspective there is also the aspect of spatial perception where the robot needs to locate the objects it manipulates and simultaneously represent their spatial distribution in its environment. This of course, should come in relation to its position (egocentric frame) but also in relation to other objects’ positions (allocentric frame). 2.3 Symbol Systems Cognitive robots need to represent the knowledge about the relevant parts of the world they inhabit. Semantic knowledge (or else meaning), according to standard theories of cognition, appears to reside in a semantic memory separate from the multi-modal systems of perception and action. Representations from 21 CHAPTER 2. BACKGROUND & RELATED WORK the modal systems are transduced into amodal symbols, representing conceptual knowledge about the robot’s experience. Although little empirical evidence supports the presence of amodal symbols in cognition, they were greatly adopted because they provided elegant and powerful formalisms for knowledge representation (KR). The reason behind this is that they capture important intuitions about the symbolic character of cognition. Therefore, many approaches in the lines of cognitivism, adopt the symbolic paradigm for representing and processing information, as cognition is considered a form of computation and hence formal symbol manipulation [122], The fundamental goal is to represent knowledge in a way that facilitates reasoning. This is done by analysing how to formally think and how to use a symbol system for representing a domain of discourse, using functions that allow inference and manipulation of this knowledge. In turn, this knowledge can be relayed backwards, toward the perceptual systems while also used from various other cognitive processes. Thus semantic representations should be adequate for higher-level cognitive processes and consistent across different levels of abstraction. With the use of semantics, we should be able to express the different modalities involved in perception as well as conceptual descriptions. 2.3.1 Models for Knowledge Representation Knowledge Representation is a family of techniques for describing (or encoding) information of a particular description of the world, in a precise and unambiguous way. Furthermore the problem concerns not only information representation but also how these representations are to be reasoned upon, so as to achieve intelligent behaviour. The term “intelligent” refers to the ability of a system to deduce implicit knowledge out of its explicit one. Models of knowledge representation have been initially developed in the field of AI, originating from theories of human information. The most commonly used way to encode knowledge is through a logical formalism which supplies both formal semantics of how reasoning functions apply to concepts in the domain of discourse, and operators (i.e. quantifiers, modal operators, etc . . . ) that along with an interpretation theory, give meaning to the formulae expressed in the logical formalism. Early models attempting to represent knowledge, include works of Boole (1815 - 1864) and Frege (1848 - 1925), which resulted in the well known Propositional and First-Order Logics (FOL) respectively. Drawbacks such as undecidability, led to the introduction of non logic-based representation languages, such as Semantic Networks and Frame Systems. A brief summary follows [56]: ⊲ Object - Attribute - Value triplets is a technique used to represent the facts about objects and their attributes. More precisely, an O-A-V triplet asserts an attribute value of an object. 22 2.3. SYMBOL SYSTEMS ⊲ Logics aim at emulating the laws of thought by providing a mechanism to represent statements about the world. The representation language is defined by its syntax and semantics, which specify the structure and the meaning of statements, respectively. The most widely used and understood logic is First-Order Logic (FOL), which assumes 1) the existence of facts, objects (individual entities), and relationships among objects; and 2) the beliefs of true, false, and unknown for statements. ⊲ Semantic networks are directed graphs which consist of vertices that represent concepts and edges that encode semantic relations between them. Concepts can be arranged into taxonomic hierarchies and have associated properties. Logics of various kinds and logical reasoning and representation languages such as Prolog and KL-ONE have been popular tools for knowledge modelling, for example in expert systems. However the user is required to be a logic expert. This problem led to the introduction of “ideal” formalisms which integrate well-defined semantics in frame systems and semantic networks, benefiting from the positive features of both frame systems and logic-based formalisms. They are known as Description Logic (DL ) and are particularly useful for representing domain knowledge through the specification of concepts and their relationships. DLs are the most commonly used formalism today in several disciplines, such as natural language processing and web-based information systems. Most notably they provide a logical formalism for authoring ontologies on the Semantic Web1 Below we summarise the related challenges. 2.3.2 Challenges related to KR&R The problems of linking and encapsulating real world concepts, by means of logic and related formalisms, became apparent around 1980s when formal computer knowledge representation languages and systems arose. They attempted to encode wide bodies of general (common-sense) knowledge, as for example the Cyc project, which attempted to encode the necessary information, for the purpose of understanding encyclopaedic content. The difficulty of knowledge representation came to be better appreciated through widening the gap between early claims of AI enthusiasts and the actual performance of implemented systems. First, we meet the symbol grounding problem (see § 2.5), which seems inevitable when relating abstract symbols to aspects of the real world. In the same context another key difficulty in cognitive agents, is to be able to represent predicate or function symbols that are able to change their values so as to reflect 1 All the major KR languages and standards, such as the Resource Description Framework (RDF), the RDF Schema, the Web Ontology Language (OWL) and Ontology Inference Layer (OIL) are solely based on DLs . 23 CHAPTER 2. BACKGROUND & RELATED WORK the dynamic nature of the world. This is known as the frame problem, which originally concerned the difficulty of using logic to represent which facts of the environment change and which remain constant over time. In a broader sense, the frame problem can be thought as finding appropriate KR&R mechanisms to make inferences, within the resource constraints, in order to limit the number of assertions and inferences that an artificial cognitive agent must make to solve a given task in a particular context. Another important aspect which affects the modelling of knowledge representation systems, concerns expressivity and the trap of decidability. Brachman and Levesque argue: “there is a trade-off between the expressiveness of a representation language and the difficulty of reasoning over the representations built using that language” [10]. The more expressive, the easier and more compact it is to represent a sentence in the knowledge base. However, more expressive languages are computationally difficult in deriving inferences. An example of a less expressive KR would be propositional logic, while a more expressive one would be auto-epistemic temporal modal logic. Less expressive KRs may be both complete and consistent (formally less expressive than set theory) in contrast to more expressive KRs which may be neither complete nor consistent. The expressivity of the language is also determined by the application domain and the level of abstraction of the knowledge intended to be expressed in the knowledge-base. For instance in a simulated single-agent scenario, we would probably need propositional logic or binary decision diagrams, to model the agent’s low-level perceptions, thus overcoming undecidability. On the other hand, on a multi-agent scenario in the real world, where the cognitive agents manipulate commonsense information or communicate via natural language with humans, we might need extensions of second-order predicate calculus, again with the risk of getting trapped in undecidable situations. 2.4 Commonsense Information Knowledge, perceptual or semantic, is required in abundance by the cognitive agent, to accomplish a specific task or to model a particular domain. Commonly we assume that any person possesses a large and compatible amount of knowledge known as common-sense, which acts as the communication medium. But what is actually commonsense knowledge? Commonsense consists of what people “sense” as their “common” natural understanding, referring to beliefs or propositions that most people would consider wise and sound. Thus “common-sense” (in this view) equates to the knowledge and experience which most people is assumed to already have. The concept originates from Aristotle2 , who considered “common-sense” to be the internal sensation that is formed after the external senses are united and 2 Aristotle De Anima Book III Part 2, accessed online at: http://classics.mit.edu/ Aristotle/soul.html 24 2.4. COMMONSENSE INFORMATION judged. Most notably, Locke argued that it is the sense of things in common between disparate impressions (integrated sense-data). Most of the empiricist philosophers approach the problem of the unification of sense-data in agreement of a sense in the human understanding, that sees commonality and does the combining. While to the average person the term “common-sense” is regarded as synonymous with “good judgement”, to the AI community it is used in a technical way to refer to the millions of basic facts and understandings possessed by most people. It is considered to span over a huge portion of human experience, thus it is encompassing knowledge about the: a) spatial; b) physical; c) social; d) temporal; and e) psychological aspects of typical everyday life. Similarly robots or intelligent agents (in general), so as to be considered eligible for comprehensive communication with a human, have to possess a vast amount of commonsense knowledge. It is remarkable that such knowledge is typically omitted from social communication just because it is assumed that every person possesses common-sense in order to comprehend. Humans effortlessly assume or infer the omitted underlying information. In contrast robots interacting with humans, do not have general knowledge about the world, neither can reason through a problem using esoteric and exoteric knowledge, logic and inference. The problem is considered to be among the hardest of all in AI research, because the breadth and detail is enormous. Arguably McCarthy’s advice-taker proposal [111] is the first scheme to use logic for representing commonsense knowledge in mathematical logic. It uses an automated theorem prover to derive answers to questions expressed in logical form and is the precursor of Logic Programming and modern computational logic languages such as Prolog and LISP. Winograd’s SHRDLU system was also one of the early natural language understanding computer programs which conversed with a human using English terms [172]. Although SHRDLU was overly simplified, the result was a tremendously successful demonstration of AI (at the time), thus encouraged many other researchers to pursue systems dealing with more realistic situations and with real-world ambiguity and complexity. One of the milestones in the development of common-sense oriented KR systems is the SNePS system developed and maintained by Stuart C. Shapiro and his group [145, 146]. SNePS is simultaneously a logic-, frame- and network-based KR&R system used both as a stand-alone system and as an implementation of the mind of intelligent agents (cognitive robots), under the scope of the GLAIR agent architecture (a Grounded Layered Architecture with Integrated Reasoning). SNePS has been used for a variety of KR&R tasks such as commonsense reasoning and natural language understanding and generation, in the context of cognitive robotics. However, the projects mentioned above were not mainly intended for robotic use and not many works exist that focus on modelling the common-sense informatic situation in mobile robots, except from two notable examples [141, 144]. 25 CHAPTER 2. BACKGROUND & RELATED WORK 2.5 Perceptual-symbolic integration and the Symbol Grounding Problem In the context of hybrid cognitive modelling and subsymbolic integration, we meet one very central argument, the Searle’s well known “Chinese Room”-argument [143] which challenged the assumptions of classical AI. Searle proposed a situation where the symbolic task of responding to questions in Chinese, without knowing the language, can apparently be well performed while at the same time not understanding the Figure 2.4: Visual metaphor of Searle’s meaning of the questions and answers. Chinese room argument. He claims that symbol systems have some fundamental limitations that prevent them from exhibiting true cognitive behaviour (due to the use of ungrounded symbols). What he really meant is that a symbol-manipulating system in isolation, will never be able to truly “understand ” anything about the physical world. Searle’s argument still provokes debate to this day, however one of the very influential arguments is Harnad’s response [64]. Harnad argues that a system with sensors and connectionist (or non-symbolic) systems, classifying raw sensor data, would not be subject to the “Chinese Room”-argument when using an approach which grounds symbols in low-level perceptions of the environment. 2.5.1 Harnad’s Symbol Grounding Problem and solution Harnad first proposed the symbol grounding problem as a question concerning semantics: “How can the semantic interpretation of a formal symbol system be made intrinsic to the system, rather than just parasitic on the meanings in our heads? ” (S. Harnad [63]) arguing that information fusion at the symbolic level will be faced with the grounding problem at two levels: a) Modality-inherent grounding within each modality; and b) Cross-modal grounding between modalities. As a solution approach Harnad proposes a hybrid sub-symbolic architecture where there is no longer any autonomous symbolic level at all. Instead there is an intrinsically dedicated symbol system with its elementary symbols (names) connected to non-symbolic representations, via connectionist networks that extract invariant features of their analogue sensory projections. The introduction of the SGP 26 2.5. PERCEPTUAL-SYMBOLIC INTEGRATION AND THE SYMBOL GROUNDING PROBLEM spawned a great debate around the subject, while many strategies attempted to solve the problem from different perspectives. Representationalist strategies approach the SGP by grounding symbols in the representations arising from the manipulation of perceptual data. Here lies also the hybrid model of Harnad which implements a mixture of features characteristic of both symbolic and of connectionist systems. According to Harnad, the symbols in question can be grounded by connecting them to the perceptual data they denote by a bottom-up; invariantly categorising processing of sensorimotor signals, where symbols are grounded in three stages: a) iconisation – the process of transforming analogue signals into iconic representations; b) discrimination – the process of judging whether two inputs are the same or how much they differ; and c) identification – the process of assigning a unique name to a class of inputs, treating them as equivalent or invariant. He also argues that three kinds of representations are necessary: a) iconic representations, which are the sensor projections of the perceived entities; b) categorical representations which are learned and innate feature detectors that pick out the invariant features of object event categories from their sensory projections; and c) symbolic representations, which consist of symbol strings describing category membership. Both the iconic and the categorical representations are assumed to be non-symbolic. Finally, he concludes that a connectionist network is the most suitable, for learning the invariant features underlying categorical representations and thus for connecting names to the icons of the entities they stand for. The function of the network then, is to pick out the objects to which the symbols refer. Concerning Harnad’s approach, one can note that although it seems clear that a pure symbolic system does not suffice (since sensors do not provide symbolic representations), the approach of adopting connectionist networks alone appears limited as well. Subsequent work from Cangelosi et al. and Cangelosi and Harnad provide a detailed description of the mechanisms for the transformation of categorical perception (CP) into grounded low-level labels and subsequently into higher-level symbols [17, 22]. They call grounding transfer the phenomenon of acquisition of new symbols from the combination of already grounded symbols. Then, they show how such processes can be implemented with neural networks [22]. According to Cangelosi et al., the functional role of CP in symbol grounding is to define the interaction between discrimination and identification. Cangelosi and Harnad outline two methods for the acquisition of new categories. They call the first method sensorimotor toil and the second one symbolic theft, in order to stress the benefit (enjoyed by the system) of not being forced to learn from a direct sensorimotor experience whenever a new category is in question [20]. Finally they provide a simulation of the process of CP, for the acquisition of grounded names and for the learning of new higher-order symbols from grounded ones [22]. 27 CHAPTER 2. BACKGROUND & RELATED WORK Thought or Reference is by ab str ac ted to Semi os s lise bo sym to stands for ted en res rep ers ref Sign Odgen & Richard 1923 Symbol Concept (Conceptualisation) Interpretant Peirce 1931 Referent Ullmann 1973 Representamen Object Symbol Thing (Reality) refered by Symbol (Language) Meaning Sign Harnad 1990 Category Vogt 2002 Icon Form Refent Figure 2.5: The semiotic landscape defined the sign and its relations between a meaning, a form and a referent. This figure emphasises on the temporal evolution of the semiotic triangle, starting from the original definition from Odgen and Richards [125] we move to the Peircian definition of symbols [129]. Ullmans’s triangle represents the relation between things in the reality and the constructs of a language [162], followed by Harnad’s and Vogt’s adoptions [63, 168]. 2.5.2 Physical Symbol Grounding Vogt connects the solution proposed by Harnad [63] with embodied robotics [12, 13] and with the semiotic definition of symbols [129] (See Fig. 2.5). His approach to the SGP originates from embodied cognitive science and grounds the symbolic system in the sensorimotor activities of the robot, thus transforming the SGP into the Physical Symbol Grounding Problem (PhSGP). He then solves the PhSGP by relying on two conceptual tools: the semiotic symbol systems and the guess game. Vogt defines symbols as a structural pair of sensorimotor activities and environmental data [167, 168]. According to a semiotic definition, symbols have a) a form (Peirce’s “representamen”), which is the physical shape taken by the actual sign; b) a meaning (Peirce’s “interpretant”), which is the semantic content of the sign; and c) a referent (Peirce’s “object”), which is the object to which the sign refers . Following this Peircean definition, a symbol always comprises of a form, a meaning and a referent, with the meaning arising from a functional relation between the form and the referent, through the process of semiosis or interpretation. Then, the whole semiotic symbol system grounds the meaning of the symbols in the sensorimotor activities, thus solving the PhSGP [167, 169, 170]. 28 2.5. PERCEPTUAL-SYMBOLIC INTEGRATION AND THE SYMBOL GROUNDING PROBLEM The solution to the PhSGP is based on the guess game [152], a technique used to study the development of a common language by situated robots. Here a Game is defined as: “a routinised sequence of interactions between two agents involving a shared situation in the world ” [150]. Steels and Vogt implemented the previously mentioned language games on simple Lego robots equipped with light and infra-red sensors. Later, Steels [149] studies the emergence of shared languages in a group of autonomous cognitive robots that learn the categories of objects. He argues that Language Games are a useful way of modelling the acquisition or sharing of language [149, 150, 153]. As Vogt points out, several critics argue that: “robots cannot use semiotic symbols meaningfully, since they are not rooted in the robot”. To this he replies: “. . . it will be assumed that robots, once they can construct semiotic symbols, do so meaningfully.” (P. Vogt [168]) Coradeschi and Saffiotti identified and formalised a problem similar to PhSGP which they term anchoring. The problem concerns “the connection between abstract and physical-level representations of objects in artificial autonomous systems embedded in a physical environment” [38]. A detailed description of anchoring is presented in the next section (§ 2.6). Vogt and Divina characterise this problem as a technical aspect of symbol grounding, since it deals with grounding symbols to specific sensory images [171] and mention that the symbol grounding problem in general has to deal with anchoring to abstractions, while including philosophical issues related to meaning. Here it is important to highlight the differences of perceptual anchoring with physical symbol grounding. Anchoring, indeed can be seen as a special case of symbol grounding, which concerns primarily physical objects [35, 38], actions and plans [14, 15, 34, 74, 105], yet not abstract concepts. Even though anchoring elaborates on certain aspects of physical symbol grounding, it can be seen as another solution, however it deviates from the scope of language development (according to Vogt [170]). For example, in symbol grounding we would consider ways for an intelligent agent to generally ground the concepts of “stripes” or “redness” in perception, while in anchoring we would instead consider how to model the perceptual experiences of a cognitive agent into a structure that a) describes an object (i.e. zebra), its properties and attributes both symbolically and perceptually; and b) maintain in time these associations. Finally, a popular class of representations for symbols which are naturally grounded, is called affordance-based representations [60]. This representation encapsulates what the robot can afford to do with respect to objects. For example, Roy has studied the grounding of words by manipulator robots, in both perceptual and affordance features [135]. He has also surveyed the area of language formation for models that ground meaning in perception and has noted the importance of future work in discourse and conversational models, based on human studies for inspiration in aligning these models between communicating partners [136]. 29 CHAPTER 2. BACKGROUND & RELATED WORK 2.5.3 Social Symbol Grounding The multi-agent perspective of symbol grounding, where symbols have to be agreed upon when there is more than one agent involved, is called “social symbol grounding” [166]. Methods explicitly dealing with the alignment of separate grounded representations learned individually, originate from language formation. Moreover, they have been studied extensively in linguistics [151]. Steels and Kaplan describe a “guessing game”, in which two simulated agents view a common screen with various shapes and colours, while attempting to obtain similar symbols to characterise them. It is important to note that they also use semiotic symbols [167] (see § 2.5.2 and Fig. 2.5). Here, instead of just having the symbol and meaning, there is a form (or word), the meaning (which can be a grounded aspect of the referent, e.g., colour) and a referent (the object itself) which must be aligned between the two agents. Also in the context of social symbol grounding, Cangelosi and collaborators have studied the emergence of language in multi-agent systems, while performing navigation and foraging tasks [18] and object manipulation tasks [21, 96]. 2.6 Foundations of Perceptual Anchoring The concept of anchoring in robotics is synonymous with the PhSGP, as it constitutes the bridge between the symbolic (conceptual) and the subsymbolic (perceptual) models of the cognitive agent. Perceptual Anchoring is the grounding of symbols, that refer to specific object entities, such as a “cup” (or more specifically “cup-22 ”) and general properties, such as “blue”. Anchoring is rather a subset of symbol grounding which is limited to physical objects (concrete instances) and does not concern abstract concepts or concrete classes (groups). It emphasises in addressing the physical symbol grounding problem [168], by providing a solution to it. Anchoring was first presented by Saffiotti, where he identified the need to ground, or better anchor, the descriptions and perceptual properties of the physical objects surrounding an intelligent agent, into internal representations that are used during the execution of actions [139]. With anchoring he refers to the execution of the two perceptual capabilities: i) to find an object by matching its percepts with its symbolic description; and ii) acquire new information about the properties of this object. Later, Coradeschi and Saffiotti gave a formal account for the problem of anchoring where they acknowledge that: “Every intelligent agent embedded in physical environments needs the ability to connect, or anchor, the symbols used to perform abstract reasoning to the physical entities which these symbols refer to.” (S. Coradeschi and A. Saffioti [37]) 30 2.6. FOUNDATIONS OF PERCEPTUAL ANCHORING Physical World Autonomous System Symbolic reasoning system table1 door5 cup22 room3 denote Symbols Anchoring Sensori-motoric system Sensor data > observe α= < , , γ > Individual symbols Predicate symbols Percepts cup22 {red, large} Σ Pred. Attr. red orange hue [-20, 20] hue Values [20, 30] blue ... ... hue ... ... [220, 260] ... ... area = 310 hue = 12 Attributes Π g Figure 2.6: Graphical illustration of the anchoring problem [35]. by introducing the concept of the anchor3 and its functionalities while outlining also a few difficulties, specifically regarding indexical and objective references [37], definite and indefinite descriptions [36] as well as the existence of uncertainty and ambiguities, which are inevitable in the presence of sensor data [15, 31, 36]. Through the early development, the most prevalent definition assumes that anchoring is the “. . . process of creating and maintaining the correspondence between symbols and percepts that refer to the same physical object” [32, 35]. In this fundamental work a domain-independent definition and computational theory of anchoring is proposed, via practical examples in the robot navigation and aerial surveillance domains. After the establishment of anchoring both as a problem and as a methodology, subsequent work from Coradeschi and Saffiotti investigated how anchoring is related to different contexts, while also concretely framing the computational theory behind it. A brief summary of the 3 The term anchor originates from studies in situation semantics where anchors for parameters provide a formal mechanism for linking parameters to actual entities. An anchor for a set A, of basic parameters is a function f defined on A, which assigns to each parameter Tn in A an object of type T [3]. 31 CHAPTER 2. BACKGROUND & RELATED WORK most representative iteration of an anchoring framework includes the following definition: ⊲ Perceptual System Contains a set of percepts (collections of measurements assumed to originate from the same object) and a set of attributes (measurable properties of percepts). ⊲ Predicate Grounding Relation Embodies the correspondence between the unary predicates in the symbol system and the attributes in the perceptual system. ⊲ Symbol System Contains a) a set of symbols which denote objects (e.g., “cup-22 ”); b) a set of unary predicate symbols, which describe symbolic properties of objects (e.g., “green”); and c) an inference mechanism which uses these components. The perceptual system continuously generates percepts, such as regions in images, and associates them with measurable attributes of the corresponding objects, such as colour values. The symbol system assigns unary predicates, such as “green”, to symbols which denote objects, such as “cup-22 ”. The associations between symbols and percepts are reified via structures called anchors. Each anchor contains symbols, percepts and estimates of one object’s properties. Anchors are time indexed, since their contents can change over time and are manipulated with three basic functionalities: i) find; ii) (re)-acquire; and track [32]. Visual perception is perhaps one of the most important aspects closely related to the scope of this thesis. An anchoring technique based on fuzzy set theory is presented in [31], which deals with representing uncertainty and the degree of matching between perceptual signatures of objects, observed by a vision system of an unmanned helicopter. Another aspect which is considered in a position paper, specifically discusses the role of learning in anchoring [33] with respect to learning the predicate grounding relation, learning property dynamics and finally learning to generate referring expressions. In the context of planning, we see autonomous robots that incorporate a symbolic planner for their operation, which use anchoring in order to associate the terms of plans to the relevant sensor data [34]. In addition, extensions of research in planning and anchoring (including action properties and partial matching) often deal with the grounding of symbols to actions [38]. Finally an approach which used a novel anchoring modality was presented by Loutfi et al. [100]. There, olfactory data were used in a shared representation that linked several sub-systems of the robot, such as the planner with the motion control and the olfactory sensors, while addressing both top-down and bottom-up processes [100]. In bottom-up processes, sensor data determine the initiation of an anchoring process, whereas top-down functions may initiate an anchoring process upon request. 32 2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS 2.7 Perceptual Anchoring in Cognitive Systems The framework proposed by Coradeschi and Saffiotti provides with a simple, yet concrete theoretical point of view in addressing the PhSGP. It is also a platform for addressing the creation / maintenance of the symbol-percept and anchorobject correspondences in cognitive agents. In the literature several approaches study aspects of the anchoring problem from very diverse perspectives. Here I present the most notable clusters of work related to anchoring and to some aspects of this dissertation. 2.7.1 Knowledge Based approaches to Anchoring An important challenge for cognitive systems is the establishment of a shared ontology, where symbols and concepts referring to objects are structured in a way that allows conceptual symbolic reasoning. There exist many approaches in the context of symbol grounding which consider the use of KR&R techniques or knowledge-bases, in order to enable logical inference. Work by Bonarini et al. investigates a model to represent the knowledge of an agent, showing that the anchoring problem can be successfully dealt with well known AI techniques. They present a model which supports the instantiation of concepts, affected by uncertainty and heterogeneity from the perceptual system in a multi-agent context [6, 7]. One of the earliest clusters of work in the context of knowledge-based approaches, is led by Chella et al. where they present a knowledge-based anchoring framework, for building high-level conceptual representations from visual input, based on the notion of Gärdenfors’ conceptual spaces [58]. Their approach allows to define conceptual semantics for the symbolic representations of the vision system, where the symbols can be grounded to sensor data. They further develop their approach, to anchor conceptual representations in dynamic scenarios using situation calculus [27, 29]. In a collaboration with Coradeschi and Saffiotti [26], they explicitly investigate how to formalise a computational model for perceptual anchoring that is unified with Gärdenfors’ conceptual spaces theory, in the context of bridging the gap between the symbolic and subsymbolic components. Finally Chella et al. present an extension of their cognitive architecture, for learning by imitation, where a rich conceptual representation of the observed actions is built [28]. Another interesting knowledge-based approach for cognitive robots, is the one from Shapiro and Ismail called the GLAIR architecture (Grounded Layered Architecture with Integrated Reasoning) which consists of three levels. The knowledge level (KL) is the one in which conscious reasoning takes place [145]. The KL is implemented by the SNePS and SNeRE (SNePS Relational Engine) logic & knowledge representation and reasoning system [83, 146], both based on Common Lisp. They evaluate their approach using the robot Cassie, which anchors the abstract symbolic terms that denote the agent’s mental entities 33 CHAPTER 2. BACKGROUND & RELATED WORK in the domains of: a) perceivable entities and properties; b) actions; c) time; and d) language in the lower-level architectures used by the embodied agent to operate in the real world. Lopes et al. describe a way to utilise the KR&R component for knowledge acquisition and information disambiguation [99]. Similarly, work by Melchert et al. presents results using symbolic knowledge representation and reasoning for perceptual anchoring. They extend the anchoring framework from [32] using the LOOM KR&R system, to incorporate a subset of the DOLCE ontology regarding physical entities. In this way they manage the symbolic information while exploiting the acquired knowledge and inference mechanisms in order to recover from failures in the anchoring process. In a simulated scenario, their system communicates with a user to interactively resolve an ambiguous description using the knowledge-base [114, 115]. Modayil and Kuipers examine unsupervised learning approaches, in order to bootstrap an ontology of objects to sensor input from a robot. Four multiple learning stages are combined in which an object is first individualised, then tracked and described (using shape models) and finally categorised [119]. Work by Mendoza et al. builds further on Harnad’s ‘solution’ [63] of the SGP, by designing an architecture for object categorisation, on the basis of features and context. They use simple iconic and categorical representations that are causally connected to the robot’s sensory sub-systems, thus providing an elementary grounding upon which Semantic Web technologies are applied in the domain of robot soccer [116]. Mozos et al. present an integrated approach for creating conceptual representations for spatial and functional properties of typical indoor environments using mobile robots [121]. Their multi-layered model represents maps at different levels of abstraction, using laser and vision sensors for place and object recognition. Their system is endowed with an OWL-based commonsense ontology of an indoor environment, that describes taxonomies (is-A relations) of room types and typical objects found therein, through has-A relations. Zender et al. present another integrated instance of the system with functionalities such as perception of the world, natural language, learning and reasoning, thus exploiting inter-disciple state-of-the-art components into a mobile robot system (CoSy Explorer)4 [174]. Their work is highly focused on cross-modal integration, ontology-based mediation and multiple levels of abstraction of perception. Tenorth and Beetz present a practical approach to robot knowledge representation and processing [154, 157]. KNOWROB is a first-order knowledge representation system, which is based on description logics and focuses on the automated acquisition of grounded (action-centred) concept representations, through observation and experience. Their framework copes with inference under uncertainty in autonomous robot control. The robot collects experiences 4 Cognitive org 34 Systems for Cognitive Assistants - CoSy, http://www.cognitivesystems. 2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS while executing actions and uses them to learn models and aspects of actionrelated concepts anchored in the robot’s perception and action system. They use the fraction of Cyc’s upper ontology which describes the relations needed for mobile manipulation tasks [156]. They further extended their approach to KNOWROB-MAP [158], a system for building environment models by combining spatial information about objects in the environment with semantic knowledge (perceptual, encyclopaedic & common-sense). Finally, a logic-based anchoring approach is presented by Fichtner and Thielscher, however from a theoretical stand point. Their approach for addressing the anchoring problem is based on the Fluent Calculus and they present preliminary results, via an example dealing with multiple hypotheses for correspondences [52, 53]. 2.7.2 Anchoring with commonsense information Commonsense knowledge plays an important role in our everyday lives and we humans apply it so many times that we do not even think of it explicitly. Such knowledge encodes our default assumptions and expectations about possible experiences. Robots as well need to have default assumptions and expectations. Projects dealing with modelling commonsense in machines include between others, the Cyc project [93, 94] or OMCS (Open Mind Common Sense) [147] (see § 3.7). Early approaches to symbol grounding using commonsense information, used primarily simple robots and considered basic commonsense information from simple sensors [81, 141, 144]. However approaches in cognitive robotics, recently began to consider aspects of how to truly integrate or exploit large scale commonsense knowledge-bases, mainly due to the recent explosion of the Semantic Web technologies and ontology languages. Specifically the research behind this thesis can be considered to be among the first efforts which consider integrating a large-scale deterministic commonsense knowledge-base with a cognitive robotic system perceiving the physical world. Of the most notable related approaches is the one by Tenorth et al., which concerned household robots that use commonsense knowledge to accomplish everyday tasks. They present their integrated system which focuses on the generation of complex plans. They propose to transform task descriptions from web sites such as ehow.com, into executable robot plans by using methods for converting the instructions from natural language into a formal logic-based representation, while then resolving the word senses using the WordNet lexical database and a fraction of the Cyc upper ontology [155]. An extension of their framework considers adding collections of commonsense knowledge into the KB, in order to enable flexible inference over control decisions, under changing environmental conditions. Their system converts commonsense knowledge which is included in the OMCS [147] database, from natural language into a Description Logic (DL ) representation using the Web Ontology Language 35 CHAPTER 2. BACKGROUND & RELATED WORK (OWL) [84]. Finally they show how to ground the abstract task descriptions in the perception and action systems of the robot. This aspect was also studied in another instance, where they report a top-down guided 3D-CAD modelbased vision algorithm, which is influenced by the commonsense knowledge obtained from the WWW, in order to ground objects (tableware and cutlery) in an assistive household environment [126]. Finally their approaches have been integrated with KNOWROB, a robot knowledge processing system (see § 2.7.1). Another emerging approach which considers a commonsense ontology is presented by Lemaignan et al. in a knowledge processing framework for robotics called OpenRobots Ontology kernel (ORO) that allows to turn previously acquired symbols into concepts, linked to each other, thus enabling reasoning [92]. Knowledge in ORO is represented as a first-order logic formalism, in RDF triples (e.g. <robot isIn kitchen>) in OWL-DL2 and the knowledge-base is implemented using the Jena semantic web framework5 in conjunction with the Pellet6 reasoner. The OpenRobots Common Sense ontology is closely aligned with the open-source OpenCyc7 upper ontology defining classes and predicates focused on concepts useful for interaction with humans. 2.7.3 Cooperative Anchoring This far, we have considered anchoring only in the context of single robotic systems. However in the case where symbolic descriptions or perceptual data are distributed between multiple robotic agents with heterogeneous sensors, we speak of cooperative anchoring [86]. Besides extracting and exchanging object descriptions, cooperative robots may directly exchange representations based on perceptual data. The idea of cooperative anchoring itself stems from the philosophical roots of the social symbol grounding problem (see § 2.5.3). Yet most of the related work in cooperative anchoring disregard the aspects of language development while they mostly focus on the technical aspect of distributed perception, where the different agents exchange anchors and perceptual data that are coming from their embedded sensors and other robots (multi-robot systems). The most related cluster of work in cooperative anchoring, mainly comes from LeBlanc and Saffiotti; where the authors propose an anchoring framework for both single-robot and cooperative anchoring. They primarily consider the problem of fusing pieces of information from a distributed system, to create the global notion of an anchor in a shared multi-dimensional domain, called the anchoring space. The framework represents information using a conceptual spaces [58] approach, allowing various types of object descriptions to be associated with uncertain and heterogeneous perceptual information. Their implementation uses fuzzy sets to represent, compare and combine information, 5 Jena Semantic Web Framework, http://jena.apache.org/ OWL 2 Reasoner for Java, http://clarkparsia.com/pellet/ 7 OpenCyc KB, http://www.opencyc.org 6 Pellet: 36 2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS while includes a cooperative object localisation method that takes into account uncertainty, in both observations and self-localisation. Experiments using simulated and real robots are used to validate the proposed framework and the cooperative object localisation method [85–89]. Kira presents the problem of how heterogeneous robots with largely different capabilities can share experiences in order to speed up learning. His work focuses specifically on differences in sensing and perception, which can be used both for perceptual categorisation tasks as well as determining actions based on environmental features. He studied methods and representations which allow perceptually heterogeneous robots to a) represent concepts via grounded properties; b) learn grounded property representations such colour or texture categories; and c) build models of their similarities and differences, in order to map their respective representations. He proposes an approach using Gärdenfors’ conceptual spaces [58] representation, where object properties are learned and represented as Gaussian Mixture Models in a metric space. Then, he uses confusion matrices, obtained in a shared context and built using instances from each robot, in order to learn the mappings between the properties of each robot. These mappings are then used to transfer a concept from one robot to another, where the receiving robot has not been previously trained on the specific instances of objects. He finally shows that the abstraction of raw sensory data into an intermediate representation can be used not only to aid learning, but also to facilitate the transfer of knowledge between heterogeneous robots. Furthermore, he utilises statistical metrics to determine which underlying properties are to be shared between the robots. Using the methods described above, two heterogeneous robots having different sensors and representations are able to successfully transfer support vector machine (SVM) classifiers among each other, resulting in considerable speed-ups during learning [76–79]. Another instance of work is presented by Bonarini et al. [8] where the authors extend their solution of single-robot symbol grounding in a multi-agent approach, thus combining the information from different agents in a global representation at the conceptual level using a fusion model based on clustering techniques. Also Mastrogiovanni et al. [110] present a distributed knowledge representation and data fusion system for an ambient intelligent environment, which consists of several cognitive agents with different capabilities. The architecture itself is based on the idea of an ecosystem of interacting artificial entities, where the framework for collaborating agents allows them to perform an intelligent multisensor data fusion. Despite its simplicity, it is able to manage heterogeneous information at different levels, thus “closing the loop” between sensors and actuators. 2.7.4 Anchoring for Human-Robot Interaction Human-robot interaction (HRI) is a recent and growing field that aims to study how humans interact with robots and ways of making this interaction more 37 CHAPTER 2. BACKGROUND & RELATED WORK effective [54, 131]. An emerging trend in the context of anchoring is to study the problem from the perspective of the interaction that occurs in order to facilitate communication with humans. Anchoring is well suited for HRI applications and especially when humans communicate with robots using grounded symbolic information. For example, a dialogue system for human-robot collaboration where the dialogues concern physical objects, can be thought as an instance of the anchoring problem. Chella et al. present a system for advanced verbal interactions between humans and artificial agents with the aim to learn a simple language in which words and their meaning are grounded in the sensory-motor experiences of the agent. The system learns grounded language models from examples with minimal user intervention and without feedback. Then, it was been used to understand and subsequently to generate appropriate natural language descriptions of real objects, engaging in verbal interactions with a human partner [30]. Lemaignan et al. present their work in grounded verbal interaction [90]. They propose a knowledge-oriented architecture, where perceptions from different point of views (from the robot itself or the human) are turned into symbolic facts, stored in different cognitive models. The framework includes a component for natural language interpretation that relies on these structured symbolic models of the world [91]. They also present a set of strategies that allow a robot to identify the referent when the human partner refers to an object giving incomplete information (i.e. an ambiguous description) [133]. Another example of an implemented dialogue system is explored in [82, 106], in the context of situated dialogue processing. This approach couples incremental processing with the notion of bidirectional connectivity, inspired by how humans process visually situated language in both top-down and bottom-up fashion. Information about the object state as well as a history of the object state is used to describe changes in a scene. Furthermore, complementary research presented in [97], adds a framework for constructing rich belief models of the robot’s environment, using Markov Logic as a unified framework for inference over these beliefs. The approach is being integrated in an implementation of a distributed cognitive architecture for mobile robots interacting with humans using spoken dialogue. The constructed belief models evolve dynamically over time and incorporate various contextual information, such as spatio-temporal framing, multi-agent epistemic status and saliency measures. Two aspects important for HRI, concern the generation of referring expressions as well as spatial relations. The architecture of Kruijff and Brenner mentioned above, has been used also by Zender et al. where they present an approach for generating and resolving referring expressions (REs) in conversational mobile robots, which identifies and distinguishes spatial entities in a large-scale space (e.g., in an office environment). Their approach is based on a spatial knowledge-base encompassing both robot- and human-centric 38 2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS representations. An important feature of this approach is that it considers descriptions that contain spatial relations among the objects [175–177]. Spatial relations are very important when describing and sharing information about objects. Melchert et al. present a framework for computing the spatial relations between anchors in the context of HRI [114]. In this work they extend an anchoring framework to include a set of binary spatial relations, which were used to provide meaningful object descriptions but also facilitate human participation in the anchoring process, by using human interaction in the disambiguation process between visually similar objects (i.e. anchors). Another aspect of HRI that has been explored in the context of anchoring is person tracking. Kleinehagenbrock et al. present a hybrid approach for integrating vision with distance information in order to track a human [80]. They use laser range data to extract the legs of a person; and camera images from the upper body part of a person for extracting the skin-coloured faces. They combine the results of the different percepts, which originate from the same person into their symbolic counterparts, by using the standard anchoring processes as they are defined in [38]. 2.7.5 Other approaches to Anchoring Work by Karlsson et al. studied anchoring from the perspective of the robot when monitoring the execution of its plans while detecting failures [15, 73]. In their work they present a schema, which responds to failures occurring before or while an action is executed. The failures they address are ambiguous situations which arise when the robot attempts to anchor symbolic descriptions (relevant to a plan action) in perceptual information. They present an extension of the original anchoring framework that deals robustly with ambiguous situations, by providing general methods for detection and recovery using planning-based methods. Loutfi et al. discuss the problem of maintaining coherent perceptual information in a mobile robotic system, which operates over extended periods of time. Their system interacts with a human and uses multiple sensing modalities in order to gather information about the environment or specific objects [100]. Their anchoring extension relies on a challenging modality; olfaction, and is capable of tracking and acquiring information from observations derived from sensor data as well as information from a-priori symbolic concepts. In the same context they present how they integrate the electronic nose in a multi-sensing mobile robot [101]. One other aspect of anchoring concerns semantic mapping, where Galindo et al. apply a multi-hierarchical approach that enables a mobile robot to acquire spatial information from its sensors and links it to semantic information via anchoring [57]. The spatial component of the map is used to plan and execute robot navigation tasks, while the semantic component enables the robot to perform symbolic reasoning. 39 CHAPTER 2. BACKGROUND & RELATED WORK Anchoring is also addressed in a number of works which focus on the problem from a different perspective. For instance, Heintz et al. present a stream-based hierarchical anchoring framework extending their DyKnow knowledge processing middle-ware, which is used to process perceptual information. They make this component available to the other cognitive layers while they further extend their approach to support reasoning in the context of successfully performing complex missions in unmanned aerial vehicles [68, 69]. 2.7.6 Summary of Related Approaches Comparing the related clusters of work, we primarily see that even though there is much heterogeneity in the domains that study or relate to anchoring, only a few approaches follow an integrative stance when grounding symbols to sensor data in cognitive systems. Work in the cooperative anchoring domain seems to lean toward low-level sensing and grounding, thus emphasising on uncertainty and the multi-modal aspects. However they appear to miss the high-level symbolic aspects related to knowledge representation and reasoning [78, 88] with the exception of Mastrogiovanni et al. and their work on the distributed data fusion approach [110] which explicitly considers symbolic knowledge representation and reasoning. We observe that integrative approaches that consider anchoring, knowledge grounding and symbolic reasoning, may rely on diverse representations of knowledge which vary according to the application. Gärdenfors’ conceptual spaces representations are generally favoured in the context of vision and anchoring [25–28, 86, 88]. On the symbolic side, mainly description logics are routinely used as the logical formalism behind the representation of knowledge [92, 154, 177]. Even though some knowledge based models intended for cognitive robots introduce the concept of commonsense information in specific contexts, such as everyday tasks [155], important aspects like the multi-modal integration with cross-modal representations; memory and grounding, are not examined in detail. Exception is the ORO knowledge management platform [92] which explicitly deals with grounding linguistic interaction. In the context of linguistic interaction, anchoring approaches study primarily the generation of referring expressions and spatial relations [114, 175–177]. Finally approaches tend to focus on the logical representations instead of the practical perceptual aspects behind the grounding and anchoring. However with the advancement of semantic web technologies and on-line commonsense knowledge repositories, we can only expect in the future, knowledge-enabled robots which eventually communicate and use grounded representations. 40 3 Methods 3.1 Introduction In this chapter we review the methods used in the different studies of the thesis, where the goal was to specify, design and implement a complete anchoring framework capable of addressing the percept to concept correspondence in an artificial cognitive agent operating in the physical world. Typical components of such a system concern the sensor processing and perceptual algorithms behind the perceptual system, an appropriate knowledge representation formalism for the semantic system, and finally a novel augmentation of the anchoring framework capable to support the integration of the two heterogeneous systems. In this chapter, I first briefly introduce the initial anchoring model which was used as a template in the augmented anchoring framework presented in the different studies (Papers I-IV). I then consider the sensors and data processing parts, while I review some of the relevant state-of-the-art object detection and classification methods used in this thesis (Papers II-IV). Consequently I briefly outline the principles behind the localisation algorithm used by the mobile robot (Papers II & III). The semantic knowledge is represented via description logics (in Papers III & IV) and I dedicate a section for it. Finally I describe, compare and contrast some widely known commonsense knowledgebase systems, to complement the understanding of the corresponding scientific papers (Papers I-IV) collected in the thesis. 41 CHAPTER 3. METHODS 3.2 Anchoring Model The present thesis extends the computational model of anchoring presented in [32]. In this section we see a summary of the details of the initial anchoring framework with its basic functionalities from [32]. The following definitions allow us to define objects in terms of their (symbolic and perceptual) properties. We consider an environment E in which there is a set O of x physical objects, denoted O = o1 , ..., ox , where the value of x is unknown. The set of objects O can change over time, since objects can be inserted or removed from the environment. There are also n > 0 agents, denoted as ag = ag1 , ..., agn . If n = 1, then the problem is reduced to single-robot anchoring while for n > 1 we deal with the cooperative anchoring problem. Each agent ag in the environment E has a perceptual anchoring system Aag . 3.2.1 Components A typical anchoring system (Aag ) as defined in [32] is composed of a perceptual system, a predicate grounding relation and a symbol system. Anchoring focuses on the problem of creating and maintaining the correspondences between the symbol and perceptual systems using the grounding relations. Here is a formal definition of these elements that allow us to characterise an object ox in terms of its symbolic and perceptual properties. A perceptual system Ξ including a set Π = π1 , π2 , ... of percepts; a set Φ = φ1 , φ2 , ... of attributes and perceptual routines. A percept π is a structured collection of measurements assumed to originate from the same physical object; an attribute φ is a measurable property of a percept, with values in the domain D(φ). We let : [ D(Φ) =def D(φ). (3.1) φ∈Φ The perceptual system continuously generates percepts and associates each percept with the observed values of a set of measurable attributes. Definition 1. A perceptual signature γ : Φ → D(Φ) is a partial function from attributes to attribute values. The set of attributes on which γ is defined is denoted by feat(γ). A symbol system Σ including a set X = x1 , x2 , ... of individual symbols (variables and constants); a set P = p1 , p2 , ... of predicate symbols; used to denote objects in the world. The symbol system assigns unary predicates to symbols which denote objects. Definition 2. A symbolic description σ ∈ 2P is a set of unary predicates. Intuitively, a symbolic description lists the predicates that are considered relevant to the perceptual identification of an object; and a perceptual signature 42 3.2. ANCHORING MODEL gives the values of the measured attributes of a percept. A predicate grounding relation g ⊆ P × Φ × D(Φ) embodies the correspondence between unary predicates and values of measurable attributes. The g relation can be used to match a symbolic description σ and a perceptual signature γ as follows. match(σ, γ) ⇔ ∀p ∈ σ.∃φ ∈ feat(γ).g(p, φ, γ(φ)) (3.2) Here it is important to mention that the grounding relation g concerns properties, however anchoring concerns objects. The correspondence between symbols and percepts for a specific object ox in the environment E, is reified in an internal data structure called anchor α(ox , t). The anchoring process is responsible for creating and maintaining anchors and since new percepts are generated continuously within the perceptual system anchors are indexed by time. At every moment t, α(ox , t) contains: a unique symbol meant to denote the object ox ; the perceptual signature γ; and the symbolic description σ of an object. α(t) = (γ, σ) : X × Π × (Φ → D(Φ)) (3.3) αsym ox ,t , αper ox ,t , and αper ox ,t is the We denote the components of an anchor α(ox , t) by αval ox ,t respectively. If the object is not observed at time t, then ‘null’ / Vt ), while αox ,t still contains the best percept ∅ (by convention, ∀t : ∅ ∈ available estimate. In order for an anchor to satisfy its intended meaning, the symbol and the percept in it should refer to the same physical object. This requirement cannot be formally stated inside the system. What can be stated is the following (recall that Vt is the set of percepts which are perceived at t). Definition 3. An anchor α(xo , t) is grounded at time t iff both αper ox ,t ∈ Vt per and match(∆t (αsym ), S (α )). t ox ,t ox ,t We informally say than an anchor α is referentially correct if, whenever α is grounded at t, then the physical object ox denoted by αsym ox ,t is the same as the one that generates the perception αper ox ,t . The anchoring problem, then, is the problem to find referentially correct anchors. The anchors are stored in the anchoring space. Definition 4. An anchoring space A which is a multi-modal space whose dimensions are the qualities of interest in the application domain. 3.2.2 Functionalities Manipulation of anchors, according to [32], is done mainly through three abstract functionalities which: a) create a grounded anchor the first time that the object denoted by x is perceived; b) continuously update the anchor while observing the object; and c) update the anchor when we need to reacquire the object after some time that it has not been observed. t denotes the time at which the functionality is called. 43 CHAPTER 3. METHODS Algorithm 1: Acquire Input: symbol x begin α ←− anchor for x; γ ←− Predict(αval t−k , x, t); π ←− Selectπ ′ ∈ Vt | Verify(St (π ′ ), ∆t (x), γ); if π 6= ∅ then γ ←− Update(γ, St (π), x); α(t) ←− hx, π, γi; return α(t) (Re)Acquire establishes the symbol-percept association for an object o which has not been observed for some time. It takes as input a symbol x with an anchor α defined for t − k and extend α’s definition to t. First predict a new signature γ; then see if there is a new percept that is compatible with both the prediction and the symbolic description; if so, update γ. Prediction, verification of compatibility, and updating are domain dependent; verification should typically use match. Algorithm 2: Track Input: α(t − k) begin x ←− αsym t−1 ; γ ←− OneStepPredict(αval t−1 , x); π ←− Select{π ′ ∈ Vt | Verify(St (π ′ ), ∆t (x), γ)}; if π 6= ∅ then γ ←− Update(γ, St (π), x); α(t) ←− hx, π, γi; return α(t) Track updates an existing anchor by incorporating new perceptual information. The track functionality takes as input an anchor α(ox , t), defined for t − k and extends its definition to t. The track assures that the percept pointed to by the anchor is the most recent and adequate perceptual representation of the object. We consider that the signatures can be updated as well as replaced but by preserving the anchor structure we affirm the persistence of the object so that it can be used even when the object is out of view. This facilitates the maintenance of information while the agent is moving as well as maintaining a longer term and stable representation of the world on a symbolic level without catering to perceptual glitches. 44 3.2. ANCHORING MODEL Algorithm 3: Find Input: symbol x begin π ←− Select {π ′ ∈ Vt |match(∆t (x), St (π ′ ))}; if π = ∅ then fail; else α(t) ←− hx, π, St (π)i; return α(t) Find accepts as input a symbol x and a symbolic description and returns a grounded anchor α(ox , t) defined at t. It checks if existing anchors that have already been created by the Acquire satisfy the symbolic description, and in that case, it selects one. Otherwise, it performs a similar check with existing percepts (in case, the description does not satisfy the constraint of percepts considered by the Acquire). 3.2.3 Relation to Publications The anchoring framework described in this section was initially implemented for the studies behind papers I, II and V. Apart from the implementation, papers II and V deal with: a) the formation of percepts from the visual and spatial modalities; b) the definition of the grounding relations used in the linguistic interaction scenarios; c) the extension of the framework in order to support the integration with the large scale knowledge-base. The knowledge management, synchronisation and memory components complement as well the implemented framework. In paper III the initial implementation was revised in order to account for the cooperative aspect, borrowing concepts from the cooperative anchoring framework of LeBlanc [88], however from a semantic point of view. This iteration extended the anchoring framework to account for bottom-up and top-down information acquisition and addressed the integration of a perceptualsemantic knowledge-base between different perceptual agents. Furthermore the structure of the anchor was defined, especially in the context of processing multi-modal perceptual information. In paper IV the framework was extended in order to define concept anchors, which were used as intermediate structures between different perceptual systems within a single agent. In this context the conceptual anchoring framework defined more in detail the multi-modal part of the anchor and introduced the perceptual and semantic distances used in modelling the perceptual and semantic descriptions respectively. Finally the modelling of the anchor indiscernibility, tolerance and nearness relations led to the anchor nearness measure. 45 CHAPTER 3. METHODS 3.3 Sensors The sensors are the most essential part in every robotic system, allowing it to acquire information from the environment, whether the robot is trying to localise itself, navigate, recognise objects or understand a scene. One central requirement in modelling perception in cognitive robots is to be able to identify high-level features such as objects and doors in perceptual data. These high-level features can be acquired using a variety of sensing devices. Sensors can be classified as proprioceptive or exteroceptive and as active or passive. Proprioceptive sensors allow internal measurements while an exteroceptive sensor on the other hand is able to take measurements of the robot’s environment. These measurements can further be used to extract features that are representative of the surrounding environment. In this thesis, the following two main sensors were used: a laser range-finder (active sensor) and a camera (passive sensor). 3.3.1 Laser Scanner A popular sensor among robotics researchers, the Laser Range Scanner (LRS) is well suited for making distance measurements on a two dimensional plane (or on three dimensions in modern devices). A single laser scanner takes measurements in an arc rotating around the sensor centre. These measurements are used for navigation, obstacle avoidance, localisation and detection of free paths to reach a goal point. The advantage of this sensor is that it makes distance measurements with high accuracy and with error in the range of centimetres. The distance is limited to tens of meters. The surface reflectance where the laser beam reached the target point does not only affect the effective distance but also introduces error to the measurements by giving a windowing effect. The laser beam of the laser scanner sweeps the plane in an arc and makes distance measurements at discrete angles. The resolution of bearing measurements depends on how many measurements occur in one sweep. The laser scanner used in this thesis is a SICK LMS-200; interfaced using a high speed USB-serial connection. This allows a scan rate of up to 72 scans per second with 181 measurements each scan. The used range is 8m and at this range the absolute range accuracy is about 1cm. For the experiments in this thesis the laser scanner is mounted approximately 20cm above the ground and on the base of the robot. 3.3.2 Camera Digital cameras are the most common vision-based sensors mounted on mobile robots. Some common cameras include the CCD and CMOS and have seen a widespread use since the earliest experiments in robotics. Usually cameras mounted on a mobile robot work with the light of the visible spectrum, however there are cameras that can work with other portions of the electromagnetic 46 3.4. MOBILE ROBOT LOCALISATION ActivMedia PeopleBot Y SICK LMS 200 X y c C x principal point optical center Logitech QuickCam Vision Pro P p principal axis Z Sony PTZ Robotic Camera image plane Pinhole Model Sensors Figure 3.1: The pinhole model (left). The robot and sensors used in the thesis (right). spectrum (infra-red). A simple model that models what happens when a ray strikes a camera is the pinhole camera model [67] (see Fig 3.1) . In this model light from a scene point passes through a single point (e.g., a small aperture) and projects an inverted image on a plane called image plane. In the pinhole camera model, the small aperture is also the origin C of a 3D coordinate system whose Z axis is along the optical axis (i.e. the viewing direction of the camera). The sensors used for the experiments in the thesis are: the SONY PTZ robotic camera embedded on an ActivMedia PeopleBot platform and two Logitech QuickCam Vision Pro, one mounted on the mobile robot and one in the ceiling (Fig 3.1). 3.4 Mobile Robot Localisation The localisation of the robot is important to establish the position of the anchored objects. We use an implementation of the Adaptive Monte-Carlo Localisation (AMCL)1 algorithm described by Fox et al. [55]. Essentially it is an implementation of the particle filter applied to robot localisation and has become very popular in the robotics literature. The Monte-Carlo method can be used to determine the position of a robot given a map of its environment. At a conceptual level, the algorithm maintains a probability distribution over the set of all possible robot poses and updates this distribution using data from odometry, sonar and / or laser range-finders. A large number of hypothetical current configurations are initially randomly scattered in the configuration space. With each sensor update, the probability that each hypothetical configuration 1 implemented in Player/Stage (http://playerstage.sourceforge.net) 47 CHAPTER 3. METHODS is correct, is updated based on a statistical model of the sensors and Bayes’ theorem. Similarly, every motion the robot undergoes is applied in a statistical sense to the hypothetical configurations based on a statistical motion model. When the probability of a hypothetical configuration becomes very low, it is replaced with a new random configuration. At the implementation level, the method represents the probability distribution using a particle filter. The filter is “adaptive”, because it dynamically adjusts the number of particles: a) when the robot’s pose is highly uncertain, the number of particles is increased; and b) when the robot’s pose is well determined, the number of particles is decreased. The driver is therefore able to make a trade-off between processing speed and localisation accuracy. As mentioned earlier, the driver also requires a pre-defined map of the environment, against which to compare observed sensor values, as it is the only item that indicates to the robot how the surroundings should look if it had perfect perception abilities. A location-based metric map describes the entire operating environment by defining all the occupied and non-occupied spaces, an example of which is an occupancy grid. The map is used to estimate the localisation state when sensor measurements are available. Range measurements to obstacles are sensor representations that may work well with a location based map. Fig. 3.2 shows the (semantic) map provided for the experiments as well as the actual output of the AMCL algorithm. Kitchen Living room Bedroom Figure 3.2: The semantic map of the PEIS-Home used in the experiments (left) is a combination of the metric map of the environment together with segmented spatial regions. Examples of the AMCL algorithm from the mobile robot in the PEIS-Home. 48 3.5. VISION 3.5 Vision 3.5.1 Image Representation A typical vision algorithm requires that the input images are represented in way which allows the algorithm to find similarities and differences between them. This problem can be summarised in two constraints. First, we are bound to reason in terms of proximity (or equivalently: distance), and not equality between images. In other words we should not try to determine if visual objects are equal, but if they are close to each other in a certain (yet unknown) space. Second, there is an imperative need to drastically reduce the dimensionality of images by dismissing all the non-relevant or redundant pieces of information they contain. One common method to achieve these two goals is to consider finitedimensional representations of sub-images, or image regions. The computer vision community has taken a strong turn during the past few years in the direction of image features, also called visual features or points of interest. In a loose definition a feature is usually a point of interest surrounded by a region, selected by a region detector. The region detector detects an “interesting” part which can be identified multiple times. The study of visual features focuses on two main aspects: i) the detection, or sampling, of finite sets of points that are relevant to the image (interest point detection); and ii) the description of the visual neighbourhood of these points as accumulation in a finite dimensional vector of certain local visual characteristics (features description). For good performance in visual recognition, it is essential that the features are robust. The most important quality criteria for descriptors are a compact representation and high precision and recall when matching descriptors from a database of images (i.e. descriptive). It is also important that the detector has a high ability to detect the same features in similar images (i.e. high repeatability). To ensure that features are repeatable for a particular class of images, some combination of the following properties are required: a) rotation invariance; b) scale invariance; c) perspective / affine invariance; d) illumination invariance; and e) robustness to motion blur and sensor noise (refer to the challenges mentioned in § 2.2.1.1 & 2.2.1.2). Interest point detection A good interest point detector has to locate points that can be detected repeatedly even if the original image is modified or the same scene is shown under varying conditions. Interest point detectors can be categorised briefly into two categories. Edge and corner detectors are the early interest point algorithms, which only detected points of interest (e.g., Harris corner detector [65]). Newer methods tend to determine regions of interest that fulfil certain invariance properties (e.g., Difference-of-Gaussians (DoG) [103]) hence called Affine detectors. 49 CHAPTER 3. METHODS s=5 ... Scale (next octave) (k) 2σ 2σ k = 21/s - Scale - (merge down - sapmpled x2) k4σ Scale (first octave) - k 3σ - k 2σ kσ σ Gaussian Difference of Gaussian (DOG) Image gradients Keypoint descriptor Figure 3.3: Schema of the Differences of Gaussians computation (left). The SIFT algorithm divides the local patch in sub-regions with image gradients and histogram construction (right). Figures reconstructed from [104]. Feature description Feature descriptors are suitable to describe the detected regions of interest. The majority of recent successful object recognition algorithms make use of interest regions to select relevant and informative features of the objects to learn. The “support” region for these feature descriptors is defined by the output of the interest point / region detectors described above. Extracting and describing these features is a computationally demanding task that can be prohibitive for applications that require close to realtime performance (e.g., robot vision). Fortunately, work is being devoted to achieve faster methods and to develop implementations of algorithms ready to run in special hardware. For example in Heymann et al. [70], the authors designed a version of the SIFT feature detector and descriptor [104] that runs at appx. 20 frames per second on a Graphics Processing Unit (GPU) in 640 × 480 pixel resolution. 3.5.1.1 Scale Invariant Feature Transform The Scale Invariant Feature Transform, or simply SIFT, introduced by Lowe, is the de-facto standard method for a) interest point detection; b) feature description; and c) matching, because of its robustness to small displacements and lighting changes and orientation [104]. Originally SIFT was based on a model proposed by Edelman et al. [49], which found that complex neurons in primary visual cortex respond to a 50 3.5. VISION gradient at a particular orientation and spatial frequency [49]. In order to mimic biological evolution, Lowe used the assumption that nearest-neighbour correlation-based similarity computation, in the space of outputs of complextype receptive fields, can support robust recognition of 3D objects. However the implementation of SIFT from Lowe has a different computational model, which uses a scale invariant region detector based on the DoG together with the SIFT descriptor, in order to provide a similarity-invariant feature detector. This involves convolving the initial image with a Gaussian at several scales, to create images separated by a scale factor k, thus creating the so called scale space pyramid of convolved images (as seen in Fig. 3.3). These images are gathered in octaves. Each octave represents a doubling of the standard deviation sigma. The number of pictures in each octave is s + 3 where s is an arbitrary number of intervals in each octave. The scale factor is computed as k = 21/s . Let the difference-of-Gaussian function convolved with the image be: DoG(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y) = L(x, y, kσ) − L(x, y, σ) (3.4) where DoG(x, y, σ) is the difference, G(x, y, σ) is Gaussian, I is the image, and L(x, y, σ) is the image I convolved with the Gaussian G. The DoG is known to provide a good approximation of the Laplacian of the Gaussian (LoG) operator (LoG(x, σ) , △Gσ (x, y)) which has been shown to detect stable image features [118]. Interest points are now detected by selecting points in the image, which are stable across scales. When finding a point of interest in D(x, y, σ), it is compared to its eight neighbours in the current image and to the nine neighbours in the scales above and below. It is selected to be a feature only if it has the largest or smallest value among these 27 points. The selected features representing the local maxima are then, the interest points. For the descriptor, around each interest point a region is defined and divided into orientation histograms on 4 × 4 pixel neighbourhoods. Finally, the most dominant orientations are determined by creating a radial histogram of gradients in a circular neighbourhood around the detected point. The maxima from this histogram determines the orientation of the point and thus enables rotation invariance. The orientation histograms are relative to the key-point orientation. Histograms contain 8 bins each and each descriptor contains a 4 × 4 array of 16 histograms around that key-point. This leads to sixteen histograms of eight bins, each of which yield a vector of 128 dimensions (4 × 4 × 8 = 128 elements) (see Fig. 3.3). This vector is normalised to unit length, so as to enhance invariance to changes in illumination. Recently Mikolajczyk and Schmid presented a review where they evaluated the performance of more than ten different descriptors for image classification against affine transformations, rotation, scale changes, jpeg compression, illumination changes and blur [117]. The authors conclude that SIFT descriptors 51 CHAPTER 3. METHODS 45 glsl cuda glsl oct -1 cuda oct -1 40 Speed (Hz) 35 30 25 parameter: image size, 320x240, 640x480, 800x600, 1024x768, 1280x1024, 1600x1200, 2048x1536, oct 0 glsl, cuda 33.0, 45.2 23.0 27.1 19.2 21.2 16.3 16.8 13.2 12.5 9.9 9.0 7.0 x parameter: image size, 320x240, 640x480, 800x600, 1024x768, 1280x1024, oct -1 glsl, 23.2, 13.4 10.2 7.3 5.2 20 15 10 5 cuda 27.5 12.6 9.1 6.4 x 0 320x240 640x480 800x600 1024x768 1280x1024 1600x1200 2048x1538 Image size (Px) Figure 3.4: Comparison Speed (Hz) vs Screen Resolution on SIFT computation on glsl and cuda for the initial octave set to -1 and 0 [173]. performed best, followed by their variant Gradient Location Orientation Histogram (GLOH). The main advantages of SIFT are that they have simple linear Gaussian derivatives (i.e. more stable) and contain more components in the feature vector (128) (potentially more discriminative representation). However one disadvantage of SIFT is their high dimensionality. One way of reducing the dimensionality is by applying Principal Component Analysis (PCA) on the raw 128-dimensional SIFT vector (PCA-SIFT [75]). However such dimensionality reduction is not applied in this thesis. The implementation used in this thesis follows the original model by Lowe (Papers II, III, IV and V). We also used the open source SiftGPU2 library [173] (Paper III & IV). SiftGPU is implemented in C and uses either the GL Shading Language or CUDA to process pixels in parallel on the GPU, so as to build Gaussian pyramids and detect DoG key-points. Figure 3.4 shows a comparison of the computation time needed for the SiftGPU implementation on different resolutions. 3.5.1.2 Histogram of Oriented Gradients Histogram of Oriented Gradients or HOG descriptor was first introduced by Dalal and Triggs in their CVPR’05 paper [41] for the problem of pedestrian detection in static images. Since then, it has been widely used in the context of object detection. The idea behind the HOG descriptors is that local shape 2 SiftGPU: A GPU Implementation of Scale Invariant Feature Transform (SIFT) – http: //cs.unc.edu/~ccwu/siftgpu 52 Magnitude 3.5. VISION 0.05 0.03 0.00 12345678 Orientation 8-orientation histogram Input image Normalise gamma and compute gradients Overlapping blocks Feature vector f [ ..., ..., ..., .... ] x Figure 3.5: The steps of computing the HOG descriptor on an example image. and appearance within an image can be described by counting the occurrences of intensity gradient orientations (or edge directions) in localised portions of the image. The computation of HOG feature descriptors can be summarised in four steps, as depicted in Fig. 3.5. The first step of calculation is the computation of the gradient values. The most common method is to simply apply the 1D centre point discrete derivative masks, in either one or both of the horizontal and vertical directions. This method requires to filter the colour or intensity data of the image with the following filter kernels: i) [−1, 0, 1]; and ii) [−1, 0, 1]T . The second step of calculation involves creating the cell histograms. Each pixel within the cell casts a weighted vote for an orientation-based histogram channel, based on the values found in the gradient computation. The cells themselves can either be rectangular or radial in shape; and the histogram channels are evenly spread over 0◦ to 180◦ or 0◦ to 360◦ , depending on whether the gradient is “unsigned” or “signed”. In the third step in order to account for changes in illumination and contrast, the gradient strengths must be locally normalised, which requires grouping the cells together into larger; spatially-connected blocks. The authors mention that this local contrast normalisation with overlapping descriptor blocks is crucial for good results. The HOG descriptor is then the vector of the components of the normalised cell histograms from all of the block regions. These blocks typically overlap, meaning that each cell contributes more than once to the final descriptor. The final step in object recognition using HOG descriptors is to feed the descriptors into some recognition system based on supervised learning. The Support Vector Machine classifier (SVM, see § 3.5.3.1) [164] is the standard binary classifier that can make decisions regarding the presence of an object. In our implementation we used the freely available SVMLight3 software package [72] in conjunction with HOG descriptors to train object models. This descriptor idea can be seen as a dense version of the SIFT descriptor. HOG and its variants routinely deliver state-of-the-art performance for object detection and image classification. 3 SVMlight Support Vector Machine – http://svmlight.joachims.org 53 CHAPTER 3. METHODS Codebook Histogram Generation Cup ? Classification ( Ψ ) [cup, coffee, book] Local Feature Detection Dictionary Generation codeword 1 codeword 2 codeword 3 codeword 4 codeword 5 Local Feature Description codeword n Codebook Dictionary Figure 3.6: The typical components of a bag of visual words image classification pipeline. 3.5.1.3 Bag-Of-Words Bag of visual words is commonly used in the last years for object categorisation; classification and image retrieval. It directly relates to the bag-of-words (BoW) model, originally used in text retrieval [66], which is a popular method for representing documents by ignoring the word orders. In the computer vision community the BoW model was introduced by Sivic and Zisserman [148], where they use a “term frequency-inverse document frequency” (tf-idf) weighting on the visual-word counts produced by interest point detection and SIFT features, to retrieve frames in videos containing a query object. Tf-idf is a standard text retrieval tool which is used to compute the similarity of text documents. To represent an image using the BoW model, it usually includes the three following steps: feature detection, feature description and codebook generation. A definition of the BoW model can be the “histogram representation based on independent features” [51]. Given an image, feature detection is to extract several local patches (or regions), which are considered as candidates for basic elements (e.g., “words”). This can be done either with a regular grid method for feature detection, where the image is evenly segmented by some horizontal and vertical lines forming a grid, or traditional interest point detectors to detect salient patches, such as edges or corners in an image. Some famous detectors are the Harris affine region detector [65], Lowe’s DoG detector [104]. After 54 3.5. VISION feature detection, each image is abstracted by several local patches which are represented as numerical vectors (feature descriptors). A good descriptor that handles intensity, rotation, scale and affine variations to some extent is the SIFT [104]. After this step, each image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance. Eventually each interest point is represented by an ID which indexes the vector into a visual-codebook or visual-vocabulary (See Fig. 3.6). An image is then modelled as a bag of those so called visual-words and is described by a vector h which stores the distribution of all assigned codebook IDs. Note that this discards the spatial distribution of the image features. In contrast, the image descriptions introduced in the previous sections still carry spatial information. Especially the dense ones (e.g., HOG) are often used directly to provide a spatial description of the objects. The final step for the BoW model is to convert vector represented patches to “codewords” (analogy to words in text documents), which also produces a “codebook” (analogy to a word on a dictionary). A codeword can be considered as a representative of several similar patches. As it is suggested, we use a clustering (vector-quantisation) method to quantise the feature descriptors. One simple method is to perform K -means clustering over the dataset (or a subset of a dataset) [95] (see § 3.5.2) where each cluster stands for a visual codeword. The value of k depends on the application, ranging from a few hundred or thousand entities for object class recognition applications up to one million for retrieval of specific objects from large databases. Codewords are then defined as the centres of the learned clusters. The number of the clusters is the codebook size (analogy to the size of the word dictionary). The size of vocabulary is chosen according to how much variability is desired in the individual visual words. After the construction of the visual vocabulary, an image is described as a vector (or histogram) that stores the distribution of all assigned codebook IDs or visual words or simply a bag of visual-words. Once the bag of words of the images have been computed, the images can be classified using just any machine learning algorithm. Current state of the art results use discriminative support vector machines (see § 3.5.3.1), but also Naı̈ve Bayes methods have been commonly used [40]. 3.5.2 Vector Quantisation The previous sections present a range of widely used, low-level feature descriptors. Many of these can be very high dimensional, such as the SIFT features with 128 dimensional feature space. Vector quantisation of these raw feature descriptors is a common step in object or texture recognition. One reason behind the quantisation is the large range of values and their sensitivity to small image perturbations. Thus the quantisation introduces robustness and 55 CHAPTER 3. METHODS opens up the possibility for a variety of object-class models, some of which are introduced in the next sections. It involves data clustering or partitioning a dataset into groups of more related samples. The most used method is the K -means clustering algorithm [95] in the Euclidean vector space. K -means starts with k randomly selected data points, called cluster centres and finds a partitioning of N points from a vector space into k < N groups, where k is typically specified by the user. The objective is to minimise the squared error: V = argmin s k X X kxj − µi k2 (3.5) i=1 xj ∈Si where there are k clusters xi , i = 1, . . . , k , and µi is the mean of all the points xj ∈ xi . The first step assigns each of the remaining data points to the closest cluster centre. We use Euclidean distance as the distance measure but it is also possible to use other distances such as the Mahalanobis distance. The next step recomputes the cluster centres to be the mean of each cluster. These two steps are alternated until convergence. While k is the only parameter that needs to be specified for K -means, its choice is not trivial, since it affects the outcome of the clustering result. A common way to handle this problem is to try several values for k. However, for large datasets this approach is too time-consuming because k can vary in a wide range. The time-complexity of the K -means algorithm is O(Nkl d) for N data-points of dimension d, and l iterations. Here l depends on the distribution of the data in the feature space and the initial centres. In our work we use K -means to cluster local visual features into visual vocabularies. 3.5.3 Classification of Visual Features Once we have established an image representation based on visual features we can use these features to train an appropriate classifier. Here we review some machine learning techniques regarding classification, which are adapted to the feature-based image representations which are also used in this thesis. Classifiers refer to a subset of pattern recognition methods that given a training set T consisting of n data-points T = x1 , . . . , xn and their corresponding feature descriptions as well as class labels y1 , . . . , yn , they are able to predict the class-label y for a previously unseen test data point x [4, 48]. The feature description x can take any form, however it is often represented in an Euclidean vector space. They can also be transformed into a different vector space using a transformation function ϕ. This transformation into (often) higher dimensional vector spaces can improve the discrimination between classes. Examples for such feature descriptors are the previously described visual-word histograms in the BoW model, or the 128 dimensional SIFT vectors embedded into the Euclidean vector space. 56 3.5. VISION Vectors Codewords Voronoi Region y10 Margins = Misclassified point 2 ||w|| y13 y8 y4 y12 y9 b y1 Support Vector Support Vector y7 y6 y3 y2 y11 w wx -b=1 wx - b = 0 y5 w x - b = -1 Vector Quantisation Support Vector Machine Figure 3.7: Codewords in 2-dimensional space (left). Input vectors are marked with a triangle, codewords are marked with red circles, and the Voronoi regions are separated with boundary lines. An illustration of the distance between a separating hyperplane and the training samples (right). Image courtesy of [9] In the next two subsections we focus on two discriminative classifiers: i) support vector machines (SVM) which are widely used in the machine learning and computer vision community; and ii) nearest neighbours (NN) which see wide adoption in applications of computer vision problems. In addition to discriminative classifiers there are also generative classifiers. Generally speaking, discriminative classifiers model the class probability given the feature description directly, i.e. P(y = c|x), also called posterior probability. Generative classifiers however model what is common between classes, i.e. P(x|y), called likelihood. The classification is then determined using Bayes’ rule: P(y|x) = P(x|y).P(y) , where P(y) denotes the class priors and P(x) the normalisation factor, also called evidence. 3.5.3.1 Support Vector Machines Support vector machines (SVM) were originally developed for pattern recognition and are motivated by results from the statistical learning theory. The basic idea behind the SVM is to learn a hyperplane in some feature space, in order to separate the positive and negative training examples with a maximum margin, thus called maximum margin classifiers. Consider the problem of separating a set p of training data (x1 , y1 ), . . . , (xp , yp ) into two classes, where xi ∈ ℜn is a feature vector and yi ∈ {−1, +1} the class label of the feature vector, if sample i belongs to the negative or positive class, respectively. Suppose that the two classes can be separated by a hyperplane: (w · x) + b = 0 in a space. Then the optimal hyperplane (H ) is the 57 CHAPTER 3. METHODS one which maximises the width of the margin (H1 , H2 ) between the two classes. This can be formulated via the following constraints: xi · w + b > +1, yi = +1 (3.6) xi · w + b 6 −1, yi = −1 (3.7) yi (xi · w + b) − 1 > 0∀i (3.8) 2 |w| . The distance separating the two classes is Hence, H has the minimal |w| iff ∀(xi , yi ) : yi (w · xi + b) − 1 > 0, subject to the constraints (Eq. 3.8). Introducing Langrange multipliers αi (i = 1, . . . , p) results in a classification function: ! p X f (x) = sign αi yi w · x + b . (3.9) i=1 where αi and b are found by Sequential Minimal Optimisation (SMO [39, 165]). Even though most of the αi s take the value of 0, the points xi which αi 6= 0 are the so called support vectors, because they define the decision boundary4 . Because the weighting term can vary to the benefit of either discrimination or generalisation, SVMs are considered a flexible training tool which can be adapted to a wide range of situations. Straightforward classification using kernelised SVMs requires evaluating the kernel for a test vector and each of the support vectors [107]. The complexity of classification for a SVM using a non-linear kernel, is the number of support vectors times the complexity of evaluating the kernel function. The later is generally an increasing function of the dimension of the feature vectors. Since the testing is expensive with non-linear kernels, linear kernel SVMs have become popular for real-time applications as they enjoy both faster training and classification speeds, with significantly less memory requirements than non-linear kernels, due to the compact representation of the decision function. However this comes at the expense of accuracy and linear SVMs can not be used on tough, complex datasets like VOC PASCAL [50]. 3.5.3.2 Nearest Neighbour Classification Nearest neighbour matching is a good baseline method that is still often used, because of its simplicity and relatively good performance. Specifically k nearest neighbours (k -NN) of the nearestneighbour class is a popular classifier in the 4 This is for linear-separable datasets. For non-separable datasets, a support vector can be within the margin or even on the wrong side of the decision boundary 58 3.6. SEMANTIC KNOWLEDGE field of computer vision. Essentially it assigns a data point to the class for which there are the most exemplars among the k nearest neighbours. One interesting point to note is that the k -NN classifier with majority vote is universally consistent [47]. A classifier is called consistent if with increasing data the risk of the classifier approximates the Bayes risk of the underlying distribution. It is called universally consistent if this property holds for all distributions of data. In the simple nearest neighbour case (1-NN), it has been shown that the probability of error cannot excess twice the Bayesian probability of error as the number of training samples becomes infinite. Thus a large amount of label information is contained in the nearest neighbour. This, along with the simplicity and versatility of k -NN, explains the popularity of this classifier. k -NN can be categorised as a non-parametric classifier and hence has several advantages over a parametric counterpart, such as that it does not require a training phase and generally avoids the issue of over-fitting. However, it often requires all training data points to be stored in memory; which can be costly. 3.6 Semantic Knowledge Semantic knowledge refers to the meaning of concepts in terms of their properties and relations to other concepts. Concepts in robotics and artificial systems in general, should be represented in a machine-interpretable and reusable way. Cognitive robots should be able to use this semantic knowledge in order to represent their perception of the environment, both from a metrical and topological point of view, but also from a high-level conceptual knowledge point of view. As we see in the previous chapter (see § 2.7.1 & § 2.7.2), there is an increasing number of approaches which attempt to model and represent high-level semantic knowledge, in the context of semantic mapping [57]; cognitive vision [27, 29, 121]; scene understanding [82, 106]; HRI [90, 114, 177] and learning [28, 78, 119]. In this thesis we are interested in the autonomous acquisition and exploitation of semantic information, as well as the integration of this perceptual semantic information with commonsense knowledge (see § 3.7). Typically, a semantic approach is based on a formalism which includes two standard elements, i) the ontology (taxonomy); and ii) the knowledge base. The theoretical foundation for modelling semantic domain information and perceptual semantic knowledge in the anchoring framework is based on the logical representation of Description Logics (DLs ), which lie on the decidable fragment of first-order logic (FOL). They are specifically tailored for representing and reasoning about the terminological knowledge of an application domain in a structured way, while they provide a good trade-off between expressivity and tractability. In this dissertation we use the most standard DL formalism, 59 CHAPTER 3. METHODS Table 3.1: DL Syntax rules. C, D → A ⊤ ⊥ ¬C C⊓D C⊔D ∀R.C ∃R.⊤ (> nR) | | | | | | | | | (6 nR) . (atomic concept) (universal concept) (bottom concept) (complement) (conjunction) (disjunction) (universal quantification) (existential quantification) (number restrictions) the prototypical family of languages ALC (Attributive Concept Language with Complements) originally introduced by Schmidt-Schauß and Smolka [142]. 3.6.1 Description Logics In DLs , as in any other mathematical logic, the two most important notions are the language (syntax) and the semantics (meaning). Language uses elements such as predicate symbols, variables, constants and functional symbols to construct formulae (sentences) intended to represent concepts or notions of the world. DLs deal with three categories of syntactic primitives: concepts which denote classes of objects, roles which denote binary relation and individuals that stand for objects. 3.6.1.1 Syntax Definition 5 (Signature). Σ = hC, R, Ii is a DL signature iff C is a set of concept names, R is a set of role names, and I is a set of individual names. The three sets are pairwise disjoint. Using the above mentioned syntactic primitives together with concept constructors, we can formalise the relevant notions of an application domain by (elementary or complex)5 concept descriptions. The most common logical constructors used in DLs inlude: ∧ (conjunction), ∨ (disjunction), → (implication), ∃ (exists) and ∀ (for all). Definition 6 (Concept Description). Let A be an atomic concept, C and D be two arbitrary concept descriptions and R any atomic role. The set of concept descriptions are formed by means of the following syntax rule: 5 Elementary concept descriptions consist only of atomic concepts, atomic roles and individuals. Complex descriptions however, can be built from elementary descriptions by using concept constructors 60 3.6. SEMANTIC KNOWLEDGE DLs are distinguished by the subset of constructors they provide. The language AL (= attributive language) has been introduced as a minimal language that is of practical interest. The other languages of this family are extensions of AL. By adding more constructors we can get more expressive description languages. For example including the union C ⊔ D of concepts (also called disjunction) as it is interpreted in Table 3.2, the resulting language is called ALU. Similarly, by allowing full existential quantification, written as ∃R.C, we obtain the description language ALU and with full negation (negation in arbitrary concepts), written as ¬C, we obtain the description language ALC and so on. By using any combination of the above constructors we can define more complex languages which would then be denoted by a string of the form: AL[U][E][C][N]. ALC and ALCN are considered today the most basic description logics. Using a small set of epistemologically adequate concept constructors we can keep the language decidable. 3.6.1.2 Semantics The notion of interpretation is used to assign meaning to the syntactic primitives and the constructors of the language. Definition 7 (Interpretation). A terminological interpretation over a signature ΣfhC, R, Ii is a tuple I = h∆I , ·I i where the set ∆I is called the domain and (·I ) the interpretation function. The interpretation function maps every individual a to an element aI ∈ ∆I , every concept to a subset of ∆I and every role name to a subset of ∆I × ∆I . The extension of (·I ) to arbitrary concept descriptions is inductively defined as shown in the upper part of Table 3.2. 3.6.1.3 Knowledge Base Although the constructors introduced so far can be used to form quite complex concepts, we have no means yet to describe relationships between them. For describing the relationships we use a terminological component (TBox) which defines the terminology relations of an application domain. Providing the relationships that hold between concepts is the first part. To describe also knowledge about facts in the world we use an assertional component (ABox). When TBox is combined with the ABox, this results in a knowledge base Ω Definition 8 (Terminological axioms). A general concept inclusion (GCI) has the form C ⊑ D where C and D are concepts. Write C=D ˙ when C ⊑ D and D ⊑ C. An concept equality has the form C ≡ D where C and D are concepts. The finite set of GCIs and equalities is called TBox. Definition 9 (Assertional axioms). A concept assertion is a statement of the form a : C where a ∈ I and C is a concept. A role assertion is a statement of the form (a, b) : R where a, b ∈ I and R is a role. An ABox is a finite set of assertional axioms 61 CHAPTER 3. METHODS Table 3.2: Syntax and Semantics of commonly used mathematical logic constructors with the corresponding letter used to denote a particular constructor. Constructor Top Bottom Intersection Union Negation Value Restriction Existential Quant. Number Restriction Concept Definition Concept Inclusion Concept Assertion Role Assertion Syntax ⊤ ⊥ C⊓D C⊔D ¬C ∀R.C ∃R.C > nR 6 nR A≡C C⊑D C(α) R(α, β) Semantics ∆I ∅ CI ∩ DI CI ∪ DI ∆I \CI I {x ∈ ∆ |∀x, (x, y) ∈ RI → y ∈ CI } {x ∈ ∆I |∃x, (x, y) ∈ RI ∧ y ∈ CI } {x ∈ ∆I |ky|(x, y) ∈ RI k > n} {x ∈ ∆I |ky|(x, y) ∈ RI k 6 n} AI = CI CI ⊆ DI αI ∈ CI (αI , βI ) ∈ RI Logic AL AL AL U C AL E N H Definition 10 (Knowledge Base). A Knowledge Base (⊗) is an ordered pair (T, A) where T is a TBox and A is an ABox. Definition 11 (Model). We say that an interpretation I is a model of a TBox T iff it satisfies all concept definitions in T, i.e., AI ≡ CI holds for all A ≡ C in T. It is a model of a general TBox T iff it satisfies all concept inclusions in T, i.e., CI ⊆ DI holds for all C ⊑ D in T. Similarly, we say that an interpretation I is a model of an ABox A iff it satisfies all concept and roles assertions, i.e., for every C(a) in A, aI ∈ CI holds and for every r(a, b) in A, (aI , bI ) ∈ rI holds. A TBox T, and ABox A are satisfiable if they have a model. I is said to be a model of the knowledge base ⊗ = hT, Ai iff the interpretation satisfies all terminological statements in T and all assertional statements A. 3.6.2 Inference Support Once we get a description of the application domain using DLs as described above, we are then able to make inferences, i.e., deduce implicit consequences from the explicitly represented knowledge. The most common decision problems are basic database-query-like questions, such as instance checking which evaluates which particular instance (member of an ABox) is a member of a given concept. Relation checking evaluates whether a relation / role does hold between two instances, in other words does concept a have property b. Another kind of inference resembles more global-database questions such as subsumption, which evaluates whether a concept is a subset of another concept; and concept consistency which evaluates if there is no contradiction among the definitions or chain of definitions. The language is capable of automatically 62 3.7. COMMONSENSE KNOWLEDGE BASES inferring hierarchies between concepts which are explicitly stated, as well as instance relations between individuals and concepts. Another interesting feature of DLs is that they maintain the intentional knowledge (TBox), which represents general knowledge about a domain, separate from the extensional knowledge (ABox), which represents specific state of affairs. The semantics of the ABox is the open-world semantics, i.e., the absence of information about an individual is not interpreted as negative information, however it only indicates lack of knowledge. Notice, this means that ALC is a fragment of first-order predicate logic, since everything that one can express in ALC can be expressed in first-order predicate logic. Entailment deduction of terminological axioms in ALC is ExpTime6 complete (for cyclic TBoxes or GCIs) [1]. Modern DL systems, such as FaCT [71] (its successor FaCT++ [161]), Racer [62] (its successor RacerPro), Pellet [127] and KAON2 [120] provide inference services that solve the inference problems mentioned above, which are also known as standard inferences. The basic inference on concept descriptions is subsumption. Given two concept descriptions C and D, the subsumption problem C ⊑ D is the problem of checking whether the concept description D is more general than the concept description C. In other words, it is the problem of determining whether the first concept always, (i.e., in every interpretation) denotes a subset of the set denoted by the second one. We say that C is subsumed by D w.r.t. a TBox T , if in every model of T , D is more general than C, i.e. the interpretation of C is a subset of the interpretation of D. We denote this as C ⊑T D. Another typical inference on concept descriptions is satisfiability. It is the problem of checking whether there is an interpretation that interprets a given concept description as a non-empty set. In fact, the satisfiability problem can be reduced to the subsumption problem. A concept is unsatisfiable iff it is subsumed by ⊥ (the empty concept). The typical inference problems for ABoxes are instance checking and consistency. Consistency of an ABox w.r.t. a TBox is the problem of checking whether the ABox and the TBox have a common model. Instance checking w.r.t. a TBox and an ABox is the problem of deciding whether the interpretation of a given individual is an element of the interpretation of a given concept in every common model of the TBox and the ABox. 3.7 Commonsense Knowledge Bases Representing and amassing commonsense knowledge has been a topic of interest since the conception of artificial intelligence, as we have already seen in a previous section (§ 2.4). Efforts stemming from the SHRDLU spirit focused on providing programs with considerably more information, from which conclusions 6 ExpTime might seem difficult at first, however there exist DL systems which can cope with very large ontologies [61] 63 CHAPTER 3. METHODS can be drawn. The vast collection of facts and information is typically stored in a special KB. Such a database consists mainly of a schema, that characterises the domain (ontology – of which the most general partition is called upper ontology) expressed in formal logic. Typically information in a commonsense KB includes the following: i) an ontology of classes and individuals; ii) properties, functions, locations, parts and materials of objects; iii) locations, durations, preconditions and postconditions (effects) of actions and events; iv) subjects and objects of actions; and v) various other complementary information, such as stereotypical situations or scripts, human goals and needs, emotions, plans and strategies in multiple contexts. According to the literature the top most notable leading efforts, that intended to build large-scale, general-purpose semantic knowledge bases are OMCS [147], ConceptNet [98] and Cyc [93, 94]. However, in the past decade the number of commonsense knowledge bases is steadily increasing, with many of the approaches embodied in the (semantic) web [24]. Here I present the commonsense knowledge-base and ontology used in the thesis (Cyc), along with some other commonsense knowledge-base systems. I then discuss their suitability for use in robots and cognitive agents. 3.7.1 CYC The Cyc project first appeared in 1984 by Lenat as part of Microelectronics and Computer Technology Corporation, where the project attempted to formalise commonsense knowledge into a logical framework, by codifying in machinereadable form, millions of pieces of knowledge that comprise human commonsense [93, 94]. Since 1994 it is been developed and maintained by Cycorp and the most important assets include emphasis on knowledge engineering, the representation of hand-crafted facts about the world and implementation of efficient inference mechanisms. Moreover there is a clear focus on the ability to communicate with end-users in natural language and to assist with the knowledge formation process via machine learning. The underlying formal language representation used in Cyc is called CycL. It is essentially an augmentation of first-order predicate calculus and LISP, which includes extensions that enhance the expressiveness of the language, in order to permit commonsense knowledge expressions. Such extensions include handling equality, default reasoning and skolemisation, as well as extensions for modal operators and higher-order quantification (over predicates and sentences). CycL uses a form of circumscription while includes the unique names assumption and thus can make use of the closed world assumption where appropriate. CycL is used in representing the knowledge stored in the knowledge base. The KB is based on the Cyc upper ontology which sets the foundations for the vocabulary and consists of terms, expressions and contexts. The set of terms can be divided into constants (concepts), non-atomic terms (NATs) 64 3.7. COMMONSENSE KNOWLEDGE BASES and variables. Constants can be individuals, collections and (truth) functions. Constants start with “#$” and are case-sensitive. The terms are combined into more complex, however meaningful CycL expressions (assertions, rules, axioms, . . . ) that are used to express facts and support inference in the CycKB. The predicates which are used heavily include #$isa and #$genls. The first one (#$isa) describes that one item is an instance of some collection (specialisation) while the second one (#$genls) describes that one collection is a sub-collection of another (generalisation). Rule sentences may also contain variables (strings starting with “?”). Finally, the several thousand concepts represented in Cyc are grouped in several contexts, which in CycL are represented as “Microtheories”. These contexts are employed in order to find the truth or falsity of a CycL sentence, while avoiding contradictory inferences. Each microtheory (Mt) contains collections of concepts and facts typically pertaining to one particular realm of knowledge and its name (which is a regular constant) contains the string “Mt” by convention (e.g., #$BaseKBMt). The inference engine of Cyc performs general logical deduction, including modus ponens, modus tollens, universal quantification and existential quantification [93, 94]. The latest version of Cyc includes the entire ontology which contains thousands of terms and (mainly taxonomic) assertions7 relating the terms to each other. On top of this taxonomic information, Cyc builds a significantly larger portion of semantic knowledge (i.e. additional facts) about the concepts in the taxonomy while includes a large lexicon for English parsing and generation tools as well as Java-based interfaces for knowledge editing and querying. Finally there exist already efforts to interconnect the Cyc upper ontology with other standard upper ontologies (i.e. UMBL, YAGO NAGA or FreeBase) and also map the Cyc ontology with other central semantic resources on the internet (such as the DBPedia [5] and Wikipedia [112]). The source code written in CycL released with the OpenCyc system is licensed as open source, to increase its usefulness in supporting the semantic web. To use Cyc in reasoning about text, it is necessary to first map the text into its proprietary logical representation, described by its own language CycL. However, this mapping process is quite complex because all of the inherent ambiguity in natural language must be resolved to produce the unambiguous logical formulation required by CycL. 3.7.2 Open Mind Common Sense The Open Mind Common Sense (OMCS) project attempts to construct and utilise a large commonsense KB from contributions of human users via the internet. In the period of 12 years since its initiation (1999-2011) the project 7 Rougly 47, 000 concepts and 306, 000 facts. 65 CHAPTER 3. METHODS managed to accumulate more than a million English facts from more than 15, 000 contributors. There are many different types of knowledge in OMCS. Some statements convey relationships between objects or events, expressed as simple phrases of natural language (e.g., “The sun is very hot”), while others contain information about peoples’ desires and goals. Much of OMCS’s software is built on three interconnected representations: i) the natural language corpus that people interact with directly; ii) a semantic network built from this corpus called ConceptNet (see § 3.7.3); and iii) a matrix-based representation of ConceptNet called AnalogySpace that can infer new knowledge using dimensionality reduction [147]. The OMCS project resembles Cyc, except that it depends on contributions from thousands of individuals across the internet, instead of carefully hand-crafted knowledge by knowledge engineers. 3.7.3 ConceptNet ConceptNet is another commonsense knowledge project, primarily based on the OMCS database (see § 3.7.2) where natural-language phrases are being collected from ordinary people who contribute to the database. ConceptNet is essentially a semantic network expressed as a directed graph, whose nodes are concepts and the edges represent assertions of commonsense about the concepts. ConceptNet captures a wide range of commonsense concepts and relations, such as those in Cyc. Its simple semantic network structure lends it an ease-of-use comparable to WordNet, thus making it suitable for use in natural language processing and intelligent agents. 3.7.4 Other approaches Probase Probase is a very large web-fed knowledge base from Microsoft. It is a very recent endeavour which includes concept hierarchies and amasses 2.7 million concepts, 4.5 million subclass relations and 16 million instances. It is argued to include all major knowledge sources like Freebase, WordNet, Cyc and DBPedia. Unfortunately, none of the data is linked in any way, is available or adhere to some standard format. Wolfram Alpha Wolfram Alpha is an on-line answering engine that answers factual queries directly, by computing the answer from structured data rather than providing with a list of documents or web pages that might contain the answer as a search engine would. Wolfram Alpha is built on Mathematica, a complete functional programming package which encompasses computer algebra, symbolic and numerical computation, visualisation, and statistics abilities. The database currently includes hundreds of datasets, accumulated over approximately two 66 3.7. COMMONSENSE KNOWLEDGE BASES years and include current and historical weather, drug data, star charts, currency conversion and others. True Knowledge True Knowledge is mainly a linguistic answering engine, that aims to directly answer questions posed in plain English text (similar in sense to Wolfram Alpha). The knowledge is organised in a database of discrete facts, where the engine attempts to comprehend posed questions by disambiguating against all possible word meanings in question, in order to find the most likely meaning of the question being asked. Information is gathered in two ways: importing it from “credible” external databases (e.g., Wikipedia) and from user submission following a consistent format and detailed process for input. The system assesses itself the submitted information by rejecting any facts that are semantically incompatible with other approved knowledge. NELL: Never-Ending Language Learning NELL is also a research project that attempts to build a never-ending semantic machine learning system, that has the ability to extract structured information from unstructured web pages. It is running continuously, attempting to perform two tasks each day: i) “reads” or extracts facts (e.g., playsInstrument(George_Harrison, guitar)) from text over hundreds of millions of web pages, looking for connections between the information it already knows; and ii) attempts to improve its reading competence by making new connections, in a manner that is intended to mimic the way humans learn new information. It comprises of a knowledge base of structured information that mirrors the content of the web so far accumulating more than 848, 802 beliefs that it has acquired from various web pages. DBpedia The DBpedia project aims to extract structured content from the information created as part of the Wikipedia project. It is based on RDF to represent the extracted information and the SPARQL query language to access this information. DBpedia allows users to query relationships and properties associated with Wikipedia resources, including links to other related datasets. The main dataset describes more than 3.64 million things, out of which 1.83 million are classified in a consistent ontology which includes persons, places, films, games, organisations, species and diseases. However the dataset features links to images and to external web pages as well as other Open Data RDF datasets such as Freebase, Cyc and UMBEL. This enables applications to enrich DBpedia data with data from the datasets mentioned above, thus making it one of the central interlinking hubs of the emerging Web of Data. 67 CHAPTER 3. METHODS Freebase Freebase is a large on-line collaborative knowledge base, consisting of structured meta-data which are gathered mainly from individual “wiki” contributors, as well as other sources. The project aims to create a global resource which allows people (and machines) to access common information more effectively. The data structure is defined as a set of nodes and a set of links that establish relationships between the nodes. Because its data structure is non-hierarchical, Freebase can model much more complex relationships between individual elements than a conventional database. Queries to the database are made in Metaweb Query Language (MQL). 3.7.5 Discussion While ConceptNet and Cyc purport to capture general-purpose world-semantic knowledge, the qualitative differences in their knowledge representations make them suitable for very different purposes. Because Cyc represents commonsense in a formalised logical framework, it excels in careful deductive reasoning and is appropriate for situations which can be posed precisely, unambiguously and in a context-sensitive way. While logic is a highly precise and expressive language, it has a difficulty modelling the imprecise way that humans categorise and compare things based on prototypes and emulating human reasoning, which is largely inductive and highly associative. Cyc is mainly hand-crafted by knowledge engineers, while ConceptNet is automatically mined from a knowledge base of English sentences of the Open Mind Common Sense (OMCS) corpus contributed by 14, 000 people over the web. Cyc, posed in logical notation, does indeed contain much more practical knowledge. The ability to quantify the semantic similarity between two symbols, permits reasoning inductively over concepts and categorisation of objects and events. However natural language is considered a weak representation for commonsense knowledge. Ambiguity is a particular aspect that needs to be dealt with when reasoning in natural language. Even though human language is full of ambiguity, a logical symbol is concise, and there may exist many different natural language fragments which mean essentially the same thing and may seem equally suitable to include. In this sense logic can be seen as more economical. However, it is sometimes difficult to be precise in natural language. For example, what is the precise colour of a “red apple”? In logic, we might be able to formally represent the range in the colour spectrum corresponding to the “red” of the apple, however in natural language the word “red” is imprecise and has various interpretations. As ConceptNet and Cyc employ different knowledge representations, cross-representational comparisons may be an indication that provides a measure of evaluation. However neither of the two systems (nor any other) is suitable for direct use in robotics. The reason is that there exist a very 68 3.8. CONCLUSIONS delicate balance between representation, expressive and explanatory power, which none of the systems mentioned above, acknowledge or comply with. We may consider the systems mentioned in this section, primarily as knowledge resources which can complement the robotic domain with additional information. For example other projects including NELL, Freebase and DBpedia, which all explore web-based approaches for collecting and distributing knowledge, may provide additional concepts and inference services, even though living entirely on the web. 3.8 Conclusions In this chapter I have first briefly outlined the initial anchoring model which was used as a basis in the papers appended in this thesis. Furthermore the implemented architecture involved the integration of several techniques and processing algorithms, in order to support the perceptual signatures and semantic descriptions of objects. This is actually what made anchoring possible in the scenarios presented in the studies of the thesis. The integration involved the described image representation and classification algorithms with respect to the visual perception; the Adaptive Monte-Carlo Localisation with respect to mobile robot localisation; the DL framework with respect to modelling the perceptual-semantic knowledge of the agents and the Cyc system as the commonsense knowledge base. With the components mentioned herein it has been possible to achieve a meaningful anchoring experimentation as shown in the different studies. Further details regarding the integration are presented in the corresponding papers (I, II, III, IV and V). 69 4 Summary of Publications 4.1 Anchoring with spatial relations In Paper I we used symbolic knowledge representation and reasoning capabilities to enrich perceptual anchoring. The use of the KR&R is advocated to also allow the human to assist the robot in simple anchoring tasks, such as the disambiguation of objects, thereby exploring a deeper form of mutual HRI. The knowledge-base consists of two parts: i) the terminological component (TBox), that contains the description of the relevant concepts and their relations; and ii) an assertional component, the ABox for storing concept instances and assertions on those instances. For the anchoring domain we require an ontology that covers all the physical entities and the corresponding perceptual properties, which are recognised by the perceptual system and eventually occur during an anchoring scenario. In addition we used knowledge that was inferred from basic knowledge contained in the anchors and also collected from external sources using other cognitive capabilities, such as other anchoring processes or linguistic perception. Modelling an ontology is in general a difficult task, therefore in this particular instance we adopt an ontology based on a subset of the ontological framework DOLCE (A Descriptive Ontology for Linguistic and Cognitive Engineering) [109], an upper-level ontology developed for the Semantic Web. The knowledge base is implemented using the LOOM knowledge representation system [11]. 71 CHAPTER 4. SUMMARY OF PUBLICATIONS 4.1.1 Semantic modelling of spatial relations Spatial relations were used in the symbolic description of objects and allowed to distinguish objects by their location with respect to other objects. They also played an important role when it came to HRI. Two classes of binary spatial relations between a reference object and a located object were considered: a) the topological (distance) relations “at” and “near”; and b) the projective (directional) relations “in-front of”, “behind”, “right”, and “left”. The interpretation of a projective relation depends on a frame of reference; for simplicity we assume a deictic frame of reference with an egocentric origin coinciding with the robot platform. We finally modelled the spatial relations as concepts in the ontology, where we considered each spatial relation of the sub-concept of “Abstract Relation”, to have as properties a reference object, a located object, and a spatial region. Using the LOOM KR&R system we were able to perform some basic inferences with the information provided by the anchoring component. Thus, we could focus on the high-level aspects behind the logical representation and the dynamic properties of the objects, instead of invoking directly sensory information. The user interface was based on a plain text-based application, where the user can type in sentences in simple English. The sentence is analysed by a recursive descent parser and translated into a symbolic description. The grammar allows commands of the form FIND . . . followed by the description of the object. The description consists of a main part that can be followed by sub-clauses describing objects that are spatially related to that object. The main part and each of the sub-clauses can be either a definite or indefinite description, indicated by the article “a” or “the”. It also includes the object’s class, for example “cup”, and optionally its colour and smell. The smell of an object is inferred from the clause “with . . . ” following the object’s class, indicating that it contains a liquid; for example, “the cup with coffee” is assumed to be a cup containing coffee, and as such, smelling of coffee. The derived symbolic description is used to construct a query for the KB. The main functionality is realised by the FIND routine, which collects candidates from the KB that match the description. If there is more than one candidate, the anchoring module checks for further properties in the given description, apart from shape and colour, and selects those additional properties. If an ambiguity still persists the system asks the user about which of the candidate objects to select. 4.1.2 System validation The validation of the system was done in the context of an intelligent home environment, which was used for ambient assisted living for the elderly or disabled. In this smart home, an existing framework called the PEIS-Ecology [140] is used to coordinate the exchange of information between the robot, other 72 4.1. ANCHORING WITH SPATIAL RELATIONS pervasive technologies in the environment and the human user. To illustrate the utility of KR&R in anchoring three case studies were presented. The first focussed on the inclusion of spatial relations in the anchoring framework. The set of binary spatial relations mentioned above was used in the validation. As spatial prepositions are inherently rather vague, we used fuzzy sets to define granular spatial relations. The proposed method computes a spatial relations-network for anchored symbols and stores that in the KB. The robot surveys a static scene with three objects (two green garbage cans and a red ball) and the anchoring module creates anchors for these objects in a bottom-up way (as soon as new percepts are identified by the vision system). We examined the possibility to use spatial relations to query for an object, for example: “find the green garbage to the left of ball”. Similarly, a human user can be asked to resolve an ambiguity in a find request. For instance the query “Find the green garbage” returns more than one anchor. Therefore the anchoring module determines an anchored object that is spatially related to these anchors as reference object and presents the user with a choice, enumerating the returned anchors and their spatial relation(s) to the reference object. Then the query is reformulated using also the selected relation(s). A second case used multi-modal information about objects, including both spatial information and in this case, olfactory information given by an electronic nose. The KR&R system is then used to assist the robot in determining which perceptual actions can be taken to collect further information about the properties of the object. As in the previous case, where an ambiguity is introduced, the robotic system uses its olfactory module for resolving an ambiguous case among different cups. The robot is given the command “Find a green cup with coffee”. Four candidates were found in the KB, which were asserted to be cups, i.e. vessels containing liquid with an associated smell. This triggers the task planner to generate a plan to visit each candidate and collect an odour property. The ambiguity is then resolved once a first match is found. The third case investigated the possibility to perform reasoning about object properties in order to determine an optimal candidate. We considered the case where there were some scattered fruits on the floor. The robot navigated around the floor and correctly perceived and asserted the instances into the KB. We then asked the robot to pick the orange that was to the left of the apple (spatial misinformation). Since this was not the case, it responded with a valid proposition, that there is no orange to the left of the specific apple and instead there is one on the right of the banana. The user confirms that the alternative candidate is indeed the requested object. As a next step we asked the robot to pick a banana to the right of the apple. Here there are 2 candidates, both being to the right of the apple. However one of them being classified as rotten and therefore not edible. The system informs the user about the situation and after removing this option, it prompts that finally there was one banana on the right of the apple (suitable for consumption) and returns. 73 CHAPTER 4. SUMMARY OF PUBLICATIONS 4.1.3 Contributions A. Loutfi, S. Coradeschi, M. Daoutis, and J. Melchert. “Using Knowledge Representation for Perceptual Anchoring in a Robotic System”. Int. Journal on Artificial Intelligence Tools (IJAIT) 17, pp. 925–944, 2008. The author of this thesis was responsible to extend a previous version of this article ([115]), which was invited for publication in a special issue of the International Journal on Artificial Intelligence Tools (IJAIT). More specifically, he designed and performed the extended experiments and helped with the preparation of the second half of the manuscript. 4.2 Common Sense Knowledge and Anchoring In Paper II we have investigated the integration of a large KR&R with a perceptual system which consisted of networked sensors. The role of the KR&R system is to maintain a world model which consists of the collection of the semantic information perceived by the robotic assistant and the smart home. The commonsense KR&R system used in this work is Cyc (see § 3.7.1). Operations, such as assertions, retractions, modifications and queries are stated using Cyc’s formal language, CycL. Such a system, that rests on a large-scale general purpose knowledge-base, can potentially manage tasks that require world knowledge (or commonsense). 4.2.1 Linguistic interaction To combine Cyc’s natural language capabilities and enhance the communication with the user we allow Cyc to translate the results of queries or inferences, into natural language. The main challenge of integrating a large KR&R like Cyc is to be able to synchronise the information with the perceptual data coming from multiple sensors, which is inherently subject to incompleteness, glitches, and errors. The anchoring module provides a stable symbolic information despite fluctuating quantitative data. Instances of objects must be created in Cyc and as objects can be dynamic (e.g., in the case of changing properties) proper updating of information needs to be managed. To enable the synchronisation between the KR&R and anchoring layers three aspects were considered: a) defining anchoring in the KR&R; b) handling of assertions; and c) handling of ambiguities. Cyc is not consistent globally but rather attempts to be consistent locally, by exploiting the use of different contexts which are expressed as MicroTheories. Through the Anchoring MicroTheory it is possible to connect concepts about objects that are currently present in the anchoring module, to the structured hierarchical knowledge in 74 4.2. COMMON SENSE KNOWLEDGE AND ANCHORING Cyc and concurrently inherit the commonsense reasoning about this knowledge. For instance, if the location of the anchor of the object “cup” is the kitchen, then the object and the kitchen are instantiated into Cyc, inheriting all the properties related to the generalised concepts Kitchen and Cup, such as ManMadeThing. 4.2.2 Knowledge synchronisation To synchronise perceptual data with the knowledge in Cyc, anchored objects from both working and archive memories were instantiated. Then, the knowledge manager keeps the Cyc symbolic system coherent with the symbolic descriptions of each anchor. This is done by translating the symbolic description of each anchor into a set of logical formulae that represent the agent’s perception of the object. When a particular predicate of the symbolic description changes (e.g., the location of an object) an update of the predicate asserts the new information into the KB. Ambiguities eventually arise during the synchronisation process and they are resolved using the denotation tool interactively with the user. The denotation tool is an interactive step during the grounding process, where the user is asked to disambiguate the grounded symbol using concepts from the KB which denote this specific symbol. Another aspect we consider in the synchronisation process, is the information about the visibility of the object. This is important as the agent still needs to maintain knowledge about an object, despite the fact that the corresponding perceptual information may not currently be available to the agent. For example the robot still needs to maintain the knowledge that “a cup is located in the kitchen”, even though the “cup” might be out of its current field of view (epistemic knowledge). This was achieved by 2nd -order rule assertions. 4.2.3 Evaluation As part of the test-bed we used a physical facility, which looks like a typical bachelor apartment of about 25m2 . It consists of a living room, a bedroom and a small kitchen. We initially presented the robot with 15 objects, while capturing 2 to 5 training images per object from different viewpoints. It is important to mention that the robot recognised instances of objects and not object categories. We placed those objects around the smart home in a random manner, either close or far, covering a great amount of different combinations of spatial relations. We then allowed the robot to navigate around the environment so as to recognise and anchor those objects. We then allowed the human to communicate with the system in natural language, by typing into the consolebased graphical user interface, questions (related to the perception of the robot as well as commonsense), or communication commands. Excerpts of dialogues that reflect perceptual information from the mobile robot and the environment were presented in Paper II. We were also able to 75 CHAPTER 4. SUMMARY OF PUBLICATIONS use the visual sensors in the ceiling to observe things that the robot could not perceive or were occluded. Another instance of the object recogniser was running on the large objects from the smart home (e.g., television set, couch, table, sink). Through the knowledge base assertions, which contained a larger collection of objects of the environment, the robot was able to compensate for its limited ability of perception. This aspect spawned an interesting problem of how to coordinate this cooperative perception with respect to the semantic knowledgebase, in the context of anchoring. This aspect was the main motivation for next development, Paper III. In sum, clearly the advantage of using a KB that contains commonsense information, is that we can exploit this information to be able to infer things that were not directly asserted perceptually. This information can also be useful when queries about functions and properties of objects are made. Due to the expressiveness of Cyc we could differentiate perceptual and epistemic knowledge, at the same time keeping both coherent. Using linguistic interaction we could study how the anchors were stored, while probing the symbolic content. As described in the previous section (§ 4.1) the use spatial relations contributed greatly towards the disambiguation between objects. 4.2.4 Contributions M. Daoutis, S. Coradeshi, A. Loutfi. “Grounding commonsense knowledge in intelligent systems”. Journal of Ambient Intelligence and Smart Environments (JAISE) 1, pp. 311–321, 2009. This article was an invited submission of a previous version (Paper V [43] – not included in the thesis), which had received the best conference paper award of Intelligent Environments 2009. The author of this thesis specified, designed and implemented the framework described in the papers mentioned above. He also devised and performed the evaluation as well as prepared and revised both manuscripts. 4.3 Cooperative Anchoring using Semantic & Perceptual Knowledge In Paper III we propose a model for semantic cooperative perceptual anchoring, with which we intend to capture the underlying processes for systematically transforming collectively acquired perceptual information into semantically expressed commonsense knowledge, over a number of different agents. We enable the different robotic systems to work in a cooperative manner, by acquiring perceptual information and contributing to the collective semantic KB. We adopt the notion of the local and global anchoring spaces from [86]. The different perceptual agents maintain their local anchoring spaces, while the 76 4.3. COOPERATIVE ANCHORING USING SEMANTIC & PERCEPTUAL KNOWLEDGE global anchoring space is coordinated by a centralised anchoring component which hosts also the commonsense KB. The reasons behind the centralised module concern the computational complexity of the commonsense KB. 4.3.1 Semantic integration This integration which relies on commonsense information, is capable of supporting better (closer to linguistic) human robot communication and generally improves the knowledge and reasoning abilities of the agent. It provides an abstract semantic layer that forms the base for establishing a common language and semantics (i.e. ontology) to be used among robotic agents and humans. Semantic integration takes place at two levels, in accordance with the anchoring framework. We have developed a local KB which is part of the anchoring process of every agent. Our solution involves a local knowledge representation layer which aims at a higher degree of autonomy and portability. More specifically, it: a) is a lightweight component ; b) adheres to the formally defined semantics of the global KB; c) provides simple and fast inference; d) allows communication between other components (e.g., the global anchoring agent, Cyc or the Semantic Web); and e) is based on the same principles that our core knowledge representation system (Cyc) is based. Each local KB is a subset of a global KB which is managed by the global anchoring agent that coordinates the different local anchoring processes. The global knowledge representation is used as the central conceptual database and a) defines the context of perception for the perceptual agents; b) is also based on formally defined semantics; c) provides more complex reasoning; and d) contains the full spectrum of knowledge of the KB. The duality of knowledge representation systems is due to the fact that we want to enable semantic knowledge and reasoning capabilities on each perceptual agent, however we cannot afford (yet) to embed a complete commonsense knowledge-base instance on every agent. This is due to the limitations in the processing power of each perceptual agent. The anchoring process is tightly interconnected with commonsense knowledge about the world. As in the previous paper (Paper II) the commonsense KB we adopt in this work for knowledge representation, is Cyc. 4.3.2 Implementation We have implemented our approach using the PEIS distributed framework [140] with the application domain being a smart home for the elderly. The implementation involved the mobile robot which is capable of perceiving the world through different typical heterogeneous sensors and is acting as the mobile anchoring agent. An ambient anchoring agent pertained sensors and actuators embedded in the environment, such as observation cameras, localisation systems and RFID tag readers. Finally a computational unit, a high end desktop 77 CHAPTER 4. SUMMARY OF PUBLICATIONS computer undertakes all the computationally demanding tasks related to the commonsense knowledge base (Cyc). The evaluation of the system includes how well the system performs, when recognising, anchoring and maintaining in time the anchors referring to the objects in the environment. We have tested several scenarios, where in particular we considered the case in which the knowledge from the perceptual anchoring agents populating the Cyc knowledge-base was used to enable inference upon the instantiated perceptual concepts and sentences. It was possible to use the object’s category, colour, spatial or topological relations, along with other grounded perceptual information, to 1) refer to objects; 2) query about objects; and 3) infer new knowledge about objects. In addition, we have considered the use of commonsense knowledge in different contexts, through the multi-context inference capabilities of the Cyc system. In order to monitor and allow interaction with the anchoring framework, we have developed an interface which has three ways for accessing information. It provides a primitive natural language interface to interact with the different agents and the KB as it displays the state of the system and provides runtime information about the performance of anchoring. The main component and observation feature of the tool is the visualisation of the 3D augmented virtual environment. It is simply, a scene of the environment rendered in a 3D engine which encapsulates every non-perceptual object in the environment, such as the wall surroundings. Then information acquired from the global anchoring space was used to virtually reconstruct the perception of each agent in the rendered scene-graph. Finally we have shown how the different perceptual agents can communicate using logical sentences, in order to exchange information between each other, in a cooperative manner. We focused on the locally maintained knowledge of each perceptual agent in the context of querying other available agents to complement for the limited perceptual knowledge so as to finally infer the answers to the queries. 4.3.3 Contributions M. Daoutis, S. Coradeschi, A. Loutfi. “Cooperative Knowledge Based Perceptual Anchoring”. Int. Journal on Artificial Intelligence Tools (IJAIT) 21, p. 1250012, 2012. The author of this thesis was responsible for the specification, design and implementation of the proposed framework described in this article. He devised and performed the evaluation as well as prepared and revised the manuscript. 78 4.4. TOWARDS THE WWW AND CONCEPTUAL ANCHORING 4.4 Towards the WWW and Conceptual Anchoring In Paper IV we consider that autonomous robots which interact with humans, require open ended systems where domain knowledge can be dynamically acquired and is not necessarily defined a-priori. This aspect is related to the previous paper (Paper III), where we identified that the modelling of the semantic and commonsense knowledge behind an intended query required considerable effort and knowledge engineering skills, which are often not present in end-users. However the World Wide Web (WWW) is a viable source of knowledge that would satisfy the requirement of being available on demand while it evolves continuously, with contributions from human users. Hence, we explored the possibility of integrating the knowledge coming from a conceptual (non physical) environment such as the WWW, in the anchoring process. This was achieved using the augmentation of the anchoring framework, called Conceptual Anchoring. 4.4.1 Representation Acquired data, from a non-physical information source, specifically the web, were used so as to form representations (conceptual anchors), about concepts that might not necessarily exist in the agent’s real environment. Via the conceptual anchoring space, the agent was able to a) express concepts in terms of their associated (eventual) percepts and symbolic knowledge; and b) use the information from the web, in order to guide the learning of novel perceptual instances (physical objects). In the context of conceptual anchoring, the percepts produced from on-line resources were used for the construction of the perceptual signatures. Then, they were grounded to their corresponding semantic descriptions and tangible semantic knowledge from the commonsense KB, in a similar way as in perceptual anchoring. The multi-modal data structure (conceptual anchor) holds the perceptual data, semantic descriptions and unique identifier about one conceptual object. Since the conceptual percepts do not correspond necessarily to physical entities in the agent’s environment, implicitly the conceptual anchor does not correspond to a specific physical object (perceptual instance), but rather to the generalised conceptual appearance and implicit knowledge about the general concept. For instance the conceptual anchor of the concept “Cup” links the multiple visual percepts of cup images and features of them, with the corresponding grounded visual concepts such as the colours or shapes of cups and with the tangible semantic knowledge about cups like: Cup is a specialisation of DrinkingVessel, inheriting all the implicit relations that it is also an Artifact, a SpatiallyDisjointObject, a Container, . . . etc. Since both conceptual and perceptual representations are expressed as anchors, the previously developed anchoring functionalities (Papers II & III) are able to handle both representations in the same way, especially regarding matching. 79 CHAPTER 4. SUMMARY OF PUBLICATIONS 4.4.2 Preliminary results Conceptual Anchoring is a system built on top of the knowledge-based perceptual anchoring architecture and mainly a) provides an interface to acquire perceptual and semantic knowledge from the web, while b) maintains the conceptual anchors in the anchoring space which are used to store the conceptual models. By acquiring conceptual anchors, which can be though as “meta-anchors”, we were able to use the multi-modal conceptual representations, as perceptual templates which guided the detection and learning of existing physical objects (perceptual instances) in the environment, without having previously trained the robot to recognise each instance individually (a tedious and highly technical process). So far the notion of the conceptual anchors has been tested in proofof-concept scenario and focus of the evaluation has been on the ability to retrieve the conceptual anchors from the web using keywords. While conceptual anchoring is the more recent development in the anchoring framework, still needs to be further enhanced as it is believed that relying more on dynamic information from the WWW will be the dominant trend in the field. 4.4.3 Contributions M. Daoutis, S. Coradeschi, A. Loutfi. “Towards concept anchoring for cognitive robots”. Intelligent Service Robotics 5, pp. 213–228, 2012. The author of this thesis was responsible for the specification, design and implementation of the proposed framework described in this article. He devised and performed the evaluation as well as prepared and revised the manuscript. 80 “Pure a priori concepts, if such exist, cannot indeed contain anything empirical, yet, none the less, they can serve solely as a priori conditions of a possible experience.” Immanuel Kant, (1724-1804) 5 Discussion 5.1 Research Outlook Successful communication among robots, human users and devices embedded in the environment, involves diverse items of information such as perceptual data from camera sensors, symbolic data, or natural language phrases from a human. Particularly, when humans interact with symbiotic robotic systems their communication mainly involve objects, their properties and functions. If these objects are described in symbolic form (e.g., “blue cup on the table”), then the link between the sensor data corresponding to the physical object and the symbolic description, has to be established. In cognitive robotics the process of creating this link is studied in the context of perceptual anchoring [32] which is a subset of the general symbol grounding problem [63]. This thesis investigated the problem of anchoring and specifically the design and implementation of a cognitive computational model for artificial perceiving systems. The main goal was to establish and maintain in time the correspondences between conceptual knowledge and perceptual information about physical objects, when humans and robots interact. The cognitive agent needs to be able to acquire perceptual information about the objects from the physical world, while associating them to the corresponding commonsense knowledge, both about the object per se but also the specific knowledge relevant to the task at hand. We considered a key aspect of this challenge, which concerns the flexibility to adapt to changes in the environment and to dynamically acquire the new 81 CHAPTER 5. DISCUSSION knowledge needed for the interaction with humans. In order to enable human participation in the anchoring process, the model has to support natural forms of communication while simultaneously dealing with the challenging problem of how to derive high-level descriptions of visual information from perceptual data. In particular, we focused on the natural interaction and common knowledge that pertain to objects which are used in everyday scenarios. In the following section I discuss some issues concerning the previously described development of anchoring, while summarising the goals of the current investigation. 5.2 Critical view on the original framework The anchoring framework proposed by Coradeschi and Saffiotti [32] described in Chap. 3, provides a simple yet concrete theoretical point of view in addressing the physical symbol grounding problem. However, when the model is applied to a cognitive agent, such as a robot, there are a number of issues which the original anchoring framework does not appear to be capable to address at first. The initial anchoring model was described in a very abstract and simplistic way while considering only a top-down approach. An issue that has been introduced later in the development of anchoring, concerned the problem of two-way information acquisition [100], especially in the presence of different modalities. This has been loosely defined, however it is an important aspect since humans typically use both top-down and bottom-up information to solve problems, in a kind of signal-symbol fusion. Previous research has shown that our visual pathways are not unidirectional either, rather there are also feedback signals [16]. Therefore both top-down and bottom-up interrelations should be clearly defined in the cognitive model and implicitly the anchoring framework. Since the introduction of other (non standard) modalities, such as olfaction [100], the model had to cope with a general and formal representation that accommodates and handles multiple perceptual modalities, as they become available to the cognitive agent. The symbolic part of the framework on the other hand, consisted merely of symbolic labels which identified anchored objects. Almost no additional symbolic information about “the world” was incorporated or exploited in the anchoring process. Furthermore, the problem of grounding symbolic properties has also been out of focus. Since there is no definition about the nature of the symbol and perceptual systems, implicitly the representation of the anchor was vaguely defined. This is problematic when modelling the matching and tracking functionalities, especially in the presence of multiple modalities. It is also an issue when modelling the similarity of the anchors, not only for their perceptual part but also the semantic one. Clearly semantics are not defined somewhere in the anchor, neither in the perceptual system, nor in the symbol system. This is another quite important aspect, also related to the anchor’s conceptual representation mentioned above. 82 5.3. SUMMARY OF THE RESEARCH Perceptual semantics allow reasoning about percepts while conceptual semantics enable reasoning on the conceptual level. As we see in the related work (§ 2.7.3), anchoring has also a cooperative aspect, where work led by LeBlanc and Saffiotti demonstrate an augmentation of the anchoring framework which addresses the percept-symbol correspondences between different robots. This was done explicitly in the context of collectively acquiring and sharing perceptual information, thus not emphasising on the symbolic and semantic aspects [86]. Finally, we observe that pre-conceptual knowledge has not been defined and hence not used somewhere in the anchoring process. Especially since it has been argued that the semantic interpretations are not solely inside an image, a feature or a percept, but depend upon a-priori semantic and contextual knowledge (that is acquired from experience). 5.3 Summary of the Research The aim of this investigation was to address the aspects mentioned in the previous section (§ 5.2), from a theoretical as well as methodological point of view. The main goal of the thesis was to extend the notion and theories of perceptual anchoring so as to accommodate knowledge representation and reasoning techniques, in order to establish a common language (i.e. ontology) and semantics to be used among heterogeneous robotic agents and humans, regarding physical objects, their properties and relations. We address the challenge of grounding perceptual information to symbolic information drawn from the conceptual information included in a commmonsense knowledge-base. The key aspect is to use the resulting grounded semantic knowledge to improve, in part the linguistic communication with human users, but also the exchange and reuse of semantic knowledge between the different components (intelligent agents, robots, pervasive devices) of the system. The first step of the investigation has been to extend the work presented in [114], where a simple ontology inspired by DOLCE was connected to the anchoring module. The integration was done using the LOOM KR&R system. We evaluated the system using topological and spatial relations as well as multiple modalities. We then show how the proposed integrated system can handle uncertainty while resolving the ambiguities which arise during the interaction, by gathering perceptual information. Finally, the use of symbolic knowledge enabled high-level reasoning about objects and their properties, thus allowing more expressive queries about the anchored objects. The next study explored the integration of a large-scale deterministic commonsense KB with the perceptual anchoring system. We specified, designed and implemented a framework which grounded perceptual data coming from a number of sensors distributed in the environment, to commonsense semantic descriptions extracted from the KB (Cyc). The study involved modelling of 83 CHAPTER 5. DISCUSSION perceptual algorithms using mainly computer vision techniques (e.g., SIFT) for the construction of percepts. Then, a series of grounding algorithms captured object properties and relations from the visual and spatial modalities (i.e. object category, colour, location, spatial relations, topological localisation and visibility). The anchoring framework managed all this information over time, while in the KB all the grounded objects were instantiated into concepts. The use of Cyc offered advantages, as there was less time spent on ontology engineering as well as creating new knowledge into the KB. A further advantage was the exploitation of commonsense knowledge in order to infer things that could not be inferred by having the symbolic descriptions on their own. The grounded information was evaluated through linguistic interaction via a simple natural language interface to facilitate a dialogue with the human user. We examined scenarios which included queries about perceptual information and properties, qualitative spatial relations and commonsense reasoning. Concluding, we find that all three aspects: a) robust perceptual system; b) anchoring; and c) a substantial KB, were equally necessary and important in the process of grounding linguistic queries in observations, in a dynamically changing environment. The purpose of the next study was to extend the initially developed system to account for cooperation in sharing and reusing grounded knowledge among different perceptual agents, while still maintaining the commonsense knowledge aspect. We presented the computational framework behind symbolic cooperative anchoring, which considers both top-down and bottom-up information acquisition in the anchoring management process. The focus was on the visual and spatial perceptual modalities where we presented the details behind the corresponding functions. We evaluated the developed system in an intelligent home scenario, where two agents (one mobile robot and one static ambient intelligent agent) cooperatively acquired complementary perceptual information from the environment so as to populate the commonsense KB. We initially reported results about the performance of the system, followed by scenarios regarding information exchange between the perceptual agents, that allowed them to compensate for incomplete knowledge. We finally present ways in which information exchange together with commonsense knowledge were used for different communication tasks. A promising direction emerged towards the final stages of the work: to utilise knowledge from the web so as to account for incomplete perceptual information during anchoring. In this investigation we presented an integrated system that combined the perceptual and conceptual knowledge acquired from the web, with the a-priori commonsense knowledge maintained in the perceptual semantic knowledge-base. Although in previous investigations we had considered both top-down and bottom-up functions, in practice the system was not able to ground arbitrary (concrete) concepts to their corresponding percepts. Specifically in this study we considered how a cognitive agent: a) acquires novel concepts; 84 5.4. CONCLUDING REMARKS & CONSIDERATIONS b) associates these concepts with preconceived information; and c) reasons about this knowledge. In the implemented augmentation, which we term conceptual anchoring, we used the measure of similarity so as to define the relations between the anchors regarding concepts and instances. This measure served also as a way to account for any eventual uncertainty that arose in the perceptual system, hence propagating into the anchoring framework. 5.4 Concluding Remarks & Considerations As we approach the end of this research, we clearly see that feature lists alone do not appear to be capable of describing everything we need to know about concepts. Percepts seem to be specialised and concrete while concepts are abstract and general. Certainly percept-based thought alone does not provide sufficient abstraction and enough richness to deal with the increased complexity of the concept’s meanings, which appear to have a dynamic component that directly relates to the reasoning processes (that generate understanding). Returning to the hypothesis posed at the beginning of this thesis, we find that knowledge representation and reasoning techniques, indeed, have the potential to bring meaning to artificial cognitive agents, such as robots. More specifically through the studies conducted we conclude that in addition to facilitating queries from human users, KR&R can also be used to reason about objects, including their properties and functionalities. For example, spatial reasoning can be used to disambiguate between visually similar objects while symbolic commonsense information supports high-level reasoning about commonly shared assumptions. The thesis successfully demonstrated how a knowledge representation system could be integrated with and utilised in an anchoring framework. From the findings that emerged during the different studies we find that the KR&R system and especially a-priori commonsense knowledge, contributed greatly to the capabilities of the system, for example by enabling high-level reasoning. The integration of this knowledge with physical perceptual information opens new horizons with respect to the communication and implicitly interaction with a human user. In order to explore human-machine interaction, we adopted an ontology which captured both concrete and abstract concepts and their relationships. Moreover the ontology was used when expressing queries describing image content, while only the relevant image aspects that were of value were evaluated. This was particularly important in the context of decidability, as the evaluation converged as soon as the appropriate top-level symbol sequence had been generated. Even though we used complex ontologies and knowledge representation in the physical perceptual agents, a great effort from experts is still needed to train the systems both perceptually and semantically. 85 CHAPTER 5. DISCUSSION The construction of ontologies originate from linguistic words. However the word is more than a symbol or a sign that represents a “thing” or a concept. On a theoretical level, language is not just a tool to communicate but can be thought as an information processing system, which allows humans to conceptualise, to think abstractly and symbolically. More importantly it is the medium that permits the transition from percept-based to concept-based thought. Intuitively, it seems that complex cognitive systems (such as humans as well as cognitive robots) do something more than just rewriting symbols. They use grounded concepts to reason about the world. The ability to form concepts is an important and recognised cognitive ability, thought to play an essential role in other related abilities such as categorisation, language understanding, object identification, recognition and reasoning, all of which can be seen as different aspects of intelligence. We find that many of the related approaches tend to focus mostly on the specific individual aspects mentioned above and often disregard the symbol grounding problem. Through the studies conducted in this thesis we acknowledge that all the aspects mentioned above are equally important, including the grounding problem. By specifying the representation of the anchor, which can be though as a conceptual structure that establishes the meaning of the intended object, we support simultaneously perceptual and semantic reasoning but foremost grounding. In the context of symbol grounding and its representation, there is a tendency from various approaches to adopt Gärdenfors’ conceptual spaces. However in our studies we did not adopt the paradigm, as it did not seem to offer any advantage over the standard anchor representation. This does not merely mean that there exist alternatives (in our case sets) to the geometric metaphor of conceptual representations. It also reveals a deeper question: Which representation is more appropriate for grounding?. In summary the thesis had a significant impact in three fronts: a) applicative – by enabling the development of cognitive robots where knowledge and perception are combined; b) theoretical – by extending the theoretical foundations of anchoring; and c) methodological – by providing a methodology for the acquisition of perceptual knowledge and the exploitation of semantic knowledge and high-level reasoning. 5.5 Recommendations for Future Work During this research several shortcomings emerged which require further investigation. The developed system appears quite limited regarding the explicit perceptual training of object categories and instances to be recognised. In addition the development of grounding relations required a great amount of interdisciplinary; highly specialised work during the implementation stages. Furthermore, reasoning was constrained to objects pertaining a particular 86 5.5. RECOMMENDATIONS FOR FUTURE WORK domain (anchoring) while lacking complex dynamic knowledge. Nevertheless, the investigation behind this thesis has set the initial groundwork towards dynamic knowledge acquisition. In order to realise a grounding framework with such a dynamic nature that can fully integrate perception and knowledge on demand, it is necessary from a theoretical point of view to relax an important assumption in current AI systems, namely the closed world assumption. In addition, the decision to create a new instance of an object in the knowledge base or to assign the perceptual information to a known object, is currently based on information about mainly appearance and location. A future augmentation of a grounding architecture may rely on gathering resources from the World Wide Web in order to overcome the problems stated above. One key aspect will be to leverage the fact that general knowledge about objects is already available on the internet, which to some extent is also related to physical perceptual information. The focus of such a cognitive architecture will be based on evolving conceptual anchoring, biased towards two characteristics: advanced visual perception and knowledge formation, so as to autonomously form and learn new concepts on demand. At the same time it is quite important to evaluate the scalability of such a system where: the number of objects increases arbitrarily; and several agents cooperatively anchor and communicate information regarding a large amount of objects. The anchoring process already collects information from the sensors and the vision system and grounds the percepts to the corresponding symbolic knowledge while asserting this knowledge about object instances in the commonsense KB (Cyc). In the future we intend to extend the anchoring process from two sides simultaneously. The first regards on demand acquisition of general knowledge from the internet, when the establishment of new general concepts / categories is required for the interaction with humans or reasoning. Progress towards this goal is described in the final paper (Paper IV). The second focuses on the continuous acquisition and update of specific information regarding objects of interest, via activity recognition and causal reasoning. This implies that we need to integrate tools which not only examine how objects are defined but also how objects are being used in general (e.g., cups are used for drinking), in a specific domain and in the context of specific activities. The use of objects fused together with activities will add to the reasoning capabilities of the system. In sum, future work will deal with the following: ⊲ Investigate methods on how to acquire structured perceptual and semantic knowledge from the internet that will provide the corresponding knowledge needed to learn new concepts on demand. ⊲ To anchor this knowledge to the physical entities in the environment like objects and places, developing upon the existing anchoring framework. 87 CHAPTER 5. DISCUSSION ⊲ To acquire more context-specific knowledge related to the environment and objects via activity recognition. ⊲ To validate the system via meaningful interactions with humans and to evaluate its performance, usability and acceptability from potential users. 88 REFERENCES References [1] F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. Patel-Schneider, eds. The Description Logic Handbook: Theory, Implementation and Applications. Cambridge University Press, 2003. (See p. 63). [2] L. W. Barsalou. “Perceptual symbol systems”. Behavioral and Brain Sciences 22, pp. 577–660, 1999. (See p. 3). [3] J. Barwise and J. Perry. Situations and Attitudes. Cambridge, MA: MIT Press, 1983. (See p. 31). [4] C. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, 2006. (See p. 56). [5] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. “DBpedia - A crystallization point for the Web of Data”. Web Semantics. 7, pp. 154– 165, 2009. (See p. 65). [6] A. Bonarini, M. Matteucci, and M. Restelli. “Anchoring: do we need new solutions to an old problem or do we have old solutions for a new problem?” In: Proceedings of the AAAI Fall Symposium on Anchoring Symbols to Sensor Data in Single and Multiple Robot Systems. 2001. Pp. 79–86. (See p. 33). [7] A. Bonarini, M. Matteucci, and M. Restelli. “Concepts for Anchoring in Robotics”. Lecture Notes in Computer Science 2175, pp. 327–332, 2001. (See p. 33). [8] A. Bonarini, M. Matteucci, and M. Restelli. “Problems and solutions for anchoring in multi-robot applications”. J. Intell. Fuzzy Syst. 18, pp. 245–254, 2007. (See p. 37). [9] B. Boser, I. Guyon, and V. Vapnik. “A training algorithm for optimal margin classifiers”. In: Proceedings of the fifth annual workshop on Computational learning theory. ACM. 1992. Pp. 144–152. (See p. 57). [10] R. J. Brachman and H. J. Levesque. “The tractability of subsumption in frame-based description languages”. In: Proc. of the 4th Nat. Conf. on Artificial Intelligence (AAAI-84). 1984. Pp. 34–37. (See p. 24). [11] D. Brill. Loom Reference Manual, for Loom Version 2.0. Tech. rep. USA: ISI, University of Southern California, 1993 (see p. 71). [12] R. A. Brooks. “Elephants don’t play chess”. Robotics and Autonomous Systems 6, pp. 3–15, 1990. (See pp. 16, 28). [13] R. A. Brooks. “Intelligence Without Representation”. Artificial Intelligence 47, pp. 139–159, 1991. (See pp. 16, 28). [14] M. Broxvall, S. Coradeschi, L. Karlsson, and A. Saffiotti. “Have Another Look: On failures and recovery planning in perceptual anchoring”. In: Proc. of the ECAI-04 Workshop on Cognitive Robotics. Valencia, ES, 2004. (See p. 29). [15] M. Broxvall, S. Coradeschi, L. Karlsson, and A. Saffiotti. “Recovery Planning for Ambiguous Cases in Perceptual Anchoring”. In: Proc. of the 20th AAAI Conf. Pittsburgh, PA, 2005. Pp. 1254–1260. (See pp. 29, 31, 39). 89 REFERENCES [16] E. M. Callaway. “Feedforward, feedback and inhibitory connections in primate visual cortex”. Neural Netw. 17, pp. 625–632, 2004. (See p. 82). [17] A. Cangelosi and S. Harnad. “The adaptive advantage of symbolic theft over sensorimotor toil: Grounding language in perceptual categories”. Evolution of Communication 4, pp. 117–142, 2002. (See p. 27). [18] A. Cangelosi. Evolution of communication and language using signals, symbols and words. 2001 (see p. 30). [19] A. Cangelosi. “Approaches to Grounding Symbols in Perceptual and Sensorimotor Categories”. In: Handbook of Categorization in Cognitive Science. Ed. by H. Cohen and C. Lefebvre. New York: Elsevier, 2005. Pp. 719–737. (See p. 3). [20] A. Cangelosi and S. Harnad. The adaptive advantage of symbolic theft over sensorimotor toil: Grounding language in perceptual categories. 2001 (see p. 27). [21] A. Cangelosi and T. Riga. “An Embodied Model for Sensorimotor Grounding and Grounding Transfer: Experiments With Epigenetic Robots”. Cognitive Science 30, pp. 673–689, 2006. (See p. 30). [22] A. Cangelosi, A. Greco, and S. Harnad. From Robotic Toil to Symbolic Theft: Grounding Transfer from Entry-Level to Higher-Level Categories. 2000 (see p. 27). [23] A. Cangelosi, A. Greco, and S. Harnad. “Symbol grounding and the symbolic theft hypothesis”. In: Simulating the evolution of language. Ed. by A. Cangelosi and D. Parisi. New York, NY, USA: Springer-Verlag New York, Inc., 2002. Pp. 191–210. (See p. 27). [24] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell. “Toward an Architecture for Never-Ending Language Learning”. In: Proceedings of the Twenty-Fourth Conference on Artificial Intelligence (AAAI 2010). 2010. (See p. 64). [25] A. Chella, M. Frixione, and S. Gaglio. “Anchoring symbols to conceptual spaces: the case of dynamic scenarios”. Robotics and Autonomous Systems 43, pp. 175 –188, 2003. (See p. 40). [26] A. Chella, S. Coradeschi, M. Frixione, and A. Saffiotti. “Perceptual Anchoring via Conceptual Spaces”. In: Proc. of the AAAI-04 Workshop on Anchoring Symbols to Sensor Data. Menlo Park, CA: AAAI Press, 2004. (See pp. 33, 40). [27] A. Chella, M. Frixione, and S. Gaglio. “Conceptual Spaces for Computer Vision Representations”. Artificial Intelligence Review 16 10.1023/A:1011658027344, pp. 137– 152, 2001. (See pp. 33, 40, 59). [28] A. Chella, H. Dindo, and I. Infantino. “Anchoring by Imitation Learning in Conceptual Spaces”. In: AI*IA 2005: Advances in Artificial Intelligence. Ed. by S. Bandini and S. Manzoni. Vol. 3673. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2005. Pp. 495–506. (See pp. 33, 40, 59). [29] A. Chella, M. Frixione, and S. Gaglio. “Anchoring symbols to conceptual spaces: the case of dynamic scenarios.” Robotics and Autonomous Systems 43, pp. 175–188, 2006. (See pp. 33, 59). [30] A. Chella, H. Dindo, and D. Zambuto. Grounded Human-Robot Interaction. 2009 (see p. 38). [31] S. Coradeschi and A. Saffiotti. “Anchoring Symbols to Vision Data by Fuzzy Logic”. In: Symbolic and Quantitative Approaches to Reasoning and Uncertainty. Ed. by A. Hunter and S. Parsons. LNCS 1638. Springer-Verlag, 1999. Pp. 104–115. (See pp. 7, 31 sq.). [32] S. Coradeschi and A. Saffiotti. “Anchoring symbols to sensor data: Preliminary report”. In: In Proc. of the 17th AAAI Conf. AAAI Press, 2000. Pp. 129–135. (See pp. 7, 31 sq., 34, 42 sq., 81 sq.). 90 REFERENCES [33] S. Coradeschi and A. Saffiotti. “On the Role of Learning in Anchoring”. In: In Proc. of AAAI Spring Symposium on Learning Grounded Representations (Technical Report SS-01-05). AAAI Press, Menlo Park, CA, 2001. (See p. 32). [34] S. Coradeschi and A. Saffiotti. “Perceptual anchoring: a key concept for plan execution in embedded systems”. In: Advances in Plan-Based Control of Robotic Agents. Ed. by M. Beetz, J. Hertzberg, M. Ghallab, and M. Pollack. LNAI 2466. Berlin, DE: Springer-Verlag, 2002. Pp. 89–105. (See pp. 29, 32). [35] S. Coradeschi and A. Saffiotti. “An Introduction to the Anchoring Problem”. Robotics and Autonomous Systems 43, pp. 85–96, 2003. (See pp. 3, 7, 29, 31). [36] S. Coradeschi and A. Saffiotti. “Perceptual Anchoring with Indefinite Descriptions”. In: Proc. of the First Joint SAIS-SSLS Workshop. Örebro, Sweden, 2003. (See p. 31). [37] S. Coradeschi and A. Saffiotti. Anchoring symbolic object descriptions to sensor data. Problem statement. 1999 (see pp. 30 sq., 33). [38] S. Coradeschi and A. Saffiotti. “Perceptual anchoring of symbols for action”. In: In Proc. of the 17th IJCAI Conf. 2001. Pp. 407–412. (See pp. 4, 29, 32, 39). [39] N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press, 2000. (See p. 58). [40] G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. A. Bray. “Visual categorization with bags of keypoints”. In: In Workshop on Statistical Learning in Computer Vision, ECCV. 2004. Pp. 1–22. (See p. 55). [41] N. Dalal and B. Triggs. “Histograms of Oriented Gradients for Human Detection”. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01. 2005. Pp. 886–893. (See p. 52). [42] M. Daoutis, S. Coradeshi, and A. Loutfi. “Grounding commonsense knowledge in intelligent systems”. Journal of Ambient Intelligence and Smart Environments (JAISE) 1, pp. 311–321, 2009. (See p. 76). [43] M. Daoutis, S. Coradeschi, and A. Loutfi. “Integrating Common Sense in Physically Embedded Intelligent Systems.” In: Intelligent Environments 2009. Ed. by V. Callaghan, A. Kameas, A. Reyes, D. Royo, and M. Weber. Vol. 2. Ambient Intelligence and Smart Environments. IE’09 Best Conference Paper Award. IOS Press, 2009. Pp. 212–219. (See p. 76). [44] M. Daoutis, A. Loutfi, and S. Coradeschi. In: “Bridges between the Methodological and Practical Work of the Robotics and Cognitive Systems Communities – From Sensors to Concepts”. T. Amirat, A. Chibani, and G. P. Zarri, eds. Vol. 21chap. Knowledge Representation for Anchoring Symbolic Concepts to Perceptual Data. Springer Publishing (in press), 2012. [45] M. Daoutis, S. Coradeschi, and A. Loutfi. “Cooperative Knowledge Based Perceptual Anchoring”. Int. Journal on Artificial Intelligence Tools (IJAIT) 21, p. 1250012, 2012. (See p. 78). [46] M. Daoutis, S. Coradeschi, and A. Loutfi. “Towards concept anchoring for cognitive robots”. Intelligent Service Robotics 5, pp. 213–228, 2012. (See p. 80). [47] L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Applications of Mathematics. Springer, 1996. (See p. 59). [48] R. Duda, P. Hart, and D. Stork. Pattern classification. Pattern Classification and Scene Analysis: Pattern Classification. Wiley, 2001. (See p. 56). [49] S. Edelman, N. Intrator, and T. Poggio. “Complex cells and Object Recognition”. 1997 (see pp. 50 sq.). 91 REFERENCES [50] M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. “The Pascal Visual Object Classes (VOC) Challenge”. Int. J. Comput. Vision 88, pp. 303–338, 2010. (See p. 58). [51] L. Fei-Fei, R. Fergus, and A. Torralba. Course on Recognizing and Learning Object Categories. CVPR 2007. 2007 (see p. 54). [52] M. Fichtner. “Anchoring Symbols to Percepts in the Fluent Calculus”. KI - Künstliche Intelligenz 25 10.1007/s13218-010-0051-1, pp. 77–80, 2011. (See p. 35). [53] M. Fichtner and M. Thielscher. Anchoring Symbols to Percepts in the Fluent Calculus. Ed. by U. Visser, G. Lakemeyer, G. Vachtesevanos, and M. Veloso. 2005 (see p. 35). [54] T. Fong, C. Thorpe, and C. Baur. “Collaboration, Dialogue, Human-Robot Interaction”. Robotics Research, pp. 255–266, 2003. (See p. 38). [55] D. Fox, W. Burgard, F. Dellaert, and S. Thrun. “Monte Carlo Localization: Efficient Position Estimation for Mobile Robots”. In: Proceedings of the Sixteenth National Conference on Artificial Intelligence (AAAI’99). 1999. (See p. 47). [56] D. Gaševic, D. Djuric, and V. Devedžic. Model Driven Engineering and Ontology Development. 2. Berlin: Springer, 2009. (See p. 22). [57] C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J. Fernández-Madrigal, and J. González. “Multi-Hierarchical Semantic Maps for Mobile Robotics”. In: Edmonton, CA, 2005. (See pp. 39, 59). [58] P. Gärdenfors. Conceptual Spaces: The Geometry of Thought. Cambridge, MA: MIT Press, 2000. Pp. I–X, 1–307. (See pp. 16, 33, 36 sq., 40). [59] J. J. Gibson. “The theory of affordances”. In: Perceiving, Acting, and Knowing: Toward and Ecological Psychology. Ed. by R. Shaw and J. Brandsford. Hillsdale, NJ: Lawrence Erlbaum, 1977. Pp. 67–82. (See p. 3). [60] J. Gibson. The ecological approach to visual perception. Resources for ecological psychology. Taylor & Francis, 1986. (See p. 29). [61] V. Haarslev and R. Möller. “High Performance Reasoning with Very Large Knowledge Bases: A Practical Case Study”. In: Proceedings of Seventeenth International JointÊ Conference on Artificial Intelligence, IJCAI-01. Ed. by B. Nebel. 2001. Pp. 161–166. (See p. 63). [62] V. Haarslev and R. Müller. “RACER System Description”. In: Automated Reasoning. Ed. by R. Goré, A. Leitsch, and T. Nipkow. Vol. 2083. Lecture Notes in Computer Science. Berlin, Heidelberg: Springer Berlin Heidelberg, June 2001. Chap. 59, pp. 701– 705. (See p. 63). [63] S. Harnad. “The symbol grounding problem”. Physica D: Nonlinear Phenomena 42, pp. 335–346, 1990. (See pp. 2 sq., 26 sqq., 34, 81). [64] S. Harnad. “Grounding Symbols in the Analog World With Neural Nets - A Hybrid Model”. Psycoloquy 12, 2001. (See p. 26). [65] C. Harris and M. Stephens. “A Combined Corner and Edge Detector”. In: Proceedings of the 4th Alvey Vision Conference. 1988. Pp. 147–151. (See pp. 49, 54). [66] Z. S. Harris. “Distributional Structure”. Word 10, pp. 146–162, 1954. (See p. 54). [67] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Second. Cambridge University Press, ISBN: 0521540518, 2004. (See p. 47). [68] F. Heintz, J. Kvarnström, and P. Doherty. “A stream-based hierarchical anchoring framework”. In: Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems. IROS’09. St. Louis, MO, USA: IEEE Press, 2009. Pp. 5254–5260. (See p. 40). 92 REFERENCES [69] F. Heintz, J. Kvarnström, and P. Doherty. “Stream-Based Reasoning Support for Autonomous Systems”. In: Proceedings of the 2010 conference on ECAI 2010: 19th European Conference on Artificial Intelligence. Amsterdam, The Netherlands, The Netherlands: IOS Press, 2010. Pp. 183–188. (See p. 40). [70] S. Heymann, K. Müller, A. Smolic, B. Fröhlich, and T. Wiegand. Scientific Commons: SIFT implementation and optimization for general-purpose gpu. 2007 (see p. 50). [71] I. Horrocks. “FaCT and iFaCT”. In: Description Logics. Ed. by P. Lambrix, A. Borgida, M. Lenzerini, R. Möller, and P. F. Patel-Schneider. Vol. 22. CEUR Workshop Proceedings. CEUR-WS.org, 1999. (See p. 63). [72] T. Joachims. “Advances in kernel methods”. In: ed. by B. Schölkopf, C. J. C. Burges, and A. J. Smola. Cambridge, MA, USA: MIT Press, 1999. Chap. Making large-scale support vector machine learning practical, pp. 169–184. (See p. 53). [73] L. Karlsson, A. Bouguerra, M. Broxvall, S. Coradeschi, and A. Saffiotti. “To Secure an Anchor – A recovery planning approach to ambiguity in perceptual anchoring”. AI Communications 21, pp. 1–14, 2008. (See p. 39). [74] L. Karlsson, A. Bouguerra, M. Broxvall, S. Coradeschi, and A. Saffiotti. “To secure an anchor - a recovery planning approach to ambiguity in perceptual anchoring”. AI Commun. 21, pp. 1–14, 2008. (See p. 29). [75] Y. Ke and R. Sukthankar. “PCA-SIFT: A more distinctive representation for local image descriptors”. In: 2004. Pp. 506–513. (See p. 52). [76] Z. Kira. “Transferring embodied concepts between perceptually heterogeneous robots”. In: IROS’09: Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems. St. Louis, MO, USA: IEEE Press, 2009. Pp. 4650–4656. (See p. 37). [77] Z. Kira. “Mapping Grounded Object Properties across Perceptually Heterogeneous Embodiments.” In: FLAIRS Conference. Ed. by H. C. Lane and H. W. Guesgen. AAAI Press, Nov. 25, 2009. (See p. 37). [78] Z. Kira. “Communication and alignment of grounded symbolic knowledge among heterogeneous robots”. AAI3414475. PhD thesis. Atlanta, GA, USA, 2010. (See pp. 37, 40, 59). [79] Z. Kira. “Inter-robot transfer learning for perceptual classification”. In: Proceedings of the 9th International Conference on Autonomous Agents and Multiagent Systems: volume 1 - Volume 1. AAMAS ’10. Toronto, Canada: International Foundation for Autonomous Agents and Multiagent Systems, 2010. Pp. 13–20. (See p. 37). [80] M Kleinehagenbrock, S Lang, J Fritsch, F Lomker, G. A. Fink, and G Sagerer. “Person tracking with a mobile robot based on multi-modal anchoring”. Proceedings 11th IEEE International Workshop on Robot and Human Interactive Communication 6, pp. 423–429, 2002. (See p. 39). [81] M. J. Kochenderfer and R. Gupta. “Common Sense Data Acquisition for Indoor Mobile Robots”. In: In Nineteenth National Conference on Artificial Intelligence (AAAI-04. AAAI Press / The MIT Press, 2003. Pp. 605–610. (See p. 35). [82] G.-J. Kruijff and M. Brenner. “Modelling Spatio-Temporal Comprehension in Situated Human-Robot Dialogue as Reasoning about Intentions and Plans”. In: Proceedings of the Symposium on Intentions in Intelligent Systems. Stanford University, Palo Alto, CA, USA. AAAI Spring Symposium Series 2007. AAAI, Mar. 2007. (See pp. 38, 59). [83] D. Kumar. “The SNePS BDI architecture”. Decis. Support Syst. 16, pp. 3–19, 1996. (See p. 33). 93 REFERENCES [84] L. Kunze, M. Tenorth, and M. Beetz. “Putting People’s Common Sense into Knowledge Bases of Household Robots”. In: KI 2010: Advances in Artificial Intelligence. Ed. by R. Dillmann, J. Beyerer, U. Hanebeck, and T. Schultz. Vol. 6359. Lecture Notes in Computer Science. Springer Berlin / Heidelberg, 2010. Pp. 151–159. (See p. 36). [85] K. LeBlanc and A. Saffiotti. “Issues of Perceptual Anchoring in Ubiquitous Robotic Systems”. In: Proc of the ICRA Workshop on Omniscent Space. Rome, Italy, 2007. (See p. 37). [86] K. LeBlanc and A. Saffiotti. “Cooperative anchoring in heterogeneous multi-robot systems”. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA). 2008. Pp. 3308–3314. (See pp. 36 sq., 40, 76, 83). [87] K. LeBlanc and A. Saffiotti. “Multirobot Object Localization: A Fuzzy Fusion Approach”. IEEE Trans on System, Man and Cybernetics B 39, pp. 1259–1276, 2009. (See p. 37). [88] K. LeBlanc. “Cooperative anchoring : sharing information about objects in multi-robot systems”. PhD thesis. Örebro University, School of Science and Technology, 2010. (See pp. 37, 40, 45). [89] K. LeBlanc and A. Saffiotti. “Cooperative information fusion in a network robot system”. In: Proceedings of the 1st international conference on Robot communication and coordination. RoboComm ’07. Piscataway, NJ, USA: IEEE Press, 2007. 42:1–42:4. (See p. 37). [90] S. Lemaignan, R. Ros, and R. Alami. “Dialogue in situated environments: A symbolic approach to perspective-aware grounding, clarification and reasoning for robot”. In: 2011. (See pp. 38, 59). [91] S. Lemaignan, R. Ros, R. Alami, and M. Beetz. “What are you talking about? Grounding dialogue in a perspective-aware robotic architecture”. In: RO-MAN, 2011 IEEE. IEEE, July 2011. Pp. 107–112. (See p. 38). [92] S. Lemaignan, R. Ros, L. Mösenlechner, R. Alami, and M. Beetz. “ORO, a knowledge management module for cognitive architectures in robotics”. In: Proceedings of the 2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. Taipei, Taiwan, 2010. Pp. 3548–3553. (See pp. 36, 40). [93] D. Lenat. “CYC: A Large-Scale Investment in Knowledge Infrastructure”. Communications of the ACM 38, pp. 33–38, 1995. (See pp. 35, 64 sq.). [94] D. B. Lenat, R. V. Guha, K. Pittman, D. Pratt, and M. Shepherd. “Cyc: toward programs with common sense”. Commun. ACM 33, pp. 30–49, 1990. (See pp. 35, 64 sq.). [95] T. Leung and J. Malik. “Representing and Recognizing the Visual Appearance of Materials using Three-dimensional Textons”. Int. J. Comput. Vision 43, pp. 29–44, 2001. (See pp. 55 sq.). [96] H. Lipson. “Evolutionary robotics: emergence of communication”. Curr Biol 17, R330– R332, 2007. (See p. 30). [97] P. Lison, C. Ehrler, and G.-J. Kruijff. “Belief Modelling for Situation Awareness in Human-Robot Interaction”. In: Proceedings of the 19th IEEE International Symposium in Robot and Human Interactive Communication. International Symposium on Robot and Human Interactive Communication (RO-MAN-10), September 12-15, Viareggio, Italy. IEEE, 2010. (See p. 38). [98] H Liu and P Singh. “ConceptNet — A Practical Commonsense Reasoning Tool-Kit”. BT Technology Journal 22, pp. 211–226, 2004. (See p. 64). [99] L. S. Lopes, A. J. Teixeira, M. Quindere, and M. Rodrigues. “From Robust Spoken Language Understanding to Knowledge Acquisition and Management”. In: Proceedings of Interspeech 2005, Lisboa, Portugal, p. 3469-3472. 2005. (See p. 34). 94 REFERENCES [100] A. Loutfi, S. Coradeschi, and A. Saffiotti. “Maintaining Coherent Perceptual Information using Anchoring”. In: The Nineteenth International Joint Conference on Artificial Intelligence (IJCAI ’05). Edinburgh, Scotland, 2005. Pp. 1477–1482. (See pp. 4, 7, 32, 39, 82). [101] A. Loutfi, S. Coradeschi, L. Karlsson, and M. Broxvall. Putting Olfaction into Action: Anchoring Symbols to Sensor Data Using Olfaction and Planning. 2005 (see p. 39). [102] A. Loutfi, S. Coradeschi, M. Daoutis, and J. Melchert. “Using Knowledge Representation for Perceptual Anchoring in a Robotic System”. Int. Journal on Artificial Intelligence Tools (IJAIT) 17, pp. 925–944, 2008. (See p. 74). [103] D. G. Lowe. “Object Recognition from Local Scale-Invariant Features”. In: Proceedings of the International Conference on Computer Vision-Volume 2 - Volume 2. 1999. Pp. 1150–. (See p. 49). [104] D. Lowe. “Distinctive image features from scale-invariant keypoints”. International Journal of Computer Vision 60, pp. 91–110, 2004. (See pp. 50 sqq., 54 sq.). [105] R. Lundh, L. Karlsson, and A. Saffiotti. “Plan-Based Configuration of a Group of Robots”. In: Riva del Garda, IT, 2006. Pp. 683–687. (See p. 29). [106] G. jan M. Kruijff, P. Lison, T. Benjamin, H. Jacobsson, and N. Hawes. “Incremental, multi-level processing for comprehending situated dialogue in human-robot interaction”. In: In Language and Robots: Proceedings from the Symposium (LangRo’2007)IJCAI01. 2007. Pp. 509–514. (See pp. 38, 59). [107] S. Maji, A. Berg, and J. Malik. “Classification using intersection kernel support vector machines is efficient”. Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pp. 1–8, 2008. (See p. 58). [108] D. Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual Information. Henry Holt & Company, June 1983. (See p. 21). [109] C. Masolo, S. Borgo, A. Gangemi, N. Guarino, A. Oltramari, and L. Schneider. The WonderWeb Library of Foundational Ontologies. Tech. rep. WonderWeb Deliverable D17. Padova, Italy: ISTC-CNR, 2003 (see p. 71). [110] F. Mastrogiovanni, A. Sgorbissa, and R. Zaccaria. “A distributed architecture for symbolic data fusion”. In: In Proceedings of IJCAI-07 – International Joint Conference on Artificial Intelligence. 2007. (See pp. 37, 40). [111] J. McCarthy. “Programs with Common Sense”. In: Proceedings of the Teddington Conference on the Mechanisation of Thought Processes. 1958. Pp. 77–84. (See p. 25). [112] O. Medelyan and C. Legg. “Integrating Cyc and Wikipedia: Folksonomy Meets Rigorously Defined Common-Sense”. In: Wikipedia and Artificial Intelligence: An Evolving Synergy, Papers from the 2008 AAAI Workshop. 2008. (See p. 65). [113] D. L. Medin and M. M. Schaffer. “Context theory of classification learning”. Psychological review 85, pp. 207–238, 1978. (See p. 7). [114] J. Melchert, S. Coradeschi, and A. Loutfi. “Spatial relations for perceptual anchoring”. In: In Proceedings of AISB’07 – AISB Annual Convention, Newcastle upon Tyne. 2007. (See pp. 34, 39 sq., 59, 83). [115] J. Melchert, S. Coradeschi, and A. Loutfi. “Knowledge Representation and Reasoning for Perceptual Anchoring”. Tools with Artificial Intelligence, 19th IEEE International Conference on 1, pp. 129–136, 2007. (See pp. 34, 74). [116] R. Mendoza, B. Johnston, F. Yang, Z. Huang, X. Chen, and M. anne Williams. “OBOC: Ontology Based Object Categorisation for Robots”. 2008. (See p. 34). [117] K. Mikolajczyk and C. Schmid. “A Performance Evaluation of Local Descriptors”. IEEE Trans. Pattern Anal. Mach. Intell. 27, pp. 1615–1630, 2005. (See p. 51). 95 REFERENCES [118] K. Mikolajczyk. “Detection of local features invariant to affines transformations”. PhD thesis. Grenoble: INPG, 2002. (See p. 51). [119] J. Modayil and B. Kuipers. “Autonomous development of a grounded object ontology by a learning robot”. In: Proceedings of the 22nd national conference on Artificial intelligence - Volume 2. AAAI’07. Vancouver, British Columbia, Canada: AAAI Press, 2007. Pp. 1095–1101. (See pp. 34, 59). [120] B. Motik. “Reasoning in Description Logics using Resolution and Deductive Databases”. PhD thesis. Universität Karlsruhe (TH), Karlsruhe, Germany, 2006. (See p. 63). [121] O. M. Mozos, P. Jensfelt, H. Zender, G.-J. Kruijff, and W. Burgard. “From Labels to Semantics: An Integrated System for Conceptual Spatial Representations of Indoor Environments for Mobile Robots”. In: Proc. of the Workshop ”Semantic information in robotics” at the IEEE International Conference on Robotics and Automation (ICRA’07). Rome,Italy, Apr. 2007. (See pp. 34, 59). [122] A. Newell and H. A. Simon. “Computer science as empirical inquiry: symbols and search”. Commun. ACM 19, pp. 113–126, 1976. (See pp. 16, 22). [123] A. Noė and E. Thompson. Vision and Mind: Selected Readings in the Philosophy of Perception. Bradford Books. MIT Press, 2002. (See p. 18). [124] R. M. Nosofsky. “Attention, similarity, and the identification-categorization relationship.” J Exp Psychol Gen 115, pp. 39–61, 1986. (See p. 7). [125] C. Odgen and I. Richards. The Meaning of Meaning A Study of the Influence of Language upon Thought and of the Science of Symbolism. 10th ed. London: Routledge & Kegan Paul Ltd., 1923. (See p. 28). [126] D. Pangercic, R. Tavcar, M. Tenorth, and M. Beetz. “Visual Scene Detection and Interpretation using Encyclopedic Knowledge and Formal Description Logic”. In: Proceedings of the International Conference on Advanced Robotics (ICAR). 2009. (See p. 36). [127] B. Parsia and E. Sirin. “Pellet: An OWL DL Reasoner”. In: 3rd International Semantic Web Conference (ISWC2004). 2004. (See p. 63). [128] D. Pecher and R. A. Zwaan. Cambridge University Press, 2005. (See p. 3). [129] C. Peirce. Collected Papers of Charles Sanders Peirce. Collected Papers of Charles Sanders Peirce: Science and Philosophy. Reviews, Correspondence, and Bibliography v. 1-8. Belknap Press of Harvard University Press, 1958. (See p. 28). [130] Z. W. Pylyshyn. “Visual indexes, preconceptual objects, and situated vision”. Cognition 80 Objects and Attention, pp. 127 –158, 2001. (See p. 6). [131] G. RANDELLI. “Improving Human-Robot Awareness through Semantic-driven Tangible Interaction”. 2011. (See p. 38). [132] L. Roberts. Machine perception of three-dimensional solids. Tech. rep. DTIC Document, 1963 (see p. 21). [133] R. Ros, S. Lemaignan, E. A. Sisbot, R. Alami, J. Steinwender, K. Hamann, and F. Warneken. “Which One? Grounding the Referent Based on Efficient Human-Robot Interaction”. In: 19th IEEE International Symposium in Robot and Human Interactive Communication. 2010. (See p. 38). [134] E. Rosch. “Principles of Categorization”. In: Readings in Cognitive Science, a Perspective From Psychology and Artificial Intelligence. Morgan Kaufmann Publishers, 1988. Pp. 312–22. (See p. 7). [135] D. Roy. “Semiotic schemas: A framework for grounding language in action and perception”. Artificial Intelligence 167, pp. 170–205, 2005. (See p. 29). 96 REFERENCES [136] D. Roy. “Grounding words in perception and action: computational insights”. Trends in Cognitive Sciences 9, pp. 389–396, 2005. (See p. 29). [137] D. Roy and E. Reiter. “Connecting language to the world”. Artif. Intell. 167, pp. 1–12, 2005. (See p. 6). [138] S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach (2nd Edition). Prentice Hall, 2002. (See pp. 16 sq.). [139] A. Saffiotti. “Pick-up What?” In C. Backstrom and E. Sandewall, eds., Current Trends in AI Planning, 1994. (See p. 30). [140] A. Saffiotti and M. Broxvall. “PEIS Ecologies: Ambient Intelligence meets Autonomous Robotics”. In: Proc of the Int Conf on Smart Objects and Ambient Intelligence (sOcEUSAI). Grenoble, France, 2005. Pp. 275–280. (See pp. 72, 77). [141] P. E. Santos. Formalising the Common Sense of a Mobile Robot. Tech. rep. 1998 (see pp. 25, 35). [142] M. Schmidt-Schauß and G. Smolka. “Attributive Concept Descriptions with Complements”. Artif. Intell. 48, pp. 1–26, 1991. (See p. 60). [143] J. R. Searle. “Minds, Brains and Programs”. Behavioral and Brain Sciences 3, pp. 417– 57, 1980. (See pp. 16, 26). [144] M. Shanahan and L. S. Bt. A Logical Account of the Common Sense Informatic Situation for a Mobile Robot. 1996 (see pp. 25, 35). [145] S. C. Shapiro and H. O. Ismail. “Anchoring in a grounded layered architecture with integrated reasoning”. Robotics and Autonomous Systems 43 Perceptual Anchoring: Anchoring Symbols to Sensor Data in Single and Multiple Robot Systems, pp. 97 –108, 2003. (See pp. 25, 33). [146] S. C. Shapiro and W. J. Rapaport. “SNePS considered as a fully intensional propositional semantic network”. In: The Knowledge Frontier. Springer-Verlag, 1987. Pp. 263– 315. (See pp. 25, 33). [147] P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins, and W. L. Zhu. “Open Mind Common Sense: Knowledge Acquisition from the General Public”. In: On the Move to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated International Conferences DOA, CoopIS and ODBASE 2002. London, UK: SpringerVerlag, 2002. Pp. 1223–1237. (See pp. 35, 64, 66). [148] J. Sivic and A. Zisserman. “Efficient Visual Search of Videos Cast as Text Retrieval”. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, pp. 591–606, 2009. (See p. 54). [149] L. Steels. “Evolving grounded communication for robots”. Trends in Cognitive Science 7, pp. 308–312, 2003. (See p. 29). [150] L. Steels and F. Kaplan. “AIBO’s first words: The social learning of language and meaning”. Evolution of Communication 4, pp. 3–32, 2001. (See p. 29). [151] L. Steels and F. Kaplan. “Bootstrapping grounded word semantics”. In: Linguistic Evolution through Language Acquisition: Formal and Computational Models. Ed. by T. Briscoe. Cambridge University Press, 2002. Chap. 3. (See p. 30). [152] L. Steels and P. Vogt. “Grounding adaptive language games in robotic agents”. In: ECAL97. Ed. by I. Harvey and P. Husbands. Cambridge, MA: MIT Press, 1997. (See p. 29). [153] L. Steels. “Language Games for Autonomous Robots”. IEEE Intelligent Systems 16, pp. 16–22, 2001. (See p. 29). 97 REFERENCES [154] M. Tenorth and M. Beetz. “KnowRob — Knowledge Processing for Autonomous Personal Robots”. In: IEEE/RSJ International Conference on Intelligent RObots and Systems. 2009. (See pp. 34, 40). [155] M. Tenorth, D. Nyga, and M. Beetz. “Understanding and Executing Instructions for Everyday Manipulation Tasks from the World Wide Web.” In: IEEE International Conference on Robotics and Automation (ICRA). 2010. (See pp. 35, 40). [156] M. Tenorth and M. Beetz. “Towards Practical and Grounded Knowledge Representation Systems for Autonomous Household Robots”. In: Proceedings of the 1st International Workshop on Cognition for Technical Systems, München, Germany, 6-8 October. 2008. (See p. 35). [157] M. Tenorth, D. Jain, and M. Beetz. “Knowledge Processing for Cognitive Robots”. KI 24, pp. 233–240, 2010. (See p. 34). [158] M. Tenorth, L. Kunze, D. Jain, and M. Beetz. “KNOWROB-MAP – KnowledgeLinked Semantic Object Maps”. In: 10th IEEE-RAS International Conference on Humanoid Robots. Nashville, TN, USA, 2010. Pp. 430–435. (See p. 35). [159] A. Treisman. “The binding problem.” Current opinion in neurobiology 6, pp. 171–178, 1996. (See p. 6). [160] A. M. Treisman and G. Gelade. “A feature-integration theory of attention”. Cognitive Psychology 12, pp. 97–136, 1980. (See p. 6). [161] D. Tsarkov and I. Horrocks. “FaCT++ description logic reasoner: System description”. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 4130 LNAI, pp. 292–297, 2006. (See p. 63). [162] S. Ullmann. Semantics: An Introduction to the Science of Meaning. Blackwell paperbacks. Blackwell, 1972. (See p. 28). [163] F. University. Thought: Fordham University Quarterly. Fordham University Press. (See p. 16). [164] V. N. Vapnik. The nature of statistical learning theory. New York, NY, USA: SpringerVerlag New York, Inc., 1995. (See p. 53). [165] V. Vapnik. Statistical learning theory. Wiley, 1998. Pp. I–XXIV, 1–736. (See p. 58). [166] P. Vogt and F. Divina. “Social symbol grounding and language evolution”. Interaction Studies 8, pp. 31–52, 2007. (See p. 30). [167] P. Vogt. “Anchoring Symbols to Sensorimotor Control”. In: in Proceedings of Belgian/Netherlands Artificial Intelligence Conference BNAIC’02. 2002. P. 2002. (See pp. 28, 30). [168] P. Vogt. “The physical symbol grounding problem”. Cognitive Systems Research 3, pp. 429–457, 2002. (See pp. 28 sqq.). [169] P. Vogt. “Anchoring of semiotic symbols”. Robotics and Autonomous Systems 43, pp. 109–120, 2003. (See p. 28). [170] P. Vogt. “Language evolution and robotics: Issues in symbol grounding and language acquisition”. In: Artificial Cognition Systems. Ed. by A. Loula, R. Gudwin, and J. Queiroz. Idea Group, 2006. (See pp. 28 sq.). [171] P. Vogt and F. Divina. “Social symbol grounding and language evolution”. Interaction Studies 8, pp. 31–52, 2007. (See pp. 3 sq., 29). [172] T. Winograd. “Procedures as a Representation for Data in a Computer Program for Understanding Natural Language.” SHRDLU. PhD thesis. Cambridge, MA: Massachusetts Institute of Technology, Jan. 1971. (See p. 25). 98 REFERENCES [173] C. Wu. SiftGPU: A GPU Implementation of Scale Invariant Feature Transform (SIFT). http://cs.unc.edu/~ccwu/siftgpu. 2007 (see p. 52). [174] H. Zender, P. Jensfelt, Ó. M. Mozos, G. M. Kruijff, and W. Burgard. “An integrated robotic system for spatial understanding and situated interaction in indoor environments”. In: In Proc. AAAI ’07. 2007. Pp. 1584–1589. (See p. 34). [175] H. Zender and G.-J. M. Kruijff. “Towards Generating Referring Expressions in a Mobile Robot Scenario”. In: Language and Robots: Proceedings of the Symposium. Aveiro, Portugal, 2007. Pp. 101–106. (See pp. 39 sq.). [176] H. Zender, G.-J. Kruijff, and I. Kruijff-Korbayová. “A Situated Context Model for Resolution and Generation of Referring Expressions”. In: Proceedings of the 12th European Workshop on Natural Language Generation. Association for Computational Linguistics. o.A., Mar. 2009. Pp. 126–129. (See pp. 39 sq.). [177] H. Zender, G.-J. M. Kruijff, and I. Kruijff-Korbayová. “Situated resolution and generation of spatial referring expressions for robotic assistants”. In: Proceedings of the 21st international jont conference on Artifical intelligence. IJCAI’09. Pasadena, California, USA: Morgan Kaufmann Publishers Inc., 2009. Pp. 1604–1609. (See pp. 38 sqq., 59). 99
© Copyright 2026 Paperzz