Knowledge Based Perceptual Anchoring: Grounding percepts to

Doctoral Dissertation
Knowledge Based Perceptual Anchoring: Grounding
percepts to concepts in cognitive robots
Marios Daoutis
Computer Science
Örebro Studies in Technology 55
Örebro 2013
Knowledge Based Perceptual Anchoring: Grounding
percepts to concepts in cognitive robots
Örebro Studies in Technology 55
Marios Daoutis
Knowledge Based Perceptual
Anchoring: Grounding percepts to
concepts in cognitive robots
This research has been supported by:
© Marios Daoutis, 2013
Title: Knowledge Based Perceptual Anchoring: Grounding percepts to
concepts in cognitive robots
Publisher: Örebro University, 2013
www.publications.oru.se
[email protected]
Printer: Örebro University, Repro 12/2012
Typeset with LATEX.
ISBN
ISSN 1650-8580
978-91-7668-912-7
Statement of Original Authorship
This thesis is submitted to the School of Science and Technology, University of
Örebro, in fulfilment of the requirements for the degree of Doctor of Philosophy
in Computer Science. I certify that I am the sole author of this thesis and the
work contained herein is entirely my own work, describes my own research and
has not been previously submitted to meet requirements for an award at this
or any other higher education institution. To the best of my knowledge and
belief, the thesis contains no ideas or material of others, previously published
or written by another person except where due reference is made. I acknowledge that I have read and understood the University’s rules, requirements,
procedures and policy related to my higher degree research award and to the
present thesis, as well as I certify that I have complied with the policy of the
Good Research Practice guide [1] (can be found here) of the Swedish Research
Council (Vetenskapsrådet).
Marios Daoutis
Örebro, April 2013
[1]
B. Gustafsson, G. Hermeren, and B. Petersson. Good research practice – what is that?
Swedish Science Research Council, 2005. (See p. g).
© 2013
All rights reserved
Abstract
A successful artificial cognitive agent needs to integrate its perception of the
environment with reasoning and actuation. A key aspect of this integration
is the perceptual-symbolic correspondence, which intends to give meaning to
the concepts the agent refers to – known as Anchoring. However, perceptual
representations alone (e.g., feature lists) cannot entirely provide sufficient
abstraction and enough richness to deal with the complex nature of the concepts’
meanings. On the other hand, neither plain symbol manipulation appears
capable of attributing the desired intrinsic meaning.
We approach this integration in the context of cognitive robots which operate
in the physical world. Specifically we investigate the challenge of establishing the
connection between percepts and concepts referring to objects, their relations and
properties. We examine how knowledge representation can be used together with
an anchoring framework, so as to complement the meaning of percepts while
supporting linguistic interaction. This implies that robots need to represent
both their perceptual and semantic knowledge, which is often expressed in
different abstraction levels and may originate from different modalities.
The solution proposed in this thesis concerns the specification, design and
implementation of a hybrid cognitive computational model, which extends a
classical anchoring framework, in order to address the creation and maintenance
of the perceptual-symbolic correspondences. The model is based on four main
aspects: (a) robust perception, by relying on state-of-the-art techniques from
computer vision and mobile robot localisation; (b) symbol grounding, using topdown and bottom-up information acquisition processes as well as multi-modal
representations; (c) knowledge representation and reasoning techniques in order
to establish a common language and semantics regarding physical objects, their
properties and relations, that are to be used between heterogeneous robotic
agents and humans; and (d) commonsense information in order to enable
high-level reasoning as well as to enhance the semantic descriptions of objects.
The resulting system and the proposed integration has the potential to
strengthen and expand the knowledge of a cognitive robot. Specifically, by
providing more robust percepts it is possible to cope better with the ambiguity
and uncertainty of the perceptual data. In addition, the framework is able
to exploit mutual interaction between different levels of representation while
integrating different sources of information. By modelling and using semantic
& perceptual knowledge, the robot can: acquire, exchange and reason formally
about concepts, while prior knowledge can become a cognitive bias in the
acquisition of novel concepts.
Keywords: anchoring, knowledge representation, cognitive perception,
symbol grounding, common-sense information.
Acknowledgements
First I would like to express my gratitude to Prof. Alessandro Saffiotti, who
offered me the opportunity to conduct my PhD studies at the Cognitive Robotic
Systems Lab in the Centre for Applied Autonomous Sensor Systems Research
(AASS) at Örebro University.
However I am even more thankful to my supervisors, Prof. Silvia Coradeschi
and Dr. Amy Loutfi, for teaching me how to conduct scientific research but
most importantly for letting me become a goal oriented and self independent
researcher.
A special thanks goes to Dr. Mathias Broxvall for his valuable support, fruitful discussions but most importantly for being there when it was really needed.
His determinative intervention made a surprising turn in the development of
this thesis.
Of course this thesis would not have been possible without the help of many
other people. I would like to thank my colleagues Sahar Asadi, Abdelbaki
Bouguerra and Kevin LeBlanc, because in various occasions either they have
shown me ways to lead me out of a problem or sparked eureka moments during
constructive conversations.
There is one person to which I am deeply indebted, Athanasia Louloudi,
for standing by me both in the tough as well as joyful moments.
Finally I would like to thank my parents and sister for their support during
my studies.
This work was supported by Vetenskapsrådet (VR) for the project titled “Anchoring in Symbiotic Robotic System” and the Centre for Applied Autonomous
Sensor Systems (AASS).
List of Publications
The contents and results of this thesis are partially reported in several refereed
papers, some of which are appended in this thesis. The author of this thesis is
also the main author of the following publications (except for Paper I). The
papers are referred throughout the thesis using their Roman numerals, as they
are presented below.
Publications included in the thesis
Journal Articles
✷ Paper I A. Loutfi, S. Coradeschi, M. Daoutis, and J. Melchert. “Using
Knowledge Representation for Perceptual Anchoring in a Robotic System”.
Int. Journal on Artificial Intelligence Tools (IJAIT) 17, pp. 925–944, 2008.
✷ Paper II M. Daoutis, S. Coradeshi, A. Loutfi. “Grounding commonsense
knowledge in intelligent systems”. Journal of Ambient Intelligence and
Smart Environments (JAISE) 1, pp. 311–321, 2009.
✷ Paper III M. Daoutis, S. Coradeschi, A. Loutfi. “Cooperative Knowledge
Based Perceptual Anchoring”. Int. Journal on Artificial Intelligence Tools
(IJAIT) 21, p. 1250012, 2012.
✷ Paper IV M. Daoutis, S. Coradeschi, A. Loutfi. “Towards concept anchoring for cognitive robots”. Intelligent Service Robotics 5, pp. 213–228,
2012.
Other publications from the author
Conference Paper
✷ Paper V M. Daoutis, S. Coradeschi, A. Loutfi. “Integrating Common Sense
in Physically Embedded Intelligent Systems.” In: Intelligent Environments
2009. Ed. by V. Callaghan, A. Kameas, A. Reyes, D. Royo, et al. Vol. 2.
Ambient Intelligence and Smart Environments. IE’09 Best Conference
Paper Award. IOS Press, 2009. Pp. 212–219
vi
Book Chapter
✷ Paper VI M. Daoutis, A. Loutfi, S. Coradeschi. In: “Bridges between the
Methodological and Practical Work of the Robotics and Cognitive Systems
Communities – From Sensors to Concepts”. T. Amirat, A. Chibani, G. P.
Zarri, eds. Vol. 21chap. Knowledge Representation for Anchoring Symbolic
Concepts to Perceptual Data. Springer Publishing (in press), 2012.
CONTENTS
Contents
Abstract
i
Acknowledgements
iii
List Of Publications
v
Contents
vii
1 Introduction
1.1 Cognitive robotics . . . . . . . . . . . . . . . . . . . . . .
1.2 Context of the research . . . . . . . . . . . . . . . . . . .
1.2.1 The symbol grounding problem . . . . . . . . . . .
1.2.2 Anchoring Symbols to Sensor Data . . . . . . . . .
1.3 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . .
1.3.1 Uncertainty and Ambiguity . . . . . . . . . . . . .
1.3.2 Binding . . . . . . . . . . . . . . . . . . . . . . . .
1.3.3 Representation . . . . . . . . . . . . . . . . . . . .
1.3.4 Perceptual & Cognitive Bias . . . . . . . . . . . .
1.4 Contributions . . . . . . . . . . . . . . . . . . . . . . . . .
1.5 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.5.1 Part A: Grounding percepts to concepts in cognitive
1.5.2 Part B: Publications . . . . . . . . . . . . . . . . .
A
1
. . .
1
. . .
2
. . .
3
. . .
3
. . .
4
. . .
5
. . .
6
. . .
6
. . .
7
. . .
7
. . .
9
robots 9
. . . 10
Grounding percepts to concepts in cognitive robots
2 Background & Related Work
2.1 Computational models of cognition . . . . .
2.1.1 Sub-symbolic modelling . . . . . . .
2.1.2 Symbolic modelling . . . . . . . . .
2.1.3 Hybrid Modelling . . . . . . . . . . .
2.1.4 Intelligent agent & the environment
2.2 Perceptual Systems . . . . . . . . . . . . . .
2.2.1 Visual Perception . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
15
16
16
17
18
18
vii
CONTENTS
2.3
2.4
2.5
2.6
2.7
2.2.2 Computer Vision for cognitive perception . . . . .
2.2.3 Spatial Perception . . . . . . . . . . . . . . . . . .
Symbol Systems . . . . . . . . . . . . . . . . . . . . . . .
2.3.1 Models for Knowledge Representation . . . . . . .
2.3.2 Challenges related to KR&R . . . . . . . . . . . .
Commonsense Information . . . . . . . . . . . . . . . . . .
Perceptual-symbolic integration and the Symbol
Grounding Problem . . . . . . . . . . . . . . . . . . . . . .
2.5.1 Harnad’s Symbol Grounding Problem and solution
2.5.2 Physical Symbol Grounding . . . . . . . . . . . . .
2.5.3 Social Symbol Grounding . . . . . . . . . . . . . .
Foundations of Perceptual Anchoring . . . . . . . . . . . .
Perceptual Anchoring in Cognitive Systems . . . . . . . .
2.7.1 Knowledge Based approaches to Anchoring . . . .
2.7.2 Anchoring with commonsense information . . . . .
2.7.3 Cooperative Anchoring . . . . . . . . . . . . . . .
2.7.4 Anchoring for Human-Robot Interaction . . . . . .
2.7.5 Other approaches to Anchoring . . . . . . . . . . .
2.7.6 Summary of Related Approaches . . . . . . . . . .
3 Methods
3.1 Introduction . . . . . . . . . . . . . . . .
3.2 Anchoring Model . . . . . . . . . . . . .
3.2.1 Components . . . . . . . . . . .
3.2.2 Functionalities . . . . . . . . . .
3.2.3 Relation to Publications . . . . .
3.3 Sensors . . . . . . . . . . . . . . . . . .
3.3.1 Laser Scanner . . . . . . . . . . .
3.3.2 Camera . . . . . . . . . . . . . .
3.4 Mobile Robot Localisation . . . . . . . .
3.5 Vision . . . . . . . . . . . . . . . . . . .
3.5.1 Image Representation . . . . . .
3.5.2 Vector Quantisation . . . . . . .
3.5.3 Classification of Visual Features
3.6 Semantic Knowledge . . . . . . . . . . .
3.6.1 Description Logics . . . . . . . .
3.6.2 Inference Support . . . . . . . .
3.7 Commonsense Knowledge Bases . . . . .
3.7.1 Cyc . . . . . . . . . . . . . . . .
3.7.2 Open Mind Common Sense . . .
3.7.3 ConceptNet . . . . . . . . . . . .
3.7.4 Other approaches . . . . . . . . .
3.7.5 Discussion . . . . . . . . . . . . .
3.8 Conclusions . . . . . . . . . . . . . . . .
viii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
21
21
21
22
23
24
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
26
26
28
30
30
33
33
35
36
37
39
40
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
42
42
43
45
46
46
46
47
49
49
55
56
59
60
62
63
64
65
66
66
68
69
CONTENTS
4 Summary of Publications
4.1 Anchoring with spatial relations . . . . . . . . . . . .
4.1.1 Semantic modelling of spatial relations . . . .
4.1.2 System validation . . . . . . . . . . . . . . .
4.1.3 Contributions . . . . . . . . . . . . . . . . . .
4.2 Common Sense Knowledge and Anchoring . . . . . .
4.2.1 Linguistic interaction . . . . . . . . . . . . .
4.2.2 Knowledge synchronisation . . . . . . . . . .
4.2.3 Evaluation . . . . . . . . . . . . . . . . . . .
4.2.4 Contributions . . . . . . . . . . . . . . . . . .
4.3 Cooperative Anchoring using Semantic & Perceptual
4.3.1 Semantic integration . . . . . . . . . . . . . .
4.3.2 Implementation . . . . . . . . . . . . . . . . .
4.3.3 Contributions . . . . . . . . . . . . . . . . . .
4.4 Towards the WWW and Conceptual Anchoring . . .
4.4.1 Representation . . . . . . . . . . . . . . . . .
4.4.2 Preliminary results . . . . . . . . . . . . . . .
4.4.3 Contributions . . . . . . . . . . . . . . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
Knowledge
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
. . . . . .
71
71
72
72
74
74
74
75
75
76
76
77
77
78
79
79
80
80
5 Discussion
5.1 Research Outlook . . . . . . . . . . . . .
5.2 Critical view on the original framework
5.3 Summary of the Research . . . . . . . .
5.4 Concluding Remarks & Considerations .
5.5 Recommendations for Future Work . . .
.
.
.
.
.
81
81
82
83
85
86
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
References
B
I
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
89
Publications
Using Knowledge Representation for Perceptual
a Robotic System
1
Introduction . . . . . . . . . . . . . . . . . . . . .
2
Perceptual Anchoring . . . . . . . . . . . . . . .
2.1
Creation and Maintenance of Anchors . .
3
Knowledge Representation . . . . . . . . . . . . .
4
Description of the Framework . . . . . . . . . . .
4.1
Simple Inferences . . . . . . . . . . . . . .
4.2
Spatial Relations . . . . . . . . . . . . . .
4.3
Managing Object Properties . . . . . . .
4.4
Human-Robot Interaction (HRI) . . . . .
5
Experiments . . . . . . . . . . . . . . . . . . . . .
5.1
TestBed . . . . . . . . . . . . . . . . . . .
Anchoring in
101
. . . . . . . . 102
. . . . . . . . 103
. . . . . . . . 105
. . . . . . . . 106
. . . . . . . . 108
. . . . . . . . 108
. . . . . . . . 110
. . . . . . . . 112
. . . . . . . . 114
. . . . . . . . 115
. . . . . . . . 115
ix
CONTENTS
5.2
Case-Studies . . . . .
6
Related Work . . . . . . . . .
7
Future Work and Conclusions
References . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
116
121
122
123
II Grounding Commonsense Knowledge in Intelligent Systems 125
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
2
System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 128
2.1
Perception . . . . . . . . . . . . . . . . . . . . . . . . . . 129
2.2
Physical Symbol Grounding . . . . . . . . . . . . . . . . 130
2.3
Management & Perceptual Memory . . . . . . . . . . . 133
2.4
Commonsense Knowledge Representation and Reasoning 134
2.5
NL Communication . . . . . . . . . . . . . . . . . . . . 137
3
Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
3.1
Perceptual Information & Properties . . . . . . . . . . . 138
3.2
Qualitative Spatial Relations . . . . . . . . . . . . . . . 139
3.3
Commonsense Reasoning . . . . . . . . . . . . . . . . . 139
4
Conclusion & Future Work . . . . . . . . . . . . . . . . . . . . 141
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
III Cooperative Knowledge Based Perceptual Anchoring
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.1
Symbol Grounding in Robotics . . . . . . . . . . . . .
1.2
Perceptual Anchoring . . . . . . . . . . . . . . . . . .
2
The Anchoring Framework . . . . . . . . . . . . . . . . . . . .
2.1
Computational Framework Description . . . . . . . . .
2.2
Global Perceptual Anchoring . . . . . . . . . . . . . .
2.3
Anchor Management . . . . . . . . . . . . . . . . . . .
3
Knowledge Representation in Anchoring . . . . . . . . . . . .
3.1
Cyc Knowledge Base . . . . . . . . . . . . . . . . . .
3.2
Ontology . . . . . . . . . . . . . . . . . . . . . . . . .
3.3
Semantic Integration Overview . . . . . . . . . . . . .
4
Instantiation of the Perceptual Anchoring Framework . . . .
4.1
Perception . . . . . . . . . . . . . . . . . . . . . . . . .
4.2
Grounding Relations . . . . . . . . . . . . . . . . . . .
4.3
User Interface & Natural Language Interaction . . . .
5
Experimental Scenarios . . . . . . . . . . . . . . . . . . . . .
5.1
Perceptual Anchoring Framework Performance . . . .
5.2
Knowledge Representation and Reasoning Evaluation
6
Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . .
7
Conclusion & Future Work . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
x
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
143
144
144
146
147
147
150
152
155
156
156
157
162
163
167
170
173
175
177
183
185
186
CONTENTS
IV Towards concept anchoring for cognitive robots
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . .
2
Conceptual Anchoring Framework . . . . . . . . . . . .
2.1
Model . . . . . . . . . . . . . . . . . . . . . . . .
2.2
Anchoring . . . . . . . . . . . . . . . . . . . . . .
2.3
Anchor Perceptual Indiscernibility Relation . . .
2.4
Bottom-up & Top-down information acquisition
3
Implementation . . . . . . . . . . . . . . . . . . . . . . .
4
Semantic knowledge & commonsense information . . . .
4.1
Cyc commonsense knowledge-base . . . . . . . .
4.2
Ontology . . . . . . . . . . . . . . . . . . . . . .
4.3
Perceptual-Semantic knowledge-base . . . . . . .
5
Conceptual Information Acquisition . . . . . . . . . . .
5.1
Linguistic & Semantic Search . . . . . . . . . . .
5.2
Perceptual learning . . . . . . . . . . . . . . . . .
6
Physical Perception . . . . . . . . . . . . . . . . . . . . .
6.1
Visual perception . . . . . . . . . . . . . . . . . .
6.2
Grounding Relations . . . . . . . . . . . . . . . .
6.3
User Interface & Natural Language Interaction .
7
Example Scenario . . . . . . . . . . . . . . . . . . . . . .
8
Discussion . . . . . . . . . . . . . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
191
192
194
195
199
200
201
203
203
204
205
206
206
207
208
211
211
212
213
213
216
217
xi
“I believe that at the end of the
century the use of words and
the general educated opinion will
have altered so much that one
will be able to speak of machines
thinking without expecting to be
contradicted.”
Alan Turing, (1912-1954)
1
Introduction
1.1 Cognitive robotics
A robot can be generally conceived as a programmable, self-controlled device,
which is able to exhibit intelligent behaviour and manipulate its environment.
Research in classical robotics focuses on the technology towards combining
sensors, controllers and actuators via software and algorithms, for programmable
automation. Typical domains emphasise on low-level sensing and control tasks,
such as sensor processing; path planning and manipulator control. Thus robotics
can be considered as restricted to sensor-based reactivity.
Cognition on the other hand refers to the process of acquiring and using
knowledge (information) about the world for goal-oriented purposes. Cognitive
Robotics is a field, which combines results from the traditional robotics discipline,
with those from artificial intelligence (AI) and cognitive science, so as to design
artificial agents that deal with cognitive phenomena such as perception, memory,
reasoning or learning.
The general research focus of this thesis is concerned with promoting
robots with intelligent behaviour, via a uniform theoretical and methodological
framework. The aim is to allow robots to perceive, reason and interact in
changing, partially known and unpredictable environments (i.e. the physical
world).
The topic of this dissertation touches upon several diverse fields such as
perception and vision, knowledge representation and symbol grounding1 . This
1 Symbol
Grounding denotes the grounding of symbols to their sensorimotor representations
1
CHAPTER 1. INTRODUCTION
Perception
Symbol Grounding
Knowledge
Representation
Percept
Perceptual Anchoring
Symbol
Models
Formulae
Olfaction Localisation
Vision Other Modalities
Communicate Plan
Learn
Infer
Figure 1.1: Research overview which depicts in a visual way the main disciplines
and concepts involved in this research.
thesis aims to be a step towards the integration of the fields mentioned above.
More in detail, this thesis examines how perceptual representations can be
grounded to their corresponding conceptual representations in the context of
cognitive robots.
This chapter introduces the background (§ 1.2) and motivation (§ 1.3) behind
this integration. The research questions and contributions of the thesis follow
in § 1.4. The thesis outline is summarised in § 1.5.
1.2 Context of the research
This research lies at the heart of hybrid cognitive modelling and concerns the
application of the intelligent agent paradigm to cognitive robots which operate
in the ‘real world’ (e.g., service robots). Autonomous cognitive robots are
often envisioned to perform difficult tasks in real environments, or to interact
effortlessly with humans. This view represents one aspect of the long-term
goal of AI and robotics, to create artificial systems that match and eventually
exceed human level competence in a multitude of tasks. In this setting, I
attempt to investigate the integration of perceptual information with semantic
and formally expressed commonsense knowledge, via a framework for symbol
grounding [63]. Fig. 1.1 depicts a high-level view of the three main disciplines
and the corresponding concepts used in this thesis.
As the artificial systems progress steadily from sensor based reactivity to
cognitive computational models that enable them to perceive, learn and reason,
the underlying models often have to deal with many open problems. In this
research we consider robots (or intelligent agents in general) interacting with
the human. Therefore, this thesis investigates a) how sensory and perceptual
information is acquired and grounded to symbols; b) how the meaning of these
symbols is represented; and c) how to express this meaning within a vast
2
1.2. CONTEXT OF THE RESEARCH
amount of formally defined commonsense knowledge. Below I introduce the
symbol grounding [63] and perceptual anchoring [35] problems, which are two
of the main aspects of the thesis.
1.2.1 The symbol grounding problem
The Symbol Grounding Problem (SGP) is the problem of intrinsically linking
the symbols used by a cognitive agent to the corresponding meanings, through
grounding language in perception and action. This aspect appears to be central to
computational cognitive modelling. It is important to find ways to combine the
typically sub-symbolic perceptual representations with the symbolic conceptual
representations used by the different cognitive processes of an intelligent agent,
in order to establish the link between perceptual and conceptual knowledge.
A linguistic symbol representing some meaning needs to be grounded in how
the world is perceived and specifically, how the internal iconic and categorical
sensorimotor representations are grounded to symbolic representations that
represent the real world phenomena to which they refer[63]. The SGP focuses
on how symbols are connected to their meanings, assuming that the cognitivist
symbols reside in the mind of the interpreter and always require the semantic
interpretation of an external observer. The meaning of “meaning” is itself non
trivial, however symbols should be tied to percepts, in a way that is meaningful
to the agent itself, be it objects or features in the world that it can sense or
actions it can perform (affordances) [59].
Intuitively information has to be represented both perceptually but at the
same time semantically, while describing concrete as well as abstract conceptual
entities. In an artificial cognitive agent this perceptual-semantic correspondence
essentially attempts to define the meaning of the concepts processed by the
agent. It further implies the presence of symbol and perceptual systems, where
the symbol grounding problem is then solved by grounding the sensorimotor and
perceptual representations to the symbolic concepts representing the meaning
of the perceptions.
In a multi-robot (or in general multi-agent) system, the problem becomes
more complex. Not only symbols do have to be grounded, but they have to be
commonly shared by convention in order to allow the sharing of information.
This is related to the social grounding problem [171], another aspect of the SGP
and a fundamental aspect in the development of cognitive systems. A growing
amount of research in interactive intelligent systems and cognitive robotics
focuses on the close integration of language with other cognitive capabilities [2,
19, 128]
1.2.2 Anchoring Symbols to Sensor Data
A cognitive robot operating in the environment has to face simultaneously many
diverse practical aspects and hard challenges, such as the ones that accompany
3
CHAPTER 1. INTRODUCTION
perception and sensing, the frame–expressivity–decidability problems concerning logic and knowledge representation as well as the well known problem of
representing and reasoning with common-sense.
On a technical level, some of the challenges mentioned above are investigated
under the scope of anchoring, where anchoring has the dual meaning: that of a
problem as well as that of a methodology. Coradeschi and Saffiotti formalised
the anchoring problem, which concerns “. . . the connection between abstract
and physical-level representations of objects in artificial autonomous systems
embedded in a physical environment” [38].
Vogt and Divina characterises this problem as a special case of a technical
aspect of symbol grounding, since it deals with grounding symbols to specific
sensory patterns [171]. However, anchoring studies the problem of creating and
maintaining in time the correspondences between symbols and sensor data that
refer specifically to one physical object. Implicitly intelligent robotic systems
with a symbol system and a perceptual system, need to solve the anchoring
problem so as to connect the information present in symbolic form with the
sensor data that the robot obtains from the physical world, both in single and
cooperative (multi-) robot cases, as well in multi-modal contexts.
On the other hand anchoring can be seen as the integrative solution and
methodology [100] which mediates the link between the perceptual and sensing
abilities, and the representations and conceptual knowledge, manipulated by
the various cognitive abilities of an intelligent system. It can be described as
a hybrid cognitive model and representation schema which incorporates both
sub-symbolic and symbolic components together with information processing
mechanisms, for creating and maintaining the links between the high-level
semantic representations and the corresponding perceptual representations,
which are used by the various cognitive functions of an agent. In the following
section we meet some motivations and theoretical aspects that relate to our
investigation of anchoring.
1.3 Motivation
This thesis investigates the problem of anchoring the interaction in cognitive
robots. To get a better understanding of how anchoring is involved even in
very simple scenarios, consider the example shown in Fig. 1.2. A cognitive
robot observes a novel object and speculates according to its prior perceptual
knowledge that the object in discussion might eventually be a “mug”. We can
easily identify already, that the robot needs to be able to sense and perceive
a complete and coherent representation of the sensed object, while binding
its features together, despite the existing ambiguity. Then, in an effort to
identify the novel object, it attempts to associate this new object with past
perceptual experiences – a form of cognitive bias. It appears quite trivial when
the human asserts that he owns this“coffee-cup”, however this linguistic assertion
4
1.3. MOTIVATION
Yes..., this is my
coffee-cup
..., mug?
assert
drinking
vessel
genls
cup
is a
coffeecup-1
on
colour
table
white-3
belongs
human-1
Figure 1.2: This scenario depicts the main focus of this thesis which concerns
anchoring grounded interaction. A mobile robot is attempting to identify an unknown
object, while a human asserts linguistically semantic information regarding this object.
The end result is the corresponding knowledge (oerceptual and semantic) which the
robot asserts and manipulates in its “mind”.
presupposes the presence of related knowledge along with the corresponding
semantics. The role of this knowledge is not only to represent the domain of
discourse (common-sense), but also to support formal reasoning and inference.
Below we see more in detail the underling principles behind this motivational
example.
1.3.1 Uncertainty and Ambiguity
Information from the environment comes inherently with uncertainty, which
can be caused by sensor noise, limited observability, incomplete knowledge or by
false recognition of patterns. Moreover in human communication we often meet
ambiguities, incomplete information or information that needs to be inferred.
Ambiguity, sometimes also referred as “second-order uncertainty”, results in
uncertain definitions of uncertain states or outcomes. It has been argued that
ambiguity is always avoidable while uncertainty (of the “first-order” kind) is
not necessarily avoidable. For instance consider that uncertainty might occur
from noise in the visual patterns as a result of sudden illumination or motion
blur. This uncertainty propagates through the vision system, thus increasing
the error recognising the colour or shape of one object. Hence, ambiguity is a
consequence of the increased error, when the agent might want to reason using
the colour of the specific object, since it’s assumption is based on an uncertain
state. In another form uncertainty may be purely a consequence of lack of
knowledge of obtainable facts, as it is only present in the human definitions
5
CHAPTER 1. INTRODUCTION
and concepts and is not an objective fact of nature. In fact, uncertainty can be
removed with further analysis and processing. The computational model has
to account for cases where processing can be used to resolve ambiguities in an
attempt to cope with the inherent uncertainty.
1.3.2 Binding
The binding problem [159] arises as soon as we have a system with two pieces
of information in different places that refer to the same entity and need to bind
together the information components into a unified whole. For example when we
see a “blue sphere” and a “red cube”, binding (in the biological sense) instructs
the neural mechanisms to ensure the sensation of “blue” to be coupled with that
of “sphere” and that of “red” to be coupled with that of “cube”. Binding combines
the different representations, properties and descriptions of an object, while
enables the different sub-systems of the cognitive model to process information
about the object in a holistic context. It is suggested that incoming visual
(and other) data get allocated object labels in some way, such that “blue”
and “square” get tagged as “object number 1” [130]. In this form the binding
problem is also an issue of memory. How do we remember the associations
among different elements of an event (i.e. that the colour of the sphere turned
from blue to green)? How these associations are created and maintained? In
the feature integration theory, Treisman and Gelade suggested that binding
between features is mediated by the features’ links to a common location [160].
In a comprehensive discussion of binding’s many aspects, Treisman and Gelade
point out that objects and locations appear to be separately coded in ventral
and dorsal pathways respectively, raising what may be the most basic binding
problem: linking what to where.
1.3.3 Representation
An intelligent robot operating in the physical world, needs to combine information from its different sensors to form a coherent and more complete
representation of the environment. When binding perceptual information as
well as conceptual knowledge, the cognitive model has to support mutual interaction between the different representations while integrating different sources
of information both horizontally (cross-modal and temporal) and vertically
(i.e. top-down & bottom-up). According to Roy and Reiter cross-modal representations are a necessary requirement and are needed to support both the
semantic and sensory-motor sub-systems [137]. As mentioned earlier in § 1.3.1
the representations should be able to handle a) ambiguity and uncertainty in
the perceptual data; and b) support sound inference using symbolic information.
Of course incomplete information may always exist in the knowledge-base and
in many cases new knowledge might need to be inferred using prior knowledge
with reasoning. Then a non-trivial issue is how to represent knowledge in a way
6
1.4. CONTRIBUTIONS
that is machine computable, but also expressive enough to better represent
the linguistic structures. The cognitive model should allow the manipulation
of complex information structures while maintaining consistency between the
various kinds of data, on different abstraction levels. Finally the representation
has to deal with grounding the interpretations of the semantic representations
in perceptual information (sensor data) from the environment.
1.3.4 Perceptual & Cognitive Bias
A holistic view suggests that what someone perceives is the effect of interplays between past experiences and the interpretations of current perceptions.
Therefore ambiguity resulting in percepts ambiguity can be rectified by a-priori
knowledge. When someone is perceiving an object with a preconceived concept
about it, tends to use the preconceived information and associates it with what
he perceives as an inherent cognitive bias of previous knowledge. We consider
that the perceptual model should be able to cope with storing and interrelating
past perceptual experiences so as to influence how it interprets the sensory
patterns into percepts. In this context the semantic model should also be able
to cope with storing and interrelating past conceptions so as to influence not
only how the world is perceived, but also the reasoning processes. It is generally
accepted that human perception heavily relies in the concept of similarity as
the basis for categorisation [134]. The class of an object is determined by its
similarity to (a set of) prototypes which define each category, allowing for varying degree of membership. Inspired by the Exemplar theory [113, 124] which
argues that a concept can be implicitly formed via all its observed instances,
the aspects of the cognitive bias and similarity are also related to memory in
the context of how associations among different elements of an object or event
are established and processed.
1.4 Contributions
The research behind this thesis is focused on anchoring and specifically its
application to cognitive systems. As we see in § 1.2.2, anchoring is established
both as a problem and as a methodology, for solving certain aspects of the
perceptual-symbolic integration in cognitive systems. Although the basic elements of an anchoring system are formally introduced in various articles [31,
32, 35, 100], there were several aspects which needed further consideration,
including the technical aspects behind the design and implementation of a
cognitive perceptual system, which is based on an anchoring framework for its
computational model.
More specifically this thesis investigates a) how the different perceptual
modalities and grounding mechanisms can be modelled in a cognitive perceptual
system; b) how its perceptual-semantic knowledge is defined and represented;
7
CHAPTER 1. INTRODUCTION
and c) how a-priori knowledge can be used as a cognitive bias. In addition, the
capabilities of such a system are presented while identifying the benefits of using
anchoring as the main computational model. Having in mind the general research
questions mentioned above, this thesis presents some technical contributions
related to the design and implementation of the proposed perceptual anchoring
framework:
⊲ Design and implementation of a complete and novel anchoring framework
for cognitive robots acting in a realistic setting. [Paper I, II, III & IV]
⊲ Modelling of perceptual routines which are based on visual, spatial and
topological sensing modalities. [Paper II & III]
⊲ Grounding mechanisms which bind the sensorimotor representations to
symbolic conceptual knowledge. [Paper II & III]
⊲ Perceptual-semantic knowledge base and processing mechanisms for storing the symbolic information acquired from perceptual experiences, via
the anchoring framework. [Paper II & III]
⊲ Integration of the proposed framework with a large-scale deterministic
commonsense knowledge base. [Paper II, III & IV]
⊲ Evaluation of the implemented system in a number of scenarios which
include linguistic interaction, cooperative and concept acquisition aspects
during the anchoring process. [Paper II, III & IV]
Besides the technical contributions mentioned above, this thesis discusses
also some theoretical contributions which are related to the extension of the
anchoring framework so as to address the following:
⊲ Extend the anchoring framework to support the processing of multi-modal
perceptual and semantic knowledge in a grounded conceptual structure
which can be accessed both bottom-up and top-down. [Paper III & IV]
⊲ Define the nature of the symbol system using standard knowledge representation and reasoning methods while integrating an ontology to be
used as a shared vocabulary. [Paper II & III]
⊲ Extend the anchoring functionalities to support the integration of the
symbol system proposed above. [Paper II & III]
⊲ Introduce a cooperative semantic anchoring model for grounding collectively acquired perceptual knowledge in a multi-robot context. [Paper
III]
⊲ Introduce a novel model for representing, grounding and reasoning with
spatial and topological relations using fused visual and spatial sensing
modalities. [Paper III]
8
1.5. OUTLINE
⊲ Define the similarity measure for matching between anchors, both perceptually and semantically. [Paper IV]
⊲ Define an extension of the anchoring space which supports the embedded
ontology, integrates knowledge from multiple agents, enables multi-modal
integration, defines the similarity based grounding and allows logical
inference. [Paper I, II, III & IV]
In sum this thesis introduces a knowledge-based perceptual anchoring
framework which is essentially a product from a series of studies that address
the points mentioned above.In the different studies performed in the context
of the thesis we attempt to address the challenging problem of how high-level
descriptions of visual information may be automatically derived from multiple
sensing modalities, while also considering the applicability of the model in real
world scenarios. The modelling of perceptual-symbolic correspondences is based
on grounding the interpretations of the semantic representations in perceptual
information coming from the environment as well as other information sources.
1.5 Outline
This thesis is composed of two parts. The first part (Part A) is a comprehensive
introduction, intending to provide an insight on the different aspects, contexts
and methods which regard this thesis. The second part (Part B) contains the
attached publications which support the contributions presented in § 1.4. The
remaining parts of this thesis are summarised as follows:
1.5.1 Part A: Grounding percepts to concepts in cognitive
robots
The first part of this thesis is an introduction which includes an overview of
the background and related work (Chap. 2), followed by the main chapter
(Chap. 3) which presents in detail the different methods used in this thesis.
A summary chapter (Chap. 4) presents the overview of the research and the
related publications, while in the final chapter (Chap. 5) the conclusions of
the thesis are presented, followed by suggestions for future development. A
brief summary of each chapter follows.
⊲ Chapter 2 This chapter introduces the background behind the perceptual
& symbol systems, followed by the (sub-) symbolic integration and related
extensions, which led to the formation of the symbol grounding problem. Then, a brief discussion summarises the work related to anchoring,
concluding with some anchoring approaches in the context of cognitive
robots.
9
CHAPTER 1. INTRODUCTION
⊲ Chapter 3 This chapter introduces the different aspects, methodologies and
tools which were used during the development of the anchoring framework.
In particular the basic anchoring framework; the algorithms that were
used for perception (vision and localisation); the framework used for
knowledge representation; and a discussion regarding commonly available
commonsense knowledge bases.
⊲ Chapter 4 This chapter summarises the included publications of the research project and the contributions from the author.
⊲ Chapter 5 This chapter concludes the development of the proposed anchoring framework by assessing the research goals and findings of the thesis
as a whole. Finally a few possible research directions for future work are
outlined.
1.5.2 Part B: Publications
The contents and results of this thesis are partially reported in conference
proceedings, journal papers and a book chapter. Reprint versions of four publications (Papers I, II, III & IV) are included in the second part (Part B) of this
thesis. All publications have been fully peer-reviewed, before being accepted for
publication. Paper II is an extended version of the results presented in Paper V.
Material from Paper VI, which is a summary of the results presented in all the
included papers, has been used for the preparation of Chap. 4 from Part A.
Finally, the versions of the included papers contain typographical and layout
changes in order to match the style of the thesis.
10
1.5. OUTLINE
Thesis MindMap
Cognitive Robotics
INTRODUCTION
Symbol Grounding
Perceptual
System
Semantic
System
Anchoring
Motivation
Vision
Localisation
Other Modalities
Knowledge
Representation
Ontology
CommonSense
Uncertainty
Binding
Representation
Cognitive Bias
Contributions
PART A
Background &
Related Work
Methods
Outline
Computational Models
of Cognition
Perceptual Systems
Symbol Systems
Commonsense
Knowledge
Symbol Grounding
Perceptual Anchoring
Related Work
Grounding Percepts to
Concepts in Cognitive
Robots
Anchoring Model
Sensors
Robot Localisation
Vision
Semantic Knowledge
Commonsense
Knowledge Bases
Summary of Publications
Anchoring with
Spatial Relations
Commonsense Cooperative Semantic
Perceptual
Knowledge in
Anchoring
Anchoring
Conceptual
Anchoring
PART B
Publications
PAPER I
Using Knowledge
Representation for
Perceptual Anchoring
in a Robotic System
PAPER III
PAPER IV
PAPER II
Towards Concept
Cooperative
Grounding
Anchoring for
Commonsense Knowledge Based
Knowledge in Perceptual Anchoring Cognitive Robots
Intelligent Systems
Discussion
Research
Outlook
Critical View
on Anchoring
Summary of
the Research
Evaluation of
the Findings
Suggestions
for Future Work
11
Part A
Grounding percepts to concepts
in cognitive robots
2
Background & Related Work
2.1 Computational models of cognition
Cognition, the process of thought, is abstractly described as the processing of
information. However most computational models require a (mathematically
and logically) formal representation of a problem. Implicitly the question
then becomes: how should we represent information in a cognitive system?
And most importantly: what would be the right frame for describing how the
cognitive system (biological or artificial) is processing information? Cognitive
computational models are used both in simulation and experimental verification
of different specific and general properties of intelligence, where three main
perspectives are considered, namely the sub-symbolic, symbolic and hybrid
approaches.
2.1.1 Sub-symbolic modelling
Sub-symbolic approaches mainly concern connectionist (neural network) models,
which typically follow the neural and associative properties of the human brain.
They mainly focus on emergent behaviour, dynamical systems and neural
learning. Connectionism is founded in the idea that the mind is composed
of simple nodes and that the power of the system comes primarily from the
existence and manner of connections between these simple nodes. Neural-nets
are textbook implementations of this approach. Some critics of this approach
argue that even though connectionist models approach biological reality as
a representation of how the system works, they lack explanatory powers,
15
CHAPTER 2. BACKGROUND & RELATED WORK
because complicated systems of connections even with simple rules are extremely
complex and often less interpretable than the system they model.
2.1.2 Symbolic modelling
Symbolic modelling on the other hand is focused on the abstract, mental
functions of an intelligent mind, based on the principle that it operates using symbols. This approach originates from knowledge-based systems merged
with a philosophical perspective, that of “Good Old-Fashioned Artificial Intelligence” (GOFAI). Initially, such systems were developed by the first cognitive
researchers and later used in information engineering and expert systems [163].
Symbolic Models assume that the mind is built like a digital computer (serial
processing), where hypothetical mental representations are symbols which are
serially processed using sets of rules.
A physical symbol system (also called a formal system) takes physical
patterns (symbols) which combines into structures (expressions) that can be
manipulated (using processes) to produce new expressions. Newell and Simon’s
Physical Symbol System Hypothesis [122] notably argues that symbolic computation is both necessary and sufficient for general intelligent action [138]. Since
symbolic computation is Turing-complete this is trivially true, but criticism
of symbolic (rule-based) AI maintains that a purely symbolic system does
not constitute a feasible practical approach, either because discrete symbols
are technically insufficient [12, 13] or because it usually lacks grounding in a
physical environment [143].
2.1.3 Hybrid Modelling
This paradigm emerged as a solution to the debate of whether the mind is best
viewed as a huge array of small but individually feeble elements (i.e. neurons)
or as a collection of higher-level structures such as symbols, schema, plans,
and rules. Both the symbolic and the associationistic approaches have their
advantages and disadvantages. As Gärdenfors very well argues:
“. . . . . . they are often presented as competing paradigms in artificial
intelligence, but since they attack cognitive problems on different
levels, they should rather be seen as complementary methodologies.”
(P. Gärdenfors [58])
Hybrid models are often drawn from machine learning, while they include
techniques which put symbolic and sub-symbolic models into correspondence.
These models tend to be generalised and take the form of integrated computational models of abstract (synthetic) intelligence, in order to benefit from both
symbolic and sub-symbolic models’ advantages.
16
2.1. COMPUTATIONAL MODELS OF COGNITION
Environment
[ f (agente): P* A ]
Percepts
Reasoning
Sensors
Actions
Problem solving
Actuators
Think
Vision
Time
Touch
Audition
Language
Perceive
Cognition
Knowledge
Planning
Concepts
Kinesthetics
Act
Consiousness
Motorcontrol
Olfaction
Proprioception
Learn
Working
Longterm
Memory
Figure 2.1: Schematic diagram of a rational intelligent agent merged with the PEAS
(Performance measure, Environment, Actuators, Sensors) paradigm from a cognitive
perspective. The agent perceives its environment through sensors, where the complete
set of its inputs at a given time is called a percept. Then, the current percept, or a
sequence of percepts can influence the actions of the agent through the performance
measure. A system with both symbolic and sub-symbolic components is a hybrid
intelligent system.
2.1.4 Intelligent agent & the environment
A very popular paradigm in hybrid systems, which is also widely accepted since
the 1990s in cognitive robotics, is the notion of Intelligent Agent (IA) [138]. An
agent can be simply described as an autonomous entity, which perceives the
environment so as to acquire new information and uses this knowledge to direct
its activity towards achieving its goals, via acting in the same environment.
Thus it is a particularly well suited model for cognitive robots which can be
very simple or arbitrarily very complex. A reflex machine such as a thermostat
can be represented as an intelligent agent, as well as a humanoid robot or even
a swarm of robotic systems working together towards a goal. An agent that
solves a specific problem can use any approach that works - some agents are
symbolic and logical; some use sub-symbolic approaches or neural networks
while others may use mixed approaches.
The agent’s experience, depending the environment in which the agent
exists, is characterized by some properties. The real world is of course, partially
observable, stochastic, sequential, dynamic, continuous and multi-agent [138].
17
CHAPTER 2. BACKGROUND & RELATED WORK
All the elements mentioned so far are typically unified in the PEAS (See Fig. 2.1)
paradigm which can be briefly described as having three major (possibly
overlapping) sub-systems, with different functions which define how sensory
data are processed and distributed through the system and where decisions are
made. A perceptual sub-system which gains information from the environment,
an action sub-system that emits energy in various forms and between those subsystems an arbitrarily complex collection of central mechanisms that interact
with the perceptual and action sub-systems but may also be involved in other
cognitive processes.
2.2 Perceptual Systems
Sensing involves gathering information about the environment using sensors.
Robots can be given the ability to see through a camera image sensor and
perceive light intensity data. Besides vision, robots can rely on other sensors,
for example laser range-finders, where they can perceive the distance to objects.
From a traditional point of view perception is the cognitive process of attaining
awareness or understanding of sensory information. The perceptual system
constructs an internal representation of the world which is further processed
and related to other information [123].
The study of perception is mainly concerned with exteroception and quite
often the word perception usually means exteroception. In modelling perception
we are concerned with how sensory stimuli are observed, represented and
interpreted. A stimulus may be the occurrence of an object at some distance
from the robot. The perceptual system transforms stimuli (sensory data) from
the world into percepts (e.g. colours, shapes, textures) which are then relayed
to the cognitive model. This transformation presupposes the existence of a
complex perceptual system which mainly relies on visual, spatial perception
and their fusion.
2.2.1 Visual Perception
Generally vision is a complex task and it involves various scientific topics such
as signal processing and pattern recognition. However vision has the potential
to provide abundance of information about the surrounding where the main
task is to derive the structure of a scene and other information about the
environment, from one or several images. The most difficult task robots are
required to solve is to be able to answer the question: “What do I see, and
where?”. This task in the computer vision community can be summarised in
the following two problems:
⊲ Image classification or Image labelling is the task of classifying an image
according to its content.
18
2.2. PERCEPTUAL SYSTEMS
Figure 2.2: Examples of image variations.
⊲ Object detection is a more complex problem, that of detecting and locating
instances of a certain object class inside an image, with the best possible
accuracy.
As we see, the modelling of visual perception is a difficult task, while
achieving human-level results is far from the current state-of-the-art. But why
vision is so difficult? Typically we can solve effortlessly the difficult visual
problems mentioned above, in fractions of a second and with a very small error.
However an artificial vision system must be able to address a wide range of
challenges which are often common between the visual processes and regard
representing and categorising the visual content with the presence of variations.
Possible variations of object appearances especially in real world scenarios, can
be divided into two main classes: image and object variations.
2.2.1.1 Image Variations
⊲ Illumination variation influences greatly the appearance of objects, due
to environmental illumination changes in the lightning conditions or the
occurrence of shadows.
⊲ Scale variation may occur by the imaging device or can be due to perspective transformation.
⊲ Viewpoint variation causes change in the appearance of objects, due to
camera position changes, in relation to the object.
19
CHAPTER 2. BACKGROUND & RELATED WORK
mug1
mug3
mug2
mug
pot
mug4
Interclass variations
Intraclass variations
Figure 2.3: Examples of object variations.
⊲ Clutter results in confusion between foreground and background objects.
Object features are likely to occur in the background, thereby producing
false matches and wrong evidence about the presence of objects.
⊲ Occlusion is caused by other objects or truncation by the image border.
Self occlusion describes the situation where parts of the object occlude
other parts of the same object.
⊲ Motion blur occurs due to fast motion of the camera or low light conditions.
Slow shutter speed, moving objects or camera shake contribute to this
blurring and the obtained image is degraded by motion blur.
2.2.1.2 Object Variations
⊲ Interclass variation refers to the variation between objects of different
classes with relatively little appearance variation between them.
⊲ Intraclass variation refers to the variation among objects that belong to
the same class.
⊲ Articulation variation occurs when there is variation of the appearance
of an object that is caused by different poses or positions of parts of the
object.
⊲ Pose variation exists when objects occur in different poses that may result
in having completely different appearances as a result.
⊲ Size variation concerns the cases where the size of objects can significantly
influence the similarity to other object-classes and increase the variance
within one object-class.
20
2.3. SYMBOL SYSTEMS
2.2.2 Computer Vision for cognitive perception
A lot of computer vision techniques are involved in robot vision systems. A typical vision system includes methods for: visual interest point detection, features
extraction and matching, detection and segmentation as well as high-level processing such as object recognition. This dissertation focuses on computer vision
methods and specifically object recognition, in order to help with processing
the increasing size of image data, inherent to the domain of cognitive robots
(and robots in general).
Besides the challenges mentioned in § 2.2.1, the real difficulty comes in
interpreting the data retrieved from the camera sensor. This is because a robot
must make sense of the raw data, which are represented as an array of numbers
indicating light intensity at each pixel, thus making image understanding
computationally very demanding.
In the early 1960’s, computer vision research initiated as an artificial intelligence problem, the goal of which was to understand images by detecting objects
and defining the relationship between them [132]. In the same context, Marr
presented his famous views that given a visual information, a complete geometric reconstruction of the observed scene is required for visual perception and
the vision system should compute the shape, orientation and colour of every
object in the scene while recognising them in one image [108]. It was not until
the late 1980’s that a more focused study of the field started when computers
could manage the processing of large data sets such as images.
2.2.3 Spatial Perception
In mobile robotic applications we have to consider another important aspect,
that is localisation and usually refers to the estimation of the position and
orientation of a robot in a reference frame. There are two types of localisation
problems: a) Global localisation – where the robot is located, given we don’t
know where the robot started; and b) Position tracking – where the robot
is located given that we know where the robot started. Furthermore, from a
cognitive perspective there is also the aspect of spatial perception where the
robot needs to locate the objects it manipulates and simultaneously represent
their spatial distribution in its environment. This of course, should come in
relation to its position (egocentric frame) but also in relation to other objects’
positions (allocentric frame).
2.3 Symbol Systems
Cognitive robots need to represent the knowledge about the relevant parts of
the world they inhabit. Semantic knowledge (or else meaning), according to
standard theories of cognition, appears to reside in a semantic memory separate
from the multi-modal systems of perception and action. Representations from
21
CHAPTER 2. BACKGROUND & RELATED WORK
the modal systems are transduced into amodal symbols, representing conceptual
knowledge about the robot’s experience.
Although little empirical evidence supports the presence of amodal symbols
in cognition, they were greatly adopted because they provided elegant and
powerful formalisms for knowledge representation (KR). The reason behind
this is that they capture important intuitions about the symbolic character of
cognition. Therefore, many approaches in the lines of cognitivism, adopt the
symbolic paradigm for representing and processing information, as cognition is
considered a form of computation and hence formal symbol manipulation [122],
The fundamental goal is to represent knowledge in a way that facilitates
reasoning. This is done by analysing how to formally think and how to use
a symbol system for representing a domain of discourse, using functions that
allow inference and manipulation of this knowledge. In turn, this knowledge
can be relayed backwards, toward the perceptual systems while also used from
various other cognitive processes. Thus semantic representations should be
adequate for higher-level cognitive processes and consistent across different
levels of abstraction. With the use of semantics, we should be able to express the
different modalities involved in perception as well as conceptual descriptions.
2.3.1 Models for Knowledge Representation
Knowledge Representation is a family of techniques for describing (or encoding)
information of a particular description of the world, in a precise and unambiguous way. Furthermore the problem concerns not only information representation
but also how these representations are to be reasoned upon, so as to achieve
intelligent behaviour. The term “intelligent” refers to the ability of a system to
deduce implicit knowledge out of its explicit one.
Models of knowledge representation have been initially developed in the field
of AI, originating from theories of human information. The most commonly used
way to encode knowledge is through a logical formalism which supplies both
formal semantics of how reasoning functions apply to concepts in the domain of
discourse, and operators (i.e. quantifiers, modal operators, etc . . . ) that along
with an interpretation theory, give meaning to the formulae expressed in the
logical formalism.
Early models attempting to represent knowledge, include works of Boole
(1815 - 1864) and Frege (1848 - 1925), which resulted in the well known
Propositional and First-Order Logics (FOL) respectively. Drawbacks such
as undecidability, led to the introduction of non logic-based representation
languages, such as Semantic Networks and Frame Systems. A brief summary
follows [56]:
⊲ Object - Attribute - Value triplets is a technique used to represent the
facts about objects and their attributes. More precisely, an O-A-V triplet
asserts an attribute value of an object.
22
2.3. SYMBOL SYSTEMS
⊲ Logics aim at emulating the laws of thought by providing a mechanism to
represent statements about the world. The representation language is
defined by its syntax and semantics, which specify the structure and the
meaning of statements, respectively. The most widely used and understood
logic is First-Order Logic (FOL), which assumes 1) the existence of facts,
objects (individual entities), and relationships among objects; and 2) the
beliefs of true, false, and unknown for statements.
⊲ Semantic networks are directed graphs which consist of vertices that represent concepts and edges that encode semantic relations between them.
Concepts can be arranged into taxonomic hierarchies and have associated
properties.
Logics of various kinds and logical reasoning and representation languages
such as Prolog and KL-ONE have been popular tools for knowledge modelling,
for example in expert systems. However the user is required to be a logic expert.
This problem led to the introduction of “ideal” formalisms which integrate
well-defined semantics in frame systems and semantic networks, benefiting
from the positive features of both frame systems and logic-based formalisms.
They are known as Description Logic (DL ) and are particularly useful for
representing domain knowledge through the specification of concepts and their
relationships. DLs are the most commonly used formalism today in several
disciplines, such as natural language processing and web-based information
systems. Most notably they provide a logical formalism for authoring ontologies
on the Semantic Web1 Below we summarise the related challenges.
2.3.2 Challenges related to KR&R
The problems of linking and encapsulating real world concepts, by means of logic
and related formalisms, became apparent around 1980s when formal computer
knowledge representation languages and systems arose. They attempted to
encode wide bodies of general (common-sense) knowledge, as for example the
Cyc project, which attempted to encode the necessary information, for the
purpose of understanding encyclopaedic content. The difficulty of knowledge
representation came to be better appreciated through widening the gap between
early claims of AI enthusiasts and the actual performance of implemented
systems.
First, we meet the symbol grounding problem (see § 2.5), which seems
inevitable when relating abstract symbols to aspects of the real world. In the
same context another key difficulty in cognitive agents, is to be able to represent
predicate or function symbols that are able to change their values so as to reflect
1 All the major KR languages and standards, such as the Resource Description Framework
(RDF), the RDF Schema, the Web Ontology Language (OWL) and Ontology Inference Layer
(OIL) are solely based on DLs .
23
CHAPTER 2. BACKGROUND & RELATED WORK
the dynamic nature of the world. This is known as the frame problem, which
originally concerned the difficulty of using logic to represent which facts of the
environment change and which remain constant over time. In a broader sense,
the frame problem can be thought as finding appropriate KR&R mechanisms to
make inferences, within the resource constraints, in order to limit the number
of assertions and inferences that an artificial cognitive agent must make to
solve a given task in a particular context.
Another important aspect which affects the modelling of knowledge representation systems, concerns expressivity and the trap of decidability. Brachman
and Levesque argue: “there is a trade-off between the expressiveness of a representation language and the difficulty of reasoning over the representations built
using that language” [10]. The more expressive, the easier and more compact
it is to represent a sentence in the knowledge base. However, more expressive
languages are computationally difficult in deriving inferences. An example of
a less expressive KR would be propositional logic, while a more expressive
one would be auto-epistemic temporal modal logic. Less expressive KRs may
be both complete and consistent (formally less expressive than set theory) in
contrast to more expressive KRs which may be neither complete nor consistent.
The expressivity of the language is also determined by the application
domain and the level of abstraction of the knowledge intended to be expressed
in the knowledge-base. For instance in a simulated single-agent scenario, we
would probably need propositional logic or binary decision diagrams, to model
the agent’s low-level perceptions, thus overcoming undecidability. On the other
hand, on a multi-agent scenario in the real world, where the cognitive agents
manipulate commonsense information or communicate via natural language
with humans, we might need extensions of second-order predicate calculus,
again with the risk of getting trapped in undecidable situations.
2.4 Commonsense Information
Knowledge, perceptual or semantic, is required in abundance by the cognitive
agent, to accomplish a specific task or to model a particular domain. Commonly we assume that any person possesses a large and compatible amount of
knowledge known as common-sense, which acts as the communication medium.
But what is actually commonsense knowledge?
Commonsense consists of what people “sense” as their “common” natural
understanding, referring to beliefs or propositions that most people would
consider wise and sound. Thus “common-sense” (in this view) equates to the
knowledge and experience which most people is assumed to already have. The
concept originates from Aristotle2 , who considered “common-sense” to be the
internal sensation that is formed after the external senses are united and
2 Aristotle
De Anima Book III Part 2, accessed online at: http://classics.mit.edu/
Aristotle/soul.html
24
2.4. COMMONSENSE INFORMATION
judged. Most notably, Locke argued that it is the sense of things in common
between disparate impressions (integrated sense-data). Most of the empiricist
philosophers approach the problem of the unification of sense-data in agreement
of a sense in the human understanding, that sees commonality and does the
combining.
While to the average person the term “common-sense” is regarded as synonymous with “good judgement”, to the AI community it is used in a technical
way to refer to the millions of basic facts and understandings possessed by
most people. It is considered to span over a huge portion of human experience,
thus it is encompassing knowledge about the: a) spatial; b) physical; c) social;
d) temporal; and e) psychological aspects of typical everyday life.
Similarly robots or intelligent agents (in general), so as to be considered
eligible for comprehensive communication with a human, have to possess a
vast amount of commonsense knowledge. It is remarkable that such knowledge
is typically omitted from social communication just because it is assumed
that every person possesses common-sense in order to comprehend. Humans
effortlessly assume or infer the omitted underlying information. In contrast
robots interacting with humans, do not have general knowledge about the world,
neither can reason through a problem using esoteric and exoteric knowledge,
logic and inference. The problem is considered to be among the hardest of all
in AI research, because the breadth and detail is enormous.
Arguably McCarthy’s advice-taker proposal [111] is the first scheme to use
logic for representing commonsense knowledge in mathematical logic. It uses an
automated theorem prover to derive answers to questions expressed in logical
form and is the precursor of Logic Programming and modern computational
logic languages such as Prolog and LISP. Winograd’s SHRDLU system was
also one of the early natural language understanding computer programs which
conversed with a human using English terms [172]. Although SHRDLU was
overly simplified, the result was a tremendously successful demonstration of AI
(at the time), thus encouraged many other researchers to pursue systems dealing
with more realistic situations and with real-world ambiguity and complexity.
One of the milestones in the development of common-sense oriented KR systems is the SNePS system developed and maintained by Stuart C. Shapiro and
his group [145, 146]. SNePS is simultaneously a logic-, frame- and network-based
KR&R system used both as a stand-alone system and as an implementation of
the mind of intelligent agents (cognitive robots), under the scope of the GLAIR
agent architecture (a Grounded Layered Architecture with Integrated Reasoning). SNePS has been used for a variety of KR&R tasks such as commonsense
reasoning and natural language understanding and generation, in the context
of cognitive robotics. However, the projects mentioned above were not mainly
intended for robotic use and not many works exist that focus on modelling the
common-sense informatic situation in mobile robots, except from two notable
examples [141, 144].
25
CHAPTER 2. BACKGROUND & RELATED WORK
2.5 Perceptual-symbolic integration and the Symbol
Grounding Problem
In the context of hybrid cognitive
modelling and subsymbolic integration, we meet one very central argument, the Searle’s well known “Chinese Room”-argument [143] which
challenged the assumptions of classical AI. Searle proposed a situation
where the symbolic task of responding to questions in Chinese, without knowing the language, can apparently be well performed while at
the same time not understanding the Figure 2.4: Visual metaphor of Searle’s
meaning of the questions and answers. Chinese room argument.
He claims that symbol systems have
some fundamental limitations that prevent them from exhibiting true cognitive
behaviour (due to the use of ungrounded symbols). What he really meant is
that a symbol-manipulating system in isolation, will never be able to truly “understand ” anything about the physical world. Searle’s argument still provokes
debate to this day, however one of the very influential arguments is Harnad’s
response [64]. Harnad argues that a system with sensors and connectionist (or
non-symbolic) systems, classifying raw sensor data, would not be subject to
the “Chinese Room”-argument when using an approach which grounds symbols
in low-level perceptions of the environment.
2.5.1 Harnad’s Symbol Grounding Problem and solution
Harnad first proposed the symbol grounding problem as a question concerning
semantics:
“How can the semantic interpretation of a formal symbol system
be made intrinsic to the system, rather than just parasitic on the
meanings in our heads? ”
(S. Harnad [63])
arguing that information fusion at the symbolic level will be faced with the
grounding problem at two levels: a) Modality-inherent grounding within each
modality; and b) Cross-modal grounding between modalities. As a solution
approach Harnad proposes a hybrid sub-symbolic architecture where there is
no longer any autonomous symbolic level at all. Instead there is an intrinsically
dedicated symbol system with its elementary symbols (names) connected to
non-symbolic representations, via connectionist networks that extract invariant
features of their analogue sensory projections. The introduction of the SGP
26
2.5. PERCEPTUAL-SYMBOLIC INTEGRATION AND THE SYMBOL
GROUNDING PROBLEM
spawned a great debate around the subject, while many strategies attempted
to solve the problem from different perspectives.
Representationalist strategies approach the SGP by grounding symbols in
the representations arising from the manipulation of perceptual data. Here
lies also the hybrid model of Harnad which implements a mixture of features
characteristic of both symbolic and of connectionist systems. According to
Harnad, the symbols in question can be grounded by connecting them to the
perceptual data they denote by a bottom-up; invariantly categorising processing
of sensorimotor signals, where symbols are grounded in three stages: a) iconisation – the process of transforming analogue signals into iconic representations;
b) discrimination – the process of judging whether two inputs are the same or
how much they differ; and c) identification – the process of assigning a unique
name to a class of inputs, treating them as equivalent or invariant.
He also argues that three kinds of representations are necessary: a) iconic
representations, which are the sensor projections of the perceived entities;
b) categorical representations which are learned and innate feature detectors
that pick out the invariant features of object event categories from their
sensory projections; and c) symbolic representations, which consist of symbol
strings describing category membership. Both the iconic and the categorical
representations are assumed to be non-symbolic. Finally, he concludes that a
connectionist network is the most suitable, for learning the invariant features
underlying categorical representations and thus for connecting names to the
icons of the entities they stand for. The function of the network then, is to pick
out the objects to which the symbols refer. Concerning Harnad’s approach, one
can note that although it seems clear that a pure symbolic system does not
suffice (since sensors do not provide symbolic representations), the approach of
adopting connectionist networks alone appears limited as well.
Subsequent work from Cangelosi et al. and Cangelosi and Harnad provide a
detailed description of the mechanisms for the transformation of categorical perception (CP) into grounded low-level labels and subsequently into higher-level
symbols [17, 22]. They call grounding transfer the phenomenon of acquisition
of new symbols from the combination of already grounded symbols. Then,
they show how such processes can be implemented with neural networks [22].
According to Cangelosi et al., the functional role of CP in symbol grounding is
to define the interaction between discrimination and identification. Cangelosi
and Harnad outline two methods for the acquisition of new categories. They
call the first method sensorimotor toil and the second one symbolic theft, in
order to stress the benefit (enjoyed by the system) of not being forced to learn
from a direct sensorimotor experience whenever a new category is in question
[20]. Finally they provide a simulation of the process of CP, for the acquisition
of grounded names and for the learning of new higher-order symbols from
grounded ones [22].
27
CHAPTER 2. BACKGROUND & RELATED WORK
Thought
or Reference
is
by
ab
str
ac
ted
to
Semi
os
s
lise
bo
sym
to
stands for
ted
en
res
rep
ers
ref
Sign
Odgen & Richard
1923
Symbol
Concept
(Conceptualisation)
Interpretant
Peirce
1931
Referent
Ullmann
1973
Representamen
Object
Symbol
Thing
(Reality)
refered by
Symbol
(Language)
Meaning
Sign
Harnad
1990
Category
Vogt
2002
Icon
Form
Refent
Figure 2.5: The semiotic landscape defined the sign and its relations between a
meaning, a form and a referent. This figure emphasises on the temporal evolution of the
semiotic triangle, starting from the original definition from Odgen and Richards [125]
we move to the Peircian definition of symbols [129]. Ullmans’s triangle represents the
relation between things in the reality and the constructs of a language [162], followed
by Harnad’s and Vogt’s adoptions [63, 168].
2.5.2 Physical Symbol Grounding
Vogt connects the solution proposed by Harnad [63] with embodied robotics [12,
13] and with the semiotic definition of symbols [129] (See Fig. 2.5). His approach
to the SGP originates from embodied cognitive science and grounds the symbolic
system in the sensorimotor activities of the robot, thus transforming the SGP
into the Physical Symbol Grounding Problem (PhSGP). He then solves the
PhSGP by relying on two conceptual tools: the semiotic symbol systems and the
guess game. Vogt defines symbols as a structural pair of sensorimotor activities
and environmental data [167, 168].
According to a semiotic definition, symbols have a) a form (Peirce’s “representamen”), which is the physical shape taken by the actual sign; b) a meaning
(Peirce’s “interpretant”), which is the semantic content of the sign; and c) a
referent (Peirce’s “object”), which is the object to which the sign refers . Following this Peircean definition, a symbol always comprises of a form, a meaning
and a referent, with the meaning arising from a functional relation between
the form and the referent, through the process of semiosis or interpretation.
Then, the whole semiotic symbol system grounds the meaning of the symbols
in the sensorimotor activities, thus solving the PhSGP [167, 169, 170].
28
2.5. PERCEPTUAL-SYMBOLIC INTEGRATION AND THE SYMBOL
GROUNDING PROBLEM
The solution to the PhSGP is based on the guess game [152], a technique
used to study the development of a common language by situated robots. Here
a Game is defined as: “a routinised sequence of interactions between two agents
involving a shared situation in the world ” [150]. Steels and Vogt implemented
the previously mentioned language games on simple Lego robots equipped with
light and infra-red sensors. Later, Steels [149] studies the emergence of shared
languages in a group of autonomous cognitive robots that learn the categories
of objects. He argues that Language Games are a useful way of modelling the
acquisition or sharing of language [149, 150, 153]. As Vogt points out, several
critics argue that: “robots cannot use semiotic symbols meaningfully, since they
are not rooted in the robot”. To this he replies:
“. . . it will be assumed that robots, once they can construct semiotic
symbols, do so meaningfully.”
(P. Vogt [168])
Coradeschi and Saffiotti identified and formalised a problem similar to
PhSGP which they term anchoring. The problem concerns “the connection
between abstract and physical-level representations of objects in artificial autonomous systems embedded in a physical environment” [38]. A detailed description of anchoring is presented in the next section (§ 2.6). Vogt and Divina
characterise this problem as a technical aspect of symbol grounding, since it
deals with grounding symbols to specific sensory images [171] and mention
that the symbol grounding problem in general has to deal with anchoring to
abstractions, while including philosophical issues related to meaning.
Here it is important to highlight the differences of perceptual anchoring
with physical symbol grounding. Anchoring, indeed can be seen as a special
case of symbol grounding, which concerns primarily physical objects [35, 38],
actions and plans [14, 15, 34, 74, 105], yet not abstract concepts. Even though
anchoring elaborates on certain aspects of physical symbol grounding, it can
be seen as another solution, however it deviates from the scope of language
development (according to Vogt [170]). For example, in symbol grounding we
would consider ways for an intelligent agent to generally ground the concepts
of “stripes” or “redness” in perception, while in anchoring we would instead
consider how to model the perceptual experiences of a cognitive agent into a
structure that a) describes an object (i.e. zebra), its properties and attributes
both symbolically and perceptually; and b) maintain in time these associations.
Finally, a popular class of representations for symbols which are naturally
grounded, is called affordance-based representations [60]. This representation
encapsulates what the robot can afford to do with respect to objects. For
example, Roy has studied the grounding of words by manipulator robots, in
both perceptual and affordance features [135]. He has also surveyed the area of
language formation for models that ground meaning in perception and has noted
the importance of future work in discourse and conversational models, based on
human studies for inspiration in aligning these models between communicating
partners [136].
29
CHAPTER 2. BACKGROUND & RELATED WORK
2.5.3 Social Symbol Grounding
The multi-agent perspective of symbol grounding, where symbols have to be
agreed upon when there is more than one agent involved, is called “social
symbol grounding” [166]. Methods explicitly dealing with the alignment of
separate grounded representations learned individually, originate from language
formation. Moreover, they have been studied extensively in linguistics [151].
Steels and Kaplan describe a “guessing game”, in which two simulated agents
view a common screen with various shapes and colours, while attempting to
obtain similar symbols to characterise them. It is important to note that they
also use semiotic symbols [167] (see § 2.5.2 and Fig. 2.5). Here, instead of
just having the symbol and meaning, there is a form (or word), the meaning
(which can be a grounded aspect of the referent, e.g., colour) and a referent
(the object itself) which must be aligned between the two agents. Also in the
context of social symbol grounding, Cangelosi and collaborators have studied
the emergence of language in multi-agent systems, while performing navigation
and foraging tasks [18] and object manipulation tasks [21, 96].
2.6 Foundations of Perceptual Anchoring
The concept of anchoring in robotics is synonymous with the PhSGP, as
it constitutes the bridge between the symbolic (conceptual) and the subsymbolic (perceptual) models of the cognitive agent. Perceptual Anchoring
is the grounding of symbols, that refer to specific object entities, such as a
“cup” (or more specifically “cup-22 ”) and general properties, such as “blue”.
Anchoring is rather a subset of symbol grounding which is limited to physical
objects (concrete instances) and does not concern abstract concepts or concrete
classes (groups). It emphasises in addressing the physical symbol grounding
problem [168], by providing a solution to it.
Anchoring was first presented by Saffiotti, where he identified the need to
ground, or better anchor, the descriptions and perceptual properties of the
physical objects surrounding an intelligent agent, into internal representations
that are used during the execution of actions [139]. With anchoring he refers to
the execution of the two perceptual capabilities: i) to find an object by matching
its percepts with its symbolic description; and ii) acquire new information about
the properties of this object.
Later, Coradeschi and Saffiotti gave a formal account for the problem of
anchoring where they acknowledge that:
“Every intelligent agent embedded in physical environments needs the
ability to connect, or anchor, the symbols used to perform abstract
reasoning to the physical entities which these symbols refer to.”
(S. Coradeschi and A. Saffioti [37])
30
2.6. FOUNDATIONS OF PERCEPTUAL ANCHORING
Physical World
Autonomous System
Symbolic reasoning system
table1
door5
cup22
room3
denote
Symbols
Anchoring
Sensori-motoric system
Sensor data
>
observe
α= < , , γ >
Individual symbols
Predicate symbols
Percepts
cup22
{red, large}
Σ
Pred.
Attr.
red
orange
hue
[-20, 20]
hue
Values
[20, 30]
blue
...
...
hue
...
...
[220, 260]
...
...
area = 310
hue = 12
Attributes
Π
g
Figure 2.6: Graphical illustration of the anchoring problem [35].
by introducing the concept of the anchor3 and its functionalities while
outlining also a few difficulties, specifically regarding indexical and objective
references [37], definite and indefinite descriptions [36] as well as the existence
of uncertainty and ambiguities, which are inevitable in the presence of sensor
data [15, 31, 36].
Through the early development, the most prevalent definition assumes that
anchoring is the “. . . process of creating and maintaining the correspondence
between symbols and percepts that refer to the same physical object” [32, 35].
In this fundamental work a domain-independent definition and computational
theory of anchoring is proposed, via practical examples in the robot navigation
and aerial surveillance domains. After the establishment of anchoring both
as a problem and as a methodology, subsequent work from Coradeschi and
Saffiotti investigated how anchoring is related to different contexts, while also
concretely framing the computational theory behind it. A brief summary of the
3 The term anchor originates from studies in situation semantics where anchors for
parameters provide a formal mechanism for linking parameters to actual entities. An anchor
for a set A, of basic parameters is a function f defined on A, which assigns to each parameter
Tn in A an object of type T [3].
31
CHAPTER 2. BACKGROUND & RELATED WORK
most representative iteration of an anchoring framework includes the following
definition:
⊲ Perceptual System Contains a set of percepts (collections of measurements
assumed to originate from the same object) and a set of attributes (measurable properties of percepts).
⊲ Predicate Grounding Relation Embodies the correspondence between
the unary predicates in the symbol system and the attributes in the
perceptual system.
⊲ Symbol System Contains a) a set of symbols which denote objects (e.g.,
“cup-22 ”); b) a set of unary predicate symbols, which describe symbolic
properties of objects (e.g., “green”); and c) an inference mechanism which
uses these components.
The perceptual system continuously generates percepts, such as regions in
images, and associates them with measurable attributes of the corresponding
objects, such as colour values. The symbol system assigns unary predicates, such
as “green”, to symbols which denote objects, such as “cup-22 ”. The associations
between symbols and percepts are reified via structures called anchors. Each
anchor contains symbols, percepts and estimates of one object’s properties.
Anchors are time indexed, since their contents can change over time and are
manipulated with three basic functionalities: i) find; ii) (re)-acquire; and
track [32].
Visual perception is perhaps one of the most important aspects closely
related to the scope of this thesis. An anchoring technique based on fuzzy set
theory is presented in [31], which deals with representing uncertainty and the
degree of matching between perceptual signatures of objects, observed by a
vision system of an unmanned helicopter.
Another aspect which is considered in a position paper, specifically discusses
the role of learning in anchoring [33] with respect to learning the predicate
grounding relation, learning property dynamics and finally learning to generate
referring expressions.
In the context of planning, we see autonomous robots that incorporate a
symbolic planner for their operation, which use anchoring in order to associate
the terms of plans to the relevant sensor data [34]. In addition, extensions of
research in planning and anchoring (including action properties and partial
matching) often deal with the grounding of symbols to actions [38].
Finally an approach which used a novel anchoring modality was presented
by Loutfi et al. [100]. There, olfactory data were used in a shared representation
that linked several sub-systems of the robot, such as the planner with the
motion control and the olfactory sensors, while addressing both top-down and
bottom-up processes [100]. In bottom-up processes, sensor data determine the
initiation of an anchoring process, whereas top-down functions may initiate an
anchoring process upon request.
32
2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS
2.7 Perceptual Anchoring in Cognitive Systems
The framework proposed by Coradeschi and Saffiotti provides with a simple, yet
concrete theoretical point of view in addressing the PhSGP. It is also a platform
for addressing the creation / maintenance of the symbol-percept and anchorobject correspondences in cognitive agents. In the literature several approaches
study aspects of the anchoring problem from very diverse perspectives. Here
I present the most notable clusters of work related to anchoring and to some
aspects of this dissertation.
2.7.1 Knowledge Based approaches to Anchoring
An important challenge for cognitive systems is the establishment of a shared
ontology, where symbols and concepts referring to objects are structured in a
way that allows conceptual symbolic reasoning. There exist many approaches
in the context of symbol grounding which consider the use of KR&R techniques
or knowledge-bases, in order to enable logical inference. Work by Bonarini et al.
investigates a model to represent the knowledge of an agent, showing that the
anchoring problem can be successfully dealt with well known AI techniques.
They present a model which supports the instantiation of concepts, affected
by uncertainty and heterogeneity from the perceptual system in a multi-agent
context [6, 7].
One of the earliest clusters of work in the context of knowledge-based
approaches, is led by Chella et al. where they present a knowledge-based
anchoring framework, for building high-level conceptual representations from
visual input, based on the notion of Gärdenfors’ conceptual spaces [58]. Their
approach allows to define conceptual semantics for the symbolic representations
of the vision system, where the symbols can be grounded to sensor data. They
further develop their approach, to anchor conceptual representations in dynamic
scenarios using situation calculus [27, 29]. In a collaboration with Coradeschi
and Saffiotti [26], they explicitly investigate how to formalise a computational
model for perceptual anchoring that is unified with Gärdenfors’ conceptual
spaces theory, in the context of bridging the gap between the symbolic and subsymbolic components. Finally Chella et al. present an extension of their cognitive
architecture, for learning by imitation, where a rich conceptual representation
of the observed actions is built [28].
Another interesting knowledge-based approach for cognitive robots, is the
one from Shapiro and Ismail called the GLAIR architecture (Grounded Layered
Architecture with Integrated Reasoning) which consists of three levels. The
knowledge level (KL) is the one in which conscious reasoning takes place [145].
The KL is implemented by the SNePS and SNeRE (SNePS Relational Engine)
logic & knowledge representation and reasoning system [83, 146], both based
on Common Lisp. They evaluate their approach using the robot Cassie, which
anchors the abstract symbolic terms that denote the agent’s mental entities
33
CHAPTER 2. BACKGROUND & RELATED WORK
in the domains of: a) perceivable entities and properties; b) actions; c) time;
and d) language in the lower-level architectures used by the embodied agent to
operate in the real world.
Lopes et al. describe a way to utilise the KR&R component for knowledge
acquisition and information disambiguation [99]. Similarly, work by Melchert
et al. presents results using symbolic knowledge representation and reasoning
for perceptual anchoring. They extend the anchoring framework from [32] using
the LOOM KR&R system, to incorporate a subset of the DOLCE ontology
regarding physical entities. In this way they manage the symbolic information
while exploiting the acquired knowledge and inference mechanisms in order
to recover from failures in the anchoring process. In a simulated scenario,
their system communicates with a user to interactively resolve an ambiguous
description using the knowledge-base [114, 115].
Modayil and Kuipers examine unsupervised learning approaches, in order
to bootstrap an ontology of objects to sensor input from a robot. Four multiple
learning stages are combined in which an object is first individualised, then
tracked and described (using shape models) and finally categorised [119].
Work by Mendoza et al. builds further on Harnad’s ‘solution’ [63] of the
SGP, by designing an architecture for object categorisation, on the basis of
features and context. They use simple iconic and categorical representations
that are causally connected to the robot’s sensory sub-systems, thus providing
an elementary grounding upon which Semantic Web technologies are applied
in the domain of robot soccer [116].
Mozos et al. present an integrated approach for creating conceptual representations for spatial and functional properties of typical indoor environments
using mobile robots [121]. Their multi-layered model represents maps at different levels of abstraction, using laser and vision sensors for place and object
recognition. Their system is endowed with an OWL-based commonsense ontology of an indoor environment, that describes taxonomies (is-A relations) of
room types and typical objects found therein, through has-A relations. Zender
et al. present another integrated instance of the system with functionalities
such as perception of the world, natural language, learning and reasoning, thus
exploiting inter-disciple state-of-the-art components into a mobile robot system
(CoSy Explorer)4 [174]. Their work is highly focused on cross-modal integration,
ontology-based mediation and multiple levels of abstraction of perception.
Tenorth and Beetz present a practical approach to robot knowledge representation and processing [154, 157]. KNOWROB is a first-order knowledge
representation system, which is based on description logics and focuses on the
automated acquisition of grounded (action-centred) concept representations,
through observation and experience. Their framework copes with inference
under uncertainty in autonomous robot control. The robot collects experiences
4 Cognitive
org
34
Systems for Cognitive Assistants - CoSy, http://www.cognitivesystems.
2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS
while executing actions and uses them to learn models and aspects of actionrelated concepts anchored in the robot’s perception and action system. They
use the fraction of Cyc’s upper ontology which describes the relations needed
for mobile manipulation tasks [156]. They further extended their approach to
KNOWROB-MAP [158], a system for building environment models by combining spatial information about objects in the environment with semantic
knowledge (perceptual, encyclopaedic & common-sense).
Finally, a logic-based anchoring approach is presented by Fichtner and
Thielscher, however from a theoretical stand point. Their approach for addressing the anchoring problem is based on the Fluent Calculus and they
present preliminary results, via an example dealing with multiple hypotheses
for correspondences [52, 53].
2.7.2 Anchoring with commonsense information
Commonsense knowledge plays an important role in our everyday lives and we
humans apply it so many times that we do not even think of it explicitly. Such
knowledge encodes our default assumptions and expectations about possible
experiences. Robots as well need to have default assumptions and expectations.
Projects dealing with modelling commonsense in machines include between
others, the Cyc project [93, 94] or OMCS (Open Mind Common Sense) [147]
(see § 3.7).
Early approaches to symbol grounding using commonsense information,
used primarily simple robots and considered basic commonsense information
from simple sensors [81, 141, 144]. However approaches in cognitive robotics,
recently began to consider aspects of how to truly integrate or exploit large
scale commonsense knowledge-bases, mainly due to the recent explosion of the
Semantic Web technologies and ontology languages. Specifically the research
behind this thesis can be considered to be among the first efforts which consider
integrating a large-scale deterministic commonsense knowledge-base with a
cognitive robotic system perceiving the physical world.
Of the most notable related approaches is the one by Tenorth et al., which
concerned household robots that use commonsense knowledge to accomplish
everyday tasks. They present their integrated system which focuses on the
generation of complex plans. They propose to transform task descriptions from
web sites such as ehow.com, into executable robot plans by using methods for
converting the instructions from natural language into a formal logic-based
representation, while then resolving the word senses using the WordNet lexical
database and a fraction of the Cyc upper ontology [155]. An extension of their
framework considers adding collections of commonsense knowledge into the
KB, in order to enable flexible inference over control decisions, under changing
environmental conditions. Their system converts commonsense knowledge
which is included in the OMCS [147] database, from natural language into
a Description Logic (DL ) representation using the Web Ontology Language
35
CHAPTER 2. BACKGROUND & RELATED WORK
(OWL) [84]. Finally they show how to ground the abstract task descriptions in
the perception and action systems of the robot. This aspect was also studied
in another instance, where they report a top-down guided 3D-CAD modelbased vision algorithm, which is influenced by the commonsense knowledge
obtained from the WWW, in order to ground objects (tableware and cutlery)
in an assistive household environment [126]. Finally their approaches have been
integrated with KNOWROB, a robot knowledge processing system (see § 2.7.1).
Another emerging approach which considers a commonsense ontology is
presented by Lemaignan et al. in a knowledge processing framework for robotics
called OpenRobots Ontology kernel (ORO) that allows to turn previously acquired symbols into concepts, linked to each other, thus enabling reasoning [92].
Knowledge in ORO is represented as a first-order logic formalism, in RDF
triples (e.g. <robot isIn kitchen>) in OWL-DL2 and the knowledge-base is
implemented using the Jena semantic web framework5 in conjunction with
the Pellet6 reasoner. The OpenRobots Common Sense ontology is closely
aligned with the open-source OpenCyc7 upper ontology defining classes and
predicates focused on concepts useful for interaction with humans.
2.7.3 Cooperative Anchoring
This far, we have considered anchoring only in the context of single robotic
systems. However in the case where symbolic descriptions or perceptual data
are distributed between multiple robotic agents with heterogeneous sensors,
we speak of cooperative anchoring [86]. Besides extracting and exchanging
object descriptions, cooperative robots may directly exchange representations
based on perceptual data. The idea of cooperative anchoring itself stems from
the philosophical roots of the social symbol grounding problem (see § 2.5.3).
Yet most of the related work in cooperative anchoring disregard the aspects
of language development while they mostly focus on the technical aspect
of distributed perception, where the different agents exchange anchors and
perceptual data that are coming from their embedded sensors and other robots
(multi-robot systems).
The most related cluster of work in cooperative anchoring, mainly comes
from LeBlanc and Saffiotti; where the authors propose an anchoring framework
for both single-robot and cooperative anchoring. They primarily consider the
problem of fusing pieces of information from a distributed system, to create
the global notion of an anchor in a shared multi-dimensional domain, called
the anchoring space. The framework represents information using a conceptual spaces [58] approach, allowing various types of object descriptions to be
associated with uncertain and heterogeneous perceptual information. Their
implementation uses fuzzy sets to represent, compare and combine information,
5 Jena
Semantic Web Framework, http://jena.apache.org/
OWL 2 Reasoner for Java, http://clarkparsia.com/pellet/
7 OpenCyc KB, http://www.opencyc.org
6 Pellet:
36
2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS
while includes a cooperative object localisation method that takes into account
uncertainty, in both observations and self-localisation. Experiments using simulated and real robots are used to validate the proposed framework and the
cooperative object localisation method [85–89].
Kira presents the problem of how heterogeneous robots with largely different
capabilities can share experiences in order to speed up learning. His work
focuses specifically on differences in sensing and perception, which can be
used both for perceptual categorisation tasks as well as determining actions
based on environmental features. He studied methods and representations
which allow perceptually heterogeneous robots to a) represent concepts via
grounded properties; b) learn grounded property representations such colour or
texture categories; and c) build models of their similarities and differences, in
order to map their respective representations. He proposes an approach using
Gärdenfors’ conceptual spaces [58] representation, where object properties are
learned and represented as Gaussian Mixture Models in a metric space. Then, he
uses confusion matrices, obtained in a shared context and built using instances
from each robot, in order to learn the mappings between the properties of each
robot. These mappings are then used to transfer a concept from one robot
to another, where the receiving robot has not been previously trained on the
specific instances of objects. He finally shows that the abstraction of raw sensory
data into an intermediate representation can be used not only to aid learning,
but also to facilitate the transfer of knowledge between heterogeneous robots.
Furthermore, he utilises statistical metrics to determine which underlying
properties are to be shared between the robots. Using the methods described
above, two heterogeneous robots having different sensors and representations
are able to successfully transfer support vector machine (SVM) classifiers among
each other, resulting in considerable speed-ups during learning [76–79].
Another instance of work is presented by Bonarini et al. [8] where the authors
extend their solution of single-robot symbol grounding in a multi-agent approach,
thus combining the information from different agents in a global representation
at the conceptual level using a fusion model based on clustering techniques.
Also Mastrogiovanni et al. [110] present a distributed knowledge representation
and data fusion system for an ambient intelligent environment, which consists
of several cognitive agents with different capabilities. The architecture itself is
based on the idea of an ecosystem of interacting artificial entities, where the
framework for collaborating agents allows them to perform an intelligent multisensor data fusion. Despite its simplicity, it is able to manage heterogeneous
information at different levels, thus “closing the loop” between sensors and
actuators.
2.7.4 Anchoring for Human-Robot Interaction
Human-robot interaction (HRI) is a recent and growing field that aims to study
how humans interact with robots and ways of making this interaction more
37
CHAPTER 2. BACKGROUND & RELATED WORK
effective [54, 131]. An emerging trend in the context of anchoring is to study the
problem from the perspective of the interaction that occurs in order to facilitate
communication with humans. Anchoring is well suited for HRI applications and
especially when humans communicate with robots using grounded symbolic
information. For example, a dialogue system for human-robot collaboration
where the dialogues concern physical objects, can be thought as an instance of
the anchoring problem.
Chella et al. present a system for advanced verbal interactions between
humans and artificial agents with the aim to learn a simple language in which
words and their meaning are grounded in the sensory-motor experiences of
the agent. The system learns grounded language models from examples with
minimal user intervention and without feedback. Then, it was been used
to understand and subsequently to generate appropriate natural language
descriptions of real objects, engaging in verbal interactions with a human
partner [30].
Lemaignan et al. present their work in grounded verbal interaction [90]. They
propose a knowledge-oriented architecture, where perceptions from different
point of views (from the robot itself or the human) are turned into symbolic
facts, stored in different cognitive models. The framework includes a component
for natural language interpretation that relies on these structured symbolic
models of the world [91]. They also present a set of strategies that allow a
robot to identify the referent when the human partner refers to an object giving
incomplete information (i.e. an ambiguous description) [133].
Another example of an implemented dialogue system is explored in [82,
106], in the context of situated dialogue processing. This approach couples
incremental processing with the notion of bidirectional connectivity, inspired by
how humans process visually situated language in both top-down and bottom-up
fashion. Information about the object state as well as a history of the object state
is used to describe changes in a scene. Furthermore, complementary research
presented in [97], adds a framework for constructing rich belief models of the
robot’s environment, using Markov Logic as a unified framework for inference
over these beliefs. The approach is being integrated in an implementation of a
distributed cognitive architecture for mobile robots interacting with humans
using spoken dialogue. The constructed belief models evolve dynamically over
time and incorporate various contextual information, such as spatio-temporal
framing, multi-agent epistemic status and saliency measures.
Two aspects important for HRI, concern the generation of referring expressions as well as spatial relations. The architecture of Kruijff and Brenner
mentioned above, has been used also by Zender et al. where they present an
approach for generating and resolving referring expressions (REs) in conversational mobile robots, which identifies and distinguishes spatial entities in
a large-scale space (e.g., in an office environment). Their approach is based
on a spatial knowledge-base encompassing both robot- and human-centric
38
2.7. PERCEPTUAL ANCHORING IN COGNITIVE SYSTEMS
representations. An important feature of this approach is that it considers
descriptions that contain spatial relations among the objects [175–177].
Spatial relations are very important when describing and sharing information
about objects. Melchert et al. present a framework for computing the spatial
relations between anchors in the context of HRI [114]. In this work they extend
an anchoring framework to include a set of binary spatial relations, which
were used to provide meaningful object descriptions but also facilitate human
participation in the anchoring process, by using human interaction in the
disambiguation process between visually similar objects (i.e. anchors).
Another aspect of HRI that has been explored in the context of anchoring
is person tracking. Kleinehagenbrock et al. present a hybrid approach for
integrating vision with distance information in order to track a human [80].
They use laser range data to extract the legs of a person; and camera images
from the upper body part of a person for extracting the skin-coloured faces.
They combine the results of the different percepts, which originate from the
same person into their symbolic counterparts, by using the standard anchoring
processes as they are defined in [38].
2.7.5 Other approaches to Anchoring
Work by Karlsson et al. studied anchoring from the perspective of the robot
when monitoring the execution of its plans while detecting failures [15, 73]. In
their work they present a schema, which responds to failures occurring before or
while an action is executed. The failures they address are ambiguous situations
which arise when the robot attempts to anchor symbolic descriptions (relevant
to a plan action) in perceptual information. They present an extension of the
original anchoring framework that deals robustly with ambiguous situations,
by providing general methods for detection and recovery using planning-based
methods.
Loutfi et al. discuss the problem of maintaining coherent perceptual information in a mobile robotic system, which operates over extended periods of
time. Their system interacts with a human and uses multiple sensing modalities
in order to gather information about the environment or specific objects [100].
Their anchoring extension relies on a challenging modality; olfaction, and is
capable of tracking and acquiring information from observations derived from
sensor data as well as information from a-priori symbolic concepts. In the same
context they present how they integrate the electronic nose in a multi-sensing
mobile robot [101].
One other aspect of anchoring concerns semantic mapping, where Galindo
et al. apply a multi-hierarchical approach that enables a mobile robot to acquire
spatial information from its sensors and links it to semantic information via
anchoring [57]. The spatial component of the map is used to plan and execute
robot navigation tasks, while the semantic component enables the robot to
perform symbolic reasoning.
39
CHAPTER 2. BACKGROUND & RELATED WORK
Anchoring is also addressed in a number of works which focus on the problem
from a different perspective. For instance, Heintz et al. present a stream-based
hierarchical anchoring framework extending their DyKnow knowledge processing
middle-ware, which is used to process perceptual information. They make this
component available to the other cognitive layers while they further extend
their approach to support reasoning in the context of successfully performing
complex missions in unmanned aerial vehicles [68, 69].
2.7.6 Summary of Related Approaches
Comparing the related clusters of work, we primarily see that even though there
is much heterogeneity in the domains that study or relate to anchoring, only a
few approaches follow an integrative stance when grounding symbols to sensor
data in cognitive systems. Work in the cooperative anchoring domain seems to
lean toward low-level sensing and grounding, thus emphasising on uncertainty
and the multi-modal aspects. However they appear to miss the high-level
symbolic aspects related to knowledge representation and reasoning [78, 88]
with the exception of Mastrogiovanni et al. and their work on the distributed
data fusion approach [110] which explicitly considers symbolic knowledge
representation and reasoning. We observe that integrative approaches that
consider anchoring, knowledge grounding and symbolic reasoning, may rely on
diverse representations of knowledge which vary according to the application.
Gärdenfors’ conceptual spaces representations are generally favoured in the
context of vision and anchoring [25–28, 86, 88].
On the symbolic side, mainly description logics are routinely used as the
logical formalism behind the representation of knowledge [92, 154, 177]. Even
though some knowledge based models intended for cognitive robots introduce
the concept of commonsense information in specific contexts, such as everyday
tasks [155], important aspects like the multi-modal integration with cross-modal
representations; memory and grounding, are not examined in detail. Exception is the ORO knowledge management platform [92] which explicitly deals
with grounding linguistic interaction. In the context of linguistic interaction,
anchoring approaches study primarily the generation of referring expressions
and spatial relations [114, 175–177]. Finally approaches tend to focus on the
logical representations instead of the practical perceptual aspects behind the
grounding and anchoring. However with the advancement of semantic web
technologies and on-line commonsense knowledge repositories, we can only
expect in the future, knowledge-enabled robots which eventually communicate
and use grounded representations.
40
3
Methods
3.1 Introduction
In this chapter we review the methods used in the different studies of the thesis,
where the goal was to specify, design and implement a complete anchoring
framework capable of addressing the percept to concept correspondence in an
artificial cognitive agent operating in the physical world. Typical components of
such a system concern the sensor processing and perceptual algorithms behind
the perceptual system, an appropriate knowledge representation formalism
for the semantic system, and finally a novel augmentation of the anchoring
framework capable to support the integration of the two heterogeneous systems.
In this chapter, I first briefly introduce the initial anchoring model which was
used as a template in the augmented anchoring framework presented in the
different studies (Papers I-IV). I then consider the sensors and data processing
parts, while I review some of the relevant state-of-the-art object detection
and classification methods used in this thesis (Papers II-IV). Consequently
I briefly outline the principles behind the localisation algorithm used by the
mobile robot (Papers II & III). The semantic knowledge is represented via
description logics (in Papers III & IV) and I dedicate a section for it. Finally I
describe, compare and contrast some widely known commonsense knowledgebase systems, to complement the understanding of the corresponding scientific
papers (Papers I-IV) collected in the thesis.
41
CHAPTER 3. METHODS
3.2 Anchoring Model
The present thesis extends the computational model of anchoring presented
in [32]. In this section we see a summary of the details of the initial anchoring
framework with its basic functionalities from [32]. The following definitions
allow us to define objects in terms of their (symbolic and perceptual) properties.
We consider an environment E in which there is a set O of x physical objects,
denoted O = o1 , ..., ox , where the value of x is unknown. The set of objects
O can change over time, since objects can be inserted or removed from the
environment. There are also n > 0 agents, denoted as ag = ag1 , ..., agn . If
n = 1, then the problem is reduced to single-robot anchoring while for n > 1 we
deal with the cooperative anchoring problem. Each agent ag in the environment
E has a perceptual anchoring system Aag .
3.2.1 Components
A typical anchoring system (Aag ) as defined in [32] is composed of a perceptual system, a predicate grounding relation and a symbol system.
Anchoring focuses on the problem of creating and maintaining the correspondences between the symbol and perceptual systems using the grounding
relations. Here is a formal definition of these elements that allow us to characterise an object ox in terms of its symbolic and perceptual properties.
A perceptual system Ξ including a set Π = π1 , π2 , ... of percepts; a set
Φ = φ1 , φ2 , ... of attributes and perceptual routines. A percept π is a structured
collection of measurements assumed to originate from the same physical object;
an attribute φ is a measurable property of a percept, with values in the domain
D(φ). We let :
[
D(Φ) =def
D(φ).
(3.1)
φ∈Φ
The perceptual system continuously generates percepts and associates each
percept with the observed values of a set of measurable attributes.
Definition 1. A perceptual signature γ : Φ → D(Φ) is a partial function from
attributes to attribute values. The set of attributes on which γ is defined is
denoted by feat(γ).
A symbol system Σ including a set X = x1 , x2 , ... of individual symbols
(variables and constants); a set P = p1 , p2 , ... of predicate symbols; used to
denote objects in the world. The symbol system assigns unary predicates to
symbols which denote objects.
Definition 2. A symbolic description σ ∈ 2P is a set of unary predicates.
Intuitively, a symbolic description lists the predicates that are considered
relevant to the perceptual identification of an object; and a perceptual signature
42
3.2. ANCHORING MODEL
gives the values of the measured attributes of a percept. A predicate grounding
relation g ⊆ P × Φ × D(Φ) embodies the correspondence between unary
predicates and values of measurable attributes. The g relation can be used to
match a symbolic description σ and a perceptual signature γ as follows.
match(σ, γ) ⇔ ∀p ∈ σ.∃φ ∈ feat(γ).g(p, φ, γ(φ))
(3.2)
Here it is important to mention that the grounding relation g concerns
properties, however anchoring concerns objects. The correspondence between
symbols and percepts for a specific object ox in the environment E, is reified
in an internal data structure called anchor α(ox , t). The anchoring process is
responsible for creating and maintaining anchors and since new percepts are
generated continuously within the perceptual system anchors are indexed by
time. At every moment t, α(ox , t) contains: a unique symbol meant to denote
the object ox ; the perceptual signature γ; and the symbolic description σ of an
object.
α(t) = (γ, σ) : X × Π × (Φ → D(Φ))
(3.3)
αsym
ox ,t ,
αper
ox ,t , and
αper
ox ,t is the
We denote the components of an anchor α(ox , t) by
αval
ox ,t
respectively. If the object is not observed at time t, then
‘null’
/ Vt ), while αox ,t still contains the best
percept ∅ (by convention, ∀t : ∅ ∈
available estimate. In order for an anchor to satisfy its intended meaning, the
symbol and the percept in it should refer to the same physical object. This
requirement cannot be formally stated inside the system. What can be stated
is the following (recall that Vt is the set of percepts which are perceived at t).
Definition 3. An anchor α(xo , t) is grounded at time t iff both αper
ox ,t ∈ Vt
per
and match(∆t (αsym
),
S
(α
)).
t
ox ,t
ox ,t
We informally say than an anchor α is referentially correct if, whenever α
is grounded at t, then the physical object ox denoted by αsym
ox ,t is the same as
the one that generates the perception αper
ox ,t . The anchoring problem, then, is
the problem to find referentially correct anchors. The anchors are stored in the
anchoring space.
Definition 4. An anchoring space A which is a multi-modal space whose
dimensions are the qualities of interest in the application domain.
3.2.2 Functionalities
Manipulation of anchors, according to [32], is done mainly through three
abstract functionalities which: a) create a grounded anchor the first time that
the object denoted by x is perceived; b) continuously update the anchor while
observing the object; and c) update the anchor when we need to reacquire the
object after some time that it has not been observed. t denotes the time at
which the functionality is called.
43
CHAPTER 3. METHODS
Algorithm 1: Acquire
Input: symbol x
begin
α ←− anchor for x;
γ ←− Predict(αval
t−k , x, t);
π ←− Selectπ ′ ∈ Vt | Verify(St (π ′ ), ∆t (x), γ);
if π 6= ∅ then
γ ←− Update(γ, St (π), x);
α(t) ←− hx, π, γi;
return α(t)
(Re)Acquire establishes the symbol-percept association for an object o
which has not been observed for some time. It takes as input a symbol x with
an anchor α defined for t − k and extend α’s definition to t. First predict a
new signature γ; then see if there is a new percept that is compatible with
both the prediction and the symbolic description; if so, update γ. Prediction,
verification of compatibility, and updating are domain dependent; verification
should typically use match.
Algorithm 2: Track
Input: α(t − k)
begin
x ←− αsym
t−1 ;
γ ←− OneStepPredict(αval
t−1 , x);
π ←− Select{π ′ ∈ Vt | Verify(St (π ′ ), ∆t (x), γ)};
if π 6= ∅ then
γ ←− Update(γ, St (π), x);
α(t) ←− hx, π, γi;
return α(t)
Track updates an existing anchor by incorporating new perceptual information. The track functionality takes as input an anchor α(ox , t), defined for
t − k and extends its definition to t. The track assures that the percept pointed
to by the anchor is the most recent and adequate perceptual representation of
the object. We consider that the signatures can be updated as well as replaced
but by preserving the anchor structure we affirm the persistence of the object
so that it can be used even when the object is out of view. This facilitates the
maintenance of information while the agent is moving as well as maintaining a
longer term and stable representation of the world on a symbolic level without
catering to perceptual glitches.
44
3.2. ANCHORING MODEL
Algorithm 3: Find
Input: symbol x
begin
π ←− Select {π ′ ∈ Vt |match(∆t (x), St (π ′ ))};
if π = ∅ then
fail;
else
α(t) ←− hx, π, St (π)i;
return α(t)
Find accepts as input a symbol x and a symbolic description and returns a
grounded anchor α(ox , t) defined at t. It checks if existing anchors that have
already been created by the Acquire satisfy the symbolic description, and in
that case, it selects one. Otherwise, it performs a similar check with existing
percepts (in case, the description does not satisfy the constraint of percepts
considered by the Acquire).
3.2.3 Relation to Publications
The anchoring framework described in this section was initially implemented for
the studies behind papers I, II and V. Apart from the implementation, papers
II and V deal with: a) the formation of percepts from the visual and spatial
modalities; b) the definition of the grounding relations used in the linguistic
interaction scenarios; c) the extension of the framework in order to support the
integration with the large scale knowledge-base. The knowledge management,
synchronisation and memory components complement as well the implemented
framework. In paper III the initial implementation was revised in order to
account for the cooperative aspect, borrowing concepts from the cooperative
anchoring framework of LeBlanc [88], however from a semantic point of view.
This iteration extended the anchoring framework to account for bottom-up and
top-down information acquisition and addressed the integration of a perceptualsemantic knowledge-base between different perceptual agents. Furthermore the
structure of the anchor was defined, especially in the context of processing
multi-modal perceptual information. In paper IV the framework was extended
in order to define concept anchors, which were used as intermediate structures
between different perceptual systems within a single agent. In this context the
conceptual anchoring framework defined more in detail the multi-modal part
of the anchor and introduced the perceptual and semantic distances used in
modelling the perceptual and semantic descriptions respectively. Finally the
modelling of the anchor indiscernibility, tolerance and nearness relations led to
the anchor nearness measure.
45
CHAPTER 3. METHODS
3.3 Sensors
The sensors are the most essential part in every robotic system, allowing it to
acquire information from the environment, whether the robot is trying to localise
itself, navigate, recognise objects or understand a scene. One central requirement
in modelling perception in cognitive robots is to be able to identify high-level
features such as objects and doors in perceptual data. These high-level features
can be acquired using a variety of sensing devices. Sensors can be classified as
proprioceptive or exteroceptive and as active or passive. Proprioceptive sensors
allow internal measurements while an exteroceptive sensor on the other hand is
able to take measurements of the robot’s environment. These measurements can
further be used to extract features that are representative of the surrounding
environment. In this thesis, the following two main sensors were used: a laser
range-finder (active sensor) and a camera (passive sensor).
3.3.1 Laser Scanner
A popular sensor among robotics researchers, the Laser Range Scanner (LRS) is
well suited for making distance measurements on a two dimensional plane (or on
three dimensions in modern devices). A single laser scanner takes measurements
in an arc rotating around the sensor centre. These measurements are used for
navigation, obstacle avoidance, localisation and detection of free paths to reach a
goal point. The advantage of this sensor is that it makes distance measurements
with high accuracy and with error in the range of centimetres. The distance is
limited to tens of meters. The surface reflectance where the laser beam reached
the target point does not only affect the effective distance but also introduces
error to the measurements by giving a windowing effect.
The laser beam of the laser scanner sweeps the plane in an arc and makes
distance measurements at discrete angles. The resolution of bearing measurements depends on how many measurements occur in one sweep. The laser
scanner used in this thesis is a SICK LMS-200; interfaced using a high speed
USB-serial connection. This allows a scan rate of up to 72 scans per second
with 181 measurements each scan. The used range is 8m and at this range the
absolute range accuracy is about 1cm. For the experiments in this thesis the
laser scanner is mounted approximately 20cm above the ground and on the
base of the robot.
3.3.2 Camera
Digital cameras are the most common vision-based sensors mounted on mobile
robots. Some common cameras include the CCD and CMOS and have seen
a widespread use since the earliest experiments in robotics. Usually cameras
mounted on a mobile robot work with the light of the visible spectrum, however
there are cameras that can work with other portions of the electromagnetic
46
3.4. MOBILE ROBOT LOCALISATION
ActivMedia
PeopleBot
Y
SICK LMS 200
X
y
c
C
x
principal
point
optical
center
Logitech QuickCam
Vision Pro
P
p
principal axis
Z
Sony PTZ Robotic
Camera
image plane
Pinhole Model
Sensors
Figure 3.1: The pinhole model (left). The robot and sensors used in the thesis
(right).
spectrum (infra-red). A simple model that models what happens when a ray
strikes a camera is the pinhole camera model [67] (see Fig 3.1) . In this model
light from a scene point passes through a single point (e.g., a small aperture)
and projects an inverted image on a plane called image plane. In the pinhole
camera model, the small aperture is also the origin C of a 3D coordinate system
whose Z axis is along the optical axis (i.e. the viewing direction of the camera).
The sensors used for the experiments in the thesis are: the SONY PTZ robotic
camera embedded on an ActivMedia PeopleBot platform and two Logitech
QuickCam Vision Pro, one mounted on the mobile robot and one in the ceiling
(Fig 3.1).
3.4 Mobile Robot Localisation
The localisation of the robot is important to establish the position of the
anchored objects. We use an implementation of the Adaptive Monte-Carlo
Localisation (AMCL)1 algorithm described by Fox et al. [55]. Essentially it is
an implementation of the particle filter applied to robot localisation and has
become very popular in the robotics literature. The Monte-Carlo method can
be used to determine the position of a robot given a map of its environment.
At a conceptual level, the algorithm maintains a probability distribution over
the set of all possible robot poses and updates this distribution using data from
odometry, sonar and / or laser range-finders. A large number of hypothetical
current configurations are initially randomly scattered in the configuration space.
With each sensor update, the probability that each hypothetical configuration
1 implemented
in Player/Stage (http://playerstage.sourceforge.net)
47
CHAPTER 3. METHODS
is correct, is updated based on a statistical model of the sensors and Bayes’
theorem. Similarly, every motion the robot undergoes is applied in a statistical
sense to the hypothetical configurations based on a statistical motion model.
When the probability of a hypothetical configuration becomes very low, it is
replaced with a new random configuration.
At the implementation level, the method represents the probability distribution using a particle filter. The filter is “adaptive”, because it dynamically
adjusts the number of particles: a) when the robot’s pose is highly uncertain,
the number of particles is increased; and b) when the robot’s pose is well
determined, the number of particles is decreased. The driver is therefore able
to make a trade-off between processing speed and localisation accuracy.
As mentioned earlier, the driver also requires a pre-defined map of the
environment, against which to compare observed sensor values, as it is the only
item that indicates to the robot how the surroundings should look if it had
perfect perception abilities. A location-based metric map describes the entire
operating environment by defining all the occupied and non-occupied spaces,
an example of which is an occupancy grid. The map is used to estimate the
localisation state when sensor measurements are available. Range measurements
to obstacles are sensor representations that may work well with a location
based map. Fig. 3.2 shows the (semantic) map provided for the experiments as
well as the actual output of the AMCL algorithm.
Kitchen
Living room
Bedroom
Figure 3.2: The semantic map of the PEIS-Home used in the experiments (left) is
a combination of the metric map of the environment together with segmented spatial
regions. Examples of the AMCL algorithm from the mobile robot in the PEIS-Home.
48
3.5. VISION
3.5 Vision
3.5.1 Image Representation
A typical vision algorithm requires that the input images are represented in way
which allows the algorithm to find similarities and differences between them.
This problem can be summarised in two constraints. First, we are bound to
reason in terms of proximity (or equivalently: distance), and not equality between
images. In other words we should not try to determine if visual objects are equal,
but if they are close to each other in a certain (yet unknown) space. Second,
there is an imperative need to drastically reduce the dimensionality of images
by dismissing all the non-relevant or redundant pieces of information they
contain. One common method to achieve these two goals is to consider finitedimensional representations of sub-images, or image regions. The computer
vision community has taken a strong turn during the past few years in the
direction of image features, also called visual features or points of interest.
In a loose definition a feature is usually a point of interest surrounded by a
region, selected by a region detector. The region detector detects an “interesting”
part which can be identified multiple times. The study of visual features focuses
on two main aspects: i) the detection, or sampling, of finite sets of points that
are relevant to the image (interest point detection); and ii) the description of
the visual neighbourhood of these points as accumulation in a finite dimensional
vector of certain local visual characteristics (features description).
For good performance in visual recognition, it is essential that the features
are robust. The most important quality criteria for descriptors are a compact
representation and high precision and recall when matching descriptors from a
database of images (i.e. descriptive). It is also important that the detector has a
high ability to detect the same features in similar images (i.e. high repeatability).
To ensure that features are repeatable for a particular class of images, some
combination of the following properties are required: a) rotation invariance;
b) scale invariance; c) perspective / affine invariance; d) illumination invariance;
and e) robustness to motion blur and sensor noise (refer to the challenges
mentioned in § 2.2.1.1 & 2.2.1.2).
Interest point detection
A good interest point detector has to locate points that can be detected repeatedly even if the original image is modified or the same scene is shown
under varying conditions. Interest point detectors can be categorised briefly
into two categories. Edge and corner detectors are the early interest point
algorithms, which only detected points of interest (e.g., Harris corner detector [65]). Newer methods tend to determine regions of interest that fulfil certain
invariance properties (e.g., Difference-of-Gaussians (DoG) [103]) hence called
Affine detectors.
49
CHAPTER 3. METHODS
s=5
...
Scale
(next
octave)
(k) 2σ
2σ
k = 21/s
-
Scale
-
(merge down - sapmpled x2)
k4σ
Scale
(first
octave)
-
k 3σ
-
k 2σ
kσ
σ
Gaussian
Difference of Gaussian
(DOG)
Image gradients
Keypoint descriptor
Figure 3.3: Schema of the Differences of Gaussians computation (left). The SIFT
algorithm divides the local patch in sub-regions with image gradients and histogram
construction (right). Figures reconstructed from [104].
Feature description
Feature descriptors are suitable to describe the detected regions of interest. The
majority of recent successful object recognition algorithms make use of interest
regions to select relevant and informative features of the objects to learn. The
“support” region for these feature descriptors is defined by the output of the
interest point / region detectors described above. Extracting and describing
these features is a computationally demanding task that can be prohibitive
for applications that require close to realtime performance (e.g., robot vision).
Fortunately, work is being devoted to achieve faster methods and to develop
implementations of algorithms ready to run in special hardware. For example
in Heymann et al. [70], the authors designed a version of the SIFT feature
detector and descriptor [104] that runs at appx. 20 frames per second on a
Graphics Processing Unit (GPU) in 640 × 480 pixel resolution.
3.5.1.1 Scale Invariant Feature Transform
The Scale Invariant Feature Transform, or simply SIFT, introduced by Lowe,
is the de-facto standard method for a) interest point detection; b) feature
description; and c) matching, because of its robustness to small displacements
and lighting changes and orientation [104].
Originally SIFT was based on a model proposed by Edelman et al. [49],
which found that complex neurons in primary visual cortex respond to a
50
3.5. VISION
gradient at a particular orientation and spatial frequency [49]. In order to
mimic biological evolution, Lowe used the assumption that nearest-neighbour
correlation-based similarity computation, in the space of outputs of complextype receptive fields, can support robust recognition of 3D objects. However
the implementation of SIFT from Lowe has a different computational model,
which uses a scale invariant region detector based on the DoG together with
the SIFT descriptor, in order to provide a similarity-invariant feature detector.
This involves convolving the initial image with a Gaussian at several scales,
to create images separated by a scale factor k, thus creating the so called
scale space pyramid of convolved images (as seen in Fig. 3.3). These images
are gathered in octaves. Each octave represents a doubling of the standard
deviation sigma. The number of pictures in each octave is s + 3 where s is an
arbitrary number of intervals in each octave. The scale factor is computed as
k = 21/s . Let the difference-of-Gaussian function convolved with the image be:
DoG(x, y, σ) = (G(x, y, kσ) − G(x, y, σ)) ∗ I(x, y)
= L(x, y, kσ) − L(x, y, σ)
(3.4)
where DoG(x, y, σ) is the difference, G(x, y, σ) is Gaussian, I is the image,
and L(x, y, σ) is the image I convolved with the Gaussian G. The DoG is
known to provide a good approximation of the Laplacian of the Gaussian (LoG)
operator (LoG(x, σ) , △Gσ (x, y)) which has been shown to detect stable image
features [118]. Interest points are now detected by selecting points in the image,
which are stable across scales. When finding a point of interest in D(x, y, σ),
it is compared to its eight neighbours in the current image and to the nine
neighbours in the scales above and below. It is selected to be a feature only if it
has the largest or smallest value among these 27 points. The selected features
representing the local maxima are then, the interest points.
For the descriptor, around each interest point a region is defined and
divided into orientation histograms on 4 × 4 pixel neighbourhoods. Finally, the
most dominant orientations are determined by creating a radial histogram of
gradients in a circular neighbourhood around the detected point. The maxima
from this histogram determines the orientation of the point and thus enables
rotation invariance. The orientation histograms are relative to the key-point
orientation. Histograms contain 8 bins each and each descriptor contains a 4 × 4
array of 16 histograms around that key-point. This leads to sixteen histograms
of eight bins, each of which yield a vector of 128 dimensions (4 × 4 × 8 = 128
elements) (see Fig. 3.3). This vector is normalised to unit length, so as to
enhance invariance to changes in illumination.
Recently Mikolajczyk and Schmid presented a review where they evaluated
the performance of more than ten different descriptors for image classification
against affine transformations, rotation, scale changes, jpeg compression, illumination changes and blur [117]. The authors conclude that SIFT descriptors
51
CHAPTER 3. METHODS
45
glsl
cuda
glsl oct -1
cuda oct -1
40
Speed (Hz)
35
30
25
parameter:
image size,
320x240,
640x480,
800x600,
1024x768,
1280x1024,
1600x1200,
2048x1536,
oct 0
glsl, cuda
33.0, 45.2
23.0 27.1
19.2 21.2
16.3 16.8
13.2 12.5
9.9
9.0
7.0
x
parameter:
image size,
320x240,
640x480,
800x600,
1024x768,
1280x1024,
oct -1
glsl,
23.2,
13.4
10.2
7.3
5.2
20
15
10
5
cuda
27.5
12.6
9.1
6.4
x
0
320x240
640x480 800x600 1024x768
1280x1024
1600x1200
2048x1538
Image size (Px)
Figure 3.4: Comparison Speed (Hz) vs Screen Resolution on SIFT computation on
glsl and cuda for the initial octave set to -1 and 0 [173].
performed best, followed by their variant Gradient Location Orientation Histogram (GLOH). The main advantages of SIFT are that they have simple linear
Gaussian derivatives (i.e. more stable) and contain more components in the
feature vector (128) (potentially more discriminative representation). However
one disadvantage of SIFT is their high dimensionality. One way of reducing the
dimensionality is by applying Principal Component Analysis (PCA) on the raw
128-dimensional SIFT vector (PCA-SIFT [75]). However such dimensionality
reduction is not applied in this thesis. The implementation used in this thesis
follows the original model by Lowe (Papers II, III, IV and V).
We also used the open source SiftGPU2 library [173] (Paper III & IV).
SiftGPU is implemented in C and uses either the GL Shading Language or
CUDA to process pixels in parallel on the GPU, so as to build Gaussian pyramids
and detect DoG key-points. Figure 3.4 shows a comparison of the computation
time needed for the SiftGPU implementation on different resolutions.
3.5.1.2 Histogram of Oriented Gradients
Histogram of Oriented Gradients or HOG descriptor was first introduced
by Dalal and Triggs in their CVPR’05 paper [41] for the problem of pedestrian
detection in static images. Since then, it has been widely used in the context
of object detection. The idea behind the HOG descriptors is that local shape
2 SiftGPU:
A GPU Implementation of Scale Invariant Feature Transform (SIFT) – http:
//cs.unc.edu/~ccwu/siftgpu
52
Magnitude
3.5. VISION
0.05
0.03
0.00
12345678
Orientation
8-orientation
histogram
Input image
Normalise gamma and
compute gradients
Overlapping
blocks
Feature vector f
[ ..., ..., ..., .... ]
x
Figure 3.5: The steps of computing the HOG descriptor on an example image.
and appearance within an image can be described by counting the occurrences
of intensity gradient orientations (or edge directions) in localised portions of
the image. The computation of HOG feature descriptors can be summarised in
four steps, as depicted in Fig. 3.5.
The first step of calculation is the computation of the gradient values.
The most common method is to simply apply the 1D centre point discrete
derivative masks, in either one or both of the horizontal and vertical directions.
This method requires to filter the colour or intensity data of the image with
the following filter kernels: i) [−1, 0, 1]; and ii) [−1, 0, 1]T . The second step of
calculation involves creating the cell histograms. Each pixel within the cell
casts a weighted vote for an orientation-based histogram channel, based on the
values found in the gradient computation. The cells themselves can either be
rectangular or radial in shape; and the histogram channels are evenly spread
over 0◦ to 180◦ or 0◦ to 360◦ , depending on whether the gradient is “unsigned”
or “signed”. In the third step in order to account for changes in illumination
and contrast, the gradient strengths must be locally normalised, which requires
grouping the cells together into larger; spatially-connected blocks.
The authors mention that this local contrast normalisation with overlapping
descriptor blocks is crucial for good results. The HOG descriptor is then the
vector of the components of the normalised cell histograms from all of the block
regions. These blocks typically overlap, meaning that each cell contributes
more than once to the final descriptor. The final step in object recognition
using HOG descriptors is to feed the descriptors into some recognition system
based on supervised learning. The Support Vector Machine classifier (SVM,
see § 3.5.3.1) [164] is the standard binary classifier that can make decisions
regarding the presence of an object. In our implementation we used the freely
available SVMLight3 software package [72] in conjunction with HOG descriptors
to train object models. This descriptor idea can be seen as a dense version of
the SIFT descriptor. HOG and its variants routinely deliver state-of-the-art
performance for object detection and image classification.
3 SVMlight
Support Vector Machine – http://svmlight.joachims.org
53
CHAPTER 3. METHODS
Codebook Histogram
Generation
Cup ?
Classification
(
Ψ
)
[cup, coffee, book]
Local Feature Detection
Dictionary Generation
codeword 1
codeword 2
codeword 3
codeword 4
codeword 5
Local Feature Description
codeword n
Codebook Dictionary
Figure 3.6: The typical components of a bag of visual words image classification
pipeline.
3.5.1.3 Bag-Of-Words
Bag of visual words is commonly used in the last years for object categorisation;
classification and image retrieval. It directly relates to the bag-of-words (BoW)
model, originally used in text retrieval [66], which is a popular method for
representing documents by ignoring the word orders. In the computer vision
community the BoW model was introduced by Sivic and Zisserman [148], where
they use a “term frequency-inverse document frequency” (tf-idf) weighting on
the visual-word counts produced by interest point detection and SIFT features,
to retrieve frames in videos containing a query object. Tf-idf is a standard text
retrieval tool which is used to compute the similarity of text documents.
To represent an image using the BoW model, it usually includes the three
following steps: feature detection, feature description and codebook generation.
A definition of the BoW model can be the “histogram representation based
on independent features” [51]. Given an image, feature detection is to extract
several local patches (or regions), which are considered as candidates for basic
elements (e.g., “words”). This can be done either with a regular grid method for
feature detection, where the image is evenly segmented by some horizontal and
vertical lines forming a grid, or traditional interest point detectors to detect
salient patches, such as edges or corners in an image. Some famous detectors
are the Harris affine region detector [65], Lowe’s DoG detector [104]. After
54
3.5. VISION
feature detection, each image is abstracted by several local patches which are
represented as numerical vectors (feature descriptors).
A good descriptor that handles intensity, rotation, scale and affine variations
to some extent is the SIFT [104]. After this step, each image is a collection
of vectors of the same dimension (128 for SIFT), where the order of different
vectors is of no importance. Eventually each interest point is represented by an
ID which indexes the vector into a visual-codebook or visual-vocabulary (See
Fig. 3.6). An image is then modelled as a bag of those so called visual-words
and is described by a vector h which stores the distribution of all assigned
codebook IDs.
Note that this discards the spatial distribution of the image features. In
contrast, the image descriptions introduced in the previous sections still carry
spatial information. Especially the dense ones (e.g., HOG) are often used
directly to provide a spatial description of the objects. The final step for the
BoW model is to convert vector represented patches to “codewords” (analogy to
words in text documents), which also produces a “codebook” (analogy to a word
on a dictionary). A codeword can be considered as a representative of several
similar patches. As it is suggested, we use a clustering (vector-quantisation)
method to quantise the feature descriptors. One simple method is to perform
K -means clustering over the dataset (or a subset of a dataset) [95] (see § 3.5.2)
where each cluster stands for a visual codeword. The value of k depends on the
application, ranging from a few hundred or thousand entities for object class
recognition applications up to one million for retrieval of specific objects from
large databases.
Codewords are then defined as the centres of the learned clusters. The
number of the clusters is the codebook size (analogy to the size of the word
dictionary). The size of vocabulary is chosen according to how much variability
is desired in the individual visual words. After the construction of the visual
vocabulary, an image is described as a vector (or histogram) that stores the
distribution of all assigned codebook IDs or visual words or simply a bag of
visual-words. Once the bag of words of the images have been computed, the
images can be classified using just any machine learning algorithm. Current
state of the art results use discriminative support vector machines (see § 3.5.3.1),
but also Naı̈ve Bayes methods have been commonly used [40].
3.5.2 Vector Quantisation
The previous sections present a range of widely used, low-level feature descriptors. Many of these can be very high dimensional, such as the SIFT features
with 128 dimensional feature space. Vector quantisation of these raw feature
descriptors is a common step in object or texture recognition. One reason
behind the quantisation is the large range of values and their sensitivity to
small image perturbations. Thus the quantisation introduces robustness and
55
CHAPTER 3. METHODS
opens up the possibility for a variety of object-class models, some of which are
introduced in the next sections.
It involves data clustering or partitioning a dataset into groups of more
related samples. The most used method is the K -means clustering algorithm [95]
in the Euclidean vector space. K -means starts with k randomly selected data
points, called cluster centres and finds a partitioning of N points from a vector
space into k < N groups, where k is typically specified by the user. The
objective is to minimise the squared error:
V = argmin
s
k
X
X
kxj − µi k2
(3.5)
i=1 xj ∈Si
where there are k clusters xi , i = 1, . . . , k , and µi is the mean of all the
points xj ∈ xi . The first step assigns each of the remaining data points to the
closest cluster centre. We use Euclidean distance as the distance measure but
it is also possible to use other distances such as the Mahalanobis distance. The
next step recomputes the cluster centres to be the mean of each cluster. These
two steps are alternated until convergence.
While k is the only parameter that needs to be specified for K -means, its
choice is not trivial, since it affects the outcome of the clustering result. A
common way to handle this problem is to try several values for k. However,
for large datasets this approach is too time-consuming because k can vary in a
wide range. The time-complexity of the K -means algorithm is O(Nkl d) for N
data-points of dimension d, and l iterations. Here l depends on the distribution
of the data in the feature space and the initial centres. In our work we use
K -means to cluster local visual features into visual vocabularies.
3.5.3 Classification of Visual Features
Once we have established an image representation based on visual features
we can use these features to train an appropriate classifier. Here we review
some machine learning techniques regarding classification, which are adapted
to the feature-based image representations which are also used in this thesis.
Classifiers refer to a subset of pattern recognition methods that given a training
set T consisting of n data-points T = x1 , . . . , xn and their corresponding feature
descriptions as well as class labels y1 , . . . , yn , they are able to predict the
class-label y for a previously unseen test data point x [4, 48].
The feature description x can take any form, however it is often represented
in an Euclidean vector space. They can also be transformed into a different
vector space using a transformation function ϕ. This transformation into (often)
higher dimensional vector spaces can improve the discrimination between classes.
Examples for such feature descriptors are the previously described visual-word
histograms in the BoW model, or the 128 dimensional SIFT vectors embedded
into the Euclidean vector space.
56
3.5. VISION
Vectors
Codewords
Voronoi Region
y10
Margins =
Misclassified point
2
||w||
y13
y8
y4
y12
y9
b
y1
Support Vector
Support Vector
y7
y6
y3
y2
y11
w
wx -b=1
wx - b = 0
y5
w x - b = -1
Vector Quantisation
Support Vector Machine
Figure 3.7: Codewords in 2-dimensional space (left). Input vectors are marked with a
triangle, codewords are marked with red circles, and the Voronoi regions are separated
with boundary lines. An illustration of the distance between a separating hyperplane
and the training samples (right). Image courtesy of [9]
In the next two subsections we focus on two discriminative classifiers: i) support vector machines (SVM) which are widely used in the machine learning
and computer vision community; and ii) nearest neighbours (NN) which see
wide adoption in applications of computer vision problems. In addition to discriminative classifiers there are also generative classifiers. Generally speaking,
discriminative classifiers model the class probability given the feature description
directly, i.e. P(y = c|x), also called posterior probability. Generative classifiers
however model what is common between classes, i.e. P(x|y), called likelihood.
The classification is then determined using Bayes’ rule: P(y|x) = P(x|y).P(y)
, where P(y) denotes the class priors and P(x) the normalisation factor, also
called evidence.
3.5.3.1 Support Vector Machines
Support vector machines (SVM) were originally developed for pattern recognition and are motivated by results from the statistical learning theory. The
basic idea behind the SVM is to learn a hyperplane in some feature space, in
order to separate the positive and negative training examples with a maximum
margin, thus called maximum margin classifiers.
Consider the problem of separating a set p of training data (x1 , y1 ), . . . ,
(xp , yp ) into two classes, where xi ∈ ℜn is a feature vector and yi ∈ {−1, +1}
the class label of the feature vector, if sample i belongs to the negative or
positive class, respectively. Suppose that the two classes can be separated by a
hyperplane: (w · x) + b = 0 in a space. Then the optimal hyperplane (H ) is the
57
CHAPTER 3. METHODS
one which maximises the width of the margin (H1 , H2 ) between the two classes.
This can be formulated via the following constraints:
xi · w + b > +1, yi = +1
(3.6)
xi · w + b 6 −1, yi = −1
(3.7)
yi (xi · w + b) − 1 > 0∀i
(3.8)
2
|w| .
The distance separating the two classes is
Hence, H has the minimal
|w| iff ∀(xi , yi ) : yi (w · xi + b) − 1 > 0, subject to the constraints (Eq. 3.8).
Introducing Langrange multipliers αi (i = 1, . . . , p) results in a classification
function:
!
p
X
f (x) = sign
αi yi w · x + b .
(3.9)
i=1
where αi and b are found by Sequential Minimal Optimisation (SMO [39,
165]).
Even though most of the αi s take the value of 0, the points xi which αi 6= 0
are the so called support vectors, because they define the decision boundary4 .
Because the weighting term can vary to the benefit of either discrimination
or generalisation, SVMs are considered a flexible training tool which can be
adapted to a wide range of situations.
Straightforward classification using kernelised SVMs requires evaluating the
kernel for a test vector and each of the support vectors [107]. The complexity
of classification for a SVM using a non-linear kernel, is the number of support
vectors times the complexity of evaluating the kernel function. The later is
generally an increasing function of the dimension of the feature vectors. Since
the testing is expensive with non-linear kernels, linear kernel SVMs have
become popular for real-time applications as they enjoy both faster training
and classification speeds, with significantly less memory requirements than
non-linear kernels, due to the compact representation of the decision function.
However this comes at the expense of accuracy and linear SVMs can not be
used on tough, complex datasets like VOC PASCAL [50].
3.5.3.2 Nearest Neighbour Classification
Nearest neighbour matching is a good baseline method that is still often used,
because of its simplicity and relatively good performance. Specifically k nearest
neighbours (k -NN) of the nearestneighbour class is a popular classifier in the
4 This is for linear-separable datasets. For non-separable datasets, a support vector can
be within the margin or even on the wrong side of the decision boundary
58
3.6. SEMANTIC KNOWLEDGE
field of computer vision. Essentially it assigns a data point to the class for
which there are the most exemplars among the k nearest neighbours.
One interesting point to note is that the k -NN classifier with majority vote
is universally consistent [47]. A classifier is called consistent if with increasing
data the risk of the classifier approximates the Bayes risk of the underlying
distribution. It is called universally consistent if this property holds for all
distributions of data. In the simple nearest neighbour case (1-NN), it has been
shown that the probability of error cannot excess twice the Bayesian probability
of error as the number of training samples becomes infinite. Thus a large
amount of label information is contained in the nearest neighbour.
This, along with the simplicity and versatility of k -NN, explains the popularity of this classifier. k -NN can be categorised as a non-parametric classifier
and hence has several advantages over a parametric counterpart, such as that it
does not require a training phase and generally avoids the issue of over-fitting.
However, it often requires all training data points to be stored in memory;
which can be costly.
3.6 Semantic Knowledge
Semantic knowledge refers to the meaning of concepts in terms of their properties
and relations to other concepts. Concepts in robotics and artificial systems
in general, should be represented in a machine-interpretable and reusable
way. Cognitive robots should be able to use this semantic knowledge in order
to represent their perception of the environment, both from a metrical and
topological point of view, but also from a high-level conceptual knowledge point
of view.
As we see in the previous chapter (see § 2.7.1 & § 2.7.2), there is an increasing
number of approaches which attempt to model and represent high-level semantic
knowledge, in the context of semantic mapping [57]; cognitive vision [27, 29,
121]; scene understanding [82, 106]; HRI [90, 114, 177] and learning [28, 78,
119].
In this thesis we are interested in the autonomous acquisition and exploitation of semantic information, as well as the integration of this perceptual
semantic information with commonsense knowledge (see § 3.7). Typically,
a semantic approach is based on a formalism which includes two standard
elements, i) the ontology (taxonomy); and ii) the knowledge base.
The theoretical foundation for modelling semantic domain information and
perceptual semantic knowledge in the anchoring framework is based on the
logical representation of Description Logics (DLs ), which lie on the decidable
fragment of first-order logic (FOL). They are specifically tailored for representing
and reasoning about the terminological knowledge of an application domain
in a structured way, while they provide a good trade-off between expressivity
and tractability. In this dissertation we use the most standard DL formalism,
59
CHAPTER 3. METHODS
Table 3.1: DL Syntax rules.
C, D →
A
⊤
⊥
¬C
C⊓D
C⊔D
∀R.C
∃R.⊤
(> nR)
|
|
|
|
|
|
|
|
| (6 nR) .
(atomic concept)
(universal concept)
(bottom concept)
(complement)
(conjunction)
(disjunction)
(universal quantification)
(existential quantification)
(number restrictions)
the prototypical family of languages ALC (Attributive Concept Language with
Complements) originally introduced by Schmidt-Schauß and Smolka [142].
3.6.1 Description Logics
In DLs , as in any other mathematical logic, the two most important notions
are the language (syntax) and the semantics (meaning). Language uses elements such as predicate symbols, variables, constants and functional symbols to
construct formulae (sentences) intended to represent concepts or notions of the
world. DLs deal with three categories of syntactic primitives: concepts which
denote classes of objects, roles which denote binary relation and individuals
that stand for objects.
3.6.1.1 Syntax
Definition 5 (Signature). Σ = hC, R, Ii is a DL signature iff C is a set of
concept names, R is a set of role names, and I is a set of individual names.
The three sets are pairwise disjoint.
Using the above mentioned syntactic primitives together with concept constructors, we can formalise the relevant notions of an application domain by
(elementary or complex)5 concept descriptions. The most common logical constructors used in DLs inlude: ∧ (conjunction), ∨ (disjunction), → (implication),
∃ (exists) and ∀ (for all).
Definition 6 (Concept Description). Let A be an atomic concept, C and D be
two arbitrary concept descriptions and R any atomic role. The set of concept
descriptions are formed by means of the following syntax rule:
5 Elementary concept descriptions consist only of atomic concepts, atomic roles and
individuals. Complex descriptions however, can be built from elementary descriptions by
using concept constructors
60
3.6. SEMANTIC KNOWLEDGE
DLs are distinguished by the subset of constructors they provide. The
language AL (= attributive language) has been introduced as a minimal
language that is of practical interest. The other languages of this family are
extensions of AL. By adding more constructors we can get more expressive
description languages. For example including the union C ⊔ D of concepts (also
called disjunction) as it is interpreted in Table 3.2, the resulting language is
called ALU. Similarly, by allowing full existential quantification, written as
∃R.C, we obtain the description language ALU and with full negation (negation
in arbitrary concepts), written as ¬C, we obtain the description language
ALC and so on. By using any combination of the above constructors we can
define more complex languages which would then be denoted by a string of the
form: AL[U][E][C][N]. ALC and ALCN are considered today the most basic
description logics. Using a small set of epistemologically adequate concept
constructors we can keep the language decidable.
3.6.1.2 Semantics
The notion of interpretation is used to assign meaning to the syntactic primitives
and the constructors of the language.
Definition 7 (Interpretation). A terminological interpretation over a signature
ΣfhC, R, Ii is a tuple I = h∆I , ·I i where the set ∆I is called the domain and (·I )
the interpretation function. The interpretation function maps every individual
a to an element aI ∈ ∆I , every concept to a subset of ∆I and every role name
to a subset of ∆I × ∆I . The extension of (·I ) to arbitrary concept descriptions
is inductively defined as shown in the upper part of Table 3.2.
3.6.1.3 Knowledge Base
Although the constructors introduced so far can be used to form quite complex
concepts, we have no means yet to describe relationships between them. For
describing the relationships we use a terminological component (TBox) which
defines the terminology relations of an application domain. Providing the
relationships that hold between concepts is the first part. To describe also
knowledge about facts in the world we use an assertional component (ABox).
When TBox is combined with the ABox, this results in a knowledge base Ω
Definition 8 (Terminological axioms). A general concept inclusion (GCI) has
the form C ⊑ D where C and D are concepts. Write C=D
˙ when C ⊑ D and
D ⊑ C. An concept equality has the form C ≡ D where C and D are concepts.
The finite set of GCIs and equalities is called TBox.
Definition 9 (Assertional axioms). A concept assertion is a statement of the
form a : C where a ∈ I and C is a concept. A role assertion is a statement of
the form (a, b) : R where a, b ∈ I and R is a role. An ABox is a finite set of
assertional axioms
61
CHAPTER 3. METHODS
Table 3.2: Syntax and Semantics of commonly used mathematical logic constructors
with the corresponding letter used to denote a particular constructor.
Constructor
Top
Bottom
Intersection
Union
Negation
Value Restriction
Existential Quant.
Number Restriction
Concept Definition
Concept Inclusion
Concept Assertion
Role Assertion
Syntax
⊤
⊥
C⊓D
C⊔D
¬C
∀R.C
∃R.C
> nR
6 nR
A≡C
C⊑D
C(α)
R(α, β)
Semantics
∆I
∅
CI ∩ DI
CI ∪ DI
∆I \CI
I
{x ∈ ∆ |∀x, (x, y) ∈ RI → y ∈ CI }
{x ∈ ∆I |∃x, (x, y) ∈ RI ∧ y ∈ CI }
{x ∈ ∆I |ky|(x, y) ∈ RI k > n}
{x ∈ ∆I |ky|(x, y) ∈ RI k 6 n}
AI = CI
CI ⊆ DI
αI ∈ CI
(αI , βI ) ∈ RI
Logic
AL
AL
AL
U
C
AL
E
N
H
Definition 10 (Knowledge Base). A Knowledge Base (⊗) is an ordered pair
(T, A) where T is a TBox and A is an ABox.
Definition 11 (Model). We say that an interpretation I is a model of a TBox
T iff it satisfies all concept definitions in T, i.e., AI ≡ CI holds for all A ≡ C
in T. It is a model of a general TBox T iff it satisfies all concept inclusions in T,
i.e., CI ⊆ DI holds for all C ⊑ D in T. Similarly, we say that an interpretation
I is a model of an ABox A iff it satisfies all concept and roles assertions, i.e.,
for every C(a) in A, aI ∈ CI holds and for every r(a, b) in A, (aI , bI ) ∈ rI
holds. A TBox T, and ABox A are satisfiable if they have a model. I is said to
be a model of the knowledge base ⊗ = hT, Ai iff the interpretation satisfies all
terminological statements in T and all assertional statements A.
3.6.2 Inference Support
Once we get a description of the application domain using DLs as described
above, we are then able to make inferences, i.e., deduce implicit consequences
from the explicitly represented knowledge. The most common decision problems are basic database-query-like questions, such as instance checking which
evaluates which particular instance (member of an ABox) is a member of a
given concept. Relation checking evaluates whether a relation / role does hold
between two instances, in other words does concept a have property b.
Another kind of inference resembles more global-database questions such as
subsumption, which evaluates whether a concept is a subset of another concept;
and concept consistency which evaluates if there is no contradiction among the
definitions or chain of definitions. The language is capable of automatically
62
3.7. COMMONSENSE KNOWLEDGE BASES
inferring hierarchies between concepts which are explicitly stated, as well as
instance relations between individuals and concepts. Another interesting feature
of DLs is that they maintain the intentional knowledge (TBox), which represents
general knowledge about a domain, separate from the extensional knowledge
(ABox), which represents specific state of affairs.
The semantics of the ABox is the open-world semantics, i.e., the absence
of information about an individual is not interpreted as negative information,
however it only indicates lack of knowledge. Notice, this means that ALC is a
fragment of first-order predicate logic, since everything that one can express in
ALC can be expressed in first-order predicate logic.
Entailment deduction of terminological axioms in ALC is ExpTime6 complete (for cyclic TBoxes or GCIs) [1]. Modern DL systems, such as FaCT [71]
(its successor FaCT++ [161]), Racer [62] (its successor RacerPro), Pellet [127] and KAON2 [120] provide inference services that solve the inference
problems mentioned above, which are also known as standard inferences.
The basic inference on concept descriptions is subsumption. Given two
concept descriptions C and D, the subsumption problem C ⊑ D is the problem
of checking whether the concept description D is more general than the concept
description C. In other words, it is the problem of determining whether the
first concept always, (i.e., in every interpretation) denotes a subset of the set
denoted by the second one. We say that C is subsumed by D w.r.t. a TBox T ,
if in every model of T , D is more general than C, i.e. the interpretation of C is
a subset of the interpretation of D. We denote this as C ⊑T D.
Another typical inference on concept descriptions is satisfiability. It is the
problem of checking whether there is an interpretation that interprets a given
concept description as a non-empty set. In fact, the satisfiability problem
can be reduced to the subsumption problem. A concept is unsatisfiable iff it
is subsumed by ⊥ (the empty concept). The typical inference problems for
ABoxes are instance checking and consistency. Consistency of an ABox w.r.t.
a TBox is the problem of checking whether the ABox and the TBox have a
common model. Instance checking w.r.t. a TBox and an ABox is the problem
of deciding whether the interpretation of a given individual is an element of
the interpretation of a given concept in every common model of the TBox and
the ABox.
3.7 Commonsense Knowledge Bases
Representing and amassing commonsense knowledge has been a topic of interest
since the conception of artificial intelligence, as we have already seen in a
previous section (§ 2.4). Efforts stemming from the SHRDLU spirit focused on
providing programs with considerably more information, from which conclusions
6 ExpTime might seem difficult at first, however there exist DL systems which can cope
with very large ontologies [61]
63
CHAPTER 3. METHODS
can be drawn. The vast collection of facts and information is typically stored in
a special KB. Such a database consists mainly of a schema, that characterises
the domain (ontology – of which the most general partition is called upper
ontology) expressed in formal logic.
Typically information in a commonsense KB includes the following: i) an
ontology of classes and individuals; ii) properties, functions, locations, parts and
materials of objects; iii) locations, durations, preconditions and postconditions
(effects) of actions and events; iv) subjects and objects of actions; and v) various
other complementary information, such as stereotypical situations or scripts,
human goals and needs, emotions, plans and strategies in multiple contexts.
According to the literature the top most notable leading efforts, that intended to build large-scale, general-purpose semantic knowledge bases are
OMCS [147], ConceptNet [98] and Cyc [93, 94]. However, in the past decade
the number of commonsense knowledge bases is steadily increasing, with many
of the approaches embodied in the (semantic) web [24]. Here I present the
commonsense knowledge-base and ontology used in the thesis (Cyc), along
with some other commonsense knowledge-base systems. I then discuss their
suitability for use in robots and cognitive agents.
3.7.1
CYC
The Cyc project first appeared in 1984 by Lenat as part of Microelectronics and
Computer Technology Corporation, where the project attempted to formalise
commonsense knowledge into a logical framework, by codifying in machinereadable form, millions of pieces of knowledge that comprise human commonsense [93, 94]. Since 1994 it is been developed and maintained by Cycorp and
the most important assets include emphasis on knowledge engineering, the
representation of hand-crafted facts about the world and implementation of
efficient inference mechanisms. Moreover there is a clear focus on the ability
to communicate with end-users in natural language and to assist with the
knowledge formation process via machine learning.
The underlying formal language representation used in Cyc is called CycL.
It is essentially an augmentation of first-order predicate calculus and LISP,
which includes extensions that enhance the expressiveness of the language, in
order to permit commonsense knowledge expressions. Such extensions include
handling equality, default reasoning and skolemisation, as well as extensions
for modal operators and higher-order quantification (over predicates and sentences). CycL uses a form of circumscription while includes the unique names
assumption and thus can make use of the closed world assumption where
appropriate.
CycL is used in representing the knowledge stored in the knowledge base.
The KB is based on the Cyc upper ontology which sets the foundations for
the vocabulary and consists of terms, expressions and contexts. The set of
terms can be divided into constants (concepts), non-atomic terms (NATs)
64
3.7. COMMONSENSE KNOWLEDGE BASES
and variables. Constants can be individuals, collections and (truth) functions.
Constants start with “#$” and are case-sensitive. The terms are combined into
more complex, however meaningful CycL expressions (assertions, rules, axioms,
. . . ) that are used to express facts and support inference in the CycKB. The
predicates which are used heavily include #$isa and #$genls. The first one
(#$isa) describes that one item is an instance of some collection (specialisation)
while the second one (#$genls) describes that one collection is a sub-collection
of another (generalisation). Rule sentences may also contain variables (strings
starting with “?”).
Finally, the several thousand concepts represented in Cyc are grouped in
several contexts, which in CycL are represented as “Microtheories”. These
contexts are employed in order to find the truth or falsity of a CycL sentence,
while avoiding contradictory inferences. Each microtheory (Mt) contains collections of concepts and facts typically pertaining to one particular realm of
knowledge and its name (which is a regular constant) contains the string “Mt”
by convention (e.g., #$BaseKBMt).
The inference engine of Cyc performs general logical deduction, including
modus ponens, modus tollens, universal quantification and existential quantification [93, 94]. The latest version of Cyc includes the entire ontology which
contains thousands of terms and (mainly taxonomic) assertions7 relating the
terms to each other. On top of this taxonomic information, Cyc builds a
significantly larger portion of semantic knowledge (i.e. additional facts) about
the concepts in the taxonomy while includes a large lexicon for English parsing
and generation tools as well as Java-based interfaces for knowledge editing and
querying.
Finally there exist already efforts to interconnect the Cyc upper ontology
with other standard upper ontologies (i.e. UMBL, YAGO NAGA or FreeBase)
and also map the Cyc ontology with other central semantic resources on the
internet (such as the DBPedia [5] and Wikipedia [112]).
The source code written in CycL released with the OpenCyc system is
licensed as open source, to increase its usefulness in supporting the semantic
web. To use Cyc in reasoning about text, it is necessary to first map the text
into its proprietary logical representation, described by its own language CycL.
However, this mapping process is quite complex because all of the inherent
ambiguity in natural language must be resolved to produce the unambiguous
logical formulation required by CycL.
3.7.2 Open Mind Common Sense
The Open Mind Common Sense (OMCS) project attempts to construct and
utilise a large commonsense KB from contributions of human users via the
internet. In the period of 12 years since its initiation (1999-2011) the project
7 Rougly
47, 000 concepts and 306, 000 facts.
65
CHAPTER 3. METHODS
managed to accumulate more than a million English facts from more than
15, 000 contributors. There are many different types of knowledge in OMCS.
Some statements convey relationships between objects or events, expressed as
simple phrases of natural language (e.g., “The sun is very hot”), while others
contain information about peoples’ desires and goals. Much of OMCS’s software
is built on three interconnected representations: i) the natural language corpus
that people interact with directly; ii) a semantic network built from this corpus
called ConceptNet (see § 3.7.3); and iii) a matrix-based representation of ConceptNet called AnalogySpace that can infer new knowledge using dimensionality
reduction [147]. The OMCS project resembles Cyc, except that it depends
on contributions from thousands of individuals across the internet, instead of
carefully hand-crafted knowledge by knowledge engineers.
3.7.3 ConceptNet
ConceptNet is another commonsense knowledge project, primarily based on
the OMCS database (see § 3.7.2) where natural-language phrases are being
collected from ordinary people who contribute to the database. ConceptNet
is essentially a semantic network expressed as a directed graph, whose nodes
are concepts and the edges represent assertions of commonsense about the
concepts. ConceptNet captures a wide range of commonsense concepts and
relations, such as those in Cyc. Its simple semantic network structure lends
it an ease-of-use comparable to WordNet, thus making it suitable for use in
natural language processing and intelligent agents.
3.7.4 Other approaches
Probase
Probase is a very large web-fed knowledge base from Microsoft. It is a very
recent endeavour which includes concept hierarchies and amasses 2.7 million
concepts, 4.5 million subclass relations and 16 million instances. It is argued to
include all major knowledge sources like Freebase, WordNet, Cyc and DBPedia.
Unfortunately, none of the data is linked in any way, is available or adhere to
some standard format.
Wolfram Alpha
Wolfram Alpha is an on-line answering engine that answers factual queries
directly, by computing the answer from structured data rather than providing
with a list of documents or web pages that might contain the answer as a search
engine would. Wolfram Alpha is built on Mathematica, a complete functional
programming package which encompasses computer algebra, symbolic and
numerical computation, visualisation, and statistics abilities. The database
currently includes hundreds of datasets, accumulated over approximately two
66
3.7. COMMONSENSE KNOWLEDGE BASES
years and include current and historical weather, drug data, star charts, currency
conversion and others.
True Knowledge
True Knowledge is mainly a linguistic answering engine, that aims to directly
answer questions posed in plain English text (similar in sense to Wolfram
Alpha). The knowledge is organised in a database of discrete facts, where the
engine attempts to comprehend posed questions by disambiguating against all
possible word meanings in question, in order to find the most likely meaning of
the question being asked. Information is gathered in two ways: importing it
from “credible” external databases (e.g., Wikipedia) and from user submission
following a consistent format and detailed process for input. The system assesses
itself the submitted information by rejecting any facts that are semantically
incompatible with other approved knowledge.
NELL: Never-Ending Language Learning
NELL is also a research project that attempts to build a never-ending semantic
machine learning system, that has the ability to extract structured information from unstructured web pages. It is running continuously, attempting to
perform two tasks each day: i) “reads” or extracts facts (e.g., playsInstrument(George_Harrison, guitar)) from text over hundreds of millions of web
pages, looking for connections between the information it already knows; and
ii) attempts to improve its reading competence by making new connections, in
a manner that is intended to mimic the way humans learn new information.
It comprises of a knowledge base of structured information that mirrors the
content of the web so far accumulating more than 848, 802 beliefs that it has
acquired from various web pages.
DBpedia
The DBpedia project aims to extract structured content from the information
created as part of the Wikipedia project. It is based on RDF to represent the
extracted information and the SPARQL query language to access this information. DBpedia allows users to query relationships and properties associated
with Wikipedia resources, including links to other related datasets. The main
dataset describes more than 3.64 million things, out of which 1.83 million are
classified in a consistent ontology which includes persons, places, films, games,
organisations, species and diseases. However the dataset features links to images
and to external web pages as well as other Open Data RDF datasets such as
Freebase, Cyc and UMBEL. This enables applications to enrich DBpedia data
with data from the datasets mentioned above, thus making it one of the central
interlinking hubs of the emerging Web of Data.
67
CHAPTER 3. METHODS
Freebase
Freebase is a large on-line collaborative knowledge base, consisting of structured
meta-data which are gathered mainly from individual “wiki” contributors, as
well as other sources. The project aims to create a global resource which allows
people (and machines) to access common information more effectively. The data
structure is defined as a set of nodes and a set of links that establish relationships
between the nodes. Because its data structure is non-hierarchical, Freebase can
model much more complex relationships between individual elements than a
conventional database. Queries to the database are made in Metaweb Query
Language (MQL).
3.7.5 Discussion
While ConceptNet and Cyc purport to capture general-purpose world-semantic
knowledge, the qualitative differences in their knowledge representations make
them suitable for very different purposes. Because Cyc represents commonsense in a formalised logical framework, it excels in careful deductive reasoning
and is appropriate for situations which can be posed precisely, unambiguously
and in a context-sensitive way. While logic is a highly precise and expressive
language, it has a difficulty modelling the imprecise way that humans categorise
and compare things based on prototypes and emulating human reasoning, which
is largely inductive and highly associative.
Cyc is mainly hand-crafted by knowledge engineers, while ConceptNet is
automatically mined from a knowledge base of English sentences of the Open
Mind Common Sense (OMCS) corpus contributed by 14, 000 people over the
web. Cyc, posed in logical notation, does indeed contain much more practical
knowledge. The ability to quantify the semantic similarity between two symbols,
permits reasoning inductively over concepts and categorisation of objects and
events. However natural language is considered a weak representation for
commonsense knowledge. Ambiguity is a particular aspect that needs to be
dealt with when reasoning in natural language. Even though human language
is full of ambiguity, a logical symbol is concise, and there may exist many
different natural language fragments which mean essentially the same thing
and may seem equally suitable to include. In this sense logic can be seen as
more economical.
However, it is sometimes difficult to be precise in natural language. For
example, what is the precise colour of a “red apple”? In logic, we might be able
to formally represent the range in the colour spectrum corresponding to the “red”
of the apple, however in natural language the word “red” is imprecise and has
various interpretations. As ConceptNet and Cyc employ different knowledge
representations, cross-representational comparisons may be an indication that
provides a measure of evaluation. However neither of the two systems (nor any
other) is suitable for direct use in robotics. The reason is that there exist a very
68
3.8. CONCLUSIONS
delicate balance between representation, expressive and explanatory power,
which none of the systems mentioned above, acknowledge or comply with. We
may consider the systems mentioned in this section, primarily as knowledge
resources which can complement the robotic domain with additional information.
For example other projects including NELL, Freebase and DBpedia, which all
explore web-based approaches for collecting and distributing knowledge, may
provide additional concepts and inference services, even though living entirely
on the web.
3.8 Conclusions
In this chapter I have first briefly outlined the initial anchoring model which
was used as a basis in the papers appended in this thesis. Furthermore the
implemented architecture involved the integration of several techniques and
processing algorithms, in order to support the perceptual signatures and semantic descriptions of objects. This is actually what made anchoring possible
in the scenarios presented in the studies of the thesis. The integration involved
the described image representation and classification algorithms with respect
to the visual perception; the Adaptive Monte-Carlo Localisation with respect
to mobile robot localisation; the DL framework with respect to modelling
the perceptual-semantic knowledge of the agents and the Cyc system as the
commonsense knowledge base. With the components mentioned herein it has
been possible to achieve a meaningful anchoring experimentation as shown in
the different studies. Further details regarding the integration are presented in
the corresponding papers (I, II, III, IV and V).
69
4
Summary of Publications
4.1 Anchoring with spatial relations
In Paper I we used symbolic knowledge representation and reasoning capabilities
to enrich perceptual anchoring. The use of the KR&R is advocated to also
allow the human to assist the robot in simple anchoring tasks, such as the
disambiguation of objects, thereby exploring a deeper form of mutual HRI.
The knowledge-base consists of two parts: i) the terminological component
(TBox), that contains the description of the relevant concepts and their relations;
and ii) an assertional component, the ABox for storing concept instances and
assertions on those instances. For the anchoring domain we require an ontology
that covers all the physical entities and the corresponding perceptual properties,
which are recognised by the perceptual system and eventually occur during an
anchoring scenario.
In addition we used knowledge that was inferred from basic knowledge
contained in the anchors and also collected from external sources using other
cognitive capabilities, such as other anchoring processes or linguistic perception. Modelling an ontology is in general a difficult task, therefore in this
particular instance we adopt an ontology based on a subset of the ontological
framework DOLCE (A Descriptive Ontology for Linguistic and Cognitive
Engineering) [109], an upper-level ontology developed for the Semantic Web.
The knowledge base is implemented using the LOOM knowledge representation
system [11].
71
CHAPTER 4. SUMMARY OF PUBLICATIONS
4.1.1 Semantic modelling of spatial relations
Spatial relations were used in the symbolic description of objects and allowed
to distinguish objects by their location with respect to other objects. They also
played an important role when it came to HRI. Two classes of binary spatial
relations between a reference object and a located object were considered: a) the
topological (distance) relations “at” and “near”; and b) the projective (directional)
relations “in-front of”, “behind”, “right”, and “left”. The interpretation of a
projective relation depends on a frame of reference; for simplicity we assume a
deictic frame of reference with an egocentric origin coinciding with the robot
platform. We finally modelled the spatial relations as concepts in the ontology,
where we considered each spatial relation of the sub-concept of “Abstract
Relation”, to have as properties a reference object, a located object, and a spatial
region.
Using the LOOM KR&R system we were able to perform some basic
inferences with the information provided by the anchoring component. Thus,
we could focus on the high-level aspects behind the logical representation and
the dynamic properties of the objects, instead of invoking directly sensory
information. The user interface was based on a plain text-based application,
where the user can type in sentences in simple English. The sentence is analysed
by a recursive descent parser and translated into a symbolic description. The
grammar allows commands of the form FIND . . . followed by the description
of the object. The description consists of a main part that can be followed by
sub-clauses describing objects that are spatially related to that object.
The main part and each of the sub-clauses can be either a definite or
indefinite description, indicated by the article “a” or “the”. It also includes
the object’s class, for example “cup”, and optionally its colour and smell. The
smell of an object is inferred from the clause “with . . . ” following the object’s
class, indicating that it contains a liquid; for example, “the cup with coffee” is
assumed to be a cup containing coffee, and as such, smelling of coffee. The
derived symbolic description is used to construct a query for the KB. The main
functionality is realised by the FIND routine, which collects candidates from
the KB that match the description. If there is more than one candidate, the
anchoring module checks for further properties in the given description, apart
from shape and colour, and selects those additional properties. If an ambiguity
still persists the system asks the user about which of the candidate objects to
select.
4.1.2 System validation
The validation of the system was done in the context of an intelligent home
environment, which was used for ambient assisted living for the elderly or disabled. In this smart home, an existing framework called the PEIS-Ecology [140]
is used to coordinate the exchange of information between the robot, other
72
4.1. ANCHORING WITH SPATIAL RELATIONS
pervasive technologies in the environment and the human user. To illustrate
the utility of KR&R in anchoring three case studies were presented.
The first focussed on the inclusion of spatial relations in the anchoring
framework. The set of binary spatial relations mentioned above was used in
the validation. As spatial prepositions are inherently rather vague, we used
fuzzy sets to define granular spatial relations. The proposed method computes
a spatial relations-network for anchored symbols and stores that in the KB.
The robot surveys a static scene with three objects (two green garbage cans
and a red ball) and the anchoring module creates anchors for these objects in
a bottom-up way (as soon as new percepts are identified by the vision system).
We examined the possibility to use spatial relations to query for an object,
for example: “find the green garbage to the left of ball”. Similarly, a
human user can be asked to resolve an ambiguity in a find request. For
instance the query “Find the green garbage” returns more than one anchor.
Therefore the anchoring module determines an anchored object that is spatially
related to these anchors as reference object and presents the user with a choice,
enumerating the returned anchors and their spatial relation(s) to the reference
object. Then the query is reformulated using also the selected relation(s).
A second case used multi-modal information about objects, including both
spatial information and in this case, olfactory information given by an electronic
nose. The KR&R system is then used to assist the robot in determining
which perceptual actions can be taken to collect further information about
the properties of the object. As in the previous case, where an ambiguity
is introduced, the robotic system uses its olfactory module for resolving an
ambiguous case among different cups. The robot is given the command “Find a
green cup with coffee”. Four candidates were found in the KB, which were
asserted to be cups, i.e. vessels containing liquid with an associated smell. This
triggers the task planner to generate a plan to visit each candidate and collect
an odour property. The ambiguity is then resolved once a first match is found.
The third case investigated the possibility to perform reasoning about object
properties in order to determine an optimal candidate. We considered the case
where there were some scattered fruits on the floor. The robot navigated around
the floor and correctly perceived and asserted the instances into the KB. We
then asked the robot to pick the orange that was to the left of the apple
(spatial misinformation). Since this was not the case, it responded with a valid
proposition, that there is no orange to the left of the specific apple and instead
there is one on the right of the banana. The user confirms that the alternative
candidate is indeed the requested object. As a next step we asked the robot
to pick a banana to the right of the apple. Here there are 2 candidates, both
being to the right of the apple. However one of them being classified as rotten
and therefore not edible. The system informs the user about the situation and
after removing this option, it prompts that finally there was one banana on
the right of the apple (suitable for consumption) and returns.
73
CHAPTER 4. SUMMARY OF PUBLICATIONS
4.1.3 Contributions
A. Loutfi, S. Coradeschi, M. Daoutis, and J. Melchert. “Using
Knowledge Representation for Perceptual Anchoring in a Robotic
System”. Int. Journal on Artificial Intelligence Tools (IJAIT) 17,
pp. 925–944, 2008.
The author of this thesis was responsible to extend a previous version of
this article ([115]), which was invited for publication in a special issue of the
International Journal on Artificial Intelligence Tools (IJAIT). More specifically,
he designed and performed the extended experiments and helped with the
preparation of the second half of the manuscript.
4.2 Common Sense Knowledge and Anchoring
In Paper II we have investigated the integration of a large KR&R with a
perceptual system which consisted of networked sensors. The role of the KR&R
system is to maintain a world model which consists of the collection of the
semantic information perceived by the robotic assistant and the smart home.
The commonsense KR&R system used in this work is Cyc (see § 3.7.1).
Operations, such as assertions, retractions, modifications and queries are stated
using Cyc’s formal language, CycL. Such a system, that rests on a large-scale
general purpose knowledge-base, can potentially manage tasks that require
world knowledge (or commonsense).
4.2.1 Linguistic interaction
To combine Cyc’s natural language capabilities and enhance the communication
with the user we allow Cyc to translate the results of queries or inferences, into
natural language. The main challenge of integrating a large KR&R like Cyc is
to be able to synchronise the information with the perceptual data coming from
multiple sensors, which is inherently subject to incompleteness, glitches, and
errors. The anchoring module provides a stable symbolic information despite
fluctuating quantitative data. Instances of objects must be created in Cyc and
as objects can be dynamic (e.g., in the case of changing properties) proper
updating of information needs to be managed.
To enable the synchronisation between the KR&R and anchoring layers
three aspects were considered: a) defining anchoring in the KR&R; b) handling
of assertions; and c) handling of ambiguities. Cyc is not consistent globally
but rather attempts to be consistent locally, by exploiting the use of different contexts which are expressed as MicroTheories. Through the Anchoring
MicroTheory it is possible to connect concepts about objects that are currently
present in the anchoring module, to the structured hierarchical knowledge in
74
4.2. COMMON SENSE KNOWLEDGE AND ANCHORING
Cyc and concurrently inherit the commonsense reasoning about this knowledge. For instance, if the location of the anchor of the object “cup” is the
kitchen, then the object and the kitchen are instantiated into Cyc, inheriting
all the properties related to the generalised concepts Kitchen and Cup, such as
ManMadeThing.
4.2.2 Knowledge synchronisation
To synchronise perceptual data with the knowledge in Cyc, anchored objects
from both working and archive memories were instantiated. Then, the knowledge
manager keeps the Cyc symbolic system coherent with the symbolic descriptions
of each anchor. This is done by translating the symbolic description of each
anchor into a set of logical formulae that represent the agent’s perception of the
object. When a particular predicate of the symbolic description changes (e.g.,
the location of an object) an update of the predicate asserts the new information
into the KB. Ambiguities eventually arise during the synchronisation process
and they are resolved using the denotation tool interactively with the user.
The denotation tool is an interactive step during the grounding process,
where the user is asked to disambiguate the grounded symbol using concepts
from the KB which denote this specific symbol. Another aspect we consider
in the synchronisation process, is the information about the visibility of the
object. This is important as the agent still needs to maintain knowledge about
an object, despite the fact that the corresponding perceptual information may
not currently be available to the agent. For example the robot still needs to
maintain the knowledge that “a cup is located in the kitchen”, even though the
“cup” might be out of its current field of view (epistemic knowledge). This was
achieved by 2nd -order rule assertions.
4.2.3 Evaluation
As part of the test-bed we used a physical facility, which looks like a typical
bachelor apartment of about 25m2 . It consists of a living room, a bedroom
and a small kitchen. We initially presented the robot with 15 objects, while
capturing 2 to 5 training images per object from different viewpoints. It is
important to mention that the robot recognised instances of objects and not
object categories. We placed those objects around the smart home in a random
manner, either close or far, covering a great amount of different combinations of
spatial relations. We then allowed the robot to navigate around the environment
so as to recognise and anchor those objects. We then allowed the human to
communicate with the system in natural language, by typing into the consolebased graphical user interface, questions (related to the perception of the robot
as well as commonsense), or communication commands.
Excerpts of dialogues that reflect perceptual information from the mobile
robot and the environment were presented in Paper II. We were also able to
75
CHAPTER 4. SUMMARY OF PUBLICATIONS
use the visual sensors in the ceiling to observe things that the robot could not
perceive or were occluded. Another instance of the object recogniser was running
on the large objects from the smart home (e.g., television set, couch, table, sink).
Through the knowledge base assertions, which contained a larger collection of
objects of the environment, the robot was able to compensate for its limited
ability of perception. This aspect spawned an interesting problem of how to
coordinate this cooperative perception with respect to the semantic knowledgebase, in the context of anchoring. This aspect was the main motivation for next
development, Paper III.
In sum, clearly the advantage of using a KB that contains commonsense
information, is that we can exploit this information to be able to infer things
that were not directly asserted perceptually. This information can also be
useful when queries about functions and properties of objects are made. Due
to the expressiveness of Cyc we could differentiate perceptual and epistemic
knowledge, at the same time keeping both coherent. Using linguistic interaction
we could study how the anchors were stored, while probing the symbolic content.
As described in the previous section (§ 4.1) the use spatial relations contributed
greatly towards the disambiguation between objects.
4.2.4 Contributions
M. Daoutis, S. Coradeshi, A. Loutfi. “Grounding commonsense
knowledge in intelligent systems”. Journal of Ambient Intelligence
and Smart Environments (JAISE) 1, pp. 311–321, 2009.
This article was an invited submission of a previous version (Paper V [43] –
not included in the thesis), which had received the best conference paper award
of Intelligent Environments 2009. The author of this thesis specified, designed
and implemented the framework described in the papers mentioned above. He
also devised and performed the evaluation as well as prepared and revised both
manuscripts.
4.3 Cooperative Anchoring using Semantic & Perceptual
Knowledge
In Paper III we propose a model for semantic cooperative perceptual anchoring,
with which we intend to capture the underlying processes for systematically
transforming collectively acquired perceptual information into semantically
expressed commonsense knowledge, over a number of different agents. We
enable the different robotic systems to work in a cooperative manner, by
acquiring perceptual information and contributing to the collective semantic
KB. We adopt the notion of the local and global anchoring spaces from [86].
The different perceptual agents maintain their local anchoring spaces, while the
76
4.3. COOPERATIVE ANCHORING USING SEMANTIC & PERCEPTUAL
KNOWLEDGE
global anchoring space is coordinated by a centralised anchoring component
which hosts also the commonsense KB. The reasons behind the centralised
module concern the computational complexity of the commonsense KB.
4.3.1 Semantic integration
This integration which relies on commonsense information, is capable of supporting better (closer to linguistic) human robot communication and generally
improves the knowledge and reasoning abilities of the agent. It provides an
abstract semantic layer that forms the base for establishing a common language
and semantics (i.e. ontology) to be used among robotic agents and humans.
Semantic integration takes place at two levels, in accordance with the anchoring
framework. We have developed a local KB which is part of the anchoring
process of every agent. Our solution involves a local knowledge representation
layer which aims at a higher degree of autonomy and portability. More specifically, it: a) is a lightweight component ; b) adheres to the formally defined
semantics of the global KB; c) provides simple and fast inference; d) allows
communication between other components (e.g., the global anchoring agent,
Cyc or the Semantic Web); and e) is based on the same principles that our
core knowledge representation system (Cyc) is based.
Each local KB is a subset of a global KB which is managed by the global
anchoring agent that coordinates the different local anchoring processes. The
global knowledge representation is used as the central conceptual database
and a) defines the context of perception for the perceptual agents; b) is also
based on formally defined semantics; c) provides more complex reasoning; and
d) contains the full spectrum of knowledge of the KB.
The duality of knowledge representation systems is due to the fact that
we want to enable semantic knowledge and reasoning capabilities on each
perceptual agent, however we cannot afford (yet) to embed a complete commonsense knowledge-base instance on every agent. This is due to the limitations
in the processing power of each perceptual agent. The anchoring process is
tightly interconnected with commonsense knowledge about the world. As in
the previous paper (Paper II) the commonsense KB we adopt in this work for
knowledge representation, is Cyc.
4.3.2 Implementation
We have implemented our approach using the PEIS distributed framework [140]
with the application domain being a smart home for the elderly. The implementation involved the mobile robot which is capable of perceiving the world
through different typical heterogeneous sensors and is acting as the mobile
anchoring agent. An ambient anchoring agent pertained sensors and actuators
embedded in the environment, such as observation cameras, localisation systems and RFID tag readers. Finally a computational unit, a high end desktop
77
CHAPTER 4. SUMMARY OF PUBLICATIONS
computer undertakes all the computationally demanding tasks related to the
commonsense knowledge base (Cyc).
The evaluation of the system includes how well the system performs, when
recognising, anchoring and maintaining in time the anchors referring to the
objects in the environment. We have tested several scenarios, where in particular
we considered the case in which the knowledge from the perceptual anchoring
agents populating the Cyc knowledge-base was used to enable inference upon
the instantiated perceptual concepts and sentences. It was possible to use
the object’s category, colour, spatial or topological relations, along with other
grounded perceptual information, to 1) refer to objects; 2) query about objects;
and 3) infer new knowledge about objects. In addition, we have considered the
use of commonsense knowledge in different contexts, through the multi-context
inference capabilities of the Cyc system.
In order to monitor and allow interaction with the anchoring framework,
we have developed an interface which has three ways for accessing information.
It provides a primitive natural language interface to interact with the different
agents and the KB as it displays the state of the system and provides runtime
information about the performance of anchoring. The main component and
observation feature of the tool is the visualisation of the 3D augmented virtual
environment. It is simply, a scene of the environment rendered in a 3D engine
which encapsulates every non-perceptual object in the environment, such as
the wall surroundings. Then information acquired from the global anchoring
space was used to virtually reconstruct the perception of each agent in the
rendered scene-graph.
Finally we have shown how the different perceptual agents can communicate
using logical sentences, in order to exchange information between each other,
in a cooperative manner. We focused on the locally maintained knowledge
of each perceptual agent in the context of querying other available agents to
complement for the limited perceptual knowledge so as to finally infer the
answers to the queries.
4.3.3 Contributions
M. Daoutis, S. Coradeschi, A. Loutfi. “Cooperative Knowledge
Based Perceptual Anchoring”. Int. Journal on Artificial Intelligence
Tools (IJAIT) 21, p. 1250012, 2012.
The author of this thesis was responsible for the specification, design and
implementation of the proposed framework described in this article. He devised
and performed the evaluation as well as prepared and revised the manuscript.
78
4.4. TOWARDS THE WWW AND CONCEPTUAL ANCHORING
4.4 Towards the WWW and Conceptual Anchoring
In Paper IV we consider that autonomous robots which interact with humans,
require open ended systems where domain knowledge can be dynamically
acquired and is not necessarily defined a-priori. This aspect is related to
the previous paper (Paper III), where we identified that the modelling of
the semantic and commonsense knowledge behind an intended query required
considerable effort and knowledge engineering skills, which are often not present
in end-users. However the World Wide Web (WWW) is a viable source of
knowledge that would satisfy the requirement of being available on demand
while it evolves continuously, with contributions from human users. Hence, we
explored the possibility of integrating the knowledge coming from a conceptual
(non physical) environment such as the WWW, in the anchoring process.
This was achieved using the augmentation of the anchoring framework, called
Conceptual Anchoring.
4.4.1 Representation
Acquired data, from a non-physical information source, specifically the web,
were used so as to form representations (conceptual anchors), about concepts
that might not necessarily exist in the agent’s real environment. Via the
conceptual anchoring space, the agent was able to a) express concepts in
terms of their associated (eventual) percepts and symbolic knowledge; and
b) use the information from the web, in order to guide the learning of novel
perceptual instances (physical objects). In the context of conceptual anchoring,
the percepts produced from on-line resources were used for the construction of
the perceptual signatures. Then, they were grounded to their corresponding
semantic descriptions and tangible semantic knowledge from the commonsense
KB, in a similar way as in perceptual anchoring. The multi-modal data structure
(conceptual anchor) holds the perceptual data, semantic descriptions and unique
identifier about one conceptual object.
Since the conceptual percepts do not correspond necessarily to physical
entities in the agent’s environment, implicitly the conceptual anchor does
not correspond to a specific physical object (perceptual instance), but rather
to the generalised conceptual appearance and implicit knowledge about the
general concept. For instance the conceptual anchor of the concept “Cup”
links the multiple visual percepts of cup images and features of them, with
the corresponding grounded visual concepts such as the colours or shapes
of cups and with the tangible semantic knowledge about cups like: Cup is
a specialisation of DrinkingVessel, inheriting all the implicit relations that
it is also an Artifact, a SpatiallyDisjointObject, a Container, . . . etc. Since
both conceptual and perceptual representations are expressed as anchors, the
previously developed anchoring functionalities (Papers II & III) are able to
handle both representations in the same way, especially regarding matching.
79
CHAPTER 4. SUMMARY OF PUBLICATIONS
4.4.2 Preliminary results
Conceptual Anchoring is a system built on top of the knowledge-based perceptual anchoring architecture and mainly a) provides an interface to acquire
perceptual and semantic knowledge from the web, while b) maintains the
conceptual anchors in the anchoring space which are used to store the conceptual models. By acquiring conceptual anchors, which can be though as
“meta-anchors”, we were able to use the multi-modal conceptual representations,
as perceptual templates which guided the detection and learning of existing
physical objects (perceptual instances) in the environment, without having
previously trained the robot to recognise each instance individually (a tedious
and highly technical process).
So far the notion of the conceptual anchors has been tested in proofof-concept scenario and focus of the evaluation has been on the ability to
retrieve the conceptual anchors from the web using keywords. While conceptual
anchoring is the more recent development in the anchoring framework, still
needs to be further enhanced as it is believed that relying more on dynamic
information from the WWW will be the dominant trend in the field.
4.4.3 Contributions
M. Daoutis, S. Coradeschi, A. Loutfi. “Towards concept anchoring for cognitive robots”. Intelligent Service Robotics 5, pp. 213–228,
2012.
The author of this thesis was responsible for the specification, design and
implementation of the proposed framework described in this article. He devised
and performed the evaluation as well as prepared and revised the manuscript.
80
“Pure a priori concepts, if such
exist, cannot indeed contain anything empirical, yet, none the less,
they can serve solely as a priori
conditions of a possible experience.”
Immanuel Kant,
(1724-1804)
5
Discussion
5.1 Research Outlook
Successful communication among robots, human users and devices embedded
in the environment, involves diverse items of information such as perceptual
data from camera sensors, symbolic data, or natural language phrases from
a human. Particularly, when humans interact with symbiotic robotic systems
their communication mainly involve objects, their properties and functions. If
these objects are described in symbolic form (e.g., “blue cup on the table”), then
the link between the sensor data corresponding to the physical object and the
symbolic description, has to be established. In cognitive robotics the process of
creating this link is studied in the context of perceptual anchoring [32] which is
a subset of the general symbol grounding problem [63].
This thesis investigated the problem of anchoring and specifically the
design and implementation of a cognitive computational model for artificial
perceiving systems. The main goal was to establish and maintain in time the
correspondences between conceptual knowledge and perceptual information
about physical objects, when humans and robots interact. The cognitive agent
needs to be able to acquire perceptual information about the objects from
the physical world, while associating them to the corresponding commonsense
knowledge, both about the object per se but also the specific knowledge relevant
to the task at hand.
We considered a key aspect of this challenge, which concerns the flexibility
to adapt to changes in the environment and to dynamically acquire the new
81
CHAPTER 5. DISCUSSION
knowledge needed for the interaction with humans. In order to enable human
participation in the anchoring process, the model has to support natural forms
of communication while simultaneously dealing with the challenging problem of
how to derive high-level descriptions of visual information from perceptual data.
In particular, we focused on the natural interaction and common knowledge
that pertain to objects which are used in everyday scenarios. In the following
section I discuss some issues concerning the previously described development
of anchoring, while summarising the goals of the current investigation.
5.2 Critical view on the original framework
The anchoring framework proposed by Coradeschi and Saffiotti [32] described
in Chap. 3, provides a simple yet concrete theoretical point of view in addressing
the physical symbol grounding problem. However, when the model is applied
to a cognitive agent, such as a robot, there are a number of issues which the
original anchoring framework does not appear to be capable to address at first.
The initial anchoring model was described in a very abstract and simplistic
way while considering only a top-down approach. An issue that has been
introduced later in the development of anchoring, concerned the problem of
two-way information acquisition [100], especially in the presence of different
modalities. This has been loosely defined, however it is an important aspect
since humans typically use both top-down and bottom-up information to solve
problems, in a kind of signal-symbol fusion. Previous research has shown that
our visual pathways are not unidirectional either, rather there are also feedback
signals [16]. Therefore both top-down and bottom-up interrelations should be
clearly defined in the cognitive model and implicitly the anchoring framework.
Since the introduction of other (non standard) modalities, such as olfaction [100], the model had to cope with a general and formal representation
that accommodates and handles multiple perceptual modalities, as they become available to the cognitive agent. The symbolic part of the framework on
the other hand, consisted merely of symbolic labels which identified anchored
objects. Almost no additional symbolic information about “the world” was
incorporated or exploited in the anchoring process.
Furthermore, the problem of grounding symbolic properties has also been
out of focus. Since there is no definition about the nature of the symbol and
perceptual systems, implicitly the representation of the anchor was vaguely
defined. This is problematic when modelling the matching and tracking functionalities, especially in the presence of multiple modalities. It is also an issue
when modelling the similarity of the anchors, not only for their perceptual part
but also the semantic one.
Clearly semantics are not defined somewhere in the anchor, neither in the
perceptual system, nor in the symbol system. This is another quite important
aspect, also related to the anchor’s conceptual representation mentioned above.
82
5.3. SUMMARY OF THE RESEARCH
Perceptual semantics allow reasoning about percepts while conceptual semantics
enable reasoning on the conceptual level.
As we see in the related work (§ 2.7.3), anchoring has also a cooperative
aspect, where work led by LeBlanc and Saffiotti demonstrate an augmentation of
the anchoring framework which addresses the percept-symbol correspondences
between different robots. This was done explicitly in the context of collectively
acquiring and sharing perceptual information, thus not emphasising on the
symbolic and semantic aspects [86].
Finally, we observe that pre-conceptual knowledge has not been defined and
hence not used somewhere in the anchoring process. Especially since it has
been argued that the semantic interpretations are not solely inside an image,
a feature or a percept, but depend upon a-priori semantic and contextual
knowledge (that is acquired from experience).
5.3 Summary of the Research
The aim of this investigation was to address the aspects mentioned in the
previous section (§ 5.2), from a theoretical as well as methodological point
of view. The main goal of the thesis was to extend the notion and theories
of perceptual anchoring so as to accommodate knowledge representation and
reasoning techniques, in order to establish a common language (i.e. ontology)
and semantics to be used among heterogeneous robotic agents and humans,
regarding physical objects, their properties and relations. We address the
challenge of grounding perceptual information to symbolic information drawn
from the conceptual information included in a commmonsense knowledge-base.
The key aspect is to use the resulting grounded semantic knowledge to improve,
in part the linguistic communication with human users, but also the exchange
and reuse of semantic knowledge between the different components (intelligent
agents, robots, pervasive devices) of the system.
The first step of the investigation has been to extend the work presented
in [114], where a simple ontology inspired by DOLCE was connected to the
anchoring module. The integration was done using the LOOM KR&R system.
We evaluated the system using topological and spatial relations as well as
multiple modalities. We then show how the proposed integrated system can
handle uncertainty while resolving the ambiguities which arise during the
interaction, by gathering perceptual information. Finally, the use of symbolic
knowledge enabled high-level reasoning about objects and their properties, thus
allowing more expressive queries about the anchored objects.
The next study explored the integration of a large-scale deterministic
commonsense KB with the perceptual anchoring system. We specified, designed
and implemented a framework which grounded perceptual data coming from a
number of sensors distributed in the environment, to commonsense semantic
descriptions extracted from the KB (Cyc). The study involved modelling of
83
CHAPTER 5. DISCUSSION
perceptual algorithms using mainly computer vision techniques (e.g., SIFT) for
the construction of percepts. Then, a series of grounding algorithms captured
object properties and relations from the visual and spatial modalities (i.e.
object category, colour, location, spatial relations, topological localisation and
visibility). The anchoring framework managed all this information over time,
while in the KB all the grounded objects were instantiated into concepts.
The use of Cyc offered advantages, as there was less time spent on ontology
engineering as well as creating new knowledge into the KB. A further advantage
was the exploitation of commonsense knowledge in order to infer things that
could not be inferred by having the symbolic descriptions on their own.
The grounded information was evaluated through linguistic interaction via
a simple natural language interface to facilitate a dialogue with the human user.
We examined scenarios which included queries about perceptual information and
properties, qualitative spatial relations and commonsense reasoning. Concluding,
we find that all three aspects: a) robust perceptual system; b) anchoring; and
c) a substantial KB, were equally necessary and important in the process
of grounding linguistic queries in observations, in a dynamically changing
environment.
The purpose of the next study was to extend the initially developed system
to account for cooperation in sharing and reusing grounded knowledge among
different perceptual agents, while still maintaining the commonsense knowledge
aspect. We presented the computational framework behind symbolic cooperative
anchoring, which considers both top-down and bottom-up information acquisition
in the anchoring management process. The focus was on the visual and spatial
perceptual modalities where we presented the details behind the corresponding
functions.
We evaluated the developed system in an intelligent home scenario, where
two agents (one mobile robot and one static ambient intelligent agent) cooperatively acquired complementary perceptual information from the environment
so as to populate the commonsense KB. We initially reported results about
the performance of the system, followed by scenarios regarding information
exchange between the perceptual agents, that allowed them to compensate for
incomplete knowledge. We finally present ways in which information exchange
together with commonsense knowledge were used for different communication
tasks.
A promising direction emerged towards the final stages of the work: to utilise
knowledge from the web so as to account for incomplete perceptual information
during anchoring. In this investigation we presented an integrated system that
combined the perceptual and conceptual knowledge acquired from the web, with
the a-priori commonsense knowledge maintained in the perceptual semantic
knowledge-base. Although in previous investigations we had considered both
top-down and bottom-up functions, in practice the system was not able to ground
arbitrary (concrete) concepts to their corresponding percepts. Specifically in
this study we considered how a cognitive agent: a) acquires novel concepts;
84
5.4. CONCLUDING REMARKS & CONSIDERATIONS
b) associates these concepts with preconceived information; and c) reasons about
this knowledge. In the implemented augmentation, which we term conceptual
anchoring, we used the measure of similarity so as to define the relations between
the anchors regarding concepts and instances. This measure served also as
a way to account for any eventual uncertainty that arose in the perceptual
system, hence propagating into the anchoring framework.
5.4 Concluding Remarks & Considerations
As we approach the end of this research, we clearly see that feature lists alone
do not appear to be capable of describing everything we need to know about
concepts. Percepts seem to be specialised and concrete while concepts are
abstract and general.
Certainly percept-based thought alone does not provide sufficient abstraction
and enough richness to deal with the increased complexity of the concept’s
meanings, which appear to have a dynamic component that directly relates to
the reasoning processes (that generate understanding).
Returning to the hypothesis posed at the beginning of this thesis, we
find that knowledge representation and reasoning techniques, indeed, have
the potential to bring meaning to artificial cognitive agents, such as robots.
More specifically through the studies conducted we conclude that in addition
to facilitating queries from human users, KR&R can also be used to reason
about objects, including their properties and functionalities. For example,
spatial reasoning can be used to disambiguate between visually similar objects
while symbolic commonsense information supports high-level reasoning about
commonly shared assumptions.
The thesis successfully demonstrated how a knowledge representation system
could be integrated with and utilised in an anchoring framework. From the
findings that emerged during the different studies we find that the KR&R
system and especially a-priori commonsense knowledge, contributed greatly to
the capabilities of the system, for example by enabling high-level reasoning.
The integration of this knowledge with physical perceptual information opens
new horizons with respect to the communication and implicitly interaction
with a human user.
In order to explore human-machine interaction, we adopted an ontology
which captured both concrete and abstract concepts and their relationships.
Moreover the ontology was used when expressing queries describing image
content, while only the relevant image aspects that were of value were evaluated. This was particularly important in the context of decidability, as the
evaluation converged as soon as the appropriate top-level symbol sequence
had been generated. Even though we used complex ontologies and knowledge
representation in the physical perceptual agents, a great effort from experts is
still needed to train the systems both perceptually and semantically.
85
CHAPTER 5. DISCUSSION
The construction of ontologies originate from linguistic words. However the
word is more than a symbol or a sign that represents a “thing” or a concept.
On a theoretical level, language is not just a tool to communicate but can
be thought as an information processing system, which allows humans to
conceptualise, to think abstractly and symbolically. More importantly it is
the medium that permits the transition from percept-based to concept-based
thought. Intuitively, it seems that complex cognitive systems (such as humans as
well as cognitive robots) do something more than just rewriting symbols. They
use grounded concepts to reason about the world. The ability to form concepts
is an important and recognised cognitive ability, thought to play an essential
role in other related abilities such as categorisation, language understanding,
object identification, recognition and reasoning, all of which can be seen as
different aspects of intelligence.
We find that many of the related approaches tend to focus mostly on the
specific individual aspects mentioned above and often disregard the symbol
grounding problem. Through the studies conducted in this thesis we acknowledge
that all the aspects mentioned above are equally important, including the
grounding problem. By specifying the representation of the anchor, which
can be though as a conceptual structure that establishes the meaning of the
intended object, we support simultaneously perceptual and semantic reasoning
but foremost grounding.
In the context of symbol grounding and its representation, there is a tendency
from various approaches to adopt Gärdenfors’ conceptual spaces. However in our
studies we did not adopt the paradigm, as it did not seem to offer any advantage
over the standard anchor representation. This does not merely mean that there
exist alternatives (in our case sets) to the geometric metaphor of conceptual
representations. It also reveals a deeper question: Which representation is more
appropriate for grounding?.
In summary the thesis had a significant impact in three fronts: a) applicative – by enabling the development of cognitive robots where knowledge
and perception are combined; b) theoretical – by extending the theoretical
foundations of anchoring; and c) methodological – by providing a methodology
for the acquisition of perceptual knowledge and the exploitation of semantic
knowledge and high-level reasoning.
5.5 Recommendations for Future Work
During this research several shortcomings emerged which require further investigation. The developed system appears quite limited regarding the explicit
perceptual training of object categories and instances to be recognised. In
addition the development of grounding relations required a great amount of
interdisciplinary; highly specialised work during the implementation stages.
Furthermore, reasoning was constrained to objects pertaining a particular
86
5.5. RECOMMENDATIONS FOR FUTURE WORK
domain (anchoring) while lacking complex dynamic knowledge. Nevertheless,
the investigation behind this thesis has set the initial groundwork towards
dynamic knowledge acquisition.
In order to realise a grounding framework with such a dynamic nature
that can fully integrate perception and knowledge on demand, it is necessary
from a theoretical point of view to relax an important assumption in current
AI systems, namely the closed world assumption. In addition, the decision
to create a new instance of an object in the knowledge base or to assign the
perceptual information to a known object, is currently based on information
about mainly appearance and location.
A future augmentation of a grounding architecture may rely on gathering
resources from the World Wide Web in order to overcome the problems stated
above. One key aspect will be to leverage the fact that general knowledge
about objects is already available on the internet, which to some extent is
also related to physical perceptual information. The focus of such a cognitive
architecture will be based on evolving conceptual anchoring, biased towards
two characteristics: advanced visual perception and knowledge formation, so as
to autonomously form and learn new concepts on demand. At the same time
it is quite important to evaluate the scalability of such a system where: the
number of objects increases arbitrarily; and several agents cooperatively anchor
and communicate information regarding a large amount of objects.
The anchoring process already collects information from the sensors and the
vision system and grounds the percepts to the corresponding symbolic knowledge
while asserting this knowledge about object instances in the commonsense KB
(Cyc). In the future we intend to extend the anchoring process from two sides
simultaneously. The first regards on demand acquisition of general knowledge
from the internet, when the establishment of new general concepts / categories
is required for the interaction with humans or reasoning. Progress towards
this goal is described in the final paper (Paper IV). The second focuses on the
continuous acquisition and update of specific information regarding objects of
interest, via activity recognition and causal reasoning. This implies that we
need to integrate tools which not only examine how objects are defined but also
how objects are being used in general (e.g., cups are used for drinking), in a
specific domain and in the context of specific activities. The use of objects fused
together with activities will add to the reasoning capabilities of the system. In
sum, future work will deal with the following:
⊲ Investigate methods on how to acquire structured perceptual and semantic
knowledge from the internet that will provide the corresponding knowledge
needed to learn new concepts on demand.
⊲ To anchor this knowledge to the physical entities in the environment like
objects and places, developing upon the existing anchoring framework.
87
CHAPTER 5. DISCUSSION
⊲ To acquire more context-specific knowledge related to the environment
and objects via activity recognition.
⊲ To validate the system via meaningful interactions with humans and to
evaluate its performance, usability and acceptability from potential users.
88
REFERENCES
References
[1]
F. Baader, D. Calvanese, D. McGuinness, D. Nardi, and P. Patel-Schneider, eds. The
Description Logic Handbook: Theory, Implementation and Applications. Cambridge
University Press, 2003. (See p. 63).
[2]
L. W. Barsalou. “Perceptual symbol systems”. Behavioral and Brain Sciences 22,
pp. 577–660, 1999. (See p. 3).
[3]
J. Barwise and J. Perry. Situations and Attitudes. Cambridge, MA: MIT Press, 1983.
(See p. 31).
[4]
C. Bishop. Pattern Recognition and Machine Learning. Springer Verlag, 2006. (See
p. 56).
[5]
C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann.
“DBpedia - A crystallization point for the Web of Data”. Web Semantics. 7, pp. 154–
165, 2009. (See p. 65).
[6]
A. Bonarini, M. Matteucci, and M. Restelli. “Anchoring: do we need new solutions
to an old problem or do we have old solutions for a new problem?” In: Proceedings
of the AAAI Fall Symposium on Anchoring Symbols to Sensor Data in Single and
Multiple Robot Systems. 2001. Pp. 79–86. (See p. 33).
[7]
A. Bonarini, M. Matteucci, and M. Restelli. “Concepts for Anchoring in Robotics”.
Lecture Notes in Computer Science 2175, pp. 327–332, 2001. (See p. 33).
[8]
A. Bonarini, M. Matteucci, and M. Restelli. “Problems and solutions for anchoring in
multi-robot applications”. J. Intell. Fuzzy Syst. 18, pp. 245–254, 2007. (See p. 37).
[9]
B. Boser, I. Guyon, and V. Vapnik. “A training algorithm for optimal margin classifiers”. In: Proceedings of the fifth annual workshop on Computational learning theory.
ACM. 1992. Pp. 144–152. (See p. 57).
[10]
R. J. Brachman and H. J. Levesque. “The tractability of subsumption in frame-based
description languages”. In: Proc. of the 4th Nat. Conf. on Artificial Intelligence
(AAAI-84). 1984. Pp. 34–37. (See p. 24).
[11]
D. Brill. Loom Reference Manual, for Loom Version 2.0. Tech. rep. USA: ISI, University of Southern California, 1993 (see p. 71).
[12]
R. A. Brooks. “Elephants don’t play chess”. Robotics and Autonomous Systems 6,
pp. 3–15, 1990. (See pp. 16, 28).
[13]
R. A. Brooks. “Intelligence Without Representation”. Artificial Intelligence 47,
pp. 139–159, 1991. (See pp. 16, 28).
[14]
M. Broxvall, S. Coradeschi, L. Karlsson, and A. Saffiotti. “Have Another Look: On
failures and recovery planning in perceptual anchoring”. In: Proc. of the ECAI-04
Workshop on Cognitive Robotics. Valencia, ES, 2004. (See p. 29).
[15]
M. Broxvall, S. Coradeschi, L. Karlsson, and A. Saffiotti. “Recovery Planning for Ambiguous Cases in Perceptual Anchoring”. In: Proc. of the 20th AAAI Conf. Pittsburgh,
PA, 2005. Pp. 1254–1260. (See pp. 29, 31, 39).
89
REFERENCES
[16]
E. M. Callaway. “Feedforward, feedback and inhibitory connections in primate visual
cortex”. Neural Netw. 17, pp. 625–632, 2004. (See p. 82).
[17]
A. Cangelosi and S. Harnad. “The adaptive advantage of symbolic theft over sensorimotor toil: Grounding language in perceptual categories”. Evolution of Communication
4, pp. 117–142, 2002. (See p. 27).
[18]
A. Cangelosi. Evolution of communication and language using signals, symbols and
words. 2001 (see p. 30).
[19]
A. Cangelosi. “Approaches to Grounding Symbols in Perceptual and Sensorimotor
Categories”. In: Handbook of Categorization in Cognitive Science. Ed. by H. Cohen
and C. Lefebvre. New York: Elsevier, 2005. Pp. 719–737. (See p. 3).
[20]
A. Cangelosi and S. Harnad. The adaptive advantage of symbolic theft over sensorimotor toil: Grounding language in perceptual categories. 2001 (see p. 27).
[21]
A. Cangelosi and T. Riga. “An Embodied Model for Sensorimotor Grounding and
Grounding Transfer: Experiments With Epigenetic Robots”. Cognitive Science 30,
pp. 673–689, 2006. (See p. 30).
[22]
A. Cangelosi, A. Greco, and S. Harnad. From Robotic Toil to Symbolic Theft: Grounding Transfer from Entry-Level to Higher-Level Categories. 2000 (see p. 27).
[23]
A. Cangelosi, A. Greco, and S. Harnad. “Symbol grounding and the symbolic theft
hypothesis”. In: Simulating the evolution of language. Ed. by A. Cangelosi and D.
Parisi. New York, NY, USA: Springer-Verlag New York, Inc., 2002. Pp. 191–210. (See
p. 27).
[24]
A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell.
“Toward an Architecture for Never-Ending Language Learning”. In: Proceedings of the
Twenty-Fourth Conference on Artificial Intelligence (AAAI 2010). 2010. (See p. 64).
[25]
A. Chella, M. Frixione, and S. Gaglio. “Anchoring symbols to conceptual spaces: the
case of dynamic scenarios”. Robotics and Autonomous Systems 43, pp. 175 –188, 2003.
(See p. 40).
[26]
A. Chella, S. Coradeschi, M. Frixione, and A. Saffiotti. “Perceptual Anchoring via
Conceptual Spaces”. In: Proc. of the AAAI-04 Workshop on Anchoring Symbols to
Sensor Data. Menlo Park, CA: AAAI Press, 2004. (See pp. 33, 40).
[27]
A. Chella, M. Frixione, and S. Gaglio. “Conceptual Spaces for Computer Vision
Representations”. Artificial Intelligence Review 16 10.1023/A:1011658027344, pp. 137–
152, 2001. (See pp. 33, 40, 59).
[28]
A. Chella, H. Dindo, and I. Infantino. “Anchoring by Imitation Learning in Conceptual
Spaces”. In: AI*IA 2005: Advances in Artificial Intelligence. Ed. by S. Bandini and S.
Manzoni. Vol. 3673. Lecture Notes in Computer Science. Springer Berlin / Heidelberg,
2005. Pp. 495–506. (See pp. 33, 40, 59).
[29]
A. Chella, M. Frixione, and S. Gaglio. “Anchoring symbols to conceptual spaces: the
case of dynamic scenarios.” Robotics and Autonomous Systems 43, pp. 175–188, 2006.
(See pp. 33, 59).
[30]
A. Chella, H. Dindo, and D. Zambuto. Grounded Human-Robot Interaction. 2009
(see p. 38).
[31]
S. Coradeschi and A. Saffiotti. “Anchoring Symbols to Vision Data by Fuzzy Logic”.
In: Symbolic and Quantitative Approaches to Reasoning and Uncertainty. Ed. by
A. Hunter and S. Parsons. LNCS 1638. Springer-Verlag, 1999. Pp. 104–115. (See pp. 7,
31 sq.).
[32]
S. Coradeschi and A. Saffiotti. “Anchoring symbols to sensor data: Preliminary report”.
In: In Proc. of the 17th AAAI Conf. AAAI Press, 2000. Pp. 129–135. (See pp. 7,
31 sq., 34, 42 sq., 81 sq.).
90
REFERENCES
[33]
S. Coradeschi and A. Saffiotti. “On the Role of Learning in Anchoring”. In: In Proc. of
AAAI Spring Symposium on Learning Grounded Representations (Technical Report
SS-01-05). AAAI Press, Menlo Park, CA, 2001. (See p. 32).
[34]
S. Coradeschi and A. Saffiotti. “Perceptual anchoring: a key concept for plan execution
in embedded systems”. In: Advances in Plan-Based Control of Robotic Agents. Ed.
by M. Beetz, J. Hertzberg, M. Ghallab, and M. Pollack. LNAI 2466. Berlin, DE:
Springer-Verlag, 2002. Pp. 89–105. (See pp. 29, 32).
[35]
S. Coradeschi and A. Saffiotti. “An Introduction to the Anchoring Problem”. Robotics
and Autonomous Systems 43, pp. 85–96, 2003. (See pp. 3, 7, 29, 31).
[36]
S. Coradeschi and A. Saffiotti. “Perceptual Anchoring with Indefinite Descriptions”.
In: Proc. of the First Joint SAIS-SSLS Workshop. Örebro, Sweden, 2003. (See p. 31).
[37]
S. Coradeschi and A. Saffiotti. Anchoring symbolic object descriptions to sensor data.
Problem statement. 1999 (see pp. 30 sq., 33).
[38]
S. Coradeschi and A. Saffiotti. “Perceptual anchoring of symbols for action”. In: In
Proc. of the 17th IJCAI Conf. 2001. Pp. 407–412. (See pp. 4, 29, 32, 39).
[39]
N. Cristianini and J. Shawe-Taylor. An Introduction to Support Vector Machines and
Other Kernel-based Learning Methods. Cambridge University Press, 2000. (See p. 58).
[40]
G. Csurka, C. R. Dance, L. Fan, J. Willamowski, and C. A. Bray. “Visual categorization
with bags of keypoints”. In: In Workshop on Statistical Learning in Computer Vision,
ECCV. 2004. Pp. 1–22. (See p. 55).
[41]
N. Dalal and B. Triggs. “Histograms of Oriented Gradients for Human Detection”. In:
Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision
and Pattern Recognition (CVPR’05) - Volume 1 - Volume 01. 2005. Pp. 886–893.
(See p. 52).
[42]
M. Daoutis, S. Coradeshi, and A. Loutfi. “Grounding commonsense knowledge in intelligent systems”. Journal of Ambient Intelligence and Smart Environments (JAISE)
1, pp. 311–321, 2009. (See p. 76).
[43]
M. Daoutis, S. Coradeschi, and A. Loutfi. “Integrating Common Sense in Physically Embedded Intelligent Systems.” In: Intelligent Environments 2009. Ed. by V.
Callaghan, A. Kameas, A. Reyes, D. Royo, and M. Weber. Vol. 2. Ambient Intelligence and Smart Environments. IE’09 Best Conference Paper Award. IOS Press,
2009. Pp. 212–219. (See p. 76).
[44]
M. Daoutis, A. Loutfi, and S. Coradeschi. In: “Bridges between the Methodological
and Practical Work of the Robotics and Cognitive Systems Communities – From
Sensors to Concepts”. T. Amirat, A. Chibani, and G. P. Zarri, eds. Vol. 21chap. Knowledge Representation for Anchoring Symbolic Concepts to Perceptual Data. Springer
Publishing (in press), 2012.
[45]
M. Daoutis, S. Coradeschi, and A. Loutfi. “Cooperative Knowledge Based Perceptual
Anchoring”. Int. Journal on Artificial Intelligence Tools (IJAIT) 21, p. 1250012, 2012.
(See p. 78).
[46]
M. Daoutis, S. Coradeschi, and A. Loutfi. “Towards concept anchoring for cognitive
robots”. Intelligent Service Robotics 5, pp. 213–228, 2012. (See p. 80).
[47]
L. Devroye, L. Györfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition.
Applications of Mathematics. Springer, 1996. (See p. 59).
[48]
R. Duda, P. Hart, and D. Stork. Pattern classification. Pattern Classification and
Scene Analysis: Pattern Classification. Wiley, 2001. (See p. 56).
[49]
S. Edelman, N. Intrator, and T. Poggio. “Complex cells and Object Recognition”.
1997 (see pp. 50 sq.).
91
REFERENCES
[50]
M. Everingham, L. Gool, C. K. Williams, J. Winn, and A. Zisserman. “The Pascal
Visual Object Classes (VOC) Challenge”. Int. J. Comput. Vision 88, pp. 303–338,
2010. (See p. 58).
[51]
L. Fei-Fei, R. Fergus, and A. Torralba. Course on Recognizing and Learning Object
Categories. CVPR 2007. 2007 (see p. 54).
[52]
M. Fichtner. “Anchoring Symbols to Percepts in the Fluent Calculus”. KI - Künstliche
Intelligenz 25 10.1007/s13218-010-0051-1, pp. 77–80, 2011. (See p. 35).
[53]
M. Fichtner and M. Thielscher. Anchoring Symbols to Percepts in the Fluent Calculus.
Ed. by U. Visser, G. Lakemeyer, G. Vachtesevanos, and M. Veloso. 2005 (see p. 35).
[54]
T. Fong, C. Thorpe, and C. Baur. “Collaboration, Dialogue, Human-Robot Interaction”. Robotics Research, pp. 255–266, 2003. (See p. 38).
[55]
D. Fox, W. Burgard, F. Dellaert, and S. Thrun. “Monte Carlo Localization: Efficient
Position Estimation for Mobile Robots”. In: Proceedings of the Sixteenth National
Conference on Artificial Intelligence (AAAI’99). 1999. (See p. 47).
[56]
D. Gaševic, D. Djuric, and V. Devedžic. Model Driven Engineering and Ontology
Development. 2. Berlin: Springer, 2009. (See p. 22).
[57]
C. Galindo, A. Saffiotti, S. Coradeschi, P. Buschka, J. Fernández-Madrigal, and J.
González. “Multi-Hierarchical Semantic Maps for Mobile Robotics”. In: Edmonton,
CA, 2005. (See pp. 39, 59).
[58]
P. Gärdenfors. Conceptual Spaces: The Geometry of Thought. Cambridge, MA: MIT
Press, 2000. Pp. I–X, 1–307. (See pp. 16, 33, 36 sq., 40).
[59]
J. J. Gibson. “The theory of affordances”. In: Perceiving, Acting, and Knowing:
Toward and Ecological Psychology. Ed. by R. Shaw and J. Brandsford. Hillsdale, NJ:
Lawrence Erlbaum, 1977. Pp. 67–82. (See p. 3).
[60]
J. Gibson. The ecological approach to visual perception. Resources for ecological
psychology. Taylor & Francis, 1986. (See p. 29).
[61]
V. Haarslev and R. Möller. “High Performance Reasoning with Very Large Knowledge
Bases: A Practical Case Study”. In: Proceedings of Seventeenth International JointÊ
Conference on Artificial Intelligence, IJCAI-01. Ed. by B. Nebel. 2001. Pp. 161–166.
(See p. 63).
[62]
V. Haarslev and R. Müller. “RACER System Description”. In: Automated Reasoning.
Ed. by R. Goré, A. Leitsch, and T. Nipkow. Vol. 2083. Lecture Notes in Computer
Science. Berlin, Heidelberg: Springer Berlin Heidelberg, June 2001. Chap. 59, pp. 701–
705. (See p. 63).
[63]
S. Harnad. “The symbol grounding problem”. Physica D: Nonlinear Phenomena 42,
pp. 335–346, 1990. (See pp. 2 sq., 26 sqq., 34, 81).
[64]
S. Harnad. “Grounding Symbols in the Analog World With Neural Nets - A Hybrid
Model”. Psycoloquy 12, 2001. (See p. 26).
[65]
C. Harris and M. Stephens. “A Combined Corner and Edge Detector”. In: Proceedings
of the 4th Alvey Vision Conference. 1988. Pp. 147–151. (See pp. 49, 54).
[66]
Z. S. Harris. “Distributional Structure”. Word 10, pp. 146–162, 1954. (See p. 54).
[67]
R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Second.
Cambridge University Press, ISBN: 0521540518, 2004. (See p. 47).
[68]
F. Heintz, J. Kvarnström, and P. Doherty. “A stream-based hierarchical anchoring
framework”. In: Proceedings of the 2009 IEEE/RSJ international conference on
Intelligent robots and systems. IROS’09. St. Louis, MO, USA: IEEE Press, 2009.
Pp. 5254–5260. (See p. 40).
92
REFERENCES
[69]
F. Heintz, J. Kvarnström, and P. Doherty. “Stream-Based Reasoning Support for
Autonomous Systems”. In: Proceedings of the 2010 conference on ECAI 2010: 19th
European Conference on Artificial Intelligence. Amsterdam, The Netherlands, The
Netherlands: IOS Press, 2010. Pp. 183–188. (See p. 40).
[70]
S. Heymann, K. Müller, A. Smolic, B. Fröhlich, and T. Wiegand. Scientific Commons:
SIFT implementation and optimization for general-purpose gpu. 2007 (see p. 50).
[71]
I. Horrocks. “FaCT and iFaCT”. In: Description Logics. Ed. by P. Lambrix, A.
Borgida, M. Lenzerini, R. Möller, and P. F. Patel-Schneider. Vol. 22. CEUR Workshop
Proceedings. CEUR-WS.org, 1999. (See p. 63).
[72]
T. Joachims. “Advances in kernel methods”. In: ed. by B. Schölkopf, C. J. C. Burges,
and A. J. Smola. Cambridge, MA, USA: MIT Press, 1999. Chap. Making large-scale
support vector machine learning practical, pp. 169–184. (See p. 53).
[73]
L. Karlsson, A. Bouguerra, M. Broxvall, S. Coradeschi, and A. Saffiotti. “To Secure
an Anchor – A recovery planning approach to ambiguity in perceptual anchoring”.
AI Communications 21, pp. 1–14, 2008. (See p. 39).
[74]
L. Karlsson, A. Bouguerra, M. Broxvall, S. Coradeschi, and A. Saffiotti. “To secure
an anchor - a recovery planning approach to ambiguity in perceptual anchoring”. AI
Commun. 21, pp. 1–14, 2008. (See p. 29).
[75]
Y. Ke and R. Sukthankar. “PCA-SIFT: A more distinctive representation for local
image descriptors”. In: 2004. Pp. 506–513. (See p. 52).
[76]
Z. Kira. “Transferring embodied concepts between perceptually heterogeneous robots”.
In: IROS’09: Proceedings of the 2009 IEEE/RSJ international conference on Intelligent robots and systems. St. Louis, MO, USA: IEEE Press, 2009. Pp. 4650–4656. (See
p. 37).
[77]
Z. Kira. “Mapping Grounded Object Properties across Perceptually Heterogeneous
Embodiments.” In: FLAIRS Conference. Ed. by H. C. Lane and H. W. Guesgen.
AAAI Press, Nov. 25, 2009. (See p. 37).
[78]
Z. Kira. “Communication and alignment of grounded symbolic knowledge among
heterogeneous robots”. AAI3414475. PhD thesis. Atlanta, GA, USA, 2010. (See pp. 37,
40, 59).
[79]
Z. Kira. “Inter-robot transfer learning for perceptual classification”. In: Proceedings
of the 9th International Conference on Autonomous Agents and Multiagent Systems:
volume 1 - Volume 1. AAMAS ’10. Toronto, Canada: International Foundation for
Autonomous Agents and Multiagent Systems, 2010. Pp. 13–20. (See p. 37).
[80]
M Kleinehagenbrock, S Lang, J Fritsch, F Lomker, G. A. Fink, and G Sagerer. “Person
tracking with a mobile robot based on multi-modal anchoring”. Proceedings 11th
IEEE International Workshop on Robot and Human Interactive Communication 6,
pp. 423–429, 2002. (See p. 39).
[81]
M. J. Kochenderfer and R. Gupta. “Common Sense Data Acquisition for Indoor
Mobile Robots”. In: In Nineteenth National Conference on Artificial Intelligence
(AAAI-04. AAAI Press / The MIT Press, 2003. Pp. 605–610. (See p. 35).
[82]
G.-J. Kruijff and M. Brenner. “Modelling Spatio-Temporal Comprehension in Situated
Human-Robot Dialogue as Reasoning about Intentions and Plans”. In: Proceedings of
the Symposium on Intentions in Intelligent Systems. Stanford University, Palo Alto,
CA, USA. AAAI Spring Symposium Series 2007. AAAI, Mar. 2007. (See pp. 38, 59).
[83]
D. Kumar. “The SNePS BDI architecture”. Decis. Support Syst. 16, pp. 3–19, 1996.
(See p. 33).
93
REFERENCES
[84]
L. Kunze, M. Tenorth, and M. Beetz. “Putting People’s Common Sense into Knowledge
Bases of Household Robots”. In: KI 2010: Advances in Artificial Intelligence. Ed. by
R. Dillmann, J. Beyerer, U. Hanebeck, and T. Schultz. Vol. 6359. Lecture Notes in
Computer Science. Springer Berlin / Heidelberg, 2010. Pp. 151–159. (See p. 36).
[85]
K. LeBlanc and A. Saffiotti. “Issues of Perceptual Anchoring in Ubiquitous Robotic
Systems”. In: Proc of the ICRA Workshop on Omniscent Space. Rome, Italy, 2007.
(See p. 37).
[86]
K. LeBlanc and A. Saffiotti. “Cooperative anchoring in heterogeneous multi-robot
systems”. In: Proc. of the IEEE Int. Conf. on Robotics and Automation (ICRA).
2008. Pp. 3308–3314. (See pp. 36 sq., 40, 76, 83).
[87]
K. LeBlanc and A. Saffiotti. “Multirobot Object Localization: A Fuzzy Fusion Approach”. IEEE Trans on System, Man and Cybernetics B 39, pp. 1259–1276, 2009.
(See p. 37).
[88]
K. LeBlanc. “Cooperative anchoring : sharing information about objects in multi-robot
systems”. PhD thesis. Örebro University, School of Science and Technology, 2010.
(See pp. 37, 40, 45).
[89]
K. LeBlanc and A. Saffiotti. “Cooperative information fusion in a network robot
system”. In: Proceedings of the 1st international conference on Robot communication
and coordination. RoboComm ’07. Piscataway, NJ, USA: IEEE Press, 2007. 42:1–42:4.
(See p. 37).
[90]
S. Lemaignan, R. Ros, and R. Alami. “Dialogue in situated environments: A symbolic
approach to perspective-aware grounding, clarification and reasoning for robot”. In:
2011. (See pp. 38, 59).
[91]
S. Lemaignan, R. Ros, R. Alami, and M. Beetz. “What are you talking about?
Grounding dialogue in a perspective-aware robotic architecture”. In: RO-MAN, 2011
IEEE. IEEE, July 2011. Pp. 107–112. (See p. 38).
[92]
S. Lemaignan, R. Ros, L. Mösenlechner, R. Alami, and M. Beetz. “ORO, a knowledge
management module for cognitive architectures in robotics”. In: Proceedings of the
2010 IEEE/RSJ International Conference on Intelligent Robots and Systems. Taipei,
Taiwan, 2010. Pp. 3548–3553. (See pp. 36, 40).
[93]
D. Lenat. “CYC: A Large-Scale Investment in Knowledge Infrastructure”. Communications of the ACM 38, pp. 33–38, 1995. (See pp. 35, 64 sq.).
[94]
D. B. Lenat, R. V. Guha, K. Pittman, D. Pratt, and M. Shepherd. “Cyc: toward
programs with common sense”. Commun. ACM 33, pp. 30–49, 1990. (See pp. 35,
64 sq.).
[95]
T. Leung and J. Malik. “Representing and Recognizing the Visual Appearance of
Materials using Three-dimensional Textons”. Int. J. Comput. Vision 43, pp. 29–44,
2001. (See pp. 55 sq.).
[96]
H. Lipson. “Evolutionary robotics: emergence of communication”. Curr Biol 17, R330–
R332, 2007. (See p. 30).
[97]
P. Lison, C. Ehrler, and G.-J. Kruijff. “Belief Modelling for Situation Awareness in
Human-Robot Interaction”. In: Proceedings of the 19th IEEE International Symposium
in Robot and Human Interactive Communication. International Symposium on Robot
and Human Interactive Communication (RO-MAN-10), September 12-15, Viareggio,
Italy. IEEE, 2010. (See p. 38).
[98]
H Liu and P Singh. “ConceptNet — A Practical Commonsense Reasoning Tool-Kit”.
BT Technology Journal 22, pp. 211–226, 2004. (See p. 64).
[99]
L. S. Lopes, A. J. Teixeira, M. Quindere, and M. Rodrigues. “From Robust Spoken
Language Understanding to Knowledge Acquisition and Management”. In: Proceedings
of Interspeech 2005, Lisboa, Portugal, p. 3469-3472. 2005. (See p. 34).
94
REFERENCES
[100]
A. Loutfi, S. Coradeschi, and A. Saffiotti. “Maintaining Coherent Perceptual Information using Anchoring”. In: The Nineteenth International Joint Conference on
Artificial Intelligence (IJCAI ’05). Edinburgh, Scotland, 2005. Pp. 1477–1482. (See
pp. 4, 7, 32, 39, 82).
[101]
A. Loutfi, S. Coradeschi, L. Karlsson, and M. Broxvall. Putting Olfaction into Action:
Anchoring Symbols to Sensor Data Using Olfaction and Planning. 2005 (see p. 39).
[102]
A. Loutfi, S. Coradeschi, M. Daoutis, and J. Melchert. “Using Knowledge Representation for Perceptual Anchoring in a Robotic System”. Int. Journal on Artificial
Intelligence Tools (IJAIT) 17, pp. 925–944, 2008. (See p. 74).
[103]
D. G. Lowe. “Object Recognition from Local Scale-Invariant Features”. In: Proceedings
of the International Conference on Computer Vision-Volume 2 - Volume 2. 1999.
Pp. 1150–. (See p. 49).
[104]
D. Lowe. “Distinctive image features from scale-invariant keypoints”. International
Journal of Computer Vision 60, pp. 91–110, 2004. (See pp. 50 sqq., 54 sq.).
[105]
R. Lundh, L. Karlsson, and A. Saffiotti. “Plan-Based Configuration of a Group of
Robots”. In: Riva del Garda, IT, 2006. Pp. 683–687. (See p. 29).
[106]
G. jan M. Kruijff, P. Lison, T. Benjamin, H. Jacobsson, and N. Hawes. “Incremental, multi-level processing for comprehending situated dialogue in human-robot
interaction”. In: In Language and Robots: Proceedings from the Symposium (LangRo’2007)IJCAI01. 2007. Pp. 509–514. (See pp. 38, 59).
[107]
S. Maji, A. Berg, and J. Malik. “Classification using intersection kernel support vector
machines is efficient”. Computer Vision and Pattern Recognition, 2008. CVPR 2008.
IEEE Conference on, pp. 1–8, 2008. (See p. 58).
[108]
D. Marr. Vision: A Computational Investigation into the Human Representation and
Processing of Visual Information. Henry Holt & Company, June 1983. (See p. 21).
[109]
C. Masolo, S. Borgo, A. Gangemi, N. Guarino, A. Oltramari, and L. Schneider. The
WonderWeb Library of Foundational Ontologies. Tech. rep. WonderWeb Deliverable
D17. Padova, Italy: ISTC-CNR, 2003 (see p. 71).
[110]
F. Mastrogiovanni, A. Sgorbissa, and R. Zaccaria. “A distributed architecture for
symbolic data fusion”. In: In Proceedings of IJCAI-07 – International Joint Conference
on Artificial Intelligence. 2007. (See pp. 37, 40).
[111]
J. McCarthy. “Programs with Common Sense”. In: Proceedings of the Teddington
Conference on the Mechanisation of Thought Processes. 1958. Pp. 77–84. (See p. 25).
[112]
O. Medelyan and C. Legg. “Integrating Cyc and Wikipedia: Folksonomy Meets
Rigorously Defined Common-Sense”. In: Wikipedia and Artificial Intelligence: An
Evolving Synergy, Papers from the 2008 AAAI Workshop. 2008. (See p. 65).
[113]
D. L. Medin and M. M. Schaffer. “Context theory of classification learning”. Psychological review 85, pp. 207–238, 1978. (See p. 7).
[114]
J. Melchert, S. Coradeschi, and A. Loutfi. “Spatial relations for perceptual anchoring”.
In: In Proceedings of AISB’07 – AISB Annual Convention, Newcastle upon Tyne.
2007. (See pp. 34, 39 sq., 59, 83).
[115]
J. Melchert, S. Coradeschi, and A. Loutfi. “Knowledge Representation and Reasoning
for Perceptual Anchoring”. Tools with Artificial Intelligence, 19th IEEE International
Conference on 1, pp. 129–136, 2007. (See pp. 34, 74).
[116]
R. Mendoza, B. Johnston, F. Yang, Z. Huang, X. Chen, and M. anne Williams.
“OBOC: Ontology Based Object Categorisation for Robots”. 2008. (See p. 34).
[117]
K. Mikolajczyk and C. Schmid. “A Performance Evaluation of Local Descriptors”.
IEEE Trans. Pattern Anal. Mach. Intell. 27, pp. 1615–1630, 2005. (See p. 51).
95
REFERENCES
[118]
K. Mikolajczyk. “Detection of local features invariant to affines transformations”.
PhD thesis. Grenoble: INPG, 2002. (See p. 51).
[119]
J. Modayil and B. Kuipers. “Autonomous development of a grounded object ontology
by a learning robot”. In: Proceedings of the 22nd national conference on Artificial
intelligence - Volume 2. AAAI’07. Vancouver, British Columbia, Canada: AAAI Press,
2007. Pp. 1095–1101. (See pp. 34, 59).
[120]
B. Motik. “Reasoning in Description Logics using Resolution and Deductive Databases”.
PhD thesis. Universität Karlsruhe (TH), Karlsruhe, Germany, 2006. (See p. 63).
[121]
O. M. Mozos, P. Jensfelt, H. Zender, G.-J. Kruijff, and W. Burgard. “From Labels to
Semantics: An Integrated System for Conceptual Spatial Representations of Indoor
Environments for Mobile Robots”. In: Proc. of the Workshop ”Semantic information
in robotics” at the IEEE International Conference on Robotics and Automation
(ICRA’07). Rome,Italy, Apr. 2007. (See pp. 34, 59).
[122]
A. Newell and H. A. Simon. “Computer science as empirical inquiry: symbols and
search”. Commun. ACM 19, pp. 113–126, 1976. (See pp. 16, 22).
[123]
A. Noė and E. Thompson. Vision and Mind: Selected Readings in the Philosophy of
Perception. Bradford Books. MIT Press, 2002. (See p. 18).
[124]
R. M. Nosofsky. “Attention, similarity, and the identification-categorization relationship.” J Exp Psychol Gen 115, pp. 39–61, 1986. (See p. 7).
[125]
C. Odgen and I. Richards. The Meaning of Meaning A Study of the Influence of
Language upon Thought and of the Science of Symbolism. 10th ed. London: Routledge
& Kegan Paul Ltd., 1923. (See p. 28).
[126]
D. Pangercic, R. Tavcar, M. Tenorth, and M. Beetz. “Visual Scene Detection and
Interpretation using Encyclopedic Knowledge and Formal Description Logic”. In:
Proceedings of the International Conference on Advanced Robotics (ICAR). 2009.
(See p. 36).
[127]
B. Parsia and E. Sirin. “Pellet: An OWL DL Reasoner”. In: 3rd International Semantic
Web Conference (ISWC2004). 2004. (See p. 63).
[128]
D. Pecher and R. A. Zwaan. Cambridge University Press, 2005. (See p. 3).
[129]
C. Peirce. Collected Papers of Charles Sanders Peirce. Collected Papers of Charles
Sanders Peirce: Science and Philosophy. Reviews, Correspondence, and Bibliography
v. 1-8. Belknap Press of Harvard University Press, 1958. (See p. 28).
[130]
Z. W. Pylyshyn. “Visual indexes, preconceptual objects, and situated vision”. Cognition
80 Objects and Attention, pp. 127 –158, 2001. (See p. 6).
[131]
G. RANDELLI. “Improving Human-Robot Awareness through Semantic-driven Tangible Interaction”. 2011. (See p. 38).
[132]
L. Roberts. Machine perception of three-dimensional solids. Tech. rep. DTIC Document, 1963 (see p. 21).
[133]
R. Ros, S. Lemaignan, E. A. Sisbot, R. Alami, J. Steinwender, K. Hamann, and
F. Warneken. “Which One? Grounding the Referent Based on Efficient Human-Robot
Interaction”. In: 19th IEEE International Symposium in Robot and Human Interactive
Communication. 2010. (See p. 38).
[134]
E. Rosch. “Principles of Categorization”. In: Readings in Cognitive Science, a Perspective From Psychology and Artificial Intelligence. Morgan Kaufmann Publishers,
1988. Pp. 312–22. (See p. 7).
[135]
D. Roy. “Semiotic schemas: A framework for grounding language in action and
perception”. Artificial Intelligence 167, pp. 170–205, 2005. (See p. 29).
96
REFERENCES
[136]
D. Roy. “Grounding words in perception and action: computational insights”. Trends
in Cognitive Sciences 9, pp. 389–396, 2005. (See p. 29).
[137]
D. Roy and E. Reiter. “Connecting language to the world”. Artif. Intell. 167, pp. 1–12,
2005. (See p. 6).
[138]
S. J. Russell and P. Norvig. Artificial Intelligence: A Modern Approach (2nd Edition).
Prentice Hall, 2002. (See pp. 16 sq.).
[139]
A. Saffiotti. “Pick-up What?” In C. Backstrom and E. Sandewall, eds., Current
Trends in AI Planning, 1994. (See p. 30).
[140]
A. Saffiotti and M. Broxvall. “PEIS Ecologies: Ambient Intelligence meets Autonomous
Robotics”. In: Proc of the Int Conf on Smart Objects and Ambient Intelligence (sOcEUSAI). Grenoble, France, 2005. Pp. 275–280. (See pp. 72, 77).
[141]
P. E. Santos. Formalising the Common Sense of a Mobile Robot. Tech. rep. 1998 (see
pp. 25, 35).
[142]
M. Schmidt-Schauß and G. Smolka. “Attributive Concept Descriptions with Complements”. Artif. Intell. 48, pp. 1–26, 1991. (See p. 60).
[143]
J. R. Searle. “Minds, Brains and Programs”. Behavioral and Brain Sciences 3, pp. 417–
57, 1980. (See pp. 16, 26).
[144]
M. Shanahan and L. S. Bt. A Logical Account of the Common Sense Informatic
Situation for a Mobile Robot. 1996 (see pp. 25, 35).
[145]
S. C. Shapiro and H. O. Ismail. “Anchoring in a grounded layered architecture with
integrated reasoning”. Robotics and Autonomous Systems 43 Perceptual Anchoring:
Anchoring Symbols to Sensor Data in Single and Multiple Robot Systems, pp. 97
–108, 2003. (See pp. 25, 33).
[146]
S. C. Shapiro and W. J. Rapaport. “SNePS considered as a fully intensional propositional semantic network”. In: The Knowledge Frontier. Springer-Verlag, 1987. Pp. 263–
315. (See pp. 25, 33).
[147]
P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins, and W. L. Zhu. “Open Mind
Common Sense: Knowledge Acquisition from the General Public”. In: On the Move
to Meaningful Internet Systems, 2002 - DOA/CoopIS/ODBASE 2002 Confederated
International Conferences DOA, CoopIS and ODBASE 2002. London, UK: SpringerVerlag, 2002. Pp. 1223–1237. (See pp. 35, 64, 66).
[148]
J. Sivic and A. Zisserman. “Efficient Visual Search of Videos Cast as Text Retrieval”.
IEEE Transactions on Pattern Analysis and Machine Intelligence 31, pp. 591–606,
2009. (See p. 54).
[149]
L. Steels. “Evolving grounded communication for robots”. Trends in Cognitive Science
7, pp. 308–312, 2003. (See p. 29).
[150]
L. Steels and F. Kaplan. “AIBO’s first words: The social learning of language and
meaning”. Evolution of Communication 4, pp. 3–32, 2001. (See p. 29).
[151]
L. Steels and F. Kaplan. “Bootstrapping grounded word semantics”. In: Linguistic
Evolution through Language Acquisition: Formal and Computational Models. Ed. by
T. Briscoe. Cambridge University Press, 2002. Chap. 3. (See p. 30).
[152]
L. Steels and P. Vogt. “Grounding adaptive language games in robotic agents”. In:
ECAL97. Ed. by I. Harvey and P. Husbands. Cambridge, MA: MIT Press, 1997. (See
p. 29).
[153]
L. Steels. “Language Games for Autonomous Robots”. IEEE Intelligent Systems 16,
pp. 16–22, 2001. (See p. 29).
97
REFERENCES
[154]
M. Tenorth and M. Beetz. “KnowRob — Knowledge Processing for Autonomous
Personal Robots”. In: IEEE/RSJ International Conference on Intelligent RObots
and Systems. 2009. (See pp. 34, 40).
[155]
M. Tenorth, D. Nyga, and M. Beetz. “Understanding and Executing Instructions for
Everyday Manipulation Tasks from the World Wide Web.” In: IEEE International
Conference on Robotics and Automation (ICRA). 2010. (See pp. 35, 40).
[156]
M. Tenorth and M. Beetz. “Towards Practical and Grounded Knowledge Representation Systems for Autonomous Household Robots”. In: Proceedings of the 1st
International Workshop on Cognition for Technical Systems, München, Germany,
6-8 October. 2008. (See p. 35).
[157]
M. Tenorth, D. Jain, and M. Beetz. “Knowledge Processing for Cognitive Robots”.
KI 24, pp. 233–240, 2010. (See p. 34).
[158]
M. Tenorth, L. Kunze, D. Jain, and M. Beetz. “KNOWROB-MAP – KnowledgeLinked Semantic Object Maps”. In: 10th IEEE-RAS International Conference on
Humanoid Robots. Nashville, TN, USA, 2010. Pp. 430–435. (See p. 35).
[159]
A. Treisman. “The binding problem.” Current opinion in neurobiology 6, pp. 171–178,
1996. (See p. 6).
[160]
A. M. Treisman and G. Gelade. “A feature-integration theory of attention”. Cognitive
Psychology 12, pp. 97–136, 1980. (See p. 6).
[161]
D. Tsarkov and I. Horrocks. “FaCT++ description logic reasoner: System description”.
Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) 4130 LNAI, pp. 292–297, 2006.
(See p. 63).
[162]
S. Ullmann. Semantics: An Introduction to the Science of Meaning. Blackwell paperbacks. Blackwell, 1972. (See p. 28).
[163]
F. University. Thought: Fordham University Quarterly. Fordham University Press.
(See p. 16).
[164]
V. N. Vapnik. The nature of statistical learning theory. New York, NY, USA: SpringerVerlag New York, Inc., 1995. (See p. 53).
[165]
V. Vapnik. Statistical learning theory. Wiley, 1998. Pp. I–XXIV, 1–736. (See p. 58).
[166]
P. Vogt and F. Divina. “Social symbol grounding and language evolution”. Interaction
Studies 8, pp. 31–52, 2007. (See p. 30).
[167]
P. Vogt. “Anchoring Symbols to Sensorimotor Control”. In: in Proceedings of Belgian/Netherlands Artificial Intelligence Conference BNAIC’02. 2002. P. 2002. (See
pp. 28, 30).
[168]
P. Vogt. “The physical symbol grounding problem”. Cognitive Systems Research 3,
pp. 429–457, 2002. (See pp. 28 sqq.).
[169]
P. Vogt. “Anchoring of semiotic symbols”. Robotics and Autonomous Systems 43,
pp. 109–120, 2003. (See p. 28).
[170]
P. Vogt. “Language evolution and robotics: Issues in symbol grounding and language
acquisition”. In: Artificial Cognition Systems. Ed. by A. Loula, R. Gudwin, and J.
Queiroz. Idea Group, 2006. (See pp. 28 sq.).
[171]
P. Vogt and F. Divina. “Social symbol grounding and language evolution”. Interaction
Studies 8, pp. 31–52, 2007. (See pp. 3 sq., 29).
[172]
T. Winograd. “Procedures as a Representation for Data in a Computer Program for
Understanding Natural Language.” SHRDLU. PhD thesis. Cambridge, MA: Massachusetts Institute of Technology, Jan. 1971. (See p. 25).
98
REFERENCES
[173]
C. Wu. SiftGPU: A GPU Implementation of Scale Invariant Feature Transform
(SIFT). http://cs.unc.edu/~ccwu/siftgpu. 2007 (see p. 52).
[174]
H. Zender, P. Jensfelt, Ó. M. Mozos, G. M. Kruijff, and W. Burgard. “An integrated robotic system for spatial understanding and situated interaction in indoor
environments”. In: In Proc. AAAI ’07. 2007. Pp. 1584–1589. (See p. 34).
[175]
H. Zender and G.-J. M. Kruijff. “Towards Generating Referring Expressions in a
Mobile Robot Scenario”. In: Language and Robots: Proceedings of the Symposium.
Aveiro, Portugal, 2007. Pp. 101–106. (See pp. 39 sq.).
[176]
H. Zender, G.-J. Kruijff, and I. Kruijff-Korbayová. “A Situated Context Model for
Resolution and Generation of Referring Expressions”. In: Proceedings of the 12th
European Workshop on Natural Language Generation. Association for Computational
Linguistics. o.A., Mar. 2009. Pp. 126–129. (See pp. 39 sq.).
[177]
H. Zender, G.-J. M. Kruijff, and I. Kruijff-Korbayová. “Situated resolution and
generation of spatial referring expressions for robotic assistants”. In: Proceedings of
the 21st international jont conference on Artifical intelligence. IJCAI’09. Pasadena,
California, USA: Morgan Kaufmann Publishers Inc., 2009. Pp. 1604–1609. (See
pp. 38 sqq., 59).
99