Towards the Evaluation of Cognitive Models using

Halbrügge, M. (2016). Towards the Evaluation of Cognitive Models using Anytime Intelligence Tests. In D. Reitter & F. E. Ritter (eds.),
Proceedings of the 14th international conference on cognitive modeling (pp. 261–263). University Park, PA: Penn State.
Towards the Evaluation of Cognitive Models using Anytime Intelligence Tests
Marc Halbrügge ([email protected])
Quality & Usability Lab, Telekom Innovation Laboratories
Technische Universität Berlin
Ernst-Reuter-Platz 7, 10587 Berlin
Abstract
Cognitive models are usually evaluated based on their fit to
empirical data. Artificial intelligence (AI) systems on the other
hand are mainly evaluated based on their performance. Within
the field of artificial general intelligence (AGI) research, a new
type of performance measure for AGI systems has recently
been proposed that tries to cover both humans and artificial
systems: Anytime Intelligence Tests (AIT; Hernández-Orallo
& Dowe, 2010). This paper explores the viability of the AIT
formalism for the evaluation of cognitive models based on data
from the ICCM 2009 “Dynamic Stocks and Flows” modeling
challenge.
Keywords: Anytime Intelligence Test; Model Evaluation; Decision Making; Stock-Flow Problems;
Introduction
Cognitive modeling as a field, although being rooted in AI,
has diverged from AI research in recent years because both
fields pursue different goals. While modelers try to understand human behavior by creating systems that act as humanlike as possible, AI researchers strive for systems that act as
perfect as possible, or in Legg and Hutter (2007)’s words:
universal intelligence. At the same time, parts of the cognitive modeling community are suggesting to direct the field
towards more generic models of human behavior (”cognitive
supermodels”; Salvucci, 2010) as opposed to task-specific
models. Such supermodels did not come into existence yet,
but given the methodological advances in the AI field, it may
be worthwhile to think about how the abilities (i.e., intelligence) of generic cognitive models should be evaluated. A
promising approach to this question are anytime intelligence
tests (AIT; Hernández-Orallo & Dowe, 2010).
Anytime Intelligence Tests
These intelligence tests are crossing the boundaries between
the modeling and the AI field because they are targeting both
biological and artificial systems. Based on the work of Legg
and Hutter (2007), they intend to measure intelligence of an
agent (i.e., ‘model’ in the terms of the ‘other’ field) as the accumulated amount of reward1 r it receives through interaction
with a set of (deterministic) environments of varying (computational) complexity. The validity of this accumulated reward
is achieved through several means: a) The reward is bound
to the range [-1;1]; b) All environments must be balanced,
1 Note: In reinforcement learning, reward functions are a crucial part of the learning agents themselves (Singh, Lewis, & Barto,
2009). In the context of this paper, rewards come from an external
‘critic’ and were not available to the agents (i.e., human participants
and cognitive models) during exploration and learning.
i.e., a random agent will on average receive a reward of zero;
and c) The aggregated reward is scaled by the computational
complexity of the transition function of the environment. An
example of such an intelligence test that was applied to both
humans and AI agents can be found in Insa-Cabrera, Dowe,
España-Cubillo, Hernández-Lloreda, and Hernández-Orallo
(2011).
Besides the construction of new environments that follow
the AIT formalism, one can try to analyze published data and
models from the literature. This way, potential insights from
the AIT procedure can be compared to fit-based evaluations
that have been performed before. It is often possible to transform existing tasks into AIT environments by constructing
new reward functions for the tasks.
Dynamic Stock and Flow Task
A promising candidate for such a reward reconstruction is the
Dynamic Stock and Flow task (DSF; Dutt & Gonzalez, 2007)
that has been used for the modeling challenge of the same
name (Lebiere, Gonzalez, & Warwick, 2009). The task for
this challenge was to maintain the level (i.e., stock) in a water tank at a given target value in the presence of dynamically
changing water in- or outflow from an external source. There
were four training conditions with monotonously changing
inflow (Lin-, Lin+, NonL-, NonL+) and five transfer conditions. Two of these featured linearly increasing inflow (like
Lin+), but the agents’ actions were delayed by one (Del2) or
two (Del3) additional time steps. The remaining conditions
featured two (Seq2, Seq2Nos) or four (Seq4) time steps long
repeating sequences of inflow; in case of Seq2Nos the pattern was masked by additional noise. The evaluation of the
models in the competition was based on the goodness-of-fit
to human data in all nine conditions. The crucial variable for
this fit was the time-dependent water level.
Besides convenience (i.e., availability of data and models),
the DSF task is especially suited as an AIT because it is deterministic and open-ended (in contrast to, e.g., robotic soccer),
and the computational complexity of the environment should
be both easily scalable and easily quantifiable. Whether and
how this task can be transformed to an AIT will now be reviewed regarding possible reward functions for the task. The
question of the complexity of the different task conditions
will be discussed in a later contribution.
Possible Reward Functions
In the following, rt denotes the reward for time step t, amountt
denotes the water level for t, envt the external inflow for t, and
goal denotes the target water level.
T
h
i
si
st
h
em
o
s
ts
t
r
a
igh
t
fo
rw
a
rdso
lu
t
ion w
i
thth
eh
igh
e
s
t
f
a
c
e
v
a
l
i
d
i
t
y
. T
h
eag
en
t
’
sp
rox
im
i
tytoth
et
a
r
g
e
tl
e
v
e
li
s
r
e
p
r
e
s
e
n
t
e
dv
e
r
yw
e
l
l
. Onth
eo
th
e
rh
and
,ci
sa
rb
i
t
r
a
ry
.
M
o
s
tp
r
o
b
l
em
a
t
i
ci
sth
a
tth
efun
c
t
ioni
sno
tb
a
l
an
c
ed
.B
e
c
a
u
s
ea
l
lt
a
s
kc
o
n
d
i
t
ion
sf
e
a
tu
r
ee
x
t
e
rn
a
lw
a
t
e
rinflow
,r
an
d
oma
g
e
n
t
sw
o
u
l
dr
e
c
e
i
v
eana
c
cumu
l
a
t
edr
ew
a
rdc
lo
s
eto
t
h
el
ow
e
rb
o
u
n
d
a
ryo
f-1
.
abso
lu
te(c=5
)
r
ax(
−1,
1−c
|
amoun
t
oa
l|
)
t=m
t−g
1
.0
●
●
●
●
●
●
●
0
.5
●
0
.0
●
●
L
in−
L
in+
NonL− NonL+
T
h
i
ss
o
l
u
t
i
o
nh
a
st
h
edown
s
id
eo
fth
ecu
r
r
en
tw
a
t
e
rl
e
v
e
lb
e
i
n
gu
n
d
e
r
r
e
p
r
e
s
e
n
t
ed
. G
e
t
t
ingf
roman
ywh
e
r
etoth
ee
x
a
c
t
t
a
r
g
e
tl
e
v
e
lw
o
u
l
db
ea
sgooda
sg
e
t
t
ingana
rb
i
t
r
a
r
i
lysm
a
l
l
am
o
u
n
tc
l
o
s
e
rt
oi
t
.A
tth
es
am
et
im
e
,en
v
i
ronm
en
t
sw
i
the
x
t
e
r
n
a
li
n
-o
ro
u
tfl
oww
ou
lds
t
i
l
lno
tr
e
su
l
tinr
andomag
en
t
s
r
e
c
e
i
v
i
n
gar
ew
a
r
do
fz
e
ro
.
R
e
l
a
t
i
v
eP
r
o
g
r
e
s
sw
i
th W
e
igh
t
ing
. Th
i
sc
anb
eso
lv
edu
s
i
n
gt
h
ef
o
l
l
ow
i
n
gr
a
t
ion
a
l
e
: Ap
e
r
f
e
c
tag
en
tw
ou
lda
lw
ay
s
b
r
i
n
gt
h
ew
a
t
e
rl
e
v
e
ltoth
ego
a
lamoun
tinth
en
e
x
ts
t
ep
.
I
ft
h
i
sm
a
p
st
oar
ew
a
rdo
f1andnoa
c
t
ion m
ap
sto0
,th
en
t
h
ea
g
e
n
ts
h
o
u
l
db
eaw
a
rd
edar
ew
a
rdth
a
ti
sp
ropo
r
t
ion
a
lto
t
h
es
t
o
c
kc
h
a
n
g
em
ad
ebyth
eag
en
tcomp
a
r
edtoth
etw
oe
x
t
r
em
e
s
.T
h
er
e
s
u
l
ti
sc
l
ipp
eda
t-1ino
rd
e
rtos
t
i
cktoth
e
p
r
o
p
e
r
t
i
e
so
fA
IT
.
De
l2
De
l3
DSFcond
i
t
ion
●
●
Seq2 Seq2No
s Seq4
1
.0
0
.5
l
l
l
0
.0
l
l
l
l
l
−1
.0
l
l
L
in−
L
in+
NonL− NonL+
De
l2
0
.5
De
l3
DSFcond
i
t
ion
1
.0
we
igh
ted
1i
f|
amoun
t
oa
l|≤|
amoun
t
oa
l|
t−g
t
−1−g
−1 o
t
h
e
rw
i
s
e
●
●
−1
.0
l
r
t=
●
●
●
−0
.5
R
e
l
a
t
i
v
eP
r
o
g
r
e
s
s
. Ab
a
l
an
c
eden
v
i
ronm
en
tcou
ldb
ec
r
e
a
t
e
db
yc
o
n
c
e
n
t
r
a
t
ingonth
er
e
l
a
t
i
v
ep
rog
r
e
s
stoth
et
a
r
g
e
t
l
e
v
e
l
.T
h
em
o
s
ts
imp
l
eop
t
ioni
sab
in
a
ryd
e
c
i
s
ionwh
e
th
e
r
t
h
ew
a
t
e
rl
e
v
e
lh
a
simp
ro
v
ed
.
●
●
●
−0
.5
re
la
t
ive
S
c
a
l
edAb
s
o
lu
t
eD
i
f
f
e
r
en
c
e
. F
o
re
v
e
ryt
im
es
t
ep
,th
eab
s
o
l
u
t
ed
i
f
f
e
r
e
n
c
etoth
et
a
r
g
e
tw
a
t
e
rl
e
v
e
li
s mu
l
t
ip
l
i
edw
i
th
s
om
ec
o
n
s
t
a
n
tca
ndm
app
edtoth
er
ang
ef
rom-1to1
.
l
l
l
l
l
l
l
l
Seq2 Seq2No
s Seq4
l
0
.0
−0
.5
−1
.0
L
in−
L
in+
NonL− NonL+
De
l2
De
l3
DSFcond
i
t
ion
Seq2 Seq2No
s Seq4
F
igu
r
e1
:A
v
e
r
ag
er
ew
a
rd
in
th
eDSF
t
a
ska
f
t
e
rr
e
cod
ingu
s
ing
th
eth
r
e
ep
ropo
s
edr
ew
a
rdfun
c
t
ion
s
.Boxp
lo
t
s
:Hum
anD
a
t
a
.
Squ
a
r
e
s
: Nu
l
lAg
en
t(noa
c
t
ion
)
.C
ro
s
s
e
s
:R
andomAg
en
t
.
T
r
i
ang
l
e
s
:ACT
-Rmod
e
l(H
a
lb
rügg
e
,2010
)
.
T
h
ea
c
c
um
u
l
a
t
e
dr
ew
a
rdfo
r
th
ehum
ans
amp
l
ep
ro
v
id
e
s
in
t
e
r
e
s
t
i
n
ge
v
i
d
e
n
c
ea
bou
tth
ed
i
f
f
e
r
en
td
i
ffi
cu
l
tyo
fth
en
in
et
a
sk
c
o
n
d
i
t
i
o
n
s
.W
h
i
l
eth
efou
rmono
tonou
scond
i
t
ion
sonth
el
e
f
t
(L
in
-toNonL+
)a
r
ea
l
lcomp
a
r
a
t
i
v
e
lye
a
sy
,th
ed
e
l
aycond
i
t
ion
sa
r
ev
e
ryh
a
rd
.Inth
eD
e
l3cond
i
t
ion
,th
em
ed
i
ano
fth
e
hum
ans
amp
l
e
i
sc
lo
s
e
tor
andomp
e
r
fo
rm
an
c
e
.Th
ed
i
ffi
cu
l
ty
o
fth
es
equ
en
c
econd
i
t
ion
sl
i
e
sb
e
tw
e
enth
emono
tonou
sand
th
ed
e
l
aycond
i
t
ion
s
.
Th
ep
e
r
fo
rm
an
c
eo
fbo
thhum
an
sandth
e mod
e
lw
i
l
lb
e
comp
a
r
edtoth
ecompu
t
a
t
ion
a
lcomp
l
e
x
i
tyo
fth
er
e
sp
e
c
t
i
v
e
en
v
i
ronm
en
t
s
. Th
en
,th
e mod
e
l
sth
a
th
aden
t
e
r
edth
ecom
p
e
t
i
t
ionc
anb
ee
v
a
lu
a
t
edw
i
thr
e
sp
e
c
ttoth
e
i
rin
t
e
l
l
ig
en
c
ea
s
3S
oppo
s
edtoth
e
i
rfi
ttoth
ehum
and
a
t
a
.
u
chane
v
a
lu
a
t
ion
shou
ldcon
s
id
e
rth
ecompu
t
a
t
ion
a
lcomp
l
e
x
i
tyo
fth
e mod
e
l
sa
sw
e
l
l(H
a
lb
rügg
e
,2007
)
.Comp
l
e
x
i
tym
e
t
r
i
c
sb
a
s
edon
th
esou
r
c
ecod
ecou
ldb
ea
c
comp
an
i
edby Mod
e
lF
l
e
x
ib
i
l
i
ty
An
a
ly
s
i
s(V
ek
s
l
e
r
, My
e
r
s
,&G
lu
ck
,2015
)
,wh
i
cht
r
i
e
stoe
s
t
im
a
t
eth
er
ang
eo
fpo
s
s
ib
l
emod
e
lb
eh
a
v
io
rth
roughs
imu
l
a
t
ion(
s
e
ea
l
soG
lu
ck
,S
t
an
l
e
y
, Moo
r
e
,R
e
i
t
t
e
r
, &H
a
lb
rügg
e
,
2010
)
.T
og
e
th
e
rw
i
thth
eA
ITfo
rm
a
l
i
sm
,th
i
scou
ldl
e
adtoa
n
ewe
v
a
lu
a
t
ionc
r
i
t
e
r
ionth
a
tw
ou
ldb
ecomp
l
em
en
t
a
rytofi
t
m
e
a
su
r
e
sl
ik
eR2andRMSEandcou
lda
l
sop
ro
v
id
eas
t
epto
w
a
rd
sr
eun
i
t
ingth
efi
e
ld
so
fcogn
i
t
i
v
emod
e
l
inganda
r
t
ifi
c
i
a
l
(g
en
e
r
a
l
)in
t
e
l
l
ig
en
c
e
.
2T
h
es
o
u
r
c
ec
o
d
eo
fth
ecogn
i
t
i
v
emod
e
li
sa
v
a
i
l
ab
l
ef
o
rd
ow
n
l
o
a
da
th
t
t
p
:
/
/
d
x
.
d
o
i
.o
r
g
/10
.14279
/d
epo
s
i
ton
c
e
-5163
3E
s
p
e
c
i
a
l
l
yt
h
ed
e
l
a
yc
o
n
d
i
t
i
o
n
so
f
t
e
nl
e
a
dt
oo
s
c
i
l
l
a
t
i
n
gs
t
o
c
k
l
e
v
e
l
s
,w
h
i
c
hr
e
n
d
e
r
sa
v
e
r
a
g
i
n
ga
c
r
o
s
sp
a
r
t
i
c
i
p
a
n
t
sq
u
e
s
t
i
o
n
a
b
l
e
.
|
amoun
t
oa
l|
t−g
r
a
x(
−1,
)
t=m
|
amoun
t
+e
n
v
−g
oa
l|
t
−1
t
B
o
x
p
l
o
t
so
ft
h
ehum
and
a
t
aco
l
l
e
c
t
edfo
rth
e DSFch
a
l
l
e
n
g
er
e
c
o
d
e
du
s
i
ngth
ep
ropo
s
edr
ew
a
rdfun
c
t
ion
stog
e
th
e
r
w
i
t
h
t
h
er
e
s
u
l
t
sa
c
h
i
e
v
edbyar
andomag
en
t(f
low
∼N(
0,
5)
)
,
an
u
l
la
g
e
n
t(f
low=0
)
,andanACT
-Rmod
e
l2(H
a
lb
rügg
e
,
2
0
1
0
)a
r
eg
i
v
e
ni
nF
igu
r
e1
.O
fth
eth
r
e
ep
ropo
s
edr
ew
a
rd
f
u
n
c
t
i
o
n
s
,‘w
e
i
g
h
t
edr
e
l
a
t
i
v
ep
rog
r
e
s
s
’p
ro
v
id
e
sth
eb
e
s
tfi
tto
t
h
eA
ITr
e
q
u
i
r
em
en
to
fb
a
l
an
c
edr
ew
a
rd
s
.
D
i
s
cu
s
s
ionandCon
c
lu
s
ion
s
References
Dutt, V., & Gonzalez, C. (2007). Slope of inflow impacts dynamic decision making. In Proceedings of the 25th international conference of the system dynamics society (p. 79).
Boston, MA.
Gluck, K. A., Stanley, C. T., Moore, L. R., Reitter, D., & Halbrügge, M. (2010). Exploration for understanding in cognitive modeling. Journal of Artificial General Intelligence,
2(2), 88–107. doi: 10.2478/v10229-011-0011-7
Halbrügge, M. (2007). Evaluating cognitive models and architectures. In G. A. Kaminka & C. R. Burghart (Eds.),
Evaluating architectures for intelligence. papers from the
2007 aaai workshop (p. 27-31). Menlo Park, California:
AAAI Press.
Halbrügge, M. (2010). Keep it simple – a case study of model
development in the context of the dynamic stocks and flows
(dsf) task. Journal of Artificial General Intelligence, 2(2),
38–51. doi: 10.2478/v10229-011-0008-2
Hernández-Orallo, J., & Dowe, D. L. (2010). Measuring universal intelligence: Towards an anytime intelligence
test. Artificial Intelligence, 174(18), 1508–1539. doi:
10.1016/j.artint.2010.09.006
Insa-Cabrera, J., Dowe, D. L., España-Cubillo, S.,
Hernández-Lloreda, M. V., & Hernández-Orallo, J. (2011).
Comparing humans and ai agents. In Artificial general intelligence (pp. 122–132). Springer. doi: 10.1007/978-3642-22887-2_13
Lebiere, C., Gonzalez, C., & Warwick, W. (2009). A comparative approach to understanding general intelligence: Predicting cognitive performance in an open-ended dynamic
task. In Proceedings of the second conference on artificial
general intelligence (pp. 103–107). Amsterdam: Atlantis
Press.
Legg, S., & Hutter, M. (2007). Universal intelligence: A
definition of machine intelligence. Minds and Machines,
17(4), 391–444. doi: 10.1007/s11023-007-9079-x
Salvucci, D. D. (2010). Cognitive supermodels. (Invited talk,
European ACT-R Workshop, Groningen, The Netherlands)
Singh, S., Lewis, R. L., & Barto, A. G. (2009). Where do
rewards come from? In Proceedings of the annual conference of the cognitive science society (pp. 2601–2606).
Veksler, V. D., Myers, C. W., & Gluck, K. A. (2015). Model
flexibility analysis. Psychological Review, 122, 755-769.
doi: 10.1037/a0039657