Halbrügge, M. (2016). Towards the Evaluation of Cognitive Models using Anytime Intelligence Tests. In D. Reitter & F. E. Ritter (eds.), Proceedings of the 14th international conference on cognitive modeling (pp. 261–263). University Park, PA: Penn State. Towards the Evaluation of Cognitive Models using Anytime Intelligence Tests Marc Halbrügge ([email protected]) Quality & Usability Lab, Telekom Innovation Laboratories Technische Universität Berlin Ernst-Reuter-Platz 7, 10587 Berlin Abstract Cognitive models are usually evaluated based on their fit to empirical data. Artificial intelligence (AI) systems on the other hand are mainly evaluated based on their performance. Within the field of artificial general intelligence (AGI) research, a new type of performance measure for AGI systems has recently been proposed that tries to cover both humans and artificial systems: Anytime Intelligence Tests (AIT; Hernández-Orallo & Dowe, 2010). This paper explores the viability of the AIT formalism for the evaluation of cognitive models based on data from the ICCM 2009 “Dynamic Stocks and Flows” modeling challenge. Keywords: Anytime Intelligence Test; Model Evaluation; Decision Making; Stock-Flow Problems; Introduction Cognitive modeling as a field, although being rooted in AI, has diverged from AI research in recent years because both fields pursue different goals. While modelers try to understand human behavior by creating systems that act as humanlike as possible, AI researchers strive for systems that act as perfect as possible, or in Legg and Hutter (2007)’s words: universal intelligence. At the same time, parts of the cognitive modeling community are suggesting to direct the field towards more generic models of human behavior (”cognitive supermodels”; Salvucci, 2010) as opposed to task-specific models. Such supermodels did not come into existence yet, but given the methodological advances in the AI field, it may be worthwhile to think about how the abilities (i.e., intelligence) of generic cognitive models should be evaluated. A promising approach to this question are anytime intelligence tests (AIT; Hernández-Orallo & Dowe, 2010). Anytime Intelligence Tests These intelligence tests are crossing the boundaries between the modeling and the AI field because they are targeting both biological and artificial systems. Based on the work of Legg and Hutter (2007), they intend to measure intelligence of an agent (i.e., ‘model’ in the terms of the ‘other’ field) as the accumulated amount of reward1 r it receives through interaction with a set of (deterministic) environments of varying (computational) complexity. The validity of this accumulated reward is achieved through several means: a) The reward is bound to the range [-1;1]; b) All environments must be balanced, 1 Note: In reinforcement learning, reward functions are a crucial part of the learning agents themselves (Singh, Lewis, & Barto, 2009). In the context of this paper, rewards come from an external ‘critic’ and were not available to the agents (i.e., human participants and cognitive models) during exploration and learning. i.e., a random agent will on average receive a reward of zero; and c) The aggregated reward is scaled by the computational complexity of the transition function of the environment. An example of such an intelligence test that was applied to both humans and AI agents can be found in Insa-Cabrera, Dowe, España-Cubillo, Hernández-Lloreda, and Hernández-Orallo (2011). Besides the construction of new environments that follow the AIT formalism, one can try to analyze published data and models from the literature. This way, potential insights from the AIT procedure can be compared to fit-based evaluations that have been performed before. It is often possible to transform existing tasks into AIT environments by constructing new reward functions for the tasks. Dynamic Stock and Flow Task A promising candidate for such a reward reconstruction is the Dynamic Stock and Flow task (DSF; Dutt & Gonzalez, 2007) that has been used for the modeling challenge of the same name (Lebiere, Gonzalez, & Warwick, 2009). The task for this challenge was to maintain the level (i.e., stock) in a water tank at a given target value in the presence of dynamically changing water in- or outflow from an external source. There were four training conditions with monotonously changing inflow (Lin-, Lin+, NonL-, NonL+) and five transfer conditions. Two of these featured linearly increasing inflow (like Lin+), but the agents’ actions were delayed by one (Del2) or two (Del3) additional time steps. The remaining conditions featured two (Seq2, Seq2Nos) or four (Seq4) time steps long repeating sequences of inflow; in case of Seq2Nos the pattern was masked by additional noise. The evaluation of the models in the competition was based on the goodness-of-fit to human data in all nine conditions. The crucial variable for this fit was the time-dependent water level. Besides convenience (i.e., availability of data and models), the DSF task is especially suited as an AIT because it is deterministic and open-ended (in contrast to, e.g., robotic soccer), and the computational complexity of the environment should be both easily scalable and easily quantifiable. Whether and how this task can be transformed to an AIT will now be reviewed regarding possible reward functions for the task. The question of the complexity of the different task conditions will be discussed in a later contribution. Possible Reward Functions In the following, rt denotes the reward for time step t, amountt denotes the water level for t, envt the external inflow for t, and goal denotes the target water level. T h i si st h em o s ts t r a igh t fo rw a rdso lu t ion w i thth eh igh e s t f a c e v a l i d i t y . T h eag en t ’ sp rox im i tytoth et a r g e tl e v e li s r e p r e s e n t e dv e r yw e l l . Onth eo th e rh and ,ci sa rb i t r a ry . M o s tp r o b l em a t i ci sth a tth efun c t ioni sno tb a l an c ed .B e c a u s ea l lt a s kc o n d i t ion sf e a tu r ee x t e rn a lw a t e rinflow ,r an d oma g e n t sw o u l dr e c e i v eana c cumu l a t edr ew a rdc lo s eto t h el ow e rb o u n d a ryo f-1 . abso lu te(c=5 ) r ax( −1, 1−c | amoun t oa l| ) t=m t−g 1 .0 ● ● ● ● ● ● ● 0 .5 ● 0 .0 ● ● L in− L in+ NonL− NonL+ T h i ss o l u t i o nh a st h edown s id eo fth ecu r r en tw a t e rl e v e lb e i n gu n d e r r e p r e s e n t ed . G e t t ingf roman ywh e r etoth ee x a c t t a r g e tl e v e lw o u l db ea sgooda sg e t t ingana rb i t r a r i lysm a l l am o u n tc l o s e rt oi t .A tth es am et im e ,en v i ronm en t sw i the x t e r n a li n -o ro u tfl oww ou lds t i l lno tr e su l tinr andomag en t s r e c e i v i n gar ew a r do fz e ro . R e l a t i v eP r o g r e s sw i th W e igh t ing . Th i sc anb eso lv edu s i n gt h ef o l l ow i n gr a t ion a l e : Ap e r f e c tag en tw ou lda lw ay s b r i n gt h ew a t e rl e v e ltoth ego a lamoun tinth en e x ts t ep . I ft h i sm a p st oar ew a rdo f1andnoa c t ion m ap sto0 ,th en t h ea g e n ts h o u l db eaw a rd edar ew a rdth a ti sp ropo r t ion a lto t h es t o c kc h a n g em ad ebyth eag en tcomp a r edtoth etw oe x t r em e s .T h er e s u l ti sc l ipp eda t-1ino rd e rtos t i cktoth e p r o p e r t i e so fA IT . De l2 De l3 DSFcond i t ion ● ● Seq2 Seq2No s Seq4 1 .0 0 .5 l l l 0 .0 l l l l l −1 .0 l l L in− L in+ NonL− NonL+ De l2 0 .5 De l3 DSFcond i t ion 1 .0 we igh ted 1i f| amoun t oa l|≤| amoun t oa l| t−g t −1−g −1 o t h e rw i s e ● ● −1 .0 l r t= ● ● ● −0 .5 R e l a t i v eP r o g r e s s . Ab a l an c eden v i ronm en tcou ldb ec r e a t e db yc o n c e n t r a t ingonth er e l a t i v ep rog r e s stoth et a r g e t l e v e l .T h em o s ts imp l eop t ioni sab in a ryd e c i s ionwh e th e r t h ew a t e rl e v e lh a simp ro v ed . ● ● ● −0 .5 re la t ive S c a l edAb s o lu t eD i f f e r en c e . F o re v e ryt im es t ep ,th eab s o l u t ed i f f e r e n c etoth et a r g e tw a t e rl e v e li s mu l t ip l i edw i th s om ec o n s t a n tca ndm app edtoth er ang ef rom-1to1 . l l l l l l l l Seq2 Seq2No s Seq4 l 0 .0 −0 .5 −1 .0 L in− L in+ NonL− NonL+ De l2 De l3 DSFcond i t ion Seq2 Seq2No s Seq4 F igu r e1 :A v e r ag er ew a rd in th eDSF t a ska f t e rr e cod ingu s ing th eth r e ep ropo s edr ew a rdfun c t ion s .Boxp lo t s :Hum anD a t a . Squ a r e s : Nu l lAg en t(noa c t ion ) .C ro s s e s :R andomAg en t . T r i ang l e s :ACT -Rmod e l(H a lb rügg e ,2010 ) . T h ea c c um u l a t e dr ew a rdfo r th ehum ans amp l ep ro v id e s in t e r e s t i n ge v i d e n c ea bou tth ed i f f e r en td i ffi cu l tyo fth en in et a sk c o n d i t i o n s .W h i l eth efou rmono tonou scond i t ion sonth el e f t (L in -toNonL+ )a r ea l lcomp a r a t i v e lye a sy ,th ed e l aycond i t ion sa r ev e ryh a rd .Inth eD e l3cond i t ion ,th em ed i ano fth e hum ans amp l e i sc lo s e tor andomp e r fo rm an c e .Th ed i ffi cu l ty o fth es equ en c econd i t ion sl i e sb e tw e enth emono tonou sand th ed e l aycond i t ion s . Th ep e r fo rm an c eo fbo thhum an sandth e mod e lw i l lb e comp a r edtoth ecompu t a t ion a lcomp l e x i tyo fth er e sp e c t i v e en v i ronm en t s . Th en ,th e mod e l sth a th aden t e r edth ecom p e t i t ionc anb ee v a lu a t edw i thr e sp e c ttoth e i rin t e l l ig en c ea s 3S oppo s edtoth e i rfi ttoth ehum and a t a . u chane v a lu a t ion shou ldcon s id e rth ecompu t a t ion a lcomp l e x i tyo fth e mod e l sa sw e l l(H a lb rügg e ,2007 ) .Comp l e x i tym e t r i c sb a s edon th esou r c ecod ecou ldb ea c comp an i edby Mod e lF l e x ib i l i ty An a ly s i s(V ek s l e r , My e r s ,&G lu ck ,2015 ) ,wh i cht r i e stoe s t im a t eth er ang eo fpo s s ib l emod e lb eh a v io rth roughs imu l a t ion( s e ea l soG lu ck ,S t an l e y , Moo r e ,R e i t t e r , &H a lb rügg e , 2010 ) .T og e th e rw i thth eA ITfo rm a l i sm ,th i scou ldl e adtoa n ewe v a lu a t ionc r i t e r ionth a tw ou ldb ecomp l em en t a rytofi t m e a su r e sl ik eR2andRMSEandcou lda l sop ro v id eas t epto w a rd sr eun i t ingth efi e ld so fcogn i t i v emod e l inganda r t ifi c i a l (g en e r a l )in t e l l ig en c e . 2T h es o u r c ec o d eo fth ecogn i t i v emod e li sa v a i l ab l ef o rd ow n l o a da th t t p : / / d x . d o i .o r g /10 .14279 /d epo s i ton c e -5163 3E s p e c i a l l yt h ed e l a yc o n d i t i o n so f t e nl e a dt oo s c i l l a t i n gs t o c k l e v e l s ,w h i c hr e n d e r sa v e r a g i n ga c r o s sp a r t i c i p a n t sq u e s t i o n a b l e . | amoun t oa l| t−g r a x( −1, ) t=m | amoun t +e n v −g oa l| t −1 t B o x p l o t so ft h ehum and a t aco l l e c t edfo rth e DSFch a l l e n g er e c o d e du s i ngth ep ropo s edr ew a rdfun c t ion stog e th e r w i t h t h er e s u l t sa c h i e v edbyar andomag en t(f low ∼N( 0, 5) ) , an u l la g e n t(f low=0 ) ,andanACT -Rmod e l2(H a lb rügg e , 2 0 1 0 )a r eg i v e ni nF igu r e1 .O fth eth r e ep ropo s edr ew a rd f u n c t i o n s ,‘w e i g h t edr e l a t i v ep rog r e s s ’p ro v id e sth eb e s tfi tto t h eA ITr e q u i r em en to fb a l an c edr ew a rd s . D i s cu s s ionandCon c lu s ion s References Dutt, V., & Gonzalez, C. (2007). Slope of inflow impacts dynamic decision making. In Proceedings of the 25th international conference of the system dynamics society (p. 79). Boston, MA. Gluck, K. A., Stanley, C. T., Moore, L. R., Reitter, D., & Halbrügge, M. (2010). Exploration for understanding in cognitive modeling. Journal of Artificial General Intelligence, 2(2), 88–107. doi: 10.2478/v10229-011-0011-7 Halbrügge, M. (2007). Evaluating cognitive models and architectures. In G. A. Kaminka & C. R. Burghart (Eds.), Evaluating architectures for intelligence. papers from the 2007 aaai workshop (p. 27-31). Menlo Park, California: AAAI Press. Halbrügge, M. (2010). Keep it simple – a case study of model development in the context of the dynamic stocks and flows (dsf) task. Journal of Artificial General Intelligence, 2(2), 38–51. doi: 10.2478/v10229-011-0008-2 Hernández-Orallo, J., & Dowe, D. L. (2010). Measuring universal intelligence: Towards an anytime intelligence test. Artificial Intelligence, 174(18), 1508–1539. doi: 10.1016/j.artint.2010.09.006 Insa-Cabrera, J., Dowe, D. L., España-Cubillo, S., Hernández-Lloreda, M. V., & Hernández-Orallo, J. (2011). Comparing humans and ai agents. In Artificial general intelligence (pp. 122–132). Springer. doi: 10.1007/978-3642-22887-2_13 Lebiere, C., Gonzalez, C., & Warwick, W. (2009). A comparative approach to understanding general intelligence: Predicting cognitive performance in an open-ended dynamic task. In Proceedings of the second conference on artificial general intelligence (pp. 103–107). Amsterdam: Atlantis Press. Legg, S., & Hutter, M. (2007). Universal intelligence: A definition of machine intelligence. Minds and Machines, 17(4), 391–444. doi: 10.1007/s11023-007-9079-x Salvucci, D. D. (2010). Cognitive supermodels. (Invited talk, European ACT-R Workshop, Groningen, The Netherlands) Singh, S., Lewis, R. L., & Barto, A. G. (2009). Where do rewards come from? In Proceedings of the annual conference of the cognitive science society (pp. 2601–2606). Veksler, V. D., Myers, C. W., & Gluck, K. A. (2015). Model flexibility analysis. Psychological Review, 122, 755-769. doi: 10.1037/a0039657
© Copyright 2026 Paperzz