The Measurement of Task Complexity and Cognitive Ability

The Measurement of Task Complexity and Cognitive Ability:
Relational Complexity in Adult Reasoning
Damian Patrick Birney
B.App.Sc (hons)
School of Psychology
University of Queensland
St. Lucia, Queensland
AUSTRALIA
A thesis submitted in fulfilment of the requirements for the degree of
Doctor of Philosophy
7 March, 2002
ii
STATEMENT OF ORIGINALITY
The work contained in this thesis has not been previously submitted for a degree at this
or any other higher education institution. To the best of my knowledge and belief, the
thesis contains no material previously published of written by another person except
where due reference is made.
Damian Patrick Birney
Signed: _______________________________
7 March, 2002
Date: ________________________
iii
ACKNOWLEDGEMENTS
There are a number of people that need special acknowledgement. First I would like to
thank my supervisor Graeme Halford for the guidance and generous support that he has
provided over the years. Graeme, I would particularly like to thank you for allowing me
the freedom to pursue my research and develop my ideas in such a strong intellectual
environment. Thank you also to Gerry Fogarty for getting me started and having faith in
me during the early years. You have been instrumental in teaching me an appreciation
for psychological measurement and encouraging me to explore good science. I would
especially like to thank Julie McCredden who has listened patiently to my ramblings
during the last 12 months. Julie, I would of course like to thank you for helping me to
enjoy the subtleties of cognition, but more importantly, I would like to thank you for
being such a good friend. I would also like to acknowledge the support from all the
people in “the lab” and especially Glenda Andrews and Geoff Goodwin. Thank you also
to Glen Smith and Julie Duck for reading my early work and to Philippe Lacherez for
the many hours of discussions over coffee.
Most importantly I would like to thank my family to whom I dedicate this thesis.
Debbie, thank you so much for allowing me to fulfil my dreams. This simply would not
have been possible without your love and support. Thank you Corrine for providing me
with an endless stream of drawings. Thank you Caleb for encouraging me to take
frequent breaks for morning tea. Corrine and Caleb, you gave me the inspiration to keep
going when I thought I could go no further. Finally, I would like to thank my parents,
Denise and David. Thanks Mum for your hope and showing me what is possible.
Thanks Dad for believing in me and showing me what is decent. Thanks to my sisters
Ange, Rache, and Gen, and to my brother Anthony, for persevering with me.
Damian Patrick Birney
March, 2002
iv
TABLE OF CONTENTS
1
2
The Measurement of Task Complexity and Capacity...........................................1
1.1
Measurement Issues .............................................................................................1
1.2
Assessment Issues.................................................................................................3
1.3
Assessing Capacity and Complexity.....................................................................5
1.4
Overview of the Thesis .........................................................................................7
Cognitive Complexity and Relational Complexity Theory ...................................9
2.1 Resource Theory...................................................................................................9
2.1.1 Resources: A Cautionary Note ....................................................................11
2.2 Relational Complexity Theory............................................................................12
2.2.1 Specification of Relational Complexity ......................................................12
2.2.2 Chunking and Segmentation........................................................................14
2.2.3 Relational Complexity Theorems................................................................15
2.2.4 Representation of Relations: A Comment on Notation...............................16
2.2.5 Evidence for Relational Complexity ...........................................................16
2.2.6 Unresolved Issues........................................................................................22
2.3 Cognitive Complexity: A Psychometric Approach.............................................26
2.3.1 Gf-Gc Theory ..............................................................................................27
2.3.2 Psychometric Complexity ...........................................................................28
2.3.3 Fluid Intelligence and Complexity: The Evidence......................................30
2.3.4 Some Final Methodological Issues..............................................................34
2.4 The Experimental Approach ..............................................................................36
2.4.1 Predictions ...................................................................................................36
2.5
3
Summary of the Key Predictions ........................................................................40
Relational Complexity Analysis of the Knight-Knave Task ...............................42
3.1 Processing in the Knight-Knave Task ................................................................42
3.1.1 Deduction Rules ..........................................................................................43
3.1.2 Mental Models.............................................................................................44
3.2 Relational Complexity Analysis .........................................................................46
3.2.1 Knowledge Required ...................................................................................47
3.3 Method................................................................................................................53
3.3.1 Problems ......................................................................................................53
3.3.2 Practice ........................................................................................................54
3.3.3
Test Problems ..............................................................................................55
3.3.4 Participants ..................................................................................................56
3.4
Procedure ...........................................................................................................57
3.5 Results & Discussion..........................................................................................57
3.5.1 Practice ........................................................................................................57
3.5.2 Test Problems ..............................................................................................58
v
3.5.3
3.5.4
Speed-Accuracy Trade-Off .........................................................................60
Alternative Accounts ...................................................................................62
3.6 General Discussion ............................................................................................64
3.6.1 Task Presentation Format ............................................................................65
3.6.2 Processing Capacity and a Speed-Accuracy Trade-Off ..............................65
3.6.3 Serial Processing: An Alternative Account.................................................66
3.7
4
Conclusion..........................................................................................................67
Development of the Latin Square Task ................................................................69
4.1 Definition of a Latin Square...............................................................................70
4.1.1 Enumeration of Latin Squares .....................................................................71
4.2 Cognitive Load and the Latin Square ................................................................72
4.2.1 The Defining Principle of the Latin Square ................................................72
4.2.2 Binary Processing in LS-4 Problems...........................................................74
4.2.3 Ternary Processing in LS-4 Problems.........................................................76
4.2.4 Quaternary Processing in LS-4 Problems....................................................77
4.2.5 An Empirical Test of the Analysis ..............................................................81
4.3 Experiment 4.1: University Students..................................................................81
4.3.1 Participants ..................................................................................................81
4.3.2 Item Generation ...........................................................................................81
4.3.3 Procedure .....................................................................................................82
4.4 Results and Discussion.......................................................................................83
4.4.1 Item Analyses: A Rasch Approach..............................................................84
4.4.2 Item Difficulty and Relational Complexity.................................................89
4.5 Decomposing Item Difficulty: Relational Complexity and Processing Steps ....92
4.5.1 Additional Regression Analyses..................................................................95
4.5.2 Summary......................................................................................................97
4.6 Item Response Time and Relational Complexity................................................97
4.6.1 Mean Item Response Time ..........................................................................98
4.6.2 Standard Deviation in Item Response Time ..............................................100
4.7 Derivation of Relational Complexity Subscale Scores ....................................101
4.7.1 Graphical Representation of Relational Complexity.................................102
4.7.2 Summary....................................................................................................103
4.8 Experiment 4.2: School Students......................................................................104
4.8.1 Participants ................................................................................................104
4.8.2 General Procedure .....................................................................................104
4.9 Results & Discussion........................................................................................105
4.9.1 Rasch Analysis ..........................................................................................105
4.9.2 Item Based Regression Analyses...............................................................107
4.9.3 Comparison of School and University Samples ........................................110
4.9.4
Summary....................................................................................................113
4.10
General Discussion.......................................................................................113
4.10.1
Alternative Accounts..............................................................................114
4.11
Conclusion ....................................................................................................116
vi
4.12
5
Modifications to the LST item database .......................................................117
Processing Capacity and Dual-Task Performance ............................................118
5.1.1
5.1.2
5.1.3
5.2
Dual-Task Deficit ......................................................................................118
The Implications of Individual Difference in the Dual-Task Paradigm....119
Cognitive Psychology and Individual Differences....................................121
Resource Theory...............................................................................................123
5.3 Dual-Task Assumptions....................................................................................125
5.3.1 Practice Effects ..........................................................................................126
5.3.2 Priority of Primary Task ............................................................................126
5.3.3 Task Interference .......................................................................................127
5.3.4 Summary....................................................................................................129
5.4 Easy-to-Hard Paradigm...................................................................................129
5.4.1 Assumptions ..............................................................................................131
5.4.2 Applications of the Easy-to-Hard Paradigm..............................................133
5.5 Overview ..........................................................................................................134
5.5.1 Secondary Tasks ........................................................................................135
5.6 Method..............................................................................................................137
5.6.1 Participants ................................................................................................137
5.6.2 Primary Task .............................................................................................138
5.6.3 Finger Tapping Task – Single Condition ..................................................138
5.6.4 Finger Tapping Task – Dual Condition.....................................................139
5.6.5 Probe RT – Single Condition ....................................................................139
5.6.6 Probe RT – Dual Condition.......................................................................140
5.6.7 General Procedure .....................................................................................140
5.7 Finger Tapping: Results & Discussion ............................................................140
5.7.1 Secondary Task Performance: Variation in Tapping (SD-score)..............142
5.7.2 Secondary Task Performance: Median Elapsed Time Between Taps.......146
5.7.3 Primary Task Performance ........................................................................147
5.7.4 Influence of Practice Effects on LST Response Times .............................150
5.7.5 Summary of traditional analyses ...............................................................152
5.7.6 Easy-to-Hard Predictions: Individual Differences ....................................152
5.7.7 Alternative Easy and Hard Conditions ......................................................156
5.8 Probe RT ..........................................................................................................156
5.8.1 Secondary Task Performance: Median Response Time ............................157
5.8.2 Influence of Relational Complexity on Median Response Time ..............158
5.8.3 Primary Task Performance ........................................................................159
5.8.4 Practice ......................................................................................................163
5.8.5 Summary of Traditional Dual-Task Analyses ...........................................163
5.8.6 Easy-to-Hard Predictions: Individual Differences ....................................164
5.8.7 Alternative Easy and Hard Conditions ......................................................166
5.8.8 Alternative Measures.................................................................................167
5.9 General Discussion ..........................................................................................167
5.9.1 Secondary Task Insensitivity.....................................................................171
5.9.2 Interference and Secondary Task Performance .........................................173
5.9.3 Interference and Primary Task Performance .............................................175
5.9.4 Is relational Processing Resource Dependent?..........................................176
vii
5.10
6
Conclusion ....................................................................................................178
Relational Complexity and Broad Cognitive Abilities ......................................180
6.1
Design of the Study & Overview ......................................................................181
6.2 Method..............................................................................................................182
6.2.1 Participants ................................................................................................182
6.2.2 Materials ....................................................................................................182
6.2.3 General Procedure .....................................................................................194
6.3
Overview of Analyses for Chapter 6 ................................................................195
6.4 Markers of Fluid Intelligence ..........................................................................196
6.4.1 Raven’s Progressive Matrices ...................................................................196
6.4.2 Triplet Numbers Test.................................................................................197
6.4.3 Swaps Test.................................................................................................201
6.5 Markers of Crystallized Intelligence................................................................206
6.5.1 Vocabulary (Synonyms) ............................................................................206
6.5.2 Similarities.................................................................................................208
6.5.3 Arithmetic Reasoning ................................................................................210
6.6 Markers of Short-term Apprehension and Retrieval (SAR) .............................211
6.6.1 Digit Span Forward ...................................................................................211
6.6.2 Digit Span Backward.................................................................................215
6.6.3 Paired Associate Recall .............................................................................217
6.7 Relational Complexity Tests.............................................................................219
6.7.1 Sentence Comprehension Task..................................................................219
6.7.2 Knight-Knave Task ...................................................................................226
6.7.3 Latin Square Task......................................................................................241
6.7.4 Summary....................................................................................................255
6.8 Summary of the measurement properties of the tasks......................................255
6.8.1 Psychometric Tasks ...................................................................................255
6.8.2 Relational Complexity Tasks ....................................................................256
6.8.3 Chapter 7 ...................................................................................................257
7
Relational Complexity and Psychometric Complexity......................................258
7.1 Class Equivalence of Relational Complexity ...................................................258
7.1.1 Summary of Class Equivalence of Relational Complexity .......................262
7.2 The Complexity-Gf Relationship ......................................................................263
7.2.1 Model of the Predictions ...........................................................................263
7.2.2 Treatment of Missing Data ........................................................................267
7.2.3 Generating Broad Cognitive Abilities Factors ..........................................268
7.2.4 Analyses Objectives ..................................................................................272
7.3 Sentence Comprehension Task.........................................................................272
7.3.1 Accuracy....................................................................................................272
7.3.2 Decision Time ...........................................................................................276
7.3.3 Summary....................................................................................................279
7.4 Knight-Knave Task...........................................................................................280
7.4.1 Accuracy....................................................................................................281
viii
7.4.2
7.4.3
7.4.4
Response Time ..........................................................................................284
Complexity-Gf at an Item Level................................................................286
Summary....................................................................................................290
7.5 Latin Square Task ............................................................................................290
7.5.1 Accuracy....................................................................................................291
7.5.2 Response Time ..........................................................................................294
7.5.3 Correlation Between Gf and Accuracy as a Function of Item Difficulty..298
7.5.4 Speed-Accuracy Trade-off ........................................................................300
7.5.5
Relational Complexity and Response Times.............................................303
7.6
Summary of the Complexity-Gf Effect in Three Relational Complexity Tasks 307
7.7 Cognitive Complexity and the Triplet Numbers Test .......................................309
7.7.1 Relational Complexity Analysis of the Triplet Numbers Test ..................311
7.7.2
Why is the Triplet Numbers Test Gf Loaded? ..........................................314
8
Discussion ..............................................................................................................316
8.1 Measurement Problems....................................................................................317
8.1.1 Quantitative Structure in the Latin Square Task .......................................317
8.2 Application of the relational complexity theory...............................................319
8.2.1 Valid Process Theories ..............................................................................319
8.3 Assessment of Relational Complexity...............................................................322
8.3.1 Relational Complexity: Resources, Relational Reasoning, and/or Gf ......322
8.3.2 Resources and Fluid Intelligence...............................................................323
8.3.3 Relational Reasoning and Broad Cognitive Abilities................................324
8.4
9
Conclusion........................................................................................................325
References..............................................................................................................327
APPENDIX A...............................................................................................................339
A.1 Relational Complexity Analysis of Knight-Knave Items......................................339
A.2 Process Analysis Based On Exhaustive Strategy.................................................344
APPENDIX B...............................................................................................................349
B.1 Relational Complexity Analysis of Latin Square Items (18-item Test)................349
B.2 Item×Trait Group Explanation and Example ......................................................353
B.3 Regression Analysis of Response Times ..............................................................354
B.4 Analysis of Composite Accuracy and Response Time Scores..............................356
B.5 Revised Latin Square Task Item Pool..................................................................357
B.6 Actual Latin Square Task items used in Chapters 5, 6, and 7 .............................364
APPENDIX C...............................................................................................................369
C.1 Analysis of correct response time to the Latin Square tasks in the finger tapping
experiment ..................................................................................................................369
ix
APPENDIX D...............................................................................................................370
D.1 Descriptive Statistics for Composite Progressive Matrices Test ........................370
D.2 Descriptive Statistics for the Arithmetic Reasoning Test....................................371
D.3 Descriptive Statistics for the Swaps Test ............................................................372
D.4 Descriptive Statistics for the Vocabulary Test ....................................................374
D.5 Descriptive Statistics for the Similarities Test ....................................................375
D.6 Responses to Knight-Knave Test Items ...............................................................376
APPENDIX E...............................................................................................................377
E.1 Triplet Numbers Test – Level 4 Example Items ...................................................377
x
LIST OF FIGURES
Figure 1.1. Representation of the assessment of components of the relational
complexity theory. ....................................................................................................4
Figure 2.1. Representation of relations based on Halford et al. (1998a). .....................14
Figure 2.2. The complexity-Gf effect: The hypothetical relationship between
performance and Gf as function of cognitive complexity ......................................30
Figure 2.3. Multitrait-multimethod correlation matrix design to assess class equivalence
of relational complexity ..........................................................................................40
Figure 3.1. Solution of knight-knave problems using the exhaustive strategy reported by
Rips (1989). ............................................................................................................44
Figure 3.2. Examples of practice phase items from A) the introduction, B) Section 1,
and C) Section 2, of Knight-Knave task .................................................................55
Figure 4.1. The 12 possible orderings defining a 3×3 Latin square (Square A is the
standard square) ......................................................................................................70
Figure 4.2. Composition of the 3×3 Græco-Latin square. ............................................71
Figure 4.3. A complete (A) and incomplete (B) “standard” 4×4 Latin square .............73
Figure 4.4. Completed and example binary LST problem............................................75
Figure 4.5. Completed and example ternary LST problem...........................................76
Figure 4.6. Completed and example quaternary LST problem (1) ...............................77
Figure 4.7. Completed and example quaternary LST problem (2) ...............................79
Figure 4.8. Practice items used in the Latin Square Task .............................................83
Figure 4.9. Person-outfit and -infit values sorted by estimated person ability in the 18item Latin Square Task ...........................................................................................88
Figure 4.10. Comparison of item difficulty estimates based on traditional and Rasch
calibration ...............................................................................................................90
Figure 4.11. Rasch based subtest characteristic curves for binary, ternary and
quaternary items (inset = item locations and standard errors)..............................103
Figure 4.12. Distribution of person infit and outfit values for the school sample
response to the Latin Square Task ........................................................................107
Figure 4.13. Item difficulty (proportion correct) for university and school samples. 111
Figure 4.14. Item response time for university and school sample (aas a function of the
calibrated item difficulty for the school sample). .................................................112
Figure 4.15. Example ternary LST item .....................................................................115
Figure 5.1. Performance resource functions (PRF) for three tasks.............................123
Figure 5.2. Performance Operating Characteristic curve (Norman & Bobrow, 1975)125
Figure 5.3. Task categorisation as a function of stage-defined and code defined resource
demands as proposed by Wickens (1991).............................................................128
xi
Figure 5.4. Representation of Partial Correlation of interest in the Easy-to-Hard
paradigm ...............................................................................................................130
Figure 5.5. Phases in the finger-tapping task (dual-task condition). . .......................141
Figure 5.6. Variation in tapping rate at each trial phase by task condition and
complexity ............................................................................................................144
Figure 5.7. Mean median elapsed time between finger taps as a function of phase and
task condition........................................................................................................147
Figure 5.8. Mean proportion correct on the LST as a function of complexity and task
condition ...............................................................................................................148
Figure 5.9. Mean overall and correct response time as a function of relational
complexity and task condition ..............................................................................150
Figure 5.10. Correct and overall response time as a function of single and dual task
condition and presentation order (early and later items) ......................................151
Figure 5.11. Median Response time as a function of Phase and task Condition ........158
Figure 5.12. Mean proportion correct on LST as a function of relational complexity and
task condition by relational complexity................................................................160
Figure 5.13. Mean overall and correct response time on the LST as a function of
relational complexity and task condition ..............................................................161
Figure 5.14. Mean response time to the LST as a function item presentation order and
task condition........................................................................................................163
Figure 6.1. Display layout for Swaps task ..................................................................186
Figure 6.2. Infit and outfit statistics as a function calibrated ability in composite
progressive matrices tests .....................................................................................197
Figure 6.3. Person infit and outfit statistics for ability calibrated on all Swaps test items
..............................................................................................................................202
Figure 6.4. Person infit and outfit statistics for ability calibrated on level3 and level 4
items of the Swaps test..........................................................................................203
Figure 6.5. Distribution of fit statistics as a function of calibrated ability for the 35 item
Vocabulary test .....................................................................................................207
Figure 6.6. Distribution of fit statistics as a function of calibrated ability for the 33 item
Vocabulary test (items 10 and 15 removed) .........................................................208
Figure 6.7. Distribution of fit statistics as a function of calibrated ability for the paper
and pencil Similarities test....................................................................................209
Figure 6.8. Distribution of fit statistics as a function of calibrated ability for the paper
and pencil Arithmetic Reasoning Test..................................................................210
Figure 6.9. Distribution of outfit statistics for the digit span – forward task (arrow
indicates extreme values beyond the plotted range) .............................................213
Figure 6.10. Distribution of person infit values as a function of estimated ability in the
digit span – forward task.......................................................................................214
Figure 6.11. Distribution of person outfit statistics as a function of calibrated ability in
the digit-span backwards test................................................................................216
xii
Figure 6.12. Distribution of person infit statistics as a function of calibrated ability in
the digit-span backwards test................................................................................217
Figure 6.13. Distribution of person fit statistics as a function of calibrated ability on the
Paired-Associate Recall test (Arrows indicate points beyond the plotted range).219
Figure 6.14. Mean proportion correct (Accuracy) as a function of relational complexity,
sentence type, and probe question-type ................................................................223
Figure 6.15. Mean decision time as a function of relational complexity, sentence type,
and probe question-type........................................................................................224
Figure 6.16. Distribution of person infit and outfit statistics as a function of estimated
ability across the 14 items of the knight-knave task.............................................228
Figure 6.17. Distribution of person fit statistics as a function of estimated person ability
..............................................................................................................................244
Figure 6.18. Distribution of Latin Square items as a function of Rasch calibrated item
difficulty (with standard errors indicated) ............................................................245
Figure 6.19. Mean calibrated item difficulty as a function of relational complexity and
number of processing steps (error bars = 1 SD) ...................................................248
Figure 6.20. Mean item response time regardless of accuracy (RT) and for correct
response only (CRT) as a function of relational complexity and number of
processing steps (error bars = 1 SD).....................................................................249
Figure 6.21. Mean proportion correct as a function of relational complexity and number
of processing steps ................................................................................................253
Figure 6.22. Composite response time (RT) and correct response time (CRT) as a
function of relational complexity and number of processing steps ......................254
Figure 7.1. Model A: Structural model of predictions made by the complexity-Gf
relationship and relational complexity..................................................................264
Figure 7.2. Model B: Revised model of predictions with a single latent factor for
relational complexity ............................................................................................266
Figure 7.3. Sentence comprehension accuracy as a function of relational complexity and
Gf ..........................................................................................................................274
Figure 7.4. Sentence comprehension accuracy as a function of relational complexity and
Gc..........................................................................................................................275
Figure 7.5. Sentence comprehension accuracy as a function of relational complexity and
SAR.......................................................................................................................276
Figure 7.6. Decision time on the Sentence Comprehension Task as a function of
relational complexity and Gf ................................................................................277
Figure 7.7. Decision time on the Sentence Comprehension Task as a function of
relational complexity and Gc................................................................................278
Figure 7.8. Decision time on the Sentence Comprehension Task as a function of
relational complexity and SAR.............................................................................279
Figure 7.9. Accuracy on knight-knave composites as a function of relational complexity
and Gf (4D^ = indeterminate quaternary composite) ...........................................281
xiii
Figure 7.10. Accuracy on knight-knave composites as a function of relational
complexity and Gc ................................................................................................282
Figure 7.11. Accuracy on knight-knave composites as a function of relational
complexity and SAR .............................................................................................283
Figure 7.12. Response time on knight-knave composites as a function of relational
complexity and Gf.................................................................................................284
Figure 7.13. Response time on knight-knave composites as a function of relational
complexity and Gc ................................................................................................285
Figure 7.14. Response time on knight-knave composites as a function of relational
complexity and SAR .............................................................................................286
Figure 7.15. Knight-knave item correlations between accuracy and broad cognitive
abilities (Gf, Gc, SAR) as a function of Rasch calibrated item difficulty ............287
Figure 7.16. Knight-knave item correlations between response time and broad cognitive
abilities (Gf, Gc, SAR) as a function of Rasch calibrated item difficulty ............289
Figure 7.17. Accuracy on Latin-square test composites as a function of relational
complexity and Gf.................................................................................................292
Figure 7.18. Accuracy on Latin-square test composites as a function of relational
complexity and Gc ................................................................................................293
Figure 7.19. Accuracy on Latin-square test composites as a function of relational
complexity and SAR .............................................................................................294
Figure 7.20. Response time on Latin-square test composites as a function of relational
complexity and Gf.................................................................................................295
Figure 7.21. Response time on Latin-square test composites as a function of relational
complexity and Gc ................................................................................................296
Figure 7.22. Response time on Latin-square test composites as a function of relational
complexity and Gc ................................................................................................297
Figure 7.23. Item correlation between accuracy and Gf as a function of calibrated item
difficulty................................................................................................................299
Figure 7.24. Item correlation between response time and Gf as a function of calibrated
item difficulty .......................................................................................................300
Figure 7.25. Correlation between accuracy and response time as a function of calibrated
item difficulty on the Latin Square task................................................................301
Figure 7.26. Speed-accuracy trade-off on LST items as a function of calibrated item
difficulty and Gf....................................................................................................301
Figure 7.27. Hypothetical components of the instantiation of a relation....................304
Figure 7.28. Number of correct responses per minute in the triplet numbers test as a
function of complexity level and Gf .....................................................................310
Figure 7.29. Decision tree of binary comparisons in level 4 of the Triplet Numbers Test
..............................................................................................................................313
xiv
ABSTRACT
The theory of relational complexity (RC) developed by Halford and his associates
(Halford et al., 1998a) proposes that, in addition to the number of unique entities that
can be processed in parallel, it is the structure (complexity) of the relations between
these entities that most appropriately captures the essence of processing capacity
limitations. Halford et al. propose that the relational complexity metric forms an ordinal
scale along which both task complexity and an individual’s processing capacity can be
ranked. However, the underlying quantitative structure of the RC metric is largely
unknown. It is argued that an assessment of the measurement properties of the RC
metric is necessary to first demonstrate that the scale is able to rank order task
complexity and cognitive capacity in adults. If in addition to ordinal ranking, it can be
demonstrated that a continuous monotonic scale underlies the ranking of capacity (the
natural extension of the complexity classification), then the potential to improve our
understanding of adult cognition is further realised. Using a combination of cognitive
psychology and individual differences methodologies, this thesis explores the
psychometric properties of RC in three high level reasoning tasks. The Knight-Knave
Task and the Sentence Comprehension Task come from the psychological literature.
The third task, the Latin Square Task, was developed especially for this project to test
the RC theory.
An extensive RC analysis of the Knight-Knave Task is conducted using the Method for
Analysis of Relational Complexity (MARC). Processing in the Knight-Knave Task has
been previously explored using deduction-rules and mental models. We have taken this
work as the basis for applying MARC and attempted to model the substantial demands
these problems make on limited working memory resources in terms of their relational
structure. The RC of the Sentence Comprehension Task has been reported in the
literature and we further review and extend the empirically evidence for this task. The
primary criterion imposed for developing the Latin Square Task was to minimize
confounds that might weaken the identification and interpretation of a RC effect.
Factors such as storage load and prior experience were minimized by specifying that the
task should be novel, have a small number of general rules that could be mastered
quickly by people of differing ages and abilities, and have no rules that are complexity
level specific.
xv
The strength of MARC lies in using RC to explicitly link the cognitive demand of a task
with the capacity of the individual. The cognitive psychology approach predicts
performance decrements with increased task complexity and primarily deals with
aggregated data across task condition (comparison of means). It is argued however that
to minimise the subtle circularity created by validating a task’s complexity using the
same information that is used to validate the individual’s processing capacity, an
integration of the individual differences approach is necessary. The first major empirical
study of the project evaluates the utility of the traditional dual-task approach to analyse
the influence of the RC manipulation on the dual-task deficit. The Easy-to-Hard
paradigm, a modification of the dual-task methodology, is used to explore the influence
of individual differences in processing capacity as a function of RC. The second major
empirical study explores the psychometric approach to cognitive complexity. The basic
premise is that if RC is a manipulation of cognitive complexity in the traditional
psychometric sense, then it should display similar psychometric properties. That is,
increasing RC should result in an increasing monotonic relationship between task
performance and Fluid Intelligence (Gf) – the complexity-Gf effect. Results from the
comparison of means approach indicates that as expected, mean accuracy and response
times differed reliably as a function of RC. An interaction between RC and Gf on task
performance was also observed. The pattern of correlations was generally not consistent
across RC tasks and is qualitatively different in important ways to the complexity-Gf
effect. It is concluded that the Latin Square Task has sufficient measurement properties
to allows us to discuss (i) how RC differs from complexity in tasks in which expected
patterns of correlations are observed, (ii) what additional information needs to be
considered to assist with the a priori identification of task characteristics that impose
high cognitive demand, and (iii) the implications for understanding reasoning in
dynamic and unconstrained environments outside the laboratory. We conclude that
relational complexity theory provides a strong foundation from which to explore the
influence of individual differences in performance further.
CHAPTER 1
THE MEASUREMENT OF TASK COMPLEXITY AND CAPACITY
1
Task Complexity and Capacity
The theory of relational complexity developed by Halford and his associates (Halford,
1993; Halford et al., 1998a) proposes that, in addition to the number of unique entities
that can be processed in parallel, it is the structure (complexity) of the relations between
these entities that most appropriately captures the essence of processing capacity
limitations. Halford and his associates argue that relational complexity is capable of
accounting for age related differences in cognitive abilities in a way that subsumes key
aspects of other theories of cognitive development. At this stage little empirical data is
available to assess the importance of relational complexity in adult cognition. The
current work considers the evidence that does exist and evaluates it in conjunction with
relevant developmental data. It is concluded that while convergent support for the
theory is strong, there is no single piece of evidence that is not potentially weakened by
theoretical and practical measurement problems. That is, unambiguous support for the
relational complexity theory is limited. In attempting to map the function of relational
complexity in adult cognition, this thesis explores two sets of issues. The first are
measurement issues related to the properties of the complexity metric and the
performance measures that are used to assess the metric’s appropriateness. The second
are assessment issues that are concerned with the validity and appropriateness of the
theory in adult cognition.
1.1
Measurement Issues
“Science requires investigating ones methods as well as using them”
Michell (1999, p. 2)
Halford et al. (1998a) propose that the relational complexity metric forms an ordinal
scale along which both task complexity and an individual’s processing can be ranked.
However, the underlying quantitative structure of the relational complexity metric and
the performance measures used are largely unknown. We argue that an assessment of
2
the measurement properties of the relational complexity metric is necessary to first
demonstrate that the scale is able to rank order task complexity and cognitive
processing capacity in adults. If in addition to ordinal ranking, it can be demonstrated
that a continuous monotonic scale underlies the ranking of capacity (the natural
extension of the complexity classification), then the potential to improve our
understanding of adult cognition is further realised. It is important to note from the
outset that this general objective is addressed and revisited frequently. The original
motivation for this work was to deepen our understanding of the measurement
properties of the relational complexity metric and processing capacity.
The basis and rationale for addressing these specific measurement issues comes from
the work of Joel Michell (e.g., Michell, 2000, 1997, 1990). Michell outlines many of the
limitations of psychological measurement in general and argues that it is often assumed
without evaluation that the measures used to develop and validate theories have
quantitative structure. The implication is that when the measurement properties of the
scores are unknown, the appropriateness of the conclusions is also unknown.
Unfortunately, the nature of psychological research means that it is very unlikely that a
situation would exist where quantitative measurement can be directly assessed. In its
stead Michell advocates the work of Luce and Tukey (1964) on additive conjoint
measurement as a means of indirectly assessing quantity. Conjoint measurement is
concerned with the way the ordering of a dependent variable varies with the joint effect
of two or more independent variables. For example, assume that R represents the
ordering of differences in relational complexity, and that S represents, for instance, the
ordering of differences in serial processing (equally applicable would be some ordering
of Gf). The dependent variable, P represents performance on an appropriate measure
along which the effect of R and S is assessed. Therefore, the ordering of R and S is
necessarily dependent upon the order of P. That is, their orders are relative to their
effect on P – the variables R and S are quantified relative to their effects on P (Michell,
1990; Perline, Wright, & Wainer, 1979).
Michell (1990) outlines the sufficient experimental conditions that satisfy conjoint
measurement and therefore the presence of a monotonic quantitative scale. An
important application of this approach to measurement is reflected in the work of
Stankov and Cregan (1993). They used conjoint measurement to assess the quantitative
3
characteristics of fluid intelligence, motivation, and working memory demand (i.e.,
complexity). From their results, Stankov and Cregan concluded that intelligence has
quantitative structure. Generally, attempts to apply conjoint measurement to theory
testing in psychology have been sparse. One of the possible reasons for the apparent
unwillingness of researchers to take on this measurement issue may be due to the
deterministic nature of conjoint measurement – the conditions are sufficient but not
necessary. That is, if the conditions of conjoint measurement are meet then quantitative
structure is supported, otherwise no valid conclusion can be made about the nature of
the scale. Perline, Wright and Wainer (1979) outline an application of Rasch analysis
that overcomes to some extent the deterministic nature of conjoint measurement. It
provides a stochastic assessment of quantity that can be tested for goodness of fit and so
a probabilistic estimate of the likelihood that conjoint measurement exists can be
obtained when not all conditions are meet. The Rasch approach attempts to map both
the individuals’ ability and item difficulty on the same underlying metric and a
satisfactory fit of the data to this model is reported to demonstrate additivity of
measurement (Brogden, 1977). Wright (1999) argues that this effectively implies that an
interval scale of measurement has been achieved. The application of the Rasch approach
is considered in more detail in the empirical chapters.
So in addition to assessing the function of relational complexity in adult cognition, we
explore the scale properties of both the complexity metric and the performance
measures used to assess processing capacity. This has direct implications for the
assessment of the validity and appropriateness of the relational complexity theory in
adult cognition. The remainder of this chapter reviews how the theory applied to adult
cognition will be tested and what particular assessment issues need to be considered.
1.2
Assessment Issues
The initial assessment concern that arises does so as a consequence of the relational
complexity theory defining a metric on which both task demand and individual
processing capacity may be represented. The reason it is a potential source of concern
has more to do with the logic by which it is tested and not with the appropriateness of
having one metric represent two distinct functions. Common scaling is an objective of
much psychological research. In fact, a whole collection of statistical methods known as
4
Item Response Theory (of which the Rasch model is a member) have the key objective
of defining item characteristics on the same scale as the individual’s ability (Embretson
& Reise, 2000; Wright, 1999). The issue here is that the assessment of complexity is
partially dependent on the assessment of capacity. Task performance is used to
demonstrate the complexity of the task and also serves as the criterion for assessing the
processing level (capacity) of the individual. Take a more specific example. Using the
relational complexity metric to be described in more detail in the following chapter, a
ternary task is validated as such because of observed differences in performance when
compared with easier binary items and more difficult quaternary items. That is, support
for the complexity manipulation is obtained if an ordinal structure in mean performance
exists from binary through to quaternary such that performance significantly
deteriorates as complexity increases. To extend the example further, consider the
individual. She is classified as processing at a ternary level (say) if some accepted level
of performance on ternary tasks is reached. The apparent confounding is allusive but it
would seem logically unsound to demonstrate the utility of the theory to define
complexity with the same evidence used to demonstrate the utility of the theory to
predict capacity.
task
complexity
individual
capacity
performance
Figure 1.1 Representation of the assessment of components of the relational complexity
theory.
Figure 1.1 helps explain this situation further. Here individual performance is
represented as a function of both the demands of the task and the processing capacity of
the individual (single-headed arrows). However, performance is also the measure used
to test the theory, and it is used to independently assess both the complexity
manipulation and processing capacity (bold arrows). The logic of the argument needs to
5
be qualified. We cannot assess task complexity simply using performance because it is
confounded with capacity to an unknown extent. Similarly, the assessment of capacity
using performance is confounded with complexity, once again to an unknown extent. In
fact Halford (1989, p. 126) has identified the same criticism in relation to capacity:
“…when performance has been attributed to capacity, the term has often been used in a
way that is synonymous with performance. The explanation is then completely circular;
performance is a function of capacity, but performance is the only indicator of
capacity.”
Experimental manipulation can address these concerns up to a point and much of the
empirical work to date has employed the experimental paradigm (Halford, Andrews,
Dalton, Boag, & Zielinski, in press; Andrews & Halford, 1998, etc). By carefully
manipulating the complexity of the task and the capacity of the individual (typically
using age) while holding as many other factors as possible constant, some assessment of
validity can be achieved by comparing aggregated performance across groups of
individuals and/or sets of items (Andrews & Halford, in press). So while the assessment
of complexity and capacity uses the same measure (i.e., performance), it is obtained
under stringent manipulation and with the assumption that the influence of potentially
confounding factors are distributed equally across individuals/items. As will be argued
below, this approach has the potential to mask reliable variation that may turn out to
(strongly) qualify the effect being tested. This is particularly relevant if some of that
reliable variation is due to some interaction between complexity and capacity (Lohman
& Ippel, 1993). Ideally, what is needed is a more independent way of testing the
presence of a complexity effect that takes into consideration individual differences in
performance (and capacity) as well as aggregated (or group) differences. This issue is
central to the thesis and requires a brief diversion into more traditional measurement
literature that will be expanded on further in later chapters.
1.3
Assessing Capacity and Complexity
Our understanding of human reasoning has developed through two distinct
methodological measurement paradigms (Cronbach, 1957; Hunt, 1980; Lohman, 1989;
Sternberg, 1994). The experimental or comparison of means (COM) approach has
dominated the field of cognitive psychology and is pervasive in much of the traditional
6
memory and information processing research. Theory development and testing from this
approach is achieved through creative experimental manipulation and control.
Assessment of variations between task conditions on one or more outcome measures
typically forms the basis of theory testing in this paradigm. Individual variations are
typically controlled statistically (as is done in within-subjects ANOVA designs) or
through random assignment (as in the standard between-subjects ANOVA designs). In
either case, the emphasis of the comparison of means approach is on aggregated
differences associated with the experimental manipulation. Therefore variations
between individuals not associated with this manipulation are not of direct interest and
are ultimately treated as noise. As mentioned previously, this has been the predominant
approach used to test the relational complexity theory. In the correlational or individual
differences approach that is considered to have been established by Charles Spearman
(Sternberg, 1990; 1994), far greater emphases is placed on trying to account for reliable
individual variation rather than averaged group differences. This approach effectively
defines the psychometric tradition that has dominated the field of differential
psychology (also called individual differences psychology).
An alternative way of conceptualizing the modelling of task performance that also takes
into consideration the potential for individual differences comes from some of the
measurement models of Item Response Theory. For instance, Susan Embretson has had
success in modelling the influence of various task components in mental rotation type
tasks using parameters from Item Response Theory (Embretson, 1995a; Embretson,
1993). Multidimensional Latent Trait Models incorporate the influence of weighted task
components directly into the parameters of the IRT model. The benefit of using IRT is
that the interaction between items and individuals are modelled formally. An
individual’s ability is modelled as a function of the items they have solved and the
characteristics of the components that make up those items. Essentially, what this means
is that we can take a process theory of task performance and model the influence of
different task characteristics on the individual.
Both the experimental and differential approaches have been instrumental in cognition
research (Deary, 2001) and very similar (if not identical) tasks and measures have been
employed under both paradigms. Unfortunately, not only does there appear to be a
persistent reluctance to combine the methods and approaches from these two paradigms
7
in any formative way (with some notable exceptions as listed above), there is also an
apparent resistance to theory integration. Lohman and Ippel (1993) suggest that this
might be a function of incompatibilities in the methodologies used by the two
approaches. In any case, this has resulted in two groups of researchers using essentially
the same measures to assess the same cognitive processes, with little if any
communication between them. A review of both the correlational and comparison of
means approaches reveals each has much to offer in assessing the function of relational
complexity. We identify two methods that may be useful in differentiating the effect of
relational complexity from other factors that might contribute reliably to variation in
task performance. The first approach uses an extension of the dual-task paradigm
developed by Hunt and Lansman (1982; see also Lansman & Hunt, 1982) that includes
aspects of the individual differences paradigm to overcome some of the reported
limitations of dual-task assessment of processing capacity. The second uses an extended
psychometric study to explore relational complexity in adult cognition and the way it
covaries with a selection of core cognitive abilities.
1.4
Overview of the Thesis
To address the issues discussed above an attempt is made in Chapter 2 to map the
development of the theory of relational complexity and its relationship with individual
differences theories of human reasoning. This serves two purposes. Firstly, it provides a
necessary review of the reasoning literature. More importantly, mapping the
development of our understanding of human reasoning has the potential to explicate
measurement and assessment issues that have arisen previously. The way these issues
have been addressed will serve as the starting point in assessing the importance of
relational complexity in adult cognition. Chapter 2 includes the specification of the
theorems of the relational complexity theory that are applied to the analysis of two
cognitive tasks in Chapters 3 and 4. Chapter 3 considers a relational complexity analysis
of the knight-knave task, a high-level reasoning puzzle investigated by a number of
researchers (Rips, 1989; Byrne & Handley, 1997; Schroyens, Schaeken, & d'Ydewalle,
1999). Chapter 4 reviews the development of the Latin Square Task. This task that was
created for this research to specifically test the principles of relational complexity. The
next three chapters form the experimental component of the thesis and synthesise the
experimental and correlational approaches in conjunction with applications of the
8
Rasch measurement model to explore more completely the role of relational
complexity. Chapter 5 applies the dual-task methodology to the Latin Square task. We
use the easy-to-hard paradigm developed by Hunt and Lansman (1982) that relies
almost exclusively on the individual differences approach. Chapters 6 explores the
comparison of means evidence individually for three relational complexity tasks. We
also explore the psychometric evidence for the measures of fluid intelligence (Gf),
crystallized intelligence (Gc), and short-term memory (SAR).
Chapter 7 considers the relationship between the relational complexity manipulations in
the three tasks, and their relationship with the measures of broad cognitive abilities. If
relational complexity is similar to complexity as it is defined in the psychometric
tradition (Stankov, 2000; Stankov & Crawford, 1993) an a priori relationship between
task performance and Gf can be specified as a function of relational complexity.
Specifically, as the relational complexity of a task increases, the correlation between
task performance and Gf should also increase monotonically. In practical terms this
implies that as complexity increases, task performance better differentiates between
individuals of differing general reasoning abilities (as Gf is defined). This pattern of
correlations is not necessarily expected to hold for the Gc and SAR factors. This offers
a means of testing whether processes of equal complexity across different tasks and
domains form a consistent latent factor. Finally, Chapter 8 attempts to integrate the
main issues addressed in each of the chapters as it relates to the role of relational
complexity in adult reasoning.
CHAPTER TWO
COGNITIVE COMPLEXITY AND RELATIONAL COMPLEXITY THEORY
2
Introduction
Cognitive psychology at its purest is ground in theories of process that attempt to
provide a conceptualization of such things as attentional resources (e.g., Just &
Carpenter, 1992), processing capacity (Halford et al., 1998a), and the functioning of
memory (Baddeley, 1992, 1986; Humphreys, Wiles, & Dennis, 1994). The earliest
attempts to measure cognitive capacity or span of attention revealed the presence of
limits that seemed to be fluid and dependent on many factors (Wundt, 1896, cited in
Chapman, 1989). The seminal work of Miller (1956) demonstrated that a quantitative
dimension existed in the number of items that could be attended to at once. Miller
introduced the concept of chunks as a way to explain how people overcome the apparent
limitation in storage capacity. Current conceptualisations of capacity are still often
drafted in terms of storage capacity (Cowan, 2000). However Halford et al. (1998a) and
others (e.g., Oberauer, Suss, Schulze, Wilhelm, & Wittmann, 2000) make a clear
structural distinction between the storage and processing functions of working memory.
Whether storage capacity can subsume the processing capacity literature as Cowan
(2000) implies, or whether processing capacity might subsume storage capacity
(Halford, Phillips, & Wilson, 2000) is still open to further exploration. What we are
concerned with here is processing capacity and like storage capacity, the construct is
based on a conceptualisation of resources.
2.1
Resource Theory
Various models of capacity and resources have been proposed since Miller’s (1956)
identification of a quantitative limit in processing. Kahneman (1973) depicted an
undifferentiated model of processing capacity that contains a single "pool" of resources
that can be allocated in a continuous fashion as task demands increase. While
differential interference in dual-task studies that Kahneman’s model was unable to
account for provided support for a multiple resource model (Norman & Bobrow, 1975;
Wickens, 1980; 1984; 1991; Fisk, Derrick, & Schneider, 1986), the basic assumption of
resource allocation has remained virtually intact (Halford, 1993).
10
Identification of dual-task deficits has been instrumental in the development of our
current understanding of the resource concept (e.g., Halford, 1993; Maybery, Bain, &
Halford, 1986; Navon, 1984). A dual-task deficit occurs when the performance on one
or both tasks deteriorates when attempted together. This deterioration is typically
provided as evidence for the tasks' dependence on limited resources from a pool
common to both tasks (Kahneman, 1973; Norman & Bobrow, 1975; Navon & Gopher,
1979; 1980). However, Navon (1984) argues that such deficits in themselves are poor
indicators that tasks compete for resources per se. He suggests that in most cases dualtask deficits can be explained equally well and more parsimoniously without
introducing the concept of resources and the additional baggage that the term entails
(see below). For instance, Navon and Miller (1987) have shown that interference effects
can be accounted for by outcome-conflict; conflict produced between the outcome of the
processing required for one task and the processing required for the other task1. Navon
(1984) therefore believes that the legitimate use of resource theory is only as a metaphor
for interference or trade-off that is not due to the scarcity of any mental commodity. In
any case, Navon's work served to highlight fundamental weaknesses in the traditional
dual-task approach that needed to be addressed.
Damos (1991b) and Wickens (1991) discuss a series of experimental controls and
guidelines that can be used in the application of the traditional dual-task methodologies
to minimise interference and outcome conflict. Hunt and Lansman (1982; Lansman &
Hunt, 1982) have a fundamentally different response to Navon’s (1984) concerns. They
propose using the individual differences methodology to test for differences in resource
capacity after partialling out the influence of interference. This procedure is known as
the easy-to-hard paradigm. Although some researchers have questioned the practicality
of meeting the easy-to-hard assumptions (e.g., Stankov, 1987), the results of research
using this approach has been interpreted to support the tenets of resource theory
(Halford, Maybery, & Bain, 1986; Halford, 1989, 1993; Foley & Berch, 1997).
Just and Carpenter (1992) use the conceptualisation of resources to define the
functionality of working memory as it corresponds to the central executive component
of Baddeley’s memory model (1986). They perceive working memory as a “…pool of
1
The Stroop effect can be considered an example of outcome-conflict in which the process of generating
a colour word interferes with another process that generates the name of an ink colour.
11
operational resources that perform the symbolic computations and thereby generate the
intermediate and final products [of reasoning, problem solving, and language
comprehension]” (Just & Carpenter, 1992, p. 122). This can be extended to describe
(and predict) individual differences in performance as a function of the interplay
between two factors; (i) strategy for allocation of the available resources to a task, and
(ii) processing efficiency. The convergence of these and many other intuitive and
empirical resource models (e.g., Polson & Friedman, 1988; Fisk et al., 1986; Boles &
Law, 1998; Pashler, 1994) serve to support and consolidate resource theory as a
conceptualization of the generic capacity construct. Wickens (1991) argues that
although in isolation each piece of evidence for resources might be explained by a
different concept, resource theory best accounts for all the processing capacity evidence
together. We will review these issues in more detail in Chapter 5 where we discuss the
application of the dual-task methodology to assess the differential processing demands
of relational complexity.
2.1.1
Resources: A Cautionary Note
The sheer volume of research using the resource concept might be interpreted as
construct validity, consistent with Wickens (1991) proposal. However, the broadness of
its current use also brings into question whether the term is sufficiently well specified to
be usefulness as a distinct construct (Oberauer & Kliegl, 2001). The resource concept
was employed to assist with the development of models describing the limitations in
memory and reasoning that were observed by the predecessors of our discipline. More
recently it has been used to develop sophisticated computational models of cognitive
performance (Just & Carpenter, 1992; Halford et al., 1998a). However talking about
capacity or resources limitations seems inappropriate unless there is a clear indication of
just what the resource entails – its internal structure and limits, and the limits of its
application. Furthermore, if the resource concept is used to define a construct, then like
all constructs its role in cognition should be placed in context with other postulated
resources and constructs and with other accounts of working memory limitations such
as interference and decay as suggested by Oberauer and Kliegl (2001). Unless this is
done, the utility of the resource concept is as Navon (1984; Navon & Miller, 1987)
suggests; as a metaphor for describing a poorly defined construct. We will also consider
12
these issues in our discussion of the empirical dual-task evidence presented in Chapter
5.
2.2
Relational Complexity Theory
The idea of relational reasoning is not new. Spearman’s (1923) conceptualization of
intelligence implicated the eduction of relations and correlates – the ability to deal with
complex relations, to understand relationships among stimuli, to comprehend the
implications of these relations, and to formulate a conclusion. In current parlance, this is
central to our understanding of Fluid Intelligence (Carroll, 1993). With the availability
of sophisticated neural imaging technology, relational reasoning is also being used to
explain the neurological correlates of executive function. Recent research has
implicated the anterior dorsolateral prefrontal cortex in relational processing (Kroger et
al., in press; Waltz et al., 1999). Waltz et al. (1999, p123) argue, “…relational
reasoning appears critical for all tasks identified with executive processing and fluid
intelligence”. As we hope to demonstrate, Halford and his colleagues have achieved a
formalism detailing how reasoning is influenced by the relational structure of the task
(Halford & Wilson, 1980; Halford et al., 1994; Halford et al., 1998a; Andrews &
Halford, in press). The original purpose of the relational complexity theory was to
provide a foundation from which to explore the developmental nature of processing
capacity and its limitations (Halford, 1993). In the following sections we explore the
specification of the theory and the implications for assessing the function of relational
reasoning in adult cognition.
2.2.1
Specification of Relational Complexity
The concept of relational reasoning can be derived from the following premises:
1. Deduction entails processing relations between entities specific to the task or
that are recalled from memory
2. Processing internally represented relations generates non-trivial cognitive
demand
3. The complexity of the internally represented relation can be used to quantify the
characteristics of the processes used in performing the task – this is relational
complexity
13
4. Further, processing capacity is a function of the relational complexity of the
representations that an individual can process and therefore can be quantified
using the same metric
From this reasoning, there are essentially two axioms that form the basis of the
relational complexity theory. We will consider the elaborations and implications of
these axioms.
Axiom 1: Complexity of a cognitive process: is the number of interacting
variables that must be represented in parallel to implement that process (Halford
et al., 1998a, p. 805).
Axiom 2: Processing complexity of a task: is the number of interacting variables
that must be represented in parallel to perform the most complex process
involved in the task, using the least demanding strategy available to humans for
that task (Halford et al., 1998a, p. 805).
The formalisation of the relational complexity metric follows directly from these
axioms. Relational complexity is considered to correspond to the arity or number of
arguments of a relation. Unary relations have a single argument as in class membership,
DOG(Fido). Binary relations have two arguments as in LARGER-THAN(elephant,
mouse). Ternary relations have three arguments as in ADDITION(2,3,5) and quaternary
relations have four arguments. Halford et al. (1998a) suggest that the upper limits of
adult cognition tends to be at the quaternary level although under optimal circumstance
quinary level processing may be possible. In general, the complexity of a relation, R(a,
b, … , n), is determined by its arity, n. Each argument (a, b, … , n) is a source of
variation in that it can be instantiated in more than one way under the condition that the
relation is true (or at least perceived to be true). So for example, the binary operation of
addition (e.g., 2 + 3 = 5) is a binding of three variables (2, 3, and 5), and is a ternary
relation. The LARGER-THAN relation requires two arguments to be appropriately
instantiated and is therefore a binary relation (see Figure 2.1).
14
LARGER-THAN (elephant, mouse)
R (a, b,… n)
Symbol for
relation
RC = n
No. of arguments
Figure 2.1. Representation of relations based on Halford et al. (1998a).
2.2.2
Chunking and Segmentation
One of the most interesting features of the relational complexity theory is the
application of the complexity metric to characterise both features of the task and the
processing capacity of the individual. That is, the implications of point 4 in section 2.2.1
above, is that the relational complexity of the most complex relation that can be
processed can be used to quantify the limits of processing capacity – a characteristic of
the individual. Much of the evidence for the relational complexity theory exploits this
foundation.
Given processing capacity limitations do constrain the amount of information that can
be represented in parallel, an individual needs to be able to work within these limits.
Relational complexity theory proposes that cognitive demand can be reduced through
the processes of conceptual chunking and segmentation. Conceptual chunking is the
recoding of a relation into a lower dimensional concept. For instance, velocity can be
considered as a function of distance and time (velocity = distance/time), and in this
form it entails a ternary relation which might be represented as either, RATIO(distance,
time, velocity), or equivalently, RATIO(distance, time) → velocity (“ratio of distance to
time implies velocity”). However, it can also be considered as a unary relation, such as
VELOCITY(60km/hr). Conceptual chunking reduces processing demand, but at the cost
that relations between chunked variables become inaccessible. While we think of speed
as a unary relation, questions about time and distance cannot be considered.
Segmentation entails reducing problems with many arguments into a series of lower
dimensional processes that are solved in series. Relations are only defined between
variables that are in the same segment (i.e., step) and relations between variables in
different segments are inaccessible (Halford et al., 1998a).
15
2.2.3
Relational Complexity Theorems
Through the second axiom and the principles of chunking and segmentation, the theory
is capable of modelling higher-level reasoning tasks that entail the representation and
integration of more than one process for successful performance. Relational complexity
of a task defined in this way captures the peak cognitive requirements of the task as a
whole (Halford et al., 1998a). This conceptualisation has been used to broaden our
understanding of dynamic situations such as air-traffic control (Boag, Neal, Halford, &
Goodwin, 2000) in which task complexity varies across time. The issue of chunking and
segmentation introduces three further theorems of the relational complexity theory:
Theorem 1. Relational complexity is defined as the minimum dimensionality to which a
representation can be reduced without the loss of information necessary for solution.
This is referred to as the effective relational complexity.
An example of this is outlined below, where the four arguments of mathematical
proportion (a/b=c/d) are segmented into a number of ternary and binary relations that
are solved in series.
Theorem 2. Task complexity is defined as the effective relational complexity of the most
complex process entailed in the task.
This is a natural extension of Axiom 2 as applied to the first theorem. The effective
relational complexity of the three processing steps required for mathematical proportion
is ternary since this is the minimum dimensionality to which mathematical proportion
can be reduced.
Theorem 3. Arguments cannot be chunked or segmented if relations between them must
be used in making the current decision. A corollary of Theorem 3 is the common
arguments principle: Where two predicates (i.e., relations) have the same argument,
and where they function as a unit for the current process, the predicates can be
chunked.
16
Note: Theorem 3 is really an instantiation of Theorem 1 but is included
separately because of its centrality in relational complexity analyses.
An example of Theorem 3 is also provided above in the case of velocity. Further
examples are provided in the analyses of the knight-knave task and the Latin Square
Task in Chapters 3 and 4, respectively. Together the axioms of relational complexity
theory and the chunking and segmentation principles form the foundations of the
Method for Analysis of Relational Complexity (MARC).
2.2.4
Representation of Relations: A Comment on Notation
It is necessary to introduce some additional notation that will be used in the relational
complexity analyses in subsequent chapters. An elaboration of the Halford et al. (1998a)
notation makes it possible to represent higher-order relations that have one or more
relations as arguments. Consider the relation, R(a, S(b, c), d). R is a higher-order
relation defined by a, d, and the embedded binary relation, S(b, c). If the arguments of
S do not need to be considered separately to make the current decision, then the relation
can be chunked as a single argument of R (see Theorem 3). We represent chunking by
underlining the elements that form the chunk. Therefore R would be represented by
three underlined arguments and takes the form of a ternary relation: R(a, S(b, c), d,).
The arguments a and d are also underlined as they are considered as chunks consisting
of a single argument. An example of the analysis of embedded relations in the knightknave and Latin Square task is provided in the analysis of Chapters 3 and 4
respectively.
2.2.5
Evidence for Relational Complexity
The relational complexity metric receives empirical support from close to half a century
of research in cognitive development. It is of course important that a theory be
predictive and well as postdictive, and we will see that while the typical measures are
variously sensitive to external factors, the relationship between the actual measure used
to validate the complexity manipulation and other factors that influence task
performance needs special consideration. It is important to recall that the relational
complexity theory was developed to explain processing capacity limitations in relation
17
to cognitive development and the evidence for the theory has therefore predominantly
been developmental in nature. In the following sections we summarise the empirical and
theoretical support offered by i) the age of attainment literature, ii) secondary task
performance in dual-task studies, iii) the neurological correlates research, and iv)
perceived or subjective workload ratings. It is important to consider these factors even
though adult cognition is of interest in the current research. Issues surrounding the
developmental evidence may provide insights into problems that might be encountered
in testing the theory with adults. Similarly, the application of the theory to an adult
population might assist in clarifying some of the developmental evidence.
2.2.5.1 Age of attainment and knowledge acquisition.
Age of attainment is the primary evidence used by Halford and his colleagues to support
their conceptualisation of processing capacity limitations as a function of relational
complexity (e.g., Halford & Dalton, 1995; Halford et al., 1986; Halford, Bain, &
Maybery, 1984a). They argue that while their approach is not a true stage model in the
Piagatian sense, there is a correspondence between the Piagatian stages and the age of
attainment on tasks that have similar relational structure. The argument runs as follows:
If children of a given age are capable of processing a task entailing ternary relations
(say), then they should also be able to perform at similar levels in tasks from other
domains that require ternary level processing (given sufficient instruction and
knowledge of the new task domain). The converse is of course a little more difficult to
verify logically (i.e., that children unable to process ternary relations in one domain will
not be able to process ternary relations in another domain)2. The age of attainment
criterion is an accepted criterion in the developmental literature and has been used to
assess the influence of relational complexity in an increasing number of traditional
developmental tasks such as transitive inference, class inclusion, and hierarchical
classification (Halford et al., 1986; Andrews & Halford, 1998; Halford, Andrews, &
Jensen, in press; Andrews & Halford, in press). This research demonstrates that
different tasks which entail ternary relations have similar ages of attainment.
2
Fallacy of denying the antecedent: If A then B, not A → not B
18
Another task in which age of attainment has been used to test the relational complexity
theory is the balance beam task (Halford, Andrews, Dalton et al., in press) discussed in
Section 2.2.6.1. This task is typically difficult for children below about 11 years
(Halford, 1993; Halford et al., 1998a). Even adults rarely use the appropriate crossproduct algorithm to take into consideration the combined influence of weight and
distance from the fulcrum. They often revert to processing less efficient and error prone
relations that are of a lower complexity. Together this suggests that knowledge is also a
significant source of difficulty. In terms of age of attainment, Halford, Andrews, Dalton
et al. (in press) demonstrated that binary components of the balance beam task but not
ternary components were available to children of two years. Five to six year old
children were capable of processing the ternary components. This is consistent with the
predictions of the relational complexity theory.
Relational Complexity, Age of Attainment, and Capacity: Not all researchers recognise
age of attainment as suitable evidence to validate relational complexity as a metric for
cognitive development. Goswami (1998) suggests that a key problem with the Halford
et al. (1998a) analyses is that there is no independent evidence of whether a child is
solving tasks on the basis of relational mappings. This criticism of independence could
be targeted at many cognitive process theories, but certainly, it is important to have
more than one source of evidence to substantiate the theory. As was indicated in
Chapter 1, a primary focus of this thesis is to provide independent evidence for the
influence of relational complexity on reasoning. Gentner and Rattermann (1998) argue
that at least in terms of the development of analogical abilities, it is difficult to
differentiate maturational effects from a knowledge-based account of performance.
Rather than a change in processing capacity, Gentner and Rattermann question whether
it is a change in the knowledge of the relations entailed (through the learning of
relational labels) that accounts for age of attainment effects from infancy to adulthood.
As discussed in Section 2.2.6.1, these issues are also relevant to the more advance levels
of the balance beam task. Simply demonstrating that tasks in which the effect of
relational complexity have been observed on performance can account for a very large
proportion of the age related variation in other tasks in which the influence of relational
complexity has been reported (Halford, Andrews, & Jensen, in press; Halford, Andrews,
Dalton et al., in press; Andrews & Halford, in press), does not in itself resolve the issue
of alternative explanations. It only tells us unambiguously that chronological
19
development is associated with improved performance on relationally complex tasks. It
does not differentiate between capacity and other factors (e.g., knowledge) as the
primary cause of this improvement.
In adult cognition the concerns with the age of attainment criterion is less pressing and
the influence of knowledge becomes much more prominent. Related to this less directly
is the work of Salthouse (1985; Salthouse, 1988) who report findings that performance
differences between young and old subjects becomes greater as the complexity of the
task increases. In a sophisticated structural equation model, Stankov (1994)
demonstrated that this effect, referred to as the Complexity Effect Phenomena, is
actually an epiphenomena in that it is predominantly a function of the well reported age
related decline in Gf during late adulthood. In fact age per se did not contribute any
additional variance beyond what Gf accounted for. This finding has interesting
implications for testing the relational complexity effect in children. It seems to be the
case that relational reasoning is central to fluid intelligence (we will argue this point
more strongly after presenting empirical evidence collected in this project). If we are
permitted to conceptualise fluid intelligence as the capacity to process relationally
complex tasks3, then the increasing nature of Gf in the first few years of life (Horn,
1988) might allow for an interpretation of the age of attainment evidence that is less
entwined with acquired knowledge (i.e., given the distinction between Gf and Gc – see
Section 2.3 below).
Disambiguating Capacity, Knowledge, and Relational Complexity: Sweller (1998)
expresses similar concerns about the influence of knowledge on performance and more
importantly how the relational complexity theory is capable of dealing with individual
differences in chunking and segmentation (this issue is raised in reference to
mathematical proportion and the balance beam task in Section 2.2.6.1). That is, to what
extent does different segmentation strategies (either taught or acquired) change the
effective relational complexity of a task and more importantly, how can differential use
of strategies be identified? This is a legitimate concern and one taken very seriously in
the application of MARC in the tasks that are employed here. Halford et al. (1998a) are
3
This is against the recommendation of Stankov (1994) who prefers to conceptualise this statement as
only a figure of speech.
20
correct in stating that difficulty in establishing a reliable process theory of the strategies
(or “schemas” in Sweller’s parlance) on which the relational complexity analysis can be
imposed, does not invalidate the theory. It is also true that cognitive psychology, if
nothing else, has been very successful in developing the tools for process/component
analyses in a large number of tasks ranging from syllogistic/propositional reasoning
(Sternberg, 1977; Maybery, 1987; Johnson-Laird, Byrne, & Schaeken, 1992) to mental
rotation tasks (Shepard & Metzler, 1971; Pellegrino & Kail, 1982; Embretson, 1993),
and even more complex tasks like the Raven’s progressive matrices (Carpenter, Just, &
Shell, 1990). However, the ability to accurately identify strategies does cause some
concern about the practical utility of the application of relational complexity theory to
less well-known domains, in tasks that entail multiple correct solution strategies, or in
tasks that come from more dynamic environments in which the complexity of stimuli
changes over time. This is an important issue and one that we will return to in the final
chapter.
2.2.5.2 Dual-task evidence for relational complexity.
It is not surprising that given the emphasis that Halford et al. (1998a) put on processing
capacity and resource theory, a second type of empirical support for the relational
complexity theory is provided by dual-task studies. Maybery et al. (1986; see also
Halford et al., 1986) investigated binary and ternary processing load in the N-term
series task via a probe reaction time measure. With memory load and number of terms
controlled, significantly longer reaction times were observed at the point where the
three terms had to be integrated in parallel (to make a transitive inference, i.e., a ternary
relation) than when three terms could be integrated in series (a matched verification
task, i.e., a binary relation). The dual-task approach has also been used independently of
Halford’ laboratory to explore hierarchical reasoning consistent with the predictions of
relational complexity theory (Foley, 1997; Foley & Berch, 1997). We will explore this
in more detail in Chapter 5.
2.2.5.3 Neural imaging evidence.
Additional evidence that processing complexity is a function of relational complexity
comes from research exploring neurological correlates of reasoning. Kroger et al. (in
21
press) and Waltz et al. (1999) have implicated the anterior dorsolateral prefrontal cortex
in relational processing. Kroger et al. manipulated five levels of complexity that are
similar to manipulations of relational complexity4 and four levels of item difficulty in
tasks resembling the Raven’s progressive matrices. The data was supportive of
increasing neural activity as complexity increases. Waltz et al. (1999, p123) argue,
“relational reasoning appears critical for all tasks identified with executive processing
and fluid intelligence”. Complementary to this is the work of Gevins and Smith (2000),
who demonstrated that individual differences in broad cognitive abilities are related to
differential cortical activity as a function of complexity during solution of the N-back
task. There is also some preliminary indications from Halford’s laboratory that different
levels of the N-back task correspond to differences in relational complexity (Halford &
Zielinski, in preparation).
2.2.5.4 Subjective ratings and other evidence.
Another test of the relational complexity theory is provided by Andrews and Halford
(1995; Andrews, 1997) who have considered ratings of perceived workload in the
comprehension of embedded sentences (this task is used in Chapters 6 and 7, and the
details are discussed there). They have demonstrated high correlations (r > .80) between
subjects perceived complexity of the task and manipulations of relational complexity
within the task. Subjective ratings of cognitive demand have also been used to assess
the relational complexity analysis of static air-traffic control scenarios (Boag et al.,
2000).
As a final point of support it is useful to consider the position of the relational
complexity model in terms of other theories of cognitive process. Halford et al. (1998a)
argue that the relational complexity approach provides a more succinct account of
performance in the Tower of Hanoi task (Loveday, 1995) than the traditional embedded
subgoals (i.e., steps) procedure used by Just, Carpenter and Hemphill (1996; see also
Carpenter et al., 1990). Further empirical testing of relational complexity in the Tower
of Hanoi is currently in progress in Halford's laboratory. Evidence of the ability of the
4
The conceptualisation of relational reasoning by Waltz (1999) and Kroger (in press) use the number of
relations to be processed as a metric of complexity rather than the number of arguments in the relation
per se, and therefore is somewhat different to that specified by Halford et al. (1998a).
22
relational complexity theory to subsume aspects of mental models (Johnson-Laird et al.,
1992) and deduction rule (Rips, 1994; Braine, 1990) theories of cognition are
considered in Chapter 3.
2.2.6
Unresolved Issues
The relational complexity theory is sufficiently well specified to be applied independent
of task domain and as a result is capable of making strong predictions about processing
capacity. As we have seen, many of these predictions have empirical and theoretical
support of varying degrees from (i) the age of attainment literature, (ii) secondary task
performance in dual-task studies, (iii) the neurological correlates research and (iv);
perceived or subjective workload ratings, and many of these entail some comparison of
simple error or reaction time data. It is clear that the role of alternative explanations in
each of these measures varies and for reasons that we have alluded to already (Chapter
1), external validation of the relational complexity metric requires assessment
techniques that are more independent of confounding factors that might influence
performance. That is, although the specifications of the relational complexity theory are
detailed enough to allow various predictions to be made, the assessment of these
predictions are open to a large number of potentially confounding factors. In addition to
those we have already mentioned, other factors that might influence performance on
cognitive tasks somewhat independent of the available processing resources might
include differential rates of schema acquisition (Kanfer & Ackerman, 1989) and intratask learning, individual differences in cognitive variables such as spatial and verbal
abilities (e.g., Lohman & Kyllonen, 1983), and individual differences in affective
variables such as impulsivity, persistence, confidence, and motivation (e.g., Stankov,
2000; Kanfer & Ackerman, 1989; Embretson & Prenovost, 2000). Given these
potentially confounding factors, the problem for a theory of relational complexity
becomes how to isolate the effect of a complexity manipulation in measures of
performance from these other factors.
2.2.6.1 Conflicting results using segmentation and chunking.
While the above evidence shows general and converging evidence for the influence of
relational complexity on processing capacity and development, the application of the
23
theory to specific tasks analysis has shown to result in some inconsistencies. In
particular, some apparent inconsistencies in the application of segmentation and
chunking principles would benefit from further clarification. The theory relies heavily
on these principles and it is important to make a distinction between what is possible
within the limits of the theory as it stands and issues that are yet to be resolved.
Segmentation and chunking can influence the relational complexity classification of a
task. Halford et al. (1998a, p. 809) cite mathematical proportion (a/b = c/d) as an
exemplar of a quaternary relation. The thesis presented by Halford et al. is that a/b = c/d
entails four terms (a, b, c, d) that are constrained by the proportion relation and is
therefore a quaternary process. This means that given any three terms plus the
knowledge that proportion is entailed, it is possible to predict the remaining term (case
1: proportion(a,b,c,?)). Alternatively, if all four terms are provided, it is possible to
determine whether proportion follows (case 2: ?(a,b,c,d)). Although the prima facie’
complexity of proportion may be quaternary, when the influence of strategy through
segmentation is considered, the effective relational complexity is ternary5. Halford et al.
(1998a) argue that both case 1 and 2 require representation of a quaternary process for
solution and therefore classify proportion as quaternary. However, they also state that a
series of ternary processes (and some algebraic knowledge) are all that is needed to
instantiate proportion in practice (Halford, Wilson, & Phillips, 1998b, p852). In
applying the axioms and theorems of relational complexity theory, the classification of
proportion should therefore be ternary.
Another example of the same inconsistency is apparent in parts of the relational
complexity analysis of the balance-beam task. Algorithmic performance on the balance
beam task entails an application of a variant on mathematical proportion and has
similarly been classified as quaternary (Halford, Andrews, Dalton et al., in press). To
determine whether the beam will balance, the product of weight and distance on one
side of the beam can be compared with the product of weight and distance on the other.
This is referred to as the product rule entailing the concept of torque. That is, weightleft
× distanceleft = weightright × distanceright, the complexity of which Halford, Andrews,
5
Pascual-Leone (1998) classifies the same problem as requiring six arguments, although his application
of the principles of relational complexity theory is not accurate
24
Dalton et al. represent as BALANCE-TORQUE(weightleft, distanceleft, weightright,
distanceright). Once again, the authors correctly state that the task can be segmented into
two ternary relations (and a binary comparison of the output of each) to determine
proportion – or in the context of this task, balance. However, once again the argument
presented by Halford, Andrews, Dalton et al. for a requirement of quaternary processing
is not really convincing; “… while the most complex relation that has to be computed is
ternary for both the addition and product rules, acquisition of the product rule might
require representation of the quaternary relational torque rule (emphasis added)”6.
An equally plausible interpretation of the relative difficulty in conceptualising the
influence of weight and distance, is that the product rule requires an acquisition of
knowledge of torque that is either not available or not easily accessible in that it is
unlikely to have been proceduralised in the context of the balance beam task. Halford et
al. (in press) did train subjects on the concepts of weight and distance but the possibility
of the interaction between the two was not demonstrated. It could be argued that
acquiring the torque schema unprompted is likely to take many more trials than what
was presented. As a caveat it is important to note that the emphasis of the research was
to demonstrate age related differences in the capacity to deal with binary and ternary
relations and was not contingent on the exact classification of the product-rule. As such,
the data provided as support for the study’s hypotheses is not compromised by the
authors’ classification of the product rule.
To address this inconsistency we need to consider more closely the principles of
segmentation and chunking. Remember, the essence of relational complexity is that the
relation constrains the possible values that the arguments can take; just as the particular
instantiation of the arguments constrains possible relations (this idea is referred to in the
principle of omni-directional access by Halford et al., 1998a, p817). Hence a relation
cannot be decomposed if the variables interact, because interacting variables must be
interpreted jointly. This is the case for all relations regardless of complexity level. So
the question is, what constrains proportion from being decomposed from a quaternary
relation? It is suggested by Halford et al. (1998b, p852) that to plan the implementation
6
The addition rule is referred to as a “buggy rule” since it entails the addition of the number of weights
and distance from the fulcrum in number of pegs
25
of the separate ternary processes entailed in mathematical proportion, “…or to
understand why it is valid, the structure of proportion must be represented”. A similar
argument is applied to the torque rule of the balance beam task (Halford, Andrews,
Dalton et al., in press). Using this conceptualisation, both tasks remain quaternary even
after segmentation to a series of ternary relations. This is a little awkward as it is
inconsistent with the principle of effective relational complexity (Theorem 1) and has
the potential to introduce substantial subjectivity and vagueness into complexity
analyses. In any case, the relational complexity theory does not seem to be well
equipped at this stage to model this type of pre-processing in a way that is independent
of procedural knowledge (see comments by Sweller, 1998, that we raised earlier). That
is, from the relational complexity framework it is very difficult to conceptualise the
representation of implicit knowledge (e.g., proportion or torque) independent of the
explicit solution strategy. In fact, the empirical data for the balance-beam task also
suggests that knowledge is a significant source of difficulty. Halford, Andrews, Dalton
et al. (in press) state that even adults have difficulty with the product rule. Given that
Halford et al. (1998a) would argue that the majority of adults are capable of processing
quaternary relations it would be reasonable to speculate that the additional difficulty
observed is some function of knowledge.
The proportion example is important because it demonstrates the necessity for a clear
understanding of the available strategies in order to apply the principles of segmentation
and chunking. We suspect this is particularly relevant in assessing the distinction
between ternary and quaternary problems within the realm of adult cognition because
strategies are likely to be much more influential than with children. The proportion
example also highlights a possible looseness in the interpretation of the chunking and
segmentation principles that could benefit from some further formalisation. The
interpretation of the MARC approach that we adopt is therefore strictly in line with the
formalism of Halford et al. (1998a). It is also more in line with process theorists such as
Johnson-Laird who, while acknowledging implicit reasoning as a factor in performance,
effectively model only explicit processes formally (e.g., Johnson-Laird & Byrne, 1991;
Johnson-Laird et al., 1992). Strictly speaking, this is also the view of Halford and his
associates: “The metric applies to explicitly represented relations… It does not apply to
associations or automatic processes” (Halford, Andrews, & Jensen, in press, emphasis
added). In Chapter 4 (Section 4.10.1), we discuss similar issues that arise in the
26
determination of segmentation and chunking in the Latin Square Task. In the following
section we consider the conceptualisation of cognitive complexity from the individual
differences approach.
2.3
Cognitive Complexity: A Psychometric Approach
We have already noted that Spearman (1923) referred to three qualitative principles of
cognition. R. J. Sternberg (1977; 1984) summarises each of these as follows:
Apprehension of experience: Encoding – the perception of a stimulus and the
relation of it to the contents of long term memory
Eduction of relations: Inference – the interrelation of two stimuli so as to
understand their differences
Eduction of correlates: Application – the applying of an inferred relation to a
new domain
Sternberg (1977; 1984) suggests that this demonstrates that although the discipline of
cognitive psychology is considered by many to have emerged in the 1960’s from the
behaviourist tradition, links can be traced back to Charles Spearman (Skinner, 1983,
prefers to see the emergence as a "retreat"!). That is, Spearman (1923) who is
traditionally considered the father of the psychometric movement, was in fact proposing
a cognitive approach to cognitive abilities. This link between experimental and
correlational approaches has been exploited by only a handful of researchers beginning
with a call by Cronbach (1957) for an integration. Earl Hunt pursued the idea in the
1980’s (Hunt, 1980; Hunt & Lansman, 1982) using the easy-to-hard paradigm, a
variation on the dual-task methodology. Kyllonen and Christal (1990) explored
individual differences in reasoning ability as defined in the psychometric literature (e.g.,
Carroll, 1993) and its relation to working memory capacity based on Baddeley’s (1986)
cognitive psychology definition. More recently, Susan Embretson has attempted to
integrate the experimental and differential psychology using IRT (e.g., Embretson,
1993; Embretson, 1995b, 1995a). Our aim is to exploit this tenuous link between
experimental and differential psychology further to explore the basis of the relational
27
complexity theory in adult cognition. We use the Gf-Gc theory of cognitive abilities as
the foundation from which to explore the influence of individual differences.
2.3.1
Gf-Gc Theory
The ability to reason and process relations seems to be central to human functioning.
According to Carroll (1993), the eduction of correlates and relations can be regarded as
reflecting elementary reasoning abilities and as we have already alluded to, these
abilities are typically considered a component of fluid intelligence. The original
distinction between fluid and crystallized intelligence was proposed by Raymond B.
Cattell in the late 1950’s (Cattell, 1957) and refined further in association with John L.
Horn (see Horn & Cattell, 1966; Horn, 1968; Cattell, 1987, for reviews). Fluid
intelligence (Gf) is clinically interpreted as a non-verbal or performance component of
intelligence that is relatively insensitive to education and to some extent culture (at least
in theory, arguable never in practice7). In marker tests such as the Raven’s progressive
matrices, it entails the ability to induce relations and deal with novelty. In fact novelty
of situation and strategy variations are considered an important component of many
theories of intelligent behaviour (e.g., see Sternberg, 1984; 1985, for reviews).
Crystallized Intelligence (Gc) on the other hand, is regarded as an indication of the
effect of acculturation and education on cognitive abilities. It is “... a type of broad
mental ability that develops through the ‘investment’ of general intelligence into
learning through education and experience.” (Carroll, 1993, p. 599). The broadness of
both Gf and Gc is reflected in the significant cross-loadings of first order factors on
these higher order abilities. For instance, the first order sequential reasoning (RG) factor
independently contributes to the identification of both Gf and Gc (Carroll, 1993). If we
consider that transitive inference and categorical syllogism tasks typically load on the
RG factor, then it becomes intuitively reasonable that significant higher order loadings
would exist on both Gc and Gf. That is, performance on these types of tasks is likely to
be facilitated by both verbal and non-verbal reasoning skills. The methodological
consequence of this cross-loading effect is that it becomes important to use more than
7
It is important to note that the measurement of psychological constructs will always contain variation
due to individual differences in experience. Hence, while in theory Gf might be free of culture and
education, it will never be so in practice.
28
one marker variable for each broad factor to avoid possible misidentification in factor
scores8. In terms of the current project, we need to be especially sure that Gf and Gc are
well identified in the data since, as we will see, they both have a role in validating the
relational complexity theory.
Short-term Apprehension and Retrieval (SAR) is another psychometric factor that is
relevant in the current work. This factor can be considered as the psychometric
equivalent to working memory, as identified by memory tasks in which individual
differences have been observed. Carroll (1993) reports several studies in which this
factor has been uniquely identified distinct from Gf and this is consistent with cognitive
psychology's notion that processing and storage should be treated as somewhat separate
systems (Halford et al., 2000; Cowan, 2000).
2.3.2
Psychometric Complexity
The issue of complexity has been pervasive throughout the psychometric movement. It
is almost taken for granted that increases in task complexity are associated with
increased demands on the information processing system (Larson, Merritt, & Williams,
1988; Pellegrino & Glaser, 1979; Stankov, 2000; Crawford, 1991). While Spearman's
general factor (g) was proposed to account for positive manifold, Jensen (1987a) argues
that the most undisputed fact about ‘g’ is that loadings of tasks on this factor are an
increasing monotonic function of the tasks’ perceived complexity. Marshalek, Lohman,
and Snow (1983, p. 108) argue that the “…actual correlation between a test and ‘g’
approximates the apparent complexity of its required operations”. Further, they propose
that an understanding of complexity is essential to understanding intelligence. The
association between complexity and "intelligence" can also be seen in Louis Guttman’s
facet theory that describes the structure of human abilities using the radex model
8
A common although erroneous assumption made in many psychometric studies is that using only the
central marker of a factor is sufficient when correlating it with other tasks. The Raven’s matrices for
instance, is considered central to defining Gf (e.g., Carpenter et al., 1990; Stankov, Boyle, & Cattell,
1995). However, all abilities that contribute to the definition of Gf are not necessarily assessed by this
one test. Using only one marker to define a factor has the potential to change the nature of the factors that
emerge. This practice should be avoided as it can lead to unexpected loadings with other factors and
tasks.
29
(radical expansion of complexity) (see Snow, Kyllonen, & Marshalek, 1984; Most &
Zeidner, 1995; Marshalek et al., 1983; and Stankov et al., 1995, for reviews). In this
model, tasks of different cognitive complexity are arranged in a series of concentric
circles such that more complex tasks that measure problem solving and higher-level
mental processes like Raven’s progressive matrices are grouped at the centre. The
domain specific and less complex tasks fall in the outer bands with sensory-type tasks at
the periphery. By its very nature, this approach avoids the constraints of the Cartesian
system of vectors in preference for a conceptualisation of the problem space using polar
coordinates. As an independent confirmation, Snow et al. (1984) used multidimensional
scaling techniques to classify the relationship between cognitive tasks along the lines of
Guttman’s theory. While differences in the identification of specific cognitive abilities
existed, the association between complexity and intelligence was consistent with
Guttman’s conceptualisation.
Cognitive complexity is manipulated rather than just observed in a more pragmatic
approach provided by Stankov (2000). In this work, Stankov (2000, p. 123) adopts what
he refers to as an eclectic approach to increasing the cognitive complexity of a task. He
states that any manipulation that results in systematic changes in the factor loadings
with Fluid Intelligence (Gf) will suffice as a complexity manipulation since the “
…empirical evidence is assumed to be the final arbiter”. This statistical criterion which
we will refer to as the complexity-Gf effect, is based on the theoretical assumption that
the factorial structure of intelligence is such that defining any one specific process to
account for complexity reduces our understanding of what Gf is, a broad multi-faceted
construct. In fact Stankov argues that too much emphasis has been placed on the process
theories of cognitive psychology. Lohman (1994) tends to agree and states that the
ambition of cognitive psychology in the 1970’s to rescue differential psychology from
psychometrics by providing measures of basic information processing capabilities
through decomposing individual differences, has resulted in weak and inconsistent
correlations with traditional estimates of abilities. To some extent we agree with this
observation, however we would like to point out an important distinction that needs to
be made. Relational complexity theory is effectively domain independent – it is a theory
about processing resources and task complexity and might more appropriately be
conceptualised as a theory of process rather than a process theory in the componential
sense typified by the work of R.J. Sternberg in the late 1970’s (Sternberg, 1977;
30
Pellegrino & Lyon, 1979). Having said this, the application of MARC does require a
process theory of the strategies that are entailed. We will argue in Chapter 3, that the
appropriateness of a relational complexity analysis is dependent on the success of the
process theory on which it is based.
The rationale for the psychometric conceptualisation of cognitive complexity, the
complexity-Gf effect, is appealing. Performances on complex cognitive tasks entail
understanding the relations among task stimuli, comprehension of the implications of
these relations, and the formulation of a conclusion based on this processing. These are
the characteristics that define fluid abilities (Carroll, 1993). It follows that if we
increase the complexity of a task this should result in a concomitant increase in the
demand placed on these fluid abilities. As Figure 2.2 indicates, this in turn should result
in better discrimination between individuals of high and low Gf, and the pattern of
monotonic increasing correlations between task performance and Gf (Stankov, 1994).
We will demonstrate in our analyses that the relational complexity theory may provide a
Performance
more detailed specification and understanding of the facets of Gf.
high-Gf
low-Gf
low
high
Cognitive Complexity
Figure 2.2. The complexity-Gf effect: The hypothetical relationship between
performance and Gf as function of cognitive complexity
2.3.3
Fluid Intelligence and Complexity: The Evidence
Although attempts have been made and a psychometric criterion identified, there is no
global consensus in the psychometric literature on just what has to be done to satisfy an
operational definition of complexity. Crawford (1991) has reviewed a number of
approaches to defining complexity and has integrated the particularly interesting role of
31
strategy variation as an influence on complexity. The key idea as expressed by
Sternberg (1985) is that novel tasks call for the greater involvement of strategic or
metacomponential processes and are therefore closely related to measures of general
intelligence. Hunt (1980) takes a consistent but slightly different perspective and argues
that the tests in the periphery of the radex model described above, have low Gf loadings
because they are very constrained and typically have a limited range of possible
strategies. Therefore they tend to be more determined by mechanistic information
processing functions (Crawford, 1991) rather than the executive/control functions that
are typically associated with Gf tasks. Stankov and his associates (e.g., Stankov, 1988;
Spilsbury, Stankov, & Roberts, 1990; Stankov, 2000; Stankov & Crawford, 1993;
Stankov & Raykov, 1995) have manipulated complexity using a variety of methods in
both dual- and single-task designs and we consider examples related to each of these in
turn.
2.3.3.1 Dual-task evidence.
In what is referred to as a competing task paradigm, two independent cognitive tasks are
presented simultaneously with response priority order varied post-presentation (i.e.,
subjects are told which of the two tasks to respond to first after the presentation). The
complexity manipulation is the combination of the two component tasks. The
competing task as a manipulation of complexity is supported by an increase in
correlations with Gf (Fogarty & Stankov, 1982; Stankov, 1987; Roberts, Beh, &
Stankov, 1988). However, the presence of the complexity-Gf effect in the competing
task paradigm is qualified. Firstly, Fogarty and Stankov (1982) report that the
complexity effect is negated when the correlation between the single tests are
themselves very high9. Second, Fogarty and Stankov (1988) found that competing tasks
have higher Gf loadings than their single task counterparts only when the latter have
relatively low ‘g’ loadings (the converse is particularly interesting; when single tasks
have high ‘g’ loadings, the competing task tends not to be highly correlated with ‘g’).
Finally, a difficulty effect (decrement in aggregated performance) is not always a
necessary consequence of a complexity manipulation. Stankov (1988) reports a
9
It may be that the correlation between single component tasks is high when they both draw on the same
processes or resources and that this might contribute to the reduction in correlations with ‘g’.
32
complexity effect in which there is no change in mean performance from the single task
to the competing task, but the expected increase in correlations with Gf was observed
(the tasks were a tonal memory test and an embedded figures test). In the same study, an
experiment is reported in which arithmetic mean performance in a multi-element
counting task actually increased from the single condition (sequential presentation of
two categories) to the competing condition (simultaneous presentation of two
categories). Stankov (1988) suggests that both these results indicate that processing
capacity limitations per se cannot account for individual differences in the tasks used
without some multiple resource type theory. It might also be the case that under more
difficult conditions, individuals will invest more effort to recruit more resources to not
only maintain, but increase overall performance (Kanfer & Ackerman, 1989). The
implication of this on the utility of resource theory and the investigation of dual-task
deficits is considered in Chapter 5. In any case, Stankov’s example is significant to the
extent that with appropriate tasks, such cases are rare (Stankov & Raykov, 1995). The
potential however is real, and this supports our emphasis on the need to use an external
criterion for complexity other then mean performance data.
2.3.3.2 Complexity and single tasks.
Mixed results have also been achieved under the single task approach. Spilsbury,
Stankov, and Roberts (1990) used a manipulation of premise presentation order of the
4-term series task (that entails transitive reasoning) as the complexity variable. There
was no evidence to suggest that the manipulation resulted in changes in correlations
although changes in item difficulty (and efficiency) were observed. This implies that the
manipulation was one of difficulty and not complexity. Similar results to Spilsbury et al.
(1990) have been observed in data from Halford's laboratory with variations of the Nterm series task (Birney, 1999). Such results are consistent with the relational
complexity theory since manipulations of the number of premises and the presentation
order does not change the inherent dimensionality or complexity of transitive reasoning
required which is essentially ternary (Halford, 1993).
Using a complexity manipulation that on the surface at least is more in line with the
Halford et al. (1998a) conceptualization, Stankov and Crawford (1993) have
successfully demonstrated a reliable and increasing monotonic relationship between
33
performance and Gf in two experimental tasks. The Triplet Numbers Test required
subjects to validate increasingly complex rules against a randomly generated sequences
of three digits. For instance, given the number triplet "3 6 4", subjects might be asked
to verify whether " …the second digit is the largest and the third digit is the smallest".
The manipulation of complexity involved increasing the number of elements in the rule.
The second experimental task used by Stankov and Crawford (1993) was the Swaps
Test. This task required subjects to rearrange a letter triplet following increasingly
complex instructions. For instance, given the letter triplet, "J K L", subjects are asked to
specify the resulting order after the position of elements had been hypothetically
swapped. The complexity manipulation in this test was the number of swaps to be
made. Both these tasks are to be included in the current series of studies and are
discussed in more detail by Stankov and Crawford (1993). It is interesting to note that
an additional measure of short-term memory employed by Stankov and Crawford (the
forward and backward digit span tests) did not show the same increasing monotonic
correlations with the complexity manipulation as Gf did. From these findings Stankov
and Crawford (1993, p. 106) suggest that the main “ingredient” of complexity is the
"…number of elements and relations that have to be dealt with while a person tries to
solve a problem."
2.3.3.3 Relational complexity and Gf-Gc theory.
The complexity-Gf effect does not receive unanimous support. Researchers such as
Schweizer (1998), who reports a reversed complexity-Gf effect10 suggests that the
strength of the prediction is lessened by the variety of meanings attributed to
complexity. It is also tempting to over-emphasise the obvious similarity in the role of
relations and elements that Stankov talks about, with those specified in the relational
complexity theory. There is no theoretical justification at this stage to expect that
Stankov and Crawford’s (1993) conceptualisation of “relations” is fundamentally
different from that specified in the relational complexity theory, there is also no
evidence to suggest otherwise. Yet, it is encouraging that the two different approaches
provided by Halford and Stankov independently should arrive at compatible
10
By reversed, we mean that the correlations between performance and Gf decreased as the number of
operations to be processed (the manipulation of complexity) was increased (Schweizer, 1998).
34
conclusions. Relational complexity theory incorporates the conceptual distinction
between the Gf and Gc factors in what might be a natural way. It specifies that the
cognitive load on resources is generated by processing relations relatively independent
of content and domain, and that education can mediate the efficiency in which resources
are applied (e.g., through appropriate segmentation) but does not influence the
individual's processing capacity. It seems clear that an association between the
theoretical characteristics of processing capacity and fluid intelligence exists (Hunt,
1987; Kyllonen & Christal, 1990).
Relational complexity theory is ground firmly in a theory of process whose
psychometric properties are not well known. If the relational complexity manipulation
is consistent with the notion of cognitive complexity in the psychometric tradition, then
we should observe a similar pattern of monotonic increasing correlations with Gf
indicative of the complexity-Gf effect. More importantly, the complexity-Gf effect
serves as an external criterion for the relational complexity manipulation that is less
dependent on the task being manipulated.
2.3.4
Some Final Methodological Issues
The unifying goal of this project is to test the relational complexity theory using
measures independent of the task whose relational complexity is being manipulated. We
have argued that this entails bringing together methodologies from experimental and
differential psychology. Although there is recognition that a merger of these areas is
necessary (Hunt, 1980; Kyllonen & Christal, 1990; Deary, 2001), history suggests that
such a venture has failed to be fully realised. Lohman and Ippel (1993) have argued that
the psychometric approach to cognition, despite the very best of intentions from the
outset, has been disappointing in its ability to provide a clear conception of what a
process underlying broad cognitive factors such as Gf might entail. They argue that the
information processing approach central to cognitive psychology provides not so much
a model, but a general framework to advance this understanding. Deary (2001) suggests
that the merging point will be the coming together of working memory and intelligence
differences, and that this relies on a more resolute attempt to develop validated
cognitive architectures that have isolable and testable constructs and processes than has
35
so far been achieved. As such, the weight of responsibility is placed with experimental
psychology in general and cognitive psychology in particular.
Lohman and Ippel (1993, p. 68) take a slightly different perspective. They suggest that
“… the generally accepted idea of test theory as applied statistics seems to have
precluded the development of a structural theory of measurement needed for the
measurement of processes”. Therefore, according to Lohman and Ippel, at least some of
the reasons for the relative failure of a merger, has to do with the fundamental
differences and incompatibilities of the two approaches used. For instance, the
methodologies of factorial theories of intelligence assume that the underlying latent
constructs are relatively stable characteristics of the individual that remain constant
during testing – the focus is on between-individual differences. In research directed at
describing the processes by which a subjects arrives at the answer to a problem,
experimental psychologist arrange observation conditions to reveal differences in
responses. The focus is on within-individual differences (although typically at a group
level).
The Raven’s progressive matrices is an interesting task that might serve to clarify the
distinction being made by Lohman and Ippel (1993). Consider a differential
psychologists use of the progressive matrices test. Performance will be considered to
reflect individual ability to deal with novelty and induce patterns and relationships – to
reflect individual differences in latent fluid abilities. An experimentalist might use the
task for somewhat different purposes. They might focus on what changes in the
individual as a function of working through the process of solving a series of matrices
items. That is, to what extent are patterns discovered, relations induced, and strategies
developed. The knowledge base that the individual uses to work through the test of 30
or 40 items does not remain static. It changes based on the experiences of each
successive item; processing is dynamic, non-linear, and fluid. Consistent with Lohman
and Ippel, we therefore believe that the merger of the experimental and differential
perspectives on cognition will come through an investigation of individual differences
in strategy that also differ in process. We will return to this idea in the discussion of
Chapter 8. For now we return to the relational complexity theory and consider the
current state of our predictions.
36
2.4
The Experimental Approach
We can now specify Axiom 3 to complete the foundation that will be used to assess the
characteristics of the relational complexity metric. The full set is as follows:
Axiom 1: Complexity of a cognitive process: is the number of interacting
variables that must be represented in parallel to implement that process (Halford
et al., 1998a, p. 805).
Axiom 2: Processing complexity of a task: is the number of interacting variables
that must be represented in parallel to perform the most complex process
involved in the task, using the least demanding strategy available to humans for
that task (Halford et al., 1998a, p. 805).
Axiom 3: Complexity-Gf effect: Factor loadings on fluid intelligence are an
increasing monotonic function of a tasks psychometric complexity.
The predictions to be outlined below demonstrate what we believe is a close conceptual
relationship between the cognitive and psychometric approaches. The broad predictions
to be explored by each study are then summarised.
2.4.1
Predictions
2.4.1.1 Aggregated performance: Comparison of means.
From Axiom 1, we can predict that manipulations of relational complexity should incur
resource costs that are reflected in the difficulty of a task. The Comparison of Means
approach can be used to test for a relational complexity effect generated by a
manipulation of a tasks relational structure. More complex tasks should have a smaller
proportion of correct responses and all else equal, a tendency for longer response
times11. From a psychological perspective this is a “weak condition” in that it is a
sufficient but not a necessary condition to substantiate a relational complexity
11
This is a contentious issues and one raised by a reviewer of Birney and Halford (in press). We will
return to this issue in Chapter 7, Section 7.5.5.
37
manipulation (Spilsbury et al., 1990). The comparison of means approach assumes that
memory and education are controlled in the design of the items or through careful
selection of subjects (e.g., using age as a covariate), it cannot be used alone to control
statistically for potentially confounding factors. There is also the problem of
aggregating reliable or systematic within-subject variability in considering composite
scores (Lohman & Ippel, 1993) and group/condition differences (Chapter 1).
2.4.1.2 Easy-to-hard correlations.
One of the methods that we use is the easy-to-hard paradigm developed by Hunt and
Lansman (1982) that relies almost exclusively on the individual differences approach.
The easy-to-hard paradigm follows the dual-task approach to assessing processing
capacity limitations. Subjects are given a secondary task alone and concurrently while
solving an easy version of the primary task. Performance on the hard version of the
primary task attempted alone is also assessed. The measure of processing capacity is the
partial correlation between the secondary task (performed in conjunction with the easy
primary task) and the hard primary task (performed alone). Removing common
variation between the hard and easy primary task performed alone, and between the
hard primary task and the secondary task performed alone, results in a relatively clean
measure of variation due to resource limitations (Halford, 1989, 1993). This technique
removes variance that the tasks might share that is not associated with resource
limitations and overcomes the dual-task issue of conflict raised by Navon (1984). Since
the hard primary task is never performed in a dual-task situation, variance that is shared
with it and the secondary task cannot be associated with one task interfering with the
other. Assuming primary task performance is maintained, the interpretation of a
significant partial correlation is that performance on the secondary task is sensitive to
individual differences in available resources. That is, it can serve as a measure of
processing capacity and is considered in Chapter 5.
2.4.1.3 Controlling for confounding factors.
Using the individual differences approach and the Gf-Gc theory of cognitive abilities
the assessment of relational complexity in reasoning can be refined and its function
further explored. In the study outlined in Chapters 6 and 7, three broad cognitive factors
38
are assessed – fluid intelligence (Gf), crystallized intelligence (Gc), and short-term
apprehension and retrieval (SAR). Gc and SAR can be thought of as corresponding to
the effect of education (and experience) and short-term memory respectively. Axiom 1
also implies that memory storage requirements do not influence relational complexity of
the task per se. That is, while the limits of storage and processing capacity are similar in
magnitude at about 4 elements in each (Cowan, 2000; Halford et al., 1998a), a
conceptual distinction between the two is frequently made (Moray, 1967; Baddeley,
1996; Halford et al., 1984a; Maybery et al., 1986) even though there is some continuing
debate over whether one of the systems can subsume the other (Cowan, 2000; Halford
et al., 2000). In any case, the conceptual difference is supported by the factorial
separation of SAR as a distinct factor in the psychometric literature (Carroll, 1993;
Stankov & Crawford, 1993). With this in mind, it is probably best that the short-term
memory demands of a task are controlled where possible experimentally so as not to
contaminate the assessment of complexity.
A further concern is related to Axiom 2. The relational complexity of a task is defined
in terms of the most complex process entailed and in theory the amount of serial
processing required does not influence the task’s effective relational complexity. At
some point however additional processes will of course influence performance. Efforts
need to be made to take into consideration the serial processing demand and to some
extent the load generated by keeping in mind the end goal while task segments are
processed. We consider serial processing in Chapters 3 and 4.
2.4.1.4 Individual differences in cognitive ability and relational complexity.
Although we can attempt to control for the influence of serial processing and memory
on accuracy and response times by holding these factors constant where possible, we
cannot be sure that we have eliminated their effect and the effect of other irrelevant and
possibly interacting factors completely. As described in Figure 1.1, measures typically
used to quantify item difficulty are also used to quantify (i) the load generated by
memory limitations, (ii) the amount of information to be processed sequentially, and
(iii) to validate the effect relational complexity has on capacity. An external quantitative
criterion of complexity is needed that is independent of the measure used to assess the
effect of the relational complexity manipulation on available resources. A criterion of
39
complexity that comes from the psychometric domain has already been introduced as
the complexity-Gf effect of Axiom 3; as the complexity of a task increases, the
correlation between task performance and Gf should also increase monotonically
(Stankov, 2000; Stankov & Crawford, 1993). This pattern of correlations is not
necessarily expected to hold for the Gc and SAR factors. That is, while education (Gc)
and memory (SAR) do not impact on the classification of the relational complexity of a
task in principle, these abilities might influence segmentation and chunking strategies.
This will therefore indirectly influence the effective relational complexity of a task. The
extent that individuals who differ in Gc and SAR abilities are differentially able to deal
with relationally complex tasks will be investigated further. The study outlined in
Chapters 6 and 7 aims to map the relationship between processing capacity measures
derived from the relational complexity metric and three psychometric measures of broad
cognitive abilities, Gf, Gc, and SAR.
2.4.1.5 Class equivalence: Relational complexity across domains.
The implication of the first axiom is that the relational complexity metric is domain and
content independent. This implies that tasks with different content that impose
equivalent relational complexity demands should be significantly correlated with each
other because of a commonality in the number of relations to be processed. As we noted
above, this idea forms the foundations of using the age of attainment data to test the
complexity metric. An implementation of Campbell and Fiske’s (1959) multitraitmultimethod matrix can be explored to consider the expectation of class equivalence
within levels of relational complexity across tasks. A hypothetical outline is provided in
Figure 2.3. If class equivalence were observed we would expect the ‘d’ correlations to
be high. These are the correlations between the same levels of complexity (trait) across
different domains (methods). The ‘a’ correlations should be low since they describe the
relationship between different task domains of different levels of complexity. The ‘b’
correlations should lie somewhere between the ‘a’ and ‘d’ correlations in magnitude to
the extent that correlations are a function of the task domain (traditionally investigated
for evidence of method bias).
40
Multitrait-Multimethod Matrix
2D
D1
2D
D2
D3
D1
3D
D2
D3
D1
4D
D2
D3
4D
D2
D3
D1
D2
D3
D1
D2
D3
c
d
d
c
d
c
-
-
-
-
-
-
b
a
a
a
b
a
a
a
b
c
d
d
c
d
c
-
-
-
b
a
a
a
b
a
a
a
b
b
a
a
a
b
a
a
a
b
c
d
d
c
d
c
D1 domain 1
D2 domain 2
D3 domain 3
a
b
c
d
3D
D1
2D binary
3D ternary
4D quaternary
different domains/different levels of complexity
same domain/different levels of complexity
reliability coefficients
different domains/same levels of complexity
Figure 2.3 Multitrait-multimethod correlation matrix design to assess class
equivalence of relational complexity
Of course tasks can be correlated for reasons independent of their relational complexity
and this issue as it relates to performance measures has already been considered. Once
again, the key point concerns the differentiation of relevant and irrelevant factors.
Correlations have the potential to form somewhat stronger tests of the predictions
particularly when domain experience and methodology limitations are taken into
consideration. However, while correlations indicate some underlying commonality they
do not indicate causality in the true sense. That is, correlations do not prove that
relational complexity is related to processing capacity directly and not through a third
mediating variable such as procedural methodology (or more controversially, education
and knowledge).
2.5
Summary of the Key Predictions
With the definitions of resources and capacity appropriately addressed, and the unique
measurement issues associated with relational complexity considered, we can propose
that (a) correlations with Gf and other psychometric factors, and (b) secondary task
41
performance, have the potential to serve as alternative and converging tests of the
relational complexity metric that are independent of the tasks in which relational
complexity has been manipulated. Based on these assumptions, the five predictions to
be tested across both studies can be summarised as follows (prediction E is really a
statement of the intent to explore these relationships further):
A.
Comparison of Means: Increasing relational complexity increases difficulty
as measured by errors and response times.
B.
Dual-Task Deficit: Increasing relational complexity increases competition
for limited resources
C.
Class Equivalence: Correlations between tasks with the same or different
content should decrease as the difference in relational complexity increases.
D.
Complexity-Gf effect: Loadings on Gf are an increasing monotonic function
of a tasks relational complexity.
E.
Complexity-Gc/SAR effect: Loadings on short-term acquisition and retrieval
(SAR) are not necessarily an increasing monotonic function of a tasks
relational complexity. Loadings on Gc are not necessarily an increasing
monotonic function of a tasks relational complexity.
The following two chapters consider two of the experimental tasks that will be used to
explore these predictions. Chapter 3 considers the application of the Method for
Analysis of Relational Complexity (MARC) to the knight-knave task, a high-level
suppositional reasoning task. Chapter 4 maps out the development of the Latin Square
Task, a new task in which relational complexity manipulations are tightly constrained.
The subsequent chapters (5, 6, and 7) consider the predictions in more detail
empirically.
CHAPTER THREE
RELATIONAL COMPLEXITY ANALYSIS OF THE KNIGHT-KNAVE TASK12
3
Introduction
This chapter presents an analysis of the complexity of reasoning in the knight-knave
task. Any complexity analysis requires that the processes and strategies employed in the
task be known. We therefore begin by reviewing some of what is currently known of
the knight-knave task. Once a working understanding of the task has been achieved we
are in a position to implement an analysis based on the principles of relational
complexity as outlined in Chapter 2. The Method of Analysis of Relational Complexity
(MARC) as applied to knight-knave problems is then outlined. The overall aim of this
chapter is (a) to test the utility of MARC as a methodology for analysing complexity of
knight-knave problems, and (b) to further explore the role of relational complexity in
this interesting set of reasoning problems.
3.1
Processing in the Knight-Knave Task
The knight-knave task was originally made popular by the philosopher Smullyan
(1978). It is a novel high-level reasoning task that was introduced to psychological
research by Lance Rips (1989). In a typical knight-knave problem reasoners are
informed of a world in which just two sorts of inhabitants exist – knights who always
tell the truth and knaves who always lie. One or more statements made by the
inhabitants are provided and the reasoner must determine the respective status of the
individuals. Consider the following example:
Example Problem:
There are two inhabitants, Tom and Beth, each of whom is a knight or a knave.
Tom says, “I am a knave and Beth is a knave”.
Beth says, “Tom is a knave”.
What is the status of Tom: Knight, knave, or impossible to tell?
12
The contents of this chapter have been published: See Birney & Halford (2002).
43
The task is deductive since an explicit conclusion can be derived from information
present in the context of the problem (i.e., the statements) using rules retrieved from
memory (see Evans, Newstead, & Byrne, 1993; Johnson-Laird & Byrne, 1991). Further,
the task is suppositional (Byrne & Handley, 1993; Byrne, Handley, & Johnson-Laird,
1995) in that the veracity of an assertion is unknown and therefore reasoners need the
ability to think conditionally or hypothetically to provide a starting point for their
reasoning. Byrne and Handley (1997) argue that this ability is essential to everyday
reasoning where inferences about the truth of assertors and assertions are frequently
required. As the analysis to be presented later will show, the combined uncertainty of
the status of the assertor and the veracity of their statement contributes to the task’s
complexity.
Two independent groups of researchers have explored the cognitive processes in the
knight-knave task. First we will consider the natural deduction rule theory of Rips
(1989; 1990) and then the mental models approach (e.g., Byrne & Handley, 1993;
Byrne et al., 1995; Johnson-Laird & Byrne, 1990).
3.1.1
Deduction Rules
Rips’ assumes a general-purpose mental logic that is applied in a serial fashion to
reasoning (Rips, 1983; Rips, 1994; Braine, 1990, 1993). Rips’ (1989) model was
developed from the verbal protocol analysis of four university students who had no
formal training in logic. These subjects typically started by considering specific
suppositions about the status of the speaker. They then followed a series of hypothesis
testing routines to assess the validity of each supposition. His model is based on an
exhaustive strategy (outlined in Figure 3.1) and is implemented as a production system.
The algorithm has been criticised for being too powerful and lacking psychological
plausibility (Johnson-Laird & Byrne, 1991; Schroyens et al., 1999). However the model
accounts for a significant amount of data, and Rips’ studies provide information about
strategies that is valuable for complexity analyses, including the role of serial
processing. However a limitation of his measure of difficulty is that it depends solely on
the number of inference rules or steps needed to solve each problem. The measure
assumes deductive uniformity (Rips & Conrad, 1983) in that individual differences in
44
the availability and application of the rules are not modelled. By contrast, MARC takes
account of both number of steps and the difficulty of each step. The mental model
approach offers an appealing and intuitive alternative to the deduction-rule
methodology.
1. Hypothesise the first speaker is telling the truth (i.e., a knight)
a. Follow up the consequences of this hypothesis
b. If it implies that the second speaker is a knight or a knave, follow up
the consequences of this, and so on…
c. If a contradiction is reached, reject the hypothesis
2. Hypothesise the first speaker is telling a lie (i.e., a knave)
a. Follow up the consequences of this hypothesis
b. If it implies that the second speaker is a knight or a knave, follow up
the consequences of this.
3. If a consistent assignment is reached for a speaker, the speaker's status is
established, otherwise it is undetermined.
Figure 3.1 Solution of knight-knave problems using the exhaustive strategy reported by
Rips (1989).
3.1.2
Mental Models
The mental model approach introduces solution strategy as a source of variation in
problem solving. According to this approach, not all aspects of a problem need to be
made explicit during solution (Johnson-Laird et al., 1992). Only when implicit models
are made explicit do they contribute to the demands on processing capacity. This
conceptualisation of processing has been used by Byrne and her associates to explore
the development of high-level reasoning (i.e., control) strategies in the knight-knave
task (Byrne & Handley, 1993, 1997; Byrne et al., 1995). Schroyens et al. (1999) suggest
that the mental model approach can also account for individual differences in
suppositional reasoning. They propose an elaboration of the mental model theory to
accommodate three levels of reasoning and have some success in accounting for errors
and bias in the knight-knave task. However, the different strategies introduced are based
45
on the construction of mental models and do not formally recognise the possibility that
some individuals may process the task at least some of the time using a rule-based
approach (see Roberts, 1993, for a review of this type of criticism of unified theories of
cognition in general).
Given that cognitive demand can be modified by strategies, a complexity analysis
requires an understanding of the conditions in which these strategies are likely to be
employed. Three main findings of Byrne and Handley’s (1997) research relate directly
to this issue and these findings will be applied and exemplified in the complexity
analysis to follow.
Finding 1. People make both forward and backward inferences to short-cut their
way through alternative suppositions.
A forward inference entails making the supposition that an assertor is telling the truth
(lies) and then inferring that the assertion is true (false). A backward inference entails
making the supposition that an assertion is true (false), and then from this inferring that
the assertor is a truth-teller (liar). The availability of backward inference strategies
means that reasoners do not always have to follow through the full exhaustive strategy
as reported by Rips (1989). In practice, a backward inference is possible when the first
supposition about the status of an inhabitant results in a contradiction. When this
occurs, the reasoner may incorporate and test suppositional inferences about the
assertion of the second speaker (the backward strategy) to short-cut their way to a
solution. The mental model theory argues that reasoners would favour this approach
since it avoids making implicit mental models unnecessarily explicit (which would be
required if a complete forward working strategy is used).
Finding 2. Generating a supposition is a source of difficulty for items in which
backward strategies cannot be used.
The protocol analysis of Rips (1989) and Johnson-Laird and Byrne (1990) suggests that
subjects begin solution by assuming the first assertor is a knight. Byrne and Handley
(1997) explored performance when a starting supposition is supplied. They conclude
46
that when backward strategies could be used (i.e., when the initial forward inference
results in a contradiction), there was no effect of being given the starting supposition.
When backward strategies could not be used, people made more correct inferences
when the given supposition was accurate than when it was inaccurate.
Finding 3. Elimination of the suppositional status of an individual does NOT
reduce the difficulty of the problem.
Byrne and Handley (1997) considered Finding 2 further by exploring whether it was the
availability of backward inferences or the possibility of eliminating a supposition by
contradiction that reduced problem difficulty. That is, if we suppose that Tom is a
knight and through testing this assumption arrive at the conclusion that Tom is a knave
(a contradiction), we can deduce that our original supposition was incorrect. Their
findings suggest that the elimination of the suppositional status of a speaker in this way
does not improve performance. Byrne and Handley conclude that it is the availability of
backward inference strategies that makes items easier rather than elimination of
suppositions through a contradiction. Each of these key findings is considered in the
application of MARC.
3.2
Relational Complexity Analysis
The general notation that we use to communicate the complexity analysis was described
in Chapter 2 (Section 2.2.4). This representational form has obvious surface similarities
to the deduction rule approach and less obvious links with the representation of mental
models. We believe our eclectic approach better facilitates representation of both
explicit and implicit models. Not only does it allow us to represent the processes that
need to be considered explicitly in making a decision, but by facilitating representation
of chunking, much of the content of what Johnson-Laird, Byrne, and Schaeken (1992)
would call implicit models, can also be represented. In addition to the general principles
it is necessary to consider the task specific knowledge and the relevant propositions
(Table 3.1A) and rules of the knight-knave island (Table 3.1B). From here we illustrate
how these elements can be integrated and how the relational complexity metric can be
applied to predict relative difficulty.
47
Table 3.1
Representation of relational components (A), and Rules of the Knights-Knaves task (B).
A. Relational Component
Propositional Representation
RCa
C1
C2
C3
C4
C5
C6
C7
C8
kt(P)
kv(P)
AND(p, q)
OR(p, q)
SAYS(P, x)
→ (s, t)
NOT(x)
CONTRADICT(y, z)
Unary
Unary
Binary
Binary
Binary
Binary
Unary
Binary
Propositional Representation
RCa
P is a knight
P is a knave
Conjunction; p and q
Disjunction; p or q
P says x
Implication; s implies t
Negation; x is not the case
Contradiction; y contradicts z
B. Rules of the Island
If P is not a knight, he is a NOT(kt(P)) → kv(P)
Binary
knave
R2 If P is not a knave, he is a NOT(kv(P)) → kt(P)
Binary
knight
Quaternary c
R3 If P’s statement is true, he is a AND(SAYS(P, x), TRUE(x)) → kt(P) b
knight
R4 If P’s statement is false, he is a AND(SAYS(P, x), FALSE(x)) → kv(P)b
Quaternary c
knave
R5 If P is a knight, his statement AND(kt(P), SAYS(P, x)) → TRUE(x)b
Quaternary c
is true
R6 If P is a knave, his statement is AND(kv(P), SAYS(P, x)) → FALSE(x)b
Quaternary c
false
a
prima facie relational complexity without consideration of chunking; b TRUE(x) can be
represented as x; FALSE(x) can be represented as NOT(x); c The effective relational complexity
of R3-R6 can be directly reduced by chunking principles S1-S5
R1
3.2.1
Knowledge Required
Table 3.1A lists the basic components of the rules that are required for knight-knave
problems13. Table 3.1B takes the components from Table 3.1A and combines them to
form the rules of the island. An understanding of these rules is effectively all that is
required to solve the knight-knave problems we have explored.
The next issue to explore is that of segmentation and chunking. As we have indicated
already, segmentation is dependent on the strategy chosen by the individual to solve the
13
In Table 3.1A (C6), “implication” is represented as a binary relation, →(s, t), which is read, “s implies
t”. This is structurally equivalent to the more convenient representation, s → t, used in Table 3.1B and the
rest of the analysis
48
problem. Dominant strategies identified by Rips (1989) and Byrne and Handley (1993;
1997; Byrne et al., 1995) are incorporated to provide a starting point. There are five
general chunking principles (S1 to S5) that we use to determine the effective relational
complexity of the knight-knave items. The example provided with each principle
demonstrates how the information is combined to reduce the initial number of elements
to be processed. Consistent with the notation outlined in Chapter 2 Section 2.2.4, the
elements that contribute to the relational complexity count are underlined (Note: #A =
Number of arguments; RC = Relational Complexity).
S1: Chunking assertion with its truth value SAYS(x)
Example:
P makes a true statement
Unchunked:
AND(SAYS(P, x), TRUE(x)) → kt(P)
(P says x, and x is true implies P is a knight)
[#A = 4]
Chunked:
SAYS(P, true(x)) → kt(P)
(P makes a true assertion implying he is a knight)
[RC = 3]
Principle S1 is an instantiation of the common arguments principle (Theorem 3). That
is, the segments SAYS(P, x) and TRUE(x) are combined and the representation reduces
to SAYS(P, true(x)) → kt(P). This principle determines the effective complexity of
rules R3 and R4 (Table 1B).
S2: Chunking the assumed status of speaker &-SAYS
Example:
P says, “Q is a knight”;
Assume P is a knave.
Unchunked:
AND(kv(P), SAYS(P, kt(Q))) → kv(Q)
[#A = 4]
(P is a knave and P says, “Q is a knight”, implies that Q is a knave)
Chunked:
&-SAYS(kv(P), kt(Q)) → kv(Q)
[RC = 3]
(P is a knave and says, “Q is a knight”, implies that Q is a knave)
49
The chunk &-SAYS is also an application of the common arguments principle
(Theorem 3). The identity of the inhabitant making the assertion (“P says”) can be
chunked with the reasoner’s supposition about the status of the inhabitant (“P is a
knave”) because the status of the asserter and the fact he has made an assertion are not
needed separately to solve the problem. We read the chunked example as: “P is a knave
and says, ‘Q is a knight’ which implies Q is a knave”. The effective relational
complexity (see Theorem 1, Section 2.2.3) of rules R5 and R6 in Table 3.1B can be
reduced using the S2 principle.
S3: Chunking elements of a conjunction
AND(a/b/… /n)
P says, “P is a knave and Q is a knave”;
Assume P is a knight
Unchunked (incorporating S1):
&-SAYS(kt(P), AND(kv(P), kv(Q))) → AND(kv(P), kv(Q))
[#A = 5]
Chunked:
&-SAYS(kt(P), AND(kv(P), kv(Q))) → AND(kv(P), kv(Q))
[RC = 3]
(P is a knight and says, “P is a knave and Q is a knave”, implies P is a knave and
Q is a knave)
S3 also incorporates Theorem 3 and works on the assumption that representing a
compound statement only generates additional load when the components of the
statement need to be considered separately to make the inference. That is, the full
cognitive demand of a conjunction is generated in constructing the implication. This
will become more obvious when we consider embedded relations in S5 below.
The argument for S3 also applies to S4 where the inclusive disjunction is chunked into a
single element. Once again, the full cognitive load is not generated until the implication
of the disjunction is considered and integrated with other information deduced from the
problem so far.
50
S4: Chunking elements of an inclusive disjunction OR(a/b/… /n)
P says, “P is a knave or Q is a knight”;
Assume P is a knight
Unchunked (incorporating S1):
&-SAYS(kt(P), OR(kv(P), kt(Q))) → OR(kv(P), kt(Q))
[#A = 5 ]
Chunked:
&-SAYS(kt(P), OR(kv(P), kt(Q))) → OR(kv(P), kt(Q))
[RC = 3 ]
(P is a knight and says, “P is a knave or Q is a knight”, implies P is a knave or Q
is a knight)
Johnson-Laird, Byrne, and Schaeken (1994, p736) deal with the type of reasoning
involved with S3 and S4 in a similar way. They propose that given,
If S or X or B or C or K or R or N or L or D or F then not both I and Q
X
∴ not both I and Q
reasoners do some rearranging and construct a deduction of the form,
X
If X or … then not both I and Q
∴ not both I and Q
That is, reasoners do not need to construct a psychologically implausible number of
models as suggested by O’Brien, Braine, and Yang (1994). This idea is consistent with
our analysis, and is accounted for by the principles of chunking and segmentation. That
is, RC theory proposes that the disjunctive components can be chunked in the above
example (as represented by Johnson-Laird et al., 1994, with the ellipsis) because the
components are not needed separately to make the current decision (Theorem 3). If they
were, then this chunking would not be possible.
The analysis of S5 demonstrates one way to deal with embedded relations and provides
the clearest instantiation of Theorem 3 discussed so far. The representation of the
problem and supposition (left of the implication sign, → ) uses S2 and S4. Processing
51
the implication of the supposition requires the representation of two conjunctions
embedded in a disjunction (1). This can also be represented as a disjunction embedded
in a conjunction (2). Either way, the interpretation is the same. The implication
contributes an additional element of complexity because the status of Q can not be
determined uniquely with the information processed so far. Hence, the main
components of the disjunction in (1) and the conjunction in (2) cannot be chunked –
each is necessary in order to make the inference, and therefore each contributes an
element of complexity.
S5: Chunking elements with embedded relations e.g., AND(a, OR(b, c))
P says, “P is a knight or Q is a knight”;
Assume P is a knight
Unchunked (incorporating S2):
&-SAYS(kt(P), OR(kt(P), kt(Q))) →
OR(AND(kt(P), kt(Q)), AND(kt(P),kv(Q))) [#A = 7 ]
Chunked (incorporating S2, S3, and S4):
(1) &-SAYS(kt(P), OR(kt(P), kt(Q))) →
OR(AND(kt(P), kt(Q)), AND(kt(P), kv(Q)))
(P is a knight and says, “P is a knight or Q is a knight”, implies P is a knight and
Q is a knight, or P is a knight and Q is a knave)
(2) &-SAYS(kt(P), OR(kt(P), kt(Q))) →
AND(kt(P), OR(kt(Q), kv(Q)))
[RC = 4 ]
(P is a knight and says, “P is a knight or Q is a knight”, implies P is a knight
and, Q is a knight or Q is a knave)
Note: (1) & (2) are equivalent representations
It is instructive to consider a complete example of how we have applied MARC to an
actual problem. Let us consider the example problem given earlier, which is repeated
here using the letters A and B rather then Tom and Beth:
A says, “I am a knave and B is a knave”.
B says, “A is a knave”.
52
Each proposition (P1 and P2 below) can be represented using the information in Table
1A as follows:
P1:
SAYS(A, AND(kv(A), kv(B)))
(1)
P2:
SAYS(B, kv(A))
(2)
We can represent a reasoner's supposition that A is a knight as follows:
AND(kt(A), SAYS(A, AND(kv(A), kv(B))))
(3)
The prima facie relational complexity of (3) is determined by counting the underlined
elements. By this method, representing Proposition 1 requires processing a quaternary
relation. The maximum cognitive load is not fully realised until some work is done on
the representation (i.e., when the implication is considered). This would result in a
further increase in the demand on available resources. However, the principles of
segmentation and conceptual chunking can be employed to reduce this demand. Using
S2 (&-SAYS), we can chunk the supposition (“A is a knight”) with the assertor (A).
Further, using S3 (AND(a/b/…/n)) we can chunk the components of the conjunction.
With these principles applied to (3) we get an approximation of the effective cognitive
load required to represent and process the first premise. This is written as follows:
&-SAYS(kt(A), AND(kv(A), kv(B)))
(4)
The next step in the strategy is to follow up on the implications of this assumption and
test for consistency. The complete solution of the problem following the protocol
analysis of Rips (1989) and what we know of the use of backward strategies (Byrne &
Handley, 1997) is as follows:
P1:
SAYS(A, AND(kv(A), kv(B)))
P2:
SAYS(B, kv(A))
Suppose kt(A) in P1.
&-SAYS(kt(A), AND(kv(A), kv(B))) → AND(kv(A), kv(B))
(5)
CONTRADICT(kt(A), kv(A)) → NOT(kt(A))
(6)
53
Suppose kv(A) in P2 (a backward inference + S1)
SAYS(B, TRUE(kv(A))) → kt(B)
(7)
So we can conclude AND(kv(A), kt(B)); that is, A is a knave and B is a knight.
The effective relational complexity of this item is ternary since the most complex
process involved entails a ternary relation (Theorem 2). The problem can be segmented
into three processes (or relational steps) - two that entail processing a ternary relation,
(5) and (7), and one that entails a binary relation (6).
The following experiment was conducted to test the application of MARC to a selection
of knight-knave problems. The key prediction is that problems of higher complexity
should be associated with more errors and longer response times (i.e., the comparison of
means prediction).
3.3
3.3.1
Method
Problems
Solution of knight-knave problems are either (i) determined in one or more of the
speakers (e.g., A says, "I am a knave and B is a knight"), (ii) undetermined in one or
more of the speakers (e.g., A says, "I am a knight"), or (iii) paradoxical in that a
statement results in a speaker being neither a knight nor a knave (e.g., A says, "I am a
knave"). For our purposes we have limited ourselves to problems of the first type.
Paradoxical problems, while interesting at a philosophical level, are likely to lead to
unnecessary confusion since technically such statements would not be made by either a
knight or a knave and therefore may be seen by reasoners as violating the rules of the
island. With this in mind, a test of five ternary problems and five quaternary problems
was generated. Two presentation formats were developed to administer the test – a
paper and pencil format and a computer-administered and scored format.
54
3.3.2
Practice
To be confident that the relational complexity analysis is appropriate we needed to be
sure that participants both understood the special nature of the knights and knaves world
and were aware of the implications of testing the veracity of compound statements. We
developed an extensive introduction and a series of practice exercises that preceded
administration of the test items. The content of the introduction and practice was
essentially the same for each version of the task, with some minor presentation
differences depending on format. Unless otherwise indicated, correct/incorrect feedback
was provided on the computer-administered version only.
Introduction: A written description of the world of knights and knaves was provided
followed by worked examples of testing the veracity of conjunctive and disjunctive
statements entailing the four possible colour combinations of two squares and two
colours. Each combination was accompanied by an explanation of why the statement
was either a true or false description of the squares (see Figure 3.2A).
Section 1 presented participants with two coloured squares and asked them to indicate
whether each of the eight (2 squares × 2 colours × 2 connectives) possible compound
statements was consistent with the statement of a knight or a knave (see Figure 3.2B).
Section 2 presented participant with four statements about two coloured squares made
by an individual whose status is known. The task was to indicate whether each of the
four (2 squares × 2 colours) potential outcomes is consistent or inconsistent with the
statement and status of the individual (see Figure 3.2C). The solution for the first item
was provided and subjects were instructed to study this worked example carefully.
Section 3 presented three knight-knave problems that emphasised understanding of the
rules of the island. The problems were presented in test format and a detailed
explanation of the correct answer was provided for the first problem.
55
A. Introduction
The “OR” rule:
“Square 1 is white or square 2 is black”
1
2
True: Because only one component needs to be true
for the whole statement to be true
1
2
False: Because both components need to be false for
the whole statement to be false
B. Section 1
1
2
Damian says,
“Square 1 is black and square 2 is white”
knight
knave
“Square 1 is white or square 2 is white”
knight
knave
C. Section 2
Damian is a knight and says, “Square 1 is white or square 2 is black”
1
2
consistent
inconsistent
1
2
consistent
inconsistent
1
2
consistent
inconsistent
1
2
consistent
inconsistent
Figure 3.2 Examples of practice phase items from A) the introduction, B)
Section 1, and C) Section 2, of Knight-Knave task
3.3.3
Test Problems
The five ternary and five quaternary problems are provided in Table 3.2. The appendix
to chapter three has the complete complexity analyses. Three response options were
available for each problem. When the probe question was “What is the status of B
(A)?”, the response options were “Knight”, “Knave”, and “Impossible to tell”. When
56
the probe was, “Is inhabitant B (A) a knave/knight”, the response options were “Yes”,
“No”, and “Impossible to tell”.
The computer-administered version of the task recorded the response, the accuracy of
the response, and the response latency. A crude measure of response latency in the
paper and pencil version of the task was provided by having participants record the start
and end time for each problem on their answer sheet.
Table 3.2
Knight-knave problems by relational complexity, relational stepsa, and number of rulesb
Ternary Items
Item 3.1
(RCsteps = 1; Rules = 8)
B says, A is a knight and B is a knave.
If we know B is a knave, what is the status of A?
Quaternary Items
Item 4.1
(RCsteps = 2; Rules = 7)
A says, "A is a knave or B is a knave"
Is inhabitant B a knave?
Item 3.2
(RCsteps = 3; Rules = 12)
A says, "A is a knight and B is a knave"
B says, "A is a knight"
Is inhabitant B a knight?
Item 4.2
(RCsteps = 3; Rules = 12)
A says, "B is a knave"
B says, "A is a knight or B is a knight"
Is inhabitant A a knave?
Item 3.3
(RCsteps = 2; Rules = 5)
A says, "A is a knight"
B says, "A is a knight"
If A is a knave, what is the status of B?
Item 4.3
(RCsteps = 2; Rules = 11)
A says, "A is a knave or B is a knight"
B says, " B is a knight"
Is inhabitant B a knave?
Item 3.4
(RCsteps = 2; Rules = 14)
A says, "A is a knight"
B says, "A is a knave and B is a knave"
Is inhabitant B a knight?
Item 4.4
(RCsteps = 2; Rules = 9)
A says, "A is a knave and B is a knight"
Is inhabitant B a knave?
Item 3.5
(RCsteps = 2; Rules = 11)
Item 4.5
(RCsteps = 3; Rules = 9)
A says, "A is a knight"
A says, "A is a knave and B is a knave"
B says, "A is a knave and B is a knight"
Is inhabitant B a knave?
If A is a knight, what is the status of B?
a
Number of relational steps in complexity analysis (RCsteps); b Number of processing steps
using exhaustive rule-based strategy (Rules)
3.3.4
Participants
Thirty-six female and 17 male first-year psychology students (mean age = 18.51 yrs)
completed the paper and pencil version. An additional 43 female and 7 male students
(mean age = 19.50 yrs) completed the computer-administered version. The students
were recruited from the University of Queensland’s first-year psychology student pool
57
during the same semester. Participation entitled the student to a 1% credit in a
nominated first-year subject.
3.4
Procedure
Paper and Pencil Format: Participants were tested in groups of approximately eight as
part of a larger testing session14. Each student was seated at an IBM 486 computer
terminal displaying a digital clock on the 14” SVGA colour monitor. The clock was
provided to allow participants to record the time they began and completed each
problem. Each participant was given a test booklet containing one of three random
orders of the 10 problems15. Participants were instructed to read the instructions and to
work through the practice carefully. They were also told to use the space provided on
the test booklet for any working.
Computer-Administered Format: As for the paper and pencil format, the students were
tested in groups of approximately eight as part of a larger testing session (see footnote
14). Each participant was seated at an IBM 486 computer terminal and the task was
displayed on a 14” SVGA colour monitor. Presentation of the ten knight-knave test
problems was randomised by the computer. Students in this format were asked not to
write anything down but to do all the working in their head. In both formats, all
participants were instructed to work as accurately and quickly as possible. All subjects
completed the task within the one hour allotted.
3.5
3.5.1
Results & Discussion
Practice
The primary purpose of analysing the practice data was to give some indication of the
extent that subjects understood the nature of the task. The overall mean proportion
14
The average time to complete the task was 25 – 35 minutes. Although other computer-administered
tasks were piloted in the same session, the knight-knave task was always administered first.
15
Due to a problem in compiling the booklets, a subset of subjects did not complete two ternary items
(3.1 and 3.5) and one quaternary item (4.1). The measure of accuracy employed is therefore proportion
correct rather then total correct score.
58
correct for the paper and pencil format was 90.7% (SD = .13). The computeradministered format was 83.7% (SD = .15). This difference was statistically reliable,
F(1, 100) = 4.41, p = .038, but qualified by a significant interaction between format and
practice section, F(2, 200) = 9.92, p < .001. Follow up tests of the interaction indicated
that the difference between formats was reliable and practical for Section 3 practice
only, F(1, 100) = 13.03, p < .001 (Ms = .89 & .68; SDs = .23 & .33)16. A high level of
performance on Section 1 and 2, and no differences between format would suggest that
participants understood the basic use of the propositional connectives AND and OR in
the context of suppositional reasoning irrespective of presentation format. The
significant decrement in the computer-administered condition of Section 3 practice
suggests that performance on actual knight-knave problems is susceptible to
presentation format.
3.5.2
Test Problems
The mean proportion correct for each of the five ternary and five quaternary problems
and their respective composite scores17 are summarised in Table 3.3. A Complexity
(ternary vs quaternary) × Format (paper vs. computer) repeated-measures/betweensubjects ANOVA on these composite scores indicated as predicted, a significant maineffect for complexity, F(1, 101) = 80.41, p < .001. Ternary problems were significantly
easier than quaternary problems (Ms = .76 & .46, respectively). There was also a
significant main-effect for Format, F(1, 101) = 11.91, p = .001. Overall, the paper and
pencil format was easier than the computer-administered format (Ms = .67 & .55,
respectively). The interaction between complexity and test format was not significant,
F(1,101) = 1.93, p = .168.
16
Section 1 scores on the paper and pencil (M = .995, SD = .02) and the computer-administered (M = .97,
SD = .07) formats differed significantly, F(1, 100) = 5.24, p = .024. The practicality of this is
questionable given the ceiling effect and subsequent attenuation of variance. There was no difference
between Section 2 scores on the paper and pencil (M = .80, SD = .31) and computer-administered (M =
.84, SD = .18) formats.
17
A Rasch analysis of knight-knave problems is provided in Chapter 6.
Table 3.3
Problem proportion correct and mean correct response times and standard deviations (in parentheses)
Proportion Correcta
Correct Response Time
Paper & Pencil
Ternary Problems
n
3.1
17
0.82
3.2
53
0.77
3.3
53
0.91
3.4
53
0.68
3.5
17
1.00
Mean
53
0.80
Computer-Administered
n
50 0.76
50 0.50
50 0.80
50 0.70
50 0.86
(0.23)
50 0.72 (0.23)
Paper & Pencil
n
14
38.86
40
76.77
48
37.71
36
65.81
17
29.41
53
53.28
Computer-Administered
n
(27.96)
38
18.55
(10.49)
(55.37)
25
33.88
(28.23)
(20.46)
40
15.17
(07.90)
(48.38)
35
30.75
(20.84)
(25.50)
43
20.16
(10.45)
(29.02)
49
22.13
(10.46)
Paper & Pencil
n
17
0.41
53
0.57
52
0.40
53
0.58
53
0.62
53
0.54
Computer-Administered
n
50 0.22
50 0.42
50 0.48
50 0.32
50 0.42
(0.30)
50 0.37 (0.22)
Paper & Pencil
n
7
55.86
30
84.23
21
63.05
31
50.87
33
35.97
49
59.41
Computer-Administered
n
(42.82)
11
27.00
(35.01)
(56.52)
21
41.96
(31.68)
(61.18)
24
29.85
(38.82)
(31.74)
16
26.18
(18.65)
(26.10)
21
29.01
(23.06)
(37.45)
46
33.14
(28.17)
Quaternary Problems
4.1
4.2
4.3
4.4
4.5
Mean
a
Mean proportion correct derived by determining the proportion of ternary and quaternary items answered correctly respectively for each subject and then
averaging across subjects.
60
Response times for correctly answered items were aggregated within each level of
complexity for each subject (see Table 3.3) and a similar analysis was conducted with
these composite scores as the dependent measure. The time to correctly respond to
ternary problems (M = 36.70s) was significantly shorter than the time to respond to
quaternary problems (M = 46.28s), F(1, 93) = 10.17, p = . 002. The main-effect for
format was also significant, F(1, 93) = 31.34, p < .001. The time to respond correctly to
problems presented in the paper and pencil format was significantly longer than in the
computer-administered format (Ms = 52.21s & 27.76s, respectively). The interaction
between format and complexity was not significant, F<1. Analysis of overall response
times (regardless of accuracy) revealed the same pattern of results.
There is no overlap in the difficulty of individual items across levels of complexity,
however inspection of Table 3.3 reveals some degree of variation in accuracy and
response times within levels of complexity. This type of variation suggests that
although our manipulation is convincing, factors other than relational complexity are
also contributing to performance. To explore this further we first consider the
possibility of individual differences in a speed-accuracy trade-off. This is followed by
consideration of how an alternative model of processing based on serial task demands
might account for item difficulty and mean correct response time.
3.5.3
Speed-Accuracy Trade-Off
The work on resource theory by Norman and Bobrow (1975) and others (e.g., Fisk et
al., 1986; Kahneman, 1973; Wickens, 1980, 1991) might suggest that as problem
difficulty increases the likelihood of a speed-accuracy trade-off would also increase.
That is, on quaternary problems where processing resources are heavily taxed, a speedaccuracy trade-off is more likely to occur. This effect is likely to be exacerbated in
poorer performing individuals. On the less complex and relationally demanding ternary
items, sufficient resources are more likely to be available such that increasing the time
spent on task would not substantially improve performance. A partial test of this
hypothesis would be a higher correlation between accuracy and response times for
quaternary level items than ternary items. However, the relationship between the
number of items answered correctly at a given level of complexity and mean correct
61
response time shows considerable heteroscedasticity. This spread of variation is of
theoretical interest and would not be well addressed in the strict linear approach of a
standard correlation analysis. Comparison of response times at each level of accuracy is
also problematic due to uneven and low cell sizes. So how are we to consider the
influence of individual differences in a speed-accuracy trade-off? A partial solution is to
adopt some level of aggregation based on overall performance. To this end, participants
were classified in terms of the overall number of problems from each level of
complexity answered correctly. Individuals who correctly answered three or more
ternary problems were placed in one group (high ternary scorers; n = 43) and those who
answered less than three problems correctly were placed in another group (low ternary
scorers; n = 6). The same criterion was used to classify participants into high and low
groups for the quaternary problems (n = 15 & 31 respectively).
Independent t-tests indicated that for the ternary problems, there was no significant
difference in correct response times between low scoring participants (who answered
less than three problems correctly, Ms = 20.98s, SD = 15.55s) and high scoring
participants (who answered three or more problems, Ms = 22.29s, SD = 9.80s), t(47) = .29, ns). For the quaternary problems, Levene’s test for equality of variances indicated
that the variation in correct response times for low scorers was substantially larger than
the variation in correct response time for high scorers, F(1, 47) = 8.03, p = .007 (SD’s =
32.63 & 10.35s respectively). Adjusting for the unequal variances, high scoring
participants had significantly shorter response times than low scoring participants (Ms =
23.06 & 38.02s respectively; adjusting for unequal variances t(40.1) = 2.32, p = .025).
Further analysis revealed that this effect was not due to differences in understanding the
rules. The mean accuracy over all practice items for low scorers was not significantly
different from high scorers, t(44)= -.62, p = .54 (Ms = .82 & .85, respectively). Also,
there were no differences between the low quaternary and high quaternary scoring
groups in terms of the three measures of performance on ternary problems; mean
proportion correct, t(44)= -.67, p = .52 (Ms = .73 & .77, respectively); overall response
time, t(44) = .34, p = .73 (Ms = 24.06 & 22.91s, respectively); and correct response
time, t(44) = .47, p = .64 (Ms = 22.91 & 21.31s, respectively).
62
3.5.4
Alternative Accounts
The influence of serial processing in the knight-knave task was formally explored in
Rips’ (1989) conceptualisation of reasoning difficulty. The number of processing steps
or rules required to solve each of our knight-knave problems can be estimated using the
exhaustive strategy summarized in Figure 3.1 (see the Appendix to Chapter 3 for these
analyses). MARC also has the functionality to allow for the quantification of serial
processing load within levels of complexity for a given strategy. Through judicious
segmentation and chunking using the principles outlined earlier, high parallel
processing demand can be reduced through the serial integration of a number of less
complex relations. For instance, the example knight-knave problem outlined in the
introduction involved three relational processing steps. Empirical support for the
usefulness of this functionality is now beginning to accumulate (e.g., Birney & Halford,
2000b, 2000c, 2000a).
The following analysis considers how estimates of difficulty using the rule-based
approach compares with the RC approach in modelling the characteristics of the current
set of problems. The classifications of relational steps (RCsteps) and number of rules
(Rules) are provided for each problem in Table 3.2.
3.5.4.1 Item-based correlation analysis.
Table 3.4A presents the relationship between item difficulty and the classifications
made by RC theory (i.e., RC – complexity classification and, RCsteps – the number of
relational steps) and the exhaustive rule based approach (Rules – the number of steps
required to instantiate the exhaustive strategy). Table 3.4B presents the same analysis
using correct response time as the dependent variable in the regression analysis (the
same pattern of results is obtained if overall response time is used instead of correct
response time). Presentation format was entered as a covariate in all analyses to follow.
It is instructive to first consider the simple correlations between the variables from the
respective models. The partial correlation (controlling for presentation format) between
item difficulty and mean correct response time was significant, pr(17) = -.48, p = .039,
indicating that as item difficulty increased, the mean time to respond correctly also
63
tended to increase. The RC classification of the items was not related to our estimation
of the number of deduction-type rules, pr(17) = -.04, ns. Neither was RC associated
with the number of relational steps (RCsteps), pr(17) = .33, p = .16 (although the trend
suggests that more complex problems require more processing steps). These findings
suggest that there is some independence between the problem’s RC, which is a measure
of parallel processing demand, and those characteristics more traditionally associated
with serial demand. Both the RCsteps and Rules measures are based on serial
processing and we would expect the partial correlation to reflect this commonality. This
was indeed the case. A marginal association between the RC classification and number
of rules was observed, pr(17) = .42, p = .077.
Table 3.4
Item-based analysis of A) difficulty and B) correct response time based on predictions made by
relational complexity theory and deduction-rule principles
A.
Difficulty
Partial Correlations a
Difficulty
Presentation Format
Relational Complexity (RC)
Relational Steps (RCsteps)
Rules - Exhaustive
-0.85 **
-0.31
-0.06
RC RCsteps
0.33
-0.04
0.42 b
Regression Coefficients
t
β
-0.31 -2.41
-0.82 -5.86
0.02 0.13
-0.10 -0.69
sig
pr
0.03 -0.53
0.00 -0.83
0.89 0.03
0.50 -0.18
sr
-0.31
-0.75
0.02
-0.09
R2 = .75 (adj R2 = .69), F(1, 15) = 11.40, p < .001
B.
Response Times
Partial Correlations a
Correct
RT
Presentation Format
Relational Complexity (RC)
Relational Steps (RCsteps)
Rules – Exhaustive
0.28
0.51 *
0.59 *
RC RCsteps
0.33
-0.04
0.42 b
Regression Coefficients
β
t
-0.70 -5.25
0.16 1.13
0.16 1.00
0.36 2.38
sig
pr
sr
0.00 -0.80
0.28 0.28
0.34 0.25
0.03 0.52
-0.70
0.15
0.13
0.32
R2 = .73 (adj R2 = .66), F(1, 15) = 10.23, p < .001
* p < .05; ** p < .001; pr = partial correlation; sr = semi-partial correlation; a partialled variable
= presentation format (df = 17); b p = .077
Table 3.4A shows that relational complexity is by far the best first-order predictor of
item difficulty, pr(17) = -.85, p <.001. Further, when relational complexity, number of
64
relational steps, and number of deduction rules are entered into a standard regression
analysis, relational complexity remains the only significant predictor, β = -.82, t(15) =
-5.86, p <.001. The analysis was repeated on correct response times, again controlling
for differences in presentation format (Table 3.4B). While the number of deduction
rules was unable to account for item difficulty, it was a significant predictor of correct
response time, pr(17) = .59, p = .008. The RC classification per se does not provide an
adequate account of the variation in correct response times across items, pr(17) = .28, p
= .24, however the number of relational steps was a significant first-order predictor,
pr(17) .51, p = .027. Further, the standard regression analysis shows that controlling for
differences in presentation format and RC, the rule-based classification (rules)
subsumes the variation accounted for by the number of relational steps, and provides a
significant unique contribution to the prediction of correct response time, β = .36, t(15)
= 2.38, p = .03.
3.6
General Discussion
Consistent with the overall prediction, problems entailing a quaternary relation were
associated with significantly more errors and longer response times than problems in
which the most complex process entailed a ternary relation. This was true regardless of
the presentation format and suggests that parallel processing is a significant factor in
determining the cognitive demand of knight-knave problems. The effect of the
complexity manipulation is convincing – no overlap in the difficulty of individual items
across levels of complexity occurred in either version of the task. Within levels of
complexity some variation in problem difficulty (accuracy) and response times was
observed. There is statistical evidence that this variation is 1) consistent with predicted
individual differences in the use of a speed-accuracy trade-off, and 2) partially a
function of variations in the serial processing demand of each problem. The RC theory
accommodates these findings without modification. We consider this evidence further
within the bounds of the RC theory shortly, but now briefly return to consider the
differences between presentation formats.
65
3.6.1
Task Presentation Format
The paper and pencil version of the test resulted in reliably more accurate performance
and longer response times in both ternary and quaternary processes than the computeradministered format. These effects can be accounted for by the ability to use pencil and
paper as an external memory aid to facilitate serial processing. Rips (1989) reported that
his subjects found keeping track of the series of processing steps in the knight-knave
task was a significant source of difficulty. Being able to keep an external record of the
processing would therefore be expected to reduce the storage load associated with serial
processing and therefore facilitate solution. We would also expect an increase in the
response time to allow for the use of this memory aid. The fact that the interaction
between format and complexity was not significant indicates that if it is a reduction in
serial processing demand that facilitates solution, it is independent of the relational
complexity of the problem. This gives some preliminary indications that serial
processing is a significant determinant of the cognitive demand in the knight-knave
task.
3.6.2
Processing Capacity and a Speed-Accuracy Trade-Off
The post hoc analysis of correct response times indicate that subjects who correctly
answered more than half the quaternary problems correctly (high scorers) had
significantly shorter correct response times than those who did not (low scorers). This
effect was not related to differences in understanding the rules of the island, as there
were no differences between the groups in terms of performance on the practice
problems or the ternary problems. The effect did not hold for solution of the ternary
problems.
It would seem that the low scoring students performed the same as the higher scoring
students in all respects except in terms of the number of quaternary items they answered
correctly and how long it took them to respond. We propose that these individuals
differed in terms of the resources available for solution on the most complex items only.
That is, even though low scorers eventually answered correctly, it was taking them
longer on average to represent the problem, integrate the necessary information, and
make a response. Although this provides no definitive assessment of the variation in
66
item difficulty and response times within levels, it does suggest that individual
differences in the use of a speed-accuracy strategy exist. The attempt to rationalise
limited processing resources has the potential to introduce some degree of variation in
item difficulty within levels of complexity. What the analysis does reveal is that
relational complexity theory, at least as it is applied to the knight-knave task, has the
potential to account for processing capacity differences in high functioning individuals
(i.e., university students).
3.6.3
Serial Processing: An Alternative Account
Finally, although it is clear that the relational complexity analysis provides a good
account of item difficulty when aggregating across items of a given level of complexity,
there is sufficient within level variation to suggest that performance on these tasks is
more complex. Recall that central to RC theory is the idea of parallel processing – that
the information needed to make a decision is only realised when the appropriate
elements are integrated at one point in time. The relationship that exists between the
elements as a result of this integration defines what we mean by a relation. For a given
strategy, this relationship cannot be decomposed without degrading the nature of the
information it provides.
While it might seem that such a conceptualisation of reasoning is philosophically
similar to the mental models account, the notion of parallel integration in the RC theory
is explicit and highly formalized. The RC theory does not ignore or argue against the
notion of serial processing. In fact a measure of serial processing is obtained within the
parameters of the RC theory without modification (Birney & Halford, 2000b). Here we
have called this the number of “relational steps”. When measures of serial processing
demand based on the relational complexity theory (relational steps) and the deductionrule approach (rules) are considered jointly with the parallel measure of RC, the only
significant predictor of item difficulty is this parallel measure. Both measures of serial
processing are important predictors of correct response time. When all variables are
taken together, the number of rules remains the only significant unique predictor. So
parallel processing demand as measured by RC appears to be a determining factor in
accounting for errors in the knight knave task, whereas serial processing seems to
67
predict response times. The interesting outcome here is that the relational complexity
metric seems to provide a means to quantify the influence of both these types of
processing demands.
3.7
Conclusion
The success of our complexity analysis of the knight-knave task is based to a large
extent on the appropriateness of the analysis of the strategies and processes involved.
Processing in the knight-knave task has been explored using deduction-rules (Rips,
1989, 1990) and mental models (Johnson-Laird & Byrne, 1990; Byrne & Handley,
1993, 1997; Byrne et al., 1995; Schroyens et al., 1999). We have taken this work as the
basis for applying MARC and attempted to model the substantial demands these
problems make on the limited processing resources of working memory in terms of their
relational structure.
The strength of MARC lies in using relational complexity to explicitly link the demand
a task makes on cognitive resources with the processing capacity of the individual
(Halford et al., 1998a). While we have argued in the previous chapters that complexity
may be represented psychometrically in terms of the relationship with broad cognitive
abilities (Stankov, 2000), a strong process theory has much to offer in furthering our
understanding of the nature of human reasoning (Lohman & Ippel, 1993). If we are able
to specify what aspects of a task are likely to make reasoning complex we are in a far
better position to predict where and when problem solving is likely to be difficult or
more importantly, fail. The application of the MARC approach here demonstrates a
potential to model the characteristics of reasoning tasks that makes processing difficult
(i.e., relational complexity) and time consuming (i.e., number of relational steps)
parsimoniously in one theory.
We now have evidence to suggest that when aggregating means in the traditional
comparison of means approach dominating cognitive psychology, relational complexity
influences performance. This is consistent with the first general prediction from Chapter
2. Close examination of the task does reveal some variability within levels of
complexity and Chapters 6 and 7 will explore these in more detail using an individual
68
differences approach. We now turn to our attention to the development of the Latin
Square Task.
CHAPTER FOUR
DEVELOPMENT OF THE LATIN SQUARE TASK
4
Introduction
We have demonstrated that it is possible to apply the method of analysis of relational
complexity to the knight-knave task in the previous chapter (Birney & Halford, 2002).
In theory it is possible to apply the MARC approach to any cognitive task in which a
sufficient understanding of the cognitive processes has been obtained. In fact attempts
are being made to apply the MARC approach to even more dynamic and unconstrained
tasks such as the determination of potential conflict in air-traffic control scenarios
(Boag et al., 2000). The practical difficulty of moving into a dynamic environment is
that the strict controls available in the laboratory evaporate and the use of the
comparison of means approach becomes problematic. Recall that in comparing means,
performance is aggregated across individuals along with any reliable individual
differences. This can serve to modify our test of the relational complexity theory in
unknown ways. As a next step then, rather than trying to further explore the impact of
relational complexity in such dynamic tasks, we have taken the theoretically more
stringent approach and not only developed a more constrained task – but done so such
that it specifically addresses the criterion of relational complexity theory. There are two
reasons for this apparent diversion. Firstly, if we cannot demonstrate a relational
complexity effect in adult reasoning for a highly contrived task, then isolating the effect
of complexity in a dynamic task will prove to be very difficult. More importantly
however, taking what might be considered a reductionist approach may enable us to
better understand the characteristics of relational processing and in turn facilitate further
refinement of the method for analysis of relational complexity.
The task that we have developed is based on the Latin square. To set an appropriate
foundation to the relational complexity analysis, the definition of a Latin square is first
considered. We then move onto some experimental data that explores the psychometric
properties of this new task.
70
4.1
Definition of a Latin Square
The Latin square derives its name from an ancient puzzle that dealt with the number of
ways Latin letters could be arranged in a square table so that each letter appeared only
once in every row or column. The use of the Latin square in science has typically taken
the form of a device for permitting the application of the now very common Analysis of
Variance procedures. By all accounts (e.g., Thomson, 1941), it was first explicated in
detail for experimental use in agriculture by R.A. Fisher to control statistically for soil
variability (e.g, Fisher, 1925, 1935, 1966; Fisher & Yates, 1934). The appropriateness
of the Latin square design in psychological research was “discovered” in the late 1930’s
(Thomson, 1941). Within five to ten years the Latin square was being actively
canvassed as an approach to designing experimental procedures that facilitated
appropriate interpretations of analysis of variance (Grant, 1948; Bugelski, 1949). It is
now frequently used to control for bias in designs where repeated-measures from the
same individual are taken. It is also taught as a matter of course in many undergraduate
methodology courses.
A
B
C
D
a
b
c
a
b
c
a
c
b
a
c
b
b
c
a
c
a
b
c
b
a
b
a
c
c
a
b
b
c
a
b
a
c
c
b
a
E
F
G
H
b
c
a
b
c
a
b
a
c
b
a
c
c
a
b
a
b
c
a
c
b
c
b
a
a
b
c
c
a
b
c
b
a
a
c
b
I
J
K
L
c
b
a
c
b
a
c
a
b
c
a
b
b
a
c
a
c
b
b
c
a
a
b
c
a
c
a
b
a
c
a
b
c
b
c
a
Figure 4.1. The 12 possible orderings defining a 3×3 Latin square (Square A is the standard
square)
71
4.1.1
Enumeration of Latin Squares
To get some idea of the power of Latin squares in randomizing experimental procedures
it is worth briefly considering the mathematics of the layout. There are k! ways values
in a “standard” Latin square can be uniquely arranged; where k is the dimensionality of
the square (Fisher & Yates, 1963). The design is typically referred to as an LS-k design.
A “standard square” is defined as having the first row and column ordered naturally
(i.e., alpha-numerically - see Figure 4.1A). For a 3 × 3 square (LS-3), there is one
standard square and (3 × 2 × 1)(2 × 1) = 12 possible orders. For a 4 × 4 square (LS-4)
there are four standard squares and (4 × 3 × 2 × 1)(3 × 2 × 1) = 144 ways each of these
can be arranged, giving a total of (4 × 144) = 576 possible arrangements. To get some
idea of how quickly the number of squares expands, an LS-7 design has 16,942,080
standard squares!
A
Græco-Latin square
B
a
b
c
α
β
χ
aα bβ cχ
b
c
a
χ
α
β
bχ cα aβ
c
a
b
β
χ
α
cβ aχ bα
Figure 4.2. Composition of the 3×3 Græco-Latin square.
A more complex form of the Latin square is the Græco-Latin square. By strict
definition, a Græco-Latin square consists of two pairs of elements (Greek and Latin
letters) such that each Latin letter appears once in each row and column, and each Greek
letter appears once in each row and column, and once with each Latin letter (Fisher,
1966). In the LS-3 design there is only one true Græco-Latin square - when square A
and square B from Figure 4.1 above are superimposed, as indicated in Figure 4.2.
Although there is only one 3×3 Græco-Latin square, there are 6912 4×4 Græco-Latin
squares, and their use in experimental designs has also been well documented (Fisher,
1966). The reason for introducing the Græco-Latin form here is that initial pilot work in
progress reveals that a relaxed version of the square in which two Latin squares are
superimposed without the constraint that each Greek letter is paired only once with each
72
Latin letter, has some potential as a relational complexity task of higher order
complexity (i.e., quinary).
4.2
Cognitive Load and the Latin Square
The typical Latin square problem presents the reasoner with a 4×4 Latin square with
missing elements in some of the cells (i.e., the square is incomplete). The reasoner is
required to determine which of the four possible elements should fill a target cell so that
it satisfies the defining principle of the Latin square. The Latin Square Task (LST) is an
example of a deductive reasoning task since all the information necessary to solve a
given problem is available within the context of the problem or can be taken from the
rules (see Evans, 1993; Johnson-Laird & Byrne, 1993). In theory a Latin square of any
dimension can be used. We have restricted the domain by selecting LS-4 problems
although as stated above, some early pilot work has been conducted with Græco-Latin
squares of rank 4 and also with LS-5 problems. As will be shown below, the complexity
of the processing required to perform the task depends on how the given elements need
to be integrated to arrive at a solution. In some cases the status of non-target empty cells
must also be determined before the element in the target cell can be determined. This
manipulation introduces additional serial processing and has the potential to provide an
insight into the link between serial processing and parallel processing on which the
relational complexity theory focuses. The relational complexity analysis of the Latin
Square Task (LST) was performed using the principles for analysing single- and multiprocess knight-knave items outlined in Chapter 3. These principles were based on the
axioms and theorems of relational complexity theory introduced in Chapter 2.
4.2.1
The Defining Principle of the Latin Square
We have already stated the defining principle of the Latin square several times now.
Essentially, the elements of the square are arranged such that each element appears only
once in every row and column.
73
A
B
A
B
C
D
A
B
C
D
1
1
2
3
4
1
1
-
-
-
2
2
1
4
3
2
-
1
-
-
3
3
4
1
2
3
-
-
1
-
4
4
3
2
1
4
-
-
-
?
Figure 4.3. A complete (A) and incomplete (B) “standard” 4×4 Latin square
As an aid to a more formal definition of the defining principle, consider Figure 4.3A.
We know that if cell A1 is 1 then we know that the remaining elements in the other cells
of row 1 and column A must not be a 1. Generically this can be represented as follows:
IF (A1 = x) THEN (B1 ≠ x) & (C1 ≠ x) & (D1 ≠ x) & (A2 ≠ x) & (A3 ≠ x) & (A4 ≠ x))
or more fully as:
IF Xij = x
THEN all Xic ≠ x and all Xrj ≠ x; such that (c ∈ {1, 2, 3, 4} and c ≠ j) AND (r ∈ {1, 2,
3, 4} and r ≠ i)
where Xij
= the intersecting cell of row i and column j
Xic
= the intersecting cell of row i and column c
Xrj
= the intersecting cell of row r and column j
x
= an element from the set of four elements {1 ,2 ,3 ,4}
In practical terms this can be taken to mean that given the appropriate k-1 elements in a
row or column of an LS-k problem, the status of element k is determined uniquely. As a
short hand way of representing this we write: {a, b, c} → {d} and refer to this as the
Defining Rule of an LS-4 square.
74
The constrained nature of the Latin square design means that this rule does not only
apply within a single row or column. In fact, the value of a cell can be determined
without necessarily knowing or considering all elements in a given dimension. Some
combination of knowledge of row and column elements can be used in certain cases to
derive the value for the target cell. For example, consider Figure 4.3B, which is Figure
4.3A with the values of the non-diagonal cells hidden. We are interested in determining
the value of the target cell, D4. From just the information in Figure 4.3B, the values of
each of the hidden cells cannot be determined uniquely, yet we know unambiguously
that the target cell must be a 1 (we will return to a similar example in more detail
shortly). What this means is that the Defining Rule, {a, b, c} → {d}, can be instantiated
in more than one way. For instance, as we have already alluded to and will specify more
formally below, the rule can be applied in a single row or column (i.e, a single
dimension). We will demonstrate that the case of integrating information in a single
dimension entails a binary process. The value of a cell can also be determined by
considering appropriate elements in the intersecting row and column. This second type
of integration is constrained by the intersection of two dimensions. A third way of
determining the value of a given cell is to consider information across multiple
dimensions. That is across multiple rows and columns that are not necessarily
constrained by a simple intersection.
These three forms of instantiation turn out to require different levels of relational
integration and as such serves as the basis for the complexity manipulation in the Latin
Square Task. The analysis is provided in more detail in the following section. The
notation that is used to represent the process analysis is similar to that outlined in
Chapters 2 and 3. Elements and rows of the square are referred to using numerical
symbols, and columns are represented using upper-case letters. To avoid the natural
hierarchy that is inherent in alpha-numeric symbols, actual test items use a combination
of geometric shapes and colours to represent the elements.
4.2.2
Binary Processing in LS-4 Problems
Binary items are generated using the {a, b, c} → {d} rule within either a single column
or row but not across both. The example problem in Figure 4.4 requires an instantiation
75
of the rule within column C. Here we know three of the elements in column C so the
process required to determine the fourth entails comparing the items that are present in
the column with the full set of items that should be present to determine the missing
element in the target cell (C2). There is no need to consider any other cells in the
square. Using simple conjunction and implication relations (i.e., AND, →), this can be
represented as;
AND(C1(4), C3(2), C4(3)) → C2(1)
RC = 2
which is read, “C1 is a 4 AND C3 is a 2 AND C4 is a 3 implies C2 is a 1”. As in
Chapter 3, underlining is used to represent the chunking of entities. Segmentation is
represented by having separate processes on a separate line.
A
B
C
D
A
B
C
1
3
2
4
1
2
4
3
1
2
4
?
3
1
4
2
3
1
2
4
2
1
3
4
D
4
2
3
Figure 4.4. Completed and example binary LST problem.
The logic underlying the binary classification might be thought of in a slightly different
way to demonstrate the instantiation of the defining rule more clearly. In Figure 4.4, the
elements that should exist in every row and column are represented by the symbols 1
through 4. This information is given as part of the task’s instructions and in LS-4
problems, as part of the response options (see Section 4.3 below for more detail on the
design of the task display). In this case, we are also provided with the elements {2},
{3}, and {4} in rows 3, 4, and 1 of column C, respectively. This results in two sets of
elements. The complete set {1, 2, 3, 4} and the known or given set {2, 3, 4}. Consistent
with the principles of MARC the comparison of these two chunks entails a binary
relation. The relations between elements within the chunks do not need to be considered
to make the current decision (Chapter 2). This will become clearer in the ternary and
76
quaternary problems when constraints on how many elements can be chunked are
imposed.
4.2.3
Ternary Processing in LS-4 Problems
A
B
C
D
A
B
C
D
1
3
1
2
4
1
4
2
4
3
1
2
?
2
3
1
2
4
3
4
2
4
3
1
3
4
3
Figure 4.5. Completed and example ternary LST problem.
Ternary items are generated using the {a, b, c} → {d} rule as for binary items however
integration of information from both a row and column is required. In Figure 4.5 for
example, the instantiation of the defining rule is {1, 2, 4} → {3}; the solution comes
through the integration of information from Column B and Row 2. The intersection, B2,
must not contain an element that is present (or can be determined to be present) in the
other cells of Column B or Row 2, and therefore the target response for the example is
element {3}.
The representation of this ternary process can be written as:
AND(B1(1), B4(4), D2(2)) → B2(3)
RC = 3
The arguments of the conjunction cannot be combined into a single argument in this
case since they require integration across more than one dimension. That is, in this case
the {2} in D2 cannot be chunked with the other terms of the conjunction. The reason is
that by the Latin square defining principle, elements in row 2 are not independent of the
elements in any of the intersecting columns. In this case the cell that intersects with
column 2 is the target cell, so elements in row 2 need to be considered to make the
77
current decision. The relation between the known elements in column B does not need
to be considered per se and therefore the elements can be chunked. In summary,
solution of ternary items entails integrating elements within a row and a column.
Remember, as outlined in the analysis of the knight-knave task in Chapter 3, the
representation of a ternary relation that we have used above is broken into two steps
around the “implication” relation (→). Halford et al. (1998a) represent similar relations
either by embedding the lower order relations into a higher one or by stringing the
arguments together under one overarching relation. For example, arithmetic addition is
represented as, ADDITION(2,3,5); transitivity is represented as MONOTONICALLYINCREASING(a, b, c). Using a similar structure we could represent the above analysis
as:
→ (AND(B1(1), B4(4), D2(3)), B2(3) )
The implication symbol (→) is simply used as a means of labeling the relationship
between the arguments.
4.2.4
Quaternary Processing in LS-4 Problems
A
B
C
D
A
1
1
2
3
4
1
2
2
3
4
1
3
3
4
1
2
4
4
1
2
3
B
C
3
D
1
1
?
3
Figure 4.6. Completed and example quaternary LST problem (1)
Quaternary items rely on a more extensive use of the defining rule. Solution is achieved
by integrating elements across multiple rows and columns that are not necessarily fully
constrained by a simple intersection. We know from the Latin square principle that each
row must contain an instance of all the elements. In the example problem in Figure 4.6,
78
the target cell cannot be determined by the binary and ternary strategies just described.
Using either of these strategies results in only knowing that the target cell is not a {3}.
Row 4 must contain a {1}, {2}, {3}, and {4} by definition of the LS-4 design. We can
determine that A4 cannot be a {1} since there is already a {1} present in column A (i.e.,
in A1). Similarly, C4 cannot be a {1} since column C already contains this element (i.e.,
in C3). The value of D4 is determined as it contains the element {3}. Since by definition
of the Latin square principle, a {1} must be present in Row 4, we know by elimination
that B4 must be a {1}. In the above example, we are considering all possible elements
in row 4, while taking into consideration the elements in columns A, B and C. We are
also taking into consideration the distribution of a particular element, that is element
{1}. The requirement to consider multiple rows and columns is the general principle
that quaternary items are based on. Here it is very clear that relations between the
elements need to be considered and cannot be chunked as in the binary and ternary
examples.
This analysis can be thought of as the integration of the following pieces of information:
A1(1) → NOT(A4(1))
(4.1)
C3(1) → NOT(C4(1))
(4.2)
D4(3) → NOT(D4(1))
(4.3)
The integration of 4.1, 4.2, and 4.3 implies that B4 must be a {1}. This cannot be
determined by simply considering each process alone, it requires integration. This line
of argument can be represented as follows:
AND(A1(1) → NOT(A4(1)), C3(1) → NOT(C4(1)), D4(3) → NOT(D4(1))) → B4(1)
this can be simplified somewhat to:
AND(A1(1), C3(1), D4(3)) → B4(1)
RC = 4
An alternative analysis based on determining elements in Column B rather than Row 4
results in the same classification through a different but isomorphic line of logic.
79
A1(1) → NOT(B1(1))
(4.4)
B2(3) → NOT(B2(1))
(4.5)
C3(1) → NOT(B3(1))
(4.6)
The integration of 4.4, 4.5, and 4.6 still implies that B4 must be a {1}. Therefore,
although the strategy used is different to the first, the logic is very similar and
importantly involves the same level of processing.
This can be represented as:
AND(A1(1), B2(3), C3(1)) → B4(1)
RC = 4
An additional example of a quaternary item is useful and describes a slightly different
line of quaternary level processing. Consider the item represented in Figure 4.7.
A
B
C
D
A
B
1
1
2
3
4
2
2
3
4
1
3
3
4
1
2
4
4
4
1
2
3
1
C
D
3
4
?
2
2
Figure 4.7. Completed and example quaternary LST problem (2).
This is a particularly interesting item since the intersecting row and column of the target
cell do not contain any elements. The most efficient solution strategy involves first a
ternary process followed by a quaternary process. Consider the following argument:
“We know that the element in B1 is a 2 since the intersecting column
contains a 4 and a 1 and the intersecting row contains a 3 and a 4.
Given B1 must be a 2, we know that A1 cannot be a 2 (in fact we
know that it must be a 1, but this result is not necessary at this stage
80
and could be forgotten). Similarly, we know that A3 cannot be a 2
since a 2 is already present in Row 3. The same argument applies
for A4 … Row 4 already contains a 2. Now since A1, A3, and A4
cannot be a 2, then A2 must be a 2.”
This line of argument is the same as for the previous quaternary example, however the
result of an intermediate step to fill B1 needs to be considered and identified.
1. AND(B3(4), B4(1), C1(3)) → B1(2)
RC = 3
2. AND(B1(2), D3(2), C3(2)) → A2(2)
RC = 4
In this particular case, it is also possible to solve the item using a rather lengthy series
of binary and ternary steps, but the short-term memory load required to achieve this
would make such a strategy unlikely in practice even with substantial training (subjects
are not permitted to use an external memory aid such as paper and pencil). A possible
strategy might be as follows:
1. AND(C4(2), C1(3), B3(4)) → C3(1)
RC = 3 (column C + row 3)
2. AND(C4(2), C1(3), C3(1)) → C2(4)
RC = 2 (column C is constant)
3. AND(B4(1), B3(4), C1(3)) → B1(2)
RC = 3 (column B + row 1)
4. AND(B4(1), B3(4), B1(2)) → B2(3)
RC = 2 (column B is constant)
5. AND(D1(4), D3(2), B2(3)) → D2(1)
RC = 3 (column D + row 2)
6. AND(B2(3), C2(4), D2(1)) → A2(2)
RC = 2 (row 2 is constant)
To illustrate the extent of the cognitive load required for this solution, we have
indicated in bold all the elements that are determined through solution (i.e., not
provided). This strategy would be classified as ternary since the most complex process
entailed in the string of six processes is ternary. The outcome of at least three of the five
intermediate steps need to be maintained to determine the value in the target cell. This is
very likely to hamper performance.
81
4.2.5
An Empirical Test of the Analysis
The complexity manipulation outlined above is based on the relationship between the
elements that need to be considered. The manipulation itself is effectively "transparent"
to the reasoner in that the only principle required for solution is based the on the LS-4
Defining Rule. To empirically test the relational complexity analysis, a series of 18
items were generated and administered to university students in Experiment 4.1. Since
this is the first administration of this task, we spend considerable time exploring the
data from this sample. Experiment 4.2 is a replication of the first experiment using the
same 18-item test administered to primary and high school students. The focus of the
analyses for this sample is primarily to consider the generalisability of the results
obtained with the university sample.
4.3
4.3.1
Experiment 4.1: University Students
Participants
In total 73 students, 13 male and 60 female (mean age = 18.62, SD = 2.87), participated
in the study. The students were enrolled in a first-year undergraduate subject at the
University of Queensland and received 1% course credit for their participation.
4.3.2
Item Generation
A set of 6 binary, 6 ternary, and 6 quaternary items were generated following the basic
principles outlined above. All items followed the LS-4 structure and the four unique
elements were drawn randomly for each item from six possible shapes (circle, square,
triangle, cross, diamond, and “no shape”) and six possible colours (red, green, blue,
cyan, magenta, and “no colour”). The “no shape” option meant that the elements were
cells filled with one of the four selected colours in an appropriate LS-4 layout (e.g., item
8 in Appendix B6). The “no colour” option meant that elements consisted of uncoloured
shapes, or shapes of just one colour (e.g., item 2 in Appendix B6). Following Halford et
al. (1998a), items requiring more than one process were assigned the complexity of the
most complex process required for solution. At this stage of the development of the LST
we made no attempt to experimentally manipulate the number or complexity of
additional processing steps within a problem. We do however consider the influence of
82
additional processing steps statistically. Table 4.1 lists the overall relational complexity
classification and the number of incomplete cells (IC) for each item. It also provides a
summary of the complexity of any additional processing steps (2D, 3D, 4D). For
instance, consider item 10 in Table 4.1. This item has been classified as ternary and 6 of
the 16 cells are incomplete. There are two binary processes (2D) and one ternary
process (3D) and the item is therefore are 3-step ternary problem. Appendix B.1 lists the
relational complexity analysis of the items.
Table 4.1
Latin square item characteristics: Complexity, number of processing steps, and incomplete
cellsa.
Binary (2D)
Ternary (3D)
Quaternary (4D)
Item
IC
Number of Steps
1.
8
2D 3D 4D ∑
1
0
0
1
2.
7
1
0
0
3.
8
2
0
0
Item
IC
Number of Steps Item IC
Number of Steps
7.
7
2D 3D 4D ∑
0 1 0 1
2D 3D 4D ∑
0
0
1
1
1
8.
8
0
1
0
1
14. 8
0
1
1
2
2
9.
8
1
1
0
2
15. 10
0
0
1
1
b
13. 10
4.
7
2
0
0
2
10.
6
2
1
0
3
16. 9
0
0
1
1
5.
6
3
0
0
3
11.
5
1
2
0
3
17. 9
0
0
1
1
1
c
0
0
1
1
6.
7
3
0
0
3
12.
9
0
1
0
18. 8
a
IC = number of incomplete cells (including the target); ∑ = total number of processing steps;
re-classification of relational complexity item 16: IC = 9; 2D = 1; 3D = 1; 4D = 0; ∑ = 2 (see
section 4.4.2.1); c re-classification of relational complexity item 18: IC = 8; 2D = 1; 3D = 2; 4D
= 0; ∑ = 3 (see section 4.4.2.1)
b
4.3.3
Procedure
The experimenter briefly introduced the Latin Square Task by explaining to subjects
that they would be asked to work as quickly and as accurately as possible through a
reasoning task presented on the computer. Students were instructed to do all their
working in their heads. Individuals were seated in sound attenuating booths and tested
in groups of 1 to 8. The items were presented using “Pentium 100” computers each
fitted with 14inch VGA displays. A series of 4 practice items oriented subjects to the
nature of the task and how to make a response (Figure 4.8). The incomplete Latin
Square item was always presented towards the left hand side of the display screen and
the list of response options was provided on the right. Column and row labels were
provided during practice to facilitate feedback but were not provided in the test items.
83
The number of response options always corresponded to the square’s dimensionality
and was equal to the number of unique elements in the Latin square. Subjects indicated
which element should fill the marked cell by clicking on one of the possible responses.
Practice started by requiring subjects to solve a trivial example of a single row of three
cells of which only two were filled (Figure 4.8A). The students then attempted to solve
an incomplete 3×3 Latin Square (Figure 4.8B). Two more complex 4×4 Latin Square
items (one ternary and one quaternary item) were then presented (Figure 4.8C and
4.8D). Detailed feedback outlining the rationale for both incorrect and correct responses
was provided. When the subject was ready to move on to the next item they pressed the
spacebar. At the end of the practice, subjects moved on to the test phase in which each
of the 18 items were presented in a different random order to each student without
feedback.
A
B
A
B
C
?
1
Choices
A
B
C
Choices
1
?
2
3
C
A
B
C
D
1
2
Choices
D
A
B
C
D
Choices
1
?
2
3
3
4
4
?
Figure 4.8. Practice items used in the Latin Square Task
4.4
Results and Discussion
As the Latin Square Task was developed using the principles of the relational
complexity theory, the critical test of the complexity analysis at this stage is the
84
performance of each item. The Latin Square Task is being used for the very first time
and we are also looking for evidence to support the classifications that we had made.
Evidence that an item that we had classified as quaternary (say) that turns out to be
solved correctly by a large number of subjects, suggests to us that we may have over
looked a much less complex solution strategy in our original classifications. We begin
with an item analysis employing both traditional and Rasch based methodologies. We
then consider the relational complexity classification as a function of item difficulty and
response time. We conclude the analyses of the university data by considering the
Rasch approach to generating subtests based on relational complexity. In all these
analyses we are exploring the comparison of means hypothesis: That is, as the relational
complexity of an item increases, accuracy is predicted to decrease and response times
increase.
4.4.1
Item Analyses: A Rasch Approach
To assist in the item analysis, items were calibrated using both traditional classical test
theory (CTT) statistics and the principles of the Rasch model. The advantage of the
Rasch approach in addition to identifying misfitting items and persons, is that the
procedure provides an interval scale along which both item difficulty and individual
ability can be identified (Wright, 1999; Andrich, 1988). The traditional CTT measure is
the proportion of subjects who correctly answer the item (i.e., p-values). These values
are sample dependent and are necessarily constrained to take values between 1 and 0.
Not only does this limit ways to deal with floor and ceiling effects, the psychometric
characteristics of the p-values are typically unknown and do not necessarily take on
interval scale properties (Green & Smith, 1987). The subsequent analyses use Rasch
estimates of item difficulty and contrasts traditional item statistics wherever
appropriate.
Table 4.2
Descriptive statistics for Rasch and classical analyses
Rasch Statistics
CTT Statistics (SD in parentheses)
Item
1
2
3
4a
5
6
7
8
9
10
11
12
13
14
15
item
location
-2.029
-3.455
-0.127
-4.515
-0.110
0.216
-0.819
-0.662
-0.037
0.404
0.570
-0.833
1.225
1.767
1.440
standard
error
0.56
1.06
0.30
0.30
0.28
0.36
0.34
0.30
0.28
0.27
0.36
0.26
0.27
0.26
outfit
MNSQ
0.410
0.433
0.684
1.384
0.737
1.249
1.431
0.971
1.135
1.135
0.913
0.872
1.166
1.115
1.040
0.866
0.953
0.925
1.043
1.149
1.146
1.000
0.926
1.151
1.034
chisq
2.73
0.58
2.88
6.14
4.47
1.95
1.07
1.95
3.20
1.40
4.41
1.23
5.16
2.43
16b
17
1.165
1.243
0.26
0.26
0.698
0.853
0.765
0.863
5.57
2.53
infit
MNSQ
0.975
1.083
0.825
-
p-value
prob (proportion correct)
0.236
0.944
0.743
0.986
0.217
0.775
1.000
0.021
0.789
0.083
0.704
0.361
0.873
0.576
0.859
0.362
0.775
0.181
0.676
0.482
0.648
0.087
0.873
0.527
0.507
0.051
0.366
0.279
0.451
overall
response time
11.97
(26.11)
7.80
(3.61)
17.09
(20.55)
12.05
(6.75)
24.76
(20.95)
18.80
(12.75)
20.20
(21.94)
24.11
(33.84)
19.73
(11.68)
33.05
(22.70)
49.75
(48.75)
20.08
(21.37)
24.27
(17.98)
56.92
(53.66)
28.20
(27.77)
correct
response time
11.878 (26.67)
7.812
(3.63)
16.009 (14.68)
12.049
(6.75)
24.336 (18.83)
21.028 (13.69)
18.299 (19.74)
24.629 (35.92)
20.305 (12.47)
35.824 (24.13)
48.807 (44.98)
18.681 (21.13)
25.272 (18.46)
66.395 (55.51)
30.812 (33.68)
0.037
0.263
21.44
29.07
25.548
33.658
0.535
0.507
(16.91)
(30.36)
(19.45)
(38.65)
0.044
0.29
0.862
0.904
0.27 0.870
0.746 27.23
(22.54) 30.545 (24.00)
18b
item 4 had a perfect score and the item location was derived by adding one standard error of measurement to the easiest calibrated item (item 2).
b
the complexity classification of item 16 and 18 was revised
a
86
To take full advantage of the Rasch calibration a close fit of the data to the Rasch model
needs to be observed (Hambleton & Swaminathan, 1985). The following analysis
considers three main issues in assessing model-fit. First the overall or global fit of the
test as a whole is reported, and then individual item- and person-fit statistics are
considered. Preliminary checks on the data resulted in item 4 being excluded from the
analysis because every subject responded correctly to it and therefore the item provides
no information about individual ability. Unless indicated otherwise, we adopt the
approach advocated by Green and Smith (1987) and assign a value of one standard error
of measurement below the easiest calibrated item for item 4 in further analyses. A
further restriction of the Rasch analysis is that calibration of person ability is not
possible for subjects with extreme scores. As a result two subjects who answered every
item correctly were omitted. Table 4.2 lists the item location or difficulty and a
selection of fit statistics based on the Rasch model, and the corresponding proportion
correct and response time statistics.
4.4.1.1 Model-Fit
The overall fit of the model based on the 17 items and 71 subjects was satisfactory. The
χ2 test of the item × trait interaction suggested an appropriate fit to the Rasch model
when four trait groups were specified, χ2 (51) = 61.682, p = .145 (person separation
index = .561). When 3 trait groups are specified the fit is not as convincing, χ2 (34) =
47.938, p = .057 (the concept of item × trait interaction is explained in Appendix B.2).
4.4.1.2 Item and person fit (outfit and infit)
Masters and Wright (1996, Equation 21) report an indicator of item-fit that can be
calculated by taking the unweighted mean of the squared standardized residual for each
item across all individuals. They refer to this as the outfit mean square. One of the
problems with this measure of item-fit is that it is sensitive to outliers, that is,
unexpected high scores on difficult items or unexpected low scores on easy items. A
solution to this is to weight the mean square value using information about the
individual and the item (i.e., response variation). This is referred to as an infit statistic
and has the effect of reducing the influence of outliers (but is sensitive to unexpected
“inliers” - see Masters & Wright, 1996, Equation 23). Outfit and infit values range from
87
0.00 to positive infinity and have an expected value of 1.00. Low values suggest that the
data fits the model too well – that there is a deficiency in the stochastic variation that is
necessary for useful measurement (i.e., responses are too consistent). High values
suggest there is more noise in the data then what has been modeled (Linacre & Wright,
1994). Although there is no hard rule for determining when fit is too low or too high,
some proponents of the Rasch methodology (e.g., Wright, Linacre, Gustafson, &
Martin-Lof, 1994) suggest that reasonable mean square fit values lie between 0.70 and
1.30 for multi-choice tests in which the “stakes” for accurate response are normal (i.e.,
not high). The outfit and infit statistics for each item are shown in Table 4.2 and a quick
examination indicates that some items fall outside the recommended outfit 1.00 ± 0.30
range (items 1, 2, and 3 fall below and item 8 above). The information weighted infit
statistic suggests satisfactory item-fit for all items (i.e., values between 0.70 and 1.30).
4.4.1.3 Person-Fit.
Similar infit and outfit values can be derived to estimate person-fit (Masters & Wright,
1996, Equations 22 & 23). The 71 individuals were sorted on the basis of their
estimated person ability and the infit and outfit values plotted in Figure 4.9. As might be
expected, the departure of consistency between the outfit and infit values tends to occur
at the more extreme values of estimated person ability (indicated by the peaks and
troughs that exceed the fit region in Figure 4.9).
Low Person-Fit: The students falling below the outfit cutoff range of 0.70 had response
patterns that tended to be too consistent. For instance, the response pattern18 for the
lowest outfit score (0.32) was: 11111111111111110. When the mean square fit statistic
was weighted to adjust for this aberrant response, the person-fit was satisfactory (i.e.,
infit = 0.83). A response pattern where misfit was identified by both infit and outfit
statistics is, 11111111111100000 (outfit = 0.43; infit = 0.54). It is clear that this student
has been flagged as misfitting because her response was “too” consistent.
18
Item response patterns are ordered by calibrated item difficulty from easiest to hardest.
88
3.5
outfit
infit
Mean Square Fit Value
3
2.5
2
1.5
fit
region
1
0.5
0
-0.70
0.24
0.54
1.18
1.54
1.95
2.48
3.29
person (ordered by estimated ability)
Figure 4.9. Person-outfit and -infit values sorted by estimated person ability in the
18-item Latin Square Task
High Person-Fit: The most extreme misfitting person, identified by an outfit value of
3.13, failed to answer an easy item correctly (11110111111111111). The infit value
(1.18) tended to adjust for this aberrant response and is within the acceptable range. The
only example where both person-fit values were substantial above 1.30 (outfit = 2.00;
infit = 1.77) was the response pattern, 10000000011100110 (subject #473). This student
failed most of the easier items but managed to get two of the most difficult and a few of
the moderately difficult items correct. We would conclude that this student has not been
measured well.
At this stage of the development of the Latin Square Task our emphasis is on the
performance of the items and not on identify misfitting subjects per se. Since only a
small number of students (3 or 4) tended to have response patterns that were too
consistent with the model (reduced variation), and only one student had an erratic
response pattern (i.e., unexplained variation), these subjects were retained in the
analyses.
89
4.4.1.4 Summary of fit
On the whole, the three measures of model-fit (global, item, and person) tend to suggest
that individual performance on the Latin square has a satisfactory fit to the Rasch
measurement model. This allows us some confidence in using the calibrated item
information to further explore the characteristics of the task (Wright, 1999) and more
importantly to consider ways in which we can better account for the processing
demands imposed by the task (Hambleton & Swaminathan, 1985; Embretson, 1993;
Embretson, 1998).
4.4.2
Item Difficulty and Relational Complexity
The relative theoretical difficulty of each item is predicted in advance by the relational
complexity analysis and this is the next area of item performance that is to be
considered. Figure 4.10 plots item difficulties generated using the Rasch (logits) and
traditional approaches (mean proportion correct, p-values) and gives some indication of
the distribution of the items across the respective difficulty continuums.
By rescaling item difficulty as a function of the ability of the individual and the
difficulty level of other items, measurement limitations in the proportion correct score
that results in ceiling and floor effects can be somewhat overcome. According to Wright
(1999), when the data fit the Rasch model, the item calibration serves to convert the
nonlinear raw-scores into linear measures that have the desirable properties of conjoint
additivity (see also Perline et al., 1979, and Chapter 1). Given the high accuracy on
binary and ternary items, this transformation results in better item separation. As an
illustration of this, consider the difference in difficulty between items 1 and 12, and
between items 8 and 5 (calculated from Table 4.2 above). When difficulty is defined by
traditional p-values, the differences between items in each set are the same, about 0.068
proportional points. When defined by the linearized logits, the difference in difficulty
between items 1 and 12 is 1.196 logits, where as the difference between items 5 and 8 is
almost half of this at 0.552 logits. Wright (1999, p. 70) details a similar example of this
scaling effect and reiterates, “raw scores are not measures”.
90
3.000
Mean Proportion
Correct
Item Location
(Logit)
Scale value (proportion correct/logits)
2.000
14
Quaternary
15
1713
16
1.000
11
10
6
18 9
5 3
0.000
Ternary
8
7
12
-1.000
Binary
-2.000
1
-3.000
2
-4.000
1.000
0.900
Proportion Correct
Binary
12
8
5
0.800
0.700
2
1
9
Ternary
6
7
3
18
10
11
0.600
16
0.500
0.400
17 13
Quaternary
15
14
0 300
Figure 4.10. Comparison of item difficulty estimates based on traditional and Rasch calibration
4.4.2.1 Complexity classification
While a general separation between items based on the three levels of complexity is
apparent in Figure 4.10, a closer examination of the plot of item locations reveals some
interesting anomalies that need to be considered before an attempt is made to generate
91
subtest measures based on relational complexity. The most obvious is the placement of
item 18 (classified as a quaternary item) centrally within the cluster of ternary items. A
second investigation of the complexity analysis of this item indicated that although the
item could be solved using a quaternary process, it could also be solved using a string of
two ternary processes separated by a simple binary process. We follow Halford et al.’s
(1998a, p. 805) definition of task complexity as “…the most complex process involved
in the task using the least demanding strategy available to humans for that task .” (as
summarized in Axiom 2 of Chapter 2). Hence, item 18 is reclassified as a ternary item
for subsequent analyses (see Appendix B.1 for the appropriate relational complexity
analysis).
This misclassification raised some concerns about the remaining items so we had the
complexity analyses tested once more by a graduate student who had not been involved
with the task development but was conversant in the MARC methodology. Having
explained our classification scheme that is outlined in Section 4.2.1 above, she detected
a previously unforeseen solution strategy for item 16 that altered it’s classification from
single step quaternary item to a two step ternary item (see Appendix B.1). In this case,
that data suggests that item 16 is a relatively difficult item (although easier than the
remaining quaternary items) and as such there was no obvious data driven rationale for
revising the analysis as there was for item 18.
A question that arises, is how do these misclassifications influence our test of the
relational complexity theory? In his review of the relational complexity theory, Sweller
(1998) states that it is not always clear which strategy will be chosen and how
segmentation might influence performance. The current data might provide evidence for
this view. However, it is important to note that this is not a burden on the RC theory.
Halford et al. (1998b) state that relational complexity can account for task difficulty
when applied to the processes that have been used. They acknowledge, as does Sweller,
that determining these processes can entail considerable work. We simply take these
current misclassifications of solution strategy not as a failure of the theory nor in our
application of it. We use these findings to assist in developing an understanding of the
processes entailed in this new task and to refine our procedures for constructing items.
92
A second more pressing issue is the overlap between binary and ternary items. This was
not predicted as part of the task development, however it would be somewhat premature
to conclude that accuracy on the Latin Square Task is unable to reliably differentiate
binary and ternary processing in adults. A closer inspection of the item analysis reported
above and Figure 4.10 indicate a distinct separation between items 1 and 2 (and 4) and
items 5, 3 and 6, although all have been classified as binary. As shown in Table 4.1, the
easier items (1 and 2) only require a single binary process (i.e., 1 step) whereas the
other binary items involve two or more binary processes. A similar pattern of difficulty
based on number of processing steps occurs in the ternary and quaternary problems. The
easier ternary items (8, 7, and 12) entail only one ternary process, whereas the items
with 3 processes (10 and 11) tend to be calibrated as more difficult. Item 9, which
entails two processes lies somewhere between the 1- and 3-step clusters of ternary
items. Item 14 was the only quaternary item that entailed more than one process (i.e., 2
steps) and was the most difficult in the set of 18 items. The remaining single-step
quaternary items tend to be grouped together at a relatively easier level of difficulty.
This overall pattern of findings suggests that performance on the Latin Square Task is
not simply a function of the capacity to represent relational elements in a single
representation (i.e., in parallel). It seems that the ability to string a series of processes
together also influences performance on this task. The following section considers this
in more detail.
4.5
Decomposing Item Difficulty: Relational Complexity and Processing Steps
Although we have a reasonably reliable experimental manipulation of relational
complexity, the conditions for an analysis of processing steps are not ideal. The
criterion used to generate the Latin square items did not explicitly take into
consideration the number of processes involved in the analysis. Potential ways to
decompose difficulty estimates into the effects of particular cognitive components has
been extensively explored in the item-response theory literature (e.g., Embretson, 1996;
Green & Kluever, 1992; Sheehan, 1997). One approach, espoused by Embretson and
Reise (2000) and investigated by Green and Smith (1987), is to regress the Rasch item
difficulties on to a set of weights or frequencies that reflect one or more cognitive
components. This avoids the problems with the ANOVA approach when the factors or
93
independent variables, are not independent, as is the case here. Green and Smith have
demonstrated that this regression approach produces estimates of the cognitive
components that differed little from the sample-size intensive procedures that are based
on extensions of the Rasch model (e.g., the Logistic Latent Trait Model, see Green and
Smith, 1987, for details). Table 4.3 details the results of the multiple regression analysis
(Model 1) in which the Latin square item difficulties derived from the Rasch analysis
were regressed on to the item’s relational complexity (2, 3, or 4 arguments) and number
of processing steps (1, 2, or 3 steps)19.
Table 4.3
Multiple regression of item difficulty on relational complexity and number of processing
steps – Model 1
Accuracy
Mean (SD)
zero-order
correlations
logit
Item Difficulty (logit)
β
t
0.84
5.33 0.000 0.81 0.80
0.43
2.73 0.016 0.58 0.41
Sig.
pr
sr
RC
-0.25 (1.67)
Relational Complexity (RC)
2.89 (0.76)
0.71 *
Number of Processes (Steps)
1.83 (0.86)
0.18
-0.30
* p < .001; Model: R2 = .67 (adjusted R2 = .62), F(2, 15) = 14.91, p < .001
The analysis indicated that relational complexity is a very good zero-order predictor of
item difficulty, such that an increase in relational complexity tends to be associated with
an increase in item difficulty, r(16) = .71, p < .001. This provides further preliminary
support for the relational complexity manipulation. The zero-order correlation between
item difficulty and number of processing steps however was not significant, r(16) = .18,
p = .49. The relationship between the item’s relational complexity and the number of
processing steps was not significant although the trend was negative, r(16) = -.30, p =
.22. This relationship between the predictors is an artifact of the way the quaternary
problems have been constructed in that they predominantly require only one process
19
Item based analyses are prone to criticism because aggregating across individuals to determine an item
statistic (e.g., mean response time) does not take into consideration the variability in the responses (i.e.,
individual differences). Using proportion correct is slightly different since the standard deviation is
94
whereas the less complex items tended to have been constructed to entail more
processing steps.
When the two predictors are taken together, relational complexity accounts for a
significant amount of the variation in item difficulty, (β = .84, t (15) = 5.33, p < .001).
This might well be expected given the significant zero-order association between these
variables. What is interesting is that after controlling for the effect of relational
complexity, the number of processing steps now also becomes a significant predictor of
item difficulty, β = .43, t(15) = 2.77, p = .016. In fact, the combined unique contribution
of RC and steps derived by summing the squared semi-partial correlations (sr) in Table
4.3 (sr2RC + sr2steps = .80) exceeds the total variation accounted for by the complete
model (R2 = .67; adjusted R2 = .62; F(2, 15) = 14.91, p < .001). First impressions of this
might suggest a disturbance in the data, however Cohen and Cohen (1975) state that this
is a necessary consequence and indication of the presence of “statistical suppression”
when two predictors are involved. Typically, statistical suppression is considered as the
effect of one variable suppressing irrelevant variation in a second variable and as a
result, this second variable becomes a significant predictor of the criterion (Conger,
1974; Cohen & Cohen, 1975; Tabachnick & Fidell, 1989; Pedhazuer & Schmelkin,
1991). In the current context, while the relationship between the predictors is a result of
the item generation procedure, their relationship with item difficulty is not an artifact.
Item difficulty is clearly a function of relational complexity and for the current selection
of items the number of processing steps alone is not a good predictor. However, when
the effect of relational complexity is taken into consideration, the number of processing
steps does have a significant influence on the difficulty of the item – such that, an
increase in the number of processing steps is associated with an increase in item
difficulty. In fact, the suppression effect is present in the other direction as well since
the partial correlation of relational complexity (pr = .81) also exceeds the zero-order
correlation when number of steps is not accounted for (r = .71). Cohen and Cohen
(1975) refer to this as “cooperative suppression” (see also Conger, 1974).
determined by the proportion of the sample that has answered the item correctly. Rasch estimates
incorporate sample and other item characteristics in the calibration of item difficulty.
95
4.5.1
Additional Regression Analyses
4.5.1.1 Interactive effects
To explore the likelihood that the additive effects of relational complexity and number
of processing steps is qualified by a significant interaction, a second regression analysis
was conducted. The component effects (RC and Steps) were centered around their
respective means and an interactive term formed by their product. The inclusion of this
interactive term in Model 1 (Table 4.3) did not result in a significant increment in R2
(R2change = .04, F(1, 14) = 1.74, p = .209), and therefore does not change the
interpretation of the relationship between relational complexity and number of
processes reported above.
4.5.1.2 Number of incomplete cells
One issue that has not been considered is the possibility that the complexity
manipulation has been confounded with the number of incomplete cells in the Latin
square (see Table 4.1). It may be the case that the more elements that are displayed in
the Latin square, the easier the item is, regardless of the complexity. To explore this
possibility, the number of incomplete cells was included as an additional predictor in
the standard regression analyses of Model 1. The zero-order relationship between item
difficulty and number of incomplete cells was not significant, r(16) = .318, p = .20, and
it did not account for any additional unique variation that relational complexity and
number of processing steps did not already accounted for, R2change = .04, F(1, 14) = 1.90,
p = .189.
Implications: This is an interesting result since the complexity manipulation relies on
resolving the ambiguity of one or more cells by combining the appropriate elements to
form a single cognitive representation. The depth of this ambiguity is necessarily
greater as the relational complexity increases since a larger number of elements are
required for resolution (e.g., a quaternary relation is defined by the relationship between
four variables whereas a binary relation entails only two variables). The problem we
face in generating items is that we need to display enough elements to facilitate
solution, and at the same time minimise the likelihood that effort will be diverted to a
time consuming search for the value of irrelevant cells. It seems that with this set of
96
items, the number of incomplete cells does not appear to contribute to item difficulty
above and beyond what relational complexity has already achieved.
4.5.1.3 Complexity of each processing step
Finally, an alternative way of exploring item performance is to consider the influence
that number of processes at each level of complexity per se has on an item’s difficulty.
To this end, the “number of steps” variables in Table 4.1 (2D, 3D, and 4D) were
regressed onto the Rasch based estimates of item difficulty. The results are summarized
in Table 4.4.
Table 4.4
Regression analysis of number of processing steps in predicting item difficulty
Mean (SD)
zero-order correlations
logit RC2
RC3
β
t
Sig.
pr
sr
Item difficulty (logit) -0.25 (1.67)
RC2a
1.00 (1.03) -0.25
0.40 1.69 0.11 0.41 0.30
a
0.61 (0.70) 0.25 -0.25
0.62 2.95 0.01 0.62 0.52
RC3
RC4a
0.22 (0.43) 0.55 * -0.53 * -0.28
0.94 3.92 0.00 0.72 0.69
R2 = .57 (adj R2 = .48); F(3, 14) = 6.23, p = .007; * p < .05, df = 16; a RC2, RC3, and RC4 =
number of binary, ternary ,and quaternary processes for solution in each item respectively
The advantage of this approach is that it considers the relational complexity of each step
entailed in the item and not simply the number of steps regardless of complexity, or the
overall classification of complexity, as the previous analyses have done. The results
indicate that a significant proportion of the variability in item difficulty can be
accounted for in this way, R2 = .57, F(3, 14) = 6.23, p = .007. The nature of this analysis
is such that the meaning of the unique contribution that each variable accounts for is
difficult to ascertain. The reason for this is that by definition, binary items will not
entail complex (ternary or quaternary) steps, but complex items (say quaternary) can
and do contain less complex (binary & ternary) steps. So the number of steps at a
particular level of complexity per se does not give us a good indication of difficulty, we
need to consider the influence of the number of steps at each level of complexity
together. In summary this provides additional evidence for the influence of relational
complexity and number of processing steps on item difficulty.
97
4.5.2
Summary
We have made some progress in trying to understand the quantitative relationship
between relational complexity, processing steps, and item difficulty in the Latin Square
Task. The evidence so far does provide strong empirical support for the Halford et al.
(1998a) theory. Consistent with their fundamental thesis, it is not simply the number of
processing steps per se that makes the Latin Square Task difficult, but the amount of
information that has to be integrated in each process is a crucial factor. From the data
we have collected in this study, the effect of the number of processing steps only
becomes an issue in performance when the relational complexity of the item has been
taken into consideration.
4.6
Item Response Time and Relational Complexity
It is instructive to also consider the time to respond to the items as a function of
relational complexity as an additional measure of cognitive demand (see Table 4.2 for
aggregated item response times). A conceptual problem with item-based analyses on
responses times is that by using aggregated performance, we are unable to estimate the
influence of individual variability in response times. This is a problem for at least two
reasons. First, if item complexity has a differential effect on the variability of response
time, this will not be detected in aggregated item analyses. Second, there is also the
possibility that reliable individual differences in other factors might interact with the
complexity of the item to influence response time. Once again, these types of effects are
not addressed by traditional item-based regression analyses. We attempt to consider the
first concern by analyzing measures of variability (standard deviations) in response time
in the same way that we analyze mean response times. That is, we are interested in
determining to what extent increases in relational complexity are associated with an
increase in the standard deviation in item response time. The second concern is
considered in more detail in Chapter 7 where measures of additional cognitive abilities
are taken into consideration. The descriptive statistics and the zero-order correlations
amongst the measures are reported in Table 4.5 and a summary of the regression
analyses of mean item response time, and standard deviation in response time are
provided in Table 4.6.
98
Table 4.5
Descriptive statistics and zero-order correlations between item response time measures and task
characteristics (N = 18 items)a
Zero-order Correlations
Mean
SD
RT
CRT
SDRT SDCRT
IC
RC
Response Time (RT) ............ 24.81 12.25
Correct RT (CRT) ................ 26.22 14.04 0.99 **
Standard Deviation (SDRT)... 23.35 12.73 0.88 ** 0.85 **
Standard Deviation (SDCRT) . 24.02 13.36 0.86 ** 0.86 ** 0.97 **
No. of incomplete cells (IC).
7.78 1.35 -0.17
-0.10
-0.09
0.05
**
**
*
Relational Complexity (RC)
2.89 0.76 0.61
0.64
0.52
0.64 ** 0.55 **
No. of processes (STEPS) ....
1.83 0.86 0.35
0.33
0.08
0.00
-0.64 ** -0.30
* p < .05; ** p < .001; a item 4 is included in the analyses of response time
As can be seen from Table 4.5, relational complexity is a significant zero-order
predictor of all the measures of response time, overall (RT), correct (CRT), standard
deviation in overall (SDRT), and standard deviation in correct response time (SDCRT).
The number of processing steps was not a significant zero-order predictor of any of the
measures of response time. These results are consistent with those observed in the
Rasch based measures of item difficulty (Tables 4.3 and 4.4). Table 4.6 effectively
replicates the regression analysis on Rasch difficulty estimates using the various
response time measures as dependent variables. The regression statistics on the left side
of Table 4.6 refer to the mean item response time. Those on the right refer to the
standard deviation in response time. An analysis of the correct response times did not
change the nature of the interpretation of the results and are reported in Appendix B.3.
First we consider the analyses of the mean item response times followed by
consideration of the measures of variability (standard deviation in response time).
4.6.1
Mean Item Response Time
Additive vs. Interactive model: Table 4.6A shows that number of processing steps
becomes a significant predictor of overall response time only once the relational
complexity of the item is taken into consideration, β = .58, t(15) = 3.80, p < .001.
Together, relational complexity and number of processing steps accounts for 68% of the
variation in mean item response time, R2 = 0.68, F(2, 15) = 15.73, p <.001.. This effect
was not moderated by an interaction between RC and STEPS in predicting mean
response time, R2change = 0.05, F(1, 14) = 2.35, p = 0.15 (Table 4.6B).
99
Table 4.6
Regression analyses of mean and standard deviation in overall response time as a function of
relational complexity, number of processing steps, number of incomplete cells; and the number
of steps at each level of RC.
A
sig
Mean RT
β
t
pr
RC
0.78 5.08 0.00 0.61
STEPS
0.58 3.80 0.00 0.35
2
R = 0.68, F(2, 15) = 15.73, p <.001.
sr
0.80
0.70
sig
β
t
pr
0.60 2.72 0.02 0.52
0.26 1.18 0.26 0.08
2
R = 0.34, F(2, 15) = 3.77, p = 0.05.
B
sig
Mean RT
β
t
pr
RC
0.86 5.51 0.00 0.83
STEPS
0.63 4.18 0.00 0.75
RC x STEPS
0.23 1.53 0.15 0.38
R2change = 0.05, F(1, 14) = 2.35, p = 0.15.
sr
0.77
0.59
0.22
sig
sr
β
t
pr
0.69 3.02 0.01 0.63 0.63
0.31 1.41 0.18 0.35 0.29
0.27 1.24 0.24 0.31 0.26
R2change = 0.07, F(1, 14) = 1.54, p = 0.24.
SDRT
sr
0.57
0.29
SDRT
C
sig
Mean RT
β
t
pr
sr
RC
0.98 6.68 0.00 0.87 0.82
STEPS
0.32 2.00 0.06 0.47 0.24
IC
-0.50 -2.77 0.02 -0.59 -0.34
R2change = 0.11, F(1, 14) = 7.66, p = 0.02.
sig
β
t
pr
sr
0.82 3.50 0.00 0.68 0.68
-0.03 -0.11 0.91 -0.03 -0.02
-0.55 -1.90 0.08 -0.45 -0.37
R2change = 0.07, F(1, 14) = 1.54, p = 0.24.
D
sig
Mean RT
β
t
RC2
0.49 2.97 0.01
RC3
0.92 6.30 0.00
RC4
0.97 5.76 0.00
R2 = 0.79, F(3, 14) = 17.57, p < .001.
sig
β
t
pr
0.09 0.37 0.72 0.10
0.65 2.89 0.01 0.61
0.63 2.43 0.03 0.54
R2 = 0.50, F(3, 14) = 4.71, p = 0.02.
pr
0.62
0.86
0.84
sr
0.36
0.77
0.70
SDRT
SDRT
sr
0.07
0.54
0.46
RC = relational complexity; STEPS = number of processing steps; IC = number of incomplete
cells; RC2, RC3, RC4 = number of binary, ternary, quaternary processes respectively.
pr = partial correlation; sr = semi-partial correlation; SD = standard deviation
Number of Incomplete Cells: Table 4.6C indicates that the number of incomplete cells
was a significant predictor of item response time once RC and Steps was considered,
R2change = 0.11, F(1, 14) = 7.66, p = 0.02 (the zero-order correlations with the response
time measures were not significant, Table 4.5). That is, while the number of incomplete
cells does not seem to influence the difficulty of the items over and above the effect of
relational complexity and number of processing steps, it does seem to influence the
length of time required to respond to the item (Note: when only the response times for
correct answers are analyzed this effect becomes somewhat marginal, R2change = 0.06,
F(1, 14) = 3.99, p = 0.07; see Appendix B.3).
100
Number of Processes at each Level of Complexity: When the number of processes at
each level of relational complexity was considered as an alternative measure of the
combined influence of RC and number of processing steps (Table 4.6D), 79% of the
variation in mean item response time was accounted for, F(3, 14) = 17.57, p < .001.
Summary: Relational complexity and number of processing steps were significant
predictors of mean item response time. Although the number of empty cells does not
influence item difficulty, response times tend to be longer when the number of
incomplete cells is greater.
4.6.2
Standard Deviation in Item Response Time
Additive vs. Interactive model: As summarized in Table 4.5, relational complexity is a
significant zero-order predictor of the variability (standard deviation) in response time,
r(16) = 0.52, p = 0.03. The number of processing steps was not a significant zero-order
predictor of variability in response time, r(16) = 0.08, p = 0.75, and this did not change
when both variables were considered together in the standard regression, β = .26, t(16)
= 1.18, p = .26, (Table 4.6A – right side). There was also no interaction between RC
and number of processing steps on the variation in response times, R2change = .07, F(1,
14) = 1.54, p = .24 (Table 4.6B – right side). This implies that as relational complexity
increases, so to does the amount of variability in item response time. The number of
processing steps however did not seem to contribute to differences in the variability of
item response time.
Number of Incomplete Cells: The influence of the number of incomplete cells on the
variability in response time once relational complexity and the number of processing
steps is considered (Table 4.6C – right side) was not significant, R2change = .07, F(1, 14)
= 1.54, p = .24. That is, as the number of incomplete cells did not account for any
additional unique variation in the standard deviation in item response time.
Number of Processes at each Level of Complexity: The alternative measure of the
combined effect of relational complexity and number of processing steps, the number of
processing steps at each level of complexity, together accounted for a significant
101
amount of the variation in item response time standard deviations, R2 = 0.50, F(3, 14) =
4.71, p = 0.02 (Table 4.6D – right side). Given the strong effect of relational complexity
and non-significant effect of number of processing steps, it is probably the RC effect in
these measures contributing to the prediction.
Summary: The variability in response time to the LST items is predominantly a function
of relational complexity. Increases in relational complexity but not number of
processing steps were associated with greater variability in individual response times. A
possible reason for this finding is that processing in the Latin Square Task might be
increasingly susceptible to individual differences in strategy as relational complexity
increases. We consider this in more detail in Chapter 7.
4.7
Derivation of Relational Complexity Subscale Scores
The relationship between item difficulty and number of processing steps in the current
item pool raises some issues concerning the derivation of composite scores based on
level of relational complexity. If the items were generated such that; (1) there were an
equal number of items at each level of complexity, (2) for each level of complexity
there were an equal number of items for each number of processing steps (i.e., the
manipulations were independent), and (3) that the “effect” of number of steps on item
difficulty is constant at each level of relational complexity (the interaction is not
significant), then the composite scores should have reliabilities that are comparable.
With the current items the components are not independent and there are different
number of items at each level of complexity (6 binary, 8 ternary, 4 quaternary). As a
result there may be some uncertainty about the comparability of the scores derived from
the complexity subtests that may bring into doubt the appropriateness of the typical
ANOVA approach. The traditional person-based repeated-measures ANOVA on
traditional composite raw-score measures were conducted and the justification for these
and the results are provided in Appendix B.4. The findings indicated that binary items
were answered more accurately and quickly than ternary items, and that ternary items
were answered more accurately and quickly than quaternary items. This is consistent
with the regression analyses we have just reported. We continue in the following section
with a Rasch analyses of subscale scores. To this end we consider a graphical
102
representation of performance based on relational complexity subtest scores generated
using the RUMM software.
4.7.1
Graphical Representation of Relational Complexity
Three subtests were generated using the subtest procedure in RUMM (Sheridan,
Andrich, & Luo, 1998) by combining items that were classified as binary (1, 2, 3, 5, 6),
ternary (7, 8, 9, 10, 11, 12, 16, 18), and quaternary (13, 14, 15, 17). The relational
complexity composites are based on different number of items and therefore the
expected score that is plotted in Figure 4.11 to represent the item characteristics curves
is the proportion of items in the subtest that a person of a given ability is expected to
answer correctly. An advantage of the Rasch approach is that person ability is scaled on
the same metric as item difficulty, and therefore provides some indication of subtest
difficulty as well. The insert in Figure 4.11 plots the actual calibrated subtest difficulty
and associated standard errors reported in Table 4.7. The RUMM approach to subtest
generation is to effectively generate 3 new items, one for each level of complexity and
then to assess item fit in the standard way. Using 4 trait groups, each subtest provides a
good fit to the Rasch model (i.e., the χ2 in Table 4.7 are associated with large
Expected Score
1
0.9
Binary
0.8
Ternary
0.7
Quaternary
0.6
0.5
0.4
0.3
0.2
-1.5
-1
Quaternary
Ternary
Binary
-0.5
0
0.5
1
1.5
2
Subtest Difficulty
0.1
0
-6
-5
-4
-3
-2
-1
0
1
2
3
4
5
6
ability
Figure 4.11 Rasch based subtest characteristic curves for binary, ternary and quaternary
items (inset = item locations and standard errors)
103
probability values, suggesting no difference between the observed probability of
success and the modeled probability of success for each subtest as a function of trait
group20).
These data support the expected ordering of the composites based on the relational
complexity theory and suggest a significant separation between the less complex
composites and the quaternary composite. The difference between the binary and
ternary composite is markedly less pronounced.
Table 4.7
Relational complexity subset composite locations, standard errors, and global test of fit
statisticsa
probability
2
Location
SE
Residual
χ
value
Binary
-0.827
0.13
0.574
1.731
0.619
Ternary
-0.163
0.10
-0.259
0.946
0.809
Quaternary
0.990
0.13
0.668
0.481
0.921
a 2
χ values based on 4 trait groups
4.7.2
Summary
The a prior prediction that binary items should be associated with better performance
than both ternary and quaternary items, and that performance on the ternary items
should be better than on the quaternary items was supported. Consistent with the
regression analyses, this was observed for all the measures. In the following section we
report a study in which the same 18 LST items were administered to school students.
The aim of the analyses of this next study is to consider the extent that the current
findings generalise to a different population.
20
Recall that trait groups are defined as a function of ability. In this case, four groups were generated
104
4.8
4.8.1
Experiment 4.2: School Students
Participants
Participants were 210 students recruited from two schools (1 primary and 1 secondary)
in the same suburb of Brisbane, Australia (see Table 4.8 for age and gender
characteristics). The students classified as Year 9 and 10 in Table 4.8 were actually
Year 8 and 9 students (respectively) tested in the last week of the academic year. All
other students were tested in the first week of the following academic year (i.e., 8 weeks
later). A small number of students did not complete all the items in the Latin Square
task and therefore the analyses are based on 204 subjects. The students were
administered a variety of other cognitive tasks as part of a larger testing program. Only
the results of the Latin Square task are reported here.
Table 4.8
Demographic characteristics of sample
Academic Year
Years 4 and 5
Years 6 and 7
Years 8, 9 and 10
Years 11 and 12
Total
4.8.2
Classification
Lower Primary
Upper Primary
Lower Secondary
Upper Secondary
Mean Age
(SD) in years
9.16 (0.58)
11.27 (0.67)
13.57 (0.80)
16.24 (1.02)
12.08 (2.46)
Male
26
22
27
8
83
Frequency
Female
29
36
37
19
121
Total
55
58
64
27
204
General Procedure
On arrival at the university, the students were escorted to one of three computer rooms
for testing in groups of 8 to 30. Testing of the Year 9 and 10 students was conducted in
the larger computer room. All other students were tested in smaller groups of 8 to 20.
Presentation of the tasks was spread over four hours and three sessions separated by
lunch and afternoon tea. There was at least one experimenter for every eight students –
more for the younger class groups. The Latin Square Task was presented in the morning
session. The tasks were administered on 486 DX computers with 14” monitors. An
experimenter introduced the study by explaining that the focus of the work was to
explore the way people reason and solve problems. The first screen of each task
required students to enter their name and an identification number into the computer
105
with the assistance of an experimenter if necessary. Students were then asked to follow
the instructions on the screen as the experimenter read them out. This procedure was
used for all the tasks and students were encouraged to ask for assistance if for some
reason the instructions were not clear or if they became confused. The specific
procedure for the Latin Square Task was identical to that outlined above (section 4.3.3).
4.9
Results & Discussion
The focus of the analyses from this experiment is to consider the generalisability of the
previous results to a population other than university students. The relationship between
performance on the LST and the other tasks administered as part of the broader battery
and the implications for relational complexity theory are considered in detail elsewhere
(Halford, Andrews, & Birney, in preparation).
4.9.1
Rasch Analysis
The item statistics based on Rasch estimates and classical test statistics for this sample
are provided in Table 4.9. The analysis indicates that the global fit to the Rasch
measurement model based on the χ2 test of the item × trait interaction was somewhat
poor, χ2(54) = 128.65, p < .001 (based on 4 trait groups), however there was a
satisfactory level of person separation (separation index = .715). Given the reported
problems with global measures of fit (e.g., Masters & Wright, 1996) a closer
examination of item and person fit was warranted. The item infit and outfit statistics
were computed and are reported in Table 4.9. These indicate that only item 8 had an
outfit statistic below 0.70 and all items were within the acceptable range for infit, 1.00 ±
0.30 (Wright et al., 1994). The rank ordering of calibrated item difficulty scores are
highly consistent with those reported in Experiment 4.1 on the university sample
(spearman’s rho = .973) and this provides initial support for generalisability of the LST.
Table 4.9
Descriptive statistics for Rasch and classical analyses
Rasch Statistics
CTT Statistics (SD in parentheses)
item
standard
outfit
infit
location
error
MNSQ
MNSQ
chisq
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-1.158
-1.327
-0.229
-1.635
-0.187
0.842
-1.044
-0.940
-0.333
0.101
0.336
-1.172
0.934
1.658
1.320
0.18
0.19
0.16
0.20
0.16
0.16
0.18
0.17
0.16
0.16
0.16
0.18
0.16
0.18
0.17
0.779
0.810
0.995
0.718
0.847
1.149
0.919
0.636
0.922
1.255
1.255
0.943
1.086
1.118
0.954
0.866
0.908
1.046
0.867
0.902
0.991
0.962
0.759
0.966
1.089
1.089
0.878
0.948
0.981
0.954
16a
17
0.968
1.699
0.16
0.18
1.275
0.773
1.076
Item
0.167
0.16
18a
N = 204; Number of trait groups = 4
p-value
Overall
correct
prob
(proportion correct)
response time
response time
3.11
4.84
3.12
7.27
6.75
11.76
2.76
20.56
9.85
16.46
2.35
1.86
15.40
12.27
4.53
0.356
0.159
0.354
0.035
0.052
0.000
0.413
0.000
0.000
0.000
0.488
0.589
0.000
0.000
0.185
0.770
0.789
0.613
0.824
0.608
0.358
0.755
0.730
0.632
0.529
0.475
0.775
0.333
0.206
0.270
0.942
0.891
2.75
2.19
0.414
0.521
0.338 12.50
0.221 14.25
(11.32) 17.692 (14.97)
(15.94) 20.227 (15.03)
1.092
3.82
0.260
0.520 16.59
(21.99) 20.152 (26.94)
9.89
9.11
11.81
9.93
14.00
12.45
13.47
12.84
14.63
15.93
19.41
12.12
13.74
23.01
16.21
(5.71)
(5.94)
(10.55)
(8.32)
(14.40)
(10.50)
(18.20)
(12.69)
(16.10)
(17.16)
(22.08)
(11.63)
(13.91)
(28.57)
(21.85)
9.828
9.487
13.179
10.076
17.425
14.351
14.488
14.533
15.898
18.040
21.352
12.945
17.462
35.946
23.818
(5.46)
(5.89)
(11.64)
(6.39)
(16.47)
(10.76)
(19.56)
(13.85)
(14.18)
(17.35)
(22.85)
(12.21)
(14.53)
(39.86)
(28.24)
107
Figure 4.12 plots the person infit and outfit statistics, 79.2% of the sample was within
the acceptable infit region. Recall that infit scores below 0.70 tend to be indications of a
response pattern that is too consistent and values above 1.30 suggests response patterns
that are more aberrant and do not fit with expected responses based on the calibrated
item difficulties (see Section 4.4.1.2). Only 6.9% of the sample had infit values greater
than 1.30 and 50% of these (i.e., 8 students) came from the lower primary school group
(the youngest age group). Overall this provides good support that on the whole,
individuals in this younger sample tend to be well measured by the Latin Square Task.
An analysis of age differences is provided in (Halford et al., in preparation).
4
3.5
outfit
Mean Square Fit Value
3
infit
2.5
2
1.5
fit
region
1
0.5
0
-3.261
-1.184
-0.572
-0.288
-0.008
0.272
0.559
0.858
1.178
1.533
2.473
person (ordered by ability)
Figure 4.12. Distribution of person infit and outfit values for the school sample response to the
Latin Square Task
4.9.2
Item Based Regression Analyses
4.9.2.1 Item difficulty
To explore the relational complexity classification, the same series of analyses
conducted on the university sample were repeated for this sample. In all cases the
interpretation is identical to the university sample and therefore we only summarise the
108
results briefly. The descriptive statistics and zero-order correlations are summarised in
Table 4.10. Relational complexity was a significant zero-order predictor of item
difficulty, r(16) = .68, p < .001. The number of processing steps was not a significant
zero-order predictor of item difficulty, r(16) = .19, p = .45, but consistent with the
university sample, STEPS did contribute to a significant proportion of the variation
once relational complexity had been controlled for, R2change = .17, F(1, 15) = 6.85, p =
.02. Together, relational complexity and number of processing steps accounted for
62.7% of the variation in item difficulty, F(2, 15) = 12.63, p = .001 (adjusted R2 = .58).
This relationship was not mediated by an interaction between complexity and number of
processing steps21, R2change = .07, F(1, 14) = 3.12, p = .10; nor did the number of
incomplete cells contribute anything unique to the prediction of item difficulty over and
above that accounted for by relational complexity and number of processing steps,
R2change = .07, F(1, 14) = 3.15, p = .10. This is also consistent with the results for the
university sample (Section 4.5).
Table 4.10
Descriptive statistics and zero-order correlations for LST measures in the school aged sample
zero-order correlations
SD
1.07
δ
δ
Mean
0.00
RT
13.99
3.42
0.64 **
SDRT
14.83
6.09
0.59 ** 0.96 **
CRT
17.05
6.20
0.78 ** 0.94 **
0.90 **
SDCRT
16.46
8.66
0.62 ** 0.93 **
0.96 **
0.94 **
RC
2.89
0.76
0.68 ** 0.64 **
0.69 **
0.72 **
0.66 **
STEPS
1.83
0.86
0.19
0.28
0.22
0.23
RT
0.38
SDRT
CRT
SDCRT
RC
STEPS
-0.30
IC
7.78 1.35
0.34 -0.11
-0.02
0.13
0.07
0.55 * -0.64 **
δ = Item difficulty; RT = Response Time; SDRT = Standard Deviation in RT; CRT = Correct
RT; SDCRT = Standard Deviation in CRT; RC = Relational Complexity; STEPS = Number of
processing Steps; IC = Number of Incomplete Cells
21
To generate the interaction term, relational complexity and number of steps were centred around their
respective means and the product calculated.
109
4.9.2.2 Item response time
Mean Item Response Time: Relational complexity was a significant zero-order predictor
of mean item response time, r(16) = .64, p < .001. The number of processing steps was
not a significant zero-order predictor of mean item response time, r(16) = 0.38, p =
0.12, but did contribute to a significant proportion of the variation once relational
complexity had been controlled for in the standard multiple regression (R2change = .36,
F(1, 15) = 22.15, p < .001). Together, relational complexity and number of processing
steps accounted for 76.0% of the variation in mean item response time, F(2, 15) =
23.70, p < .001 (adjusted R2 = .73). This relationship was not mediated by an interaction
between complexity and number of processing steps, R2change = .04, F(1, 14) = 3.14, p =
.10. The number of incomplete cells did contribute marginally to the prediction of item
response time over and above that accounted for by relational complexity and number
of processing steps, R2change = .05, F(1, 14) = 4.08, p = .06. These results are consistent
with the findings from the university sample (Section 4.6.1).
Standard Deviation in Item Response Time: Similar to the analyses reported in Section
4.6.2, we consider the differences in variability in item response time to explore the
extent that this was a function of relational complexity, number of processing steps, or
the number of incomplete cells in the Latin square. Relational complexity was a
significant zero-order predictor of the standard deviation in item response time, r(16) =
.69 p < .001. The number of processing steps was not a significant zero-order predictor
of the differences in variability (standard deviation) of item response time, r(16) = 0.28,
p = 0.26, but did contribute to a significant proportion of the variation once relational
complexity had been controlled for in the standard multiple regression (R2change = .26,
F(1, 15) = 15.25, p = .001). This is contrary to the university sample in which only
relational complexity was a significant predictor of the differences in item standard
deviation of response time. Together, relational complexity and number of processing
steps accounted for 74.3% of the variation in item response time, F(2, 15) = 21.64, p <
.001 (adjusted R2 = .71). This relationship was not mediated by an interaction between
complexity and number of processing steps, R2change = .02, F(1, 14) = 1.32, p = .27. The
number of incomplete cells did not contribute to the prediction of the standard deviation
in item response time over and above that accounted for by relational complexity and
110
number of processing steps, R2change = .05, F(1, 14) = 2.99, p = .11. This result is
consistent with the university sample (Section 4.6.2).
4.9.2.3 Summary
The results for the analyses of item difficulty and response time measures for the school
sample were on the whole consistent with the university sample. Relational complexity
was found to be a significant predictor of item difficulty and item response time. The
number of processing steps was also found to contribute to the variability in the
aggregated item measures. The only difference that was observed in the school sample
was that in addition to relational complexity, the number of processing steps also
predicted differences in the variability (standard deviation) of item response time. This
was not the case for the older university sample. This might suggests that the younger
group is being influenced relatively more by serial processing demand than the older
group and would be consistent with the research suggesting that individual differences
in working memory occur through childhood and early adolescence (e.g., Rabinowitz,
Howe, & Saunders, 2002).
4.9.3
Comparison of School and University Samples
Our main objective in the following analyses is to consider more closely the differences
between the university and school students performance on the LST at an item based
level. Two measures of performance were considered, item accuracy and item response
time. The descriptive statistics for each of the 18 items based on the university sample
are provided in Table 4.2. The descriptive statistics for the school sample are in Table
4.9. The analyses reported below follow the rationale of profile analysis using the
univariate repeated-measures approach (see Tabachnick & Fidell, 1989 for a review of
profile analysis).
Item Accuracy: The univariate approach entailed a mixed repeated-measures/between
subjects ANOVA (18-items × 2 groups) to explore performance on individual items as a
111
function of group membership (university, school)22 23. The objective was to consider the
degree of consistency in the pattern of results across items and groups and therefore the
effects of interest are the main-effect of group membership and the two-way interaction
between groups and items. As expected, a significant overall difference between the
items was observed, F(17, 4675) = 47.44, p < .001. We have effectively explored these
item differences as a function of relational complexity and number of steps in the
regression based analyses reported in the preceding sections (Sections 4.5 and 4.9.2),
and therefore we do not repeat the analyses here. The main effect for group membership
was also significant, F(1, 275) = 58.88, mse = .58, p < .001, such that university
students tended to make fewer errors than school students. The magnitude of this effect
was consistent across all items as indicated by the non-significant interaction between
items and group, F(17, 4675) = 1.12, mse = .17, p = .33. The interaction is plotted in
Proportion Coorect
Figure 4.13.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
uni
school
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
item
Figure 4.13. Item difficulty (proportion correct) for university and school
samples.
22
Individual based data is explored to avoid potential problems associated with aggregating individual
differences that can arise in item based analyses. This was deemed particularly important since we did not
wish to make any a priori assumptions that the item variances of the DV’s (at least for the response time
measures) were homogonous between groups.
23
Adjusting for violations of sphericity assumption did not change the interpretation of the results.
112
Response Time: A similar analysis was considered for item response time. As before
there was a significant difference in response times between items, F(17, 4675) = 50.45,
mse = 260.61, p < .001, and this has been explored more fully in the item based
regression analyses for each group in the preceding sections. There was also a
significant main effect for group, F(1, 275) = 665.91, MSe = 2178.52, p < .001, with the
overall tendency being that university students actually took longer to respond than the
school children. These effects are however qualified by a significant interaction, F(17,
4675) = 17.01, MSe = 260.61, p < .001. That is, the difference between the mean
response time for university students and school students is not constant across the 18
items. Figure 4.14 plots this interaction between items and group membership. Items
have been scaled along the abscissa using the Rasch calibrated logits scale of item
difficulty for the school sample in Table 4.9.
Two aspects of this plot are worth commenting on. First, there is a tendency for the item
response times to be much more consistent for the school sample than the university
sample whose response times tend to vary much more (items 10, 11, and 14 have
particularly long average response times and large standard deviations; see Table 4.2).
However the interaction between items and groups remained significant with items 10,
Mean Item Response Time (sec)
60
uni
50
school
40
30
20
10
4
12
2
7
8
9
3
18 11
5 10
14
13 16
6
15
17
1
0
-1.64
-1.16
-0.33
0.10
0.84
Calibrated item difficulty
1.32
a
Figure 4.14. Item response time for university and school sample (aas a function of the
calibrated item difficulty for the school sample).
113
11, and 14 removed. Second, the difference between the two samples tends to increase
as the calibrated difficulty of the items increase. The regression lines for each sample
are represented in Figure 4.14 and there is significant divergence as calibrated difficulty
increases, t(32) = 2.19, p = 0.036 (School b = 2.05, t(16) = 3.35, p = .004, University b
= 7.15, t(16) = 3.19, p = .06; Note: b = unstandardised regression coefficient).
4.9.4
Summary
These results indicate that when the 18-item Latin Square Task is administered to a
younger population with a larger age range (8 to 16 years; mean age = 12.08), the
influence of relational complexity followed virtually an identical pattern of results to the
more homogenous sample of undergraduate university students (mean age = 18.62).
Although school students tended to perform a little poorer overall in terms of accuracy
they tended to respond more quickly to the item. As the complexity and difficulty of the
item increased the additional amount of time invested by the older group on average
tended to be greater than for the younger school group. One possibility is that as the
items tended to exceed the ability of the younger children (as indicated in the generally
poorer performance of the school children), the amount of time the school students were
prepared to invest on the task declined. That is, consistent with the work of Lohman and
Kyllonen (1983) on spatial reasoning, students might have a preferred processing time
limit for a problem after which they will tend to give up and guess their response from
the work done so far on the item. In any case, what is encouraging here is that we have
been able to replicate the general pattern of results with a second sample whose
characteristics are much more varied than our original university based sample.
4.10 General Discussion
The findings from the two studies are comparable and both indicate that increasing
relational complexity in the Latin Square Task is associated with an increased number
of errors and longer response times. Our findings suggest that the number of processing
steps is also a significant predictor of both errors and response times, but only after the
effect of relational complexity has been considered. There was no interaction between
relational complexity and number of processing steps. Almost by definition, increasing
114
the relational complexity of the reasoning in the Latin Square Task is associated with
ambiguity associated with incomplete cells. We therefore partially expected that this
might be a contributing factor in performance. In terms of errors this was not the case.
However, a greater number of incomplete cells did tend to increase the time to respond
over and above that contributed to relational complexity and number of processing
steps. These results appear to generalise to a younger population of students. Although
the number of errors increases in the younger age group as might be expected, there was
actually a decrease in the response latencies. This difference tended to become greater
as the difficulty (relational complexity) of the item increased. We take up similar issues
in Chapter 7.
4.10.1 Alternative Accounts
The relational complexity manipulation in the Latin Square Task is transparent in that
no new rules need to be acquired as complexity increases. In fact there is really only
one rule that needs to be mastered – the defining principle of a Latin square. It is
important to recognise that the processes that we have specified and modelled as
features of the Latin Square task are theoretically driven by the relational complexity
theory. The task was developed with the RC principles in mind and the pattern of results
provide convergent support for our analyses. Of course, it may be the case that
alternative process theories might equally well account for the empirical data
independent of relational complexity. It may be that our relational processing account is
not unique or even important. Here we consider an alternative conceptualisation that has
been raised by independent researchers during the presentation of this material at
national and international conferences. It has been argued that this conceptualisation
does not require the levels of relational integration that we have proposed.
Consider the ternary problem reported in Figure 4.15. A suggested solution protocol for
this item is as follows:
“There is a triangle and a circle in column B, so the target cell must be either a
square or a cross. But, there is a square in the second row so it can’t be a square.
Therefore it must be a cross”
115
An analysis using our terminology would be:
AND(B1(triangle), B4(circle), D2(square)) → B2(cross)
Problem Square
A
B
C
Option
D
Completed Square
A
1
2
B
C
D
1
?
2
3
3
4
4
Figure 4.15. Example ternary LST item
The argument that has been put to us is that this is not parallel processing but simply
two or three (binary) processing steps pieced together in series where the output of one
step is used as input for the next24. The protocol itself is highly plausible and one that
we believe would be used frequently. However we are yet to be convinced that solution
does not entail simultaneous processing. Our reasoning is as follows: The possibility of
the target cell being either a triangle or a circle is eliminated by their presence in
column B. However, there is still ambiguity at this stage of the processing as a
determinate solution is not possible – the missing cell is either a square or a cross but
we do not know which. To solve the problem, the reasoner needs to consider sources of
information beyond column B, but not independent of what is already known about the
elements in column B – this is a requirement of the Latin Square design. With some
searching of neighbouring cells, the square in D2 will be discovered. If the rules are
understood, it will be clear that this element must constrain the possible values in the
target cell (as in fact any element in row 2 will). Yet this piece of information alone
cannot provide a determinate solution of the target cell. It needs to be considered in
conjunction with the current representation of the task (i.e., square or cross). This new
information (i.e. not square) needs to supplement the representation of the task to derive
a solution. Essentially what we are musing with here is the idea that parallel processing
24
I would like to thank Gary Pearce who was among the first to bring this to my attention.
116
in practice does not mean that the representation of the task has to be built
instantaneously (as the tensor model of Halford et al., 1998a does). Pertinent elements
can be brought together over time to build a single representation – a relation. If it is
only at this time that a solution is possible, then the structure of the relation at this point
determines the complexity of the process.
It is important to note that this is conceptually different from the idea of segmentation in
multiple step Latin Square problems (e.g., Figure 4.7). In these problems, each
intermediate step results in a determinate solution of one or more cells whose value is
necessary for solution of the target cell (or other empty cells). If there were ambiguity in
the value of the cell in an intermediate step then it would not qualify as a separate
“segment”– the values of other cells would be needed to determine it’s status. This is
consistent with the chunking and segmentation principles that were introduced in
Chapter 2 – where variables interact segmentation or chunking is not possible while still
maintaining the information about the relationship between the variables. Future
experimentation and analysis of response times and eye-movement data might further
our understanding of these issues.
4.11 Conclusion
Our main objective in designing the Latin Square Task was to use the principles of
relational complexity theory to develop a manipulation with a priori complexity effects.
We have aimed to minimize confounds that might weaken the identification and
interpretation of what constitutes cognitive complexity. For instance, the influence of
prior experience has been minimized by the novelty of the task. Future research will
follow up on the influence of learning. It is clear from the current analyses that levels of
relational complexity in the Latin Square Task are differentiable in terms of individual
performance consistent with the predictions. The empirical results indicate that the task
also fits well with the tenets of the Rasch measurement model. As such, we have
empirical evidence to suggest that the measures of performance that are obtained
quantify what we might speculate is relational reasoning in the Latin Square Task. We
will consider the implications of this further in Chapters 5 and 7. Further, the number of
processing steps is associated with item difficulty only after statistically controlling for
117
relational complexity. Together, these two factors accounted for an impressive 66.5% of
the variation in item difficulty for the university sample, and 62.7% of the variation in
item difficulty of the school sample. We therefore have preliminary evidence that the
measurement properties of the LST generalise to a wide range of ages. In Chapter 7 we
report an experiment that manipulates serial processing in the LST independently of
relational complexity. This work attempts to map reasoning in the Latin Square Task
with established measures of cognitive processing (i.e., measures of fluid- and
crystallised-intelligence and short-term memory).
4.12 Modifications to the LST item database
The studies reported here were designed to assist in identifying some of the factors that
contributes to the psychometric properties of the Latin Square Task. It seems clear that
number of processing steps is a source of difficulty and that the current set of items does
not manipulate this independently of relational complexity (see Section 4.5). As a
result, a series of modifications to the existing item base were employed for the studies
reported in the following chapters. The new pool of 36 items consisted of 12 items at
each level of complexity – half of which entailed a single process and half the problems
included an additional processing step of equal or lower level of complexity. This new
item pool and the relational complexity analysis are provided in Appendix B.5 and B.6.
118
CHAPTER FIVE
PROCESSING CAPACITY AND DUAL-TASK PERFORMANCE
5
Introduction
It is possible to further explore the properties of the relational complexity metric by
using secondary task performance as an indicative measure of processing capacity
limitations. That is, under appropriate conditions, we might expect that secondary task
performance would be sensitive to increases in the relational complexity of a concurrent
primary task (Halford et al., 1986; Halford, 1993). If this were the case, then we would
have convergent and relatively independent evidence that the manipulation of relational
complexity does impact on available processing resources. We have already argued that
simply demonstrating a monotonic decrease in performance as complexity is increased
does not unambiguously support the manipulation of relational complexity. There are
many factors that can influence task difficulty and therefore in theory, the arguments
presented in Chapters 1 and 2 also apply to secondary task performance. Holding
domain and experimental procedures constant helps reduce the ambiguity to a large
extent, however statistically we are still left with the problem of discriminating the
effect of complexity from other factors contributing to performance.
5.1.1
Dual-Task Deficit
A decrement in secondary task performance in the traditional dual-task approach may
be seen as a necessary (and possibly sufficient) condition for testing the relational
complexity theory. However, problems with the traditional dual-task methodology have
been frequently reported in the literature and differentiating interference in secondary
task performance that is not due to processing capacity is difficult, if not intractable
(e.g., Navon, 1984). In addition to task specific interference factors, we believe part of
the problem lies in the fact that the traditional dual-task methodology is also susceptible
to many of the concerns associated with the comparison of means approach that we
raised in earlier chapters. In specifying a modification to the dual-task approach, Hunt
and Lansman (1982; see also, Lansman & Hunt, 1982) have attempted to address these
types of methodological concerns in addition to addressing the issue of interference
raised by Navon (1984). Hunt and Lansman argue, for instance, that the traditional dual-
119
task assumption that the primary task is resource limited requires a re-interpretation.
The implication of the assumption as it stands is that the impact of the primary task on
secondary task performance is conceptualised predominantly as a function of primary
task difficulty – the smaller the resource demands of the primary task the more
resources available to improve performance on the secondary task. Far less emphasis is
given to the influence of the actual amount of resources available to the individual on
task performance, at an individual level. The reinterpretation proposed by Lansman and
Hunt (1982, p. 22) is to consider a task as resource limited if “…performance of an
individual, relative to other individuals, is determined by resource limitations rather
than data limitations.”. Such an interpretation requires a more resolute consideration of
the individual differences issues than is afforded by simply comparing means as a
function of primary task difficulty. This is particularly important if aggregation of the
data obscures a reliable relationship between primary task processing and the
individual’s available mental resources. We consider the implications of this a little
further in the next section.
5.1.2
The Implications of Individual Difference in the Dual-Task Paradigm
Bourke and his colleagues (Bourke, 1997; Bourke, Duncan, & Nimmo-Smith, 1996)
have explored the influence of a general attentional factor across a wide selection of
tasks (in number and type) across several experiments. The rationale for this comes
from the foundation of resource theory, that if a general resource or executive
processing system exists, it is assumed to be limited in capacity and of a relatively fixed
level for a given individual. The implications for task performance is that when tasks
compete for these resources, the resources are split entirely between the tasks, and
therefore task performance may be increased through the allocation of more resources.
However interference between task specific factors may account for all dual-task
deficits (Navon, 1984). Bourke’s research has attempted to isolate at least some of these
known factors; however across ten tasks he still found evidence to support the
identification of a general factor (Bourke, 1997; Bourke et al., 1996). While Bourke and
his colleagues acknowledge the overwhelming support of specific interference effects,
they suggest that the general attentional factor functions in combination with specific
120
factors rather than in their stead. Further, Bourke (1997) concludes that the general
factor has the properties of interval scale measurement.
The approach adopted by Bourke and his associates follows the traditional cognitive
psychology paradigm in that analyses are based on aggregated task scores. There is no
attempt to consider the influence of individual differences in any serious way – an
interesting situation given that the emphasis on a general attentional factor has often
entailed an individual difference approach (e.g., Morrin, Law, & Pellegrino, 1994; Yee,
Hunt, & Pellegrino, 1991; Brookings & Damos, 1991). Even more interesting, is that of
the four separate experiments reported in Bourke (1997) and Bourke et al. (1996), one
experiment used 16 subjects, one had 5 subjects, and the remaining had only 4 subjects
– so even if there was an interest in exploring the influence of individual differences, a
sample of 4 or 5 would have made this very difficult25.
It is common in traditional cognitive psychology research to assume that the cognitive
processes that hold for one person also holds for other people (in the hope of
generalising to all). This assumption is tested by demonstrating the reliability of an
effect through replication of mean differences; it is rarely tested any further. This is in
the face of a whole discipline of psychology (i.e., differential psychology) that reports
reliable individual differences in cognition often using identical tasks as cognitive
psychologists. Of course a response to such comments might be that the focus of the
research are differences in performance on different tasks – not differences between
individuals, and we accept this as an appropriate emphasis. Experimental control and
pre-training to a given level of performance (and then subsequent analysis of response
times, say) are useful and common methodological strategies to reduce the influence of
inter-individual differences. Our concern is that we cannot be sure that even with these
controls, individual differences do not influence performance, or more importantly, that
if they do, their effects are invariant across task conditions (as assumed by Hunt &
Lansman, 1982). Replication is useful for testing the reliability and to some extent the
robustness of the observed effects, but a careful distinction needs to be made. We are
25
We identify Bourke’s work as an example to demonstrate the dominance of the comparison of means
approach. It also serves as an example of the investigation of a general attentional factor as specified by
the various forms of resource theory.
121
not necessarily concerned with the random effects of inconsequential individual
differences that are typically washed out through aggregation and within-subject
designs. We are interested in the possibility that individual differences are implicated in
performance differentially as a function of task condition, within individuals. Such
intra-individual differences effects are likely to be systematic, and will give the
appearance of consistency over repeated experiments unless they are explicitly
considered. That the flaws identified in the traditional dual-task methodologies by
Navon (1984) and others were not detected earlier gives some indication of the possible
consequences of this level of aggregation. Certainly Navon was quick to caution against
the tendency to take for granted the interpretations in favour of resource theory.
An example of where intra-individual differences might influence performance is
actually identified by Bourke et al. (1996). They showed that strategic trade-off of
resources could occur regardless of which task is designated as primary. That is, people
tended to allocate more resources to the task that they perceived to be more demanding.
More important was the anecdotal evidence that people differed in what they perceived
as demanding (most subjects stated that the tone task was very demanding, yet others
stated that it was easy26, p. 538). It is possible that perception of difficulty is likely to be
a function of capacity (or confidence in ability) and this might account for the degree of
resources traded-off. Rather than collecting data for a greater number of subjects, as
would be the traditional individual differences approach, Bourke et al. follow the
cognitive psychology tradition and collect data over many more trials. We might expect
that this methodological strategy has the potential to change the very nature of the task.
5.1.3
Cognitive Psychology and Individual Differences
One way in which individual differences in capacity has been considered within a
cognitive psychology framework is demonstrated by Halford, Bain, and Maybery
(1984a; see also Halford, Bain, & Maybery, 1984b). This research attempted to resolve
some of the conflict at the time over whether concurrent memory load interfered with
reasoning. Baddeley and Hitch (1974) provided evidence to suggest memory load
interfered with reasoning (and therefore support for a common processing system
26
The experiment was based on N = 16
122
underlying both tasks) whereas Evans and Brooks (1981) failed to find evidence for
interference. In exploring the influence of a concurrent memory load on a mathematical
reasoning task entailing two levels of difficulty, Halford et al. (1984a) incorporated
individual differences into their design by predetermining each individual’s memory
span. The two levels of the mathematical reasoning task were performed while
maintaining a memory load that ranged from 3 items below the individual’s
predetermined memory span to 1 item above. Item recall on the memory task differed
consistently as a function of the difficulty of the concurrent primary task – there were
more errors in recall when the concurrent task was the difficult version of the
mathematical reasoning test. This difference was most pronounced when the concurrent
memory load was equivalent to the span of the individual. In terms of primary task
performance, there was no difference in accuracy on the mathematical reasoning task as
a function of the memory load (consistent with the instructions to give the primary task
priority). The difference in response times between easy and difficult versions of the
primary task significantly increased only when the memory load exceeded the
individual’s predetermined memory span. This was taken to imply that memory and
reasoning share a common pool of resources.
The research of Halford et al. (1984a) was successful in addressing the original
controversy and it did so by incorporating basic individual differences within the
traditional experimental approach. Other researchers such as Boles and Law (1998)
have also used individual differences methodologies (i.e., factor analysis) to augment
task selection for the experimental approach. Their research tested the appropriateness
of differentiated and undifferentiated resource theories. The approach that Hunt and
Lansman (1982) have proposed is known as the easy-to-hard paradigm and it
incorporates individual differences more fully than either the Halford et al. or Boles and
Law approach. In this chapter we consider dual-task evidence related to the Latin
Square Task as it pertains to the relational complexity manipulation using both
traditional and the easy-to-hard methodology. We begin with some further background
on some of the pertinent issues in dual-task assessment in general and the easy-to-hard
paradigm in particular.
123
5.2
Resource Theory
There are three basic tenets of resource theory as specified by Norman and Bobrow
(1975). As we have alluded to already, performance on a task is a function of the
resources allocated to it. This function is typically monotonic in that additional
resources should improve performance and not be a hindrance (e.g., Halford, 1993).
Second, the amount of resources allocated to the task or tasks depends on the
individual’s capacity. Third, as resources are re-allocated from one task to another,
performance in the first task declines. The first two tenets are graphically represented by
what Norman and Bobrow refer to as a task's Performance-Resource Function (PRF).
Figure 5.1 displays the relationship between performance on three tasks, A, B, and C, as
a function of the amount of resources allocated. The curves are therefore the idealised
performance-resource functions. For practical reasons such as measurement limitations,
only a portion of the PRF range can be assessed in any experiment27. This is referred to
as the assessment region. The PRF for Task A over the assessed region is considered to
Assessed Region
A
B
resource-limited
data-limited
Performance
C
Resources
Figure 5.1. Performance resource functions (PRF) for three tasks
27
Measurement is limited by both the characteristics of the task and the individual. The presence of
ceiling effects for instance, indicate a very easy task that might be resource limited at some level of
resource allocation, but at a point that is not measurable for the current task. A more difficult version of
the task would be required to assess these regions.
124
be data-limited. The term data-limited suggests that performance on the task can only
be improved by somehow changing the nature of the task; by improving the quality of
the information or data that is available. For instance, investing more resources to
reading the words on this page is not likely to assist the average reader of a PhD
dissertation to recognise the words. Improving the visual clarity of the text (i.e.,
available data) might however improve word recognition. The horizontal nature of the
data-limited function represented by curve A in Figure 5.1 indicates that performance is
independent of the resources allocated to the task – regardless of the resources
contributed, performance remains the same.
Task C is resource-limited. That is, an increase in the resources allocated to the task
results in a performance increase as reflected by the positive gradient over the assessed
region. To take the previous analogy a little further, comprehension of the written words
on this page might be improved if reading was conducted in a quite room rather than in
front of the television. The resources required to screen out the noise from the television
might be allocated to comprehension of the written text. Task B is an interesting case
since it changes from being resource-limited to data-limited within the assessed region.
That is, allocating additional resources to the task increases performance up to a point
after which no increase is recorded. The idea of data- and resource-limited task
performance seems to be well accepted (Wickens, 1991, 1980; Damos, 1991b; Halford,
1993) and methodological extensions of the dual-task procedures such as the easy-tohard paradigm have been developed based on this understanding (Hunt & Lansman,
1982; Colle, Amell, & Ewry, 1988). Resource theory and the notion of data- and
resource-limited performance are particularly important in the selection of tasks to
include in a dual-task paradigm (Fisk et al., 1986; Damos, 1991b; Brookings & Damos,
1991).
The third tenet of resource theory states that performance in one task declines as
resources are diverted away to a second task. This results in a dual-task trade-off or
deficit and can be conceptualised as the result of combining the PRF of the two tasks.
Norman and Bobrow (1975) refer to this as a Performance-Operating Characteristic
(POC) where performance on one task is plotted as a function of performance on
125
another. Figure 5.2 describes a POC for two tasks that are perfectly related in that a
reduction in resources on one task results in the same amount of resources being
allocated to the other28. In Task A performance has reached a data-limited condition at
P. If we continually re-allocate resources from the first task (A) to the second (B), an
increase in performance in the latter occurs. Up to point Q, sufficient resources are
available for maximum Task A performance. From Q to R reallocating resources to
Task B results in a decrease in performance in Task A and continued increase in Task B
performance. Both tasks are said to be resource-limited during this range since
additional resources to both tasks would result in performance improvement. At point R,
Task B becomes data limited. Allocating additional resources from Task A does not
increase Task B performance but it continues to reduce Task A performance.
Task A
data limited
Task A Performance
P
Both tasks
resource limited
Q
R
Task B
Data limited
S
Task B Performance
Figure 5.2. Performance Operating Characteristic curve (Norman & Bobrow, 1975)
5.3
Dual-Task Assumptions
If legitimate conclusions are to be made about resource demands based on a dual-task
methodology, certain assumptions need to be met. Different types of task interactions
28
Of course, using the multiple resource approach, it might be the case that allocation of resources to one task has no
impact on the functioning of the other.
126
will produce different POC's and while these can be investigated to test some of the
assumptions that are frequently made, others require explicit experimental control. Fisk,
et al. (1986) outline three implicit assumptions often made that can lead to problems
interpreting dual-task data. These are associated with (i) practice effects, (ii) primary
task priority, and (iii) task interference.
5.3.1
Practice Effects
First, it is inappropriate to assume that performance does not change quantitatively or
qualitatively with practice. To use the terminology of Schneider and Shiffrin (1977),
any task using consistent response mapping has the potential to become "automatized".
When learning is influencing secondary task performance the shape of the POC will
change as a function of practice. That is, the relationship between the two tasks will
change with practice. Damos (1991a) therefore suggests that stable single and dual-task
performance should be reached before data is collected.
5.3.2
Priority of Primary Task
A second assumption that can lead to interpretation problems is related to trade-off
strategies. An essential proposition of the dual-task paradigm is that individuals devote
maximum resources to the primary task. Primary task performance in the dual-task
condition can decline when subjects allocate resources to the secondary task in
preference to maintaining single task levels of performance in the primary task. The use
of appropriate feedback has been shown to keep subjects suitably attentive to the
primary task (Damos, 1991a). The type of secondary task used can also discourage a
trade of resources from the primary task. Fisk et al. (1986) argue that a secondary task
that demands processing resources throughout the duration of the primary task helps
prevent a strategy that drains resources from the primary task. In addition to these
proactive strategies to ensure maximum resources are devoted to the appropriate task,
the primary task measure itself should be sensitive enough to detect if this is not the
case. We cannot simply assume that trade-off strategies are not used because task
instructions demand that primary task performance be maintained at single task levels.
127
5.3.3
Task Interference
The third assumption is the most important of all, as it requires consideration of the
structure of the resource system. Undifferentiated resources theories such as the one
Kahneman (1973) proposed, essentially entail a general pool of resources that can be
used by any task that demands them. Boles and Law (1998) argue that the hemispheric
resource theory of Friedman, Polson, and their associates (e.g., Friedman, Polson, &
Dafoe, 1988; Polson & Friedman, 1988) is also effectively an undifferentiated model.
Here each hemisphere is conceptualised as an independent undifferentiated resource
pool that differs in location and not type (i.e., any task can use resources from either
hemisphere, although one hemisphere is typically more efficient in processing verbal
information and the other spatial information). The research of Boles and Law
demonstrated support against the undifferentiated resource theory in favour of Wicken’s
differentiated multiple resource theory (Wickens, 1991). If we accept this and other
evidence against undifferentiated resource model (e.g., Wickens, 1991, 1980; Fisk et al.,
1986), then pairing any commonly used secondary task with a primary task will not
necessarily produce a reliable and valid measure of processing capacity. The extent that
tasks require overlapping resources will determine the degree of dual-task interference.
According to Wickens (1991) the implications can be conceptualised by considering
three dimensions that influence the suitability of a primary-secondary task combination.
They are a) stage-of-processing, b) codes-of-processing, and c) input-modality effects
and are described in more detail below.
5.3.3.1 Stage of processing.
Wickens (1991) proposes that different resources are entailed as a function of two
processing stages. The first are resources related to perceptual-cognitive activity such as
display reading, voice comprehension, mental rotation, and calculation. The second set
is related to response processes, such as covert or overt speech production, or more
physical tasks such as cursor positioning and switch throwing (e.g., pressing a
spacebar). The typical finding is that when two tasks compete for resources from the
same stage of processing, interference in performance on either or both tasks is more
likely than if processing stages were different.
128
5.3.3.2 Codes of processing.
The codes-of-processing factor is typically dichotomised into verbal and spatial
processing codes. Evidence for a distinction between verbal and spatial factors is
commonplace in many areas of psychology. For instance, Baddeley (1986; Baddeley &
Hitch, 1974) and many others have presented evidence that supports the
conceptualisation of working memory as entailing verbal and spatial systems (the
articulatory loop and visual-spatial sketch pad, respectively). Resource systems have
also been proposed around verbal and non-verbal (spatial) tasks based on hemispheric
specialisation (e.g., Polson & Friedman, 1988), and the psychometric literature abounds
with distinctions between verbal and spatial abilities (e.g., Horn & Cattell, 1966;
Carroll, 1993). Within cognitive psychology, reasoning theorists debate the utility of
deduction rules (verbal) and mental models (propositional and spatial) as an appropriate
theory of cognition (Johnson-Laird & Byrne, 1991; Rips, 1989; Rips, 1983; Roberts,
1993). With this in mind, interference due to the codes-of-processing factor is typically
considered to be more pronounced when both the primary and secondary tasks require
the same processing codes (i.e., both verbal or both spatial).
Perceptual/cognitive
Response
Verbal
Verbal perceptual/cognitive tasks
Verbal response task
Spatial
Spatial perceptual/cognitive tasks
Spatial response task
Figure 5.3. Task categorisation as a function of stage-defined and code defined resource
demands as proposed by Wickens (1991)
Wickens (1991) classifies tasks by the dichotomy between stage-defined and codedefined resources in a matrix such as Figure 5.3. Tasks drawn from the same cell are
likely to be subject to the most interference, tasks from adjacent cells will also be
subject to significant interference, but somewhat less than if the tasks had come from
the same cell. Tasks drawn from diagonal cells are subject to the least amount of
interference.
129
5.3.3.3 Input-modality effects.
The third factor is somewhat more controversial and is related to interference as a
function of task modality (i.e., visual and auditory). In general, the typical finding is
that
cross-modality
time-sharing
will
result
in
superior
performance
(less
interference/conflict) than intra-modality time-sharing. Wickens (1991) seems to be less
convinced that visual and auditory input modalities entail distinct resources in the same
way as the components of the stage and code effects discussed previously (see sections
5.3.3.1 and 5.3.3.2). He believes that it may be the case that the conflict here is more a
function of structural features rather than something more central to resources. In any
case, Damos (1991a) argues quite pragmatically that the interpretation of dual task
deficits are more reliable and when they are not confounded by modality effects.
5.3.4
Summary
It is quite evident that secondary task selection is not a trivial exercise and that several
factors need to be considered to be able to detect the influence of resource limitations if
they exist. As we describe in more detail below, we have tried to consider these issues
in selection of the secondary tasks (the primary task is the Latin Square Task). In those
cases where the nature of the interference is debatable, we try and consider the
empirical evidence from the traditional dual-task methodology wherever possible. The
easy-to-hard modification explicitly attempts to minimise the influence of many of these
factors using a correlational approach, and to some extent such arguments become less
important (given the two tasks indeed compete for the same resources). We consider the
easy-to-hard paradigm in more detail in the following section.
5.4
Easy-to-Hard Paradigm
The easy-to-hard paradigm proposed by Hunt and Lansman (1982; Lansman & Hunt,
1982) uses a dual-task methodology with a hard and easy version of the primary task.
The defining criterion is that performance on the hard version of the primary task,
which is never presented in a dual-task situation, can be predicted by secondary task
performance presented jointly with the easy version of the primary task. That is, to the
extent that the hard task is resource-limited, performance on the secondary task “...
130
reflects the total capacity available to the participant because it uses whatever capacity
is left spare while performing the easy primary task” (Halford et al., 1986, p. 621).
Figure 5.4 represents a mapping of the relationship between standardised performance
variations in measures of the easy-to-hard paradigm. It shows that the correlation of
interest is the partial correlation29 between the secondary task (performed in conjunction
with the easy primary task) and the hard primary task (performed alone). Removing
common variation between the hard and easy primary task performed alone (β + λ),
and between the hard primary task and the secondary task performed alone (φ + γ),
results in δ, a relatively clean measure of variation due to resource limitations (Halford,
1993). This technique removes variance which the tasks might share that is not
associated with resource limitations and overcomes the dual-task issue of conflict raised
Secondary task performed with
easy primary task
α
ε
λ
Easy primary task
performed alone
δ
Secondary task
performed alone
φ
β
γ
Hard primary task
performed alone
Partial correlation of interest
Circles represent variation in performance. ‘δ’ is the
squared partial correlation which should be significant if the
assumptions of the easy-to-hard paradigm have been meet.
Figure 5.4. Representation of partial correlation of interest in the Easy-to-Hard
paradigm
29
If we assume that all variables are entered into the analysis at once (as in a standard regression) and that the
dependent variable is the hard version of the primary task performed alone, then using the notation of Figure 5.4, the
squared partial correlation (pr2) is equivalent to δ/(δ + H), where H = unexplained variation in the hard primary task.
Hence, pr2 is the appropriate statistic to consider since only the variation common to the measures of direct interest is
considered. The remaining sources of variation accounted for by other factors are excluded from the derivation.
131
by Navon (1984). Since the hard primary task is never performed in a dual-task
situation, variance that is shared with it and the secondary task cannot be associated
with one task interfering with the other (Halford, 1993). Assuming primary task
performance is maintained, the significance of the partial correlation δ (Figure 5.4), is
that performance on the secondary task is sensitive to the available resources. That is, it
can serve as a measure of capacity.
5.4.1
Assumptions
In their specification of the easy-to-hard paradigm, Hunt and Lansman (1982, p 222)
assume that:
a) the easy version of the primary task is data limited when performed alone
b) the secondary task is data limited when performed alone
c) individuals are resource limited in performing the hard primary task alone
d) individuals are resource limited in performing the easy primary task in
conjunction with the secondary task
e) attentional resources that are allocated to the superordinate task of
coordinating the two concurrent tasks is constant across subjects.
The first four assumptions can be tested to some extent empirically to ensure the
secondary tasks used have the appropriate characteristics, and we consider this in more
detail in the results sections 5.7 and 5.8. Assumption (e) tends to imply, as Hunt and
Lansman point out, that there are no individual differences in general and domain
specific abilities required to coordinate the tasks (e.g., time-sharing). They acknowledge
that this might be a somewhat crude approximation, but point to the literature of the
time that tended to find some difficulty in determining a domain general coordinating
ability. On this evidence they suggest that assumption (e) was reasonable.
Subsequent research related to domain general and specific coordination abilities have
tended to indicate that if a general coordination ability does exist, it is complex in its
functioning. We have already discussed the research of Bourke and his associates
investigating the function of a general attentional ability (Bourke, 1997; Bourke et al.,
132
1996). Yee et al. (1991), and others (e.g., Law, Morrin, & Pellegrino, 1995; Morrin et
al., 1994; Pellegrino, Hunt, & Yee, 1989; Yee, Laden, & Hunt, 1994) have used
variations of a coordination task30 to investigate the existence of a distinct coordination
ability that is above and beyond the ability to perform the single component tasks. In
this research the approach to the dual-task deficit is correlational in nature and is similar
to that used in the easy-to-hard approach. Essentially, the aim is to identify consistent
variation in the coordinated task that is unexplained by variation in the single or
component tasks.
The findings from this research tend not to support the existence of a general timesharing or coordination ability although there is support for task specific time-sharing.
For instance, Morrin et al. (1994) report evidence that suggests (a) that the contribution
of a coordination ability tended to decrease with practice; (b) the complexity of the task
influences the role of coordination; and (c) the factor that they identified as coordination
was not correlated with other cognitive ability constructs derived from the
administration of the Armed Services Vocation Aptitude Battery - ASVAB (i.e., general
knowledge; reasoning; processing speed; and technical knowledge). Morin et al. suggest
that the first two findings, (a) and (b), lend themselves well to models of working
memory that include a supervisory component such as Baddeley’s (1986) central
executive. The authors interpret the lack of correlation with other broad cognitive
ability measures as providing evidence for a within-domain coordination ability as a
distinct entity. This seems consistent with independent research of Fogarty (1987), that
failed to find a domain general time-sharing factor (for a brief review see Brookings,
1990). To return to the assumptions of the easy-to-hard paradigm, while it might be
expected that individual differences in dual-task coordination would be included in the
variation partialled out of the analysis (sections α, λ, ε, and φ, in Figure 5.4 – these are
the correlations associated with the dual-task condition), some component of δ might
also be due to individual differences in coordination of two concurrent tasks. Lansman
and Hunt (1982, p. 22) state that “to the extent that subjects differ in allocation
strategies, the secondary task is an invalid measure of spare capacity”. This in
30
Morrin et al. (1994, p118) define a coordination task as comprising "... two or more mutually related
component tasks, with successful performance dependent on the integration of the output of the
133
conjunction with the concerns raised above suggests that we should be somewhat
cautious of assumption (e).
5.4.2
Applications of the Easy-to-Hard Paradigm
Further criticisms of the easy-to-hard paradigm are also directed at the assumptions in
that they are very specific and task selection has been shown to be crucial to its success.
Studies by Stankov (1987) and Bloem and Damos (1985) for instance have reported that
the easy-to-hard paradigm is inappropriate when the secondary task is more complex
than the simple memory tasks used by Hunt and Lansman. Stankov (1987) explored
capacity limitations using “competing tasks”. Here both tasks are of about equal
difficulty and no a priori distinction is made between the primary and secondary task
(see section 2.3.3.1). He argues that the easy-to-hard approach produces inconsistent
results and a more precise test of the covariance model can be achieved using structural
equation modelling rather than a series of regression analyses (Bloem & Damos, 1985;
Stankov, 1987; Foley, 1997; Foley & Berch, 1997).
5.4.2.1 Complexity and the easy-to-hard paradigm.
Halford and his colleagues have conducted a variety of studies using the dual-task
methodology to explore the role of processing capacity limitations as a function of the
primary task’s relational complexity (Halford et al., 1984a; Halford et al., 1986;
Maybery et al., 1986; Halford & Leitch, 1989). Several of these studies have explicitly
implemented the easy-to-hard approach. Halford et al. (1986) explored the reasoning of
4-6 year old children in the N-term series task using the memory load paradigm (study
1) which is based on the traditional dual-task methodology, and the easy-to-hard
paradigm (study 2). In study 2, the easy level of the task entailed processing a binary
relation and the hard version required transitive reasoning and has been demonstrated to
enlist ternary level processing (e.g., Andrews & Halford, 1998). Halford et al. (1986)
conclude that (1) transitive inference ability is capacity limited and (2) that the easy-tohard paradigm is superior to the more traditional approach as it helps differentiate
effects due to capacity and those due to interference.
simultaneously performed components."
134
Additional research conducted by Elizabeth Foley (1997; Foley & Berch, 1997) has
explored similar capacity effects in children using an extension of the easy-to-hard
paradigm proposed by Halford (1993). This extension labelled the double easy-to-hard
paradigm was designed to test the comparability of processing load in two different
primary tasks. Foley (1997) concluded that both transitive inference and class inclusion
tasks were not only resource limited, but also entailed similar levels of processing. This
is consistent with the relational complexity analyses of these types of tasks (Andrews &
Halford, in press; Halford, Andrews, Dalton et al., in press; Halford, Andrews, &
Jensen, in press) and provides support for the Halford et al. (1998a) conceptualisation of
processing capacity limitations.
5.5
Overview
The aim of this study is to test the relational complexity metric using a measure that is
independent of the task being manipulated. This will be achieved by using both the
traditional dual-task experimental design and the easy-to-hard design that allows testing
of statistical associations in which sources of variation unrelated to the complexity
manipulation can be isolated. Following the procedures and assumptions of Hunt and
Lansman's (1982) easy-to-hard paradigm, secondary task performance will be used as
the measure of processing capacity. The strengths of the easy-to-hard paradigm outlined
above provide a way of removing confounding sources of variation associated with the
dual-task methodology to allow for a relatively clean measure of processing capacity
(see Figure 5.4). The relational complexity theory proposes that more resources are
available for secondary task performance when the primary task is binary rather than
ternary, and when the primary task is ternary rather than quaternary. The current study
explored resource limitations in the Latin Square Task as a function of performance on
two secondary tasks; the Probe Response Time (RT) task and a finger-tapping task.
135
5.5.1
Secondary Tasks
5.5.1.1 Finger tapping.
The potential for using finger tapping as a secondary task has come from several areas.
Brainerd and Reyna (1989) summarise data suggesting that when finger tapping is
performed concurrently as a secondary task with memory tasks, finger tapping
performance is susceptible to interference. In a study investigating the cerebral laterality
of left- and right-handers, Bathurst and Kee (1994) investigated interference in finger
tapping as a function of verbal (e.g., anagram-solution) and non-verbal (e.g., Raven’s
progressive matrices) tasks. Only the verbal tasks tended to produce significant
decrements in tapping rate performance although the nature of the verbal task (silent or
aloud) and handedness influenced the results. Friedman et al. (1988) has also used the
finger-tapping task to evaluate a hemispheric resource model in which each hemisphere
is postulated to entail differentiated resources (see also, Polson & Friedman, 1988).
Summers and Pressing (1994) demonstrated that when finger tapping is used as the
primary task, a decrement in a secondary cognitive task can also be observed.
Differentiating resource demands from other factors has also been investigated in the
finger-tapping task empirically by Klauer, Stegmaier, and Meiser (1997). In Experiment
4 of their study, they found that the null effect of a cognitive task on finger tapping
reported by Toms, Morris, and Ward (1993) might in fact be qualified. When subjects
were trained against using heuristic approaches that had been identified by Evans
(1989) to short cut the reasoning load, tapping was sensitive to a concurrent
propositional reasoning task31. The tapping task was not sensitive to propositional load
for subjects in the control group who were not trained.
From this and other work, it is not altogether clear whether the dual-task deficit in these
cases is a function of output interference and that therefore the finger-tapping task
serves as a particularly interesting case of response competition (e.g., Brainerd &
Reyna, 1989), or as support for a theory of resource competition (e.g., Bjorklund &
31
The tapping task required subjects to tap one of 9 keys on a computer keyboard arranged in a 3 × 3
matrix every 2 seconds. The order of the keys pressed was predetermined, from left to right and then up
and down repeatedly.
136
Harnishfeger, 1989; Klauer et al., 1997). It maybe the case as suggested by Howe and
Rabinowitz (1989), that the methodology is not sufficient to separate the two models
and certainly the issue is far from resolved. Since non-resource related interference
effects are minimised in the easy-to-hard paradigm, this approach might provide some
further insights in to the controversy. An appealing characteristic of the finger-tapping
task is that it is a continuous secondary task whose variation can be used as a measure
of interference (from either a resource or response competition perspective).
5.5.1.2 Probe reaction time.
The probe reaction time (probe RT) task requires subjects to respond as quickly as
possible to a stimulus that is presented at some point in time while solving a primary
task. The probe can take virtually any form and is typically auditory or visual. For
instance, Posner and Boies (1971) had subjects respond as quickly as possible by
pressing a single key when an auditory probe (tone) was presented at some point during
the trial. Alternatively, Fisk et al. (1986) had subjects respond to a visual probe that was
presented in one of four spatial locations marked by a white plus (+). The probe
occurred when one of the white pluses changed to a red asterisk. A response was made
by pressing one of four buttons associated with each spatial location. The assumption
accepted by these and other researchers (e.g., Maybery et al., 1986; Pashler, 1994) is
that speed of responding to the probe provides an index of the residual resources
unoccupied by the primary task (assuming subjects give priority to the primary task).
The Probe RT task is very data sparse and assumes that task presentation can be
controlled in some way so as not to cause stage of processing effects (see section
5.3.3.1). Given the sparsity of probes in anyone trial, it has also been criticised for its
susceptibility to switching and trade-off strategies (e.g., Pashler, 1994). The advantages
of using this task in testing relational complexity theory has been outlined by Maybery
et al. (1986; see also, Halford, 1993; Halford & Leitch, 1989). In this research, the
probe RT task was used to explore processing load during transitive reasoning. The
rationale was that peak processing load would occur when premises have to be
integrated and therefore response to the probe would be most susceptible to interference
at this time. The results were consistent with their hypothesis. The probe RT task has
137
also been used successfully in the easy-to-hard approach by Hunt and Lansman (1982;
Lansman & Hunt, 1982) and Foley (1997; Foley & Berch, 1997).
5.5.1.3 Rejected secondary tasks.
The random number generation task requires subjects to verbalise a random sequence of
numbers at a set rate. Performance measures are based on rate and the frequency of nonrandom sequences. Gilhooly, Logie, Wetherick, and Wynn (1993) used both the random
number generation and tapping tasks as secondary tasks with syllogistic reasoning as
the primary task and concludes that the random number generation task imposes central
executive load. One of the concerns we had with this task is that it has been reported to
be susceptible to a strategic trade-off. That is, regardless of which task is designated as
primary, people tend to allocate more resources to the task that they perceive as more
demanding. Hence resources might be syphoned off to protect performance on the
random number generation task, contrary to instructions (e.g., Hegarty, Shah, &
Miyake, 2000; Bourke et al., 1996).
5.6
5.6.1
Method
Participants
In total 38 students enrolled in an undergraduate psychology subject participated in the
study for course credit and were randomly allocated to one of the two secondary task
conditions. An equal number of students attempted each task. Of the 19 students who
received the Probe RT secondary task, 12 were female and seven were male. One
student failed to acquire the target tapping rate in all conditions of the Finger Tapping
task and was therefore excluded from the analyses. The data for the remaining 18
students (14 females and 4 males) who attempted the Finger Tapping secondary task
was analysed.
138
5.6.2
Primary Task
The primary task used was the Latin Square Task described in Chapter 5. The modified
item pool containing 36 items, 12 at each level of complexity, half containing one step,
half containing two steps, was used (see Appendix B.5 and B.6). The computer
randomly selected 18 items from this pool to present jointly with the secondary task.
The remaining items were presented alone without the concurrent secondary task. The
presentation and procedure remained unchanged. The same introduction and practice
items were used, and subjects made their response by clicking the left mouse button on
their choice using their dominant hand. No further instructions or practice were given in
the dual task condition. Although the binary and ternary items can serve as the “easy”
version of the task, with adults we might expect that the joint presentation of a binary
item with the secondary task might violate assumption (d) from section 5.4.1. That is,
performance may not be resource limited because the demands of the task may be too
low even when performed jointly.
5.6.3
Finger Tapping Task – Single Condition
In the Finger Tapping task (FTAP) subjects were required to press the space bar at a
target rate of one “tap” per second. A blue box containing the word “READY” was
presented for 2s to orient the subject to the start of the trial. This was followed with a
flashing blue square and simultaneous tone that served as a metronome to facilitate
acquisition of the correct tapping rate. Once the target rate (1.00 ± 0.25 taps/s) had been
achieved the flashing blue square and tone were replaced with a static green box
containing the word “MAINTAIN”. The participants were required to keep tapping at
the same rate while the maintain box remained on the screen. A red box containing the
word “STOP” indicated the end of a trial. Subjects were asked to use their nondominant hand to press the space bar and their dominant hand to control the mouse (to
click on the “Next Item” button and to make response to the Latin Square Task in the
dual task condition). The trial duration was an integer value determined randomly from
within the range of 20 and 35 seconds so as to roughly coincide with the typical solution
time for Latin square items. There were 10 trials in the single task condition.
139
5.6.4
Finger Tapping Task – Dual Condition
In the dual-task condition subjects were required to perform the finger tapping task
before (pre-LST), during (trial-LST), and after attempting to solve a Latin Square item
(post-LST). The general procedure was the same as for the single task condition.
Subjects were instructed to give the Latin Square Task priority while not ignoring the
secondary task and to keep tapping at the target rate until the red “stop box” was
displayed. Subjects were instructed to press the space bar with their non-dominant hand
and to respond to the Latin Square Task using the mouse with their dominant hand. As
for the single condition, a blue “ready box” was presented for 2s to orient the subject to
the start of the trial before presentation of the metronome (flashing blue square and
tone). Once the target rate had been achieved the flashing square was replaced with the
“maintain box”. A pre-trial period of 5s preceded presentation of the Latin Square item
and a post-trial period of 5s followed the LST response. The duration of the trial-LST
period was determined by the subject’s response to the primary task.
5.6.5
Probe RT – Single Condition
Subjects were presented an introductory screen describing the procedure for the task.
They were told that they would hear a series of tones and that they should press the
computer spacebar as quickly as possible whenever they heard a tone. Participants were
asked to use their non-dominant hand to respond to tones by pressing the space bar and
their dominant hand to control the mouse. The mouse was used to click on the “Next
Item” button at the end of each trial (and used to make a Latin Square Task response in
the dual task condition). Before each trial a blue box containing the word “READY”
served as an orienting cue and was presented for 2s. A green square containing the word
“MAINTAIN” was then presented and used to indicate the start of a trial of probes. A
red square containing the word “STOP” indicated that the subject could stop monitoring
for tones. In the single task condition 10 trials were presented. As for the finger-tapping
task, the trial duration was selected randomly from integer values between 20 and 35
seconds to roughly coincide with the typical solution time for Latin square items. The
interval between probes within each trial ranged from 800 to 2400 msec and was also
determined randomly by the computer.
140
5.6.6
Probe RT – Dual Condition
In the dual-task condition subjects were required to monitor and respond to tones before
(pre-LST), during (trial-LST), and after attempting to solve a Latin Square item (postLST). The response requirement was the same as for the single task condition. Subjects
were instructed to give the Latin Square Task priority while not ignoring the secondary
task and to keep monitoring for tones until the red “stop box” was displayed (the green
“maintain box” was presented to remind students when to monitor for tones). Response
to a probe was with the non-dominant hand and response to the Latin Square Task was
made using the mouse with the dominant hand. As for the single condition, a blue box
containing the word “READY” was presented for 2s to orient the subject to the start of
the trial before presentation of the “maintain box”. The pre-LST and post-LST duration
was 5sec in each case and the duration of the trial-LST period was determined by
responding to the Latin Square item. The inter-probe interval was the same as for the
single task condition (800 to 2400 msec). Subjects were presented 18 trials
corresponding to the 18 Latin Square items.
5.6.7
General Procedure
All tasks were administered on 486 IBM computers with 14” VGA displays and tones
were produced using the standard PC speakers. Participants were randomly allocated
first to either the Probe RT task or the Finger Tapping task, and then to either receiving
the single primary task first or the single secondary task first. All subjects received the
dual-task condition last and were tested in groups of three to four in sound attenuating
booths.
5.7
Finger Tapping: Results & Discussion
As shown in Figure 5.5, four phases were identified within each trial in the dual-task
condition; a) the learning phase in which the target base rate was acquired; b) the 5s
pre-LST phase before the LST item was presented; c) the trial-LST phase which lasted
from when the LST item was presented until the subject made a response; and d) the 5s
post-LST phase which followed the LST response. For comparability, similar phases
were identified in the single-task condition. The learning phase continued until the
subject achieved the target tapping rate. The pre-LST and post-LST phases were the
141
first and last 5s of each trial respectively, and the duration of the “trial-LST” phase was
determined by the computer. Two performance measures were derived from the elapsed
time between taps for each subject. The core measure was the individual’s variation in
tapping rate, and this was derived from the standard deviation of the elapsed time
between taps within each phase for each trial (SD-score). The individual’s median
elapsed time between taps (MD-score) served as an additional performance score.
A preliminary inspection of the data revealed two extreme scores. The first was
recorded in a very early stage of the experiment - the very first tap of trial 1 in a preLST session of the single task condition. We suspect that this subject may have been
momentarily confused about the instructions of the task so the extreme score was
replaced with the subject’s median pre-LST elapsed time from the remaining trials. This
subject also had an extreme score for the post-LST ternary items in the dual task
condition (approx. 11 SD’s above the mean elapsed time score). The subject’s SD-score
for this phase was truncated to 3 standard deviations above the mean SD-score
calculated for the remaining subjects. No other modifications were made to the data.
learning
pre-LST
trial-LST
5s
post-LST
5s
end
start
In the single task when the LST
item is not presented, the
duration of the trial-LST phase
is determined by the computer
Figure 5.5. Phases in the finger-tapping task (dual-task condition). The same
terminology is used in the single task condition however the separate pre-, trial- and
post-LST phases were not apparent to the subjects.
142
Table 5.1 summarises for each condition the number of items at each level of
complexity that the subjects received. In total there are 12 Latin Square items at each of
the three levels of complexity. From this pool of 36 items, 18 were drawn at random for
the single-task condition and the remaining 18 were presented in the dual-task
condition. As can be seen from this table, all subjects received at least 3 items at each
level of complexity in each condition, and therefore all dual-task measures are based on
at least 3 trials32.
Table 5.1
Distribution of item complexity in single- and dual-task Latin Square conditions
Minimum
Maximum
Mean
Mode
SD
5.7.1
Binary
4
9
6.33
6
1.24
Ternary
3
8
5.89
5&6
1.32
Single
Quaternary
3
9
5.78
6
1.44
Binary
3
8
5.67
6
1.24
Ternary
4
9
6.11
6&7
1.32
Dual
Quaternary
3
9
6.22
6
1.44
Secondary Task Performance: Variation in Tapping (SD-score)
Table 5.2A shows the mean and standard deviations of the variation score (SD-score)
for each phase of the single-task condition, and for each phase of the dual task condition
separately for binary, ternary and quaternary levels of complexity in the Latin Square
Task. The SD-score data were analysed using a repeated-measures ANOVA in which
the within-subjects factors were Task Condition (single, 2D-dual, 3D-dual, 4D-dual)
and Phase (learning, pre-LST, trial-LST, post-LST). Preliminary tests of the repeatedmeasures ANOVA assumption indicated that the sphericity assumption was violated for
the main-effects. The Greenhouse-Geisser correction did not alter the interpretation so
the unadjusted results are reported.
32
The initial intention was to present an equal number of LST items from each level of complexity
however an error in the programming of the task meant that this was not possible. The high level of
internal consistency of the items within levels of complexity as demonstrated in Chapter 6 suggested that
differences in reliability of measurement for individuals with different numbers of items is likely to have
a minimal effect.
143
Table 5.2
Task Condition by Phase: Means and standard deviations (in parentheses) for measures of A)
SD-score, and B) MD-score in finger tapping
A. SD-score: Within-subject standard deviation in elapsed time between spacebar taps
Task Condition
Single
Dual - 2D
Dual - 3D
Dual - 4D
Learning
106.16 (22.45)
67.78 (24.71)
62.91 (17.94)
60.35 (19.05)
Phase
pre-LST
trial-LST
67.07 (23.17)
67.70 (31.42)
67.78 (24.71)
106.40 (53.41)
62.91 (17.94)
100.74 (54.14)
60.35 (19.05)
97.27 (58.43)
post-LST
57.77 (14.40)
103.36 (60.76)
104.31 (62.87)
128.51 (104.32)
B. MD-Score: Within-subject median elapsed time between spacebar taps
Task Condition Learning
Single 933.64 (66.33)
Dual - 2D
956.27 (24.56)
Dual - 3D
952.42 (42.70)
Dual - 4D
965.49 (28.74)
Phase
pre-LST
933.64 (82.52)
963.23 (70.66)
962.34 (74.64)
966.89 (63.28)
trial-LST
894.48 (96.94)
936.56 (88.31)
953.13 (75.04)
954.85 (84.36)
post-LST
879.83 (106.16)
922.59 (78.95)
945.08 (85.55)
941.64 (83.66)
The main-effect for Phase was significant, F(3, 51) = 10.06, MSe = 1805.92, p = .001.
Post-hoc analyses indicate that overall, the variation in tapping rate was much greater in
the trial-LST phase than in either the learning, or pre-LST phases, t(17)= 2.84, p = .011,
and t(17) = 4.37, p < .001, respectively. There were no significant differences in SDscores overall between the trial-LST and the post-LST phases, t(17)= 0.97, p = .347).
While the main-effect for task condition was not significant, F(3, 51) = 1.30, MSe =
1710.91, p = .284, this result is qualified by a significant interaction between the
factors, F(9, 153) = 10.62, MSe = 859.32, p < .001. An examination of the plot of the
interaction shown in Figure 5.6, suggests that effects related to differences between
single- and dual-task conditions need consideration33.
33
Because of the potential problem of using pooled error terms to follow-up specific comparisons in a
repeated-measures design, all follow-up tests are based on the appropriate lower order repeated-measures
ANOVA using only the data for the comparison of interest, unless otherwise indicated.
Mean SD-score
144
Single
140
130
120
110
100
90
80
70
60
50
40
2D Dual
3D Dual
4D Dual
Learning
pre-LST trial-LST post-LST
Phase
Figure 5.6. Variation in tapping rate at each trial phase by task condition and complexity
Learning Phase: Helmert contrasts on just the learning phase data indicate that variation
in tapping rate is more pronounced in the single-task condition than in all three dualtask conditions combined, F(1, 17) = 95.31, MSe = 340.74, p < .001. There were no
differences in the overall variability of tapping rate between the binary dual-task
condition and the ternary and quaternary dual-task conditions combined, F(1, 17) =
1.137, MSe = 598.39, p = .301, nor was there a significant difference between the
ternary and quaternary dual-task conditions, F(1, 17) = .956, MSe = 124.19, p = 342.
Since the single-task condition was always presented before the dual-task condition and
that the learning phase requirements were identical in both conditions, isolating
contributing factors independent of the influence of practice (the likely candidate) is not
possible. What this increased variation in tapping during the learning phase of the single
task condition suggests is that there are some differences in the ability to establish the
target tapping rate.
Single Task Performance: It is important to note that evidence that practice effects
continued into other phases of the trial is weak. If only performance in the single-task
condition without the learning phase is considered, variation in tapping rate remained
constant across the three phases, F(2, 34) = 1.28, p = .219. Variation in tapping rates
145
was also constant in the single- and dual-task conditions in the pre-LST phase (prior to
the presentation of the LST item), F(3, 51) = .837, MSe = 266.25, p = .480. This also
suggests that practice effects in the main part of the trial were not present, and that
generally all subjects reached a stable level of variability before the LST item was
presented, and that this was equivalent to the variability in the single task condition.
5.7.1.1 Dual-task deficit: Relational complexity and finger tapping variability.
To test for the effect of LST complexity on secondary task performance the simpleeffects of condition for the trial-LST phase was analysed. The repeated-measures
ANOVA suggested a significant difference between conditions, F(3, 51) = 4.175, MSe
= 1290.06, p = .010. Helmert contrasts indicated that although the variation in tapping
rate was significantly less in the single-task condition than in the dual-task conditions
combined, F(1, 17) = 8.611, MSe = 2382.61, p = .009, it did not differ as a function of
the relational complexity of the LST item. There were no differences in the SD-score
between binary items, and ternary and quaternary items combined, F(1, 17) = .392, MSe
= 2510.52, p = .540. Nor was there a significant difference between ternary and
quaternary items, F(1, 17) = .264, MSe = 817.58, p = .614). These results suggest that
the finger-tapping task is sensitive to the processing load generated by the concurrent
primary task but is not able to differentiate between levels of complexity. These are
important results for the relational complexity theory and will be discussed in more
detail once we have considered the second measure of performance in this secondary
task (i.e., median time between taps).
The simple-effect of task condition for the post-LST data was also significant, F(3, 51)
= 4.30, MSe = 2493.72, p = .001. Helmert contrasts indicated that the variation in
tapping rate was significantly less in the single-task condition than in the combined
dual-task conditions, F(1, 17) = 16.39, MSe = 3237.60, p = .001. There were no
significant differences in SD-scores between levels of complexity (binary vs ternary &
quaternary, F(1,17) = 1.10, MSe = 2780.06, p = .309; ternary vs quaternary, F(1,17) =
1.65, MSe = 6399.16, p = .217). These results suggest that the variation in fingertapping rate is maintained beyond solution of the LST item for at least the 5 seconds of
post-LST that data was recorded.
146
5.7.2
Secondary Task Performance: Median Elapsed Time Between Taps
Table 5.2B shows the mean and standard deviations of the median time between taps for
each phase in the single and dual-task conditions. Adjusting for violations of the
sphericity assumption (Greenhouse-Geisser correction), the task Condition (4) × Phase
(4) repeated-measures ANOVA on the median score indicated a significant main-effect
for task Condition, F(3, 51) = 6.77, p = .012, and a marginally significant main-effect
for Phase, F(3,51) = 2.84, p = .067. The interaction was not significant, F(9,153) = 1.29,
p = .29, which suggests that the influence of trial condition and phase are stable across
the different levels of the alternative factor. Further analyses of the main-effect for
Phase indicated that the median elapsed time between taps for the pre-LST phase (M =
956.53) was significantly longer than the median for both the trial-LST (M = 934.75)
and post-LST (M = 922.29), t(17) = 3.7, p < .001 and
t(17) = 2.75, p = 0.01,
respectively. Since this effect was similar in the single task condition, it suggests a
tendency for an increase in the tapping rate over time (i.e., a decrease in the elapsed
time between taps), rather than something specifically related to the concurrent load of
the primary task. There were no other significant differences for the Phase factor.
Multiple comparisons for the levels of task Condition indicated that overall, the mean
median elapsed time between taps was significantly shorter in the single task condition
(M=910.40) than in all the dual task conditions (single vs 2D, t(15) = -2.36, p = 0.03;
single vs 3D = t(15) = -2.75, p = 0.01; single vs 4D = t(15) = -2.79, p = 0.01). Overall,
the mean median elapsed time between taps was also significantly shorter during the
binary items than the quaternary items, t(15) = -2.28, p = 0.04. It is important for our
predictions to consider the influence of relational complexity on the secondary task
especially during the trial phase, even though the lack of an interaction suggests that the
omnibus effect generalises34. The simple-effect of complexity during the trial phase was
significant, F(3, 51) = 9.13, p = .002. As would be expected from the omnibus tests and
as seen in Figure 5.7, during the trial phase all levels of the LST resulted in an increase
in the mean median elapsed time between taps. This effect tended to be less pronounced
when the primary task was binary. The mean median elapsed time between taps was
significantly shorter when the concurrent primary task was binary compared to
34
That is, the interpretation of the main-effect holds for each level of complexity.
147
quaternary, t(17) = -2.14, p = 0.05. The difference between binary and ternary
concurrent loads was in the same direction but not reliable, t(17) = -1.68, p = 0.11.
Although the evidence is weak, this is the first indication that we have that the resource
demands imposed by the Latin Square Task differ as a function of relational
complexity. The general quickening in taping rate observed over the length of a trial
tended to be less pronounced for the more complex items (quaternary).
Mean Median-score
990
Single
970
2D Dual
950
3D Dual
4D Dual
930
910
890
870
850
Learning
pre-LST trial-LST post-LST
Phase
Figure 5.7. Mean median elapsed time between finger taps as a function of
phase and task condition
5.7.3
Primary Task Performance
One of the assumptions of the dual-task methodology is that performance levels on the
primary task should not be influenced by the concurrent secondary-task. Subjects were
instructed to give their priority to the Latin Square Task. To test this assumption,
performance on the LST task is compared in the single and dual task conditions. Mean
proportion correct (accuracy), overall response time, and response time for correct
responses were the three measures explored. The data is summarised in Table 5.3.
148
Table 5.3
Latin Square task performance: Mean proportion correct (A), response time (B), and correct
response time (C) by task condition and level of relational complexity.
Relational Complexity
2D
3D
4D
A. Accuracy
Single Task
0.898 (0.15)
0.825 (0.20)
0.529 (0.22)
0.887 (0.13)
0.804 (0.19)
0.577 (0.31)
Dual Task
B. Response Time (ms)
Single
9.12 (3.38)
20.57 (14.82)
39.08 (35.53)
Dual
6.88 (2.86)
10.06 ( 4.27)
20.20 ( 12.5)
C. Correct Response Time (ms)
Single
9.07 (3.23)
18.08 (11.81)
38.26 (36.41)
Dual
6.91 (3.05)
10.54 ( 4.07)
17.62 (12.30)
5.7.3.1 Accuracy.
A task condition (2) × complexity (3) repeated measures ANOVA indicated a
significant main-effect for complexity, F(2, 34) = 28.74, MSe = .040, p < .001. There
were no overall differences in performance as a function of task condition (single or
dual-task presentation), F < 1, and the interaction between the factors as plotted in
Figure 5.8, was not significant, F < 1. Planned comparisons of the levels of complexity
indicated as expected that the mean proportion correct on the binary items was
significantly higher than for the ternary items, F(1, 17) = 4.982, MSe = .022, p = .039
Mean Proportion Correct
1
Single
0.9
Dual
0.8
0.7
0.6
0.5
0.4
2D
3D
4D
Relational Complexity
Figure 5.8. Mean proportion correct on the LST as a function of
complexity and task condition
149
(M’s = .893 & .815, respectively). The mean proportion correct on ternary items was
significantly higher than for quaternary items (M = .553), F(1, 17) = 25.53, MSe = .048,
p < .001. This pattern of results is consistent with the assumption that subjects gave
priority to the primary task. The results are also consistent with the Latin Square data
reported in our other studies (Chapters 4 and 6).
5.7.3.2 Overall response time.
A similar repeated-measures ANOVA was conducted for the overall response time
measure. The assumption of sphericity was violated for the complexity main-effect and
the interaction, however appropriate adjustment did not alter the interpretation. The
unadjusted values are reported and indicate that once again the main-effect of
complexity was significant, F(2, 34) = 14.53, MSe = 300.29, p = .001. Averaging across
task condition, the response times on binary items (M=8.00s) were shorter than for
ternary items (M=15.31s), F(1, 17) = 14.24, MSe = 67.56, p = .002, which were shorter
than quaternary items (M=29.64s), F(1, 17) = 10.10, MSe = 365.97, p = .006. Unlike the
accuracy measure, the main-effect for task condition was significant, F(1, 17) = 17.35,
MSe = 173.06, p = .001, such that averaging across levels of complexity, overall
response times was actually shorter in the dual-task condition (M=12.38s) than in the
single-task condition (M=22.93s). This may be the result of a practice effect that is
confounded with the presentation order of the single and dual-task conditions, although
additional analyses reported below (Section 5.7.4) tend to suggest that it is to some
extent independent of the presentation order.
The interaction between the task-condition and complexity was significant, F(2, 34) =
5.18, MSe = 120.21, p = .011, and is plotted in Figure 5.9. Analysis of the simple-effects
of complexity indicated that the interpretation of the complexity main-effect generalises
to both the single- and dual-task conditions. Figure 5.9 tends to suggests that the source
of the interaction is that as the complexity of the Latin Square items increases, the
difference in response times between single and dual-task conditions becomes more
pronounced. To test this trend, a difference score was computed for each subject at each
level of complexity by subtracting overall dual-task response time from overall singletask response time. A one-way repeated measures ANOVA on this difference score
150
indicated that the difference between single- and dual-task response times (the primarytask deficit) for binary items was significantly weaker than for ternary (F(1, 17) = 7.21,
MSe = 171.04, p = .016) and quaternary items (F(1, 17) = .741, MSe = 672.15, p =
.014). There was no significant difference in the size of the primary-task deficit between
ternary and quaternary items (F1, 17) = 2.10, MSe = 599.35, p = .166). That is, the
difference in overall response time between single- and dual-task conditions was less
pronounced in the binary items than in the more complex ternary and quaternary items.
The analysis of the mean response times for correctly answered Latin Square items
(plotted in Figure 5.9) resulted in the same interpretation as for the overall response
time measure (one subject was omitted from the analysis for failing to answer any of the
Mean Response Time (sec)
quaternary items correctly). These results are provided in Appendix C.1.
50
Single (overall)
Dual (overall)
40
Single (correct)
30
Dual (correct)
20
10
0
2D
3D
Relational Complexity
4D
Figure 5.9. Mean overall and correct response time as a function of relational complexity
and task condition
5.7.4
Influence of Practice Effects on LST Response Times
To explore the finding that dual-task response times were shorter than single-task
response times, the 18 items presented in each condition were split into two groups
based on presentation order (first 9 and last 9). The repeated measures ANOVA indicate
that the main effect for condition (single vs dual) was as expected given the previous
results. Overall response time on the LST was significantly shorter in the dual task
condition than in the single task condition, F(1, 17) = 16.26, MSe = 251.06, p = .001.
151
The main-effect of presentation order (earlier vs later) was not significant. The average
response time for the first 9 items was statistically equivalent to the overall response
time for the last 9 items, F < 1. The interaction between presentation order and task
condition was significant, F(1, 17) = 6.18, MSe = 47.61, p = .024. Simple effects
analyses indicated that the difference between single and dual-task conditions was more
pronounced for the earlier items, F(1, 17) = 18.05, p = .001, than the later items F(1, 17)
= 9.36, p = .007. Figure 5.10 plots the interaction for both overall and correct response
time as a function of presentation order.
single-RT
Mean Response Time
30
dual-RT
25
single-CRT
20
dual-CRT
15
10
5
0
early
late
Presentation Order
Figure 5.10. Correct and overall response time as a function of single and dual task
condition and presentation order (early and later items)
The same analyses were conducted on the composite response times for only those
items answered correctly (correct response time - CRT). Because of the requirements of
the repeated-measures ANOVA, list-wise deletion resulted in the data of only 10
subjects being used. As for the analysis of overall response times, the main-effect for
condition was significant. Correct responses to LST items were significantly shorter in
the dual-task condition, F(1, 9) = 5.22, MSe = 520.39, p = .048. There were no
significant differences in correct response times as a function of presentation order, F <
1, and the interaction between task condition and presentation order was not significant,
F(1, 9) = 2.35, MSe = 70.68, p = .159. Taken together with the overall response time
analyses, these results suggest that in this case at least, the decreased response time in
the dual task condition is not simply a function of practice.
152
5.7.5
Summary of traditional analyses
The presence of a dual-task deficit is evidence to suggest that the variation in tapping
rate is sensitive to resource limitations in the Latin Square Task in general. However, it
is not capable of clearly differentiating between levels of relational complexity. There is
some partial evidence to suggest that tapping rate itself (as measured by the median
elapsed time between taps) is sensitive to variations in relational complexity.
Performance on the primary task was assessed using two main measures. The
proportion of items answered correctly by each subject at each level of complexity was
used as a measure of accuracy. The average time to respond to items of a given
complexity also served as a measure of processing demand. Two response times were
derived, average time to respond regardless of accuracy and average time to respond to
items answered correctly. Although accuracy on the primary task was not influenced by
the concurrent secondary task, response times were influenced by the increased
processing demand generated by the concurrent finger-tapping task. Contrary to what
might be expected, the response time was actually shorter in the dual-task condition
than the single-task condition. This effect became more pronounced as the relational
complexity of the primary task increased and does not seem to be due to a practice
effect.
5.7.6
Easy-to-Hard Predictions: Individual Differences
There are four key measures required for the easy-to-hard analysis. Performance on (a)
the hard (quaternary) version of the primary task, (b) the easy (ternary) version of the
primary task, (c) the secondary task, and (d) the secondary task performed concurrently
with the easy version of the primary task. We are fortunate enough to have several
measures of primary task performance and there is no a priori reason for selecting one
over the other. Response time is an intuitive measure since it is less susceptible than the
accuracy measure to the ceiling effects on the easier versions (binary and ternary) of the
Latin Square Task and to possible floor effects in the hard version of the task
(quaternary). While correct response time would be the more appealing measure, we are
in a situation of losing subjects because of the nature of the repeated-measures design.
Given the relatively small sample size, we have chosen to use overall response time as
the measure of performance. The secondary task alone measure was a composite score
153
based on performance in all phases of the single-task trials except the learning phase.
The learning phase data was excluded since it differed significantly from the other
phases.
Table 5.4A shows the results of the standard regression analysis based on the traditional
easy-to-hard variables. The zero-order correlation between the hard (quaternary)
version of the Latin Square task and the finger-tapping task performed concurrently
with the easy (ternary) version of the Latin Square task is significant and positive r(16)
= .54, p = .01. That is, increased variation in finger-tapping rate is associated with
increases in primary task response-time. When variation due to performing the fingertapping task alone, and performance on the easy primary task alone is partialled out, the
second-order correlation remains positive and is still marginally significant, pr(14) =
.485, p = .057. This is the partial correlation of interest in the easy-to-hard paradigm and
its (marginal) significance is consistent with resource limitations. The positive
correlation suggests that within the dual-task condition, higher LST RT’s are associated
with more variability in finger tapping as would be expected if the LST were competing
for similar resources as the secondary task. This is an important result for it suggests
that the decrease in mean response time to LST items observed in the dual-task
condition (when compared with the single task condition) does not necessarily imply a
reduced processing load.
As we have stated, the analyses reported in Table 5.4A do not take into consideration
the accuracy of the response. One approach to control for this may be to include a
measure of accuracy on the hard version of the task in the regression analysis. Table
5.4B repeats the analysis with accuracy on the quaternary items included as an
additional predictor. With this additional variable in the model, the partial correlation
between the hard primary task and the secondary task performed concurrently with the
easy primary task is highly significant and positive, pr(13) = .62, p = .01. This
correlation is actually higher than the zero-order correlation and suggests that taking
into consideration performance accuracy substantially improves the ability of the
secondary-task to predict variation in the hard primary task. The implication of this
154
significant association is that performance on the hard primary task is indeed limited by
individual differences in processing capacity35.
Table 5.4
Regression coefficients for easy-to-hard analysis of A. response times, and B. with accuracy
partialled out
A. Overall response times.
Variables
HPART ...................
EPART ....................
SA .........................
SEP........................
zero-order correlations
Mean (SD) HPART EPART
SA
Regression Coefficients
t Sig.
β
pr
39.08 (35.53)
20.57 (14.82)
64.26 (17.14)
100.74 (54.14)
0.54 *
0.13
0.54 *
0.04
0.29
0.52 *
0.40 1.93 0.07 0.46
-0.15 -0.66 0.52 -0.17
0.50 2.07 0.06 0.49
Model: R = .68; R2 = .46 (adjusted R2 = .35), F(3,14) = 4.02, p = .03
B. Overall response time with accuracy partialled out.
Variables
HPART ...................
HPAAcc...................
EPART ....................
SA .........................
SEP........................
zero-order correlations
Mean (SD) HPART HPAAcc EPART
SA
39.08 (35.53)
0.53 (0.22) 0.60 *
20.57 (14.82) 0.54 *
0.57 *
64.26 (17.14) 0.13
-0.23
0.04
100.74 (54.14) 0.54 * -0.04
0.29 0.52 *
Regression Coefficients
t Sig.
pr
β
0.59
0.04
-0.04
0.56
2.88
0.20
-0.19
2.86
0.01
0.85
0.85
0.01
0.62
0.06
-0.05
0.62
Model: R = 0.82; R2 = 0.67 (adjusted R2 = 0.57), F(4, 13) = 6.65, p < 0.01
Variables:
HPART: Hard Primary Alone...............
HPAAcc: Hard Primary Alone...............
EPART: Easy Primary Alone ...............
SA: Secondary Alone..........................
SEP Secondary with Easy Primary .....
Mean overall response time for quaternary items
Mean proportion correct for quaternary items
Mean overall response time for ternary items
Mean total trial SD-score for single task condition
Mean trial-LST SD-score during concurrent ternary
LST items
* p < .05
35
The correlation between accuracy and response time on the hard LST is also highly significant, r(17) =
.60, p = .01, suggesting the presence of a speed accuracy trade-off. We consider this in more detail in
Chapter 6.
155
Alternative Measures: If we were to use accuracy in the hard and easy primary task and
then were to partial out the effects of response time, the same pattern of results are
obtained (see Table 5.5). However, the partial correlation in this case is negative, pr(13)
= -.60, p = .02, and suggests that increases in finger-tapping variation is associated with
a decrease in accuracy. This is consistent with the traditional dual-task effect reported
earlier demonstrating that increases in finger-tapping variation are associated with
increased processing load. So although using the traditional dual-task methodology the
finger-tapping task was only partially successful in being sensitive to increased
processing demand of increasing relational complexity, when individual differences in
performance are considered (rather than comparing aggregated performance across task
conditions), not only do we see that processing quaternary level Latin Square items is
resource limited, but the significance of the partial correlation suggests that individuals
differ in their ability to process such load. This observation would not have been
possible if the traditional dual-task methodology had been used.
Table 5.5
Easy-to-hard regression analysis on accuracy with overall response time partialled out.
Mean (SD)
zero-order correlations
Regression Coefficients
HPAAcc HPART EPAAcc
SA
t Sig.
β
pr
0.53 (0.22)
HPAAcc ..................
HPART .................. 39.08 (35.53) 0.60 *
1.15 4.78 0.00 0.80
EPAAcc ..................
0.83 (0.20) 0.21
0.32
-0.49 -2.16 0.05 -0.51
SA ........................ 64.26 (17.14) -0.23
0.13
-0.45 *
-0.23 -1.16 0.27 -0.31
SEP ....................... 100.74 (54.14) -0.04
0.54 *
-0.32 0.52 * -0.69 -2.70 0.02 -0.60
Model: R = 0.816; R2 = 0.666 (adjusted R2 = 0.563), F(4, 13) = 6.471, p = 0.004
Variables:
HPAAcc: Hard Primary Alone ..........
HPART: Hard Primary Alone ...........
EPAAcc: Easy Primary Alone...........
SA: Secondary Alone......................
SEP: Secondary with Easy Primary
* p < .05
Mean proportion correct for quaternary items
Mean overall response time for quaternary items
Mean proportion correct for ternary items
Mean total trial SD-score for single task condition
Mean trial-LST SD-score during concurrent ternary LST
items
156
5.7.7
Alternative Easy and Hard Conditions
There are three levels of difficulty in the Latin Square task and the easy-to-hard
methodology can be applied to any of the three combinations of easy and hard versions.
As can be seen in Table 5.6, when this is done, the partial correlation of interest in all
cases was not significant. We consider the implications of this in terms of the
assumptions of the easy-to-hard paradigm in more detail in the general discussion
(Section 5.9).
Table 5.6
Analyses of finger tapping task with different easy and hard task combinationsa
Easy taskb
Hard task
Variable controlled
Partial correlation
2D accuracy
3D accuracy
None
pr(14) = .06, p = .84
2D accuracy
3D accuracy
3D response time
pr(13) = .063, p = .83
2D response Time
3D response time
None
pr(14) = .04, p = .90
2D response time
3D response time
3D accuracy
pr(13) = .04, p = .90
2D accuracy
4D accuracy
None
pr(14) = .06, p = .83
2D accuracy
4D accuracy
4D response time
pr(13) = -.07, p = .79
2D response Time
4D response time
None
pr(14) = .17, p = .54
2D response time
4D response time
4D accuracy
pr(13) = .17, p = .54
a
Secondary task measure is based on composite of trial-LST and post-LST median response
times;
5.8
b
2D, 3D, 4D = binary, ternary, and quaternary respectively.
Probe RT
Three phases were identified within each trial of the single-task condition and at each
level of relational complexity in the dual-task conditions. In the dual task condition
these phases were: a) the 5s pre-LST phase before the LST item was presented; b) the
trial-LST phase which lasted from when the LST item was presented until the subject
made a response; and c) the 5s post-LST phase which followed the LST response. In the
single task condition, the pre-LST and post-LST phases were the first and last 5s of
each trial, and the computer determined the duration of the trial-LST phase. In each
trial, the median response time for each phase was recorded for each subject. One
subject was omitted from the analyses for failing to respond to any probes in the single
task condition. All subjects received at least one item at each level of complexity in
each condition and therefore all dual-task measures are based on at least one trial. Only
two subjects received less than three trials at each level of complexity and when these
subjects were omitted the same pattern of findings are obtained. As a result, the
following analyses are based on all 18 subjects. Table 5.7 summarises for each
157
condition the number of Latin Square items at each level of complexity that the subjects
received.
Table 5.7
Distribution of item complexity in single- and dual-task Latin Square conditions
Single
Dual
Binary
Ternary Quaternary
Binary
Ternary Quaternary
Minimum
3
3
2
3
1
3
Maximum
9
11
9
9
9
10
Mean
6.39
5.83
5.78
5.61
6.17
6.22
Mode
6
7
4
6
5
5
1.54
1.92
2.07
1.54
1.92
2.07
SD
5.8.1
Secondary Task Performance: Median Response Time
A Repeated-measures ANOVA was conducted on the median scores summarised in
Table 5.8. The within-subject factors were Phase (pre-LST, trial-LST, post-LST) and
task Condition (single, dual-2D, dual-3D, dual-4D). The assumptions of sphericity were
violated for the Phase main-effect and the interaction however adjustments did not alter
the interpretation and hence the unadjusted results are reported. These indicated a
significant main-effect for Phase, F(2, 34)=14.83, MSe = 10829.63, p < .001. The main
effect for task Condition, F(3,51)=14.58, MSe = 6202.32, p < .001, and the interaction
between the factors, F(6,102) = 5.29, MSe = 6212.22, p <.001, were also significant.
Table 5.8
Mean (SD) of median response time score for single task and task condition at each trial phase.
Trial Phase
Task Condition
pre-LST
trial-LST
post-LST
Single
318.44 (93.44)
265.14 ( 48.63)
266.81 (73.69)
Dual - 2D
342.68 (86.07)
463.38 (205.24)
309.80 (77.77)
Dual - 3D
337.67 (92.26)
442.90 (152.05)
321.72 (80.10)
Dual - 4D
333.54 (70.30)
406.79 (157.46)
308.43 (56.78)
158
Following up the main-effect of Phase indicates that subjects took longer to respond to
the probe during the trial-LST phase than both the pre-LST and post-LST phases (F(1,
17) = 8.53, MSe = 31881.93, p = .01 and F(1, 17) = 32.40, MSe = 8577.43, p < .001,
respectively). The significant interaction suggests that this effect is qualified and as
shown in Figure 5.11, a dominant source of the interaction is due to performance on the
single-task condition. When the analyses are repeated without the single-task condition,
the main-effect for Phase remains significant however the main effect for condition and
the interaction are no longer significant. Probe response time was significantly longer in
the trial-LST phase than both the pre-LST and post-LST phases (F(1, 17) = 9.84, MSe =
5200.11, p = .006 and F(1, 17) = 8.11, MSe = 1384.42, p < .011, respectively). This is
consistent with the prediction that the probe RT task is sensitive to processing demands
generated by the concurrent Latin Square Task.
5.8.2
Influence of Relational Complexity on Median Response Time
To explore the effect of primary task complexity on probe response, the simple-effects
of task-condition (single-task, dual-2D, dual-3D, dual-4D) for each phase in the trial
were explored. As expected, the time to respond to the probe during the pre-LST phase
did not differ as a function of task condition, F < 1. Helmert contrasts indicated that
during the trial-LST phase, response to the probe in the single-task condition was
Median Response Time (msec)
significantly quicker than in the dual-task conditions combined, F(1,17) = 42.57, MSe =
500
Single
450
2D Dual
3D Dual
400
4D Dual
350
300
250
200
pre-LST
trial-LST
Phase
post-LST
Figure 5.11. Median Response time as a function of Phase and task Condition
159
12590.19, p < .001. However, probe response time was not sensitive to differences in
relational complexity (binary vs ternary & quaternary, F(1, 17) = 1.08, MSe = 24676.57,
p = .312, ternary vs quaternary, F < 1). Analysis of the post-LST phase response times
suggested similar effects as for the trial phase. Probe response in the last 5 seconds of
the single-task condition was significantly quicker than in the 5 seconds following
response to a Latin Square item F(1,17) = 19.58, MSe = 1988.70, p < .001. Response to
the post-LST probe was not sensitive to the relational complexity of the preceding LST
item (binary vs ternary & quaternary, F < 1, ternary vs quaternary, F(1, 17) = 1.36, MSe
= 2333.49, p = .259).
Single Task Performance: Finally, in preparation for the individual differences easy-tohard analyses to follow, an analysis of the simple-effects of Phase for the single-task
condition indicated that subjects took significantly longer to respond to probes during
the pre-LST phase than in either the trial-LST or the post-LST phase (t(17) = 2.55, p =
.018, and t(17) = 3.11, p = .006, respectively). The secondary task alone measure will
therefore be based on a composite of the trial- and post-LST phases only.
5.8.3
Primary Task Performance
As for the finger-tapping condition, performance on the primary task was assessed using
the mean proportion correct, overall response time, and response time for correctly
answered items, as summarised in Table 5.9.
Table 5.9
Latin Square task performance: Mean proportion correct (A), response time (B), and correct
response time (C) by task condition and level of relational complexitya
Relational Complexity
2D
3D
4D
A. Accuracy
Single Task
0.87 (0.18)
0.79 (0.23)
0.48 (0.30)
0.77 (0.26)
0.40 (0.25)
Dual Task 0.83 (0.17)
B. Response Time (ms)
Single 9.87 (2.88) 20.13 (13.01) 28.85 (19.33)
Dual 6.62 (2.74) 11.57 ( 8.02) 17.17 (11.86)
C. Correct Response Time (ms)
Single 10.06 (2.73)
20.42 (12.1) 30.64 (21.61)
Dual 6.38 (1.91)
11.88 (8.39) 14.96 (13.48)
a
The correct response times are based on N = 12, all other values are based on N = 18
160
Mean Proportion Correct
1
0.9
Single
0.8
Dual
0.7
0.6
0.5
0.4
0.3
2D
3D
4D
Relational Complexity
Figure 5.12. Mean proportion correct on LST as a function of relational complexity
and task condition by relational complexity
5.8.3.1 Accuracy.
A repeated-measures ANOVA was conducted on the mean proportion correct scores for
the 18 subjects. The within-subjects factors were task condition (single vs dual) and
complexity level (binary, ternary, quaternary). The results indicated a significant maineffect for complexity, F(2, 34) = 43.86, MSe = 0.040, p < .001. There was no overall
difference in accuracy as a function of task condition, F(1, 17) = 2.61, MSe = 0.020, p =
.125, and the interaction
between the factors, as plotted in Figure 5.12, was not
significant, F(2, 34) = .341, MSe = 0.032, p = .713. Planned comparisons of the levels
of complexity indicated that the mean proportion correct on the binary items was only
marginally higher than for the ternary items, F(1, 17) = 3.14, MSe = 0.059, p = .094.
The mean proportion correct on ternary items was significantly higher than for
quaternary items, F(1, 17) = 68.196, MSe = 0.074, p < .001.
5.8.3.2 Overall response time.
A similar repeated-measures ANOVA was conducted for the overall response time
measure. The assumption of sphericity was violated for the complexity main-effect,
however appropriate adjustment did not alter the interpretation. The unadjusted values
161
are reported and indicate that once again the main effect of complexity was significant,
F(2, 34) = 18.715, MSe = 104.87, p < .001. Averaging across single- and dual-task
conditions, response times on binary items were shorter than for ternary items (F(1, 17)
= 14.205, MSe = 146.77, p = .002), and response time on ternary items were
significantly shorter than quaternary items, F(1, 17) = 21.14, MSe = 204.55, p < .001.
The main-effect for task condition was also significant, F(1, 17) = 14.81, MSe = 111.70,
p = .001, such that averaging across levels of complexity, overall response time was
shorter in the dual-task condition than the single-task condition. These effects are
qualified by a significant interaction between the factors, F(2, 34) = 5.08, MSe = 32.19,
p = .012, which is plotted along side correct response times in Figure 5.13. Analysis of
the simple-effects of task indicated that as the complexity of the item increases the
difference between the response time on dual-task and single-task conditions becomes
more pronounced. This effect is consistent with that reported in the finger-tapping data
and a similar analysis is used to test the trend in the probe-RT data. A task difference
score (the primary-task deficit) was derived by subtracting each individual’s overall
dual-task response time score from their overall single-task response time score. A oneway ANOVA was conducted on this primary-task deficit score with complexity (binary,
ternary, quaternary) as the within-subjects factor. The difference between single and
dual-task response times for the binary items was significantly smaller than for both the
Mean Response Time (sec)
50
Single (overall)
Dual (overall)
Single (correct)
40
Dual (correct)
30
20
10
0
2D
3D
4D
Relational Complexity
Figure 5.13. Mean overall and correct response time on the LST as a function of
relational complexity and task condition
162
ternary and quaternary items (F(1, 17) = 4.46, MSe = 113.94, p = .05, and F(1, 17) =
7.141, MSe = 179.10, p = .016, respectively). There was no difference in the task-effect
between ternary and quaternary items, F(1, 17) = 1.88, MSe = 93.18, p = .189.
5.8.3.3 Correct response time.
Six subjects were omitted from the correct response time analysis for failing to answer
at least one item correctly at each level of complexity. As demonstrated in Figure 5.13,
the analysis of the mean response times for only correctly answered Latin Square items
for the remaining 12 subjects resulted in essentially the same interpretation as for the
overall response time measure. The assumption of sphericity was violated for the
complexity main-effect, however appropriate adjustment did not alter the interpretation.
The unadjusted values are reported and indicate that once again the main effect of
complexity was significant, F(2, 22) = 10.41, MSe = 122.83, p = .001. Averaging across
task condition, the correct response times on binary items were shorter than for ternary
items, F(1, 11) = 13.46, MSe = 112.25, p = .004, and correct response time on ternary
item were significantly shorter than quaternary items, F(1, 11) = 9.51, MSe = 284.30, p
= .010. The main-effect for task condition was also significant, F(1, 11) = 13.30, MSe =
117.05, p = .004, such that averaging across levels of complexity, mean response time
for correctly answered items was significantly shorter in the dual-task condition than in
the single-task condition. These effects are qualified by a significant interaction
between the factors, F(2, 22) = 4.53, MSe = 48.31, p = .023. As for the overall response
time, the difference between the single- and dual-task conditions tends to become more
pronounced as the complexity of the item increases (see Figure 5.6). A task condition
difference (primary-task deficit) score was derived as for the overall response time
analysis and a one-way ANOVA conducted with relational complexity as the withinsubjects factor. There was no significant difference between binary and ternary items on
the task-effect score, F(1, 11) = 2.887, MSe = 98.04, p = .117, nor was there a
significant difference between the ternary and quaternary scores, F(1, 11) = 2.786, MSe
= 220.00, p = .123. The difference between the lowest (binary) and highest levels of
complexity (quaternary) however was significant, F(1, 11) = 6.61, MSe = 261.67, p =
.026. That is, the difference in response time between single- and dual-task was more
pronounced for the more complex quaternary items than for binary items.
163
5.8.4
Practice
As for the finger-tapping condition, the Latin Square Task response times in the dualtask condition tended to be quicker than for single tasks. We also considered the
influence of practice effects in this data set. The items were ordered for each subject
into two groups, the first 9 items attempted and the last 9 items attempted. A repeated
measures ANOVA was conducted with presentation order (early, late) and condition
(single, dual) as the within-subject factors. The results reveal a significant main-effect
for condition as expected, response time in the dual-task condition was significantly
shorter than in the single task condition, F(1,13) = 6.16, p = .03. Unlike the finger
tapping study, the main effect for presentation order was marginally significant, F(1,13)
= 4.12, p = .06. There is some indication that the responses to the items presented later
in the list tended to be much quicker, consistent with a practice effect. The interaction
between presentation order and condition was not significant (F < 1). Listwise deletion
in the analysis of correct response time resulted in a sample size of 5 and so the
analyses were not conducted. Figure 5.14 shows the plot of the interaction for response
times.
Mean Response Time
25
single-RT
20
dual-RT
15
10
5
0
early
late
Presentation Order
Figure 5.14. Mean response time to the LST as a function item presentation order
and task condition
5.8.5
Summary of Traditional Dual-Task Analyses
The traditional dual-task analyses suggest that the probe-RT task is sensitive to resource
limitations in the Latin Square Task, however it is not capable of differentiating
between levels of relational complexity. Performance on the primary task was assessed
164
using two main measures. The proportion of items answered correctly by each subject at
each level of complexity was used as a measure of accuracy. The average time to
respond to items of a given complexity also served as a measure of processing demand.
Two response times were derived, average time to respond regardless of accuracy and
average time to respond correctly. Although accuracy on the primary task was not
influenced by the concurrent secondary task, both overall and correct response time was
influenced by the processing demand generated by the concurrent probe-RT task.
Consistent with the finger-tapping task, this influence tended to become larger as the
complexity of the LST increased. Contrary to what might be expected, responses to LST
items were actually quicker in the dual-task condition than in the single-task condition.
There is some evidence to suggest that with practice, response to the Latin Square Task
becomes quicker. This effect was not observed for the group of subjects who received
the finger tapping secondary task. This difference could be the result of small sample
differences or it might be a function of some underlying difference between the
secondary tasks. The tendency for responses to the LST for both groups in dual task
conditions to become quicker as the relational complexity of the primary task increased
complicates the practice effects a little. We argued previously that this effect might be a
result of subjects becoming more risk-taking (over confident) under the increased
demands of the dual-task situation36; however, if it was risk-taking it did not seem to
influence the accuracy of the responses, which remained stable regardless of task
condition (single or dual). An alternative possibility is that as the relational complexity
increases, the amount of effort/motivation invested to raising the requisite resources
might also increase (Kanfer & Ackerman, 1989). We have not assessed risk-taking of
effort independently in the current study and further research is necessary to clarify
these effects.
5.8.6
Easy-to-Hard Predictions: Individual Differences
As for the finger-tapping task, we have several measures of primary task performance to
choose from and there is no apparent a priori reason for selecting one over the other.
Overall response time was used in the finger-tapping task both with and without
36
There is evidence to suggest that as task become more difficult, tend to become overconfident of there
performance (Stankov, 2000) and this might be consistent with our risk-taking hypothesis.
165
controlling for response accuracy. Similar results were found when accuracy was used
(statistically controlling for individual differences in response latencies). The same
types of measures of the primary task are used for the probe-RT data however there is
some reason to be cautious in selecting the secondary task alone measure. In the fingertapping task, the secondary alone measure was a composite score across the pre-, trialand post-LST phases. The learning phase was excluded because it differed significantly
from the remaining phases. In the probe-RT task response to the probe in the pre-LST
phase of the single task condition differed from the other phases. Rather than use the
median response time across all three phases in the secondary task as we did with finger
tapping task, a more homogenous measure is obtained if we use a composite based on
the median response times in the trial-LST and post-LST phases only.
Table 5.10A shows the results of the standard regression analysis based on the
traditional easy-to-hard variables. The zero-order correlation between response time on
the hard (4D) version of the Latin Square task and the probe-RT task performed
concurrently with the easy (3D) version of the Latin Square task is negative and with
the current sample size not significantly different from zero, r(16) = -.27, p = .142.
When variation due to performing the probe task alone, and performance on the easy
primary task alone is partialled out, the second-order correlation of interest to the easyto-hard paradigm is not significant, pr(14) = -.16, p = .54. Including a measure of
accuracy on the quaternary Latin Square items to the analysis of Table 5.10A did not
improve the partial correlation, pr(13) = -.15, p = .60.
If accuracy is used as the measure of primary task performance rather than response
time, no difference in the interpretation of the easy-to-hard paradigm is observed. The
results reported in Table 5.10B show that the partial correlation between the hard
primary task and the secondary task performed concurrently with the easy primary task,
controlling for easy primary task alone and secondary task alone, is not significant,
pr(14) = -.22, p = .41. In fact the total regression model is not significant R2 = .25, F(3,
14) = 1.58, p = .24. Partialling out mean response time on quaternary items by including
it in the regression analysis, did not improve this prediction, pr(13) = -.14, p = .54.
166
Table 5.10
Easy-to-hard regression analyses of A. overall response time, and B. accuracy
A. Overall response times.
Variables
HPART .......................
EPART ........................
SA .............................
SEP ............................
zero-order correlations
EPAR
Mean (SD) HPART
SA
T
28.85 (19.33)
20.13 (13.01)
0.89 **
265.98 (58.72) -0.16
-0.22
442.9 (152.05) -0.27
-0.27 0.72 **
Regression Coefficients
β
t Sig.
pr
0.89 7.25 0.00
0.11 0.66 0.52
-0.11 -0.62 0.54
0.89
0.17
-0.16
Model: R = 0.90; R2 = 0.80 (adjusted R2 = 0.76), F(3, 14) = 19.26, p < .001
B. Accuracy.
Variables
HPAAcc .......................
EPAAcc .......................
SA .............................
SEP ............................
zero-order correlations
Regression Coefficients
Mean (SD) HPAAcc EPAAcc SA
t Sig.
β
pr
0.48 (0.3)
0.79 (0.23) 0.35
0.30 1.27 0.22 0.32
265.98 (58.72) -0.31
-0.02
-0.10 -0.30 0.77 -0.08
442.9 (152.05) -0.41 * -0.15 0.72 ** -0.29 -0.86 0.41 -0.22
Model: R = 0.5; R2 = 0.25 (adjusted R2 = 0.09), F(3, 14) = 1.58, p = 0.24
Variables:
HPAAcc Hard Primary Alone ................Mean proportion correct on quaternary LST alone
EPAAcc Easy Primary Alone.................Mean proportion correct on ternary LST alone
HPART Hard Primary Alone.................Average median response time quaternary LST alone
EPART Easy Primary Alone .................Average median response time ternary LST alone
SA Secondary Alone............................Composite median Probe RT for trial and post trial phases
SEP Secondary with Easy Primary ......Composite median trial Probe RT during ternary LST
* p < .05; ** p < .001
5.8.7
Alternative Easy and Hard Conditions
There are three levels of difficulty in the Latin Square task and the easy-to-hard
methodology can be applied to any of the three combinations of easy and hard versions.
When this is done, the partial correlation of interest in all cases was not significant. This
pattern of results was also observed in the finger-tapping condition and we will consider
the implications of this further in the discussion (Section 5.9.1).
167
Table 5.11
Analyses of probe RT task with different easy and hard task combinationsa
Easy taskb
Hard task
Variable controlled
Partial correlation
2D accuracy
3D accuracy
None
pr(14) = -.22, p = .41
2D accuracy
3D accuracy
3D response time
pr(13) = -.23, p = .42
2D response Time
3D response time
None
pr(14) = .22, p = .41
2D response time
3D response time
3D accuracy
pr(13) = .23, p = .42
2D accuracy
4D accuracy
None
pr(14) = -.08, p = .76
2D accuracy
4D accuracy
4D response time
pr(13) = -.17, p = .56
2D response Time
4D response time
None
pr(14) = .12, p = .67
2D response time
4D response time
4D accuracy
pr(13) = .18, p = .51
a
Secondary task measure is based on composite of trial-LST and post-LST median response
times; b 2D, 3D, 4D = binary, ternary, and quaternary respectively.
5.8.8
Alternative Measures
As mentioned previously, the selection of measures for a particular variable to use in
the analyses is somewhat arbitrary. The results reported above suggest that regardless of
whether accuracy or response time on the Latin square task is used (and regardless of
whether response time and accuracy is controlled for in these analyses), the findings
tend to suggest that once the controls of the easy-to-hard paradigm are implemented,
there is no reliable shared variation between the Latin Square Task and the probe RT
that would be consistent with a capacity based dual-task trade-off. Further analyses not
reported here using different measures of secondary task performance also did not
change the nature of these results. These additional analyses considered (i) mean
response times to the probe rather than median response time, (ii) whether a secondary
alone task was based on just the pre-LST phase responses for the single task, or (iii)
whether a secondary alone task was based on the pre-LST phase responses to both
single and dual tasks. In all cases the pattern of the results remained unchanged. The
probe RT task did not seem to be sensitive to individual variations in processing
capacity on the quaternary LST items that the finger-tapping task suggested existed.
5.9
General Discussion
Traditional Analyses: The results of this study can be interpreted to be consistent with
the hypothesis that reasoning in the Latin Square Task (LST) is resource limited
however there are reasons to be cautious about the generalisability of the findings. The
traditional dual-task analyses suggest that reasoning in the Latin Square Task is
168
resource limited since a dual-task deficit was recorded for all levels of relational
complexity, regardless of whether the concurrent secondary task was finger-tapping or
whether it required monitoring for and responding to a random probe. Contrary to
expectations, secondary task performance was generally not sensitive to differences in
the relational complexity manipulation of the LST. That is, although there is some weak
evidence for a difference between the effect of binary and more complex levels of the
LST on finger tapping rate, generally the decrement observed from the single to dualtask condition remained constant at all levels of relational complexity regardless of the
secondary task.
According to Lansman and Hunt (1982), this is not altogether an unexpected finding in
the traditional dual-task paradigm. One explanation proposed by Kantowitz and Knight
(1976) and Navon and Gopher (1979) endorsed by Lansman and Hunt (p. 14), is that
“…coordinating the two tasks demands resources beyond the requirements of each
individual task”. Therefore the dual-task deficit is not only a function of the primary
task demands, but also of the additional demands of coordination. We suspect that this
finding also tends to feed the controversy surrounding the identification of factors that
are considered sources of “difficulty” and those that are thought to be sources of
“complexity”. In this literature (e.g., Bjorklund & Harnishfeger, 1989; Kantowitz &
Knight, 1978), two tasks that differ in difficulty are often interpreted as requiring a
different amount of the same resources, where as tasks that differ in complexity are
thought to entail different resources (as a function of the complexity of the problem
solving situation)37. If it is the case that two tasks differ in difficulty but entail the same
resources, then the magnitude of the dual-task deficit should be a function of the
relative difficulty of performing each task. On the other hand, two tasks that differ in
complexity are thought to entail different resources and therefore do not necessarily
produce dual-task deficits that reflect the different levels of performance on each task.
The problem of course lies in identifying which of these two classifications is the case
for a given task (or task manipulation). It may in fact be the case that task manipulations
37
The use of the terms “difficulty” and “complexity” might be interpreted as quantitative and qualitative
differences, respectively. In any case, the terminology is somewhat different to that adopted by Halford et
al. (1998a).
169
differ in both difficulty and complexity and then the issue becomes one of deciding to
what extent each applies.
The recent work of Klauer et al. (1997) that we discussed briefly in section 5.5.1.1,
provides another example of the potential problems in explaining a dual-task deficit.
Klauer et al. demonstrated that the presence of a dual-task deficit depended on whether
problem solving heuristics were used to short-cut the demands of the primary task or
not. When heuristics could not be used the finger-tapping task was sensitive to the
primary task demands. When heuristics were used, processing in the task was not
resource limited. We have already outlined the importance that solution strategies play
in determining the relational complexity of problem solving in the knight-knave task
(Chapter 3) and the Latin Square Task (Chapter 4). The principles of segmentation and
chunking proposed by Halford et al. (1998a) can be used to model complexity
differences in strategies and further experimentation manipulating strategy use might
help to disambiguate the effect of strategy on the dual-task deficit reported here.
It seems to be the case that interpretation of a dual-task deficit using traditional analyses
does not provide for clear conclusions. An overall dual task deficit is arguably a
necessary consequence of competition for resources from the same pool. However, it is
not a sufficient condition in that it does not allow us to unambiguously differentiate
between resource demands and the influence of other interference related factors. The
utility of the dual-task approach is dependent on the tightness of the experimental
controls and we will consider this in more detail below. As it stands, whether our
findings with the LST are due to resource limitations or structural type interference in
response (Navon, 1984) essentially remains unresolved if the only source of evidence
we have is the dual-task deficit.
Easy-to-Hard Analyses: Concerns over the issues raised above lead us to adopt the
easy-to-hard approach to try and develop a critical test of resource limitations, but as we
will argue, it also has some difficulty with aspects of our hypotheses. When the
secondary task was finger-tapping, the easy-to-hard paradigm results corroborated the
traditional analyses in support of the presence of resource limitations in solving
quaternary LST items. It is important to be clear just what the easy-to-hard analyses are
able to demonstrate. The rationale is that when the secondary task is paired with an easy
170
version of the LST, some of the total variability in secondary task performance will be a
function of the availability of sufficient resources to perform the two tasks at once. This
is assuming that combining the two tasks result in resource limited performance. A
decrement in secondary task performance when the easy task is performed concurrently
is supporting evidence of this. The argument then is, that if performance on the hard
version of the LST is also resource limited and if it entails the same resources that the
secondary task is competing for when paired with the easy version of the LST, then
there should be some common variability between the secondary task performance
when paired with the easy version of the LST and the hard version of the LST
performed alone (this is δ in Section 5.4, Figure 5.4). To control for variability in
performance due to structural (e.g., response) interference, the influence of performing
the secondary task alone and the easy primary task alone are partialled out from both
the hard LST and the secondary task performed concurrently with the easy LST. If the
remaining partial correlation is significant, then it suggests that performance on the
quaternary LST is subject to the same resource constraints as the secondary task
combined with the easy version of the LST. The analyses are not able to provide
evidence for or against the relative difference in the resource demands as a function of
relational complexity per se. That is, a reduction in the magnitude of the easy-to-hard
partial correlation as a function of decreasing the relational complexity, as was observed
in the finger-tapping condition, might be accounted for in a number of ways. It might be
a function of smaller resource demands consistent with our thesis. Alternatively, it
might be an indication of a range of other factors, for instance, that different resources
are being used for items of different levels of complexity. Implementation of the easyto-hard paradigm assumes that resource demands change quantitatively not qualitatively
(Lansman & Hunt, 1982), but cannot differentiate between the two.
The results indicate that when the finger-tapping task was used as a secondary task, a
significant proportion of the unaccounted variation in performance on quaternary Latin
Square items could be accounted for by variation in the secondary task combined with
the easy version of the LST (i.e., the easy-to-hard partial correlation was significant).
This was not the case however when the secondary task required a response to a random
probe. This suggests that the dual-task deficit observed in the traditional analyses of the
probe RT measures may not be a function of competition for the same resources.
171
Summary: To recapitulate, the traditional analyses suggest that processing in the LST is
resource limited at all levels of complexity, although secondary task performance did
not differ reliability as a function of relational complexity. The easy-to-hard analyses
corroborated the presence of resource limitations in the quaternary level items when the
finger-tapping task was used, but not when the secondary task was the Probe RT. There
was no evidence to suggest that when using the ternary items as the hard task in the
easy-to-hard analyses, the ternary items were resource limited. To clarify our
interpretations and to place these general trends into some context, we consider three
key findings from this study in more detail throughout the following sections. The first
is that both the finger tapping task and the probe-RT task were insensitive to differences
in the relational complexity of the LST. The second is that in the finger-tapping task and
the probe RT task, response to the LST items tended to become quicker as their
complexity increased. Finally, there is some evidence to suggest that the two secondary
tasks are not having the same effect on resources. The finger-tapping task tended to
produce more conclusive evidence for resource competition.
5.9.1
Secondary Task Insensitivity
Contrary to our predictions, increases in relational complexity did not produce larger
secondary task deficits in either the finger-tapping (Sections 5.7.1, and 5.7.2) or the
probe RT conditions (Section 5.8.1). This is in spite of the fact that accuracy and
response times on the LST were a function of relational complexity in the same way in
the current study, as our previous data has suggested (i.e., Chapters 4 and 6). Binary
items were associated with fewer errors and shorter response times than quaternary
items, and ternary items lay somewhere in between on both measures. So what are we to
make of this? A key prediction of the dual-task methodology is that processing capacity
is sensitive to differences in primary task difficulty, yet we have generally failed to
show this. As we have already alluded to, this is not necessarily an unusual result,
however it is useful to speculate on possible contributing factors:
1. Task Sensitivity: The tasks or measures were simply not sensitive enough to
detect the differences in resource demands generated by different levels of
complexity in the LST
172
2. Interference: Processing in the Latin Square Task is resource limited but that
specific interference effects have obscured them (related to 1)
3. Relational Complexity: The difficulty caused by the Latin Square Task is not a
function of processing capacity, but caused by other factors that are not related
to the availability of resources
It may be the case that resource demands in the LST do not change substantially as a
function of relational complexity. As a consequence, while the secondary task was
sensitive to the concurrent load of performing the LST in general, it was not sensitive to
the changes in complexity. The easy-to-hard analyses were based on using ternary items
as the easy version of the task and quaternary items as the hard version. When other
combinations of easy and hard tasks were used, the partial correlation of interest was
not significant (see Section 5.7.7). One interpretation of this is that individual
differences in capacity were not implicated in the ternary level task. However, when we
take the dual-task deficit into consideration this interpretation becomes a little more
complicated. Ternary LST items were associated with just as large a dual-task deficit as
the quaternary items were, yet only the quaternary items were resource limited in terms
of the easy-to-hard analyses. The source of this contradiction may lie with the
assumptions of the easy-to-hard paradigm. First, the hard version of the primary task
needs to be resource limited (Section 5.4.1, Assumption c). In any case, if ternary items
are used as the hard task, then binary items need to be used as the easy task and this
might lead to violations of one of the other assumptions – that performance on the easy
version of the primary task combined with the secondary task is resource limited
(Section 5.4.1, Assumption d). We can speculate that pairing all levels of the LST task
with the secondary task results in resource limited processing since a dual-task deficit
on the secondary task was observed regardless of complexity. However, as we have
already stated, determining that the deficit is not a result of some other factor is
difficult.
Sample Characteristics: There is additional evidence from Chapter 4 that might suggest
that processing less complex LST items entails resource demands that the secondary
task might not be able to differentiate between. The Rasch analyses of LST performance
from the university sample in Chapter 4 indicated that all the items tended to be well
173
within the capabilities of the students (i.e., a skew in the distribution of ability estimates
greater than 0). The sample of school children on the other hand showed a much
broader distribution of abilities and found the task more difficult (see section of chapter
4). This suggests that on the whole, the Latin square items were relatively easy for the
university students but somewhat more difficult for the school aged children (aged 8 –
17 years). The extent that this difficulty is resource related needs further investigation
however recent work of Rabinowitz, Howe, and Saunders (2002) who observed
capacity related differences in school age children ranging in age from 10 – 15, suggests
that individual differences in resource capacity varies with age in this population.
5.9.2
Interference and Secondary Task Performance
The easy-to-hard results with the finger tapping task were consistent with the notion
that processing in the quaternary LST items is resource limited and not simply a
function of non-resource related interference (since the hard version of the task is never
paired with the secondary task, interference is believed to be less of an issue). They also
corroborate the interpretation that the dual-task deficits were a function of resource
limitations and not simply response competition. However, the easy-to-hard paradigm
does not help in explaining the null effect observed with the probe RT task. In fact,
given the significant findings with the finger-tapping task that suggests that processing
is resource limited, the easy to hard analyses suggest that the probe RT task, as it has
been implemented in this study, is not sensitive to resource limitations. The Probe RT
task has worked in other domains, for example, transitive inference (Maybery et al.,
1986), and the possibility that the particular characteristics of the LST has contributed
to the effects need further consideration. Consistent with this idea is the implication that
the dual-task deficit observed in the probe RT is more a function of non-resource related
interference. We consider the obvious potential interference factors in more detail next.
Finger Tapping and Interference: The finger-tapping task requires a continuous
response with the non-dominant hand. The LST requires a single response with the
dominant hand. There is certainly evidence to suggest that at the point of dual responses
this is going to cause some interference (e.g., Damos, 1991a; Wickens, 1991).
Processing in the LST is conducted on average over a 10 – 30s window and maximum
174
interference due to response competition is likely to be localised to within, say, the last
2 or 3 second period in which a response by both the left and right hand is required (i.e.,
tapping and selection of the LST response option). The influence of this disruption is
likely to be greatest on the measure of variability in tapping (i.e., standard deviation
score). The variation in tapping rate almost doubled under the concurrent load imposed
by the Latin square task. The influence on the median elapsed time measure of tapping
rate is likely to be smaller since the median is less sensitive to extreme responses. The
data that has been presented is generally consistent with such an interpretation. There
was no difference in the variability of tapping between the three levels of relational
complexity however the median elapsed time between taps was sensitive to differences
in binary and quaternary LST demand. Further, when these interference effects are
controlled using the easy-to-hard approach, there is evidence to suggest that processing
on the LST task is resource limited.
Probe RT and Interference: Once again, both the secondary and primary tasks require a
motor response. Fisk et al. (1986) argue that a secondary task that demands processing
resources throughout the duration of the primary task helps prevent a strategy that
drains resources from the primary task. In this case the response to the Probe-RT is not
continuous. The probe is dispersed randomly throughout the trial every 800-2400ms.
The obvious consequence of response competition is a switching of resources to the
task. That is, when the tone is presented, resources devoted to processing the LST might
be temporarily diverted to make a response to the probe. Processing in the LST would
be halted while a response to the probe is made. It is not possible to do this in the
finger-tapping task since tapping needs to be maintained throughout solving the LST
and therefore a switching strategy is not possible. Hence, rather than result of
competition for similar resources, the increase in probe RT during the dual task
condition may simply be a function of probe monitoring that is arguably constant
regardless of relational complexity, and the switching of attention to another task when
the probe is heard. This would account for the failure to observe a difference in probe
RT as a function of relational complexity since the task of switching is the same
regardless of complexity. That the performance on the probe RT is not competing for
similar processing resources as the LST is also consistent with the failure to find a
significant correlation between performance on quaternary LST items and the probe RT
175
task when the easy-to-hard variables are controlled. The two tasks either do not compete
for similar resources, or if they do, rather than being allocated in parallel as the general
dual-task methodology assumes, they are being allocated in series.
5.9.3
Interference and Primary Task Performance
A key assumption of the dual-task methodology that applies to both traditional and
easy-to-hard approaches is that subjects give the primary task priority. We discussed
earlier that it is not always clear whether there is a clear case for accepting this
assumption (Sections 5.1.2 and 5.3.2). Bourke et al. (1996) have suggested that subjects
will tend toward assigning more resources to the task they perceived to be most difficult
regardless of instructions, and that perception of difficulty is not always as expected. In
our study, accuracy of response to the Latin Square Task was not compromised by
either the finger-tapping or probe RT tasks. However, there was an interesting effect on
response times. Overall response times in the Latin Square Task tended to be quicker in
the dual task condition than when the LST was performed alone. Furthermore, for
subjects in both the finger-tapping and probe RT conditions, the extent to which the
response times to the LST decreased in the dual task situation was a function of
relational complexity. That is, quaternary items were associated with a greater tendency
to respond more quickly in the dual-task condition.
This finding suggests the influence of at least two factors. First, there may be a strategic
trade-off of resources in that the extra load generated by the concurrent secondary task
encourages the subjects to be a little less conservative in checking their responses once
they have worked through the item. That the difference increases as a function of the
items complexity supports such an interpretation. Second, given that the dual-task
always came after the single task administration of the Latin Square Task, practice may
be influencing performance. In an attempt to try and tease apart what is going on here,
the item pool was split into two halves as a function of presentation order in both the
single and dual task conditions. During the finger-tapping task there was no difference
in response times as a function of presentation order when it was dichotomised in this
way. However, for those subjects who received the probe RT task, there was a marginal
176
difference in average response times between the items presented early in the list and
the later items consistent with a practice effect.
When we take these two rather contrasting results together, it suggests that if practice
effects are having an influence on response time, it is in conjunction with some other
factor. Note that the effect cannot be accounted for by a traditional speed-accuracy
trade-off since accuracy did not change as a function of whether the LST was performed
alone or in combination with one of the secondary tasks. In any case, the evidence
suggests an interaction between the complexity of the Latin Square Task and secondary
task performance and that at least in the probe RT task, interference due to non-resource
factors cannot be ruled out.
5.9.4
Is relational Processing Resource Dependent?
The alternative to what has been presented of course is that the relational complexity
manipulations in the Latin Square Task simply do not impose different demands on
similar processing resources. If problems with the traditional dual-task methodology
mean that a dual-task deficit cannot unambiguously be associated with resource
limitations, does a failure to observe a deficit provide compelling evidence for a lack of
resource competition between the two tasks? While the easy-to-hard methodology is
useful in helping to determine if the hard version of the task is resource limited, it does
not provide any relative information about the resource demands of the other levels of
the task. When the binary task is used as the easy version and the ternary as the hard
there is no evidence for resource limitations, although the extent that this result is due to
violations of the easy-to-hard assumptions is unknown. This suggests either that the
ternary version of the task is not resource limited or that it entails different resources.
The current data is unable to differentiate these alternatives. Relational complexity
theory is based on resource theory and while it makes no real distinction between
undifferentiated and differentiated resource models, it does assume that processing
cognitive representations of different levels of complexity is done from a common
resource pool. While Halford and his associates argue that there is evidence that
processing relations of different levels of complexity entail the same resources (e.g.,
Halford et al., 1998a; Andrews & Halford, 1998), there is little direct evidence for this
177
in adult cognition across more than two levels of primary task difficulty. In fact, the
developmental notion that resources become more differentiated with age (e.g., Halford,
1993) would tend to suggest a real possibility that the structural difference between
levels of complexity might entail a different set of resources. One might reasonably
envision the case where binary processes in particular might be instantiated in a number
of resource centres because of their very common nature. As the structure of the relation
becomes more complex, the nature of the resources required might become more
specialised and differentiated. There is some neurological evidence accumulating to
support the notion that tasks of different complexity enlist different anatomical
structures as a function of the individual’s general ability (e.g., Gevins & Smith, 2000)
and if we are to endorse research that suggests that the hemispheres entail differentiated
resources (e.g., Friedman et al., 1988; Polson & Friedman, 1988), then we might expect
that different structures within the same hemisphere might also entail different
resources. The inability to find a difficulty effect in the LST using traditional analyses
could then be accounted for as follows:
Regardless of relational complexity, a fixed amount of resources is required to
implement the solution strategy and the secondary task is sensitive to these
resources – hence no difficulty effect is observed in the concurrent secondary
task38. Additional and different resources are required as the complexity of the
processing increases to maintain accuracy on the primary task. The cost to
performance for enlisting these different resources is on the Latin Square Task
where a general slowing occurs relative to the availability of resources to the
individual (consistent with the positive easy-to-hard partial correlation between
the quaternary LST and the secondary task performed in conjunction with the
easy LST, section 5.7.6). The general quickening from single to dual-task
conditions has nothing to do with the availability of resources per se, and is
simply a function of the tendency for risk taking strategies to increase as
perceived task demands increase.
38
Ward, Roberts, and Phillips (2001) discuss evidence to suggests that task-switching costs are at least
partially independent of executive resources and independent of task difficulty.
178
5.10 Conclusion
The line of reasoning presented above obviously entails some speculation. In some
respects it is also presented as an example of pushing the implications of resource
theory somewhere near the point of reductio ad absurdum. In any case, primarily
because of methodological problems, the resource concept has proven to be less useful
in testing our understanding of relational complexity than we had first anticipated. The
resource concept may well be a theoretical soup stone as Navon (1984) argued.
Oberauer and Kliegl (2001) suggest that decay and interference are sufficient to account
for limits in working memory without additional constructs like limited resource pools
or “magical numbers”. It may be the case that we can demonstrate a single to dual-task
decrement in secondary task performance. From here we might conclude that
performance on the primary task is resource limited. However, if we have more than
one level of primary task difficulty (determined by relational complexity) and we find
that dual-task performance is not sensitive to this manipulation, it is difficult not to find
an acceptable alternative to the null hypothesis. It is difficult to conclude
unambiguously that changes in relational complexity are not associated with changes in
the processing resources required. We can simply enlist the help of another set of
resources.
The evidence we present is consistent with the hypothesis that processing in the LST
task is resource limited and taken in conjunction with the data presented in Chapters 4
and 6, provide converging evidence for the relational complexity manipulation. The two
studies reported here have also provided some further insight into the characteristics of
the Latin Square Task. However, the issues that we set out to clarify are far from being
resolved unambiguously. Part of the reason for this is the practical limitations of the
current study, and part is a function of the methodology in general. Consistent with
previous literature, it would seem that the traditional dual-task methodology is very
much limited in the ability to differentiate between resource competition and other
factors that might interfere with the unambiguous identification of resource limitations
(e.g., response competition). Possibly the biggest limitation of the current series of
studies is the real possibility that the dual-task deficit that we have observed is a
function of response competition that is not a function of the true cognitive demand
imposed by the LST. Both secondary tasks required a response with a non-dominant
179
hand. Response to the LST was by the dominant hand. The major difference between
the two tasks was that the probe RT task required a much less consistent allocation of
resources. Although the results clearly need replication, it seems that the easy-to-hard
task is sensitive to these differences in secondary tasks.
Although the probe-RT task has been used successfully in identifying the processing
demands of transitive inference in adults (e.g., Maybery et al., 1986) and children (e.g.,
Foley, 1997; Foley & Berch, 1997), the current data suggests that at least in its current
form, it is not well suited to identifying resource limitations when paired with the Latin
Square Task in an adult sample. A possible reason for this is that extensive research into
transitive reasoning has been able to determine the point at which the peak cognitive
demand is generated (e.g., Sternberg & Weil, 1980; Maybery et al., 1986). Reasoning in
the Latin Square Task is yet to be investigated to this detail. Aggregating probe RT’s
across a full LST trial might mean that the peak load is being obscured – washed out
amongst less demanding search processes. We are encouraged by the results of the
finger-tapping task, and the continuity of response might be seen to overcome the
limitations of the intermittent probes. Future research should also investigate alternative
response modalities, for example, vocal responses to the LST, so as to minimise the
influence of response competition and provide a more convincing test of the dual-task
deficit hypothesis. Further, a more detailed inspection of the nature of the processing in
the LST is required and eye-tracking or neuro-imaging methodologies would assist with
this.
180
CHAPTER SIX
RELATIONAL COMPLEXITY AND BROAD COGNITIVE ABILITIES
6
Introduction
In the previous chapters we have demonstrated that manipulations of relational
complexity within individual tasks results in differential performance consistent with
the relational complexity theory. As introduced from the outset, the primary aim of this
project is to consider further the evidence for the relational complexity theory in such a
way as to avoid the subtle circularity in the validation process. In Chapter 5 we
demonstrated that manipulations of relational complexity were associated with
secondary task deficits consistent with being sensitive to resource demands, however
the methodology was not strong enough to explore the relative demands of the different
levels of complexity. The correlational study39 we report in this chapter and Chapter 7,
implements a test of the relational complexity model using a more traditional individual
differences methodology. What we are attempting to demonstrate is that by using a
cognitive correlates approach we are in a better position to reduce the subtle
confounding that has been generated by using the same evidence to validate
manipulations of task complexity and determine processing capacity limitations.
The thesis to be explored in the following chapters can be stated quite simply. The core
premise is based on the a priori relationship between task performance and Fluid
Intelligence (Gf) as a function of cognitive complexity (Stankov & Crawford, 1993;
Marshalek et al., 1983; Jensen, 1987b). We have argued that if relational complexity
has the characteristics of what has traditionally been referred to as cognitive complexity
then we might expect it to have similar psychometric properties of other reported
complexity manipulations (most notably, Stankov & Crawford, 1993; Stankov, 2000).
Specifically, as the relational complexity of a task increases, the correlation between
task performance and Gf should also increase monotonically (we refer to this as the
complexity-Gf relationship). In practical terms this implies that as complexity increases,
task performance is better able to differentiate between individuals of differing general
39
Data from this study has been presented at several international and domestic conferences (Birney,
2001; Birney & Halford, 2000c, 2000a, 2001)
181
reasoning abilities, as Gf is typically defined. The theory of relational complexity also
makes specific predictions about the impact of education (and acculturation) and shortterm memory on processing capacity. If relational complexity is a characteristic of the
task as Halford et al. (1998a) propose, then with sufficient knowledge of the task
specific domain, the determination of task complexity is theoretically independent of an
individual’s educational experiences (Gc) and memory (SAR). To the extent that
performance on these tasks requires short-term memory and educational experiences,
we might expect significant correlations between task performance and the Gc and SAR
factors. However, it is not clear what the relationship between performance and these
factors would be as a function of relational complexity. Individual differences in
segmentation and chunking have the potential to introduce some dependency between
education and memory and performance differentially as a function of relational
complexity. We are therefore interested in exploring the extent to which independence
is evident in the relationship between task performance and the Gc and SAR factors.
6.1
Design of the Study & Overview
The design we have implemented centres on exploring the influence of the relational
complexity manipulations on performance in each of three tasks, the knight-knave task
(chapter3), the Latin-square task (chapter 4 and 5), and the sentence comprehension task
(that we will introduce here), as a function of broad cognitive abilities (Gf, GC, and
SAR). There are three marker tests for each of the psychometric factors and we will
consider the measurement properties of each of these tasks separately. The reason for
doing this is first to demonstrate the tasks have the appropriate psychometric properties
for the current sample. The second more important reason is that although many of
these tasks have come from standard psychometric tests, the presentation mode that has
been used differs in each case from standard procedures. Details of these modifications
are provided below. There are 12 tasks in total and the organization of this and the next
chapter will be as follows. This chapter will present the details of the item analyses and
any additional analyses associated with the psychometric tasks, and the comparison of
means approach to the analysis of the relational complexity tasks. Chapter 7 will
consider the relationship between the tasks, specifically, the relationship between the
182
psychometric factors and the manipulations of relational complexity in each of the three
RC tasks (knight-knave, sentence comprehension, and Latin-square).
6.2
6.2.1
Method
Participants
In total 191 students, 64 male and 127 female (mean age40 = 20.76, SD = 6.07),
participated in the study. The students were enrolled in a first-year undergraduate
subject at the University of Queensland and received 4% course credit for their
participation. No student had participated in any other aspect of the project and testing
was conducted in groups of no more than 20 students.
6.2.2
Materials
Table 6.1 lists the battery of nine psychometric tasks categorised into the broad abilities
they were employed to assess, and the three relational complexity tasks. The details of
the relational complexity analysis of the knight-knave task and Latin Square task have
been described earlier (in Chapters 3 and 4, respectively). Details of the relational
complexity analysis of the Sentence Comprehension Task has been described in detail
elsewhere (Andrews, 1997; Andrews & Halford, in press) and will therefore only the
core issues will be summarised here. The Arithmetic Reasoning and Similarities tests
were presented in a paper and pencil format. All other tasks were either partially or
completely computer administered and scored, and the details of the modifications to
the standardised tests are provided below.
6.2.2.1 Raven’s progressive matrices.
Consistent with Stankov (1998), a composite test of 32 items was developed. Set I of
the Advanced Progressive Matrices were used as practice. The computer presented the
instructions and ensured that subjects responded correctly to the first 4 items before
moving on. The 12 items from Set E of the Standard Progressive Matrices and a random
40
Based on the 151 subjects who supplied their age.
183
selection of 20 items from the Advanced Progressive matrices comprised the test items
(see Appendix D.1 for a listing of the items used). The items were collated into a
booklet with only one item per page. To assist scoring, subjects made their response on
computer. Response times were recorded however no time-limit was imposed.
Rationale for Inclusion: The progressive matrices have traditionally been considered as
the criterion measure of fluid intelligence (Carroll, 1993). The task requires the ability
to deal with novelty, and to induce rules and principles from easier items and apply
them in more complex ways (e.g., Carpenter et al., 1990; Carroll, 1993). This is not to
say that the task has not attracted its share of controversy. There is an interesting debate
in the reasoning literature over the actual cause of item difficulty. Some of the more
lively debates focus on the influence of context in reasoning and suggests that the
abstract nature of the items is a mediating factor (e.g., Richardson, 1991, 1996; Roberts,
1996; Roberts & Stevenson, 1996). In any case, the abstract nature of the task is likely
to be the very reason the Ravens tests have served as a reliable predictor of at least
some of the aspects of fluid intelligence.
Table 6.1
List of cognitive tasks used in the study and the original source
Task
Source
Fluid Intelligence (Gf)
1 Ravens Progressive Matrices
ACER (Raven, Raven, & Court, 1993)
2 Triplet Numbers
Stankov and Crawford (1993)
3 Swaps
Stankov and Crawford (1993)
Crystallized Intelligence (Gc)
4 Arithmetic Reasoning
5 Similarities
6 Vocabulary (synonyms)
ETS Kit of Factor-Referenced Tests (1976)
WAIS-III (199?)
ETS Kit of Factor-Referenced Tests (1976)
Short-term Apprehension & Retrieval (SAR)
7 Digit Span Forward
8 Digit Span Backward
9 Paired-Associate Recall
Standard Psychometric Test
Standard Psychometric Test
WMS-III
Relational Complexity Task
10 Knight-knave task
11 Latin square task
12 Sentence comprehension
Rips (1989)
Birney (developed for this thesis)
Andrews (1997)
184
6.2.2.2 Triplet Numbers Test.
The Triplet Numbers Test was first used by Wittenborn (1943, cited in Stankov &
Crawford, 1993) to study attention. The task involves the presentation of a series of
three randomly selected digits (e.g., 3 6 4) in which subjects are required to verify if this
“triplet” fits a given rule. Subjects are required to attempt as many items as possible
within the undisclosed time limit. There are four rules.
Level 1 - Search Triplets: If a 3 is present within the triplet then press yes,
otherwise press no (time limit = 2 min).
Level 2 - Half-rule Triplets: If the second digit in the triplet is the largest then
press yes, otherwise press no (time limit = 3 min).
Level 3 - One-rule Triplets: If the second digit is the largest AND the third digit
is the smallest, then press yes, otherwise press no (time limit = 6 min).
Level 4 - Two-rules Triplets: If the first digit is the largest AND the second
digit is the smallest, OR, if the third digit is the largest and the first digit is the
smallest, then press yes, otherwise press no (time limit = 6 min).
Two hundred triplets were randomly generated at each level such that the three
elements of the triplet were unique (i.e., no repeated digits in the triplet) and that the
rule for the given level was correct approximately 50% of the time (level1 = 52%,
level2 = 50%, level3 = 50%, level4 = 53%). A random ordering of these 200 items was
generated and this order was presented to all subjects. Items were recycled if a subject
exhausted all 200 in a given level within the time-limit (although this only happened
once). At the start of each level, subjects were presented a trial of four triplets as
practice with correct/incorrect and response time feedback. During the practice trials,
the current rule was always displayed at the top left of the display. At the end of each
practice trial subjects were informed that the rule would not be available during the test
and were given the opportunity to attempt another practice trial if they still felt unsure
about the rule before moving onto the test items. Subjects were not informed that there
was a time-limit on each level and simply continued responding until the computer told
185
them to stop. The levels were presented in order of increasing difficulty (level 1 through
4).
During both practice and test trials, the triplet was presented at the top centre of the
display in 30mm red lettering on a black background. The word “Ready…” was
displayed near this area for one second before each triplet to orient the subject to the
stimulus. The response buttons were always presented in the centre of the display with
the “Yes” button about 15mm directly to the left of the “No” button. A mandatory 5s
break was enforced between responding to the last item and presentation of the
instructions for the next level. This was done using a black screen and the words “Get
ready for the next test…” displayed in the centre. After the 5s pause, subjects clicked
the “Next” button when they were ready to continue on to the next level.
Rationale for Inclusion: Levels 3 and 4 of the Triplet Numbers Test have been
demonstrated to be good measures of Fluid Intelligence by Stankov and his associates
(Stankov, 2000; Stankov & Crawford, 1993; Stankov & Raykov, 1995). The task is
particularly interesting since Stankov and his colleagues have used it to provide
empirical support for the Complexity-Gf relationship. Therefore, an analysis of the
nature of the cognitive complexity in this task and its relationship with relational
complexity has the potential to further our understanding of cognition in complex
situations.
6.2.2.3 Swaps test.
The swaps test was also developed in Stankov’s laboratory at the University of Sydney
(Crawford, 1991; Stankov & Crawford, 1993). The task involves the presentation of the
same three letters (J, K, and L) in a random order. Subjects are presented with one to
four instructions to rearrange (i.e., swap) the ordering of the letters. As an example,
Figure 6.1 presents the display for a typical item from the swaps test. The instructions
(Swap 2,3 and Swap 1,3) indicate to the subject which letters of the stimulus (L J K)
should be rearranged. The correct ordering after the first instruction would be L K J
(i.e., swap the position of letter 2 with letter 3). The correct ordering after the second
instruction (and the correct response for the item) would in this case be J K L.
186
Four sets of 12 items were generated entailing one, two, three, and four swaps,
respectively. The 48 items were presented in a different random order for each subject.
Subjects were provided with a worked example to orient them to the nature of the task.
The three letters were presented at the top centre of the display in 20mm red lettering.
The string of “swap” instructions was presented below in yellow text and a smaller font.
Each swap instruction was presented on a new line. The six possible response orderings
of the JKL series were provided as buttons in the bottom half of the screen in a constant
order (see Figure 6.1). Subjects selected their response by clicking on the appropriate
button.
LJK
Swap 2, 3
Swap 1, 3
JKL
JLK
KJ L
KLJ
LJ K
LKJ
Figure 6.1 Display layout for Swaps task
Rationale for Inclusion: The more difficult levels of the Swaps Test have been used as
measures of Fluid Intelligence by Stankov’s research team. The Swaps Test also entails
a manipulation of cognitive complexity that produces correlations generally consistent
with the statistical complexity-Gf relationship. We are therefore also interested in how a
relational complexity account of this task might fit with the complexity-Gf relationship.
6.2.2.4 Arithmetic reasoning.
The test of arithmetic reasoning was based on the General Reasoning scale of the ETS
Kit of Factor-Referenced Cognitive Tests (Ekstrom et al., 1976). In total 15 items were
compiled from a random selection of 5 items from the Arithmetic Aptitude Test RG-1
187
(designed for grades 6-12) and a random selection of 10 items from the Mathematics
Aptitude Test RG-2 (designed for grades 11-16). The actual items used are provided in
Appendix D.2. All items consisted of a word problem presented in a pencil and paper
format with a 5-choice response option. The content was modified to suit an Australian
sample (e.g., miles was replaced with kilometres). Subjects were asked to work as
quickly and accurately as possible and to use the space provided for any working. The
use of calculators was not permitted.
Rationale for Inclusion: Arithmetic reasoning has been considered to reflect the
influence of education and acculturation and therefore a measure of Crystallized
Intelligence (and Numerical Facility at the first-order, Carroll, 1993). When presented
as word problems, the task also has the potential to tap traditional verbal abilities that
are associated with Gc. The factor loadings of WAIS-R arithmetic reasoning test on Gc
reported by Stankov, Roberts, and Spilsbury (1994) tended to be somewhat lower than
the vocabulary type tests and there was some cross loading with the SAR factors. Given
the test in its current form is presented as a paper and pencil test, we suspect cross
loadings on a memory factor will be minimised. We therefore decided to include this
test for two reasons. First, it clearly captures the theoretical criterion for a measure of
Gc. Second, Carroll (1993) suggests that mathematics knowledge tends to have cross
loadings with Gf and Gc. Therefore the inclusion of arithmetic reasoning might assist in
broadening the definition of Gf beyond the Progressive Matrices test (and therefore give
a more generalised test of the complexity-Gf relationship).
6.2.2.5 Similarities.
The similarities test was included as a marker test of Gc consistent with Carroll’s (1993)
classification of the test as reflecting a first order verbal ability factor. It has also been
used as a measure of Gc by Stankov and his colleagues either directly as a subtest of a
complete WAIS administration (e.g., Stankov et al., 1994) or in their own variation that
requires subjects to identify two words in a list that are most similar (e.g., Stankov,
1989, 1988). The standard administration procedure of the WAIS-III similarities test
was modified to be presented as a paper and pencil test which the subjects could work
through at their own pace. The first item was presented as a practice, with appropriate
188
responses from the WAIS-III manual as examples. Subjects were instructed to list as
many examples of similarity that they felt were needed to justify their understanding.
Details on the scoring are provided in the results section below.
Rationale for Inclusion: The similarities test was included as a marker test of Gc
consistent with Carroll (1993) and Stankov et al. (1994).
6.2.2.6 Vocabulary (Synonyms).
The vocabulary test was based on the Verbal Comprehension scale of the ETS Kit of
Factor-Referenced Cognitive Tests (Ekstrom et al., 1976). In total 36 items were
compiled from the Vocabulary Test II – V2 that the manual recommends for grades 712. The task was modified to be computer administered and scored however the
presentation order was as specified in Ekstrom et al. (1976). Each item consisted of a
cue word presented in 20mm white lettering towards the centre left of the display on a
blue background. The five possible synonyms from the subject were presented in 15mm
red lettering to the right of the cue and subjects were required to select their choice from
one of these options using the computer mouse. Subjects were permitted to change their
choice for a given item as often as they wished. The computer recorded the final
response (if subjects changed their selection within the trial), accuracy of the final
choice, and the time from when the item was presented to when the “Next” button was
pressed to display the next item.
Rationale for inclusion: Synonym tests have been used to assess breadth of vocabulary
that is considered to be a core marker of Crystallized Intelligence (e.g., Carroll, 1993;
Stankov et al., 1994).
6.2.2.7 Digit Span Forward.
The digit span forward task takes the traditional format but was presented on the
computer. Each digit was presented one at a time at 1000ms intervals. At the end of the
presentation the word “Go” was displayed to indicate to the subject that they should
enter the string of digits in the exact order that was just presented. Subjects could use
either the numeric keypad or the keys at the top of the keyboard to respond. Subjects
189
were not permitted to change a digit once it had been keyed and previously entered
digits remained visible until the next item button was selected. Two items at each digit
length were presented in increasing length from two through to nine digit items. That is,
two items containing two digits were presented; then two items of three digits, and so
on through to two items of nine digits long. The Digit Span Backward task was
presented, immediately after this task.
Rationale for inclusion: The digit span forward task is a measure of memory span
length and was included in the current battery as a marker of the SAR factor (e.g.,
Stankov et al., 1994; Carroll, 1993; Roberts & Stankov, 1999). It is arguably the most
common marker of this factor.
6.2.2.8 Digit Span Backwards.
The digit span backward task took the same format as the forward version however
subjects were required to enter the digits in the reverse order to how they were
presented. Items were again presented in order of increasing digit list length. The first
pair of items had a digit list length of two, with the list length of subsequent pairs of
items increasing by 1 digit through to the final pair of items that had a list length of 9.
Rationale for inclusion: As for the forward version of the digit span test, this task was
included as a marker of the short-term memory factor, SAR. The backward version is
particularly interesting since it has also been reported to entail fluid abilities. This is
considered to be a function of some learning ability that is enlisted to develop
mnemonics to be able to manipulate and repeat digits in the reverse order (Carroll,
1993). Hence, the task is expected to load on the Gf factor and this will also assist in a
defining Gf in a way that is less focused on performance in the Progressive Matrices
test.
6.2.2.9 Paired-Associate Recall.
The paired-associate recall test was based on the Verbal-Paired Associates-I from the
third edition of the Weschler Memory Scale (WMS-III). Eight word pairs are presented
to the subject in the “learning phase”. The first word of each pair (cue) was presented
190
on the left of the screen and the second (target) was presented immediately to its right
1500 ms later. The pair stayed on screen for a further 1500 ms before the screen was
cleared and the next pair presented. The participants were instructed to remember which
cue was associated with each target. At the completion of the learning phase, the subject
was told that one word from each pair would be displayed and that their task was to
provide the word that had been associated with it in the learning phase. Subjects made
their response by typing the target word in the space where the second word was
presented in the learning phase. This procedure was repeated three additional times with
both the ordering of the pairs in the learning phase and the presentation of the cue in the
recall phase being randomised for each subject.
Rationale for inclusion: The Paired-Associate Recall test was included as the third
marker of the short-term memory factor, SAR. It has also been identified by Carroll
(1993) as a marker of the first-order Associative Memory factor. It is a standard
psychometric test of short-term memory.
6.2.2.10 Knight-Knave Task.
The knight-knave task followed the computer-administered format outlined in Chapter 3
with identical introductory and practice sections. Table 6.2 lists the items that were
presented and the full complexity analysis is available in Appendix A. The items used
in this administration are essentially identical to the items used previously in Chapter 3.
There were however a couple of small changes. An additional ternary item was added
(Item 3.6), to make 6 ternary items in total. Items 4.1 and 4.3 were omitted from the
quaternary item pool and two additional items added (4.6, and 4.7), to make 5
quaternary items in total. Three indeterminate items classified as quaternary were also
added to allow subjects legitimate use of the “Not possible” category, and so that we
could explore the influence of indeterminacy on performance in quaternary items.
191
Table 6.2
Ternary and quaternary knight-knave items
Ternary Items
Item 3.1
B says, A is a knight and B is a knave.
If we know B is a knave, what is the status of A?
Quaternary Items
Item 4.2
A says, "B is a knave"
B says, "A is a knight or B is a knight"
Is inhabitant A a knave?
Item 3.2
A says, "A is a knight and B is a knave"
B says, "A is a knight"
Is inhabitant B a knight?
Item 4.4
A says, "A is a knave and B is a knight"
Is inhabitant B a knave?
Item 3.3
A says, "A is a knight"
B says, "A is a knight"
If A is a knave, what is the status of B?
Item 4.5
A says, "A is a knave and B is a knave"
Is inhabitant B a knave?
Item 3.4
A says, "A is a knight"
B says, "A is a knave and B is a knave"
Is inhabitant B a knight?
Item 4.6
A says, “A is a knight”
B says, “A is a knave and B is a knave”
Is inhabitant B a Knave?
Item 3.5
A says, "A is a knight"
B says, "A is a knave and B is a knight"
If A is a knight, what is the status of B?
Item 4.7
A says, “A is a knave or B is a knave”
B says, “A is a knave or B is a knight”
What is the status of B?
Item 3.6
A says, “A is a knight and B is a knight”
B says, “A is a knight and B is a knight”
If we know B is a knave what is the status of A?
Item 4.8 (indeterminate)
A says, “A is a knight and B is a knave”
B says, “A is a knave or B is a knight”
Is inhabitant A a Knave?
Item 4.9 (indeterminate)
A says, “B is a knave”
B says, “A is a knave and B is a knight”
What is the status of A?
Item 4.10 (indeterminate)
A says, “A is a knight or B is a knight”
B says, “A is a knave or B is a knight”
If A is a knight what is the status of B?
6.2.2.11 Latin Square Task.
In Chapter 4, we identified the potential for number of processing steps to be a
significant factor in the processing entailed in the Latin Square task. The item base was
therefore modified to include a reliable manipulation of relational complexity that was
192
independent of the number of processing steps. A set of 36 items were generated – 12 at
each level of complexity. Within each complexity level, 6 items entailed only one
process and 6 items entailed two processes (i.e., an additional process of an equal or
lower level of complexity). These items and the associated RC analysis are provided in
Appendix B.5 and B.6. This item pool was also used in the dual-task study reported in
Chapter 5. The presentation format that we use here is identical to that outlined in
Chapter 4.
6.2.2.12 Sentence Comprehension Task.
The format of the sentence comprehension task is a modification of that outlined in
Andrews and Halford (in press; Andrews, 1997). The number of roles (2, 3, 4, or 5) and
sentence form (centre-embedded (CE), right branching (RB)) can be used to categorise
processing demand in terms of sentence structure. The number of roles corresponds to
the relational complexity level of the CE sentences. Two-role sentences are not actually
CE, but are either RB object-focused or RB subject-focused. An object-focused
sentence has two noun phrases that precede the verb (more if there are more than two
roles), whereas subject-focused sentences have only one noun phrase that precedes the
verb. For comparability with 3-, 4-, and 5-role sentences, the object-focused 2-role
sentences are referred to as “2-role CE”.
Two blocks of sentences were constructed although each subject received sentences
from only one block (A or B). Each block consisted of 64 sentences, 8 CE and 8 RB
sentences at each of the four role-levels. From these, 6 CE and 6 RB sentences were
randomly selected from each of the four role levels and randomly presented such that no
more than four items from one level were presented consecutively. Hence, 48 items in
total were presented to each subject. The semantic content of the RB sentences in Block
A matched the content of CE sentences in Block B. Similarly, the semantic content of
the CE sentences in Block A matched the content of the RB sentences in Block B.
Hence, for each participant, a particular semantic content appeared in only one sentence
form. Table 6.3 lists example items at each role level.
193
Table 6.3
Example 2-, 3-, 4- and 5- role center-embedded (CE) and right-branching (RB) sentences with
probe question types.
Roles
Examples
Probe Questions
Response
2*
CE (Block A): Sally saw the
2-1. Who helped?
Noun
woman that the man helped.
2-2. Who was helped?
Noun
3
4
5
RB (Block B): John saw the man
that helped the woman.
CE (Block A): The duck that the
monkey touched walked.
RB (Block B): The monkey
touched the duck that walked.
CE (Block A): The artist that the
waiter warned the chef about
talked.
RB (Block B): The waiter warned
the chef about the artist that
talked.
CE (Block A): The clown that the
teacher that the actor liked
watched laughed.
3-1. Who touched?
3-2. Who walked?
3-3. Who was touched?
3-4. What did the duck do?
3-5. What did the monkey do?
4-1. Who warned?
4-2. Who talked?
4-3. Who was warned?
4-4. Who was the chef warned about?
4-5. What did the artist do?
4-6. What did the waiter do?
Noun
Noun
Noun
Verb
Verb
Noun
Noun
Noun
Noun
Verb
Verb
5-1. Who liked?
Noun
5-2. Who watched?
Noun
5-3. Who laughed?
Noun
5-4. Who was liked?
Noun
RB (Block B): The actor liked the 5-5. Who was watched?
Noun
teacher that watched the clown
5-6. What did the clown do?
Verb
that laughed.
5-7. What did the teacher do?
Verb
5-8. What did the actor do?
Verb
* According to Andrews (in preparation), the initial noun phrase in two role sentences (e.g.,
“Sally saw…” do not contribute the relational complexity. See Section 6.7.1.3 for details.
Sentence comprehension was evaluated by the accuracy of the subject’s response to a
probe question that referred to a noun-verb relation in the sentence. The number of
potential probe questions for a given sentence depended on the number of roles. As
shown in Table 6.3, there are two probe question types for 2-role sentences. For 3-role
sentences there are five possible question types, six types for 4-role sentences, and eight
types for 5-role sentences. All possible probe question types for a given role-level were
blocked and probe questions were randomly drawn (without replacement) for
presentation. On exhausting the block, this selection procedure was repeated. For
example, after presenting five 3-role sentences, all question types would have been
randomly sampled. Sampling without replacement from the five question types would
begin again for the next 3-role sentence presented. As another example, consider the 4role sentences, there are six question types and since only six items are presented for
194
each sentence form (CE/RB), subjects will receive each of these probe questions once
(for each sentence form).
Procedure: Written instructions informed participants that the computer would present
a series of sentences one at a time. Subjects were asked to read the sentence as many
times as they needed to ensure that they understood it, and then to press the spacebar.
Having pressed the spacebar, subjects were told that the sentence would be replaced
with a question to test their comprehension followed by a selection of possible answers.
Subjects were instructed to make their response by clicking on their choice with the left
mouse button. Emphasis was given to the fact that the sentence and question would not
be displayed at the same time and that subjects should therefore make sure that they
understood the sentence before pressing the spacebar.
The consecutively assigned subject identification numbers provided to students at the
beginning of the experiment determined which block was presented. Students with even
identification numbers were presented Block A. All other students received Block B.
All sentences were presented in the upper half of a computer screen in 15-20mm yellow
lettering on a gray background. When the spacebar was pressed, probe questions
replaced the sentences in the top half of the screen and response options were presented
in the lower half of the screen in 15-20mm red lettering. It was not possible for students
to change their choice once they had made a response. The subject determined the
duration between their response to the sentence probe question and presentation of the
next item by pressing the spacebar.
6.2.3
General Procedure
The administration of the tests was spread over two testing sessions lasting between 1
and 2 hours each and separated by no more than two weeks. At the start of each session,
students were seated at 486 DX computers with 14inch monitors. An experimenter
introduced the study by explaining that the focus of the work was to explore the way
people reason and solve problems. Each student was given an identification number and
the experimenter individually guided each student to start each task. The first screen of
every task required students to enter their name and identification number into the
195
computer (with the assistance of an experimenter if necessary). Students were asked to
read the instructions carefully on the screen and to inform the experimenter if they did
not understand what they were expected to do or if there was some apparent problem
with the computer. At least a one minute rest period between tasks was enforced
although students could take as long as they desired. Each student received the battery
of tasks in a different order (although all students received the Digit Span – backward
task immediately after the Digit Span – forward version). The paper and pencil tasks
were also completed at the computer desk but no times were recorded.
6.3
Overview of Analyses for Chapter 6
In the remainder of this chapter we will consider first the psychometric properties of
each of the marker tests to determine an appropriate measure of performance to use in
the correlational analyses. The measurement properties of each task are considered by
using the Rasch methodology that was described briefly in Chapter 4. When the data fits
the Rasch measurement model there is evidence to suggest that the transformed scores
(person estimates) have the properties of an interval measure (Wright, 1999) and
therefore where possible we will use these transformed scores as the measures in the
correlational analyses. It is interesting to note that proponents of the Rasch model also
argue that even if the data does not fit the Rasch model, using the transformed scores is
better than using the raw scores (e.g., Rasch measurement transactions), although
investigation of the sources of misfit is still important. We follow this approach and
consider cases of misfit where they arise.
The objectives for the analyses of the relational complexity tasks are to consider the
evidence for a relational complexity effect in each of the tasks. That is, to test whether
performance on the Sentence Comprehension Task, the knight-knave task, and the Latin
Square task is influenced by the relational complexity manipulation. We also consider
other issues that might influence the confidence we have in generating composite
measures of performance within levels of complexity.
196
6.4
6.4.1
Markers of Fluid Intelligence
Raven’s Progressive Matrices
The composite Progressive Matrices test was attempted by 175 students (119 female &
56 male). Given the test is a composite of items from various versions of the
progressive matrices series, some inspection of the characteristics of this set of items is
warranted. The average time for completion of the instruction and practice phase of the
test was 7.15mins (SD = 2.06mins). The average time for completion of the test phase
was 23.28 mins (SD = 8.02 mins). Students were given no time limit in which to finish
the test, however two participants had to be stopped because they had not completed the
task within the 2hr session time limit. The missing responses for these students were
scored as incorrect. In either case the percentage of missing values is very small and
does not influence the interpretation of the test’s appropriateness41.
The composite Raven’s test items consisted of 12 practice and 32 test items. The 32 test
items were submitted to a Rasch analysis and a traditional item (reliability) analysis and
the results are presented in detail in Appendix D.1. The items have come from a wellestablished test and we would expect its psychometric properties to be strong. This is
indeed what we observed. Internal consistency was good, Cronbach α = .84, and the
Rasch analysis indicated that the data was a good fit to the Rasch measurement model
(test of the item×trait interaction – 4 groups: χ2 (96) = 113.86, p = 0.10). The
assessment of item fit using both infit and outfit statistics indicated that none of the 32
test items had extreme infit statistics and only two items (item 6 and 9) had outfit
statistics beyond the 1.00 ± 0.30 range suggested in the Rasch literature (e.g., Wright et
al., 1994). These items had outfit statistics less than .70 suggesting responses to the
items tended to be too consistent given the calibrated ability of the subjects42.
Person Fit: In terms of assessing person fit, Figure 6.2 plots the infit and outfit statistics
as a function of calibrated ability. The plot demonstrates that over 22.8% of subjects
41
When these subjects were removed from the item analyses, no change greater than 0.8% was noted in
item difficulties.
42
A consistent response pattern suggests that too many poor subjects got the items wrong and too many
good subjects got the item correct. See Chapter 4 for a more detailed discussion of these issues.
197
had infit scores outside of the recommended fit region (1.00 ± 0.30). A total of 13.1% of
the sample had infit values greater than 1.30 and 9.7% had infit values less than 0.70.
Although person misfit is a little higher than would be desirable, the item statistics are
strong and we therefore use these calibrated ability scores as the measure of
performance in the subsequent analyses. Table D.1 (in Appendix D) also provides the
proportion correct values from the 20 items from Set II of the Advanced Progressive
Matrices manual (Raven et al., 1993). The rank order correlation between these items is
very high (spearman rho = .998), suggesting the relative ordering of item difficulty
remained the same when the items were presented in the non-standardised way.
4.5
outfit
infit
Mean Square Fit Value
4.0
3.5
3.0
2.5
2.0
1.5
Fit
region
1.0
0.5
0.0
-2.27
-0.64
0.02
0.23
0.45
0.87
1.31
1.53
1.76
2.00
2.26
2.85
person (ordered by estimated ability)
Figure 6.2. Infit and outfit statistics as a function calibrated ability in composite
progressive matrices tests
6.4.2
Triplet Numbers Test
The measures of performance on the Triplet Number Test typically used by Stankov and
Crawford (1993) is an aggregated score across the trials presented at each level of
difficulty and as such is not really amenable to a Rasch-based item analysis. There are
therefore some issues that need to be resolved in terms of just which measure should be
used as the marker of Gf. All levels of difficulty in the test are time limited (see Section
6.2.2.2). Using the proportion of items answered correctly has the potential to miss
individuals who work slowly in order to increase accuracy (and therefore attempt fewer
198
items). Using number of items answered correctly also has problems since the time
limits differ and therefore the scale will be different for each level. A possible solution
would be to use the number of items answered correctly per minute (or some other
standard time frame). This would present the number correct on a scale that is more
comparable across levels. Stankov & Crawford (1993) use the average time to respond
correctly to an item as a measure of performance and this variable tends to place more
emphasis on processing speed.
Of the 191 subjects who participated in the study, 165 attempted the triplet numbers
test. Two of these subjects did not complete the most difficult level of the test. The
descriptive statistics and correlations between levels and measures are presented in
Table 6.4. Since an item based analysis is not really possible, we consider the
correlation between measures from the levels to try and assess the internal consistency
of the task. As an additional indicator of the internal consistency of the task, Cronbach
α is calculated using the four levels as items. The results suggest that there is a
significant degree of commonality between the levels in the number of correct
responses per minute measure. Interestingly, there is evidence for a simplex relationship
between the levels on this measure. Simplex is a pattern of correlations in which values
close to the main diagonal of the correlation matrix are large and become smaller
further away from the diagonal. That is, as the distance between the levels of the triplet
numbers test in the correlation matrix increase, the magnitude of the correlation
between the levels decrease. The presence of a simplex structure has been interpreted to
imply that the variables in question are ordered in terms of complexity in individual
differences (Roberts et al., 1988; Roberts, 1997). However, Raykov and Stankov (1993)
demonstrate that it is possible for increases in complexity not to produce a simplex
pattern, and for the simplex pattern to occur without changes in complexity. They
therefore suggest that simplex structure is not sufficient evidence to indicate the
presences of a complexity manipulation.
Table 6.4
Means, standard deviations (in parentheses) and correlations between performance measures on the triplet numbers test
Measure
level Mean SD
c/m1 c/m2 c/m3 c/m4 %c1 %c2 %c3 %c4
rt1
rt2
rt3
Number of correct
c/m1 30.65 (3.42)
response per minute
c/m2 27.15 (3.45)
0.63 (α = .8068)
c/m3 23.44 (2.97)
0.43 0.73 c/m4 17.06 (3.39)
0.19 0.47 0.67 Proportion of items
answered correctly
(α = .4632)
Mean response time
per item
(α = .6165)
Mean response time
per item answered
correctly
(α = .6085)
-0.01 0.06 0.01 0.16 0.29 0.16 0.48 0.18 0.37 0.50
%c1
%c2
%c3
%c4
0.99
0.97
0.96
0.88
(0.01)
(0.06)
(0.04)
(0.14)
0.05
0.06
-0.12
-0.05
-0.09
0.52
-0.01
0.26
0.01
0.19
0.26
0.31
rt1
rt2
rt3
rt4
0.89
1.09
1.41
2.09
(0.21)
(0.23)
(0.32)
(0.68)
-0.67
-0.33
-0.14
-0.04
-0.44
-0.71
-0.54
-0.15
-0.25
-0.62
-0.85
-0.35
-0.13
-0.44
-0.57
-0.64
0.01
0.05
0.04
0.16
0.01
-0.12
-0.10
0.25
0.06
0.00
-0.06
0.20
0.04 -0.14 0.64 -0.20 0.26 0.73 0.28 0.16 0.38 0.45
crt1
crt2
crt3
crt4
0.89
1.10
1.41
2.10
(0.21)
(0.24)
(0.32)
(0.69)
-0.66
-0.32
-0.12
-0.05
-0.44
-0.74
-0.53
-0.16
-0.25
-0.62
-0.85
-0.36
-0.13
-0.44
-0.56
-0.62
-0.01
0.05
0.02
0.17
0.01
-0.21
-0.11
0.25
0.05
0.00
-0.12
0.20
0.04
-0.16
-0.22
0.30
rt4
crt1
crt2
crt3
crt4
-
1.00
0.62
0.26
0.16
0.65
1.00
0.73
0.38
0.27
0.72
0.99
0.45
0.16 0.36 0.64 0.44 0.27 0.72 1.00 0.17 0.36 0.44
-
* Coefficient α determined using the 4 levels of difficulty as items. Values italicised are intra-measure correlations between levels of the same type of
measure. Values underlined are inter-measure correlations between different measures of the same level of difficulty. Values greater than r = .15, are
significant at p < .05; greater than r = .25 p < .001
200
Another interesting characteristic of the triplet data is that the correlation between
proportion correct and mean response time per item (regardless of accuracy) is only
significant for level 4 items, the most difficult level, r(163) = .28, p < .001. That is,
although the effect is weak, there is evidence for a speed-accuracy trade-off in the more
difficult levels in that higher accuracy is generally obtained by spending more time on
the task. In regards to the easier levels, accuracy does not appear to be related to the
time taken to respond. This would be consistent with the role of strategy variation as an
influence on complexity (Crawford, 1991) that we discussed in Chapter 2, Section 2.3.3.
We will consider the implications of these types of strategies further in Chapter 7.
6.4.2.1 ANOVA analyses on levels of difficulty.
Given the triplet numbers test was developed and used by Stankov and his associates to
demonstrate the influence of task complexity and fluid intelligence on performance, it is
necessary to consider the nature of the differences between levels for the current
sample. A series of simple one-way ANOVA’s were conducted to explore the effect for
each of the measures reported in Table 6.4. In each case there was a significant main
effect for level and in each case planned Helmert contrasts were significant (comparison
of level 1 with levels 2, 3, and 4; level 2 with 3 and 4; and level 3 with 4). These results
are reported in Table 6.5 and effectively replicates the findings of Stankov and
Crawford (1993; Stankov, 2000). It also suggests that the number of correct responses
per minute is a potentially useful measure given (i) it is the least susceptible to
heteroscedasticity (see standard deviations in Table 6.4), and (ii) the level of difficulty
accounts for 86% of the total variation in the data that is unaccounted for by other
factors (i.e., partial-η2 = .86). Stankov (2000) used the two most complex levels of the
Triplet Numbers Test as measures of Gf and consistent with this we use the number of
correct responses per minute for levels 3 and 4 as measures in the correlational analyses
reported in Chapter 7.
201
Table 6.5
ANOVA results for four measures from the triplet numbers test and proportion of experimental
effect accounted for by the comparison (η2)
Measure
Omnibus F-valuea
η2
η2
Helmert Contrasts
F-valueb
Number of correct
response per minute
1023.10 ** 0.86
Proportion of items
answered correctly
75.03 ** 0.32
Mean response time
per item
380.08 ** 0.70
Mean response time
per item answered
correctly
371.90 ** 0.70
a
df = 3, 486;
6.4.3
b
Level 1 vs. Later
Level 2 vs. Later
Level 3 vs. Level 4
Level 1 vs. Later
Level 2 vs. Later
Level 3 vs. Level 4
Level 1 vs. Later
Level 2 vs. Later
Level 3 vs. Level 4
Level 1 vs. Later
Level 2 vs. Later
Level 3 vs. Level 4
df = 1, 162; ** p < .001; η2 = partial η2
1026.67 **
1042.10 **
986.62 **
120.71 **
66.92 **
62.51 **
595.28 **
539.91 **
206.58 **
607.99 **
520.27 **
198.60 **
0.86
0.87
0.86
0.43
0.29
0.28
0.79
0.77
0.56
0.79
0.76
0.55
Swaps Test
Accuracy on the Swaps Test is amenable to a Rasch analysis since all subjects received
the same 48 items (although in different random orders). Two analyses were run on the
data for the 168 subjects who completed the test. The first considered the analysis of all
48 items to determine the structure of the test as a whole. The overall test of fit was
appropriate (χ2 (144) = 171.93, p = .056, person separation index = .817) when four
trait groups were used and no items exceeded the suggested infit and outfit range of
1.00 ± 0.30 (Wright et al., 1994). A traditional reliability analysis indicated high
internal consistency across the 48 item, α = .935. The second analyses were run on the
two most complex levels of the swaps test so as to generate an ability measure that we
initially believed would be more appropriate as a marker of fluid ability for the
correlational analyses (consistent with Stankov, 2000)43. The overall fit for these 24
items was much better, χ2 (72) = 69.90, p = .548 (person separation = .760), and no
items exceeded the bounds for the individual test of item infit. Again, the traditional
43
The principle components analyses do not differ as a function of whether the full test is used or just
levels 3 and 4.
202
reliability analyses supported the internal consistency of the scale, α = .902. A summary
of these item statistics and the traditional measures of proportion correct, mean response
time regardless of accuracy, and mean response times for correct items only, are
provided in Appendix D.3 (Table D.3).
Person Fit: An analysis of person fit was also considered by deriving both infit and
outfit statistics for each subjects’ response pattern. Figure 6.3 plots the distribution of
infit and outfit scores as a function of calibrated ability calculated using all 48 items.
Recall that spikes outside the fit region indicate misfit. Figure 6.4 plots the same
information using only the 24 Swaps test items from level 3 and level 4. While there is
some considerable deviation in the outfit statistics for each measure, particularly near
the extremes of the ability continuum, virtually no one violated the 1.00 ± 0.30 criterion
for infit scores (All items: 0.0% above 1.3, 1% below .7; level3/4 items: 0.0% above
1.3, 4% below 0.7). The measure based on only level 3 and level 4 items is much less
influenced by deviation in the outfit statistics in general. Misfit tends to occur for the
more able individuals.
3.0
OUTFIT
INFIT
Mean Square Fit Value
2.5
2.0
1.5
Fit
Region
1.0
0.5
0.0
-2.70
0.31
1.27
1.67
2.00
2.19
2.41
2.67
2.99
3.43
4.16
person (ordered by estimated ability)
Figure 6.3. Person infit and outfit statistics for ability calibrated on all Swaps test items
203
3.0
OUTFIT
INFIT
Mean Square Fit Value
2.5
2.0
1.5
Fit
Region
1.0
0.5
0.0
1.00
16.00
31.00
46.00
61.00
76.00
91.00
106.00
121.00
136.00
151.00
person (ordered by estimated ability)
Figure 6.4. Person infit and outfit statistics for ability calibrated on level3 and level 4 items of
the Swaps test
The Swaps Test was used and refined by Stankov and Crawford (1993; see also,
Stankov, 2000) to study manipulations of cognitive complexity. Increasing the number
of swaps to be manipulated served as the complexity manipulation. It is important to
note from the outset that we observed strong anecdotal evidence of a common solution
strategy during testing that might bring this manipulation into question. Some subjects
were observed using the mouse to short-cut the processing load during intermediate
swaps. Recall, that all possible orders of the three letters are displayed on the screen as
response options during solution (see Figure 6.1). A number of subjects told the
experimenter that they would do the first swap in their head, and then move the mouse
over the correct response button at that point in the processing. They would then use
this as the cue for making the second swap, and so on. As a result of this strategy, the
number of swaps that have to be maintained simultaneously in working memory can be
reduced to just one swap at a time. We are not certain how such a strategy might
influence the test as a measure of fluid intelligence. It might be the case that the Swaps
test is a reliable measure of Gf for somewhat unexpected reasons. That is, the test might
be more sensitive to those subjects who picked up on this particular solution strategy
and therefore were more capable of dealing with the novelty of the task, rather than the
amount of information that has to be integrated per se. In any case given the test serves
its role as a marker of Gf (as we will demonstrate in Chapter 7), a closer analysis of
204
strategies will not be pursued in detail at this stage. We will however consider the
“comparison of means” evidence for a manipulation of cognitive in the following
section.
6.4.3.1 Analysis of difficulty levels in the Swaps test.
As for the triplet numbers test there are a couple of measures that can be used. Since an
equal number of items are presented to each subject and items of different levels of
complexity are presented randomly throughout the test, without further analyses, a
simple composite score is likely to be appropriate where it was not in the triplet
numbers test (a measure of the number of items answered correctly per minute might
actually disadvantage students who did not use the “mouse cue” to step their way
through the more difficult swaps items). We consider the following measures:
proportion correct, response time, and correct response time. The descriptive statistics
and correlations between levels of difficulty based on these measures is summarised in
Table 6.6. To explore the influence of difficulty level (number of swaps) on these
measures a series of one-way ANOVA’s were conducted. In each case the main effect
for difficulty level was significant and Helmert contrasts were consistent with
predictions. That is, lower level items resulted in significantly more accurate
performance and quicker response times than higher-level items that entail more swaps.
Table 6.7 summarises these results. An investigation of the inter-correlations between
measures of accuracy and response time summarised in Table 6.6 suggest that there is
increasing evidence for a speed-accuracy trade-off as the number of swaps required
increases. The correlation between proportion correct and mean response time for one
swap is not significant; r(166) = .09, p = .23. When four swaps are required the
correlation between accuracy and response time is r(166) = .42, p < .001. The speedaccuracy trade-off correlation for the other levels lie in between these values (2-swaps,
r(166) = .30, p < .001; 3-swaps r(166) = .39, p < .001). As for the triplet numbers test,
there is also evidence for a simplex pattern of correlation between levels of complexity
for each measure.
Table 6.6
Descriptive statistics and correlations between levels of difficulty on the Swaps test
Measure
level Mean
SD
%c1 %c2 %c3 %c4
rt1
rt2
rt3
Proportion of items
%c1
0.93 (0.14)
-
answered correctly
%c2
0.87 (0.19)
0.65
%c3
0.79 (0.23)
0.47 0.77
%c4
0.73 (0.26)
0.53 0.77 0.81
Mean response time per
rt1
5.98 (1.82)
0.09 0.02 -0.07 -0.04
item
rt2 12.05 (3.45)
0.30 0.30 0.15 0.09 0.61
rt3 18.85 (6.33)
0.30 0.41 0.39 0.29 0.40 0.61
rt4 25.37 (9.65)
0.39 0.50 0.45 0.42 0.36 0.56 0.81
Mean response time for item crt1
answered correctly
6.06 (1.87)
rt4
crt1
crt2
crt3
crt4
-
-0.02 -0.05 -0.12 -0.10 0.96 0.57 0.36 0.30
-
crt2 12.83 (5.96)
0.11 -0.18 -0.23 -0.21 0.44 0.59 0.21 0.19 0.46
crt3 19.44 (6.42)
0.19 0.24 0.25 0.12 0.31 0.52 0.91 0.71 0.32 0.49
crt4 26.01 (8.76)
0.26 0.45 0.39 0.29 0.33 0.52 0.70 0.88 0.30 0.11 0.67
Values greater than r = .15, are significant at p < .05; greater than r = .25 p < .001
-
206
Table 6.7
ANOVA results for measures from the Swaps test and proportion of experimental effect (η2)
η2
Measure
Omnibus F-valuea η2 Helmert Contrasts
F-valueb
0.32 Level 1 vs. Later
99.20**
0.38
Level 2 vs. Later
100.27**
0.38
24.75**
0.13
0.79 Level 1 vs. Later
1107.96**
0.87
Level 2 vs. Later
483.76**
0.75
Level 3 vs. Level 4
207.99**
0.56
0.79 Level 1 vs. Later
1289.50**
0.89
per item answered
Level 2 vs. Later
474.91**
0.75
correctly
Level 3 vs. Level 4
179.15**
0.53
Proportion of items
77.04**
answered correctly
Level 3 vs. Level 4
Mean response time
601.58**
per item
Mean response time
a
605.16**
2
df = 3, 483; b df = 1, 161; ** p < .001; η = partial η
6.5
6.5.1
2
Markers of Crystallized Intelligence
Vocabulary (Synonyms)
Response times and accuracy on the 36-item vocabulary test were recorded for the 177
students who attempted it. All subjects answered item 22 correctly and therefore its
location could not be estimated using the Rasch approach. The initial analysis is
therefore based on a test of 35 items. Although person separation was good (.701), the
χ2 test of fit based on the item × trait interaction (the global fit statistic based on four
trait groups) generally suggested a poor fit, χ2(105) = 147.87, p = .004. A closer
examination of individual item fit using infit and outfit statistics reported in Appendix
D.4 (Table D.5) indicate that 2 items (10 and 15) have outfit statistics greater than 1.3,
and 5 items (2, 18, 21, 24, and 36) have outfit values less than .70. When the infit
statistics are considered, Item 15 is still extreme but this time its value is less than .70
(Item 15 infit = .68). Removal of items 10 and 15 from the analysis produces a slightly
better global fit to the Rasch model (χ2(99) = 119.89, p = .075, person separation =
.722). With the modified item base, none of the items had infit values outside of the
1.00 ± 0.30 range (see Table D.5 in Appendix D.4).
207
4.0
Mean Square Fit Value
3.5
Outfit for these subjects
OUTFIT
exceeded limits of the plot.
INFIT
3.0
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-1.18
-0.02
0.16
0.34
0.70
0.70
1.06
1.06
1.25
1.45
1.88
2.37
person (ordered by estimated ability)
Figure 6.5. Distribution of fit statistics as a function of calibrated ability for the 35
item Vocabulary test
Person Fit: Consistent with previous Rasch analyses, we plot the infit and outfit values
for each person as a function of the estimated ability. Figure 6.5 show the distribution of
infit and outfit statistics as a function of estimated ability based on the 35-item test (i.e.,
with Item 22 removed). Considering the infit statistics values, 19% of the subjects had
poor fitting response patterns (11% over 1.30; and 8% below .70). Figure 6.6 shows the
same information but this time when ability is calibrated without Items 10 and 15. For
this analyses, 22% had infit values suggestive of a poor fitting response pattern (12%
above, 1.3; and 10% below 0.70). Given there is no substantial difference in the
characteristics of person fit, the choice for the ability estimate was based on the global
fit and item fit statistics. As such the calibrated ability estimate from the 33-item test
(i.e., with Items 10, 15 and 22 removed) is used as the measure to be included in the
correlational analyses to follow44.
44
A standard reliability analysis of the vocabulary test considering item-total correlations for the full 36
scale supported the use of the 33-item test. The item-total correlation for items 10 and 15 were the lowest
in the set (r = .002 and –.166, respectively). Cronbach α for the 33-item test was .721.
208
4.0
OUTFIT
INFIT
Mean Square Fit Value
3.5
3.0
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-1.21
0.01
0.39
0.57
0.76
0.95
1.14
1.34
1.54
1.76
2.24
2.83
person (ordered by estimated ability)
Figure 6.6. Distribution of fit statistics as a function of calibrated ability for the 33
item Vocabulary test (items 10 and 15 removed)
6.5.2
Similarities
The Similarities test is normally individually administered, however it has been
modified here for presentation as a group administered paper and pencil test. As a
result, standard administration features such as the ability to question subjects for
borderline responses are not possible. The scoring of the written responses followed the
WAIS-III guidelines. Where subjects wrote more than one response, the best response
was scored. The benefit of the doubt was given for borderline responses and these were
therefore scored at the higher level. There are two types of items on this test. Responses
to the first four items are scored either 0 or 1. The remaining items can have a score of
0, 1, or 2, depending on the quality of the response. There was no discontinue rule used
and therefore subjects were graded on all 18 items. The appropriate Rasch analysis used
the extended response category procedure rather than the dichotomous multi-choice
option that was used to analyse the other tests (Sheridan et al., 1998). Of the 191
participants in this study, 160 attempted the Similarities test. The χ2 test of the item ×
trait interaction as a measure of global fit suggested an appropriate fit to the Rasch
209
model, χ2(54) = 32.23, p = .504; although person separation was more moderate (person
separation index = .583). An analysis of the item infit and outfit statistics was also
consistent with an appropriate fit to the Rasch model. There were no items with high
outfit scores above 1.30, and only three items (3, 5, and 8) with low outfit scores
consistent with a tendency for a pattern of subject responses that are too consistent. No
items exceeded the 1.00 ± 0.30 criterion for infit. The item statistics are reported in
Appendix D.5 (Table D.6).
Person Fit: Person infit and outfit statistics are reported in Figure 6.7 as a function of
calibrated ability. The range of person infit and outfit values for this test tends to be
more varied than in the other tests. Only 51% of the sample have infit statistics within
the 1.00 ± 0.30 criterion. In total, 21% of the sample have infit values exceeding 1.30
suggesting a somewhat aberrant response pattern. This is a large proportion of the
sample and the results are not isolated to the ends of the ability continuum (where a
smaller number of respondents might contribute to the weaker measurement). A
traditional reliability analysis gives some weight to the inconsistency in person fit
identified both in the plot of person-fit statistics and the moderate person separation
index (i.e., Cronbach α = .63). In general this indicates a potential weakness in the
4.5
OUTFIT
INFIT
4.0
Mean Square Fit Value
3.5
3.0
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-0.36
0.82
1.31
1.49
1.67
1.87
1.87
2.10
2.36
2.66
3.04
person (ordered by estimated ability)
Figure 6.7. Distribution of fit statistics as a function of calibrated ability for the paper
and pencil Similarities test
210
stability of the similarities measure and we will keep this in mind in deriving factor
scores in Chapter 7.
6.5.3
Arithmetic Reasoning
The arithmetic reasoning test has a multiple choice response and scoring is dichotomous
– correct or incorrect. The data from the 159 subjects who attempted this test was
submitted to a Rasch analysis. The χ2 test of the item × trait interaction using four trait
groups suggest a satisfactory fit to the Rasch model, χ2(45) = 56.46, p = .118
(separation index = .506, was moderate to low). An investigation of item fit statistics
indicated that all items meet the 1.00 ± 0.30 criterion for both outfit and infit. The Rasch
statistics and the traditional proportion correct values are provided in Table D.2 in
Appendix D.2. The relatively low person separation index suggests that an analysis of
person fit might indicate some lack of fit and indeed this is what was observed. Figure
6.8 plots the person infit and outfit values as a function of calibrated ability. Once again,
spikes outside the fit region indicate a degree of misfit. Of the sample, 15.7% had infit
statistics suggestive of an aberrant item response pattern (i.e., infit > 1.30). A further
9.8% of the sample had an item response pattern that tended to be too consistent (i.e.,
infit < 0.70). The test performed substantially better in this regard than the similarities
3.5
OUTFIT
INFIT
Mean Square Fit Value
3.0
2.5
2.0
1.5
1.0
0.5
0.0
-2.20
-0.21
0.46
0.46
0.82
1.22
1.22
1.68
1.68
2.26
3.13
person (ordered by estimated ability)
Figure 6.8. Distribution of fit statistics as a function of calibrated ability for the
paper and pencil Arithmetic Reasoning Test
211
test, however traditional measures of reliability were moderate to low (Cronbach α =
.617). This corroborates an interpretation that the test has some relative difficulty in
establishing a stable unidimensional measure of arithmetic reasoning ability.
6.6
6.6.1
Markers of Short-term Apprehension and Retrieval (SAR)
Digit Span Forward
As was mentioned earlier, an error in the administration of the digit span computer
program meant that the digit span forward task was only administered to 130 subjects.
The χ2 test of the item × trait interaction as a measure of global fit to the Rasch model
was poor, χ2(48) = 104.44, p < .001, although person separation was moderate
(separation = .690). An analysis of the item infit and outfit statistics indicated
substantial misfit as identified by the outfit statistics, 4 items (2, 4, 6, and 7) had high
values and 6 items (1, 12-16) had low outfit values. As we indicated in Chapter 4, outfit
values tend to be sensitive to extreme outliers (i.e., individuals) and the calculation of
infit statistics tends to overcome this problem. The most extreme outfit value is
associated with Item 2 and warrants a closer inspection since this item is also implicated
in a very extreme person outfit value for subject 454. This student recalled the correct
order for every item except Item 2 in the digit span forward task. Item 2 required
subjects to recall the digits 6, 3 in the correct order, and is an item from the easiest level
(calibrated difficulty = -2.44 logits). Subject 454’s failure to recall this item given she
had answered every other item correctly resulted in a person × item standardised
residual of 31.69. When this subject is removed and the Rasch analysis repeated, the
outfit for Item 2 is substantially reduced. This clearly indicates the sensitivity of the
outfit statistics to aberrant responses from one or two individuals. Person infit statistics
are weighted by the variation in the items and are typically less sensitive to the factors
that tend to influence outfit statistics (see Chapter 4 for a review of this). The infit value
for Item 2 based on the full sample was slightly below the lower bound cut-off of 0.70
(Item 2: Infit = 0.65). This is the only item that has an infit score outside of the 1.00 ±
0.30 criterion.
212
Table 6.8
Descriptive and Rasch Statistics for Digit Span Forward test items
List Classical Statistics
Rasch Statistics
Item Length
p-value
logit
SE outfit infit
1
2
0.98 -3.75 0.69
0.19 0.84
2
2
0.97 -2.44 0.40
8.26 0.65
3
3
0.94 -1.89 0.33
0.75 0.80
4
3
0.95 -1.95 0.34
1.73 0.71
5
4
0.95 -2.11 0.36
0.91 0.72
6
4
0.93 -1.88 0.33
1.33 0.85
7
5
0.85 -0.97 0.26
1.79 0.95
8
5
0.85 -1.11 0.26
1.14 0.87
9
6
0.58
0.60 0.21
1.29 0.94
10
6
0.61
0.36 0.21
0.90 0.96
11
7
0.45
1.22 0.21
0.85 0.93
12
7
0.39
1.52 0.21
0.68 0.75
13
8
0.22
2.63 0.24
0.66 0.88
14
8
0.22
2.55 0.23
0.65 0.77
15
9
0.13
3.43 0.28
0.57 0.75
16
9
0.11
3.79 0.31
0.43 0.80
Cronbach α = .6765
Item Difficulty and List Length: The items of the digit span task have a clear theoretical
structure that can be empirically tested. Increasing the number of digits to be recalled is
expected to increase the difficulty of the recall task. This was corroborated in the
calibrated item difficulty values shown in Table 6.8. There is a strong correlation
between the number of digits to be recalled and item difficulty as measured by
proportion of subjects who answered it correctly (i.e., p-value), r(14) = -.96, p < .001,
and calibrated item difficulty from the Rasch analysis (i.e., logit), r(14) = .98, p < .001.
Person Fit: The aberrant response of Subject 454 that was identified in the assessment
of item fit also influenced her person fit statistics. Once again, this was moderated in the
derivation of person infit (outfit = 62.86; infit = 1.56). There is a considerable
difference in the nature of the person infit and outfit statistics in this task and so as to
avoid confusion each is plotted separately in a slightly different format to what we have
used for the other tasks. The nature of this difference is interesting and is worthy of
comment to highlight the utility of the Rasch approach. As can be seen from the plot of
outfit statistics in Figure 6.9, aberrant misfit (> 1.30) tends to increase as the ability of
the person increases. This might suggest that the more able individuals, who will by
definition answer more items correctly, are likely to have answered some of the
relatively easier items incorrectly. A tendency for a response pattern that is too
213
62.86
7.0
Mean Square Fit Value
6.0
5.0
OUTFIT
4.0
3.0
2.0
fit
region
1.0
0.0
-2.11
-0.64
-0.13
0.41
0.96
1.52
2.11
2.11
3.48
person (ordered by estimated ability)
Figure 6.9. Distribution of outfit statistics for the digit span – forward task (arrow
indicates extreme values beyond the plotted range)
consistent (outfit < 0.70) is apparent throughout the calibrated ability continuum. The
plot of person infit in Figure 6.10 also suggests that aberrant responses (infit > 1.3) are
more likely for high ability individuals and that a substantial proportion of the sample
had infit values suggestive of a response pattern that is too consistent (infit < .70). An
unqualified interpretation of these results might cause us to be cautious of the tasks
utility. In fact this finding is consistent with intuitive expectations given the nature of
the task. Memory span is an interesting construct and tends to have a gradual decline as
list length increases up to a point after which failure is almost certain. For instance,
Subject 327 (outfit = 0.21, infit = 0.30) answered the first 8 items correctly and all
subsequent items incorrectly. That is, the subject was able to answer all items up to a
span of 4 digits and then was unable to recall a list of digits of any greater length
correctly. Hence although his response pattern (i.e., 1111111100000000) seems
reasonable given our theoretical understanding of the construct under investigation, it is
214
Mean Square Fit Value
3.0
INFIT
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-2.11
-0.64
-0.13
0.41
0.96
1.52
2.11
2.11
3.48
person (ordered by estimated ability)
Figure 6.10. Distribution of person infit values as a function of estimated ability in the
digit span – forward task
deemed too consistent by the Rasch approach and hence the low person fit scores45. We
have already considered an example of a subject with a high infit score (Subject 454).
Let us consider Subject 559 as another example of an aberrant response pattern (Subject
559: outfit = 5.19, infit = 2.49). This subject’s response pattern was as follows:
1100111111010011. She answered some easy items incorrectly and some hard items
correctly, and overall the pattern suggests that the subject’s memory span has not been
well measured. The high infit score is therefore quite reasonable.
This brief example reiterates the utility of the Rasch approach when used in conjunction
with a sound understanding of the theoretical construct. It provides a context in which
to assess not only the nature of the items in the test, but also the fit to the Rasch model.
Hence, although 56.1% of the sample have infit statistics outside the 1.00 ± 0.30 range,
40.8% are like subject 327 whose response pattern appears too consistent. This
consistency tends to masquerade as a lack of variability to the Rasch model (Linacre &
Wright, 1994). This is a problem for the mathematics since the model is designed to use
normal variability (error) in response to calibrate item and person differences. Fisher
(1992) states that “…The size of the stochastic element in the response patterns
provides a shared quantitative frame of reference that permits estimation of differences
45
Items in the response pattern are ordered in theoretical difficulty from 1 to 16
215
among item difficulties and person abilities. Thus the unavoidable noise in the data
contributes information essential to construct quantitative comparisons.”. As such we
are content to use the calibrated ability estimate as a rough linearised measure of the
individuals forward memory span in preference to the raw scores generated from the
test.
6.6.2
Digit Span Backward
The digit-span backward task was attempted by 173 subjects however no subjects
recalled the list of digits correctly for item 16 and therefore this item is excluded from
the Rasch analyses. The task has a very similar pattern of item properties as the forward
version although the global fit as measured by the χ2 test of the item × trait interaction
(with four trait groups) was much better, χ2(45) = 55.45, p = .137; person separation =
.704. An assessment of item outfit produced results similar to the forward version. Two
items (3 and 4) had outfit values above 1.30 and four items (1,2,3, and 15) had outfit
values less than 0.70. When the infit statistics are considered, no items exceeded the
1.00 ± 0.30 criterion. The results of this analysis and the traditional proportion correct
statistics are reported in Table 6.9.
Table 6.9
Descriptive and Rasch for digit span backward items
List Classical Statistics
Rasch Statistics (All)
Item Length
p-value
logit
SE outfit infit
1
3
0.98 -3.90 0.52
0.55 0.74
2
3
0.98 -3.81 0.50
0.48 0.76
3
4
0.91 -2.12 0.27
1.79 0.82
4
4
0.96 -3.11 0.38
3.62 0.71
5
5
0.76 -0.72 0.20
1.01 1.02
6
5
0.83 -1.25 0.22
0.99 0.94
7
6
0.58
0.21 0.18
1.28 1.12
8
6
0.55
0.32 0.18
0.80 0.84
9
7
0.38
1.26 0.18
0.71 0.82
10
7
0.49
0.64 0.18
0.84 0.92
11
8
0.27
1.87 0.20
0.73 0.81
12
8
0.34
1.46 0.19
0.77 0.90
13
9
0.17
2.60 0.22
0.69 0.81
14
9
0.17
2.55 0.22
0.77 0.84
15
10
0.06
4.01 0.33
0.49 0.94
16
10
0.00
Cronbach α = .7214 (with item 16 removed due to zero variance)
216
Item Difficulty and List Length: The relationship between the empirical difficulty of the
items and that predicted as a function of list length was as expected (proportion correct:
r(14) = -.99, p < .001; Rasch calibrated difficulty, r(14) = .99, p <.001).
Person Fit: The analysis of person fit was also similar to the digit span forward task and
we use the same method to represent the distribution of scores as for that task. There
was substantial deviation from the 1.00 ± 0.30 criterion for outfit (Figure 6.11) although
this was moderated by the infit calculations (Figure 6.12). Again, as for the forward
version, the misfitting response patterns for the digit span backward version were
identified because of a response pattern that was too consistent. Of the total sample
84.3% of persons had infit statistics below the upper limit of the 1.30 indicative of an
aberrant response pattern. A substantial proportion of the sample (38.4%) had infit
values below 0.70. As we argued above, this is an expected consequence of the nature
of the task and as such we are content to use the calibrated ability estimates for the
correlational analyses in Chapter 7 in preference to the raw total correct score.
39.29
4.10
6.55
16.21
3.5
Mean Square Fit Value
3.0
OUTFIT
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-5.95
-1.27
-0.65
-0.65
-0.09
0.43
0.94
0.94
1.44
1.97
2.56
3.27
person (ordered by estimated ability)
Figure 6.11. Distribution of person outfit statistics as a function of calibrated ability in the digitspan backwards test
217
3.5
INFIT
Mean Square Fit Value
3.0
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-5.95
-1.27
-0.65
-0.65
-0.09
0.43
0.94
0.94
1.44
1.97
2.56
3.27
person (ordered by estimated ability)
Figure 6.12. Distribution of person infit statistics as a function of calibrated ability in the digitspan backwards test
6.6.3
Paired Associate Recall
The paired associates recall test from the WMS-III was administered to 177 subjects
and can be scored in two ways. The first is to look at the number of correct items on the
first trial. The second is to look at performance over all the trials – that is, the number of
correct items across the 4 presentations of the word pairs. Given the test has been
modified for computer administration we consider a variety of analyses in an attempt to
give us some indication of the performance of the test in this situation. Five measures
were derived as follows: (i) number of correct matches on the first trial, (ii) number of
correct matches on second trial, (iii) number of correct matches on the third trial, (iv)
number of correct matches on the fourth trial, and (v) total number of correct matches
over all four trials. The descriptive statistics and inter-correlations between these
measures is provided in Table 6.10. Trial scores have a possible range of 0-8. The total
score has a range of 0-32.
218
Table 6.10
Mean number of recalled matches (and standard deviations) and inter-correlations between
measures on the paired-associate recall test
Correlations
Rasch Statistics
Trial
Mean
SD
Trial 1 Trial 2 Trial 3 Trial 4
Trial 1 4.59
(2.64)
-
Trial 2 6.51
(2.02)
0.75
-
Trial 3 7.20
(1.47)
0.53
0.73
-
Trial 4 7.51
(1.15)
0.38
0.63
0.73
Total* 25.81 (6.23)
0.86
0.93
0.83
logit
SE outfit infit
1.72
0.07 0.67
0.60
0.15
0.08 0.42
0.44
-0.63
0.09 0.79
0.57
-
-1.25
0.11 1.76
0.80
0.72
-
-
-
-
* Total is the sum of the recalled items across the four presentations/trials; All correlations are
significant p < .001
If we consider each trial as an item, there is indication of consistency in measurement
from trial to trial. A simplex pattern is observed in the correlation matrix such that trials
that are close together in time are more highly correlated. A reliability analysis on the
four items corroborates some consistency in measurement, Cronbach α = .83, although
this value is likely to be accentuated somewhat given each subsequent trial is another
presentation of the same items in a different random order (and therefore may constitute
some level of colinearity).
A Rasch analysis was also conducted on the paired associate recall test using the 4 trials
as items and the extended category option (score 0-8). The χ2 test of the item × trait
interaction indicated an appropriate fit to the Rasch model, χ2(12) = 13.17, p = .357;
person separation = .818. Inspection of item infit and outfit reported in Table 6.10
suggests a poor fit, although the extent that this is due to the small item base is
uncertain. The analysis of person fit was also conducted and tended to follow the
general pattern observed with the other memory tests. Approximately 5.6% of the
sample had infit statistics greater than 1.30 suggesting some difficulties in measurement
for these individuals (Figure 6.13). Approximately 83.3% of the sample had infit
statistics below the .70 cut-off, indicative of an item response pattern that is too
consistent. Once again, given the small number of effective items (i.e., 4 – one for each
presentation of the 8 stimuli pairs), the stability of the person fit statistics is unknown.
219
It is not altogether clear just what measure is likely to provide the best estimate of
memory to define SAR in the correlational analyses. As it turns out, the question is
somewhat moot as performance on the paired associate recall task does not seem to be
related to the other memory tests, or for that matter, to any of the other psychometric
tests. We will discuss this in more detail in Chapter 7.
5.0
10.00
8.23
4.53
16.38
same person
4.5
6.12
25.23
4.64
OUTFIT
INFIT
Mean Square Fit Value
4.0
3.5
3.0
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-3.28
0.09
0.78
1.13
1.33
1.80
2.37
2.68
3.05
3.61
person (ordered by estimated ability)
Figure 6.13. Distribution of person fit statistics as a function of calibrated ability on the PairedAssociate Recall test (Arrows indicate points beyond the plotted range)
6.7
6.7.1
Relational Complexity Tests
Sentence Comprehension Task
Centre-embedded sentences impose simultaneous demand in the Halford et al. (1998a)
sense since the assignment of the nouns to the thematic roles of the verbs entails
processing relations between nouns and verbs in the sentence (Andrews & Halford, in
press). The more information that needs to be integrated in order to determine the nounverb roles the greater is the relational complexity. That is, in centre-embedded sentences
such as, “The duck that the monkey touched walked”, no assignment of roles is possible
until the verb touched is encountered. To make sense of the sentence, monkey is
220
assigned to the agent role, and duck must be assigned to the patient role. In this case, the
assignment needs to be done in the same decision. We then know that the monkey
touched the duck, and that the duck was touched by the monkey. The sentence is further
complicated since the second verb, walked, needs to be considered as well. Andrews
and Halford (in press) argue that the close proximity of the two verbs means that a
temporal overlap in processing is likely, further increasing the demands of processing
resources. The relational complexity analysis of this sentence is represented by Halford
et al. (1998a, Section 6.1.4) as a ternary relation using propositional format as follows:
TOUCH(monkey, duck) and WALK(duck), since three roles need to be completed46.
The structure of right-branching sentences means that assignment of noun-verb roles
can be determined in series as the sentence is read. A right-branching example of the
previous sentence would be: “The monkey touched the duck that walked” (Andrews &
Halford, in press). To assess comprehension, subjects in the current study are required
to answer a question related to the sentence they have just read. Other measures that
have been used to assess the influence of relational complexity on sentence
comprehension has been subjective rating of sentence difficulty and time to respond to
the probe question – that is, decision time (Andrews, 1997).
The structure of item presentation in the Sentence Comprehension Task means that the
task is less amenable to a Rasch analysis (although recent developments in Item
Response Theory means that these types of analyses are becoming within the reaches of
available technology, Andrich, 1999). The standard administration that was adopted
from Andrews and Halford (in press) does not present all subjects with the same items
or the same probe question to each item. We are therefore constrained to using a score
that is aggregated across a number of items. We begin the analysis of the sentence
comprehension task by considering the influence of the probe question.
Probe Question Type: As outlined in Table 6.3, each sentence type has a number of
probe questions that can be asked to assess comprehension. It is therefore necessary to
46
We present the complexity analysis of the Sentence Comprehension Task as reported by Halford et al.
(1998a) and Andrews and Halford (in press) without further evaluation or comment.
221
consider the potential differential influence of these questions. To assist in this,
questions were categorised into those requiring a noun response and those requiring a
verb response consistent with the classification in Table 6.3. Two types of aggregated
scores were then derived across sentence-type (CE, RB), response-type (Noun, Verb),
and relational complexity (2D, 3D, 4D, and 5D). The first is the proportion of items
answered correctly within the grouping, and the second is the mean decision time. Of
the 8 possible questions that can be presented at each level of complexity for each
sentence type (CE, RB), 6 are selected at random by the computer. The noun/verb
response types that are associated with these questions for a given individual is also
determined randomly (without replacement), but the distribution is also partly
influenced by the number of noun-verb roles that are available at each level of
complexity (as outlined in Table 6.3). As a result, subjects receive a different number of
items entailing a particular question type. The mean number of noun and verb response
items attempted by each subject is described in Table 6.11 as a function of relational
complexity and sentence-type.
Table 6.11
Mean number of items per subject requiring a noun and verb response (question type) as a
function of relational complexity and sentence type (CE/RB)
Noun
RC
Centre-embedded
Right-branching
Mean
Min Max
Mean
2a
6
6 6.00 (0.00)
3
3
4 3.55 (0.50)
2
3 2.45 (0.50)
4b
4
4 4.00 (0.00)
2
2 2.00 (0.00)
5
3
5 3.73 (0.63)
1
3 2.27 (0.63)
2a
6
6 6.00 (0.00)
3
3
4 3.60 (0.49)
2
3 2.40 (0.49)
b
4
4 4.00 (0.00)
2
2 2.00 (0.00)
5
3
5 3.74 (0.65)
1
3 2.26 (0.65)
4
a
Min Max
Verb
b
There are no verb responses for level 2 sentences; There are 6 question types for 4D
sentences and therefore each subject will receive all types; RC = Relational Complexity; SD in
parentheses
222
6.7.1.1 Accuracy as a function of question-type.
A sentence-type (CE, RB) × question-type (Noun, Verb) × Relational complexity (3D,
4D, 5D) repeated-measures ANOVA was conducted on the aggregated item accuracy
measure for the 167 subjects who attempted the task47. The summary of the descriptive
statistics on which these analyses are based is presented in Table 6.12. There was a
significant 3-way interaction between the factors, F(2, 332) = 4.07, MSe = .04, p = .02.
This interaction is plotted in Figure 6.14. Note that the 2D responses were not included
in the omnibus analyses, since there are no verb responses for these items, but the
means for the noun-responses are plotted in Figure 6.14. As expected, right-branching
items were easier overall than centre-embedded items, F(1, 166) = 208.27, MSe = .07, p
< .001. Since the relational complexity analysis is only appropriate for centre-embedded
sentences the remaining analyses focus on the simple-effects associated with these
items.
Table 6.12
Descriptive statistics: A. Mean proportion correct and B. Mean decision time to respond to
probe question as a function of relational complexity (RC), sentence-type, and question-type
(noun of verb)
A. Mean Proportion Correct
B. Mean Decision Time
RC
Centre-Embedded
Right-Branching
Noun
2D
0.83 (0.17)
3D
Verb
-
Noun
Verb
-
3.84 (1.39)
0.86 (0.22)
0.84 (0.27)
3.85 (1.87)
3.77 (1.94)
4D
0.70 (0.25)
0.83 (0.27)
5.73 (2.56)
4.30 (2.30)
5D
0.55 (0.29)
0.57 (0.39)
5.50 (2.90)
5.81 (4.53)
2D
0.89 (0.14)
3D
-
-
-
-
-
3.68 (1.43)
0.94 (0.14)
0.95 (0.15)
2.95 (1.01)
2.91 (1.21)
4D
0.86 (0.20)
0.93 (0.17)
4.36 (1.92)
3.66 (2.07)
5D
0.80 (0.23)
0.87 (0.23)
4.51 (1.94)
4.00 (2.01)
2D, 3D, 4D, 5D = binary, ternary, quaternary, and quinary respectively
47
-
The sphericity assumption was violated for all relevant omnibus effects except the relational
complexity by question type interaction. Greenhouse-Geisser corrections did not change the
interpretation of significance and therefore unadjusted F statistics are reported.
223
Noun (CE)
Mean Proportion Correct
1.00
Verb (CE)
Noun (RB)
0.90
Verb (RB)
0.80
0.70
0.60
0.50
2D
3D
4D
5D
Relational Complexity Classification
Figure 6.14. Mean proportion correct (Accuracy) as a function of relational complexity,
sentence type, and probe question-type
Simple-Effects of RC at Noun: The simple-effects of relational complexity (including
binary items) for questions entailing a noun response was significant, F(3, 498) = 73.21,
MSe = .044, p < .001. Simple comparisons of this effect indicated that there was no
significant difference between accuracy on binary and ternary items, F(1, 166) = 2.66,
MSe = .058, p = .11. All other comparisons were significant at p < .001. Ternary items
were significantly easier than quaternary items, F(1, 166) = 36.56, MSe = .075, p <
.001, and quaternary items were significantly easier than items classified as quinary,
F(1, 166) = 140.03, MSe = .089, p < .001.
Simple-effects of RC at Verb: An analysis of the simple-effects of relational complexity
for the items entailing a verb response was also significant, F(2, 332) = 50.17, MSe =
.079, p < .001. Simple comparisons indicated no significant difference between the
ternary and quaternary items, F < 1, although the mean proportion of items correct
entailing a verb response was significantly higher for quaternary items than for quinary
items, F(1, 166) = 63.61, MSe = .195, p < .001.
Simple-effects of Question-type: It is also useful to consider the series of simple-effects
of question-type at each level of complexity even though these describe the same
interaction and are therefore technically redundant. These analyses indicated that for
ternary and quinary items, there were no differences in accuracy as a function of
224
question type, F < 1 in both cases. Accuracy on quaternary items was better when a
verb response was required than when a noun response was required, F(1, 166) =
27.53, MSe = .053, p < .001.
6.7.1.2 Decision time as a function of question-type48.
Similar sets of analyses were repeated for the decision time measure49. Recall that this
measure is the time from when the probe question is displayed (after the subject
indicates they have read the sentence) to when the desired response option is selected.
The relational complexity (3D, 4D, 5D) × sentence-type (CE, RB) × question-type
(Noun, Verb) interaction was again significant, F(2, 332) = 7.19, MSe = 3.44, p = .001.
A plot of this interaction is provided in Figure 6.15 and includes the response to the
binary items (which were all noun responses). As can be seen from this figure, decision
time for centre-embedded items were reliably longer than for right-branching items (i.e.,
the main effect for sentence-type was significant, F(1, 166) = 112.06, MSe = 5.34, p <
.001). Again the follow-up analyses focused on the centre-embedded items since this is
where the relational complexity load is generated.
Noun (CE)
6.00
Verb (CE)
Decision Time (sec)
5.50
Noun (RB)
5.00
Verb (RB)
4.50
4.00
3.50
3.00
2.50
2.00
2D
3D
4D
5D
Relational Complexity Classification
Figure 6.15. Mean decision time as a function of relational complexity, sentence type, and
probe question-type
48
Analyses of decision time with a natural log transformation resulted in an identical pattern of
interpretations. The results for the untransformed data is reported here.
49
When violations of the sphericity assumption were adjusted using the Greenhouse-Geisser procedure,
the interpretation of the results remained un changed, unless otherwise indicated.
225
Simple-effects of RC at Noun: The simple effects of complexity for the CE items
entailing a noun response was significant, F(3, 498) = 45.53, MSe = 3.87, p < .001.
Follow-up multiple comparisons indicated that there was no difference in decision time
between the binary and ternary items, or between the quaternary and quinary items,
(F<1 in both cases). There was however a significant difference between the
binary/ternary items and the quaternary/quinary items (p < .001 in each case).
Simple-Effects of RC at Verb: The simple-effects for the CE sentences entailing a verb
response was also significant, F(2, 332) = 24.43, MSe = 7.62, p < .001. Follow-up pairwise comparisons of this effect indicated a significant difference between all levels.
Mean decision time for ternary items were significantly quicker than for quaternary
items, (F(1, 166) = 7.02, MSe = 6.65, p = .009; and the mean decision time for
quaternary items were significantly quicker than for quinary items, F(1, 166) = 36.48,
MSe = 18.94, p < .001.
Simple-Effects of Question-Type: The alternative comparisons of the simple effect of
question-type for each level of complexity indicated that mean decision time to ternary
items and quinary items did not differ as a function of question-type (either noun or
verb), F < 1 in both cases. There was however a significant different in mean decision
time for quaternary items. The time to select a noun response for CE was significantly
longer than for verb response to CE sentences, F(1, 166) = 40.05, MSe = 4.26, p < .001.
6.7.1.3 Additional issue: 2-role sentences.
The data suggests that overall binary items tend to be more difficult than ternary items,
regardless of sentence- or question-type (or at least not significantly different from
ternary items). This might be a function of a restriction of range problem and the recent
findings of Andrews (in preparation) has shown similar results with another university
sample. An alternative explanation might be that 2-role sentences are just as difficult as
3-role sentences because of the way they are constructed. Following the relational
complexity analysis of sentence comprehension by Halford et al. (1998a), binary
sentence entail two noun phrases followed by a verb, such that the assignment of the
nouns to their role has to be done in parallel. For example, “the woman that the man
226
helped”. This does not constitute a grammatical sentence and Andrews (in preparation)
argues that using object focussed sentences resolves the problem; for example by
prefixing the sentence with “Sally saw…”, the sentence becomes, “Sally saw the
woman that the man helped”. Andrews (in preparation) argue that the initial words
“Sally saw” would be processed before the rest of the sentence and therefore do not
contribute to the relational demand imposed by assigning the remaining noun phrases to
their roles of the relative clause verb. This may be the case in theory however we
suggest that “Sally saw” will be seen as legitimate elements that might interfere with
comprehension. We suggest that this needs further investigation.
6.7.1.4 Summary of Sentence Comprehension Task.
In summary, the current data suggests that the type of probe question used to assess
comprehension modifies the influence of relational complexity in this task. Noun
responses are more likely to be incorrect and take longer to determine than verb
responses. This effect is most pronounced for quaternary items and was present in both
accuracy and decision time measures. This finding suggests that in assessing the
influence of relational complexity on sentence comprehension, we need to be a little
cautious about the types of probe questions we use. This is particularly important since
non-significant differences between ternary and quaternary items for verb responses
have the potential to washout differences between these levels of complexity when item
performance is aggregated over probe question-types. Therefore, to maximise the test of
the relational complexity metric, we consider these measures separately in the
correlational analyses to follow (i.e., we use an aggregated score collapsed of all probe
question types, and a second score derived from just those items entailing a noun
response).
6.7.2
Knight-Knave Task
As for the other tasks, we consider the item analyses of the current knight-knave
problems in detail and then move on to consider the evidence available to demonstrate
consistency in measurement as a function of relational complexity. In Chapter 3, we
began the process of considering the influence of additional factors on performance in
the knight-knave task (i.e., serial processing demand). We continue our investigation of
other plausible alternatives here.
227
6.7.2.1 Item Analysis
The traditional item and Rasch based statistics for the knight-knave task are provided in
Table 6.13. The test of the item × trait interaction from the Rasch analysis was
appropriate, χ2(42) = 46.52, p = .291, and the infit and outfit tests of item-fit were all
within the 1.00 ± 0.03 criterion suggested by (Wright et al., 1994). Although this is
indicative of an appropriate fit to the Rasch model, person separation was very low
(person separation = .317). Low person separation tends to indicate some problem with
the measure (e.g., multidimensionality) and this was corroborated by the presence of a
low reliability coefficient, Cronbach α = .26. Together these results suggest substantial
heterogeneity amongst the items and brings into question the utility of the Rasch
analysis for this task since it assumes unidimensionality (or at least a dominant factor).
Table 6.13
Mean item statistics (standard deviations in parentheses) for the knight-knave items
Item
RC RCsteps p-value
ternary
3.1
3
1
0.70
3.2
3
3
0.59
3.3
3
2
0.80
3.4
3
2
0.75
3.5
3
2
0.91
3.6
3
1
0.82
quaternary
4.2
4
3
0.40
4.4
4
2
0.48
4.5
4
3
0.54
4.6
4
4
0.61
4.7
4
3
0.23
Indeterminate
4.8
4
3
0.33
4.9
4
5
0.36
4.10
4
4
0.21
RT
CRT
Rasch Statistics
logit
SE outfit
infit
19.08
38.37
15.64
27.04
19.76
21.45
(12.33)
(39.85)
(8.43)
(18.17)
(11.19)
(13.43)
20.22
40.74
16.07
27.63
19.95
21.96
(13.02)
(42.46)
(8.07)
(18.41)
(11.16)
(13.05)
-0.59
-0.09
-1.11
-0.80
-2.02
-1.32
0.18
0.17
0.20
0.19
0.26
0.21
0.89
1.07
0.85
0.92
0.90
0.80
0.92
1.07
0.91
0.97
0.90
0.87
38.11
29.10
24.95
32.36
37.89
(28.81)
(22.47)
(26.38)
(26.08)
(35.70)
42.91
31.53
28.29
35.87
46.93
(29.88)
(21.06)
(31.56)
(29.76)
(58.51)
0.71
0.44
0.13
-0.23
1.45
0.17
0.17
0.17
0.17
0.19
0.98
1.03
1.00
0.98
1.13
1.00
1.00
0.94
0.99
1.01
42.62 (36.06)
38.96 (27.16)
43.56 (29.41)
1.01
0.85
1.58
0.18
0.17
0.19
0.96
0.92
1.02
1.00
0.92
0.94
38.35 (28.13)
30.62 (22.11)
35.04 (23.46)
It would seem that if relational reasoning is influential in performance on the knightknave task, it shares this role with other factors. We have already evidence to support
this. In Chapter 3, the number of relational processing steps entailed in the knight-knave
problem accounted for a substantial amount of the variability in item difficulty and
response times. Relational complexity theory proved useful in providing a measure of
serial load – the number of relational processes. Serial processing demand might
228
therefore account for at least some of the problems observed here with the Rasch
analysis.
Person Fit: Analysis of person fit statistics was surprisingly stable given the concerns
of multidimensionality raised above. Figure 6.16 plots the person infit and outfit
statistics as a function of estimated ability – 76.4% of the sample had infit values within
the fit region (recall that spikes outside of the fit region indicate probable misfit). In
spite of this we are cautious in making any decision until we have had a chance to
explore the task further.
Mean Square Fit Value
3.5
OUTFIT
3.0
INFIT
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-1.58 -0.72 -0.34
0.02
0.02
0.37
0.37
0.74
0.74
0.74
1.14
person (ordered by estimated ability)
Figure 6.16. Distribution of person infit and outfit statistics as a function of estimated
ability across the 14 items of the knight-knave task
6.7.2.2 Measures of relational complexity.
The predominant approach that has been used to support a manipulation of relational
complexity is a comparison of performance on items classified at different levels of
complexity (Halford et al., 1998a)50. In practice this entails creating a composite score
across items as a function of relational complexity (e.g., arithmetic sum or mean). This
50
In some cases scores across more than one level of complexity have been used (e.g., Andrews &
Halford, in press).
229
is what we did in the Sentence Comprehension Task after we had controlled for the
influence of probe question-type (i.e., noun vs. verb response). Composite scores were
also derived to assess the influence of the relational complexity manipulation in the
Latin Square task in Chapters 4 and 5, but in this case the clustering of items within
levels of complexity was much more consistent than what is apparent in the knightknave task (additional evidence for this within-level consistency in the Latin Square
task will be presented in Section 6.7.3). It was argued earlier (Chapter 1 and 2) that
unconsidered aggregation has the potential to obscure the influence of an unknown
number of factors. This is of particular concern to us if such factors are differentially
influential in performance as a function of relational complexity. The differential
influence of probe question-type (noun vs verb response) in the SCT is an example of
this.
Substantial variability is present in the measures of item difficulty in Table 6.13 for the
current set of knight-knave problems – the proportion correct (p-values) for ternary
items ranged from .59 to .91 (range of correct RT’s was 16.07 – 40.74s), the proportion
correct (p-values) for quaternary items range from .23 – 61 (range of correct RT’s was
28.29 – 46.93s). Item indeterminacy also appears to add an additional source of
difficulty to solution of quaternary knight-knave items. The proportion correct for these
indeterminate items ranged from .21 – .36 (the range of correct RT’s was 38.96 –
43.56). Given serial processing demand has been previously identified as a significant
source of item difficulty in the knight-knave task, it is highly probably that it would
continue to be so with essentially the same item pool in another sample of university
students. We consider this in more detail in the following section.
6.7.2.3 Serial processing demand.
Accuracy: A item-based regression analysis similar to that used for the knight-knave
task in Chapter 3, indicated that both relational complexity (RC) and number of
relational processing steps (RCsteps) are highly significant zero-order predictors of
accuracy and response time, as summarised in Table 6.14. There is also a significant
zero-order relationship between an item’s relational complexity and the number of
processing steps – quaternary items tended to entail a greater number of processing
230
steps than ternary items51. The combined effect of RC and RCsteps accounted for 70.7%
of the variation in item difficulty (i.e., proportion correct), i.e., R2 = .707 (adjusted R2 =
.65), F(2, 11) = 13.28, p = .001. The number of processing steps was not a significant
predictor of item difficulty once relational complexity was considered, β = -.17, t(11) =
-.75, p = .47. The relational complexity manipulation subsumed the zero-order
variability in item difficulty that was accounted for by the number of processing steps
and provided an additional unique contribution to its prediction, β = -.71, t(11) = -3.14,
p = .01. This effect was not mediated by an interaction between RC and RCsteps. When
an interactive effect was included in the regression analysis, the change in R2 was not
significant, R2change = .01, F < 1.
Table 6.14
Descriptive statistics and inter-correlations item based-analyses of item difficulty and mean
response time
Mean
Correlation Coefficients
%C
CRT
RC
Proportion Correct (%C)
0.552
(0.22)
Correct RT (CRT)
32.662 (10.35) -0.89**
Relational Complexity (RC)
3.571
(0.51) -0.83** 0.71 **
Number of processing steps (RCsteps)
2.714
(1.14) -0.67** 0.71 ** 0.70 **
** p < .01, df = 12; SD is parentheses
Response Time: Mean item response time can also be modelled as a function of
relational complexity and number of processing steps. Table 6.14 shows the zero-order
correlations between RC and RCsteps and the mean time to make a correct response.
When both variables are considered together they account for 59.7% of the variability in
mean correct response time, R2 = .597, F(2, 11) = 8.14, p = .007. The standard
regression analyses indicated that neither the relational complexity classification nor the
number of relational processing steps provided a significant unique contribution to the
prediction of response time when considered together, relational complexity β = .43,
t(11) = 1.62, p = .13; number of processing steps β = .41, t(11) = 1.53, p = .15. This
effect was also not mediated by an interaction between RC and RCsteps. When an
51
Given the lack of independence in the manipulations of relational complexity and number of
processing steps, a traditional analysis of variance exploring the influence of relational complexity and
processing steps is problematic – hence the reliance on an item-based regression analysis.
231
interactive effect was included in the regression analysis, the change in R2 was not
significant, R2change = .06, F(1, 10) = 1.69, p = .22. These results together suggest that it
is not possible to differentiate statistically the influence of relational complexity per se
on response times from the influence of serial processing load (the number of relational
processing steps).
It is important to note that the interpretation of the analyses of response times at an item
based level is a little more problematic than with the item difficulty measure. This is
because any individual differences in response time variability across items will be
obscured by the use of an aggregated score (i.e., mean correct RT)52. A simple way of
assessing the equality of variance assumption is to calculate Fmax, the ratio of the largest
variance to the smallest variance, within each level of complexity. As a rule of thumb,
Fmax scores above 3 suggest some departure from homoscadasticity. Both the ternary
and quaternary items show substantial deviation from homoscadasticity using this
criterion, Fmax = 27.69 and 7.72, respectively. Hence although the zero-order
correlations between response time and complexity are highly significant, the regression
analyses suggest that these should probably be interpreted a little more cautiously than
what might normally be the case. We will consider the role of relational complexity on
item response time from a person-based perspective in more detail in Chapter 7.
The implications of these analyses also need to be considered in the derivation of a
composite relational complexity score. The relational complexity manipulation seems to
be able to subsume all of the variability in item difficulty and mean item correct
response time that the number of relational processing steps accounts for and there was
no significant interaction. Further, relational complexity provides additional predictive
utility for measures of item difficulty but not mean correct response time. We might
therefore expect that the generation of a composite score by simply aggregating
performance across the 6 ternary items and 5 quaternary items separately is not going to
be influenced disproportionably by differences in serial processing demand. In an
52
Remember that standard deviation of dichotomous variables (0,1) is determined by the number of
people who score correct, and not by individual responses per se). The standard deviation of proportion,
σ p = π (1 − π ) ; where π = population proportion
232
attempt to address this issue further we consider a selection of other factors that have
been raised in the reasoning literature that have the potential to mediate the influence of
relational complexity on performance in the knight-knave task. These are problem
indeterminacy, strategy, and negation. These are post hoc analyses and therefore
somewhat qualitative and necessarily require further empirical evaluation.
6.7.2.4 Indeterminacy.
The majority of knight-knave problems require testing for indeterminacy at some level
even though the status of both speakers might be determined unambiguously once all
information
has
been
considered.
It
is
important
to
remember
that
indeterminacy/ambiguity is a central issue in the theory of relational complexity. The
inability to decompose a relation into smaller components while maintaining the same
level of information is a key determinant of the relational complexity of a process
(Halford et al., 1998a). That is, when indeterminacy generated by considering
arguments in isolation is resolved through their integration, the number of arguments
that needed to be instantiated effectively defines the relational complexity metric. To
clarify this empirically in the knight-knave task we will consider two items from the
current pool. The first, Item 4.7, is determinate if all the information is considered, but
indeterminate if only one of the premises is considered. The second, Item 4.8, is
completely indeterminate even after both premises are considered.
Partial Indeterminacy: To analyse the complexity of a task’s processes, we need to
have a sound understanding of the strategy that reasoners use and the processes they
entail. Chapter 3 considered the task analyses of the knight-knave task proposed by Rips
(1989) and Byrne and Handley (1997) that was based on protocol analyses and
experimental research. We believe this work has provided a strong foundation for a
solution heuristic on which to base the relational complexity analyses. Using this
heuristic to determine the solution strategy that the typical reasoner would adopt, we
would propose solution of Item 4.7 would begin by making the supposition that “B is a
knight”. This is because the probe question, “What is the status of B?”, indicates to the
reasoner a possible starting point – inhabitant B. The subjects in both Rips and Byrne
233
and Handley studies tended to prefer working from the supposition that the speaker is a
knight rather than a knave, when no other information is provided (see chapter 3).
Item 4.7
A says, “A is a knave or B is a knave”
B says, “A is a knave or B is a knight”
What is the status of B?
Working through the implications of B’s statement being true (i.e., spoken by a knight),
we can conclude that it is in fact consistent with that made by a knight. Of the 161
subjects who attempted this problem, 40.37% seemed to have reached this point and
concluded that B was a knight, contented to go no further (see Appendix D.6 for a
summary of the actual response made by the 161 subjects to each of the 14 items). If the
reasoner was to go on and pursue the possibility that “B is a knave”, as indeed they
should, then they would conclude that B’s statement also happens to be consistent with
that spoken by a knave. At this point, there is some indeterminacy – is B a knave or is
he a knight? As it turns out, this can be resolved by considering inhabitant A’s
statement. However, over two thirds of the sample (36.65%) apparently chose not to do
this and instead selected the “not possible” option, suggesting that they had reached a
level of indeterminacy that they believed required no further assessment. If A’s
statement is considered, the correct status of B can be determined – B is a knave. Only
22.98% of the sample made this response.
An analysis of the average response times for the subjects who made each of these
responses is consistent with the solution heuristic that is modelled from previous
research (i.e., consistent with the strategies identified by Byrne and Handley, 1997).
Subjects who believed inhabitant B was a knight took on average 32.80s (SD = 21.21s)
to respond. Those subjects, who believed B’s status could not be determined from the
information provided, took on average 37.81s (SD = 28.52s) to make their response.
Those subjects, who correctly decided that B is a knave, took on average 46.93s (SD =
58.51s) to make their response. Clearly there is a substantial degree of individual
variability in response time to each of these options. However, there is a tendency for
234
the “B is a knave” response to be significantly longer than the “B is a knight” response,
F(158) = 3.73, p = .055. This gives some preliminary, albeit weak empirical support to
the solution heuristic that we have assumed as driving the problem solving strategy
adopted by our subjects.
Complete Indeterminacy: The next item we consider (Item 4.8) is similar in many
respects to Item 4.7, however its indeterminacy is total – the status of the target speaker
cannot be determined unambiguously from the information provided.
Item 4.8
A says, “A is a knight and B is a knave”
B says, “A is a knave or B is a knight”
Is inhabitant A a knave?
To solve Item 4.8, again we apply the solution heuristic and assume that reasoners will
begin by making the supposition that A is a knave (consistent with the probe question),
and that he will test this initially in the first premise (A’s statement). Working through
the implications of this supposition quickly results in the conclusion that the statement
is consistent with that made by a knave, even though the status of B is indeterminate at
this point in the processing. In total, 45.96% of the sample chose this option and took on
average 33.92s (SD = 22.25s) to make their response. If subjects were to reason a little
further and test the supposition that A is actually a knight, then they would realise that
A’s statement is also consistent with this. This time however, it also provides a
determined solution for B – he must be a knave, if A is a knight. This can be verified in
B’s statement – if B is a knave then his statement must be false. We can conclude that
B’s statement is therefore both consistent with inhabitant A being a knight and
inhabitant B a knave. In total, 21.12% of the sample concluded that A was a knight and
took on average 41.33s (SD = 24.73s) to make their response. We might speculate that
these subjects failed to realise that given the previous consistency (i.e., A is a knave)
this actually means that the status of A cannot be determined unambiguously – both
knight and knave assignments result in consistency. The response of 32.92% of the
sample that stated that it was not possible to tell if A was a knave, suggested that these
subjects were aware of this subtlety. These subjects took on average 42.62s (SD =
235
36.06s) to respond. Again, although there is substantial variability in individual
response times, there is marginal empirical support for a significant difference in
response times between those subjects who responded “Yes (A is a knave)” and those
who correctly stated that A’s status could not be determined, F(1, 158) = 2.99, p = .085.
Summary: It seems clear that the nature and location of indeterminacy is a source of
difficulty in the knight-knave task and that partial ambiguity is also troublesome. What
is also becoming clear from these analyses is that subjects tend to respond with the first
plausible assignment that they come across. This then implies that at least some subjects
are reluctant to follow up alternative explanations. We would argue that subjects are
made aware of the possibility of plausible alternatives during the practice phase (see the
Method section of the study reported in Chapter 3). However the extent to which this
information is acted upon during the test phase cannot be assessed adequately with this
data. Therefore we cannot rule out completely the likelihood that some subjects simply
did not understand that plausible alternatives might exist. In the case of Item 4.8, a
potential way to test this in future research would be to ask subjects to provide the
status of B immediately after they have made their decision about A’s status. If subjects
respond “A is a knave” and then indicate that B’s status cannot be determined, then it is
unlikely that they have followed up plausible alternatives for inhabitant A (recall that
the first step of testing “A is a knave” results in the status of B being indeterminate). If
subjects indicated that “B is a knave” in the proposed post-test, then we might speculate
that the subject has contemplated the possibility that A is a knight (see Schroyens et al.,
1999, for an alternative account of the influence of strategies).
6.7.2.5 Strategy.
The analysis of indeterminacy also serves to highlight the importance of solution
strategy on performance. That is, the point at which a reasoner confronts ambiguity (if
at all) and the point at which a consistent assignment of status is reached (if at all) can
influence the decisions made. It might not be clear just what the status of the relational
complexity analysis would be given that this is the case. A great deal of research on
propositional reasoning has demonstrated that processing demand can be minimised
with the development of an appropriate heuristic (e.g., Evans, 1989). We have already
236
identified in Chapter 5 the dual-task research of Klauer et al. (1997) that demonstrated
that a dual-task deficit (i.e., competition of resources) can be obviated with a successful
heuristic. When the heuristic was blocked from being used, the dual-task deficit was
present. This is indicative of a significant demand for resources that wasn’t present
when the heuristic could be used. The issue in the knight-knave task then becomes, are
reasoners having difficulty in solving the items because of the demand imposed by the
relational complexity of the item, or because for some items, a plausible solution is
reached before the correct response is realised (i.e., because they fail to follow up
plausible alternatives). The following analyses of the influence of solution strategy on
relational complexity classification attempts to provide some further insight into this
problem. The two items listed below are identical in all respects except that one (Item
3.4) asks whether inhabitant B is a knight, and the other (Item 4.6) asks whether B is a
knave.
Item 3.4
Item 4.6
A says, "A is a knight"
A says, “A is a knight”
B says, "A is a knave and B is a knave"
B says, “A is a knave and B is a knave”
Is inhabitant B a knight?
Is inhabitant B a Knave?
Using the solution heuristic identified in Chapter 3, we assume that reasoners will
attempt solution of Item 3.4 by first making the supposition that “B is a knight”, and
then testing the implications of this in B’s statement. Similarly, we assume that
reasoning in Item 4.6 will start by making a supposition that “B is a knave” and testing
the implication of this in B’s statement. This difference in starting point influences the
processes that the individual is likely to use and more importantly, it influences the
classification of relational complexity. To demonstrate this, consider Item 3.4. If we
assume that B is a knight, then both conjuncts of B’s compound statement must be true.
This immediately leads to a contradiction because one of the components of B’s
assertion states that B is a knave. Therefore reasoners can conclude unambiguously that
if B is not a knight, then he must be a knave. Note that if subjects understand the
implications of the rules of the knight-knave task, they would realise that they do not
need to be test “B is a knave” since as we have just said, B must be either a knight or a
237
knave, if he is not a knight, then he must be a knave53. This line of reasoning has been
classified as requiring at most, a ternary process:
Item 3.4
&-SAYS(kt(B), AND(kv(A), kv(B)) → AND(kv(A), kv(B))
CONTRADICT(kt(B), kv(B)) → kv(B)
Now consider Item 4.6 in which the starting supposition is to assume “B is a knave”. If
B is a knave, then this implies that B’s statement must be false. That is, either both
conjuncts are false, or just one of the conjuncts is false. Using the principles for
analysing compound statements introduced in Chapter 3, decomposing the compound
statement in this way adds an additional element of complexity that was not necessary
when starting from the supposition that “B is a knight” – the status of A is implicated in
the reasoning about B and therefore these terms cannot be chunked in the implication.
The significant reasoning process during solution of this problem is as follows:
&-SAYS(kv(B), AND(kv(A), kv(B)) → AND(kt(A), kv(B))
That is, B’s statement is consistent with that of a knave, if A is a knight.
Table 6.15
Cross-tabulation of the number of individuals selecting each response option to Item 4.6 and
Item 4.3
Item 4.6: Is B a knave?
knight (no)
Item 3.4:
knight (yes)
Is B a knight? knave (no)
can't tell
Total
knave (yes) can't tell
Total
6
17
5
28
(17.4%)
22
76
22
120
(74.5%)
2
6
5
13
(8.1%)
30
99
32
(18.6%)
(61.5%)
161 (100.0%)
(19.9%) (100.0%)
* Correct response for both items is “B is a knave”
53
This is also distinct from the indeterminacy issues raised above. B’s status as a knight is not
indeterminate. If B were a knight he would not make that statement.
238
The differences in accuracy (proportion correct: p-value) and response time can be
determined from the information in Table 6.13. Table 6.15 shows a cross-tabulation of
the number of subjects who selected each response to each item. Item 3.4 was answered
correctly by 75% of the sample, while Item 4.6 was answered correctly by only 61% of
the sample. Analyses of this difference using a simple paired t-test suggested that
response to Item 3.4 was significantly more accurate than for Item 4.6, t(160) = 2.61, p
= .01. Analysis of the item response times (regardless of accuracy) reported in Table
6.13, also suggested that processing time in Item 4.6 (MRT = 32.36s, SD = 26.08) was
significantly longer than for Item 3.4 (MRT = 27.04, SD = 18.17), t(160) = 2.28, p =
.024. This effect was replicated for correct response times for the 76 subjects who
answered both items correctly, t(75) = 2.21, p = .03. Interestingly, there was no
significant correlation between accuracy on the two items, r(159) = .07, p = .41. That is,
getting one of the items correct does not provide any indication to us that the subject
would get the other item correct. Although there is weak evidence for a correlation in
response times regardless of accuracy between Item 3.4 and Item 4.6, r(159) = .14, p =
.07, when the accuracy of the response was considered, the correlation between the two
items was not significant, r(74) = .11, p = .35.
Summary: Together these results are consistent with the idea that processing complexity
in the knight-knave task is influenced by the solution strategy that is adopted. It also
provides some corroborating evidence that the assumptions we have made about the
dominant solution heuristic identified by Byrne and Handley (1997) and Rips (1989) is
consistent with that used by at least a large proportion of the subjects in the current
sample. Of course it might be the case that some of our subjects tested both possibilities
in either or both items (3.4 and 4.6). That is, first they tested whether B was a knight,
and then whether B was a knave (or vice versa), regardless of the necessity to do so. In
that case, we would expect there to be no difference between the items in terms of
performance and response times. The extent that this is the strategy that a substantial
number of subjects use will make the significance test more conservative. That we find
a significant result regardless of this possibility suggests the effect of the probe question
is a strong one. The interpretation that we postulate is that in this case, the effect is
strong because the solution paths entail the construction of different representations that
differ in relational complexity.
239
6.7.2.6 The effect of negation.
The last issue that we would like to raise is associated with the effect of negation. The
knight-knave task is interesting for the very fact that it requires reasoning about the
interaction between the veracity of statements (true or false) and the status of
individuals (truth-teller or liar). Solution almost invariably entails processing of
negatives or complementary accounts from these two broad levels. An inspection of the
list of items in Table 6.13 indicates that all of the determinant quaternary items entail
some verification of whether the target speaker is a knave. Only Item 4.5 results in the
correct answer being a knight. A direct probe is only used in two of the ternary items
(Items, 3.2 and 3.4) and this asks reasoners to determine whether the target speaker is a
knight. Items 3.1, 3.3, and 3.6, would appear to cue an initial supposition entailing a
knave, although this is more indirect since it is focused on the non-target inhabitant (and
therefore is likely to enlist a backward working inference – see Chapter 3). The
remaining ternary items appear to cue an initial supposition entailing a knight, but these
are also focused on the non-target individual.
The extent to which these differences influence solution separate from the impact of the
relational complexity manipulation is difficult to ascertain in the current experiment.
We have just demonstrated in the preceding section that changing the probe question
from “Is B a knight” to “Is B a knave” influenced the relational complexity
classification of problems 3.4 and 4.6. The relational complexity analysis is capable of
discriminating between these particular items, however we cannot disambiguate the
influence of negation (i.e., knave) from the relational complexity in general without
further experimentation. As such, we leave this question for further research.
6.7.2.7 Final analyses: Composite scores.
It seems clear that performance in the knight-knave task is a function of a complex array
of factors. Here we have considered the influence of relational complexity, the number
of relational processing steps, partial and complete indeterminacy, and solution strategy.
All of which provide valuable insights into the reasoning entailed in this very
interesting collection of items. It is not surprising therefore that the item analyses
indicate a considerable degree of heterogeneity in the constructs being assessed. This
240
does not bode well for the construction of a composite unidimensional relational
complexity knight-knave score in which the influence of factors other than relational
complexity has been removed. While it seems clear that relational complexity theory is
capable of accounting for a substantial proportion of the variability in item difficulty
and response times in the knight-knave task, the extent to which it is confounded with
the factors we have identified here is difficult to know without further empirical work.
Similarly it is unclear just how many other factors might influence performance. Clearly
further experimental exploration of these issues is warranted. Further analyses of the
cognitive correlates to performance in the knight-knave task may also provide some
insights into these factors.
As a final comparison and regardless of the concerns that have been raised, we
constructed composite scores for the 6 ternary and 5 determinate quaternary items by
average proportion correct and response times separately, the descriptive statistics along
with their inter-correlations are summarised in Table 6.16.
Table 6.16
Means and standard deviations (in parentheses) and inter-correlations between composite
ternary and quaternary measures
Composite Measure
Mean & SD
%c3d
%c4d rt3d
rt4d
crt3d
Mean proportion correct 3D (%c3d)
0.76 (0.19)
-
Mean proportion correct 4D (%c4d)
0.45 (0.23)
0.12
-
Mean RT 3D (rt3d)
23.56 (9.57)
0.12
0.00
-
Mean RT 4D (rt4d)
32.48 (14.39)
0.25 **
0.14
0.44 **
Mean Correct RT 3D (crt3d)
23.90 (15.27) -0.08
Mean Correct RT 4D (crt4d)
35.44 (25.29)
0.25 **
-0.09
0.01
-
0.80 ** 0.35 **
-
0.23 ** 0.75 ** 0.11
** p < .01; RT = Response Time; 3D = ternary; 4D = quaternary
As might be expected, the individuals’ average proportion correct across the ternary
items was significantly higher than for the quaternary items, t(160) = 13.77, p < .001
and response times were substantially longer for quaternary items than ternary ones
regardless of whether only response times for correct items were used, t(148) = -5.19, p
< .001, or whether the response times for all items were used, t(160) = -8.47, p < .001.
If it were clear that relational complexity was the predominant factor involved in
241
reasoning, this would support one of the key predictions of the relational complexity
theory – that as complexity increase, performance deteriorates. Unfortunately our
concerns regarding the influence of alternative factors in the derivation of composite
scores seems empirically well founded. The correlation between the ternary composite
scores of accuracy and the quaternary counter-part is not significantly different from
zero, r(159) = .12, p = .14. That is, the number of quaternary items answered correctly
is not associated with the number of ternary items answered correctly. There is however
a significant positive correlation between response time on ternary and quaternary
composite scores r(159) = .44, p < .001, such that longer average processing on ternary
items is associated with longer average processing on quaternary items. This effect is
negated when only response times for correctly answered items are aggregated, r(147) =
.11, p = .19. In any case, the item-based regression analysis of response times suggested
that relational complexity and number of processing steps did not add anything unique
to the prediction of item response time, and that these factors did not interact.
Therefore, it is difficult to know to what extent the significant correlation in response
times is a function of relational complexity per se, or due to the strong relationship
between the relational complexity classification and the number of processing steps (see
Table 6.14; r(159) = .70, p < .001).
As our understanding of the reasoning entailed in the knight-knave task improves, we
will be in a better position to disambiguate the influence of relational complexity from
other factors. At this point in time, relational complexity is a significant predictor of
reasoning difficulty in this multi-faceted complex reasoning task. The Latin Square task
that we consider next, was developed to minimise the influence of extraneous factors in
the assessment of the relational complexity metric. In the development of this task we
hoped to overcome some of the problems associated with working with problems like
the knight-knave task in which reasoning is obviously very dynamic and complex.
6.7.3
Latin Square Task
The modified pool of Latin Square Task items was used in the dual-task study of
Chapter 5 with a random selection procedure. Therefore, subjects did not necessarily
receive the same selection of items in the single and dual task conditions. In the
242
analyses for the dual-task study we argued that there was sufficient evidence for
consistency within levels of relational complexity to allow us to generate a composite
score regardless of the number of items attempted. We wish to complete the
presentation of this evidence now and we begin by considering a Rasch analysis of the
LST items for the current sample as we have for the other tasks.
The Latin Square task was attempted by 180 of the 191 university students who
participated in the study. Table 6.17 lists the traditional and Rasch based item statistics.
The χ2 test of the item × trait interaction indicated a satisfactory fit to the Rasch model
when four trait groups were used and all 36 items were considered, χ2(108) = 129.75, p
= .076. Person separation was also satisfactory (person separation index = .77). The
results of the traditional reliability analyses indicate that a composite score across the 36
items has good reliability, Cronbach α = .82. These preliminary indicators suggest the
LST has stable psychometric properties.
Investigation of the item fit statistics suggest that while two items (2 and 30) exceed the
1.30 cut off suggesting some degree of aberrant misfit, and a further 7 items (1, 7, 19,
20, 22, 24, and 25) have outfit scores below the 0.70 cut-off. No items fall beyond the
fit region suggested by Wright et al. (1994) when item infit is used. An examination of
person fit statistics was also consistent with the interpretation that the Latin Square
items were measuring a unidimensional construct. The evidence for the LST’s fit to the
Rasch model is very good, particularly when we take into consideration the relatively
poorer fit of the more traditional psychometric tests (granted that they have been
presented in non-traditional ways). Figure 6.17 plots the distribution of person infit and
outfit statistics with the 1.00 ± 0.30 fit region. There are a small number of individuals
with extreme outfit statistics although these tend to be at the extremes of the ability
continuum (as indicated by the grey spikes falling beyond the fit region). However, only
3.80% of the sample have infit values suggestive of misfit (2.47% with infit > 1.30, and
1.33% with infit < .70). Note that this is considerably better than the Progressive
Matrices test in which 22.8% of the sample fell outside the fit region (see Section 6.4.1)
and is comparable with the Swaps Test which also demonstrated a very high level of
person fit. Given the good fit to the Rasch model and consistent with our treatment of
243
Table 6.17
Traditional and Rasch based item descriptive statistics for Latin Square Task as a function of
relational complexity and number of processing steps
Complexity item %correct
RT
CRT
logit SE outfit infit
Binary
01
0.97
6.84 (3.42)
6.87 (3.32)
-2.30 0.45 0.37 1.01
1 step
02
0.96
11.16 (8.38) 11.14 (8.47)
-1.65 0.35 1.37 0.93
19
0.96
9.01 (5.98)
8.96 (5.99)
-2.10 0.42 0.37 0.95
20
0.97
6.46 (3.10)
6.57 (3.06)
-2.54 0.50 0.39 1.03
21
0.96
8.13 (6.04)
8.24 (6.05)
-1.84 0.37 1.23 0.87
22
0.99
6.58 (3.08)
6.60 (3.08)
-3.50 0.78 0.33 1.16
Binary
03
0.77
16.66 (24.11) 13.04 (7.05)
0.30 0.19 1.11 1.09
2 Step
23
0.79
14.63 (11.13) 14.10 (9.88)
0.14 0.20 0.89 0.89
24
0.87
16.54 (9.18) 17.19 (9.08)
-0.49 0.23 0.66 0.93
25
0.87
13.66 (15.48) 12.24 (10.62)
-0.56 0.24 0.58 0.88
26
0.84
16.13 (11.29) 15.75 (10.93)
-0.20 0.22 1.25 0.97
27
0.79
17.50 (15.44) 18.25 (16.50)
0.10 0.20 1.03 1.02
Ternary
07
0.92
18.39 (20.82) 17.99 (19.74)
-1.10 0.28 0.61 0.87
1 step
08
0.86
21.93 (19.95) 21.02 (17.88)
-0.36 0.22 1.01 0.96
12
0.89
18.30 (18.32) 17.99 (17.23)
-0.70 0.25 0.96 1.01
29
0.91
21.48 (15.69) 21.12 (15.19)
-0.93 0.27 1.09 1.02
30
0.93
19.53 (17.35) 19.86 (17.49)
-1.12 0.28 1.70 0.89
31
0.92
28.19 (32.19) 28.07 (32.80)
-1.00 0.27 1.09 0.95
Ternary
09
0.76
20.08 (16.14) 19.82 (16.05)
0.31 0.19 1.06 1.09
2 Step
32
0.74
19.04 (14.88) 18.18 (13.33)
0.42 0.19 0.78 0.91
33
0.77
23.96 (22.93) 23.62 (22.96)
0.23 0.19 0.91 0.97
34
0.79
27.07 (25.70) 28.64 (27.00)
0.17 0.20 1.06 1.00
35
0.70
25.03 (22.53) 26.16 (21.71)
0.66 0.18 0.95 0.99
36
0.81
24.70 (17.26) 23.52 (14.93)
0.03 0.20 1.11 1.03
Quaternary 15
0.58
25.17 (22.10) 28.58 (24.37)
1.28 0.17 0.96 1.02
1 step
17
0.60
19.69 (22.38) 18.40 (22.12)
1.23 0.17 0.82 0.89
37
0.69
22.38 (21.05) 20.60 (21.04)
0.71 0.18 0.98 0.99
38
0.63
24.73 (24.12) 25.16 (25.04)
1.05 0.17 0.98 1.04
39
0.59
28.69 (27.77) 29.68 (30.42)
1.18 0.17 1.05 1.04
46
0.67
23.74 (27.45) 23.96 (27.38)
0.87 0.18 0.92 0.94
Quaternary 14
0.43
55.37 (50.40) 76.70 (59.70)
2.11 0.17 0.94 0.89
2 Step
41
0.48
55.10 (68.47) 69.79 (86.68)
1.71 0.17 1.22 1.14
42
0.61
29.08 (24.80) 31.42 (24.76)
1.16 0.17 0.90 0.96
43
0.47
45.18 (46.48) 55.51 (49.09)
1.81 0.17 1.02 1.05
44
0.30
52.02 (49.78) 72.01 (55.73)
2.74 0.18 0.98 1.04
45
0.37
52.28 (47.54) 68.59 (56.91)
2.31 0.17 1.04 1.07
RT = mean item response time; CRT = mean item correct response time; SE = standard error of
calibrated item difficulty estimate.
244
the LST in Chapter 4, we use the Rasch calibrated difficulty scores as estimates of the
construct level required for adequate performance.
OUTFIT
INFIT
4.5
Mean Square Fit Value
4.0
3.5
3.0
2.5
2.0
1.5
fit
region
1.0
0.5
0.0
-1.10 0.20 0.66 0.99 1.33 1.52 1.72 1.93 2.16 2.42 2.72 3.57
person (ordered by estimated ability)
Figure 6.17. Distribution of person fit statistics as a function of estimated person
ability
6.7.3.1 Relational complexity and item difficulty.
As for the knight-knave task and the sentence comprehension task, it is of interest to
explore the relationship between item difficulty and the classification of complexity.
The advantage of the LST is that we have far more control over some of the additional
factors that influence performance. With the current item pool, we have orthogonally
manipulated the number of relational processing steps as define in Chapter 4, and the
relational complexity analysis. There are 12 items at each level of complexity, half of
which entail a single relational process and half of which have an extra relational
process of equal or lower complexity. To provide a graphical representation of the
relationship between the item’s calibrated difficulty and the complexity classification,
Figure 6.18 plots the items in order of difficulty. As we can see from this plot there is
considerable consistency within levels of relational complexity and within number of
processing steps. Another interesting characteristic that is apparent in this plot is that
single step ternary items tend to be more difficult (have higher logit values), than twostep binary items. We will have more to say about this shortly.
4.00
3.00
44
45
2.00
Calibrated Item Difficulty
41
1.00
03
0.00
26
25
-1.00
21
-2.00
20
-3.00
22
-4.00
01
19
02
27 23
36
34 33
09
32
35
37
46
38 39 17
15
43
42
08
24
12
30
07
31
29
Binary - 1 Step
Binary - 2 Step
Ternary - 1 Step
Ternary - 2 Step
Quaternary - 1 Step
Quaternary - 2 Step
-5.00
Figure 6.18. Distribution of Latin Square items as a function of Rasch calibrated item difficulty (with standard errors indicated)
14
246
6.7.3.2 An item-based regression analysis of item difficulty.
The inter-correlations between the calibrated item difficulty and response time as a
function of relational complexity and number of processing steps is summarised in
Table 6.18. A comment is necessary about generating a correlation between relational
complexity and other variables. Essentially, what we assume when we calculate the
Pearson product-moment correlation between RC and other variables is that the
relational complexity metric has appropriate scaling characteristics. To demonstrate that
this is an acceptable assumption in this case, we treated relational complexity as a
simple categorical variable and calculated a series of dummy variables and regressed
these onto the performance measures. The R2 values were significant in all cases –
Rasch calibrated difficulty (R2 = .80, F(2,33) = 28.96, p < .001), item response time (R2
= .76, F(2,33) = 22.14, p < .001), and correct item response time (R2 = .71, F(2,33) =
16.83, p < .001). We are therefore confident that the three levels of relational
complexity have some meaningful quantitative structure in the Latin Square task.
Table 6.18
Item based zero-order correlations between LST measures and relational complexity and
number of processing steps
RC
RCsteps
logit
RT
RTSD
CRTSD
Relational complexity (RC)
Number of processing steps (RCsteps)
0.00
Rasch calibrated item difficulty (logit)
0.79 **
0.50 **
Response time (RT)
0.75 **
0.43 **
0.83 **
Standard deviation in RT (RTSD)
0.74 **
0.37 *
0.79 **
0.96 **
Correct RT (CRT)
0.70 **
0.42 *
0.78 **
0.99 **
0.94 **
Standard deviation in CRT (CRTSD)
0.73 **
0.33
0.74 **
0.96 **
0.98 **
0.95 **
N = 36 items; * p < .05; * p < .01
As can be seen from Table 6.18, both relational complexity and number of steps are
significant zero-order predictors of item performance. More complex items tend to be
more difficult, r(34) = .79, p < .001, and are associated with longer response times
regardless of whether accuracy was taken into consideration (CRT: r(34) = .70, p <
.001) or not (RT: r(34) = .75, p < .001). The number of processing steps was also a
significant zero-order predictor of item performance. Interestingly, both RC and
247
RCsteps also accounted for a substantial degree of the variability in response time. The
correlations suggest that more complex items tended to be associated with higher
standard deviations in item response time than less complex items, and response times
to two-step problems tended to vary more than response times to one-step problems,
which have smaller standard deviations. Two explore the relationship between RC and
RCsteps, a standard multiple regression analysis could be conducted. However, given
the manipulations of RC and RCsteps are orthogonal, unless an interaction term is
included, the regression analyses would not provide any additional information than
what is available in Table 6.1854. A more efficient and simpler way of exploring an
interactive effect in this case is to use a simple analysis of variance. A relational
complexity (2D, 3D, 4D) × number of processing steps (1-step, 2-step) univariate
analysis of variance was therefore conducted for each of the difficulty and response
time measures.
6.7.3.3 Interactive effects: RC × RCsteps.
Accuracy: Consistent with expectations from the correlational analyses, the results
indicate a significant main effect for complexity, F(2, 30) = 134.06, MSe = .173, p <
.001, and number of steps, F(1, 30) = 106.69, MSe = .173, p < .001. The test of interest
here is the interaction and this was significant, F(2, 30) = 8.04, MSe = .173, p = .002
and therefore qualifies the interpretation of the RCsteps and RC effects. The plot of the
interaction in Figure 6.19 suggests that the source of the effect has to do with the twostep binary items. We noted a similar trend previously in the plot of the individual items
in Figure 6.18 – two-step binary items appeared to be more difficult than the one-step
ternary items. Separate pair-wise analyses indicated that this difference was reliable,
t(10) = 7.22, p < .001. To explore the interaction further, a series of simple-effects
analyses were conducted.
54
When the correlation between predictors is zero, R2 from a standard regression analysis will be equal to
the sum of the r2 for the predictors and the respective β values will be equal to the zero-order correlation
coefficients.
248
Simple-effects of RC: The simple-effect of relational complexity was significant for the
one-step problems, F(2, 30) = 99.62, p < .001. Simple comparisons following up this
effect indicated that one-step binary items were easier than one-step ternary items, F(1,
30) = 36.66, p < .001, and that one-step ternary items were easier than one-step
quaternary items, F(1, 30) = 64.24, p < .001. The simple-effects of relational
complexity for 2-step problems was also significant, F(2, 30) = 42.43, p < .001. Simple
comparisons indicated that although the trend was in the expected direction, there was
only a marginal difference in the calibrated item difficulty between the two-step binary
items and the marginally harder two-step ternary items, F(1, 30) = 3.09, p = .089. Twostep quaternary items were significantly more difficult than two-step ternary items, F(1,
30) = 48.40, p < .001. Together the main-effects and interaction accounted for 92.9% of
the variation in item difficulty (SSexplained / SStotal).
Calibrated Item Difficulty (logits)
3.00
2.00
1.00
0.00
1-step
-1.00
2-step
-2.00
-3.00
-4.00
binary
ternary
quaternary
Figure 6.19. Mean calibrated item difficulty as a function of relational complexity and
number of processing steps (error bars = 1 SD)
Response Time: The correlation between item difficulty and the standard deviation in
item response time reported in Table 6.18 indicates that variability in response time
increases as the difficulty of the item increases. Given that item difficulty is so closely
related to relational complexity and number of processing steps, we are not surprised to
see significant heterogeneity of variance in response times as a function of relational
249
complexity and number of processing steps, F(5, 30) = 3.84, p = .008. With this in
mind, the ANOVA on response time (regardless of accuracy) indicated a significant
main effect for RC, F(2,30) = 76.38, p < .001, and RCsteps, (F1, 30) = 49.81, p < .001.
This relationship was also qualified by a significant interaction, F(2, 30) = 17.03, p <
.001. The plot of this interaction is provided in Figure 6.20. Together the two factors
and the interaction accounted for 88.7% of the variability in response times
(SSexplained/SStotal).
Figure 6.20 also plots the times for correct responses (CRT). The omnibus tests for
correct response time together accounted for 86.5% of the variability (SSexplained/SStotal).
The effects followed the same trend as the overall response time and are therefore not
reported here. There are some minor differences in the simple-effects analyses when
CRT is used so these effects are considered.
Simple-Effects of RC: The effects for RC were investigated for the one- and two-step
problems separately – each were significant, F(2,30) = 19.08, p < .001 and F(2, 30) =
74.32, p < .001, respectively. For the one-step problems, time to respond to binary items
tended to be significantly quicker than to ternary items, F(1, 30) = 22.87, p < .001,
however there was generally no difference in the time to respond to ternary and
quaternary items, F < 1. For two-step problems, response to two-step binary items was
90.00
Response Time (sec)
80.00
70.00
60.00
1-step (RT)
50.00
2-step (RT)
40.00
1-step (CRT)
30.00
2-step (CRT)
20.00
10.00
0.00
binary
ternary
quaternary
Figure 6.20. Mean item response time regardless of accuracy (RT) and for correct
response only (CRT) as a function of relational complexity and number of processing
steps (error bars = 1 SD)
250
significantly quicker than to two-step ternary items, F(1, 30) = 7.22, p = .012, and
response to two-step ternary items was significantly quicker than for two-step
quaternary items, F(1, 30) = 80.20, p < .001. When only correct response were
considered in the aggregation of mean item response time (i.e., Correct RT), the same
pattern of results were observed as reported above. The only change was that the
difference between two-step binary and two-step ternary problems was now only
marginal, F(1, 30) = 3.62, p = .067.
Simple-Effects of RCsteps: It is also useful to consider the alternative simple-effects
analyses of the interaction even though there is some redundancy in what has already
been presented. That is, the difference between one- and two-step problems at each
level of relational complexity. For overall response time, one-step binary problems were
solved more quickly than two-step binary problems F(1, 30) = 7.94, p = .008, and onestep quaternary problems were solved more quickly than two-step quaternary problems,
F(1, 30) = 75.39, p < .001. There was no difference between the one- and two-step
response times for ternary items, F < 1. When only correct response were considered,
the difference between one- and two-step binary problems was no longer significant,
F(1, 30) = 2.65, p = .114. The other CRT effects were consistent with the overall RT
analysis – there was no difference between one- and two-step ternary problems, F < 1,
but there was a significant difference between the one- and two-step quaternary
problems, F(1, 30) = 76.99, p < .001.
6.7.3.4 Summary of item analyses.
These results together suggest to us that relational complexity is a substantial source of
item difficulty in the Latin Square Task and that the number of processing steps tends to
mediate this effect slightly. When items entail only one processing step, there is a
significant difference between all levels of complexity. The addition of another
processing step of equal or lower complexity reliably increased the difficulty of the
items. This manipulation tended to raise the difficulty of binary items relatively more
than the ternary and quaternary items (Figure 6.19). When response times are
considered, the pattern of results is a little more complex. There is a clear tendency for
response times to increase as the complexity of the item increases. An additional
251
relational processing step also tended to increase the response times as might be
expected, however this increase from one- to two- processing steps seemed more
pronounced for the quaternary level items.
We started this section by stating that we would provide evidence to establish the
appropriateness for calculating composite relational complexity scores in Chapter 5
(and here). The Rasch analyses provide support for stable measurement, although we
are yet to speculate on just what aspect of reasoning ability we have measured. We will
leave this to the next chapter. The influence of relational complexity and number of
processing steps was highly significant – the two variables together accounted for 87%
of the variability in calibrated item difficulty and when the interaction was considered,
this increased to 92.9%. A similar degree of predictability is obtained from the analysis
of response times. Relational complexity and number of processing steps (and the
interaction between them) accounted for 88.7% of the variability in overall item
response times, and 86.5% of the variability in correct item response times.
We believe this is sufficient evidence to suggest that we have a clean manipulation of
relational complexity that is stable and reliable both in theory and in practice. For
completeness, the last analyses that we wish to demonstrate here is a person-based
analysis of complexity and processing steps using composite scores. This is particularly
important for the analysis of response times since, as we have mentioned previously
(see Section 6.7.2.3), analysis of aggregated response time measures obscures the
influence of individual differences. The item-based correlations between relational
complexity and the standard deviations in response time reported in Table 6.18
indicates that this needs to be considered further at the person level.
6.7.3.5 Composite measures in the Latin Square Task.
A composite-score was derived for each level of relational complexity and for each
level of processing steps – a total of 6 variables per subject. Accuracy was assessed by
determining the proportion of items (out of 6) in each of the six categories that were
answered correctly (3 levels of RC × 2 levels of RCsteps). Composite response time
measures were derived by averaging item response times across the 6 items within each
252
of the categories regardless of accuracy; and separately by averaging the response times
for only those items within a category answered correctly. Table 6.19 summarises the
descriptive statistics for these composite scores.
Table 6.19
Descriptive Statistics (standard deviations in parentheses) for composite relational complexity
measures
%Correct
RT
CRT
N
Binary
1-step
0.97 (0.11)
8.04
(2.55)
8.07
(2.64)
180
2-step
0.82 (0.21)
15.86
(7.56)
15.02
(6.73)
179
1-step
0.90 (0.15)
21.34 (12.83)
21.04 (12.57)
180
2-step
0.76 (0.22)
23.31 (10.59)
22.81 (11.03)
178
Quaternary 1-step
0.63 (0.32)
24.05 (16.19)
24.67 (19.02)
166
2-step
0.45 (0.27)
48.23 (31.32)
55.21 (63.97)
167
Ternary
%correct = proportion of items answered correctly; CRT = response time; CRT = Correct
Response Time; N = number of subjects for which there was at least one observation on which
to base the composite score. List-wise deletion provided 153 complete cases for the analysis of
CRT.
6.7.3.6 Analysis of composite scores.
A relational complexity (2D, 3D, 4D) × processing steps (1-step, 2-step) repeatedmeasures ANOVA was conducted on each of the measures55.
Accuracy: Analysis of the proportion correct score as a measure of accuracy indicated a
significant main-effect for relational complexity, F(2, 358) = 372.56, MSe = .035, p <
.001. Overall, the binary composite of items were significantly easier than the ternary
composite, F(1, 179) = 35.07, p < .001; and the ternary composite was significantly
easier than the quaternary composite, F(1,179) = 409.59, p < .001. The main effect for
number of processing steps was also significant. Accuracy was significantly better on
one-step problems than two-step problems, F(1, 179) = 188.72, MSe = .035, p < .001.
55
Unless otherwise indicated, corrections for violations of the sphericity assumption did not change the
nature of the interpretation and therefore unadjusted statistics are reported.
253
The interaction as plotted in Figure 6.21 was not significant, F(2, 358) = 1.27, MSe =
.034, p = .281. This is interesting since in the analysis of individual items based on the
Rasch estimated item difficulties, there was a significant interaction (as a function of the
two-step binary items). When a composite score is generated based on proportion of
items answered correctly for each individual, this effect tends to be minimised. The
most plausible reason for this discrepancy is that the proportion correct scores are
subject to a ceiling effect that is minimised in the transformation to the Rasch estimated
logit scale. We discussed this feature of the Rasch model in considerable detail in
Chapter 4.
1.00
Mean Proportion Correct
0.90
0.80
0.70
1-step
0.60
2-step
0.50
0.40
0.30
0.20
binary
ternary
quaternary
Figure 6.21. Mean proportion correct as a function of relational complexity
and number of processing steps
Alternative explanations are possible. It might be the case that some outlying binary (or
ternary) items were responsible for the interaction in the item-based analysis, and that
their effect was minimised when composite scores are generated since each item is
given equal weighting. However, inspection of the plot of item difficulty scores in
Figure 6.18 indicate that there is substantial homogeneity within levels of relational
complexity and number of processing steps. Certainly the highly significant proportion
of item variability that was accounted for in the analyses of these item-based statistics
corroborate this impression. We contend that this finding again highlights the utility of
the Rasch approach as an appropriate methodology in this area. It seems particularly
254
well suited to our research problem where we present very easy (binary) items to a
university population in the hope of differentiating performance on slightly more
difficult ternary items that in theory are well within the capabilities of our sample.
Response Time: The composite response time measure (RT) was based on mean time to
respond to items of a given level of complexity and number of processing steps. The
time to respond correctly to each of the items within each level of complexity and
number of processing steps was also averaged to derive a correct response time
composite (CRT). The relational complexity × number of processing steps repeated
measures ANOVA using the overall RT as the dependent variable produced results
consistent with the item-based analyses reported above. There was a significant maineffect for complexity, F(2, 358) = 189.90, MSe = 279.26, p < .001; and a significant
main-effect for number of processing steps, F(1, 179) = 218.09, MSe = 158.82, p < .001.
The interaction was also significant (Figure 6.22). The interpretation was essentially
identical to the analyses reported for the item-based analyses and are therefore not
reported in detail here. The only significant change to the interpretation is that the
difference between one-step ternary and two-step ternary which was not significant in
the item-based analyses, is significant when composite scores are used in a person-
60.00
1-step (RT)
2-step (RT)
Response Time (sec)
50.00
1-step (CRT)
2-step (CRT)
40.00
30.00
20.00
10.00
0.00
binary
ternary
quaternary
Figure 6.22. Composite response time (RT) and correct response time (CRT)
as a function of relational complexity and number of processing steps
255
based analysis, F(1, 179) = 5.49, p = .02 (mean response times for two-step problems
was longer than to one-step composite of problems).
Correct Response Time: A number of subjects were excluded from the analysis of
correct response times for failing to get at least one item at a given level of complexity
and number of processing correct. The analyses of correct response times are therefore
based on only 153 subjects. Once again, the interpretations of these analyses mirror the
item-based analyses and therefore are not reported in detail. The only change was that
the difference between two-step binary and two-step ternary problems that was marginal
in the item based analyses (F(1, 30) = 3.62, p = .067) was now highly significant in the
person based analyses, F(1, 179) = 105.82, p < .001. Also, the difference between the
one-step ternary composite and the two-step ternary composite was significant, F(1,
179) = 6.92, p = .009, where as it was not significant in the item-based analyses. These
differences do not weaken our previous interpretations and if anything strengthen our
argument that relational complexity and number of processing steps are substantial
predictors of performance in the Latin Square task.
6.7.4 Summary
In summary the evidence that we have presented for the Latin Square task suggests that
the measurement properties are sound and therefore provide a strong basis from which
to assess the complexity-Gf predictions.
6.8
6.8.1
Summary of the Measurement Properties of the Tasks
Psychometric Tasks
The selection of psychometric tasks has generally performed well in this new
environment and under different administration procedures. There are a variety of
different minor problems with each of the psychometric tasks but in general our
analyses suggest that we have a battery of tests that will be satisfactory for testing our
hypotheses.
256
6.8.2
Relational Complexity Tasks
Sentence Comprehension: There are also some mixed results from the relational
complexity tasks. The sentence comprehension task was included essentially as a
reference test from the growing collection of tasks in which the influence of relational
complexity has been investigated. Our statistical analyses suggested that the influence
of probe question type might weaken our identification of a complexity-Gf effect in this
task so we have isolated this as a factor in our measures (i.e., use of items entailing a
noun response only). We also have included two new tasks that have not been submitted
to a relational complexity analyses prior to this work.
Knight-Knave Task: The knight-knave task is a high-level reasoning task that has been
investigated in some detail independently by two groups of researchers from somewhat
different philosophical perspectives. Our findings related to this task suggest that
relational complexity is a significant predictor of task performance but that other factors
also influence accuracy and response times. We considered some of these factors briefly
and concluded that a composite score based on relational complexity in the knightknave task might not provide a strong test of the relational complexity theory. The
evidence that we do have for the knight-knave tasks suggests that the relational
complexity theory is capable of quantifying the relational demands of the reasoning
entailed, at least in the set of problems we have investigated so far. The extent that
performance is pure and not confounded by some other factor is yet to be realised. We
believe the relational complexity analysis of the knight-knave is internally consistent
with our current state of understanding of the task. The external validity of the RC
analysis is yet to be established. The knight-knave task clearly involves a number of
known and idiosyncratic strategies. Sweller’s (1998) concerns that the utility of the
relational complexity theory is tied to a sufficient understanding of the processes
involved are significant. We have argued that the utility of the RC analysis is based on
the appropriateness of the process model of the strategies that subjects use to segment
and chunk their way through to solution. And this is clearly the case with the knightknave task. Our knowledge in this task is advancing and a more detailed understanding
of the processes can only strengthen the base on which the complexity analyses are
conducted. The reply of Halford et al. (1998b) to Sweller’s concerns therefore is equally
significant. Having a sufficient process analysis of a task might constrain the
257
application of the theory but it does not threaten its validity. We will consider these
issues in more detail in the next chapter.
Latin Square Task: Finally, the Latin square task was developed especially for this
project to test the RC theory. The task has a demonstrated an empirically robust
manipulation of complexity and number of processing steps. The test of the influence of
these manipulations on performance has confirmed that the task is an internally
consistent and tight psychometric measure. An exploration of the cognitive correlates of
this task in relation to the Gf-Gc theory will help establish just what type of reasoning
ability is being tapped.
6.8.3
Chapter 7
The following chapter is a continuation of the analyses of the data from this study. The
focus will be on exploring the complexity-Gf effect in the three relational complexity
tasks. We are also interested in relational reasoning in general and the extent that the
three tasks might tap such a construct will also be investigated (class equivalence). The
methodological approach that we take in Chapter 7 is clearly based on the individual
differences paradigm. We conclude by attempting to integrate the understanding of
relational reasoning provided by both the experimental and individual differences
approaches.
CHAPTER SEVEN
RELATIONAL COMPLEXITY AND PSYCHOMETRIC COMPLEXITY
7
Introduction
This chapter will be concerned with the way the relational complexity tasks are related
to each other and related to measures of broad cognitive abilities. The overall objective
is to explore the instantiation of relational complexity theory in the current data set and
to test some of its major tenets. There are three key predictions and objectives that we
are interested in:
Class equivalence: The extent to which tasks of the same relational
classification are related to each other
Complexity-Gf effect: The extent to which increasing task complexity is
associated with a monotonic increase in the tasks loadings with Gf
Gc/SAR effects: The extent to which manipulations of relational complexity are
independent of educational experiences and acculturation and short-term
memory ability
7.1
Class Equivalence of Relational Complexity
The theory of relational complexity makes particular predictions about the relationship
between process levels of the same complexity from tasks of different domain. This has
been partially tested with children using tasks such as transitive inference, class
inclusion, balance beam and the results have tended to confirm the presence of what
Andrews and Halford (in press) refer to as class equivalence. As we indicated in
Chapter 2, a similar prediction might be made with an adult population. One approach
to explore this assumption would be to consider the Multitrait-multimethod approach
that was developed by Campbell and Fiske (1959) to assess construct validity. The
rationale of this approach is essentially that different measures of the same construct
should be more highly correlated with each other than different measures of different
constructs. The correlations between measures of different constructs using the same
method of assessment are expected to fall somewhere between these ideals to the extent
that the assessment method itself can account for a significant proportion of the
common variance.
259
Table 7.1
Inter-correlations between relational complexity measures of accuracy (lower triangle) and
sample size (upper triangle)
L2D
L2D
-
L3D
L4D
K3D
K4D
S2D
S3D
S4D
S5D
S2Da
S3Da
S4Da
S5Da
180
180
152
152
161
161
161
161
161
161
161
161
180
152
152
161
161
161
161
161
161
161
161
152
152
161
161
161
161
161
161
161
161
-
161
149
149
149
149
149
149
149
149
-
149
149
149
149
149
149
149
149
.14
-
160
160
160
160
160
160
160
-
160
160
160
160
160
160
160
160
160
160
160
-
160
160
160
160
.54
**
.42
**
K3D
.24
**
K4D
.03
.08
.15
.12
S2D
.30 **
.28 **
.20 *
.27 **
L3D
L4D
*
.57
**
.32
**
**
.21
*
.15
.36
**
.02
.30
.39 **
.00
.24 **
.39 **
.04
.29
**
**
S3D
.17
S4D
.35 **
.27 **
.22 **
S5D
**
**
**
.21
.25
.27
.27
.18
*
**
.27
.41
**
S2Da .30 **
.28 **
.20 *
.27 **
.14
1.00 **
.30 ** .24 **
.29 **
-
160
160
160
S3Da .12
.18 *
.14
.36 **
.02
.24 **
.82 ** .31 **
.23 **
.24 **
-
160
160
S4Da
.27 **
.19 *
.16 *
.37 ** -.03
.19 *
.32 ** .90 **
.34 **
.19 *
.23 **
-
160
S5Da
.12
.21 **
.22 **
.06
.24 **
.25 ** .22 **
.87 **
.24 **
.22 **
.18 *
-
.02
L = latin square task; K = knight-knave task; S = sentence comprehension task; 2D, 3D, 4D, 5D
= binary, ternary, quaternary, and quinary composite, respectively; a sentence comprehension
composite based on noun responses only (see Chapter 6); Values in bold indicate correlations
between measures of the same level of RC
Table 7.1 and 7.2 list the inter-correlations between the composite scores from the
accuracy and response time measures of the relational complexity tasks, respectively.
Although we have only a partially completed multitrait-multimethod matrix (since we
do not have measures of all levels of relational complexity for all tasks), we can
consider the correlations we do have as a preliminary indicator of the breadth of class
equivalence in this set of tasks.
Accuracy: Investigation of the correlations in Table 7.1 between levels of complexity
within each task indicate that the Latin Square task and the Sentence Comprehension
task corroborate the impression of internal consistency from the item analyses in
Chapter 6. As reported previously, the correlation between the ternary and quaternary
composite measures of accuracy in the knight-knave task was not significant. There are
16 correlations of interest in assessing class equivalence of relational complexity and
these are indicated in bold in Table 7.1. The only correlations that are not significant in
260
this group of 16 are those associated with the quaternary level of the knight-knave task.
Accuracy measures of equivalent levels of relational complexity in the Latin Square
Task and the Sentence Comprehension Task were correlated. These correlations are not
large, and certainly no larger than the correlations between the tasks at different levels
of complexity. Hence, while there is weak evidence to suggest that the claim of class
equivalence is worth investigating further, it is likely that the correlations reflect
individual differences in general cognitive abilities (i.e., positive manifold).
This relationship was similar regardless of whether the composite sentence
comprehension measure included all probe question-types (S2D, S3D, S4D, S5D), or only
the noun responses (S2Da, S3Da, S4Da, S5Da). Finally, it is interesting to note that for
accuracy at least, sentence composites based on all question-types for a given level of
complexity are highly correlated with sentence composites based on only noun
question-types, of the same level of complexity. This would be expected given the two
measures include responses to a similar set of items. For instance, the items that make
up the S3D composite, include all the items that make up the S3Da composite in addition
to the items entailing a verb response (note the S2D composite is identical to S2Da since
all 2D items entail a noun response).
Response Time: The trend for class equivalence is even less convincing when the
composite measures of response time are considered (Table 7.2). There is evidence for
internal consistency in the Latin Square Task and the Sentence Comprehension Task
between levels of complexity, although the correlations are more modest than those for
the accuracy measure. As we have reported previously, the correlation between the
composite response time measures for the Knight-Knave Task is significant. That is,
while there is no relationship between accuracy on ternary and quaternary knight-knave
items, response time as a measure of overall processing demand suggests that
individuals that tend to take longer to make a response on the ternary knight-knave
items also tend to take longer to respond to the quaternary items.
261
Table 7.2
Inter-correlations between relational complexity measures of response time (lower triangle) and
sample size (upper triangle)
L2D
L2D
L3D
L4D
K3D
L3D
-
180
.29
**
.28
**
.09
S2D
.21 **
S3D
.31
**
S4D
.25
**
S5D
.19 *
.21
**
S3Da
.34
**
S4Da
.21 **
S5Da
.17
.58
**
*
K3D
S4Da
S5Da
161
161
161
161
161
180
152
152
161
161
161
161
161
161
161
161
152
152
161
161
161
161
161
161
161
161
161
149
149
149
149
149
149
149
149
-
149
149
149
149
149
149
149
149
167
167
167
167
167
167
167
167
167
167
167
167
167
-
167
167
167
167
167
167
167
167
167
167
167
167
-
167
167
.30 **
-
*
.01
.01
-.08
.22
.04
.10
**
.01
.45
**
.01
.39
**
.05
.39 **
.31 **
.44 **
-.08
1.00
**
.45
**
.39
**
**
.89
**
.35
**
-.14
-.01
.11
-.09
**
.10
.03
.01
-.11
.39
.21 **
.15
.18 *
-.05
.31 **
.06
**
.06
S3Da
161
-.09
.11
S2Da
S5D
161
.11
.16 *
S4D
161
.44
.21
S3D
152
.07
**
S2D
152
.09
.09
K4D
180
.21
.11
K4D -.05
S2Da
L4D
-.04
.02
.31
.39
**
.32 **
.20
*
.39
**
.23
**
.39
**
.93 **
.33 **
.31 **
**
**
**
.35
.86
.31
.14
.24
167
**
-
L = latin square task; K = knight-knave task; S = sentence comprehension task; 2D, 3D, 4D, 5D
= binary, ternary, quaternary, and quinary composite, respectively; a sentence comprehension
composite based on noun responses only (see chapter 6); Values in bold indicate correlations
between measures of the same level of RC
The only inter-task correlation between equivalent levels of complexity that is
significant is that between the binary levels of the Latin Square Task and the Sentence
Comprehension Task (both with and without the verb response items). Until now we
have followed a traditional interpretation of the influence of relational complexity on
task performance – increasing complexity results in a performance decrement – more
errors and longer response times. This has been observed in the empirical data within
each task that we presented in Chapter 6. The relationship presented in Table 7.2
between response times for different tasks as a function of relational complexity is
generally very low. This suggests that the assessment of class equivalence in response
time as a function of complexity is not straightforward. When composite response time
scores are based on only correctly answered items and the analyses repeated, the same
pattern of result is observed (although, as we reported in Chapter 6, the relationship
between ternary and quaternary knight-knave items is no longer significant). We will
have more to say about the role of response time and relational complexity in Section
7.5.5. For now we will flag this as an issue for further investigation.
262
7.1.1
Summary of Class Equivalence of Relational Complexity
In summary, the evidence for class equivalence of relational complexity across the Latin
Square Task and Sentence Comprehension Task as demonstrated by Andrews and
Halford (in press) in children is weak in the tasks administered to adults. The evidence
for consistent and stable measurement in the knight-knave task was also again weak. To
some extent this might be expected given our concerns about deriving composite scores
for this task. We have argued in Chapters 3 and 6 that while the influence of relational
complexity in the knight-knave task is apparent in our data, performance on the task is
also likely to be influenced by at least one known factor (serial processing) and possibly
a number of others (strategy, indeterminacy, negation). We do not have a sufficient
breadth of items in the current study to been able to separate methodologically the
influence of these factors, although we have preliminary evidence for one or two items
(see Chapter 6).
The theory of relational complexity would predict that university students should be
capable of processing quaternary relations, and performance on quaternary levels of the
Latin Square task and the sentence comprehension tasks suggests that psychometric
difficulty per se is not the major cause of the measurement problems observed in the
knight-knave task. Byrne and Handley’s (1997; Byrne & Handley, 1993) research on
reasoning in this task suggest that subjects are able to short-cut processing demand as
solution strategies develop and mature (within the scope of attempting to solve several
items within a single sitting). Individuals differ in the acquisition and use of these
general control strategies (in rate and type) as a function of processing capacity and
experience, and these strategies appear to differ in difficulty as a function of relational
complexity (i.e., comparison of Items 3.4 and 4.6, Chapter 6).
By definition, ternary items entail the resolution of a certain degree of ambiguity
through building a cognitive representation that integrates three arguments. As the task
becomes more complex ambiguity and indeterminacy between elements of a problem
play a larger role in performance (as a function of demanding more processing
resources). Again this effectively defines what it means to increase relational
complexity. The differentiation that we need to make then is not so much the distinction
between the characteristics that define the levels of complexity per se; this can be done
263
almost algorithmically through the careful application of MARC. The empirical
problem is to determine how other factors influence our measurement of the effect of
complexity. A subject with lower processing capacity might appear to perform beyond
their capacity for all the wrong reasons. Similarly, a person with ample processing
capacity might fail for trivial reasons. An often understated and therefore frequently
untested assumption of cognitive assessment is that subjects who have the ability to
solve a problem do so. Misclassification of individuals is a measurement problem in the
first instance, and then a threat to substantive theory if these problems cannot be
resolved. In the context of the knight-knave task, reasoning might be discontinued when
a plausible solution has been reached even though alternatives might exist. As such,
performance on the knight-knave task may not necessarily be predominantly a function
of processing capacity. We have presented preliminary response time evidence to
support this sort of speculation and the work of Schroyens, et al. (1999) on differences
in solution strategy in the knight-knave task is also consistent with this. We will return
to this topic in more detail in the general discussion.
7.2
7.2.1
The Complexity-Gf Relationship
Model of the Predictions
There are a variety of possible methods for testing the predictions associated with the
relational complexity theory. A convenient way of representing these predictions is by
using structural equation modelling (SEM) notation. A potential conceptualisation56 of
the hypothesised relationships between the measured and latent variables in the current
selection of tasks is provided in Model A of Figure 7.1. This provides a foundation from
which to explore the tests of the predictions but there are a number of reasons why such
a model would be difficult to implement in practice using the SEM methodology. We
will explain the rationale behind this model first and then discuss the practical
limitations that constrain the assessment of it.
56
The absence of symbols to represent the measurement error is to avoid clutter, not to imply that the
variables have been measured without error.
264
Using the traditional SEM notation, the squares represent the measured variables for the
psychometric and relational complexity tasks. The ellipses represent the hypothesised
underlying latent variables. The best way to interpret this model is to start at the centre
and work our way out. The dark ellipses in the centre represent the four levels of
complexity as proposed by the relational complexity theory (relationships with the
quinary level are indicated in dotted lines since there is only one observed measure).
The arrows from these latent relational complexity variables heading towards the top of
the page represent the influence that these factors have on the variability in each of the
observed measures of complexity from the knight-knave, Latin square, and sentence
comprehension tasks. These observed measures are also influenced by task specific
abilities represented by the ellipses at the top of the figure. The latent psychometric
factors, Gf, Gc, and SAR are represented by ellipses toward the bottom of the page.
These factors have arrows that go to the measured psychometric variables (squares)
since they are hypothesised to influence performance in these tasks. For example, the Gf
Kni ght-Knave
L2
Latin Square
S2
K3
β1
Binary
L3
S3
Ternary
Sentence
K4
β2
L4
Quaternary
a
SAR
DSf
DSb
S4
b
Gc
PA
Voc
Ar ith
S5
β3
Quinary
c
d
Gf
Sim
Rav
Sw
Tr
Figure 7.1. Model A: Structural model of predictions made by the complexity-Gf
relationship and relational complexity.
265
factor has arrows going to the square representing the Ravens’ progressive matrices
composite (Rav) since fluid intelligence is expected to be the predominant factor
determining performance in that task.
The component of the model enclosed in the dashed rectangle relates to the basis of the
complexity-Gf prediction that states that increasing relational complexity results in
higher task loadings on Gf. Accordingly, the loadings associated with arrows leading
from the Gf factor to the latent relational complexity factors labelled a, b, c, and d in
Figure 7.1 have a predicted order of magnitude; a < b < c < d. That is, the extent that Gf
contributes to variability in the relational complexity factors increases with level of
complexity (no prediction about the magnitude of these loadings is made).
7.2.1.1 Conflicts and constraints.
At first glance Model A may seem like a reasonable, uncontroversial representation of
the state of the predictions. However, such a representation makes some implicit
assumptions that will influence our ability to test it. First of all, notice that the different
levels of relational complexity are modelled as separate latent factors (a separate ellipse
for binary, ternary, quaternary, and quinary processes). This is essentially to model the
concept of class equivalence that we have just discussed – that different tasks entailing
similar levels of relational complexity have some commonality as a function of this
structure regardless of domain. Representing levels of relational complexity separately
however also implies that levels of relational complexity are qualitatively distinct.
Strictly speaking, this does not fit with the specification of the relational complexity
theory. As we discussed in Chapter 5, the Halford et al. (1998a) formalism can be
interpreted to mean that relational complexity is related to a single construct (processing
resources) and therefore differences in relational complexity are seen as differences in
the quantitative requirements that the task makes on resources (or of the resources that
an individual possesses). To represent this specification would entail a single relational
complexity factor distinct from the psychometric factors that all the measures of the
relational complexity tasks would define (this is effectively the approach taken by
Andrews and Halford (in press) in the isolation of a single relational complexity factor
across a collection of developmental tasks). Representation of the complexity-Gf effect
266
would then be achieved by associating the observed variables directly with the Gf
factor. This would be consistent with the way cognitive complexity was modelled as a
function of fluid intelligence by Stankov and Raykov (1995; see also, Raykov &
Stankov, 1993). Model B in Figure 7.2 is a partial model of this type of
conceptualisation (with Gc and SAR removed for the convenience of clarity). The
complexity-Gf effect would then be tested by considering the cluster of loadings for a
given level of complexity with the same expectation as before in Model A; that is, set(a)
< set(b) < set(c) < set(d).
Knight-Knave
L2
S2
Latin Square
K3
L3
S3
Sentence
K4
b
L4
S4
d
c
a
S5
Gf
RC
Rav
Sw
Tr
Figure 7.2. Model B: Revised model of predictions with a single latent factor for
relational complexity
There are however some problems with the Model B conceptualisation also. First, since
only one RC factor is postulated it is not clear how the class equivalence feature should
be represented. In Model A, it is possible to test the relationship between levels of
complexity. The direction of the arrows in Model A (Figure 7.1) going from the latent
binary factor to the latent ternary factor, from the ternary to the quaternary factor, and
from the quaternary to the quinary factor (labelled β1, β2, and β3) indicate that
variability at a given level of complexity is determined by variability at lower levels but
not higher levels. That is, capacity for ternary processing presupposes a capacity for
267
binary level processing, but not necessarily for quaternary processing, and so on. Like
the class equivalence feature, it is not clear how this might be represented in Model B.
Of course another conceptualisation might be proposed to deal with all these issues
together in a single model. It is likely that a more detailed specification of the
relationship between components of the theory would be necessary to achieve this. If it
were possible, the additional complexity of the model would require the estimation of
even more parameters and this would place a higher demand on the stability of the
measures and more pragmatically, on the required sample size. Even the models that we
have already proposed assume that we have sufficient control over the measurement
process to be have stable estimates of performance at all levels of complexity. This is
probably not the case in the knight-knave task. The alternative then is to test the
hypotheses we have in a more piecemeal fashion and deal with measurement issues as
they arise. Although this is not as appealing as a single SEM analysis, given the current
state of the theory and the measurements we have, this seems to be the most practical
way to advance.
We have already considered the evidence for class equivalence. We propose to explore
the complexity-Gf prediction separately for each relational complexity task in much the
same way that Model B proposes. We will estimate the latent psychometric factors
using a common factor analysis to generate factor scores. Before we consider our
methodology in more detail, we need to first consider the issue of missing data.
7.2.2
Treatment of Missing Data
Of the 191 subjects who participated in the study, 104 had complete data for all 12 tasks
and a further 53 subjects have data for at least two of the three marker tests for each of
the Gf, Gc, and SAR factors, and the relational complexity tests. Missing data is either a
result of the subject failing to turn up for the second testing session, and/or insufficient
time to complete all tasks (the data for an additional subject who voluntarily withdrew
from the study was not considered in any of the analyses). We are reluctant to simply
exclude all subjects who did not complete the 12 tasks, as this has the potential to
introduce a source of bias that may influence the findings of the correlational analysis
(Bollen, 1989). This is particularly the case if those subjects with missing data have a
268
different pattern of responses across the tasks that they did attempt (e.g., a tendency to
respond more slowely) than subjects with complete data. Bollen (1989) suggests that
multiple regression procedures can be used to estimate missing data by using test scores
for subjects with complete data. Gorsuch (1983, p302-3) also recommends this type of
approach. He states that the regression procedure is superior to (a) replacement with the
mean, (b) the elimination of the case with missing data, and (c) the use of a principle
component procedure to estimate missing values. The principle that we use is consistent
with the regression approach and is implemented in SPSS-V9 as follows:
1. The regression analysis is only computed on those individuals who have scores
on all variables to be used in the analysis – in this case the psychometric
variables.
2. Regression weights are then applied to estimate missing scores.
3. An error term is added to each missing value that is estimated using the normal
deviate approach. That is, error terms are randomly drawn from a distribution
with the expected value 0 and the standard deviation equal to the square root of
the mean squared error term (sometimes called the root mean squared error, or
RMSE) of the regression (SPSS, 1999).
We have only estimated missing data for the psychometric variables. The rationale for
this is that each psychometric factor (Gf, Gc, and SAR) is based on three marker
variables that have been selected based on previous empirical evidence. We are not as
confident to estimate missing values on the relational complexity tasks. Although we
have presented empirical evidence in the current data to suggest reasoning in the Latin
Square task and the Sentence Comprehension task are related, their relationship with the
knight-knave task is weak. As such we do not wish to contaminate our findings by
estimating values for a relational complexity test from one or more of the other RC
tasks.
7.2.3
Generating Broad Cognitive Abilities Factors
The decision on which measures to use to estimate the factor scores associated with Gf,
Gc, and SAR was based on the preliminary analyses in chapter 6. Where possible, the
269
Rasch transformed scores were used. Rasch ability estimates based on all items from the
Swaps Test was used given the apparent dominance of the solution heuristic that we
discussed in Chapter 6 (i.e., using the mouse to locate the outcomes of intermediate
processing in the response options). The Triplet Numbers task measure was a composite
of the number of correct items answered per minute for levels 3 and 4. The PairedAssociate Recall task measure was the number of correctly recalled cue-target pairs
after the first presentation. This measure was used since it is typically considered more
consistent with the span type factors associated with the digit-span tasks (WMS-III
manual). The total score across the four presentations of the Paired Associate Recall
task has been associated with the slightly different “Associative Memory” first-order
factor (Carroll, 1993) and therefore was not used.
The correlation matrix between the variables reported in Table 7.3. was submitted to a
common factor analysis using principle axis factoring extraction. An oblique rotation
was used to allow for the possibility that the factors were correlated. A factor analysis
was determined to be more appropriate than a principle components analysis since we
are interested in considering the variance shared between the variables rather than the
total variance (an analysis of the same data using the principle components procedure
revealed an identical pattern of results although slight changes in the magnitude of
individual correlations was observed, Birney & Halford, 2001). Three factors were
extracted with eigenvalues greater than 1 and these accounted for 40.82% of the shared
variance. The extraction of three factors is consistent with the number of dominant
psychometric factors expected (Gf, Gc, and SAR).
Table 7.3
Inter-correlation between psychometric variables and pattern matrix of factor loadings
Correlations Coefficients
Mean
SD
Gf1
1.12 (1.26)
-
Gf2
Gf3
Gc1
Gc2
Factor Loadings
Gc3
SAR1
Progressive Matrices
Gf1
Triplet Numbers Test
Gf2
Swaps Test
Gf3
2.24 (1.44) 0.43 ** 0.35 **
Arithmetic Reasoning
Gc1
1.01 (1.18) 0.45 ** 0.39 ** 0.52 **
Vocabulary
Gc2
1.21 (1.03) 0.17 *
Similarities
Gc3
1.87 (0.82) 0.26 ** 0.14
Digit Span Forward
SAR1
0.64 (1.52) 0.28 ** 0.31 ** 0.32 ** 0.29 ** 0.30 ** 0.16 *
Digit Span Backward
SAR2
1.19 (1.34) 0.13
0.16 *
0.33 ** 0.30 ** 0.29 ** 0.21 **
0.48 **
4.63 (2.62) 0.16 *
0.10
0.33 ** 0.17 *
0.11
Paired Associate Recall SAR3
40.29 (6.27) 0.34 **
SAR2
-
0.13 ** 0.25 ** 0.22 **
0.15 *
0.13
0.24 **
-
0.29 ** 0.24 **
N = 191 (with missing values estimated using regression); * p < .05; ** p < .01, h2 = estimated communality
0.16 *
h2
Gf
0.43
0.67
0.30
0.54 -0.08
0.06
0.50
0.58
0.19
0.06
0.51
0.70 -0.01
0.03
Gc
SAR
0.05 -0.09
0.28 -0.01
0.34
0.30
0.15
0.07
0.29
0.12
0.49 -0.03
0.07
0.69
0.52
0.15 -0.09
0.67
0.49
0.03
0.72 -0.09
271
Inspection of the estimated communalities (h2) in Table 7.3 indicates that the
Similarities test is somewhat lower than the communalities for the other tests and this
suggest that it may be an outlier. The factor loadings of each variable are also presented
in Table 7.3 The three markers of Gf all had significant loadings on the Gf factor and
the Swaps test also had a moderate loading on the Gc factor. Contrary to expectations,
the Arithmetic Reasoning task loaded more heavily on the Gf than the Gc factor. The
highest loadings for the Vocabulary and Similarities test were on the same factor but
these loadings were only moderate (approx. .30). Interestingly, the Paired Associate
Recall task also had a very high loading on what might be considered the “Gc factor”.
This indicates that it has contributed significantly to the definition of this factor.
Together, these results suggest that the Gc factor was generally poorly defined in the
current set of tasks. Given that the individual analyses of the Gc tasks in Chapter 6
suggested that the measurement properties were only moderate for these tasks, this is
not altogether an unexpected finding.
A memory factor was defined by the loadings of the digit span tasks. The Digit Span
Backward task also had a moderate loading on the Gf factor. This might be expected
given the necessity to develop a strategy to deal with the required digit manipulations
(Carroll, 1993). As we have just mentioned, the Paired Associate Recall task seemed
not to share any of the characteristics of the other memory tests and tended to define a
separate factor with two of the Gc variables. It might be the case that the modified
administration procedure we used might have changed the characteristics of this task. It
is possible that the requirement to use the keyboard to type in responses produced a
long enough delay for the association between the cue and target pairs to start to decay.
We might therefore speculate that the ability to do well on the task might be a function
of some other general experiential factor such as keyboard or computing skills rather
than memory per se. In any case, the interpretation of the Gc factor as being dominated
by the traditional verbal abilities may not be appropriate here.
7.2.3.1 Summary.
Taken together, the factor analysis suggests that we have provided a satisfactorily broad
definition of Gf and SAR, and a weaker more general experiential Gc factor. Based on
272
this analysis, factor scores were generated to use as measures of these broad abilities.
The inter-correlations between the factors were all significant, rGf,SAR = .50; rGf,Gc = .38,
and rGc,SAR = .37.
7.2.4
Analyses Objectives
There are two primary objectives in the following analyses. First, we will consider the
complexity-Gf prediction in detail for each task. This will entail exploring whether the
influence of Gf on task performance changes as a function of the classified level of
relational complexity. Second, we will consider the influence of educational
experiences and memory on performance to explore to what extent the influence of
these abilities is a function of the tasks relational complexity. We begin with the
Sentence Comprehension Task. The analyses of the knight-knave task is complicated by
the measurement problems we raised earlier in Chapter 6. We will consider the
composite scores we generated for this task, but we will also consider individual items
in an attempt to broaden our understanding of the correlates of performance in this very
interesting task. Finally, the Latin Square Task has demonstrated a high level of internal
and theoretical consistency. We therefore hope that this will facilitate a closer
exploration of the characteristics of relational reasoning in conjunction with broad
cognitive ability. We conclude with a consideration of the complexity-Gf evidence
associated with the levels of cognitive complexity in the Triplet Numbers Test and to
consider how the relational complexity manipulations are comparable.
7.3
7.3.1
Sentence Comprehension Task
Accuracy
The correlations between the composite measure of accuracy at each level of
complexity and the Gf-Gc variables are summarised in Table 7.4A. The composite
measures are based on all items (i.e., noun and verb responses) since in these analyses
there was no substantive difference in the relationship when verb responses were
omitted. The significance values (p-values) in Table 7.4A summarise the pair-wise
comparison of the differences between the magnitudes of the correlations using the
procedure outlined by Howell (2001) to test the difference between dependent
273
correlation coefficients (using a one-tailed test). That is, the difference in the magnitude
of the correlation between Gf and the 2D composite score (r = .26) and the correlation
between Gf and the 3D composite score (r = .34) is not significant. The p-value of this
test reported in table 7.4 is p = .20.
Table 7.4
Correlations between A) accuracy (Acc) and B) decision time (DT) on the sentence
comprehension task and broad cognitive abilities (Gf, Gc, SAR) with test of significance (pvalues) between correlations
A.
p-values
RC
rGf,Acc
2D
3D
0.26
**
3D
0.34
**
.20
-
4D
0.48 **
.01
.03
5D
**
2D
0.41
p-values
4D
.21
2D
3D
0.30
**
0.33
**
.34
-
-
0.32 **
.41
.42
.17
**
-
.05
rGc,Acc
0.24
p-values
4D
-
.24
.14
.16
rSAR,Acc
2D
0.16
*
-
0.23
**
.21
-
0.34 **
.02
.08
-
.34
.36
.03
0.20
*
3D
4D
B.
p-values
RC
rGf,DT
2D
-0.25 **
3D
-0.29
**
4D
-0.17 *
2D
3D
p-values
4D
rGc,DT
-
-0.23 **
.31
-
**
.16
.07
-0.23
-
-0.16 *
2D
3D
p-values
4D
-
rSAR,DT
-0.14
.48
-
.19
.21
-0.22
-
-0.04
2D
3D
4D
**
.16
-
.11
.02
-
5D
-0.07
.01 .01 .11 -0.14
.14 .16 .40 -0.03
.10 .02 .46
* p < .05, ** p < .01; RC = relational complexity; 2D, 3D, 4D, 5D = binary, ternary, quaternary,
and quinary respectively, N = 167
A global test of the relationship between the broad cognitive abilities and accuracy as a
function of relational complexity can be investigated using a repeated-measures analysis
of covariance (ANCOVA). This was the procedure adopted by Stankov and Crawford
(1993) in their analysis of the influence of Gf and SAR on performance in the triplet
numbers test as a function of complexity level. The effect of interest is the interaction
between the within-subjects relational complexity variable and the psychometric
covariate. We have already explored the experimental effect of relational complexity in
each of the tasks separately in Chapter 6. Therefore, in the analyses of the main-effect
274
for complexity to follow we report the significance levels but do not re-interpret these,
as they are identical to those already reported.
Fluid Intelligence: The relational complexity (2D, 3D, 4D, 5D) × Gf (covariate)
repeated-measures ANCOVA indicated a significant main-effect for complexity, F(3,
495) = 95.60, p < .001. The main effect for the Gf covariate was significant, F(1, 165) =
68.00, p < .001 indicating that fluid ability influenced performance on the sentence
comprehension task overall. The effect of interest, the interaction between Gf and
relational complexity, was significant, F(3, 495) = 4.89, p = .002. Figure 7.3 plots this
interaction with low and high Gf dichotomised at ± one standard deviation around the
Mean Proportion Correct
mean of Gf (M = 0; SD = .88).
1.00
LowGf
0.90
HighGf
0.80
0.70
0.60
0.50
0.40
0.30
2D
3D
4D
5D
Relational Complexity Classification
Figure 7.3. Sentence comprehension accuracy as a function of relational
complexity and Gf
Recall that the complexity-Gf effect in practice proposes that as the complexity of a task
increases, we are better able to differentiate between individuals of high and low Gf. As
can be seen in Figure 7.3 this is what the data suggests. As the complexity increases, the
difference between high and low Gf individuals tends to increase. The monotonic nature
of this prediction equates to a test of linearity in the interaction. A test of this contrast
was significant, F(1, 165) = 11.81, p = .001. The quadratic and cubic contrasts were not
significant, F < 1, respectively.
275
Crystallized Intelligence: A similar repeated-measures ANCOVA was conducted on the
relationship between accuracy and Gc. The main-effect for complexity was significant,
F(3, 495) = 92.39, p < .001. The main-effect for the Gc covariate was also significant
indicating that a generalised Gc ability influenced performance in the Sentence
comprehension task, F(1, 165) = 33.88, p < .001. The test of the interaction between RC
and Gc was not significant, indicating that the influence of Gc on the accuracy of
sentence comprehension was constant regardless of level of complexity, F < 1. Figure
7.4 plots the relationship between Gc and relational complexity on accuracy in the
sentence comprehension task. Again Gc has been dichotomised into high and low
values at one standard deviation above and below the mean, respectively (Mean = 0; SD
Mean Proportion Correct
= .78).
1.00
LowGc
0.90
HighGc
0.80
0.70
0.60
0.50
0.40
0.30
2D
3D
4D
5D
Relational Complexity Classification
Figure 7.4. Sentence comprehension accuracy as a function of relational
complexity and Gc
Short-term Apprehension and Retrieval: The repeated-measures ANCOVA was
repeated with SAR as the covariate. As before, the main-effect for relational complexity
was significant, F(3, 495) = 93.34, p < .001. The main-effect for SAR was also
significant indicating individual differences in memory influenced performance on the
sentence comprehension task overall, F(1, 165) = 19.92, p < .001. The interaction term
was not significant indicating that the memory abilities are not implicated differentially
as a function of sentence complexity, F(3, 495) = 1.74, p = .157. Figure 7.5 plots the
relationship between SAR and relational complexity on accuracy in the sentence
276
comprehension task. SAR has been dichotomised into high and low values at one
Mean Proportion Correct
standard deviation above and below the mean, respectively (M = 0; SD = .84).
1.00
LowSAR
0.90
HighSAR
0.80
0.70
0.60
0.50
0.40
0.30
2D
3D
4D
5D
Relational Complexity Classification
Figure 7.5. Sentence comprehension accuracy as a function of relational
complexity and SAR
7.3.2
Decision Time
The correlations between the composite measure of decision time at each level of
complexity and the Gf-Gc variables are summarised in Table 7.4B. The composite
measures are based on all items. As this table indicates, all three cognitive abilities (Gf,
Gc, and SAR) are significant predictors of decision time for items of lower relational
complexity, however the relationship tends to be weaker for the higher levels of
complexity. As for Table 7.4A, the p-values in Table 7.4B summarise the significance
of the difference between the correlations using the procedure outlined by Howell
(2001). As an example, the correlation between Gf and 2D decision time (r = -.25) is
not significantly different from the correlation between Gf and the average 3D decision
time (r = -.29). The p-value for this significance test as reported in Table 7.4B, is p =
.31. To test the overall trend across levels of complexity, a repeated-measures analysis
of covariance was conducted separately with each broad cognitive ability as the
covariate. The main-effect for relational complexity in the analyses is not interpreted
277
since this was covered in detail in Chapter 6. We simply report the significance test for
this factor.
Fluid Intelligence: The relational complexity (2D, 3D, 4D, 5D) × Gf (covariate)
ANCOVA on the composite decision time measure indicated a significant main-effect
for relational complexity, F(3, 495) = 55.00, p < .001. The main-effect for the Gf
covariate was also significant indicating that decision time was a function of fluid
abilities, higher Gf in general was associated with quicker decision times, F(1, 165) =
9.68, p = .002. The test of the complexity-Gf effect via the interaction between
relational complexity and Gf was not significant, F < 1. That is, although the individual
tests of the correlations reported in Table 7.4B tend to suggest that differences exist as a
function of complexity, the test of the general trend indicated that the influence of Gf on
decision time was constant regardless of relational complexity. Figure 7.6 plots the
relationship between Gf and relational complexity on decision time in the sentence
comprehension task. As for the assessment of accuracy, Gf has been dichotomised at ±
Decision Time (sec)
one standard deviation from the mean Gf score.
7.00
LowGf
6.00
HighGf
5.00
4.00
3.00
2.00
1.00
0.00
2D
3D
4D
5D
Relational Complexity Classification
Figure 7.6. Decision time on the Sentence Comprehension Task as a function of
relational complexity and Gf
278
Crystallized Intelligence: A similar repeated-measures analysis of covariance was
conducted on the relationship between decision time and Gc as a function of relational
complexity. The main effect for relational complexity was significant, F(3, 495) =
54.95, p < .001. The main-effect for the Gc covariate was also significant indicating that
a generalised Gc ability influenced decision time in the Sentence Comprehension Task,
F(1, 165) = 10.53, p = .001. The test of the interaction between relational complexity
and Gc was not significant, indicating that the influence of Gc on accuracy was constant
regardless of level of complexity, F < 14. Figure 7.7 plots the relationship between Gc
and relational complexity on decision time in the sentence comprehension task. Again
Gc has been dichotomised into high and low values at one standard deviation above and
Decision Time (sec)
below the mean.
7.00
LowGc
6.00
HighGc
5.00
4.00
3.00
2.00
1.00
0.00
2D
3D
4D
5D
Relational Complexity Classification
Figure 7.7. Decision time on the Sentence Comprehension Task as a function of
relational complexity and Gc
Short-term Apprehension and Retrieval: The repeated-measures ANCOVA with SAR
as the covariate indicated a significant main-effect for relational complexity, F(3, 495)
= 55.10, p < .001. The main-effect for SAR however was not significant indicating that
on average across levels of complexity, individual differences in memory did not
influence decision time on the sentence comprehension task, F(1, 165) = 2.43, p = .121.
The interaction between SAR and relational complexity was also not significant
suggesting that the influence of memory does not change as a function of complexity,
279
F(3, 495) = 1.04, p = .373. Figure 7.8 plots the relationship between SAR and relational
complexity on accuracy in the sentence comprehension task. SAR has been
Decision Time (sec)
dichotomised at one standard deviation above and below the mean.
6.00
LowSAR
5.00
HighSAR
4.00
3.00
2.00
1.00
0.00
2D
3D
4D
5D
Relational Complexity Classification
Figure 7.8. Decision time on the Sentence Comprehension Task as a function of
relational complexity and SAR
7.3.3
Summary
Accuracy: The association between relational complexity and broad cognitive abilities
on the accuracy of sentence comprehension are consistent with our predictions. As
complexity increased, there is evidence for an increase in the relationship between
accuracy and Gf. Although Gc and SAR abilities influenced accuracy on the sentence
comprehension task, this influence was not a function of relational complexity. This
general pattern of results suggests that relational reasoning in sentence comprehension
is not differentially a function of educational experiences or memory per se. The
complexity-Gf effect is present in the relational complexity manipulation in the sentence
comprehension task and therefore fits with our proposition that relational complexity
manipulation in this task is consistent with the traditional psychometric definition
(Stankov & Crawford, 1993; Stankov, 2000).
Decision Time: The relationships between decision time and broad cognitive abilities as
a function of relational complexity are somewhat more complex. Although individual
280
correlations suggest that each of the broad abilities has a role to play in sentences of
lower complexity, these effects are weak and not obvious in the tests of the overall
trends. Additional tests of the likelihood that a speed-accuracy trade-off might be
obscuring the complexity effect in response times is complicated by the fact that the
subject’s composite scores are based on different items. However, as an indication, the
correlation between the composite accuracy and decision-time measures were derived
for each level of complexity. None of these were significantly different from zero
(Speed-Accuracy correlations: r2D(165) = -.06, p = .48; r3D(165) = -.10, p = .22; r4D(165)
= -.09, p = .23; r5D(165) = -.04, p = .59). Taken together, we are cautious not to over
interpret the decision time results as support for the complexity-Gf trend.
Table 7.5
Correlations between A) accuracy (Acc) and B) response time (RT) on the knight-knave task
and broad cognitive abilities (Gf, GC, SAR) with test of significance (p-values) between
correlations
A.
p-values
p-values
3D 4D
rGc,Acc
p-values
RC
rGf,Acc
3D
0.34 **
4D
0.04
.00
-
0.06
.01
-
-0.01
4D^
0.10
.01 .29
0.00
.00 .31
0.11
0.29 **
-
3D 4D
rSAR,Acc
0.22 **
-
3D 4D
.02
-
.17 .14
B.
p-values
RC
rGf,RT
3D
0.23 **
4D
**
0.30
3D
p-values
4D
.19
rGc,RT
0.14
-
0.23
3D
p-values
4D
**
.12
rSAR,RT
0.15
-
0.22
3D
4D
**
.18
-
4D^
0.18 *
.27 .05 0.13
.48 .10
0.17 *
.41 .23
* p < .05, ** p < .01; RC = relational complexity; 3D, 4D, 4D^ = ternary, quaternary, and
indeterminate quaternary respectively
7.4
Knight-Knave Task
Table 7.5 reports the correlations between the composite accuracy and response time
measures from the knight-knave task and the measures of broad cognitive ability (Gf,
Gc, SAR) as a function of relational complexity.
281
7.4.1
Accuracy
The analyses of the accuracy measures are as might be expected given the general
interpretation of the inter-task correlations presented in Section 7.1 and the within-task
analyses of Chapter 6. The correlations reported in Table 7.5A indicate that while the
three broad cognitive abilities predict performance on the ternary composite, they do
not predict accuracy on the quaternary items; whether the composite is based on
determinate items (4D) or indeterminate items (4D^).
Fluid Intelligence: A relational complexity (3D, 4D, 4D^) × Gf (covariate) repeated-
measures ANCOVA was conducted on the composite accuracy scores. This revealed as
expected from earlier analyses a main-effect for relational complexity, F(2, 318) =
143.65, p < .001. We will interpret this effect here since the relative performance on
indeterminate items has not been considered previously. Planned contrasts indicated
Mean Proportion Correct
that ternary items had a higher composite accuracy score than determinate quaternary
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
LowGf
HighGf
3D
4D
4D^
Relational Complexity Classification
Figure 7.9. Accuracy on knight-knave composites as a function of relational
complexity and Gf (4D^ = indeterminate quaternary composite)
items, F(1, 159) = 198.04, p < .001, and determinate quaternary items had a higher
composite accuracy score than indeterminate quaternary items, F(1, 159) = 118.95, p <
.001. There was also a significant main-effect for the Gf covariate. Overall, higher
levels of Gf resulted in higher accuracy, F(1, 159) = 9.77, p = .002. The test of the
interaction was not significant, F(2, 318) = 2.09, p = .126, however when only
determinate items were included in the analysis (i.e., omitting 4D^), the interaction was
282
significant, F(1, 159) = 6.60, p = .011. This suggests that the indeterminate items are
qualifying the nature of the interaction between levels of complexity and we are
therefore cautious about the interpretations that we offer for this effect. The
interpretation of the interaction with only determinate items is consistent with the
analyses of the correlations in Table 7.5A. Gf is a significant predictor of accuracy on
the ternary composite but not the quaternary composite. This is contrary to the
complexity-Gf expectations. Figure 7.9 plots the influence of relational complexity and
Gf on the composite measures of accuracy. Gf is dichotomised at 1 standard deviation
Mean Proportion Correct
above and below the mean.
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
LowGc
HighGc
3D
4D
4D^
Relational Complexity Classification
Figure 7.10. Accuracy on knight-knave composites as a function of relational
complexity and Gc
Crystallized Intelligence: A repeated-measures ANCOVA was conducted with Gc as
the covariate on the accuracy composites. The main effect for complexity was
significant as expected from the previous analyses, F(2, 318) = 143.64, p < .001 and the
interpretation remains unchanged. The main effect for the Gc covariate was also
significant indicating that general crystallized ability influenced overall performance on
the knight-knave task, F(1, 159) = 4.24, p = .041. The interaction, as plotted in Figure
7.10 was not significant, F(2, 318) = 2.12, p = .122. When the indeterminate items
(4D^) were excluded from the analyses, the interaction was only marginally significant,
F(1, 159) = 3.55, p = .061. Together, this indicates that the influence of Gc on accuracy
283
in the knight-knave task does not change as a function of complexity and indeterminacy
in the quaternary items.
Short-term Apprehension and Retrieval: The same repeated-measures ANCOVA was
conducted using SAR as the covariate. The main effect for complexity was significant,
F(2, 318) = 142.50, p = .001, and is interpreted as before; 3D > 4D > 4D^. The main
effect for the SAR covariate was also significant indicating that memory influenced
overall performance on the knight-knave task, F(1, 159) = 4.71, p = .032. The
interaction as plotted in Figure 7.11 was not significant F(2, 318) = 1.41, p = .246.
When the indeterminate items (4D^) were excluded from the analyses, the interaction
was significant, F(1, 159) = 3.76, p = .054. This suggests that accuracy on the ternary
composite of items was a function of memory ability. There was no evidence to suggest
that memory ability mediated performance on the quaternary composite of items. This
taken in conjunction with the influence of the other broad cognitive abilities in this task
corroborates our intuitions that the requirements of the knight-knave task are complex –
Mean Proportion Correct
particularly for the quaternary items.
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
LowSAR
HighSAR
3D
4D
4D^
Relational Complexity Classification
Figure 7.11. Accuracy on knight-knave composites as a function of relational
complexity and SAR
284
7.4.2
Response Time
Table 7.5B reports the correlations between the composite response time for knightknave items and the measures of broad cognitive ability as a function of relational
complexity.
Fluid Intelligence: The relational complexity (3D, 4D, 4D^) × Gf (covariate) repeated-
measures analysis of covariance was conducted on the response time composites. The
results indicate a significant main effect for relational complexity, F(2, 318) = 47.31, p
< .001. Planned contrasts following up this effect indicated that overall, ternary items
were answered more quickly than determinate quaternary items, F(1, 159) = 74.02, p <
.001; and determinate quaternary items were answered more quickly than indeterminate
quaternary items, F(1, 159) = 31.69, p < .001. The individual correlations in Table 7.5B
indicate that fluid ability is a significant predictor of response time on all composites,
and the main effect for Gf was consistent with this, F(1, 159) = 14.06, p < .001. The
interaction plotted in Figure 7.12 was not significant, F(2, 318) = 1.53, p = .219,
indicating that overall, the influence of Gf on decision time did not change as a function
of relational complexity. Reanalysis with the indeterminate items omitted indicated that
there was a significant interactive effect between relational complexity and Gf on
response time, F(1, 159) = 4.13, p = .044. Therefore, although inspection of the
correlations in Table 7.5B suggests that a trend consistent with the complexity-Gf effect
Response Time (sec)
40.00
LowGf
HighGf
30.00
20.00
10.00
3D
4D
4D^
Relational Complexity Classification
Figure 7.12. Response time on knight-knave composites as a function of relational
complexity and Gf
285
is weak (not significant), analysis of the interaction between Gf and levels of
complexity on response time indicated that this trend was significant. This provides
some weak evidence for the complexity-Gf effect when overall response times are
considered.
Crystallized Intelligence: A similar repeated-measures ANCOVA was conducted with
Gc as the covariate. The main effect for complexity was significant as expected from
the previous analyses, F(2, 318) = 47.17, p < .001. The main effect for the Gc covariate
was also significant indicating that general crystallized ability did influence overall
response time on the knight-knave task, F(1, 159) = 7.15, p = .008. The interaction
across the three complexity classifications, plotted in Figure 7.13 was not significant,
F(2, 318) = 1.40, p = .249. When the indeterminate items were omitted from the
analyses, the interaction was significant, F(1, 159) = 3.77, p = .054. That is, Gc
abilities had a greater influence on response time to the composite of quaternary items
than the composite of ternary items.
Response Time (sec)
40.00
LowGc
HighGc
30.00
20.00
10.00
3D
4D
4D^
Relational Complexity Classification
Figure 7.13. Response time on knight-knave composites as a function of relational
complexity and Gc
Short-term Apprehension and Retrieval: The analyses of the response time for the
relational complexity composites from the knight knave task and SAR produced similar
results to the Gc analyses above. The main effect for complexity was significant, F(2,
318) = 46.95, p < .001. The main effect for the SAR covariate was also significant
286
indicating that memory ability did influence overall response time on the knight-knave
task, F(1, 159) = 8.38, p = .004. The interaction, as plotted in Figure 7.14 was not
significant, F(2, 318) = 1.31, p = .271. When indeterminate items (4D^) were omitted,
the interaction was still not significant, F(1, 159) = 2.96, p = .09 (although this might be
considered marginal).
Response Time (sec)
40.00
LowSAR
HighSAR
30.00
20.00
10.00
3D
4D
4D^
Relational Complexity Classification
Figure 7.14. Response time on knight-knave composites as a function of relational
complexity and SAR
7.4.3
Complexity-Gf at an Item Level
In summary, the evidence for a complexity-Gf effect in the knight-knave task composite
is weak. The classifications of relational complexity resulted in an effect opposite to the
hypothesised complexity-Gf relationship when the measure of performance is accuracy.
The ternary composite was better able to differentiate individuals of high and low Gf
than the more complex quaternary items. The evidence for the complexity-Gf
relationship is a little stronger and marginally significant for overall response time. The
analyses in Chapter 6 and the assessment of correlations in Table 7.1 and 7.5 suggest
that the knight-knave composites are likely to be contaminated measures of the
influence of relational complexity. This seems particularly true for the quaternary
collection of items. It might therefore be instructive to consider any apparent trend in
the performance of individual items, to explore the extent to which individual items
287
might be contributing to a failure to find compelling evidence for the complexity-Gf
prediction.
Figure 7.15 plots for each of the 14 knight-knave items, the correlation coefficient
between accuracy and each of the broad cognitive ability measures as a function of the
item’s Rasch calibrated difficulty (as reported in Chapter 6). We have plotted the
correlations in this way to explore any trends in the size of the correlation as a function
of the actual difficulty subjects had with each item regardless of our classification of
relational complexity. We could have used traditional p-values (proportion correct),
however these are widely known to be subject to ceiling and floor effects (Howell,
2001; Wright, 1999). The Rasch model is considered to provide a linearised
transformation of p-values when the data fits the model (Wright, 1999). It also provides
a useful reference scale by anchoring the zero point at the average item difficulty (for
the calibrated sample). In the case of the knight-knave task, item fit was appropriate
although person-separation was low (see Chapter 6). We suggested that this was an
indication of a rather heterogenous collection of items, and our theoretical analyses of
individual items in Chapter 6 were consistent with this. Although the task clearly does
0.50
Gf
Correlation Coefficient
0.40
0.30
0.20
GC
3 -6
3 -1
SAR
Linear (Gf)
3 -5
4 -5
4 -9
4 -1 0
0.10
3 -3
3 -4
0.00
4 -6
-0.10
3 -2
4 -4
4 -8
non-significant
region
4 -7
4 -2
-0.20
-0.30
-2.50
-1.50
-0.50
0.50
1.50
2.50
Calibrated Item Difficulty (logits)
Figure 7.15. Knight-knave item correlations between accuracy and broad cognitive
abilities (Gf, Gc, SAR) as a function of Rasch calibrated item difficulty
288
not fit the unidimensional requirements of the Rasch model, proponents of the Rasch
methodology argue that the transformed scores will be a poorer approximation to linear
scaling, however no poorer than raw p-values (Wright, 1999). In this case we cannot
justify the assumption that items have been calibrated on an interval scale. We therefore
caution against over interpreting the measurement scale.
While the correlation between accuracy and broad cognitive abilities is significant for
one or two items (Items 3.6 and 3.1), the correlations for the remaining items failed to
reach significance. The regression line represented in Figure 7.15 is for the prediction of
the magnitude of the correlation coefficient as a function of calibrated item difficulty.
This relationship is negative and significant r(12) = -0.58, p = 0.03, the more difficult
the item is, the less influence fluid abilities have on performance57. Similar, results were
observed for Gc, r(12) = -0.63, p = 0.02, and SAR, r(12) = -0.53, p = 0.05 (the
regression lines for these variables is not represented in Figure 7.15). There are no
apparent outliers and this indicates that the similar trend in the composite scores is
reliable.
There are some important points to consider here. First, the failure to support the
complexity-Gf prediction in accuracy scores suggests that the classifications of
relational complexity do not fit with the traditional psychometric definition of cognitive
complexity. However, items clearly differ in difficulty, and while these differences
coincide with our complexity classification, in and of itself the relationship between
item difficulty and fluid intelligence do not indicate differences in cognitive complexity
(see Spilsbury et al., 1990, for a discussion of the issues related to difficulty and
cognitive complexity).
57
A spearman rank order correlation (i.e., regardless of units and the scaling) indicted similar results, rho
= -.57
289
Figure 7.16 plots similar information, however this time the correlation coefficient
represents the relationship between item response time and the three measures of broad
cognitive abilities. The regression line in Figure 7.16 represents the regression of the 14
item correlations between Gf and response time, as a function of calibrated item
difficulty. This association was not significant, r(12) = 0.22, p = 0.45, suggesting that
the influence of fluid intelligence on response time was not a function of the item’s
absolute difficulty. This was also the case for Gc and SAR, r(12) = -0.17, p = 0.57, and
r(12) = 0.00, p = 1, although the regression lines for these factors are not plotted. When
the indeterminate items (4.8, 4.9, and 4.10) are removed, the relationship does not
change. Recall that in our analyses of composite response time above, there was weak
evidence for the complexity-Gf effect. However, when individual items are considered
the pattern of results is less convincing.
0.50
Gf
Correlation Coefficient
0.40
0.30
0.20
0.10
Gc
4 -6
SAR
3 -3
3 -1
4 -5
3 -5
4 -9
Linear (Gf)
4 -4
4 -1 0
3 -6
3 -4
4 -2
0.00
3 -2
4 -8
non-significant
region
4 -7
-0.10
-0.20
-0.30
-2.50
-1.50
-0.50
0.50
1.50
2.50
Calibrated Item Difficulty (logits)
Figure 7.16. Knight-knave item correlations between response time and broad cognitive
abilities (Gf, Gc, SAR) as a function of Rasch calibrated item difficulty
Although this type of item analysis puts a lot of weight on the reliability of individual
items, it does tend to demonstrate that performance on the majority of items does not
follow a pattern consistent with a manipulation of cognitive complexity either as
function of relational complexity, or simply in terms of calibrated item difficulty (as
290
estimated using the Rasch approach). This does not seem to change whether accuracy or
response times are considered.
7.4.4
Summary
These findings are not altogether unexpected given the troubles we have had with
establishing evidence for stability of measurement in the knight-knave task. They have
however allowed us to clarify to some extent the nature of the relational complexity
composites. The instability appears to be a function of the complex and multifaceted
nature of the reasoning required, especially for the quaternary level items. Individual
differences in broad cognitive abilities did account for a significant proportion of the
individual variability in the composite measures of accuracy and response times.
However, these abilities did not seem to mediate performance reliably as a function of
relational complexity consistent with a complexity-Gf effect. The findings were also not
consistent with Halford et al. (1998a) propositions that educational experiences and
memory are independent of relational complexity, although it would not be appropriate
to place too much emphasis on this given our measurement concerns. It seems that
knight-knave reasoning is likely to be a function of a range of abilities and strategies
that entail a much broader conceptualisation of complexity than what we have so far
assessed. An alternative account of the failure to find significant correlations between
Gf and the quaternary level is that there is a threshold on the complexity-Gf
relationship. Once task characteristics reach a certain level of complexity the
relationship between performance and broad cognitive abilities starts to decline
(Lohman & Kyllonen, 1983).
7.5
Latin Square Task
Table 7.6 reports the correlations between the composite measures of accuracy and
response time, and the measures of broad cognitive ability as a function of relational
complexity. The relational complexity manipulation is collapsed over the number of
processing steps, however we will consider individual items in more detail in section
7.5.3 below. We also use a similar ANCOVA approach to explore the interaction
between each broad cognitive ability and the manipulation of relational complexity. As
291
before, we will not reinterpret the main-effect for the relational complexity factor since
this was done in detail in Chapter 6.
Table 7.6
Correlations between A) accuracy (acc) and B) response time on the Latin Square task and
broad cognitive abilities (Gf, GC, SAR) with test of significance (p-values) between correlations
A.
p-values
RC
rGf,Acc
2D
0.46 **
-
3D
0.44 **
.39
4D
**
0.41
2D
p-values
3D
rGc,Acc
0.31 **
-
.25 .31
0.16 *
0.23
**
2D
p-values
3D
rSAR,Acc
2D
0.35 **
-
-
0.32 **
.32
.15 .16
**
.02
0.33
3D
-
.41 .41
B.
p-values
RC
rGf,RT
2D
-0.15 *
3D
0.06
2D
p-values
3D
rGc,RT
2D
-
-0.04
-
.01
- -0.01
.37
p-values
3D
-
rSAR,RT
2D
-0.14
-
0.04
.02
3D
-
4D
0.28 ** .00 .00 0.12
.03 .03
0.20 ** .00 .01
* p < .05, ** p < .01; RC = relational complexity; 2D, 3D, 4D = binary, ternary, and quaternary,
respectively
7.5.1
Accuracy
Fluid Intelligence: A relational complexity (2D, 3D, 4D) × Gf (covariate) repeated-
measures ANCOVA was conducted on the composite accuracy scores. This revealed as
expected from earlier analyses a main-effect for relational complexity, F(2, 356) =
376.79, p < .001. There was also a significant main-effect for the Gf covariate. Overall,
higher levels of Gf resulted in higher accuracy on the Latin Square Task, F(1, 178) =
65.69, p < .001. Although the tests of the individual correlations in Table 7.6A suggest
that there is no significant difference between the correlations at different levels of
complexity, the test of the interaction between the variables suggested the trend was
statistically significant, F(2, 356) = 3.59, p = .029. The trend was in the opposite
direction to the complexity-Gf prediction and was linear in nature (test of linear contrast
= F(1, 178) = 4.42, p = .037; test of the quadratic contrast, F(1, 178) = 1.87, p = .174).
292
That is, as relational complexity increased, a statistically weaker association with Gf
was observed. The practical significance of this is questionable for two reasons. First,
the test of the individual correlations in Table 7.6A suggest that there is no difference
between the correlations (rGf,2D = .46; rGf,3D = .44; rGf,4D = .41). Second, the observed
statistical trend would suggest that the difference between low and high Gf becomes
smaller as the complexity increased. If we consider the plot of the interaction in Figure
7.17, in which the influence of relational complexity and Gf on the composite measures
of accuracy is represented, it is clear that this effect must be very weak (Gf is
dichotomised at one standard deviation above and below the mean). It is clear, that in
Mean Proportion Correct
terms of proportion correct, the practical differences in the effect of Gf is questionable.
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
LowGf
HighGf
2D
3D
4D
Relational Complexity Classification
Figure 7.17. Accuracy on Latin-square test composites as a function of relational
complexity and Gf
Crystallized Intelligence: A similar repeated-measures ANCOVA was conducted with
Gc as the covariate. The main effect for the complexity manipulation in the Latinsquare task was significant, F(2, 356) = 268.17, p < .001. The main effect for the Gc
covariate was also significant indicating that general crystallized ability influenced
overall performance on the Latin-square task, F(1, 178) = 11.61, p = .001. The
interaction as plotted in Figure 7.18 was not significant and together, this indicates that
the influence of Gc on accuracy in the Latin-square task does not change as a function
of relational complexity, F(2, 356) = 2.03, p = .132.
Mean Proportion Correct
293
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
LowGc
HighGc
2D
3D
4D
Relational Complexity Classification
Figure 7.18. Accuracy on Latin-square test composites as a function of relational
complexity and Gc
Short-term Apprehension and Retrieval: The repeated-measures ANCOVA was
repeated with SAR as the covariate. As before we will not re-interpret the significant
main-effect for relational complexity as this has been done previously, F(2, 356) =
268.62, p < .001. The main-effect for SAR was significant indicating individual
differences in memory influenced performance on the Latin Square Task overall, F(1,
178) = 27.52, p < .001. The interaction term was not significant indicating that the
influence of SAR does not change as a function of complexity, F(2, 356) = 2.20, p =
.112. Figure 7.19 plots the relationship between SAR and relational complexity on
accuracy in the Latin-square task. SAR has been dichotomised into high and low values
at one standard deviation above and below the mean, respectively (SD of SAR = .84).
Mean Proportion Correct
294
1.00
0.90
0.80
0.70
0.60
0.50
0.40
0.30
0.20
LowSAR
HighSAR
2D
3D
4D
Relational Complexity Classification
Figure 7.19. Accuracy on Latin-square test composites as a function of
relational complexity and SAR
Complexity-Gf prediction on accuracy: In summary, Gf did influence accuracy in the
Latin Square Task, however the complexity-Gf prediction was not supported. In fact,
there was a statistically reliable trend in the opposite direction although the practical
significance of this was questioned. The Gc and SAR factors were also associated with
accuracy on the Latin Square Task – individuals with higher Gc scores tended to have
higher levels of accuracy, similarly for individuals with higher SAR scores. There was
no interaction with relational complexity and either of these variables on performance.
7.5.2
Response Time
Table 7.6B reports the correlations between the Latin Square Task composite response
times and the three broad cognitive abilities measures. Also summarised in this table is
the individual tests of the difference between the correlation coefficients using the
procedure specified by Howell (2001).
Fluid Intelligence: The correlations in Table 7.6B suggest that the nature of the
relationship between fluid intelligence and response time changes qualitatively as a
function of relational complexity. For binary items, response times are negatively
correlated with Gf, r(178) = -.15, p = .05; higher Gf students tend to respond more
quickly than lower Gf students. For quaternary items, higher Gf students tend to take
longer to respond, r(178) = .28, p < .001. The relational complexity (2D, 3D, 4D) × Gf
295
(covariate) repeated-measures analysis of covariance was conducted on the response
time composites for the Latin Square Task to explore this trend further. The results
indicate a significant main effect for relational complexity, F(2, 356) = 205.19, p <
.001. The main effect for Gf was also significant, indicating overall, that Gf ability
influences response time, F(1, 178) = 7.13, p = .008. The interaction plotted in Figure
7.20 was significant, F(2, 356) = 17.32, p < .001, indicating that overall, the influence
of Gf on response time was a function of relational complexity. Note the cross-over of
the high and low Gf lines at the binary level items in this plot. Accordingly, the test of
the quadratic contrast was significant, F(1, 178) = 7.86, p = .006, indicating that
although the linear interactive effect was significant, F(1, 178) = 20.43, p < .001, it was
not sufficient to account for the interaction alone. This supports the earlier interpretation
of the individual correlations in Table 7.6B that a reversal in the relationship between
Gf and response time is apparent as a function of relational complexity. We cannot put
this reversal down to inadequate measurement. The Latin Square task is internally
consistent and psychometric strong and this is an interesting result that we will explore
in more detail shortly.
Response Time (sec)
50.00
LowGf
HighGf
40.00
30.00
20.00
10.00
0.00
2D
3D
4D
Relational Complexity Classification
Figure 7.20. Response time on Latin-square test composites as a function of
relational complexity and Gf
296
Crystallized Intelligence: An inspection of the individual correlations in Table 7.6B
indicate that Gc has an influence on the response times to the quaternary level items, but
not for the lower level items. That is, consistent with the effect observed for Gf, higher
Gc individuals tend to take longer to respond to quaternary items than lower Gc
students. To explore this effect in more detail, the relational complexity (2D, 3D, 4D) ×
Gc (covariate) repeated-measures ANCOVA was conducted on the overall response
time composite. Again, there was a significant main effect for relational complexity
which will not be further interpreted here, F(2, 356) = 190.53, p < .001. Overall, Gc was
not a significant predictor of response time when averaging across levels of relational
complexity, F(1, 178) = 1.09, p = .297, however this was qualified by a significant
interaction between Gc and relational complexity, F(2, 356) = 3.38, p = .035. This
interaction is represented in Figure 7.21 where as before, Gc is dichotomised at one
standard deviation above and below the mean Gc score. The practical significance of
the effect should be interpreted cautiously as both the linear and quadratic contrasts of
the interaction are marginal, F(1, 178) = 3.38, p = .068, and F(1, 178) = 3.35, p = .069,
respectively. Figure 7.21 and the relatively larger (although non-significant) correlation
between Gc and response time to quaternary items, suggests that the source of the
interaction is the relatively larger influence Gc has on performance in the quaternary
items.
Response Time (sec)
50.00
LowGc
HighGc
40.00
30.00
20.00
10.00
0.00
2D
3D
4D
Relational Complexity Classification
Figure 7.21. Response time on Latin-square test composites as a function of
relational complexity and Gc
297
Short-term Apprehension and Retrieval: An inspection of the individual correlation in
Table 7.6B indicate that SAR has an influence on the response times in a similar way
that Gf does – that is, an apparent reversal on the relationship between response time
and SAR as complexity increases. To explore this relationship in more detail, the
relational complexity (2D, 3D, 4D) × SAR (covariate) repeated-measures ANCOVA
was conducted on the response time composite. Again, there was a significant main
effect for relational complexity, F(2, 356) = 196.51, p < .001. Overall, SAR was a
marginally significant predictor of response time when averaging across levels of
relational complexity, F(1, 178) = 3.56, p = .061. These effects were qualified by a
significant interaction between SAR and relational complexity, F(2, 356) = 9.24, p <
.001. This interaction is represented in Figure 7.22 with SAR dichotomised at one
standard deviation above and below the SAR mean. Note, that as for the Gf analyses,
the high and low SAR lines cross over at the binary level. The test of the quadratic
effect was slightly weaker than for the Gf analyses, although still significant, F(1, 178)
= 3.81, p = .053. This supports the earlier interpretation of the individual correlations in
Table 7.6B that a reversal in the relationship between SAR and response time is
apparent as a function of relational complexity. Again, this is an interesting effect and
counter to the Halford et al. (1998) prediction that the influence of SAR on performance
should not change as a function of relational complexity. We will consider the influence
of individual items as we did for the knight-knave task in the following section.
Response Time (sec)
50.00
LowSAR
HighSAR
40.00
30.00
20.00
10.00
0.00
2D
3D
4D
Relational Complexity Classification
Figure 7.22. Response time on Latin-square test composites as a function of
relational complexity and SAR
298
7.5.2.1 Summary.
These analyses indicate that overall, response times in the Latin Square Task are
associated with the three broad cognitive abilities that we have measured. Although we
did not find evidence for the complexity-Gf prediction, response times did differ reliably
as an interactive function of relational complexity and fluid intelligence. We will argue
shortly that this pattern suggests a within-individual change of strategy that differs as a
function of ability and complexity. Relational complexity also tended to mediate the
influence of general crystallized abilities and memory. These abilities tended to predict
response times for the most complex quaternary items but not the less complex items.
The psychometric stability of the Latin Square Task is a compelling reason not to write
these effects off as chance fluctuations and we consider the analyses of individual items
in the following section to explore these relationships further.
7.5.3
Correlation Between Gf and Accuracy as a Function of Item Difficulty
The correlation between Gf and item accuracy is plotted for the 36 Latin Square Task
(LST) items as a function of the calibrated item difficulty in Figure 7.23. The evidence
that we presented in Chapter 4 and 6 indicates that the LST fits the Rasch model well
and the calibration of item difficulty on the logit scale will approximate a linear
transformation (Wright, 1999). Recall, that the logit scale orients zero at the point of
average item difficulty. We therefore have a linear scale and a useful reference point to
explore our analyses from. As we can see in Figure 7.23, Gf abilities also tended to
predict performance on the individual items – for the majority of items, this relationship
was significant. The stable distribution of the correlations of individual items as a
function of estimated ability suggest that the failure to observe the complexity-Gf effect
in the composite accuracy data was not a function of one or two extreme items that were
washing out the expected trend.
299
0.5
2D - 1 Step
2D - 2 Step
Correlation Coefficient
0.4
3D - 1 Step
3D - 2 Step
0.3
4D - 1 Step
4D - 2 Step
0.2
0.1
non-signficant
region
0
-0.1
-0.2
-0.3
-4
-3
-2
-1
0
1
2
3
Calibrated Item Difficulty (logits)
Figure 7.23. Item correlation between accuracy and Gf as a function of calibrated item
difficulty
The distribution of item correlations between Gf and response time is equally
interesting. Recall that in Section 7.5.2 we reported a reversal in the correlations
between the composite measures of response time and Gf as a function of relational
complexity. Figure 7.24 plots this distribution at an item based level. Note that the
regression line almost passes through the origin and indicates that for items below the
average calibrated difficulty (i.e., less than 0 logits), high Gf individuals tend to respond
more quickly than low Gf individuals. The strength of this relationship increases as the
item becomes easier. For items above the average difficulty level (i.e., greater than 0
logits) the opposite occurs. Students with high Gf scores tended to take longer to
respond than students with low Gf scores. This relationship is highly significant, r(34) =
.82, p < .001. As we reported in Chapter 6, the relational complexity classification is
very closely related to calibrated item difficulty (r(34) = .78, p < .001). When the
number of relational processing steps and the interactive term (RC × RCsteps) are
considered, 92.9% of variation in calibrated item difficulty was accounted for.
Relational complexity is therefore almost synonymous with item difficulty in the Latin
Square task.
300
0.50
2D - 1 Step
2D - 2 Step
Correlation Coefficient
0.40
3D - 1 Step
3D - 2 Step
0.30
4D - 1 Step
4D - 2 Step
0.20
0.10
non-signficant
region
0.00
-0.10
-0.20
-0.30
-0.40
-4
-3
-2
-1
0
1
2
3
Calibrated Item Difficulty (logits)
Figure 7.24. Item correlation between response time and Gf as a function of calibrated item
difficulty
7.5.4
Speed-Accuracy Trade-off
So far we have only considered the relationship between Gf and overall response time
regardless of whether the student gets the item correct or not. It may be the case that the
results that we see in Figure 7.25 are a function of a speed-accuracy trade-off (SAT).
The nature of speed-accuracy trade-off has received considerable debate in the literature
and its stability has been questioned (Dennis & Evans, 1996; Lohman, 1989). What we
present here is an attempt to further our understanding of the Latin Square Task and we
acknowledge that our speculations will need further empirical validation. Figure 7.26
plots the distribution of the correlations between item accuracy and response time again
as a function of calibrated item difficulty. We continue to use calibrated difficulty for it
gives us a standard frame of reference that has a linear structure for comparing items. It
also provides an absolute measure of item difficulty that we have demonstrated to be
very closely related to the relational complexity classification – that is, we effectively
have a rough interval scale for relational complexity.
301
0.50
2D - 1 Step
2D - 2 Step
Correlation Coefficient
0.40
3D - 1 Step
3D - 2 Step
0.30
4D - 1 Step
4D - 2 Step
0.20
0.10
non-signficant
region
0.00
-0.10
-0.20
-0.30
-0.40
-4
-3
-2
-1
0
1
2
3
Calibrated Item Difficulty (logits)
Figure 7.25. Correlation between accuracy and response time as a function of calibrated item
difficulty on the Latin Square task
0.50
Low Gf
0.40
High Gf
Correlation Coefficient
0.30
0.20
0.10
non-signficant
region
0.00
-0.10
-0.20
-0.30
-0.40
-0.50
-4
-3
-2
-1
0
1
2
3
Calibrated Item Difficulty (logits)
Figure 7.26 Speed-accuracy trade-off on LST items as a function of calibrated item
difficulty and Gf
302
The distribution of correlations as a function of item difficulty in Figure 7.26 is
curvilinear and a quadratic model provided a significant fit to the data, R2 = .48, F(2,
33) = 15.06, p < .001. Both the linear and quadratic item difficulty terms contributed
significantly to the prediction of the relationship between accuracy and response time, β
= .47, t(33) = 3.62, p = .001, and β = .64, t(33) = 4.90, p < .001, respectively. For items
of average difficulty generally there is no relationship between accuracy and response
time – as items become easier, there is a tendency for the SAT to become significant –
longer response times equate to better performance. For ternary items, the pattern of
SAT is quite varied. For some ternary items, accuracy is facilitated by quicker response,
for others, a longer response is better. The marked effect in this plot is for the complex
quaternary items. There is a distinct speed-accuracy trade-off that becomes more
pronounced as the difficulty increases. Again, spending longer on the task is more likely
to result in accurate performance.
The distribution of Gf-Accuracy correlations in Figure 7.24 suggests that Gf also
influences performance as a function of item difficulty and therefore we might expect
Gf to mediate the speed-accuracy trade-off. To explore this possibility, subjects were
classified into two groups as a function of their Gf score. The high-Gf group (N = 103)
scored above the mean Gf score (which is scaled to be 0), and the low-Gf individuals (N
= 88) scored below the mean58. Figure 7.26 plots the speed-accuracy trade-off as a
function of these two groups with the line of best fit for each59. The distribution of
accuracy-RT correlations for the low group was distributed in the same way as for the
total group, R2 = .25, F(2, 33) = 5.40, p = .009. The test of the quadratic term was
significant, β = .51, t(33) = 3.25, p = .003; however the test of the linear term was not
significant, β = .20, t(33) = 1.31, p = .200. The distribution of accuracy-RT correlations
was qualitatively different for the high-Gf group. The linear relationship was
significant, β = .52, t(30) = 3.64, p = .001, however the quadratic trend was only
58
Note, this distinction is between high functioning individuals – they are all university students.
59
Three of the easiest items (item 2, 20, 22) were answered correctly by all the high Gf individuals. As a
result, the SAT correlation could not be calculated and are not represented in Figure 7.26. The nonsignificant region of Figure 7.26 is based on the sample size of the smaller low Gf group (N = 88).
303
marginal, β = .28, t(30) = 1.93, p = .061. The linear trend alone is highly significant, R2
= .39, F(1, 31) = 19.82, p < .001, and is plotted in Figure 7.26.
7.5.4.1 Summary.
The posthoc analyses of the individual items indicate that the relationship between Gf
and relational complexity in the Latin Square Task is potentially a function of a reliable
within subject change in strategy, which is also a function of fluid abilities. For complex
items, Figure 7.26 suggests that high- and low-Gf individuals tend to profit from
spending longer on the task – with high-Gf individuals spending longer on the task than
low Gf individuals (Figure 7.24). As the difficulty or complexity of the item decreases
the strategies adopted by the two groups begin to differ. Low-Gf individuals tend to
continue to profit from spending longer on the task although the relationship is more
diverse across items of average difficulty (i.e., ternary items). On the other hand highGf individuals actually profit from spending less time on easier items. That is, it appears
that the longer high-Gf individuals spend on the less complex items, the more likely
they are to make an error. This is consistent with the trend observed in Figure 7.24,
which suggests that high-Gf individuals tend to spend less time on the easier items than
low-Gf individuals. Until now we have effectively treated response time based on a
loose interpretation of the Halford et al. (1998a) comparison of means prediction – that
increases in relational complexity are associated with decrements in performance. We
have taken this to mean a decrease in accuracy and an increase in response times. It
seems that the relationship between relational complexity and response time, at least in
the Latin Square Task, is more complex than this simple statement might suggest. It
would seem that the time has come to consider this relationship in a little more detail.
7.5.5
Relational Complexity and Response Times
The relationship between accuracy and relational complexity is reasonably clear. As we
have discussed in Chapters 2 and 5, the relational complexity theory is based on the
concept of processing resources and differences in processing capacity are likely to
influence the performance on relationally complex tasks. If the individual does not have
a sufficiently high processing capacity, then they will not be capable of integrating all
304
the arguments to instantiate the relation. The relationship between response time and
relational complexity is much more involved. In the tasks that we have considered,
much of the processing is conducted over a relatively long period of time and it is
difficult to differentiate the time required to instantiate a process from the time spent
considering strategies and searching through the range of elements that might need to be
considered. However, if we assume that we can in some way eliminate these “other
factors”, then what might we expect the relationship between relational complexity and
response time be?
Instantiation
Assembly
Search
Figure 7.27 Hypothetical components of the instantiation of a relation
Going back to the formalisation of relational complexity (Halford et al., 1998a), it is
clear that the theory proposes that elements are processed in parallel to instantiate a
relation. This does not necessarily mean that response time is constant across levels of
complexity. In the Halford et al. (1998a) computational model (which we assume is a
model of the actual processes), arguments of a relation are represented by a vector of
activation weights associated with features of the element that satisfies the argument. In
the Halford et al. (1998a) model, the equal length vectors are then combined to
instantiate the relation in such a way that the number of nodes is equal to N k, where N
is the number features in the vector, and k is the rank of the relation (a vector for each
argument plus an additional vector to represent the relation). The exact details are much
more complex, but the key point that we wish to make is that the number of nodes
increase exponentially as relational complexity increases. The relationship between
305
response time and relational complexity is therefore a function of the settling time
needed for the network of activations to stabilise. It is not altogether clear what the
scale of these times or the magnitudes of the differences due to relational complexity
might be. However it seems reasonable to suggest that they are unlikely to be of the
magnitude that we have observed, in the 10’s of seconds. We might then speculate what
other components of instantiating a relational process might be, and how these might
differ as a function of relational complexity. Figure 7.27 is one such hypothetical
representation. In this model we isolate a difference between the assembly of the
arguments and the actual consolidation or instantiation of the relation to form a single
cognitive representation. We suggested in Chapter 4 that arguments are not necessarily
brought together and integrated in parallel, all in one step. That is, we mused that the
necessary elements of a relation are brought together in a number of steps to build a
single cognitive representation. Only when this occurs is the relation instantiated and a
solution determinable. From this basic conceptualisation then, we might argue that the
assembly and/or processing time increases with the number of arguments. We would
also suspect that a search of the possible elements to act as arguments might influence
response times. Although it might be the case that complex relations are embedded in
more task features than less complex relations and therefore the search time might also
change as a function of relational complexity, this is not necessarily the case
theoretically. Increasing the number of relational steps will however increase response
times. This very simple heuristic makes no distinction about the actual ordering of the
phases – although the observed relationship with Gf in the Latin Square Task might
give some clues to start thinking about this issue.
Bottom-Up Procedure: In what we might consider as a bottom-up procedure, emphasis
would be given to the search and assembly phases in an attempt to piece together the
necessary arguments to solve the problem. That is, the problem would be approached
from the outset with no clear conceptualisation of the relation to be instantiated. A
subject using a bottom-up strategy in the Latin Square task might start by working
through a number of pair-wise comparisons, searching for elements in a single row or
column of the square. Using this strategy we might predict that the probability of
discovering the right collection of elements to instantiation the relation increases the
longer the reasoner persists with the search and assembly process.
306
Top-Down Procedure: A general top-down procedure would entail giving the
instantiation/consolidation procedure of Figure 7.27 priority. A subject using a topdown procedure might begin with a clear conceptualization of the necessary relation
from the outset. For instance, subjects who have consolidated the relationship between
elements in intersecting rows and columns of the Latin Square might use this
understanding to constrain the number of elements in the search process. This would
lead to a reduction in overall response time.
We suspect in practice people are probably able to iterate through a series of bottom-up
and top-down procedures during solution. This is indicated in Figure 7.27 by the
double-headed arrow going from instantiation to search and assembly. Further, the
capacity for relational reasoning might mediate the preference and therefore the relative
usage of each procedure. Such a preference is likely to influence response times, but not
necessarily accuracy.
Empirical Evidence: There are two relevant pieces of empirical evidence that we have
presented so far in trying to understand the relationship between response time,
accuracy, fluid intelligence, and relational complexity. First, the trend in Figure 7.24
indicates that higher Gf abilities are associated with quicker response times for less
complex items, but longer response times for more complex items. Figure 7.26 suggests
that the speed-accuracy relationship also differs as a function of relational complexity
(item difficulty) and fluid intelligence. Both high- and low-Gf individuals profit from
spending more time on complex quaternary items. For items of average complexity
(ternary) there is no consistent relationship between speed and accuracy for the low-Gf
individuals. For the high-Gf individuals, there are indications at the ternary level of an
increasing trend that these individuals actually do well by spending relatively less time
on the item. This relationship continues to become stronger for the high-Gf groups as
the items become easier. The subjects in the low-Gf group tend to profit from spending
longer on the binary items. The second piece of evidence that we need to consider is
that the relationship between Gf and accuracy did not change as a function of relational
complexity (as plotted in Figure 7.23). The question then becomes, does this data fit
307
with our conceptualization of the components of the relation instantiation process
represented in Figure 7.27.
First of all, by the definition of fluid abilities, individuals with higher Gf are more able
to develop a strategy to incorporate more complex relations. This might lead to a
preference for a top-down procedure. For less complex items this is a particularly useful
strategy for the high-Gf individuals, as it will lead to a more constrained search process.
On the other hand, lower Gf individuals are more likely to use a bottom-up procedure
and spend more time in the search and assembly phase for the less complex items than
the high-Gf individuals. The longer the low-Gf individuals persevere with the task, the
more likely they are to get the item correct (although accuracy is less than for the higher
Gf individuals – as indicated by significant positive correlations between accuracy and
Gf). By definition, complex items entail a more complex instantiation of elements and
we might suspect that a greater number of the high Gf individuals are likely to invest
relatively more time in the assembly process than they did for the less complex items.
The speed-accuracy strategy for high-Gf students will then be more consistent with the
low Gf individuals – the longer the time invested on the task, the more likely they are to
assemble the correct elements to instantiate the relation and determine the value of the
target cell. The positive correlation between Gf and response time for the complex items
(high Gf spend longer on the task: Figure 7.24) would then be accounted for by
differences in the length of time individuals will persevere with the item60.
7.6
Summary of the Complexity-Gf Effect in Three Relational Complexity Tasks
The rationale for the complexity-Gf effect is based on a persistent finding from the study
of human reasoning that there is an increasing monotonic relationship between task
performance and Fluid Intelligence (Gf) as cognitive complexity of the task increases
(Snow et al., 1984; Jensen, 1987b; Stankov & Crawford, 1993; Stankov, 2000). In
practical terms this means that task performance is more able to differentiate individuals
as a function of fluid abilities as the cognitive task becomes more complex. This
60
Pellegrino and Kail (1982) suggest that there are individual differences in a self-imposed time limit to
problem solving that once exceeded triggers a response based on the processing so far conducted.
308
criterion has served as a somewhat independent indicator of a complexity manipulation
even though a global consensus on the definition of cognitive complexity is yet to be
reached. It is the criterion against which we have assessed the relational complexity
manipulation in the current set of tasks, and our results are mixed.
Although we cannot consider the influence of individual question types on performance
in the Sentence Comprehension Task at a sufficiently close level, the reliability of the
task is satisfactory and the task performs well in terms of the predicted complexity-Gf
effect. The influence of general acculturational experiences (Gc) and memory abilities
(SAR) in sentence comprehension did not change as a function of relational complexity.
The data from the knight-knave task was less convincing and in most respects
contradicted the predictions. The knight-knave task has a myriad of possible factors that
we believe can influence performance on individual items. It may be the case that the
measures were not sensitive to the manipulations of complexity – that other factors
obscured the influence of relational complexity. It may be the case that the complexity
of the reasoning was too great and that this caused a breakdown in the performance-Gf
relationship (a similar effect has been reported by Lohman & Kyllonen, 1983 in spatial
reasoning tasks). There is evidence for each of these arguments and at this stage further
research is necessary to clarify these findings. We do not doubt our analyses of the
relational complexity of the actual processing strategy we have modelled. Similarly, we
do not doubt the task and process analyses on which the relational complexity account
is based. We suggest that further work on explicating the influence of these factors on
the assessment of relational complexity is needed before commenting further on the
utility of the knight-knave task to provide a clear measure of relational reasoning. As it
stands, it is difficult to conclude either way whether the data provides evidence for or
against the complexity-Gf effect in the relational complexity manipulation, or whether
the measurement properties have simply made the test of it unreliable.
Finally, the Latin square task produces a reliable relationship between the broad
cognitive abilities and performance. For accuracy, the relationship between Gf and
performance tended to weaken slightly as complexity increased – contrary to the
predicted complexity-Gf effect. Although the practical significance of this result is
questioned, clearly there is no evidence for an increase in correlations as complexity
309
increases. For response time, there was however a change in the relationship with Gf
but this was qualitatively different to the expected complexity-Gf effect. An analysis of
individual items as a function of their calibrated difficulty and relational complexity,
and the relationship between accuracy and response times as a function of fluid ability,
provided a useful insight into this new task and relational complexity in general.
The findings from the manipulations of relational complexity are quite broad. In fact,
each of the three tasks had a different relationship with broad cognitive abilities that we
assessed (Gf, Gc, SAR). Based on this information we might conclude that if relational
complexity is consistent with traditional manipulations of cognitive complexity, its
influence is not constant across domains. As a final comparison then, we consider the
manipulation of complexity in the Triplet Numbers Test and how the structure of this
task differs from the relational complexity tasks.
7.7
Cognitive Complexity and the Triplet Numbers Test
The psychometric properties of the Triplet Numbers Test for the current sample was
summarised in Table 6.4 of Chapter 6. We repeat parts of this in Table 7.7 for
convenience and include the correlations between fluid intelligence and the measures of
accuracy and response times61.
Table 7.7
Accuracy and response time as a function complexity level in the Triplet Numbers Test and
correlations with Gf
%correct
Gf
correct/min Gf
RT (sec)
Gf
level
1
0.99 (0.01) .04
30.65 (3.42) .11
0.89 (0.21) -.12
2
0.97 (0.06) .17 *
27.15 (3.45) .35 **
1.09 (0.23) -.33 **
3
0.96 (0.04) .27 **
23.44 (2.97) .45 **
1.41 (0.32) -.38 **
4
0.88 (0.14) .47 **
17.06 (3.39) .57 **
2.09 (0.68) -.21 **
%correct = Mean proportion correct; RT = mean response time; standard deviations in
parentheses; triplets number test omitted from factor analysis.
61
The triplets number test was omitted from a factor analysis similar to that reported in Section 7.2.3; the
three oblique factors extracted accounted for 42.51% of the shared variance. The remaining variables
loaded in the same way as in Section 7.2.3. Factor scores were derived from these analyses for each of
the three broad cognitive abilities.
310
We explored the increasing trend in correlations between Gf and performance in the
Triplet Numbers Test as a function of complexity level in the same way as we have
done for the relational complexity tasks. A complexity level (1, 2, 3, 4) × Gf (covariate)
repeated measures ANCOVA on the number of correct responses per minute
(correct/min) was run. This indicated a highly significant main effect for level of
complexity, F(3, 483) = 1106.94, p < .001, which was interpreted in Chapter 6, Section
6.4.2.1. There was also a significant main effect for the Gf covariate, F(1, 161) = 40.71,
p < .001, although this was qualified by the expected interaction, F(3, 483) = 13.93, p <
.001. This interaction is plotted in Figure 7.28 and the linear contrast of this effect was
significant, F(1, 161) = 21.69, p < .001 (the quadratic effect was not significant, F < 1;
although the cubic trend was, F(1, 161) = 5.97, p = .016). This is consistent with the
complexity-Gf effect and a similar pattern of results was reported when response times
were submitted to a similar analyses.
There are several important comparisons that are necessary to make from these analyses
and the results in Table 7.7. The first is that accuracy on the Triplet Numbers Test is
very high and the average response time for the most complex level is around two
seconds. In the relational complexity tasks, the range of accuracy was more varied – as
low as 30% correct in the quaternary level of the knight-knave task and about 50%
Correct response per minute
35
LowGf
HighGf
30
25
20
15
10
Level 1
Level 2
Level 3
Level 4
Figure 7.28. Number of correct responses per minute in the triplet numbers test as a
function of complexity level and Gf
311
correct for most complex levels of the Sentence Comprehension and Latin Square tasks.
Response times in the relational complexity tasks was also more varied than in the
triplet numbers test; 40-50 seconds on average in the knight-knave and Latin Square
tasks. The decision time in the Sentence Comprehension Task was much closer to the
triplet numbers test (on average about 5 seconds in the most complex level of the
sentence comprehension task). At least at this level, the nature of the reasoning in the
Triplet Numbers Test is quite different to what we have explored in the relational
complexity tasks. Although this might be the case, we have replicated in the Triplet
Numbers Test62 the complexity-Gf effect that Stankov and his associates have reported
(Stankov, 2000; Stankov & Crawford, 1993). It is also interesting to note one other
difference in the way Gf is related to the different levels of complexity. The simplest
level of complexity in the Triplet Numbers Test is not correlated with Gf, however
accuracy measures in the simplest levels of complexity in all the relational complexity
task were significantly correlated with Gf. In summary there is evidence to suggest that
psychometrically, the triplet numbers test is quite different from the relational
complexity tasks. As one last comparison we consider the relational complexity of the
reasoning in the triplet numbers test.
7.7.1
Relational Complexity Analysis of the Triplet Numbers Test
An application of MARC requires a theory about the processes that are entailed in the
task. We begin the representation of the relational complexity analysis by considering
what we believe is a plausible strategy to deal with the most complex level. The rule for
level 4 of the triplet numbers test is as follows:
IF the first digit is the largest AND the second digit is the smallest
OR
IF the third digit is the largest AND the first digit is the smallest
THEN press “yes”
ELSE press “no”
62
We have also replicated the complexity-Gf effect in the swaps test, but this data is not reported here.
312
First we consider an example and then attempt to model this generically to determine a
classification of relational complexity. Consider the triplet 7 3 8. We might begin
solution by considering the first component of the first conjunction; “is the first digit the
largest?” A comparison of the three figures would result in an interim “no” response.
Since it is not necessary in this case to consider the second component, we would then
move on to the first component of the second conjunction; “is the third digit the
largest?” An investigation of the triplet would lead to an interim “yes” response. We
would then continue on to test the second component of the second conjunction; “is the
first digit the smallest?” The answer is “no” and we have then exhausted the
possibilities and should respond “no” overall. The data suggests that on average, the
reasoning in this level of the triplet numbers test takes about 2 seconds (Table 7.7).
Now consider a generic approach to the complexity analyses. The most complex level
of the Triplet Numbers Test essentially entails a representation of two conjunctions
embedded in a disjunction. We represent the triplet as (a, b, c) and using our
propositional notation outlined in Chapters 3 and 4, we might represent the instantiation
of the level 4 rule as follows:
Triplet: a b c
OR(AND(>(a, bc), <(b, ac)), AND(>(c, ab), <(a, bc)))
A first run prima’ facie relational complexity analysis suggests that this might require
quaternary level processing (i.e., one argument for each conjunct). However as we have
just demonstrated, it is quite straightforward to segment and chunk the task into a series
of binary relations that are pieced together in series. For instance, consider the first
component of the first conjunctive statement “the first digit is the largest” which would
be represented as >(a, bc). The b and c terms are chunked because there is no need to
consider the relationship between these terms explicitly to make the current decision
(Theorem 3, Chapter 2). It is simply a comparison of the first digit with “the other two”.
If this relation is validated then we move on to the next component of the first
conjunction (“the 2nd digit is the smallest”) which would be represented as <(b, ac). If
both of these relations are validated (entailing another binary relation) then the level 4
rule is validated and the subject should respond “yes”. Otherwise the subject would
313
move the focus of reasoning onto the next conjunction. Effectively what this means is
that solution in the most complex level of the Triplet Numbers Test requires a series of
binary processes, and the digits in the triplet determine the actual number of binary
processes. A summary of the possible binary decisions in the level 4 triplets numbers
test is provided in Figure 7.29.
abc
C11
>(a, bc)
yes
no
no
C12
<(b, ac)
C21
>(c, ab)
C12 = component 2 of 1st
conjunction; the second digit is
the smallest
yes
yes
C22
<(a, bc)
yes
Yes
C11 = component 1 of 1st
conjunction; the first digit is the
largest
no
no
No
C21 = component 1 of 2nd
conjunction; the third digit is the
largest
C22 = component 2 of 2nd
conjunction; the first digit is the
smallest
Figure 7.29. Decision tree of binary comparisons in level 4 of the Triplet Numbers Test
Solution of levels 2 and 3 of the Triplet Numbers Test will entail a reduced number of
the same type of binary processes. Level 1 of the triplet test essentially requires
recognition of a particular feature in the triplet – recognition of the numeral 3. This
would satisfy the Halford et al. (1998a) definition of a unary relation since no
comparison with any of the other digits or information is required. The accuracy data
and response time data reported here and by Stankov and Crawford (1993) and his
associates would corroborate our conclusion that the task entails the implementation of
a series of binary relations – accuracy is very high and response times very quick
(relative to the relational complexity tasks). However, this does not explain why such a
clear instantiation of the complexity-Gf effect is observed. Although we have no
314
evidence yet, we might speculate on just what it is that makes reasoning in the higher
levels of the triplet numbers test Gf loaded.
7.7.2
Why is the Triplet Numbers Test Gf Loaded?
In Chapter 2 (Section 2.3.3) we reported that cognitive complexity has been considered
a function of the range of solution strategies that are available. The involvement of
strategic or metacomponential process was considered a characteristic of novel tasks
that influenced their complexity and demand on fluid abilities (Sternberg, 1985;
Crawford, 1991). It might however be difficult to see the extent that differences in
solution strategy might be entailed in the Triplet Numbers Test. We suggest that a
simple exercise will demonstrate that in fact, solution strategies do develop over the
span of 6 minutes of solving level 4 items. If the reader wishes, s/he might try validating
the level 4 rule with the list of randomly derived triplets presented in Appendix E
(subjects are not provided the rule during the test items so be sure you cover the rule in
Appendix E once you have memorised it).
What this evidence shows is that there are a number of heuristics that can be used to
facilitate solution. Two suggested by colleagues attempting the task were, a) if the
middle digit is the largest, then respond “no”; b) if the ordering is small, medium, and
large, respond “yes”. There are also a number of mnemonic type strategies that combine
spatial and verbal heuristics to assist in remembering the rule. For example, c) left large,
middle small, right large, left small; or d) left middle right left – large small large small
(i.e., a cycling from left to right and then back to left, alternating largest and smallest).
What we are suggesting here is nothing new. We are simply restating the definition of
fluid intelligence (Carroll, 1993). That is, that high Gf individuals are more likely to be
sensitive to these types of strategies – to see order and patterns in the rules and triplets
quickly and to deal with the novelty of the task effectively. That these strategies will
also tend to shorten response times would account for why Gf was also a significant
predictor of response times in the task (Table 7.7). We commented on a similar heuristic
in the swaps test, in which individuals were moving the mouse over the response keys
to keep track of intermediate processing (Chapter 6, Section 6.4.3). It may be the ability
315
of the high-Gf individual to quickly detect these types of patterns that facilitates
solution and makes these tasks good measures of Gf.
If this is the case, then it is clear that the manipulation of relational complexity,
particularly in the Latin Square task, is fundamentally different from the manipulation
of complexity in the Triplet Numbers Test and the Swaps Test. We would argue that
relational complexity is a manipulation of complexity in a purer processing sense than
has been achieved in either the Swaps Test or the Triplet Numbers Test. Although it is
clear that strategy is important, the Latin Square Task is predominantly a manipulation
of processing demand and this is why, even with binary items, it is a good predictor of
fluid intelligence.
In the final Chapter we aim to bring together the empirical results that we have
presented throughout this project and to take a step back and consider what has been
achieved in our understanding of cognition through this work.
CHAPTER EIGHT
RELATIONAL COMPLEXITY IN ADULT REASONING:
SOME CONCLUDING COMMENTS
8
Discussion
Our objective for this project was to consider the application of the relational
complexity theory to adult cognition. We argued that a series of measurement and
assessment problems had to be addressed in order to do this. The way that we
approached these issues was to consider both correlational and experimental
procedures.
The relational complexity theory had originally been proposed to account for
developmental differences in cognition. Therefore the availability of high-level
reasoning tasks in which relational complexity had been investigated in an adult
population was somewhat limited. Three different “relational complexity” tasks were
used. The first was the sentence comprehension task. This was a task in which relational
complexity had been previously investigated (Andrews, 1997, in preparation; Andrews
& Halford, in press). The second task that we used was the knight-knave task. This task
was relatively new to psychological investigation and by all accounts (e.g., JohnsonLaird & Byrne, 1990; Rips, 1989) entailed a dynamic set of reasoning processes that
would provide an interesting investigation of relational complexity. The third task that
we used was based on the Latin Square. We designed this new task specifically from the
principles of relational complexity to provide a more stringent test of the theory. Our
findings supported the comparison of means hypothesis – that increases in relational
complexity resulted in a decrement in performance. However support for the
complexity-Gf prediction was inconsistent.
Rather than simply re-summarising our results we wish to spend the remainder of this
chapter considering the implications of our findings. The issues that we will cover can
be grouped into three broad categorises; the initial measurement problems and how
these were addressed, the application of relational complexity theory to higher level
reasoning tasks, and the assessment of the influence of relational complexity on task
performance and the implications.
317
8.1
Measurement Problems
Measurement and assessment problems are two closely related but distinct issues.
Measurement is related to assessing the structure of the constructs that we are interested
in, and to what extent we can infer additivity of measurement in the traditional sense
(Michell, 2000, 1990). Assessment on the other hand has to do with the quality of the
inferences that we make. The quality of the measurement therefore feeds back into the
confidence that we have in our inferences. It is for this reason that we have spent
considerable effort exploring the measurement properties of our tasks. One of the issues
that we wanted to address in this project concerned the relational complexity metric
itself. As defined by Halford et al. (1998a) the metric is ordinal in nature and as a
tangible characteristic of a task this is reasonable. However, given the explicit link that
Halford et al. make between a task’s relational complexity and an individual’s
processing capacity, we were also interested in exploring the extent that our measures of
performance reflected some underlying quantitative structure in processing capacity.
The Latin Square Task provides a suitable foundation from which to consider these
issues further.
8.1.1
Quantitative Structure in the Latin Square Task
Michell (1990) summarises a number of principles that can be applied to assess additive
conjoint measurement indirectly – to assess the extent that our constructs have been
quantified. Although the approach that he advocates has been applied to investigate the
structure of fluid intelligence by Stankov and Cregan (1993), the scarcity of other
similar studies suggests a reluctance from the majority of researchers to address these
hard measurement issues head on (Michell, 2000). In a very brief review of Michell’s
work in Chapter 1 we indicated that conjoint measurement is concerned with the way
the ordering of a dependent variable varies with the joint effect of two or more
independent variables. Using the methods endorsed by Michell (1990) provides a
deterministic test of conjoint measurement that is a sufficient but not a necessary
condition for quantitative structure. Further, we presented the argument of Perline,
Wright and Wainer (1979) that suggested that a suitable fit to the Rasch model provided
a stochastic indication of quantitative structure (see also, Brogden, 1977; Luce &
Tukey, 1964, for more detailed accounts). We used the measurement approach offered
318
by the Rasch model on several levels. First, the Rasch analyses were very successful in
providing diagnostic cues to assist with the development of the Latin Square Task. The
Rasch model was also used to assist with determining the appropriateness of our
psychometric measures in preparation for the correlational analyses. Finally the Rasch
approach was used because it provides a transformation of proportion correct data to an
approximate interval (additive) scale (Wright, 1999). As it turns out, the pattern of
results in the Latin Square Task when considered in conjunction with the properties of
the Rasch model enables us to explore more closely the quantitative structure of
performance in this task. Furthermore, it allows us to make some preliminary
speculations about the underlying structure of processing capacity63.
8.1.1.1 Additive Conjoint Measurement
The accuracy measure on the Latin Square Task has quantitative structure by virtue of
its fit with the Rasch model (Perline et al., 1979). While this might tell us that we have
measured something quite well, it does not really tell us what. We need to place this
finding in some context. The relationship between Gf and item accuracy as a function of
the Rasch calibrated item difficulty was presented in Figure 7.23. This plot indicated
that fluid abilities were generally implicated in the reasoning entailed in the Latin
Square Task to the same degree regardless of item difficulty. Given the very close
relationship between item difficulty and relational complexity in this task, we might
infer that this relationship extends to the manipulation of relational complexity. That is,
fluid abilities are implicated in Latin Square Task reasoning to the same degree
regardless of relational complexity. We noted that this was in fact contrary to the
predicted complexity-Gf effect, suggesting that our manipulation was not a manipulation
of complexity in the psychometric sense (Stankov, 2000). The relationship
demonstrated in Figure 7.24 between Gf and item response time as a function of Rasch
calibrated item difficulty tells us something quite different. It suggests that at least in
some aspect of problem solving, Gf is implicated differentially as a function of item
difficulty – as a function of relational complexity.
63
The psychometric properties of the knight-knave task were not appropriate for this type of analysis.
The administration of the sentence comprehension task did not facilitate an analysis using the Rasch
method as we have applied it, although more advanced Rasch models might be used.
319
We need to consider the implications of Figure 7.24 further. The correlation coefficients
plotted in this figure represents the linear relationship between Gf and response time
for each of the 36 Latin Square items. This relationship changes reliably as a function of
the item’s absolute difficulty, which as we have just stated, closely reflects the
relational complexity manipulation. That is, there is a reliable interactive relationship
between the time to respond to the item, Gf, and item difficulty (which is a function of
overall accuracy on the item and has quantitative structure by virtue of its fit to the
Rasch model). Specifically, the ordering of relational complexity (item difficulty) and
Gf is necessarily related to the order of Latin Square response time. That is, their orders
are relative to their effect on response time, and consistent with the generic argument
presented by Michell (1990) and Perline et al. (1979) that we described in Chapter 1,
relational complexity and Gf are quantified relative to their effects on performance in
the Latin Square Task.
The broader implications of this may not be immediately apparent. This is the first
compelling evidence that we have for the quantitative structure of performance
underlying the influence of relational complexity in adult cognition. If performance is a
function of processing capacity as Halford et al. (1998a) propose, then this provides a
strong foundation from which to consider the possible structure of processing capacity
in an adult population further.
8.2
8.2.1
Application of the relational complexity theory
Valid Process Theories
The application of the relational complexity theory is only as successful as the process
theory on which it is based. This can be an obvious limitation (Sweller, 1998)
particularly in dynamic situations when it is not clear which strategy might be used and
a precise process theory cannot be assured. The interesting problem of applying the
relational complexity theory to the analysis of adult cognition is that adults bring with
them to the laboratory a whole constellation of abilities and experiences that children
simply do not have. This causes a wide range of methodological problems in
determining exactly which strategy has been used. This in turn influences the
confidence that we have in our inferences.
320
8.2.1.1 Relational complexity and strategy.
Chapter 3 provides the first explicit formalism of an application of the MARC
procedure to the analysis of a high-level, multi-process deductive reasoning task. The
relational complexity analysis of the knight-knave task was based on the process
theories of Byrne and Handley (1997) and Rips (1989). The knight-knave task provides
convincing evidence that re-affirms that strategy effects are very important in assessing
adult cognition.
Using the general segmentation and chunking principles we have been successful in
modelling differences in strategy across a range of knight-knave problems. However,
even for a single item, differences in strategy are important. We have demonstrated this
by presenting the same knight-knave problem using two different probe questions (“Is B
a knight?” and “Is B a knave?”). This very small change influenced the relational
complexity since each question cues a different solution strategy that entails processes
that differ in complexity. All roads might lead to Rome, but not all are equally efficient.
We cannot avoid the difficulties that this presents to the complexity classification of a
problem. That is, we must entertain the possibility that the relational complexity of a
problem may be qualified by the strategy chosen. The analyses of the Latin Square Task
suggest that the strategy chosen may also be qualified by the abilities and experiences of
the individual.
These issues need further exploration in the application of the relational complexity
theory. We cannot be tempted into saying that strategy makes no difference because we
believe that the number of arguments that needs to be instantiated to derive a solution
will always be the same for a given problem. This approach at best will place a lower
bound on the relational complexity classification. When reasoning is dynamic and
entails multiple processes, we cannot know what the upper limit of complexity is
without explicitly testing for it. This idea is not new. It has been repeatedly
demonstrated that a pure algorithmic/logical approach to deductive reasoning does not
capture the full range of possible approaches that individuals can use. Heuristics
develop, biases are reinforced, and reasoning can often appear irrational (Evans et al.,
1993). It would seem that Johnson-Laird and Byrne’s (1991) mental models approach to
reasoning is successful because it is pragmatic. Its success and similarly the success of a
321
relational complexity analysis require a very detailed understanding of the processes
that are entailed. Further, if cognitive abilities influence strategy, and strategy
influences problem complexity as our findings suggest, then cognitive abilities are
another area that needs to be further investigated in applying relational complexity
theory. We suspect that this is likely to be particularly important in modelling
complexity in dynamic environments.
8.2.1.2 Application to dynamic environments.
If the relational complexity theory is to be applied successfully to the complexities of
dynamic reasoning required of pilots, air traffic controllers, or surgeons (for instance), a
detailed understanding of the processes that are entailed is crucial. Halford et al.
(1998b) state that the weight imposed by the requirement of a strong process theory
does not invalidate the theory per se. We agree in principle, but are keen to see evidence
that the theory can be applied to reliably account for reasoning in ecologically valid
situations beyond the laboratory. Although it is clear that our analyses of the knightknave task were capable of accounting for differences in aggregated performance, we
had difficulty in demonstrating consistency of measurement in the task as a whole. That
is, using a resource based conception of reasoning we would expect that an individual
who has the capacity to process quaternary items would succeed on less complex items.
Our findings indicate that performance on quaternary knight-knave problems did not
predict ternary performance as might be expected. We argued from our evidence that to
account for performance on the quaternary items a more detailed model of strategies
and abilities would be necessary. In fact it was these concerns that lead to the
development of the Latin Square Task. We took the reductionist approach and argued
that if we could not describe the influence of relational complexity in adult reasoning in
a very constrained task, then it was unlikely this could be achieved in real-world
settings. The considerable success that we have had with the Latin Square Task
provides a strong argument for a merger of methodologies and the development of new
ones to explore ecological applications of relational reasoning further.
The next question that we would like to address is to what extent does relational
complexity assess a consistent underlying construct?
322
8.3
Assessment of Relational Complexity
The relational complexity theory was proposed to help integrate and explain the
cognitive development literature and is based on a theory of processing resources. The
traditional comparison of means approach suggested that the relational complexity
manipulations in all our tasks were consistent with the relational complexity
predictions. Increasing complexity resulted in more errors and longer response times.
Incorporating an individual differences approach allowed us to consider the influence of
strategy (e.g., speed-accuracy trade-off) and broad cognitive abilities (Gf, Gc, and
SAR). All of these factors tended to mediate the influence of a task’s relational
complexity on performance, in some cases quite dramatically. This suggests to us that
the influence of strategies and abilities need careful consideration in the application of
MARC.
8.3.1
Relational Complexity: Resources, Relational Reasoning, and/or Gf
Incorporating experimental and differential psychology is complicated by a fundamental
difference in the constructs and metaphors that each discipline tends to use as a way of
conceptualising cognition. Cognitive psychologists talk about resources while
differential psychologist tend to talk in terms of abilities and aptitudes. The obvious
question that this raises is just what is the relationship between processing resources,
relational reasoning, and fluid intelligence? Although we suspect there is no clear
resolution of these terms (at least in the minds of their proponents), we believe it is an
issue that can be addressed to some extent with the current data.
Cognitive psychologists are quick to suggest that a predominant source of reasoning
difficulty is the availability of sufficient processing resources. Hence, two Latin Square
items that differ in relational complexity would be expected to differ in the relative
amount of resources that they require. In the Halford et al. (1998a) model it is assumed
that these resources come from the same pool. The issue of common processing
resources was raised in Chapter 5, and we argued that although our implementation of
the dual-task approach could have been stronger, the concept of resources did little to
clarify the nature of the processing entailed in the Latin Square Task. A dual-task deficit
was observed, which can be interpreted to be consistent with either the notion that the
323
two tasks competed for a common resource, or consistent with the fact that response
competition existed. The latter is arguably more convincing, since there was only weak
evidence to suggest that increasing relational complexity was associated with a larger
decrement in secondary task performance. We argued that for a variety of practical
reasons, both the traditional dual-task methodology and the easy-to-hard paradigm were
not sufficient to address this issue.
8.3.2
Resources and Fluid Intelligence
Unlike the factorial theories of individual differences, the relational complexity theory
has a clear statement of what it is about a task that makes reasoning complex. The
traditional psychometric definition of cognitive complexity has been based on the tasks
relationship with measures of fluid intelligence. More complex tasks are considered to
be better able to differentiate between individuals of low and high fluid abilities. Yet,
the relational complexity manipulations in two of the three tasks that we have used are
not related to Gf in this way. Accuracy on the Latin Square Task is a good predictor of
fluid intelligence that is relatively stable across items and levels of complexity (see
Figure 7.23). From an individual differences perspective, this would suggest that the
different levels of complexity are tapping the same abilities to the same degree. If we
consider performance as reflecting available resources, consistent with the traditional
cognitive psychology approach, then the internal consistency of the Latin Square Task
suggests that the items are tapping the same resources. However, as we have just said,
the relationship between accuracy and fluid intelligence does not change as a function
of relational complexity. If resources for relational processing are closely related to
fluid intelligence, then the data presented in Figure 7.23 suggests that resources are not
implicated differentially as a function of relational complexity to determine accuracy.
Hence at this level, our interpretations are consistent with those we made in relation to
the dual-task data in Chapter 5.
What complicates this interpretation is that the relationship between Gf and the Latin
Square Task response time represented in Figure 7.24, suggests that processing
resources (if we continue with the cognitive psychology “metaphor”) are implicated
differently as a function of item complexity. That is, the relationship between relational
324
complexity and performance depended on what measure of performance was used. It
also depended on the individual’s general cognitive ability levels (Gc and SAR as well
as Gf). As a result of this evidence, we are not clear to what extent the resource concept
improves our understanding of cognition over what can be achieved by conceptualising
performance as a function of a range of related abilities and aptitudes. That is, although
we might use resources as a metaphor to describe some characteristic of relational
reasoning or fluid cognitive abilities, it does not seem to add anything in and of itself.
Compared to resources, the concept of fluid intelligence is well defined. Its relationship
with other “resources” is also well defined (e.g., crystallized intelligence, short-term
apprehension and retrieval, spatial abilities, etc…). We know generally what fluid
intelligence is and what it is not. The same cannot be said of the resource concept. We
have already discussed the concerns that Navon (1984) has with resource theory. More
recently, Oberauer and Kliegl (2001) have suggested that limits in working memory can
be accounted for without additional constructs like “limited resource pools” or “magical
numbers” (e.g., using decay and interference theories).
We are certainly not advocating that cognitive psychologist should discard their tools in
preference for the individual differences methodologies. As a metaphor, the resource
concept has served to provide detailed models of cognitive processes (Halford et al.,
1998a; Humphreys et al., 1994) that simply would not have been possible by the
correlational procedures of differential psychologists (Lohman & Ippel, 1993). In fact
computational models such as Just and Carpenter’s (1992) are based on a
conceptualisation of resources that feeds directly into theories of individual differences.
We suspect that relational complexity theory has the same potential to broaden our
understanding of individual differences in reasoning across a wide range of tasks and
environments.
8.3.3
Relational Reasoning and Broad Cognitive Abilities
If performance is not a function of a common pool of resources, we need to contemplate
alternative accounts of the influence that relational complexity has on performance. Our
findings suggested that the relationship between fluid intelligence and performance on
our relational complexity tasks is complex and multifaceted. As we intend to explain
325
next, it is for this very reason that we find the concept of resources as a construct less
than satisfactory.
The manipulations of relational complexity in the three tasks studied in this project
seemed to be quite different from the complexity manipulation in the Triplet Numbers
Test, yet almost invariably the manipulations of relational complexity were correlated
with fluid intelligence. The relational complexity tasks were also related to Gc and SAR
in different ways. We would like to speculate that these differences are significant when
it comes to conceptualising the foundations of performance differences as a function of
relational complexity. What the findings of this project suggests to us is that the
reasoning entailed in our relational complexity tasks is related to some component of
Gf, but not necessarily the same component as what the triplet numbers test, the swaps
test, or even the progressive matrices test does. Consistent with Halford et al., (1998a),
we believe that the manipulations of relational complexity are manipulations of the
amount of relational reasoning entailed in the task. However, the findings of this project
also indicate that relational reasoning abilities are not the only factor influencing
performance. Serial processing was identified as important, as were memory and
educational/experiential abilities. The specific characteristics of each task are likely to
mediate the extent that these factors influence strategies and overall performance. They
also influence the extent that performance measures will be sensitive to manipulations
of relational complexity – of relational reasoning abilities. Therefore, the reason that we
prefer to think in terms of abilities rather than resources is that not only does the former
conceptualization of performance facilitate an understanding of multifaceted tasks, it
provides a clearer foundation from which to explore other contributing factors.
8.4
Conclusion
The unifying goal of this project was to test the relational complexity theory using
measures independent of the task whose relational complexity is being manipulated. We
have argued that this entails bringing together methodologies from experimental and
differential psychology. Although there is recognition that a merger of these areas is
necessary, history suggests that such a venture has failed to be fully realised. We
believe we have demonstrated that a merger is not only possible, but can be done very
326
successfully. The theory of relational complexity is well specified and demonstrates
immense potential to further our understanding of cognition. It is a theory that integrates
task demands with individual abilities and we believe when applied carefully, has the
potential to be one of the most important contributions to broadening our understanding
of individual differences in human performance.
9
References
Andrews, G. (1997). Relational complexity as a capacity construct in cognitive
development. Unpublished PhD Dissertation, University of Queensland,
Brisbane.
Andrews, G. (in preparation). The role of processing capacity in the comprehension of
relative clause sentences.
Andrews, G., & Halford, G. S. (1995). Working memory capacity and the
comprehension of relative clause sentences. Paper presented at the 3rd
Conference of the Australasian Cognitive Science Society, Brisbane.
Andrews, G., & Halford, G. S. (1998). Children's ability to make transitive inferences:
The importance of premise integration and structural complexity. Cognitive
Development, 13, 470-513.
Andrews, G., & Halford, G. S. (in press). A cognitive complexity metric applied to
cognitive development. Cognitive Psychology.
Andrich, D. (1988). Rasch Models for Measurement. Newbury Park, CA: Sage.
Andrich, D. (1999). Essays: Equating of marks. In G. N. Masters & J. P. Keeves (Eds.),
Advances in Measurement in Educational Research and Assessment (pp. 176185). Oxford, UK: Elsevier Science.
Baddeley, A., & Hitch, G. J. (1974). Working memory. In G. H. Bower (Ed.), The
Psychology of Learning and Motivation: Advances in Research and Theory
(Vol. 8, pp. 47-89). New York: Academic Press.
Baddeley, A. D. (1986). Working Memory. Oxford: Clarendon Press.
Baddeley, A. D. (1992). Working memory: The interface between memory and
cognition. Journal of Cognitive Neuroscience, 4(3), 281-288.
Baddeley, A. D. (1996). Exploring the central executive. Quarterly Journal of
Experimental Psychology, 49A(1), 5-28.
Bathurst, K., & Kee, D. W. (1994). Finger-tapping interference as produced by
concurrent verbal and nonverbal tasks: An analysis of individual differences in
left-handers. Brain and Cognition, 24, 123-136.
Birney, D. P. (1999). Emulating complexity in the n-tem series task using premise
presentation order. Unpublished manuscript.
Birney, D. P. (2001). Solution of Incomplete Latin Squares: An Application of the Rasch
Measurement Model. Paper presented at the Psychonomic Seminar Series,
School of Psychology, University of Queensland.
Birney, D. P., & Halford, G. S. (2000a). Methods for analysing complexity in reasoning
tasks: Links to fluid intelligence. Paper presented at the Fourth International
Conference on Thinking, University of Durham, UK.
Birney, D. P., & Halford, G. S. (2000b). Principles of relational complexity in cognitive
task development. Paper presented at the Fifth Australasian Cognitive Science
Conference, University of Melbourne.
Birney, D. P., & Halford, G. S. (2000c). The psychometric properties of relational
complexity in adult cognition. Paper presented at the poster session of the XVI
British Psychological Society Cognitive Section Conference, The University of
Essex, UK.
Birney, D. P., & Halford, G. S. (2001, Nov 28-30). Understanding cognitive
complexity: Evidence from cognitive psychology and individual differences.
Paper presented at the Third International Spearman Seminar, University of
Sydney, Australia.
References 328
Birney, D. P., & Halford, G. S. (2002). Cognitive complexity of suppositional
reasoning: An application of the relational complexity metric to the knightknave task. Thinking and Reasoning, 8(2), 109-134.
Bjorklund, D. F., & Harnishfeger, K. K. (1989). In defense of resources. Journal of
Experimental Child Psychology, 47(1), 19-25.
Bloem, K. A., & Damos, D. L. (1985). Individual differences in secondary task
performance and subjective estimation of workload. Psychological Reports, 56,
311-322.
Boag, C. C., Neal, A., Halford, G. S., & Goodwin, G. (2000). Comparing measures of
cognitive complexity: Cognitive psychology applied to air traffic control. Paper
presented at the XVI British Psychological Society Cognitive Section
Conference, The University of Essex, England.
Boles, D. B., & Law, M. B. (1998). A simultaneous task comparison of differentiated
and undifferentiated hemispheric resource theories. Journal of Experimental
Psychology: Human Perception and Performance, 24(1), 204-215.
Bollen, K. A. (1989). Structural Equations with Latent Variables. New York: Wiley.
Bourke, P. A. (1997). Measuring attentional demand in continuous dual-task
performance. Quarterly Journal of Experimental Psychology, 50A(4), 821-840.
Bourke, P. A., Duncan, J., & Nimmo-Smith, I. (1996). A general factor involved in
dual-task performance decrement. Quarterly Journal of Experimental
Psychology, 49A(3), 525-545.
Braine, M. D. S. (1990). The "Natural Logic" approach to reasoning. In W. F. Overton
(Ed.), Reasoning, Necessity, and Logic: Developmental Perspectives (pp. 133157). Hillsdale, NJ: Lawrence Erlbaum.
Braine, M. D. S. (1993). Mental models cannot exclude mental logic and make little
sense with out it. Behavioral and Brain Sciences, 16(338-339).
Brainerd, C. J., & Reyna, V. F. (1989). Output-interference theory of dual-task deficits
in memory development. Journal of Experimental Child Psychology, 47(1), 118.
Brogden, H. E. (1977). The Rasch model, the law of comparative judgement and
additive conjoint measurement. Psychometrika, 42, 631-634.
Brookings, J. B. (1990). A confirmatory factor analytic study of time-sharing
performance and cognitive abilities. Intelligence, 14, 43-59.
Brookings, J. B., & Damos, D. L. (1991). Individual differences in multiple-task
performance. In D. L. Damos (Ed.), Multiple-Task Performance (pp. 363-386).
London: Taylor & Francis.
Bugelski, B. R. (1949). A note on Grant's discussion of the Latin square principle in the
design of experiments. Psychological Bulletin, 46, 49-50.
Byrne, R. M. J., & Handley, S. J. (1993). The nature and development of reasoning
strategies. In K. Ryan & F. F. E. Sutcliffe (Eds.), AI and Cognitive Science '92
(pp. 59-70). London: Springer-Verlag.
Byrne, R. M. J., & Handley, S. J. (1997). Reasoning strategies for suppositional
deductions. Cognition, 62, 1-49.
Byrne, R. M. J., Handley, S. J., & Johnson-Laird, P. N. (1995). Reasoning from
suppositions. Quarterly Journal of Experimental Psychology, 48A(4), 915-944.
Campbell, D. T., & Fiske, D. A. (1959). Convergent and discriminant validation by the
multitrait-multimethod matrix. Psychological Bulletin, 56, 81-105.
Carpenter, P. A., Just, M. A., & Shell, P. (1990). What one intelligence test measures: A
theoretical account of the processing in the Raven Progressive Matrices Test.
Psychological Review, 97, 404-431.
References 329
Carroll, J. B. (1993). Human cognitive abilities: A survey of factor-analytic studies. NY,
New York: Cambridge University Press.
Cattell, R. B. (1957). Personality and motivation: Structure and measurement. NY:
World Book Co.
Cattell, R. B. (1987). Intelligence: Its structure, growth and action. Amsterdam:
Elsevier Science Publishers.
Chapman, M. (1989). Resources versus response competition: A false disjunction?
Journal of Experimental Child Psychology, 47(1), 39-41.
Cohen, J., & Cohen, P. (1975). Applied Multiple Regression/Correlation Analysis for
the Behavioral Sciences. Hillsdale, NJ: Lawrence Erlbaum Associates.
Colle, H. A., Amell, J. R., & Ewry, M. E. (1988). Capacity equivalence curves: A
double trade-off curve method for equating task performance. Human Factors,
30(5), 645-656.
Conger, A. J. (1974). A revised definition for suppressor variables: A guide to their
identification and interpretation. Educational and Psychological Measurement,
35, 35-46.
Cowan, N. (2000). The magical number 4 plus or minus 1: A reconsideration of mental
storage capacity. Behavioral and Brain Sciences, 24(1), 87-185.
Crawford, J. D. (1991). Intelligence, task complexity and the distinction between
automatic and effortful mental processing. In H. Rowe (Ed.), Intelligence:
Reconceptualization and measurement (pp. 119-144). Hillsdale, NJ: Lawrence
Erlbaum.
Cronbach, L. J. (1957). The two disciplines of scientific psychology. American
Psychologist, 12, 671-684.
Damos, D. L. (1991a). Dual-task methodology: Some common problems. In D. L.
Damos (Ed.), Multiple-Task Performance (pp. 101-119). London: Taylor &
Francis.
Damos, D. L. (1991b). Multiple-Task Performance. London: Taylor & Francis.
Deary, I. J. (2001). Human intelligence differences: Towards a combined experimentaldifferential approach. Trends in Cognitive Sciences, 5(4), 164-170.
Dennis, I., & Evans, J. S. T. (1996). The speed-error trade-off problem in psychometric
testing. British Journal of Psychology, 87, 105-129.
Ekstrom, R. B., French, J. W., Harman, H. H., & Dermen, D. (1976). Manual for Kit of
Factor-Referenced Cognitive Tests. Princeton, NJ: Educational Testing Service.
Embretson, S. (1993). Psychometric models for learning and cognitive processes. In N.
Frederiksen & R. J. Mislevy & I. I. Bejar (Eds.), Test Theory for a New
Generation of Tests (pp. 125-150). Hillsdale, NJ: Lawrence Erlbaum Associates.
Embretson, S. E. (1995a). A measurement model for linking individual learning to
processes and knowledge: Application to mathematical reasoning. Journal of
Educational Measurement, 32(3), 277-294.
Embretson, S. E. (1995b). The role of working memory capacity and general control
processes in intelligence. Intelligence, 20, 169-189.
Embretson, S. E. (1996). Multicomponent response models. In W. J. van der Linden &
R. K. Hambleton (Eds.), Handbook of modern item response theory (pp. 305321). New York: Springer.
Embretson, S. E. (1998). A cognitive design system approach to generating valid tests:
Application to abstract reasoning. Psychological Methods, 3(3), 380-396.
Embretson, S. E., & Prenovost, L. K. (2000). Dynamic cognitive testing: What kind of
information is gained by measuring response time and modifiability?
Educational and Psychological Measurement, 60(6), 837-863.
References 330
Embretson, S. E., & Reise, S. P. (2000). Item Response Theory for Psychologists.
Mahwah, NJ: Lawrence Erlbaum Associates.
Evans, J. (1993). The cognitive psychology of reasoning: An introduction. Quarterly
Journal of Experimental Psychology, 46A(4), 561-567.
Evans, J. S. B. T., Newstead, S. E., & Byrne, R. M. J. (1993). Human Reasoning: The
Psychology of Deduction. Hove: Lawrence Erlbaum Associates.
Evans, J. S. T. (1989). Bias in human reasoning: Causes and consequences. Hillsdale,
NJ: Lawrence Erlbaum Associates.
Evans, J. S. T., & Brooks, P. G. (1981). Competing with reasoning: A test of the
working memory hypothesis. Current Psychological Research, 1, 139-147.
Fisher, & Yates. (1963). Statistical Tables for Biological, Agricultural and Medical
Research. Edinburgh: Oliver & Boyd Ltd.
Fisher, R. A. (1925). Statistical Methods for Research.
Fisher, R. A. (1935). The Design of Experiments (1 ed.). Edinburgh: Oliver and Boyd.
Fisher, R. A. (1966). The Design of Experiments (8 ed.). Edinburgh: Oliver and Boyd.
Fisher, R. A., & Yates, F. (1934). The 6 x 6 Latin squares. Proceedings of the
Cambridge Philosophical Society, 30(492-507).
Fisher, W. (1992). Stochastic resonance and Rasch measurement. Rasch Measurement
Transactions, 5(4), www.rasch.org/rmt/rmt54k.htm.
Fisk, A. D., Derrick, W. L., & Schneider, W. (1986). A methodological assessment and
evaluation of dual-task paradigms. Current Psychological Research & Reviews,
5(4), 315-327.
Fogarty, G. (1987). Timesharing in relation to broad ability domains. Intelligence, 11,
207-231.
Fogarty, G., & Stankov, L. (1982). Competing tasks as an index of intelligence.
Personality and Individual Differences, 3, 407-422.
Fogarty, G., & Stankov, L. (1988). Abilities involved in performance on competing
tasks. Personality and Individual Differences, 9, 35-49.
Foley, E. J. (1997). Assessing conceptual complexity in hierarchical reasoning: A dualtask approach. Unpublished Dissertation, University of Cincinnati.
Foley, E. J., & Berch, D. B. (1997). Capacity limitations of a classical M-Power
measure: A modified dual-task approach. Journal of Experimental Child
Psychology, 66, 129-143.
Friedman, A., Polson, M. C., & Dafoe, C. G. (1988). Dividing attention between the
hands and the head: Performance trade-offs between rapid finger tapping and
verbal memory. Journal of Experimental Psychology: Human Perception and
Performance, 14(1), 60-68.
Gentner, D., & Rattermann, M. J. (1998). Deep thinking in children: The case for
knowledge change in analogical development. Behavioral and Brain Science,
24(6), 837-838.
Gevins, A., & Smith, M. E. (2000). Neurophysiological measures of working memory
and individual differences in cognitive ability and cognitive style. Cerebral
Cortex, 10, 829-839.
Gilhooly, K. J., Logie, R. H., Wetherick, N. E., & Wynn, V. (1993). Working memory
and strategies in syllogistic-reasoning tasks. Memory and Cognition, 21(1), 115124.
Gorsuch, R. L. (1983). Factor Analysis (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum
Associates.
Goswami, U. (1998). Is relational complexity a useful metric for cognitive
development. Behavioral and Brain Science, 24(6), 838-839.
References 331
Grant, D. A. (1948). The Latin square principle in the design and analysis of
psychological experiments. Psychological Bulletin, 45, 427-442.
Green, K. E., & Kluever, R. C. (1992). Components of item difficulty of Raven's
Matrices. The Journal of General Psychology, 119(2), 189-199.
Green, K. E., & Smith, R. M. (1987). A comparison of two methods of decomposing
item difficulties. Journal of Educational Statistics, 12(4), 369-381.
Halford, G., & Leitch, E. (1989). Processing load constraints: A structure-mapping
approach. In M. Luszcz & T. Nettelbeck (Eds.), Psychological Development:
Perspectives Across the Life-Span (pp. 151-159). North-Holland: Elsevier
Science Publishers.
Halford, G. S. (1989). Cognitive processing capacity and learning ability: An
integration of two areas. Learning and Individual Differences, 1(1), 125-153.
Halford, G. S. (1993). Children's understanding: The development of mental models.
Hillsdale, NJ: Lawrence Erlbaum Associates.
Halford, G. S., Andrews, G., & Birney, D. P. (in preparation). Assessment of relational
reasoning in school aged children.
Halford, G. S., Andrews, G., Dalton, C., Boag, C. C., & Zielinski, T. (in press). Young
children's performance on the balance scale: The influence of relational
complexity. Journal of Experimental Child Psychology.
Halford, G. S., Andrews, G., & Jensen, I. (in press). Integration of category induction
and hierarchical classification: One paradigm at two levels of complexity.
Journal of Cognition and Development.
Halford, G. S., Bain, J. D., & Maybery, M. T. (1984a). Does a concurrent memory load
interfere with reasoning. Current Psychological Research & Reviews, 3(2), 1423.
Halford, G. S., Bain, J. D., & Maybery, M. T. (1984b). Working memory and
representational processes: Implications for cognitive development. In H.
Bouma & D. G. Bouwhuis (Eds.), Attention and Performance X (pp. 459-470).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Halford, G. S., & Dalton, C. (1995). Performance on the balance scale by two-year old
children. ERIC document, ED 385 355. Retrieved, from the World Wide Web:
Halford, G. S., Maybery, M. T., & Bain, J. D. (1986). Capacity limitations in children's
reasoning: A dual-task approach. Child Development, 57, 616-627.
Halford, G. S., Phillips, S., & Wilson, W. (2000). Processing capacity limits are not
explained by storage limits. Behavioral and Brain Science, 24(1), 123-124.
Halford, G. S., & Wilson, W. H. (1980). A category theory approach to cognitive
development. Cognitive Psychology, 12, 356-411.
Halford, G. S., Wilson, W. H., Guo, J., Gayler, R. W., Wiles, J., & Stewart, J. E. M.
(1994). Connectionist implications for processing capacity limitations in
analogies. In K. J. Holyoak & J. Barden (Eds.), Advances in Connectionist and
Neural Computation Theory (Vol. 2: Analogical Connections, pp. 363-415).
Norwood, NJ: Ablex.
Halford, G. S., Wilson, W. H., & Phillips, S. (1998a). Processing capacity defined by
relational complexity: Implications for comparative, developmental, and
cognitive psychology. Behavioral and Brain Sciences, 21, 803-831.
Halford, G. S., Wilson, W. H., & Phillips, S. (1998b). Relational complexity metric is
effective when assessments are based on actual processes. Behavioral and Brain
Science, 21(6), 848-864.
Halford, G. S., & Zielinski, T. (in preparation). Relational complexity in the N-back
task.
References 332
Hambleton, R. K., & Swaminathan, H. (1985). Item Response Theory: Principles and
Applications. Norwell, MA: Kluwer-Nijhoff.
Hegarty, M., Shah, P., & Miyake, A. (2000). Constraints on using the dual-task
methodology to specify the degree of central executive involvement in cognitive
tasks. Memory and Cognition, 28(3), 376-385.
Horn, J. L. (1968). Organization of abilities and the development of intelligence.
Psychological Review, 75(3), 242-259.
Horn, J. L. (1988). Thinking about human abilities. In J. R. Nesselroade (Ed.),
Handbook of Multivariate psychology (2nd ed., pp. 645-685). New York:
Plenum Press.
Horn, J. L., & Cattell, R. B. (1966). Refinement and test of the theory of fluid and
crystallized general intelligences. Journal of Educational Psychology, 57(5),
253-270.
Howe, M. L., & Rabinowitz, M. (1989). On the uninterpretability of dual-task
performance. Journal of Experimental Child Psychology, 47, 32-38.
Howell, D. C. (2001). Statistical Methods for Psychology. Pacific Grove, CA:
Brooks/Cole-Thomson Learning.
Humphreys, M. S., Wiles, J., & Dennis, S. (1994). Toward a theory of human memory:
Data structures and access processes. Behavioral and Brain Sciences, 17(4),
655-692.
Hunt, E. (1987). Science, technology, and intelligence. In R. R. Ronning & J. A. Glover
& J. C. Conoley (Eds.), The Influence of Cognitive Psychology on Testing (pp.
11-40). Hillsdale, NJ: Lawrence Erlbaum.
Hunt, E., & Lansman, M. (1982). Individual differences in attention. In R. J. Sternberg
(Ed.), Advances in the Psychology of Human Intelligence (Vol. 1, pp. 207-254).
Hillsdale, NJ.: Lawrence Erlbaum Associates.
Hunt, E. B. (1980). Intelligence as an information-processing concept. British Journal
of Psychology, 71, 449-474.
Jensen, A. (1987a). The g beyond factor analysis. In R. R. Ronning & J. A. Glover & J.
C. Conoley & J. C. Witt (Eds.), The Influence of Cognitive Psychology on
Testing (Vol. 3, pp. 87-142). Hillsdale, NJ: Lawrence Erlbaum Associates.
Jensen, A. (1987b). Process differences and individual differences in some cognitive
tasks. Intelligence, 11, 107-136.
Johnson-Laird, P., & Byrne, R. (1990). Meta-logical problems: Knights, knaves, and
Rips. Cognition, 36, 69-84.
Johnson-Laird, P., & Byrne, R. M. (1993). Précis of deduction. Behavioral and Brain
Sciences, 16, 323-380.
Johnson-Laird, P. N., Byrne, R. M., & Schaeken, W. (1992). Propositional reasoning by
model. Psychological Review, 99(3), 418-439.
Johnson-Laird, P. N., Byrne, R. M., & Schaeken, W. (1994). Why models rather than
rules give a better account of propositional reasoning: A reply to Bonatti and to
O'Brien, Braine, and Yang. Psychological Review, 101(4), 734-739.
Johnson-Laird, P. N., & Byrne, R. M. J. (1991). Deduction. Hove: Lawrence Erlbaum
Assoicates.
Just, M. A., & Carpenter, P. A. (1992). A capacity theory of comprehension: Individual
differences in working memory. Psychological Review, 99(1), 122-149.
Just, M. A., Carpenter, P. A., & Hemphill, D. D. (1996). Constraints on processing
capacity: Architectural or implementational? In D. Steier & T. Mitchell (Eds.),
Mind Matters: A Tribute to Allen Newell (pp. 141-178). Mahwah, NJ: Lawrence
Erlbaum Associates.
References 333
Kahneman, D. (1973). Attention and Effort. Englewood Cliffs, NJ: Prentice-Hall.
Kanfer, R., & Ackerman, P. (1989). Motivation and cognitive abilities: An
integrative/aptitude-treatment interaction approach to skill acquisition. Journal
of Applied Psychology, 74(2), 657-690.
Kantowitz, B. H., & Knight, J. L. (1978). When is an easy task difficult and vice versa?
A reply to Lane. Acta Psychologica, 42, 163-170.
Kantowitz, B. H., & Knight, J. L., Jr. (1976). Test tapping timesharing, II: Auditory
secondary task. Acta Psychologica, 40, 343-362.
Klauer, K. C., Stegmaier, R., & Meiser, T. (1997). Working memory involvement in
propositional and spatial reasoning. Thinking and Reasoning, 3.
Kroger, J., Sabb, F. W., Fales, C., Bookheimer, S. Y., Cohen, M. S., & Holyoak, K. (in
press). Recruitment of anterior dorsolateral prefrontal cortex in human
reasoning: A parametric study of relational complexity. Cerebral Cortex.
Kyllonen, P. C., & Christal, R. E. (1990). Reasoning ability is (little more than)
working-memory capacity?! Intelligence, 14, 389-433.
Lansman, M., & Hunt, E. (1982). Individual differences in secondary task performance.
Memory and Cognition, 10(1), 10-24.
Larson, G. E., Merritt, C. R., & Williams, S. E. (1988). Information processing and
intelligence: Some implications of task complexity. Intelligence, 12, 131-147.
Law, D. J., Morrin, K. A., & Pellegrino, J. W. (1995). Training effects and working
memory contributions to skill acquisition in a complex coordination task.
Learning and Individual Differences, 7(3), 207-234.
Linacre, J., & Wright, B. (1994). Chi-square fit statistics. Rasch Measurement
Transactions, 8:2, 361.
Lohman, D. F. (1989). Individual differences in errors and latencies on cognitive tasks.
Learning and Individual Differences, 1(2), 179-202.
Lohman, D. F. (1994). Component Scores as Residual Variation (or Why the Intercept
Correlates Best). Intelligence, 19(1), 1-11.
Lohman, D. F., & Ippel, M. J. (1993). Cognitive diagnosis: From statistically based
assessment toward theory-based assessment. In N. Frederiksen & R. J. Mislevy
& I. I. Bejar (Eds.), Test Theory for a New Generation of Tests (pp. 41-70).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Lohman, D. F., & Kyllonen, P. C. (1983). Individual differences in solution strategy on
spatial tasks. In R. F. Dillion & R. R. Schmeck (Eds.), Individual Differences in
Cognition (Vol. 1). NY: Academic Press.
Loveday, W. (1995). The effect of complexity on planning in the Tower of Hanoi
problem. Unpublished Honours Thesis, University of Queensland, Brisbane,
Australia.
Luce, R. D., & Tukey, J. W. (1964). Simultaneous conjoint measurement. Journal of
Mathematical Psychology, 1, 1-27.
Marshalek, B., Lohman, D. F., & Snow, R. E. (1983). The complexity continuum in the
radex and hierarchical models of intelligence. Intelligence, 7, 107-127.
Masters, G. N., & Wright, B. D. (1996). The partial credit model. In W. J. van der
Linden & R. K. Hambleton (Eds.), Handbook of Modern Item Response Theory.
New York, NY: Springer-Verlag.
Maybery, M. T. (1987). Information Processing Models of Transitive Inference.
Unpublished PhD Dissertation, University of Queensland, Brisbane.
Maybery, M. T., Bain, J. D., & Halford, G. S. (1986). Information-processing demands
of transitive inference. Journal of Experimental Psychology: Learning, Memory,
and Cognition, 12(4), 600-613.
References 334
Michell, J. (1990). An Introduction to the Logic of Psychological Measurement.
Hillsdale, NJ: Erlbaum.
Michell, J. (1997). Quantitative science and the definition of measurement in
psychology. British Journal of Psychology, 88(3), 355-384.
Michell, J. (1999). Measurement in Psychology: Critical History of a Methodological
Concept. New York: Cambridge University Press.
Michell, J. (2000). Normal science, pathological science and psychometrics. Theory and
Psychology, 10(5), 639-667.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our
capacity for processing information. Psychological Review, 63(2), 81-97.
Moray, N. (1967). Where is capacity limited? A survey and a model. Acta
Psychologica, 27, 84-92.
Morrin, K. A., Law, D. J., & Pellegrino, J. W. (1994). Structural modelling of
information coordination abilities: An evaluation and extension of the Yee,
Hunt, and Pellegrino Model. Intelligence, 19, 117-144.
Most, R. B., & Zeidner, M. (1995). Constructing personality and intelligence
instruments. In D. H. Saklofske & M. Zeidner (Eds.), International Handbook of
Personality and Intelligence (pp. 475-503). NY: Plenum Publishing
Corporation.
Navon, D. (1984). Resources - A theoretical soup stone? Psychological Review, 91(2),
216-234.
Navon, D., & Gopher, D. (1979). On the economy of the human-processing system.
Psychological Review, 86(1), 214-255.
Navon, D., & Gopher, D. (1980). Task difficulty, resources, and dual-task performance.
In R. S. Nickerson (Ed.), Attention and Performance VIII (pp. 297-315).
Hillsdale, New Jersey: Lawrence Erlbaum Associates.
Navon, D., & Miller, J. (1987). Role of outcome conflict in dual-task interference.
Journal of Experimental Psychology: Human Perception and Performance,
13(3), 435-448.
Norman, D. A., & Bobrow, D. G. (1975). On data-limited and resource-limited
processes. Cognitive Psychology, 7, 44-64.
Oberauer, K., & Kliegl, R. (2001). Beyond resources: Formal models of complexity
effects and age differences in working memory. European Journal of Cognitive
Psychology, 13, 187-215.
Oberauer, K., Suss, H., Schulze, R., Wilhelm, O., & Wittmann, W. (2000). Working
memory capacity - facets of a cognitive ability construct. Personality and
Individual Differences, 29, 1017-1045.
O'Brien, D. P., Braine, M. D. S., & Yang, Y. (1994). Propositional reasoning by mental
models? Simple to refute in principle and in practice. Psychological Review,
101(4), 711-724.
Pascual-Leone, J. (1998). To appraise developmental difficulty of mental demand,
relational complexity is not enough. Behavioral and Brain Science, 21(6), 843844.
Pashler, H. (1994). Dual-task interference in simple tasks: Data and theory.
Psychological Bulletin, 116(2), 220-244.
Pedhazuer, E. J., & Schmelkin, L. P. (1991). Measurement, Design, and Analysis: An
Integrated Approach. NJ: Lawrence Erlbaum Associates.
Pellegrino, J. W., & Glaser, R. (1979). Cognitive correlates and components in the
analysis of individual differences. Intelligence, 3, 187-214.
References 335
Pellegrino, J. W., Hunt, E. B., & Yee, P. (1989). Assessment and modelling of
information coordination abilities. In R. Kanfer & P. Ackerman & R. Cudek
(Eds.), Learning and Individual Differences: Abilities, Motivation, &
Methodology. Hillsdale, NJ: Erlbaum.
Pellegrino, J. W., & Kail, R. (1982). Process analyses of spatial aptitude. In R. J.
Sternberg (Ed.), Advances in the Psychology of Human Intelligence (Vol. 1).
Hillsdale, NJ: Lawrence Erlbaum Associates.
Pellegrino, J. W., & Lyon, D. R. (1979). The components of a componential analysis.
Intelligence, 3, 169-186.
Perline, R., Wright, B., & Wainer, H. (1979). The Rasch model as additive conjoint
measurement. Applied Psychological Measurement, 3(237-255).
Polson, M. C., & Friedman, A. (1988). Task-sharing within and between hemispheres:
A multiple-resources approach. Human Factors, 30(5), 633-643.
Posner, M. I., & Boies, S. J. (1971). Components of attention. Psychological Review,
78(391-408).
Rabinowitz, M., Howe, M. L., & Saunders, K. (2002). Age, memory load, and
individual differences in working memory as determinants of class-inclusion
reasoning. Journal of Experimental Child Psychology, 81, 157-193.
Raven, J. C., Raven, J., & Court, J. H. (1993). Manual for Raven's Progressive Matrices
and Vocabulary Scales.
Raykov, T., & Stankov, L. (1993). On task complexity and "simplex" correlation
matrices. Australian Journal of Psychology, 45(2), 81-85.
Richardson, K. (1991). Reasoning with Raven in and out of context. British Journal of
Psychology, 61, 129-138.
Richardson, K. (1996). Putting Raven in context: A response to Roberts & Stevenson.
British Journal of Educational Psychology, 66, 533-538.
Rips, L. (1989). The psychology of knights and knaves. Cognition, 31, 85-116.
Rips, L. (1990). Paralogical reasoning: Evans, Johnson-Laird, and Byrne on liar and
truth-teller puzzles. Cognition, 36, 291-314.
Rips, L. (1994). The psychology of proof: Deductive reasoning in human thinking.
Cambridge, Mass: MIT Press.
Rips, L. J. (1983). Cognitive processes in propositional reasoning. Psychological
Review, 90(1), 38-71.
Rips, L. J., & Conrad, F. G. (1983). Individual differences in deduction. Cognition and
Brain Theory, 6(3), 259-285.
Roberts, M. J. (1993). Human reasoning: Deduction rules or mental models, or both?
Quarterly Journal of Experimental Psychology, 46A(4), 569-589.
Roberts, M. J. (1996). Putting context in context: A rejoinder to Richardson. British
Journal of Educational Psychology, 66, 539-542.
Roberts, M. J., & Stevenson, N. J. (1996). Reasoning with Raven - with and without
help. British Journal of Educational Psychology, 66, 519-532.
Roberts, R. D. (1997). Fitts' law, movement time and intelligence. Personality and
Individual Differences, 23(2), 227-246.
Roberts, R. D., Beh, H. C., & Stankov, L. (1988). Hick's law, competing-task
performance, and intelligence. Intelligence, 12, 111-130.
Roberts, R. D., & Stankov, L. (1999). Individual differences in speed of mental
processing and human cognitive abilities: Toward a taxonomic model. Learning
and Individual Differences, 11(1), 1-120.
Salthouse, T. A. (1985). A Theory of Cognitive Aging. Amsterdam: North Holland.
References 336
Salthouse, T. A. (1988). The complexity of age x complexity functions: Comments on
Charness and Campbell (1988). Journal of Experimental Psychology: General,
117(425-428).
Schneider, W., & Shiffrin, R. M. (1977). Controlled and automatic human information
processing: I. Detection, search, and attention. Psychological Review, 84, 1-66.
Schroyens, W., Schaeken, W., & d'Ydewalle, G. (1999). Error and bias in metapropositional reasoning: A case of the mental model theory. Thinking and
Reasoning, 5(1), 29-65.
Schweizer, K. (1998). Complexity of information processing and the speed-ability
relationship. Journal of General Psychology, 125(1), 89-102.
Sheehan, K. M. (1997). A tree-based approach to proficiency scaling and diagnostic
assessment. Journal of Educational Measurement, 34(4), 333-352.
Shepard, R. N., & Metzler, J. (1971). Mental rotation of three dimensional objects.
Science, 171(701-703).
Sheridan, B., Andrich, D., & Luo, G. (1998). Welcome to RUMM: A Windows-Based
item analysis program employing Rasch unidimensional measurement models,
User's Guide.
Skinner, B. F. (1983). The shame of American education. American Psychologist,
39(9), 947-954.
Smullyan, R. (1978). What is the name of this book? The riddle of Dracula and other
logical puzzles. Englewood Cliffs, NJ: Prentice Hall.
Snow, R. E., Kyllonen, P. C., & Marshalek, B. (1984). The topography of ability and
learning correlations. In R. J. Sternberg (Ed.), Advances in the Psychology of
Human Intelligence (Vol. 2). Hillsdale: Erlbaum.
Spearman, C. (1923). The nature of 'intelligence' and the principles of cognition.
London: Macmillan.
Spilsbury, G., Stankov, L., & Roberts, R. (1990). The effect of a test's difficulty on its
correlation with intelligence. Personality and Individual Differences, 11(10),
1069-1077.
SPSS. (1999). SPSS 9.0 Syntax Reference Guide. IL: SPSS Inc.
Stankov, L. (1987). Competing task and attentional resources: Exploring the limits of
primary-secondary paradigm. Australian Journal of Psychology, 39(2), 123-137.
Stankov, L. (1988). Single tests, competing tasks and their relationship to the broad
factors of intelligence. Personality and Individual Differences, 9(1), 25-33.
Stankov, L. (1989). Attentional resources and intelligence: A disappearing link.
Personality and Individual Differences, 10(9), 957-968.
Stankov, L. (1994). The complexity effect phenomenon is an epiphenomenon of agerelated fluid intelligence decline. Personality and Individual Differences, 16(2),
265-288.
Stankov, L. (1998). Calibration curves, scatterplots and the distinction between general
knowledge and perceptual tasks. Learning and Individual Differences, 10(1), 2950.
Stankov, L. (2000). Complexity, metacognition and fluid intelligence. Intelligence,
28(2), 121-143.
Stankov, L., Boyle, G. J., & Cattell, R. B. (1995). Models and paradigms in personality
and intelligence research. In D. H. Saklofske & M. Zeidner (Eds.), International
Handbook of Personality and Intelligence (pp. 15-43). New York: Plenum Press.
Stankov, L., & Crawford, J. D. (1993). Ingredients of complexity in fluid intelligence.
Learning and Individual Differences, 5(2), 73-111.
References 337
Stankov, L., & Cregan, A. (1993). Quantitative and qualitative properties of an
intelligence test: Series completion. Learning and Individual Differences, 5(2),
137-169.
Stankov, L., & Raykov, T. (1995). Modeling complexity and difficulty in measures of
fluid intelligence. Structural Equation Modeling, 2(4), 335-366.
Stankov, L., Roberts, R., & Spilsbury, G. (1994). Attention and speed of test-taking in
intelligence and aging. Personality and Individual Differences, 17(2), 273-284.
Sternberg, R. J. (1977). Intelligence, information processing, and analogical reasoning:
The componential analysis of human abilities. Hillsdale, NJ: Lawrence Erlbaum
Associates.
Sternberg, R. J. (1984). Facets of human intelligence. In J. R. Anderson & S. M.
Kosslyn (Eds.), Tutorials in Learning and Memory: Essays in honour of Gordon
Bower. NY: W.H. Freeman & Company.
Sternberg, R. J. (1985). Beyond IQ: A triarchic theory of human intelligence.
Cambridge: Cambridge University Press.
Sternberg, R. J. (1990). Metaphors of mind: Conceptions of the nature of intelligence.
NY: Cambridge University Press.
Sternberg, R. J. (1994). Intelligence. In R. J. Sternberg (Ed.), Thinking and Problem
Solving. San Diego: Academic Press.
Sternberg, R. J., & Weil, E. M. (1980). An aptitude x strategy interaction in linear
syllogistic reasoning. Journal of Educational Psychology, 72(2), 226-239.
Summers, J. J., & Pressing, J. (1994). Coordinating the two hands in polyrhythmic
tapping, Interlimb Coordination: Neural, Dynamical, and Cognitive Constraints
(pp. 571-593).
Sweller, J. (1998). Can we measure working memory without contamination from
knowledge held in long-term memory? Behavioral and Brain Science, 24(6),
845-846.
Tabachnick, B. G., & Fidell, L. S. (1989). Using Multivariate Statistics (2nd ed.). NY:
Harper Collins.
Thomson, G. H. (1941). The use of the Latin square in designing educational
experiments. British Journal of Educational Psychology, 11, 135-137.
Toms, M., Morris, N., & Ward, D. (1993). Working memory and conditional reasoning.
Quarterly Journal of Experimental Psychology, 46A(4), 679-699.
Waltz, J. A., Knowlton, B. J., Holyoak, K. J., Boone, K. B., Mishkin, F. S., de Menezes
Santos, M., Thomas, C. R., & Miller, B. L. (1999). A system for relational
reasoning in human prefrontal cortex. Psychological Science, 10(2), 119-125.
Ward, G., Roberts, M. J., & Phillips, L. H. (2001). Task-switching costs, Stroop-costs,
and executive control: A correlational study. Quarterly Journal of Experimental
Psychology, 54A(2), 491-511.
Wickens, C. D. (1980). The structure of attentional resources. In R. Nickerson & R.
Pews (Eds.), Attention and Performance (Vol. VIII, pp. 239-257). Hillsdale, NJ:
Lawrence Erlbaum.
Wickens, C. D. (1984). Processing resources in attention. In R. Parasuraman & D. R.
Davies (Eds.), Varieties of Attention. NY: Academic Press.
Wickens, C. D. (1991). Processing resources and attention. In D. L. Damos (Ed.),
Multiple-Task Performance (pp. 3-34). London: Taylor & Francis.
Wright, B. D. (1999). Fundamental measurement for psychology. In S. E. Embretson &
S. L. Hershberger (Eds.), The New Rules of Measurement: What Every
Psychologist and Educator Should Know. Mahwah, NJ: Lawrence Erlbaum
Associates.
References 338
Wright, B. D., Linacre, J., Gustafson, J.-E., & Martin-Lof, P. (1994). Reasonable meansquare fit values. Rasch Measurement Transactions, 8:3, 370.
Yee, P. L., Hunt, E., & Pellegrino, J. W. (1991). Coordinating cognitive information:
task effects and individual differences in integrating information from several
sources. Cognitive Psychology, 23, 615-680.
Yee, P. L., Laden, B., & Hunt, E. (1994). The coordination of compensatory tracking
and anticipatory timing tasks. Intelligence, 18, 259-287.
APPENDIX A
A.1 Relational Complexity Analysis of Knight-Knave Items
The following knight-knave problems used in Chapter 3 and Chapter 6 were selected
from the following items:
Ternary Items:
Item 3.1
Premise 1:
SAYS(B, AND(kt(A), kv(B))
Categorical:
kv(B),
?(A)
[Answer: A is knave]
&-SAYS(kv(B), AND(kt(A), kv(B)) → kv(A)
Item 3.2
Premise1:
SAYS(A, AND(kt(A), kv(B))
Premise2:
SAYS(B, kt(A))
kt(B)?
[Answer: No, B is a knave]
&-SAYS(kt(B), kt(A)) → kt(A)
&-SAYS(kt(A), AND(kt(A), kv(B)) → AND(kt(A), kv(B)
CONTRADICT(kt(B), kv(B)) → NOT(kt(B))
Item 3.3
Premise 1:
SAYS(A, kt(A))
Premise 2:
SAYS(B, kt(A))
Categorical:
kv(A)
?(B)
[Answer, B is a knave]
&-SAYS(kv(A), kt(A)) → kv(A)
AND(kv(A), SAYS(B, kt(A)) → kv(B)
Appendix A 340
Item 3.4
Premise 1:
SAYS(A, kt(A))
Premise 2:
SAYS(B, AND(kv(A), kv(B))
kt(B)?
[Answer: No, B is a knave]
&-SAYS(kt(B), AND(kv(A), kv(B))) → AND(kv(A), kv(B)
CONTRADICT(kt(B), kv(B)) → NOT(kt(B))
Item 3.5
Premise 1:
SAYS(A, kt(A))
Premise 2:
SAYS(B, AND(kv(A), kt(B))
Categorical:
kt(A)
?(B)
[Answer: B is a knave]
&-SAYS(kt(B), AND(kv(A), kt(B)) → AND(kv(A), kt(B)
CONTRADICT(kt(A), kv(A)) → NOT(kt(B))
Item 3.6
Premise 1: SAYS(A, AND(kt(A), kt(B)))
Premise 2: SAYS(B, AND(kt(A), kt(B)))
Categorical: kv(B)
?(A) [Answer: A is a knave]
&-SAYS(kv(B), AND(kt(A), kt(B))) → kv(A)
Appendix A 341
Quaternary Items:
Item 4.1
Premise 1:
SAYS(A, OR(kv(A), kv(B)
kv(B)? [Answer: Yes, B is a knave]
AND(kv(B), SAYS(A, OR(kv(A), kv(B))) → kt(A)
&-SAYS(kt(A), OR(kv(A), kv(B))) → kv(B)
Item 4.2
Premise 1:
SAYS(A, kv(B))
Premise 2:
SAYS(B, OR(kt(A), kt(B))
kv(A)? [Answer: Yes, A is a knave]
&-SAYS(kv(A), kv(B)) → kt(B)
&-SAYS(kt(B), OR(kt(A), kt(B))) → OR(kt(A), kv(A))
CONTRADICT(kt(A), kv(A)) → kv(A)
Item 4.3
Premise 1:
SAYS(A, OR(kv(A), kt(B)))
Premise 2:
SAYS(B, kt(B))
kv(B)?
[Answer, No, B is a knight]
AND(kv(B), SAYS(A, OR(kv(A), kt(B)) → OR(kt(A), kv(A))
AND(kt(B), SAYS(A, OR(kv(A), kt(B)) → AND(kt(A), kt(B))
Item 4.4
Premise 1:
SAYS(A, AND(kv(A), kt(B)))
kv(B)? [Answer: Yes, B is a knave]
AND(kv(B), SAYS(A, AND(kv(A), kt(B)) → AND(kv(A), kv(B))
&-SAYS(kv(A), AND(kv(A), kt(B)) → AND(kv(A), kv(B)
Appendix A 342
Item 4.5
Premise 1:
SAYS(A, AND(kv(A), kv(B)))
kv(B)?
[Answer: No B is a knight]
AND(kv(B), SAYS(A, AND(kv(A), kv(B))) → OR(kv(A), kt(A))
AND(kt(B), SAYS(A, AND(kv(A), kv(B))) → kv(A)
&-SAYS(kv(A), AND(kv(A), kv(B))) → kt(B)
Item 4.6
Premise 1: SAYS(A, kt(A))
Premise 2: SAYS(B, AND(kv(A), kv(B)))
kv(B)? [Answer, Yes, B is a knave]
&-SAYS(kv(B), AND(kv(A), kv(B))) → AND(kt(A), kv(B))
&-SAYS(kt(A), kt(A)) → kt(A)
&-SAYS(kt(B), AND(kv(A), kv(B))) → AND(kv(A), kv(B))
CONTRADICT(kt(B), kv(B)) → kv(B)
Item 4.7
Premise 1: SAYS(A, OR(kv(A), kv(B)))
Premise 2: SAYS(B, OR(kv(A), kt(B)))
?(B) [Answer, B is a knave]
&-SAYS(kt(B), OR(kv(A), kt(B)) → AND(kt(B), OR(kt(A), kv(A)))
&-SAYS(kv(B), OR(kv(A), kt(B)) → AND(kt(A), kv(B))
&-SAYS(kt(A), OR(kv(A), kv(B)) → AND(kt(A), kv(B))
Item 4.8
Premise 1: SAYS(A, AND(kt(A), kv(B)))
Premise 2: SAYS(B, OR(kv(A), kt(B)))
kv(A)? [Answer, Impossible to tell]
&-SAYS(kv(A), AND(kt(A), kv(B))) → kv(A)
&-SAYS(kt(A), AND(kt(A), kv(B)) → AND(kt(A), kv(B))
AND(kt(A), kv(A)) → indeterminate
Appendix A 343
Item 4.9
Premise 1: SAYS(A, kv(B))
Premise 2: SAYS(B, AND(kv(A), kt(B)))
?(A) [Answer, Impossible to tell]
&-SAYS(kt(A), kv(B)) → kv(B)
&-SAYS(kv(B), AND(kv(A), kt(B))) → OR(kt(A), kv(A))
&-SAYS(kv(A), kv(B)) → kt(B)
&-SAYS(kt(B), AND(kv(A), kt(B))) → AND(kv(A), kt(B))
AND(kv(A), kt(A)) → indeterminate
Item 4.10
Premise 1: SAYS(A, OR(kt(A), kt(B)))
Premise 2: SAYS(B, OR(kv(A), kt(B)))
Categorical: kt(A)
?(B) [Answer, Impossible to tell]
&-SAYS(kt(A), OR(kt(A), kt(B)) → OR(kt(B), kv(B)
&-SAYS(kt(B), OR(kv(A), kt(B)) → kt(B))
&-SAYS(kv(B), OR(kv(A), kt(B)) → AND(kt(A), kv(B))
AND(kv(B), kt(B)) → indeterminate
The rationale for the second last step is as follows: If we assume B is a knave, then what
he says must be false. Since B has made a disjunctive statement, both components need
to be false. His statement that he is a knight is consistent with this (i.e., if he is actually
a knave, and says he is a knight, this is false). We know A is a knight since this was
given. So B’s statement that A is a knave is also false. This satisfies the supposition
that B is a knave. Now the problem that people might have with this is that if they work
through the problem, and confirm that it is consistent with that which might be made by
a knight, but fail to check this further, then they will miss the indeterminacy. The same
argument would apply if the subject works through the problem assuming B is a knave
but fails to go further. This is consistent with the response time data. The average time
required to provide the indeterminate response is nearly 10sec longer than either kt(B)
or kv(B).
Appendix A 344
A.2 Process Analysis Based On Exhaustive Strategy
The following analyses follow the heuristic outlined in Figure 3.1 which was adapted
from Rips (1989) by Byrne, Handley, and Johnson-Laird (1995). The premises for each
question are listed in Table 3.1.
Item 3.1
kv(B) in P1
→ NOT(kt(A), NOT(kv(B))
[i.e., kv(A), kt(B)]
→ inconsistent: known kv(B) ≠ derived kt(B)
→ NOT(kt(A)), kv(B)))
(1)
(2)
(3)
→ consistent: nothing to follow up
(4)
→ AND(kt(A), NOT(kv(B))
(5)
→ inconsistent known kv(B) ≠ derived kt(B)
NOT(kt(A)) → kv(A)
(6)
(7)
Item 3.2
kt(A) in P1
kv(A) in P1
→ kt(A), kv(B)
(1)
→ kv(B) in P2 → kv(A)
(2)
→ inconsistent assumed kt(A) ≠ derived kv(A)
(3)
→ NOT(kt(A), NOT(kv(B))
→ kt(B) in P2 → kt(A)
[i.e., kv(A), kt(B)]
[i.e., kt(A)]
→ inconsistent: assumed kv(A) ≠ derived kt(A)
→ NOT(kt(A), kv(B))
[i.e., kv(A), kv(B)]
(5)
(6)
(7)
→ kv(B) in P2 → NOT(kt(A)) [i.e., kv(A)]
(8)
→ consistent: nothing to follow up
(9)
→ kt(A), NOT(kv(B))
[i.e., kt(A), kt(B)]
→ inconsistent: assumed kv(A) ≠ derived kt(A)
kv(B)
(4)
(10)
(11)
(12)
Appendix A 345
Item 3.3
kv(A) in P1
kv(A) in P2
→ NOT(kt(A))
[i.e., kv(A)]
(1)
→ consistent: nothing to follow up
(2)
→ kv(B)
(3)
→ consistent: nothing to follow up
(4)
kv(B)
(5)
Item 3.4
kt(A) in P1
kv(A) in P1
kt(B) in P2
kv(B) in P2
→ kt(A)
(1)
→ consistent – nothing to follow up
(2)
→ kv(A)
(3)
→ consistent – nothing to follow up
(4)
→ kv(A), kv(B)
(5)
→ inconsistent: assumed kt(B) ≠ derived kv(B)
(6)
→ NOT(kv(A)), NOT(kv(B)
[i.e., kt(A), kt(B)]
→ inconsistent: assumed kv(B) ≠ derived kt(B)
→ NOT(kv(A)), kv(B)
→ kt(A) in P1 → kt(A)
→ consistent – nothing to follow up
(7)
(8)
[i.e., kt(A), kv(B)]
(9)
(10)
(11)
→ kv(A), NOT(kv(B))
(12)
→ inconsistent: assumed kv(B) ≠ derived kt(B)
(13)
kv(B) → NOT(kt(B)) → NO
(14)
Appendix A 346
Item 3.5
kt(A) in P1
kt(B) in P2
kv(B) in P2
→ kt(A)
(1)
→ consistent: nothing to follow up
(2)
→ kv(A), kt(B)
(3)
→ inconsistent: known kt(A) ≠ derived kv(A)
(4)
→ NOT(kv(A)), NOT(kt(B))
[i.e., kt(A), kv(B)]
(5)
→ consistent: nothing to follow up
(6)
→ NOT(kv(A)), kt(B)
(7)
→ inconsistent: assumed kv(B) ≠ derived kt(B)
(8)
→ kv(A), NOT(kt(B))
(9)
→ inconsistent: known kt(A) ≠ derived kv(B)
(10)
kv(B)
(11)
Item 4.1
kt(A) in P1
→ kv(A), kv(B)
(1)
→ inconsistent: assumed kt(A) ≠ derived kv(A)
(2)
→ kv(A), NOT(kv(B)
[i.e., kv(A), kt(B)]
→ inconsistent: assumed kt(A) ≠ derived kv(A)
→ NOT(kv(A)), kv(B)
→ consistent: nothing to follow up
kv(B)
[i.e., kt(A), kv(B)]
(3)
(4)
(5)
(6)
(7)
Appendix A 347
Item 4.2
kt(A) in P1
→ kv(B)
(1)
→ kv(B) in P2 → NOT(kt(A)), NOT(kt(B))
[i.e., kv(A), kv(B)]
→ inconsistent: assumed kt(A) ≠ derived kv(A)
kv(A) in P1
(2)
(3)
→ kt(B)
(4)
→ kt(B) in P2 → kt(A), kt(B)
(5)
→ inconsistent: assumed kv(A) ≠ derived kt(A) (6)
→ NOT(kt(A)), kt(B)
[i.e., kv(A), kt(B)]
→ consistent: nothing to follow up
→ kt(A), NOT(kt(B))
[i.e., kt(A), kv(B)]
→ inconsistent: assumed kt(B) ≠ derived kv(B)
→ inconsistent: assumed kv(A) ≠ derived kt(A)
(7)
(8)
(9)
(10)
(11)
kv(A), yes
(12)
Item 4.3
kt(A) in P1
→ kv(A), kt(B)
(1)
→ inconsistent: assumed kt(A) ≠ derived kt(B)
(2)
→ NOT(kv(A)), kt(B)
(3)
[i.e., kt(A), kt(B)]
→ consistent
(4)
→ kt(B) in P2 → kt(B)
(5)
→ consistent: nothing to follow up
(6)
[i.e., kv(A), kv(B)]
(7)
→ kv(A), NOT(kt(B))
→ inconsistent: assumed kt(A) ≠ derived kt(B)
kv(A) in P1
→ NOT(kv(A)), NOT(kt(B))
(8)
[i.e., kt(A), kv(B)]
→ inconsistent: assumed kv(A) ≠ derived kt(A)
kt(B)
(9)
(10)
(11)
Appendix A 348
Item 4.4
kt(A) in P1
kv(A) in P1
→ kv(A), kt(B)
(1)
→ inconsistent: assumed kt(A) ≠ derived kv(A)
(2)
→ NOT(kv(A)), NOT(kt(B))
[i.e., kt(A), kv(B)]
→ inconsistent: assumed kv(A) ≠ derived kt(A)
→ NOT(kv(A)), kt(B)
(3)
(4)
[i.e., kt(A), kt(B)]
(5)
→ inconsistent: assumed kv(A) ≠ derived kt(A)
(6)
→ kv(A), NOT(kt(B))
(7)
→ consistent: nothing to follow up
(8)
NOT(kt(B)) → kv(B) → yes
(9)
Item 4.5
kt(A) in P1
kv(A) in P1
→ kv(A), kv(B)
(1)
→ inconsistent: assumed kt(A) ≠ derived kv(A)
(2)
→ NOT(kv(A)), NOT(kv(B))
[i.e., kt(A), kt(B)]
→ inconsistent: assumed kv(A) ≠ derived kt(A)
→ NOT(kv(A), kv(B)
(4)
[i.e., kt(A), kv(B)]
→ inconsistent: assumed kt(A) ≠ derived kv(A)
→ kv(A), NOT(kv(B)
→ consistent: nothing to follow up
NOT(kv(B)) → kt(B) → no
(3)
(5)
(6)
[i.e., kv(A), kt(B)]
(7)
(8)
(9)
APPENDIX B
B.1 Relational Complexity Analysis of Latin Square Items (18-item Test)
Item
Complete
1
Item
Relational Complexity Analysis
1
2
3
4
A
3
4
1
2
B
2
3
4
1
C
4
1
2
3
D
1
2
3
4
A B C D
4
4
? 2
1
2
3
Binary: 1 step
AND(C1(4), C3(2), C4(3)) > C2(1)
1
2
3
4
A
1
2
3
4
B
2
3
4
1
C
3
4
1
2
D
4
1
2
3
A B C D
1
4
? 4 1 2
4
2
Binary: 1 step
AND(B3(4), C3(1), D3(2)) > A3(3)
1
2
3
4
A
3
4
1
2
B
2
3
4
1
C
4
1
2
3
D
1
2
3
4
A B C D
4
4 ?
2
1
2
3
Binary: 2 steps
AND(C1(4), C3(2), C4(3)) > C2(1)
AND(C2(1), A2(4), D2(2)) > B2(3)
1
2
3
4
A
1
4
3
2
B
4
2
1
3
C
3
1
2
4
D
2
3
4
1
A B C D
?
3 2
4 2
1
2 3
Binary: 2 steps
AND(B2(2), B3(1), B4(3)) > B1(4)
AND(B1(4), C1(3), D1(2)) > A1(1)
Alternative 3D 1 step:
AND(A4(2), A2(4), C1(3)) > A1(1)
1
2
3
4
A
3
4
1
2
B
1
3
2
4
C
2
1
4
3
D
4
2
3
1
A B C D
3 1
4
3
2
1
4
4
?
Binary: 3 steps
AND(B1(1), B2(3), B4(4)) > B3(2)
AND(A3(1), B3(2), C3(4)) > D3(3)
AND(D1(4), D2(2), D3(3)) > D4(1)
2
3
4
5
Or 1 ternary, 1 binary
AND(D1(4), D2(2), A3(1)) > D3(3)
AND(D1(4), D2(2), D3(3)) > D4(1)
Appendix B 350
Item
Complete
6
1
2
3
4
A
3
2
1
4
B
2
1
4
3
C
1
4
3
2
Item
D
4
3
2
1
Relational Complexity Analysis
A B C D
3
1 ?
2
4 3
4
3
Binary: 3 Steps
AND(A2(2), C2(4), D2(3)) > B2(1)
AND(B2(1), B3(4), B4(3)) > B1(2)
AND(B1(2), A1(3), C1(1)) > D1(4)
Alternative: ternary 2 steps
AND(A1(3), C1(3), B3(4)) > B1(2)
AND(A1(3), C1(3), B1(2)) > D1(4)
7
1
2
3
4
A
4
3
1
2
B
3
2
4
1
C
2
1
3
4
D
1
4
2
3
A B C
2
2
1 4
4
1
2
3
4
A
3
4
1
2
B
1
3
2
4
C
2
1
4
3
D
4
2
3
1
A B C D
1
4
?
2
3
4 3
Ternary: 1 Step
AND(B1(1), B4(4), D2(2)) > B2(3)
1
2
3
4
A
3
4
1
2
B
1
3
2
4
C
2
1
4
3
D
4
2
3
1
A B C D
4
3
2
1
4
2
?
Ternary:2 Steps
AND(D1(4), D2(2), A1(1)) > D3(3)
AND(D3(3), D1(4), D2(2)) > D4(1)
1
2
3
4
A
3
4
1
2
B
2
3
4
1
C
4
1
2
3
D
1
2
3
4
A B C D
4 1
2
1 4
3
? 1
4
Ternary: 3 steps
AND(A3(1), B3(4), D3(3)) > C3(2)
AND(D4(4), B4(1), C3(2)) > C4(3)
AND(D4(4), C4(2), B4(1)) > A4(2)
8
9
10
D
1
Ternary: 1Step
AND(D1(1), D3(2), C4(4)) > D4(3)
2
?
Appendix B 351
Item
Complete
11
Item
Relational Complexity Analysis
1
2
3
4
A
1
4
3
2
B
4
2
1
3
C
3
1
2
4
D
2
3
4
1
A B C D
?
3
2 1
2 4
2 3 4 1
Ternary: 3 Steps
AND(B2(2), B4(3), D3(4)) > B3(1)
AND(B2(2), B3(1), B4(3)) > B1(4)
AND(B1(4), C1(3), A4(2)) > A1(1)
1
2
3
4
A
1
3
2
4
B
2
4
1
3
C
4
2
3
1
D
3
1
4
2
A B C D
4
3
?
1
1 2
Ternary: 1 step
AND(C1(4), C4(1), A2(3)) > C2(2)
1
2
3
4
A
3
4
1
2
B
1
3
2
4
C
2
1
4
3
D
4
2
3
1
A B C D
3
1
3
4 ?
Quaternary: 1 step
AND(A1(3), C2(1), D3(3)) > C4(3)
1
2
3
4
A
1
2
3
4
B
2
3
4
1
C
3
4
1
2
D
4
1
2
3
A B C D
3 4
?
4
2
1 2
Quaternary: 2 steps
AND(B3(4), B4(1), C1(3)) > B1(2)
AND(B1(2), C4(2), D3(2)) > A2(2)
1
2
3
4
A
3
1
4
2
B
2
3
1
4
C
1
4
2
3
D
4
2
3
1
A B C D
1 ?
4
4
1
Quaternary: 1 STEP
AND(C2(4), A3(4), D4(1)) > D1(4)
1
2
3
4
A
1
2
3
4
B
2
3
4
1
C
3
4
1
2
D
4
1
2
3
A B C D
1
4
3
?
1
3
Quaternary: 1 step
AND(D1(4), C3(1), D4(3)) > D2(1)
12
13
14
15
16
Alternative: Ternary 2 steps
AND(D1(4), D4(3), C3(1)) > D3(2)
AND(D1(4), D3(2), D4(3)) > D2(1)
Appendix B 352
Item
Complete
17
Item
Relational Complexity Analysis
1
2
3
4
A
3
4
2
1
B
1
2
4
3
C
4
3
1
2
D
2
1
3
4
A B C D
3
?
4
3
4
4
Quaternary: 1 step
AND(A1(3), B3(4), D4(4)) > C1(4)
1
2
3
4
A
1
3
4
2
B
3
4
2
1
C
2
1
3
4
D
4
2
1
3
A B C D
2
4
2
4
3
? 1
Quaternary: 1 step
AND(D2(2), C1(2), B4(1)) > A4(2)
18
Alternative:
AND(C1(2), C3(3), B4(1)) > C4(4)
AND(B4(1), C4(4), D2(2)) > D4(3)
AND(B3(1), C4(4), D4(3)) > A4(2)
In cases were there is more than one possible solution strategy, consistent with the
definition of Halford et al, we use the relationally less complex solution regardless of
the number of steps. This raises doubts about the suitability of the analyses to all
individuals. It could be the case that more able students might process the item using the
higher level relation. To avoid this problem in the future the item base was modified to
include only 1 and 2 step problems in a much more constrained way having better
understood the task from these experiments.
Appendix B 353
B.2 Item×Trait Group Explanation and Example
The item × trait test of fit is tested using a χ2 test of fit. The general procedure is to
generate a number of subgroups from the sample such that they differ in calibrated
ability and then compare the actual performance of the individuals in these groups with
what would be expected for subjects of their ability based on the item response
function. The example that we use is for the generation of the quaternary subtest that
was discussed in Section 4.7.1.
Table B.1
Item-trait interaction statistics for quaternary subtest (items 13, 14, 15, and 17)
Proportion of group at
ability
each score
Group n
max
mean
0
1
2
3
4
1
15
-0.168
-0.324
Observed 0.47 0.27 0.27 0.00 0.00
Estimated 0.48 0.25 0.22 0.04 0.00
2
21
0.575
0.305
Observed 0.24 0.29 0.33 0.14 0.00
Estimated 0.24 0.23 0.37 0.14 0.02
3
23
1.338
1.172
Observed 0.04 0.09 0.48 0.35 0.04
Estimated 0.04 0.10 0.40 0.36 0.09
4
12
2.788
2.348
Observed 0.00 0.00 0.17 0.58 0.25
Estimated 0.00 0.01 0.16 0.45 0.38
Item location = 0.990, SE = 0.13; total χ2 (3) = .481, p = .921
χ2
0.022
0.135
0.164
0.159
The data for the quaternary composite subtest item is shown in Table B.1. Four groups
were constructed and the probability of getting each of the possible scores is determined
as a function of the groups ability level (since there are 4 quaternary items there are five
possible scores, 0 – 4). So for group 1 who has a mean ability of -0.324, 48% of the
group are expected to get a score of 0 (out of 4); 25% are expected to get a score of 1,
and so on. A chi-squared goodness of fit test is then calculated for each group
comparing observed and expected scores at each of the possible scores. The rationale is
that if the data fits the Rasch model then there should be little variation between what is
expected from the model and what is observed in practice. Large χ2 values suggest
some deviation away from this expectation. As an overall test of fit for the item (or
subtest), the χ2 values determined for each group are summed and tested for
significance (df = number of groups – 1). Low χ2 values are associated with larger pvalues and hence we generally look for non-significant p values (i.e., greater than .05).
Appendix B 354
B.3 Regression Analysis of Response Times
The complete regression analyses for the correct RT and overall RT of the Latin Square
Task from section 4.6 are provided in the tables below. The statistics on the left hand
side are for the analysis of overall response time as the dependent variable. Those on the
right are for correct response time as the dependent variable. Models A through to D
analyze mean item response time and models E through to H consider differences in the
variability (standard deviation) in response time.
Table B.2
Regression analyses of correct and overall response time in the Latin Square Task
Model A:
Overall RT
Correct RT
t
sig
pr
sr
β
RC
0.78 5.08 0.00 0.61 0.80
STEPS
0.58 3.80 0.00 0.35 0.70
2
R = 0.68, F(2, 15) = 15.73, p <.001.
t
sig
pr
sr
β
0.81 5.57 0.00 0.64 0.82
0.58 3.97 0.00 0.33 0.72
2
R = 0.711, F(2, 15) = 18.41, p < .001.
Model B
Overall RT
t
sig
pr
β
RC
0.86 5.51 0.00 0.83
STEPS
0.63 4.18 0.00 0.75
RC x STEPS 0.23 1.53 0.15 0.38
R2change = 0.05, F(1, 14) = 2.35, p = 0.15.
sr
0.77
0.59
0.22
Model C
Overall RT
t
sig
pr
sr
β
RC
0.98 6.68 0.00 0.87 0.82
STEPS
0.32 2.00 0.06 0.47 0.24
IC
-0.50 -2.77 0.02 -0.59 -0.34
R2change = 0.11, F(1, 14) = 7.66, p = 0.02.
Model D
Overall RT
t
sig
pr
β
RC2
0.49 2.97 0.01 0.62
RC3
0.92 6.30 0.00 0.86
RC4
0.97 5.76 0.00 0.84
R2 = 0.79, F(3, 14) = 17.57, p < .001.
sr
0.36
0.77
0.70
Correct RT
t
sig
pr
sr
β
0.89 6.11 0.00 0.85 0.80
0.62 4.43 0.00 0.76 0.58
0.23 1.66 0.12 0.41 0.22
R2change = 0.05, F(1, 14) = 2.76, p = 0.12.
Correct RT
t
sig
pr
sr
β
0.96 6.31 0.00 0.86 0.80
0.38 2.29 0.04 0.52 0.29
-0.38 -2.00 0.07 -0.47 -0.25
R2change = 0.06, F(1, 14) = 3.99, p = 0.07.
Correct RT
t
sig
pr
β
0.52 3.38 0.00 0.67
0.90 6.67 0.00 0.87
1.04 6.68 0.00 0.87
R2 = 0.82, F(3, 14) = 21.42, p < .001.
sr
0.38
0.75
0.75
Appendix B 355
Model E
SD Overall RT
t
sig
pr
β
RC
0.60 2.72 0.02 0.52
STEPS
0.26 1.18 0.26 0.08
R2 = 0.34, F(2, 15) = 3.77, p = 0.05.
sr
0.57
0.29
Model F
SD Overall RT
t
sig
pr
β
RC
0.69 3.02 0.01 0.63
STEPS
0.31 1.41 0.18 0.35
RC x STEPS 0.27 1.24 0.24 0.31
R2change = 0.07, F(1, 14) = 1.54, p = 0.24.
sr
0.63
0.29
0.26
SD Correct RT
t
sig
pr
β
0.70 3.50 0.00 0.64
0.21 1.05 0.31 0.00
R2 = 0.45, F(2, 15) = 6.11, p = 0.01.
sr
0.67
0.26
SD Correct RT
t
sig
pr
sr
β
0.78 3.69 0.00 0.70 0.70
0.25 1.24 0.23 0.32 0.24
0.22 1.10 0.29 0.28 0.21
2
R change = 0.04, F(1, 14) = 1.22, p = 0.29.
Model G
SD Overall RT
t
sig
pr
sr
β
RC
0.82 3.50 0.00 0.68 0.68
STEPS
-0.03 -0.11 0.91 -0.03 -0.02
IC
-0.55 -1.90 0.08 -0.45 -0.37
R2change = 0.14, F(1, 14) = 3.61, p = 0.08.
SD Correct RT
t
sig
pr
sr
β
0.88 4.03 0.00 0.73 0.73
-0.02 -0.10 0.92 -0.03 -0.02
-0.45 -1.65 0.12 -0.40 -0.30
R2change = 0.09, F(1, 14) = 2.73, p = 0.12.
Model H
SD Overall RT
t
sig
β
RC2
0.09 0.37 0.72
RC3
0.65 2.89 0.01
RC4
0.63 2.43 0.03
R2 = 0.50, F(3, 14) = 4.71, p = 0.02.
SD Correct RT
t
sig
pr
β
0.05 0.22 0.83 0.06
0.61 2.95 0.01 0.62
0.72 3.04 0.01 0.63
R2 = 0.59, F(3, 14) = 6.57, p = 0.01.
pr
0.10
0.61
0.54
sr
0.07
0.54
0.46
sr
0.04
0.51
0.52
Key to Tables: RC = relational complexity; STEPS = number of processing steps; IC = number
of incomplete cells; RC2, RC3, RC4 = number of binary, ternary, quaternary processes
respectively. pr = partial correlation; sr = semi-partial correlation; SD = standard deviation; RT
= response time
Appendix B 356
B.4 Analysis of Composite Accuracy and Response Time Scores
To consider the possibility that reliable variation between subjects was overlooked in
the item based analyses from section 4.7 of difficulty and response times in the Latin
Square Task, a series of one-way repeated measures ANOVA’s were conducted on
composite scores generated by aggregating accuracy (percent correct), response time,
and correct response time at each level of complexity. The descriptive statistics are
summarised in Table B.3. The composite correct-response time score was derived for
each subject based only on the items they responded to correctly. Students must answer
at least one item correctly at a given level of complexity to gain a score for this
measure. All 71 subjects answered at least one binary and ternary item correctly,
however 13 students failed to correctly answer any items classified as quaternary. As a
result, the quaternary correct-response-time score is based on 58 students.
Table B.3
Means and standard deviations (in parentheses) for relational complexity composite
Measure
Binary
Ternary
Quaternary
Accuracy (% correct)
0.839 (0.202)
Response Time
a
Correct Response Time
a
0.748
(0.177)
0.458
(0.293)
15.411 (9.618)
26.949 (17.223)
34.615 (24.659)
14.630 (8.221)
25.460 (16.187)
35.369 (27.677)
Quaternary mean based on N = 58
Accuracy: There was a significant effect of relational complexity on the accuracy
measure64 (proportion of items at each level of complexity answer correctly), F(2, 140)
= 73.24, p < .001, η2
= .51. Helmert contrasts indicated that binary items were
significantly easier than ternary and quaternary items, F(1, 70) = 82.12, p < .001, η2 =
54; and ternary items were significantly easier than quaternary items, F(1, 70) = 66.85,
p < .001, η2 = .49.
Overall Response Time: There was also a significant effect of relational complexity on
the overall response time measure, F(2, 140) = 41.59, p < .001, η2 = .37. Helmert
64
The sphericity assumption was violated for all analyses. Adjustments did not change the level of
significance and so the unadjusted statistics are reported.
Appendix B 357
contrasts indicated that binary items were significantly easier than ternary and
quaternary items, F(1, 70) = 59.82, p < .001, η2
= .46; and ternary items were
significantly easier than quaternary items, F(1, 70) = 15.80, p < .001, η2 = .18.
Correct Response Time: Correct response time also differed as a function of relational
complexity, F(2, 140) = 27.48, p < .001, η2 = .33. Helmert contrasts indicated once
again that binary items were significantly easier than ternary and quaternary items, F(1,
70) = 47.34, p < .001, η2 = .45; and that ternary items were significantly easier than
quaternary items, F(1, 70) = 8.38, p =.005, η2 = .13.
Summary: These findings indicated that binary items were answered more accurately
and quickly than ternary items, and that ternary items were answered more accurately
and quickly than quaternary items. This is consistent with the regression analyses
reported in sections 4.5 and 4.6.
B.5 Revised Latin Square Task Item Pool
Following is the revised list of Latin Square items used in Chapters 5, 6, and 7
represented in generic form. The actual items used are provided in Section B.6.
Item
1
1
2
3
4
2
1
2
3
4
Complete
Item
A B C D
A B C D
3
4
1
2
2
3
4
1
4
1
2
3
1
2
3
4
4
?
2
3
4
1
Relational Complexity Analysis
Binary: 1Step
AND(C1(4), C3(2), C4(3)) > C2(1)
2
A B C D
A B C D
Binary: 1 step
1
2
3
4
1
AND(B3(4), C3(1), D3(2)) > A3(3)
2
3
4
1
3
4
1
2
4
1
2
3
?
4
4
4
1
2
2
Appendix B 358
Item
3
1
2
3
4
7
Complete
Item
A B C D
A B C D
3
4
1
2
2
3
4
1
4
1
2
3
1
2
3
4
A B C D
1
2
3
4
8
4
3
1
2
3
2
4
1
2
1
3
4
1
4
2
3
A B C D
1
2
3
4
9
3
4
1
2
1
3
2
4
2
1
4
3
1
2
3
4
12
3
4
1
2
1
3
2
4
2
1
4
3
4
2
3
1
A B C D
1
2
3
4
14
1
3
2
4
2
4
1
3
4
2
3
1
3
1
4
2
A B C D
1
2
3
4
1
2
3
4
2
3
4
1
3
4
1
2
4
4
1
4
1
2
3
?
2
1
Binary: 2 step
AND(C1(4), C3(2), C4(3)) > C2(1)
AND(C2(1), A2(4), D2(2)) > B2(3)
2
3
A B C D
2
1
4
2
?
2
4
A B C D
4
2
3
1
A B C D
Relational Complexity Analysis
1
?
4
4
2
3
AND(D1(1), D3(2), C4(4)) > D4(3)
Ternary: 1 step
AND(B1(1), B4(4), D2(2)) > B2(3)
3
A B C D
4
2
3
1
2
Ternary: 1 step
Ternary: 2 step
AND(D1(4), D2(2), A1(1)) > D3(3)
AND(D3(3), D1(4), D2(2)) > D4(1)
4
?
A B C D
AND(C1(4), C4(1), A2(3)) > C2(2)
4
?
3
Ternary: 1 step
1
1
2
A B C D
3
4
?
4
1
2
2
Quaternary: 2 step
AND(B3(4), B4(1), C1(3)) > B1(2)
AND(B1(2), C4(2), D3(2)) > A2(2)
Appendix B 359
Item
15
1
2
3
4
17
1
2
3
4
19
1
2
3
4
20
1
2
3
4
21
Complete
Item
A B C D
A B C D
3
1
4
2
2
3
1
4
1
4
2
3
4
2
3
1
22
AND(C2(4), A3(4), D4(1)) > D1(4)
4
1
A B C D
Quaternary: 1 step
3
4
2
1
3
4
AND(A2(4), B3(4), D4(4)) > C1(4)
1
2
4
3
4
3
1
2
2
1
3
4
?
3
4
4
A B C D
A B C D
Binary: 1 step
3
2
1
4
3
2
1
AND(B1(2), B3(4), B4(3)) > B2(1)
2
1
4
3
1
4
3
2
4
3
2
1
2
?
4
3
1
2
2
1
A B C D
A B C D
Binary: 1 step
1
2
4
3
1
AND(A1(1), B1(4), C1(3)) > D1(2)
4
1
3
2
3
4
2
1
2
3
1
4
4
1
3
2
3
4
2
1
2
3
1
4
1
2
4
3
A B C D
1
2
3
4
?
Quaternary: 1 step
A B C D
A B C D
1
2
3
4
1
4
Relational Complexity Analysis
1
2
4
3
3
4
1
2
2
1
3
4
4
3
2
1
4
3
4
2
?
3
1
4
A B C D
1
3
3
4
?
1
2
1
2
?
3
AND(B1(3), B2(4), B4(1)) > B3(2)
4
A B C D
3
4
1
2
Binary: 1 step
Binary: 1 step
AND(B3(1), C3(3), D3(2)) > A3(4)
3
2
Appendix B 360
Item
Complete
Item
A B C D
A B C D
Binary: 2 step
1
2
3
4
1
4
3
2
?
AND(B2(2), B3(1), B4(3)) > B1(4)
AND(B1(4), C1(3), D1(2)) > A1(1)
24
A B C D
A B C D
Binary: 2 step
1
2
3
4
2
4
1
3
2
AND(A1(2), B1(3), D1(4)) > C1(1)
AND(C1(1), C2(2), C4(4)) > C3(3)
23
25
1
2
3
4
26
1
2
3
4
27
1
2
3
4
29
4
2
1
3
3
1
4
2
3
1
2
4
1
2
3
4
2
3
4
1
4
3
2
1
3
3
1
4
2
2
4
4
2
?
4
1
A B C D
A B C D
Binary: 2 step
2
1
4
3
2
1
AND(A2(1), B2(3), C2(2)) > D2(4)
AND(D1(3), D2(4), D3(1)) > D4(2)
4
3
2
1
1
2
3
4
3
4
1
2
4
3
1
2
3
1
?
A B C D
A B C D
Binary: 2 step
3
4
1
2
3
AND(A3(1), B3(2), C3(4)) > D3(3)
AND(D1(4), D2(2), D3(3)) > D4(1)
1
3
2
4
2
1
4
3
4
2
3
1
1
1
3
2
4
4
2
4
?
A B C D
A B C D
Binary: 2 step
3
2
1
4
3
2
AND(B2(1), B3(4), B4(3)) > B1(2)
AND(B1(2), A1(3), C1(1)) > D1(4)
2
1
4
3
1
4
3
2
4
3
2
1
A B C D
1
2
3
4
2
1
3
3
1
Relational Complexity Analysis
1
2
4
3
3
4
2
1
2
3
1
4
4
1
3
2
1
4
3
1
4
?
3
A B C D
Ternary: 1 step
AND(B2(4), B4(1), D3(3)) > B3(2)
2
4
3
4
?
1
3
4
1
3
2
Appendix B 361
Item
30
1
2
3
4
31
Complete
Item
A B C D
A B C D
Ternary: 1 step
3
4
2
1
?
AND(A4(1), A3(2), C1(4)) > A1(3)
1
2
4
3
4
3
1
2
2
1
3
4
A B C D
1
2
3
4
32
3
2
4
1
1
4
3
2
4
1
2
3
33
2
3
4
1
4
1
2
3
3
2
1
4
1
4
3
2
A B C D
1
2
3
4
34
1
2
3
4
35
3
4
1
2
2
3
4
1
4
1
2
3
1
2
3
4
1
4
2
3
2
?
3
1
4
A B C D
4
?
2
2
1
1
1
4
AND(C1(4), C4(3), D3(1)) > C3(2)
Ternary: 2 step
AND(B1(4), B3(2), A4(1)) > B4(3)
AND(B1(4), B3(2), B4(3)) > B2(1)
2
A B C D
1
?
Ternary: 1 step
4
1
4
1
2
3
4
Ternary: 2 step
AND(D4(4), B4(1), C3(2)) > C4(3)
AND(D4(4), C4(2), B4(1)) > A4(2)
A B C D
A B C D
Ternary: 2 step
1
4
3
2
?
AND(B2(2), B3(1), B4(3)) > B1(4)
AND(A4(2), B1(4), C1(3)) > A1(1)
4
2
1
3
3
1
2
4
2
3
4
1
A B C D
1
2
3
4
4
3
2
1
3
A B C D
2
3
1
4
A B C D
1
2
3
4
2
1
4
3
1
2
Relational Complexity Analysis
4
3
1
2
1
2
4
3
3
4
2
1
2
1
3
4
3
2
2
1
3
3
1
2
4
A B C D
3
1
AND(C2(4), C4(1), D3(3)) > C3(2)
AND(A3(1), C3(2), D3(3)) > B3(4)
4
?
3
1
Ternary: 2 step
3
4
Appendix B 362
Item
36
1
2
3
4
37
Complete
Item
A B C D
A B C D
2
3
4
1
1
4
3
2
3
2
1
4
4
1
2
3
A B C D
1
2
3
4
38
1
2
3
4
39
1
2
3
4
41
3
1
4
2
2
3
1
4
1
4
2
3
4
2
3
1
42
1
2
?
A B C D
1
?
1
4
1
4
AND(A2(3), B2(4), D3(2)) > D2(1)
AND(D1(4), D2(1), D3(2)) > D4(3)
4
Quaternary: 1 step
AND(B4(4), A2(1), D1(4)) > A3(4)
1
A B C D
Quaternary: 1 step
1
2
3
4
1
AND(A1(1), B2(3), C3(1)) > B4(1)
2
3
4
1
3
4
1
2
4
1
2
3
4
3
1
?
3
A B C D
A B C D
Quaternary: 1 step
3
4
2
1
3
4
AND(A1(3), C2(3), D4(4)) > D3(3)
1
2
4
3
4
3
1
2
2
1
3
4
1
2
3
4
2
3
4
1
3
4
1
2
4
1
2
3
A B C D
1
2
3
4
4
Ternary: 2 step
A B C D
A B C D
1
2
3
4
3
4
1
4
3
Relational Complexity Analysis
1
3
2
4
3
4
1
2
4
2
3
1
2
1
4
3
3
4
?
4
1
A B C D
?
4
3
4
1
2
Quaternary: 2 step
AND(C2(4), D2(1), A3(3)) > A2(2)
AND(C4(2), D3(2), A2(2)) > B1(2)
2
A B C D
4
2
3
2
?
1
Quaternary: 2 step
AND(C1(4), C3(3), D2(1)) > C2(2)
AND(A3(2), B4(2), C2(2)) > D1(2)
Appendix B 363
Item
43
1
2
3
4
44
Complete
Item
A B C D
A B C D
3
4
1
2
1
3
2
4
2
1
4
3
4
2
3
1
A B C D
1
2
3
4
45
1
4
3
2
4
2
1
3
3
1
2
4
2
3
4
1
A B C D
1
2
3
4
46
1
4
3
2
4
2
1
3
3
1
2
4
2
3
4
1
A B C D
1
2
3
4
3
2
1
4
2
1
4
3
1
4
3
2
4
3
2
1
Relational Complexity Analysis
2
3
2
4
?
Quaternary: 2 step
AND(C1(2), B2(3), B4(4)) > B1(1)
AND(B1(1), D2(2), A3(1)) > D4(1)
1
A B C D
1
?
2
3
4
2
3
4
1
A B C D
?
2
1
3
4
2
3
4
1
A B C D
1
4
2
?
4
1
Quaternary: 2 step
AND(D1(2), C2(1), C4(4)) > C1(3)
AND(C1(3), D2(3), B4(3)) > A3(3)
Quaternary: 2 step
AND(A4(2), B3(1), D3(4)) > A3(3)
AND(A3(3), B4(3), D2(3)) > C1(3)
Quaternary: 1 Step
AND(C2(4), B3(4), D4(1)) > D1(4) step
Appendix B 364
B.6 Actual Latin Square Task items used in Chapters 5, 6, and 7
Item 1
Item 2
?
Item 3
?
Item 7
?
?
Item 8
Item 9
?
?
Item 12
Item 14
?
?
Appendix B 365
It em 15
?
I tem 17
?
Item 20
It em 19
?
?
It em 21
Item 22
?
It em 23
?
?
Item 24
?
Appendix B 366
It em 25
Item 26
?
Item 27
?
?
Item 29
?
It em 30
Item 31
?
?
It em 32
Item 33
?
?
Appendix B 367
It em 34
It em 35
?
?
It em 36
Item 37
?
?
It em 39
It em 38
?
?
It em 41
?
It em 42
?
Appendix B 368
It em 43
Item 44
?
It em 45
?
?
Item 46
?
APPENDIX C
C.1 Analysis of correct response time to the Latin Square tasks in the finger tapping
experiment
The analysis of the mean response times for correctly answered Latin Square items (plotted in
Figure 5.9) resulted in the same interpretation as for the overall response time measure (one
subject was omitted from the analysis for failing to answer any of the quaternary items
correctly). The assumption of sphericity was violated for the complexity main-effect and the
interaction, however appropriate adjustment did not alter the interpretation. The unadjusted
values are reported and indicate that the main effect of complexity was significant, F(2, 32) =
12.70, MSe = 278.223, p < .001. The correct response times on binary items (M = 7.99s) were
shorter than for ternary items (M = 14.31s), which were shorter than quaternary items
(M=27.94s), p < .01 in all cases. The main-effect for task was also significant, F(1, 16) =
11.49, MSe = 227.20, p = .004, such that averaging across levels of complexity, correct
response times were shorter in the dual-task condition (M = 11.69s) than the single-task
condition (M=21.80s). These effects are qualified by a significant interaction between the
factors, F(2, 32) = 5.49, MSe = 139.86, p = .009. As for the overall response times, the
difference between the single- and dual-task conditions becomes more pronounced as the
complexity of the item increases. The quite large variation in response time for the quaternary
items accounts for the task condition effect being less reliable here than at the binary or
ternary levels. To test this trend, a difference score was computed for each subject at each
level of complexity by subtracting correct dual-task response time from correct single-task
response time. A one-way repeated measures ANOVA on this score indicated that the task
effect (difference between single- and dual-task correct response time) for binary items was
marginally weaker than for ternary items (F(1, 16) = 4.22, MSe = 116.49, p = .057) and
significantly weaker than the effect for quaternary items, F(1, 16) = 6.20, MSe = 936.11, p =
.024. There was also a significant difference between the task effect for ternary items and the
task effect for quaternary items, F(1, 16) = 4.66, MSe = 625.66, p = .046. As the complexity
of the Latin Square item increased, the difference between single- and dual-task conditions
became more pronounced.
APPENDIX D
D.1 Descriptive Statistics for Composite Progressive Matrices Test
Table D.1
Descriptive statistics for traditional and Rasch estimates of ability (standard deviations in
parenthesis) of performance on the Progressive Matrices composite
Classical Statistics
Test Item
Standard (item E1)
Standard (item E2)
Standard (item E3)
Standard (item E4)
Standard (item E5)
Standard (item E6)
Standard (item E7)
Standard (item E8)
Standard (item E9)
Standard (item E10)
Standard (item E11)
Standard (item E12)
Adv. Set II (item 1)
Adv. Set II (item 5)
Adv. Set II (item 6)
Adv. Set II (item 7)
Adv. Set II (item 9)
Adv. Set II (item 10)
Adv. Set II (item 11)
Adv. Set II (item 13)
Adv. Set II (item 16)
Adv. Set II (item 18)
Adv. Set II (item 19)
Adv. Set II (item 21)
Adv. Set II (item 24)
Adv. Set II (item 27)
Adv. Set II (item 28)
Adv. Set II (item 29)
Adv. Set II (item 32)
Adv. Set II (item 33)
Adv. Set II (item 34)
Adv. Set II (item 35)
a
Manual pvaluea p-value
0.85
0.78
0.84
0.78
0.83
0.76
0.76
0.61
0.67
0.56
0.60
0.45
0.30
0.26
0.26
0.20
0.17
0.27
0.17
0.18
0.93
0.94
0.92
0.89
0.95
0.90
0.76
0.68
0.57
0.46
0.31
0.32
0.94
0.90
0.95
0.90
0.94
0.82
0.93
0.60
0.80
0.53
0.63
0.57
0.35
0.35
0.30
0.21
0.21
0.40
0.30
0.39
Mean RT
13.70
22.00
19.96
24.37
17.44
34.92
51.79
42.27
46.28
63.96
76.88
77.26
36.32
33.58
17.32
25.27
15.81
31.26
21.61
35.60
26.36
42.63
45.90
45.39
61.93
65.90
88.57
59.11
77.46
80.05
56.80
44.96
(9.64)
(10.67)
(11.04)
(14.05)
(11.28)
(20.89)
(34.57)
(28.57)
(31.42)
(39.79)
(57.20)
(57.88)
(16.84)
(17.70)
(9.21)
(11.59)
(13.69)
(18.86)
(12.53)
(21.35)
(21.81)
(21.80)
(39.20)
(24.30)
(48.72)
(50.58)
(78.34)
(31.70)
(44.68)
(60.12)
(41.55)
(29.17)
Mean CRT
13.46
21.45
19.69
22.01
16.56
34.53
50.43
39.03
43.07
67.03
85.23
76.40
36.27
34.07
17.16
24.66
14.41
29.19
21.12
38.09
22.80
45.25
46.70
46.26
66.31
71.31
111.50
69.99
79.46
92.03
77.80
47.02
(9.16)
(10.25)
(10.64)
(10.81)
(9.37)
(20.98)
(32.61)
(24.94)
(30.08)
(39.63)
(73.90)
(44.82)
(15.66)
(17.54)
(9.29)
(11.11)
(10.56)
(17.71)
(12.02)
(22.40)
(13.01)
(23.48)
(35.77)
(25.68)
(51.62)
(45.17)
(78.29)
(30.07)
(43.37)
(69.28)
(47.63)
(27.90)
Rasch Statistics
Item
Std
location Error Infit outfit
-1.97
-2.20
-1.78
-1.37
-2.41
-1.40
-0.33
0.18
0.76
1.35
2.15
2.03
-2.19
-1.58
-2.51
-1.54
-2.19
-0.77
-1.90
0.56
-0.60
0.89
0.39
0.70
1.77
1.87
2.00
2.46
2.61
1.56
2.02
1.62
0.30
0.33
0.28
0.25
0.35
0.25
0.20
0.18
0.17
0.17
0.19
0.18
0.33
0.27
0.37
0.26
0.33
0.21
0.30
0.18
0.21
0.17
0.18
0.17
0.18
0.18
0.18
0.20
0.20
0.18
0.18
0.18
0.93
0.96
0.89
0.81
0.84
0.95
0.94
1.04
0.93
0.83
0.83
0.90
0.82
1.00
0.88
0.98
0.85
0.88
0.88
0.95
0.88
0.99
1.14
0.90
1.14
0.98
1.11
1.16
0.95
1.00
0.96
0.96
Proportion correct from German Sample in Manual; RT = Mean item response time; CRT =
Mean item response time for subjects who answered correctly; SE = Standard Error of the item
difficulty estimate, Infit and outfit are item fit statistics (see chapter 4)
0.71
0.70
1.08
0.71
0.70
0.99
0.91
1.04
0.89
0.81
0.80
0.89
0.98
0.89
0.48
0.72
0.67
0.77
0.76
0.96
0.84
1.18
1.23
0.81
1.12
0.90
1.27
1.71
1.62
1.00
0.97
0.94
Appendix D 371
D.2 Descriptive Statistics for the Arithmetic Reasoning Test
Table D.2
Arithmetic Reasoning Test: Traditional descriptive statistics and Rasch-based item statistics
Classical Statistics
Rasch Statistics
Item
ETS
p-value
logit SE outfit infit
1 RG-1:1
0.87
-1.04 0.24 1.08 1.01
2 RG-1:3
0.81
-0.55 0.21 1.02 1.06
3 RG-1:12
0.68
0.26 0.19 0.91 0.95
4 RG-1:14
0.71
0.08 0.19 0.74 0.84
5 RG-1:20
0.86
-0.95 0.24 0.86 0.94
6 RG-2:1
0.92
-1.64 0.29 0.96 0.92
7 RG-2:12
0.64
0.44 0.18 0.99 1.03
8 RG-2:13
0.60
0.63 0.18 1.07 1.05
9 RG-2:14
0.82
-0.66 0.22 0.74 0.83
10 RG-2:15
0.29
2.04 0.19 0.99 1.01
11 RG-2:16
0.78
-0.35 0.21 0.71 0.81
12 RG-2:17
0.82
-0.72 0.22 0.81 0.81
13 RG-2:19
0.81
-0.57 0.22 0.87 0.94
14 RG-2:29
0.30
1.97 0.19 1.04 0.94
15 RG-2:30
0.49
1.05 0.18 1.21 1.16
ETS = Item number form manual for ETS Kit of Factor-referenced cognitive tests
Appendix D 372
D.3 Descriptive Statistics for the Swaps Test
Table D.3
Traditional and Rasch-based item statistics for (i) all levels and (ii) levels 3 and 4 only of the
Swaps Test
Classical Statisticsa
Level
Level 1
Level 2
Level 3
Item p-value
Mean RT
Mean CRT
Rasch Statistics (All)
logit
SE outfit
Rasch Statistics (level 3/4)
infit
logit
SE outfit
infit
1
0.90 6.33 (3.56) 6.52 (3.54) -0.60
0.30
0.80
1.02
-
-
-
-
5
0.93 6.18 (2.79) 6.28 (2.78) -1.02
0.33
1.30
1.08
-
-
-
-
9
0.93 6.81 (4.92) 6.81 (4.42) -1.00
0.33
1.78
1.00
-
-
-
-
13
0.92 6.07 (3.13) 6.07 (3.13) -0.70
0.31
1.27
1.32
-
-
-
-
17
0.92 5.36 (2.64) 5.48 (2.53) -0.90
0.32
1.39
0.98
-
-
-
-
21
0.89 5.93 (2.93) 6.09 (2.70) -0.40
0.29
1.69
0.98
-
-
-
-
25
0.96 5.12 (2.41) 5.19 (2.41) -1.70
0.41
0.78
0.98
-
-
-
-
29
0.93 5.66 (2.41) 5.71 (2.37) -1.00
0.33
1.34
0.96
-
-
-
-
33
0.89 5.74 (2.69) 5.88 (2.65) -0.40
0.28
1.44
1.12
-
-
-
-
37
0.96 5.60 (3.59) 5.65 (3.61) -1.58
0.39
1.18
1.00
-
-
-
-
41
0.91 6.51 (3.19) 6.51 (2.75) -0.70
0.30
1.05
1.14
-
-
-
-
45
0.96 5.81 (2.30) 5.84 (2.19) -2.30
0.50
0.32
0.96
-
-
-
-
2
0.84 11.52 (5.53) 12.06 (5.26)
0.01
0.25
0.90
0.93
-
-
-
-
6
0.82 13.51 (7.48) 14.31 (7.01)
0.26
0.24
0.89
0.94
-
-
-
-
10
0.89 12.02 (6.68) 12.76 (6.52) -0.60
0.30
0.50
0.93
-
-
-
-
14
0.87 9.99 (5.15) 9.95 (4.17) -0.20
0.27
1.14
0.89
-
-
-
-
18
0.87 11.41 (6.82) 12.06 (6.77) -0.30
0.27
0.80
0.86
-
-
-
-
22
0.89 12.61 (6.31) 13.02 (6.35) -0.40
0.28
1.49
0.94
-
-
-
-
26
0.85 11.94 (7.38) 12.74 (7.18)
0.00
0.25
0.83
0.96
-
-
-
-
30
0.85 13.71 (7.00) 14.65 (6.84)
0.00
0.25
1.23
1.07
-
-
-
-
34
0.89 12.66 (6.91) 13.03 (7.00) -0.50
0.29
1.11
1.03
-
-
-
-
38
0.87 12.17 (7.73) 12.73 (7.75) -0.20
0.27
1.04
1.10
-
-
-
-
42
0.90 10.93 (5.29) 11.42 (5.17) -0.70
0.31
0.94
0.93
-
-
-
-
46
0.91 12.19 (6.71) 12.67 (6.58) -0.70
0.31
1.28
1.05
-
-
-
-
3
0.86 17.52 (9.73) 18.11 (9.18) -0.20
0.27
0.95
0.89 -0.90
0.28
1.28
0.87
7
0.78 19.83 (10.20) 20.78 (9.93)
0.57
0.22
1.07
1.07
0.00
0.22
0.94
1.06
11
0.77 18.27 (10.04) 19.07 (8.64)
0.63
0.22
1.12
1.07 -0.02
0.22
1.06
1.07
15
0.78 19.44 (11.39) 21.07 (10.66)
0.54
0.22
1.14
1.09 -0.10
0.23
1.03
1.09
19
0.74 19.44 (12.54) 19.86 (11.36)
0.87
0.21
1.10
1.00
0.18
0.21
1.20
1.04
23
0.83 19.42 (12.41) 20.49 (12.16)
0.20
0.24
0.99
0.90 -0.51
0.25
1.19
0.92
27
0.77 18.65 (12.92) 20.22 (12.04)
0.65
0.22
0.75
0.90
0.00
0.22
0.74
0.89
31
0.77 20.59 (13.83) 21.79 (13.64)
0.63
0.22
1.19
1.10 -0.01
0.22
1.04
1.08
Appendix D 373
Classical Statisticsa
Level
Item p-value
3
Level 4
a
Mean RT
Rasch Statistics (All)
Mean CRT
logit
SE outfit
infit
Rasch Statistics (level 3/4)
logit
SE outfit
infit
35
0.82 19.87 (12.10) 20.53 (10.93)
0.24
0.24
1.09
1.10 -0.40
0.24
1.00
1.10
39
0.80 16.06 (8.34) 17.12 (7.05)
0.46
0.22
1.34
0.96 -0.24
0.23
1.45
1.07
43
0.85 18.24 (10.67) 18.96 (9.72)
0.01
0.25
0.92
0.99 -0.69
0.26
1.09
0.99
47
0.74 18.04 (8.60) 19.56 (8.03)
0.86
0.21
1.36
1.07
0.19
0.21
1.36
1.11
4
0.75 24.17 (15.39) 26.09 (14.07)
0.83
0.21
1.02
1.04
0.16
0.21
1.10
1.08
8
0.74 24.59 (14.48) 26.07 (12.44)
0.92
0.21
0.88
0.97
0.23
0.21
0.88
0.99
12
0.69 26.43 (17.28) 28.29 (14.26)
1.26
0.19
0.80
0.91
0.58
0.20
0.89
0.95
16
0.68 27.25 (18.02) 29.51 (13.68)
1.31
0.19
0.90
0.93
0.62
0.20
0.96
0.95
20
0.64 26.06 (14.44) 25.90 (9.94)
1.47
0.19
0.99
1.04
0.81
0.19
1.04
1.07
24
0.68 27.40 (16.52) 30.16 (13.45)
1.36
0.19
0.70
0.84
0.68
0.20
0.70
0.87
28
0.81 25.08 (14.62) 25.67 (12.20)
0.34
0.23
1.05
1.17 -0.31
0.24
1.42
1.20
32
0.80 24.05 (16.92) 25.72 (16.77)
0.47
0.23
0.66
0.82 -0.20
0.23
0.63
0.85
36
0.79 23.97 (15.39) 25.22 (12.43)
0.52
0.22
0.80
0.86 -0.10
0.23
0.76
0.90
40
0.72 24.62 (11.45) 26.49 (10.45)
1.01
0.20
0.89
0.98
0.34
0.21
0.98
1.00
44
0.72 25.54 (16.19) 26.60 (12.03)
1.08
0.20
0.86
0.86
0.36
0.21
0.92
0.92
48
0.80 23.74 (13.87) 25.74 (11.99)
0.41
0.23
0.66
0.80 -0.30
0.24
0.74
0.82
Standard deviations in parentheses
Table D.4
Swaps test stimuli and rearrangement (swap) rules
Item Swap 1
Item Swap 2
Item Swap 3
Item Swap 4
1 JKL
13-
2 LKJ
23- 12-
3 KJL
13- 12- 23-
4 LJK
13- 23- 12- 23-
5 LKJ
13-
6 KLJ
23- 13-
7 JLK
12- 23- 13-
8 KJL
12- 23- 13- 12-
9 KLJ
12-
10 KJL
13- 12-
11 JKL
13- 23- 12-
12 KLJ
23- 13- 23- 12-
13 KJL
13-
14 LKJ
12- 23-
15 LJK
23- 12- 13-
16 KJL
12- 13- 12- 13-
17 LKJ
12-
18 JKL
12- 23-
19 KLJ
23- 13- 23-
20 KJL
12- 13- 23- 13-
21 JLK
23-
22 KLJ
23- 12-
23 LKJ
13- 23- 13-
24 JLK
23- 13- 12- 13-
25 JKL
12-
26 JLK
13- 23-
27 KJL
12- 13- 23-
28 KLJ
12- 13- 23- 12-
29 KJL
12-
30 LKJ
23- 13-
31 LJK
13- 23- 13-
32 JKL
13- 23- 12- 23-
33 KLJ
13-
34 LJK
23- 12-
35 LKJ
13- 23- 13-
36 JKL
13- 23- 12- 23-
37 KJL
12-
38 JKL
13- 12-
39 LJK
12- 23- 12-
40 KLJ
13- 12- 13- 23-
41 JKL
13-
42 KLJ
13- 23-
43 KJL
23- 12- 23-
44 JKL
13- 12- 13- 12-
45 LKJ
23-
46 JLK
13- 23-
47 LKJ
12- 13- 23-
48 KLJ
13- 23- 12- 23-
Appendix D 374
D.4 Descriptive Statistics for the Vocabulary Test
Table D.5
Vocabulary Test: Traditional descriptive statistics and Rasch-based item statistics for 35-item
and 33-item test (standard deviations in parentheses)
Rasch Statistics (35 items)a Rasch Statistics (33 items)b
Classical Statistics
Item p-value
a
b
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
Mean RT
0.96 9.10
0.98 6.24
0.94 8.76
0.31 13.52
0.71 10.72
0.93 7.27
0.66 11.59
0.37 13.32
0.68 7.96
0.43 13.36
0.51 11.34
0.25 11.34
0.23 14.68
0.81 10.48
0.02 9.41
0.59 12.62
0.86 9.06
0.96 5.06
0.67 8.54
0.78 9.71
0.99 4.40
1.00 4.41
0.73 6.69
0.92 7.66
0.84 7.24
0.67 12.53
0.53 8.84
0.64 11.70
0.31 10.04
0.20 8.88
0.46 9.50
0.68 10.76
0.59 12.34
0.94 7.37
0.31 9.90
0.95 5.06
Mean CRT
(5.69) 8.72
(3.78) 6.15
(6.76) 7.93
(8.60) 12.05
(6.19) 10.31
(4.47) 7.20
(7.78) 10.80
(8.06) 9.56
(6.44) 5.85
(8.04) 12.72
(7.25) 9.09
(6.52) 13.52
(11.73) 15.20
(6.38) 10.13
(5.54) 11.95
(9.81) 10.93
(5.86) 8.08
(2.35) 5.01
(6.06) 7.82
(7.91) 7.88
(2.55) 4.41
(1.64) 4.41
(4.84) 5.69
(4.92) 7.14
(6.41) 6.02
(9.44) 10.47
(5.34) 7.98
(7.50) 11.28
(5.69) 10.66
(6.58) 8.82
(5.85) 8.94
(6.91) 9.97
(7.18) 10.78
(4.84) 7.05
(6.32) 7.85
(2.30) 5.04
(5.26)
(3.71)
(5.02)
(8.26)
(6.20)
(4.39)
(5.81)
(7.28)
(3.92)
(6.93)
(5.42)
(7.83)
(11.86)
(6.03)
(15.23)
(9.68)
(4.58)
(2.33)
(5.66)
(5.93)
(2.55)
(1.64)
(4.39)
(4.45)
(3.96)
(8.29)
(5.02)
(7.39)
(5.85)
(8.39)
(4.83)
(6.90)
(6.85)
(4.20)
(5.24)
(2.29)
logit
SE
outfit
infit
logit
SE
-2.31
-3.33
-2.02
1.77
0.02
-1.86
0.28
1.60
0.15
1.25
0.94
2.18
2.39
-0.62
4.41
0.57
-0.94
-2.59
0.23
-0.40
-4.58
-0.08
-1.74
-0.77
0.26
0.86
0.38
1.76
2.54
1.21
0.17
0.66
-1.94
1.83
-2.27
0.36
0.58
0.32
0.17
0.17
0.30
0.17
0.17
0.17
0.16
0.16
0.18
0.19
0.20
0.37
0.16
0.22
0.41
0.17
0.19
1.06
0.18
0.29
0.21
0.17
0.16
0.17
0.17
0.20
0.16
0.17
0.16
0.31
0.17
0.36
1.28
0.66
1.20
1.18
1.03
0.85
0.92
0.93
0.85
1.31
0.89
0.93
0.96
0.84
2.05
0.96
1.19
0.50
0.85
0.79
0.46
0.91
0.62
0.76
0.91
0.86
1.12
1.22
0.86
0.95
1.02
0.97
1.02
0.98
0.62
0.90
0.97
0.92
1.05
1.08
0.95
0.94
0.96
0.91
1.21
0.93
0.97
0.90
1.00
0.68
0.99
0.94
1.02
0.91
0.90
1.09
0.92
0.93
0.91
0.94
0.90
1.06
1.15
0.91
0.97
1.05
0.99
0.89
0.96
0.95
-2.16
-3.14
-1.86
1.95
0.19
-1.69
0.46
1.78
0.32
1.12
2.38
2.59
-0.45
0.75
-0.77
-2.45
0.41
-0.24
-4.46
0.08
-1.57
-0.61
0.43
1.03
0.55
1.94
2.73
1.40
0.34
0.84
-1.78
2.02
-2.14
0.37
0.57
0.32
0.17
0.17
0.30
0.17
0.17
0.17
0.16
0.19
0.19
0.20
0.16
0.22
0.41
0.17
0.19
1.07
0.18
0.29
0.21
0.17
0.16
0.17
0.17
0.20
0.16
0.17
0.16
0.31
0.17
0.36
outfit infit
1.17
0.69
1.22
1.25
1.06
0.86
0.98
0.94
0.88
0.92
0.97
0.98
0.86
0.99
1.34
0.48
0.87
0.79
0.42
0.91
0.63
0.77
0.91
0.86
1.17
1.28
0.89
0.96
1.04
1.01
1.04
1.02
0.61
0.91
0.94
0.92
1.07
1.09
0.95
0.96
0.96
0.92
0.95
0.99
0.90
1.01
1.01
0.95
1.01
0.93
0.90
1.17
0.93
0.93
0.91
0.94
0.90
1.08
1.18
0.92
0.98
1.07
1.01
0.90
0.97
0.96
Item 22 is excluded from the Rasch analysis since all subjects answered it correctly.
Item 10 and 15 were omitted because of the extreme outfit statistics in the 35-item analysis
Appendix D 375
D.5 Descriptive Statistics for the Similarities Test
Table D.6
Similarities Test: Traditional descriptive statistics and Rasch-based item statistics (standard
deviations in parentheses)
Classical Statistics
Item Max mean
SD Adj Mean
1
1 0.98 (0.14)
0.98
2
1 0.94 (0.24)
0.94
3
1 0.98 (0.14)
0.98
4
1 0.91 (0.29)
0.91
5
2 1.96 (0.12)
0.98
6
2 1.95 (0.11)
0.98
7
2 1.69 (0.25)
0.84
8
2 1.92 (0.16)
0.96
9
2 1.43 (0.34)
0.71
10
2 1.28 (0.31)
0.64
11
2 1.38 (0.39)
0.69
12
2 1.54 (0.34)
0.77
13
2 1.59 (0.34)
0.80
14
2 1.50 (0.35)
0.75
15
2 1.10 (0.28)
0.55
16
2 1.24 (0.42)
0.62
17
2 0.87 (0.33)
0.43
18
2 1.05 (0.36)
0.53
SD
(0.14)
(0.24)
(0.14)
(0.29)
(0.12)
(0.11)
(0.25)
(0.16)
(0.34)
(0.31)
(0.39)
(0.34)
(0.34)
(0.35)
(0.28)
(0.42)
(0.33)
(0.36)
Rasch Statistics (All)
logit
SE outfit infit
-2.43 0.59
0.79 1.02
-1.08 0.33
0.84 1.00
-2.52 0.62
0.43 1.04
-0.55 0.27
1.18 1.04
-1.29 0.35
0.54 0.86
-3.06 0.37
1.01 0.93
-0.20 0.17
0.95 0.97
-0.77 0.25
0.54 0.87
0.90 0.13
1.01 0.96
0.98 0.14
0.93 0.92
1.11 0.12
0.89 0.94
0.73 0.13
0.97 0.95
0.71 0.13
0.88 0.93
0.82 0.13
1.03 0.99
1.43 0.15
0.92 0.92
1.40 0.11
1.01 1.01
2.12 0.13
0.90 0.91
1.69 0.12
1.01 1.00
* Adj Mean = percentage type score weighted to range between 0 and 1.
Appendix D 376
D.6 Responses to Knight-Knave Test Items
Table D.7
Frequency of selected response options and mean option response time and standard deviation
(in parentheses) for knight-knave problems
Item 3.1
N %
RT
knight 42 26.09% 15.83 (10.21)
knave* 113 70.19% 20.22 (13.02)
can't tell
6 3.73% 20.36 (9.52)
Item 3.2
yes
no*
can't tell
N %
RT
37 22.98% 29.20 (29.97)
95 59.01% 40.74 (42.46)
29 18.01% 42.32 (41.48)
Item 3.3
N %
RT
knight 24 14.91% 47.91 (67.11)
knave* 128 79.50% 37.15 (33.64)
can't tell
9 5.59% 30.35 (21.98)
Item 3.4
N %
RT
yes 28 17.39% 26.28 (17.36)
no* 120 74.53% 27.63 (18.41)
can't tell 13 8.07% 23.26 (18.49)
Item 3.5
N %
RT
knight 12 7.45% 15.77 (9.36)
knave* 146 90.68% 19.95 (11.16)
can't tell
3 1.86% 26.42 (18.46)
Item 3.6
N %
RT
knight 24 14.91% 16.05 (11.94)
knave* 132 81.99% 21.96 (13.05)
can't tell
5 3.11% 33.81 (21.10)
Item 4.2
yes*
no
can't tell
N %
RT
65 40.37% 42.91 (29.88)
51 31.68% 32.45 (30.75)
45 27.95% 37.58 (23.93)
Item 4.4
yes*
no
can't tell
N %
RT
77 47.83% 31.53 (21.06)
27 16.77% 28.08 (18.19)
57 35.40% 26.29 (25.88)
Item 4.5
yes
no*
can't tell
N
%
RT
31 19.25% 18.71 (19.72)
87 54.04% 28.29 (31.56)
43 26.71% 22.68 (16.62)
Item 4.6
yes*
no
can't tell
N
%
RT
99 61.49% 35.87 (29.76)
30 18.63% 25.24 (19.83)
32 19.88% 28.19 (15.34)
Item 4.7
knight
knave*
can't tell
N %
RT
65 40.37% 32.81 (21.21)
37 22.98% 46.93 (58.51)
59 36.65% 37.81 (28.52)
Item 4.8
yes
no
can't tell*
N %
RT
74 45.96% 33.92 (22.25)
34 21.12% 41.33 (24.73)
53 32.92% 42.62 (36.06)
Item 4.9
knight
knave
can't tell*
N %
RT
37 22.98% 22.54 (16.73)
66 40.99% 27.81 (17.14)
58 36.02% 38.96 (27.16)
Item 4.10
knight
knave
can't tell*
N %
RT
50 31.06% 33.54 (21.47)
77 47.83% 32.25 (21.08)
34 21.12% 43.56 (29.41)
N = number of subjects selecting response, % = proportion of total sample that selected response; RT =
the mean response time for subjects to make response; * = correct response
APPENDIX E
E.1 Triplet Numbers Test – Level 4 Example Items
Instructions: Validate the following rule in the triplets provided below
IF the first digit is the largest AND the second digit is the smallest
OR
IF the third digit is the largest AND the first digit is the smallest
548
384
089
713
501
152
021
438
354
740
426
102
094
537
372
920
935
269
645
470
352
730
958
705
384
702
794
503
905
913
348
973
058
937
017
350
524
291
457
197
624
903
754
024
803
813
358
420
049
207
863
793
824
523
201
726
035
823
513
370
459
596
204
620
862
361
973
863
476
682
238
049
738
923
239
915
756
716
821
857
759
470
418
751
610
041
689
591
934
908
948
198
491
814
725
207
683
473
706
312
168
359
628
935
901

Download Report

The Measurement of Task Complexity and Cognitive Ability

Paperzz.com

Your Paperzz