preprint - Open Science Framework

Assessing children’s understanding of complex syntax: a comparison of two methods
Pauline Frizelle 1, 2, Paul Thompson1, Mihaela Duta, 1 & Dorothy V. M. Bishop 1
1
Department of Experimental Psychology, University of Oxford, Oxford, Oxon, UK.
2
Department of Speech and Hearing Sciences, University College Cork, Republic of Ireland
RUNNING HEAD: Assessing children’s understanding of grammar
Abstract
We examined the effect of two methods of assessment, multiple-choice sentence-picture
matching and an animated truth-value judgement task, on typically developing children’s
understanding of relative clauses. Children between the ages of 3;06 and 4;11 took part in the
study (n = 103). Results indicated that (i) children performed better on the animation than on
the multiple-choice task independently of age (ii) each testing method revealed a different
hierarchy of constructions (iii) and the testing method had a greater impact on children’s
performance with some constructions more than others. Our results suggest that young
children have a greater understanding of complex sentences than previously reported, when
assessed in a manner more reflective of how we process language in natural discourse.
KEYWORDS: complex syntax, relative clause, language, development
1
During the process of child language development it is often the case that children
will demonstrate language knowledge that is sufficient to support comprehension but is
insufficient for production (Fraser, Bellugi & Brown, 1963). This is not surprising given the
extra demands of utterance planning and sentence formulation involved in language
production. However, in the case of complex syntax the opposite appears to be the case with
children’s production superior to comprehension (Ha ̊kansson & Hansson, 2000). This causes
us to question how best to assess receptive syntactic knowledge.
There are a number of paradigms typically used to assess children’s comprehension of
syntax, most of which share commonalities in relation to assessment principles. These include
the set-up, which is usually purposely structured to remove context or situational cues in
order to evaluate understanding of the specific aspect of language of interest such as the
grammatical structure, morphological marker or a combination thereof. In a typical test
(administered clinically), the items (e.g., sentences) are sequenced so that they make
increasing demands on processing skills, in order to identify the upper threshold of a child’s
comprehension ability. Despite these commonalities, there are also many differences in the
assessment methods and procedures used to investigate children’s comprehension of syntax,
such as preferential looking, act-out tasks, story comprehension, grammaticality judgements,
on-line methods, picture selection tasks and truth-value judgment tasks. Of interest in the
current paper is the effect that different methodologies can have on the language knowledge
being measured. Given the variation in assessment procedures used, it is likely that they will
have a varied effect on the knowledge being tested and consequently on the outcome measure.
While each assessment measure comes with its own biases, one of the main characteristics of
a good test is that the methodology should not compromise the knowledge being measured. It
would be misleading for both the research and clinical communities to use assessments that
yield results that are more a reflection of the testing method than of knowledge of the item
2
being measured. In a recent paper, Frizelle, O’Neill and Bishop (2017) compared children’s
performance on a sentence recall task and a multiple-choice comprehension task of relative
clauses. The authors found that although the two assessments revealed a similar order of
difficulty of constructions, there was little agreement between them when evaluating
individual differences. In addition, children showed the ability to repeat sentences that they
did not understand. Given that sentence repetition has been reported to be a reliable measure
of children’s syntactic knowledge (Gallimore & Tharpe, 1981; Kidd, Brandt, Lieven &
Tomasello, 2007; Polišenská, Chiat, & Roy, 2015), the authors attributed the discrepancy
between the two measures to the additional processing load resulting from the design of
multiple-choice picture selection comprehension tasks.
Here we consider two ‘pure’ methods of receptive language assessment, the multiple-choice
picture selection task and a newly devised truth-value judgement animation task. The former
is one of the most commonly used assessment paradigms to evaluate children’s understanding
of syntax, within a formal, standardized assessment framework (it is used for example in The
Clinical Evaluation of Language Fundamentals – 4th Edition (CELF- 4) (Semel, Wiig &
Secord, 2006), The Test for the Reception of Grammar – 2nd Edition (TROG – 2) (Bishop,
2003) and the Assessment of Comprehension and Expression 6 – 11 (ACE) (Adams, Coke,
Crutchley, Hesketh & Reeves, 2001). In a picture selection paradigm the child is presented
with a sentence and is asked to select the picture (usually from a choice of four) that best
corresponds to the stimulus presented. Within a four-picture layout, one picture represents the
target structure and the other three are considered distractors. On the condition that each
distractor is equally plausible, this reduces the probability of choosing the correct item by
chance to 25%. However, it may not always be possible to design three equally plausible
distractors (Gerken and Shady, 1994), in which case deciding what constitutes chance
performance becomes difficult. Not only is the semantic plausibility of the distractors
3
influential but also their syntactic framework and the relationship between the two. In parsing
a sentence for syntactic comprehension, a child will typically form a semantic representation
of that sentence, relying on the syntactic structure to assign the thematic roles to the
appropriate words in the sentence. However, there are additional linguistic factors that
influence a child’s performance and these are not always controlled for. For example, greater
comprehension difficulty has been noted when the thematic roles assigned by the verb could
be applied to either noun (Gennari & MacDonald, 2009). It is also the case that on hearing the
target stimulus, children will use typical thematic role-to-verb argument mapping in trying to
process the given structure. The presence of three distractors requires the child to rule out
three competing alternative mappings, increasing the processing load considerably and
creating a level of ambiguity in how the roles should be assigned. If we consider the test
sentence She pushed the man that was reading the book (taken from the multiple-choice task
used in the current experiment) shown in figure 1, the distractor images represent the
sentences The woman that was reading the book pushed the man, The man pushed the woman
that was reading the book and The man that was reading the book pushed the woman. Using
the same lexical set, the four pictures represent four alternative ways to assign the thematic
roles. With regard to relative clauses, this type of distractor design is used in commercially
available standardized assessments such as the TROG-2 (Bishop, 2003) and the ACE (Adams
et al., 2001) (see Table 1.) Interestingly, in the example taken from the CELF the object (the
banana) is inanimate and therefore the roles of the subject and object cannot be reversed,
necessitating a different set of distractors. However, we note that the distractors in the
example below serve to test children’s understanding of prepositions and, on condition that
the child understands the concept of under, they could choose the correct item. In addition, in
the illustrations depicting the distractors, there is no alternative to the head noun to which the
4
relative clause was referring ( no other boy) making the use of a relative clause pragmatically
inappropriate.
Table 1. Examples of distractors used in commercially available standardized assessments.
Test- Assessment of Comprehension and Expression (ACE)
Target Sentence
The cat that scratched the fox is fat.
Distractors
The cat that the fox scratched is fat
The fox that scratched the cat is fat
The fox that the cat scratched is fat.
Test – Test for the Reception of Grammar (TROG-2)
Target Sentence
The girl chases the dog that is jumping.
Distractors
The dog chases the girl that is jumping.
The girl that is jumping chases the dog.
The dog that is jumping chases the girl.
Test – The Clinical Evaluation of Language Fundamentals (CELF – 4)
Target Sentence
The boy who is sitting under the big tree is eating a banana.
Distractors
The boy who is standing beside the small tree is eating a banana.
The boy who is sitting beside a small tree is eating a banana.
The boy who is sitting in a big tree is eating a banana.
Figure 1. Example of an illustration for the test sentence: She pushed the man that was
reading the book
5
The advantage of using this role-reversal distractor approach, is that we can devise test items
that can only be correctly identified by children with a deep understanding of the syntactic
structure being tested. It can also be argued that through careful manipulation of the distractor
images we can complete a systematic error analysis, which will potentially shed light on
children’s developing representations. However, at the same time it is important to question
the ecological validity of this method as it results in the assessment of an understanding that
is far removed from that used in natural discourse. Would error patterns revealed emerge at
all if the distractors were less salient? In addition, while it is important to ascertain ‘true’ or
deep levels of syntactic knowledge we need to be aware that this format is likely to lead to
children failing for reasons other than a lack of linguistic knowledge (Frizelle et al., 2017).
For these reasons, in this experiment we are considering the multiple-choice picture selection
task to be an example of an assessment procedure in which the method is likely to have a
strong effect on the knowledge being measured.
An alternative assessment measure is the truth-value judgement task, described by
Gordon (1996, p.211) as ‘one of the most illuminating methods of assessing children’s
6
linguistic competence’. This task requires the child to make a binary judgement about
whether a statement accurately describes a situation presented, in a given context. Some
studies have used a yes/ no version within the context of a short story i.e. the child is
presented with a story followed by several yes/ no questions (see Gordon & Chafetz 1986).
Other researchers such as Crain and McKee (1985) have used what is termed a reward/
punishment version. In this design, Crain and McKee (1985) asked children to watch a
situation ‘acted out’ while it was simultaneously described. Following a statement made by
Kermit the frog about an event; children were asked to reward him if it was true and to
punish him (by feeding him a rag) if it were false.
In the current study we used a truth-value judgement task that could be considered
somewhat less demanding than those previously described. In this task, the child listens to a
single sentence while simultaneously being shown an animation and must simply decide if
the sentence was truly represented in the animation shown. Half of the sentences are
accurately represented by the animation and half are not. By listening to the stimulus sentence
at the same time as it is acted out ‘in real time’ the child can take advantage of a reduction in
memory demands. However, one disadvantage of using this method is that a larger number of
stimuli are required to determine if performance is reliably above chance, which may result in
fatigue, particularly for young children. In addition, Crain, Thornton, Boster, and Conway,
(1996) raised some methodological design concerns in relation to single sentence truth-valuejudgement tasks and the pragmatic context in which they are represented. These concerns
were prompted by studies by Philip (1991; 1992) and Roeper and de Villiers (1991), showing
that when children were interpreting sentences containing universal qualifiers (such as every),
they were influenced by the extra elements in the picture. Children showed a pragmatic bias
surmising that the extra element was there for a reason and therefore responded negatively
and incorrectly to the question asked. Crain et al., followed up with further research in which
7
they created a story context where the child could plausibly reject the extra elements depicted.
They argued that with the appropriate contextual support children would not override their
grammatical knowledge in favour of any pragmatic bias. However, they also acknowledge
that children’s response to the number of items in each illustration could be as much to do
with the fore-grounding/back-grounding of the items as the pragmatic context. Although
previously discussed in relation to universal qualifiers, this is relevant to any truth-value
judgement task where the child is presented with a single image or animation. Even when the
image is an accurate representation of the sentence, the extra elements within that image are
in effect a form of distractor. On the other hand, if the image does not represent the sentence
accurately, the entire image is considered a distractor and how that is structured affects the
processing load for the child. If we consider a distractor animation for the relative clause She
pushed the boy that was jumping – the child would be shown a sequence where there are two
potential referents a boy that is jumping and a boy that is not and the animation would show a
girl pushing the boy that is not jumping. If however the girl is also jumping this would
increase the salience of an alternative interpretation of the sentence The girl that was jumping
pushed the boy, which in turn increases the processing load for the child and the difficulty
level of the item. As assessors, our aim is to gain the most accurate picture of children’s
receptive syntactic knowledge while at the same time gaining some insight into the
difficulties they experience. However, we do not want to distract or provoke children in to
making an invalid response. Therefore, how each structure is represented is central to the
assessment results we are likely to obtain.
The advantage of using a truth value judgement task where the child is shown a single
animation at a time, is that they can evaluate the truth of each sentence (as it is happening)
directly against the real-world situation without having to store in memory the arguments
associated with the verbs. In this regard, the task has no greater memory load than that
8
required for processing everyday discourse. The child is required to make a semantic
evaluation of a sentence and to map the thematic roles to the appropriate verb argument
structures without having to actively rule out three competing alternative mappings. In
addition, using animation reduces the dependence on children’s ability to interpret adult
created pictures and reduces the amount of inference-making required by the child. We
might therefore expect this assessment framework to yield more accurate test results than
those obtained from the multiple-choice picture selection task i.e. we expect this methodology
to have less bearing on the knowledge being measured.
In the current study, to examine to what extent the assessment method affects the
assessment outcome; we used the two methods of assessment previously described to
measure the same language knowledge (i.e. children’s understanding of relative clauses). We
then investigated if the use of these two different methodologies significantly affects the
children’s understanding of relative clauses. The following research questions were
addressed:

Do children show a greater understanding of relative clauses when assessed using a truthvalue judgement animation task or a multiple-choice comprehension task? Based on
findings reported in Frizelle et al., (2017) showing that the multiple-choice
comprehension task is assessing factors other than those that are linguistic, we predict that
children will achieve higher scores on the animation task than on the multiple-choice task.

Is the hierarchy of structures similar across the tests? Given that the exact same sentences
were used in both assessments (with additional sentences added to the animation task) we
might predict that a similar hierarchy would be shown by both assessment measures.

What is the relationship between children’s performance on either method of syntactic
assessment and (i) their language skills, as measured by a sentence recall task (ii) their
9
cognitive skills, as measured by the Goodenough-Harris draw a person test (1963). Given
the additional cognitive load of the multiple-choice task (the fact that it is assessing skills
beyond the linguistic) we might expect that this would correlate more highly with the
draw a person test. On the other hand, if our animation task is a less compromised
measure of syntactic comprehension, and sentence recall is a reliable measure of syntactic
knowledge then we would expect the animation task to be more closely associated with
the sentence recall measure.
Method
Participants
One hundred and ten typically developing children, between the ages of 3;06 and 5;0 years,
were recruited in to the study. Seven children were subsequently excluded as a result of
speech and language delay (n = 3), attention difficulties (n = 2), failure on the hearing
screening tests and an unwillingness to participate (n = 1). This resulted in 103 children
participating in the study. To take account of the rapid changes in language development in
this age range, the children were divided into six age bands, with between 16 and 18 children
in each band (3;06 – 3;08, 3;09 – 3;11, 4;0 – 4;02, 4;03 – 4;05, 4;06 – 4;08 and 4;09 – 4;11).
The children were recruited through nurseries and primary schools in the Oxford area. To
minimise volunteer bias, children were recruited using an ‘opt out’ protocol which was
approved by the Central University Research Ethics Committee, University of Oxford. A
decile of deprivation was obtained from post code data for each participant; 32% had a
deprivation index of 7 or below. The study was explained to the children recruited; younger
children made a verbal choice regarding their participation and older children were asked to
sign an assent form. Children were included in the sample on the basis that they had typical
language abilities, had never been referred for speech and language therapy, spoke English as
their first language and the language of the home; and had no known intellectual,
10
neurological or hearing difficulties. Children completed the Goodenough-Harris draw a
person test (DAP) (1963) to ensure cognitive ability within the normal range. We wanted a
simple nonverbal measure of developmental level that could be used with children as young
as 3 years and because the maturity of children's drawings shows a clear developmental trend
between 3 to 5 years, we decided to use this as a quick play-based assessment, using the
scoring methods of DAP to quantify the maturity of drawings. Using a DSP Pure Tone
Audiometer (Micro Audiometric Corporation), hearing ability was screened on the first day
of assessment at three frequencies (1000 Hz, 2000 Hz and 4000 Hz) at a 25 dB level. Owing
to difficulty in locating an adequately quiet space the results of this initial screen were
inconclusive for 7 children. These children were further assessed using the Ling six sound
test, which was deemed to be an appropriate measure of the ability to hear speech at
conversational loudness level.
Comprehension Tasks
Animation task
The newly devised animation task was presented on Microsoft Surface Pro. Children were
shown fifty animations representing one of five types of relative clause; subject (both
intransitive and transitive), object, indirect object and oblique. All fifty relative clauses were
attached to the direct object of a transitive clause and are therefore categorized as full biclausal relatives. Example test sentences are given in Table 2.
Table 2. Example Relative Clause Test Sentences
Relative clause type
Example test sentence
Subject intransitive
He found the girl that was hiding
Subject transitive
He pushed the girl that scored the goal
11
Object
The boy picked up the cup that she broke
Oblique
The man opened the gate she jumped over
Indirect object
She kissed the boy she poured the juice for
There were ten animations for each relative clause structure, five of which matched the
structure and five non-match items. The test sentences were chosen on the basis of pilot work
carried out by the first author, work completed by Diessel and Tomasello (2000; 2005) and
research by Frizelle and Fletcher (2014). Object relatives were adapted to include only those
considered to be more discourse relevant (Kidd et al., 2007) i.e. all had an inanimate head
noun and a pronominal subject. We also included pronominal subjects in the oblique, and
indirect object clause structures as they are considered to be more reflective of natural
discourse. In order to increase ecological validity, within each animation there was an
alternative to the head noun to which the relative clause was referring i.e. a referent from
which another can be distinguished. For example the representation of the sentence He
laughed at the girl he threw the ball to included another girl in the sentence who was holding
a ball. Examples have been deposited on the open science framework and can be accessed at
the following links (https://youtu.be/OM27lMM4zPs; https://youtu.be/Cd-EBpCtzZw;
https://youtu.be/d3dz_m8zTvc). Animations were on average six seconds in length. As each
animation began children were presented with a pre-recorded sentence orally and asked if the
animation matched the sentence they heard. Children were given the opportunity to hear the
sentence and see the animation more than once if needed. They were asked to respond by
touching either the smiley or sad face on the Surface Pro touch screen.
Multiple-Choice task
12
The multiple-choice task was a sentence-picture matching task designed to assess the same
relative clause structures as those in the animation task described. The sentences were
identical to a subset of those used in the animation task. The multiple-choice items were also
part of a larger set reported on in Frizelle, et al., (2017). Sentences were again pre-recorded.
Children were given each sentence orally and were asked to choose the picture (from a choice
of four) that corresponded to that sentence. The other three images were distractors, which
included role reversal of the main clause (the relative clause is understood), role reversal of
the relative clause (the main clause is understood) and role reversal of both main and relative
clause (see Figure 1).
In comparison to the animation task, where the child chooses between two (correct vs.
incorrect), choosing from four pictures reduces the possibility of the child answering
correctly by chance. Therefore, in the multiple-choice task fewer exemplars of each
construction were required. In contrast to the ten examples of each construction in the
animation task, the multiple-choice task included 4 of each construction i.e. twenty test items
in total. The pictures were presented in two formats governed by the type of verbs within the
relative clause. For example, in the test sentence She pushed the man that was reading the
book, the verbs reading and pushing can occur simultaneously and can therefore be
represented by four single images (the correct response and three distractors) (see Figure 1).
In contrast, for the sentence The girl washed the teddy that he played with, both verbs must be
shown consecutively (the boy needs to play with the teddy before the girl washes it) and
therefore require two images within each set of four (see Figure 2).
Figure 2.
13
Procedure
Children were assessed individually in a quiet room in their respective schools. The
assessments were administered in two sessions between three and twenty days apart. The
hearing screen was carried out on the first occasion each child was seen. The cognitive
assessment, sentence recall task, and the multiple-choice comprehension task were
administered in one session and the animation task was administered in the other. The
sentence recall task was audio recorded and sentences from 5% of the participants were retranscribed to check transcription reliability. Inter-rater agreement was at 98%. The sequence
of test sentences was randomised for both experimental tasks so that there were two orders of
presentation for each task. The order in which the assessments were administered was also
randomised. For the animation task children simultaneously watched an animation and
listened to a sentence. They were required to indicate if the animation represented the
sentence they heard by touching the smiley face / sad face icons on the tablet screen. For the
multiple-choice task children listened to a sentence and were asked to point to the picture that
corresponded to that sentence. Verbal positive feedback was given every four to five trials.
14
As the animation task was significantly longer than the multiple-choice, children received
positive feedback visually (using star animations) every 10 trials, regardless of the child’s
performance on the task.
Scoring for the animation task was automated and the results were collated and stored within
the application. The researcher scored the multiple-choice task in real time as the children
completed it. The initial scoring system for both tasks was binary, 1 for a correct response
and 0 if the response was incorrect.
Results
To allow both assessments to be compared based on similar scales, a coding system was
developed that took into account the probability of obtaining a correct response by chance.
The binomial theorem was used to compute the probability of obtaining a given number of
items correct by chance guessing, where chance was .25 for the multiple-choice test, and .5
for the animation task. A two-tiered coding system was applied; the child was given 2 if they
answered 4 out of 4 correctly on the multiple-choice task (p = .004) and 9 or 10 out of 10
correctly on the animation task (p = .01). In the more lenient coding the child was given 1 if
they answered 3 out of 4 correctly on the multiple-choice task and 8 out of 10 on the
animation task. With this coding, the binomial probability of a score of 1 or 2 on the multiplechoice test was .051, and the probability of a score of 1 or 2 on the animation task was .055.
Thus in both tasks, a score of 2 could be regarded as evidence that the child completely
understood the construction, and a score of 1 that they were close to mastery. A score less
than or equal to 2 out of 4 on the multiple-choice test or less than or equal to 7 out of 10 on
the animation task was coded as 0. This coding allowed us compare children’s performance
across different relative clause types in both tests on a similar metric. Figure 3 shows stacked
bar charts of the relevant data.
15
Figure 3. Children’s performance, categorised as 2 (mastery), 1 (above chance) or 0 (not
mastered), on each construction on the animation and multiple-choice tasks
SI = subject intransitive, ST = subject transitive, Obj = object, Obl = oblique, IO = indirect
object.
An examination of the stacked charts shows considerable growth in children’s comprehension
of these constructions from 3;06 to 4;11 years. This is particularly evident on the animation
task where (with the exception of the subject transitive relatives) we see more evidence of full
mastery on all constructions. If we look at the 3 younger age bands (where children are aged
between 3; 6 and 4;02 years) the differences on full mastery between the two tasks are
particularly evident. For example, younger children show very limited mastery of object and
indirect object relatives on the multiple-choice task, yet a considerable percentage of children
show full mastery on the animation task.
To subject the data to statistical analysis, we grouped together categories 1 and 2 for contrast
with category 0 before fitting a mixed model logistic regression to the data. Odds ratios were
16
calculated as measures of effect size. This represents the odds of scoring 1 vs. scoring 0 given
a particular condition, compared to the odds of 1 vs. 0 occurring in the absence of that
condition. An odds ratio of 1 indicates no difference between conditions. As shown in Table
3, our results showed an effect of test, such that children performed better on the animation
task than the multiple-choice task: the odds of achieving higher scores on the animation task
were 3.6 times that of the multiple-choice task. To evaluate the effect of the specific
constructions, the intransitive relative clause score was the reference against which the other
relative clause types were measured. In relation to the intransitive relatives, the odds of
achieving a higher score on the oblique relatives were 1 in 4 (0.27), on the object relatives 1
in 10 (0.1) and on the indirect object relatives 1 in 20 (0.05).
The model also revealed an effect of age, showing that older children were more likely to
achieve higher score on the tasks.
Table 3. Estimates of fixed effects for mixed effects logistic regression.
Odds ratio
95% CI
p
(Intercept)
4.44
2.39 to 8.25
<.0001
Test type (Animation)
3.6
1.49 to 8.67
0.0043
Age (centred)
2.72
1.84 to 4.01
<.0001
Transitive
1.93
0.87 to 4.31
0.1074
Indirect object
0.05
0.02 to 0.11
<.0001
Object
0.12
0.05 to 0.24
<.0001
Oblique
0.27
0.13 to 0.56
0.0004
Test type x transitive
0.10
0.03 to 0.34
0.0002
Test type x indirect object
4.14
1.31 to 13.2
0.016
Test type x object
1.02
0.33 to 3.13
0.9773
Test type x oblique
0.53
0.17 to 1.63
0.269
Test type x Age
1.26
0.87 to 1.8
0.2235
Note: Intransitive clause is treated as reference category
17
Although we see considerable differences between the two assessment methods at the level of
full mastery, particularly with the younger children, when we analysed the data according to
above chance performance on the tests (2 or 1 vs. 0), there was no interaction between the
method of testing and age overall. We also considered the pattern of results for full mastery
(2 vs. 1 or 0) and the pattern did not change. However, the significant odds ratios for
interaction terms with transitive subject and indirect object relatives indicates that children's
performance on these constructions differed depending on which method of assessment was
used, whereas this was not seen for the other constructions (see Table 3 for the fixed effects).
This suggests that some constructions are more vulnerable to testing method than others.
Our second question was whether the order of difficulty of constructions was similar on the
two types of test. Based on the lenient coding system (where children showed an
understanding of constructions at a level above chance) a rank ordering was assigned to each
of the constructions for total items correct in the animation and multiple-choice tasks. The
rank orderings were different suggesting that the tests are not sensitive to the same aspects of
clause complexity. For the multiple-choice test, the rank ordering (and mean items correct)
was Subject transitive (3.28) > Subject intransitive (3.04) > Oblique (2.46) > Object ( 2.27) >
Indirect object (1.93). For the animation task, the order was Subject intransitive (9.03) >
Indirect object (8.24) > Subject transitive (8.14) > Object (7.91) > Oblique (7.87).
We next considered whether the rank orderings were meaningful, i.e. whether the differences
in difficulty were reliable enough to merit interpretation. To do this, we compared scores for
adjacently ranked constructions, e.g., for the animation task, we compared Subject
intransitive vs Indirect object, then Indirect object and Subject transitive, and so on. Because
the data was not normally distributed, paired Wilcoxon tests were used. The alpha level was
corrected for each set of comparisons within the test (alpha = .05/4 (.0125) as there were four
18
comparisons in each test). For the animation task, there was a significant difference between
intransitive subject and indirect object relatives (Z = 369.5, p = 4.62* 10-7). However, there
were no significant differences between any of the other sequential pairs, showing a twotiered hierarchy in relation to children’s understanding of different relative clause types on
this assessment. In contrast, the multiple-choice task showed three levels of difficulty, where
children performed similarly on both subject relative types (Z = 724, p = .025), there was a
significant different between subject intransitive and oblique relatives (Z = 480, p = 1.24*105
), no difference between oblique and object relatives (Z = 755, p = 0.109) and a significant
difference between object and indirect object relatives (Z = 963.5, p = 0.012), which were the
most difficult construction based on the multiple choice measure.
Our third question asked whether children’s performance on either measure of syntactic
comprehension would correlate with their performance on the sentence recall task from the
CELF-P2 (Semel et al,. 2006) or with their mental age as measured by the GoodenoughHarris draw a person test (1963). Table 4 provides a summary of the correlations between
measures. Note that the lack of correlation between Sentence Recall and age arises because
these were age-scaled scores. Draw a person scores were missing for three children, who
were therefore excluded from the analysis. Correlations between both assessments of syntax
and sentence recall scores were substantial, using Davis' (1971) criteria for interpreting the
magnitude of correlation coefficients. Although not as strong, associations between both
measures of syntax and DAP scores were also highly significant and are classified as
moderate associations.
Table 4. Pearson correlations between Variables
Measure
Animation
Multiple
Age
Sent Recall
19
Choice
Multiple choice .56***
–
Age
.50***
.48***
–
Sent Recall1
.50***
.40***
.03
–
Draw a person
.45***
.46***
.51***
.16
* p < .05, ** p < .01, *** p <.001
1
Age-scaled scores
To investigate the independent contributions of age, sentence recall and cognitive
ability to the scores on both assessments of syntactic comprehension, likelihood ratio tests
were used to perform hierarchical linear regression analyses. Results are shown in Table 5. In
the first analysis, performance on the animation task was the independent variable. Age was
entered first in order to control for its effect on the children’s performance on the task. This
was followed by sentence recall and finally the DAP scores. The final model accounted for a
large proportion of the variance in animation score (51%) and age explained 25% of that
variance (p < .001). An additional 23% of the variance was explained by the inclusion of
sentence recall, demonstrating the significant contribution of this language measure. In
contrast, when age was accounted for, the addition of DAP scores only explained a further
2% of the variance. In the second analysis, performance on the multiple-choice task was the
independent variable. Age was again entered first and accounted for 23% of the variance in
multiple-choice scores. In contrast to the animation task, when age was accounted for,
sentence-recall explained just under 15% of additional variance in the multiple-choice scores
and DAP accounted for 3.5%. We can see from the analyses that, as predicted, sentence recall
contributed to a larger proportion of the variance in animation scores than to those from the
multiple-choice task. However, contrary to what we expected, the contribution of DAP scores
was fairly similar across the two tasks.
20
Table 5 - Hierarchical linear Regression: predicting total score on Animation and Multiple
Choice tasks from age, sentence recall and Draw A Person test
Model
Outcome variables
Predictor variables
R2
(Test total score)
R2
Likelihood ratio test
increase
between subsequent
models (p-value)
Null
Animation
1
Age
0.2519
-
Age and Sentence
0.4859
0.2340
<.001***
0.5076
0.0217
.038*
0.2309
-
0.3787
0.1478
0.4136
0.0349
recall
2
Age, Sentence recall,
and DAP
Null
Multiple Choice
1
Age
Age and Sentence
recall
2
Age, Sentence recall,
and DAP
<.001***
.016*
* p < .05, ** p < .01, *** p <.001
Discussion
The current study aimed to examine to what extent the assessment method affects the
assessment outcome in relation to children’s understanding of complex sentences. To do this
we compared two methods of assessment – a multiple-choice picture-matching sentence
comprehension task and a newly devised truth-value judgement animation task. As we
predicted the results showed that children performed better on the truth-value animation task
than on the multiple-choice task, and this was across the full age range. In addition, we aimed
to explore whether the two assessment methods would reveal a similar order of difficulty in
relation to understanding specific syntactic constructions. We predicted the same order
between assessment methods, but our results showed a different hierarchy depending on the
method used. Finally, we were interested to know if results from either test would be
predicted by children’s performance on an independent measure of language (sentence recall
21
ability) or on a measure of cognitive ability. As expected, our results showed that sentence
recall ability predicted children’s performance on the animation assessment more than on the
multiple-choice test, validating the animation task as a measure of syntactic knowledge.
However, cognitive ability, as measured by the Draw a Person test, made a small but
significant contribution to children’s performance fairly equally across both tasks.
As previously discussed, a recent paper by Frizelle et al., (2017) highlighted the fact that the
multiple-choice picture-matching assessment method may be problematic when assessing
comprehension, in that it is testing skills beyond those of linguistic competence. Frizelle and
colleagues compared this method of assessing comprehension with a sentence recall task,
which (once beyond span) is considered to tap into both comprehension and production of
language. They found children could repeat sentences they didn’t understand and there was
little agreement between the two measures in relation to individual differences. Although the
lack of agreement between the two measures is unsettling, sentence recall and multiplechoice methods do not purport to measure identical linguistic skills. The two methods used in
the current study however, have been designed specifically to assess sentence comprehension.
If these assessment methods are effective they should not compromise access to the
knowledge being tested and therefore negatively impact on the scores achieved by the
children who have completed them. Our results show that this is not the case. Children across
the full age range performed better on the animation task than on the multiple-choice task,
causing us to question what is influencing children’s performance on the two measures. We
suggest that differences may be driven by task design. As previously outlined, in attempting
to process a sentence for syntactic comprehension, children typically form a semantic
representation of the sentence while using the syntactic structure to help them assign the
thematic roles to the appropriate words. In the case of the animation task, children hear the
22
target sentence and see the animation at the same time. We expect that children are using
typical thematic role to verb argument mapping in trying to process the given sentence in real
time, as they would in natural discourse. The added requirement is that they must decide
whether the linguistic structure describes accurately what happened in the animation. In
addition, in the animation task, children are not required to infer the same amount of
information as they are in a still image. In contrast, the multiple-choice task is presented
using still images, which requires children to infer temporal relationships and movement of
characters, which is often depicted by more subtle means. Along with the correct answer, the
multiple-choice task also involves the presentation of three additional distractors, which
demands that the child rules out three competing alternative mappings. This creates a level of
ambiguity in how the thematic roles should be assigned and increases the processing load of
the task considerably. The presence of three distractors also requires the child to store in
memory the arguments associated with each verb in order to rule them out. This memory load
is heightened even further by the fact that although the structure and thematic roles are
assigned differently, the nouns in each of the distractors are the same as those in the target
sentence.
Even when comparing children’s test performance based on above chance (rather than full
mastery) the differences between the two methods are evident. Our results showed that some
constructions were more vulnerable to testing method than others. Specifically, the testing
method had a large impact on children’s understanding of indirect object relatives and subject
transitive relatives. Indirect object relatives are those in which the relativized element ‘who’
is the indirect object of the relative clause and they have what is called a stranded preposition.
They are considered to be one of the more complex relative clause constructions in relation to
children’s production (Diessel & Tomasello, 2005; Frizelle & Fletcher, 2014). Unlike object
23
or oblique relatives the head noun of the indirect object relative is always animate which
means that the thematic role assigned by the verb could be applied to either noun. Previous
literature has suggested that when this is the case, comprehension is more difficult (Gennari
& McDonald, 2009). In order to make the task pragmatically appropriate, in the multiplechoice task indirect object relatives were depicted with at least four referents in each image. If
we take the example sentence He followed the girl he gave the present to, the correct image
shows a boy giving a girl a present, and is succeeded by the boy following the same girl.
There are two aspects of the multiple choice task which make the illustration particularly
complex, firstly the temporal aspect (the boy must be seen to give the girl the present before
following her – the two actions can not happen simultaneously) and secondly the need to
provide distractors showing alternative semantic-syntactic mappings regarding ‘who did what
to who’. In contrast a still image of the corresponding animation shows the three referents
included in the animation - a boy, a girl and an alternative girl who is also holding a present.
Because the animation happens in real time the verbs can be shown consecutively as they are
enacted and the design is not based on the need for three alternate distractors. See figure 4
below.
Figure 4. Multiple-choice stimulus picture and still image from the animation task
24
The reason why testing method had a particular impact in relation to subject transitive relative
clauses is potentially due to the type of verbs (in both the main and relative clause) that are
required in these constructions. The transitive nature of the relativized verb means that the
subject of the relative clause is again usually animate (the girl that scored the goal) and as a
result the main clause verbs are limited to those that we enact with a person or animal such as
push, pull, chase, hit etc (He pushed the girl that scored the goal). We suggest that these
types of verbs are less ambiguously depicted in an animation task than when shown in a still
image where the child is required to engage in a higher level of inference making. In contrast,
(in line with discourse regularities) the head noun in our object relatives and most of our
oblique relative test items was inanimate (The dog ate the banana she dropped or The girl
painted the wall he pointed to), facilitating the use of a broader range of verbs.
Both test methods showed growth in children’s understanding of complex syntax
across the age range. However, our findings in relation to a hierarchy of constructions did not
show consistency between the two assessments i.e. the assessments were not sensitive to the
same aspects of clause complexity. The multiple choice task showed three levels of
complexity. Children showed the highest level of understanding of subject relatives (both
types), followed by object and oblique relatives and lastly indirect object relatives. In contrast,
the animation task showed that children performed best on the intransitive subject relatives
but there were no significant differences between any of the other relative clause types. The
finding that intransitive subject relatives are the least difficult bi-clausal relative clause type
to process is in keeping with previous literature (Diessel, 2004, Diessel & Tomasello, 2005,
Frizelle & Fletcher, 2014). However, the lack of any difference between the other relative
clause types has not been previously reported and may suggest that we have been
underestimating children’s understanding of complex sentences. We suggest that when
children’s understanding of structures is assessed in a manner that is linguistically focussed,
25
pragmatically scaffolded, and more closely related to natural discourse we see a different
picture of what children are capable of understanding.
Our final research question addressed the associations between the two methods of
assessing syntactic comprehension and a measure of language and cognitive ability. We
chose sentence recall ability as the measure of language as there is a strong link between
sentence repetition and syntactic competence (Frizelle & Fletcher, 2014; Gallimore & Tharpe,
1981; Geers & Moog, 1978; Kidd et al., 2007; McDade, Simpson & Lamb, 1982; Polisenska,
et al., 2015). Current thinking is that sentence repetition is not merely a task of repeating a
heard series of word supported by phonological short term memory, but that it is also
supported by conceptual, lexical and syntactic representations in long-term memory (Brown
& Hulme, 1995; Hulme, Maughan & Brown 1991; Klem, Melby-Lervag, Hagtvet, Lyster,
Gustafsson & Hulme, 2015; Potter & Lombardi, 1990, 1998; Schweickert, 1993). Our
findings that sentence recall was more predictive of children’s performance on the animation
than on the multiple-choice assessment serves to validate this as a measure of syntactic
comprehension. Regarding cognitive ability, we found only marginal differences in how
DAP scores predicted children’s performances on both syntactic comprehension measures (it
accounted for slightly more variance on the multiple-choice than the animation task). Given
the additional cognitive load of the multiple-choice task we expected that there would be a
closer association between DAP scores and performance on this than the animation task.
However, we note that DAP scores do not correlate well with some other measures of nonverbal reasoning (Imuta, Scarf, Pharo & Hayne, 2013). Despite this they have been shown in
4 year-old children to be a predictor of later IQ (Arden, Trzaskowski, Garfield, Plomin, 2014).
In any case, the DAP measures only some aspects of children's nonverbal ability and it may
be that the cognitive demands of the multiple-choice assessment are more reflected in other
26
cognitive measures such as attentional control, working memory or inhibition, rather than a
general measure of intelligence.
In conclusion, we have compared two comprehension assessment methodologies and
have found that by varying the assessment context we alter children’s understanding of
complex sentences. Our results suggest that young children have a greater understanding of
complex sentences than previously reported, when assessed in a manner that is more
reflective of how we process language in natural discourse. Multiple choice tests may be
useful when we want to see how robust comprehension is when the child is forced to focus
only on grammatical structure and has to ignore other information that may be competing or
distracting, but interpretation needs to take into account the role of other factors, such as
attention, memory and inhibition. Thus in addition to testing complex sentences, we would
need other items that pose similar task demands, but use more simple sentences. In addition,
in a well-designed multiple-choice test, analysis of error patterns can help distinguish genuine
grammatical difficulties from more general cognitive impairments (Bishop, 2003). We
suggest that those designing and administering language assessments need to critically
appraise the tools they are using and reflect on whether the test is actually measuring what it
claims to. The current technological era opens the way to developing assessment tools which
allow testing of children’s understanding ‘in real time’. Our results have implications for
those who work clinically as well as researchers who are investigating children’s language
comprehension ability.
Acknowledgments
This research was supported by funding to the first author from the charity RESPECT and the
People Programme (Marie Curie Actions) of the European Union's Seventh Framework
Programme (FP7/2007-2013), under REA grant agreement no. PCOFUND-GA-2013-608728.
We are grateful to the parents, children, schools and nurseries that participated in this study.
27
References
Adams, C, Coke, R, Crutchley, A, Hesketh, A & Reeves, D (2001). Assessment of
Comprehension and Expression 6 - 11. NFER-Nelson, London. ISBN
9780708705612
Arden, R, Trzaskowski, M, Garfield, V, Plomin, R (2014). Genes Influence Young Children's
Human Figure Drawings and Their Association With Intelligence a Decade Later.
Psychological Science, 25, 1843-50. DOI: 10.1177/0956797614540686
Bishop, D.V.M. (2003). The Test for Reception of Grammar. TROG 2. London:
Psychological Corporation.
Brown, G. D. A. & Hulme, C. (1995). Modeling item length effects in memory span: No
rehearsal needed? Journal of Memory and Language, 34, 594 – 621.
Crain, S., & McKee, C. (1985) The acquisition of structural restrictions on anaphora. In S.
Berman, J. W. Choe & J. McDonagh (eds) Proceedings of NELS 16. GLSA
University of Massachussetts, Amherst.
Crain, S., Thornton, R., Boster, C., & Conway, L. (1996). Quantification without
qualification. Language, 5(2), 83–153. http://doi.org/10.1207/s15327817la0502_2
Davis, J.A. (1971) Elementary survey analysis. Englewood Cliffs, NJ: Prentice–Hall;
Diessel, H. (2004). The acquisition of complex sentences. Cambridge, United Kingdom:
Cambridge University Press.
Diessel, H. & Tomasello, M. (2000). The development of relative clauses in spontaneous
child speech. Cognitive Linguistics 11(1/2) 131-51.
Diessel, H. & Tomasello, M. (2005). A new look at the acquisition of relative clauses.
Language 81(4), 1–25.
28
Fraser, C., Bellugi, U., & Brown, R. (1963). Control of grammar in imitation, comprehension,
and production. Journal of verbal learning and verbal behavior, 2(2), 121-135.
Frizelle, P. & Fletcher, P. (2014). Relative clause constructions in children with specific
language impairment. International Journal of Language & Communication
Disorders, 49, 255- 64.
Frizelle, P., O'Neill, C., & Bishop, D. V. M. (2017). Assessing understanding of relative
clauses: a comparison of multiple-choice comprehension versus sentence repetition.
Journal of Child Language, 1–23. http://doi.org/10.1017/S0305000916000635
Gallimore, R. & Tharp, R. G. (1981). The interpretation of elicited sentence imitation in a
standardized context. Language Learning, 31(2), 369 - 92.
Gerken L., & Shady M. E. (1996) The picture selection task. In D. McDaniel, C. McKee, and
H. Smith Cairns (eds), Methods for Assessing Children’s Syntax Cambridge, MA:
MIT Press, pp. 55–76.
Geers, A. E. & Moog, J. S. (1978). Syntactic maturity of spontaneous speech and elicited
imitations of hearing impaired children. Journal of Speech and Hearing Disorders, 43,
380 –91.
Gennari, S. P., & MacDonald, M. C. (2009). Linking production and comprehension
processes: The case of relative clauses. Cognition, 111, 1–23.
http://dx.doi.org/10.1016/j.cognition.2008.12.006.
Goodenough, F. L. (1926). Measurement of intelligence by drawings. New York: World
Books.
Gordon, P. (1996). The Truth-Value Judgment Task. In D. McDaniel, C. McKee and H.
Smith-Cairns (Eds.) Methods for Assessing Children’s Syntax. (pp. 211- 231) MIT
Press.
29
Gordon, P. & Chafetz, J (1986). Lexical learning and generalization in the passive acquisition.
Paper presented at the 11th Annual Boston University Conference on Language
Development, Boston, October.
Ha ̊kansson, G. & Hansson, K. (2000). Comprehension and production of relative clauses: a
comparison between Swedish impaired and unimpaired children. Journal of Child
Language 27, 313–33.
Harris, D. B. (1963). Children’s drawings as measures of intellectual maturity. New York:
Harcourt, Brace, Jovanovich.
Hulme, C., Maughan, S. & Brown, G. D. A. (1991). Memory for familiar and unfamiliar
words: evidence for a long-term memory contribution to short-term memory span.
Journal of Memory and Language, 30(6), 685–701.
Imuta, K., Scarf, D., Pharo, H., & Hayne, H. (2013). Drawing a Close to
the Use of Human Figure Drawings as a Projective Measure of Intelligence.
PLoS ONE, 8(3), e58991. http://doi.org/10.1371/journal.pone.0058991
Kidd, E., Brandt, S., Lieven, E. & Tomasello, M. (2007). Object relatives made easy: a crosslinguistic comparison of the constraints influencing young children’s processing of
relative clauses. Language and Cognitive Processes, 22(6), 860–97.
Klem, M., Melby-Lervåg, M., Hagtvet, B., Lyster, S.-A. H., Gustafsson, J. E. & Hulme, C.
(2015). Sentence repetition is a measure of children’s language skills rather than
working memory limitations. Developmental Science, 18(1), 146 – 54.
McDade, H. L., Simpson, M. A. & Lamb, D. E. (1982). The use of elicited imitation as a
measure of expressive grammar: a question of validity. Journal of Speech and
Hearing Disorders, 47(1), 19 – 24.
McDaniel, D., & McKee, C. (1998) Methods for assessing children’s syntax. MIT Press.
Philip, W. (1991). Spreading in the acquisition of universal quantifiers. In West Coast
30
Conference on Formal Linguisitics 10, 359-373.
Philip, W. (1992). Distributivity and logical form in the emergence of universal quantification.
In Semantics and Linguistic Theory, 2, 327-346.
Polišenská, K., Chiat, S. & Roy, P. (2015). Sentence repetition: What does the task measure?
International Journal of Language & Communication Disorders, 50 (1), 106–18.
Potter, M. C. & Lombardi, L. (1990). Regeneration in the short-term recall of sentences.
Journal of Memory and Language 29, 633–54.
Potter, M. C. & Lombardi, L. (1998). Syntactic priming in immediate recall of sentences.
Journal of Memory and Language 38, 265–82.
Roeper, T., & De Villiers, J. (1993). The emergence of bound variable structures.
In Knowledge and language (pp. 105-139). Springer Netherlands.
Schweickert, R. (1993). A multinomial processing tree model for degradation and
redintegration in immediate recall. Memory and Cognition, 21, 168–75.
Semel, E., Wiig, E. M. & Secord, W. (2006). Clinical Evaluation of Language
Fundamentals—Fourth Edition, UK Standardisation (CELF–4 UK). London: Pearson
Assessment.
31