Measuring Deep, Reflective Comprehension and Learning Strategies

Measuring Deep, Reflective Comprehension and Learning Strategies: Challenges and Successes
Danielle S. McNamara
This collection of articles focuses on the important question of how to assess
metacognition and strategy use associated with comprehension and learning. Two of the papers
focus on the assessment of general learning strategies (Schellings, this issue; Winne & Muis, this
issue), and three focus on the assessment of reading strategies (Bråten & Strømsø, this issue;
Cromley & Azevedo, this issue; Magliano, Millis, The RSAT Development Team, Levinstein, &
Boonthum, this issue). These papers address crucial questions relevant to education in light of
the growing understanding that metacognition and strategy use are crucial to deep, long-lasting
comprehension and learning. This importance dictates the need to develop better means of
measuring these constructs.
Deep understanding of content is assumed to emerge principally from strategic activities
that prompt the learner to generate inferences connecting what is being learned to other content
and what the learner already knows (McNamara & Magliano, 2009a, 2009b). These activities
include asking questions, seeking answers to questions, evaluating the quality of answers to
questions, generating explanations, solving problems, and reflecting on the success of such
activities (Bransford, Brown, & Cocking, 2000; Graesser, McNamara, & VanLehn, 2005;
McNamara, 2010).
While it is fairly well accepted and established that these generative, reflective activities
are desirable and lead to deeper, more stable learning, they are rare. They are not characteristic
activities of typical classrooms and moreover, they are often not observed in one-on-one tutoring
sessions (Baker, 1996; Graesser, Person, & Magliano, 1995). These activities are also difficult to
observe for an individual because they need not be expressed verbally. One student can ask a
question, answer a question, evaluate an answer, generate an explanation, solve a problem, and
reflect without blinking an eye (so to speak), while another takes an hour while all the time
daydreaming about the weather.
Nonetheless, if strategy use in learning and comprehension are important factors in
learning, then it follows that it is important to develop good measures of these constructs. This
has been challenging. It is far easier to measure whether a student recognizes a detail, fact, or
figure, than to measure how the student tends to learn those details, facts, or figures, let alone
how the student or whether the student tends to learn more or less deeply or with or without
reflection. The complications abound. First, students’ judgments of what they do and
measurements of their performance often do not match. This mismatch may be due to many
factors. For example, measurements that rely on the student’s judgments in self-report may be
skewed in one direction or another because the student lacks a clear understanding what
comprises good versus poor performance. Second, retrospective self-report is fraught with issues
regarding the reliability of memory. Third, students tend to learn and comprehend differently
depending on the subject matter, contexts, goals, and tasks. Hence, while a student may appear to
use deep, reflective strategies in one situation, the results may be quite different in another. The
latter may potentially be reduced by assessing actual performance across a variety of subject
matters, contexts, goals, and tasks. This is of course much easier prescribed than done given the
multiple constraints associated with test administration, such as group administration and time
limitations. Any viable measure would need to meet the constraints of being deliverable to many
students in a fairly brief amount of time (e.g., less than an hour). Meeting these real-world
constraints, while at the same time providing a valid assessment of strategy use, is challenging.
Jennifer Cromley and Roger Azevedo (this issue) approached problems of measuring
students’ strategy use when comprehending text by assessing students’ ability to recognize good
strategies. They used a multiple-choice questionnaire in which students chose the example of the
best strategy given a particular passage and task (following Kozminsky & Kozminsky, 2001).
For example, the student was asked to choose what might follow a passage, in order to assess the
ability to make predictive inferences, or to choose the best summary, in order to assess the ability
to summarize. Other examples included choosing which questions could not be answered from
the passage and which sentences could be omitted without changing the passage meaning. Clear
advantages of this approach are that it is contextualized, task specific, and avoids problems
associated with self-report. Student are not asked to gauge whether or not they have engaged in
some process in the past or would do so in some context, but rather, they are asked to recognize
its manifestation. This approach avoids the student guessing which answers might be the most
socially desirable, or best answers, and avoids the student misunderstanding normative
comparisons.
Nonetheless, despite the advantages, a glaring problem with this approach is that it is
conflated with the very processes it is attempting to predict. That is, in the case of assessing the
ability to generate predictive inferences, in order to assess what might follow a passage, the
student must understand the passage as well as the text describing what might follow (and
engage in intertextual processing of the two pieces of information). In order to assess a summary,
the student must be able to comprehend both the passage and all of the summaries and be
capable of comparing the summaries. In sum, the reading and comprehension processes
involved in answering the questions are far too intertwined with the processes involved in the
comprehension of the passage to extricate a student’s metacognitive abilities.
This conclusion was supported by the results of their study. To assess the reliability and
validity of the strategy measure, Cromley and Azevedo (this issue) conducted three separate
studies including 1097 high school and college students. The specific measures differed
somewhat across the studies, but the constructs assessed were more or less constant. Across the
three studies, they found that the strategy use measure accounted for little unique variance.
While it correlated with the other measures and the target comprehension measure, it added little
in explaining performance. Indeed, Cromley and Azevedo found that across all three studies, the
effects of each separate variable overlapped with the effects of the other variables, each offering
little unique variance. These results support the conclusion that the strategy measure primarily
assessed comprehension rather than a separable construct associated with strategy use. Notably,
the lack of unique variance among all of the measures may be at least partially due to task
specific performance given that multiple-choice was the response format common to all of the
measures. On the plus side, the measure may provide a viable measure of a student’s ability to
understand text deeply rather than superficially, while at the same time using a multiple-choice
format (which facilitates scoring).
In sum, while Cromley and Azevedo (this issue) addressed both of the principle issues
facing the measurement of metacomprehension and deep strategies, a third concern was
introduced, that of successfully separating the use of the strategies from other processes involved
in reading and comprehension. While some may argue that comprehension and strategy use (or
metacognition) are one and the same, most theories assume that they are separable constructs.
However, using their measure, Cromely and Azevedo found that it was impossible to statistically
separate the one from the other.
Ivar Bråten and Helge Strømsø (this issue) tackled the problem of how to measure
strategy use when reading multiple texts. Although they did so using self-report (along with its
inherent problems), the questions were contextualized within the task of reading particular texts.
In their study, college students read seven science texts and responded to a survey on their use of
strategies and how they had processed the texts. Their investigation of strategies associated with
multiple text processing is reflective of the growing appreciation of the importance of these
processes to learning. It is evident that real world tasks often rely on considering multiple
perspectives that are offered from multiple sources, and students’ abilities to process information
across multiple sources is a crucial skill. At debate is whether the crux of those abilities lies at
the processing of the individual sources of information (and even the processing of the implicit
information and the implications from the single sources) or if a separable and important ability
lies in the processing of information across those sources. Further elucidation of the importance
of cross-text processing relies by necessity on the ability to measure it.
The self-report survey developed by Bråten and Strømsø (this issue) contains 15
questions using a 10-point likert scale, with 5 questions assessing efforts to remember
information from the texts and 10 assessing students’ efforts to integrate across the texts. The
questions (e.g., I tried to note disagreements among the texts) referred to their assessment of how
they had processed the seven texts on climate change. The important advantage of this measure
is that it is contextualized within the task that the student had just recently completed, and thus
the answers are more likely to be reflective of performance on a specific task at a particular time,
rather than based on memory or a hypothetical task. Some disadvantages were that performance
and strategy assessment was assessed for only one task and one topic, they used a self-report
measure, the questions were all positively worded and regarded strategies with a positive valence
(hence suffering greater risk of social desirability effects), and it was limited to only 15
questions. Given the number and range of strategies that might be used , and the variety of
contexts in which they may or may not be used (see e.g., McNamara, Ozuru, Best, & O’Reilly,
2007; Schellings, this issue; Van Hout-Wolters, Simons, & Volet, 2000), 15 questions may not
suffice.
Comprehension was assessed using verification statements containing inference
statements referring to either information in single texts (intratextual) or across texts
(intertextual). Bråten and Strømsø (this issue) confirmed the presence of two factors accounting
for variance in their survey, the accumulation of information (e.g., single text processing) and
cross-text elaboration processes. Unfortunately, students’ performance on the intratextual
verification statements was solely predicted by prior knowledge. Indeed, a noteworthy positive
aspect of their study is that they also measured prior knowledge on the topic (using 15 multiplechoice questions). A disadvantage of the study is that no other comprehension measures were
collected. While student’s reports to remember information from the texts negatively predicted
performance on the intertextual (cross-text) verification statements, the students’ reports on
cross-text elaboration were only marginally predictive. Moreover, the relationships were
disappointingly weak. One potential conclusion might be that strategies engaged while
processing the text did not influence comprehension. Theoretically, however, that is unlikely. As
such, future research on the topic might address the particular weaknesses of the study by
including more items, assessing performance on a variety of different topics, using both
positively and negatively valenced items, and assessing multiple levels of comprehension using
various types of measures. However, none of these modifications address what is potentially the
most serious road block and that is the use of retrospective self report.
Indeed, Gonny Schellings (this issue) conducted three studies to compare students’
performance on self-report surveys to the use of think aloud. In the first study, 16 high school
students thought aloud while reading a history text, after which they answered a 58 item
questionnaire about the strategies they had used while reading the text. Each item on the
questionnaire corresponded to a code used to score the verbal protocols. This small study
established that the reliability of the questionnaire was fair, with moderate to poor subscale
reliability depending on the subscale. Likewise, the correlations between the questionnaire and
think-aloud protocols were moderate to weak depending on the subscale. The elaboration and
evaluation subscale was found to be the most robust. These results were replicated in a second
study with 190 students, with somewhat lower reliability estimates, providing some doubts about
the internal consistency of the subscales of the questionnaire. For that reason, a third study with
4 students was conducted to collect retrospective think-aloud protocols about the questionnaire.
These protocols revealed that some students misunderstood the questions, some answered from a
general and some from a task-specific perspective, and some students’ answers were driven at
least partially from the perspective of social desirability. Overall, the study indicated that the
correlation between think-aloud and a survey can be moderate when they refer to the same task
and include a correspondence between the codes and the items. However, the low reliability of
the self-report survey indicates that the subscales may not measure the intended constructs.
Regardless, without measures of comprehension, the validity of the scale is impossible to judge.
These studies highlight the complications and concerns encountered in developing
questionnaires to assess strategy use and metacomprehension. Indeed, the most important break-
through in measuring strategy use is the use of technology to measure on line performance.
Importantly, many researchers assume that the use of verbal protocols and explanation is not
possible in real-time assessment because of the time required to code, score, and interpret the
students’ responses. For example, among the papers in this issue, Bråten and Stramso (this
issue) state ‘think-alouds are a very time and labour-intensive methodology…making it less
suitable for larger samples.’ Schellings (this issue) similarly noted that on-line methods such as
verbal protocols were excessively challenging. Notably however, the past decade has shown
enormous progress in this regard.
Joe Magliano, Keith Millis and their colleagues (this issue) describe RSAT, which is an
automated tool for assessing students’ comprehension ability and use of comprehension
strategies by using computational methods to analyze verbal protocols produced on-line while
reading texts. Students read texts on a computer and answer open-ended indirect and direct
questions embedded within the texts. Indirect questions are of the genre What are you thinking
now?, asking the readers to report thoughts regarding their understanding of the sentence in the
context of the passage. Direct questions assess comprehension with specific “wh-” questions
about the text at target sentences. The responses are analyzed for the presence of paraphrases,
bridges, and elaborations, which are reading strategies associated with successful
comprehension. Their studies have confirmed that the automated scores generated by RSAT
correlate highly with humans’ judgments of the presence of the strategies in the protocols. They
also show that RSAT measures correlate with standardized measures of comprehension
including the ACT and the Gates MacGinitie. Further, they demonstrate that the RSAT
measures of comprehension and comprehension strategies account for more variance in
predicting performance on short answer questions about two texts (including a narrative and a
science text).
Some advantages of the RSAT method are that it measures both comprehension ability
and strategy use on line, it does not rely on retrospective self-report, and it can be used across
multiple topics and texts. Some disadvantages are that it only assesses the use of strategies in the
context of a single, somewhat constrained task (reading a text for a somewhat undefined
purpose); it only measures a limited number and type of strategies (i.e., paraphrasing, bridging,
and elaboration); and the use of strategies is intrinsically intertwined with the process of
understanding the text. The latter problem is less of a concern than in the study conducted by
Cromley and Azevedo, however, because RSAT is measuring traces of the strategies within the
verbal protocols, and Magliano et al. statistically demonstrate that these traces are separable
constructs from the comprehension outcomes. What is less evident is how to separate among the
traces of strategy use, those that result from automatic and those that result from conscious
strategic processes. When a reader thinks aloud about thoughts about a text or to answer a
question, those responses may include a paraphrase or bridging inference that resulted from
automatic, skilled comprehension processes, or from concerted efforts to understand a
challenging text. For this reason, Magliano et al. refer to these traces as comprehension processes
rather than strategies. The outcome of RSAT is consequently an assessment of what processes
contribute most to the comprehension outcome, rather than whether or the degree to which the
reader is strategic.
Philip Winne and Krista Renee Muis focused on judgments of accuracy (i.e., calibration,
addressing the general issue of the appropriateness of using the signal detection statistic d’ when
measuring calibration as opposed to gamma (G), and how those judgments varied across
domains (i.e., general, word, and mathematics knowledge). Although high calibration does not
necessarily indicate that a student will be strategic, calibration is pertinent to strategy use
because students must be able to judge the potential accuracy of their performance to know
whether to use strategies when comprehending and learning. They investigate the
appropriateness of d’ in estimating calibration because it expands researchers’ ability to examine
calibration across domains and contexts.
After completing three 40-item short answer (1-2 words) knowledge tests, 266
undergraduate students assessed the correctness of each answer using a 5-point Likert scale.
Students were least accurate on the word knowledge assessment, but according to both the G and
d’ statistics, calibration was high overall, but lowest for the mathematics test. Winne and Muis
(this issue) further found large correlations between G and d’ (r = .63-.71); however, they did not
calculate correlations correcting for variance, which may have shown higher correlations.
While Winne and Muis (this issue) conclude that d’ is a valid and viable statistic to be
used when assessing calibration, their evidence is scant. They certainly show that d’ can be used
and that there is some overlap between d’ and G. But there is no evidence provided to indicate
that d’ prime is more or less valid than G because there were no other criterion measurements.
Another consideration is potential differences between the tests. While it was found that
calibration varied as a function of test, this variance may well have emerged from similarities
and differences between the items comprising the tests (rather than domain itself).
Conclusions
In conclusion, this collection of articles reflects growing concerns and increased efforts to
develop reliable and valid measures of strategy use and metacomprehension strategies. The
challenges in doing so are enormous. In the case of Cromley and Azevedo (this issue), their
survey could not account for substantial unique variance in comprehension, likely because it
involved so many of the processes tapped by the other measures (including word knowledge,
prior knowledge, and comprehension processes). Bråten and Strømsø (this issue) similarly found
disappointingly weak relationships between their survey and comprehension. Schellings (this
issue) did not investigate relationships with comprehension but found stronger relationships
between a think aloud and a survey than have been found previously, though perhaps at the cost
of reliability. The most successful approach, it seems, was that adopted by Magliano, Millis, and
colleagues (this issue) who leveraged technology to assess on-line verbal protocols, though also
not without eliciting some concerns.
Two pervading issues often surface regarding attempts to assess strategy use: the degree
to which the processes are automatic versus conscious, and the degree to which they are induced
versus naturally occurring. Some seek to distinguish between comprehension and strategic
processes parallel to the distinction between automatic and conscious controlled processes.
Drawing the line between the two in this fashion, however, has not been particular useful or
productive. For example, if a reader generates a bridging inference while reading a text, but is
unaware of having done so, is that strategic or metacognitive or automatic? The inference may
well have emerged simply through the activation of prior knowledge (which is hardly avoidable).
A comprehension assessment will indicate that the inference has been made, and an on-line
protocol will indicate that the reader is making reference to both the current sentence and a prior
sentence. Thus, one may deduce that the reader is a skilled (strategic reader), because the
bridging inference has been successfully drawn. But neither measure would necessarily indicate
whether the reader would attempt to use prior text in order to understand a current sentence in
the face of a comprehension challenge. Thus, neither measure would provide a good measure of
whether the reader is strategic or metacognitive across a wide range of circumstances.
Another important question that emerges when approaching the task of assessing strategy
use is how to distinguish underlying skills germane to the task from metacognition or strategies.
Indeed, the problem of imparting a pure measure of strategy use is readily apparent in the articles
presented in this issue. As depicted in Figure 1, strategies are assumed to aide in comprehension
and learning, and they also have some shared variance. An underlying assumption is that the
strategies are separable from the skills underlying performance on the target comprehension and
learning tasks. It is also assumed that there will be some shared variance between the two.
Notably, this shared variance may stem from similarities between the tasks, or similarities
between measurements. For example, multiple-choice items by their very nature require similar
processes such as comprehending the question and choices, and reasoning to eliminate foils.
Open-ended or short-answer question would involve different underlying processes than multiple
choice questions, such as the production of text (Ozuru, Best, Bell, Witherspoon, & McNamara,
2007). The degree to which this shared variance is maximized decreases the likelihood of
measuring processes intrinsic to the strategies. However, the degree to which it is minimized is
likely to also lower the correlation between the two. Hence, finding a reliable, contextualized,
yet separate measure of metacognition is challenging.
Figure 1. Many theories assume that strategies influence comprehension and learning (i.e., the arrow).
However, measurements of the constructs have overlapping variance (i.e., the overlap of the two circles).
To the degree that they do not overlap, one obtains a pure measure of strategy use, but that also likely
lowers the strength of the correlation between the constructs.
Associated with the movement to develop better measures of metacognition and
metacomprehension, researchers and educators are becoming more mindful of the problems
associated with learning by consumption (McNamara, 2010). Learning by consumption
expresses the fallacy that learners can consume and be filled with knowledge – that delivering
information is the end goal and acquiring information defines the essence of education. Such an
emphasis on the learning of specific content is both expressed by and heightened by educators’
anticipation of state or federally mandated assessments for their students, which typically focus
on surface recognition for details, facts, and figures (almost exclusively using multiple-choice
questions).
But that is changing. There is an increased acknowledgment that a focus on content
delivery and acquisition in the classroom tends to result in a distressing reduction in emphasis on
the methods and strategies that effect deep and long-lasting learning. That change is happening is
some classrooms and is also being expressed in some movements in the realm of assessment.
Perhaps international assessments such as NAEP and particularly PISA are most reflective of
those changes. These assessments are moving toward the assessment of multiple levels of
understanding and abilities (e.g., comprehension, problem solving, critical thinking), the use
different response formats (recognition, short answer, recall, writing), as well as the use of
multiple types of media to present information. These trends in assessment reflect the growing
understanding that a student’s ability to deeply process and understand information plays an
important role in characterizing a student’s competencies. These are encouraging trends in
assessment.
However, educators will not bring students to the point of performing well on such
assessments by focusing on content delivery, and by remaining victims of the fallacies associated
with assumptions of learning by consumption. Deep, meaningful learning and comprehension
across a variety of contexts, topics, and tasks depends on the use of deliberate, effective
strategies. One source of evidence for this claim comes from intervention studies, where students
are provided with training to use learning or comprehension strategies and consequently improve
in performance. Examples of such studies are becoming increasingly abundant (e.g., see
McNamara, 2007). These interventions tend to be highly successful in improving task specific
and general academic performance, particularly when the interventions to use the strategies
include deliberate practice using the strategies across multiple contexts (or texts).
A necessary complement of these strategy interventions are measures of their need. The
articles in this issue indicate that we have yet to attain that goal. There is much left to do. Thus,
hopefully, researchers will follow in the footsteps of these and similar studies and continue
endeavors to develop reliable, valid assessments of students’ use of deep, reflective
comprehension and learning strategies.
References
Baker, L. (1996). Social influences on metacognitive development in reading. In C. Cornoldi&J.
Oakhill (Eds.), Reading comprehension difficulties (pp. 331–351). Hillsdale, NJ: Erlbaum.
Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.). (2000). How people learn: Brain, mind,
experience, and school. Washington, DC: National Academies Press.
Graesser, A.C., McNamara, D.S., & VanLehn, K. (2005). Scaffolding deep comprehension
strategies through Point&Query, AutoTutor, and iSTART. Educational Psychologist, 40,
225-234.
Graesser, A. C., Person, N. K., & Magliano, J. P. (1995). Collaborative dialogue patterns in
naturalistic one-to-one tutoring. Applied Cognitive Psychology, 9, 495-522.
Kozminsky, E. & Kozminsky, L. (2001). How do general knowledge and reading strategies
ability relate to reading comprehension of high school students at different educational
levels. Journal of Research in Reading, 24, 187-204.
Magliano, J. P., Millis, K. K., The RSAT Development Team, Levinstein, I., & Boonthum, C.
(this issue). Assessing comprehension during reading with the reading strategy assessment
tool (RSAT).
McNamara, D. S. (Ed.). (2007). Reading comprehension strategies: Theory, interventions, and
technologies. Mahwah, NJ: Erlbaum.
McNamara, D.S. (2010). Strategies to read and learn: Overcoming learning by consumption.
Medical Education, 44, 340-346.
McNamara, D. S. & Magliano, J. P. (2009a). Self-explanation and metacognition: The dynamics
of reading. In J. D. Hacker, J. Dunlosky, & A. C. Graesser (Eds.), Handbook of
metacognition in education (pp. 60-81). Mahwah, NJ: Erlbaum.
McNamara, D.S. & Magliano, J.P. (2009b). Toward a comprehensive model of comprehension.
In B. Ross (Ed.), The psychology of learning and motivation (pp. 297-384). New York, NY:
Elsevier.
McNamara, D. S., Ozuru, Y., Best, R., & O'Reilly, T. (2007). The 4-pronged comprehension
strategy framework. In D.S. McNamara (Ed.), Reading comprehension strategies: Theories,
interventions, and technologies (pp. 465-496). Mahwah, NJ: Erlbaum.
Ozuru, Y., Best, R., Bell, C., Witherspoon, A., & McNamara, D.S. (2007). Influence of question
format and text availability on assessment of expository text comprehension. Cognition &
Instruction, 25, 399–438.
Van Hout-Wolters, B. H. A. M., Simons, P. R. J., & Volet, S. (2000). Active learning: Selfdirected learning and independent work. In P. R. J. Simons, J. L. van der Linden, & T. M.
Duffy (Eds.), New learning (pp. 21-37). Dordrecht, The Netherlands: Kluwe.