Measuring Deep, Reflective Comprehension and Learning Strategies: Challenges and Successes Danielle S. McNamara This collection of articles focuses on the important question of how to assess metacognition and strategy use associated with comprehension and learning. Two of the papers focus on the assessment of general learning strategies (Schellings, this issue; Winne & Muis, this issue), and three focus on the assessment of reading strategies (Bråten & Strømsø, this issue; Cromley & Azevedo, this issue; Magliano, Millis, The RSAT Development Team, Levinstein, & Boonthum, this issue). These papers address crucial questions relevant to education in light of the growing understanding that metacognition and strategy use are crucial to deep, long-lasting comprehension and learning. This importance dictates the need to develop better means of measuring these constructs. Deep understanding of content is assumed to emerge principally from strategic activities that prompt the learner to generate inferences connecting what is being learned to other content and what the learner already knows (McNamara & Magliano, 2009a, 2009b). These activities include asking questions, seeking answers to questions, evaluating the quality of answers to questions, generating explanations, solving problems, and reflecting on the success of such activities (Bransford, Brown, & Cocking, 2000; Graesser, McNamara, & VanLehn, 2005; McNamara, 2010). While it is fairly well accepted and established that these generative, reflective activities are desirable and lead to deeper, more stable learning, they are rare. They are not characteristic activities of typical classrooms and moreover, they are often not observed in one-on-one tutoring sessions (Baker, 1996; Graesser, Person, & Magliano, 1995). These activities are also difficult to observe for an individual because they need not be expressed verbally. One student can ask a question, answer a question, evaluate an answer, generate an explanation, solve a problem, and reflect without blinking an eye (so to speak), while another takes an hour while all the time daydreaming about the weather. Nonetheless, if strategy use in learning and comprehension are important factors in learning, then it follows that it is important to develop good measures of these constructs. This has been challenging. It is far easier to measure whether a student recognizes a detail, fact, or figure, than to measure how the student tends to learn those details, facts, or figures, let alone how the student or whether the student tends to learn more or less deeply or with or without reflection. The complications abound. First, students’ judgments of what they do and measurements of their performance often do not match. This mismatch may be due to many factors. For example, measurements that rely on the student’s judgments in self-report may be skewed in one direction or another because the student lacks a clear understanding what comprises good versus poor performance. Second, retrospective self-report is fraught with issues regarding the reliability of memory. Third, students tend to learn and comprehend differently depending on the subject matter, contexts, goals, and tasks. Hence, while a student may appear to use deep, reflective strategies in one situation, the results may be quite different in another. The latter may potentially be reduced by assessing actual performance across a variety of subject matters, contexts, goals, and tasks. This is of course much easier prescribed than done given the multiple constraints associated with test administration, such as group administration and time limitations. Any viable measure would need to meet the constraints of being deliverable to many students in a fairly brief amount of time (e.g., less than an hour). Meeting these real-world constraints, while at the same time providing a valid assessment of strategy use, is challenging. Jennifer Cromley and Roger Azevedo (this issue) approached problems of measuring students’ strategy use when comprehending text by assessing students’ ability to recognize good strategies. They used a multiple-choice questionnaire in which students chose the example of the best strategy given a particular passage and task (following Kozminsky & Kozminsky, 2001). For example, the student was asked to choose what might follow a passage, in order to assess the ability to make predictive inferences, or to choose the best summary, in order to assess the ability to summarize. Other examples included choosing which questions could not be answered from the passage and which sentences could be omitted without changing the passage meaning. Clear advantages of this approach are that it is contextualized, task specific, and avoids problems associated with self-report. Student are not asked to gauge whether or not they have engaged in some process in the past or would do so in some context, but rather, they are asked to recognize its manifestation. This approach avoids the student guessing which answers might be the most socially desirable, or best answers, and avoids the student misunderstanding normative comparisons. Nonetheless, despite the advantages, a glaring problem with this approach is that it is conflated with the very processes it is attempting to predict. That is, in the case of assessing the ability to generate predictive inferences, in order to assess what might follow a passage, the student must understand the passage as well as the text describing what might follow (and engage in intertextual processing of the two pieces of information). In order to assess a summary, the student must be able to comprehend both the passage and all of the summaries and be capable of comparing the summaries. In sum, the reading and comprehension processes involved in answering the questions are far too intertwined with the processes involved in the comprehension of the passage to extricate a student’s metacognitive abilities. This conclusion was supported by the results of their study. To assess the reliability and validity of the strategy measure, Cromley and Azevedo (this issue) conducted three separate studies including 1097 high school and college students. The specific measures differed somewhat across the studies, but the constructs assessed were more or less constant. Across the three studies, they found that the strategy use measure accounted for little unique variance. While it correlated with the other measures and the target comprehension measure, it added little in explaining performance. Indeed, Cromley and Azevedo found that across all three studies, the effects of each separate variable overlapped with the effects of the other variables, each offering little unique variance. These results support the conclusion that the strategy measure primarily assessed comprehension rather than a separable construct associated with strategy use. Notably, the lack of unique variance among all of the measures may be at least partially due to task specific performance given that multiple-choice was the response format common to all of the measures. On the plus side, the measure may provide a viable measure of a student’s ability to understand text deeply rather than superficially, while at the same time using a multiple-choice format (which facilitates scoring). In sum, while Cromley and Azevedo (this issue) addressed both of the principle issues facing the measurement of metacomprehension and deep strategies, a third concern was introduced, that of successfully separating the use of the strategies from other processes involved in reading and comprehension. While some may argue that comprehension and strategy use (or metacognition) are one and the same, most theories assume that they are separable constructs. However, using their measure, Cromely and Azevedo found that it was impossible to statistically separate the one from the other. Ivar Bråten and Helge Strømsø (this issue) tackled the problem of how to measure strategy use when reading multiple texts. Although they did so using self-report (along with its inherent problems), the questions were contextualized within the task of reading particular texts. In their study, college students read seven science texts and responded to a survey on their use of strategies and how they had processed the texts. Their investigation of strategies associated with multiple text processing is reflective of the growing appreciation of the importance of these processes to learning. It is evident that real world tasks often rely on considering multiple perspectives that are offered from multiple sources, and students’ abilities to process information across multiple sources is a crucial skill. At debate is whether the crux of those abilities lies at the processing of the individual sources of information (and even the processing of the implicit information and the implications from the single sources) or if a separable and important ability lies in the processing of information across those sources. Further elucidation of the importance of cross-text processing relies by necessity on the ability to measure it. The self-report survey developed by Bråten and Strømsø (this issue) contains 15 questions using a 10-point likert scale, with 5 questions assessing efforts to remember information from the texts and 10 assessing students’ efforts to integrate across the texts. The questions (e.g., I tried to note disagreements among the texts) referred to their assessment of how they had processed the seven texts on climate change. The important advantage of this measure is that it is contextualized within the task that the student had just recently completed, and thus the answers are more likely to be reflective of performance on a specific task at a particular time, rather than based on memory or a hypothetical task. Some disadvantages were that performance and strategy assessment was assessed for only one task and one topic, they used a self-report measure, the questions were all positively worded and regarded strategies with a positive valence (hence suffering greater risk of social desirability effects), and it was limited to only 15 questions. Given the number and range of strategies that might be used , and the variety of contexts in which they may or may not be used (see e.g., McNamara, Ozuru, Best, & O’Reilly, 2007; Schellings, this issue; Van Hout-Wolters, Simons, & Volet, 2000), 15 questions may not suffice. Comprehension was assessed using verification statements containing inference statements referring to either information in single texts (intratextual) or across texts (intertextual). Bråten and Strømsø (this issue) confirmed the presence of two factors accounting for variance in their survey, the accumulation of information (e.g., single text processing) and cross-text elaboration processes. Unfortunately, students’ performance on the intratextual verification statements was solely predicted by prior knowledge. Indeed, a noteworthy positive aspect of their study is that they also measured prior knowledge on the topic (using 15 multiplechoice questions). A disadvantage of the study is that no other comprehension measures were collected. While student’s reports to remember information from the texts negatively predicted performance on the intertextual (cross-text) verification statements, the students’ reports on cross-text elaboration were only marginally predictive. Moreover, the relationships were disappointingly weak. One potential conclusion might be that strategies engaged while processing the text did not influence comprehension. Theoretically, however, that is unlikely. As such, future research on the topic might address the particular weaknesses of the study by including more items, assessing performance on a variety of different topics, using both positively and negatively valenced items, and assessing multiple levels of comprehension using various types of measures. However, none of these modifications address what is potentially the most serious road block and that is the use of retrospective self report. Indeed, Gonny Schellings (this issue) conducted three studies to compare students’ performance on self-report surveys to the use of think aloud. In the first study, 16 high school students thought aloud while reading a history text, after which they answered a 58 item questionnaire about the strategies they had used while reading the text. Each item on the questionnaire corresponded to a code used to score the verbal protocols. This small study established that the reliability of the questionnaire was fair, with moderate to poor subscale reliability depending on the subscale. Likewise, the correlations between the questionnaire and think-aloud protocols were moderate to weak depending on the subscale. The elaboration and evaluation subscale was found to be the most robust. These results were replicated in a second study with 190 students, with somewhat lower reliability estimates, providing some doubts about the internal consistency of the subscales of the questionnaire. For that reason, a third study with 4 students was conducted to collect retrospective think-aloud protocols about the questionnaire. These protocols revealed that some students misunderstood the questions, some answered from a general and some from a task-specific perspective, and some students’ answers were driven at least partially from the perspective of social desirability. Overall, the study indicated that the correlation between think-aloud and a survey can be moderate when they refer to the same task and include a correspondence between the codes and the items. However, the low reliability of the self-report survey indicates that the subscales may not measure the intended constructs. Regardless, without measures of comprehension, the validity of the scale is impossible to judge. These studies highlight the complications and concerns encountered in developing questionnaires to assess strategy use and metacomprehension. Indeed, the most important break- through in measuring strategy use is the use of technology to measure on line performance. Importantly, many researchers assume that the use of verbal protocols and explanation is not possible in real-time assessment because of the time required to code, score, and interpret the students’ responses. For example, among the papers in this issue, Bråten and Stramso (this issue) state ‘think-alouds are a very time and labour-intensive methodology…making it less suitable for larger samples.’ Schellings (this issue) similarly noted that on-line methods such as verbal protocols were excessively challenging. Notably however, the past decade has shown enormous progress in this regard. Joe Magliano, Keith Millis and their colleagues (this issue) describe RSAT, which is an automated tool for assessing students’ comprehension ability and use of comprehension strategies by using computational methods to analyze verbal protocols produced on-line while reading texts. Students read texts on a computer and answer open-ended indirect and direct questions embedded within the texts. Indirect questions are of the genre What are you thinking now?, asking the readers to report thoughts regarding their understanding of the sentence in the context of the passage. Direct questions assess comprehension with specific “wh-” questions about the text at target sentences. The responses are analyzed for the presence of paraphrases, bridges, and elaborations, which are reading strategies associated with successful comprehension. Their studies have confirmed that the automated scores generated by RSAT correlate highly with humans’ judgments of the presence of the strategies in the protocols. They also show that RSAT measures correlate with standardized measures of comprehension including the ACT and the Gates MacGinitie. Further, they demonstrate that the RSAT measures of comprehension and comprehension strategies account for more variance in predicting performance on short answer questions about two texts (including a narrative and a science text). Some advantages of the RSAT method are that it measures both comprehension ability and strategy use on line, it does not rely on retrospective self-report, and it can be used across multiple topics and texts. Some disadvantages are that it only assesses the use of strategies in the context of a single, somewhat constrained task (reading a text for a somewhat undefined purpose); it only measures a limited number and type of strategies (i.e., paraphrasing, bridging, and elaboration); and the use of strategies is intrinsically intertwined with the process of understanding the text. The latter problem is less of a concern than in the study conducted by Cromley and Azevedo, however, because RSAT is measuring traces of the strategies within the verbal protocols, and Magliano et al. statistically demonstrate that these traces are separable constructs from the comprehension outcomes. What is less evident is how to separate among the traces of strategy use, those that result from automatic and those that result from conscious strategic processes. When a reader thinks aloud about thoughts about a text or to answer a question, those responses may include a paraphrase or bridging inference that resulted from automatic, skilled comprehension processes, or from concerted efforts to understand a challenging text. For this reason, Magliano et al. refer to these traces as comprehension processes rather than strategies. The outcome of RSAT is consequently an assessment of what processes contribute most to the comprehension outcome, rather than whether or the degree to which the reader is strategic. Philip Winne and Krista Renee Muis focused on judgments of accuracy (i.e., calibration, addressing the general issue of the appropriateness of using the signal detection statistic d’ when measuring calibration as opposed to gamma (G), and how those judgments varied across domains (i.e., general, word, and mathematics knowledge). Although high calibration does not necessarily indicate that a student will be strategic, calibration is pertinent to strategy use because students must be able to judge the potential accuracy of their performance to know whether to use strategies when comprehending and learning. They investigate the appropriateness of d’ in estimating calibration because it expands researchers’ ability to examine calibration across domains and contexts. After completing three 40-item short answer (1-2 words) knowledge tests, 266 undergraduate students assessed the correctness of each answer using a 5-point Likert scale. Students were least accurate on the word knowledge assessment, but according to both the G and d’ statistics, calibration was high overall, but lowest for the mathematics test. Winne and Muis (this issue) further found large correlations between G and d’ (r = .63-.71); however, they did not calculate correlations correcting for variance, which may have shown higher correlations. While Winne and Muis (this issue) conclude that d’ is a valid and viable statistic to be used when assessing calibration, their evidence is scant. They certainly show that d’ can be used and that there is some overlap between d’ and G. But there is no evidence provided to indicate that d’ prime is more or less valid than G because there were no other criterion measurements. Another consideration is potential differences between the tests. While it was found that calibration varied as a function of test, this variance may well have emerged from similarities and differences between the items comprising the tests (rather than domain itself). Conclusions In conclusion, this collection of articles reflects growing concerns and increased efforts to develop reliable and valid measures of strategy use and metacomprehension strategies. The challenges in doing so are enormous. In the case of Cromley and Azevedo (this issue), their survey could not account for substantial unique variance in comprehension, likely because it involved so many of the processes tapped by the other measures (including word knowledge, prior knowledge, and comprehension processes). Bråten and Strømsø (this issue) similarly found disappointingly weak relationships between their survey and comprehension. Schellings (this issue) did not investigate relationships with comprehension but found stronger relationships between a think aloud and a survey than have been found previously, though perhaps at the cost of reliability. The most successful approach, it seems, was that adopted by Magliano, Millis, and colleagues (this issue) who leveraged technology to assess on-line verbal protocols, though also not without eliciting some concerns. Two pervading issues often surface regarding attempts to assess strategy use: the degree to which the processes are automatic versus conscious, and the degree to which they are induced versus naturally occurring. Some seek to distinguish between comprehension and strategic processes parallel to the distinction between automatic and conscious controlled processes. Drawing the line between the two in this fashion, however, has not been particular useful or productive. For example, if a reader generates a bridging inference while reading a text, but is unaware of having done so, is that strategic or metacognitive or automatic? The inference may well have emerged simply through the activation of prior knowledge (which is hardly avoidable). A comprehension assessment will indicate that the inference has been made, and an on-line protocol will indicate that the reader is making reference to both the current sentence and a prior sentence. Thus, one may deduce that the reader is a skilled (strategic reader), because the bridging inference has been successfully drawn. But neither measure would necessarily indicate whether the reader would attempt to use prior text in order to understand a current sentence in the face of a comprehension challenge. Thus, neither measure would provide a good measure of whether the reader is strategic or metacognitive across a wide range of circumstances. Another important question that emerges when approaching the task of assessing strategy use is how to distinguish underlying skills germane to the task from metacognition or strategies. Indeed, the problem of imparting a pure measure of strategy use is readily apparent in the articles presented in this issue. As depicted in Figure 1, strategies are assumed to aide in comprehension and learning, and they also have some shared variance. An underlying assumption is that the strategies are separable from the skills underlying performance on the target comprehension and learning tasks. It is also assumed that there will be some shared variance between the two. Notably, this shared variance may stem from similarities between the tasks, or similarities between measurements. For example, multiple-choice items by their very nature require similar processes such as comprehending the question and choices, and reasoning to eliminate foils. Open-ended or short-answer question would involve different underlying processes than multiple choice questions, such as the production of text (Ozuru, Best, Bell, Witherspoon, & McNamara, 2007). The degree to which this shared variance is maximized decreases the likelihood of measuring processes intrinsic to the strategies. However, the degree to which it is minimized is likely to also lower the correlation between the two. Hence, finding a reliable, contextualized, yet separate measure of metacognition is challenging. Figure 1. Many theories assume that strategies influence comprehension and learning (i.e., the arrow). However, measurements of the constructs have overlapping variance (i.e., the overlap of the two circles). To the degree that they do not overlap, one obtains a pure measure of strategy use, but that also likely lowers the strength of the correlation between the constructs. Associated with the movement to develop better measures of metacognition and metacomprehension, researchers and educators are becoming more mindful of the problems associated with learning by consumption (McNamara, 2010). Learning by consumption expresses the fallacy that learners can consume and be filled with knowledge – that delivering information is the end goal and acquiring information defines the essence of education. Such an emphasis on the learning of specific content is both expressed by and heightened by educators’ anticipation of state or federally mandated assessments for their students, which typically focus on surface recognition for details, facts, and figures (almost exclusively using multiple-choice questions). But that is changing. There is an increased acknowledgment that a focus on content delivery and acquisition in the classroom tends to result in a distressing reduction in emphasis on the methods and strategies that effect deep and long-lasting learning. That change is happening is some classrooms and is also being expressed in some movements in the realm of assessment. Perhaps international assessments such as NAEP and particularly PISA are most reflective of those changes. These assessments are moving toward the assessment of multiple levels of understanding and abilities (e.g., comprehension, problem solving, critical thinking), the use different response formats (recognition, short answer, recall, writing), as well as the use of multiple types of media to present information. These trends in assessment reflect the growing understanding that a student’s ability to deeply process and understand information plays an important role in characterizing a student’s competencies. These are encouraging trends in assessment. However, educators will not bring students to the point of performing well on such assessments by focusing on content delivery, and by remaining victims of the fallacies associated with assumptions of learning by consumption. Deep, meaningful learning and comprehension across a variety of contexts, topics, and tasks depends on the use of deliberate, effective strategies. One source of evidence for this claim comes from intervention studies, where students are provided with training to use learning or comprehension strategies and consequently improve in performance. Examples of such studies are becoming increasingly abundant (e.g., see McNamara, 2007). These interventions tend to be highly successful in improving task specific and general academic performance, particularly when the interventions to use the strategies include deliberate practice using the strategies across multiple contexts (or texts). A necessary complement of these strategy interventions are measures of their need. The articles in this issue indicate that we have yet to attain that goal. There is much left to do. Thus, hopefully, researchers will follow in the footsteps of these and similar studies and continue endeavors to develop reliable, valid assessments of students’ use of deep, reflective comprehension and learning strategies. References Baker, L. (1996). Social influences on metacognitive development in reading. In C. Cornoldi&J. Oakhill (Eds.), Reading comprehension difficulties (pp. 331–351). Hillsdale, NJ: Erlbaum. Bransford, J. D., Brown, A. L., & Cocking, R. R. (Eds.). (2000). How people learn: Brain, mind, experience, and school. Washington, DC: National Academies Press. Graesser, A.C., McNamara, D.S., & VanLehn, K. (2005). Scaffolding deep comprehension strategies through Point&Query, AutoTutor, and iSTART. Educational Psychologist, 40, 225-234. Graesser, A. C., Person, N. K., & Magliano, J. P. (1995). Collaborative dialogue patterns in naturalistic one-to-one tutoring. Applied Cognitive Psychology, 9, 495-522. Kozminsky, E. & Kozminsky, L. (2001). How do general knowledge and reading strategies ability relate to reading comprehension of high school students at different educational levels. Journal of Research in Reading, 24, 187-204. Magliano, J. P., Millis, K. K., The RSAT Development Team, Levinstein, I., & Boonthum, C. (this issue). Assessing comprehension during reading with the reading strategy assessment tool (RSAT). McNamara, D. S. (Ed.). (2007). Reading comprehension strategies: Theory, interventions, and technologies. Mahwah, NJ: Erlbaum. McNamara, D.S. (2010). Strategies to read and learn: Overcoming learning by consumption. Medical Education, 44, 340-346. McNamara, D. S. & Magliano, J. P. (2009a). Self-explanation and metacognition: The dynamics of reading. In J. D. Hacker, J. Dunlosky, & A. C. Graesser (Eds.), Handbook of metacognition in education (pp. 60-81). Mahwah, NJ: Erlbaum. McNamara, D.S. & Magliano, J.P. (2009b). Toward a comprehensive model of comprehension. In B. Ross (Ed.), The psychology of learning and motivation (pp. 297-384). New York, NY: Elsevier. McNamara, D. S., Ozuru, Y., Best, R., & O'Reilly, T. (2007). The 4-pronged comprehension strategy framework. In D.S. McNamara (Ed.), Reading comprehension strategies: Theories, interventions, and technologies (pp. 465-496). Mahwah, NJ: Erlbaum. Ozuru, Y., Best, R., Bell, C., Witherspoon, A., & McNamara, D.S. (2007). Influence of question format and text availability on assessment of expository text comprehension. Cognition & Instruction, 25, 399–438. Van Hout-Wolters, B. H. A. M., Simons, P. R. J., & Volet, S. (2000). Active learning: Selfdirected learning and independent work. In P. R. J. Simons, J. L. van der Linden, & T. M. Duffy (Eds.), New learning (pp. 21-37). Dordrecht, The Netherlands: Kluwe.
© Copyright 2026 Paperzz