Journal of Applied Testing Technology, Vol 17(1), 20-32, 2016 A Framework for Examining the Utility of Technology-Enhanced Items Michael Russell* Boston College; [email protected] Abstract Interest in and use of technology-enhanced items has increased over the past decade. Given the additional time required to administer many technology-enhanced items and the increased expense required to develop them, it is important for testing programs to consider the utility of technology-enhanced items. The Technology-Enhanced Item Utility Framework presented in this paper identifies three characteristics of a technology-enhanced item that may affect its measurement utility. These include: a) the fidelity with which the response interaction space employed by a technology-enhanced item reflects the targeted construct; b) the usability of the interaction space; and c) the accessibility of the interaction space for students who are blind, have low vision, and/or have fine or gross motor skill needs. The framework was developed to assist testing programs and test developers in thinking critically about when the use of a technology-enhanced item is warranted from a measurement utility perspective. Keywords: Accessibility, Digital Assessments, Item Development, Technology-Enhanced Items, Test Validity, Usability 1. Introduction For the past century, large-scale standardized tests have relied heavily on selected-response and text-based openresponse interactions to collect evidence about cognitive constructs that are the target of measurement (Haladyna & Rodgriguez, 2013). Over the past two decades, educational testing programs have begun administering tests on computers and other digital devices. In response, there is growing interest in expanding the ways in which test takers produce responses to test items to demonstrate their knowledge, skills, and/or understanding of the targets of measurement (Drasgow & Olson-Buchanan, 1999; Russell, 2006; Scalise & Gifford, 2006; Washington State, 2010). With the expanding use of digital assessments, there is also growing interest in measuring test taker knowledge, skills, and understandings in more authentic ways. As evidence, both the Smarter Balanced and the Partnership for Assessment of Readiness for College *Author for correspondence and Career (PARCC) assessment consortiums aimed to develop items that measure constructs in contexts that are more authentic than traditional selected or text-based open-response items (Florida Department of Education, 2010; Washington State, 2010). To meet these needs, test developers have introduced the concept of technologyenhanced items (Russell, 2006; Scalise & Gifford, 2006). Technology-enhanced items fall into two broad categories. The first category includes items that contain media that cannot be presented on paper. These items utilize video, sound, 3D graphics, and animations as part of the stimulus and/or the response options. The second category includes items that require test takers to demonstrate knowledge, skills, and abilities using response interactions that provide methods for producing responses other than selecting from a set of options or entering alphanumeric content. To distinguish the two categories, the term technology-enabled refers to the first category and technology-enhanced labels the second category (Measured Progress/ETS Collaborative, 2012). Michael Russell While most test items can be categorized uniquely as traditional, technology-enabled, or technology-enhanced, there are some items that contain both non-text-based media and new response interactions and can be classified as both technology-enabled and technology-enhanced. Because the primary purpose of a test item is to collect evidence of the test taker’s development of the targeted knowledge, skill, or ability and the response interaction is the vehicle for collecting that evidence, the framework presented in this paper focuses narrowly on technologyenhanced items. Utility of technology-enabled items requires a separate framework that, to the best of my knowledge, has not yet been developed. 2. Item Response Interaction Space As Haladyna and Rodriguez (2013) describe, all test items are composed of at least two parts: a) a stimulus that establishes the problem a test-taker is to focus on; and b) a response space in which a test taker records an answer to the problem. For items delivered in a digital environment, the response space has been termed an interaction space (IMS Global Learning Consortium, 2002) since this is the area in which the test taker interacts with the test delivery system to produce a response. For a multiple-choice item, an interaction space is that part of an item that presents the test-taker with answer options and allows one or more options to be selected. For an open-response item, an interaction space typically takes the form of a text box into which alphanumeric content is entered. For a technology-enhanced item, the interaction space is that part of an item that presents response information to a test-taker and allows the testtaker to either produce or manipulate content to provide a response. As Sireci and Zenisky (2006) detail, there are a wide and growing variety of interaction spaces employed by technology-enhanced items. For example, one class of interaction space presents words or objects that are classified into two or more categories by dragging and dropping them into their respective containers. Often termed “dragand-drop,” these items are used to measure a variety of knowledge and skills including classifying geometric shapes (e.g., see Figure 1a in which shapes are classified based on whether or not they contain parallel lines) or classifying organisms based on specific traits (e.g., mam- Vol 17 (1) | 2016 | www.jattjournal.com mals versus birds versus fish, single cell versus multiple cell organisms, chemical versus physical changes, etc.). A version of this interaction space can also be used to present content that the test taker manipulates to arrange in a given order. As an example, a series of events that occur in a story may be rearranged to indicate the order in which they occurred. Similarly, a list of animals may be rearranged to indicate their hierarchy in a food chain. Another class of interaction space requires test takers to create one or more lines to produce a response to a given prompt (see Figure 1b). For example, test takers may be presented with a coordinate plane and asked to produce a line that represents a given linear function, a geometric shape upon which the test taker is asked to produce a line of symmetry, or a list of historical events where the test taker is to draw lines connecting each event to the date when it occurred. A third class of interaction space requires test takers to highlight content to produce a response (see Figure 1c). For example, the test taker may be asked to highlight words that are misspelled in a given sentence, select sentences in a passage that support a given argument, or highlight elements in an image of a painting that demonstrate the use of a given technique or imagery. There are a wide variety of interaction spaces currently in use and they have been classified in a variety of ways. Perhaps the earliest attempt to classify response interaction spaces was made by Bennett (1993) who identified six categories of item response types including multiple-choice, selection/identification, reordering/ rearranging, completion, construction, and presentation. Scalise and Gifford (2006) expanded Bennett’s classification scheme to consider both the response format and the complexity of the response actions. As an example, under the multiple-choice response category, “True/False” items were classified as least complex and “multiple-choice with new media distractors” as most complex. Similarly, under the completion response category “single numerical constructed response” was classified as least complex and “matrix completion” was the most complex (Scalise & Gifford, 2006). In an attempt to establish a standardized method of encoding items in a digital format, the IMS Global Learning Consortium (IMS) developed the Question and Test Interoperability (QTI) Specification (2002; 2012). The current version of QTI specifies 32 classes of response interaction spaces that can be used to create a wide vari- Journal of Applied Testing Technology 21 A Framework for Examining the Utility of Technology-Enhanced Items (a) (b) (c) Figure 1. Examples of Interaction Spaces: (a) Drag and Drop Item. (b) Draw Line Item. (c) Select Text Item. ety of item types including different types of selected response, drag-and-drop, line and object production, text selection, reordering, “free-hand” drawing, and even upload of sound, image, and video files. Although there is not yet consensus on how to classify or term specific interaction types, the variety of interactions being employed by today’s educational testing programs has expanded greatly. However, use of many of these new interaction types comes with increased costs for item development. While difficult to obtain accurate figures, the PARCC Assessment Consortium estimates that the cost of developing technology-enhanced items ranges from two to five times that of developing traditional multiple-choice items. Given these economic considerations, it is important to consider the utility that technology-enhanced item interaction spaces provide for an assessment program. 22 Vol 17 (1) | 2016 | www.jattjournal.com 3. Utility of Technology Enhanced Items Utility has different meanings in different contexts. In the general colloquial sense, utility focuses on the extent to which something is “useful” or “functional” for a given purpose (Gove, 1986). From this perspective, the utility of a technology-enhanced item interaction space might be conceived as its usefulness for measuring specific knowledge, skills, or abilities. In economics, utility focuses on “desire” or “want” and is often measured by a person’s willingness to pay for a given object or service (Marshall, 1920). From an economic perspective, utility of technology-enhanced items might be conceived as an assessment program’s desire to employ and willingness to pay for the development and delivery of a given item interaction space. Journal of Applied Testing Technology Michael Russell In educational measurement, utility can be viewed in at least two ways (Davey & Pitoniak, 2006). Measurement utility focuses on the information that a given item contributes to the estimate of test taker ability. Content utility considers the extent to which the use of a given item contributes to adequate representation of the content domain measured by the test. When viewed through the lens of evidence centered design (Mislevy, Steinberg, & Almond, 2003), these two forms of utility collectively consider the strength of evidence about a specific construct provided via a given item interaction space. From the perspective of evidence contribution, there are at least two factors that influence the utility of a given item interaction space. The first factor focuses on the accuracy of the evidence about a given construct provided by an interaction space. The more accurately the information recorded through the response interaction reflects the test taker’s mastery of the targeted construct, the greater measurement utility the interaction space provides. The second factor focuses on the fidelity or directness with which the interaction space requires the test taker to employ a given construct (Haladyna & Rodriguez, 2013; Lane & Stone, 2006). Fidelity and directness focus on how closely the context created by an interaction space resembles the context in which a person applies the construct in an authentic or “real-world” situation. From this perspective, the more similar the context created by the interaction space is to a real-world context, the greater the fidelity, and hence the greater the utility. Focusing on accuracy of response information provided by an interaction space, there are at least two factors to consider. The first factor relates to construct-representation (Messick, 1989) and considers the accuracy of the evidence provided through the interaction space about the test taker’s mastery of the construct. The extent to which the test taker must apply the targeted construct as s/he produces a response in the interaction space influences the accuracy of the response evidence. When an interaction space allows a test-taker to produce a correct answer through guessing, trial-and-error, or by applying other constructs such as test-taking skills, its utility for measuring the targeted construct is diminished. Second, the extent to which constructs outside those that are targeted for measurement interfere with a test taker’s ability to respond correctly also affects accuracy and focuses on sources of construct irrelevant variance introduced by the item content or the interaction space (Messick, 1998). Vol 17 (1) | 2016 | www.jattjournal.com Sources of construct irrelevant variance can take many forms, many of which fall under a broader concept of accessibility (Russell, 2011a; 2011b ). Traditionally, accessibility has been conceived of as the ease and accuracy with which test takers are able to access content presented by a test item. Factors such as the size of the text and images, complexity of the language and vocabulary that are used, and the length of item prompts and response options are seen as factors that influence a test taker’s ability to access an item (Thompson, Johnstone, & Thurlow, 2002). This view of accessibility emphasizes how test takers must be able to access item content in order to understand what is being asked of them. A broader conception of accessibility in the context of testing flips emphasis from the test taker’s access to item content to the test item’s accessing the construct as it operates within the test taker (Russell, 2011b). From the perspective of Accessible Test Design, the test taker’s ability to access content presented by an item is still important. In addition, though, the degree to which the context in which an item is administered allows the test taker to apply the targeted construct without distraction and the test taker’s ability to produce a response that reflects the outcome of a cognitive process are also viewed as components of accessibility. From this perspective, additional factors such as the conditions under which an item is performed, the representational form in which test takers are required to produce a response and the usability of the interaction space are also seen as factors that influence accessibility. In cases where the required representational form in which a response must be communicated conflicts with a test taker’s ability to communicate in that form, accuracy will be negatively impacted. As an example, for a test taker who is still developing proficiency in English or who has challenges producing text, a math item that requires students to explain their reasoning in written English may interfere with the ability of the test taker to accurately record his/her reasoning in the required representational form. Similarly, the more difficult the interaction space is to use to produce responses, the less accurately the test takers’ responses may reflect their thinking, and thus the less accurate the information provided by the item is likely to be. As an example, an interaction space that requires students to use the arrow keys to drag-and-drop objects may create frustration due to the need to press a given arrow several times to position an object and may result in incomplete response production. Journal of Applied Testing Technology 23 A Framework for Examining the Utility of Technology-Enhanced Items Aspects from each of these three perspectives– colloquial, economic, and educational measurement – can be used to create a working definition for the utility of technology-enhanced items. From this context, the framework presented below defines utility as the value provided by a given interaction for collecting evidence about the targeted construct in an accurate, efficient and high-fidelity manner. 4. Technology-Enhanced Item Utility Framework Given the increasing variety of response interaction spaces that can be created in a digital environment, the following technology-enhanced item utility framework is designed to help testing programs weigh the costs and benefits of employing a given response interaction methodology to measure the knowledge, skill, or ability of interest. When considering the utility of a technology-enhanced interaction space, there are three characteristics that should be considered: a) fidelity to the targeted construct; b) usability of the interaction space for producing responses; and c) accessibility of the response interaction for test takers with specific disabilities and special needs. The first characteristic, construct fidelity, focuses on two aspects: a) the extent to which the response interaction space creates a context that represents how the construct might be applied in an authentic situation; and b) the extent to which the methods employed to produce a response reflect the methods used to produce artifacts that are the outcome of the targeted construct in a real-world environment. Together, these two aspects address the fidelity produced by the interaction space. Construct fidelity is a product of the context created through the interaction, the interaction itself, and the targeted construct. The second and third characteristics focus on the way in which a testing program delivers the response interaction. Specifically, the second characteristic, termed usability, considers the extent to which the delivery system’s implementation of the interaction space allows test-takers to efficiently and accurately produce responses. The third characteristic, termed accessibility, considers whether methods are provided for test takers with special needs to produce responses in an accurate and efficient manner. The usability and accessibility characteristics recognize that a response interaction (e.g., drawing a line) can 24 Vol 17 (1) | 2016 | www.jattjournal.com be implemented in a variety of ways. The way in which a response interaction is implemented affects its usability for collecting responses that reflect the outcome of test taker cognition in an efficient and accurate manner. These three characteristics of utility are explored in greater detail below. 4.1 Construct Fidelity The primary purpose of technology-enhanced items is to collect evidence that is aligned with the construct measured by an item. In effect, interest in technologyenhanced items derives from growing concerns that traditional items do not adequately measure some constructs (Russell, 2006; Sireci & Zenisky, 2006). It has been argued that new methods of collecting evidence from test takers will increase the variety of constructs that can be measured and that measures of these constructs will improve (Florida Department of Education, 2010; Washington State, 2010). Given this purpose for technology-enhanced items, the first component of the Technology-Enhanced Item Utility Framework focuses on the fidelity of the interaction space. As described above, there are two factors that affect fidelity. The first focuses on the degree to which the context created through the response interaction space represents an authentic application of the construct. Here, the key question is whether the interaction space creates a context that is similar to a situation in the real world (i.e., outside of a testing situation) where the construct is typically applied. The second factor is concerned with the methods required by an item’s response interaction to produce a response and considers the extent to which the methods required to produce a response are similar to those used in a real world situation. Here, the focus is on the tools and actions the test taker must use to produce a response rather than on the situation created by the interaction space. As an example, consider an item designed to measure writing ability. An interaction space that requires test takers to produce text in response to a prompt using an external keyboard connected to a computer/tablet device or an on-screen keyboard on a tablet device creates an authentic context for applying writing skills and provides methods that are similar (if not identical to) the methods employed in an authentic context. In contrast, an item that presents the same prompt but employs an interaction space that requires test takers to drag-and-drop letters or Journal of Applied Testing Technology Michael Russell words to create sentences would still create an authentic context for measuring writing but in a way inconsistent with the methods employed to produce text in an authentic context. As a second example, an interaction space that presents test takers with a linear function and asks test takers to produce a line on a coordinated plane that represents the given function would have high fidelity with a construct concerned with the ability to produce graphical representations of linear functions. In contrast, the same interaction space would have less fidelity if it were used for an item measuring test takers’ knowledge of historical events that required drawing lines connecting a given event with its date of occurrence. In this case, neither the act of producing lines nor the context in which test takers connect events with dates represents how this construct is applied in a real-world context. In this way, Construct Fidelity focuses both on: a) the extent to which the interaction space creates an authentic context in which the construct is applied outside of a testing situation; and b) the extent to which the methods used by the interaction space reflect the methods used to produce products in an authentic context. It might be noted that low Construct Fidelity does not necessarily mean that the item itself is poor but rather that the context in which responses are produced and/or the method used to produce a response do not authentically reflect how the construct is typically applied outside of a testing situation. In these situations, it may be beneficial to consider alternate methods of collecting evidence that better reflect how the construct is typically applied in the real world. 4.2 Usability In many cases, the interaction spaces employed by technology-enhanced items require test takers to invest more time producing a response compared to a traditional response selection item. As an example, when measuring a test taker’s ability to create graphical representations of mathematical functions, it may take longer to produce a plot of a linear relationship than it would to select an image that depicts that relationship. While the increased time required to produce a response using a technologyenhanced item may result in more direct evidence of the measured construct in a more authentic context, it competes with a desire for efficient use of time during the testing process. As a result, test developers need to balance considerations about the desire to improve the Vol 17 (1) | 2016 | www.jattjournal.com quality of measures about test taker knowledge and skills, while maximizing the time efficiency of evidence collection. A key factor that influences efficiency is the usability of the interaction space. In this framework, usability is defined as the intuitive functionality of an interaction space and the ease with which a novice user can produce and modify responses with minimal mouse or finger actions and/or response control selections. While a given interaction space (e.g., text production or line drawing) is intended to allow a test taker to produce a specific type of response, the method used to implement that interaction can vary widely among test delivery systems. For instance, a text production item may allow test takers to use a standard keyboard or limit them to the use of a mouse to select letters from an on-screen “keyboard.” Further, there are a number of ways an on-screen keyboard might be arranged including QWERTY format (i.e., like a traditional keyboard), arranged alphabetically from “a” to “z”, or ordered by frequency of use. While each implementation provides functionality that allows a test-taker to produce a text-based response, the ease and efficiency with which a test-taker could do so varies greatly. The usability component focuses on the specific implementation of the response interaction and examines the usability of that implementation. Factors that are considered when examining usability include intuitiveness, layout, and functionality. 4.2.1 Intuitiveness This factor examines the design of the response interaction space and considers the ease with which a test-taker can determine how to produce a response using the provided tools/functions. When considering intuitiveness, it should be assumed that the test-taker has had some training and prior exposure to the response interaction. As a result, intuitiveness does not focus on the ease with which a naive test-taker can determine how to use the response interaction upon first encounter. Rather, it is concerned with the ease with which test-takers can use the various features of the response interaction with minimal cognitive effort. 4.2.2 Layout This factor considers whether the interaction space is designed in a way that minimizes the distance between on-screen elements required to produce a response. As an Journal of Applied Testing Technology 25 A Framework for Examining the Utility of Technology-Enhanced Items example, if tool buttons are required, are they located in close proximity to the response space and to each other, yet not so close as to allow the test taker to accidentally select the wrong button? Similarly, for an item that requires test takers to drag and drop objects, is the distance that test-takers must drag content minimized, yet not so close as to confuse test-takers about what content is to be dragged and what content represents a receptacle for dragged content? 4.2.3 Functionality This characteristic considers whether the response interaction is designed in a way that minimizes the number of mouse/finger selections required to produce a response. As an example, if test takers make mistakes, can they correct those mistakes without having to clear the entire response space and begin again? Although each of these factors influences usability, their influence is considered holistically. For this reason, limited functionality can be compensated for by intuitive design and careful layout. Similarly, poor layout can be compensated for by intuitive design and strong functionality. Usability of an interaction space is a holistic concept that focuses on the overall usability of the interaction space rather than the individual quality of each of these factors. While this factor is not necessarily based upon a direct comparison of the speed with which the test taker can produce a response with other potential response interactions (e.g., drag-and-drop versus multiple-choice selection interactions), it does consider the extent to which the specific implementation allows for efficient response production. This requires the evaluator to view the implementation of the interaction within the test delivery environment and to be familiar with potential alternate approaches to implementing that response interaction. 4.3 Accessibility The final component of the framework focuses on the accessibility of the interaction space. Like usability, this focuses on the specific implementation within the test delivery system employed by a given testing program. Also similar to usability, it considers the extent to which the interaction space allows test takers who are blind, have low vision, or have motor skills-related disabilities to 26 Vol 17 (1) | 2016 | www.jattjournal.com produce a response in an efficient manner. Given that the needs of these three sub-populations of test takers (i.e., blind, low vision, and those with motor skill needs) differ, the accessibility component comprises three separate sub-components, each of which focuses on how well it supports efficient response production by test takers with the focal need. 4.3.1 Motor Skill Accessibility This accessibility sub-component focuses on the extent to which the implementation allows test takers with fine and gross motor skills needs and those who use assistive input devices to efficiently produce responses. Assistive input devices fall into two broad categories: a) those that allow test takers to perform traditional mouse functions (e.g., select/click/highlight, drag, drop) using a device other than a mouse (e.g., track ball or eye gaze); and b) those that mimic Tab-Enter navigation. Tab-Enter navigation allows test takers to use the TAB key to perform the equivalent mouse action of hovering over an object (e.g., a menu option, button, or text) and using the ENTER key to select the object over which the mouse is hovering (e.g., clicking on a button or menu option). Tab-Enter navigation can be performed using the Tab and Enter keys on a traditional keyboard or by using a variety of assistive input devices such as a dual-switch device, single-switch device, or an alternate keyboard. The factors that influence the accessibility of motor skill input devices include: • Size of objects that must be selected or the size of containers into which objects are to be placed (the smaller the object, the more challenging the selection process) • The ordering of tab selection (the more logical the ordering, the more efficient the navigation process) • The hierarchical structure of tab-enter selection (the more logical the structure, the more efficient the navigation process) • The extent to which all functions within the interaction space are supported by alternate methods (the more functionality supported, the more efficient the response process). 4.3.2 Low Vision Accessibility Test takers with low vision are typically able to view content that is displayed on the screen, but require it to be enlarged or magnified in order to view it clearly. Thus, the first factor that affects the accessibility of an interaction Journal of Applied Testing Technology Michael Russell space for test takers with low vision is whether it allows content to be magnified. In cases when magnification is allowed, an additional factor that affects accessibility is the extent to which magnification of content obscures relationships among all content displayed in the interaction space. The importance of this factor will vary based upon the interaction space test-takers are required to use and the content with which they are required to perform the interaction. As an example, an interaction space that requires test takers to create a line that bisects an angle requires test takers to engage with a small amount of content all of which is in close proximity to each other (in this case two lines that form an angle). As a result, the relationship between the two lines at the point where the test-taker is expected to respond (i.e., the point of intersection) is visible when the content in the interaction space is magnified greatly. In contrast, an item that presents a coordinate plane that ranges from +25 to -25 on the x and y axes and requires the test-taker to create a line that passes through the points (23, 14) and (-15, -18) could present challenges depending on how magnification functions (see Figure 2a). Specifically, if magnification enlarges content within a confined response space, portions of the coordinate grid may be pushed out of view and therefore obscured (see Figure 2b). As a result, it would be difficult to produce (a) (b) (c) Figure 2. Different implementations of magnification. (a) Item in original, unmagnified state. (b) Magnification confined within response space. (c) Entire response space magnified. Vol 17 (1) | 2016 | www.jattjournal.com Journal of Applied Testing Technology 27 A Framework for Examining the Utility of Technology-Enhanced Items Table 1. Guidelines for interpretation of component ratings Fidelity Usability Accessibility Overall Utility Rationale High High High High The skills required by the interaction are aligned with skills associated with the measured construct and the interaction is implemented in an efficient and accessible manner. High High or Moderate High or Moderate Moderate High The skills required by the interaction are aligned with the skills associated with the measured construct but the implementation may present moderate challenges for a subset of test takers. High Moderate or Low Moderate or Low Moderate The skills required by the interaction are aligned with the skills associated with the measured construct but the implementation may present significant challenges for a sub-set of test takers. Low The skills required by the interaction are aligned with the skills associated with the measured construct but the implementation may present significant challenges for many test takers. Moderate High The skills required by the interaction are moderately aligned with the skills associated with the measured construct and the interaction supports efficient and accessible response by all test takers. While fidelity is only moderate, all test takers can produce responses efficiently. Moderate to Low The skills required by the interaction are not aligned with the skills associated with the measured construct but the interaction supports efficient and accessible response by all test takers. In cases where the information yielded by the interaction provides solid evidence for the measured construct, the utility would be moderate. As the strength of the evidence decreases, the utility decreases to low. Low The skills required by the interaction are moderately to poorly aligned with the skills associated with the measured construct and the implementation is inefficient and/or inaccessible which presents challenges for some test takers. High Moderate Low High Low High Low High High Moderate to Low Moderate to Low Moderate to Low a line that passes through points that are no longer visible on the screen. In contrast, if the interaction space expands as magnification increases (effectively allowing the interaction space to cover more of the screen), the visible relationship among key content will be preserved which allows easier production of a correct response (see Figure 2c). In this way, accessibility for low vision focuses on the manner in which magnification is supported by the response interaction space and whether the magni- 28 Vol 17 (1) | 2016 | www.jattjournal.com fication functionality impedes the test taker’s interaction with response content. 4.3.3 Accessibility for the Blind This accessibility sub-component focuses on the extent to which the implementation of the interaction space provides supports that allow test takers who are blind to produce a response. Because test takers who are blind cannot view content displayed on a screen, there are Journal of Applied Testing Technology Michael Russell three design factors that influence their ability to produce responses: • The extent to which the implementation supports navigation among content (e.g. this factor is similar to the TAB-ENTER navigation factor for test takers with motor skill needs). • The clarity with which navigation and response actions performed by the test-taker are described auditorily so that the test-taker understands and can confirm that the desired action occurred (e.g., the implementation states what content the test-taker has “tabbed” to or states what container into which an object was placed). • The extent to which the methods employed to support navigation and to provide confirmation of actions do not interfere with the measured construct (e.g., if the test-taker is required to produce a line with a negative slope and the confirmation of a line drawing tool states the intercept and slope of the line produced, the confirmation method interferes with the measured construct by stating the slope on the line produced by the test-taker). These three factors are considered together when examining the accessibility of the implementation of a response interaction for test takers who are blind. It is important to note that these three sub-components of accessibility (i.e., Motor Skill, Low Vision, and Blind accessibility) should be examined independently. For this reason, it is possible for an interaction to be viewed as high on motor skill accessibility but low on low vision and blind accessibility. Alternately, an item may be strong for blind accessibility, moderate for motor skill, and low for low vision accessibility. 5. Using the TechnologyEnhanced Item Utility Framework The Technology-Enhanced Item Utility Framework is designed to help test developers and testing programs consider the extent to which the use of a given response interaction space is both appropriate for the construct measured by an item and implemented in a manner that allows test-takers to accurately and efficiently produce responses that reflect the product of their cognitive processes. While each of these factors is examined individually, they are considered collectively to evaluate the utility of the interaction. Table 1 is designed to help inter- Vol 17 (1) | 2016 | www.jattjournal.com pret components to make decisions about the utility of an interaction. When evaluating the utility of an interaction, it is important to recognize that the purpose of an interaction space is to create a context in which the targeted construct is applied as the student produces a response using the response methods provided by the interaction space. For this reason, the most important factor affecting the utility of an interaction space is the alignment between the context created by the interaction space and the way(s) in which the targeted construct is authentically applied in a non-testing situation. When the interaction space creates a context that is authentic in terms of how the measured construct is typically applied, the interaction may be said to have utility. However, the level of utility is influenced by how the interaction is implemented within a test delivery system. Specifically, if the interaction is implemented in a way that allows test-takers to efficiently produce responses and provides adequate accessibility for test takers with special needs, then the utility of the interaction is maximized. In this way, strong alignment must be coupled with high levels of usability and accessibility to maximize utility. In contrast, if the implementation of a strongly aligned interaction is inefficient or difficult to access for some or many test takers, its utility is diminished. As shown in Table 1, when fidelity is moderate or low, two additional factors must be considered prior to making definitive decisions about the utility of the interaction. First, its usability and accessibility must be examined. Second, the extent to which the interaction allows testtakers to produce evidence that can be used to inform the measure of the construct should be considered. In cases where fidelity is moderate or low, and usability and accessibility are low, the interaction space should be interpreted as also having poor utility. However, when usability and accessibility are strong, an interaction space of low fidelity may still have moderate utility if the evidence provided by the interaction can serve as a measure of the construct. In effect, these cases result in information that can be provided accurately and efficiently by a broad range of test-takers and used to make inferences about the measured construct, even though the directness of the inference is diminished. Although the skills required to produce a response and the context created via the interaction are unrelated to the measured con- Journal of Applied Testing Technology 29 A Framework for Examining the Utility of Technology-Enhanced Items struct, the resulting information provides appropriate evidence about the measured construct. As an example, consider an interaction that requires test-takers to drag-and-drop sentences into an order that reflects the plot of a given story. The skills required by the interaction (ability to select, drag, and position content) are unrelated to reading comprehension. In addition, reordering sentences is not an authentic context in which test takers typically apply understanding of order of events outside of a testing situation. Yet, the evidence provided by the interaction (i.e., the order of sentences describing events from the story) supports an inference about the test-takers comprehension of the events of the story. In cases where the interaction is implemented in an efficient and accessible manner, the utility of that interaction for measuring reading comprehension might be deemed moderate or adequate. When using the framework to evaluate the utility of a technology-enhanced item, there are two additional considerations to keep in mind: a) the non-comparative aspects of the framework and b) examining utility in the context of measurement. Each of these considerations is discussed separately below. 5.1 Non-Comparative Aspects of the Framework It is important to note that the intent of the Construct Fidelity component is not to compare the fidelity of the employed interaction with other potential interactions. Rather, the focus is limited to a judgment about the extent to which the context produced through the interaction space is authentic and the skills required for the interaction align with the skills associated with how the measured construct is typically applied in a real-world context. For this reason, it is possible that more than one interaction space can create an authentic context in which the construct is typically applied and/or could employ methods that overlap with how the measured construct is typically applied in an authentic context and receive high ratings for construct alignment. Similarly, when evaluating efficiency and accessibility, the intent is not to compare a specific implementation with other implementations. Rather, the focus is on the specific implementation and whether it provides an efficient and accessible approach to response production. For 30 Vol 17 (1) | 2016 | www.jattjournal.com this reason, it is possible that several different implementations may be rated highly. That being said, it is also important to note that in order to evaluate a given implementation, it is necessary to have a solid understanding of usability design principles, accessibility design principles, and the functionality of common assistive technology devices. While the intent of bringing this knowledge to bear when examining efficiency and accessibility is not to compare the specific implementation with other possible implementations, it is necessary to understand what is possible and what represents best practices when evaluating usability and accessibility. As one example, in order to evaluate the accessibility provided by a TAB-ENTER hierarchical design, one needs to be familiar with how TAB-ENTER navigation functions and the challenges that can arise by a poor hierarchical design. 5.2 The Context of Measurement The Technology-Enhanced Item Utility Framework is designed to focus on the use and implementation of a given interaction to measure a targeted construct for four sub-groups of test takers: a) test takers with motor skill needs; b) test takers with visual needs; c) test takers who are blind; and d) test takers who do not have special needs associated with motor skills, visual impairments or blindness and are expected to use typical keyboard, mouse and/or finger movements to produce responses. As a result, construct fidelity is evaluated in the context of the measured construct while efficiency is considered in the context of the actions likely to be performed by the subgroup of test takers who will use a keyboard, mouse, or finger actions to produce a response. Additionally, accessibility must be evaluated in the context of the specific needs of the sub-groups of test takers and the tools they typically use to interact with a computer or digital device. Finally, when component ratings are combined to make an overall determination of utility, interactions with moderate or low construct fidelity ratings must be considered in the context of the types of information yielded by the interaction and the adequacy of using that information as evidence for the measured construct. As a result, evaluating the utility of an interaction space requires one to understand the construct measured by the item and the sub-groups of test takers who are being assessed. Journal of Applied Testing Technology Michael Russell 6. Discussion From an economic perspective, it is clear that the new interaction spaces employed by technology-enhanced items have utility and an increasing number of assessment programs are demanding development and administration of items that utilize them. However, it is important to remember that the primary criteria for including any item in an educational test is its ability to contribute accurate evidence with utility for informing the measure of a targeted construct. While the idea of employing technology-enhanced items to modernize a testing program or demonstrate that it is capitalizing on the powers of technology is attractive, the use of new item interaction spaces should not come at the cost of measurement value. That is, if measurement value is degraded by the use of a technology-enhanced interaction, that interaction should not be employed. In contrast, when measurement value is improved by the use of a technology-enhanced interaction, it seems logical to employ that interaction. However, this decision must be tempered by considering the additional cost incurred by developing such items. The key question is whether the increase in measurement utility outweighs the additional financial cost that it brings. In cases where utility value increases substantially while financial costs are affected minimally, it is reasonable to conclude that the interaction should be employed. But what should be done when the item development cost for a given interaction are more than triple the cost of a traditional interaction type and the measurement value is only marginally affected? Clearly, this is subjective decision. As a guideline, it seems reasonable that a doubling of costs might be acceptable for each one-step increase in utility provided by a given interaction type. That is, when an interaction increases utility from low to moderate compared to a traditional item interaction, a doubling of costs is reasonable. And when utility increases from low to high, tripling of costs seems acceptable. Developing guidance on making cost-benefit decisions is an area in need of further research. When conducting this research, it will be important to recognize that the costs associated with developing items that employ a given interaction will likely decrease over time as test developers create more efficient mechanisms for encoding item content and item writers become more accustomed to developing items that employ a given interaction model. Vol 17 (1) | 2016 | www.jattjournal.com The Technology-Enhanced Item Utility Framework presented here aims to help testing programs and test developers maintain a focus on measurement value by directing attention to the use of a given interaction space to produce an authentic, usable, and accessible context in which the targeted construct is applied by the test taker. Through emphasis on these three factors, it is hoped that technology-enhanced items will be employed increasingly to enhance measurement utility. 7. References Davey, T. & Pitoniak, M. (2006). Designing computerized adaptive tests. In S.M. Downing & T.M. Haladyna (Eds.) Handbook of Test Development (pp. 543–574). New York, NY: Routledge. Drasgow, F. & Olson-Buchanan, J.B. (1999). Innovations in Computerized Assessment. Mahwah, NJ: Lawrence Earlbaum Associates, Inc. Florida Department of Educaiton (2010). Race to the Top Assessment Program Application for New Grants. Retrieved September 5, 2013 from hhttp://www2.ed.gov/ programs/racetothetop-assessment/rtta2010parcc.pdf Gove, P.B. (1986). Webster’s Third new International Dictionary of the English Language Unabridged. Springfield, MA: Merriam-Webster, Inc. Haladyna, T.A. & Rodriguez, M.C. (2013). Developing and Validating Test Items. New York, NY: Routledge. IMS Global Learning Consortium. (2002). IMS Question and Test Interoperability: An Overview Final Specification Version 1.2. Retrieved June 25, 2015 from http://www. imsglobal.org/question/qtiv1p2/imsqti_oviewv1p2.html. IMS Global Learning Consortium. (2002). IMS Question and Test Interoperability: An Overview Version 2.1 Final. Retrieved June 25, 2015 from http://www.imsglobal.org/ question/qtiv2p1/imsqti_oviewv2p1.html. Lane, S. & Stone, C.A. (2006) Performance assessment. In R.L. Brennan (Ed.) Educational Measurement (4th ed. pp. 387431). Westport, CT: American Psychological Association. Marshall, A. (1920). Principles of Economics: An Introductory Volume (8th Edition). London: Mcmillan. Measured Progress/ETS Collaborative. (2012). Smarter Balanced Assessment Consortium: Technology Enhanced Items. Retrieved June 23, 2015 from https://www. measuredprogress.org/wp-content/uploads/2015/08/ SBAC-Technology-Enhanced-Items-Guidelines.pdf. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed., pp.13-103). New York: American Council on Education/Macmillan. Journal of Applied Testing Technology 31 A Framework for Examining the Utility of Technology-Enhanced Items Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (2003). On the Structure of Educational Assessments, CSE Technical Report 597. Los Angles, CA: Center for the Study of Evaluation. Russell, M. (2006). Technology and Assessment: The Tale of Two Perspectives. Greenwich, CT: Information Age Publishing. Russell, M. (2011a). Accessible Test Design. In M. Russell & M. Kavanaugh, Assessing Students in the Margin: Challenges, Strategies, and Techniques. Charlotte, NC: Information Age Publishing. Russell, M. (2011b). Digital Test Delivery: Empowering Accessible Test Design to Increase Test Validity for All Students. A Monograph Commissioned by the Arbella Advisors. Scalise, K. & Gifford, B. (2006). Computer-Based Assessment in E-Learning: A Framework for Constructing “Intermediate Constraint” Questions and Tasks for Technology Platforms. 32 Vol 17 (1) | 2016 | www.jattjournal.com Journal of Technology, Learning, and Assessment, 4(6). Retrieved June 23, 2015 from http://ejournals.bc.edu/ojs/ index.php/jtla/article/view/1653/1495. Sireci, S.G. & Zenisky, A.L. (2006). Innovative item formats in computer-based testing: in pursuit of improved construct representation. In S.M. Downing & T.M. Haladyna (Eds.) Handbook of Test Development (pp. 329-348). New York, NY: Routledge. Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002). Universal design applied to large scale assessments (Synthesis Report 44). Minneapolis, MN: University of Minnesota, National Center on Educational Outcomes. Retrieved June 24, 2015 from the World Wide Web: http:// education.umn.edu/NCEO/OnlinePubs/Synthesis44.html. Washington State. (2010). Race to the Top Assessment Program Application for New Grants. Retrieved September 5, 2013 from http://www2.ed.gov/programs/racetothetop-assessment/rtta2010smarterbalanced.pdf Journal of Applied Testing Technology
© Copyright 2026 Paperzz