A Framework for Examining the Utility of Technology

Journal of Applied Testing Technology, Vol 17(1), 20-32, 2016
A Framework for Examining the Utility of
Technology-Enhanced Items
Michael Russell*
Boston College; [email protected]
Abstract
Interest in and use of technology-enhanced items has increased over the past decade. Given the additional time required
to administer many technology-enhanced items and the increased expense required to develop them, it is important for
testing programs to consider the utility of technology-enhanced items. The Technology-Enhanced Item Utility Framework
presented in this paper identifies three characteristics of a technology-enhanced item that may affect its measurement
utility. These include: a) the fidelity with which the response interaction space employed by a technology-enhanced item
reflects the targeted construct; b) the usability of the interaction space; and c) the accessibility of the interaction space
for students who are blind, have low vision, and/or have fine or gross motor skill needs. The framework was developed
to assist testing programs and test developers in thinking critically about when the use of a technology-enhanced item is
warranted from a measurement utility perspective.
Keywords: Accessibility, Digital Assessments, Item Development, Technology-Enhanced Items, Test Validity, Usability
1. Introduction
For the past century, large-scale standardized tests have
relied heavily on selected-response and text-based openresponse interactions to collect evidence about cognitive
constructs that are the target of measurement (Haladyna
& Rodgriguez, 2013). Over the past two decades, educational testing programs have begun administering tests
on computers and other digital devices. In response, there
is growing interest in expanding the ways in which test
takers produce responses to test items to demonstrate
their knowledge, skills, and/or understanding of the targets of measurement (Drasgow & Olson-Buchanan, 1999;
Russell, 2006; Scalise & Gifford, 2006; Washington State,
2010).
With the expanding use of digital assessments, there
is also growing interest in measuring test taker knowledge, skills, and understandings in more authentic
ways. As evidence, both the Smarter Balanced and the
Partnership for Assessment of Readiness for College
*Author for correspondence
and Career (PARCC) assessment consortiums aimed to
develop items that measure constructs in contexts that
are more authentic than traditional selected or text-based
open-response items (Florida Department of Education,
2010; Washington State, 2010). To meet these needs, test
developers have introduced the concept of technologyenhanced items (Russell, 2006; Scalise & Gifford, 2006).
Technology-enhanced items fall into two broad categories. The first category includes items that contain
media that cannot be presented on paper. These items
utilize video, sound, 3D graphics, and animations as
part of the stimulus and/or the response options. The
second category includes items that require test takers to demonstrate knowledge, skills, and abilities using
response interactions that provide methods for producing responses other than selecting from a set of options
or entering alphanumeric content. To distinguish the
two categories, the term technology-enabled refers to the
first category and technology-enhanced labels the second
category (Measured Progress/ETS Collaborative, 2012).
Michael Russell
While most test items can be categorized uniquely as traditional, technology-enabled, or technology-enhanced,
there are some items that contain both non-text-based
media and new response interactions and can be classified
as both technology-enabled and technology-enhanced.
Because the primary purpose of a test item is to collect
evidence of the test taker’s development of the targeted
knowledge, skill, or ability and the response interaction
is the vehicle for collecting that evidence, the framework
presented in this paper focuses narrowly on technologyenhanced items. Utility of technology-enabled items
requires a separate framework that, to the best of my
knowledge, has not yet been developed.
2. Item Response Interaction
Space
As Haladyna and Rodriguez (2013) describe, all test items
are composed of at least two parts: a) a stimulus that
establishes the problem a test-taker is to focus on; and b) a
response space in which a test taker records an answer to
the problem. For items delivered in a digital environment,
the response space has been termed an interaction space
(IMS Global Learning Consortium, 2002) since this is the
area in which the test taker interacts with the test delivery
system to produce a response.
For a multiple-choice item, an interaction space is that
part of an item that presents the test-taker with answer
options and allows one or more options to be selected.
For an open-response item, an interaction space typically
takes the form of a text box into which alphanumeric
content is entered. For a technology-enhanced item, the
interaction space is that part of an item that presents
response information to a test-taker and allows the testtaker to either produce or manipulate content to provide
a response.
As Sireci and Zenisky (2006) detail, there are a wide
and growing variety of interaction spaces employed by
technology-enhanced items. For example, one class of
interaction space presents words or objects that are classified into two or more categories by dragging and dropping
them into their respective containers. Often termed “dragand-drop,” these items are used to measure a variety of
knowledge and skills including classifying geometric
shapes (e.g., see Figure 1a in which shapes are classified
based on whether or not they contain parallel lines) or
classifying organisms based on specific traits (e.g., mam-
Vol 17 (1) | 2016 | www.jattjournal.com
mals versus birds versus fish, single cell versus multiple
cell organisms, chemical versus physical changes, etc.). A
version of this interaction space can also be used to present content that the test taker manipulates to arrange in a
given order. As an example, a series of events that occur in
a story may be rearranged to indicate the order in which
they occurred. Similarly, a list of animals may be rearranged to indicate their hierarchy in a food chain.
Another class of interaction space requires test takers to create one or more lines to produce a response to
a given prompt (see Figure 1b). For example, test takers
may be presented with a coordinate plane and asked to
produce a line that represents a given linear function, a
geometric shape upon which the test taker is asked to
produce a line of symmetry, or a list of historical events
where the test taker is to draw lines connecting each event
to the date when it occurred.
A third class of interaction space requires test takers
to highlight content to produce a response (see Figure
1c). For example, the test taker may be asked to highlight words that are misspelled in a given sentence, select
sentences in a passage that support a given argument, or
highlight elements in an image of a painting that demonstrate the use of a given technique or imagery.
There are a wide variety of interaction spaces currently in use and they have been classified in a variety
of ways. Perhaps the earliest attempt to classify response
interaction spaces was made by Bennett (1993) who
identified six categories of item response types including multiple-choice, selection/identification, reordering/
rearranging, completion, construction, and presentation.
Scalise and Gifford (2006) expanded Bennett’s classification scheme to consider both the response format and the
complexity of the response actions. As an example, under
the multiple-choice response category, “True/False” items
were classified as least complex and “multiple-choice
with new media distractors” as most complex. Similarly,
under the completion response category “single numerical constructed response” was classified as least complex
and “matrix completion” was the most complex (Scalise
& Gifford, 2006).
In an attempt to establish a standardized method
of encoding items in a digital format, the IMS Global
Learning Consortium (IMS) developed the Question and
Test Interoperability (QTI) Specification (2002; 2012).
The current version of QTI specifies 32 classes of response
interaction spaces that can be used to create a wide vari-
Journal of Applied Testing Technology
21
A Framework for Examining the Utility of Technology-Enhanced Items
(a)
(b)
(c)
Figure 1. Examples of Interaction Spaces: (a) Drag and Drop Item. (b) Draw Line Item. (c) Select Text Item.
ety of item types including different types of selected
response, drag-and-drop, line and object production,
text selection, reordering, “free-hand” drawing, and even
upload of sound, image, and video files.
Although there is not yet consensus on how to classify
or term specific interaction types, the variety of interactions being employed by today’s educational testing
programs has expanded greatly. However, use of many of
these new interaction types comes with increased costs
for item development. While difficult to obtain accurate
figures, the PARCC Assessment Consortium estimates
that the cost of developing technology-enhanced items
ranges from two to five times that of developing traditional multiple-choice items. Given these economic
considerations, it is important to consider the utility that
technology-enhanced item interaction spaces provide for
an assessment program.
22
Vol 17 (1) | 2016 | www.jattjournal.com
3. Utility of Technology Enhanced
Items
Utility has different meanings in different contexts. In the
general colloquial sense, utility focuses on the extent to
which something is “useful” or “functional” for a given
purpose (Gove, 1986). From this perspective, the utility of
a technology-enhanced item interaction space might be
conceived as its usefulness for measuring specific knowledge, skills, or abilities.
In economics, utility focuses on “desire” or “want”
and is often measured by a person’s willingness to pay for
a given object or service (Marshall, 1920). From an economic perspective, utility of technology-enhanced items
might be conceived as an assessment program’s desire to
employ and willingness to pay for the development and
delivery of a given item interaction space.
Journal of Applied Testing Technology
Michael Russell
In educational measurement, utility can be viewed in
at least two ways (Davey & Pitoniak, 2006). Measurement
utility focuses on the information that a given item contributes to the estimate of test taker ability. Content utility
considers the extent to which the use of a given item contributes to adequate representation of the content domain
measured by the test. When viewed through the lens of
evidence centered design (Mislevy, Steinberg, & Almond,
2003), these two forms of utility collectively consider the
strength of evidence about a specific construct provided
via a given item interaction space.
From the perspective of evidence contribution, there
are at least two factors that influence the utility of a given
item interaction space. The first factor focuses on the accuracy of the evidence about a given construct provided by
an interaction space. The more accurately the information
recorded through the response interaction reflects the
test taker’s mastery of the targeted construct, the greater
measurement utility the interaction space provides. The
second factor focuses on the fidelity or directness with
which the interaction space requires the test taker to
employ a given construct (Haladyna & Rodriguez, 2013;
Lane & Stone, 2006). Fidelity and directness focus on
how closely the context created by an interaction space
resembles the context in which a person applies the construct in an authentic or “real-world” situation. From this
perspective, the more similar the context created by the
interaction space is to a real-world context, the greater the
fidelity, and hence the greater the utility.
Focusing on accuracy of response information provided by an interaction space, there are at least two factors
to consider. The first factor relates to construct-representation (Messick, 1989) and considers the accuracy of the
evidence provided through the interaction space about
the test taker’s mastery of the construct. The extent to
which the test taker must apply the targeted construct as
s/he produces a response in the interaction space influences the accuracy of the response evidence. When an
interaction space allows a test-taker to produce a correct
answer through guessing, trial-and-error, or by applying
other constructs such as test-taking skills, its utility for
measuring the targeted construct is diminished. Second,
the extent to which constructs outside those that are targeted for measurement interfere with a test taker’s ability
to respond correctly also affects accuracy and focuses on
sources of construct irrelevant variance introduced by the
item content or the interaction space (Messick, 1998).
Vol 17 (1) | 2016 | www.jattjournal.com
Sources of construct irrelevant variance can take
many forms, many of which fall under a broader concept
of accessibility (Russell, 2011a; 2011b ). Traditionally,
accessibility has been conceived of as the ease and accuracy with which test takers are able to access content
presented by a test item. Factors such as the size of the
text and images, complexity of the language and vocabulary that are used, and the length of item prompts and
response options are seen as factors that influence a test
taker’s ability to access an item (Thompson, Johnstone, &
Thurlow, 2002). This view of accessibility emphasizes how
test takers must be able to access item content in order to
understand what is being asked of them.
A broader conception of accessibility in the context
of testing flips emphasis from the test taker’s access to
item content to the test item’s accessing the construct as
it operates within the test taker (Russell, 2011b). From
the perspective of Accessible Test Design, the test taker’s
ability to access content presented by an item is still important. In addition, though, the degree to which the context
in which an item is administered allows the test taker to
apply the targeted construct without distraction and the
test taker’s ability to produce a response that reflects the
outcome of a cognitive process are also viewed as components of accessibility. From this perspective, additional
factors such as the conditions under which an item is performed, the representational form in which test takers are
required to produce a response and the usability of the
interaction space are also seen as factors that influence
accessibility. In cases where the required representational
form in which a response must be communicated conflicts with a test taker’s ability to communicate in that
form, accuracy will be negatively impacted. As an example, for a test taker who is still developing proficiency in
English or who has challenges producing text, a math
item that requires students to explain their reasoning
in written English may interfere with the ability of the
test taker to accurately record his/her reasoning in the
required representational form. Similarly, the more difficult the interaction space is to use to produce responses,
the less accurately the test takers’ responses may reflect
their thinking, and thus the less accurate the information
provided by the item is likely to be. As an example, an
interaction space that requires students to use the arrow
keys to drag-and-drop objects may create frustration due
to the need to press a given arrow several times to position an object and may result in incomplete response
production.
Journal of Applied Testing Technology
23
A Framework for Examining the Utility of Technology-Enhanced Items
Aspects from each of these three perspectives– colloquial, economic, and educational measurement – can
be used to create a working definition for the utility of
technology-enhanced items. From this context, the
framework presented below defines utility as the value
provided by a given interaction for collecting evidence
about the targeted construct in an accurate, efficient and
high-fidelity manner.
4. Technology-Enhanced Item
Utility Framework
Given the increasing variety of response interaction spaces
that can be created in a digital environment, the following
technology-enhanced item utility framework is designed
to help testing programs weigh the costs and benefits of
employing a given response interaction methodology to
measure the knowledge, skill, or ability of interest. When
considering the utility of a technology-enhanced interaction space, there are three characteristics that should be
considered: a) fidelity to the targeted construct; b) usability of the interaction space for producing responses; and
c) accessibility of the response interaction for test takers
with specific disabilities and special needs.
The first characteristic, construct fidelity, focuses on
two aspects: a) the extent to which the response interaction space creates a context that represents how the
construct might be applied in an authentic situation;
and b) the extent to which the methods employed to
produce a response reflect the methods used to produce
artifacts that are the outcome of the targeted construct
in a real-world environment. Together, these two aspects
address the fidelity produced by the interaction space.
Construct fidelity is a product of the context created
through the interaction, the interaction itself, and the
targeted construct. The second and third characteristics
focus on the way in which a testing program delivers the
response interaction. Specifically, the second characteristic, termed usability, considers the extent to which the
delivery system’s implementation of the interaction space
allows test-takers to efficiently and accurately produce
responses. The third characteristic, termed accessibility,
considers whether methods are provided for test takers
with special needs to produce responses in an accurate
and efficient manner.
The usability and accessibility characteristics recognize that a response interaction (e.g., drawing a line) can
24
Vol 17 (1) | 2016 | www.jattjournal.com
be implemented in a variety of ways. The way in which a
response interaction is implemented affects its usability
for collecting responses that reflect the outcome of test
taker cognition in an efficient and accurate manner. These
three characteristics of utility are explored in greater
detail below.
4.1 Construct Fidelity
The primary purpose of technology-enhanced items
is to collect evidence that is aligned with the construct
measured by an item. In effect, interest in technologyenhanced items derives from growing concerns that
traditional items do not adequately measure some constructs (Russell, 2006; Sireci & Zenisky, 2006). It has been
argued that new methods of collecting evidence from
test takers will increase the variety of constructs that
can be measured and that measures of these constructs
will improve (Florida Department of Education, 2010;
Washington State, 2010).
Given this purpose for technology-enhanced items,
the first component of the Technology-Enhanced Item
Utility Framework focuses on the fidelity of the interaction space. As described above, there are two factors that
affect fidelity. The first focuses on the degree to which the
context created through the response interaction space
represents an authentic application of the construct. Here,
the key question is whether the interaction space creates
a context that is similar to a situation in the real world
(i.e., outside of a testing situation) where the construct
is typically applied. The second factor is concerned with
the methods required by an item’s response interaction
to produce a response and considers the extent to which
the methods required to produce a response are similar to
those used in a real world situation. Here, the focus is on
the tools and actions the test taker must use to produce a
response rather than on the situation created by the interaction space.
As an example, consider an item designed to measure
writing ability. An interaction space that requires test
takers to produce text in response to a prompt using an
external keyboard connected to a computer/tablet device
or an on-screen keyboard on a tablet device creates an
authentic context for applying writing skills and provides
methods that are similar (if not identical to) the methods
employed in an authentic context. In contrast, an item
that presents the same prompt but employs an interaction
space that requires test takers to drag-and-drop letters or
Journal of Applied Testing Technology
Michael Russell
words to create sentences would still create an authentic
context for measuring writing but in a way inconsistent
with the methods employed to produce text in an authentic context.
As a second example, an interaction space that presents test takers with a linear function and asks test takers
to produce a line on a coordinated plane that represents
the given function would have high fidelity with a construct concerned with the ability to produce graphical
representations of linear functions. In contrast, the same
interaction space would have less fidelity if it were used
for an item measuring test takers’ knowledge of historical events that required drawing lines connecting a given
event with its date of occurrence. In this case, neither the
act of producing lines nor the context in which test takers
connect events with dates represents how this construct is
applied in a real-world context.
In this way, Construct Fidelity focuses both on: a) the
extent to which the interaction space creates an authentic
context in which the construct is applied outside of a testing situation; and b) the extent to which the methods used
by the interaction space reflect the methods used to produce products in an authentic context. It might be noted
that low Construct Fidelity does not necessarily mean
that the item itself is poor but rather that the context in
which responses are produced and/or the method used to
produce a response do not authentically reflect how the
construct is typically applied outside of a testing situation.
In these situations, it may be beneficial to consider alternate methods of collecting evidence that better reflect
how the construct is typically applied in the real world.
4.2 Usability
In many cases, the interaction spaces employed by technology-enhanced items require test takers to invest more
time producing a response compared to a traditional
response selection item. As an example, when measuring
a test taker’s ability to create graphical representations of
mathematical functions, it may take longer to produce
a plot of a linear relationship than it would to select an
image that depicts that relationship. While the increased
time required to produce a response using a technologyenhanced item may result in more direct evidence of
the measured construct in a more authentic context, it
competes with a desire for efficient use of time during
the testing process. As a result, test developers need to
balance considerations about the desire to improve the
Vol 17 (1) | 2016 | www.jattjournal.com
quality of measures about test taker knowledge and skills,
while maximizing the time efficiency of evidence collection.
A key factor that influences efficiency is the usability of the interaction space. In this framework, usability
is defined as the intuitive functionality of an interaction
space and the ease with which a novice user can produce
and modify responses with minimal mouse or finger
actions and/or response control selections. While a given
interaction space (e.g., text production or line drawing) is
intended to allow a test taker to produce a specific type of
response, the method used to implement that interaction
can vary widely among test delivery systems. For instance,
a text production item may allow test takers to use a
standard keyboard or limit them to the use of a mouse
to select letters from an on-screen “keyboard.” Further,
there are a number of ways an on-screen keyboard might
be arranged including QWERTY format (i.e., like a traditional keyboard), arranged alphabetically from “a” to
“z”, or ordered by frequency of use. While each implementation provides functionality that allows a test-taker
to produce a text-based response, the ease and efficiency
with which a test-taker could do so varies greatly.
The usability component focuses on the specific
implementation of the response interaction and examines
the usability of that implementation. Factors that are considered when examining usability include intuitiveness,
layout, and functionality.
4.2.1 Intuitiveness
This factor examines the design of the response interaction space and considers the ease with which a test-taker
can determine how to produce a response using the provided tools/functions. When considering intuitiveness, it
should be assumed that the test-taker has had some training and prior exposure to the response interaction. As a
result, intuitiveness does not focus on the ease with which
a naive test-taker can determine how to use the response
interaction upon first encounter. Rather, it is concerned
with the ease with which test-takers can use the various
features of the response interaction with minimal cognitive effort.
4.2.2 Layout
This factor considers whether the interaction space is
designed in a way that minimizes the distance between
on-screen elements required to produce a response. As an
Journal of Applied Testing Technology
25
A Framework for Examining the Utility of Technology-Enhanced Items
example, if tool buttons are required, are they located in
close proximity to the response space and to each other,
yet not so close as to allow the test taker to accidentally select the wrong button? Similarly, for an item that
requires test takers to drag and drop objects, is the distance that test-takers must drag content minimized, yet
not so close as to confuse test-takers about what content
is to be dragged and what content represents a receptacle
for dragged content?
4.2.3 Functionality
This characteristic considers whether the response interaction is designed in a way that minimizes the number of
mouse/finger selections required to produce a response.
As an example, if test takers make mistakes, can they
correct those mistakes without having to clear the entire
response space and begin again?
Although each of these factors influences usability,
their influence is considered holistically. For this reason,
limited functionality can be compensated for by intuitive
design and careful layout. Similarly, poor layout can be
compensated for by intuitive design and strong functionality. Usability of an interaction space is a holistic concept
that focuses on the overall usability of the interaction
space rather than the individual quality of each of these
factors.
While this factor is not necessarily based upon a
direct comparison of the speed with which the test taker
can produce a response with other potential response
interactions (e.g., drag-and-drop versus multiple-choice
selection interactions), it does consider the extent to
which the specific implementation allows for efficient
response production. This requires the evaluator to view
the implementation of the interaction within the test
delivery environment and to be familiar with potential alternate approaches to implementing that response
interaction.
4.3 Accessibility
The final component of the framework focuses on the
accessibility of the interaction space. Like usability, this
focuses on the specific implementation within the test
delivery system employed by a given testing program.
Also similar to usability, it considers the extent to which
the interaction space allows test takers who are blind,
have low vision, or have motor skills-related disabilities to
26
Vol 17 (1) | 2016 | www.jattjournal.com
produce a response in an efficient manner. Given that the
needs of these three sub-populations of test takers (i.e.,
blind, low vision, and those with motor skill needs) differ, the accessibility component comprises three separate
sub-components, each of which focuses on how well it
supports efficient response production by test takers with
the focal need.
4.3.1 Motor Skill Accessibility
This accessibility sub-component focuses on the extent to
which the implementation allows test takers with fine and
gross motor skills needs and those who use assistive input
devices to efficiently produce responses. Assistive input
devices fall into two broad categories: a) those that allow
test takers to perform traditional mouse functions (e.g.,
select/click/highlight, drag, drop) using a device other
than a mouse (e.g., track ball or eye gaze); and b) those
that mimic Tab-Enter navigation. Tab-Enter navigation allows test takers to use the TAB key to perform the
equivalent mouse action of hovering over an object (e.g.,
a menu option, button, or text) and using the ENTER key
to select the object over which the mouse is hovering (e.g.,
clicking on a button or menu option). Tab-Enter navigation can be performed using the Tab and Enter keys on
a traditional keyboard or by using a variety of assistive
input devices such as a dual-switch device, single-switch
device, or an alternate keyboard.
The factors that influence the accessibility of motor
skill input devices include:
• Size of objects that must be selected or the size of
containers into which objects are to be placed (the smaller
the object, the more challenging the selection process)
• The ordering of tab selection (the more logical the
ordering, the more efficient the navigation process)
• The hierarchical structure of tab-enter selection
(the more logical the structure, the more efficient the
navigation process)
• The extent to which all functions within the interaction space are supported by alternate methods (the
more functionality supported, the more efficient the
response process).
4.3.2 Low Vision Accessibility
Test takers with low vision are typically able to view content that is displayed on the screen, but require it to be
enlarged or magnified in order to view it clearly. Thus, the
first factor that affects the accessibility of an interaction
Journal of Applied Testing Technology
Michael Russell
space for test takers with low vision is whether it allows
content to be magnified.
In cases when magnification is allowed, an additional
factor that affects accessibility is the extent to which magnification of content obscures relationships among all
content displayed in the interaction space. The importance of this factor will vary based upon the interaction
space test-takers are required to use and the content with
which they are required to perform the interaction. As an
example, an interaction space that requires test takers to
create a line that bisects an angle requires test takers to
engage with a small amount of content all of which is in
close proximity to each other (in this case two lines that
form an angle). As a result, the relationship between the
two lines at the point where the test-taker is expected to
respond (i.e., the point of intersection) is visible when
the content in the interaction space is magnified greatly.
In contrast, an item that presents a coordinate plane that
ranges from +25 to -25 on the x and y axes and requires
the test-taker to create a line that passes through the
points (23, 14) and (-15, -18) could present challenges
depending on how magnification functions (see Figure
2a). Specifically, if magnification enlarges content within
a confined response space, portions of the coordinate grid
may be pushed out of view and therefore obscured (see
Figure 2b). As a result, it would be difficult to produce
(a)
(b)
(c)
Figure 2. Different implementations of magnification. (a) Item in original, unmagnified state. (b) Magnification confined
within response space. (c) Entire response space magnified.
Vol 17 (1) | 2016 | www.jattjournal.com
Journal of Applied Testing Technology
27
A Framework for Examining the Utility of Technology-Enhanced Items
Table 1. Guidelines for interpretation of component ratings
Fidelity
Usability
Accessibility
Overall Utility
Rationale
High
High
High
High
The skills required by the interaction are aligned with skills
associated with the measured construct and the interaction
is implemented in an efficient and accessible manner.
High
High or
Moderate
High or
Moderate
Moderate
High
The skills required by the interaction are aligned with
the skills associated with the measured construct but the
implementation may present moderate challenges for a subset of test takers.
High
Moderate or
Low
Moderate or
Low
Moderate
The skills required by the interaction are aligned with
the skills associated with the measured construct but the
implementation may present significant challenges for a
sub-set of test takers.
Low
The skills required by the interaction are aligned with
the skills associated with the measured construct but the
implementation may present significant challenges for many
test takers.
Moderate
High
The skills required by the interaction are moderately aligned
with the skills associated with the measured construct and
the interaction supports efficient and accessible response by
all test takers. While fidelity is only moderate, all test takers
can produce responses efficiently.
Moderate to
Low
The skills required by the interaction are not aligned with
the skills associated with the measured construct but the
interaction supports efficient and accessible response by
all test takers. In cases where the information yielded by
the interaction provides solid evidence for the measured
construct, the utility would be moderate. As the strength of
the evidence decreases, the utility decreases to low.
Low
The skills required by the interaction are moderately to
poorly aligned with the skills associated with the measured
construct and the implementation is inefficient and/or
inaccessible which presents challenges for some test takers.
High
Moderate
Low
High
Low
High
Low
High
High
Moderate to
Low
Moderate to
Low
Moderate to
Low
a line that passes through points that are no longer visible on the screen. In contrast, if the interaction space
expands as magnification increases (effectively allowing
the interaction space to cover more of the screen), the
visible relationship among key content will be preserved
which allows easier production of a correct response (see
Figure 2c). In this way, accessibility for low vision focuses
on the manner in which magnification is supported by
the response interaction space and whether the magni-
28
Vol 17 (1) | 2016 | www.jattjournal.com
fication functionality impedes the test taker’s interaction
with response content.
4.3.3 Accessibility for the Blind
This accessibility sub-component focuses on the extent
to which the implementation of the interaction space
provides supports that allow test takers who are blind
to produce a response. Because test takers who are blind
cannot view content displayed on a screen, there are
Journal of Applied Testing Technology
Michael Russell
three design factors that influence their ability to produce
responses:
• The extent to which the implementation supports navigation among content (e.g. this factor is similar
to the TAB-ENTER navigation factor for test takers with
motor skill needs).
• The clarity with which navigation and response
actions performed by the test-taker are described auditorily so that the test-taker understands and can confirm
that the desired action occurred (e.g., the implementation
states what content the test-taker has “tabbed” to or states
what container into which an object was placed).
• The extent to which the methods employed to
support navigation and to provide confirmation of actions
do not interfere with the measured construct (e.g., if the
test-taker is required to produce a line with a negative
slope and the confirmation of a line drawing tool states
the intercept and slope of the line produced, the confirmation method interferes with the measured construct by
stating the slope on the line produced by the test-taker).
These three factors are considered together when
examining the accessibility of the implementation of a
response interaction for test takers who are blind.
It is important to note that these three sub-components of accessibility (i.e., Motor Skill, Low Vision, and
Blind accessibility) should be examined independently.
For this reason, it is possible for an interaction to be
viewed as high on motor skill accessibility but low on low
vision and blind accessibility. Alternately, an item may be
strong for blind accessibility, moderate for motor skill,
and low for low vision accessibility.
5. Using the TechnologyEnhanced Item Utility
Framework
The Technology-Enhanced Item Utility Framework is
designed to help test developers and testing programs
consider the extent to which the use of a given response
interaction space is both appropriate for the construct
measured by an item and implemented in a manner
that allows test-takers to accurately and efficiently produce responses that reflect the product of their cognitive
processes. While each of these factors is examined individually, they are considered collectively to evaluate the
utility of the interaction. Table 1 is designed to help inter-
Vol 17 (1) | 2016 | www.jattjournal.com
pret components to make decisions about the utility of an
interaction.
When evaluating the utility of an interaction, it is
important to recognize that the purpose of an interaction
space is to create a context in which the targeted construct
is applied as the student produces a response using the
response methods provided by the interaction space. For
this reason, the most important factor affecting the utility of an interaction space is the alignment between the
context created by the interaction space and the way(s) in
which the targeted construct is authentically applied in a
non-testing situation. When the interaction space creates
a context that is authentic in terms of how the measured
construct is typically applied, the interaction may be said
to have utility.
However, the level of utility is influenced by how the
interaction is implemented within a test delivery system.
Specifically, if the interaction is implemented in a way that
allows test-takers to efficiently produce responses and
provides adequate accessibility for test takers with special
needs, then the utility of the interaction is maximized.
In this way, strong alignment must be coupled with high
levels of usability and accessibility to maximize utility.
In contrast, if the implementation of a strongly aligned
interaction is inefficient or difficult to access for some or
many test takers, its utility is diminished.
As shown in Table 1, when fidelity is moderate or low,
two additional factors must be considered prior to making definitive decisions about the utility of the interaction.
First, its usability and accessibility must be examined.
Second, the extent to which the interaction allows testtakers to produce evidence that can be used to inform the
measure of the construct should be considered.
In cases where fidelity is moderate or low, and usability and accessibility are low, the interaction space should
be interpreted as also having poor utility. However, when
usability and accessibility are strong, an interaction space
of low fidelity may still have moderate utility if the evidence provided by the interaction can serve as a measure
of the construct. In effect, these cases result in information that can be provided accurately and efficiently by a
broad range of test-takers and used to make inferences
about the measured construct, even though the directness of the inference is diminished. Although the skills
required to produce a response and the context created
via the interaction are unrelated to the measured con-
Journal of Applied Testing Technology
29
A Framework for Examining the Utility of Technology-Enhanced Items
struct, the resulting information provides appropriate
evidence about the measured construct.
As an example, consider an interaction that requires
test-takers to drag-and-drop sentences into an order that
reflects the plot of a given story. The skills required by the
interaction (ability to select, drag, and position content)
are unrelated to reading comprehension. In addition,
reordering sentences is not an authentic context in
which test takers typically apply understanding of order
of events outside of a testing situation. Yet, the evidence
provided by the interaction (i.e., the order of sentences
describing events from the story) supports an inference
about the test-takers comprehension of the events of the
story. In cases where the interaction is implemented in an
efficient and accessible manner, the utility of that interaction for measuring reading comprehension might be
deemed moderate or adequate.
When using the framework to evaluate the utility of
a technology-enhanced item, there are two additional
considerations to keep in mind: a) the non-comparative
aspects of the framework and b) examining utility in the
context of measurement. Each of these considerations is
discussed separately below.
5.1 Non-Comparative Aspects of the
Framework
It is important to note that the intent of the Construct
Fidelity component is not to compare the fidelity of the
employed interaction with other potential interactions.
Rather, the focus is limited to a judgment about the
extent to which the context produced through the interaction space is authentic and the skills required for the
interaction align with the skills associated with how the
measured construct is typically applied in a real-world
context. For this reason, it is possible that more than one
interaction space can create an authentic context in which
the construct is typically applied and/or could employ
methods that overlap with how the measured construct is
typically applied in an authentic context and receive high
ratings for construct alignment.
Similarly, when evaluating efficiency and accessibility, the intent is not to compare a specific implementation
with other implementations. Rather, the focus is on the
specific implementation and whether it provides an efficient and accessible approach to response production. For
30
Vol 17 (1) | 2016 | www.jattjournal.com
this reason, it is possible that several different implementations may be rated highly.
That being said, it is also important to note that in
order to evaluate a given implementation, it is necessary to have a solid understanding of usability design
principles, accessibility design principles, and the functionality of common assistive technology devices. While
the intent of bringing this knowledge to bear when examining efficiency and accessibility is not to compare the
specific implementation with other possible implementations, it is necessary to understand what is possible and
what represents best practices when evaluating usability
and accessibility. As one example, in order to evaluate
the accessibility provided by a TAB-ENTER hierarchical
design, one needs to be familiar with how TAB-ENTER
navigation functions and the challenges that can arise by
a poor hierarchical design.
5.2 The Context of Measurement
The Technology-Enhanced Item Utility Framework is
designed to focus on the use and implementation of a
given interaction to measure a targeted construct for four
sub-groups of test takers: a) test takers with motor skill
needs; b) test takers with visual needs; c) test takers who
are blind; and d) test takers who do not have special needs
associated with motor skills, visual impairments or blindness and are expected to use typical keyboard, mouse
and/or finger movements to produce responses. As a
result, construct fidelity is evaluated in the context of the
measured construct while efficiency is considered in the
context of the actions likely to be performed by the subgroup of test takers who will use a keyboard, mouse, or
finger actions to produce a response. Additionally, accessibility must be evaluated in the context of the specific
needs of the sub-groups of test takers and the tools they
typically use to interact with a computer or digital device.
Finally, when component ratings are combined to make
an overall determination of utility, interactions with moderate or low construct fidelity ratings must be considered
in the context of the types of information yielded by the
interaction and the adequacy of using that information as
evidence for the measured construct. As a result, evaluating the utility of an interaction space requires one to
understand the construct measured by the item and the
sub-groups of test takers who are being assessed.
Journal of Applied Testing Technology
Michael Russell
6. Discussion
From an economic perspective, it is clear that the new
interaction spaces employed by technology-enhanced
items have utility and an increasing number of assessment
programs are demanding development and administration of items that utilize them. However, it is important
to remember that the primary criteria for including any
item in an educational test is its ability to contribute accurate evidence with utility for informing the measure of
a targeted construct. While the idea of employing technology-enhanced items to modernize a testing program
or demonstrate that it is capitalizing on the powers of
technology is attractive, the use of new item interaction
spaces should not come at the cost of measurement value.
That is, if measurement value is degraded by the use of a
technology-enhanced interaction, that interaction should
not be employed.
In contrast, when measurement value is improved by
the use of a technology-enhanced interaction, it seems
logical to employ that interaction. However, this decision must be tempered by considering the additional cost
incurred by developing such items. The key question is
whether the increase in measurement utility outweighs
the additional financial cost that it brings. In cases where
utility value increases substantially while financial costs
are affected minimally, it is reasonable to conclude that
the interaction should be employed. But what should
be done when the item development cost for a given
interaction are more than triple the cost of a traditional
interaction type and the measurement value is only marginally affected? Clearly, this is subjective decision. As
a guideline, it seems reasonable that a doubling of costs
might be acceptable for each one-step increase in utility
provided by a given interaction type. That is, when an
interaction increases utility from low to moderate compared to a traditional item interaction, a doubling of
costs is reasonable. And when utility increases from low
to high, tripling of costs seems acceptable. Developing
guidance on making cost-benefit decisions is an area in
need of further research. When conducting this research,
it will be important to recognize that the costs associated
with developing items that employ a given interaction will
likely decrease over time as test developers create more
efficient mechanisms for encoding item content and item
writers become more accustomed to developing items
that employ a given interaction model.
Vol 17 (1) | 2016 | www.jattjournal.com
The Technology-Enhanced Item Utility Framework
presented here aims to help testing programs and test
developers maintain a focus on measurement value by
directing attention to the use of a given interaction space
to produce an authentic, usable, and accessible context in
which the targeted construct is applied by the test taker.
Through emphasis on these three factors, it is hoped that
technology-enhanced items will be employed increasingly to enhance measurement utility.
7. References
Davey, T. & Pitoniak, M. (2006). Designing computerized
adaptive tests. In S.M. Downing & T.M. Haladyna (Eds.)
Handbook of Test Development (pp. 543–574). New York,
NY: Routledge.
Drasgow, F. & Olson-Buchanan, J.B. (1999). Innovations
in Computerized Assessment. Mahwah, NJ: Lawrence
Earlbaum Associates, Inc.
Florida Department of Educaiton (2010). Race to the Top
Assessment Program Application for New Grants.
Retrieved September 5, 2013 from hhttp://www2.ed.gov/
programs/racetothetop-assessment/rtta2010parcc.pdf
Gove, P.B. (1986). Webster’s Third new International Dictionary
of the English Language Unabridged. Springfield, MA:
Merriam-Webster, Inc.
Haladyna, T.A. & Rodriguez, M.C. (2013). Developing and
Validating Test Items. New York, NY: Routledge.
IMS Global Learning Consortium. (2002). IMS Question and
Test Interoperability: An Overview Final Specification
Version 1.2. Retrieved June 25, 2015 from http://www.
imsglobal.org/question/qtiv1p2/imsqti_oviewv1p2.html.
IMS Global Learning Consortium. (2002). IMS Question and
Test Interoperability: An Overview Version 2.1 Final.
Retrieved June 25, 2015 from http://www.imsglobal.org/
question/qtiv2p1/imsqti_oviewv2p1.html.
Lane, S. & Stone, C.A. (2006) Performance assessment. In R.L.
Brennan (Ed.) Educational Measurement (4th ed. pp. 387431). Westport, CT: American Psychological Association.
Marshall, A. (1920). Principles of Economics: An Introductory
Volume (8th Edition). London: Mcmillan.
Measured Progress/ETS Collaborative. (2012). Smarter
Balanced Assessment Consortium: Technology Enhanced
Items. Retrieved June 23, 2015 from https://www.
measuredprogress.org/wp-content/uploads/2015/08/
SBAC-Technology-Enhanced-Items-Guidelines.pdf.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational
Measurement (3rd ed., pp.13-103). New York: American
Council on Education/Macmillan.
Journal of Applied Testing Technology
31
A Framework for Examining the Utility of Technology-Enhanced Items
Mislevy, R.J., Steinberg, L.S., & Almond, R.G. (2003). On
the Structure of Educational Assessments, CSE Technical
Report 597. Los Angles, CA: Center for the Study of
Evaluation.
Russell, M. (2006). Technology and Assessment: The Tale
of Two Perspectives. Greenwich, CT: Information Age
Publishing.
Russell, M. (2011a). Accessible Test Design. In M. Russell
& M. Kavanaugh, Assessing Students in the Margin:
Challenges, Strategies, and Techniques. Charlotte, NC:
Information Age Publishing.
Russell, M. (2011b). Digital Test Delivery: Empowering
Accessible Test Design to Increase Test Validity for All
Students. A Monograph Commissioned by the Arbella
Advisors.
Scalise, K. & Gifford, B. (2006). Computer-Based Assessment in
E-Learning: A Framework for Constructing “Intermediate
Constraint” Questions and Tasks for Technology Platforms.
32
Vol 17 (1) | 2016 | www.jattjournal.com
Journal of Technology, Learning, and Assessment, 4(6).
Retrieved June 23, 2015 from http://ejournals.bc.edu/ojs/
index.php/jtla/article/view/1653/1495.
Sireci, S.G. & Zenisky, A.L. (2006). Innovative item formats in
computer-based testing: in pursuit of improved construct
representation. In S.M. Downing & T.M. Haladyna (Eds.)
Handbook of Test Development (pp. 329-348). New York,
NY: Routledge.
Thompson, S. J., Johnstone, C. J., & Thurlow, M. L. (2002).
Universal design applied to large scale assessments
(Synthesis Report 44). Minneapolis, MN: University of
Minnesota, National Center on Educational Outcomes.
Retrieved June 24, 2015 from the World Wide Web: http://
education.umn.edu/NCEO/OnlinePubs/Synthesis44.html.
Washington State. (2010). Race to the Top Assessment Program
Application for New Grants. Retrieved September 5, 2013
from http://www2.ed.gov/programs/racetothetop-assessment/rtta2010smarterbalanced.pdf
Journal of Applied Testing Technology