Comparison of Web-based and Face-to

ISSN (Print) : XXXX-XXXX
ISSN (Online) : 2375-5636
Journal of Applied Testing Technology, Vol 15(1), 1–17, 2014
Comparison of Web-based and Face-to-face
Standard Setting using the Angoff Method
Irvin R. Katz* and Richard J. Tannenbaum
Educational Testing Service, MS 16-R, 660 Rosedale RoadPrinceton,
NJ 08541; [email protected], [email protected]
Abstract
Web-based standard setting holds promise for reducing the travel and logistical inconveniences of traditional, face-to-face
standard setting meetings. However, because there are few published reports of setting standards via remote meeting
technology, little is known about the practical potential of the approach, including technical feasibility of implementing
common standard setting methods and whether such an approach presents threats to validity and reliability. In previous
work, we demonstrated the feasibility of implementing a modified Angoff methodology in a virtual environment (Katz,
Tannenbaum, & Kannan, 2009). This paper presents results from two studies in which we compare cutscores set through
face-to-face meetings and through the web-based approach on two operational tests, one of digital literacy and one of
French.
Keywords: Angoff Methodology, Distance-based Meetings, Standard Setting, Virtual Meetings
1. Introduction
In a world of 24x7 accesses to information, geographically dispersed teams and even whole companies,
tablet- or phone-based collaboration, crowdsourcing of
decision-making, and desktop screen sharing, a traditional standard-setting study seems almost quaint. Could
not the face-to-face standard-setting study be replaced by
a virtual meeting by using current remote-meeting technologies? What are the issues involved and would such
meetings yield results comparable to face-to-face studies? This paper reports two experiments that compare
outcomes from face-to-face and virtual standard-setting
studies.
Standard setting involves comparatively small numbers of domain experts (often fewer than 15, especially
for licensure testing) meeting at a particular location to
define one or more performance standards, reviewing
a test in light of those standards, and recommending a
cutscore corresponding to each performance standard
*Author for correspondence
(Tannenbaum & Katz, 2013). Standard-setting studies can
be both time- and cost-intensive, with domain experts
(e.g., practitioners) taking time away from their work and
with the attendant expenses of travel, accommodations,
food, meeting sites, etc. Holding a standard-setting study
virtually, without the need for travel, could significantly
reduce the costs of the meeting (Katz, Tannenbaum,
Kannan, 2009; Zieky, 2001).
Beyond cost, assembling a panel of domain experts is a
non-trivial task: the need to travel has direct implications
for the representative standard-setting panel. The size and
representativeness of a standard-setting panel are important drivers of the meaningfulness and credibility of a
recommended cutscore (American Educational Research
Association, American Psychological Association,
National Council on Measurement in Education, 1999;
Tannenbaum & Katz, 2013). Thus, traditional, face-to-face
studies are constrained to the domain experts available
to travel on the study days. By eliminating travel, a virtual standard-setting study may be more likely to include
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
panelists with critical perspectives and experiences who
might otherwise not have been able to participate.
Previous research has presented the potential value of
virtual standard setting (Lorié, 2011), and others had envisioned a rise in the use of virtual standard setting (Zieky,
2001); but, in fact, there are few published accounts of
virtual standard setting. Harvey and colleagues (Harvey,
2000; Harvey & Way, 1999) compared face-to-face and
distance-based standard setting, using a modified Angoff
and a variation of a Yes/No method; overall, the results
supported the reasonableness of using a distance-based
process. However, the technology at the time did not support the level of discourse and interactivity that may be
expected today. More recently, Katz et al. (2009) affirmed
the feasibility of applying a more dynamic web-based,
modified Angoff standard-setting approach.
There are also theoretical and empirical results from
the research on virtual teams in the business world suggesting that virtual standard-setting is a feasible option to
consider. A critical predictor of a long-term team’s success revolves around issues of trust among team members
(e.g., trust in the other members’ expertise; Espinosa,
Slaughter, Kraut, & Herbsleb, 2007). However, standardsetting panels are short-term teams for which trust might
not be as critical a factor (Katz et al., 2009). Another predictor of virtual team success is the amount of “coupling”,
or inter-dependence, among team members’ activities: the
more interdependent the activities, the more challenging
for a virtual environment (Olson & Olson, 2000). Several
activities in a standard-setting study (becoming familiar
with the standard-setting process, the performance standards, and the test; making standard-setting judgments
on each test item) are loosely coupled activities, occurring independently of the work of other panelists, and are
therefore more amenable to the virtual environment.
The potential benefits of a virtual standard setting
approach notwithstanding, our previous research left
some key validity questions unanswered. For example,
would our virtual application yield results comparable
to those obtained from a face-to-face application? Does
the virtual standard setting approach introduce irrelevant
variance into cutscore judgments? Does the virtual meeting context negatively impact panel dynamics? In addition
to content expertise, will panelists also need technology
skills beyond what may reasonably be expected of most
practitioners and educators? The current research directly
investigates whether the outcomes of a standard setting
(e.g., cutscores, variability in panelist judgments, and
2
Vol 15 (1) | 2014 | www.jattjournal.com
panelist confidence in their judgments and the overall
panel recommendation) differ between the face-to-face
and web-based (virtual) approaches. Our research focuses
on whether the outcomes from independent face-toface panels for the same test differ to a lesser extent than
do outcomes from face-to-face and web-based panels
for these same tests. This approach is consistent with
other research that has compared alternative standardsetting methodologies—mostly face-to-face —to investigate validity threats (e.g., Buckendahl, Smith, Impara, &
Plake, 2002; Davis, Buckendahl, Chin, & Gerrow, 2008;
Olsen & Smith, 2008; Reckase, 2006).
2. Research Question and Study
Approach
The overarching question to be addressed by this study is
whether a standard setting meeting conducted via the web
results in outcomes (cutscores, variability in judgments,
and validity evidence) similar to a face-to-face meeting.
Standard setting typically involves a relatively small number of panelists, so differences between face-to-face and
web-based outcomes might be due to characteristics of
the particular panelists. The constraints of consequential,
operational standard setting often preclude a counterbalanced, within subject design. However, we identified
two operational tests (digital literacy and French) that
each included two independent face-to-face meetings,
providing a naturally occurring replication. An additional
web-based panel for these assessments offers a unique
opportunity to consider if a virtual environment introduces significant irrelevant variance. If the web-based
environment introduces irrelevant variance, then we
would expect to observe larger differences in the outcomes
between the web-based panel and either of the face-to-face
panels. However, we hypothesize that the web-based and
face-to-face meeting outcomes will differ no more than do
the outcomes between the two face-to-face meetings.
For each assessment, the face-to-face panels met
before the virtual panel. The digital literacy panels met
face-to-face in early October, 2009 and the virtual panel
met in mid-November, 2009. The French panels met
face-to-face in mid-July and the virtual panel met in
mid-September, 2009. Each panel—virtual and both faceto-face—comprised distinct panelists. However, all three
panels were selected from the same group of nominated
panelists, each expert in the domain of the assessment
(digital literacy or French).
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
3. Data Analyses
The approach to the data analyses follow those of studies that compare standard setting methodologies, such as
comparisons of the Bookmark and Angoff methods (e.g.,
Buckendahl et al., 2002; Davis et al., 2008). Analyses are
descriptive, rather than inferential, owing to the small
number of panelists (10 to 15) often associated with
operational standard setting studies. This sample size
restriction is not unusual even in the standard setting
research literature; for example: Buckendahl et al. (2002)
comparison of Angoff and Bookmark approaches used
10 panelists per condition. Comparative studies with
larger Ns typically involve students acting as subject matter experts, which is unfeasible for an operational testing
program.
Our measures address not only the traditional
outcomes of standard setting, such as panelists’ item judgments, but measures of validity as well. Threats to validity
include lack of panelist engagement or participation,
panelists’ misunderstanding the judgment task, and panelists not basing judgments on the definition of the Just
Qualified Candidate (JQC). These threats exist whether
the meeting is conducted via the web or face-to-face.
Other threats that more likely affect the virtual environment include not being able to access web sites, not being
able to hear or see other panelists during group discussions, and other technological failures.
Kane (1994) outlines three types of evidence for the
validity of cutscores: procedural, internal, and external.
Procedural evidence refers to information showing that
the standard setting study was implemented in a way that
leads to reasonable outcomes. Internal evidence refers
to information demonstrating that the judgments made
by panelists appear consistent with each other and that
any disagreements are due to professional judgment (i.e.,
unlikely due to irrelevant factors). External evidence
refers to the consistency of the cutscore with measures
outside of the specific panel. In the current study, specific
measures to compare between the face-to-face meetings
and between each face-to-face meeting and the webbased meeting include:
• Agreement with statements regarding implementation
quality (procedural validity evidence). In a typical standard setting meeting, panelists self-report on aspects of
the meeting that provide validity evidence on how well
the meeting was conducted and the overall satisfaction with or acceptance of the recommended cutscore
Vol 15 (1) | 2014 | www.jattjournal.com
(Tannenbaum & Katz, 2013; Zieky, Perie, & Livingston,
2008). For example, panelists report on whether they
understood the instructions, on the factors that influenced their judgments, whether they felt confident in
their ability to make standard setting judgments, and
whether the recommended cutscore is “about right,” “too
low,” or “too high.”
• Variability in judgments (internal validity evidence). Perhaps as important as the specific cutscore
is the amount of variability across panelists’ judgments. Standard-setting studies are often conducted
through several rounds of judgments, with discussion
and feedback between the rounds. Variability typically
is greatest in Round 1, where judgments are independent. Variance tends to decrease over rounds owing
to the intervening group discussions (Tannenbaum &
Katz, 2013). If the level and quality of discussion in the
web-based approach parallels that which occurs in a
face-to-face meeting, we would expect similar reductions in judgment variability from round to round.
In addition, the variability of judgments in Round 1
(before feedback and discussion occur) provides a comparison as to whether there are any inherent differences
in judgments being made between web-based and faceto-face environments.
• Cutscores (external validity evidence). The absolute
cutscore reached at the end of a standard setting meeting
is often compared between standard setting methodologies. While the cutscore reached is a matter of informed
judgment (Nichols, Twing, Mueller, & O’Malley, 2010;
Zieky, 2001) and so has no “correct” answer, we expect
similar methodologies, with similar panels, to yield
similar cutscores (Tannenbaum & Kannan, in press;
Tannenbaum & Katz, 2008). If the web-based and face-toface methods are comparable, the panelists should reach
similar cutscore decisions.
4. Virtual Standard Setting
Approach
The virtual standard setting approach used a combination of Microsoft Live Meeting 2007, conference calling,
webcasts (self-running presentations delivered on-line),
web surveys, and web pages to create an environment for
panelists to participate in a standard setting study from
their home or office computers. Our choice of technology
was, in part, driven by our explicit goal of using readily
available, off-the-shelf technology and software. One of
Journal of Applied Testing Technology
3
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
our objectives was to compare face-to-face methods with
virtual methods without the need to build new platforms
or delivery modes. If virtual standard setting is to be a
viable alternative to traditional face-to-face approaches,
then it should be adaptable to existing, relatively common
technologies.
In creating the virtual environment, panelists view
introductory materials via web casts (i.e., visiting a web
site to watch a presentation video), receive background
information via presentations given with Live Meeting
(remote meeting technology), participate in group
discussions via conference calls, fill out readiness-toproceed and final evaluations surveys on-line, and enter
their Angoff-based standard setting judgments via a web
page that shows the test items, the definition of the JQC,
and a data entry area for entering judgments (Figure 1).
The full approach is summarized in Table 1. Figure 1
and Table 1 refer to the previous Katz et al. (2009) virtual standard setting, but serve as useful illustrations.
Both the digital literacy and French studies followed this
approach generally, although some adaptations for the
specifics of each test were necessary, as outlined later in
this paper.
5. Study 1: Digital Literacy
5.1 About the Test
The first study recommended a cutscore for a digital literacy certification test, which is performance-based,
computer-delivered, and automatically scored. The test
consists of 14 performance tasks, each of which produce
2-5 scores, totaling 50 scores for the test. For convenience,
these scores will be referred to as “items.” Each response
to an item may earn full-credit (1), partial credit (.5), or
no credit (0). However, the definition of full-credit, partial credit, and no credit is different for each of the 50
items on the test. Thus, to evaluate the difficulty of an
item, panelists had to see the actual task (screen snapshots
and sample full-credit responses) along with the scoring
rubric for the item.
5.2 Method
5.2.1 Panelists
The face-to-face panelists were 23 digital literacy
experts recruited from across the U.S. and from three
countries. Panelists were nominated by an international
Figure 1. Website for modified Angoff judgments, showing an example test (top left), JQC definition (top right), and survey
for entering item ratings (bottom) (Katz et al., 2009).
4
Vol 15 (1) | 2014 | www.jattjournal.com
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
Table 1. Virtual standard setting approach (adapted from Katz et al., 2009)
Common Elements of
Standard Setting
Virtual Standard Setting Approach
Pre-meeting work
Panelists learn the basic objectives of standard setting, including their roles and
responsibilities in the meeting. Pre-meeting work also includes panelists accessing two
websites to assure they can view web casts and access the web conferencing software. The
performance level description(s) to be used in the meeting are sent to panelists for early
review.
Overview of meeting
Introductions, meeting overview, and more detailed technology tryouts are covered during
an introductory 30-min session held approximately one week prior to the first standard
setting session. The extra week provides time to address any technical issues not solvable
during the introductory meeting.
Overview of standard
setting
Panelists view an eleven minute webcast between the “Overview of Meeting” and the first
standard setting session. This material is also reviewed during first standard setting session.
Training and practice in
standard setting methods
Panelists view a second eleven minute webcast between the “Overview of Meeting” session
and first standard setting session. The training is reinforced during first standard setting
session.
Panelists take the test
Secured PDF (no printing, no copying, expiration date) of the test sent as two files (without
answers and with answers) to panelists during the introductory session. They take the test
and self-score their responses sometime in the week between the “Overview of Meeting”
and the first standard setting session. In a face-to-face meeting, panelists typically take the
test and self-score during the first day of the meeting.
Discussion of the
performance standard:
the Just-Qualified
Candidate (JQC)
As in face-to-face, a description of the just qualified candidate (JQC) is discussed via
teleconference during first standard setting session. Facilitators work with panelists to
create this definition based on the test content specifications. During discussion, the JQC
definition is written and edited (more operationally defined) via a shared application,
which is similar to face-to-face in which the edited JQC definition is often projected. After
the discussion, the final JQC definition is electronically distributed to each panelist for
printing.
Initial survey (training
evaluation) and
verification of readiness
to proceed
Web survey delivered through the web conferencing software, using web survey software.
Results are immediately available via a web link (for facilitators only, although could be
shared).
Round 1 judgments
Panelists enter their ratings via a custom web site accessed through the web conferencing
software. The web site was designed to keep visible the information panelists need to make
their judgments: the test (with answers), JQC definition, and web survey form for entering
Angoff ratings (created with web survey software).
Round 1 discussion
Facilitators enter the ratings data into a spreadsheet, which facilitators share via the web
conferencing software. The spreadsheet, identical to that used in face-to-face meetings,
summarizes judgments, highlights discrepancies among panelists, and shows item
difficulty data. Panelists then share their rationales via the teleconference for their itemlevel judgments.
Round 2 judgments
Panelists enter their ratings via same web site as Round 1. Web survey software pre-enters
each panelist’s data, so panelists see their earlier responses and re-judge only the items they
want to modify.
Round 2 discussion
Round 2 follows an approach similar to the Round 1 discussion.
Final survey
Panelists complete a web survey delivered through the web conferencing software. Results
are immediately available via a web link (for facilitators only, although could be shared).
Vol 15 (1) | 2014 | www.jattjournal.com
Journal of Applied Testing Technology
5
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
advisory council. A split-panel replication was implemented, as described in the procedures.
The virtual panelists were 10 digital literacy experts
who were recruited from the same pool of nominated
panelists as in the face-to-face study. The virtual panel
completed its work independently of the face-to-face
panel.
Table 2 shows some of the characteristics of the three
standard setting panels. The face-to-face groups were
designed to be reasonably equivalent in terms of panelist
demographics and background. Similarly to the faceto-face panels, the virtual panelists were predominantly
white (60% vs. 65%) and came from many regions of the
U.S. However, the virtual group contained a larger number of female panelists (8 vs. 5), no international panelists,
and no corporate trainers.
Table 2. Characteristics of panelists
6
Face-to-face
Group A
n = 12
Face-to-face Virtual
Group B
Group
n = 11
n = 10
Gender
Female
5
5
8
Male
7
6
2
Race/ethnicity
Asian American
2
1
0
African American
2
2
2
Hispanic
0
1
2
White (nonHispanic)
8
7
6
Expertise type
Corporate Training
2
2
0
Workforce
Development
Training
2
3
1
College or university
instruction
6
4
7
K-12 teacher
training
2
2
2
Institution/
Organization
location
Northeast
4
3
2
Midwest
1
2
2
South
2
1
3
Southwest
1
1
0
West
2
2
3
International
2
2
0
Vol 15 (1) | 2014 | www.jattjournal.com
5.2.2 Procedure (Face-to-Face)
Before the standard setting meeting, panelists were asked
to complete two assignments intended to familiarize them
with the materials to be used during the meeting. First,
panelists were given access to the computer-delivered
test, which could be reached via a website, so that they
become familiar with the knowledge and skills elicited
by the test. Second, panelists were sent a draft definition
of the JQC constructed by a similarly composed panel of
experts from a different standard-setting study evaluating digital literacy (Tannenbaum & Katz, 2008). Panelists
were asked to review the JQC definition and to write a
few performance indicators (behavioral examples) to
help them internalize the definition. The purpose of these
pre-meeting assignments was to begin the process of
the panelists coming to a shared understanding of what
should be expected of a Just Qualified Candidate.
The panel was then split into two independent subpanels that completed the remaining work (assignment to
each subpanel was random, although attempts were made
to have similar background characteristics represented
on both panels). Each subpanel was guided by a different
facilitator. Although the groups worked largely independently on the ratings, note that they were calibrated
together on the nature of the test, the JQC definition, and
the practice ratings.
5.2.3 Procedure (Virtual)
While the general activities involved in the standard setting meeting did not change between face-to-face and
virtual environments, the virtual environment necessitated some adaptations, and also provided some
opportunities for time-savings. As with the face-to-face
panels, the web-based panelists were asked to complete
two pre-meeting assignments: one was to take the test
(using the same system as was used for the face-to-face
panelists) and to review the JQC definition constructed
by the face-to-face group. In addition, the virtual panel
viewed an online presentation that provided background
to the standard setting process, which lessened the time
needed to cover this material during the meeting. All panelist interactions were via the audio conference call; no
web video was used. Pilot testing of preliminary versions
of the virtual standard setting included audio-only and
audio with video capabilities. The results did not differ,
and pilot panelists thought the video was not necessary
for the standard-setting application.
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
Because the JQC definition is the narrative equivalent of the cutscore, if the two groups (face-to-face and
virtual) worked from different JQC definitions, it is likely
that their standard setting judgments and cutscore recommendations would differ. Therefore, to determine if
the “venue” of the standard setting impacts judgments
and outcomes, we needed to maintain comparable performance expectations, as defined by the JQC definition.
For the virtual panel to better understand the JQC definition constructed by the face-to-face panel, they developed
several performance indicators for each section of the
JQC. The indicators reflected their understanding of the
ways that test takers might exhibit the knowledge and
skills expected of a Just Qualified Candidate. Research
by Tannenbaum and Kannan (in press) suggests that two
panels applying the same standard-setting method to the
same test will likely recommend comparable cutscores,
regardless if one panel constructed the JQC definition
and the other simply fleshed out that same JQC definition, but did not change its fundamental meaning.
The virtual standard setting study was spread over five
2.5 hour sessions, as described by Katz et al. (2009), to
minimize fatigue. The activities of the virtual standard
setting otherwise paralleled those of the face-to-face study
(general introduction, discussion of test, discussion of the
JQC definition, training and practice, and three rounds of
standards setting judgments) as outlined in Table 1.
5.2.4 Standard Setting Judgments
For both face-to-face and virtual panels, during Round
1, the facilitator guided the subpanel through a description of each item rubric, discussing the characteristics
of responses (and processes in some cases) that lead to
full credit, partial credit, or no credit on each item. The
face-to-face panelists worked with two binders: one containing the items and rubrics associated with each task,
and the other containing benchmarks (exemplars) of fullcredit responses to each task. The virtual panelists viewed
the benchmarks and item rubrics on their computer (the
benchmark and corresponding item rubric were projected simultaneously on a split screen).
For each task, the facilitator reminded panelists about
the goals of the task and the portions of the JQC definition
that were likely most relevant (the JQC definition comprised sections that corresponded to distinct task types
in the assessment). The facilitator then walked the panelists through the benchmark for the task and the scoring
rubrics for the first item of that task. After describing the
Vol 15 (1) | 2014 | www.jattjournal.com
item, the facilitator asked panelists to enter their standard
setting rating for that item onto the scan form (face-toface) or recording sheet (virtual). Panelists were asked to
estimate the mean (average) score that 100 Just Qualified
Candidates would earn on this item. This judgment task
allowed the use of a similar rating scale with a typical
modified Angoff approach, as used in the next study. The
rating scale ranged from 0 to 1, in .05 increments. The
next item within the task was then described. This continued for all of the items in the task. This process was
followed for all 14 tasks.
After finishing their ratings for all tasks, the scan forms
of the face-to-face panels were collected and scanned in.
At the corresponding time in the virtual meeting, the panelists entered their recorded ratings into an online survey.
5.3 Results
As discussed earlier, Kane (1994) outlines three types of
validity evidence for evaluating cutscores: procedural,
internal, and external. As will be shown below, the three
panels provided similar evidence of procedural and internal validity. However, the overall cutscore of the virtual
group was somewhat lower than that of the two face-toface groups.
5.3.1 Procedural Validity Evidence
Evidence that the standard setting methodology was
carried out correctly typically comes from panelist assertions about the process and its outcomes (Cizek, Bunch,
& Koons, 2004; Tannenbaum & Katz, 2013). All three
groups gave similar ratings of their respective meetings
and outcomes. All three panels reported understanding
the standard setting process (Table 3) and everyone in
all panels agreed that the consensus cutscore was “about
right.” Panelists in all groups reported that their standards setting judgments were influenced by similar study
materials, such as the JQC definition, the between-round
discussions, and the skills needed to solve each test item
(Table 4). In addition, all three panels reported similarly
positive evaluations of the meeting process, rating it as
“understandable,” “satisfying,” and “fair,” among other
measures (Table 5). Thus, it appears that the panels had
similar confidence in how the meeting was conducted;
nothing in their opinions suggests that the implementation of the standard setting method during the
face-to-face or the virtual meetings was done incorrectly
and so introduced error into the process.
Journal of Applied Testing Technology
7
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
Table 3. Overall evaluations
Please indicate below the degree to which you
agree with each of the following statements.
Face-to-Face
Virtual
SA
A
D
SD
SA
A
D
SD
I understood the purpose of this study.
20
(87%)
3
(13%)
0
0
8
(80%)
2
(20%)
0
0
The instructions and explanations provided by
the facilitators were clear.
21
(91%)
2
(9%)
0
0
9
(90%)
1
(10%)
0
0
The training in the standard setting methods
was adequate to give me the information I
needed to complete my assignment.
19
(83%)
4
(17%)
0
0
6
(60%)
4
(40%)
0
0
The explanation of how the recommended
cutscore is computed was clear.
20
(87%)
3
(13%)
0
0
8
(80%)
2
(20%)
0
0
The opportunity for feedback and discussion
between rounds was helpful.
22
(96%)
1
(4%)
0
0
8
(80%)
2
(20%)
0
0
The inclusion of the item data was helpful.
20
(87%)
2
(9%)
1
(4%)
0
9
(90%)
1
(10%)
0
0
The inclusion of the classification percentages
was helpful.
15
(65%)
7
1
(30%) (4%)
0
9
(90%)
1
(10%)
0
0
The process of making the standard setting
judgments was easy to follow.
16
(70%)
7
(30%)
0
6
(60%)
4
(40%)
0
0
0
Note. SA = Strongly Agree, A = Agree, D = Disagree, SD = Strongly Disagree
Table 4. Influence of study materials on judgments
How influential was each of the following factors in
guiding your standard setting judgments?
Face-to-Face
Virtual
VI
I
NI
NR
VI
I
NI
NR
The definition of the Just Qualified Candidate
21
(91%)
2
(9%)
0
0
10
(100%)
0
0
0
The between-round discussions
18
4
(78%) (17%)
1
(4%)
0
8
(80%)
2
(20%)
0
0
The knowledge/skills required to answer each test
question
17
6
(74%) (26%)
0
0
9
(90%)
1
(10%)
0
0
The cutscores of other panel members
2
(9%)
0
0
The item-level data
11
11
(48%) (48%)
1
(4%)
0
The classification percentages
5
15
(22%) (65%)
2
(9%)
My own professional experience
11
12
(48%) (52%)
0
16
5
(60%) (22%)
7
3
(70%) (30%)
0
4
(40%)
6
(60%)
0
0
1
(4%)
2
(20%)
8
(80%)
0
0
0
8
(80%)
2
(20%)
0
0
Note. VI = Very Influential, I = Influential, NI = Not Influential, NR = No Response
5.3.2 Internal Validity Evidence
Were panelists consistent in their ratings, suggesting a
similar understanding of the difficulty of the test tasks and
items, JQC, and standard setting methodology? While
some amount of professional disagreement is expected,
we generally expect that the standard deviation of the
8
Vol 15 (1) | 2014 | www.jattjournal.com
ratings should decrease between Rounds 1 to 3 owing to
the calibrating effect of group discussions.
Table 6 shows the Rounds 1-3 ratings of each panel,
including the panel means and standard deviations. As
expected, all panels showed greater consistency (lower SDs)
as the studies proceeded, although the virtual panel tended
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
Table 5. Evaluation of meeting process
How would you describe the meeting
process? Please rate the meeting on
each of the five scales shown below.
Face-to-Face
Virtual
1
2
3
4
5
1
2
3
4
5
Inefficient (1) – Efficient (5)
0
0
1
(4%)
2
(9%)
20
(87%)
0
0
0
3
(30%)
7
(70%)
Uncoordinated (1) – Coordinated (5)
0
0
1
(4%)
1
(4%)
21
(91%)
0
0
0
1
(10%)
9
(90%)
Unfair (1) – Fair (5)
0
0
0
2
(9%)
21
(91%)
0
0
0
0
10
(100%)
Confusing (1) – Understandable (5)
0
0
0
4
(17%)
19
(83%)
0
0
1
2
(10%) (20%)
7
(70%)
Dissatisfying (1) – Satisfying (5)
0
0
0
2
(9%)
21
(91%)
0
0
1
3
(10%) (30%)
6
(60%)
Table 6. Mean cutscores (SDs) across rounds
(maximum of 50 points)
Table 7. Intercorrelations of item-level ratings
Round 1
Round 2
Round 3
Item data
Group A
33.8 (3.8)
33.4 (3.4)
33.4 (3.4)
Group A
Group B
32.6 (3.7)
32.3 (3.1)
32.3 (3.0)
Group B
Virtual
29.8 (3.2)
29.2 (2.1)
29.1 (2.0)
Virtual
to agree more throughout the rounds compared to the faceto-face panels, having lower SDs throughout the process.
In addition, the three panels were similarly consistent
in the relative rankings of item difficulties for the JQC.
Some researchers (e.g., Clauser, Mee, Baldwin, Margolis,
& Dillon, 2009; Kane, 1994) have used correlation with
actual item difficulties to demonstrate the quality of standard-setting ratings. While the absolute difficulty level of
an item for the JQC would differ from that of the overall
population (as the population may be more heterogeneous
in its digital literacy skills than the JQC), the relative difficulty of the items should be fairly consistent. Table 7 shows
the inter-correlations of the item ratings (Round 3) among
the three panels and empirical item difficulty data. Again,
all three panels showed similar levels of consistency both
with each other and with the observed item difficulties.
These internal validity results again suggest that the
three panels implemented the standard setting methodology appropriately and consistently throughout the
standard setting meetings.
5.3.3 External Validity Evidence
The next question is the extent to which the face-to-face
and virtual standard setting meetings resulted in similar
outcomes. While we have evidence that the standard set-
Vol 15 (1) | 2014 | www.jattjournal.com
Item Data
Group A
Group B
Virtual
-
.82
.80
.80
-
.80
.79
-
.90
-
ting methods were implemented by the facilitators and
the panelists consistently, there is still the open question
of whether the three panels reached similar results. Did
they end up with similar cutscores? Does the cutscore
methodology replicate to a new panel and does the
“venue” matter?
Table 6 (reported earlier) shows the recommended
cutscores across the three rounds and the three panels.
The between-round discussions appeared to have the
same effect on all of the groups, which was to reduce
the cutscore slightly (i.e., no round x group interaction).
However, right from the start, the virtual panel recommended a lower cutscore (although significantly lower
only than Group A in Round 1), while the two face-to-face
groups came to very similar cutscore recommendations
(Round 3; within 1 point on a 50 point scale). The virtual recommendation reflects a 12% decrease from that
of face-to-face Group A and a 9% decrease from Group B.
5.4 Study 1: Discussion
Several aspects of the face-to-face vs. virtual environments might have led to the differences between Group A
and the virtual panel. For example, although the face-toface panels did all of the judgments separately from each
other, the group worked as a whole during a large portion
Journal of Applied Testing Technology
9
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
of the first day to reach consensus on the JQC definition.
The virtual panel worked from this definition, but independently of the other panels. Although the intent was to
minimize a construct-shift in the JQC definition between
panels, it may be that the two face-to-face panels that
worked together on augmenting the JQC definition may
well have internalized the JQC definition somewhat differently from the virtual panel. The difference in judgment
variability between the face-to-face and virtual panels
supports this hypothesis. Other possible sources of difference between the face-to-face and virtual panels include:
• Differences in the background of panelists. As described
earlier, the experience and demographics of the virtual
panel differed somewhat from that of the two face-toface panels. Because panelists bring to the judgment
process their own experiences, the differences in
panel make-up cannot be ruled out as a factor for the
observed cutscore differences between the face-toface and virtual panels.
• Facilitator and rubric interaction effect. During Round
1 of the face-to-face panels, panelists had physical
access to each rubric that was then explained by the
facilitator. For the virtual standard setting, because the
panelists did not have physical copy of the rubrics—
they were presented online—the facilitator spent
somewhat more time explaining the rubrics to the
panelists. This may have inadvertently reinforced the
stringency of the rubrics applied to the items, leading
to a lower performance expectation for the JQC.
• Transcription effects. In the virtual panel, panelists recorded their ratings on paper, along with their
judgment rationale, then later transcribed the ratings
onto a web-based survey. This was done to facilitate
between-round item-level discussions. Because the
face-to-face panelists had physical copies of the tasks
and rubrics, they were not explicitly asked to record
their rating rationales. This is typical in face-to-face
standard setting, as it is believed that panelists having
the tasks and rubrics in front of them as they discuss their ratings are sufficient to jog their memories.
Asking people to provide rationales for their decisions
has been shown to alter decision-making processes
(Wilson & Schooler, 1991). Thus, explicitly providing
the virtual panel with a physical recording form (the
form was emailed to them) and asking them to write
down their rationales, may have introduced a source
of variance to their judgments.
10
Vol 15 (1) | 2014 | www.jattjournal.com
• Environment effects. A variety of factors due to the virtual environment might have led to the lower cutscore.
In the face-to-face situation, panelists had printouts
of full-credit responses (screen snapshots) and of the
rubrics, both of which they could refer to freely during
the item-level discussions of Round 1 and during the
ratings of subsequent rounds. However, the ability to
view material in the virtual environment was somewhat
limited, with the facilitator controlling the presentation
of the rubrics and of the full-credit responses, the latter of which often spanned multiple screen snapshots.
Because the virtual panelists could not freely look
among the materials, they might have considered some
of the tasks more difficult because they did not remember all of the information presented to candidates.
Face-to-face panelists often remarked that candidates should be able to figure something out because
of a key sentence in the task description. Without the
task description as readily available, the virtual panelists might have thought the tasks were more difficult
because they were not considering everything that was
visible to the candidates; virtual panelists had to rely
more on their memory of the task descriptions.
6. Study 2: French
6.1 About the Test
The French test is one component of the certification
process for beginning K-12 French teachers. The test consists of four sections: Listening, Reading, Writing, and
Speaking. The Listening and Speaking sections contain
audio stimuli in addition to written text. The Listening
and Reading sections contain only multiple-choice questions while the Writing and Speaking sections contain
constructed-response questions, each question scored via
a rubric. The maximum number of raw points available
on the form of the French test included in the face-to-face
and virtual studies is 97.
The structural differences between the digital literacy and French tests led us to apply somewhat different
standard-setting judgments to the questions although,
still Angoff-based. For example, the scoring of the constructed-response questions on the French test is different
from that applied to the digital literacy test. The questions
are scored by independent raters using 3-point rubrics;
the sum of the ratings is the question score. The functional score scale for a question is therefore 0 through 6.
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
The Angoff-based standard-setting judgment applied to
these questions both for the face-to-face and virtual panelists was to decide on the score (0 through 6) that a JQC
would likely earn.
For the multiple-choice questions on the French
test, the standard-setting judgment was to decide on the
probability that a JQC would answer the question correctly. The rating scale ranged from 0 through 1. In both
instances of standard-setting judgments (Constructed
Response [CR] or Multiple Choice [MC]), the emphasis
was on a single JQC, rather than on a group of 100 JQCs,
as was done for the digital literacy test. As noted earlier,
the focus on 100 JQCs allowed the use of the same 0-1
rating scale for the all-CR digital literacy test, which eases
comparisons between the studies.
Another difference between the two standard-setting
applications is that there were no empirical item difficulty
data available for the French test, as it had not yet been
administered. A further distinction between the digital
literacy standard setting and this one is that the first faceto-face French panel constructed the JQC definition. This
definition was then shared with the second face-to-face
panel for augmentation; the two panels did not collaborate
on the JQC definition. The JQC definition from the first
face-to-face panel was similarly shared with the virtual
panel for it to augment independently. In this regard, the
second face-to-face panel and the virtual panel experienced
this implementation feature in common. Two rounds of
standard-setting judgments occurred for the French test,
given the absence of empirical items difficulty data.
For these panels, test familiarization included taking
the test (including all auditory stimuli) during the first
part of the meeting. For the virtual panel, the panelists
took the sections of the test that did not include any auditory stimuli (reading and writing sections) on their own.
During the first part of the virtual meeting, the panelists
took the listening and speaking portions. The audio for
these sections was played over the speakerphone being
used by the facilitators. All panelists reported good quality of the sound over the speakerphone (at first, the audio
was played through Skype, which did not work well).
panels, each meeting independently of one another. The
virtual panel also met independently of the other two.
The full demographic and background data for the two
face-to-face panels are shown in Table 8. Panelists were
assigned to panels randomly, although attempts were
made to create panels with similar background characteristics.
The general characteristics of all three panels were
similar. Panel 1 (face-to-face) included 23 teachers,
administrators, and college faculty who prepare K-12
French teachers, representing 18 states. Panel 2 (face-toface) included 24 teachers, administrators, and college
faculty, representing the same 18 states. The virtual panel
included 7 educators: 4 teachers and 3 faculty members.
While we attempted to recruit a similar number of panelists for the virtual panel, a smaller number resulted
because of scheduling conflicts—so while a virtual meeting may increase access to participation, it does not solve
issues of competing demands.
6.2 Method
Tables 9-11 show the panelists’ evaluations of the meeting
process and its outcomes. Panelists generally “agreed” or
“strongly agreed” that the study was conducted appropriately and clearly (Table 9). The factors influencing
judgments were consistent across all three panels, with
the cutscores of other panelists being the least influential
6.2.1 Panelists
All panelists were recruited from the same master list of
recommendations from the Boards of Education within
each participating state. There were two face-to-face
Vol 15 (1) | 2014 | www.jattjournal.com
6.2.2 Procedure
The two face-to-face panels and the virtual panel
engaged in similar activities. Following a brief introduction to standard setting and an overview of the test,
the panelists took the entire test and self-scored their
responses. Next, they were introduced to the standardsetting method for the multiple-choice questions and
practiced making standard setting judgments on the first
set of 6 items from the Listening section. After discussion
of the ratings, the panelists completed their ratings for
the remainder of the Listening section items (with each
Listening audio passage played before panelists rated the
associated 6 items) and then completed their judgments
for the Reading section (Round 1, discussion, Round 2).
Panelists then trained and practiced on the standard-setting method for constructed-response questions (Writing
and Speaking sections), and completed the judgments for
all questions (Round 1, discussion, Round 2).
6.3 Study 2: Results and Discussion
6.3.1 Procedural validity evidence
Journal of Applied Testing Technology
11
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
Table 8. Characteristics of Panelists
Panel 1
N
Percent
Panel 2
N
Percent
Group you are representing
Teachers
15
65%
19
79%
Administrator/Department Head
2
9%
2
8%
College Faculty
5
22%
2
8%
Other
1
4%
1
4%
Race
African American or Black
3
13%
3
13%
Alaskan Native or American Indian
1
4%
0
0%
Asian or Asian American
0
0%
1
4%
Native Hawaiian or Other Pacific Islander
0
0%
0
0%
White
19
83%
19
79%
Hispanic
0
0%
0
0%
Gender
Female
17
74%
18
75%
Male
6
26%
6
25%
In which language are you most fluent?
English
14
61%
19
79%
French
1
4%
3
13%
English and French about the same
7
30%
2
8%
Other
1
4%
0
0%
Are you certified as a French teacher in your state?
No
4
17%
4
17%
Yes
19
83%
20
83%
Are you currently teaching French in your state?
No
2
9%
2
8%
Yes
21
91%
22
92%
Are you currently mentoring another French teacher?
No
16
70%
17
71%
Yes
7
30%
7
29%
How many years of experience do you have as a French teacher in your state?
3 years or less
1
4%
1
4%
4 - 7 years
4
17%
5
21%
8 - 11 years
7
30%
4
17%
12 - 15 years
3
13%
2
8%
16 years or more
8
35%
11
46%
For which education level are you currently teaching French?
Elementary (K - 5 or K - 6)
2
9%
0
0%
Middle School (6 - 8 or 7 - 9)
1
4%
1
4%
High School (9 - 12 or 10 - 12)
11
48%
18
75%
Middle/High School
2
9%
0
0%
All Grades (K - 12)
0
0%
1
4%
Higher Education
6
26%
4
17%
Other
1
4%
0
0%
School Setting
Urban
10
43%
9
38%
Suburban
6
26%
9
38%
Rural
7
30%
6
25%
12
Vol 15 (1) | 2014 | www.jattjournal.com
Virtual
N
Percent
3
0
3
1
43%
0%
43%
14%
1
0
1
0
5
0
14%
0%
14%
0%
71%
0%
7
0
100%
0%
4
1
2
0
57%
14%
29%
0%
3
4
43%
57%
0
7
0%
100%
4
3
57%
43%
0
2
1
0
4
0%
29%
14%
0%
57%
0
0
3
0
1
3
0
0%
0%
43%
0%
14%
43%
0%
2
3
2
29%
43%
29%
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
Table 9. Overall evaluations
Please indicate below the degree to
which you agree with each of the
following statements.
Panel 1
Panel 2
Virtual
SA
A
D
SD
SA
A
D
SD
SA
A
I understood the purpose of this study.
21
(91%)
2
(9%)
0
0
23
(96%)
1
(4%)
0
0
6
(86%)
1
(14%)
0
0
The instructions and explanations
provided by the facilitators were clear.
18
5
(78%) (22%)
0
0
23
(96%)
1
(4%)
0
0
6
(86%)
1
(14%)
0
0
The training in the standard setting
methods was adequate to give me the
information I needed to complete my
assignment.
18
5
(78%) (22%)
0
0
21
3
(88%) (13%)
0
0
6
(86%)
1
(14%)
0
0
The explanation of how the recommended cutscore is computed was clear.
21
(91%)
0
0
19
5
(79%) (21%)
0
0
5
(71%)
2
(29%)
0
0
The opportunity for feedback and
discussion between rounds was helpful.
15
6
2
(65%) (26%) (9%)
0
22
(92%)
2
(8%)
0
0
7
(100%)
0
0
0
The process of making the standard
setting judgments was easy to follow.
15
8
(65%) (35%)
0
21
3
(88%) (13%)
0
0
4
(57%)
3
(43%)
0
0
2
(9%)
0
D SD
Note. SA = Strongly Agree, A = Agree, D = Disagree, SD = Strongly Disagree
Table 10. Influence of Study Materials on Judgments
How influential was each of the following
factors in guiding your standard setting
judgments?
The definition of the Just Qualified
Candidate
The between-round discussions
The knowledge/skills required to answer
each test question
The cutscores of other panel members
My own professional experience
Panel 1
Panel 2
VI
I
NI
VI
20
(87%)
10
(43%)
2
(9%)
12
(52%)
1
(4%)
1
(4%)
19
(83%)
2
(9%)
18
(78%)
4
0
(17%)
18
3
(78%) (13%)
5
0
(22%)
Virtual
I
19
5
(79%) (21%)
15
9
(63%) (38%)
*
*
NI
VI
I
NI
0
6
(86%)
5
(71%)
1
(14%)
2
(29%)
0
0
0
6
1
0
(86%) (14%)
2
4
1
(29%) (57%) (14%)
7
0
0
(100%)
*
2
16
6
(8%) (67%) (25%)
13
11
0
(54%) (46%)
Note. VI = Very Influential, I = Influential, NI = Not Influential.
*
Because of an error in study materials, this question was not asked of Panel 2.
Table 11. Evaluation of meeting process
How would you describe the meeting
process? Please rate the meeting on each of
the five scales shown below*
Inefficient (1) – Efficient (5)
1
Panel 1
Uncoordinated (1) – Coordinated (5)
0
Unfair (1) – Fair (5)
0
Confusing (1) – Understandable (5)
0
Dissatisfying (1) – Satisfying (5)
0
0
2
3
Virtual
4
1
1
7
(4%) (4%) (30%)
0
1
6
(4%) (26%)
0
0
3
(13%)
0
1
5
(4%) (22%)
1
1
2
(4%) (4%) (9%)
5
1
2
3
4
5
14
(61%)
16
(70%)
20
(87%)
17
(74%)
19
(83%)
0
0
0
1
(14%)
0
0
0
0
0
0
0
0
0
0
2
(29%)
2
(29%)
1
(14%)
3
(43%)
3
(43%)
4
(57%)
5
(71%)
6
(86%)
4
(57%)
4
(57%)
0
Because of an error in materials, Panel 2 did not receive this question.
*
Vol 15 (1) | 2014 | www.jattjournal.com
Journal of Applied Testing Technology
13
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
factor (Table 10). Finally, panelists had generally good
opinions of the meeting process overall, although the
reactions were somewhat more positive in the virtual
panel than in the first face-to-face panel (these data were
unavailable for Panel 2; Table 11). Overall, there is nothing in the evaluation data suggesting that the virtual and
face-to-face panels were implemented differently.
6.3.2 Internal Validity Evidence
Table 12 shows the section item ratings and the total
cutscore across rounds for the three panels. These results
suggest some differences in the virtual panelists’ judgments
in the multiple-choice vs. constructed-response portions of the test. For the face-to-face panels, we observed
the expected decrease in standard deviations of ratings
between Rounds 1 and 2 for all test sections. However,
the virtual panelists’ ratings showed this expected effect
only for the Reading and Listening sections (multiplechoice items); for the Writing and Speaking sections
(constructed-response items), the standard deviations
of the virtual panel’s ratings remained the same, which
contributed to an increase in the variability of the overall cutscore. Furthermore, the overall variability in the
Writing and Speaking sections are approximately twice
that of the face-to-face panels. These results may suggest some differential performance of the virtual panel
depending on test section, although that difference
might be due to language skill tested (e.g., receptive vs.
productive), item type (multiple-choice or constructedresponse), judgment question posed (“likelihood of JQC
answering correctly” vs. “likely scored earned by JQC”),
or other factors. Additional studies would need to be conducted to ascertain if this is a systematic effect.
6.3.3 External Validity Evidence
Table 12 shows the three rounds of ratings for the three
French panels; discussion will focus on the last line of the
table, the overall cutscores recommended by each panel.
The virtual panel full-test cutscores were slightly higher
than those of the second face-to-face panel (approximately 3 raw points), and considerably higher than those
of the first face-to-face panel (approximately 10 raw
points). Nonetheless, the difference between face-to-face
Panel 1 and the virtual panel paralleled the difference in
the cutscores between the two face-to-face panels, which
was approximately 7 raw points. It is unclear as to why the
cutscore from Panel 1 diverged from the other cutscores.
But it is noteworthy that the cutscores from Panel 2 and
the virtual panel were consistent with cutscores recommended for two other language tests built for the same
testing program, German and Spanish. The specifications
for these two tests are the same as for the French test and
the test format and structure are the same for all three
tests. For each of these two tests, two independent face-toface panels were assembled, and the same Angoff-based
process applied to the French test was applied to these two
tests. For each of two panels, the cutscores for German
were 65.7 and 62.1, and 65.5 and 68.0 for Spanish. This
additional information suggests that the results of Panel 1
are discrepant. Importantly, for the web-based approach,
these results demonstrate that the web-based approach
differs from a face-to-face result no more than two faceto-face results might.
7. General Discussion
In previous research (Katz et al., 2009), we demonstrated
that it is feasible to conduct standard setting with a virtual
panel and to obtain similar positive evidence of procedural
validity as obtained in traditional face-to-face studies.
In this study, we set out to determine if the outcomes of
standard setting studies on the same tests conducted virtually and face-to-face would be comparable. Overall, the
results suggest that the absolute recommended cutscores
are not very different, the most frequent difference being
3-4 points between any face-to-face and virtual panel.
Table 12. Mean (SD) cutscores for each Round by test section for the three panels
14
Test Section
(Max Raw Score)
Panel 1
Panel 2
Virtual
Round 1
Round 2
Round 1
Round 2
Round 1
Round 2
Listening (30)
17.6 (2.2)
17.2 (1.9)
18.3 (2.5)
18.1 (2.0)
19.7 (1.8)
20.1 (1.5)
Reading (31)
21.5 (2.9)
21.5 (2.4)
22.8 (2.6)
23.1 (2.3)
25.5 (2.2)
25.6 (2.1)
Writing (18)
9.8 (1.3)
10.3 (1.1)
12.0 (1.4)
12.7 (1.1)
11.3 (3.0)
11.3 (3.0)
Speaking (18)
9.5 (2.4)
9.6 (2.0)
11.5 (1.6)
12.0 (1.1)
11.6 (3.2)
11.6 (3.2)
Total (97)
58.4 (5.3)
58.5 (4.6)
64.7 (6.0)
65.8 (4.7)
68.0 (7.2)
68.6 (7.8)
Vol 15 (1) | 2014 | www.jattjournal.com
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
The one exception was the first face-to-face French panel
(Study 2), which was 7 and 10 raw points lower than
second face-to-face panel and the virtual panel, respectively. However, those larger differences do not seem to
be attributable to the standard setting venue (face-to-face
or virtual).
Although the virtual and face-to-face cutscores from
the French panels (Panel 1 notwithstanding) and from
the digital literacy panels seem reasonably close, the
direction of the relationship was different. For the French
test, the virtual cutscore was greater than the face-toface cutscores, but for the digital literacy test, the virtual
cutscore was less than the face-to-face cutscores. We cannot conclude, as a consequence, if one should expect a
virtually determined cutscore to under-predict or overpredict what would occur in a face-to-face venue.
Further, the impact of a 3-4 point discrepancy needs
to be considered in light of the range of available points
and the distribution of test scores. On the 50-point scale
for the digital literacy test, such a discrepancy is a meaningful difference in terms of the percentage of points that
needs to be earned to be considered just qualified. The
cutscores for the two face-to-face panels translate into
67% and 65%, respectively, whereas the cutscore for the
virtual panel translates to 58%. On the 97-point scale for
the French test, the percentage differences are less pronounced, 67% (Panel 2) and 71% for the virtual panel.
Performance data were only available for the digital literacy test. On that test, the face-to-face and virtual cutscores
were near the center of the score distribution, and so the
3-4 point difference would lead to large differences in
the percentage of test takers classified as just qualified,
approximately 50% for the face-to-face cutscores and 70%
for the virtual cutscore.
8. Conclusions
Overall, a virtual standard setting approach appears to
be both viable and appropriate. The differences between
the absolute cutscores derived from the face-to-face and
virtual panels were not so large as to nullify our opening
remarks. However, given that much of the validity evidence pertaining to standard setting remains procedural,
more focused attention on understanding the sources of
variance in a virtual approach would seem prudent.
The use of virtual meetings to support standard setting, the focus of this study, clearly has application
to other aspects of the assessment process (see, e.g.,
Vol 15 (1) | 2014 | www.jattjournal.com
Schnipke & Becker, 2007). Item writing and reviewing
workshops may be conducted virtually, as can fairness
reviews. The nature of the training and data collection
needed to support these other distance-based assessment
practices may be somewhat different from a more traditional face-to-face process, but there is nothing inherent
in the distance-based process that would preclude these
and other applications. Tannenbaum (2011), for example,
applied a distance-based approach to evaluate the judged
alignment between a test of English language skills and a
language framework of English proficiency.
9. Recommendations
The use of and reliance on remote, or distance-based,
technologies will continue to impact assessment-related
practices. Reflecting on our experiences reported here
and from our previous research in this area (Katz et al.,
2009) enables us to offer recommendations for practitioners. These recommendations, although inspired by the
virtual standard-setting approach outlined in the paper,
should be applicable to other distance-based, assessmentrelated practices.
First, hold a pre-meeting, “check-in” session. The session should focus specifically on verifying that all panelists
are able to access the web-based sources of information
relevant to the study. We constructed sample web pages
of basic information and rating scales for the panelists to
navigate and use to uncover potential issues; and we went
over some of the basic features of Microsoft Live Meeting,
such as the location of the icon for the microphone, how
to use “flags” to signal when they completed a task, and
how to send messages to the facilitators. The check-in
session proved to be invaluable, both to address minor
glitches before the actual standard-setting process and to
alleviate the concerns of panelists who might not otherwise be “technology savvy.”
Second, use technology to support test security.
Test security is always an issue when conducting standard setting, whether face-to-face or virtual, but there
is a greater threat of inadvertent security breaches when
sharing electronic files of test materials. We addressed
this issue by using an online document securing service;
at the time this service was free, but there are several
commercial solutions currently available, including the
security features of Microsoft Office that provide similar
functionality. The service we used allowed the creation
of encrypted files, accessible only through unique user
Journal of Applied Testing Technology
15
Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method
IDs and passwords for each panelist, which also allowed
the tracking of file access and the ability to set an expiry
date for the file (by disabling user IDs). These files also
did not allow printing or copying. While these security
measures would not stop an intentional security breach,
they are similar to measures used when paper copies of
secure materials are distributed. As would occur with a
face-to-face study, all panelists had signed non-disclosure
agreements before having access to any secure materials.
Third, pay attention to panelist engagement. Panelist
engagement and participation presents more of an issue
in virtual standard setting than it does in face-to-face
studies because of the greater variety of non-meetingrelated distractions. During group discussions, we
maintained an informal log of which panelists were
speaking more frequently than others and invited those
less participative into the discussion. For example, we
would pose questions to these panelists about their
reactions to what was being discussed, or asked the less
participative members to be the first ones to offer their
rationales for their standard-setting judgments—i.e., to
start a new chain of discussion. We also requested that
each panelist, before speaking, state his or her name so
that the other panelists would start to associated voices
with names and, thus, begin to take initiative for drawing
others into the discussion.
Fourth, balance the work panelists do online vs. at
their desks. We learned that the panelists had some difficulty referring to the JQC definition online and then
entering their standard-setting judgments online, due to
screen real-estate issues. Therefore, we encouraged panelists to print the JQC definition so a physical copy was
available, which, from the feedback we received, helped
greatly. Regarding the online rating scale, although panelists had no technical difficulty in entering ratings, during
the between-round item discussions, panelists stated that
they did not readily recall the reasons for their item judgments, making discussion more challenging. We addressed
this issue by emailing the panelists a rating form to print
out. The form contained the item numbers (not the actual
items), a space to enter their rating, and a space to make
brief notes about the reasons for each item rating. Once
they made their ratings on the form, we gave them time
to enter their ratings online. The panelists referred to their
written notes about the items during the between-round
discussion. The informal feedback we received supported
the effectiveness of this simple solution.
16
Vol 15 (1) | 2014 | www.jattjournal.com
An overall theme to our recommendations is that
technique trumps technology: rather than being due to
the specific technology used to implement the approach,
any success of this approach came more from careful planning to ensure the ease and comfort of panelists as well
as meeting facilitation techniques as adapted to remote
meetings. Note that we intentionally relied on off-theshelf technology; we did not set out to build any software
or platforms specifically to support the standard setting.
We are confident that other readily available technologies that facilitate distance-based meetings would work
as effectively. Naturally, tailored technologies would likely
make the standard-setting process easier by offering, for
example, more intuitive interfaces, but this approach
needs to be balanced against the added cost and expertise
to build and maintain new systems.
10. References
1.
American Educational Research Association, American
Psychological Association, & National Council on
Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC.
2.Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B.
S. (2002). A comparison of Angoff and Bookmark standard
setting methods. Journal of Educational Measurement, 39,
253–263.
3.Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational
Measurement: Issues and Practice, 23, 31–50.
4.Clauser, B. E., Mee, J., Baldwin, S. G., Margolis, M. J., &
Dillon, G. F. (2009). Judges’ use of examinee performance
data in an Angoff standard‐setting exercise for a medical
licensing examination: An experimental study. Journal of
Educational Measurement, 46, 390–407.
5.Davis, S. L., Buckendahl, C. W., Chin, T. Y., & Gerrow, J.
(2008, March). Comparing the Angoff and Bookmark
methods for an international licensure examination. Paper
presented at the National Council on Measurement in
Education, New York.
6.Espinosa, J. A., Slaughter, S. A., Kraut, R. E., & Herbsleb,
J. D. (2007). Familiarity, complexity, and team performance in geographically distributed software development.
Organization Science, 18, 613–630.
7.Harvey, A. L., & Way, W. D. (1999, April). A comparison
of web-based standard setting and monitored standard
setting. Paper presented at the annual conference of the
National Council on Measurement in Education, Montreal,
Canada.
Journal of Applied Testing Technology
Irvin R. Katz and Richard J. Tannenbaum
8.Harvey, A. L. (2000, April). Comparing onsite and online
standard setting methods for multiple levels of standards.
Paper presented at the annual conference of the National
Council on Measurement in Education, New Orleans, LA.
9.Kane, M. (1994). Validating the performance standards
associated with passing scores. Review of Educational
Research, 64, 425–461.
10.Katz, I. R., Tannenbaum, R. J., & Kannan, P. (2009). Virtual
standard setting. CLEAR Exam Review, 20(2), 19–27.
11.
Lorié, W. (2011, June). Setting standards remotely:
Conditions for success. Paper presented that the CCSSO
National Conference on Student Assessment, Orlando, FL.
12.Nichols, P., Twing, J., Mueller, C. D., & O’Malley, K. (2010).
Standard-setting methods as measurement processes.
Educational Measurement: Issues and Practice, 29, 14–24.
13.Olsen, J. B., & Smith, R. (2008, March). Cross validating modified Angoff and Bookmark standard setting for
a home inspection certification. Paper presented at the
annual meeting of the National Council on Measurement
in Education, New York.
14.Olson, G. M., & Olson, J. S. (2000). Distance matters.
Human-Computer Interaction, 15, 139–178.
15.Reckase, M. D. (2006). A conceptual framework for a psychometric theory for standard setting with examples of its
use for evaluating the functioning of two standard setting
methods. Educational Measurement:Issues and Practice,
25, 4–18.
16.Schnipke, D. L., & Becker, K. A. (2007). Making the test
development process more efficient using web-based virtual meetings. CLEAR Exam Review, 18, 13–17.
17.Tannenbaum, R.J., & Katz, I.R. (2008). Setting standards on
the core and advanced iSkills™ assessments (ETS Research
Memorandum No. RM-08-04). Princeton, NJ: Educational
Testing Service.
18.Tannenbaum, R. J. (2011). Alignment between the TOEIC®
test and the Canadian Language Benchmarks. Final report.
Princeton, NJ: ETS.
19.Tannenbaum, R. J., & Kannan, P. (in press). Consistency of
Angoff-based standard-setting judgments: Are item judgments and passing scores replicable across different panels
of experts? Educational Assessment.
20.Tannenbaum, R. J., & Katz, I. R. (2013). Standard setting. In
K. F. Geisinger (Ed.), APA handbook of testing and assessment in psychology: Vol 3. Testing and assessment in school
psychology and education (pp. 455–477). Washington, DC:
American Psychological Association.
21.Wilson, T. D., & Schooler, J. W. (1991). Thinking too much:
Introspection can reduce the quality of preferences and
decisions. Journal of Personality and Social Psychology, 60,
181–192.
22.Zieky, M. J. (2001). So much has changed: How the setting
of cutscores has evolved since the 1980s. In G. J. Cizek (Ed.),
Setting performance standards: Concepts, methods, and
perspectives (pp. 19–51). Mahwah, NJ: Lawrence Erlbaum.
23.Zieky, M. J., Perie, M, & Livingston, S. A. (2008). Cutscores:
A manual for setting standards of performance on educational
and occupational tests. Princeton, NJ: Educational Testing
Service.
Address for Correspondence:
Irvin R. Katz
Educational Testing Service
How to cite this article: Katz, I. R., & Tannenbaum, R. J. (2014).
Comparison of web-based and face-to-face standard setting
using the Angoff Method. Journal of Applied Testing Technology,
15(1), 1–17.
Source of Support: Yes, Conflict of Interest: Declared
Vol 15 (1) | 2014 | www.jattjournal.com
Journal of Applied Testing Technology
17