ISSN (Print) : XXXX-XXXX ISSN (Online) : 2375-5636 Journal of Applied Testing Technology, Vol 15(1), 1–17, 2014 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method Irvin R. Katz* and Richard J. Tannenbaum Educational Testing Service, MS 16-R, 660 Rosedale RoadPrinceton, NJ 08541; [email protected], [email protected] Abstract Web-based standard setting holds promise for reducing the travel and logistical inconveniences of traditional, face-to-face standard setting meetings. However, because there are few published reports of setting standards via remote meeting technology, little is known about the practical potential of the approach, including technical feasibility of implementing common standard setting methods and whether such an approach presents threats to validity and reliability. In previous work, we demonstrated the feasibility of implementing a modified Angoff methodology in a virtual environment (Katz, Tannenbaum, & Kannan, 2009). This paper presents results from two studies in which we compare cutscores set through face-to-face meetings and through the web-based approach on two operational tests, one of digital literacy and one of French. Keywords: Angoff Methodology, Distance-based Meetings, Standard Setting, Virtual Meetings 1. Introduction In a world of 24x7 accesses to information, geographically dispersed teams and even whole companies, tablet- or phone-based collaboration, crowdsourcing of decision-making, and desktop screen sharing, a traditional standard-setting study seems almost quaint. Could not the face-to-face standard-setting study be replaced by a virtual meeting by using current remote-meeting technologies? What are the issues involved and would such meetings yield results comparable to face-to-face studies? This paper reports two experiments that compare outcomes from face-to-face and virtual standard-setting studies. Standard setting involves comparatively small numbers of domain experts (often fewer than 15, especially for licensure testing) meeting at a particular location to define one or more performance standards, reviewing a test in light of those standards, and recommending a cutscore corresponding to each performance standard *Author for correspondence (Tannenbaum & Katz, 2013). Standard-setting studies can be both time- and cost-intensive, with domain experts (e.g., practitioners) taking time away from their work and with the attendant expenses of travel, accommodations, food, meeting sites, etc. Holding a standard-setting study virtually, without the need for travel, could significantly reduce the costs of the meeting (Katz, Tannenbaum, Kannan, 2009; Zieky, 2001). Beyond cost, assembling a panel of domain experts is a non-trivial task: the need to travel has direct implications for the representative standard-setting panel. The size and representativeness of a standard-setting panel are important drivers of the meaningfulness and credibility of a recommended cutscore (American Educational Research Association, American Psychological Association, National Council on Measurement in Education, 1999; Tannenbaum & Katz, 2013). Thus, traditional, face-to-face studies are constrained to the domain experts available to travel on the study days. By eliminating travel, a virtual standard-setting study may be more likely to include Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method panelists with critical perspectives and experiences who might otherwise not have been able to participate. Previous research has presented the potential value of virtual standard setting (Lorié, 2011), and others had envisioned a rise in the use of virtual standard setting (Zieky, 2001); but, in fact, there are few published accounts of virtual standard setting. Harvey and colleagues (Harvey, 2000; Harvey & Way, 1999) compared face-to-face and distance-based standard setting, using a modified Angoff and a variation of a Yes/No method; overall, the results supported the reasonableness of using a distance-based process. However, the technology at the time did not support the level of discourse and interactivity that may be expected today. More recently, Katz et al. (2009) affirmed the feasibility of applying a more dynamic web-based, modified Angoff standard-setting approach. There are also theoretical and empirical results from the research on virtual teams in the business world suggesting that virtual standard-setting is a feasible option to consider. A critical predictor of a long-term team’s success revolves around issues of trust among team members (e.g., trust in the other members’ expertise; Espinosa, Slaughter, Kraut, & Herbsleb, 2007). However, standardsetting panels are short-term teams for which trust might not be as critical a factor (Katz et al., 2009). Another predictor of virtual team success is the amount of “coupling”, or inter-dependence, among team members’ activities: the more interdependent the activities, the more challenging for a virtual environment (Olson & Olson, 2000). Several activities in a standard-setting study (becoming familiar with the standard-setting process, the performance standards, and the test; making standard-setting judgments on each test item) are loosely coupled activities, occurring independently of the work of other panelists, and are therefore more amenable to the virtual environment. The potential benefits of a virtual standard setting approach notwithstanding, our previous research left some key validity questions unanswered. For example, would our virtual application yield results comparable to those obtained from a face-to-face application? Does the virtual standard setting approach introduce irrelevant variance into cutscore judgments? Does the virtual meeting context negatively impact panel dynamics? In addition to content expertise, will panelists also need technology skills beyond what may reasonably be expected of most practitioners and educators? The current research directly investigates whether the outcomes of a standard setting (e.g., cutscores, variability in panelist judgments, and 2 Vol 15 (1) | 2014 | www.jattjournal.com panelist confidence in their judgments and the overall panel recommendation) differ between the face-to-face and web-based (virtual) approaches. Our research focuses on whether the outcomes from independent face-toface panels for the same test differ to a lesser extent than do outcomes from face-to-face and web-based panels for these same tests. This approach is consistent with other research that has compared alternative standardsetting methodologies—mostly face-to-face —to investigate validity threats (e.g., Buckendahl, Smith, Impara, & Plake, 2002; Davis, Buckendahl, Chin, & Gerrow, 2008; Olsen & Smith, 2008; Reckase, 2006). 2. Research Question and Study Approach The overarching question to be addressed by this study is whether a standard setting meeting conducted via the web results in outcomes (cutscores, variability in judgments, and validity evidence) similar to a face-to-face meeting. Standard setting typically involves a relatively small number of panelists, so differences between face-to-face and web-based outcomes might be due to characteristics of the particular panelists. The constraints of consequential, operational standard setting often preclude a counterbalanced, within subject design. However, we identified two operational tests (digital literacy and French) that each included two independent face-to-face meetings, providing a naturally occurring replication. An additional web-based panel for these assessments offers a unique opportunity to consider if a virtual environment introduces significant irrelevant variance. If the web-based environment introduces irrelevant variance, then we would expect to observe larger differences in the outcomes between the web-based panel and either of the face-to-face panels. However, we hypothesize that the web-based and face-to-face meeting outcomes will differ no more than do the outcomes between the two face-to-face meetings. For each assessment, the face-to-face panels met before the virtual panel. The digital literacy panels met face-to-face in early October, 2009 and the virtual panel met in mid-November, 2009. The French panels met face-to-face in mid-July and the virtual panel met in mid-September, 2009. Each panel—virtual and both faceto-face—comprised distinct panelists. However, all three panels were selected from the same group of nominated panelists, each expert in the domain of the assessment (digital literacy or French). Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum 3. Data Analyses The approach to the data analyses follow those of studies that compare standard setting methodologies, such as comparisons of the Bookmark and Angoff methods (e.g., Buckendahl et al., 2002; Davis et al., 2008). Analyses are descriptive, rather than inferential, owing to the small number of panelists (10 to 15) often associated with operational standard setting studies. This sample size restriction is not unusual even in the standard setting research literature; for example: Buckendahl et al. (2002) comparison of Angoff and Bookmark approaches used 10 panelists per condition. Comparative studies with larger Ns typically involve students acting as subject matter experts, which is unfeasible for an operational testing program. Our measures address not only the traditional outcomes of standard setting, such as panelists’ item judgments, but measures of validity as well. Threats to validity include lack of panelist engagement or participation, panelists’ misunderstanding the judgment task, and panelists not basing judgments on the definition of the Just Qualified Candidate (JQC). These threats exist whether the meeting is conducted via the web or face-to-face. Other threats that more likely affect the virtual environment include not being able to access web sites, not being able to hear or see other panelists during group discussions, and other technological failures. Kane (1994) outlines three types of evidence for the validity of cutscores: procedural, internal, and external. Procedural evidence refers to information showing that the standard setting study was implemented in a way that leads to reasonable outcomes. Internal evidence refers to information demonstrating that the judgments made by panelists appear consistent with each other and that any disagreements are due to professional judgment (i.e., unlikely due to irrelevant factors). External evidence refers to the consistency of the cutscore with measures outside of the specific panel. In the current study, specific measures to compare between the face-to-face meetings and between each face-to-face meeting and the webbased meeting include: • Agreement with statements regarding implementation quality (procedural validity evidence). In a typical standard setting meeting, panelists self-report on aspects of the meeting that provide validity evidence on how well the meeting was conducted and the overall satisfaction with or acceptance of the recommended cutscore Vol 15 (1) | 2014 | www.jattjournal.com (Tannenbaum & Katz, 2013; Zieky, Perie, & Livingston, 2008). For example, panelists report on whether they understood the instructions, on the factors that influenced their judgments, whether they felt confident in their ability to make standard setting judgments, and whether the recommended cutscore is “about right,” “too low,” or “too high.” • Variability in judgments (internal validity evidence). Perhaps as important as the specific cutscore is the amount of variability across panelists’ judgments. Standard-setting studies are often conducted through several rounds of judgments, with discussion and feedback between the rounds. Variability typically is greatest in Round 1, where judgments are independent. Variance tends to decrease over rounds owing to the intervening group discussions (Tannenbaum & Katz, 2013). If the level and quality of discussion in the web-based approach parallels that which occurs in a face-to-face meeting, we would expect similar reductions in judgment variability from round to round. In addition, the variability of judgments in Round 1 (before feedback and discussion occur) provides a comparison as to whether there are any inherent differences in judgments being made between web-based and faceto-face environments. • Cutscores (external validity evidence). The absolute cutscore reached at the end of a standard setting meeting is often compared between standard setting methodologies. While the cutscore reached is a matter of informed judgment (Nichols, Twing, Mueller, & O’Malley, 2010; Zieky, 2001) and so has no “correct” answer, we expect similar methodologies, with similar panels, to yield similar cutscores (Tannenbaum & Kannan, in press; Tannenbaum & Katz, 2008). If the web-based and face-toface methods are comparable, the panelists should reach similar cutscore decisions. 4. Virtual Standard Setting Approach The virtual standard setting approach used a combination of Microsoft Live Meeting 2007, conference calling, webcasts (self-running presentations delivered on-line), web surveys, and web pages to create an environment for panelists to participate in a standard setting study from their home or office computers. Our choice of technology was, in part, driven by our explicit goal of using readily available, off-the-shelf technology and software. One of Journal of Applied Testing Technology 3 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method our objectives was to compare face-to-face methods with virtual methods without the need to build new platforms or delivery modes. If virtual standard setting is to be a viable alternative to traditional face-to-face approaches, then it should be adaptable to existing, relatively common technologies. In creating the virtual environment, panelists view introductory materials via web casts (i.e., visiting a web site to watch a presentation video), receive background information via presentations given with Live Meeting (remote meeting technology), participate in group discussions via conference calls, fill out readiness-toproceed and final evaluations surveys on-line, and enter their Angoff-based standard setting judgments via a web page that shows the test items, the definition of the JQC, and a data entry area for entering judgments (Figure 1). The full approach is summarized in Table 1. Figure 1 and Table 1 refer to the previous Katz et al. (2009) virtual standard setting, but serve as useful illustrations. Both the digital literacy and French studies followed this approach generally, although some adaptations for the specifics of each test were necessary, as outlined later in this paper. 5. Study 1: Digital Literacy 5.1 About the Test The first study recommended a cutscore for a digital literacy certification test, which is performance-based, computer-delivered, and automatically scored. The test consists of 14 performance tasks, each of which produce 2-5 scores, totaling 50 scores for the test. For convenience, these scores will be referred to as “items.” Each response to an item may earn full-credit (1), partial credit (.5), or no credit (0). However, the definition of full-credit, partial credit, and no credit is different for each of the 50 items on the test. Thus, to evaluate the difficulty of an item, panelists had to see the actual task (screen snapshots and sample full-credit responses) along with the scoring rubric for the item. 5.2 Method 5.2.1 Panelists The face-to-face panelists were 23 digital literacy experts recruited from across the U.S. and from three countries. Panelists were nominated by an international Figure 1. Website for modified Angoff judgments, showing an example test (top left), JQC definition (top right), and survey for entering item ratings (bottom) (Katz et al., 2009). 4 Vol 15 (1) | 2014 | www.jattjournal.com Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum Table 1. Virtual standard setting approach (adapted from Katz et al., 2009) Common Elements of Standard Setting Virtual Standard Setting Approach Pre-meeting work Panelists learn the basic objectives of standard setting, including their roles and responsibilities in the meeting. Pre-meeting work also includes panelists accessing two websites to assure they can view web casts and access the web conferencing software. The performance level description(s) to be used in the meeting are sent to panelists for early review. Overview of meeting Introductions, meeting overview, and more detailed technology tryouts are covered during an introductory 30-min session held approximately one week prior to the first standard setting session. The extra week provides time to address any technical issues not solvable during the introductory meeting. Overview of standard setting Panelists view an eleven minute webcast between the “Overview of Meeting” and the first standard setting session. This material is also reviewed during first standard setting session. Training and practice in standard setting methods Panelists view a second eleven minute webcast between the “Overview of Meeting” session and first standard setting session. The training is reinforced during first standard setting session. Panelists take the test Secured PDF (no printing, no copying, expiration date) of the test sent as two files (without answers and with answers) to panelists during the introductory session. They take the test and self-score their responses sometime in the week between the “Overview of Meeting” and the first standard setting session. In a face-to-face meeting, panelists typically take the test and self-score during the first day of the meeting. Discussion of the performance standard: the Just-Qualified Candidate (JQC) As in face-to-face, a description of the just qualified candidate (JQC) is discussed via teleconference during first standard setting session. Facilitators work with panelists to create this definition based on the test content specifications. During discussion, the JQC definition is written and edited (more operationally defined) via a shared application, which is similar to face-to-face in which the edited JQC definition is often projected. After the discussion, the final JQC definition is electronically distributed to each panelist for printing. Initial survey (training evaluation) and verification of readiness to proceed Web survey delivered through the web conferencing software, using web survey software. Results are immediately available via a web link (for facilitators only, although could be shared). Round 1 judgments Panelists enter their ratings via a custom web site accessed through the web conferencing software. The web site was designed to keep visible the information panelists need to make their judgments: the test (with answers), JQC definition, and web survey form for entering Angoff ratings (created with web survey software). Round 1 discussion Facilitators enter the ratings data into a spreadsheet, which facilitators share via the web conferencing software. The spreadsheet, identical to that used in face-to-face meetings, summarizes judgments, highlights discrepancies among panelists, and shows item difficulty data. Panelists then share their rationales via the teleconference for their itemlevel judgments. Round 2 judgments Panelists enter their ratings via same web site as Round 1. Web survey software pre-enters each panelist’s data, so panelists see their earlier responses and re-judge only the items they want to modify. Round 2 discussion Round 2 follows an approach similar to the Round 1 discussion. Final survey Panelists complete a web survey delivered through the web conferencing software. Results are immediately available via a web link (for facilitators only, although could be shared). Vol 15 (1) | 2014 | www.jattjournal.com Journal of Applied Testing Technology 5 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method advisory council. A split-panel replication was implemented, as described in the procedures. The virtual panelists were 10 digital literacy experts who were recruited from the same pool of nominated panelists as in the face-to-face study. The virtual panel completed its work independently of the face-to-face panel. Table 2 shows some of the characteristics of the three standard setting panels. The face-to-face groups were designed to be reasonably equivalent in terms of panelist demographics and background. Similarly to the faceto-face panels, the virtual panelists were predominantly white (60% vs. 65%) and came from many regions of the U.S. However, the virtual group contained a larger number of female panelists (8 vs. 5), no international panelists, and no corporate trainers. Table 2. Characteristics of panelists 6 Face-to-face Group A n = 12 Face-to-face Virtual Group B Group n = 11 n = 10 Gender Female 5 5 8 Male 7 6 2 Race/ethnicity Asian American 2 1 0 African American 2 2 2 Hispanic 0 1 2 White (nonHispanic) 8 7 6 Expertise type Corporate Training 2 2 0 Workforce Development Training 2 3 1 College or university instruction 6 4 7 K-12 teacher training 2 2 2 Institution/ Organization location Northeast 4 3 2 Midwest 1 2 2 South 2 1 3 Southwest 1 1 0 West 2 2 3 International 2 2 0 Vol 15 (1) | 2014 | www.jattjournal.com 5.2.2 Procedure (Face-to-Face) Before the standard setting meeting, panelists were asked to complete two assignments intended to familiarize them with the materials to be used during the meeting. First, panelists were given access to the computer-delivered test, which could be reached via a website, so that they become familiar with the knowledge and skills elicited by the test. Second, panelists were sent a draft definition of the JQC constructed by a similarly composed panel of experts from a different standard-setting study evaluating digital literacy (Tannenbaum & Katz, 2008). Panelists were asked to review the JQC definition and to write a few performance indicators (behavioral examples) to help them internalize the definition. The purpose of these pre-meeting assignments was to begin the process of the panelists coming to a shared understanding of what should be expected of a Just Qualified Candidate. The panel was then split into two independent subpanels that completed the remaining work (assignment to each subpanel was random, although attempts were made to have similar background characteristics represented on both panels). Each subpanel was guided by a different facilitator. Although the groups worked largely independently on the ratings, note that they were calibrated together on the nature of the test, the JQC definition, and the practice ratings. 5.2.3 Procedure (Virtual) While the general activities involved in the standard setting meeting did not change between face-to-face and virtual environments, the virtual environment necessitated some adaptations, and also provided some opportunities for time-savings. As with the face-to-face panels, the web-based panelists were asked to complete two pre-meeting assignments: one was to take the test (using the same system as was used for the face-to-face panelists) and to review the JQC definition constructed by the face-to-face group. In addition, the virtual panel viewed an online presentation that provided background to the standard setting process, which lessened the time needed to cover this material during the meeting. All panelist interactions were via the audio conference call; no web video was used. Pilot testing of preliminary versions of the virtual standard setting included audio-only and audio with video capabilities. The results did not differ, and pilot panelists thought the video was not necessary for the standard-setting application. Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum Because the JQC definition is the narrative equivalent of the cutscore, if the two groups (face-to-face and virtual) worked from different JQC definitions, it is likely that their standard setting judgments and cutscore recommendations would differ. Therefore, to determine if the “venue” of the standard setting impacts judgments and outcomes, we needed to maintain comparable performance expectations, as defined by the JQC definition. For the virtual panel to better understand the JQC definition constructed by the face-to-face panel, they developed several performance indicators for each section of the JQC. The indicators reflected their understanding of the ways that test takers might exhibit the knowledge and skills expected of a Just Qualified Candidate. Research by Tannenbaum and Kannan (in press) suggests that two panels applying the same standard-setting method to the same test will likely recommend comparable cutscores, regardless if one panel constructed the JQC definition and the other simply fleshed out that same JQC definition, but did not change its fundamental meaning. The virtual standard setting study was spread over five 2.5 hour sessions, as described by Katz et al. (2009), to minimize fatigue. The activities of the virtual standard setting otherwise paralleled those of the face-to-face study (general introduction, discussion of test, discussion of the JQC definition, training and practice, and three rounds of standards setting judgments) as outlined in Table 1. 5.2.4 Standard Setting Judgments For both face-to-face and virtual panels, during Round 1, the facilitator guided the subpanel through a description of each item rubric, discussing the characteristics of responses (and processes in some cases) that lead to full credit, partial credit, or no credit on each item. The face-to-face panelists worked with two binders: one containing the items and rubrics associated with each task, and the other containing benchmarks (exemplars) of fullcredit responses to each task. The virtual panelists viewed the benchmarks and item rubrics on their computer (the benchmark and corresponding item rubric were projected simultaneously on a split screen). For each task, the facilitator reminded panelists about the goals of the task and the portions of the JQC definition that were likely most relevant (the JQC definition comprised sections that corresponded to distinct task types in the assessment). The facilitator then walked the panelists through the benchmark for the task and the scoring rubrics for the first item of that task. After describing the Vol 15 (1) | 2014 | www.jattjournal.com item, the facilitator asked panelists to enter their standard setting rating for that item onto the scan form (face-toface) or recording sheet (virtual). Panelists were asked to estimate the mean (average) score that 100 Just Qualified Candidates would earn on this item. This judgment task allowed the use of a similar rating scale with a typical modified Angoff approach, as used in the next study. The rating scale ranged from 0 to 1, in .05 increments. The next item within the task was then described. This continued for all of the items in the task. This process was followed for all 14 tasks. After finishing their ratings for all tasks, the scan forms of the face-to-face panels were collected and scanned in. At the corresponding time in the virtual meeting, the panelists entered their recorded ratings into an online survey. 5.3 Results As discussed earlier, Kane (1994) outlines three types of validity evidence for evaluating cutscores: procedural, internal, and external. As will be shown below, the three panels provided similar evidence of procedural and internal validity. However, the overall cutscore of the virtual group was somewhat lower than that of the two face-toface groups. 5.3.1 Procedural Validity Evidence Evidence that the standard setting methodology was carried out correctly typically comes from panelist assertions about the process and its outcomes (Cizek, Bunch, & Koons, 2004; Tannenbaum & Katz, 2013). All three groups gave similar ratings of their respective meetings and outcomes. All three panels reported understanding the standard setting process (Table 3) and everyone in all panels agreed that the consensus cutscore was “about right.” Panelists in all groups reported that their standards setting judgments were influenced by similar study materials, such as the JQC definition, the between-round discussions, and the skills needed to solve each test item (Table 4). In addition, all three panels reported similarly positive evaluations of the meeting process, rating it as “understandable,” “satisfying,” and “fair,” among other measures (Table 5). Thus, it appears that the panels had similar confidence in how the meeting was conducted; nothing in their opinions suggests that the implementation of the standard setting method during the face-to-face or the virtual meetings was done incorrectly and so introduced error into the process. Journal of Applied Testing Technology 7 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method Table 3. Overall evaluations Please indicate below the degree to which you agree with each of the following statements. Face-to-Face Virtual SA A D SD SA A D SD I understood the purpose of this study. 20 (87%) 3 (13%) 0 0 8 (80%) 2 (20%) 0 0 The instructions and explanations provided by the facilitators were clear. 21 (91%) 2 (9%) 0 0 9 (90%) 1 (10%) 0 0 The training in the standard setting methods was adequate to give me the information I needed to complete my assignment. 19 (83%) 4 (17%) 0 0 6 (60%) 4 (40%) 0 0 The explanation of how the recommended cutscore is computed was clear. 20 (87%) 3 (13%) 0 0 8 (80%) 2 (20%) 0 0 The opportunity for feedback and discussion between rounds was helpful. 22 (96%) 1 (4%) 0 0 8 (80%) 2 (20%) 0 0 The inclusion of the item data was helpful. 20 (87%) 2 (9%) 1 (4%) 0 9 (90%) 1 (10%) 0 0 The inclusion of the classification percentages was helpful. 15 (65%) 7 1 (30%) (4%) 0 9 (90%) 1 (10%) 0 0 The process of making the standard setting judgments was easy to follow. 16 (70%) 7 (30%) 0 6 (60%) 4 (40%) 0 0 0 Note. SA = Strongly Agree, A = Agree, D = Disagree, SD = Strongly Disagree Table 4. Influence of study materials on judgments How influential was each of the following factors in guiding your standard setting judgments? Face-to-Face Virtual VI I NI NR VI I NI NR The definition of the Just Qualified Candidate 21 (91%) 2 (9%) 0 0 10 (100%) 0 0 0 The between-round discussions 18 4 (78%) (17%) 1 (4%) 0 8 (80%) 2 (20%) 0 0 The knowledge/skills required to answer each test question 17 6 (74%) (26%) 0 0 9 (90%) 1 (10%) 0 0 The cutscores of other panel members 2 (9%) 0 0 The item-level data 11 11 (48%) (48%) 1 (4%) 0 The classification percentages 5 15 (22%) (65%) 2 (9%) My own professional experience 11 12 (48%) (52%) 0 16 5 (60%) (22%) 7 3 (70%) (30%) 0 4 (40%) 6 (60%) 0 0 1 (4%) 2 (20%) 8 (80%) 0 0 0 8 (80%) 2 (20%) 0 0 Note. VI = Very Influential, I = Influential, NI = Not Influential, NR = No Response 5.3.2 Internal Validity Evidence Were panelists consistent in their ratings, suggesting a similar understanding of the difficulty of the test tasks and items, JQC, and standard setting methodology? While some amount of professional disagreement is expected, we generally expect that the standard deviation of the 8 Vol 15 (1) | 2014 | www.jattjournal.com ratings should decrease between Rounds 1 to 3 owing to the calibrating effect of group discussions. Table 6 shows the Rounds 1-3 ratings of each panel, including the panel means and standard deviations. As expected, all panels showed greater consistency (lower SDs) as the studies proceeded, although the virtual panel tended Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum Table 5. Evaluation of meeting process How would you describe the meeting process? Please rate the meeting on each of the five scales shown below. Face-to-Face Virtual 1 2 3 4 5 1 2 3 4 5 Inefficient (1) – Efficient (5) 0 0 1 (4%) 2 (9%) 20 (87%) 0 0 0 3 (30%) 7 (70%) Uncoordinated (1) – Coordinated (5) 0 0 1 (4%) 1 (4%) 21 (91%) 0 0 0 1 (10%) 9 (90%) Unfair (1) – Fair (5) 0 0 0 2 (9%) 21 (91%) 0 0 0 0 10 (100%) Confusing (1) – Understandable (5) 0 0 0 4 (17%) 19 (83%) 0 0 1 2 (10%) (20%) 7 (70%) Dissatisfying (1) – Satisfying (5) 0 0 0 2 (9%) 21 (91%) 0 0 1 3 (10%) (30%) 6 (60%) Table 6. Mean cutscores (SDs) across rounds (maximum of 50 points) Table 7. Intercorrelations of item-level ratings Round 1 Round 2 Round 3 Item data Group A 33.8 (3.8) 33.4 (3.4) 33.4 (3.4) Group A Group B 32.6 (3.7) 32.3 (3.1) 32.3 (3.0) Group B Virtual 29.8 (3.2) 29.2 (2.1) 29.1 (2.0) Virtual to agree more throughout the rounds compared to the faceto-face panels, having lower SDs throughout the process. In addition, the three panels were similarly consistent in the relative rankings of item difficulties for the JQC. Some researchers (e.g., Clauser, Mee, Baldwin, Margolis, & Dillon, 2009; Kane, 1994) have used correlation with actual item difficulties to demonstrate the quality of standard-setting ratings. While the absolute difficulty level of an item for the JQC would differ from that of the overall population (as the population may be more heterogeneous in its digital literacy skills than the JQC), the relative difficulty of the items should be fairly consistent. Table 7 shows the inter-correlations of the item ratings (Round 3) among the three panels and empirical item difficulty data. Again, all three panels showed similar levels of consistency both with each other and with the observed item difficulties. These internal validity results again suggest that the three panels implemented the standard setting methodology appropriately and consistently throughout the standard setting meetings. 5.3.3 External Validity Evidence The next question is the extent to which the face-to-face and virtual standard setting meetings resulted in similar outcomes. While we have evidence that the standard set- Vol 15 (1) | 2014 | www.jattjournal.com Item Data Group A Group B Virtual - .82 .80 .80 - .80 .79 - .90 - ting methods were implemented by the facilitators and the panelists consistently, there is still the open question of whether the three panels reached similar results. Did they end up with similar cutscores? Does the cutscore methodology replicate to a new panel and does the “venue” matter? Table 6 (reported earlier) shows the recommended cutscores across the three rounds and the three panels. The between-round discussions appeared to have the same effect on all of the groups, which was to reduce the cutscore slightly (i.e., no round x group interaction). However, right from the start, the virtual panel recommended a lower cutscore (although significantly lower only than Group A in Round 1), while the two face-to-face groups came to very similar cutscore recommendations (Round 3; within 1 point on a 50 point scale). The virtual recommendation reflects a 12% decrease from that of face-to-face Group A and a 9% decrease from Group B. 5.4 Study 1: Discussion Several aspects of the face-to-face vs. virtual environments might have led to the differences between Group A and the virtual panel. For example, although the face-toface panels did all of the judgments separately from each other, the group worked as a whole during a large portion Journal of Applied Testing Technology 9 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method of the first day to reach consensus on the JQC definition. The virtual panel worked from this definition, but independently of the other panels. Although the intent was to minimize a construct-shift in the JQC definition between panels, it may be that the two face-to-face panels that worked together on augmenting the JQC definition may well have internalized the JQC definition somewhat differently from the virtual panel. The difference in judgment variability between the face-to-face and virtual panels supports this hypothesis. Other possible sources of difference between the face-to-face and virtual panels include: • Differences in the background of panelists. As described earlier, the experience and demographics of the virtual panel differed somewhat from that of the two face-toface panels. Because panelists bring to the judgment process their own experiences, the differences in panel make-up cannot be ruled out as a factor for the observed cutscore differences between the face-toface and virtual panels. • Facilitator and rubric interaction effect. During Round 1 of the face-to-face panels, panelists had physical access to each rubric that was then explained by the facilitator. For the virtual standard setting, because the panelists did not have physical copy of the rubrics— they were presented online—the facilitator spent somewhat more time explaining the rubrics to the panelists. This may have inadvertently reinforced the stringency of the rubrics applied to the items, leading to a lower performance expectation for the JQC. • Transcription effects. In the virtual panel, panelists recorded their ratings on paper, along with their judgment rationale, then later transcribed the ratings onto a web-based survey. This was done to facilitate between-round item-level discussions. Because the face-to-face panelists had physical copies of the tasks and rubrics, they were not explicitly asked to record their rating rationales. This is typical in face-to-face standard setting, as it is believed that panelists having the tasks and rubrics in front of them as they discuss their ratings are sufficient to jog their memories. Asking people to provide rationales for their decisions has been shown to alter decision-making processes (Wilson & Schooler, 1991). Thus, explicitly providing the virtual panel with a physical recording form (the form was emailed to them) and asking them to write down their rationales, may have introduced a source of variance to their judgments. 10 Vol 15 (1) | 2014 | www.jattjournal.com • Environment effects. A variety of factors due to the virtual environment might have led to the lower cutscore. In the face-to-face situation, panelists had printouts of full-credit responses (screen snapshots) and of the rubrics, both of which they could refer to freely during the item-level discussions of Round 1 and during the ratings of subsequent rounds. However, the ability to view material in the virtual environment was somewhat limited, with the facilitator controlling the presentation of the rubrics and of the full-credit responses, the latter of which often spanned multiple screen snapshots. Because the virtual panelists could not freely look among the materials, they might have considered some of the tasks more difficult because they did not remember all of the information presented to candidates. Face-to-face panelists often remarked that candidates should be able to figure something out because of a key sentence in the task description. Without the task description as readily available, the virtual panelists might have thought the tasks were more difficult because they were not considering everything that was visible to the candidates; virtual panelists had to rely more on their memory of the task descriptions. 6. Study 2: French 6.1 About the Test The French test is one component of the certification process for beginning K-12 French teachers. The test consists of four sections: Listening, Reading, Writing, and Speaking. The Listening and Speaking sections contain audio stimuli in addition to written text. The Listening and Reading sections contain only multiple-choice questions while the Writing and Speaking sections contain constructed-response questions, each question scored via a rubric. The maximum number of raw points available on the form of the French test included in the face-to-face and virtual studies is 97. The structural differences between the digital literacy and French tests led us to apply somewhat different standard-setting judgments to the questions although, still Angoff-based. For example, the scoring of the constructed-response questions on the French test is different from that applied to the digital literacy test. The questions are scored by independent raters using 3-point rubrics; the sum of the ratings is the question score. The functional score scale for a question is therefore 0 through 6. Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum The Angoff-based standard-setting judgment applied to these questions both for the face-to-face and virtual panelists was to decide on the score (0 through 6) that a JQC would likely earn. For the multiple-choice questions on the French test, the standard-setting judgment was to decide on the probability that a JQC would answer the question correctly. The rating scale ranged from 0 through 1. In both instances of standard-setting judgments (Constructed Response [CR] or Multiple Choice [MC]), the emphasis was on a single JQC, rather than on a group of 100 JQCs, as was done for the digital literacy test. As noted earlier, the focus on 100 JQCs allowed the use of the same 0-1 rating scale for the all-CR digital literacy test, which eases comparisons between the studies. Another difference between the two standard-setting applications is that there were no empirical item difficulty data available for the French test, as it had not yet been administered. A further distinction between the digital literacy standard setting and this one is that the first faceto-face French panel constructed the JQC definition. This definition was then shared with the second face-to-face panel for augmentation; the two panels did not collaborate on the JQC definition. The JQC definition from the first face-to-face panel was similarly shared with the virtual panel for it to augment independently. In this regard, the second face-to-face panel and the virtual panel experienced this implementation feature in common. Two rounds of standard-setting judgments occurred for the French test, given the absence of empirical items difficulty data. For these panels, test familiarization included taking the test (including all auditory stimuli) during the first part of the meeting. For the virtual panel, the panelists took the sections of the test that did not include any auditory stimuli (reading and writing sections) on their own. During the first part of the virtual meeting, the panelists took the listening and speaking portions. The audio for these sections was played over the speakerphone being used by the facilitators. All panelists reported good quality of the sound over the speakerphone (at first, the audio was played through Skype, which did not work well). panels, each meeting independently of one another. The virtual panel also met independently of the other two. The full demographic and background data for the two face-to-face panels are shown in Table 8. Panelists were assigned to panels randomly, although attempts were made to create panels with similar background characteristics. The general characteristics of all three panels were similar. Panel 1 (face-to-face) included 23 teachers, administrators, and college faculty who prepare K-12 French teachers, representing 18 states. Panel 2 (face-toface) included 24 teachers, administrators, and college faculty, representing the same 18 states. The virtual panel included 7 educators: 4 teachers and 3 faculty members. While we attempted to recruit a similar number of panelists for the virtual panel, a smaller number resulted because of scheduling conflicts—so while a virtual meeting may increase access to participation, it does not solve issues of competing demands. 6.2 Method Tables 9-11 show the panelists’ evaluations of the meeting process and its outcomes. Panelists generally “agreed” or “strongly agreed” that the study was conducted appropriately and clearly (Table 9). The factors influencing judgments were consistent across all three panels, with the cutscores of other panelists being the least influential 6.2.1 Panelists All panelists were recruited from the same master list of recommendations from the Boards of Education within each participating state. There were two face-to-face Vol 15 (1) | 2014 | www.jattjournal.com 6.2.2 Procedure The two face-to-face panels and the virtual panel engaged in similar activities. Following a brief introduction to standard setting and an overview of the test, the panelists took the entire test and self-scored their responses. Next, they were introduced to the standardsetting method for the multiple-choice questions and practiced making standard setting judgments on the first set of 6 items from the Listening section. After discussion of the ratings, the panelists completed their ratings for the remainder of the Listening section items (with each Listening audio passage played before panelists rated the associated 6 items) and then completed their judgments for the Reading section (Round 1, discussion, Round 2). Panelists then trained and practiced on the standard-setting method for constructed-response questions (Writing and Speaking sections), and completed the judgments for all questions (Round 1, discussion, Round 2). 6.3 Study 2: Results and Discussion 6.3.1 Procedural validity evidence Journal of Applied Testing Technology 11 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method Table 8. Characteristics of Panelists Panel 1 N Percent Panel 2 N Percent Group you are representing Teachers 15 65% 19 79% Administrator/Department Head 2 9% 2 8% College Faculty 5 22% 2 8% Other 1 4% 1 4% Race African American or Black 3 13% 3 13% Alaskan Native or American Indian 1 4% 0 0% Asian or Asian American 0 0% 1 4% Native Hawaiian or Other Pacific Islander 0 0% 0 0% White 19 83% 19 79% Hispanic 0 0% 0 0% Gender Female 17 74% 18 75% Male 6 26% 6 25% In which language are you most fluent? English 14 61% 19 79% French 1 4% 3 13% English and French about the same 7 30% 2 8% Other 1 4% 0 0% Are you certified as a French teacher in your state? No 4 17% 4 17% Yes 19 83% 20 83% Are you currently teaching French in your state? No 2 9% 2 8% Yes 21 91% 22 92% Are you currently mentoring another French teacher? No 16 70% 17 71% Yes 7 30% 7 29% How many years of experience do you have as a French teacher in your state? 3 years or less 1 4% 1 4% 4 - 7 years 4 17% 5 21% 8 - 11 years 7 30% 4 17% 12 - 15 years 3 13% 2 8% 16 years or more 8 35% 11 46% For which education level are you currently teaching French? Elementary (K - 5 or K - 6) 2 9% 0 0% Middle School (6 - 8 or 7 - 9) 1 4% 1 4% High School (9 - 12 or 10 - 12) 11 48% 18 75% Middle/High School 2 9% 0 0% All Grades (K - 12) 0 0% 1 4% Higher Education 6 26% 4 17% Other 1 4% 0 0% School Setting Urban 10 43% 9 38% Suburban 6 26% 9 38% Rural 7 30% 6 25% 12 Vol 15 (1) | 2014 | www.jattjournal.com Virtual N Percent 3 0 3 1 43% 0% 43% 14% 1 0 1 0 5 0 14% 0% 14% 0% 71% 0% 7 0 100% 0% 4 1 2 0 57% 14% 29% 0% 3 4 43% 57% 0 7 0% 100% 4 3 57% 43% 0 2 1 0 4 0% 29% 14% 0% 57% 0 0 3 0 1 3 0 0% 0% 43% 0% 14% 43% 0% 2 3 2 29% 43% 29% Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum Table 9. Overall evaluations Please indicate below the degree to which you agree with each of the following statements. Panel 1 Panel 2 Virtual SA A D SD SA A D SD SA A I understood the purpose of this study. 21 (91%) 2 (9%) 0 0 23 (96%) 1 (4%) 0 0 6 (86%) 1 (14%) 0 0 The instructions and explanations provided by the facilitators were clear. 18 5 (78%) (22%) 0 0 23 (96%) 1 (4%) 0 0 6 (86%) 1 (14%) 0 0 The training in the standard setting methods was adequate to give me the information I needed to complete my assignment. 18 5 (78%) (22%) 0 0 21 3 (88%) (13%) 0 0 6 (86%) 1 (14%) 0 0 The explanation of how the recommended cutscore is computed was clear. 21 (91%) 0 0 19 5 (79%) (21%) 0 0 5 (71%) 2 (29%) 0 0 The opportunity for feedback and discussion between rounds was helpful. 15 6 2 (65%) (26%) (9%) 0 22 (92%) 2 (8%) 0 0 7 (100%) 0 0 0 The process of making the standard setting judgments was easy to follow. 15 8 (65%) (35%) 0 21 3 (88%) (13%) 0 0 4 (57%) 3 (43%) 0 0 2 (9%) 0 D SD Note. SA = Strongly Agree, A = Agree, D = Disagree, SD = Strongly Disagree Table 10. Influence of Study Materials on Judgments How influential was each of the following factors in guiding your standard setting judgments? The definition of the Just Qualified Candidate The between-round discussions The knowledge/skills required to answer each test question The cutscores of other panel members My own professional experience Panel 1 Panel 2 VI I NI VI 20 (87%) 10 (43%) 2 (9%) 12 (52%) 1 (4%) 1 (4%) 19 (83%) 2 (9%) 18 (78%) 4 0 (17%) 18 3 (78%) (13%) 5 0 (22%) Virtual I 19 5 (79%) (21%) 15 9 (63%) (38%) * * NI VI I NI 0 6 (86%) 5 (71%) 1 (14%) 2 (29%) 0 0 0 6 1 0 (86%) (14%) 2 4 1 (29%) (57%) (14%) 7 0 0 (100%) * 2 16 6 (8%) (67%) (25%) 13 11 0 (54%) (46%) Note. VI = Very Influential, I = Influential, NI = Not Influential. * Because of an error in study materials, this question was not asked of Panel 2. Table 11. Evaluation of meeting process How would you describe the meeting process? Please rate the meeting on each of the five scales shown below* Inefficient (1) – Efficient (5) 1 Panel 1 Uncoordinated (1) – Coordinated (5) 0 Unfair (1) – Fair (5) 0 Confusing (1) – Understandable (5) 0 Dissatisfying (1) – Satisfying (5) 0 0 2 3 Virtual 4 1 1 7 (4%) (4%) (30%) 0 1 6 (4%) (26%) 0 0 3 (13%) 0 1 5 (4%) (22%) 1 1 2 (4%) (4%) (9%) 5 1 2 3 4 5 14 (61%) 16 (70%) 20 (87%) 17 (74%) 19 (83%) 0 0 0 1 (14%) 0 0 0 0 0 0 0 0 0 0 2 (29%) 2 (29%) 1 (14%) 3 (43%) 3 (43%) 4 (57%) 5 (71%) 6 (86%) 4 (57%) 4 (57%) 0 Because of an error in materials, Panel 2 did not receive this question. * Vol 15 (1) | 2014 | www.jattjournal.com Journal of Applied Testing Technology 13 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method factor (Table 10). Finally, panelists had generally good opinions of the meeting process overall, although the reactions were somewhat more positive in the virtual panel than in the first face-to-face panel (these data were unavailable for Panel 2; Table 11). Overall, there is nothing in the evaluation data suggesting that the virtual and face-to-face panels were implemented differently. 6.3.2 Internal Validity Evidence Table 12 shows the section item ratings and the total cutscore across rounds for the three panels. These results suggest some differences in the virtual panelists’ judgments in the multiple-choice vs. constructed-response portions of the test. For the face-to-face panels, we observed the expected decrease in standard deviations of ratings between Rounds 1 and 2 for all test sections. However, the virtual panelists’ ratings showed this expected effect only for the Reading and Listening sections (multiplechoice items); for the Writing and Speaking sections (constructed-response items), the standard deviations of the virtual panel’s ratings remained the same, which contributed to an increase in the variability of the overall cutscore. Furthermore, the overall variability in the Writing and Speaking sections are approximately twice that of the face-to-face panels. These results may suggest some differential performance of the virtual panel depending on test section, although that difference might be due to language skill tested (e.g., receptive vs. productive), item type (multiple-choice or constructedresponse), judgment question posed (“likelihood of JQC answering correctly” vs. “likely scored earned by JQC”), or other factors. Additional studies would need to be conducted to ascertain if this is a systematic effect. 6.3.3 External Validity Evidence Table 12 shows the three rounds of ratings for the three French panels; discussion will focus on the last line of the table, the overall cutscores recommended by each panel. The virtual panel full-test cutscores were slightly higher than those of the second face-to-face panel (approximately 3 raw points), and considerably higher than those of the first face-to-face panel (approximately 10 raw points). Nonetheless, the difference between face-to-face Panel 1 and the virtual panel paralleled the difference in the cutscores between the two face-to-face panels, which was approximately 7 raw points. It is unclear as to why the cutscore from Panel 1 diverged from the other cutscores. But it is noteworthy that the cutscores from Panel 2 and the virtual panel were consistent with cutscores recommended for two other language tests built for the same testing program, German and Spanish. The specifications for these two tests are the same as for the French test and the test format and structure are the same for all three tests. For each of these two tests, two independent face-toface panels were assembled, and the same Angoff-based process applied to the French test was applied to these two tests. For each of two panels, the cutscores for German were 65.7 and 62.1, and 65.5 and 68.0 for Spanish. This additional information suggests that the results of Panel 1 are discrepant. Importantly, for the web-based approach, these results demonstrate that the web-based approach differs from a face-to-face result no more than two faceto-face results might. 7. General Discussion In previous research (Katz et al., 2009), we demonstrated that it is feasible to conduct standard setting with a virtual panel and to obtain similar positive evidence of procedural validity as obtained in traditional face-to-face studies. In this study, we set out to determine if the outcomes of standard setting studies on the same tests conducted virtually and face-to-face would be comparable. Overall, the results suggest that the absolute recommended cutscores are not very different, the most frequent difference being 3-4 points between any face-to-face and virtual panel. Table 12. Mean (SD) cutscores for each Round by test section for the three panels 14 Test Section (Max Raw Score) Panel 1 Panel 2 Virtual Round 1 Round 2 Round 1 Round 2 Round 1 Round 2 Listening (30) 17.6 (2.2) 17.2 (1.9) 18.3 (2.5) 18.1 (2.0) 19.7 (1.8) 20.1 (1.5) Reading (31) 21.5 (2.9) 21.5 (2.4) 22.8 (2.6) 23.1 (2.3) 25.5 (2.2) 25.6 (2.1) Writing (18) 9.8 (1.3) 10.3 (1.1) 12.0 (1.4) 12.7 (1.1) 11.3 (3.0) 11.3 (3.0) Speaking (18) 9.5 (2.4) 9.6 (2.0) 11.5 (1.6) 12.0 (1.1) 11.6 (3.2) 11.6 (3.2) Total (97) 58.4 (5.3) 58.5 (4.6) 64.7 (6.0) 65.8 (4.7) 68.0 (7.2) 68.6 (7.8) Vol 15 (1) | 2014 | www.jattjournal.com Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum The one exception was the first face-to-face French panel (Study 2), which was 7 and 10 raw points lower than second face-to-face panel and the virtual panel, respectively. However, those larger differences do not seem to be attributable to the standard setting venue (face-to-face or virtual). Although the virtual and face-to-face cutscores from the French panels (Panel 1 notwithstanding) and from the digital literacy panels seem reasonably close, the direction of the relationship was different. For the French test, the virtual cutscore was greater than the face-toface cutscores, but for the digital literacy test, the virtual cutscore was less than the face-to-face cutscores. We cannot conclude, as a consequence, if one should expect a virtually determined cutscore to under-predict or overpredict what would occur in a face-to-face venue. Further, the impact of a 3-4 point discrepancy needs to be considered in light of the range of available points and the distribution of test scores. On the 50-point scale for the digital literacy test, such a discrepancy is a meaningful difference in terms of the percentage of points that needs to be earned to be considered just qualified. The cutscores for the two face-to-face panels translate into 67% and 65%, respectively, whereas the cutscore for the virtual panel translates to 58%. On the 97-point scale for the French test, the percentage differences are less pronounced, 67% (Panel 2) and 71% for the virtual panel. Performance data were only available for the digital literacy test. On that test, the face-to-face and virtual cutscores were near the center of the score distribution, and so the 3-4 point difference would lead to large differences in the percentage of test takers classified as just qualified, approximately 50% for the face-to-face cutscores and 70% for the virtual cutscore. 8. Conclusions Overall, a virtual standard setting approach appears to be both viable and appropriate. The differences between the absolute cutscores derived from the face-to-face and virtual panels were not so large as to nullify our opening remarks. However, given that much of the validity evidence pertaining to standard setting remains procedural, more focused attention on understanding the sources of variance in a virtual approach would seem prudent. The use of virtual meetings to support standard setting, the focus of this study, clearly has application to other aspects of the assessment process (see, e.g., Vol 15 (1) | 2014 | www.jattjournal.com Schnipke & Becker, 2007). Item writing and reviewing workshops may be conducted virtually, as can fairness reviews. The nature of the training and data collection needed to support these other distance-based assessment practices may be somewhat different from a more traditional face-to-face process, but there is nothing inherent in the distance-based process that would preclude these and other applications. Tannenbaum (2011), for example, applied a distance-based approach to evaluate the judged alignment between a test of English language skills and a language framework of English proficiency. 9. Recommendations The use of and reliance on remote, or distance-based, technologies will continue to impact assessment-related practices. Reflecting on our experiences reported here and from our previous research in this area (Katz et al., 2009) enables us to offer recommendations for practitioners. These recommendations, although inspired by the virtual standard-setting approach outlined in the paper, should be applicable to other distance-based, assessmentrelated practices. First, hold a pre-meeting, “check-in” session. The session should focus specifically on verifying that all panelists are able to access the web-based sources of information relevant to the study. We constructed sample web pages of basic information and rating scales for the panelists to navigate and use to uncover potential issues; and we went over some of the basic features of Microsoft Live Meeting, such as the location of the icon for the microphone, how to use “flags” to signal when they completed a task, and how to send messages to the facilitators. The check-in session proved to be invaluable, both to address minor glitches before the actual standard-setting process and to alleviate the concerns of panelists who might not otherwise be “technology savvy.” Second, use technology to support test security. Test security is always an issue when conducting standard setting, whether face-to-face or virtual, but there is a greater threat of inadvertent security breaches when sharing electronic files of test materials. We addressed this issue by using an online document securing service; at the time this service was free, but there are several commercial solutions currently available, including the security features of Microsoft Office that provide similar functionality. The service we used allowed the creation of encrypted files, accessible only through unique user Journal of Applied Testing Technology 15 Comparison of Web-based and Face-to-face Standard Setting using the Angoff Method IDs and passwords for each panelist, which also allowed the tracking of file access and the ability to set an expiry date for the file (by disabling user IDs). These files also did not allow printing or copying. While these security measures would not stop an intentional security breach, they are similar to measures used when paper copies of secure materials are distributed. As would occur with a face-to-face study, all panelists had signed non-disclosure agreements before having access to any secure materials. Third, pay attention to panelist engagement. Panelist engagement and participation presents more of an issue in virtual standard setting than it does in face-to-face studies because of the greater variety of non-meetingrelated distractions. During group discussions, we maintained an informal log of which panelists were speaking more frequently than others and invited those less participative into the discussion. For example, we would pose questions to these panelists about their reactions to what was being discussed, or asked the less participative members to be the first ones to offer their rationales for their standard-setting judgments—i.e., to start a new chain of discussion. We also requested that each panelist, before speaking, state his or her name so that the other panelists would start to associated voices with names and, thus, begin to take initiative for drawing others into the discussion. Fourth, balance the work panelists do online vs. at their desks. We learned that the panelists had some difficulty referring to the JQC definition online and then entering their standard-setting judgments online, due to screen real-estate issues. Therefore, we encouraged panelists to print the JQC definition so a physical copy was available, which, from the feedback we received, helped greatly. Regarding the online rating scale, although panelists had no technical difficulty in entering ratings, during the between-round item discussions, panelists stated that they did not readily recall the reasons for their item judgments, making discussion more challenging. We addressed this issue by emailing the panelists a rating form to print out. The form contained the item numbers (not the actual items), a space to enter their rating, and a space to make brief notes about the reasons for each item rating. Once they made their ratings on the form, we gave them time to enter their ratings online. The panelists referred to their written notes about the items during the between-round discussion. The informal feedback we received supported the effectiveness of this simple solution. 16 Vol 15 (1) | 2014 | www.jattjournal.com An overall theme to our recommendations is that technique trumps technology: rather than being due to the specific technology used to implement the approach, any success of this approach came more from careful planning to ensure the ease and comfort of panelists as well as meeting facilitation techniques as adapted to remote meetings. Note that we intentionally relied on off-theshelf technology; we did not set out to build any software or platforms specifically to support the standard setting. We are confident that other readily available technologies that facilitate distance-based meetings would work as effectively. Naturally, tailored technologies would likely make the standard-setting process easier by offering, for example, more intuitive interfaces, but this approach needs to be balanced against the added cost and expertise to build and maintain new systems. 10. References 1. American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (1999). Standards for educational and psychological testing. Washington, DC. 2.Buckendahl, C. W., Smith, R. W., Impara, J. C., & Plake, B. S. (2002). A comparison of Angoff and Bookmark standard setting methods. Journal of Educational Measurement, 39, 253–263. 3.Cizek, G. J., Bunch, M. B., & Koons, H. (2004). Setting performance standards: Contemporary methods. Educational Measurement: Issues and Practice, 23, 31–50. 4.Clauser, B. E., Mee, J., Baldwin, S. G., Margolis, M. J., & Dillon, G. F. (2009). Judges’ use of examinee performance data in an Angoff standard‐setting exercise for a medical licensing examination: An experimental study. Journal of Educational Measurement, 46, 390–407. 5.Davis, S. L., Buckendahl, C. W., Chin, T. Y., & Gerrow, J. (2008, March). Comparing the Angoff and Bookmark methods for an international licensure examination. Paper presented at the National Council on Measurement in Education, New York. 6.Espinosa, J. A., Slaughter, S. A., Kraut, R. E., & Herbsleb, J. D. (2007). Familiarity, complexity, and team performance in geographically distributed software development. Organization Science, 18, 613–630. 7.Harvey, A. L., & Way, W. D. (1999, April). A comparison of web-based standard setting and monitored standard setting. Paper presented at the annual conference of the National Council on Measurement in Education, Montreal, Canada. Journal of Applied Testing Technology Irvin R. Katz and Richard J. Tannenbaum 8.Harvey, A. L. (2000, April). Comparing onsite and online standard setting methods for multiple levels of standards. Paper presented at the annual conference of the National Council on Measurement in Education, New Orleans, LA. 9.Kane, M. (1994). Validating the performance standards associated with passing scores. Review of Educational Research, 64, 425–461. 10.Katz, I. R., Tannenbaum, R. J., & Kannan, P. (2009). Virtual standard setting. CLEAR Exam Review, 20(2), 19–27. 11. Lorié, W. (2011, June). Setting standards remotely: Conditions for success. Paper presented that the CCSSO National Conference on Student Assessment, Orlando, FL. 12.Nichols, P., Twing, J., Mueller, C. D., & O’Malley, K. (2010). Standard-setting methods as measurement processes. Educational Measurement: Issues and Practice, 29, 14–24. 13.Olsen, J. B., & Smith, R. (2008, March). Cross validating modified Angoff and Bookmark standard setting for a home inspection certification. Paper presented at the annual meeting of the National Council on Measurement in Education, New York. 14.Olson, G. M., & Olson, J. S. (2000). Distance matters. Human-Computer Interaction, 15, 139–178. 15.Reckase, M. D. (2006). A conceptual framework for a psychometric theory for standard setting with examples of its use for evaluating the functioning of two standard setting methods. Educational Measurement:Issues and Practice, 25, 4–18. 16.Schnipke, D. L., & Becker, K. A. (2007). Making the test development process more efficient using web-based virtual meetings. CLEAR Exam Review, 18, 13–17. 17.Tannenbaum, R.J., & Katz, I.R. (2008). Setting standards on the core and advanced iSkills™ assessments (ETS Research Memorandum No. RM-08-04). Princeton, NJ: Educational Testing Service. 18.Tannenbaum, R. J. (2011). Alignment between the TOEIC® test and the Canadian Language Benchmarks. Final report. Princeton, NJ: ETS. 19.Tannenbaum, R. J., & Kannan, P. (in press). Consistency of Angoff-based standard-setting judgments: Are item judgments and passing scores replicable across different panels of experts? Educational Assessment. 20.Tannenbaum, R. J., & Katz, I. R. (2013). Standard setting. In K. F. Geisinger (Ed.), APA handbook of testing and assessment in psychology: Vol 3. Testing and assessment in school psychology and education (pp. 455–477). Washington, DC: American Psychological Association. 21.Wilson, T. D., & Schooler, J. W. (1991). Thinking too much: Introspection can reduce the quality of preferences and decisions. Journal of Personality and Social Psychology, 60, 181–192. 22.Zieky, M. J. (2001). So much has changed: How the setting of cutscores has evolved since the 1980s. In G. J. Cizek (Ed.), Setting performance standards: Concepts, methods, and perspectives (pp. 19–51). Mahwah, NJ: Lawrence Erlbaum. 23.Zieky, M. J., Perie, M, & Livingston, S. A. (2008). Cutscores: A manual for setting standards of performance on educational and occupational tests. Princeton, NJ: Educational Testing Service. Address for Correspondence: Irvin R. Katz Educational Testing Service How to cite this article: Katz, I. R., & Tannenbaum, R. J. (2014). Comparison of web-based and face-to-face standard setting using the Angoff Method. Journal of Applied Testing Technology, 15(1), 1–17. Source of Support: Yes, Conflict of Interest: Declared Vol 15 (1) | 2014 | www.jattjournal.com Journal of Applied Testing Technology 17
© Copyright 2026 Paperzz