817 Journal of Occupational and Organizational Psychology (2011), 84, 817–824 C 2010 The British Psychological Society The British Psychological Society www.wileyonlinelibrary.com Short research note Dealing with the threats inherent in unproctored Internet testing of cognitive ability: Results from a large-scale operational test program Filip Lievens1 ∗ and Eugene Burke2 1 2 Ghent University, Belgium SHL, Thames Ditton, UK There is little information available about operational systems of unproctored Internet testing (UIT) of cognitive ability and how they deal with the threats inherent in UIT. This descriptive study provides a much-needed empirical examination of a largescale operational UIT system of cognitive ability that implemented test design and verification testing for increasing test security and honest responding. Test security evaluations showed item exposure and test overlap rates were acceptable. Aberrant score evaluations revealed that negative score change (higher unproctored scores than proctored ones) was negligible. Implications for UIT research are discussed. Recently, few practices have provided as much controversy as unproctored Internet testing (UIT) of cognitive ability (Sackett & Lievens, 2008; Tippins et al., 2006). Unproctored testing refers to testing that is not monitored by human test administrators. At one end of the continuum, many organizations believe that UIT provides a competitive advantage due to its efficiency, reduced cost, and hi-tech image. UIT’s efficiency stems from the fact that applicants complete tests at any time and in any location via the Internet. Administrative efficiency might be bolstered because centralized item banks enable consistent administration and constant updating of items/norms. Cost arguments are based on the possibility to reduce expenses (administrators, equipment, and travel) and time-to-hire. As a third reason for implementing UIT, organizations believe it leads to positive applicant perceptions. At the other end of the continuum, reliability and validity questions have been raised about UIT of cognitive ability. Reliability is posited tot be threatened by the unstandardized test environment. Further, validity threats might result from UIT being vulnerable to test fraud. According to Impara and Foster (2006), test fraud ∗ Correspondence should be addressed to Dr Filip Lievens, Department of Personnel Management and Work and Organizational Psychology, Ghent University, Henri Dunantlaan 2, 9000 Ghent, Belgium (e-mail: [email protected]). DOI:10.1348/096317910X522672 818 Filip Lievens and Eugene Burke refers to piracy (stealing test content so that test security breaches occur) and cheating (obtaining a score through prohibited materials, others’ help or others impersonating applicants so that applicants’ scores do not reflect their standing on the construct). In the middle of those two positions, others posit to invest in UIT while at the same time undertaking substantial efforts to minimize those threats. As Tippins argued: ‘For many I/O psychologists and employers, the UIT train has left the station. They have embraced UIT as an efficient, cost-effective solution to the problem of testing large numbers of widely dispersed candidates. The question is not “Should we use UIT?” Rather, the question is “What is the best way to use UIT?” ’ This last perspective suggests researchers and practitioners should bind forces to develop and examine strategies for minimizing the threats inherent in UIT. Hereby a distinction can be made between mechanical and principled strategies (Cizek, 1999). Mechanical strategies seek to enhance test security by reducing the ease for pirates to capture the items and the likelihood to predict the items received. For instance, organizations have implemented technological precautions (e.g., impossibility to use some keyboard functions) and web patrols (in search of test security breaches on the Internet). According to Tippins et al. (2006), test design might also be useful as it might produce ‘alternate, equivalent fixed length test forms while carefully managing item usage rates’. Several approaches for automatically assembling test forms can be adopted ( Jodoin, Zenisky, & Hambleton, 2006). Examples are linear on-the-fly approaches wherein alternate test forms at the time of testing are drawn from item banks or adaptive approaches wherein items (i.e., computer adaptive testing) or testlets/item blocks (i.e., multi-stage testing) are selected one at a time given candidates’ estimated ability. To evaluate whether test design with item exposure control generates tests that reduce the ease for pirates to capture the test content, test security evaluations might be conducted via pre-test (simulation) and/or post-test administration analyses. Conversely, principled strategies aim to reduce test-takers’ intentions to carry out test fraud by encouraging honest responding. Along these lines, research has demonstrated that incorporating accountability in assessments might serve to deter people’s intention to distort responses (Lerner & Tetlock, 1999). In UIT, the accountability dimension of verifiability (expectation that test performance will be assessed by another method with possible consequences, Farh & Werbel, 1986) might be used. As recommended by the International Test Commission (2006), test-takers might be told their UIT scores will be followed by a proctored verification test (if they pass the unproctored stage). To evaluate discrepancies between unproctored and proctored scores aberrant score evaluations might be conducted. So far, little information is available about operational UIT programs that implemented such mechanical and principled strategies. Granted, some studies examined whether unproctored and proctored test scores differed significantly, with limited differences discovered (Arthur, Glaze, Villado, & Taylor, 2010; Nye, Do, Drasgow, & Fine, 2008). Yet, none of these studies clarified whether participants were informed that verification tests were included. Prior studies also included only a small number of tests and applicant groups. Therefore, this study describes the results of a large-scale operational UIT system (various applicant groups and ability tests) that implemented several strategies for increasing test security and honest responding. This study is descriptive in nature; it does not compare the effectiveness of those approaches. Unproctored internet testing 819 Method Data were gathered via a consultancy firm’s selection process from September 2006 to February 2008. The total sample size consisted of 71,985 applicants (64% males; 36% females). Most of them were between 21 and 29 years old. Twenty-four per cent had a high-school degree, 68% a bachelor/master degree. Eighty-seven per cent was born in the UK. Table 1 (first row) presents the job levels included. Note that proctored test scores were based on substantially smaller sample sizes (see Tables 3 and 4) as about 5% of the candidates passed the unproctored stage. Table 1. Item exposure statistics broken down by test type and applicant group Although the selection system differed across clients, it shared the following characteristics. In the unproctored stage, candidates were informed that a verification test would be administered if they proceeded to the next stage. They signed up an honesty contract and completed the tests. In the proctored stage (2 weeks later), candidates passing an organization’s cut-off completed alternate versions of the tests in supervised rooms. In this system, unproctored scores served as the basis of the selection decision for all applicants, with the exception of those with aberrant scores (see below). Those applicants were administered a proctored form of the first test, and that became their score. This approach was adopted because it results in a single-stage selection (proctored test serves as verification instead of as selection hurdle), which might be important in light of adverse impact (only one ability testing round). The cognitive ability tests were business-related verbal and numerical reasoning tests. Linear on-the-fly testing was used as test design. When applicants logged in, tests were generated from the respective IRT calibrated item banks and assigned to them. Each unproctored verbal and numerical ability test consisted of 30 (time limit = 19 min) and 18 items (25 min), respectively. These tests are not speeded tests as the modal completion rates are 95% of test items. Proctored verbal and numerical reasoning tests consisted of 18 (11 min) and 10 items (15 min), respectively. We refer to the manual (Burke, Van Someren, & Tatham, 2006), for details about the item development, test generator, and reliability and validity evidence. 820 Filip Lievens and Eugene Burke Analyses and results Test security evaluation Two primary indices (item exposure and test overlap) have been used in test security evaluations (Chang & Ying, 1999). The item exposure rate is defined as the ratio between the number of times an item is administered and the total number of test-takers. Test overlap is defined as the average of the percentage of items shared by any pair of test-takers. Contrary to simulation studies (where test fraud is a priori modelled), in operational testing these indices are only suggestive of an increased risk of test fraud as they neither prove a security breach has occurred nor test-takers are using the captured content. Table 1 presents item exposure rate averages for the item banks and candidate categories. To capture the discrepancy between the observed and ideal item exposure rates (i.e., test length divided by item bank size), a scaled X 2 was computed (Yi & Chang, 2003), which quantifies the efficiency of item bank usage (a low scaled X 2 means the observed exposure rate is on average close to the ideal exposure rate). Mean exposure rates were always below 20%, which is considered acceptable for alternate test forms (Impara & Foster, 2006). In some instances, verbal items showed exposure rates >50%. Table 2 shows the item usage frequency. Generally, item usage was balanced. However, the verbal bank showed higher proportions of items with exposure rates >20% as compared to the numerical one. Note that we also examined whether item Table 2. Item pool usage statistics broken down by test type and applicant group Table 3. Overall and conditional test overlap statistics broken down by test type and applicant group Table 4. Means and standard deviations of test scores broken down by setting, test type, and applicant group Unproctored internet testing 821 822 Filip Lievens and Eugene Burke exposure was correlated with item difficulty/discrimination. Results revealed only small (<.30) correlations. Table 3 presents test overlap statistics. To assess the amount of test overlap, a theoretical lower bound for the expected overlap rates is provided (Chang & Ying, 1999). All tests had overlap rates approximating their theoretical lower bound. Aberrant score change evaluation In prior studies (Nye et al., 2008), aberrant score change was determined when applicants’ proctored scores were significantly lower than their UIT scores. Although this approach is often advanced as a ‘cheating detection’ method, in operational testing it does not determine cheating per se because it is prone to Type I (falsely detecting cheating) and Type II error (failing to detect cheating). Therefore, we refer to it as aberrant score evaluation. Given that only the high-scoring candidates in UIT made it to the proctored stage (cf. the SDs of the total sample in Table 4), it was important to correct the unproctored scores for regression towards the mean (Campbell & Stanley, 1963). When applying this correction (see Nye et al., 2008, for formula), we used the alternate form reliabilities presented in the manual (.72 and .70, respectively). Table 4 shows means and standard deviations of unproctored and proctored scores per test and applicant group. Given differences in sample sizes, we used significance tests and effect sizes. Only for two groups (numerical test, graduates applicants and verbal test, operational), unproctored scores were higher than proctored ones but ds were only −0.06 and −0.10, which indicate negligible effect sizes. In six groups, unproctored scores were lower than proctored ones (highest d = 0.34). The latter is in line with Arthur et al. (2010) who contributed this to practice effects (e.g., increased test wiseness). In fact, our effect sizes parallel the meta-analytic effect sizes associated with practice effects of ability tests (Hausknecht, Halpert, Di Paolo, & Moriarty, 2007). We also conducted analyses at the individual level by computing the amount of change between each applicant’s scores (Nye et al., 2008). Hereby applicants’ predicted unproctored scores (taking regression to the mean into account) were subtracted from their standardized proctored scores. A significant amount of change was determined via the cut-off of ± 1.96 SD (representing the .05 significance level), with negative values indicating unproctored scores were higher than proctored ones. Table 5 shows that the percentage of applicants with negative score change varied from 0.3 to 2.2%, which was lower than 2.5% (by chance this percentage might be expected in the tails). Table 5. Summary of Aberrant Score Evaluation at the Individual Level Unproctored internet testing 823 Discussion In recent discussions, the emphasis has shifted from ‘Is UIT of cognitive ability feasible in selection settings?’ towards ‘How to deal with UIT’s reliability and validity threats in those settings’ (Tippins, 2009). Such a shift provides a key role for industrial and organizational psychologists in shaping the technological solutions to meet both business needs and test standards. This study took a step in that direction by describing the results of a large-scale UIT system that implemented best UIT practices (mechanical and principled strategies for increasing test security and honest responding). Acknowledging that the generalizability of our results to other systems, samples, and item banks should be examined, test security evaluations showed that item exposure and test overlap statistics of this UIT system with linear on-the-fly testing did not exceed 20%. One exception was the higher exposure rates for verbal items, probably due to their longer length compared to the numerical ones. Higher exposure rates for verbal items might invoke security risks when those items were consistently administered with other items, leading to high test overlaps. However, this was not the case. Aberrant score evaluations showed that the effect of unproctored scores being higher than proctored ones was negligible in this UIT system with verification testing, which corroborates prior results (Arthur et al., 2010; Nye et al., 2008). In most applicant categories, proctored test scores were even higher than unproctored ones. At the individual level, applicants flagged with aberrant score records varied from 0.3 to 2.2%. Aberrant score change was highest for numerical tests administered to recent graduates (facing much competition in their job search) and verbal tests completed by applicants for operational jobs. This result contrasts to prior assertions that managers/supervisors would be more likely to cheat because the position is more attractive due to the higher pay and/or prestige (see Tippins et al., 2006, pp. 194–195). Some interpretational problems in aberrant score evaluation in operational testing should be noted. First, apart from prone to Type I and II errors, these evaluations ignore practice effects. Hence, they might underestimate negative score changes. However, this is only true if applicants themselves took the unproctored test. Assuming some people had others take the unproctored test, correcting for practice effects might overestimate negative score changes. Second, aberrant score evaluations address only the top of the score distribution. So, changes in the bottom remain unknown. Only in simulation and experimental studies aberrant score analyses can be conducted for the entire score range. In future studies, mechanical and principled approaches might be manipulated to compare their effects on test fraud. Specifically, the effects of multiple strategies for increasing test security (e.g., test design, item rotation) and honest responding (e.g., verification, data forensic audits, webcam proctoring, key stroke analysis, Bartram, 2008; Foster, 2009) should be scrutinized. Research should also investigate the validity of unproctored and proctored scores. Acknowledgements We would like to acknowledge Mario Pugliese and Hinrik Johanesson for their help in collecting the data and Frederik Anseel, Iris Egberink, and Mike Harris for their insightful comments on a previous version of this paper. References Arthur, W., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). Unproctored Internet-based tests of cognitive ability and personality: The magnitude and extent of cheating and response distortion 824 Filip Lievens and Eugene Burke effects on unproctored Internet-based tests of cognitive ability and personality. International Journal of Selection and Assessment, 18, 1–16. doi:10.1111/j.1468-2389.2010.00476.x Bartram, D. (2008). The advantages and disadvantages of on-line testing. In S. Cartwright & C. L. Cooper (Eds.), The Oxford handbook of personnel psychology (pp. 234–260). Oxford: Oxford University Press. Burke, E., Van Someren, G., & Tatham, N. (2006). VerifyTM range of ability tests: Technical manual. Thames Ditton: SHL Group plc, Retrieved from, http://www.shl.com/OurScience/ TechnicalInformation/Pages/TechnicalManualsandGuides.aspx Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Boston, MA: Houghton Mifflin Company. Chang, H. H., & Ying, Z. (1999). Alpha-stratified multistage computerized adaptive testing. Applied Psychological Measurement, 23, 211–222. doi:10.1177/01466219922031338 Cizek, G. J. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ: Erlbaum. Farh, J. L., & Werbel, J. D. (1986). Effects of purpose of the appraisal and expectation of validation on self-appraisal leniency. Journal of Applied Psychology, 71, 527–529. doi:10. 1037/0021-9010.71.3.527 Foster, D. (2009). Secure, online, high-stakes testing: Science fiction or business reality? Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 31–34. doi:10.1111/ j.1754-9434.2008.01103.x Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty, M. O. (2007). Retesting in selection: A meta-analysis of practice effects for tests of cognitive ability. Journal of Applied Psychology, 92, 373–385. doi:10.1037/0021-9010.92.2.373 Impara, J. C., & Foster, D. (2006). Item and test development strategies to minimize test fraud. In S. M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 91–114). Mahwah, NJ: Erlbaum. International Test Commission (2006). International guidelines on computer-based and Internet delivered testing. International Journal of Testing, 6, 143–172. doi:10.1207/s15327574ijt0602 4 Jodoin, M. G., Zenisky, A., & Hambleton, R. K. (2006). Comparison of the psychometric properties of several computer-based test designs for credentialing exams with multiple purposes. Applied Measurement in Education, 19, 203–220. doi:10.1207/s15324818ame1903 3 Lerner, J. S., & Tetlock, P. E. (1999). Accounting for the effects of accountability. Psychological Bulletin, 125, 255–275. doi:10.1037/0033-2909.125.2.255 Nye, C. D., Do, B., Drasgow, F., & Fine, S. (2008). Two-step testing in employee selection: Is score inflation a problem? International Journal of Selection and Assessment, 16, 112–120. doi:10.1111/j.1468-2389.2008.00416.x Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419– 445. doi:10.1146/annurev.psych.59.103006.093716 Tippins, N. T. (2009). Internet alternatives to traditional proctored testing: Where are we now? Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 2–10. doi:10.1111/j.1754-9434.2008.01097.x Tippins, N. T., Beatty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., . . . Shepherd, W. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59, 189–225. doi:10.1111/j.1744-6570.2006.00909.x Yi, Q., & Chang, H. H. (2003). Alpha-stratified CAT design with content blocking. British Journal of Mathematical and Statistical Psychology, 56, 359–378. doi:10.1348/000711003770480084 Received 19 March 2010; revised version received 2 July 2010
© Copyright 2026 Paperzz