Dealing with the threats inherent in unproctored Internet testing of

817
Journal of Occupational and Organizational Psychology (2011), 84, 817–824
C 2010 The British Psychological Society
The
British
Psychological
Society
www.wileyonlinelibrary.com
Short research note
Dealing with the threats inherent in unproctored
Internet testing of cognitive ability: Results from a
large-scale operational test program
Filip Lievens1 ∗ and Eugene Burke2
1
2
Ghent University, Belgium
SHL, Thames Ditton, UK
There is little information available about operational systems of unproctored Internet
testing (UIT) of cognitive ability and how they deal with the threats inherent in
UIT. This descriptive study provides a much-needed empirical examination of a largescale operational UIT system of cognitive ability that implemented test design and
verification testing for increasing test security and honest responding. Test security
evaluations showed item exposure and test overlap rates were acceptable. Aberrant
score evaluations revealed that negative score change (higher unproctored scores than
proctored ones) was negligible. Implications for UIT research are discussed.
Recently, few practices have provided as much controversy as unproctored Internet
testing (UIT) of cognitive ability (Sackett & Lievens, 2008; Tippins et al., 2006).
Unproctored testing refers to testing that is not monitored by human test administrators.
At one end of the continuum, many organizations believe that UIT provides a competitive
advantage due to its efficiency, reduced cost, and hi-tech image. UIT’s efficiency stems
from the fact that applicants complete tests at any time and in any location via the
Internet. Administrative efficiency might be bolstered because centralized item banks
enable consistent administration and constant updating of items/norms. Cost arguments
are based on the possibility to reduce expenses (administrators, equipment, and travel)
and time-to-hire. As a third reason for implementing UIT, organizations believe it leads
to positive applicant perceptions.
At the other end of the continuum, reliability and validity questions have been
raised about UIT of cognitive ability. Reliability is posited tot be threatened by
the unstandardized test environment. Further, validity threats might result from UIT
being vulnerable to test fraud. According to Impara and Foster (2006), test fraud
∗ Correspondence should be addressed to Dr Filip Lievens, Department of Personnel Management and Work and
Organizational Psychology, Ghent University, Henri Dunantlaan 2, 9000 Ghent, Belgium (e-mail: [email protected]).
DOI:10.1348/096317910X522672
818
Filip Lievens and Eugene Burke
refers to piracy (stealing test content so that test security breaches occur) and
cheating (obtaining a score through prohibited materials, others’ help or others
impersonating applicants so that applicants’ scores do not reflect their standing on the
construct).
In the middle of those two positions, others posit to invest in UIT while at the same
time undertaking substantial efforts to minimize those threats. As Tippins argued: ‘For
many I/O psychologists and employers, the UIT train has left the station. They have
embraced UIT as an efficient, cost-effective solution to the problem of testing large
numbers of widely dispersed candidates. The question is not “Should we use UIT?”
Rather, the question is “What is the best way to use UIT?” ’
This last perspective suggests researchers and practitioners should bind forces to
develop and examine strategies for minimizing the threats inherent in UIT. Hereby a
distinction can be made between mechanical and principled strategies (Cizek, 1999).
Mechanical strategies seek to enhance test security by reducing the ease for pirates
to capture the items and the likelihood to predict the items received. For instance,
organizations have implemented technological precautions (e.g., impossibility to use
some keyboard functions) and web patrols (in search of test security breaches on the
Internet). According to Tippins et al. (2006), test design might also be useful as it might
produce ‘alternate, equivalent fixed length test forms while carefully managing item
usage rates’. Several approaches for automatically assembling test forms can be adopted
( Jodoin, Zenisky, & Hambleton, 2006). Examples are linear on-the-fly approaches
wherein alternate test forms at the time of testing are drawn from item banks or adaptive
approaches wherein items (i.e., computer adaptive testing) or testlets/item blocks (i.e.,
multi-stage testing) are selected one at a time given candidates’ estimated ability. To
evaluate whether test design with item exposure control generates tests that reduce the
ease for pirates to capture the test content, test security evaluations might be conducted
via pre-test (simulation) and/or post-test administration analyses.
Conversely, principled strategies aim to reduce test-takers’ intentions to carry out test
fraud by encouraging honest responding. Along these lines, research has demonstrated
that incorporating accountability in assessments might serve to deter people’s intention
to distort responses (Lerner & Tetlock, 1999). In UIT, the accountability dimension
of verifiability (expectation that test performance will be assessed by another method
with possible consequences, Farh & Werbel, 1986) might be used. As recommended
by the International Test Commission (2006), test-takers might be told their UIT scores
will be followed by a proctored verification test (if they pass the unproctored stage).
To evaluate discrepancies between unproctored and proctored scores aberrant score
evaluations might be conducted.
So far, little information is available about operational UIT programs that implemented such mechanical and principled strategies. Granted, some studies examined
whether unproctored and proctored test scores differed significantly, with limited
differences discovered (Arthur, Glaze, Villado, & Taylor, 2010; Nye, Do, Drasgow, &
Fine, 2008). Yet, none of these studies clarified whether participants were informed
that verification tests were included. Prior studies also included only a small number of tests and applicant groups. Therefore, this study describes the results of a
large-scale operational UIT system (various applicant groups and ability tests) that
implemented several strategies for increasing test security and honest responding.
This study is descriptive in nature; it does not compare the effectiveness of those
approaches.
Unproctored internet testing
819
Method
Data were gathered via a consultancy firm’s selection process from September 2006 to
February 2008. The total sample size consisted of 71,985 applicants (64% males; 36%
females). Most of them were between 21 and 29 years old. Twenty-four per cent had a
high-school degree, 68% a bachelor/master degree. Eighty-seven per cent was born in
the UK. Table 1 (first row) presents the job levels included. Note that proctored test
scores were based on substantially smaller sample sizes (see Tables 3 and 4) as about 5%
of the candidates passed the unproctored stage.
Table 1. Item exposure statistics broken down by test type and applicant group
Although the selection system differed across clients, it shared the following characteristics. In the unproctored stage, candidates were informed that a verification test
would be administered if they proceeded to the next stage. They signed up an honesty
contract and completed the tests. In the proctored stage (2 weeks later), candidates
passing an organization’s cut-off completed alternate versions of the tests in supervised
rooms.
In this system, unproctored scores served as the basis of the selection decision for
all applicants, with the exception of those with aberrant scores (see below). Those
applicants were administered a proctored form of the first test, and that became their
score. This approach was adopted because it results in a single-stage selection (proctored
test serves as verification instead of as selection hurdle), which might be important in
light of adverse impact (only one ability testing round).
The cognitive ability tests were business-related verbal and numerical reasoning tests.
Linear on-the-fly testing was used as test design. When applicants logged in, tests were
generated from the respective IRT calibrated item banks and assigned to them. Each
unproctored verbal and numerical ability test consisted of 30 (time limit = 19 min) and
18 items (25 min), respectively. These tests are not speeded tests as the modal completion
rates are 95% of test items. Proctored verbal and numerical reasoning tests consisted of
18 (11 min) and 10 items (15 min), respectively. We refer to the manual (Burke, Van
Someren, & Tatham, 2006), for details about the item development, test generator, and
reliability and validity evidence.
820
Filip Lievens and Eugene Burke
Analyses and results
Test security evaluation
Two primary indices (item exposure and test overlap) have been used in test security
evaluations (Chang & Ying, 1999). The item exposure rate is defined as the ratio between
the number of times an item is administered and the total number of test-takers. Test
overlap is defined as the average of the percentage of items shared by any pair of
test-takers. Contrary to simulation studies (where test fraud is a priori modelled), in
operational testing these indices are only suggestive of an increased risk of test fraud as
they neither prove a security breach has occurred nor test-takers are using the captured
content.
Table 1 presents item exposure rate averages for the item banks and candidate
categories. To capture the discrepancy between the observed and ideal item exposure
rates (i.e., test length divided by item bank size), a scaled X 2 was computed (Yi &
Chang, 2003), which quantifies the efficiency of item bank usage (a low scaled X 2
means the observed exposure rate is on average close to the ideal exposure rate). Mean
exposure rates were always below 20%, which is considered acceptable for alternate test
forms (Impara & Foster, 2006). In some instances, verbal items showed exposure rates
>50%. Table 2 shows the item usage frequency. Generally, item usage was balanced.
However, the verbal bank showed higher proportions of items with exposure rates
>20% as compared to the numerical one. Note that we also examined whether item
Table 2. Item pool usage statistics broken down by test type and applicant group
Table 3. Overall and conditional test overlap statistics broken down by test type and applicant group
Table 4. Means and standard deviations of test scores broken down by setting, test type, and applicant group
Unproctored internet testing
821
822
Filip Lievens and Eugene Burke
exposure was correlated with item difficulty/discrimination. Results revealed only small
(<.30) correlations. Table 3 presents test overlap statistics. To assess the amount of test
overlap, a theoretical lower bound for the expected overlap rates is provided (Chang &
Ying, 1999). All tests had overlap rates approximating their theoretical lower bound.
Aberrant score change evaluation
In prior studies (Nye et al., 2008), aberrant score change was determined when
applicants’ proctored scores were significantly lower than their UIT scores. Although
this approach is often advanced as a ‘cheating detection’ method, in operational testing
it does not determine cheating per se because it is prone to Type I (falsely detecting
cheating) and Type II error (failing to detect cheating). Therefore, we refer to it as
aberrant score evaluation.
Given that only the high-scoring candidates in UIT made it to the proctored stage
(cf. the SDs of the total sample in Table 4), it was important to correct the unproctored
scores for regression towards the mean (Campbell & Stanley, 1963). When applying this
correction (see Nye et al., 2008, for formula), we used the alternate form reliabilities
presented in the manual (.72 and .70, respectively). Table 4 shows means and standard
deviations of unproctored and proctored scores per test and applicant group. Given
differences in sample sizes, we used significance tests and effect sizes. Only for two
groups (numerical test, graduates applicants and verbal test, operational), unproctored
scores were higher than proctored ones but ds were only −0.06 and −0.10, which
indicate negligible effect sizes. In six groups, unproctored scores were lower than
proctored ones (highest d = 0.34). The latter is in line with Arthur et al. (2010) who
contributed this to practice effects (e.g., increased test wiseness). In fact, our effect
sizes parallel the meta-analytic effect sizes associated with practice effects of ability tests
(Hausknecht, Halpert, Di Paolo, & Moriarty, 2007).
We also conducted analyses at the individual level by computing the amount of
change between each applicant’s scores (Nye et al., 2008). Hereby applicants’ predicted
unproctored scores (taking regression to the mean into account) were subtracted from
their standardized proctored scores. A significant amount of change was determined via
the cut-off of ± 1.96 SD (representing the .05 significance level), with negative values
indicating unproctored scores were higher than proctored ones. Table 5 shows that the
percentage of applicants with negative score change varied from 0.3 to 2.2%, which was
lower than 2.5% (by chance this percentage might be expected in the tails).
Table 5. Summary of Aberrant Score Evaluation at the Individual Level
Unproctored internet testing
823
Discussion
In recent discussions, the emphasis has shifted from ‘Is UIT of cognitive ability feasible in
selection settings?’ towards ‘How to deal with UIT’s reliability and validity threats in those
settings’ (Tippins, 2009). Such a shift provides a key role for industrial and organizational
psychologists in shaping the technological solutions to meet both business needs and
test standards. This study took a step in that direction by describing the results of a
large-scale UIT system that implemented best UIT practices (mechanical and principled
strategies for increasing test security and honest responding). Acknowledging that the
generalizability of our results to other systems, samples, and item banks should be
examined, test security evaluations showed that item exposure and test overlap statistics
of this UIT system with linear on-the-fly testing did not exceed 20%. One exception was
the higher exposure rates for verbal items, probably due to their longer length compared
to the numerical ones. Higher exposure rates for verbal items might invoke security risks
when those items were consistently administered with other items, leading to high test
overlaps. However, this was not the case.
Aberrant score evaluations showed that the effect of unproctored scores being higher
than proctored ones was negligible in this UIT system with verification testing, which
corroborates prior results (Arthur et al., 2010; Nye et al., 2008). In most applicant
categories, proctored test scores were even higher than unproctored ones. At the
individual level, applicants flagged with aberrant score records varied from 0.3 to 2.2%.
Aberrant score change was highest for numerical tests administered to recent graduates
(facing much competition in their job search) and verbal tests completed by applicants
for operational jobs. This result contrasts to prior assertions that managers/supervisors
would be more likely to cheat because the position is more attractive due to the higher
pay and/or prestige (see Tippins et al., 2006, pp. 194–195).
Some interpretational problems in aberrant score evaluation in operational testing
should be noted. First, apart from prone to Type I and II errors, these evaluations ignore
practice effects. Hence, they might underestimate negative score changes. However, this
is only true if applicants themselves took the unproctored test. Assuming some people
had others take the unproctored test, correcting for practice effects might overestimate
negative score changes. Second, aberrant score evaluations address only the top of the
score distribution. So, changes in the bottom remain unknown. Only in simulation and
experimental studies aberrant score analyses can be conducted for the entire score range.
In future studies, mechanical and principled approaches might be manipulated to
compare their effects on test fraud. Specifically, the effects of multiple strategies for
increasing test security (e.g., test design, item rotation) and honest responding (e.g.,
verification, data forensic audits, webcam proctoring, key stroke analysis, Bartram, 2008;
Foster, 2009) should be scrutinized. Research should also investigate the validity of
unproctored and proctored scores.
Acknowledgements
We would like to acknowledge Mario Pugliese and Hinrik Johanesson for their help in
collecting the data and Frederik Anseel, Iris Egberink, and Mike Harris for their insightful
comments on a previous version of this paper.
References
Arthur, W., Glaze, R. M., Villado, A. J., & Taylor, J. E. (2010). Unproctored Internet-based tests of
cognitive ability and personality: The magnitude and extent of cheating and response distortion
824
Filip Lievens and Eugene Burke
effects on unproctored Internet-based tests of cognitive ability and personality. International
Journal of Selection and Assessment, 18, 1–16. doi:10.1111/j.1468-2389.2010.00476.x
Bartram, D. (2008). The advantages and disadvantages of on-line testing. In S. Cartwright & C. L.
Cooper (Eds.), The Oxford handbook of personnel psychology (pp. 234–260). Oxford: Oxford
University Press.
Burke, E., Van Someren, G., & Tatham, N. (2006). VerifyTM range of ability tests: Technical
manual. Thames Ditton: SHL Group plc, Retrieved from, http://www.shl.com/OurScience/
TechnicalInformation/Pages/TechnicalManualsandGuides.aspx
Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for
research. Boston, MA: Houghton Mifflin Company.
Chang, H. H., & Ying, Z. (1999). Alpha-stratified multistage computerized adaptive testing. Applied
Psychological Measurement, 23, 211–222. doi:10.1177/01466219922031338
Cizek, G. J. (1999). Cheating on tests: How to do it, detect it, and prevent it. Mahwah, NJ: Erlbaum.
Farh, J. L., & Werbel, J. D. (1986). Effects of purpose of the appraisal and expectation of
validation on self-appraisal leniency. Journal of Applied Psychology, 71, 527–529. doi:10.
1037/0021-9010.71.3.527
Foster, D. (2009). Secure, online, high-stakes testing: Science fiction or business reality? Industrial
and Organizational Psychology: Perspectives on Science and Practice, 2, 31–34. doi:10.1111/
j.1754-9434.2008.01103.x
Hausknecht, J. P., Halpert, J. A., Di Paolo, N. T., & Moriarty, M. O. (2007). Retesting in selection:
A meta-analysis of practice effects for tests of cognitive ability. Journal of Applied Psychology,
92, 373–385. doi:10.1037/0021-9010.92.2.373
Impara, J. C., & Foster, D. (2006). Item and test development strategies to minimize test fraud. In S.
M. Downing & T. M. Haladyna (Eds.), Handbook of test development (pp. 91–114). Mahwah,
NJ: Erlbaum.
International Test Commission (2006). International guidelines on computer-based and Internet delivered testing. International Journal of Testing, 6, 143–172. doi:10.1207/s15327574ijt0602 4
Jodoin, M. G., Zenisky, A., & Hambleton, R. K. (2006). Comparison of the psychometric properties
of several computer-based test designs for credentialing exams with multiple purposes. Applied
Measurement in Education, 19, 203–220. doi:10.1207/s15324818ame1903 3
Lerner, J. S., & Tetlock, P. E. (1999). Accounting for the effects of accountability. Psychological
Bulletin, 125, 255–275. doi:10.1037/0033-2909.125.2.255
Nye, C. D., Do, B., Drasgow, F., & Fine, S. (2008). Two-step testing in employee selection: Is
score inflation a problem? International Journal of Selection and Assessment, 16, 112–120.
doi:10.1111/j.1468-2389.2008.00416.x
Sackett, P. R., & Lievens, F. (2008). Personnel selection. Annual Review of Psychology, 59, 419–
445. doi:10.1146/annurev.psych.59.103006.093716
Tippins, N. T. (2009). Internet alternatives to traditional proctored testing: Where are we now?
Industrial and Organizational Psychology: Perspectives on Science and Practice, 2, 2–10.
doi:10.1111/j.1754-9434.2008.01097.x
Tippins, N. T., Beatty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., . . . Shepherd,
W. (2006). Unproctored Internet testing in employment settings. Personnel Psychology, 59,
189–225. doi:10.1111/j.1744-6570.2006.00909.x
Yi, Q., & Chang, H. H. (2003). Alpha-stratified CAT design with content blocking. British Journal
of Mathematical and Statistical Psychology, 56, 359–378. doi:10.1348/000711003770480084
Received 19 March 2010; revised version received 2 July 2010