ABSTRACT FRANCAVILLA, NICOLE. Examining the Influence of

ABSTRACT
FRANCAVILLA, NICOLE. Examining the Influence of Virtual and In-Person Proctors on
Careless Response in Survey Data. (Under the direction of Dr. Adam Meade.)
Online data collection from undergraduate students has become a common practice for
psychological researchers because it provides a convenient, cost effective way to reach a
large sample. However, the lack of environmental control and lack of human interaction in
Internet-based data collection bring up concerns over data quality, particularly regarding
careless response. Careless responding occurs when participants respond to items without
regard for item content, likely due to inattention. Undergraduate students may be particularly
prone to careless response due to the low-stakes nature of their participation and their
propensity to multitask. The present study used a virtual proctor that provided progress
feedback in an attempt to create a social facilitation effect that lead participants to put more
effort into an online survey. The study also included an in-person condition in which
participants took a survey in a classroom with a proctor, and a control condition in which
participants took the survey remotely online. Analyses indicated that virtual proctor presence
did not reduce careless responding. The in-person proctor increased self-reported diligence
and reduced misses on bogus items, depending on the proctor’s identity, but in general the inperson proctor did not reduce careless responding either. Careless response indicators were
examined as continuous variables, dichotomous variables, and multivariate composites.
Implications and directions for future research are discussed.
© Copyright 2016 by Nicole Francavilla
All Rights Reserved
ii
Examining the Influence of Virtual and In-Person Proctors on Careless Response in Survey
Data
by
Nicole Francavilla
A thesis submitted to the Graduate Faculty of
North Carolina State University
in partial fulfillment of the
requirements for the Degree of
Master of Science
Psychology
Raleigh, North Carolina
2016
APPROVED BY:
_______________________________
Dr. Stephen Craig
_______________________________
Dr. Jennifer Burnette
___________________________
Dr. Adam Meade
Chair of Advisory Committee
ii
ii
DEDICATION
To Kerri and Drew.
iii
BIOGRAPHY
Nicole Francavilla was born in Flemington, New Jersey to parents Thomas and Eileen
Francavilla and has an older brother, Michael Francavilla. She attended Villanova University
and received a Bachelor of Arts Degree in Psychology with a Writing and Rhetoric Minor in
December 2013. She moved to Raleigh, North Carolina in 2014 to pursue her doctoral
degree in Industrial/Organizational Psychology at North Carolina State University.
iv
ACKNOWLEDGEMENTS
Thank you to my advisor and committee chair, Dr. Adam Meade, for providing
continuous timely feedback and guiding me through this challenging process. Thank you to
my committee members, Dr. Bart Craig and Dr. Jennifer Burnette for your helpful insight
and feedback.
Thank you to M. K. Ward for your advice at every step of my research process.
Thank you to Amanda Young, Sam Wilgus, and Drew Weedfall for your thorough revisions.
Finally, thank you to my parents for being the ultimate support system, for always
providing a listening ear, for helping me keep everything in perspective, and for encouraging
me to believe in myself in times of challenge.
v
TABLE OF CONTENTS
LIST OF TABLES ............................................................................................................. vi
LIST OF FIGURES ......................................................................................................... viii
Examining the Influence of Virtual and In-Person Proctors on ...........................................1
Careless Response in Survey Data .......................................................................................1
Issues with Undergraduate Samples ................................................................................2
Data Quality Concerns .....................................................................................................3
Variation in Reports of Careless Response Prevalence ...................................................4
Psychometric Concerns ....................................................................................................5
Detecting Careless Response ...........................................................................................5
Finding Ways to Prevent Careless Response ...................................................................7
Present Study .................................................................................................................12
Method ...............................................................................................................................13
Participants.....................................................................................................................13
Study Design ..................................................................................................................13
Procedure .......................................................................................................................14
Materials ........................................................................................................................16
Results ................................................................................................................................24
Group Equivalence Checks ............................................................................................25
Analyses to Test Hypotheses .........................................................................................27
Supplemental Analyses ..................................................................................................36
Discussion ..........................................................................................................................36
Future Directions ...........................................................................................................44
REFERENCES ..................................................................................................................46
APPENDIX ........................................................................................................................74
vi
LIST OF TABLES
Table 1.
Table 2.
Table 3.
Table 4.
Table 5.
Table 6.
Table 7.
Table 8.
Table 9.
Table 10.
Table 11.
Table 12.
Table 13.
Table 14.
Table 15.
Table 16.
Table 17.
Table 18.
Table 19.
Table 20.
Table 21.
aa
Table 22.
Number of Participants, Mean Age (Standard Deviation), and Sex
by Condition........................................................................................
Cut Scores and Flags by Condition for Mahalanobis Distance,
Even-Odd Consistency, Maximum LongString, and Psychometric
Synonym…….......................................................................................
Percentage of Sample Flagged for CR………………………...……..…
Correlations Among Continuous CR Indicators…………………….…
Phi Correlations Among Dichotomous CR Indicators…………..…...
T-Tests for Time-of-Semester Effects on Continuous CR Indicator...
T-Tests for Proctor Identity Effects on Continuous CR Indicators in
the In-Person Condition…………………………………………………..
ANOVA Tests for Group Size Effect on Continuous CR Indicators
in the In-Person Condition………………………………………...……..
Kruskal-Wallis Test for Group Size Effect on Maximum LongString
Within In-Person Condition………………………………………..…….
Chi-Square Tests for Time-of-Semester Effects in the Full Sample
and Proctor Identity Effects Within the In-Person Condition on
Dichotomous Bogus and Instructed-Response Flags………..………..
Logistic Regressions for Group Size Effects on Dichotomous CR
Indicators In-Person Condition...…………........………………….……
Univariate ANOVA Tests for Condition Effect on Continuous CR
Indicators…………………........……………….…………………............
Post-Hoc Pairwise Tukey Tests for Diligence and Bogus Sum….…..
Kruskal-Wallis Rank Sum Test for Condition Effect on Continuous
Maximum LongString……………...………….……………………….….
ANOVA Test for Condition Effect on Bogus Sum, Controlling for
In-Person Proctor Identity………………........……………….….……..
Pairwise Tukey Tests for Bogus Sum, Controlling for In-Person
Proctor Identity………………........……………….……………….……..
Chi-Square Tests for Condition Effects on Bogus Flag (In-Person
Group Collapsed) ………………........……………….……………..
Chi-Square Tests for Condition Effects on Bogus Flag (Controlling
for In-Person Proctor) ………………........………………..….………...
Chi-Square Tests for Condition Effects on Instructed-Response
Flag………………………………………………………………………….
One-Way MANOVA for Condition Effect on Composite of CR
Indicators (Diligence, Interest, Bogus Sum, Instructed-Response686
Sum, Mahalanobis Distance, Even-Odd Consistency)……...………...
Kruskal-Wallis Tests for Condition Effect on Sum of CR Flag
Scores…………………………………………………………………..
Chi-Square Test for Condition Effect on Dichotomous Overall Flag
52
53
54
55
56
57
58
59
59
60
61
62
62
63
64
64
65
66
67
68
69
70
vii
Score……………........……………….………………………………….....
Table 23.
Table 24.
Environmental Distraction Items………………………………..………
Attention Scale…………………………………………………..…………
71
72
viii
LIST OF FIGURES
Figure 1. Depiction of the virtual human and the progress feedback in the virtual proctor
condition…………………………………………………………………………………73
1
1
Examining the Influence of Virtual and In-Person Proctors on Careless Response in
Survey Data
Online surveys have become the standard of data collection in psychological research
over the past several years. As technology has developed and more people than ever have
access to the Internet, researchers often rely on this convenient method to collect data. There
are many benefits to Internet-based surveys. They provide a cost effective way to quickly
collect data from a large sample through easy distribution. Additionally, researchers can
easily transfer online data into statistical software programs, making data analysis less
tedious than that in paper-and-pencil data collection. Despite their advantages, a primary
concern about Internet-based surveys is that regarding data quality. Meade and Craig (2012)
suggested several possible threats to the quality of data collected from online surveys such as
a lack of environmental control, limited human interaction between the researcher and
participant, low participant motivation, and fatigue during long surveys.
Paramount to the data quality concern is the lack of human interaction involved in
online data collection (Johnson, 2005). Unlike in laboratory settings, online survey
participation often occurs without any interaction between the participant and the researcher.
The absence of a proctor may reduce a participant’s perceived accountability for his/her
responses. Additionally, the removal of a social interaction takes away the benefits of the
social facilitation effect, in which an individual performs better on simple tasks in the
presence of another person (Aiello & Douthitt, 2001). Some researchers argue that a social
interaction component to surveys could improve the quality of data (e.g., Ward & Pond,
2015).
2
Another point of concern in online data collection is the lack of environmental control
(Johnson, 2005). The researcher has no control, and often no information about, the
environment in which the individual completes the survey. The recent influx in technology
use has also lead to an influx in multitasking, and one study found that people frequently
combine technology-based tasks, such as sending text messages while browsing the Internet
(Carrier, Cheever, Rosen, Benitez, & Chang, 2009). With access to a number of devices such
as laptops, tablets, and smartphones, survey participants may fall victim to many
environmental distractions throughout online surveys, of which the researcher has no control.
Environmental distractions likely take participant attention away from the survey, and the
lack of an interpersonal component likely reduces participant accountability for his/her
responses. The present study attempted to address the issues of environmental control and
human interaction in online data collection, particularly with undergraduate samples.
Issues with Undergraduate Samples
Undergraduate students make up an accessible data pool that has become one of the
most utilized sources of research participants (Gordon, Slade, & Schmitt, 1986). Often
researchers incentivize students to participate in research in exchange for course credit or to
fulfill a course obligation. Even if instructors offer an alternative assignment such as a paper,
students often choose to participate in research as a means to an end of credit receipt. They
may be reluctant and resentful to the process (Schultz, 1969), and thus, undergraduates likely
lack intrinsic motivation to take part in online surveys (Meade & Craig, 2012). A lack of
motivation brings into question the degree to which researchers can rely on the data students
provide.
3
In addition to motivational concerns, undergraduates may also be prone to
distractions and other problems associated with multitasking. Not surprisingly, members of
the “Net Generation,” or those born from 1980 to the present, tend to engage in multitasking
more frequently in comparison to those in older generations (Carrier et al., 2009). The
majority of undergraduate students today fall into this category, which warrants concern for
participant attention when collecting Internet-based data. Activities such as browsing the
web, sending text messages, and listening to music may detract a student’s attention from a
survey. One study found that over half of online survey participants aged 18-24 reported
having multitasked on another electronic-based activity during a survey (Zwarun & Hall,
2014). In addition to a propensity to multitask on such activities throughout the survey,
students are at risk for environmental distractors based on the location in which they take the
survey, such as noise in a dormitory. Dual-task research shows that divided attention reduces
performance on both cognitive and physical tasks (Spelke, Hirst, & Neisser, 1976), so
multitasking and distractions bring up data quality issues for data collection in an
uncontrolled environment.
Data Quality Concerns
A concern over the quality of survey data is not new. Nichols, Green, and Schmolck
(1989) defined two types of issues with data quality. They defined content responsive faking
as the phenomenon in which participants respond to survey items with information that is not
completely accurate, but influenced by the item content. Paulhus (1984) suggests social
desirability as one reason behind such faking. In other words, sometimes participants
respond to survey items in an attempt to “look good,” rather than responding in terms of their
true thoughts or feelings about a topic. The second data quality problem is that of content
4
nonresponsivity. Content nonresponsivity occurs when the participant provides responses
with no regard for the item content, referred to as random response in earlier literature (e.g.,
Beach, 1989). Meade and Craig (2012) argue that content nonresponsivity is not truly
random; for instance, participants may choose the same response several times in a row or
make a pattern with their answers. Regardless, participants respond with no consideration of
the actual content of the item. Such behavior has been labeled insufficient effort response
(Huang, Curran, Keeney, Poposki, & DeShon, 2012) and careless response (CR; Meade &
Craig, 2012). In this paper, I use the term CR to refer to content nonresponsive behavior.
Variation in Reports of Careless Response Prevalence
There are inconsistent reports on the prevalence of CR. Meade and Craig (2012) used
a combination of several CR indicators and reported that 10-12% of undergraduate research
participants displayed CR. Johnson (2005) reported a much lower number, 3.5%, but used a
sample of individuals who sought out the International Personality Item Pool online (IPIP;
Goldberg, 1999). Johnson’s participants were likely more motivated than undergraduate
students because they freely participated and wanted feedback on the personality measure.
Similarly, Ehlers, Green-Shortridge, Weekley, and Zajack (2009) reported CR in 5% of their
sample of job applicants. Again, such participants likely had more invested in their job
application than do undergraduates that take online surveys to fulfill a course requirement.
With limited intrinsic motivation, student samples that take Internet-based surveys may be
particularly prone to CR (Meade & Craig, 2012). Thus, the type of sample utilized plays a
role in the discrepancies in reports of CR prevalence.
It is important to note that participants that display CR may not provide poor data for
the entirety of the survey, but CR instead may only occur on a small number of items. Berry,
5
et al. (1992) found that roughly 50-60% of students self-reported responding to one or more
items randomly in an online survey. Baer, Ballenger, Berry and Wetter (1997) found an even
higher number, 73%, for self-reported CR. Thus, while many of the commonly used CR
indicators may detect low percentages of CR (somewhere between 3.5%-12%), CR seems to
occur often throughout surveys on a small number of items. Additionally, the data clearly
show that undergraduate samples are particularly prone to at least some degree of CR (Baer
et al., 1997; Berry et al., 1992; Meade & Craig, 2012).
Psychometric Concerns
The pervasiveness of CR brings up several psychometric concerns and researchers
(e.g., Huang et al., 2012; Maniaci & Rogge, 2014; Meade & Craig, 2012; Ward & Pond,
2015) agree that the field must address CR in order to improve research methods. CR may
affect within-group variability (Clark, Gironda, & Young, 2003), create Type II errors in
hypothesis testing, affect correlations and error variance, and reduce internal consistency
reliability estimates (Huang et al., 2012; Meade & Craig, 2012). CR also may affect
conclusions made from factor analyses and correlations among items (Meade & Craig, 2012).
These psychometric issues bring up concerns for the collection of data for scale development
and general data-based decisions we draw from research (Ward & Pond, 2015; Woods,
2006). Researchers clean data as a common part of data analysis (Tabachnick & Fidell,
2007), but only recently have methods for CR detection become regularly utilized.
Detecting Careless Response
There are two general approaches to CR detection: methods that require the inclusion
of material before data collection (i.e., a priori) and methods that can be used post-hoc. A
priori methods involve the inclusion of particular survey items. Researchers may include
6
items that ask participants in a straightforward manner to indicate their level of engagement
throughout the survey or if they believe their effort was sufficient and thus worthy of
inclusion in the data set (Meade & Craig, 2012). In addition to self-report items, researchers
also include instructed-response and bogus items that provide a clear indication of whether
the participant is paying attention to item content. Instructed-response items ask the
participant to select a particular answer (i.e., “Select option D for this item”), while bogus
items, often nonsensical, have one clear correct answer (i.e., “All my friends are aliens”)
(Meade & Craig, 2012). If the participant provides an “incorrect” response for instructedresponse or bogus items, it is clear that he/she is not paying attention to survey content, at
least at that particular point in time.
The other approach to detect CR is via indices involving computation after data is
collected. Meade and Craig (2012) conducted an extensive analysis of the different CR
indicators and recommend that researchers use a multi-dimensional approach. They found
that self-report items were insufficient in detecting CR, as they were not highly correlated
with the objective indicators. In other words, some participants that displayed CR from
statistical analyses did not self-report insufficient effort in their responses. Meade and Craig
(2012) suggest the use of a combination of built-in items and statistical analyses, such as
computing participants’ Mahalanobis distance, Even-Odd Consistency, Maximum
LongString, or Psychometric Synonyms scores. Meade and Craig’s approach has become a
standard for the detection of CR.
Simply detecting CR, however, is not enough. The next logical step is to develop a
plan for how to address identified careless respondents in the data set. One approach is to
eliminate careless responders from analyses (Tabachnick & Fidell, 2013). However, as noted
7
by Ward and Pond (2015), the elimination of careless respondents has the potential to reduce
sample sizes and affect the distribution of dependent variables in a biased way (i.e., by
eliminating a certain “type” of respondent). In fact, Bowling et al. (2016) recently found that
participants’ CR rates related to their college grade point average, class absences, and
acquaintance-reported personality traits. A potentially biased reduction of the sample size
could limit the external validity of the research. Instead of eliminating CR after it happens, it
is important to examine ways to prevent CR from occurring in the first place. In an attempt
to improve the methodology of survey data collection, researchers should investigate
procedures that reduce inattention and ensure that participants carefully consider their
responses. If not, researchers and practitioners might use poor data to justify decisions and
actions (Ward & Pond, 2015).
Finding Ways to Prevent Careless Response
Meade and Craig (2012) suggest four main reasons for the occurrence of CR: low
respondent interest, length of surveys, lack of social contact, and lack of environmental
control. The present study addressed two of these concerns, social contact and environmental
control, in an attempt to increase respondent attention and in turn reduce CR.
Social interaction. As noted by Meade and Craig (2012), online survey
participation is passive. Internet-based methods present a change to the original paper-andpencil administration of surveys in the presence of a proctor. There is minimal, if any,
interaction between the researcher and participant. Johnson (2005) argues that the physical
distance between the researcher and participant and the lack of personalization in online
surveys combine to reduce the participant’s accountability. Additionally, Dillman, Smyth,
and Christian (2009) emphasize the importance of social contact in research, particularly in
8
online surveys. They argue that survey design should be highly tailored in a way that
improves the social interaction between the researcher and the participant. One such way
suggested to improve this interaction is with the use of a virtual human embedded in online
surveys (Ward & Pond, 2015). The current study expands upon the work of Ward and
Pond (2015) and further examines the use of a virtual human to reduce CR.
In the presence of an observer (such as a proctor for an in-person survey),
individuals tend to put forth more effort on tasks (Zajonc, 1980). However, the relationship
between observer presence and performance is complex. The social facilitation effect is the
phenomenon in which individuals perform better on simple tasks (and worse on complex
tasks) in the presence of another individual as opposed to when alone (Zajonc, 1980).
Many theories attempt to address why a change in performance occurs in the presence of
others. Researchers argue that the mere presence of another person could enhance or hinder
the drive within a person to complete a task (Zajonc, 1980), or even lead a person to shift
his/her cognitive processing capacity, as moderated by variables like task complexity and
evaluation context (Baron, 1986). Most notable of the social facilitation theories for the
present study are those regarding social factors. Cottrell (1972) believes that the presence
of another person makes individuals concerned with how they will be evaluated. An
evaluation apprehension drives the individual to perform, and previous experiences with
evaluation contribute to individual drive reactions. On simple tasks, this evaluation
apprehension and drive leads to better performance, but on complex tasks, the individual
may place too much pressure on himself/herself and fail. Similarly, Duval and Wicklund
(1972) argue that self-awareness, or a focus on the self in a way that considers the view of
others, leads to improved performance possibly because the individual tends to focus
9
attention inward (Carver & Scheier, 1981). Overall, the social facilitation effect occurs
when an individual puts more effort into a task in the presence of an observer.
The completion of a survey falls into the category of a simple rather than a complex
task. Most online surveys contain personality measures and self-report items that require
limited cognitive effort. Therefore, one argument in favor of a proctor for surveys is that
the presence of another person induces a social facilitation effect in which the participant is
motivated to put forth effort in the task. Research has shown that individuals may have the
same kinds of responses to virtual humans as they do to actual humans (Gratch et al., 2007;
Park, 2009). Extensive work by Bailenson and colleagues has shown benefits to virtual
reality, specifically in work and training environments. Virtual reality has been used in a
wide range of applications, from perception research on facial recognition to motion
training for high-stakes individuals like medical students. Bailenson et al. (2008) argue that
virtual reality may provide benefits over and above face-to-face learning due to
customizability. While creating an actual virtual reality environment for survey
administration is impractical and unnecessary, researchers have attempted to maximize on
the customizability benefits of virtual worlds through more simple applications such as
including a virtual agent on a webpage (e.g., Ward & Pond, 2015). Behrend and Thompson
(2011, 2012) showed that virtual humans can increase attention and accountability, and
Park and Catrambone (2007) argue that these increases are likely due to an enhanced sense
of interaction during the virtual experience. In essence, an introduction of a virtual proctor
means the individual is no longer simply providing answers to a machine. That individual
is taking part in a social interaction with the virtual human.
10
Ward and Pond (2015) recently combined a virtual human with varying survey
instructions in an attempt to reduce CR. They found that when paired with warning
instructions (indicating strictly that the individual would be removed from the data set if CR
was detected), CR significantly decreased. However, this was the only condition in which
CR decreased—when they paired the virtual human with normal survey instructions, CR did
not decrease. In Ward and Pond’s study, the virtual human remained present throughout the
survey, but did not have any true interaction with the individual. In the present study, I
attempted to enhance the experience of the social interaction by having the virtual human
provide progress feedback throughout the survey. Researchers have used progress feedback
strategies such as progress bars or textual messages in an effort to reduce survey attrition
(i.e., to motivate participants to finish a lengthy survey rather than stop in the middle) (Yan,
Conrad, Tourangeau, & Couper, 2010). Results of progress feedback effectiveness in
reducing survey attrition are inconclusive. The use of progress feedback in the present study
was not in an effort to reduce attrition, but rather in an effort to enhance the social interaction
between the virtual proctor and the participant by providing a reminder of the observer’s
presence. In turn, the salience of the proctor’s presence was intended to increase attention
and decrease CR.
Emphasis on attention in instructions. Also in an attempt to enhance participant
attention, I tested the effect of an emphasis on the proctor in the survey instructions. Other
researchers have modified instructions to enhance participant attention and reduce CR,
particularly through warning instruction (e.g. Huang et al. 2012; Ward & Pond, 2015). As
mentioned, Ward and Pond (2015) found that when paired with a virtual human, warning
instructions significantly reduced CR. However, warning instructions have a negative effect
11
on attitudes about surveys (Meade & Craig, 2012). Thus, it is worth investigating ways to
maintain normal instructions and reduce CR without also influencing negative attitudes in
participants. Normal survey instructions, as defined by Huang et al. (2012) emphasize
honesty, accuracy and anonymity. In light of research on CR, I argue that an emphasis on
attention be added to normal survey instructions.
When taking a lengthy survey, a participant’s allocation of cognitive resources to
survey content can be challenging. One study found that 30% of participants during an
online survey reported experiencing one or more distractions (Zwarun & Hall, 2014). As
noted previously, 52% of participants in the 18-24 age range reported multitasking on
another electronic-based activity during a survey. Those that reported multitasking also
reported feeling more distracted throughout the survey. To address the issues of multitasking
and distraction, I modified survey instructions to emphasize the importance of paying
attention.
Research on attention regulation, particularly on cognitive resource theory, states that
individual differences in attentional resources interact with difficulty of task demands to
predict performance (Randall, Oswald, & Beier, 2014). Thus, designing a survey that can
maintain the attention of individuals with differing levels of attention in uncontrolled
environments is a challenging task. Executive control theory suggests that in pursuit of a
goal, individuals attempt to control their cognitive resources by focusing their attention on
task-relevant information and blocking out or ignoring other information (Kane, Poole,
Tuholski, & Engle, 2006). If the instructions of the survey emphasize attention as an
important piece of the completion of the task, participants should devote more attention to
12
items. In line with executive control theory, individuals should maintain attentional control
to the survey in an effort to complete their goal (i.e., completion of the survey).
Present Study
In the present study, I predicted that the presence of a virtual proctor during a survey
would produce a social facilitation effect in which participants regulated their attention to put
effort into the survey, likely by limiting multitasking activities. In turn, I predicted that
participants’ guided attention would reduce rates of CR compared to online participants that
did not have a virtual proctor.
Hypothesis 1: Individuals who take an Internet-based survey with a virtual proctor
will display lower rates of careless response than individuals who take an Internet-based
survey without a proctor.
Because the virtual proctor plays the same role as an in-person observer, I expected
individuals to perform similarly in the presence of virtual and real humans. I expected an
in-person proctor to produce a similar social facilitation effect to that of the virtual proctor.
Thus, I predicted that in-person survey participants proctored by a real human would selfregulate their attention and produce lower rates of CR than those that were not proctored
online.
Hypothesis 2: Individuals who take an in-person survey in the presence of a proctor
will display lower rates of careless response than individuals who take an Internet-based
survey without a proctor.
13
Method
Participants
Participants were 254 undergraduate students at a large southeastern United States
university that received credit in an introductory psychology course for participation. One
participant was removed from the data set because he/she omitted over a single page of
items, leaving a final set of 253 participants for analyses. The mean age of participants was
18.84 years (SD = 1.42), although one participant did not indicate his/her age, and
approximately 59% of the sample was female. Approximately 78% of respondents identified
as Caucasian and/or European American, 8% as Asian and/or Asian American, 8% as
African and/or African American, 4% as Hispanic, and 2% as other races.
Study Design
This study used an experimental between-subjects design with survey condition as the
independent variable and several CR indicators as dependent variables (diligence, interest,
bogus item sum, instructed-response sum, Mahalanobis distance, Even-Odd Consistency,
Maximum LongString, and Psychometric Synonym score). The study had three conditions.
One condition consisted of a survey taken in-person with a proctor in a classroom. The other
two conditions were Internet-administered. One Internet-administered condition included a
virtual proctor and one, the control, did not. Because remote online data collection has
become a standard due to convenience, idiosyncrasies among on-site survey methodology
were not of interest, and thus I did not include other on-site conditions. I wanted to
investigate whether or not a virtual human proctor on a remote survey produced similar CR
rates to that of an in-person proctor. Additionally, I used the in-person condition as a means
14
to create a setting in which environmental distractors were minimal. Therefore, it was only
necessary to include one on-site condition to compare against the remote surveys.
In sum, the study contained three conditions (in-person, virtual proctor, control) that
represented the three levels of the independent variable. The dependent variables were a set
of CR indicators.
Procedure
Participants were recruited via introductory psychology course requirements and
signed up through an online portal. The survey description indicated that participants would
be randomly assigned to take a survey on campus or online. Students randomly assigned to
the in-person condition received an email with the time slot options and signed up for a time
of their choice. I set limits such that each survey sitting contained between two and five
participants in an attempt to ensure a controlled environment. Participants placed in the two
online-administered conditions received a hyperlink to the survey and were told that it would
take approximately one hour to complete.
The survey consisted of a lengthy personality measure, for which I evaluated the
prevalence of CR in responses. Additionally, the survey included an attention scale and
items assessing environmental distractors during the survey. All three conditions began with
a set of normal survey instructions that stated: “There are no correct or incorrect answers on
this survey. Please respond to each statement or question as honestly and accurately as you
can. Your answers cannot be linked to your identity” (adapted from Huang et al., 2012).
The last sentence is a modification from Huang et al.’s indication that answers would be kept
confidential. Confidentiality might imply that answers are linked to the participant, but
remain private. I wanted to make it clear that not only would answers be kept private, there
15
would also be no way of linking answers to an individual person, in order to encourage
accuracy and honesty.
In-person condition. In-person survey sessions were either proctored by the
researcher or by an undergraduate research assistant. Both proctors followed a script in order
to maximize the similarity of in-person sessions. At the beginning of each session, the
proctor provided participants a university laptop and emailed them the survey link. Once all
participants had arrived, the proctor read the instructions aloud as the students read along on
their screens. Following the normal instructions, the instructions emphasized both the
proctor’s presence and its purpose—to maintain their attention to the survey. The
instructions read: “To ensure the quality of survey data, this survey is being proctored by the
instructor in the room. The proctor will walk around the room at various times in an effort to
encourage you to maintain full attention to the survey.” During the session, the proctor sat in
the front of the room. At every 10-minute interval, the proctor stood up and walked around
the room, paused behind each student for a 5-second period, and then returned to the front.
Virtual proctor condition. In the virtual proctor condition, participants took the
survey remotely online. The survey contained a “virtual proctor,” which was a moving
image of a person’s head and shoulders placed in the upper left-hand corner of the screen,
adapted from Ward and Pond (2015) (see Figure 1). The virtual human was gender and race
neutral and exhibited lifelike movement including breathing and blinking. The virtual human
remained visible throughout the survey as the participant scrolled up and down each page.
Following the normal instructions, the survey instructions stated: “To ensure the quality of
survey data, this survey is being proctored by a virtual human. The proctor will provide you
updates on your progress in an effort to encourage you to maintain full attention to the
16
survey.” Throughout the survey, after the participant completed two pages, text appeared in
a speech box below the proctor, stating: “You have completed 102 questions of the survey.
Please continue.” This statement continued after completion of page 4 with the proper
number of questions stated. After page 6, the text said: “You have 64 questions remaining in
the survey. Please continue.” This statement continued on page 8 with the proper number of
questions.
Control condition. For the control condition, participants took the survey remotely
online without a proctor. These participants received the initial set of normal instructions
with no additional information. This condition represented the typical online survey
experience of undergraduate research participants.
Materials
Personality. The survey included the 300-item International Personality Item Pool
(IPIP; Goldberg, 1999). This personality scale is typical of those used in long surveys
(Meade & Craig, 2012). Response options for all 300 items consisted of 5-point Likert-type
scales with options ranging from 1 (very inaccurate) to 5 (very accurate).
Distraction. The survey also included 16 items that asked participants to indicate the
number of times they experienced environmental distractions throughout the survey (Ward &
Osgood, 2015; see Table 23). Example items are “I looked at another webpage other than the
survey page,” “I read a text message,” and “I was listening to music.” Response options
were “0 times,” “1-2 times,” “3-4 times,” and “5 or more times” coded from 0-3. I summed
the responses to these 16 items to produce an environmental distractor frequency score for
each participant.
17
Attention. The survey included an 8-item attention scale that measured selective and
divided attention throughout the survey (Yentes, 2015; α = .91; see Table 24). An example
item is: “I was completely focused on the survey.” I averaged scores on these eight items to
produce attention scores.
Proctor perceptions. The survey included two items that assessed participants’
perceptions of the evaluative capabilities of the proctor (e.g., “A proctor monitored my
activity during this survey”) with response options ranging from 1 (strongly disagree) to 5
(strongly agree).
Measures to test hypotheses. I used eight indicators to assess CR and test my
hypotheses. I based my selection of these indices on suggestions made by Meade and Craig
(2012). At the proposal stage, I intended to treat all CR indicators as continuous variables.
Upon further consideration, I believe that most of the CR indicators are not necessarily
continuous and would be better treated as dichotomous categorical variables. Theoretically,
there is no continuum of responding more or less carefully; there are only those people who
respond carelessly on a given item and those who do not. Relatedly, in practice, researchers
use CR indicators in order to flag respondents as careless or not careless. The only clearly
continuous CR indicators I utilized were engagement scale scores (diligence and interest), as
theoretically, individuals fall on a continuum of the abstract constructs of diligence and
interest. As a result of these new considerations, I conducted additional analyses on
dichotomous versions of the other CR indicators (bogus items, instructed-response items,
Mahalanobis distance, Even-Odd Consistency, Maximum LongString, and Psychometric
Synonym scores). For each variable, I present the steps taken to compute the continuous
indicator followed by the steps taken to compute the dichotomous indicator.
18
Engagement scale. I used the 15-item engagement scale developed by Meade and
Craig (2012) that measures participant self-reported diligence and interest throughout the
survey. Response options were on a 5-point Likert-type scale from 1 (strongly disagree) to 5
(strongly agree). Examples of items are “I carefully read every survey item” (diligence) and
“This study was a good use of my time” (interest). I computed one diligence and one interest
score for each participant by averaging responses to the items of each subscale. Diligence
and interest scores could range from 1 (low diligence/interest) to 5 (high diligence/interest).
Meade and Craig (2012) validated this scale, and it demonstrated reliability in my sample for
both diligence (α = .90) and interest (α = .84). Engagement scale scores are the only CR
indicator for which I did not dichotomize.
Bogus items. Bogus items have a clear “correct” answer. The survey included three
bogus items placed among items from the IPIP scale, appearing on alternate pages of the
survey. The items were “I have been to every country in the world,” “I sleep less than one
hour per night” (Meade & Craig, 2012), and “I work fourteen months in year” (Huang et al.,
2012). Response options ranged from 1 (very inaccurate) to 5 (very accurate). For the item
“I have been to every country in the world,” 212 respondents chose option 1 (very
inaccurate) and 14 respondents chose option 2 (moderately inaccurate). A similar pattern
existed in responses to the item “I sleep less than one hour per night,” for which 216
individuals chose option 1 and 24 chose option 2. I coded response options 1 and 2 as
correct for these items, since a response of “moderately inaccurate” does not necessarily
indicate that a respondent was being careless. It is possible that well-traveled respondents
chose “moderately inaccurate” because they have been to many countries, but not every
country in the world. Additionally, some individuals may have slept less than one hour in a
19
night at some point in time, and chose “moderately inaccurate” instead of “very inaccurate.”
Thus, for these two bogus items, I coded responses a zero for options 1 and 2 (correct), and a
one for options 3 through 5 (incorrect). The third bogus item was “I work 14 months in a
year.” Upon examination of this items’ response distribution, I noticed that in addition to a
larger than expected frequency for option choice 2 (35 participants), an even larger number
(42 participants) chose option 3 (neither accurate nor inaccurate). Participants may have
been confused by this item, as it is not possible to work more months than exist in a year, and
thus chose option 3. Following this logic, a choice of option 3 does not necessarily indicate
carelessness. Thus, I coded responses for this item such that options 1-3 received a zero
(correct) and 4-5 received a one (incorrect).
Continuous bogus sum. After recoding the three bogus items, I computed a bogus
sum for participants by summing their scores from the three items. A participant’s bogus
sum could range from zero (no bogus items missed) to three (all bogus items missed).
Higher bogus sums indicate higher CR.
Dichotomous bogus flag. I also recoded the bogus scores dichotomously such that
participants with a bogus sum of one or more received a one, and participants with a bogus
sum of zero received a zero. Thus, a bogus flag score of one represents “flagged as careless
by bogus sum” and a score of zero represents “not flagged as careless by bogus sum.” Fortynine individuals, or 19.37% of the sample, were flagged as careless by bogus sum.
Instructed-response items. Instructed-response items are a clear metric for scoring
correct or incorrect responses (Meade and Craig, 2012). An example of an instructedresponse item is: “Select ‘very inaccurate’ for this item.” The survey included three
instructed-response items, with response options ranging from 1 (very inaccurate) to 5 (very
20
accurate), placed among items from the IPIP scale, on alternate pages of the survey. Thus,
each page of 50 IPIP items contained either an instructed-response or bogus item. These
items were also scored dichotomously as careless or not careless for each participant. For
each item, participants received a zero if they chose the correct option and a one if they chose
any other option.
Continuous instructed-response sum. I calculated instructed-response sums by
adding participants’ scores on the three items. Instructed-response sums could range from
zero (no instructed-response items missed) to three (all instructed-response items missed).
Higher instructed-response sums indicate higher CR.
Dichotomous instructed-response flag. I dichotomized instructed-response scores
using the same strategy as bogus items. Participants with an instructed-response sum of one
or more received a one, representing “flagged as careless by instructed-response” and those
with an instructed response sum of zero received a zero, representing “not flagged as careless
by instructed-response.” Due to missing data, I was able to compute instructed-response
sums for 250 participants. Forty participants, or 16% of the sample, were flagged as careless
by instructed-response.
Mahalanobis distance. Mahalanobis distance is an outlier analysis used to measure a
participant’s propensity to be an outlier. I computed Mahalanobis distance scores based on
strategies used by both Meade and Craig (2012) and Ward and Pond (2015).
Continuous Mahalanobis distance. For the continuous Mahalanobis distance, I
calculated each respondent’s distance from the average response pattern on a set of items
(Meade & Craig, 2012). I computed Mahalanobis distance scores on each Big 5 personality
trait by calculating each participant’s distance from the average response pattern on each
21
scale. Then, I averaged these five scores to give participants one Mahalanobis distance
score. Higher Mahalanobis distance scores indicate a higher propensity to be an outlier, and
thus higher CR.
Dichotomous Mahalanobis distance flag. I considered it useful to create a cut score
for Mahalanobis distance such that values above a certain number indicated carelessness. I
calculated the mean (M = 59.54) and standard deviation (SD = 17.36) of the full sample’s
Mahalanobis distance scores, and decided on a cut score two standard deviations above the
mean (94.27). Then, I dichotomized the Mahalanobis distance scores such that participants
with a score greater than 94.27 received a one (flagged as careless by Mahalanobis distance),
and those equal to or below 94.27 received a zero (not flagged as careless by Mahalanobis
distance). Due to missing data, I computed the Mahalanobis distance for 246 participants.
Only nine participants, or 3.36% of those that received a Mahalanobis distance score, were
flagged as careless by Mahalanobis distance.
Even-Odd Consistency. The Even-Odd Consistency assesses the extent to which
participants respond consistently to items that measure the same construct. In theory,
individuals should respond similarly to items that measure the same thing, and thus,
inconsistency is an indication of carelessness.
Continuous Even-Odd score. I broke down each of the 30 10-item trait subscales
within the IPIP into two 5-item subscales. Based on the order participants viewed the items,
I labeled the items from one to ten and split them into subscales by even and odd numbers.
Next, I averaged participants’ responses to the subscales to compute subscale scores. Then, I
computed a correlation for each participant’s set of subscale scores to provide each
22
participant one Even-Odd Consistency score. Even-Odd Consistency scores could range from
-1.00 to 1.00, with lower values representing less consistency and higher CR.
Dichotomous Even-Odd flag. Again, I found it practically appropriate to dichotomize
the Even-Odd measure in order to set a cut score below which values represented
carelessness. I calculated the mean (M = 0.78) and standard deviation (SD = 0.17) of the
Even-Odd Consistency scores and set the cut score at two standard deviations below the
mean (0.45). Participants that received an Even-Odd Consistency score below 0.45 were
coded a one (flagged as careless by Even-Odd) and participants with Even-Odd Consistency
scores equal to or above 0.45 were coded a zero (not flagged as careless by Even-Odd). I
was able to compute Even-Odd flag scores for all but one participant, and 12 participants, or
4.76% of the sample, were flagged as careless by Even-Odd Consistency.
Maximum LongString. The LongString CR indicator detects patterns in responses in
which a participant repeatedly chooses the same response option. I calculated Maximum
LongString values by identifying the longest string of identical responses in each
participant’s data on the 300 IPIP items.
Continuous Maximum LongString. I calculated the Maximum LongString by
summing the longest string for which each participant chose the same response option. For
example, a Maximum LongString value of 7 indicates that the maximum string of
consecutive identical responses within a participant’s data set was seven items long. Higher
Maximum LongString values represent longer strings of responses and thus, higher CR.
Dichotomous Maximum LongString flag. With only five response options, a
participant could likely choose the same option repetitiously without being careless.
Researchers should not necessarily care about the difference between strings of five or six
23
responses in a row, but rather care about a point at which a Maximum LongString value
indicates carelessness. In order to create a cut score, I temporarily withheld extreme outliers
from the sample (i.e., three respondents that had Maximum LongString values of 105, 254,
and 306). Then, I computed the mean (M = 6.32) and standard deviation (SD = 2.75) for
Maximum LongString and set the cut score at two standard deviations above the mean
(11.82). I rounded up for a cut score of 12 responses in a row. Finally, I returned the outliers
to the data set and dichotomized Maximum LongString scores such that participants with a
score of 12 or greater received a one (flagged as careless by Maximum LongString) and
those with scores below 12 received a zero (not flagged as careless by Maximum
LongString). I computed Maximum LongString scores for the entire sample, and 12
participants, or 4.42%, were flagged as careless by Maximum LongString.
Psychometric Synonyms. Psychometric Synonyms provide another way to assess
participants’ consistency in responses to related items. For Psychometric Synonyms, I
determined item pairs with correlations above .60 within the sample, and examined withinperson variability on these “synonymic” items. For highly correlated items, high withinperson variability indicates inconsistent responding.
Unfortunately, I was unable to calculate Psychometric Synonym scores for the full
sample. In an effort to measure true carelessness, I did not force responses to items in the
survey. Thus, participants left some items blank. Psychometric Synonym scores could not
be computed for 72 participants because they skipped at least one survey item, and the R
script could not accommodate missing values. I tested whether or not there was a bias
reduction in sample size by comparing the condition sample sizes for the full sample versus
the sample that could receive Psychometric Synonym scores. The chi-square test for
24
independence indicated no significant differences between the groups, χ2(2) = 0.29, p =.87.
The in-person group dropped from 83 to 55 participants, a 34% reduction in sample size.
The virtually proctored group dropped from 87 to 64 participants, a 26% reduction, and the
control group dropped from 83 to 62 participants, a 25% drop in sample size. I computed
Psychometric Synonym scores for the remaining 181 participants.
Continuous psychometric synonyms. There were 61 item pairs in the data set with
correlations above .60. I calculated individuals’ Psychometric Synonym scores by
correlating responses on the selected item pairs. Values could range from -1.00 to 1.00, and
lower Psychometric Synonym scores indicate less consistency and higher CR.
Dichotomous psychometric synonym flag. I calculated the mean (M = 0.67) and
standard deviation (SD = 0.15) and set a cut score at two standard deviations below the mean
(0.36). Then I dichotomized the Psychometric Synonym scores such that respondents with
scores below 0.36 received a one (flagged as careless by Psychometric Synonyms) and those
at or above 0.36 received a zero (not flagged as careless by Psychometric Synonyms). Of the
181 individuals for which I computed Psychometric Synonyms, 8 participants, or 4.42%,
were flagged as careless by Psychometric Synonyms.
Results
My hypotheses stated that both in-person and virtually proctored participants would
score lower on CR indicators when compared to control participants. For several of the CR
indicators, only small portions of the sample were flagged as careless and I was unable to
conduct chi-square tests for independence. I still utilized the dichotomous version of these
variables for a composite summation score to test for group differences in carelessness.
25
I conducted the proposed continuous analyses on each variable and then conducted
additional analyses that treated CR indicators as dichotomous variables. After presenting
checks for extraneous effects, I present the continuous and dichotomous analyses results by
indicator, followed by a set of multivariate analyses.
Group Equivalence Checks
Before running the analyses to test my hypotheses, I checked for time-of-semester
effects for the general sample, and for proctor identity and group size effects within the inperson condition. To do so, I tested whether or not the CR indicators related to these
variables in both their continuous and dichotomous form.
Time of semester. During data collection, an error interrupted the random
assignment of participants to groups. Some participants originally assigned to the in-person
condition had the ability to refresh their browser and fall into one of the online conditions.
Upon detection of this error, I fixed the problem; however, I acknowledge the fact that
participants had a slightly higher chance of falling into an online condition rather than inperson during the beginning of the semester versus the end. In order to check whether timeof-semester influenced my dependent variable of interest, CR, I compared participants’
careless response rates at the beginning versus the end of the semester. I divided the data set
into two groups. The “Early” group included all individuals who participated on or before
October 5th, 2015 and the “Late” group included all those that participated on or after
October 6th, 2015, the day the issue was resolved. First, I conducted t-tests to compare the
means of the Early and Late groups on the continuous version of all CR indicators (diligence,
interest, bogus sum, instructed-response sum, Mahalanobis distance, Even-Odd Consistency,
Maximum LongString, and Psychometric Synonym score). None of the t-tests indicated
26
significant differences in CR rates between the two groups (.06 < p < .82). See Table 6 for ttest results. Additionally, I intended to conduct chi-square tests to compare the distributions
of the Early and Late groups on the dichotomous CR indicators for all variables besides
diligence and interest. Due to the small number of flagged cases, I was unable to conduct
chi-square tests for condition effects on the Mahalanobis distance flag, Even-Odd flag,
Maximum LongString flag, and Psychometric Synonym flag. Thus, I only tested for time-ofsemester effects on the bogus and instructed response flags. For the bogus flag, the chisquare test indicated no differences in bogus flags by time-of-semester, χ2(1) = 0.00, p =
1.00. However, the chi-square test for instructed-response was significant, χ2(1) = 3.95, p =
.047, indicating a difference in flagged versus not-flagged patterns in the Early and Late
groups, such that participants were more likely to be flagged for instructed-response during
the first half of the semester. To address this effect, I controlled for time of semester when
evaluating the condition effects on the instructed-response dichotomous variable.
In-person group size. In the in-person condition, participants took the survey in
groups ranging from two to five people (in one unique case, only one individual showed up
for the survey). In order to rule out an effect of the number of participants present during a
survey sitting, I conducted a one-way ANOVA for seven of the eight continuous CR
indicators (excluding Maximum LongString) with group size as the independent variable.
None of the ANOVA results indicated a significant group size effect on CR (.37 < p < .83;
see Table 8). Since the distribution of the Maximum LongString scores was positively
skewed, I conducted the non-parametric alternative to the ANOVA test, a Kruskal-Wallis
test, to check for group size effects on Maximum LongString, and the test did not indicate a
significant effect, H(4) = 5.45, p = .24. I also conducted logistic regressions for the
27
dichotomous versions of bogus and instructed-response flags, and did not find significant
group size effects on the dichotomized CR indicators (p = .20, p = .94).
In-person proctor identity. Relatedly, two different proctors ran the sessions for the
in-person condition, the researcher and an undergraduate research assistant. I grouped inperson participants based on their proctor, and ran t-tests to compare group means on the
continuous CR indicators. The researcher proctored 50 participants and the assistant
proctored 33. Seven of the eight t-tests indicated no significant differences between proctor
groups (.10 < p < .79; see Table 7). However, the t-test for bogus sum indicated significant
differences between mean bogus sums of those proctored by the researcher (M = .06) and
those proctored by the assistant (M = .27), t(44) = 2.48, p = .02, 95% CI [.04, .39]. Thus, I
controlled for in-person proctor identity when running analyses on the bogus sum variable.
I also conducted chi-square tests to check for proctor identity effects on the
dichotomized instructed-response and bogus flags and found similar results such that no
significant differences existed for the instructed-response flags, χ2(1) = 1.68, p = .20, but the
bogus flag test was significant, χ2(1) = 5.66, p = .02, such that in-person participants were
more likely to receive a flag for bogus items when the assistant was their proctor rather than
the researcher. Thus, I controlled for in-person proctor identity when conducting analyses on
the bogus flag variable as well.
Analyses to Test Hypotheses
I present the results by indicator for both the continuous and dichotomous treatment
of each variable (for all variables besides engagement scale scores). Then I present
multivariate results for treating the composite of CR indicators as both a continuous and
categorical variable.
28
Individual CR Indicator Analyses.
Engagement scale. I conducted one-way ANOVAs to test for significant differences
in diligence and interest scores across conditions. For interest, the ANOVA did not indicate
significant differences across groups, F(2, 248) = 1.00, p = .37. The in-person (M = 3.28, SD
= 0.72), virtual proctor (M = 3.12, SD = 0.79), and control group (M = 3.19, SD = 0.69) all
self-reported similar levels of interest. For diligence, the one-way ANOVA had a significant
global F value, F(2, 249) = 3.06, p = .049. I performed pairwise comparisons with a Tukey’s
Honest Significant Difference (Tukey HSD) test and found that the in-person condition
significantly differed from the control condition, such that in-person participants had higher
self-reported diligence scores (M = 4.19, SD = 0.49) than control participants (M = 3.96, SD
= 0.69, p < .05). The virtual proctor condition did not significantly differ from either group
on diligence score (M = 4.01, SD = 0.73), but in terms of effect size, fell between the inperson and control group diligence means. These results provide partial support for
Hypothesis 2, as in-person participants had higher diligence scores than control participants,
which represents less carelessness in the in-person proctored group. The results do not
provide support for Hypothesis 1, as virtually proctored participants did not significantly
differ from control participants on diligence or interest.
Bogus sum. For both the continuous and dichotomous version of the bogus sum, I
found proctor effects within the in-person condition such that bogus values were lower when
the researcher proctored the session compared to when the assistant proctored the session.
Thus, I first conducted analyses with the in-person sample collapsed across proctors, and
then ran analyses controlling for in-person proctor identity. For the one-way ANOVA with
condition (in-person collapsed, virtual proctor, control) as the independent variable and
29
bogus sum as the dependent variable, the global F value indicated significant differences in
bogus sums across groups, F(2, 250) = 5.20, p < .01. I performed pairwise comparisons with
the Tukey HSD test, which indicated that the virtually proctored participants had
significantly lower bogus sums (M = 0.18, SD = 0.49), than control participants (M = 0.39,
SD = 0.66, p = .03). The in-person participants also had significantly lower bogus sums (M =
0.14, SD = 0.35), than control participants (p < .01). Next, I ran a one-way ANOVA with
four levels by splitting the in-person condition into groups based on proctor identity. The
one-way ANOVA was significant, F(3, 249) = 4.63, p < .01. Tukey HSD pairwise
comparisons revealed that the only significant difference was that between in-personresearcher-proctored participants (M = 0.06, SD = 0.24) and control participants (M = 0.39,
SD = 0.66, p < .01). The in-person-assistant-proctored participants did not significantly
differ from any groups on bogus sum (M = 0.27, SD = 0.45) nor did the virtually proctored
participants (M = 0.18, SD = 0.49). Interestingly, the comparison between the virtual proctor
participants and control participants became non-significant (p = .06).
For the dichotomous version of bogus sum, I first conducted a chi-square test with the
in-person condition collapsed to see if the pattern of flagged versus not-flagged bogus
respondents differed amongst the three conditions. The chi-square test was significant, χ2(2)
= 9.16, p = .01. Next, I conducted pairwise chi-square tests with a Bonferroni correction to
control for family-wise error, such that a p-value lower than .008 indicated significant
differences between groups. The chi-square test comparing the in-person group against the
control, χ2(2) = 5.01, p = .03, and the virtually proctored group against the control, χ2(2) =
5.73, p = .02, both failed to reach significance with the Bonferroni correction, indicating no
significant differences between groups. Interestingly, the only post-hoc test that reached
30
significance was that comparing the control group against the in-person and virtually
proctored group collapsed together, χ2(2) = 8.15, p = .004. In order to control for proctor
identity, I conducted a chi-square test with four levels by splitting the in-person participants
by proctor. The chi-square test was significant, χ2(2) = 14.92, p < .01. Again, I performed
post-hoc chi-square tests. In this case, the Bonferroni correction p-value was .006 due to the
additional level of the independent variable. The in-person-researcher-proctored participants
significantly differed from the control participants, χ2(2) = 9.52, p = .002, such that
participants were more likely to be flagged by bogus sum in the control condition than the inperson-researcher-proctored group. However, the in-person-assistant-proctored participants
did not significantly differ from the control participants, χ2(2) = 0.01, p = .94.
Results from both the continuous and dichotomous analyses for bogus sum indicate
that when the assistant proctored the in-person session, bogus flags were indistinguishable
from those in the two online conditions. However, when the researcher proctored the inperson session, bogus scores were reduced. In other words, the effect of a proctor on bogus
scores only held when the researcher was the proctor, and not when the assistant was the
proctor. This conditional proctor effect provides partial support for Hypothesis 2, such that
in-person respondents had lower careless response rates than control respondents, but the
effect depended on the identity of the proctor. These results indicate that the identity of the
proctor plays a role in the social facilitation effects on CR, and that the mere presence of a
proctor does not necessarily reduce CR. In regard to Hypothesis 1, the results are somewhat
challenging to interpret. For the continuous version of the bogus variable, the virtually
proctored participants had significantly lower bogus sums than control participants, however,
when I controlled for proctor-identity and thus added a degree of freedom to the analyses, the
31
relationship became non-significant. Thus, it is unclear whether or not the virtual proctor
significantly reduced bogus sums. The dichotomous analysis of bogus sums clearly does not
support Hypothesis 1, as virtually proctored participants did not significantly differ from
control participants on bogus flag.
Instructed-response sum. For instructed-response sum, a one-way ANOVA with
instructed-response sum as the dependent variable did not indicate significant differences
across conditions, F(2, 247) = 0.12, p = .89. In-person (M = 0.22, SD = 0.55), virtual proctor
(M = 0.21, SD = 0.58), and control (M = 0.25, SD = 0.64) means all fell below one item
missed per respondent.
When I conducted effect checks I found that the pattern of the instructed-response
flag variable differed between participants based on the time of semester the survey was
taken, such that participants were more likely to receive a flag for instructed-response during
the first half of the semester versus the second half. I first conducted a chi-square test on the
full sample and then controlled for time of semester by conducting chi-squares on just
“early” students and just “late” students. The chi-square test for instructed-response flag
differences across conditions for the full sample was not significant, χ2(2) = 0.41, p = .81.
Additionally, the chi-square test for just the early sample was not significant, χ2(2) = 0.32, p
= .85, nor was the chi-square test for just the late sample, χ2(2) = 1.18, p = .56. Results for
both the continuous and dichotomous treatment of the instructed-response sum do not
provide support for the hypotheses that CR rates would be lower for the in-person and virtual
proctor conditions compared to the control.
Mahalanobis distance. I conducted a one-way ANOVA with Mahalanobis distance
as the dependent variable and condition as the independent variable. The ANOVA did not
32
indicate significant differences across groups, F(2, 243) = 0.29, p = .75, as in-person (M =
59.44, SD = 14.33), virtual proctor (M = 58.61, SD = 19.44), and control (M = 60.66, SD =
17.95) participants had similar Mahalanobis distance scores. These results fail to support
both hypotheses, as Mahalanobis distance scores did not differ across conditions.
Next, I intended to run a chi-square test on the dichotomous Mahalanobis distance
flag variable to see if the pattern of participants flagged for carelessness by Mahalanobis
distance versus participants not flagged for carelessness by Mahalanobis distance differed
across condition. However, only nine participants received a Mahalanobis distance flags for
the entire sample, and thus, it was not possible to conduct a chi-square test.
Even-Odd Consistency. I conducted a one-way ANOVA and did not find
significantly different Even-Odd Consistency scores across conditions, F(2, 249) = 1.21, p =
.30. The in-person (M = 0.81, SD = 0.13), virtual proctor (M = 0.78, SD = 0.17), and control
participants (M = 0.77, SD = 0.19) did not differ on Even-Odd Consistency scores. These
results fail to provide support for both hypotheses, as Even-Odd Consistency scores did not
significantly differ across conditions.
I also intended to conduct a chi-square test on the dichotomous Even-Odd flag
variable to investigate whether the pattern of participants flagged as careless by Even-Odd
versus participants not flagged as careless by Even-Odd differed across conditions.
However, only 12 total participants were flagged for Even-Odd, and thus, I could not conduct
the chi-square test.
Maximum LongString. The Maximum LongString distribution was positively
skewed due to outliers, which violates the normality assumption of an ANOVA test. I
conducted the non-parametric alternative to the ANOVA test, a Kruskal-Wallis test. The
33
Kruskal-Wallis test results did not indicate significant differences in Maximum LongString
values across condition, H(2) = 1.13, p = .57. This does not provide support for either
hypothesis, as Maximum LongString values did not differ across condition.
I also intended to conduct a chi-square test on the LongString flag variable to see if
the pattern of flagged as careless by LongString versus not flagged as careless by LongString
differed across conditions, but only 12 total participants were flagged for LongString, so the
chi-square test could not run.
Psychometric Synonyms. With the reduced sample of 181 participants, I conducted a
one-way ANOVA and did not find significant differences amongst groups on Psychometric
Synonym scores, F(2, 178) = 0.49, p = .62. Means of the in-person (M = 0.68, SD = 0.12),
virtual proctor (M = 0.66, SD = 0.18) and control conditions (M = 0.67, SD = 0.15) did not
significantly differ. These results fail to support both hypotheses as Psychometric Synonym
scores did not differ across conditions.
Again, I intended to conduct a chi-square test on the dichotomous Psychometric
Synonym flag variable to see if the pattern of respondents flagged as careless by
Psychometric Synonyms versus respondents not flagged by Psychometric Synonyms differed
across conditions, but only eight participants were flagged, so the chi-square test would not
run.
Multivariate Composite Analyses. In addition to examining each CR indicator
separately, I also proposed to conduct a one-way MANOVA with a multivariate composite of
CR indicators as the dependent variable. I conducted additional multivariate analyses using
the dichotomously scored data as well.
34
CR as continuous. I conducted the proposed MANOVA with condition as the
independent variable and a multivariate composite of six CR indicators (diligence, interest,
bogus sum, instructed-response sum, Mahalanobis distance, and Even-Odd Consistency) as
the dependent variable. I did not include the Maximum LongString in the composite because
Meade and Craig (2012) reported that it represents a different manifestation of CR.
Additionally, I was unable to include Psychometric Synonym scores due to the missing data.
Before running the MANOVA, I checked for the assumptions outlined by Tabachnick and
Fidell (2013). Linearity and multicollinearity checks were satisfactory. Additionally, the
distributions of the dependent variables met the normality assumption. I used Pillai’s
criterion MANOVA test because it takes into account high correlations amongst the
dependent variables in the composite (Tabachnick & Fidell, 2013). Results of the
MANOVA did not indicate significant differences amongst conditions on the composite of
CR indicators, F(12, 466) = 1.42, p = .15. I also conducted a MANOVA test controlling for
proctor identity and did not find significant differences across groups (see Table 20).
CR as categorical. Next, I conducted some additional analyses with the
dichotomized CR indicators. Again, I could not include Psychometric Synonyms.
Additionally, I did not include engagement scale scores in these analyses due to their
continuous nature. I summed participants’ flag scores on four of the remaining CR indicators
(bogus flag, instructed-response flag, Mahalanobis distance flag, and Even-Odd flag).
Participants could receive a score of zero through four on this composite, with a zero
representing “not flagged for any CR indicator” and four indicating “flagged for all four CR
indicators.” The distribution was positively skewed, so I conducted a Kruskal-Wallis test to
check for differences among conditions on the CR flag sum. The Kruskal-Wallis test was not
35
significant, H(2) = 3.39, p = .18. I also conducted this analysis with the Maximum
LongString flag added to participants’ total flag score. This Kruskal-Wallis test was not
significant either, H(2) = 2.74, p = .25. While no significant differences were found for
either Kruskal-Wallis test, the control group mean was higher for both flag sums than the inperson and virtual proctor flag sums. See Table 21 for details.
Finally, I dichotomized the flag scores in order to compare participants not flagged at
all for CR versus those flagged by one or more indicator. To do so, I recoded the flag scores
(that included Maximum LongString flag) such that participants with a one or more on flag
sum received a one and those with a zero on flag sum received a zero. Seventy-nine
participants, or 31.22% of the sample, were flagged as careless by at least one of the CR
indicators. I conducted a chi-square test to see if the pattern of participants flagged for one or
more CR indicator versus participants not flagged for any CR indicators differed across
conditions. The chi-square test was not significant, χ2(2) = 2.92, p = .23. The flags were
generally spread across condition, but in terms of effect size, the in-person condition had the
lowest value, followed by the virtual proctor and then control. Unfortunately, due to a lack
of statistical significance, I cannot conclude that carelessness differed across conditions. I
also conducted this series of Kruskal-Wallis and chi-square tests controlling for both proctor
identity and time of semester, and none of the tests indicated significant differences in CR
indicators across condition. See Table 22 for detailed results.
In sum, none of the multivariate composite analyses provide support for the
hypotheses that CR rates would be lower in the in-person and virtually proctored groups
compared to the control group.
36
Supplemental Analyses
I conducted a one-way ANOVA to test for group differences in self-reported
distraction on the sum of the 16 environmental distraction items. The ANOVA revealed
significant differences in self-reported distraction rates across groups, F(2, 239) = 29.12, p <
.01, such that in-person respondents reported less distractors on average (M = 1.57, SD =
1.35) than did virtually proctored respondents (M = 6.20, SD = 6.20, p < .01) and control
respondents (M = 7.03, SD = 6.21, p < .01). I also assessed the Pearson’s correlation
between environmental distraction frequency and CR flag sum (that included Maximum
LongString flag) across all conditions, and the variables were significantly related, r = .13, p
= .04. Additionally, the CR flag sum was significantly and negatively related to scores on
the 8-item attention scale across conditions, r = -.60, p < .01.
Discussion
Hypothesis 1 stated that virtually proctored online participants would display lower
careless response rates on a survey than would the control group of un-proctored online
participants. Results of univariate and multivariate analyses failed to provide support for this
hypothesis, as CR rates did not significantly differ across the virtually proctored and control
conditions. In the analyses for bogus sum, the virtually proctor group significantly differed
from control, but when I controlled for in-person proctor identity, the relationship became
non-significant. Hypothesis 2 stated that participants proctored in-person would display
lower careless response rates than control participants that took the survey online, unproctored. Results provide limited support for this hypothesis as in-person participants selfreported higher diligence levels throughout the survey than control participants.
Additionally, when the researcher proctored participants, in-person bogus sums and
37
dichotomous bogus flags were lower in the in-person group compared to the control group.
However, on the remaining CR indicators, the in-person group did not have significantly
lower CR rates than the control group, whether the indicators were treated continuously or
dichotomously. When I treated CR indicators as multivariate composites of both continuous
and categorical variables, I did not find significant group differences on CR across
conditions. In sum, I found partial support for Hypothesis 2, but the majority of my analyses
did not support either hypothesis. Across all conditions, participants received CR flags more
for bogus and instructed-response items compared to other CR indicators. A small number of
participants received flags on the Mahalanobis distance, Even-Odd Consistency, Maximum
LongString and Psychometric Synonym variables, making it difficult to investigate whether
or not condition had an effect on their prevalence.
This study makes an important contribution to the literature in that for some CR
indicators, CR is generally not reduced when participants take surveys in-person, in the
presence of a proctor. Many researchers (e.g., Dillman et al., 2009; Johnson, 2005) have
cautioned psychologists about data quality in online surveys due to the lack of both social
interaction between the researcher and participant. Ward and Pond (2015) suggested that a
social component to survey data collection might improve data quality. However, the
presence of a proctor, either in-person or virtual, did not reduce CR on instructed-response
flags for participants when compared to un-proctored, online survey participants. Given the
small number of participants that received flags for the other CR indicators, I am hesitant to
make claims about the effects of a proctor on these indicators individually. However, by
adding individuals’ scores on five of the CR indicators, in-person respondents did no better
than online respondents.
38
The results hold several implications for researchers. On one hand, in-person
proctoring may not significantly reduce CR compared to online survey administration. On
the other hand, careless response prominently exists, as almost one third (31.22%) of the
participants in this study were flagged for at least one of the CR indicators. However, the
majority of flags came from bogus sums and instructed-response sums (19% and 16%,
respectively). While problematic, these indicators can only inform us about carelessness at
the point at which the person answered that particular item. Indicators like Mahalanobis
distance and Even-Odd Consistency measure more general carelessness throughout the
survey, and a much smaller number of participants were flagged for these indicators (3% and
5%, respectively). While the majority of participants are not generally careless throughout
an entire survey, participants may go in and out of carelessness, as demonstrated by the high
rates of misses on bogus and instructed-response items. These results are consistent with
those of Berry et al. (1992) that 50-60% of the sample self-reported responding randomly one
or more times during an online survey.
Consistent with past research (e.g., Carrier et al., 2009), remote online participants
engaged in several other technology-based tasks throughout the survey, such as sending text
messages and opening other webpages. I hypothesized that by reducing the inclination to
multitask on such activities, participants would produce lower CR rates. I examined the
frequency of endorsement of the environmental distraction items, and found that the presence
of the in-person proctor significantly reduced participants’ self-reported engagement in these
activities, as in-person respondents reported less distractors on average than did virtually
proctored and control respondents. Collapsed across condition, environmental distraction
was significantly related to the CR flag sum, meaning that participants that reported higher
39
environmental distraction rates were more likely to receive CR flags. This correlation may
provide an explanation for the lower rates of bogus sums in the in-person-researcher-proctor
group compared to the control group. However, as the groups did not differ on the majority
of CR indices, the small magnitude of this correlation may reflect the fact that younger
generations have multitasking skills (e.g., Carrier et al., 2009). When treated dichotomously,
the in-person condition received the lowest amount of CR flags, and perhaps with more data,
significant differences would be found across groups on CR rates. As expected,
environmental distraction prominence had a strong negative relationship with self-reported
attention, which underlines researchers’ concern over the prominence of distraction during
online surveys. While millennials may possess multitasking “skills,” they still self-report
much lower attention rates when they engage in other activities during a survey.
Another interesting implication from the present study surrounds the in-person
proctor effects. The higher self-reported diligence in the in-person condition provides some
evidence for the social facilitation effect. Participants likely reported higher performance
through the diligence scale because of the close proximity of the proctor and the inclination
to perform better in the presence of another person, likely due to an evaluation apprehension
(e.g., Cottrell, 1972). However, these findings provide mixed support for social facilitation
theory because while self-reported performance was higher in the in-person group, actual
performance as measured by several CR indicators was not (on 6 of the 8 CR indicators). In
social facilitation theory, individuals perform better on simple tasks and worse on complex
tasks in the presence of an individual. Even if one were to consider the survey a complex
rather than simple task, results still do not demonstrate a social facilitation effect as in-person
participant CR rates were indistinguishable from those of online participants (i.e., presence of
40
a person neither improved nor reduced performance as measured by CR indicators).
However, one possible explanation for these results is that CR is not actually a measure of
performance. Perhaps the presence of a proctor did not influence CR in a similar way
because CR is not actually a manifestation of an individual’s performance.
Additionally, the proctor identity effect on bogus sums has interesting implications
for social facilitation theory. It seems that the identity of the proctor, and not just the mere
presence of one, is what had an effect on bogus item carelessness. This may have been due to
higher authority displayed by the researcher, which in turn increased evaluation
apprehension. The results also indicate that it takes more than just the mere presence of
another individual to reduce CR. The proctor identity effect relates to findings in the
learning via virtual agents literature. Researchers have found that effects of agents on
learning depend on characteristics of agents, including empathy expression (Craig et al.,
2002), appearance (Baylor & Kim, 2004), and personality (Isbister & Nass, 2000). However,
the fact that the proctor effect only held for bogus items and none of the other CR indicators
make the effect of proctor identity on CR unclear.
The difference in findings for bogus sums and instructed-response sums is important
to note. Instructed-response sums may be a better way to assess CR as they provide an
objective indication of whether or not a person is paying attention. Meade and Craig (2012)
suggest the use of instructed-response items over bogus items for flagging CR due to the
subjective and sometimes confusing nature of bogus items. Due to these reasons, I note the
influence of a proctor on bogus scores, however, I do not draw large conclusions about CR
reduction due to the lack of significant differences for instructed-response and other CR
indicators.
41
The general lack of support for the hypotheses is likely due to the fact that the study
did not address the proper underlying cause of careless response. I attempted to address the
lack of both social interaction and environmental control in online surveys, two suggested
reasons for the prevalence of CR (Meade & Craig, 2012). However, other variables, such as
a lack of intrinsic motivation, may have a stronger influence on CR. Motivation is an oftmentioned possible underlying cause of carelessness, as low-stakes participants may not feel
a need to engage fully in surveys (Meade & Craig, 2012). Future research should attempt to
manipulate participant motivation in order to reduce CR. Additional underlying causes
should be tested, such as time-of-day and energy levels, survey content’s alignment with
personal interests, and compensation.
In addition to the findings, this study makes a unique methodological contribution to
the literature by dichotomizing scores on CR indicators. Cut scores for Mahalanobis
distance, Maximum LongString, Even-Odd Consistency, and Psychometric Synonyms make
practical sense for researchers looking to flag careless respondents. The use of cut scores
two standard deviations above or below the mean (depending on the indicator) provides a
way for researchers to flag problem participants. Another important methodological
contribution of this study was the treatment of the bogus items. By evaluating the
distribution of the bogus item responses, I was able to more accurately code for carelessness;
however, scoring bogus items is subjective and I may have been too lenient.
Some limitations surround the virtual proctor I used, as participants may not have
perceived the image to actually portray a human-like proctor. I included two items that
assessed perceptions of the proctor, which stated: “A proctor monitored my activity during
this survey” and “My responses were monitored by a proctor throughout this survey” with
42
response options from 1 (strongly disagree) to 5 (strongly agree). For the former, a t-test did
not reveal significant difference between the two proctor groups within the in-person
condition t(75) = 0.14, p = .89. A one-way ANOVA indicated significant differences across
groups for perceived proctor monitoring, F(2, 249) = 183.90, p < .01. Tukey HSD
comparisons revealed that in-person participants indicated significantly higher agreement (M
= 4.35, SD = 0.76) than both the virtual proctor group (M = 2.45, SD = 1.37; p < .01) and
control group (M = 1.45, SD = 0.65; p < .01). Additionally, the virtually proctored
participants reported significantly higher agreement than the control participants (p < .01).
Unfortunately, the mean of the virtual proctor group was 2.45, meaning the average response
was to “disagree” with this item. While these results raise concern over the presence of the
virtual proctor, 29 participants in the virtual proctor condition chose “strongly agree” or
“agree” for this item, while zero participants chose “strongly agree” or “agree” in the control
group. It seems that individuals did not perceive that the virtual human was truly monitoring
their responses. Relatedly, the latter item had the same significant pattern of results. A t-test
did not indicate significant differences between the two proctored in-person groups, t(62) =
0.67, p = .51, and a one-way ANOVA indicated significant differences in agreement across
conditions, F(2, 247) = 23.55, p < .01. Tukey HSD comparisons indicated that in-person
participants had significantly higher agreement (M = 2.92, SD = 1.23) than virtual proctor
participants (M = 2.31, SD = 1.21, p < .01), and control participants (M = 1.70, SD = 0.91, p
< .01). Additionally, the virtual proctor participants agreed at significantly higher rates than
the control (p < .01). These results are consistent with Ward and Pond’s (2015) finding that
participants agreed with an item describing the virtual human as an animated picture of a
43
person at higher rates than they agreed to items describing the human as having evaluative,
monitoring capabilities.
I reexamined the data by only including participants that chose “agree” or “strongly
agree” in both the virtually proctored and in-person conditions for the item “A proctor
monitored my activity during this survey.” I also only included control participants that
chose “disagree” or “strongly disagree.” All results for the in-person condition remained the
same, however, when treated continuously, the bogus sum for virtually proctored participants
that agreed they were being proctored (M = 0.07, SD = 0.26) was significantly lower than the
control participants (M = 0.39, SD = 0.66), F(3, 184) = 6.06, p <.01, Tukey HSD p = .02. In
other words, virtually proctored participants that actually believed they were being monitored
performed better than un-proctored online participants on the bogus items. As none of the
other CR indicator analyses became significant when controlling for proctor perceptions, the
effect of the virtual proctor on CR remains unclear.
I attempted to enhance the interactive nature of the Ward and Pond’s virtual human
by adding progress updates; however, the updates did not seem to make participants perceive
the virtual image as more life-like than that used in Ward and Pond’s study. Perhaps if
participants believed their responses were truly being monitored, either by an in-person
proctor reading over their shoulder or by a person virtually logged into an active online
survey, CR would drop. However, this type of monitoring raises ethical concerns over
anonymity and would likely enhance social desirability manifestation in responses. Such
monitoring may be better suited for evaluative ability tests in which the main concern is
cheating rather than CR. Recently, Hylton, Levy and Dringus (2016) demonstrated that a
webcam-based proctor might deter participants taking online exams from engaging in
44
misconduct, such as accessing non-authorized information. Researchers looking to further
investigate the use of virtual proctors in online surveys may also turn to the agent-based
learning literature. For instance, one study found that virtual agents are more effective for
training outcomes when participants have the ability to select their agent’s appearance
(Behrend & Thompson, 2012).
In addition to the limitation surrounding the virtual proctor, my study also had
limitations due to the in-person proctor. The use of only one proctor would have
standardized the in-person condition and group differences could be more confidently
assessed. An additional limitation to my analyses is that the use of cut scores to flag CR has
not been validated. Some researchers (e.g., Maniaci & Rogge, 2014) have made
recommendations, but there is no hard and fast rule on how to compute such scores.
Additionally, scoring of the bogus items is subjective and other researchers may have made
different judgments on their dichotomization.
Future Directions
As noted, more research is needed to pinpoint the true underlying cause of CR.
Additionally, cut score strategies should be established to standardize the flagging process.
Another area for future research on CR should further assess sporadic carelessness. As
demonstrated in my study, participants are more likely to receive flags for bogus and
instructed-response items that indicate carelessness on a specific section of a survey, as
opposed to indicators that evaluate overall survey carelessness. Researchers should
investigate ways to identify areas of surveys in which a participant carelessly responded. If
we are able to establish rigorous ways to detect sporadic CR, we will have the capability to
45
retain more respondents in samples by only removing areas of response sets in which
carelessness is detected.
The present study examined the effects of in-person and virtual proctors on CR in
online surveys. In the future, a larger sample size could help parse out the relationships of
proctor presence and CR prevalence. Results indicated that in general, both in-person and
virtual proctors do not reduce CR. However, an in-person proctor caused participants to selfreport higher diligence during the survey. Additionally, the in-person proctor reduced misses
on bogus items, but the effect depended on the identity of the proctor, underlining the
complexity of the social facilitation effect. Almost one third of the sample was flagged for at
least one type of carelessness, but carelessness was generally evenly spread across the three
conditions. Researchers should continue collecting data via online surveys, flag participants
with select indicators, and work to find ways to prevent CR from happening in the first place.
46
REFERENCES
Aiello, J. R., & Douthitt, E. A. (2001). Social facilitation from Triplett to electronic
performance monitoring. Group Dynamics: Theory, Research, and Practice, 5, 163–
180. doi:10.1037//1089-2699.5.3.163
Baer, R. A., Ballenger, J., Berry, D. T. R., & Wetter, M. W. (1997). Detection of random
responding on the MMPI-A. Journal of Personality Assessment, 68, 139–151.
doi:10.1207/s15327752jpa6801_11
Bailenson, J., Patel, K., Nielsen, A., Bajscy, R., Jung, S., & Kurillo, G. (2008). The effect of
interactivity on learning physical actions in virtual reality. Media Psychology, 11,
354–376. doi:10.1080/15213260802285214
Baron, R. S. (1986). Distraction-conflict theory: Progress and problems. Advances in
Experimental Social Psychology, 19, 1–40. doi:10.1016/S0065-2601(08)60211-7
Baylor, A. L., & Kim, Y. (2004). Pedagogical agent design: The impact of agent realism,
gender, ethnicity, and instructional role. In J. C. Lester, R. M. Vicari, & F. Paraguaçu
(Eds.), Intelligent tutoring systems: 7th International Conference, ITS 2004 (pp. 592–
603). New York: Springer.
Beach, D. A. (1989). Identifying the random responder. The Journal of Psychology:
Interdisciplinary and Applied, 123, 101–103. doi:10.1080/00223980.1989.10542966
Behrend, T. S., & Thompson, L. (2011). Similarity effects in online training: Effects with
computerized trainer agents. Computers in Human Behavior, 27, 1201–1206.
doi:10.1016/j.chb.2010.12.016
47
Behrend, T. S., & Thompson, L. (2012). Using animated agents in learner-controlled
training: The effects of design control. International Journal of Training and
Development, 16, 263–283. doi:10.1111/j.1468-2419.2012.00413.x
Berry, D. T. R., Wetter, M. W., Baer, R. A., Larsen, L., Clark, C., & Monroe, K. (1992).
MMPI-2 random responding indices: Validation using a self-report methodology.
Psychological Assessment, 4, 340 –345. doi:10.1037/1040-3590.4.3.340
Bowling, N. A., Huang, J. L., Bragg, C. B., Khazon, S., Liu, M., & Blackmore, C. E.
(2016). Who cares and who is careless? Insufficient effort responding as a reflection
of respondent personality. Journal of Personality and Social Psychology. Advance
online publication. doi:10.1037/pspp0000085
Carrier, L. M., Cheever, N. A., Rosen, L. D., Benitez, S., & Change, J. (2009). Multitasking
across generations: Multitasking choices and difficulty ratings in three generations of
Americans. Computers in Human Behavior, 25, 483–489.
doi:10.1016/j.chb.2008.10.012
Carver, C. S., & Scheier, M. F. (1981). The self-attention-induced feedback loop and social
facilitation. Journal of Experimental Social Psychology, 17, 545–568.
doi:10.1016/0022-1031(81)90039-1
Clark, M. E., Gironda, R. J., & Young, R. W. (2003). Detection of back random responding:
Effectiveness of MMPI-2 and personality assessment inventory validity indices.
Psychological Assessment, 15, 223–234. doi:10.1037/1040-3590.15.2.223
Cottrell, N. B. (1972). Social facilitation. In C. G. McClintock (Ed.), Experimental social
psychology (pp. 185–236). New York: Holt.
48
Craig, S. D., Gholson, B., & Driscoll, D. M. (2002). Animated pedagogical agents in
multimedia educational environments: Effects of agent properties, picture features
and redundancy. Journal of Educational Psychology, 94, 428–434.
doi:10.1037/0022-0663.94.2.428
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009). Internet, mail, and mixed-mode
surveys: The tailored design method (3rd ed.). Hoboken, NJ: Wiley.
Duval, S., & Wicklund, R. A. (1972). A theory of objective self-awareness. New York:
Academic Press.
Ehlers, C., Greene-Shortridge, T. M., Weekley, J. A., & Zajack, M. D. (2009). The
exploration of statistical methods in detecting random responding. Poster session
presented at the annual meeting for the Society for Industrial/Organizational
Psychology, Atlanta, GA.
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory
measuring the lower-level facets of several five-factor models. In I. Mervielde, I.
Deary, F. D. Fruyt, & F. Ostendorf (Eds.), Personality psychology in Europe (pp. 7–
28). Tilburg, the Netherlands: Tilburg University Press.
Gordon, M. E., Slade, L. A., & Schmitt, N. (1986). The “science of the sophomore”
revisited: From conjecture to empiricism. Academy of Management Review, 11,
191–207. doi:10.2307/258340
Gratch, J., Wang, N., Okhmatovskaia, A., Lamothe, F., Morales, M., van der Werf, R. J.,
& Morency, L. P. (2007). Can virtual humans be more engaging than real ones?
Paper presented at the 12th international conference on human-computer
interaction: Intelligent multimodal interaction environments, Beijing, China.
49
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting
and deterring insufficient effort responding to surveys. Journal of Business and
Psychology, 27, 99–114. doi:10.1007/s10869-011-9231-8
Hylton, K., Levy, Y., & Dringus, L. P. (2016). Utilizing webcam-based proctoring to deter
misconduct in online exams. Computers & Education, 92, 53–63.
doi:10.1016/j.compedu.2015.10.002
Isbister, K., & Nass, C. (2000). Consistency of personality in interactive characters: Verbal
cues, non-verbal cues, and user characteristics. International Journal of HumanComputer Studies, 53, 2251–2267. doi:10.1006/ijhc.2000.0368
Jackson, D. N. (1977). Jackson Vocational Interest Survey manual. Port Huron, MI:
Research Psychologists Press.
Johnson, J. A. (2005). Ascertaining the validity of individual protocols from web-based
personality inventories. Journal of Research in Personality, 39, 103–129.
doi:10.1016/j.jrp.2004.09.009
Kane, M. J., Poole, B. J., Tuholski, S. W., & Engle, R. W. (2006). Working memory
capacity and the top-down control of visual search: Exploring the boundaries of
“executive attention.” Journal of Experimental Psychology: Learning, Memory, and
Cognition, 32, 749–777. doi:10.1037/ 0278-7393.32.4.749
Maniaci, M. R., & Rogge, R. D. (2014). Caring about carelessness: Participant inattention
and its effects on research. Journal of Research in Personality, 48, 61–83.
doi:10.1016/j.jrp.2013.09.008
Meade, A. W., & Craig, S. B. (2012). Identifying careless responses in survey data.
Psychological Methods, 17, 437–455. doi:10.1037/a0028085
50
Nichols, D. S., Greene, R. L., & Schmolck, P. (1989). Criteria for assessing inconsistent
patterns of item endorsement on the MMPI: Rationale, development, and empirical
trials. Journal of Clinical Psychology, 45, 239–250. doi:10.1002/10974679(198903)45:2<239::AID-JCLP2270450210>3.0.CO;2-1
Park, S. (2009). Social responses to virtual humans: The effect of human-like
characteristics. (Unpublished doctoral dissertation). Georgia Institute of
Technology, Atlanta, Georgia.
Park, S., & Catrambone, R. (2007). Social facilitation effects of virtual humans. Human
Factors: The Journal of the Human Factors and Ergonomics Society, 49, 1054–
1060. doi:10.1518/001872007X249910
Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of
Personality and Social Psychology, 46, 598–609. doi:10.1037/0022-3514.46.3.598
Randall, J. G., Oswald, F. L., & Beier, M. E. (2014). Mind-wandering, cognition, and
performance: A theory-driven meta-analysis of attention regulation. Psychological
Bulletin, 140, 1411–1431. doi:10.1037/a0037428
Schultz, D. P. (1969). The human subject in psychological research. Psychological Bulletin,
72, 214–228. doi:10.1037/h0027880
Spelke, E., Hirst, W., & Neisser, U. (1976). Skills of divided attention. Cognition, 4, 215–
230. doi:10.1016/0010-0277(76)90018-4
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston, MA:
Pearson/Allyn & Bacon.
Tabachnick, B. G., & Fidell, L. S. (2013). Using multivariate statistics (6th ed.). Boston, MA:
Pearson.
51
Ward, M. K., & Osgood, J. (2015). Environmental distraction items. Unpublished instrument.
Ward, M. K., & Pond, S. B. (2015). Using virtual presence and survey instructions to
minimize careless responding on Internet-based surveys. Computers in Human
Behavior, 48, 554–568. doi:10.1016/j.chb.2015.01.070
Woods, C. (2006). CR to reverse-worded items: Implications for confirmatory factor
analysis. Journal of Psychopathology and Behavioral Assessment, 28, 186–191.
doi:10.1007/s10862-0059004-7
Yan, T., Conrad, F. G., Tourangeau, R., & Couper, M. P. (2010). Should I stay or should I
go: The effects of progress feedback, promised task duration, and length of
questionnaire on completing web surveys. International Journal of Public Opinion
Research, 23, 131–147. doi:10.1093/ijpor/edq046
Yentes, R. (2015). Attention scale. Unpublished instrument.
Zajonc, R. B. (1980). Compresence. In P B. Paulhus (Ed.), Psychology of group influence
(pp. 35–60). Hillsdale, NJ: Erlbaum.
Zwarun, L., & Hall, A. (2014). What’s going on? Age, distraction, and multitasking during
online survey taking. Computers in Human Behavior, 41, 236–244.
doi:10.1016/j.chb.2014.09.041
52
Table 1.
Number of Participants, Mean Age (Standard Deviation), and Sex by Condition
Condition
N
Age
Sex
In-Person
83
18.82 (1.13)
36 M
47 F
Virtual Proctor
87
18.94 (1.76)
35 M
52 F
Control
83
18.76 (1.27)
32 M
51 F
Total
253 18.84 (1.42)
103 M
150 F
53
Table 2.
Cut Scores and Flags by Condition for Mahalanobis Distance, Even-Odd Consistency,
Maximum LongString, and Psychometric Synonyms
Dependent Variable
Cut Score Condition
Flagged Not Flagged
Mahalanobis Distance
94.27
In-Person
2
79
Virtual Proctor
5
81
Control
2
77
Even-Odd Consistency
0.45
In-Person
1
82
Virtual Proctor
5
82
Control
6
76
Maximum LongString
12
In-Person
1
82
Virtual Proctor
8
79
Control
3
80
Psychometric Synonyms
0.36
In-Person
1
54
Virtual Proctor
4
60
Control
3
59
54
Table 3.
Percentage of Sample Flagged for CR
Indicator
Flagged
Bogus
49
Instructed-Response
40
Mahalanobis Distance
9
Even-Odd Consistency
12
Maximum LongString
12
Psychometric Synonyms
8
Flagged for 1 or more Indicators
79
n
253
250
246
252
253
181
253
Percent
19.37%
16.00%
3.36%
4.76%
4.74%
4.42%
31.22%
55
Table 4.
Correlations Among Continuous CR Indicators
Variable
1
2
1. Diligence
1.00
2. Interest
.52***
1.00
3. Bogus Sum
-.26*** -.13*
4. Instructed-Response Sum
-.24*** -.18**
5. Mahalanobis Distance
.10
-.02
6. Even-Odd Consistency
.36***
.20**
7. Maximum LongString
-.24*** -.09
Note. *p < .05. **p < .01. ***p < .001.
3
4
5
6
7
1.00
.37***
.14*
-.40***
.36***
1.00
.02
-.35***
.46***
1.00
-.16**
-.20**
1.00
-.23**
1.00
56
Table 5.
Phi Correlations Among Dichotomous CR Indicators
Variable
1
2
1. Bogus Flag
1.00
2. Instructed-Response Flag
.23*** 1.00
3. Mahalanobis Distance Flag .08
-.07
4. Even-Odd Consistency Flag .32*** .25***
5. Maximum LongString Flag .17** .23***
Note. **p < .01. ***p < .001.
3
4
5
1.00
.08
.06
1.00
.23*** 1.00
57
Table 6.
T-Tests for Time-of-Semester Effects on Continuous CR Indicators
Dependent Variable
Time
M
SD
df
t
p
Engagement (Diligence)
Early
4.07
0.69
249.32
0.46
.65
Late
4.03
0.61
Engagement (Interest)
Early
3.24
0.74
240.49
1.16
.25
Late
3.13
0.73
Bogus Sum
Early
0.25
0.57
251.00
0.55
.58
Late
0.22
0.47
Instructed-Response Sum Early
0.25
0.53
215.91
0.63
.53
Late
0.20
0.65
Mahalanobis Distance
Early
61.13
19.11
242.82
1.61
.11
Late
57.64
14.87
Even-Odd Consistency
Early
0.77
0.19
234.14
-1.86
.06
Late
0.80
0.12
Maximum LongString
Early
7.17
8.92
125.57
-1.09
.28
Late
10.93
36.19
Psychometric Synonyms
Early
0.67
0.15
173.70
0.23
.82
Note. For all variables besides Psychometric Synonyms, NEarly = 138, NLate = 150. For
Psychometric Synonyms, NEarly = 96, NLate = 85.
58
Table 7.
T-Tests for Proctor Identity Effects on Continuous CR Indicators in the In-Person
Condition
Dependent Variable
Proctor
M
SD
df
t
p
Engagement (Diligence)
R
4.12
0.47
66.13
1.65
.10
A
4.30
0.50
Engagement (Interest)
R
3.26
0.75
72.59
0.27
.79
A
3.30
0.69
Bogus Sum
R
0.06
0.24
44.00
2.48
.02*
A
0.27
0.45
Instructed-Response Sum
R
0.14
0.40
42.27
1.54
.13
A
0.35
0.71
Mahalanobis Distance
R
58.75 14.47
67.06
0.53
.60
A
60.49 14.28
Even-Odd Consistency
R
0.83
0.08
40.98 -1.69
.10
A
0.77
0.18
Maximum LongString
R
6.10
1.97
81.00 -1.38
.17
A
5.61
1.30
Psychometric Synonyms
R
0.70
0.12
40.83
0.95
.35
A
0.67
0.13
Note. R = Researcher as Proctor, A = Assistant as Proctor. NR = 50, NA = 33. *p < .05.
59
Table 8.
ANOVA Tests for Group Size Effect on Continuous CR Indicators in the In-Person
Condition
Dependent Variables
df
Residuals
F
p
Engagement (Diligence)
4
78
0.72
.58
Engagement (Interest)
4
78
0.81
.52
Bogus Sum
4
78
0.87
.48
Instructed-Response Sum
4
76
0.37
.83
Mahalanobis Distance
4
76
1.01
.41
Even-Odd Consistency
4
78
1.08
.37
Psychometric Synonyms
4
50
0.81
.56
Table 9.
Kruskal-Wallis Test for Group Size Effect on Maximum LongString Within In-Person
Condition
Dependent Variable
df
χ2
p
Maximum LongString
4
5.45
.24
60
Table 10.
Chi-Square Tests for Time-of-Semester Effects in the Full Sample and Proctor Identity
Effects Within the In-Person Condition on Dichotomous Bogus and Instructed-Response
Flags
Dependent
Not
Sample Variables
Grouping
Flagged Flagged
df
χ2
p
Full Sample (Time-of-Semester Effects)
Bogus
Early
27
111
1
0.00
1.00
Late
22
93
InstructedResponse
Early
28
108
1
3.95 .047*
Late
12
102
In-Person Condition (Proctor Identity
Effects)
Bogus
Researcher
3
47
1
5.66
.02*
Assistant
9
24
InstructedResponse
Researcher
6
44
1
1.68
.20
Assistant
8
23
Note. The bogus flag chi-square value may be inaccurate due to the small number of
flagged participants. *p < .05.
61
Table 11.
Logistic Regressions for Group Size Effects on Dichotomous CR Indicators In-Person
Condition
Dependent Variables
n
Estimate
SE
z
p
Bogus Flag
83
0.39
0.31
1.27
.20
Instructed-Response Flag
81
0.02
0.26
0.07
.94
62
Table 12.
Univariate ANOVA Tests for Condition Effect on Continuous CR Indicators
Dependent Variable Condition
M
SD
df Residual
F
Engagement (Diligence)
2
249
3.06
In-Person
4.19
0.49
Virtual Proctor 4.01
0.73
Control
3.96
0.69
Engagement (Interest)
2
248
1.00
In-Person
3.28
0.72
Virtual Proctor 3.12
0.79
Control
3.19
0.69
Bogus Sum
2
250
5.20
In-Person
0.14
0.35
Virtual Proctor 0.18
0.49
Control
0.39
0.66
Instructed-Response Sum
2
247
0.12
In-Person
0.22
0.55
Virtual Proctor 0.21
0.58
Control
0.25
0.64
Mahalanobis Distance
2
243
0.29
In-Person
59.44 14.33
Virtual Proctor 58.61 19.44
Control
60.66 17.95
Even-Odd Consistency
2
249
1.21
In-Person
0.81
0.13
Virtual Proctor 0.78
0.17
Control
0.77
0.19
Psychometric Synonyms
2
178
0.49
In-Person
0.68
0.12
Virtual Proctor 0.66
0.18
Control
0.67
0.15
Note. *p < .05. **p < .01.
Table 13.
Post-Hoc Pairwise Tukey Tests for Diligence and Bogus Sum
Dependent Variable
Comparison
Engagement (Diligence)
In-Person vs. Control
Virtual Proctor vs. Control
Bogus Sum
In-Person vs. Control
Virtual Proctor vs. Control
Note. *p < .05. **p < .01.
d
p
0.24
0.06
.048*
.83
0.24
0.20
.008**
.03*
p
.049*
.37
.006**
.89
.75
.30
.62
63
Table 14.
Kruskal-Wallis Rank Sum Test for Condition Effect on Continuous Maximum LongString
Dependent Variable Condition
M
SD
df
χ2
p
Maximum
LongString
2
1.13
.57
In-Person
5.90
1.74
Virtual Proctor 10.68 28.65
Control
9.96
32.97
64
Table 15.
ANOVA Test for Condition Effect on Bogus Sum, Controlling for In-Person Proctor Identity
Condition
M
SD
df Residuals
F
p
3
249
4.63
.004**
In-Person-Researcher-Proctor 0.06 0.24
In-Person-Assistant-Proctor
0.27 0.45
Virtual Proctor
0.18 0.49
Control
0.39 0.66
Note. *p < .01.
Table 16.
Pairwise Tukey Tests for Bogus Sum, Controlling for In-Person Proctor Identity
Comparison
d
p
In-Person-Researcher-Proctor vs. Control
0.33
.003**
In-Person-Assistant-Proctor vs. Control
0.11
.71
Virtual Proctor vs. Control
0.20
.055
Note. **p < .01.
65
Table 17.
Chi-Square Tests for Condition Effects on Bogus Flag (In-Person Group Collapsed)
Not
Test
Condition
Flagged Flagged df
χ2
p
Chi-Square Bogus
2 9.16
.01*
In-Person
12
71
Virtual Proctor
12
75
Control
25
58
Post-Hoc Chi-Square
In-Person v. Control
Virtual v. Control
Control v. Both (In-person & Virtual)
Note. *p < .05. **p < .01.
1
1
1
5.01
5.73
8.15
.03
.02
.004**
66
Table 18.
Chi-Square Tests for Condition Effects on Bogus Flag (Controlling for In-Person Proctor)
Not
Test
Condition
Flagged Flagged df
χ2
p
Chi-Square Bogus
3 14.92 .002**
In-Person-Researcher-Proctor
3
47
In-Person-Assistant-Proctor
9
24
Virtual Proctor
12
75
Control
25
58
Post-Hoc Chi-Square
In-Person-Researcher-Proctor vs. Control
In-Person-Assistant-Proctor vs. Control
Note. **p < .01.
1
1
9.52
0.01
.002**
.94
67
Table 19.
Chi-Square Tests for Condition Effects on Instructed-Response Flag
Not
Test
Condition
Flagged Flagged df
χ2
Chi-Square IR Flag In-Person
14
67
2
0.41
Virtual Proctor
12
74
Control
14
69
Controlling for Time of Semester
Early
In-Person
Virtual Proctor
Control
Late
In-Person
Virtual Proctor
Control
p
.81
7
10
11
22
43
43
2
0.32
.85
7
2
3
45
31
26
2
1.18
.56
68
Table 20.
One-Way MANOVA for Condition Effect on Composite of CR Indicators (Diligence, Interest, Bogus
Sum, Instructed-Response Sum, Mahalanobis Distance, Even-Odd Consistency)
Numerator Denominator
Analysis
df Residuals
df
df
Pillai
F
p
Full Sample (In-Person
Collapsed)
2
237
12
466
0.07 1.42 .15
Controlling for InPerson Proctor
3
236
18
699
0.11 1.53 .07
69
Table 21.
Kruskal-Wallis Tests for Condition Effect on Sum of CR Flag Scores
Group Indicators
Condition
M
SD df
Full Sample
Flag Sum 1
2
In-Person
0.34 0.62
Virtual Proctor
0.36 0.69
Control
0.51 0.75
Flag Sum 2
In-Person
Virtual Proctor
Control
Full Sample (Controlling for In-Person Proctor)
Flag Sum 1
In-Person-Research-Proctor
In-Person-Assistant-Proctor
0.35
0.46
0.53
0.62
0.85
0.75
0.22
0.52
0.47
0.77
Flag Sum 2
In-Person-Research-Proctor
In-Person-Assistant-Proctor
Time Groups (Controlling for Time of Semester)
Early
Flag Sum 1
In-Person
Virtual Proctor
Control
0.24
0.52
0.38
0.38
0.60
χ2
p
3.39
.18
2
2.74
.25
3
6.17
.10
3
4.90
.18
2
2.26
.32
2
2.22
.33
2
0.46
.80
0.48
0.77
0.62
0.66
0.82
Flag Sum 2
In-Person
Virtual Proctor
Control
0.38
0.46
0.62
0.62
0.80
0.82
In-Person
Virtual Proctor
Control
0.31
0.33
0.35
0.62
0.74
0.56
Late
Flag Sum 1
Flag Sum 2
2 0.14
.93
In-Person
0.33 0.62
Virtual Proctor
0.45 0.94
Control
0.35 0.56
Note. Flag Sum 1 Variables = Bogus Flag, Instructed-Response Flag, Mahalanobis Distance
Flag, Even-Odd Consistency Flag. Flag Sum 2 Variables = Bogus Flag, Instructed-Response
Flag, Mahalanobis Distance Flag, Even-Odd Consistency Flag, Maximum LongString Flag.
70
Table 22.
Chi-Square Test for Condition Effect on Dichotomous Overall Flag Score
Not
Sample
Condition
Flagged Flagged df
Full Sample
In-Person
22
58
2
Virtual Proctor
26
59
Control
31
47
Full Sample (Controlling for In-Person Proctor)
In-Person-Researcher-Proctor
In-Person-Assistant-Proctor
Virtual Proctor
Control
Time Groups (Controlling for Time-ofSemester)
Early
In-Person
Virtual Proctor
Control
Late
In-Person
Virtual Proctor
Control
11
11
26
31
χ2
p
2.92
.23
3
4.39
.22
2
2.03
.36
2
0.24
.89
38
20
59
47
9
17
23
20
35
29
13
9
8
38
24
18
71
Table 23.
Environmental Distraction Items
Items
1. I received an alert indicating an incoming email, text message, instant message,
online chat request, or phone call
2. I was listening to music
3. I read an instant message
4. I looked at another webpage other than the survey page
5. I sent an instant message
6. I made a post on a webpage (such as Facebook or internet forum)
7. My cell/home phone rang or vibrated to indicate an incoming call
8. I sent an e-mail
9. I spoke with someone on my cell/home phone
10. Someone spoke to me in person
11. There was a television active in the area
12. I sent a text message
13. I attended to music (e.g. skipping a song, changing the volume, or
pausing/resuming play)
14. I could see people or things moving in the background
15. I read an e-mail
16. I read a text message
Note. Unpublished items produced by M. K. Ward and J. Osgood (2015).
72
Table 24.
Attention Scale
Item
1. I devoted all of my attention to the survey.
2. I was entirely focused on the task at hand.
3. I was completely focused on the survey.
4. I switched back and forth between the task at hand and other tasks.
5. The task at hand was the only thing I was paying attention to.
6. My mind wandered while taking the survey.
7. I daydreamed while taking this survey.
8. I was multi-tasking while I took the survey.
Note. Unpublished scale produced by R. Yentes (2015).
73
Figure 1. Depiction of the virtual human and the progress feedback in the virtual proctor
condition.
74
APPENDIX
75
Thesis Proposal
Combining a Virtual Proctor and Attention-Focused Instructions to Reduce
Careless Responding in Online Surveys
Online surveys have become the standard of data collection in psychological research
over the past several years. As technology has developed and the Internet has become
available almost everywhere, researchers often rely on this convenient method to collect data.
There are many benefits to Internet-based surveys. They provide a convenient, cost effective
way to quickly collect data from a large sample through easy distribution. Additionally, no
copying costs and easy transfer into statistical software programs make data analysis simple
and less tedious than that in paper-and-pencil data collection. Although there are many
benefits to Internet-based surveys, a primary concern with the method is that of data quality.
One issue in the data quality discussion is the lack of human interaction involved in
online data collection (Johnson, 2005). Unlike laboratory settings, online survey
participation often occurs without any interaction between the participant and the researcher.
The absence of a proctor may reduce a participant’s perceived accountability for his/her
responses. Additionally, the removal of a social interaction takes away the benefits of the
social facilitation effect, in which an individual performs better on simple tasks in the
presence of another person (Aiello & Douthitt, 2001). Some researchers argue that a social
interaction component to surveys could improve the quality of data (Ward & Pond, 2015).
Another point of concern in online data collection is the lack of environmental control
(Johnson, 2005). The researcher has no control, and often no information, about the
environment in which the individual completes the survey. The influx in technology use
relates to an influx in multitasking, and one study found that people frequently combine
76
technology-based tasks, such as sending text messages while browsing the Internet (Carrier,
Cheever, Rosen, Benitez, & Chang, 2009). With access to laptops, tablets, and smartphones,
survey participants may become victim to many environmental distractions throughout online
surveys, of which the researcher has no control.
In addition to environmental variables, contextual factors like Internet speed and
equipment quality also lack standardization (Tippins et al., 2006). Questionnaire display
may differ on certain devices, or may not be compatible on the device a participant owns,
such as a tablet. In sum, researchers that utilize online surveys are at risk of collecting poor
data due to the lack of both human interaction and environmental control in online survey
administration. The proposed study addresses the issues of human interaction and
environmental control in online data collection, particularly with undergraduate samples.
Issues with Undergraduate Samples
Undergraduate students make up an accessible data pool that has become one of the
most utilized sources of research participants (Gordon, Slade, & Schmitt, 1986). Often
researchers incentivize students to participate in research in exchange for course credit or to
fulfill a course obligation. Even if instructors offer an alternative assignment such as a paper,
students may view research hours as the “lesser of two evils” and choose to participate as a
means to an end of credit receipt. They may be reluctant and resentful to the process
(Schultz, 1969) and thus, undergraduates likely lack intrinsic motivation to take part in online
surveys (Meade & Craig, 2012). A lack of motivation brings into question the degree to
which we can rely on the data students provide.
In addition to motivational concerns, undergraduates may also be prone to
distractions and other problems associated with multitasking. Not surprisingly, members of
77
the “Net Generation,” or those born from 1980 to the present, engage in multitasking most
frequently in comparison to those in older generations (Carrier et al., 2009). The majority of
undergraduate students today fall into this category, which warrants concern for participant
attention when collecting Internet-based data. Activities like browsing the web, sending text
messages, and listening to music may detract a student’s attention from the survey. One
study found that over half of online survey participants aged 18-24 reported having
multitasked on another electronic-based activity during the survey (Zwarun & Hall, 2014).
In addition to a propensity to multitask on such activities throughout the survey, students are
at risk for environmental distractors based on the location in which they take the survey, such
as noise in a dormitory. Dual-task research shows that divided attention reduces performance
on both cognitive and physical tasks (Spelke, Hirst, & Neisser, 1976), so multitasking and
distractions bring up data quality issues for data collection in an uncontrolled environment.
Data Quality Concerns
A concern over the quality of survey data is not new. Nichols, Green, and Schmolck
(1989) defined two types of issues with data quality. They believed content responsive
faking occurred when participants responded to survey items with information that was not
completely accurate, but influenced by the item content. Paulhus (1984) suggested social
desirability as one reason behind such faking. In other words, sometimes participants
respond to survey items in an attempt to “look good,” rather than responding in terms of their
true thoughts or feelings about a topic. The second data quality problem is that of content
nonresponsivity. Content nonresponsivity occurs when the participant provides responses
with no regard for the item content, referred to as random response in earlier literature (e.g.,
Beach, 1989). Meade and Craig (2012) argue that content nonresponsivity is not truly
78
random; for instance, participants may choose the same response several times in a row or
make a pattern with their answers. Regardless, participants respond with no consideration of
the actual content of the item, and some researchers label such behavior as insufficient effort
response (e.g., Huang, Curran, Keeney, Poposki, & DeShon, 2012). The umbrella term to
describe responses not reflective of an individual’s true score, whether due to content
responsive faking or content nonresponsivity, is careless response (CR; Meade & Craig,
2012).
Variation in Reports of Careless Response Prevalence
There are inconsistent reports on the prevalence of CR. Meade and Craig (2012) used
a combination of several CR indicators and reported that 10-12% of undergraduate research
participants displayed CR. Johnson (2005) reported a much lower number, 3.5%, but used a
sample of individuals who sought out the International Personality Item Pool online (IPIP;
Goldberg, 1999). These participants were likely more motivated than undergraduates were
because they wanted the feedback on the personality measure and freely participated.
Similarly, Ehlers, Green-Shortridge, Weekley, and Zajack (2009) reported CR in 5% of their
sample of job applicants. Again, such participants likely had more invested in their job
application than do undergraduates that take online surveys to fulfill a course requirement.
With little buy-in, student samples that take Internet-based surveys may be particularly prone
to CR (Meade & Craig, 2012). Thus, the type of sample utilized plays a role in the
discrepancies in reports of CR prevalence.
It is important to note that participants that display CR may not provide poor data for
the entirety of the survey, but it may instead only occur on a small number of items. Berry,
Wetter, Baer, Larsen, Clark, and Monroe (1992) found that roughly 50-60% of students self-
79
reported responding to one or more items randomly in an online survey. Baer, Ballenger,
Berry and Wetter (1997) found an even higher number, 73%, for self-reported CR. Thus,
while many of the commonly used CR indicators may detect low percentages of CR
(somewhere between 3.5%-12%), careless response seems to occur often throughout surveys
on a small number of items. Additionally, the data clearly show that undergraduate samples
are particularly prone to at least some degree of CR (Baer et al., 1997; Berry et al., 1992;
Meade & Craig, 2012).
Psychometric Concerns
CR brings up several psychometric concerns and researchers (e.g., Huang et al., 2012;
Meade & Craig, 2012; Ward & Pond, 2015) agree that the field must address CR if we want
to improve research methods. CR may affect within-group variability (Clark, Gironda, &
Young, 2003), create Type II errors in hypothesis testing, affect correlations and error
variance, and reduce internal consistency reliability estimates (Huang et al., 2012; Meade &
Craig, 2012). It also may affect conclusions made from factor analyses and correlation
among items (Meade & Craig, 2012). These types of psychometric issues bring up concerns
for the collection of data for scale development and general data-based decisions we draw
from research (Ward & Pond, 2015; Woods, 2006). Researchers clean data as a common
part of data analysis (Tabachnick & Fidell, 2007), but only recently have methods for CR
detection come about.
Detecting Careless Response
There are two general approaches to CR detection: methods that require the inclusion
of material before data collection (i.e., a priori) and methods that can be used post-hoc. A
priori methods involve the inclusion of particular survey items. Researchers may include an
80
item that asks the participant in a straightforward manner to indicate his/her level of
engagement throughout the survey. They may also ask whether the participant believes that
his/her effort throughout was sufficient and thus worthy of inclusion in the data set (Meade &
Craig, 2012). In addition to self-report items, researchers also include instructed-response
and bogus items that provide a clear indication of whether the participant is paying attention
to item content. Instructed-response items ask the participant to select a particular answer
(i.e., “Select option D for this item”), while bogus items, often nonsensical, have one clear
correct answer (i.e., “All my friends are aliens”) (Meade & Craig, 2012). If the participant
provides an “incorrect” response for instructed-response or bogus items, it is clear that he/she
is not paying attention to survey content, at least at that particular point in time.
The other approach to detect CR is via indices involving computation after data is
collected. Several indices exist, including Mahalanobis distance, Even-Odd Consistency,
Psychometric Synonyms, and LongString (Meade & Craig, 2012). Mahalanobis distance is a
multivariate outlier analysis in which the researcher compares a vector of an individual’s
responses to the sample means for all variables in that vector (Ehlers et al., 2009). In other
words, it detects if the participant displays a general outlier pattern, and thus indicates CR
(Meade & Craig, 2012). The Even-Odd Consistency measure allows researchers to compare
a participant’s responses to items that measure the same construct (Jackson, 1977).
Unidimensional scales for particular constructs are split into even and odd subscales based on
the order items appear. The researcher computes an average score across subscales and a low
within-person correlation on the subscales indicates CR (Meade & Craig, 2012).
Psychometric Synonyms also measure within-person variance for particular constructs. For
highly correlated items, high within-person variability indicates CR (Meade & Craig, 2012).
81
One other calculation commonly used is called LongString (Johnson, 2005), often measured
by calculating the average number of consecutive responses per web page (Meade & Craig,
2012). LongString looks at the extent to which a participant chooses the same answer
repeatedly. Logically, a participant continuously choosing one answer is likely not
answering based on the content in the items.
Meade and Craig (2012) conducted an extensive analysis of the different CR
indicators and recommend that researchers use a multi-dimensional approach. They found
that self-report items were insufficient in detecting CR, as they were not highly correlated
with the calculation indicators. In other words, some participants that displayed CR from
statistical analyses did not self-report any insufficient effort in responses. Meade and Craig
(2012) suggest the use of a combination of built-in items and statistical analyses. Their
approach has become a standard for the detection of CR.
Simply detecting CR, however, is not enough. The next logical step is to think about
how to address identified careless respondents in the data set. One approach is to simply
eliminate careless responders from analyses (Tabachnick & Fidell, 2013). However, as noted
by Ward and Pond (2015), the elimination of careless respondents has the potential to reduce
sample sizes and affect the distribution. A potentially biased reduction of the sample size
could limit the external validity of the research at hand. Instead of eliminating CR after it
happens, it is important to examine ways to prevent CR from occurring in the first place. In
an attempt to improve the methodology of survey data collection, researchers should
investigate ways to keep participant attention and ensure that participants provide responses
based on their true reaction to items. If not, we might use bad data to justify our decisions
and actions (Ward & Pond, 2015).
82
Finding Ways to Prevent Careless Response
Meade and Craig (2012) suggest four main reasons for the occurrence of CR: low
respondent interest, length of surveys, lack of social contact, and lack of environmental
control. The present study addresses two of these issues, social contact and environmental
control, in an attempt to increase respondent effort and attention and in turn reduce CR.
Social interaction. As noted by Meade and Craig (2012), online survey
participation is passive. Internet-based methods present a change to the original paper-andpencil administration of surveys in which a proctor is present. There is minimal, if any,
interaction between the researcher and participant. Johnson (2005) argues that the physical
distance between the researcher and participant and the lack of personalization in online
surveys combine to reduce the participant’s accountability. Dillman, Smyth, and Christian
(2009) emphasize the importance of social contact in research, and particularly in online
surveys. They argue that survey design should be highly tailored in a way that improves
the social interaction between the researcher and the participant. One such way suggested
to improve this interaction is with the use of a virtual human embedded in online surveys
(Ward & Pond, 2015). The current study expands upon the work of Ward and Pond (2015)
and further examines the use of a virtual human to reduce CR.
In the presence of an observer (such as a proctor in an in-person survey), individuals
tend to put forth more effort on tasks. However, the relationship between observer presence
and performance is very complex. The social facilitation effect is the phenomena in which
individuals perform better on simple tasks (and worse on complex tasks) in the presence of
another individual as opposed to when alone (Zajonc, 1980). Many theories attempt to
address why a change in performance occurs in the presence of others. Researchers argue
83
that the mere presence of another person could enhance or hinder the drive within a person
to complete a task (Zajonc, 1980), or even lead a person to shift his/her cognitive
processing capacity (Baron, 1986), as moderated by variables like task complexity and
evaluation context. Most notable of the social facilitation theories for the present study are
those regarding social factors. Cottrell (1972) believed that the presence of another person
made individuals concerned with how they would be evaluated. An evaluation
apprehension drove the individual to perform, and previous experiences with evaluation
contributed to individual drive reactions. On simple tasks, this evaluation apprehension and
drive leads to better performance, but on complex tasks, the individual may place too much
pressure on himself/herself and fail. Similarly, Duvall and Wicklund (1972) argued that
self-awareness, or a focus on the self in a way that considers the view of others, leads to
improved performance possibly because the individual tends to focus attention inward
(Carver & Scheier, 1981). Overall, the social facilitation effect occurs when an individual
puts more effort into a task in the presence of an observer.
The completion of a survey falls into the category of a simple rather than a complex
task. Most online surveys contain personality measures and self-report items that require
limited cognitive effort. Thus, one argument in favor of a proctor for surveys is that the
presence of another person induces a social facilitation effect in which the participant is
motivated to put forth effort in the task. Research shows that individuals may have the
same kinds of responses to virtual humans as they do to actual humans (Gratch, Wang,
Okhmatovskaia, Lamothe, Morales, van der Werf, & Morency, 2007; Park, 2009).
Thompson and Behrend (2011, 2012) showed that virtual humans can increase an
individual’s attention and accountability, and Park and Catrambone (2007) argue that these
84
increases are likely due to an enhanced sense of interaction during the virtual experience.
In essence, an introduction of a virtual proctor means the individual is no longer simply
providing answers to a machine. That individual is taking part in a social interaction with
the virtual human.
Ward and Pond (2015) recently combined a virtual human with varying survey
instructions in an attempt to reduce CR. They found that when paired with warning
instructions (indicating strictly that the individual would be removed from the data set if CR
was detected), CR significantly decreased. However, this was the only condition in which
CR decreased—when they paired the virtual human with normal survey instructions, CR did
not decrease. The present study expands upon Ward and Pond’s (2015) methodology in an
attempt to reduce CR through a virtual proctor with a few modifications.
In line with the social facilitation literature, the first reason behind the use of a virtual
proctor is to induce a social interaction experience for the participants in which they feel they
are being observed and thus evaluated. In Ward and Pond (2015), the virtual human
remained present throughout the survey, but did not have any true interaction with the
individual. I would like to enhance the experience of the social interaction by having the
virtual human provide progress feedback throughout the survey. Researchers have used
progress feedback strategies such as progress bars or textual messages in the past in an effort
to reduce survey attrition (i.e., to motivate participants to finish a lengthy survey rather than
stop in the middle) (Yan, Conrad, Tourangeau, & Couper, 2010). Results of progress
feedback effectiveness in reducing survey attrition are inconclusive. The use of progress
feedback in the present study was not in an effort to reduce attrition, but rather in an effort to
enhance the social interaction between the virtual proctor and the participant. It was a way to
85
provide a reminder to the participant throughout the survey of the observer’s presence in an
effort to direct his/her attention to survey items. In turn, the participant’s guided attention
and effort should reduce CR.
Emphasis on attention in instructions. Also in an attempt to enhance participant
attention, I suggest an emphasis on the proctor in the survey instructions. Other researchers
have modified instructions to enhance participant attention and reduce CR, particularly
through warning instruction (Huang et al. 2012). As mentioned, Ward and Pond (2015)
found that when paired with a virtual human, warning instructions significantly reduced CR.
However, warning instructions have a negative effect on attitudes about surveys (Meade &
Craig, 2012). Perhaps there is a better way to reduce CR with instructions other than
including a strict warning. Normal survey instructions, as defined by Huang et al. (2012)
emphasize honesty, accuracy and anonymity. I believe the researchers omitted a key piece
worth emphasizing: attention.
When taking a lengthy survey, allocation of cognitive resources to survey content can
be challenging. Access to tablets, smartphones, and the Internet during online surveys may
make online participants particularly prone to environmental distractions. One study found
that 30% of participants during an online survey reported experiencing one or more
distractions during the survey (Zwarun & Hall, 2014). As noted previously, 52% of
participants in the 18-24 age range reported multitasking on another electronic-based activity
during the survey. Those that reported multitasking also reported feeling more distracted
throughout the survey. To address the issues of multitasking and distraction, I propose an
addition to survey instructions that emphasizes the importance of paying attention.
86
Research on attention regulation, particularly on cognitive resource theory, states that
individual differences in attentional resources interact with difficulty of task demands to
predict performance (Randall, Oswald, & Beier, 2014). Thus, designing a survey that can
maintain the attention of individuals with differing levels of attention in uncontrolled
environments is a challenging task. Executive control theory suggests that in pursuit of a
goal, individuals attempt to control their cognitive resources by maintaining to task-relevant
information and blocking out or ignoring other information (Kane, Poole, Tuholski, & Engle,
2006). If the instructions of the survey emphasize attention as an important piece of the
completion of the task, I argue that participants will devote more attention to items. In line
with executive control theory, individuals will maintain attentional control to the survey in an
effort to complete their goal (i.e., the survey). Thus, I suggest that following normal
instructions, the instructions emphasize both the proctor’s presence and its purpose—to
maintain their attention to the survey. This initial emphasis on attention will ignite their selfregulation process, and the progress feedback throughout will remind them of the purpose of
the proctor. Additionally, the progress feedback will remind them of the mere observation
from the proctor, and thus lead to more effort on the survey through a social facilitation
effect. In total, the virtual proctor should improve participant attention and thus reduce CR.
Hypothesis 1: Individuals in the virtually proctored condition will score lower on a
multivariate composite of CR indicators than those in the control condition.
Because the virtual proctor plays the same role as an in-person observer, individuals
should perform similarly in the presence of virtual and real humans. The proposed study
will include an in-person condition in which a proctor walks around the room at 10-minute
intervals also in an attempt to remind students to pay attention. I expect the in-person
87
proctor to serve as a reminder to self-regulate and produce a similar social facilitation effect
as the virtual proctor. Thus, I predict that students in the in-person group will produce
significantly lower CR than those in a control online survey group with no proctor.
Hypothesis 2: Individuals in the in-person proctored condition will score lower on a
multivariate composite of CR indicators than those in the control condition.
I will use a multi-dimensional approach to detect CR during a lengthy personality
survey to see whether the presence of a proctor, virtual or in-person, reduces CR.
Method
Participants
I will recruit approximately 400 participants from undergraduate introductory
psychology courses at a large Southeastern United States university. Random assignment
will place the students into one of three conditions with the option to opt out upon placement.
Study Design
This study will use an experimental between-subjects design with survey condition as
the independent variable and a multivariate composite of CR indicators as the dependent
variable. The study will have three conditions. One condition will consist of a survey
proctored in person in a classroom on the university’s campus. The other two conditions will
be Internet-administered. One Internet-administered condition will include a virtual proctor
and one, the control, will not. Careless response indicators (Mahalanobis distance, Even-Odd
Consistency, Psychometric Synonyms, Maximum LongString, instructed-response sum,
bogus item sum, and participant engagement average) will be the dependent variables.
88
Procedure
Students in introductory psychology courses will be recruited to participate to fulfill a
course requirement. The survey description will indicate that participation in this study
presents the possibility of taking a survey in person, on campus. The researcher will reserve a
classroom on campus at various times during a 3-week period. Students randomly assigned
to the in-person condition will receive an email with the time slot options and sign up for a
convenient time of their choice. I will set limits that each survey sitting must contain
between four and 10 participants in attempt to ensure some external validity. Participants
will be asked to schedule if three or less sign up for a particular time slot.
Participants placed in the online-administered conditions will receive an email with a
hyperlink to the survey. The survey description will indicate that it must be taken in one
sitting and that it will take approximately one hour to complete.
Survey Conditions
In the in-person condition, students will arrive in a classroom on the university’s
campus and the researcher will provide laptops to participants (rented from the university).
Upon entrance to the classroom, the proctor will email the survey link to each participant and
read the instructions aloud as the students read along on their screens. All three conditions
will begin with a set of normal survey instructions, adapted from Huang et al. (2012): “There
are no correct or incorrect answers on this survey. Please respond to each statement or
question as honestly and accurately as you can. Your answers cannot be linked to your
identity.” The last sentence is a modification from Huang et al.’s indication that answers
would be kept confidential. Confidentiality might imply that answers are linked to the
participant, but remain private. I wanted to make it clear that not only would answers be kept
89
private, there would also be no way of linking answers to an individual person. Hopefully
this modification makes participants answer more accurately and honestly.
Proctor instructions. In both the in-person and virtually proctored conditions, the
survey instructions will emphasize the reason for the proctor’s presence. Following the
normal instructions, the in-person condition will state: “To ensure the quality of survey data,
this survey is being proctored by the instructor in the room. The proctor will walk around the
room at various times in an effort to encourage you to maintain full attention to the survey.”
For the virtual proctor condition, the instructions will state: “To ensure the quality of survey
data, this survey is being proctored by a virtual human. The proctor will provide you updates
on your progress in an effort to encourage you to maintain full attention to the survey.”
In-person condition. The proctor will sit at a desk at the front of the room. At every
10-minute interval, the proctor will stand up and walk around the room. She will pause
behind each student for a 5-second period, and then return to the desk.
Virtual proctor condition. The virtual proctor will be a moving image of a person’s
head and shoulders placed in the upper left-hand corner of the screen, adapted from Ward
and Pond (2015). The virtual human will be gender and race neutral and exhibit lifelike
movement including breathing and blinking. The virtual human will remain visible
throughout the survey as the participant scrolls up and down each page. After the participant
completes two pages of the survey, text will appear in a speech box below the proctor,
stating: “You have completed 100 questions of the survey. Please continue.” This statement
will continue after completion of page 4 and page 6, with the proper number of questions
stated. After page 8, the text will say: “You have 200 questions remaining in the survey.
90
Please continue.” This statement will continue on alternating pages with the proper number
of questions until the student completes the survey.
Control condition. In the control condition, participants will receive the initial set of
normal instructions with no additional information. This condition will represent the typical
online-survey experience of undergraduate research participants without a proctor.
Materials
The survey will be based on that used by Meade and Craig (2012). Following the
logic of both these authors and Ward and Pond (2015) pertaining to the popularity of
Agree/Disagree Likert scale measures, 7-point Likert scales ranging from “strongly disagree”
to “strongly agree” will be used for the majority of questions in the survey.
The last page of the survey will include several self-report questions. The items will
question the participant’s awareness of proctor presence, perceptions of the proctor, and their
engagement throughout the survey.
Personality. The survey will include the 300-item International Personality Item
Pool (IPIP; Goldberg, 1999). This scale is typical of those used in long surveys (Meade &
Craig, 2012).
Manipulation check. The survey will include a series of items about perceptions of
the proctor. These items will range from “I was being monitored throughout the survey” to
“The presence of the proctor made me pay more attention to the survey than I would have
otherwise.”
Measures to test hypotheses. I will use seven measures to assess careless
responding to test my hypothesis. The selection of these seven indices was made based on
suggestions from Meade and Craig (2012).
91
Instructed-response items. As indicated by Meade and Craig (2012), instructedresponse items are a clear metric for scoring correct or incorrect responses. An example of
an instructed response item is: “Select ‘strongly agree’ for this item.” My survey will include
1 instructed-response item on every other survey page.
Bogus items. Also adapted from Meade and Craig (2012), bogus items have a clear
“correct” answer, such as “I am using a computer currently.” The survey will include 1
bogus item on every other survey page, alternating pages with instructed-response items.
Participant engagement scale. I will use the 15-item scale developed by Meade and
Craig (2012) that measures participant self-reported diligence and interest throughout the
survey. Examples of items are “I carefully read every survey item” and “I put forth my best
effort in responding to this survey.” A composite of responses of these items will produce
each participant’s engagement score.
Outlier analysis. To measure the Mahalanobis distance, I will calculate each
respondent’s distance from the average response pattern (Meade & Craig, 2012). Based on
the methods of both Meade and Craig (2012) and Ward and Pond (2015), I will calculate
Mahalanobis distance for the five personality factors. Each participant will receive five
Mahalanobis distance measures that are then averaged to provide a single measure for each
participant. Higher values for this distance indicate CR (Meade & Craig, 2012).
Consistency indicators. Both the Even-Odd Consistency and Psychometric Synonym
measure look at the extent to which participants respond consistently to items regarding the
same construct. I will base my analyses of these items off Meade and Craig (2012).
92
Response pattern. The LongString measure identifies response patterns in which the
participant chooses the same item for an extensively long time, indicating CR. I will base my
calculation of the Maximum LongString data off that of Meade and Craig (2012).
Proposed Analyses
Analysis of Fidelity of Manipulations
I will run manipulation checks to ensure that participants in the proctor conditions
perceived the proctor’s presence and understood why he/she was present. Additionally, I
will check whether or not participants properly perceived and remembered the survey
instructions.
Manipulation checks for instructions. To ensure that participants perceived the
content of the survey instructions, I will run one-way fixed effects ANOVAs on the
manipulation check items that I build into the survey. I will include the items: “I remember
what the instructions said for this survey.” For this item, I do not expect differences between
groups. The other two instruction manipulation check items will be “The instructions for this
survey addressed the presence of a proctor” and “The instructions for this survey addressed
the importance of my attention to the survey.” I expect a significantly higher score for these
items in the in-person and virtual proctor conditions as compared to the control condition.
Manipulation check for progress feedback. I will run a one-way fixed effects
ANOVA to ensure that participants in the virtual human condition perceived the progress
feedback. One item will state: “A proctor provided me feedback on my progress throughout
this survey.” I expect to see a significantly higher score for the virtually proctored group as
compared to the two other groups.
93
Manipulation check for proctor presence. I will use several items to verify that
participants perceived the presence of the proctor. I will run one-way fixed effects ANOVAs
for each of these items, some of which I expect to see significant differences between the inperson and virtual presence condition, and some of which I do not expect a significant
difference. I do not expect a significant difference between the in-person and virtual proctor
group for more general proctor items (“I was being monitored during this survey”), but I do
expect significant differences on items like: “A virtual proctor monitored my survey activity
through an animated picture of a person.” The virtual human condition should have
significantly higher scores here in comparison to both the in-person and control conditions.
Analyses of Hypotheses
I will perform a one-way MANOVA using SAS 9.3. The between subjects variables
will be survey condition (in-person, virtual proctor, or control). The dependent variable will
be the multivariate composite of CR indicators (instructed-response sum, bogus item sum,
participant engagement scale average, Mahalanobis distance, Even-Odd Consistency,
Psychometric Synonyms, and Maximum LongString). If the MANOVA indicates a
significant effect for survey condition on CR prevalence, I will perform pairwise
comparisons to see which conditions significantly differ from one another. I predict
significantly lower CR in the virtually proctored condition compared to the control condition.
I also predict significantly lower CR in the in-person condition versus the control.
I will also run univariate ANOVAs for each of the CR indicators to see if there were
differences between particular CR “type” amongst groups, followed by pairwise comparisons
if significance is found.
94
I will also perform some chi-square tests to test dichotomous relationships between
CR indicators. Particularly, I will run a chi-square test for the instructed-response items and
bogus items. I will also run a chi-square test for the Even-Odd Consistency and the
Psychometric Synonyms. These tests will determine if the compared indices provide
different information about CR, or if one is sufficient to use instead of both.
95
References
Aiello, J. R., & Douthitt, E. A. (2001). Social facilitation from Triplett to electronic
performance monitoring. Group Dynamics: Theory, Research, and Practice, 5, 163180. doi:10.1037//1089-2699.5.3.163
Baer, R. A., Ballenger, J., Berry, D. T. R., & Wetter, M. W. (1997). Detection of random
responding on the MMPI-A. Journal of Personality Assessment, 68, 139–151.
doi:10.1207/s15327752jpa6801_11
Baron, R. S. (1986). Distraction-conflict theory: Progress and problems. Advances in
Experimental Social Psychology, 19, 1-36.
Beach, D. A. (1989). Identifying the random responder. The Journal of Psychology:
Interdisciplinary and Applied, 123, 101-103.
Behrend, T. S., & Thompson, L. F. (2012). Using animated agents in learner-controlled
training: The effects of design control. International Journal of Training and
Development, 16, 263-283. doi:10.1111/j.1468-2419.2012.00413.x
Behrend, T., & Foster Thompson, L. (2011). Similarity effects in online training: Effects
with computerized trainer agents. Computers in Human Behavior, 27, 1201-1206.
Berry, D. T. R., Wetter, M. W., Baer, R. A., Larsen, L., Clark, C., & Monroe, K. (1992).
MMPI-2 random responding indices: Validation using a self-report methodology.
Psychological Assessment, 4, 340 –345. doi:10.1037/1040-3590.4.3.340
Carrier, L. M., Cheever, N. A., Rosen, L. D., Benitez, S., & Change, J. (2009). Multitasking
across generations: Multitasking choices and difficulty ratings in three generations of
Americans. Computers in Human Behavior, 25, 483-489.
Carver, C. S., & Scheier, M. F. (1981). The self-attention-induced feedback loop and social
96
facilitation. Journal of Experimental Social Psychology, 17, 545-568.
Clark, M. E., Gironda, R. J., & Young, R. W. (2003). Detection of back random responding:
Effectiveness of MMPI-2 and personality assessment inventory validity indices.
Psychological Assessment, 15, 223–234. doi:10.1037/1040-3590.15.2.223
Cottrell, N. B. (1972). Social facilitation. In C. G. McClintock (Ed.), Experimental social
psychology (pp. 185-236). New York: Holt.
Dillman, D. A., Smyth, J. D., & Christian, L. M. (2009). Internet, mail, and mixed-mode
surveys:The tailored design method (3rd ed.). Hoboken, NJ: Wiley.
Duval, S., & Wicklund, R. A. (1972). A theory of objective self-awareness. New York:
Academic Press.
Ehlers, C., Greene-Shortridge, T.M., Weekley, J.A., & Zajack, M.D. (2009). The
exploration of statistical methods in detecting random responding. Poster session
presented at the annual meeting for the Society for Industrial/Organizational
Psychology, Atlanta, GA.
Goldberg, L. R. (1999). A broad-bandwidth, public domain, personality inventory
measuring the lower-level facets of several five-factor models. In I. Mervielde, I.
Deary, F. D. Fruyt, & F. Ostendorf (Eds.), Personality psychology in Europe (pp. 7–
28). Tilburg, the
Netherlands: Tilburg University Press.
Gordon, M. E., Slade, L. A., & Schmitt, N. (1986). The “science of the sophomore”
revisited: From conjecture to empiricism. Academy of Management Review, 11,
191-207.
Gratch, J., Wang, N., Okhmatovskaia, A., Lamothe, F., Morales, M., van der Werf, R. J.,
97
& Morency, L. P. (2007). Can virtual humans be more engaging than real ones?
Paper presented at the 12th international conference on human-computer
interaction: intelligent multimodal interaction environments, Beijing, China.
Huang, J. L., Curran, P. G., Keeney, J., Poposki, E. M., & DeShon, R. P. (2012). Detecting
and deterring insufficient effort responding to surveys. Journal of Business and
Psychology, 27, 99-114. doi:10.1007/s10869-011-9231-8
Jackson, D. N. (1977). Jackson Vocational Interest Survey manual. Port Huron, MI:
Research Psychologists Press.
Johnson, J. A. (2005). Ascertaining the validity of individual protocols from web-based
personality inventories. Journal of Research in Personality, 39, 103-129.
doi:10.1016/j.jrp.2004.09.009
Kane, M. J., Poole, B. J., Tuholski, S. W., & Engle, R. W. (2006). Working memory
capacity and the top-down control of visual search: Exploring the boundaries of
“executive attention.” Journal of Experimental Psychology: Learning, Memory, and
Cognition, 32, 749–777. doi:10.1037/ 0278-7393.32.4.749
Meade, A. W., & Craig, S. W. (2012). Identifying carless responses in survey data.
Psychological Methods, 17, 437-455. doi:10.1037/a0028085
Nichols, D. S., Greene, R. L., & Schmolck, P. (1989). Criteria for assessing inconsistent
patterns of item endorsement on the MMPI: Rationale, development, and empirical
trials. Journal of Clinical Psychology, 45, 239-250. doi:10.1002/10974679(198903)45:2<239::AID-JCLP2270450210>3.0.CO;2-1
Park, S. (2009). Social Responses to Virtual Humans: The Effect of Human-Like
98
Characteristics. (Unpublished doctoral dissertation). Georgia Institute of
Technology, Atlanta, Georgia.
Park, S., & Catrambone, R. (2007). Social facilitation effects of virtual humans. Human
Factors: The Journal of the Human Factors and Ergonomics Society, 49, 1054 1060. doi:10.1518/001872007X249910
Paulhus, D. L. (1984). Two-component models of socially desirable responding. Journal of
Personality and Social Psychology, 46, 598–609. doi:10.1037/0022-3514.46.3.598
Randall, J. G., Oswald, F. L., & Beier, M. E. (2014). Mind-wandering, cognition, and
performance: A theory-driven meta-analysis of attention regulation. Psychological
Bulletin, 140, 1411-1431. doi:10.1037/a0037428
Schultz, D. P. (1969). The human subject in psychological research. Psychological Bulletin,
72, 214–228. doi:10.1037/h0027880
Spelke, E., Hirst, W., & Neisser, U. (1976). Skills of divided attention. Cognition, 4, 215–
230. doi:10.1016/0010-0277(76)90018-4
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (5th ed.). Boston, MA:
Pearson/Allyn & Bacon.
Tippins, N. T., Beaty, J., Drasgow, F., Gibson, W. M., Pearlman, K., Segall, D. O., &
Shepherd, W. (2006). Unproctored Internet testing in employment settings. Personnel
Psychology, 59, 189-225. doi:10.1111/j.1744-6570.2006.00909.x
Ward, M. K., & Pond, S. B. (2015). Using virtual presence and survey instructions to
minimize careless responding on Internet-based surveys. Computers in Human
Behavior.
99
Woods, C. (2006). CR to reverse-worded items: Implications for confirmatory factor
analysis. Journal of Psychopathology and Behavioral Assessment, 28, 186–191.
doi:10.1007/s10862-0059004-7
Yan, T., Conrad, F. G., Tourangeau, R., & Couper, M. P. (2010). Should I stay or should I
go: The effects of progress feedback, promised task duration, and length of
questionnaire on completing web surveys. International Journal of Public Opinion
Research, 23, 131-147. doi:10.1093/ijpor/edq046
Zajonc, R. B. (1980). Compresence. In P B. Paulhus (Ed.), Psychology of group influence
(pp. 35-60). Hillsdale, NJ: Erlbaum.
Zwarun, L., & Hall, A. (2014). What’s going on? Age, distraction, and multitasking during
online survey taking. Computers in Human Behavior, 41, 236-244.
doi:10.1016/j.chb.2014.09.041