Survey Experiment with Google Consumer Surveys: Promise and

1
Survey Experiments with Google Consumer Surveys: Promise and Pitfalls for
Academic Research in Social Science
1. Introduction
In this paper, we examine the usefulness of Google Consumer Surveys (GCS) as a
platform for conducting survey experiments in social science. GCS is a relative newcomer to the
rapidly expanding market for inexpensive (and quick) online surveying and its methodology
differs rather dramatically from most of its competitors who rely on large opt-in internet
panels. Below, we first review this methodology and point out that its particular strengths and
weaknesses lend themselves well to a focus on survey experiments that face a high bar of
internal and external validity.
Our paper proceeds in the following manner. In section two, we introduce and discuss
how GCS operates in the web environment, its cost structure, strengths, and limitations. In
section three, we lay out a framework to evaluate the threats to external and internal validity
that result from the particular characteristics of GCS’ respondents. Next, we discuss the issues
of response rate, experimental balance, and sample representativeness in section four and
show the results of a series of experiments that we replicated on the GCS platform in section
five. We conclude with a discussion of our findings in section six, identifying the most
appropriate uses of GCS for social science research.
2. An introduction to Google Consumer Surveys
GCS is a new tool developed by Google that surveys a sample of Internet users as they
attempt to view content on the websites of online publishers who are part of the program.
These sites require users to complete a survey created by researchers in order to access
premium content provided by the publishers. This “survey wall” is similar to the pay wall that is
often used to gate access to premium content, but instead of having to pay or subscribe,
2
consumers can now access the content by answering a short survey. Currently, GCS is available
in four countries1 and it takes about 48 hours to field a survey of any size.
GCS involves three different groups of users: publishers, consumers, and researchers.
Researchers create the surveys and pay GCS to have consumers answer questions, consumers
encounter the surveys when attempting to access publisher content, and publishers are paid by
GCS to have surveys administered on their sites. According to a study done by PEW in 20122,
there were approximately 80 sites in GCS’ network and an additional 33 sites that were in
testing during this period of time. These sites include small and large publishers such as New
York Daily News, Christian Science Monitor, Reader’s Digest, the Texas Tribune, as well as
YouTube and Pandora
Given the “survey wall” environment, it is essential that the surveys be short. Likewise,
as we show below, the current cost-structure of GCS makes very short surveys (even as short as
one question) the most economically attractive. This has important implications for how GCS
can best be used by social scientists. Specifically, it is unlikely that social scientists interested in
making a solid causal inference will find GCS a good vehicle over traditional methods of doing
so in observational studies. Such designs usually require researchers to identify and control for
a large set of potential confounding variables; but GCS makes measuring such confounders
prohibitive. As we discuss below, however, other designs for casual inference – relying on
survey experiments – are better suited to take advantage of the features and economies of
GCS.
Another implication of the short length of the surveys is that unlike other Internet-based
surveys that rely on large opt-in panels, researchers using GCS cannot usually collect extensive
demographic and background information on respondents in the survey itself. Since the
respondents are not part of an online panel, GCS has not collected this information at an earlier
date (as is the case in many panels). Instead, GCS provides researchers with a set of
respondents’ “inferred demographics” (location, income, age, gender) that are derived from
1
2
The 4 countries are USA, Canada, UK, and Australia.
http://www.people-press.org/2012/11/07/a-comparison-of-results-from-surveys-by-the-pew-research-center-and-google-consumer-surveys/
3
their browsing histories3 and IP addresses4. This information is then used in the sampling
process as depicted in Figure 1 below.
Figure 1: Google Consumer Surveys “Survey Wall” Methodology
Individual
Browsing the
Internet
Demographics Implied by Browsing
History and IP address
(age, income, gender, location)
Survey Wall
(must answer a survey to obtain target
content)
Target
Content
Real time stratified sampling of individuals (i.e.,
matching individuals to surveys in the pool) using
inferred demographics to target age group, gender, and
location shares from the current population survey
Pool of all active
Surveys
Internet Content
Providers (who have
contracted with
Google)
2.1 Process of GCS
Figure 1 shows that when individuals browse the Internet and wish to obtain access to
the premium content, they have to answer a survey taken from a pool of currently active
surveys (i.e., the “survey wall” page). The specific survey the individual is presented with,
however, is not selected randomly. Instead, GCS uses a sampling scheme designed to ensure
that each active survey is on track to have proportions of respondents by age, gender and
location (region, state) that match the Current Population Survey (CPS). To achieve this, GCS
3
The gender and age group can be inferred from the types of pages users visited in the Google Display network using the DoubleClick
Cookie. The DoubleClick Cookie is a way for Google to provide ads that are more relevant to the respondents by using a unique identifier
called a cookie ID to record users’ most recent searches, previous interactions with advertisers ‘ads, and visits to advertisers’ website.
4 IP address is a number assigned to a device that connects to the Internet. From this number, users nearest city can be determined and
subsequently, the income and urban density.
4
infers the demographic characteristics of the respondents from their browsing history and IP
address and uses this information to create a sample of Internet users that resembles the
Internet using population based on the most recent CPS Internet Users Supplement.
Importantly, this adjustment is done in real time so that if a given survey appears to have, for
example, too few female respondents, more females (according to the implied demographics)
will be assigned to that survey. Finally, since this process does not always produce proportions
of respondents that match the CPS, post-stratification weights from the CPS Internet Users
Supplement are calculated and provided to GCS users.
2.2 Cost Structure
One of the main advantages of GCS for social scientists is its low costs. However, given
the current costs structure, these advantages are felt more keenly for specific types of surveys
than for others (i.e. the survey experiment). Specifically, GCS employs two forms of survey
format, each with different cost structures. First is a micro-survey where respondents can
answer either one or two questions. The costs are 10¢ per response for a general (untargeted)
audience and 50¢ per response for a targeted audience filtered by a screening question. A
second option is a full survey where respondents can answer up to 10 questions and the costs
are based on survey length (see Figure 2 below) for a general untargeted audience. Importantly,
the marginal cost per response of an additional question is constant at $0.30 for two to 10
questions but is $1.00 for moving from one to two questions.
5
Figure 2: GCS Cost’s Structure
Based on these cost structures, it is clear that conducting multi-question surveys using
the GCS is less cost-effective than single-question surveys. As such, relying on survey
experiments rather than a long list of control variables may be a better strategy for researchers
to employ GCS in making causal inferences.
2.3 Limitations of GCS
Google is keenly aware of the imposition the survey wall places on potential GCS
respondents. Specifically, it interrupts their browsing activities with survey questions and
coerces them to answer these questions before proceeding to the target content. This is
significantly different from the traditional opt-in panels where respondents have agreed in
advance to participate. With this in mind, Google has concluded that it is essential for the
survey questions not to be too long, too complicated, or controversial in nature. As a result,
researchers need to construct their surveys in accordance to these guidelines. Below we briefly
described some of the restrictions that GCS imposes and suggest some of the strategies we
have adopted to work within them. The limitations are:
a) Number of response categories: GCS tries to ensure the survey is simple by restricting
the question answers to a maximum of five response categories and an additional openended option - i.e., “Other (please specify)” text field.
6
b) Number of characters/words: To keep the question short, GCS recommends the
question length to be 125 characters and sets the maximum limit to 175 characters.
c) Censored questions and populations: GCS places restrictions on sensitive demographics
information by prohibiting researchers from asking respondents for their age, gender,
ethnicity, religion, and immigration status. Researchers can only ask these questions if
the response choices include "I prefer not to say" as the opt out answer. Also, questions
may not directly or indirectly target respondents under 18 years old or include any age
category under 18 in the answer choices.
d) Don’t know and “opt out” option: Besides regulating the content and the target
respondents of the question, GCS also requires an option for the respondents to not
answer the question by either clicking an “I don’t know, show me another question”
link, or by sharing the premium content that they intend to read in their social media
pages (see Figure 3 for an example). Essentially, this just ensures that all respondents
have some paths that allow them to skip a given question yet still (ultimately) access
their desired content. We will return to this feature in the discussion of internal validity
associated with using GCS below.
7
Figure 3: Example of a Survey Wall
2.4 Working with the Word Constraint
One of the constraints of using GCS for our proposed survey experiments is the word
limits for questions and the response sets to the questions. For many kinds of surveys
experiments of interest to social scientists, these word limits will be too constraining. Indeed, in
many of the canonical experiments we conduct below, we could not faithfully implement the
experiments under these word limits (even after a great deal of work streamlining wording).
Thus, we devised one way to work around this limitation by using the “picture
technique.” Specifically, we capture the necessary text in a picture format that is then inserted
into the question as a screen image. Below is an example of doing this for the response set in a
list experiment. Similar pictures can also be utilized for longer questions or textual vignettes, as
is necessary in many survey experiments.
Now I am going to read you three/four things that sometimes people oppose or are
against. After I read all three/four, just tell me HOW MANY of them you oppose. I don’t
want to know which ones, just HOW MANY.
8
The preamble and question fit within the word limit, but the listed items do not and so we
insert them as a graphic in the question itself.
2.5 Randomizing the Treatments
GCS does not provide a specific facility for randomly assigning different versions of
questions to different respondents. Such assignment, however, can be achieved effectively by
simply submitting multiple versions of the same question to the pool of active questions. So,
for example, if a researcher wants to have 1000 respondents randomly assigned to one of two
questions, he or she can submit two one-question surveys and request 500 respondents for
each one. Since respondents are matched with questions from the pool of active questions
randomly (at least within the limits of the dynamic stratification described in Figure 1), there is
no reason to think that there could be anything systematic about the matching process that
would break random assignment of questions (at least previous to the respondent’s decision
whether to opt out).
3. Assessing the Validity of Conducting Survey Experiments Using GCS
In spite of the restrictions discussed above, GCS provides researchers with access to
inexpensive samples of respondents and, as we have argued above, can be used most
efficiently to conduct survey experiments (often with only one question). But are these samples
good enough to play a role in rigorous social scientific research?
There are generally two threats of validity in experimental research: external and
internal validity. Threats to external validity pertain to the question of whether the samples in
9
the experiments are representative of some broader population (and so inferences can be
usefully generalized to that population), while threats to internal validity refer to whether a
causal inference is valid even for the sample itself (i.e., does a given causal inference truly
reflecting the effect of the experimental manipulation?).
With regards to conducting experimental research using GCS, two questions about the
external validity of the sample are pertinent. First, are the samples of individuals who actually
get the opportunity to answer a given question representative of the general Internet user
population? Second, do patterns of non-response (opting out or answering “Don’t Know”)
create significant non-response bias? To address these questions, we first review previous
research that had compared the characteristics of GCS samples (relying on the inferred
demographics of those samples) to other survey samples used in social science research. Next,
we evaluate the accuracy of Google’s inferred demographics, and finally, we examine whether
there are significant differences in respondents who opt in and out of the GCS including those
who give a “don’t know” response.
Turning to the issue of internal validity, our primary concern is on whether we can make
valid causal inferences from the experiments conducted in GCS. To answer this question, we
conducted two tests. First is the balance test. This test asks whether the sample treatment and
control groups are balanced on all measured variables and serves as a test of whether
randomization of treatments were achieved (in which case one would have – with sufficient
numbers of respondents – balance on all measured and unmeasured variables that might be
related to the treatment and outcome of interest). Second is a replication test of wellestablished social science experiments. This allows us to determine whether we can replicate
the experimental (treatment) effects from prior studies in GCS and also examine whether GCS’
subjects behave differently from subjects in other experimental settings.
4. External Validity [might define what we mean by external validity]
We first assessed the external validity of using GCS by reviewing some of the prior work
that has compared the characteristics of GCS’ samples to the characteristics of samples that use
other survey methods to study comparable research questions. McDonald et al. (2012
10
compares the responses of a probability based Internet panel, a non-probability based Internet
panel, and a GCS sample against some media consumption benchmarks5. After matching the
results against the benchmarks, they found that the results from GCS were more accurate than
both the probability and non-probability panels in terms of average absolute error, largest
absolute error, and percent of responses within 3.5% of the benchmark. For instance, the
average absolute error of GCS was 3.25%, compared to 4.7% for the probability sample and
5.87% for the non-probability sample.
Researchers at the Pew Research Center (2012) also conducted a (pseudo) mode study
in 2012 in which they compared the results for more than 40 questions asked of a sample of
telephone respondents with those asked (over the same time period) of GCS respondents.6 The
types of questions asked include demographic characteristics, use of technology, political
attitudes and behavior, domestic and foreign policy, and civic engagement. The reported
distributions of responses for most of these questions did not differ dramatically across modes.
Pew reports “[A]cross several political measures, the results from the Pew Research Center and
using Google Consumer Surveys were broadly similar, though some larger difference were
observed (2012).”
4.1 Evaluating GCS’ Inferred Demographics
Next, we evaluated the accuracy of GCS’ inferred demographics by replicating the study
conducted by Pew in 2012 on two demographic variables (gender and age) and a third variable
income, which was not reported by Pew, into the studies. We asked the respondents on their
demographics (gender, age, income) and then compared them to the inferred data provided by
GCS. Table 1-3 below reports the comparison between the inferred information and the
responses given by the respondents.
5
The benchmarks are video on demand, digital video recorder, and satellite dish usage in American households as measured by a semi-annual
RDD telephone survey of 200,000 respondents.
6 We say “pseudo mode study” because the respondents were not randomly assigned to a mode, but were sampled separately in what was
actually a dual frame design. Thus, some of the differences reported may well come from the different sampling frames rather than mode
differences.
11
Table 1: Inferred vs Reported Gender
Inferred Sex (%)
Sex given in response
(Pew Study 2012, N=1056)
Male
Female
(Our Study 2013, N=1000)
Male
Female
Prefer not to answer
Male
Female
Unknown
79
21
28
72
58
42
Male
60
13
27
Female
16
59
25
Unknown
37
26
37
Figures based on unweighted data
For gender (see Table 1), the Pew studies reported that there were about 75% of respondents
whose inferred gender matched their survey respondents. However, it did not include "I prefer
not to say" as the opt out answer in their response categories as suggested by Google when
asking respondents sensitive demographic information (see Section 2.3). Thus, when we include
this option, the percentage decreased to only approximately 60%. Among those whose gender
GCS did not infer, Pew reported that 58% said they were male and 42% female while we found
37% are male, 26% female, and 37% prefer not to answer.
In terms of age (see Table 2), Pew found that from the two samples they surveyed in
2012; there was an average of 44% to 46% of the respondents that reported an age that was in
the same category as their inferred age. Again, the opt out choice, “I prefer not to say,” was not
included in the response categories. Unlike gender however, the average percentage of the
GCS’ respondents choosing a similar age group as their inferred age did not change even when
the opt-out option was included in the response categories7 with 47% choosing similar age
group as their inferred age found in our studies. Furthermore, both the Pew studies and ours
showed that the largest incongruity between self-reported age and inferred was observed for
7
GCS limits the number of response categories to five but infer age in six categories. Pew built two separate samples with two different
age categories and collapsed different age categories for each. We only provided 4 age categories with “Not Applicable” being the 5th
option. We collapsed the inferred age according to the age categories in our response option.
12
middle age respondents (i.e., 35-44) where only about 23%-43% of these respondents had their
age correctly inferred.
Table 2: Inferred vs Reported Age
Inferred Age (%)
Age given in response
(Pew Study 2012, N=1009)
18-24
25-34
35-44
45-64
65+
18-24
25-34
35-44
45-64
65+
Unknown
65
15
10
6
4
23
39
20
15
4
12
25
23
34
7
5
11
16
52
16
5
5
12
26
52
22
18
15
33
12
(Pew Study 2012, N=1004)
18-24
25-34
35-44
45-54
55+
18-24
63
11
10
4
11
25-34
21
43
17
7
12
35-44
14
24
30
15
16
45-54
10
10
22
33
25
55+
11
8
9
18
54
Unknown
21
17
16
21
26
(Our study 2013, N=1003)
18-24
25-34
35-54
55+
Unknown
48
11
18
5
18
17
31
27
4
21
5
12
54
15
14
3
3
19
56
19
10
11
28
27
25
18-24
25-34
35-54
55+
Not applicable / Prefer not to answer
Figures based on unweighted data
Besides replicating the studies done by Pew, we also assessed the accuracy of GCS’
inferred income by asking the respondents to report their income category (see Table 3).
Similar to our previous assessment of age, we collapsed the inferred income into five categories
in order to accommodate the number of response categories that GCS allowed8. The results
however, were not as accurate as observed for gender and age. First, we found that on
average, only 23.5% of the respondents chose an income category that was similar to the one
that GCS inferred. Second, contradictions between what respondents report and what GCS
inferred were prevalent across all self-reported income categories and not just in specific
categories.
8
We collapse income group of $100,000-$149,999 and $150,000+ into one category.
13
Table 3: Inferred vs Reported Income
Inferred Income (%)
Income given in response
(Our Study 2013, N=1006)
$0-$24,999
$25,000-$49,999
$50,000-$74,999
$75,000-$99,999
$100,000+
$0$24,999
34
24
17
13
12
$25,000$49,999
35
21
19
8
17
$50,000$74,999
32
13
17
15
24
$75,000$99,999
32
11
13
13
31
$100,000+
Unknown
50
8
0
17
25
33
22
22
0
23
Figures based on unweighted data
Based on findings that 74% of the respondents’ gender, 40% of the respondents’ age,
and 20% of the respondents’ income were correctly inferred, we conclude that the GCS’
inferred demographics contain substantial errors in classifying the characteristics of the
individual respondents, especially on age and income. This level of accuracy may be good
enough for basic or “do-it-yourself” survey research, but perhaps not for professional or
academic researchers who are particularly interested in segmentation analysis of the
population.
4.2 Evaluating the non-response rate issue
Besides the accuracy of inferred demographics, we are also concerned that the nonresponse rate and/or “Don’t Know” (DK) responses may pose some threats to the external
validity of GCS. Specifically, if the survey wall experience where a respondent is motivated to
get to their preferred content could increase the incentive for respondents to “bail out”9, then
the important question is whether respondents who chose to bail out are systematically
different from those who do not.
The 2012 Pew study did not speak to this issue; moreover, many of their questions did
not include a DK response or report on the non-response rate. Where Pew did include DK
9
Respondents can bail out by choosing “Don’t Know” from the response category (if provided), clicking “I don’t know, show me another
question” link, or sharing the premium content in their social media pages.
14
options (i.e. on questions about the Supreme Court and Presidential debates), they found
higher rates of DKs than in other survey modes. To examine this issue in greater detail, we
replicated a question on Obama approval (where DK option is not included in Pew studies) with
similar sample size and included the DK option in our response categories. By doing so, we
could then compare the demographics (i.e. gender and age) of respondents and nonrespondents (those who chose to bail out) in both versions of the questions.
Table 4: Demographics of Respondents and Non-Respondents (Obama Approval)
Group
Male
Female
18-24
25-34
35-44
45-54
55-64
65+
GCS Sample (Pew, 2012)
% Response
(DK options
% Nonnot
Response
provided)
46.9
52.1
53.1
47.9
11.7
15.9
24.3
23.6
16.7
20.4
23.1
12.7
14.0
15.3
10.1
12.1
GCS Sample (2013)
Response
% Answer
% DK
% Total
% NonResponse
47.5
52.5
2.7
13.5
13.7
23.8
31.9
14.5
41.7
58.3
8.6
17.7
20.9
19.9
23.7
9.1
46.2
53.8
4.1
14.7
16.1
23.0
28.5
13.5
54.6
45.4
9.0
18.0
14.5
19.5
19.9
19
% of non-response refers to respondents who chose the link to get another question or the link to their social media pages
Based on the comparison between our studies and Pew’s (see Table 4), we found that
male respondents were more likely to be non-responsive to the surveys than women. This
pattern, which has been observed in other studies (Ferber 1966; Riphahn and Serfling 2003)
may require some post-stratified weighting in order to correct for possible over representation
of certain demographic groups in the sample (Gelmand and Carlin, 2001; Gary R Pike, 2007). On
the other hand, we found no evidence that age is a factor in determining one’s propensity to
respond, as respondents across all age categories made up similar proportions of the nonresponsive sample.
With regards to the issue of DK responses, we found that there are no significant
differences in terms of gender between those who answer the question and those who chose
DK. Although women have higher propensity to answer DK than men, the gender balance was
rather similar between those who answered (47.5% male, 52.5% female) and in those who
15
chose don’t know responses (41.7% male, 58.3 female). On the other hand, we observed some
differences in terms of age where we found that the share of younger respondents (i.e. 18-34
years old) was higher in those who chose don’t know (26.3%) than in those who chose an
offered response (16.2%). Conversely, the share of older respondents (i.e. 55+ years old) was
higher in the group that answered (46.4%) relative to the group that chose DK (32.8%). Based
on the analyses above, we conclude that although there are ways to correct some sampling
error, it is best to be cautious in treating the GCS sample as representative of the general
Internet population. First, the inferred demographics are not always consistent in providing
accurate demographic characteristics of the respondents. Second, although there are no
significant differences between respondents who chose to answer the survey wall and
respondents that bail out, we still did not get perfect balance in the compositions of gender and
age in the two sets of respondents. Thus, GCS sample may yield results that are inaccurate for
larger population.
5. Internal Validity
Turning to the issue of internal validity, our primary concern is to assess whether we are
able to make causal inferences from GCS survey data. There are two ways to do this. First, we
can conduct observational studies where we draw our inferences based on our observations of
subjects and measuring variables of interest without assigning treatments to the subjects. The
treatment that each subjects receive occurred without any manipulations done by the
researchers. As such, researchers have to control for potential confounders in order to isolate
the effect of treatments on the subjects.
Second, we can administer survey experiments where assignments of treatment to
subjects are done using a chance mechanism. This randomization of treatments is to ensure
that treatment and control groups are balanced on all measured and unmeasured variables
that might otherwise be related to the treatment and outcome of interest. As a result, one can
make a strong causal inference from differences in responses to different treatments without
the need to control for a large number of confounding variables. Given the features and
economies of GCS where one is limited in the number of questions that each respondent
16
answers, it will be more advantageous for social scientists to rely on survey experiments in
establishing solid causal inference when using GCS.
So, how does one evaluates the usefulness of GCS for social science survey
experimentation? There are two ways to do this. First, we test whether a balance on
demographic characteristics is achieved across the treatment and control groups and whether
there is a balance on those respondents who choose to opt in and out of GCSs. Second, we
replicate results of well-established Social Science experiments. Mutz (2011) reviewed many
different ways population-based survey experiments can be designed to make causal
inferences. For example, she listed several kinds of survey experiments such as list experiments
and framing experiments, which are often used in social science and marketing. To this, we also
add experiments that originated from the tradition of behavioral economics that put the
respondent in a “game” such as the ultimatum and dictator games. There are challenges in
implementing each of these types of experiments in the GCS platform, but we suggest and test
several strategies for doing so below. In general, the way we implement a survey experiment in
GCS is to build a separate survey for each treatment and a control where they would differ in
their question wording and/or response categories in a way that captured the manipulation one
wished to make.
5.1 Balance Test
GCS selects and balances its sample of respondents in real time. That is, as subjects opt
in and out of the survey wall, GCS adjusts the sample of respondents on inferred demographics
such as age, gender, and income (see Figure 1). Based on this procedure, one might expect to
have a balanced sample across treatment groups on at least these inferred demographics.
However, there are at least two issues that could potentially undermine a balanced design.
First, we are concerned that the target population’s values on the inferred demographics may
be unbalanced given the presence of noise (i.e. not externally valid) and are often unavailable
to many respondents. As such, the control and treatment groups may not be balanced on these
variables for the purpose of external generalization.
17
Second, we are worried that the respondents who choose to opt out (by choosing
another question or taking an alternative to answering the question) are not balanced on the
inferred demographics across the control and treatment groups. Random assignment of large
samples might assure balance on these inferred demographics variables, but as we have
already pointed out, the randomization on inferred demographics in GCS is not perfect.
Respondents do have some choices as to which questions they can answer. They can either
choose, once they see a question, not to answer it but rather to see another question, or to
take some other actions (such as sharing the page in Facebook) that enable them to move on to
their desired content without answering the question. This opens the possibility, for example,
where female respondents who are less interested in politics than males systematically opt out
of answering political questions. Also, one could imagine a situation where this pattern of
opting out would be more pervasive for the treatment group than the control group.
We ran four experiments, each with control and treatment groups, to test whether
balancing on GCS’ inferred demographics (i.e. gender, age, and income) are achieved and
whether respondents who opt out are balanced on these characteristics. Two of the
experiments are versions of framing experiments (Mutz 2011), while the other two are
examples of list experiments. As shown in Table 5, running these experiments using GCS was
relatively quick and inexpensive as we only took two days to have 2000 responses for $200 per
experiment, with an average response rate of 26.3%.
Table 5: Descriptive Statistics
Experiment
Invitations
Responses
Response Rate
(%)
Cost
Time to Results
(Days)
3414
3380
1002
1001
29.3
29.6
$100
$100
2
2
3919
3949
1004
1001
25.6
25.3
$100
$100
2
2
3925
3886
1001
1005
25.6
25.9
$100
$100
2
2
Question Wording
Control
Treatment
Asian Disease
Control
Treatment
List Experiment
(Female President)
Control
Treatment
List Experiment
18
(Immigration)
Control
Treatment
4217
3986
1004
1002
23.8
25.1
$100
$100
2
2
In spite of the fact that the inferred demographics are noisy and are often unavailable
for many respondents, we found that GCS managed to produce balancing on the inferred
demographics across the control and treatment groups (see Table 6). Starting with gender, both
the control and treatment have more males than females (55% males vs 35% females), and
more respondents in the $25-49K income category than in other categories. As table 6 shows,
more than half of the respondents in the sample were categorized under the $25-49K column
while less than 10% of the respondents were inferred to have income more than $75K. On age,
the proportions of each category were almost equal in both the control and treatment groups.
Each age category made up about 12-18% of the sample on average and there are no single age
categories that contribute to more than 20% of the respondent.
Table 6: Experiments' Sample
Gender (%)
Experiment
Female
Male
1824
2534
Age (%)
35- 4544
54
Question
Wording
Experiment
Control
36
56
14 16
14
Treatment
36
55
16 15
14
Framing
Experiment
Control
35
56
17 15
14
Treatment
35
56
17 14
14
List
Experiment
(Female
President)
Control
38
54
15 18
13
Treatment
37
55
17 15
13
List
Experiment
(Immigration)
Control
32
52
12 13
12
Treatment
31
53
14 14
13
Note: Variables with unknown values are dropped.
Income (%)
$25- $50- $7549K
74K
99K
$100149K
Median
Response
Time (s)
5564
65+
$024K
15
15
20
19
11
12
8
7
61
59
25
27
5
6
2
1
11.5
14
15
18
18
11
11
6
6
58
61
28
26
5
6
2
1
24.8
25.1
17
17
15
16
11
11
6
7
55
58
32
28
5
6
0
0
20.5
21.2
15
16
16
16
12
10
7
7
57
58
29
27
6
6
1
1
19.7
21.8
10.2
19
We next compare the rate at which individuals opt out with those that respond and
determined whether balance was also achieved between the control and treatment groups
within the samples that choose to opt out. Based on the results shown in Table 7, we found
that the distributions of inferred demographics in the control and treatment groups for both
the survey participants (see table 6) and non-participants were quite similar. Specifically, we
found that there were more males than females (56% males, 44% females) and relatively equal
proportion of each age categories (around 12-18%) in the samples that chose to opt out. These
distributions were similar to respondents who choose to answer the survey, which indicates
that there were no significant differences in the inferred demographics between the survey
respondents and non-respondents. Besides similarity across the two sets of samples, GCS also
produced a balance sample on the inferred demographics for the control and treatment groups
within the sample that chose to opt out of the survey wall. This is shown from the similarity in
the distributions by inferred gender, age, and region across the control and treatment groups.
Even though randomization in GCS is not perfect due to the choice given to the respondents to
opt out, we still manage to acquire balance on the inferred demographics across groups.
Table 7: Characteristics of Respondents who Opt-Out
Gender (%)
Experiment
2534
Age (%)
354544
54
5564
65+
Midwest
Region (%)
NorthSouth
east
Female
Male
1824
Control
Treatment
43
42
57
58
18
18
17
17
15
15
18
18
17
17
15
15
22
22
21
20
34
34
24
24
Framing
Experiment
Control
Treatment
43
42
57
58
17
17
17
18
16
16
17
16
19
19
14
15
21
20
19
20
36
35
25
25
17
18
16
15
15
17
21
19
15
15
21
21
21
21
33
32
27
27
18
18
16
15
16
19
20
18
15
17
20
21
21
21
34
33
26
27
West
Question Wording
Experiment
List Experiment
(Female President)
Control
43
57
17
Treatment
45
55
16
List Experiment
(Immigration)
Control
45
55
16
Treatment
47
53
15
Figures may not add to 100% because of rounding
20
5.2 Replication of Prior Social Science Experiments
To further assess the usefulness of GCS as a platform to conduct experimental research,
we replicated the results reported in three types of well-established social science experiments.
The first is a replication of the classic framing experiments that had been done with MTurk by
Berinsky et al (2012); the second is a series of list experiments; and the third is behavioral
economics games without payment.10 In the case of framing experiments, we found that the
manipulation effects were less pronounced in GCS compared to face-to-face, student samples,
or other Internet-based samples. However for the list experiments and economic games, the
experimental results using GCS are quite similar to those reported using other research
methodologies (i.e., MTurk and GSS).
5.2.1 Framing Experiments
Question Wording Experiment (N=2003): This experiment is a replication of Rasinski (1989),
Green and Kern (2010), and Berinski et.al (2012) that asked respondents whether too much,
too little, or about the right amount was being spent on either “welfare” or “assistance to the
poor”. All of these studies have shown that support for welfare spending depends greatly on
whether it is called “assistance to the poor” or “welfare”. Rasinski (1989) reported that from
1984 to 1986, 20%-25% of the respondents from the General Social Surveys (GSS) in each year
said that too little was being spent on welfare, while 63%-65% said that too little was being
spent on assistance to the poor. The GSS continued to conduct the spending experiment and
the gap in support for increasing spending ranged from 28% to 50% from 1986-2010 (Green and
Kern, 2010). Besides face to face samples, similar gap was also found on MTurk samples where
38% gap between the two conditions was found (Berinsky et al, 2012).
We ran a between subject experiment on GCS where respondents were given either the
welfare or assistance to the poor version of the question. Although we found that the direction
of the framing effect is similar to prior studies (i.e. more respondents chose “too little” in
assistance to the poor version than in welfare version), the gap between the two conditions is
10
This is slightly different from the traditional way since they have often been done with payment.
21
only 23 percentage points (p value <.001), which is significantly smaller than the 35 to 40
percentage points difference in the GSS and MTurk’s samples (see table 8).
Table 8: Question
Wording
Experiments
Do you think that we are spending too much, too little, or about the right amount of
money on welfare/assistance to the poor?
% Respondents choosing “too little”
Rasinski (1989)
GSS face to face
1984-1986
“Assistance to the
poor”
“Welfare”
Berinsky et al. (2012)
MTurk sample
GCS internet
sample
63-65%
Green and Kern
(2010)
GSS face to face
1986-2010*
<78-93%
55%
49%
20-25%
<32-61%
17%
26%
Difference
35-40%
<32-46%
38%
23%
DK responses were dropped from analysis. *Green and Kern’s Results include “About Right” responses
Asian Disease Experiment (N=2005): We also replicated another frequently employed
experiment on framing – the “Asian Disease Problem”, which was first reported in Tversky and
Kahneman (1981). This experiment was first conducted using a student sample but has been
replicated many times and in different subject pools (Takemura 1994; Kuhberger 1995; Jerwen
et al 1996; Druckman 2001; Berinsky et al 2012). All respondents were given the following
scenario:
Imagine that your country is preparing for the outbreak of an unusual disease, which is expected to kill 600 people.
Two alternative programs to combat the disease have been proposed. Assume that the exact scientific estimates
of the consequences of the programs are as follows:
They were then given either one of the two following conditions:
Condition 1, Lives Saved: If Program A is adopted, 200 people will be saved. If Program B is adopted, there is onethird probability that 600 people will be saved, and two-third probability that no people will be saved.
Condition 2, Lives Lost: If Program A is adopted, 400 people will die. If Program B is adopted there is one-third
probability that nobody will die, and two-third probability that 600 people will die.
In each of these two conditions, respondents were asked to choose one of the two programs.
First is a program with certain consequences and second is a program with probabilistic
outcomes. However in the first condition, both the certain and risky programs were framed in
22
terms of likelihood of positive outcomes (lives saved), while in the second condition, the two
programs were described in terms of negative outcomes (lives lost).
In their experiments with a student sample, Tversky and Kahneman reported that when
the condition was framed in positive terms, 72% of the respondents were more likely to pick
certain choice, but when the condition was framed in negative terms, only 22% of the
respondents were more likely to pick the certain choice. Similar pattern has also been observed
in subsequent replications where the gap in choosing certain choice across the two conditions
ranges between 32% and 60% (see Table 9). In our replication using GCS, we found similar
preference reversal among our respondents as 58% of them pick the certain choice in lives
saved condition while only 33% choose the certain choice in lives lost condition. However, the
gap of only 25% (p value <.001) indicated that the treatment effects of the framing are
significantly attenuated in GCS relative to other research designs.
Table 9: Asian
Disease
Experiments
% Respondents choosing certain choice (i.e. Program A)
Druckman
(2001)
student
sample
54%
Jerwen
et al
(1996)
student
sample
65%
20%
22%
Difference
50%
60%
DK responses were dropped from analysis
32%
“Lives Saved”
Condition
“Lives Lost”
Condition
Tversky &
Kahneman
(1981)
student
sample
72%
Takemura
(1994)
student
sample
Kuhberger
(1995)
student
sample
80%
22%
GCS
internet
sample
68%
Berinsky
et al.
(2012)
MTurk
sample
74%
20%
23%
38%
33%
45%
45%
36%
25%
58%
5.2.2 List Experiments
Next, we replicated two different “list experiments” (often called the “Item Count
Technique”) that are designed to allow researchers to get response from questions that the
respondent may have an incentive to over or underreport or otherwise shade the truth. The
use of list experiments has increased and we expect GCS to be particularly useful for
implementing list survey experiments.
23
The first replication is the list experiment conducted by Janus (2010) who showed that
there is much more resistance to immigration than individual respondents admits to in
standard survey questions. Due to the word limit that GCS imposes on its surveys, we were
unable to follow the exact wording that Janus used in his experiment. Thus, we had to make the
instructions as brief as possible in order to meet the word limited imposed by GCS. Below is a
brief description of that experiment:
Two questions will be given with different response categories (and different respondents).
First, both sets of respondents get the question:
How many of these items are you opposed to?
Both groups of respondents are then given the same three non-sensitive items to choose
from11:
1. The way gasoline prices keep going up.
2. Professional athletes getting million dollar-plus salaries.
3. Requiring seat belts to be used when driving.
4. Large corporations polluting the environment.
One of the groups serves as the baseline (or control), while respondents in the treatment group
receive an additional sensitive item, which is:
5. Cutting off immigration to the United States.
Respondents were only asked to say how many statements that they oppose, not which ones.
This eliminates the respondent’s concern about giving the socially desirable answer and allows
the person to reply more honestly. Because the two groups are randomly distributed, the mean
number of “oppose” responses to the first four statements should be the same for both the
baseline and test groups; therefore, any increase in the mean number of opposed items in the
11
The statements are rotated.
24
treatment group must be attributed to the “cutting off immigration to the United States”
statement. We also replicate a similar list experiment in GCS by Streb et.al (2008) that examines
attitudes about a female president. In this experiment, both groups received similar nonsensitive items as the ones given in the immigration experiment while the treatment group
receives the sensitive item: "A woman serving as president".
Table 10 presents the mean number of opposed responses in the baseline and
treatment groups as well as the percent that was opposing the immigration in the United States
and a woman serving as president. Generally, we found that there were 34-39% of the
respondents who are not in favor of immigration and 23-26% who oppose female presidents.
Unlike framing experiments, the finding from GCS Internet sample comports closely to the
established benchmarks even though the questions length used here were significantly shorter.
Therefore, there is a potential for GCS to be a reliable platform for social scientists that are
interested in manipulating answer categories to measure sensitive topics in a way that
circumvents social desirability concerns.
Table 10: List
Experiments
Mean number of
Baseline Items
Mean Number of
Treatment Items
Difference
How many of these three/four
statements do you oppose?
How many of the following four/five
statements make you “angry or
upset”?
Sensitive item: " Cutting off immigration
to the United States"
Sensitive item: "A woman serving as
president"
Janus (2010)
TESS Internet
sample
Streb et. al. (2008)
Telephone sample
(N=2056)
GCS
Internet sample
(N=2006)
GCS
Internet sample
(N=2006)
1.77
2.46
2.16
2.3
2.16
2.8
2.42
2.53
39%
“opposed to
immigration”
34%
26%
“woman president
makes them
upset”
23%
5.2.3 Behavioral Economics Games
Finally, we examine if GCS can be used to conduct the kind of experiments that are the
backbone of behavioral economics. Specifically, we replicated Ultimatum and Dictator Games
25
using two different methods and evaluate which of the two are more appropriate to be used in
GCS.
Ultimatum game: In the ultimatum game, one player proposes some division of a dollar (or
some other amount) between himself (the proposer) and a receiver. The receiver can either
accept the proposal or reject it but if it is rejected, both get nothing. Although rational players
should have the proposer offering to give the receiver one cent and the receiver accepting (he
prefers something to nothing), that is not how the real players play. Instead, proposers offer
between 40% and 50% of the money to the receiver and the receivers usually accept that. If a
receiver is offered less than about 40%, they often reject the offer (taking nothing instead of
something) and if the offer is less than 20%, they almost always reject (Slonim and Roth, 1998;
Camerer and Thaler, 1995; Roth, et. al. 1991).
There are two ways to implement the ultimatum game in GCS. First, we could focus on
receiver’s decision by describing the game, telling them the offer, and asking them if they
accept or reject. Second, we could utilize an open-ended numeric question in GCS where we let
the respondents indicate the amount they want to accept. We tested both methods with our
GCS sample and found that the latter (i.e., open-ended numeric question) is a more appropriate
way of conducting this experiment in GCS.
In the first method, we tried to prevent asking the respondents to answer too many
different questions but we still wanted to know how respondents reacted to a variety of
different proposals. As such, we develop many versions of the question (where the proposal
varies from generous to selfish) and ask each of these to a different set of respondents. In each
experiment the number of subjects was sufficient for testing the effect of the different
treatments. That is, we asked 10 questions to 100 different people, instead of all 10 questions
to the same 1000 people). This procedure let us map out the response pattern to different
“offers” that is the usual object of interest in the ultimatum game.
We developed a brief set of instructions that are short enough for a respondent to
understand and utilize the “picture technique” in getting around the GCS’ word constraint.
Figure 4 below shows an example of the instruction.
26
Figure 4: Instruction to the Ultimatum Game
In total, we develop 10 different proposals, ranging from $1 to $10.
Table 11: Pattern of Ultimatum
Offer (GCS Internet Sample)
Offer ($)
% Accept
1
38
2
43
3
33
4
43
5
39
6
27
7
30
8
23
9
35
10
32
Based on the results shown in Table 11, we found that the pattern of ultimatum offer is not
consistent with prior findings. First, contrary to the conventional wisdom that receiver almost
always accept the offer above 50% of the money, only around 31% of the GCS sample accept
such offer. Second, although established benchmarks had indicated that the respondents often
reject the offer if they are offered less than 40% of the money and almost always reject 20% of
the money, we found that there are still a substantial number of GCS respondents who accept
such offer (around 33% to 43%). Therefore, using this method did not give us the results that
were consistent with prior research.
27
In the second method (N= 500), instead of developing ten different versions of the
question, we asked one version of the question to all respondents and allowed them to indicate
the smallest amount in which they want to accept. We slightly modified the question as the
open-ended numeric question prohibits us to insert any picture in the question. Figure 5 below
shows the results.
Figure 5: Results of the Ultimatum Game
Although the pattern of offer is still not consistent with the prior works, the average
amount offered proved to be similar to the amount found in Slonim and Roth (1998)’s game
(both GCS sample and their sample had a mean of 4.45 to 4.5). Based on this, it seems that
conducting economic games in GCS may not be able to give researchers the ability to map out
the response pattern to different “offers” that is the usual object of interest in such games.
However, it gives a reliable mean value of the amount offered, which is still a piece of valuable
information to researchers.
28
Dictator game: Besides the ultimatum game, we also replicated dictator game to evaluate the
usefulness of GCS in conducting behavioral economic experiments. In the dictator game, the
proposer determines an allocation (or split) of some endowment and the receiver has no choice
but to accept the allocation. Based on pure self-interest, the proposer should keep all money,
giving nothing to the receiver. However, prior works suggests that proposer tends to transfer a
nontrivial sum of money to the receiver (see Roth 1995; Camerer 2003). Similar to the
ultimatum game, we test both methods of conducting the dictator game in GCS and found that
the second method to be more appropriate.
In the first method, we again developed 10 versions of the question using the “picture
technique”, ranging from $1 to $10 (see Figure 6 below for an example).
Figure 7: Instructions to Dictator Game
We ask each version of the questions to 100 respondents, giving us a total number of 1000.
The respondents were then given the options “yes”, “no”, and “don’t know”.
Table 12: Pattern of Dictator’s Giving
(GCS Internet Sample)
% of Respondents who
Offer ($)
choose to give this
amount
1
44
2
42
3
32
4
40
5
41
6
32
7
32
8
40
9
30
29
10
45
Reported results for GCS drop DK responses
From the results shown in Table 12, we found that the results are not consistent with previously
published findings. In particular, the pattern found in GCS sample showed no evidence that
proposer tend to transfer a nontrivial sum of money to the receiver as the percentage of
respondents transferring their endowment does not varied dramatically across different offer.
As such, this method of constructing the experiment in GCS does not yield an accurate result.
Instead, we turn to the second method (which is similar to the second method of the
Ultimatum game) where we used the open-ended numeric question format that allows the
respondents to choose the amount of money that they want to give away. Again, we slightly
modified the question in order to fit the question into GCS’ word limit. Figure 8 below shows
the result.
Figure 8: Results of the Dictator Game
Similar to the results from the ultimatum game, we found that the pattern of giving in GCS’
sample does not replicate the conventional wisdom that respondents tend to give away non-
30
trivial amount. In fact, 22.9% of the GCS respondents choose to give away all of the endowment
away. However, the average of money given away of $3.20 is comparable to the amount found
in Ben-Ner et al’s (2008) student samples, where the average amount given in their games was
$3.08. This finding further underscores the conclusion that GCS provides a useful estimation of
mean values accepted in ultimatum games but not a reliable estimate of money given away by
the “dictators”.
5.3 Discussions
Throughout this paper, we have discussed some of the key features of GCS that
characterize it to be different from other Internet-based surveys that rely on large opt-in
panels. Specifically, we had pointed out that the “survey wall” environment interrupts the
respondents’ browsing activities with survey questions and “coerces” them to provide some
responses to these questions before allowing them to proceed to their target contents. Based
on this feature, there are two possible concerns that could affect the quality of the samples.
First, when the respondents are given questions that they perceived to be too difficult, they
could choose to bail out by either requesting for another question or sharing the target content
in their social media pages. However, as we have shown in Section 4, we have no reason to
believe that this particular problem will significantly affect the samples in GCS since
respondents who chose to bail out are not systematically different in terms of inferred
demographic characteristics from those who do not.
Second, even if respondents choose not to bail out and participate in the survey, their
motivations to access the target contents may cause them to not read and answer the
questions carefully. Since attention is an important element for receiving treatment in many
survey experiments, having inattentive (or unmotivated) respondents could weaken the
correlations across related questions and diminish experimental treatment effects (Berinsky
et.al, 2014). Consequently, our attempt to replicate the treatment effect from prior
experiments will be attenuated.
We found this particular problem in the two framing experiments that we replicated. In
the question-wording experiment, we found a 23 percentage point gap between the two
31
conditions while the comparable gap was 38% in the 2012 MTurk and 44% in the 2010 GSS.
Similar results are also found in the Asian Disease experiment where our gap between the two
conditions in GCS is only 25% compared to 36% in the 2012 MTurk and 45% in the 2001
Druckman’s student sample among others. Such attenuation of treatment effects we believe,
are caused by the lack of motivation that respondents have in reading and answering the
survey questions. Although respondents are not necessarily inclined to be motivated in other
survey environments (i.e. laboratory settings or web-based opt-in panels12), at least in these
other settings, they have agreed in advance to participate. Given that respondents in GCS are
interrupted and “coerced” to participate in the survey, greater inattentiveness is more likely to
occur in this particular environment.
From the three studies that we replicate above, we have some evidence that GCS could
be a useful platform to conduct survey experiments. The comparable results in list experiments
and behavioral economic games suggest that the GCS’ sample behave rather similarly to those
found in published research. However, we had also encountered some potential drawbacks as
well. Based on the results found in the framing experiments, we discovered that the treatment
effects in GCS sample diminished considerably, even though the expected directions of the
treatments are still consistent. But on balance, we still believe that the utilities of GCS in
conducting survey experiments outweigh its drawbacks.
6. Conclusion
GCS is a low cost survey platform that may have promise for use by social
scientists exploring. causal hypotheses. Specifically, in this paper we
investigate its value to researchers interested in conducting survey
experiments.Our focus on its usef ulness for survey experimentation rather
than as a more general survey platform stems from the fact that the real
cost advantage of GCS over to other alternatives is limited to surveys of
one or a very small number of questions.
Observational studies that require statistical models including a long (or
even small} list of control variables can likely be accomplished as efficiently
12
Prior scholars identified that respondents often display satisficing behavior (Krosnick 1991), offer careless and
haphazard survey responses (Huang et al 2012; Meade and Craig 2012), and do not bother to closely read
questions or instructions (Oppenheimer et al 2009) in opt-in surveys and experiments.
32
using other survey tools. Further,while previous work suggests GCS obtains
reasonably representative samples, there is little compelling evidence that
it's samples are sufficiently better than its competitors to attract social
scientists' attention.
In contrast, survey experimental work has three features that make GCS a
potentially attractive option for social scientists pursuing this kind of
research. First, if these experiments can be implemented in one question,
then the considerable cost advantages of GCS over its competitors can be
realized. Second, If true randomization of treatments can be achieved (and
we argue that GCS’ methodology does so - and demonstrate resulting
balance in a set of measured covariates then no additional control
variables are necessary to make strong inferences. Finally, given the fact
that most experimenters are much more concerned with the internal
validity of their inferences than with tier external validity, unresolved
questions about the representativeness of the GCS sample may be of
secondary concern and certainly the sample is better than that of most
lab experiments.