1 Survey Experiments with Google Consumer Surveys: Promise and Pitfalls for Academic Research in Social Science 1. Introduction In this paper, we examine the usefulness of Google Consumer Surveys (GCS) as a platform for conducting survey experiments in social science. GCS is a relative newcomer to the rapidly expanding market for inexpensive (and quick) online surveying and its methodology differs rather dramatically from most of its competitors who rely on large opt-in internet panels. Below, we first review this methodology and point out that its particular strengths and weaknesses lend themselves well to a focus on survey experiments that face a high bar of internal and external validity. Our paper proceeds in the following manner. In section two, we introduce and discuss how GCS operates in the web environment, its cost structure, strengths, and limitations. In section three, we lay out a framework to evaluate the threats to external and internal validity that result from the particular characteristics of GCS’ respondents. Next, we discuss the issues of response rate, experimental balance, and sample representativeness in section four and show the results of a series of experiments that we replicated on the GCS platform in section five. We conclude with a discussion of our findings in section six, identifying the most appropriate uses of GCS for social science research. 2. An introduction to Google Consumer Surveys GCS is a new tool developed by Google that surveys a sample of Internet users as they attempt to view content on the websites of online publishers who are part of the program. These sites require users to complete a survey created by researchers in order to access premium content provided by the publishers. This “survey wall” is similar to the pay wall that is often used to gate access to premium content, but instead of having to pay or subscribe, 2 consumers can now access the content by answering a short survey. Currently, GCS is available in four countries1 and it takes about 48 hours to field a survey of any size. GCS involves three different groups of users: publishers, consumers, and researchers. Researchers create the surveys and pay GCS to have consumers answer questions, consumers encounter the surveys when attempting to access publisher content, and publishers are paid by GCS to have surveys administered on their sites. According to a study done by PEW in 20122, there were approximately 80 sites in GCS’ network and an additional 33 sites that were in testing during this period of time. These sites include small and large publishers such as New York Daily News, Christian Science Monitor, Reader’s Digest, the Texas Tribune, as well as YouTube and Pandora Given the “survey wall” environment, it is essential that the surveys be short. Likewise, as we show below, the current cost-structure of GCS makes very short surveys (even as short as one question) the most economically attractive. This has important implications for how GCS can best be used by social scientists. Specifically, it is unlikely that social scientists interested in making a solid causal inference will find GCS a good vehicle over traditional methods of doing so in observational studies. Such designs usually require researchers to identify and control for a large set of potential confounding variables; but GCS makes measuring such confounders prohibitive. As we discuss below, however, other designs for casual inference – relying on survey experiments – are better suited to take advantage of the features and economies of GCS. Another implication of the short length of the surveys is that unlike other Internet-based surveys that rely on large opt-in panels, researchers using GCS cannot usually collect extensive demographic and background information on respondents in the survey itself. Since the respondents are not part of an online panel, GCS has not collected this information at an earlier date (as is the case in many panels). Instead, GCS provides researchers with a set of respondents’ “inferred demographics” (location, income, age, gender) that are derived from 1 2 The 4 countries are USA, Canada, UK, and Australia. http://www.people-press.org/2012/11/07/a-comparison-of-results-from-surveys-by-the-pew-research-center-and-google-consumer-surveys/ 3 their browsing histories3 and IP addresses4. This information is then used in the sampling process as depicted in Figure 1 below. Figure 1: Google Consumer Surveys “Survey Wall” Methodology Individual Browsing the Internet Demographics Implied by Browsing History and IP address (age, income, gender, location) Survey Wall (must answer a survey to obtain target content) Target Content Real time stratified sampling of individuals (i.e., matching individuals to surveys in the pool) using inferred demographics to target age group, gender, and location shares from the current population survey Pool of all active Surveys Internet Content Providers (who have contracted with Google) 2.1 Process of GCS Figure 1 shows that when individuals browse the Internet and wish to obtain access to the premium content, they have to answer a survey taken from a pool of currently active surveys (i.e., the “survey wall” page). The specific survey the individual is presented with, however, is not selected randomly. Instead, GCS uses a sampling scheme designed to ensure that each active survey is on track to have proportions of respondents by age, gender and location (region, state) that match the Current Population Survey (CPS). To achieve this, GCS 3 The gender and age group can be inferred from the types of pages users visited in the Google Display network using the DoubleClick Cookie. The DoubleClick Cookie is a way for Google to provide ads that are more relevant to the respondents by using a unique identifier called a cookie ID to record users’ most recent searches, previous interactions with advertisers ‘ads, and visits to advertisers’ website. 4 IP address is a number assigned to a device that connects to the Internet. From this number, users nearest city can be determined and subsequently, the income and urban density. 4 infers the demographic characteristics of the respondents from their browsing history and IP address and uses this information to create a sample of Internet users that resembles the Internet using population based on the most recent CPS Internet Users Supplement. Importantly, this adjustment is done in real time so that if a given survey appears to have, for example, too few female respondents, more females (according to the implied demographics) will be assigned to that survey. Finally, since this process does not always produce proportions of respondents that match the CPS, post-stratification weights from the CPS Internet Users Supplement are calculated and provided to GCS users. 2.2 Cost Structure One of the main advantages of GCS for social scientists is its low costs. However, given the current costs structure, these advantages are felt more keenly for specific types of surveys than for others (i.e. the survey experiment). Specifically, GCS employs two forms of survey format, each with different cost structures. First is a micro-survey where respondents can answer either one or two questions. The costs are 10¢ per response for a general (untargeted) audience and 50¢ per response for a targeted audience filtered by a screening question. A second option is a full survey where respondents can answer up to 10 questions and the costs are based on survey length (see Figure 2 below) for a general untargeted audience. Importantly, the marginal cost per response of an additional question is constant at $0.30 for two to 10 questions but is $1.00 for moving from one to two questions. 5 Figure 2: GCS Cost’s Structure Based on these cost structures, it is clear that conducting multi-question surveys using the GCS is less cost-effective than single-question surveys. As such, relying on survey experiments rather than a long list of control variables may be a better strategy for researchers to employ GCS in making causal inferences. 2.3 Limitations of GCS Google is keenly aware of the imposition the survey wall places on potential GCS respondents. Specifically, it interrupts their browsing activities with survey questions and coerces them to answer these questions before proceeding to the target content. This is significantly different from the traditional opt-in panels where respondents have agreed in advance to participate. With this in mind, Google has concluded that it is essential for the survey questions not to be too long, too complicated, or controversial in nature. As a result, researchers need to construct their surveys in accordance to these guidelines. Below we briefly described some of the restrictions that GCS imposes and suggest some of the strategies we have adopted to work within them. The limitations are: a) Number of response categories: GCS tries to ensure the survey is simple by restricting the question answers to a maximum of five response categories and an additional openended option - i.e., “Other (please specify)” text field. 6 b) Number of characters/words: To keep the question short, GCS recommends the question length to be 125 characters and sets the maximum limit to 175 characters. c) Censored questions and populations: GCS places restrictions on sensitive demographics information by prohibiting researchers from asking respondents for their age, gender, ethnicity, religion, and immigration status. Researchers can only ask these questions if the response choices include "I prefer not to say" as the opt out answer. Also, questions may not directly or indirectly target respondents under 18 years old or include any age category under 18 in the answer choices. d) Don’t know and “opt out” option: Besides regulating the content and the target respondents of the question, GCS also requires an option for the respondents to not answer the question by either clicking an “I don’t know, show me another question” link, or by sharing the premium content that they intend to read in their social media pages (see Figure 3 for an example). Essentially, this just ensures that all respondents have some paths that allow them to skip a given question yet still (ultimately) access their desired content. We will return to this feature in the discussion of internal validity associated with using GCS below. 7 Figure 3: Example of a Survey Wall 2.4 Working with the Word Constraint One of the constraints of using GCS for our proposed survey experiments is the word limits for questions and the response sets to the questions. For many kinds of surveys experiments of interest to social scientists, these word limits will be too constraining. Indeed, in many of the canonical experiments we conduct below, we could not faithfully implement the experiments under these word limits (even after a great deal of work streamlining wording). Thus, we devised one way to work around this limitation by using the “picture technique.” Specifically, we capture the necessary text in a picture format that is then inserted into the question as a screen image. Below is an example of doing this for the response set in a list experiment. Similar pictures can also be utilized for longer questions or textual vignettes, as is necessary in many survey experiments. Now I am going to read you three/four things that sometimes people oppose or are against. After I read all three/four, just tell me HOW MANY of them you oppose. I don’t want to know which ones, just HOW MANY. 8 The preamble and question fit within the word limit, but the listed items do not and so we insert them as a graphic in the question itself. 2.5 Randomizing the Treatments GCS does not provide a specific facility for randomly assigning different versions of questions to different respondents. Such assignment, however, can be achieved effectively by simply submitting multiple versions of the same question to the pool of active questions. So, for example, if a researcher wants to have 1000 respondents randomly assigned to one of two questions, he or she can submit two one-question surveys and request 500 respondents for each one. Since respondents are matched with questions from the pool of active questions randomly (at least within the limits of the dynamic stratification described in Figure 1), there is no reason to think that there could be anything systematic about the matching process that would break random assignment of questions (at least previous to the respondent’s decision whether to opt out). 3. Assessing the Validity of Conducting Survey Experiments Using GCS In spite of the restrictions discussed above, GCS provides researchers with access to inexpensive samples of respondents and, as we have argued above, can be used most efficiently to conduct survey experiments (often with only one question). But are these samples good enough to play a role in rigorous social scientific research? There are generally two threats of validity in experimental research: external and internal validity. Threats to external validity pertain to the question of whether the samples in 9 the experiments are representative of some broader population (and so inferences can be usefully generalized to that population), while threats to internal validity refer to whether a causal inference is valid even for the sample itself (i.e., does a given causal inference truly reflecting the effect of the experimental manipulation?). With regards to conducting experimental research using GCS, two questions about the external validity of the sample are pertinent. First, are the samples of individuals who actually get the opportunity to answer a given question representative of the general Internet user population? Second, do patterns of non-response (opting out or answering “Don’t Know”) create significant non-response bias? To address these questions, we first review previous research that had compared the characteristics of GCS samples (relying on the inferred demographics of those samples) to other survey samples used in social science research. Next, we evaluate the accuracy of Google’s inferred demographics, and finally, we examine whether there are significant differences in respondents who opt in and out of the GCS including those who give a “don’t know” response. Turning to the issue of internal validity, our primary concern is on whether we can make valid causal inferences from the experiments conducted in GCS. To answer this question, we conducted two tests. First is the balance test. This test asks whether the sample treatment and control groups are balanced on all measured variables and serves as a test of whether randomization of treatments were achieved (in which case one would have – with sufficient numbers of respondents – balance on all measured and unmeasured variables that might be related to the treatment and outcome of interest). Second is a replication test of wellestablished social science experiments. This allows us to determine whether we can replicate the experimental (treatment) effects from prior studies in GCS and also examine whether GCS’ subjects behave differently from subjects in other experimental settings. 4. External Validity [might define what we mean by external validity] We first assessed the external validity of using GCS by reviewing some of the prior work that has compared the characteristics of GCS’ samples to the characteristics of samples that use other survey methods to study comparable research questions. McDonald et al. (2012 10 compares the responses of a probability based Internet panel, a non-probability based Internet panel, and a GCS sample against some media consumption benchmarks5. After matching the results against the benchmarks, they found that the results from GCS were more accurate than both the probability and non-probability panels in terms of average absolute error, largest absolute error, and percent of responses within 3.5% of the benchmark. For instance, the average absolute error of GCS was 3.25%, compared to 4.7% for the probability sample and 5.87% for the non-probability sample. Researchers at the Pew Research Center (2012) also conducted a (pseudo) mode study in 2012 in which they compared the results for more than 40 questions asked of a sample of telephone respondents with those asked (over the same time period) of GCS respondents.6 The types of questions asked include demographic characteristics, use of technology, political attitudes and behavior, domestic and foreign policy, and civic engagement. The reported distributions of responses for most of these questions did not differ dramatically across modes. Pew reports “[A]cross several political measures, the results from the Pew Research Center and using Google Consumer Surveys were broadly similar, though some larger difference were observed (2012).” 4.1 Evaluating GCS’ Inferred Demographics Next, we evaluated the accuracy of GCS’ inferred demographics by replicating the study conducted by Pew in 2012 on two demographic variables (gender and age) and a third variable income, which was not reported by Pew, into the studies. We asked the respondents on their demographics (gender, age, income) and then compared them to the inferred data provided by GCS. Table 1-3 below reports the comparison between the inferred information and the responses given by the respondents. 5 The benchmarks are video on demand, digital video recorder, and satellite dish usage in American households as measured by a semi-annual RDD telephone survey of 200,000 respondents. 6 We say “pseudo mode study” because the respondents were not randomly assigned to a mode, but were sampled separately in what was actually a dual frame design. Thus, some of the differences reported may well come from the different sampling frames rather than mode differences. 11 Table 1: Inferred vs Reported Gender Inferred Sex (%) Sex given in response (Pew Study 2012, N=1056) Male Female (Our Study 2013, N=1000) Male Female Prefer not to answer Male Female Unknown 79 21 28 72 58 42 Male 60 13 27 Female 16 59 25 Unknown 37 26 37 Figures based on unweighted data For gender (see Table 1), the Pew studies reported that there were about 75% of respondents whose inferred gender matched their survey respondents. However, it did not include "I prefer not to say" as the opt out answer in their response categories as suggested by Google when asking respondents sensitive demographic information (see Section 2.3). Thus, when we include this option, the percentage decreased to only approximately 60%. Among those whose gender GCS did not infer, Pew reported that 58% said they were male and 42% female while we found 37% are male, 26% female, and 37% prefer not to answer. In terms of age (see Table 2), Pew found that from the two samples they surveyed in 2012; there was an average of 44% to 46% of the respondents that reported an age that was in the same category as their inferred age. Again, the opt out choice, “I prefer not to say,” was not included in the response categories. Unlike gender however, the average percentage of the GCS’ respondents choosing a similar age group as their inferred age did not change even when the opt-out option was included in the response categories7 with 47% choosing similar age group as their inferred age found in our studies. Furthermore, both the Pew studies and ours showed that the largest incongruity between self-reported age and inferred was observed for 7 GCS limits the number of response categories to five but infer age in six categories. Pew built two separate samples with two different age categories and collapsed different age categories for each. We only provided 4 age categories with “Not Applicable” being the 5th option. We collapsed the inferred age according to the age categories in our response option. 12 middle age respondents (i.e., 35-44) where only about 23%-43% of these respondents had their age correctly inferred. Table 2: Inferred vs Reported Age Inferred Age (%) Age given in response (Pew Study 2012, N=1009) 18-24 25-34 35-44 45-64 65+ 18-24 25-34 35-44 45-64 65+ Unknown 65 15 10 6 4 23 39 20 15 4 12 25 23 34 7 5 11 16 52 16 5 5 12 26 52 22 18 15 33 12 (Pew Study 2012, N=1004) 18-24 25-34 35-44 45-54 55+ 18-24 63 11 10 4 11 25-34 21 43 17 7 12 35-44 14 24 30 15 16 45-54 10 10 22 33 25 55+ 11 8 9 18 54 Unknown 21 17 16 21 26 (Our study 2013, N=1003) 18-24 25-34 35-54 55+ Unknown 48 11 18 5 18 17 31 27 4 21 5 12 54 15 14 3 3 19 56 19 10 11 28 27 25 18-24 25-34 35-54 55+ Not applicable / Prefer not to answer Figures based on unweighted data Besides replicating the studies done by Pew, we also assessed the accuracy of GCS’ inferred income by asking the respondents to report their income category (see Table 3). Similar to our previous assessment of age, we collapsed the inferred income into five categories in order to accommodate the number of response categories that GCS allowed8. The results however, were not as accurate as observed for gender and age. First, we found that on average, only 23.5% of the respondents chose an income category that was similar to the one that GCS inferred. Second, contradictions between what respondents report and what GCS inferred were prevalent across all self-reported income categories and not just in specific categories. 8 We collapse income group of $100,000-$149,999 and $150,000+ into one category. 13 Table 3: Inferred vs Reported Income Inferred Income (%) Income given in response (Our Study 2013, N=1006) $0-$24,999 $25,000-$49,999 $50,000-$74,999 $75,000-$99,999 $100,000+ $0$24,999 34 24 17 13 12 $25,000$49,999 35 21 19 8 17 $50,000$74,999 32 13 17 15 24 $75,000$99,999 32 11 13 13 31 $100,000+ Unknown 50 8 0 17 25 33 22 22 0 23 Figures based on unweighted data Based on findings that 74% of the respondents’ gender, 40% of the respondents’ age, and 20% of the respondents’ income were correctly inferred, we conclude that the GCS’ inferred demographics contain substantial errors in classifying the characteristics of the individual respondents, especially on age and income. This level of accuracy may be good enough for basic or “do-it-yourself” survey research, but perhaps not for professional or academic researchers who are particularly interested in segmentation analysis of the population. 4.2 Evaluating the non-response rate issue Besides the accuracy of inferred demographics, we are also concerned that the nonresponse rate and/or “Don’t Know” (DK) responses may pose some threats to the external validity of GCS. Specifically, if the survey wall experience where a respondent is motivated to get to their preferred content could increase the incentive for respondents to “bail out”9, then the important question is whether respondents who chose to bail out are systematically different from those who do not. The 2012 Pew study did not speak to this issue; moreover, many of their questions did not include a DK response or report on the non-response rate. Where Pew did include DK 9 Respondents can bail out by choosing “Don’t Know” from the response category (if provided), clicking “I don’t know, show me another question” link, or sharing the premium content in their social media pages. 14 options (i.e. on questions about the Supreme Court and Presidential debates), they found higher rates of DKs than in other survey modes. To examine this issue in greater detail, we replicated a question on Obama approval (where DK option is not included in Pew studies) with similar sample size and included the DK option in our response categories. By doing so, we could then compare the demographics (i.e. gender and age) of respondents and nonrespondents (those who chose to bail out) in both versions of the questions. Table 4: Demographics of Respondents and Non-Respondents (Obama Approval) Group Male Female 18-24 25-34 35-44 45-54 55-64 65+ GCS Sample (Pew, 2012) % Response (DK options % Nonnot Response provided) 46.9 52.1 53.1 47.9 11.7 15.9 24.3 23.6 16.7 20.4 23.1 12.7 14.0 15.3 10.1 12.1 GCS Sample (2013) Response % Answer % DK % Total % NonResponse 47.5 52.5 2.7 13.5 13.7 23.8 31.9 14.5 41.7 58.3 8.6 17.7 20.9 19.9 23.7 9.1 46.2 53.8 4.1 14.7 16.1 23.0 28.5 13.5 54.6 45.4 9.0 18.0 14.5 19.5 19.9 19 % of non-response refers to respondents who chose the link to get another question or the link to their social media pages Based on the comparison between our studies and Pew’s (see Table 4), we found that male respondents were more likely to be non-responsive to the surveys than women. This pattern, which has been observed in other studies (Ferber 1966; Riphahn and Serfling 2003) may require some post-stratified weighting in order to correct for possible over representation of certain demographic groups in the sample (Gelmand and Carlin, 2001; Gary R Pike, 2007). On the other hand, we found no evidence that age is a factor in determining one’s propensity to respond, as respondents across all age categories made up similar proportions of the nonresponsive sample. With regards to the issue of DK responses, we found that there are no significant differences in terms of gender between those who answer the question and those who chose DK. Although women have higher propensity to answer DK than men, the gender balance was rather similar between those who answered (47.5% male, 52.5% female) and in those who 15 chose don’t know responses (41.7% male, 58.3 female). On the other hand, we observed some differences in terms of age where we found that the share of younger respondents (i.e. 18-34 years old) was higher in those who chose don’t know (26.3%) than in those who chose an offered response (16.2%). Conversely, the share of older respondents (i.e. 55+ years old) was higher in the group that answered (46.4%) relative to the group that chose DK (32.8%). Based on the analyses above, we conclude that although there are ways to correct some sampling error, it is best to be cautious in treating the GCS sample as representative of the general Internet population. First, the inferred demographics are not always consistent in providing accurate demographic characteristics of the respondents. Second, although there are no significant differences between respondents who chose to answer the survey wall and respondents that bail out, we still did not get perfect balance in the compositions of gender and age in the two sets of respondents. Thus, GCS sample may yield results that are inaccurate for larger population. 5. Internal Validity Turning to the issue of internal validity, our primary concern is to assess whether we are able to make causal inferences from GCS survey data. There are two ways to do this. First, we can conduct observational studies where we draw our inferences based on our observations of subjects and measuring variables of interest without assigning treatments to the subjects. The treatment that each subjects receive occurred without any manipulations done by the researchers. As such, researchers have to control for potential confounders in order to isolate the effect of treatments on the subjects. Second, we can administer survey experiments where assignments of treatment to subjects are done using a chance mechanism. This randomization of treatments is to ensure that treatment and control groups are balanced on all measured and unmeasured variables that might otherwise be related to the treatment and outcome of interest. As a result, one can make a strong causal inference from differences in responses to different treatments without the need to control for a large number of confounding variables. Given the features and economies of GCS where one is limited in the number of questions that each respondent 16 answers, it will be more advantageous for social scientists to rely on survey experiments in establishing solid causal inference when using GCS. So, how does one evaluates the usefulness of GCS for social science survey experimentation? There are two ways to do this. First, we test whether a balance on demographic characteristics is achieved across the treatment and control groups and whether there is a balance on those respondents who choose to opt in and out of GCSs. Second, we replicate results of well-established Social Science experiments. Mutz (2011) reviewed many different ways population-based survey experiments can be designed to make causal inferences. For example, she listed several kinds of survey experiments such as list experiments and framing experiments, which are often used in social science and marketing. To this, we also add experiments that originated from the tradition of behavioral economics that put the respondent in a “game” such as the ultimatum and dictator games. There are challenges in implementing each of these types of experiments in the GCS platform, but we suggest and test several strategies for doing so below. In general, the way we implement a survey experiment in GCS is to build a separate survey for each treatment and a control where they would differ in their question wording and/or response categories in a way that captured the manipulation one wished to make. 5.1 Balance Test GCS selects and balances its sample of respondents in real time. That is, as subjects opt in and out of the survey wall, GCS adjusts the sample of respondents on inferred demographics such as age, gender, and income (see Figure 1). Based on this procedure, one might expect to have a balanced sample across treatment groups on at least these inferred demographics. However, there are at least two issues that could potentially undermine a balanced design. First, we are concerned that the target population’s values on the inferred demographics may be unbalanced given the presence of noise (i.e. not externally valid) and are often unavailable to many respondents. As such, the control and treatment groups may not be balanced on these variables for the purpose of external generalization. 17 Second, we are worried that the respondents who choose to opt out (by choosing another question or taking an alternative to answering the question) are not balanced on the inferred demographics across the control and treatment groups. Random assignment of large samples might assure balance on these inferred demographics variables, but as we have already pointed out, the randomization on inferred demographics in GCS is not perfect. Respondents do have some choices as to which questions they can answer. They can either choose, once they see a question, not to answer it but rather to see another question, or to take some other actions (such as sharing the page in Facebook) that enable them to move on to their desired content without answering the question. This opens the possibility, for example, where female respondents who are less interested in politics than males systematically opt out of answering political questions. Also, one could imagine a situation where this pattern of opting out would be more pervasive for the treatment group than the control group. We ran four experiments, each with control and treatment groups, to test whether balancing on GCS’ inferred demographics (i.e. gender, age, and income) are achieved and whether respondents who opt out are balanced on these characteristics. Two of the experiments are versions of framing experiments (Mutz 2011), while the other two are examples of list experiments. As shown in Table 5, running these experiments using GCS was relatively quick and inexpensive as we only took two days to have 2000 responses for $200 per experiment, with an average response rate of 26.3%. Table 5: Descriptive Statistics Experiment Invitations Responses Response Rate (%) Cost Time to Results (Days) 3414 3380 1002 1001 29.3 29.6 $100 $100 2 2 3919 3949 1004 1001 25.6 25.3 $100 $100 2 2 3925 3886 1001 1005 25.6 25.9 $100 $100 2 2 Question Wording Control Treatment Asian Disease Control Treatment List Experiment (Female President) Control Treatment List Experiment 18 (Immigration) Control Treatment 4217 3986 1004 1002 23.8 25.1 $100 $100 2 2 In spite of the fact that the inferred demographics are noisy and are often unavailable for many respondents, we found that GCS managed to produce balancing on the inferred demographics across the control and treatment groups (see Table 6). Starting with gender, both the control and treatment have more males than females (55% males vs 35% females), and more respondents in the $25-49K income category than in other categories. As table 6 shows, more than half of the respondents in the sample were categorized under the $25-49K column while less than 10% of the respondents were inferred to have income more than $75K. On age, the proportions of each category were almost equal in both the control and treatment groups. Each age category made up about 12-18% of the sample on average and there are no single age categories that contribute to more than 20% of the respondent. Table 6: Experiments' Sample Gender (%) Experiment Female Male 1824 2534 Age (%) 35- 4544 54 Question Wording Experiment Control 36 56 14 16 14 Treatment 36 55 16 15 14 Framing Experiment Control 35 56 17 15 14 Treatment 35 56 17 14 14 List Experiment (Female President) Control 38 54 15 18 13 Treatment 37 55 17 15 13 List Experiment (Immigration) Control 32 52 12 13 12 Treatment 31 53 14 14 13 Note: Variables with unknown values are dropped. Income (%) $25- $50- $7549K 74K 99K $100149K Median Response Time (s) 5564 65+ $024K 15 15 20 19 11 12 8 7 61 59 25 27 5 6 2 1 11.5 14 15 18 18 11 11 6 6 58 61 28 26 5 6 2 1 24.8 25.1 17 17 15 16 11 11 6 7 55 58 32 28 5 6 0 0 20.5 21.2 15 16 16 16 12 10 7 7 57 58 29 27 6 6 1 1 19.7 21.8 10.2 19 We next compare the rate at which individuals opt out with those that respond and determined whether balance was also achieved between the control and treatment groups within the samples that choose to opt out. Based on the results shown in Table 7, we found that the distributions of inferred demographics in the control and treatment groups for both the survey participants (see table 6) and non-participants were quite similar. Specifically, we found that there were more males than females (56% males, 44% females) and relatively equal proportion of each age categories (around 12-18%) in the samples that chose to opt out. These distributions were similar to respondents who choose to answer the survey, which indicates that there were no significant differences in the inferred demographics between the survey respondents and non-respondents. Besides similarity across the two sets of samples, GCS also produced a balance sample on the inferred demographics for the control and treatment groups within the sample that chose to opt out of the survey wall. This is shown from the similarity in the distributions by inferred gender, age, and region across the control and treatment groups. Even though randomization in GCS is not perfect due to the choice given to the respondents to opt out, we still manage to acquire balance on the inferred demographics across groups. Table 7: Characteristics of Respondents who Opt-Out Gender (%) Experiment 2534 Age (%) 354544 54 5564 65+ Midwest Region (%) NorthSouth east Female Male 1824 Control Treatment 43 42 57 58 18 18 17 17 15 15 18 18 17 17 15 15 22 22 21 20 34 34 24 24 Framing Experiment Control Treatment 43 42 57 58 17 17 17 18 16 16 17 16 19 19 14 15 21 20 19 20 36 35 25 25 17 18 16 15 15 17 21 19 15 15 21 21 21 21 33 32 27 27 18 18 16 15 16 19 20 18 15 17 20 21 21 21 34 33 26 27 West Question Wording Experiment List Experiment (Female President) Control 43 57 17 Treatment 45 55 16 List Experiment (Immigration) Control 45 55 16 Treatment 47 53 15 Figures may not add to 100% because of rounding 20 5.2 Replication of Prior Social Science Experiments To further assess the usefulness of GCS as a platform to conduct experimental research, we replicated the results reported in three types of well-established social science experiments. The first is a replication of the classic framing experiments that had been done with MTurk by Berinsky et al (2012); the second is a series of list experiments; and the third is behavioral economics games without payment.10 In the case of framing experiments, we found that the manipulation effects were less pronounced in GCS compared to face-to-face, student samples, or other Internet-based samples. However for the list experiments and economic games, the experimental results using GCS are quite similar to those reported using other research methodologies (i.e., MTurk and GSS). 5.2.1 Framing Experiments Question Wording Experiment (N=2003): This experiment is a replication of Rasinski (1989), Green and Kern (2010), and Berinski et.al (2012) that asked respondents whether too much, too little, or about the right amount was being spent on either “welfare” or “assistance to the poor”. All of these studies have shown that support for welfare spending depends greatly on whether it is called “assistance to the poor” or “welfare”. Rasinski (1989) reported that from 1984 to 1986, 20%-25% of the respondents from the General Social Surveys (GSS) in each year said that too little was being spent on welfare, while 63%-65% said that too little was being spent on assistance to the poor. The GSS continued to conduct the spending experiment and the gap in support for increasing spending ranged from 28% to 50% from 1986-2010 (Green and Kern, 2010). Besides face to face samples, similar gap was also found on MTurk samples where 38% gap between the two conditions was found (Berinsky et al, 2012). We ran a between subject experiment on GCS where respondents were given either the welfare or assistance to the poor version of the question. Although we found that the direction of the framing effect is similar to prior studies (i.e. more respondents chose “too little” in assistance to the poor version than in welfare version), the gap between the two conditions is 10 This is slightly different from the traditional way since they have often been done with payment. 21 only 23 percentage points (p value <.001), which is significantly smaller than the 35 to 40 percentage points difference in the GSS and MTurk’s samples (see table 8). Table 8: Question Wording Experiments Do you think that we are spending too much, too little, or about the right amount of money on welfare/assistance to the poor? % Respondents choosing “too little” Rasinski (1989) GSS face to face 1984-1986 “Assistance to the poor” “Welfare” Berinsky et al. (2012) MTurk sample GCS internet sample 63-65% Green and Kern (2010) GSS face to face 1986-2010* <78-93% 55% 49% 20-25% <32-61% 17% 26% Difference 35-40% <32-46% 38% 23% DK responses were dropped from analysis. *Green and Kern’s Results include “About Right” responses Asian Disease Experiment (N=2005): We also replicated another frequently employed experiment on framing – the “Asian Disease Problem”, which was first reported in Tversky and Kahneman (1981). This experiment was first conducted using a student sample but has been replicated many times and in different subject pools (Takemura 1994; Kuhberger 1995; Jerwen et al 1996; Druckman 2001; Berinsky et al 2012). All respondents were given the following scenario: Imagine that your country is preparing for the outbreak of an unusual disease, which is expected to kill 600 people. Two alternative programs to combat the disease have been proposed. Assume that the exact scientific estimates of the consequences of the programs are as follows: They were then given either one of the two following conditions: Condition 1, Lives Saved: If Program A is adopted, 200 people will be saved. If Program B is adopted, there is onethird probability that 600 people will be saved, and two-third probability that no people will be saved. Condition 2, Lives Lost: If Program A is adopted, 400 people will die. If Program B is adopted there is one-third probability that nobody will die, and two-third probability that 600 people will die. In each of these two conditions, respondents were asked to choose one of the two programs. First is a program with certain consequences and second is a program with probabilistic outcomes. However in the first condition, both the certain and risky programs were framed in 22 terms of likelihood of positive outcomes (lives saved), while in the second condition, the two programs were described in terms of negative outcomes (lives lost). In their experiments with a student sample, Tversky and Kahneman reported that when the condition was framed in positive terms, 72% of the respondents were more likely to pick certain choice, but when the condition was framed in negative terms, only 22% of the respondents were more likely to pick the certain choice. Similar pattern has also been observed in subsequent replications where the gap in choosing certain choice across the two conditions ranges between 32% and 60% (see Table 9). In our replication using GCS, we found similar preference reversal among our respondents as 58% of them pick the certain choice in lives saved condition while only 33% choose the certain choice in lives lost condition. However, the gap of only 25% (p value <.001) indicated that the treatment effects of the framing are significantly attenuated in GCS relative to other research designs. Table 9: Asian Disease Experiments % Respondents choosing certain choice (i.e. Program A) Druckman (2001) student sample 54% Jerwen et al (1996) student sample 65% 20% 22% Difference 50% 60% DK responses were dropped from analysis 32% “Lives Saved” Condition “Lives Lost” Condition Tversky & Kahneman (1981) student sample 72% Takemura (1994) student sample Kuhberger (1995) student sample 80% 22% GCS internet sample 68% Berinsky et al. (2012) MTurk sample 74% 20% 23% 38% 33% 45% 45% 36% 25% 58% 5.2.2 List Experiments Next, we replicated two different “list experiments” (often called the “Item Count Technique”) that are designed to allow researchers to get response from questions that the respondent may have an incentive to over or underreport or otherwise shade the truth. The use of list experiments has increased and we expect GCS to be particularly useful for implementing list survey experiments. 23 The first replication is the list experiment conducted by Janus (2010) who showed that there is much more resistance to immigration than individual respondents admits to in standard survey questions. Due to the word limit that GCS imposes on its surveys, we were unable to follow the exact wording that Janus used in his experiment. Thus, we had to make the instructions as brief as possible in order to meet the word limited imposed by GCS. Below is a brief description of that experiment: Two questions will be given with different response categories (and different respondents). First, both sets of respondents get the question: How many of these items are you opposed to? Both groups of respondents are then given the same three non-sensitive items to choose from11: 1. The way gasoline prices keep going up. 2. Professional athletes getting million dollar-plus salaries. 3. Requiring seat belts to be used when driving. 4. Large corporations polluting the environment. One of the groups serves as the baseline (or control), while respondents in the treatment group receive an additional sensitive item, which is: 5. Cutting off immigration to the United States. Respondents were only asked to say how many statements that they oppose, not which ones. This eliminates the respondent’s concern about giving the socially desirable answer and allows the person to reply more honestly. Because the two groups are randomly distributed, the mean number of “oppose” responses to the first four statements should be the same for both the baseline and test groups; therefore, any increase in the mean number of opposed items in the 11 The statements are rotated. 24 treatment group must be attributed to the “cutting off immigration to the United States” statement. We also replicate a similar list experiment in GCS by Streb et.al (2008) that examines attitudes about a female president. In this experiment, both groups received similar nonsensitive items as the ones given in the immigration experiment while the treatment group receives the sensitive item: "A woman serving as president". Table 10 presents the mean number of opposed responses in the baseline and treatment groups as well as the percent that was opposing the immigration in the United States and a woman serving as president. Generally, we found that there were 34-39% of the respondents who are not in favor of immigration and 23-26% who oppose female presidents. Unlike framing experiments, the finding from GCS Internet sample comports closely to the established benchmarks even though the questions length used here were significantly shorter. Therefore, there is a potential for GCS to be a reliable platform for social scientists that are interested in manipulating answer categories to measure sensitive topics in a way that circumvents social desirability concerns. Table 10: List Experiments Mean number of Baseline Items Mean Number of Treatment Items Difference How many of these three/four statements do you oppose? How many of the following four/five statements make you “angry or upset”? Sensitive item: " Cutting off immigration to the United States" Sensitive item: "A woman serving as president" Janus (2010) TESS Internet sample Streb et. al. (2008) Telephone sample (N=2056) GCS Internet sample (N=2006) GCS Internet sample (N=2006) 1.77 2.46 2.16 2.3 2.16 2.8 2.42 2.53 39% “opposed to immigration” 34% 26% “woman president makes them upset” 23% 5.2.3 Behavioral Economics Games Finally, we examine if GCS can be used to conduct the kind of experiments that are the backbone of behavioral economics. Specifically, we replicated Ultimatum and Dictator Games 25 using two different methods and evaluate which of the two are more appropriate to be used in GCS. Ultimatum game: In the ultimatum game, one player proposes some division of a dollar (or some other amount) between himself (the proposer) and a receiver. The receiver can either accept the proposal or reject it but if it is rejected, both get nothing. Although rational players should have the proposer offering to give the receiver one cent and the receiver accepting (he prefers something to nothing), that is not how the real players play. Instead, proposers offer between 40% and 50% of the money to the receiver and the receivers usually accept that. If a receiver is offered less than about 40%, they often reject the offer (taking nothing instead of something) and if the offer is less than 20%, they almost always reject (Slonim and Roth, 1998; Camerer and Thaler, 1995; Roth, et. al. 1991). There are two ways to implement the ultimatum game in GCS. First, we could focus on receiver’s decision by describing the game, telling them the offer, and asking them if they accept or reject. Second, we could utilize an open-ended numeric question in GCS where we let the respondents indicate the amount they want to accept. We tested both methods with our GCS sample and found that the latter (i.e., open-ended numeric question) is a more appropriate way of conducting this experiment in GCS. In the first method, we tried to prevent asking the respondents to answer too many different questions but we still wanted to know how respondents reacted to a variety of different proposals. As such, we develop many versions of the question (where the proposal varies from generous to selfish) and ask each of these to a different set of respondents. In each experiment the number of subjects was sufficient for testing the effect of the different treatments. That is, we asked 10 questions to 100 different people, instead of all 10 questions to the same 1000 people). This procedure let us map out the response pattern to different “offers” that is the usual object of interest in the ultimatum game. We developed a brief set of instructions that are short enough for a respondent to understand and utilize the “picture technique” in getting around the GCS’ word constraint. Figure 4 below shows an example of the instruction. 26 Figure 4: Instruction to the Ultimatum Game In total, we develop 10 different proposals, ranging from $1 to $10. Table 11: Pattern of Ultimatum Offer (GCS Internet Sample) Offer ($) % Accept 1 38 2 43 3 33 4 43 5 39 6 27 7 30 8 23 9 35 10 32 Based on the results shown in Table 11, we found that the pattern of ultimatum offer is not consistent with prior findings. First, contrary to the conventional wisdom that receiver almost always accept the offer above 50% of the money, only around 31% of the GCS sample accept such offer. Second, although established benchmarks had indicated that the respondents often reject the offer if they are offered less than 40% of the money and almost always reject 20% of the money, we found that there are still a substantial number of GCS respondents who accept such offer (around 33% to 43%). Therefore, using this method did not give us the results that were consistent with prior research. 27 In the second method (N= 500), instead of developing ten different versions of the question, we asked one version of the question to all respondents and allowed them to indicate the smallest amount in which they want to accept. We slightly modified the question as the open-ended numeric question prohibits us to insert any picture in the question. Figure 5 below shows the results. Figure 5: Results of the Ultimatum Game Although the pattern of offer is still not consistent with the prior works, the average amount offered proved to be similar to the amount found in Slonim and Roth (1998)’s game (both GCS sample and their sample had a mean of 4.45 to 4.5). Based on this, it seems that conducting economic games in GCS may not be able to give researchers the ability to map out the response pattern to different “offers” that is the usual object of interest in such games. However, it gives a reliable mean value of the amount offered, which is still a piece of valuable information to researchers. 28 Dictator game: Besides the ultimatum game, we also replicated dictator game to evaluate the usefulness of GCS in conducting behavioral economic experiments. In the dictator game, the proposer determines an allocation (or split) of some endowment and the receiver has no choice but to accept the allocation. Based on pure self-interest, the proposer should keep all money, giving nothing to the receiver. However, prior works suggests that proposer tends to transfer a nontrivial sum of money to the receiver (see Roth 1995; Camerer 2003). Similar to the ultimatum game, we test both methods of conducting the dictator game in GCS and found that the second method to be more appropriate. In the first method, we again developed 10 versions of the question using the “picture technique”, ranging from $1 to $10 (see Figure 6 below for an example). Figure 7: Instructions to Dictator Game We ask each version of the questions to 100 respondents, giving us a total number of 1000. The respondents were then given the options “yes”, “no”, and “don’t know”. Table 12: Pattern of Dictator’s Giving (GCS Internet Sample) % of Respondents who Offer ($) choose to give this amount 1 44 2 42 3 32 4 40 5 41 6 32 7 32 8 40 9 30 29 10 45 Reported results for GCS drop DK responses From the results shown in Table 12, we found that the results are not consistent with previously published findings. In particular, the pattern found in GCS sample showed no evidence that proposer tend to transfer a nontrivial sum of money to the receiver as the percentage of respondents transferring their endowment does not varied dramatically across different offer. As such, this method of constructing the experiment in GCS does not yield an accurate result. Instead, we turn to the second method (which is similar to the second method of the Ultimatum game) where we used the open-ended numeric question format that allows the respondents to choose the amount of money that they want to give away. Again, we slightly modified the question in order to fit the question into GCS’ word limit. Figure 8 below shows the result. Figure 8: Results of the Dictator Game Similar to the results from the ultimatum game, we found that the pattern of giving in GCS’ sample does not replicate the conventional wisdom that respondents tend to give away non- 30 trivial amount. In fact, 22.9% of the GCS respondents choose to give away all of the endowment away. However, the average of money given away of $3.20 is comparable to the amount found in Ben-Ner et al’s (2008) student samples, where the average amount given in their games was $3.08. This finding further underscores the conclusion that GCS provides a useful estimation of mean values accepted in ultimatum games but not a reliable estimate of money given away by the “dictators”. 5.3 Discussions Throughout this paper, we have discussed some of the key features of GCS that characterize it to be different from other Internet-based surveys that rely on large opt-in panels. Specifically, we had pointed out that the “survey wall” environment interrupts the respondents’ browsing activities with survey questions and “coerces” them to provide some responses to these questions before allowing them to proceed to their target contents. Based on this feature, there are two possible concerns that could affect the quality of the samples. First, when the respondents are given questions that they perceived to be too difficult, they could choose to bail out by either requesting for another question or sharing the target content in their social media pages. However, as we have shown in Section 4, we have no reason to believe that this particular problem will significantly affect the samples in GCS since respondents who chose to bail out are not systematically different in terms of inferred demographic characteristics from those who do not. Second, even if respondents choose not to bail out and participate in the survey, their motivations to access the target contents may cause them to not read and answer the questions carefully. Since attention is an important element for receiving treatment in many survey experiments, having inattentive (or unmotivated) respondents could weaken the correlations across related questions and diminish experimental treatment effects (Berinsky et.al, 2014). Consequently, our attempt to replicate the treatment effect from prior experiments will be attenuated. We found this particular problem in the two framing experiments that we replicated. In the question-wording experiment, we found a 23 percentage point gap between the two 31 conditions while the comparable gap was 38% in the 2012 MTurk and 44% in the 2010 GSS. Similar results are also found in the Asian Disease experiment where our gap between the two conditions in GCS is only 25% compared to 36% in the 2012 MTurk and 45% in the 2001 Druckman’s student sample among others. Such attenuation of treatment effects we believe, are caused by the lack of motivation that respondents have in reading and answering the survey questions. Although respondents are not necessarily inclined to be motivated in other survey environments (i.e. laboratory settings or web-based opt-in panels12), at least in these other settings, they have agreed in advance to participate. Given that respondents in GCS are interrupted and “coerced” to participate in the survey, greater inattentiveness is more likely to occur in this particular environment. From the three studies that we replicate above, we have some evidence that GCS could be a useful platform to conduct survey experiments. The comparable results in list experiments and behavioral economic games suggest that the GCS’ sample behave rather similarly to those found in published research. However, we had also encountered some potential drawbacks as well. Based on the results found in the framing experiments, we discovered that the treatment effects in GCS sample diminished considerably, even though the expected directions of the treatments are still consistent. But on balance, we still believe that the utilities of GCS in conducting survey experiments outweigh its drawbacks. 6. Conclusion GCS is a low cost survey platform that may have promise for use by social scientists exploring. causal hypotheses. Specifically, in this paper we investigate its value to researchers interested in conducting survey experiments.Our focus on its usef ulness for survey experimentation rather than as a more general survey platform stems from the fact that the real cost advantage of GCS over to other alternatives is limited to surveys of one or a very small number of questions. Observational studies that require statistical models including a long (or even small} list of control variables can likely be accomplished as efficiently 12 Prior scholars identified that respondents often display satisficing behavior (Krosnick 1991), offer careless and haphazard survey responses (Huang et al 2012; Meade and Craig 2012), and do not bother to closely read questions or instructions (Oppenheimer et al 2009) in opt-in surveys and experiments. 32 using other survey tools. Further,while previous work suggests GCS obtains reasonably representative samples, there is little compelling evidence that it's samples are sufficiently better than its competitors to attract social scientists' attention. In contrast, survey experimental work has three features that make GCS a potentially attractive option for social scientists pursuing this kind of research. First, if these experiments can be implemented in one question, then the considerable cost advantages of GCS over its competitors can be realized. Second, If true randomization of treatments can be achieved (and we argue that GCS’ methodology does so - and demonstrate resulting balance in a set of measured covariates then no additional control variables are necessary to make strong inferences. Finally, given the fact that most experimenters are much more concerned with the internal validity of their inferences than with tier external validity, unresolved questions about the representativeness of the GCS sample may be of secondary concern and certainly the sample is better than that of most lab experiments.
© Copyright 2026 Paperzz