On the replicability of intra-team communication classification∗ Theresa Eich† Stefan P. Penczynski‡ August 31, 2016 The use of intra-team communication as introduced by Burchardi and Penczynski (2014) provides data in form of incentivized written accounts of reasoning that is immediately speaking to procedural models of economic behavior. At the same time, the coding of such data is often criticized as subjective and little replicable. The present short study addresses these critiques by drawing on a large number of coders recruited on Amazon Mechanical Turk and thus investigating the quality of the coding in terms of replicability. The results indicate substantial to very good measures of inter-rater consensus and consistency. Overall, the paper shows that the classification of intra-team communication data is replicable by non-incentivized, non-expert coders. Keywords: Communication, classification, Amazon Mechanical Turk. JEL Classification: C91, C80, D83 ∗ We thank Ryan E. Atkins, Ori Heffetz and Christian Koch as well as seminar participants at the European ESA Conference 2015 in Heidelberg. This study is based on Theresa Eich’s Master thesis at the University of Mannheim. Katharina Momsen provided very good research assistance. † Frontier Economics Ltd., Im Zollhafen 24, 50678 Cologne, Germany, [email protected]. ‡ Department of Economics, University of Mannheim, L7, 3-5, 68131 Mannheim, Germany, [email protected], Tel. +49 621 181 3656, Fax. +49 621 181 1893. 1. Introduction In recent years, economic behavior has been described in models of strategic thinking, time preferences, social preferences, etc. that very explicitly describe the reasoning process of decisionmakers.1 This development in the theoretical literature has been accompanied by an increasing reliance on experimental data that goes beyond choice data. More and more studies include measures of beliefs (Nyarko and Schotter, 2002; Costa-Gomes and Weizsäcker, 2008), information search data (Johnson, Camerer, Sen, and Rymon, 2002; Costa-Gomes and Crawford, 2006; Reutskaja, Nagel, Camerer, and Rangel, 2011), response times (Rubinstein, 2007), physiological and hormonal information (Kosfeld, Heinrichs, Zak, Fischbacher, and Fehr, 2005) and, of course, neurological data in the wide field of neuroeconomics (Coricelli and Nagel, 2009; Glimcher and Fehr, 2013). This paper is concerned with an alternative way of observing reasoning processes, namely via incentivized written accounts of reasoning that result from intra-team communication as introduced by Burchardi and Penczynski (2014, henceforth BP). Compared to all above-mentioned methods, this procedure generates data that is probably most immediately informative about the reasoning process of subjects. For the analysis of this data, this strength comes with the natural disadvantage that the coding into suitable categories requires human judgment. The common procedure in similar studies in economics (Cooper and Kagel, 2005; Rydval, Ortmann, and Ostatnicky, 2009) and in protocol analysis (Ericsson and Simon, 1984; Krippendorff, 2013) is to rely on independent ratings of at least 2 classifiers. This approach is often criticized by economists as relying on subjective and thus nonreplicable classifications of messages. In this study, we are testing this hypothesis and implement a non-incentivized classification procedure with non-expert raters in the online labor market of Amazon Mechanical Turk (AMT).2 This platform enables us to efficiently obtain more than 40 classifications of a single set of 78 messages. These data provide a rich basis to rigorously test whether the content of the messages is repeatedly classified in a similar fashion. The results indicate substantial to very good measures of inter-rater agreement and consistency. The agreement measure for the whole sample is associated with more than 33 out of 43 coders agreeing on the exact category. We observe perfect agreement of 43 classifiers in some instances, which suggests that the quality of the classification process is good. Imper1 Be it models of strategic reasoning (Nagel, 1995; Stahl and Wilson, 1995; Camerer, Ho, and Chong, 2004; Crawford and Iriberri, 2007), equilibrium models of games with incomplete information (Eyster and Rabin, 2005; Jehiel, 2005; Esponda, 2008), models of time preferences and self-control (Fudenberg and Levine, 2006) or models of fairness (Rabin, 1993; Falk and Fischbacher, 2006). 2 Traditionally, AMT is used for rather simple tasks such as tagging photos, identification of abusive written comments, or giving short descriptions of images. More recently, AMT is used as a platform to recruit subjects and run interactive online experiments (Amir and Rand, 2012; Horton, Rand, and Zeckhauser, 2011; Jordan, McAuliffe, and Rand, 2015, and citations therein, LIONESS project). Here, we use it in its traditional use for individual message classification. 2 fect classification might rather be due to inherent ambiguity of messages with respect to the model categories. It can be shown that BP’s conservative procedure of requiring lower and upper bounds of levels of reasoning is useful to capture individual incidences of ambiguous messages. 2. Data 2.1. The intra-team communication experiment The communication transcripts in this study are taken from the experiment of BP in which a standard beauty contest game with multiplier p = 2/3 is implemented. Teams of 2 subjects play as one entity and exchange arguments via a particular communication protocol. Specifically, both subjects individually make a suggested decision and write up a justifying message. Upon completion, this information is exchanged simultaneously and both subjects can enter individually a final decision. The computer draws randomly one message to be the team’s action in the beauty contest game. The protocol has the advantage of recording the arguments of the individual player at the time of the decision making. Furthermore, the subject has incentives to convince his team partner of his reasoning as she determines the team action with 50% chance. The analysis had 2 PhD students classify the messages on the basis of a level-k model of strategic reasoning (Nagel, 1995; Stahl and Wilson, 1995), indicating the lower and upper bound of level of reasoning and the mean of the level-0 belief. The level categories will thus be directly given from the model as the number of iterated best responses described in the message. We add a category “not applicable” as well as equilibrium reasoning. The fact that the categories are derived from the model gives the researcher only the possibility to merge categories in order to deal with ambiguities. For this reason, BP elicit lower and upper bounds that determine the interval within which the level of reasoning is likely to lie. Here, instead of asking for both upper and lower bounds, workers are asked for the level most likely to be behind a message. Previous test-runs with university students had shown this to be more easily understood. The level-0 belief mean is in the interval [0, 100], which we break down for simplicity into 8 categories with boundaries 15, 25, 35, 45, 55, 65, and 75. We add the category “none” if no classification can be made. 2.2. Amazon Mechanical Turk Amazon Mechanical Turk (AMT) is a crowd-sourcing market place that brings together task providers and workers on a platform that allows for task implementation, compensation and data collection. “Requesters” create and offer a task; “workers” are willing to earn money by performing this task. The so-called Human Intelligence Tasks (HIT) are characterized on 3 the one side by a high degree of complexity that cannot handled by computers3 and on the other side by a simple structure that humans can work on via a computer (e.g. surveys and experiments). Workers have access to the HITs and can accept them after previewing their content and payment conditions. We are working with the standard population of Amazon workers and do not restrict them on the basis of geographical location or performance record.4 Buhrmester, Kwang, and Gosling (2011) find that although the participation rate is sensitive to compensation there are many workers willing to work for low amounts of money. In fact, lowering the money offered did not have an effect on the quality delivered. This shows that the workers are not primarily monetarily driven but rather intrinsically (Paolacci, Chandler, and Ipeirotis, 2010; Chandler and Kapelner, 2010; Horton, Rand, and Zeckhauser, 2011). 2.3. Instructions, Qualification Test, and Task As members of the general public, workers are not expected to have the economic background necessary to right away evaluate the statements from the experiment. To establish comprehension, instructions on strategic thinking in the level-k model (see appendix A.2) are provided, written for people without further economic background. Examples from everyday life were used to illustrate the concept. Still, it is essential that only those workers pass to the task who have gained a thorough understanding of the concept. Achieving this objective involves installing a qualification test (see appendix A.3). It comprises of questions on the level of reasoning and the according level-0 belief. Workers who pass the qualification test were allowed to work on the classification of the statements. Figure 1 shows the screenshot of the task. The message to be classified is displayed on top and the two classifications, level of reasoning and level-0 belief mean, are to be entered below. Confirming these two entries concludes the HIT. We set a compensation of $0.15 per assignment independent of the classification.5 Given prior assumptions of a duration of 1.5 minutes per assignment and 30 minutes for the qualification test, the hourly wage amounts to $5.25. Eventually workers took less time than expected, raising the real hourly wage to $8.94. 3 In a related project (Penczynski, 2016), I investigate the possibilities of machine learning for these kinds of classification tasks. So the possibility to train a computer algorithm with existing classifications in order to reduce the cost of classifying large datasets already exists. 4 While requesters have to be an entity located in the US, workers in any country can sign in to the market place independently of their residence. 5 We told subjects to code conscientiously according to the instructions and that there is no right or wrong. We would only refuse payment for classifications that were obviously random or uninformative, a scenario which would be easily detectable on the basis of a couple of classifications but which never occurred. Houser and Xiao (2011) incentivize interrater agreement with a classification coordination game, paying coders when they coordinate on the same category thanks to the message as coordination device. 4 Figure 1: Screenshot of a HIT in AMT. 2.4. Data collection In total, 42 to 43 classifications per message – or assignments per HIT – were collected in two sequential rounds. The first one was a trial round and collected 2 to 3 assignments per HIT in five days between July 1 and 5, 2014, while the second one collected 40 assignments per HIT in less than five days between July 9th and 14th 2014. No worker did the same HIT repeatedly. In total 59 workers participated and 23 of them completed the whole set of 78 HITs. Except for four messages, each message was classified 43 times, making a total of 78 · 43 − 4 = 3350 classifications. The time that workers spent on their task can be approximated by looking at the time elapsed between the submission of one task and the submission of the next.6 This measure indicates a median time spent on a task of 27 seconds, a reasonable time for a repeated activity. 10% of classifications are done within 10 seconds, 25% within 16 seconds. 25% take longer than 54 second and 10% take longer than 121 seconds.7 6 Two relevant times are stamped: the acceptance of a task and its submission. Because a task is visible before its acceptance, the time between acceptance and submission can be very small. At the same time, workers that accept multiple tasks have relatively long times between acceptance and submission. In 90% of the cases, only up to 8 seconds elapse between submission of one and acceptance of the next task. Overall, the time between one submission and the next seems a good measure of the time spent on the task. 7 The reason behind short work times is often the brevity of the message. The overall median message length is 140 characters. For messages classified within 10 (16) seconds, the median length is 37 (75) characters. Longer work times are not associated with longer messages, presumably because medium-sized messages can also be difficult to classify. For messages classified in more than 54 (121) seconds, the median length is 161 (142) 5 3. Results The classification results for each message are illustrated in histograms in figures 5 to 82 in appendix A.1. While other studies were interested in the exact classifications, here, I am only interested in the extent to which classifiers agree in the judgement of the level of reasoning and the level-0 belief mean exhibited in the message. This interrater reliability is commonly evaluated on the basis of consensus and consistency estimates (Stemler, 2004). Consensus estimates are most relevant when it can be expected of classifiers to exactly agree on which rubric the observed message falls into. In contrast, consistency estimates reflect to which extent different behavior is indeed classified differently in a consistent way, even though not in the same absolute rubric. Consistency measures require rubrics with an underlying unidimensional construct that allows for differences in the direction of classification to be meaningful. In our case, the classification in levels and level-0 beliefs should ideally have high levels of consensus, consistency will be reported for completeness. 3.1. Consensus The first and probably most intuitive measure of the classification quality is the degree of consensus among classifiers in terms of percentage agreement. As a first benchmark, we note that the agreement between the two RAs in BP is 76.92% for the level lower bounds and 75.64% for the upper bounds (based on independent individual classification before reconciliation). The median pairwise percentage agreement between all pairs of the 43 sets of classifications is 64.74%. This is slightly less than for the RAs, but still a substantial level of agreement for our classifiers.8 Since it is generally difficult to provide minimally required levels for this simple measure, a popular measure of pairwise consensus is Cohen’s kappa statistic, which corrects for the amount of agreement that could be due to chance based on the marginal distributions. A value of zero then indicates that the ratings did not coincide any more than they would be expected to by chance. This measure is particularly useful for data that mostly fall into one category, inflating the percentage agreement measure. This is not the case in our data since categories NA, 1, 2 and 3 are occurring commonly. The median Cohen kappa in the data is 0.48, indicating moderate agreement according to influential benchmarks by Landis and Koch (1977).9 As a comparison, the 2 RAs exhibit 0.66 (lower bounds) and 0.65 (upper bounds) which would be considered a substantial agreement. This comparison gives a quantitative idea of the difference non-expert coders make for the agreement. characters. Among all pairs of workers, the average percentage agreement is 61.25%. 9 The average Cohen kappa is 0.45. 8 6 The literature acknowledges that measures correcting for chance agreement yield conservative measures of agreement (Lombard, Snyder-Duch, and Bracken, 2010). In particular, Gwet (2014) shows that higher numbers of messages and categories make a given level of kappa allow excluding chance agreements more strongly. With 2 coders, for example, 5 categories and 60 messages, a kappa of 0.2 allows to reject chance agreements at the 0.05 significance level, while a kappa of 0.5 with 2 categories and 10 messages does not. With 78 messages and 7 categories, a kappa of 0.45 is strong evidence against chance agreements. An elaborate consensus measure for any number of coders is Krippendorff’s alpha (Krippendorff, 2013) which also corrects for agreement due to chance. For the present whole sample on levels, this measure is 0.45. For the level-0 belief classification, it turns out to be 0.54. Krippendorff (2013) generally requires an alpha of 0.66 for his purposes. To illustrate that the present level of alpha in a sample with 78 messages and 43 coders constitutes substantial agreement in our context, figure 2 shows two distributions that would yield this alpha value if all classifications exhibited the same distribution shape and led to the overall marginal distribution as in my data. A representative agreement of 33 out of 43 coders suggests that coders do coincide substantially. (a) Level classification leading to α = 0.42. (b) Level-0 belief classification leading to α = 0.53. Figure 2: Representative distributions for observed levels of α. We saw before that part of the disagreement might be due to the non-expertise of the AMT workers. In the following, we investigate another feature that could contribute to the disagreement, namely the ambiguity of messages with respect to the model category. We start this with a look at the just mentioned frequency of the mode classification which indicates how many raters agree with the most common classification. Approximating this way the agreement for each message, the empirical cumulative distribution function of the mode frequency in figure 3 illustrates how agreement is distributed across messages. For both level and level-0 belief classifications, it can be seen that the mode frequency is nearly uniformly 7 distributed, with more density towards the right end. More than 10% of level classifications and more than 20% of the level-0 belief classifications have more than 40 raters agree on one category. This shows that there is no minimum level of noise in the classification that inhibits extreme agreement. Similarly, we can calculate the Krippendorff alpha for the third of the sample with the strongest agreement. Here, the alpha is 0.74 for levels and 0.89 for level-0 belief means, easily satisfying the suggested minimum level. Both results point to the ambiguity of the messages with respect to the model as a candidate limiting factor that prevents better agreement. Figure 3: Empirical cdf of the maximal number of workers choosing the same classification for each of the 78 messages. In order to take into account the possible ambiguity of messages with respect to the level of reasoning, the classifications in BP were intervals of levels of reasoning determined by level lower and upper bounds. For some instances with a small mode frequency and thus large disagreement, figure 4 shows that this measure is useful to carefully capture the content of a message. Out of 3350 level classifications, 2514 (75.0%) lie within the level intervals from BP. If in each interval only the most frequent classification was considered, this value would shrink to 2254 (67.3%). By capturing nearly 8 percentage points more of the non-expert classifications, the intervals contribute to an appropriate interpretation of the messages. 3.2. Consistency An often reported measure of interrater consistency is Cronbach’s alpha, which measures the extent to which the ratings of a group of classifiers measure a common underlying dimension. 8 (a) 37 classifications within the bounds. (b) 36 classifications within the bounds. Figure 4: Capturing ambiguity with lower and upper bounds (Red bars reflect out-of-bounds classifications). Low values correspond to the variance in the overall ratings being due to error rather than due to the sum of intrarater variance. For the 2 RAs, the alpha for the lower bounds is 0.90 and for the upper bounds it is 0.89, indicating a very high consistency. The overall measure for the 43 workers is 0.99, implying that there is a strong notion of an underlying dimension that is measured in these classifications. The median pairwise alpha between all pairs of the 43 sets of classifications is 0.83. This is slightly less than for the RAs, but still an excellent level of consistency.10 3.3. Discussion Overall, these results from non-incentivized, non-expert coders show that the message content from intra-team communication can indeed be reliably and replicably classified, even in categories that are not determined by the researcher but determined by the model of reasoning. Interestingly, a number of messages lead to extremely similar and consistent classifications, which strengthens the view that the classification part of the process is not particularly noisy or subjective. The fact that perfect agreement of 43 classifiers is observed in some instances suggests that one limiting factor of agreement might be the ambiguity of the messages with respect to the classified model. One useful way to deal with this can be the bundling of categories as in the level intervals. Ambiguities could result from individual’s reasoning not fitting to the model of reasoning. On the contrary, if the model describes the reasoning well, the verbal expression of that reasoning might introduce noise. The influence of the experimenter on these elements is limited. Subject comprehension is always sought with very clear instructions and understanding tests. 10 Among all pairs, the average alpha is 0.82, the minimum alpha 0.52 and the maximum alpha 0.94. 9 Other than that, the software implementation of the protocol should be as intuitive as possible, a point worth keeping in mind given the rapid developments in user interfaces and the habituation particularly of students to highly sophisticated interfaces on smartphones and tablets. Progress in this respect requires a measure of ambiguity. If the classification of the message is assumed to be of constant good quality across messages, the dispersion of classifications in the present data immediately provides such a measure. For illustration, I propose the measure h = 1+ P i P (xi ) log2 P (xi ) , log2 N which is 1 minus the normalized Shannon entropy for a random variable X with N values xi where P (xi ) is the probability or empirical fraction of value xi . In this dimension-free measure, a value of 0 results from a uniform distribution and reflects an uninformative message. A degenerate distribution with full agreement yields a value of 1 as the message perfectly informs about the category within the model. The classification results in Appendix A.1 provide the h-value along with each histogram which – in my view – quantifies the message’s informativeness with respect to the classification very well. Table 1 gives descriptive statistics of this measure, reiterating the higher information in the belief classification and the fact that the value 1 can almost be reached in both classifications. Table 1: Message informativeness h. Level-k Level-0 belief Mean 0.655 0.723 St. Dev. 0.158 0.189 Median 0.662 0.753 Min 0.294 0.346 Max 0.943 1 N 78 78 4. Conclusion This study set out to investigate the replicability of classification of intra-team communication in the context of procedural strategic reasoning models. With the help of a large number of replications for a given set of 78 messages, it can be shown that the classification is overall replicable, exhibiting substantial interrater consensus and very good consistency. The data was generated in a non-incentivized way by non-expert workers on Amazon Mechanical Turk, therefore providing a lower bound on the reliability and consistency one should expect to find in classifications of such non-trivial content. 10 References A MIR , O., AND D. G. R AND (2012): “Economic games on the internet: The effect of $1 stakes,” PloS one, 7(2), e31461. B UHRMESTER , M., T. K WANG , AND S. D. G OSLING (2011): “Amazon’s Mechanical Turk a new source of inexpensive, yet high-quality, data?,” Perspectives on Psychological Science, 6(1), 3–5. B URCHARDI , K. B., AND S. P. P ENCZYNSKI (2014): “Out of your mind: Eliciting individual reasoning in one shot games,” Games and Economic Behavior, 84(1), 39 – 57. C AMERER , C. F., T.-H. H O , AND J.-K. C HONG (2004): “A Cognitive Hierarchy Model of Games,” The Quarterly Journal of Economics, 119(3), 861–898. C HANDLER , D., AND A. K APELNER (2010): “Breaking monotony with meaning: Motivation in crowdsourcing markets,” Discussion paper, University of Chicago. C OOPER , D. J., AND J. H. K AGEL (2005): “Are Two Heads Better than One? Team versus Individual Play in Signaling Games,” American Economic Review, 95(3), 477–509. C ORICELLI , G., AND R. NAGEL (2009): “Neural correlates of depth of strategic reasoning in medial prefrontal cortex,” Proceedings of the National Academy of Sciences, 106(23), 9163–9168. C OSTA -G OMES , M. A., AND V. P. C RAWFORD (2006): “Cognition and Behavior in Two-Person Guessing Games: An Experimental Study,” American Economic Review, 96(5), 1737–1768. C OSTA -G OMES , M. A., AND G. W EIZS ÄCKER (2008): “Stated Beliefs and Play in Normal-Form Games,” Review of Economic Studies, 75(3), 729–762. C RAWFORD , V. P., AND N. I RIBERRI (2007): “Level-k Auctions: Can a Nonequilibrium Model of Strategic Thinking Explain the Winner’s Curse and Overbidding in Private-Value Auctions?,” Econometrica, 75(6), 1721–1770. E RICSSON , K., AND H. S IMON (1984): Protocol Analysis: Verbal Reports as Data. Cambridge, Mass.: MIT Press. E SPONDA , I. (2008): “Behavioral equilibrium in economies with adverse selection,” The American Economic Review, 98(4), 1269–1291. E YSTER , E., AND M. R ABIN (2005): “Cursed Equilibrium,” Econometrica, 73(5), 1623–1672. FALK , A., AND U. F ISCHBACHER (2006): “A theory of reciprocity,” Games and Economic Behavior, 54(2), 293–315. F UDENBERG , D., AND D. K. L EVINE (2006): “A Dual-Self Model of Impulse Control,” American Economic Review, 96(5), 1449–1476. G LIMCHER , P. W., AND E. F EHR (2013): Neuroeconomics: Decision making and the brain. Academic Press. G WET, K. L. (2014): Handbook of inter-rater reliability: The definitive guide to measuring 11 the extent of agreement among raters. Advanced Analytics, LLC. H ORTON , J. J., D. G. R AND , AND R. J. Z ECKHAUSER (2011): “The online laboratory: Conducting experiments in a real labor market,” Experimental Economics, 14(3), 399–425. H OUSER , D., AND E. X IAO (2011): “Classification of natural language messages using a coordination game,” Experimental Economics, 14(1), 1–14. J EHIEL , P. (2005): “Analogy-based expectation equilibrium,” Journal of Economic Theory, 123(2), 81 – 104. J OHNSON , E. J., C. C AMERER , S. S EN , AND T. RYMON (2002): “Detecting Failures of Backward Iinduction: Monitoring Information Search in Sequential Bargaining,” Journal of Economic Theory, 104(1), 16–47. J ORDAN , J., K. M C AULIFFE , AND D. R AND (2015): “The effects of endowment size and strategy method on third party punishment,” Experimental Economics, pp. 1–23. KOSFELD , M., M. H EINRICHS , P. J. Z AK , U. F ISCHBACHER , AND E. F EHR (2005): “Oxytocin increases trust in humans,” Nature, 435(7042), 673–676. K RIPPENDORFF , K. (2013): Content analysis: An introduction to its methodology. Sage. L ANDIS , J. R., AND G. G. KOCH (1977): “The measurement of observer agreement for categorical data,” biometrics, pp. 159–174. L OMBARD , M., J. S NYDER -D UCH , AND C. C. B RACKEN (2010): “Intercoder reliability,” Discussion paper, Temple University. NAGEL , R. (1995): “Unraveling in Guessing Games: An Experimental Study,” American Economic Review, 85(5), 1313–1326. N YARKO , Y., AND A. S CHOTTER (2002): “An Experimental Study of Belief Learning Using Elicited Beliefs,” Econometrica, 70(3), 971–1005. PAOLACCI , G., J. C HANDLER , AND P. I PEIROTIS (2010): “Running Experiments on Amazon Mechanical Turk,” Judgment and Decision Making, 5, 411–419. P ENCZYNSKI , S. P. (2016): “Using machine learning for intra-team communication classification,” Discussion paper, University of Mannheim. R ABIN , M. (1993): “Incorporating Fairness into Game Theory and Economics,” The American Economic Review, 83(5), pp. 1281–1302. R EUTSKAJA , E., R. NAGEL , C. F. C AMERER , AND A. R ANGEL (2011): “Search dynamics in consumer choice under time pressure: An eye-tracking study,” The American Economic Review, pp. 900–926. RUBINSTEIN , A. (2007): “Instinctive and Cognitive Reasoning: A Study of Response Times,” Economic Journal, 117, 1243–1259. RYDVAL , O., A. O RTMANN , AND M. O STATNICKY (2009): “Three very simple games and what it takes to solve them,” Journal of Economic Behavior & Organization, 72(1), 589–601. S TAHL , D. O., AND P. W. W ILSON (1995): “On Players’ Models of Other Players: Theory 12 and Experimental Evidence,” Games and Economic Behavior, 10(1), 218–254. S TEMLER , S. E. (2004): “A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability,” . 13 FOR ONLINE PUBLICATION ONLY A. Appendix A.1. Data by message The following graphs show the frequency of each category for the message given in the caption. Black bars reflect classifications in accordance with the classification in BP, red bars reflect differing classifications. The informativeness h is reported for each classification histogram. (a) Level classification. h = 0.943 (b) Level-0 belief classification. h = 0.9 Figure 5: ”i think most people will go for 50? So the average will be about 35?” (a) Level classification. h = 0.51 (b) Level-0 belief classification. h = 0.689 Figure 6: ”If 50 were average, 2/3rds would be around 33. I would imagine the average to be above 50, and therefore guess around 39.Realisitically the answer will be between 33-66.So maybe higher than 39? Perhaps 43” 14 (a) Level classification. h = 0.705 (b) Level-0 belief classification. h = 0.85 Figure 7: ”Hey partner:) I am doing econs 2nd year, we’ve studied this experiment on our lectures let me just explain to you how it works: the average of all scores will ve centred arond 50, then if we multiply it by 2/3 it will give 33.3. HOWEVER, note that 33 is the number that most people will put, because they can calculate up to that point. YET whan majority fails to foresee is that if everyone puts 33, then the real target average is 22!and that’s the number we should put in.” (a) Level classification. h = 0.629 (b) Level-0 belief classification. h = 0.767 Figure 8: ”60 seems close to the average between the number of 0 to 100, and i guess more people will choose the average no.” 15 (a) Level classification. h = 0.76 (b) Level-0 belief classification. h = 0.767 Figure 9: ”60” (a) Level classification. h = 0.847 (b) Level-0 belief classification. h = 0.95 Figure 10: ”I am not an economist, but I’m looking at at whole I think the avearge maybe around 60. Therefore, 40 would be the two thirds of this it also seems reasonable.” (a) Level classification. h = 0.683 (b) Level-0 belief classification. h = 0.865 Figure 11: ”I would guess that the majority of people would pick a number in the lower stratum. Say 25 or 30 the two thirds average of this is around 15” 16 (a) Level classification. h = 0.717 (b) Level-0 belief classification. h = 0.835 Figure 12: ”As the range of numbers is between 0 - 100, I think most of guys might be choosing a number close to 50, and since the two thirds of the average is choosen, the target number may be nearer to 40. it’s just my thinking. Not sure really in this way this round works.. Waiting to see your suggested decision and then I will take a decision. Cheers.” (a) Level classification. h = 0.736 (b) Level-0 belief classification. h = 1 Figure 13: ”guess” 17 (a) Level classification. h = 0.604 (b) Level-0 belief classification. h = 0.489 Figure 14: ”Considering it will calculate two thirds of the average.. wouldn’t 60 be highly possible? It looks ideal for me. Any suggestions?” (a) Level classification. h = 0.529 (b) Level-0 belief classification. h = 0.705 Figure 15: ”I guess a safe way would be to have one choose a low number lets say between 20 and 50 and the other between 50 and 80 .. your suggestions?” (a) Level classification. h = 0.544 (b) Level-0 belief classification. h = 0.511 Figure 16: ”I think it will be a rather low number, definitially below 50.” 18 (a) Level classification. h = 0.714 (b) Level-0 belief classification. h = 0.85 Figure 17: ”lets say everyone thinks the average will be 50 based on random probabilityand 2/3rds of that will be about 34hence they will mostly be choosing around 34 onlymaking the average 34and its 2/3rd to be around 21” (a) Level classification. h = 0.705 (b) Level-0 belief classification. h = 0.546 Figure 18: ”say most of the people will say around 40 seomthing because they think that will be the 2/3 of the average. People thing the average will be around 50 or 60 so they get 2/3 of that that will be around 40 but if everyone does then the 2/3 of 40 will be less si i say it will be around 32 or so.” 19 (a) Level classification. h = 0.589 (b) Level-0 belief classification. h = 0.914 Figure 19: ”because it is relatively close to the middle?” (a) Level classification. h = 0.841 (b) Level-0 belief classification. h = 1 Figure 20: ”it’s quite random, so just guess” 20 (a) Level classification. h = 0.841 (b) Level-0 belief classification. h = 0.615 Figure 21: ”Well. i suppose mopst of the eld will choose big numbers so that average will go up to 60-70. 2/3 of this number would be in range of 40-45” (a) Level classification. h = 0.815 (b) Level-0 belief classification. h = 0.599 Figure 22: ”i just used 50 which is the average of 0 and 100.” 21 (a) Level classification. h = 0.887 (b) Level-0 belief classification. h = 0.716 Figure 23: ”Theory says that we should play 0, as everyone will always try to beat everyone else by goingtwo thirdsless until everyone reaches 0. Whether everyone will follow this reasoning I don’t know; it maybe better to bida bit higher since not everyone may choose 0.” (a) Level classification. h = 0.528 (b) Level-0 belief classification. h = 0.596 Figure 24: ”the target will be 50, so i chose 75, of whose (2/3) would be 50” 22 (a) Level classification. h = 0.629 (b) Level-0 belief classification. h = 0.95 Figure 25: ”Assuming that the overall total of the 6 teams is around 300, giving an average of about 50 iwould suggest 32 as a decision since some may tend above 50 while others below iti think the average will be around about 50 in which case we should submit an answer of about32 or 33” (a) Level classification. h = 0.943 (b) Level-0 belief classification. h = 0.7 Figure 26: ”I’m no mathematician so I don’t know what number to go for exactly which would give thegreatest chace*chanceso I’m just going with 50, unless you have a di erent tacticIf you’re not sure either I will just go with 50 still” 23 (a) Level classification. h = 0.752 (b) Level-0 belief classification. h = 0.514 Figure 27: ”I would give a small number as I think most people will go with number over 50+. I would gowith 22” (a) Level classification. h = 0.548 (b) Level-0 belief classification. h = 0.8 Figure 28: ”I thought i’d go for 26as we’re aiming for 2/3rds of an average, which should be about 21if everyone aims for ’50’ or something around ’50’The remianing 5 (21+5) would account for any deviations in the groupI think it’s unlikely that anyone will choose numbers close to 0 or 100hence my decision” 24 (a) Level classification. h = 0.525 (b) Level-0 belief classification. h = 0.457 Figure 29: ”As our target is two-thirds of the average, we should call for the number which is lower than 66.I think 40 is a quite safe line, but possibly be reduced to 30s?” (a) Level classification. h = 0.815 (b) Level-0 belief classification. h = 0.815 Figure 30: ”to be honest i have little directed thought about thisbut.say the average is 50, then two thirds of 50 is around mid-30si think?either way it’s my lucky number!wow sorry thats not more helpful.i’ll see what you’ve said though.. will probably make more sense!” 25 (a) Level classification. h = 0.661 (b) Level-0 belief classification. h = 0.816 Figure 31: ”Between 35 and 40 because 2/3 of the average case number 50 would be 38.But it might be less if everyone is guessing lower” (a) Level classification. h = 0.615 (b) Level-0 belief classification. h = 1 Figure 32: ”test. whats Yours?” 26 (a) Level classification. h = 0.717 (b) Level-0 belief classification. h = 0.761 Figure 33: ”I suggest 66, because there are 6 teams each of them choosing a random number from 0 to 100then the average is found by adding those and dividing the sum by 6.Since we don’t know the average in advnace let’s assume 2/3 from 66 is 44” (a) Level classification. h = 0.522 (b) Level-0 belief classification. h = 0.647 Figure 34: ”normally people will choose the number 10-90 I think but I not surejust use your rst reection to think about other people” 27 (a) Level classification. h = 0.481 (b) Level-0 belief classification. h = 0.641 Figure 35: ”Hi there, I’ve just won last 2 test period’s questions, what I did was i wrote a few reasons andgive a numberand this guess answer is bit burther than the nal answerfurther i meanand i chose the one in the middlewell, not exactly in the middle, something like a little to the left and a little to the righthowever I won the last 2 rounds that i didn’t get anything, I guess I am not going to win this, lolI am just being pestmistici think 50 is ne, and i will choose one is between yours and mineit’s best if you chose something in the middle as wellas this is how I found for the last 2 roundsi mean won not found :PThis is how i won the last 2 roundshope this one will work. but this one is fully randomI think people like big numbers, and a few will chose 50%because rationally this is where it is highly possible if they are rationalbut i like big numbersbecause wining 10 pounds makes me feel putting big numbersthat’s how I feel, and decision is totally up to you:) take care” (a) Level classification. h = 0.676 (b) Level-0 belief classification. h = 1 Figure 36: ”i do not have a suggested decision as it is not based on any logic .. i cannot predict what theother teams will play and cannot gure out how to work out what would be the best decision tomake.if you have any better ideas, i am willing to take your adviceif you dont have any better ideas, then i guess we will just be lucky..or not..” 28 (a) Level classification. h = 0.576 (b) Level-0 belief classification. h = 0.346 Figure 37: ”the problem is that we have to be BELOW the assumed average - which is tricky cos how shouldwe know? lets think about what’s going on in ppl’s minds: they’re all facing the same question,so i reckon they’ll go for ratherso lets try being below that as well and do something like 25 or maybe even 20 which would betwo thirds of their 30-40” (a) Level classification. h = 0.775 (b) Level-0 belief classification. h = 0.85 Figure 38: ”I reckon most people will assume an average of 50 so go for 36 or 37 (2/3 of 50) so I suggestwe go for 2/3 of their guesses, say 25” 29 (a) Level classification. h = 0.792 (b) Level-0 belief classification. h = 1 Figure 39: ”Random.” (a) Level classification. h = 0.44 (b) Level-0 belief classification. h = 0.95 Figure 40: ”I think the result depends on others decision” (a) Level classification. h = 0.696 (b) Level-0 belief classification. h = 1 Figure 41: ”i dont really get it. any suggestions” 30 (a) Level classification. h = 0.647 (b) Level-0 belief classification. h = 0.719 Figure 42: ”the average number goes toward the 50 if the participants are so many.even if the number of pariticipants is just 16but i suggest this number.” (a) Level classification. h = 0.516 (b) Level-0 belief classification. h = 0.568 Figure 43: ”Two thirds of 75 is 50, but I think other people will choose this, as this is an easy number.I do not think other people will think of 90 straight away.” (a) Level classification. h = 0.521 (b) Level-0 belief classification. h = 0.644 Figure 44: ”66 is two thirds, many may go for this number, for the connection.” 31 (a) Level classification. h = 0.943 (b) Level-0 belief classification. h = 0.9 Figure 45: ”I think that the average would be 50 and 2/3 of 50 is about 32” (a) Level classification. h = 0.943 (b) Level-0 belief classification. h = 0.83 Figure 46: ”Hopefully the average should be about 50 because all the teams will choose a random numberand 35 is about 2/3 of 50” (a) Level classification. h = 0.445 (b) Level-0 belief classification. h = 0.9 Figure 47: ”i would go for something high as it is unlikely that lots of people will go for something very lowand so that means we more likely to get closer to the averagebut im not sure really! that would be my suggestion” 32 (a) Level classification. h = 0.548 (b) Level-0 belief classification. h = 0.429 Figure 48: ”I don’t really know if there’s a structure to this, so am picking a number that is close to twothirds of the half of 100.” (a) Level classification. h = 0.943 (b) Level-0 belief classification. h = 0.767 Figure 49: ”88 is the year I was born and my lucky number. So this is my suggested decision” 33 (a) Level classification. h = 0.772 (b) Level-0 belief classification. h = 1 Figure 50: ”I don’t understand! :-S” (a) Level classification. h = 0.687 (b) Level-0 belief classification. h = 0.776 Figure 51: ”people may just go for half of one hundred because its like half way.numbers like 67 or 34 areless likely to be chosenmaybe?” (a) Level classification. h = 0.713 (b) Level-0 belief classification. h = 0.746 Figure 52: ”57” 34 (a) Level classification. h = 0.679 (b) Level-0 belief classification. h = 1 Figure 53: ”what do you think?” (a) Level classification. h = 0.753 (b) Level-0 belief classification. h = 0.62 Figure 54: ”well i guess the average will be about 50-60, so this is roughly down to two thirds, what do yousay? :)” (a) Level classification. h = 0.596 (b) Level-0 belief classification. h = 0.405 Figure 55: ”I think it is more possilbe around 30, and it should be less than 50 I think.” 35 (a) Level classification. h = 0.785 (b) Level-0 belief classification. h = 0.761 Figure 56: ”I think there’s got to be at least 8 or 9 teams, so I reckon thats got to come to at least 500 intotal, the average of which would be about 55-60, which gives 40 as two thirds of it.So I say we go for between 40 and 50” (a) Level classification. h = 0.453 (b) Level-0 belief classification. h = 0.885 Figure 57: ”it is better to take a high number in order to have a reduced choice for the nal decision” 36 (a) Level classification. h = 0.578 (b) Level-0 belief classification. h = 0.495 Figure 58: ”I think everyone will aim for a two-thirds number of less than 50, so the target number of whateveryone actually picks will be much lower! I think it’ll be less than 30. As everyone will bepicking low numbers to allow for this. so i think 16” (a) Level classification. h = 0.847 (b) Level-0 belief classification. h = 1 Figure 59: ”Average should be around 50 but I can’t tell if it actually will be, 2/3s of 50 is about 34. So Ithink 34 is a good startI’m going with 34” 37 (a) Level classification. h = 0.486 (b) Level-0 belief classification. h = 0.498 Figure 60: ”If the answer is going to be two thirds of the average, it cannot be above 66 as this is two thirdsof 99I think most people will work that out and go for something around 50 ish, so I think 30-40 willget us in the right area” (a) Level classification. h = 0.633 (b) Level-0 belief classification. h = 0.636 Figure 61: ”provided that a number of people will go for answer of 2/3s of a 100 (66) then we should go forananswer that is 2/3’s of that (44), however, i suggest we scale the number down slightly if otherteams dothe same as ourselves, therefore I suggest 38. I am open to your theory” 38 (a) Level classification. h = 0.662 (b) Level-0 belief classification. h = 0.441 Figure 62: ”i think a number around this number will possible give us the right ans as most people willchoose above 50” (a) Level classification. h = 0.547 (b) Level-0 belief classification. h = 0.85 Figure 63: ”i guess if most people assume the teams average will be 50 then they will expect (2/3) to beaprox 33 and thiss will be their guesswhich would mean our guess should be 22however if other people assume the same then (2/3) average will be more like 7want to go with somewhere in the middle.# 12?” 39 (a) Level classification. h = 0.87 (b) Level-0 belief classification. h = 0.835 Figure 64: ”I worked on an average of 50 and then looked at two thirds and added on a bit for luck!” (a) Level classification. h = 0.668 (b) Level-0 belief classification. h = 0.399 Figure 65: ”most people willchoose a number under 50, but not too low, so I reckon it might be somewherein the region of 26:0 not entirely certain though!” (a) Level classification. h = 0.477 (b) Level-0 belief classification. h = 0.95 Figure 66: ”The larger the number, the more likely we can predict the answer later.” 40 (a) Level classification. h = 0.887 (b) Level-0 belief classification. h = 0.706 Figure 67: ”i think that the average number will be from 50 to 60, therefore 2/3 of this will be around 32 to40. so i chose 35” (a) Level classification. h = 0.558 (b) Level-0 belief classification. h = 0.599 Figure 68: ”I’m assuming everyone will choose a large number.If we choose very large numbers, i.e 90 and above, it will severely increase the overall averagemaking it less likely that the overall average can be below 40, at least.Hopefully, this works in practice as well as in theory.” 41 (a) Level classification. h = 0.394 (b) Level-0 belief classification. h = 0.5 Figure 69: ”i think most people will pick a number around 2/3rds of 100, but if this is so, 2/3rds of 60 willbe around 40.but then again, if everyone is thinkin glike this then the average will be rougly 30, so im notsure! but im going to go with 40.” (a) Level classification. h = 0.337 (b) Level-0 belief classification. h = 0.646 Figure 70: ”lets all pic 0!! if we pick 0 the averge is 0, our targer number is 0, and 2/3rds of 0 is 0!! wecannot lose!!!!!think about it keep it simeple. go 0!!!!” 42 (a) Level classification. h = 0.43 (b) Level-0 belief classification. h = 0.439 Figure 71: ”if people are varied from 0 to 100, the target number would be around 30,That answer would be the best way to be logical because we cannot guratee all the other’s numberIf you have any prefency that they would prefer big number, that will be higher than 33, on else,will be below 30” (a) Level classification. h = 0.499 (b) Level-0 belief classification. h = 0.733 Figure 72: ”I reckon the average will be players saying 66/100 approx due to it being two thirds.I believe we should say two thirds of that number which would range between 40 and 50The lower range will probably be a wiser move as a few others will estimate lower.Personally i would be happy with anything between 38-42.” 43 (a) Level classification. h = 0.73 (b) Level-0 belief classification. h = 0.766 Figure 73: ”Every tries to guess the two-thirds of the average.If every one choose randomly from 0 -100 then the average would be 50then the two-thirds would be about 33so I would guess 33I am a Mathematican” (a) Level classification. h = 0.679 (b) Level-0 belief classification. h = 0.41 Figure 74: ”Perhaps around 30 would be the safest bet?Because it’s going to be 2/3 of the average. just an idea. i dont really have any mathimaticalcalculationsaround the suggestion.If you do.. I will follow your lead.” 44 (a) Level classification. h = 0.424 (b) Level-0 belief classification. h = 0.578 Figure 75: ”I think that most team players will say a number under 50 because every team contributes onenumber from the average of all teams 2/3 are taken. So everyone wil think that it is most likelyto be something under 50 and even less because whatever number it is 1/3 will fall away.I think it will be 18 because the average will be around 25 to 27 because everyone tries to saysomething already thinking that it will be subtracted by 1/3” 45 (a) Level classification. h = 0.294 (b) Level-0 belief classification. h = 0.597 Figure 76: ”I assume that most teams will choose low initial numbers, as they will be aiming to undercutthe other teams, being closer to the 2/3 target pointBy choosing 20, i think that it will be high to be above the initial low numbers, whilst not beingso astronomically high as to rule us out entirely.” (a) Level classification. h = 0.814 (b) Level-0 belief classification. h = 0.865 Figure 77: ”I assume hat the average of all teams will be around 40, because most people tend to choosesmaller numbers(don’t know why)So, I estimate that 23 is verly likely to be the closest to 2/3, but this is just a suugestion, ofcourse” 46 (a) Level classification. h = 0.661 (b) Level-0 belief classification. h = 0.474 Figure 78: ”I’m not entirely sure what sort of objective we are looking for, but as 66 is two thirds from 0 to100i suggest we choose this numberand again it’s mere luck or probability which i’m not sure how to calculate.” (a) Level classification. h = 0.443 (b) Level-0 belief classification. h = 0.781 Figure 79: ”its the highest number, and i would expect its the most common one to pick.” (a) Level classification. h = 0.903 (b) Level-0 belief classification. h = 0.95 Figure 80: ”i just picked it because if about 50, the mid way point, is the average then 30 is about 2/3 ofthe average” 47 (a) Level classification. h = 0.508 (b) Level-0 belief classification. h = 0.369 Figure 81: ”Statistically, 50 would be the average and therefore two thirs of this would be roughly 33, how-ever i think, especially in the rst round, people will be inlined to enter high numbers. So i thinkthat two thirs of the averageIf we both put in 100 and then estiamte somwehere near 60 i think we may have a better chance.Perhaps not as high as 60 but you get my reasoning.” (a) Level classification. h = 0.5 (b) Level-0 belief classification. h = 0.559 Figure 82: ”If we put in 100, we’ll know that the average is going to be slightly higher than what it would beotherwise (as the average would probably settle at around 50) I think a few people will havethis idea (as there’s a lot2/3rds of the average will be around 50, and the actual average being 60something.” 48 A.2. Instructions for Workers on AMT 49 50 51 52 53 54 55 A.3. Qualification Test 56 57 58
© Copyright 2026 Paperzz