Gamifying Natural Language Acquisition

DEGREE PROJECT IN TECHNOLOGY, FIRST CYCLE, 15 CREDITS
STOCKHOLM, SWEDEN 2016
Gamifying Natural Language
Acquisition
A quantitative study on Swedish antonyms while
examining the effects of consensus driven
rewards
LISA LUND
PATRICK O’REGAN
KTH ROYAL INSTITUTE OF TECHNOLOGY
SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION
Informationsinsamling om naturligt språk genom gamifiering En kvantativ studie om svenska antonymer och undersökning av konsensusdrivna belöningar Lisa Lund and Patrick O’Regan Degree Project in Computer Science, DD143X Supervisor: Michael Minock Examiner: Örjan Ekeberg May 11, 2016 Abstract Little research has been done on antonymic relations, a great deal of this has been done by linguists Paradis et al. Gamification was used in natural language acquisition by Bos and Nissim in their 2015 study about noun­noun compound relations, but gamification of information retrieval remains a relatively new field of study. This thesis reproduced work done by Paradis et al. in an attempt to answer the following questions for Swedish antonyms: will reversing word order in antonymic relations affect the strength of said word pair? Will the perceived strength of canonical antonyms have a lower variance than that of non­canonical antonyms? It will also examine whether giving points depending on the agreement with other users reduce the occurrence of extreme points on an ordinal scale? Two parallel studies were conducted, one using a web app which implemented consensus driven rewards, and another utilising a questionnaire. Reversing the order of the words did not to alter the perceived strength of the antonymic pair, which is consistent with the results acquired by Paradis et al. in 2009. Results regarding variance of canonical and non­canonical antonym pairs were inconclusive. An implementation with consensus driven rewards yielded more extreme values than the questionnaire. More research is suggested to improve the strength of the results. 1 Referat Lite forskning har gjorts om antonymer inom natural language acquisition, och mycket av den forskning som finns om antonymer är gjord av Paradis et al. inom lingvistik. Gamifiering har använts inom natural language acquisition, bland annat av Bos och Nissim i deras studie om relationer hos sammansatta substantiv från 2015. Denna rapport försöker besvara följande frågor om svenska antonymer: spelar ordningen på ord i ett antonympar någon roll i hur parets antonymiska styrka uppfattas? Kommer den uppfattade styrkan hos kanoniska antonympar ha lägre varians än deras ickekanoniska motsvarigheter? Rapporten undersöker även huruvida konsensusdriven poängsättning påverkar förekomsten av extremvärden på en ordinalskala. Två parallella delstudier utfördes, en webapp som implementerade konsensusdriven poängsättning, samt ett frågeformulär utan poängsättning. Ordningen av orden i ett antonympar hade ingen signifikant påverkan på dess uppfattade styrka, i enighet med Paradis et al.s studie från 2009. Resultaten angående kanoniska antonymers varians var inte entydiga. Implementationen med konsensusbaserad poängsättning gav fler extremvärden än frågeformuläret. Eftersom detta var en liten studie behövs vidare undersökning för att stärka resultaten. 2 Table of contents 1 Introduction 3 1.1 Problem Statement 4 1.2 Scope 4 1.3 Outlines 5 2 Background 6 2.1 Antonyms 6 2.2 Gamification 8 2.3 Similar work 9 3 Method 10 3.1 Dataset 10 3.2 Web app 11 3.2.1 Interface 11 3.3 Questionnaire 13 3.4 Collection method 14 3.5 Statistical methods 14 3.5.1 Wilcoxon signed­rank test 14 3.5.2 Levene’s test 15 3.5.3 χ² test 15 4 Results 4.1 Primary results 16 16 4.1.1 Wilcoxon signed­rank test 16 4.1.3 ​
χ​
²​
​
test 16 3 5 Discussion and Conclusion 18 5.1 Discussion 18 5.2 Conclusion 20 References 22 Appendices 23 Appendix A: Key words and expressions 23 Antonym 23 Natural language 23 Appendix B: Screenshots 24 Appendix C: Code 27 Views.py 27 Models.py 29 Forms.py 30 Populate_db.py 31 Get_data.py 31 Appendix D: Complete list of word pairs 34 4 1 Introduction In linguistic literature there is an unanimous idea that contrast is fundamental to our cognitive reasoning ​
[1]​
. In 1969 Fillenbaum, professor in psycholinguistics, conducted an experiment supporting the view that synonyms and antonyms are both stored in memory as the same complex entities ​
[2]​
. In spite of this there is a surprising gap in the volume of research done on antonyms as opposed to synonyms, searching Google scholar for “synonyms” returns over 390,000 hits whereas a search on “antonyms” returns less than 35,000 hits. A big portion of the research done on antonyms has been conducted in Lund by Paradis et al. a group of professors in linguistics. In 2009 they wrote a papers on English antonyms, ​
Good and bad opposites ­ Using textual and experimental techniques to measure antonym canonicity, where they combined corpus methodology with experimental methods to gain insights into antonymy as a lexico­semantic relation ​
[3]​
. We decided to reproduce one of these experiments on Swedish antonyms and on smaller set of word pairs. Figure 1.1 below is a screenshot from the experiment we decided to reproduce. Figure 1.1 In this thesis we have conducted a small­scale study where we ask people to describe the strength of antonym pairs on a scale from one to five. We have created a web application (henceforth referred to as web app) where people answer questions and are subsequently awarded points. The idea to implement a reward­system was mainly inspired by a study on noun­noun compound relations conducted in 2015 by Bos and Nissim ​
[4]​
. Gamification was utilised and points awarded to players according to what other players had answered before. To see if a 5 reward system would have an effect on the results we made a survey powered by Google Forms that we ran parallel to our web app. 1.1 Problem Statement We reproduced a subset of the work done by Paradis and Willners. Data was gathered from two sources simultaneously, first via a web app where we implemented a consensus driven score system and also via a survey powered by Google Forms. Based on this data the problem statements are: ● Will our results correlate with the results of the original work done by Paradis and Willner in their paper Good and Bad Opposites? This problem statement can be broken down into two parts: ○ Will reversing the word order in an antonym pair have a significant effect on the perceived strength of the pairing? In ​
Good and Bad Opposites ​
no significant effect was found. ○ Will the perceived strength of canonical antonyms have a significantly smaller variance than other antonyms? In ​
Good and Bad Opposites ​
Paradis et al. found that canonical antonyms had a significantly smaller variance. The hypothesis is that our results will agree with those of Paradis and Willners. ● Will implementing a reward system based on the relative distance from the mean value significantly reduce the occurrence of extreme values on an ordinal scale? Our hypothesis is that the low probability of an extreme mean value may significantly deter players from selecting these options. 1.2 Scope Choosing the amount of word pairs to include in our study we had to take into consideration how many users our web app would have. We estimated that in order to get sufficient data we should have no more than 50 word pairs (in our web app) this was based on an expectancy of 500 answers distributed over all word pairs. That would in turn limit the conclusions we could draw since we worked on a small cluster of word pairs and some results on our data set would not be transferable to all antonym sets. Time was a limiting factor affecting both the quality of our web app and the amount of data we were able to gather from it. These variables were also in conflict with each other since the more time we spent on creating our web app the less time we would have to gather data from its users. This mainly affected the number of gamifying elements of our web app, which we limited to a point system, a counter counting down towards the end, and a high score list. 6 1.3 Outlines This thesis is divided into five sections, with the introduction above as the first. The second section introduces antonymy and gamification, and presents some previous works on these, as well as related works. The third section introduces the dataset used in our study, the data collection methods and statistical methods used on the gathered data. The fourth section presents the results acquired through statistical analysis, and section five analyses and discusses these results as well as limitations of the study. The fifth section also presents a conclusion. 7 2 Background The first part of this section introduces the properties of antonyms, based on previous works on the subject. A definition of antonyms can be found in appendix A. The second part of this section introduces gamification as a concept and some previous works using it to acquire natural language data. The third and final part of this section discusses a similar test to the one conducted in this thesis. 2.1 Antonyms In 1986 Herrmann et al. argued that antonyms can be qualitatively defined through four elements ​
[5]​
. The first element is how well­defined the dimension of the pair of antonyms are. For example ​
narrow­wide​
is more explicit in its dimension than ​
narrow­large,​
the dimension in the first example relates to the width of an object while the dimension of the second relies on how one interprets the dimension of ​
large.​
The hypothesis is that the clearer the dimension the stronger the antonymic relationship.The second element builds in many ways on the first in whether the antonym’s dimension is denotative or connotative. For example the dimension of narrow­wide is denotatively about the width of an object, whereas for example the dimension of obnoxious­lovable​
is connotatively about emotions regarding someone or something. The hypothesis being that the dimension of a good antonym should be denotative rather than connotative. The third and arguably most important element states that the words’ positions along their dimension axis should be on opposite sides of the mid­point. If we compare good­bad​
with ​
good­great​
we see the relevance of this element. The hypothesis here would be that word pairs on opposite sides of their dimensional midpoint are stronger than word pairs on the same side. The last element considers the words distance from the mid­point, for example small­colossal​
have a more asymmetric distance from the mid­point compared to ​
small­big.​
The hypothesis is that the distance from the mid­point should be of equal magnitude. An important distinction in antonyms is that while they are straight­forward in adjectives and adverbs where there exist canonical antonyms, they are very abstract in other parts of speech. An example of such a word class is nouns. While nouns like ​
happiness ​
and ​
comfort ​
can arguably have antonyms, either because ​
happiness ​
is derived from the adjective ​
happy ​
or because an antonym of sorts can be created by adding a prefix to create words like ​
discomfort.​
That said, far from all nouns have any sort of antonym. For instance, the noun ​
chair ​
cannot be considered to have an antonymic relation to any noun, because what would the antonym to chair ​
be? An interesting property of antonym relations is that it is usually not a symmetric relation. The best antonym of ​
mediocre​
for example is ​
good ​
whereas the best antonym for ​
good ​
is ​
bad.​
This property is described in ​
figure 2.1​
below 8 Figure 2.1 ​
[6] There are three ways of finding antonyms of a word. In 1991 Justeson and Katz showed that lexical association between antonyms are formed via co­occurrence in sentences such as “let’s go up ​
early​
rather than ​
late”​
and “A fight between ​
good ​
and ​
evil”​
​
[7]​
. Therefore one can use this pattern to search a corpus for antonyms. Through this method, one will still have to make a quantitative or qualitative judgement whether the words are antonyms or not, since several non­antonyms co­occur as well. Another way to find antonyms is through the underlying patterns between synonyms and antonyms suggested by the lexical concept of WordNet ​
[8]​
. The idea is that because synonyms generally mean more or less the same thing as the original word, synonyms of antonym pairs tend to be antonyms, but that is not always the case, and often the synonym pairings are “worse” antonyms than the original antonyms, even if these were canonical ​
[3]​
. This can be explained in the difference in dimension and their position on the dimensional axis (mentioned above) among synonym pairs, which DiMarco et al. discussed in their paper about differentiation between synonyms and near­synonyms ​
[9]​
. These differences would mean a “dimensional mismatch” when synonyms of canonical antonyms are paired as antonyms. For example the word pair ​
hot­cold ​
is considered to be antonymic, if we pick two synonyms of these like ​
warm ​
to ​
hot​
and ​
cool t​
o ​
cold ​
we see that the word pair warm­cool​
create an antonymic relationship. However, in this method a quantitative or qualitative judgement is needed since ​
spicy​
could very well have been chosen as the synonym to ​
hot​
and ​
frozen​
to ​
cold​
, and the word pair ​
spicy­frozen ​
does not create an antonymic relationship. 9 2.2 Gamification Gamification was first mentioned 2003 and became widely used in literature 2010, in the paper ‘From game design elements to gamefulness: defining "gamification"’​
by Deterding et al. defines the term ​
gamification ​
as the use of game design elements in non­game contexts ​
[10]​
. This relates to the idea that game mechanics such as high­scores, levels and achievements can motivate us to achieve better in non­gaming related areas. Deterding et al. clarifies that there exists several definitions of the term within the game industry and some consider the above definition an oversimplification​
[10]​
. According to Dale ​
[11]​
some of the most common elements of gamification in 2014 were ­ Achievements like experience points and levels ­ Prompted exercises like challenges ­ Synchronisation through high scores and collaborations ­ Transparent representation of results, eg. through use of progress bars ­ Time limits and speed bonuses ­ Lotteries, random achievements and other luck­based elements However, despite these trends Fitz­Walter made a compelling argument in his thesis about effective gamification that the trends are not conclusive due to how new gamification is as a formal method ​
[12]​
. Fitz­Walter also mentions how gamification elements may also conflict with the original purpose of the task in several ways. One example of this is users feeling forced to play a game they may not want to participate in and therefore being deterred from completing the task. Another example of gamification­purpose conflict is having too many playful elements which may increase task completion time rather than decreasing it ​
[12]​
. In short, gamification is very new as a method of effectivisation, and it is a balancing act to make gamification work well. The same study also showed that player enjoyment may wane once the player learns the game mechanics, and thus finds the task just as repetitive, boring or otherwise unmotivating as before gamification was implemented. A similar use of gamification in Natural Language Acquisition as the one we used in this thesis was applied by Bos and Nissim in a study regarding noun­noun compound relations in English [4]​
. This study gave players points depending on how close their answer was to the general consensus, giving more points the more the players “agreed” with their peers. The main positive factor in giving players scores is that it is likely to deter trolls (people on the internet who deliberately cause disruption online for the purpose of amusement) by giving low score to disruptive answers which diverge from the general consensus.. 10 2.3 Similar work A similar project to Paradis et al.’s study of antonymy and the study conducted in this thesis is Folkets synonymlexikon​
which was initiated in 2005 by Viggo Kann ​
[13]​
. It uses user consensus to determine what a good synonym is, and thus which word pairs should be included in the dictionary. That raises the question of whether anyone being able to contribute to a dictionary lowers said dictionary’s quality. Kann addresses this question in an article in the Swedish language magazine ​
Språktidningen [​
14].​
In the article he argues that anyone being allowed to contribute to Folkets synonymlexikon does not lower the quality, because a certain level of consensus is required for an entry to be accepted. Kann argues that this by extension implies that it is the “common people’s” opinion that is represented rather than a linguist’s or lexicographer's, and that that is a good thing as the dictionary is made to be used by the people. Figure 2.2 below is a screenshot from the study. Figure 2.2 11 3 Method The first part of this section will cover the set of word pairs used in our study. The second part will cover the web app that was created, both the interface of the app and the gamification elements used. The third part will describe the questionnaire used parallel to the web app, and the fourth part will cover the collection method itself. The fifth and final major section will cover the statistical methods used to process the raw data, namely the Wilcoxon signed­rank test, Levene’s test and the ​
χ​
² test. The first thing that had to be decided on was how to implement the web app. Starting out we had no prior knowledge of web development and had to decide both what language we were to write the web app in and whether we were to use some web framework for the development process. Limited time argued for using a web framework to streamline as much of the process as possible. Prior experience with the language Python made Python on Django seem like a sensible choice. 3.1 Dataset As mentioned in section 1.2 we expected a limited number of participants in our study and decided to limit the number of pairs to examine. We decided 50 word pairs was a sufficiently small sample to get a significant number of test results on. We based our sample on the lexical concept of WordNet picking the canonical word pair stor­liten​
and then selecting 4 random synonyms to each of these words. We decided on words in table 3.1 stor liten enorm ynklig bred futtig tung smal kolossal kort Table 3.1 We chose ​
stor ​
and ​
liten​
because they have several synonyms that would let us look at some of the elements used to describe the strength of an antonymic relation. For example we would be able to observe the effect of varying dimensions, whether ​
bred­smal ​
score higher than 12 bred­kort​
. The more subjective elements would be harder to examine, for example the element of their relative distance to the mid­point is often subjective, is ​
liten ​
closer to ​
normalstor​
than futtig​
? Since we needed to look at whether the order of the words in a pair had any effect on the score of the pair we generated all permutations between the lists resulting in two lists of 25 word pairs, with one permutation of a word pair in each. All of the word pairs can be found in Appendix D and the Python code for how this was done can be found in populate_db.py Appendix C. 25 pairs were then presented to half the participants, the other 25 to the other half through our web app where they were asked to rate the strength of the pair on a scale from 1 to 5. 3.2 Web app 3.2.1 Interface Much of how we chose to implement the user interface of the web app stems from what was deemed reasonable for how long the implementation would take, and what our skills would allow. We chose to keep the interface as simple as possible, and demand as little as possible from our users. For instance users were not required to enter an email address when creating an account as its only real use would have been to reset forgotten passwords. However, as most users were not expected to want to frequent the site, it was omitted to make signing up easier. We chose not to edit the theme of the page very heavily, relying on a “classical” theme with a bar at the top which contains the user’s most important actions and information. These consisted of log on and log off options, username and score. We also chose to have a highscore present in a box at all times, in an attempt to motivate the user to get a high score and get on the list. A third feature we decided to have was a counter that incremented as the user answered questions, showing how far they had until the “finish line”. The reason this type of layout was chosen was because it was easy to achieve and didn’t make the app appear cluttered. Keeping the user interface simple is key according to several sources on user interface design, such as Usability.gov ​
[15]​
. We decided not to include explanatory texts to the answers beyond the 5­point scale ranging from ​
Awful​
to ​
Very good ​
(fig. 3.1); there were several reasons for this. The first, and perhaps most important is that the convention for user interface design states that an overabundance of information may be counterproductive and make the user experience worse ​
[16]​
, in combination with it being quite difficult to describe the scale in a few concise sentences. The second reason for not including descriptions is that previous studies of this kind did not add explanatory texts, as can be seen in figure 1.1 and 2.2. Though neither Paradis et al. nor Viggo Kann discussed this decision we decided that following their lead was a good idea. Figure 3.1 below is a screenshot of how a question is presented in the web app, to the left in the figure one can see the high score. 13 Fig. 3.1: A close­up of the game part of the app. The scale consists of five points which translate to ​
Awful, Bad, Ok, Pretty good ​
and​
Very good. ​
A more thorough description can be found in Appendix B (Fig. B.4). Screenshots of the pages in the web app are available in Appendix B. 3.2.2 Gamification There were many gamification elements the web app could have benefited from. Achievements, progress bars, speed bonuses to name a few. Many of these would have been interesting to implement and study. The scope of our work limited the number of gamification elements we could implement. Since we already implemented Johan Bos score system to study its effect on ordinal data a high score list was a natural gamification element to implement. The high score list can be seen in figure 3.1. All we knew of the point­system in Bos and Nissim’s implementation was that points were awarded based on the relative agreement with other players. Therefore we had to implement our own score system to achieve this. The first question was whether to use mean or median to define the relative agreement on a word pair. Median is a good substitute for mean on large ordinal scales with an asymmetric distribution around the mean value. In that case outliers will have a strong impact on the mean value causing it to be misleading. However, since we had a small ordinal scale, integers in the range one to five, the mean value would not be heavily affected by outliers and would not be misleading. This is why we chose to use mean value rather than median to define relative agreement. We then had to decide on the amount of points awarded depending on the player’s distance from the mean value of a word pair. First we implemented a linear system according to the equation below 14 ​
Mean value of the word pair Selected value by the player ​
Score awarded to the player ​
Where p will range from one, if selecting ​
usel,​
to five, if selecting ​
jättebra​
. Therefore the score will vary from zero, (i.e. the player selects one and the mean value is five), to four (when selected value equals the mean value). After trying this out we realized that players could still secure a minimum score of two by selecting the middle option, and the reward for selecting the extreme values might not be enough to deter this ‘tactic’. Therefore we decided to scale the awarded points exponentially according to the equation below. Now the points would scale through 0, 1, 8, 27, 64 and selecting the middle options would not be as rewarding in relation to accurately selecting the mean value. This might not be enough, however, which is why we implemented the alternate study through Google Forms to see if rewarding players based on their relation to the mean value cause them to avoid the extreme values. In the end the score was multiplied by a hundred resulting in the final equation below. 3.3 Questionnaire We decided to use Google Forms to cross check the data from the web app through a questionnaire. Google Forms is easy to use and enabled us to export collected data to a spreadsheet where we could make numerical interpretations of our data.The words used for the questionnaire was a randomly chosen subset of 15 word pairs from the initial set of 50 word pairs. The pairs selected for the questionnaire are available in table 3.2. 15 bred ­ futtig kolossal ­ ynklig kort ­ kolossal enorm ­ liten stor ­ futtig liten ­ enorm bred ­ liten enorm ­ smal kort ­ stor bred ­ kort tung ­smal futtig ­ bred smal ­ bred bred ­ smal liten ­ tung Table 3.2 3.4 Collection method After both the web app and the Google Form was up and running we needed participants to gather data. We therefore spread their web addresses both through social media channels and by asking people we know to use and spread the web app and form. 3.5 Statistical methods 3.5.1 Wilcoxon signed­rank test To study the difference in antonymic strength depending on word order a Wilcoxon signed­rank test was used, since the underlying population cannot be assumed to be normally distributed. The null hypothesis being that the median of the difference between the two samples (word1­word2 and word2­word1) is zero. Wilcoxon signed­rank test is calculated according to the algorithm below: number of pairs of word pairs the mean value of the i:th word pair of the first order the mean value of the i:th word pair of the second order Sort ​
and in respect of ​
Index the sorted lists by 16 for every ​
where for every where​
If , where ​
​
is required p­value, we can reject the null hypothesis. We used the wilcoxon­function in the Python package SciPy.stats to perform the test. 3.5.2 Levene’s test To study the difference in variance of results between canonical and non­canonical antonyms Levene’s test of equality for variances was used since the underlying population cannot be assumed to be normally distributed. The null hypothesis being that the two sample groups (canonical and non­canonical antonym pairs) have the same variance. Levene’s test is calculated according to the algorithm below: Where: is the result of the test is the number of sample groups being compared (in our case 2) is the total number of samples in all groups the number of samples in group is the ​
:th value in the ​
:th group where is the mean of the ​
​
:th group If , where ​
​
is required p­value, we can reject the null hypothesis We used the levenes­function in the Python package SciPy.stats to perform the test. 3.5.3 ​
χ​
² ​
test To study the likelihood that any observed differences between the web app and the Google Form arose out of chancethe ​
χ​
² ​
test was used. The null hypothesis being that the occurrence of extreme values (​
usel ​
and ​
jättebra)​
have an equal distribution in both samples (web app and Google Form). 17 We counted the total amount of extreme values in both samples (only comparing pairs from the web­app that had been cross­studied in the Google Form) and the total amount of non­extreme values. Then a ​
χ​
²​
test was performed to see if the distribution of extreme to non­extreme values was dependant on whether the person is using the web­app or Google Forms. We used the chi2_contingency­function in the Python package SciPy.stats to perform the test. 4 Results 4.1 Primary results 4.1.1 Wilcoxon signed­rank test The test returned and ​
, since ​
we can not reject the null hypothesis that there’s no difference depending on the order of the words on a significance level of 10%. The p­value of the test was 0.326, meaning that there’s a 32.6% chance the observed difference arose by chance. 4.1.2 ​
Levene’s test Since we could not reject that there is no significant difference depending on the order of the words in a word pair, we choose to only examine one of the word orders (25 pairs) with the Levene’s test. We compared the variance of every combination giving a total of 25 test results for every word pair. We then calculated the average p­value for every word­pair. Two word paris, ​
stor­liten ​
and ​
ynklig­kolossal h
​ad a variance that deviated significantly from the rest on average. In the case of ​
stor­liten ​
this was (in accordance with our hypothesis) because it had a significantly lower variance (0.13). In the case of ynklig kolossal it was because the variance was significantly higher (2.12). 4.1.3 ​
χ​
² ​
test The word pairs in table 3.2 were compared in both the Google Forms and in the Web App, the results of their respective distribution in extreme values and non­extreme values can be seen in table 4.1. 18 Google Forms Web App extreme values (1,5) 497 208 Non­extreme values (2­4) 864 324 Total 1361 532 Table 4.1 Surprisingly (in regard to our hypothesis) the web app had a greater ratio of extreme values in relation to non­extreme values). Still the test returned a p­value 0.3216 showing that ​
there’s a 32% chance the different ratios arose by chance. Therefore we cannot reject the null hypothesis that there is no difference on a significant level. 19 5 Discussion and Conclusion 5.1 Discussion The Levene’s test showed that the pair ​
stor­liten ​
had a significantly smaller variance than the other pair on average. This is consistent with the findings of Paradis et al. since ​
stor­liten ​
scored exceptionally well (the mean value was 4.9) and can be considered a canonical antonym. On the other hand we noticed a trend that antonym pairs that were strong and weak tended to have a lower variance while the average antonym pairs tended to have a high variance. This pattern is illustrated in figure 5.1 below, where the standard deviation of the pairs is plotted against their respective mean value. A polynomial trend line was also plotted. Figure 5.1 We believe this problem is inherent to using a finite ordinal scale. Antonym relations that are not particularly strong nor weak can have a score fluctuating in both directions depending on the opinion of the participant creating a large variance. On the other hand, if the antonym relation is particularly strong or weak, the endpoints of the scale will limit the fluctuation in one direction, lowering the standard deviation. This confounding variable makes it impossible to tell if people agree more on bad or good antonym or if it’s an effect caused by the finite ordinal scale. To answer this part of our problem statement we would suggest an alternate experimental method, e.g. one where users insert the best antonym to a word themselves. Another interesting pattern regarding the dimensions of an antonym relation was that pairs with clearly defined dimensions tended to score either good (​
liten­stor, smal­bred)​
or bad (​
kort­tung, kort­bred​
) whereas pairs where at least one word was less clear in its dimension tended to end up with a more central mean value (​
ynklig­stor, futtig­enorm).​
We believe a possible reason for 20 this is that the dimension of words like ​
futtig​
and ​
ynklig ​
are interpretable and subjective and their antonymic relations will either be stronger or weaker depending on the participants interpretation giving them a high variance and balancing an average mean value. The final observation made from the results is that not only do the two canonical antonyms (​
stor­liten, bred­smal) ​
acquire a higher score (which could be expected), they are also the only symmetrical antonyms. That is, they were each other’s best antonyms, in figure 5.2 this is illustrated in a bipartite graph. The directed edges represent the best antonym for each word (going from word1 to word2 in our word1­word2 dataset). In this figure the two pairs which can be considered canonical (​
smal ­ bred ​
and ​
liten ­ stor,​
respectively), are the only ones where the edges between the corresponding vertices are double when the best antonyms are represented by a directed graph. For instance, the best antonym to ​
kort ​
is ​
kolossal ​
according to our results, but the best antonym to ​
kolossal ​
is ​
liten.​
Further research would be required to see if this is a repeating pattern in antonymic relations or a coincident in our dataset. Figure 5.2 We chose to implement a simple counter which showed the user how far they had to go until they were finished. Early testing showed that when users had no idea of how many more questions they had to answer, or how many questions they had answered they were reluctant to continue playing the game. Once we added the counter that not only kept track of how many questions the user had answered, but how many were left in total, users reported that the game felt much easier to complete. A feature some users requested was the ability to keep playing as the same user. That would possibly conflict with the progress counter, as it would remove the “finished” state. One solution to that issue could have been an “endless mode” of sorts, which activates after the user has played through the entire game once. This would enable users to gain a higher score than the one achieved at the end of the game. 21 Another form of visual feedback we believe would help is a clear visual feedback upon answering a question and receiving points. We had included such feedback in our very first prototype of the game (fig. 5.1), but due to lack of time and knowledge on our part we did not implement any such mechanics in the final version. Fig. 5.1: The very first concept prototype, points were represented by sweets falling into a jar. Since we did not specifically collect data on the user base of the web app nor the questionnaire there are a few things that can be said about this. One is that we spread both the questionnaire and the web app to a range of different age groups, and based on feedback received, it is very likely that a large part of respondents were fellow students of computer science at KTH, but not an insignificant part were from other demographics. Something that we believe would have improved this thesis in retrospect is using a larger data set. The reason for this is that we got almost four times as many responses as expected (about 1900 answers in total on the web app, in comparison to the 500 we expected). While having more respondents on a smaller data set makes the results stronger they are still only indications as both data set and user base are much too small to say anything definitive on this matter. 5.2 Conclusion The experiment did not show that reversing the word order had any significant effect on the perceived strength of an antonymic relation. This is consistent with the results of Paradis et al. [3] The results regarding the variance of canonical antonyms in relation to other antonyms were inconclusive. The limited ordinal scale was a possible confounding variable and we suggest alternate methods to sufficiently answer this problem. 22 Implementing a reward system based on the relative distance from the mean value did not cause a significant reduction in extreme values. In fact, the web app had a higher ratio of extreme values than the questionnaire. We refrain from speculating on the reason for this but suggest that more research is done on the matter. The significance of our results is limited by the small scale of the study. To improve the strength of the results on antonyms we suggest studying a greater range of words with multiple experimental methods. To improve the result on gamification we suggest a larger dataset and implementing multiple gamification elements, possibly over several iterations. 23 References [1] B. MacWhinney, ​
Mechanisms of Language Acquisition.​
Lawrence Erlbaum Associates, 1987. [2] S. Fillenbaum and F. Samuel, “Words as feature complexes: False recognition of antonyms and synonyms,” ​
J. Exp. Psychol.​
, vol. 82, no. 2, pp. 400–402, 1969. [3] C. Paradis, P. Carita, W. Caroline, and J. Steven, “Good and bad opposites: Using textual and experimental techniques to measure antonym canonicity,” ​
Ment. Lex.​
, vol. 4, no. 3, pp. 380–429, 2009. [4] J. Bos and M. Nissim, “Uncovering Noun­Noun Compound Relations by Gamification,” in W15­1832​
, Vilnius, Lithuania, 2015, pp. 251–255. [5] D. J. Herrmann, R. Chaffin, M. P. Daniel, and R. S. Wool, “The role of elements of relation definition in antonym and synonym comprehension,” ​
Z. Psychol. Z. Angew. Psychol.​
, vol. 194, no. 2, pp. 133–153, 1986. [6] C. Paradis, P. Carita, W. Caroline, and J. Steven, “Good and bad opposites: Using textual and experimental techniques to measure antonym canonicity,” ​
Ment. Lex.​
, vol. 4, no. 3, pp. 380–429, 2009. [7] J. S. Justeson and S. M. Katz, “Co­occurrences of antonymous adjectives and their contexts,” ​
Comput. Linguist.,​
vol. 17, no. 1, pp. 1–19, Mar. 1991. [8] G. A. Miller, “WordNet: a lexical database for English,” ​
Commun. ACM,​
vol. 38, no. 11, pp. 39–41, Nov. 1995. [9] C. DiMarco, G. Hirst, and M. Stede, “The semantic and stylistic differentiation of synonyms and near­synonyms,” ​
Spring Symposium on Building Lexicons for …,​
1993. [10] S. Deterding, D. Sebastian, D. Dan, K. Rilla, and N. Lennart, “From game design elements to gamefulness,” in ​
Proceedings of the 15th International Academic MindTrek Conference on Envisioning Future Media Environments ­ MindTrek ’11​
, 2011. [11] S. Dale, “Gamification: Making work fun, or making fun of work?,” ​
Business Information Review​
, vol. 31, no. 2, pp. 82–90, 2014. [12] Z. J. Fitz­Walter, “Achievement unlocked: Investigating the design of effective gamification experiences for mobile applications and devices,” Queensland University of Technology, 2015. [13] “Folkets synonymlexikon Synlex.” [Online]. Available: http://folkets­lexikon.csc.kth.se/synlex.html​
. [Accessed: 16­Apr­2016]. [14] “Så lika är orden,” ​
Språktidningen.​
[Online]. Available: http://spraktidningen.se/artiklar/2008/08/sa­lika­ar­orden​
. [Accessed: 16­Apr­2016]. [15] “User Interface Design Basics,” May 2014. [16] Y. Rogers, H. Sharp, and J. Preece, ​
Interaction Design: Beyond Human ­ Computer Interaction​
. John Wiley & Sons, 2011. [17] M. Lynne Murphy, “Antonyms as lexical constructions: or, why paradigmatic construction is not an oxymoron.” 24 Appendices Appendix A: Key words and expressions Antonym The Oxford Dictionary define an antonym as “a word opposite in meaning to another”. Such opposites can express oppositeness on a scale such as ​
good ­ bad ​
or a definitive either­or relationship such as ​
alive­dead. T
​here is also a type of strong antonym pairs known as canonical antonyms, ​
which are pairs of antonyms that are associated by convention as well as by semantic relatedness ​
[17]​
. ​
Some examples of canonical antonyms are ​
slow­fast​
, weak­strong​
and ​
big­small ​
[3]​
. Natural language According to The Oxford Dictionary a natural language is a language which has developed naturally in use, as opposed to computer code or artificial language such as Esperanto or Klingon. Some examples of natural languages are English, Swedish and Arabic. 25 Appendix B: Screenshots Fig. B.1: The first page the user is directed to once they come to our site if they are not already logged on from a previous session. The user is prompted to ​
log on (Fig. B.3) or make an account (Fig. B.2) to play. A high score is visible to the right, and will always be visible in this position regardless of which page the user is looking at. The user can also click a link to learn more about us (Fig. B.5). Fig. B.2: If it is the first time the user visits the site they will create an account. 26 Fig. B.3: If the user has visited the site before they may log on here. This page can be reached either via the link on the first page (Fig. B.1), or by clicking the symbol in the top­right corner. The symbol is always visible in that position if the user is not logged on. Fig. B.4: This is the page the user gets redirected to once they have logged on or successfully created an account, or the first page the user sees if they are already logged on from a previous session when they first enter the site. The user is prompted to rate how good antonyms the two bold words are in relation to each other. The available options translate to Awful, Bad, Ok, Pretty good ​
and Very good.​
They user can see their score in the top right corner underneath their username, and can click the symbol in the top right corner to log off. At the bottom there is a very short explanation of how the points are calculated, with a link to Google’s reCaptcha project for the users who are interested. The user may also click a link to learn more about us (Fig. B.5). 27 Fig. B.5: This page describes us, the creators of the page and authors of this thesis. It explains that the purpose of the site is to gather data for this thesis. There are mailto­links should the user wish to get in touch with us. At the bottom of the page there are links to log on (Fig. B.3) or create a user (Fig. B.2) if the user is not already logged on, if they are they are redirected back to the play­page (Fig. B.4). Fig. B.6: Once the user completes all 25 questions they are redirected to this page. It congratulates the player on completing the game and thanks them for their help, whilst playing a video. The page also prompts the user to create a new account should they wish to help us out even more. 28 Appendix C: Code This appendix consists of the relevant code used for operations in the web app as well as for the extraction of data. Views.py from django.shortcuts import render, redirect from django.utils import timezone from django.http import HttpResponse from django.contrib.auth.decorators import login_required from django.views.decorators.csrf import csrf_protect from django.contrib.auth.forms import UserCreationForm from django.contrib.auth import authenticate from django.contrib.auth import login as auth_login from django.contrib.auth.models import User from .models import UserProfile, WordPair1, WordPair2 from .forms import AnswerForm from django.template import RequestContext rand_list = [20, 9, 3, 17, 10, 12, 6, 1, 15, 14, 19, 5, 7, 2, 11, 25, 18, 4, 22, 24, 21, 8, 23, 13, 16] def index(request): return render(request, 'antonymapp/index.html', {}) def login(request): return render(request, 'antonymapp/login.html', {}) def create_user(request): if request.method == "POST": form = UserCreationForm(request.POST) if form.is_valid(): username = form.cleaned_data["username"] password = form.cleaned_data["password1"] new_user = User.objects.create_user(username=username, password=password) new_user = authenticate(username=username, password=password) if new_user: auth_login(request, new_user) return redirect('play') else: 29 form = UserCreationForm() return render(request, 'antonymapp/create_user.html', {'form': form}) # Helper function to play def update_word_data(argument, word_pair): if argument==1: word_pair.ones += 1 if argument==2: word_pair.two += 1 if argument==3: word_pair.three += 1 if argument==4: word_pair.four += 1 if argument==5: word_pair.five += 1 word_pair.save() @login_required @csrf_protect def play(request): user_index = request.user.userprofile.word_index if user_index >= 25 : return render(request, 'antonymapp/victory.html', {}) word_index = rand_list[user_index­1] if request.user.id%2==0: word_pair = WordPair1.objects.get(word_number=word_index) else: word_pair = WordPair2.objects.get(word_number=word_index) if request.method == 'POST': form = AnswerForm(request.POST) if form.is_valid(): picked = form.cleaned_data.get('svar') # This is where I do things with my result picked = int(picked[0]) player = request.user.userprofile consensus = word_pair.calc_mean() score = ((4 ­ abs(picked­consensus)) ** 3) * 100 player.score = player.score + score player.word_index = user_index + 1 30 player.save() update_word_data(picked, word_pair) #Result is done, present a new question to the user return redirect('play') else: form = AnswerForm send_index = user_index + 1 return render(request, 'antonymapp/play.html', {'form':form, 'word_pair':word_pair, 'index':send_index}) def about(request): return render(request, 'antonymapp/about.html', {}) Models.py #!/usr/bin/python # ­*­ coding: utf­8 ­*­ from django.db import models from django.contrib.auth.models import User from django.db.models.signals import post_sav def save_profile(sender, instance, created, **kwargs): user = instance if created: profile = UserProfile(user=user) profile.save() class WordPair1(models.Model): word1 = models.TextField() word2 = models.TextField() ones = models.IntegerField(default=0) two = models.IntegerField(default=0) three = models.IntegerField(default=0) four = models.IntegerField(default=0) five = models.IntegerField(default=0) word_number = models.IntegerField(default=0) def calc_mean(self): nominator = self.ones+self.two+self.three+self.four+self.five if nominator != 0 : 31 mean = int(round((self.ones*1+self.two*2+self.three*3+self.four*4+self.five*5)/nom
inator)) return mean return 0 def __str__(self): return self.word1+" "+self.word2 class WordPair2(models.Model): word1 = models.TextField() word2 = models.TextField() ones = models.IntegerField(default=0) two = models.IntegerField(default=0) three = models.IntegerField(default=0) four = models.IntegerField(default=0) five = models.IntegerField(default=0) word_number = models.IntegerField(default=0) def calc_mean(self): nominator = self.ones+self.two+self.three+self.four+self.five if nominator != 0 : mean = int(round((self.ones*1+self.two*2+self.three*3+self.four*4+self.five*5)/nom
inator)) return mean return 0 def __str__(self): return self.word1+" "+self.word2 class UserProfile(models.Model): user = models.OneToOneField(User, on_delete=models.CASCADE) score = models.IntegerField(default=0) word_index = models.IntegerField(default=0) post_save.connect(save_profile, sender=User) 32 Forms.py #!/usr/bin/python # ­*­ coding: utf­8 ­*­ from django import forms class AnswerForm(forms.Form): CHOICES = ((1,'Usel'), (2,'Dålig'), (3,'Okej'), (4,'Rätt bra'), (5,'Jättebra'),) svar = forms.ChoiceField(choices=CHOICES, widget=forms.RadioSelect()) Populate_db.py from django.core.management.base import BaseCommand from antonymapp.models import WordPair1, WordPair2 words1 = ["liten", "smal", "kort", "futtig", "ynklig"]; words2 = ["stor", "bred", "kolossal", "tung", "enorm"]; class Command(BaseCommand): def _create_pairs(self): number = 1 for word1 in words1: for word2 in words2: pair = WordPair1(word1=word1, word2=word2, word_number=number) pair.save() pair = WordPair2(word1=word2, word2=word1, word_number=number) pair.save() number += 1 def handle(self, *args, **options): self._create_pairs() 33 Get_data.py from __future__ import print_function from django.core.management.base import BaseCommand, CommandError from antonymapp.models import WordPair1, WordPair2 import math from scipy.stats import levene, wilcoxon, chi2_contingency from numpy import mean, std from scipy import stats word_pairs1 = WordPair1.objects.all() word_pairs2 = WordPair2.objects.all() word_pairs = [word_pairs1] + [word_pairs2] class Command(BaseCommand): def freq_to_list(self, word_pair): Y = [] for i in range(0,word_pair.ones): Y += [1] for i in range(0,word_pair.two): Y += [2] for i in range(0,word_pair.three): Y += [3] for i in range(0,word_pair.four): Y += [4] for i in range(0,word_pair.five): Y += [5] return Y def handle(self, *args, **options): print("\n\n") print("word1 \t word2 \t n \t mean \t std_dev") print("\n") mean1 = [] mean2 = [] for j in range(0,2): for i in range(1,26): word = word_pairs[j].get(word_number=i) word1 = word.word1[:5] word2 = word.word2[:5] Y_1 = self.freq_to_list(word) 34 n = len(Y_1) mean_value = mean(Y_1) std_value = std(Y_1) if j == 0: mean1 += [mean_value] else: mean2 += [mean_value] print("%s \t %s \t %d \t %.5f \t %.5f" % (word1, word2, n, mean_value, std_value)) W, p = stats.wilcoxon(mean1,mean2) print() print("Wilcoxon returns W = %.5f, and p = %.5f" % (W,p)) print("Can not reject null hypothesis since p>0.1") print() print("Matrix for levenes test W­value") for i in range(1,26): pair1 = word_pairs[0].get(word_number=i) word1 = pair1.word1[:4] word2 = pair1.word2[:4] print(word1 + " " + word2 + " ", end="") for j in range(1,26): pair2 = word = word_pairs[0].get(word_number=j) Y_1 = self.freq_to_list(pair1) Y_2 = self.freq_to_list(pair2) W, p = levene(Y_1, Y_2) #print(' {0:.2f}'.format(W), end="") print() print() print("Matrix for levenes test p­value") for i in range(1,26): pair1 = word_pairs[0].get(word_number=i) word1 = pair1.word1[:4] word2 = pair1.word2[:4] print(word1 + " " + word2 + " ", end="") for j in range(1,26): pair2 = word = word_pairs[0].get(word_number=j) Y_1 = self.freq_to_list(pair1) Y_2 = self.freq_to_list(pair2) W, p = levene(Y_1, Y_2) print(' {0:.2f}'.format(p), end="") print() 35 ## Time for chi squared pairs = [] pairs += [word_pairs[0].get(word1="kort",word2="kolossal")] pairs += [word_pairs[0].get(word1="smal",word2="bred")] pairs += [word_pairs[0].get(word1="liten",word2="enorm")] pairs += [word_pairs[0].get(word1="kort",word2="stor")] pairs += [word_pairs[0].get(word1="futtig",word2="bred")] pairs += [word_pairs[0].get(word1="liten",word2="tung")] pairs += [word_pairs[1].get(word1="bred",word2="futtig")] pairs += [word_pairs[1].get(word1="enorm",word2="liten")] pairs += [word_pairs[1].get(word1="bred",word2="liten")] pairs += [word_pairs[1].get(word1="bred",word2="kort")] pairs += [word_pairs[1].get(word1="kolossal",word2="ynklig")] pairs += [word_pairs[1].get(word1="stor",word2="futtig")] pairs += [word_pairs[1].get(word1="enorm",word2="smal")] pairs += [word_pairs[1].get(word1="tung",word2="smal")] pairs += [word_pairs[1].get(word1="bred",word2="smal")] extreme = 0 non_extreme = 0 total = 0 for pair in pairs: pair_values = self.freq_to_list(pair) for num in pair_values: total += 1 if (num==1 or num==5): extreme += 1 else: non_extreme += 1 print() print("data for chi square test") print("extreme values: "+ str(extreme)) print("non extreme values: "+ str(non_extreme)) print("total"+ str(total)) C, p,ddof, expected = chi2_contingency([[extreme, 497],[non_extreme, 864]]) print(C) print(p) print(ddof) print(expected) 36 Appendix D: Complete list of word pairs The following table consists of all word pairs used in the web app, their mean value and standard deviations. Word pair Mean value Standard deviation liten stor 4.91892 0.35855 stor liten 4.94118 0.33792 liten bred 2.24324 0.74997 bred liten 2.24242 0.81762 liten kolossal 3.94595 0.80357 kolossal liten 3.86842 0.89358 liten tung 1.80556 0.84391 tung liten 1.78788 0.80745 liten enorm 4.18919 0.76539 enorm liten 4.06061 0.85065 smal stor 2.7027 1.03623 stor smal 2.58824 1.00345 smal bred 4.56757 0.94556 bred smal 4.42424 1.04534 smal kolossal 2.5 0.92796 kolossal smal 2.53125 0.93489 smal tung 1.57895 0.81536 tung smal 1.71053 0.75769 smal enorm 2.27027 0.94864 enorm smal 2.57143 0.90351 kort stor 2.27027 1.15428 stor kort 2.72727 0.93006 kort bred 1.75676 1.05024 bred kort 1.57143 0.76665 kort kolossal 2.02778 0.92754 kolossal kort 2.375 0.85696 kort tung 1.37838 0.84953 37 tung kort 1.42424 0.69763 kort enorm 2.2973 0.98268 enorm kort 2.38235 0.9398 futtig stor 3.275 1.07209 stor futtig 3.1 0.86023 futtig bred 1.89189 0.92368 bred futtig 1.94444 0.88017 futtig kolossal 3.88889 1.04822 kolossal futtig 3.87879 1.06622 futtig tung 1.78378 0.74017 tung futtig 1.9697 0.90403 futtig enorm 3.66667 1.07019 enorm futtig 3.39474 0.98781 ynklig stor 2.97222 1.18991 stor ynklig 3.3125 0.91643 ynklig bred 1.80556 1.07547 bred ynklig 2.28125 0.94321 ynklig kolossal 3.22222 1.45509 kolossal ynklig 3.5625 1.22315 ynklig tung 1.91667 0.92421 tung ynklig 1.9375 0.86377 ynklig enorm 3.55556 1.36309 enorm ynklig 3.45455 1.01775 38 www.kth.se