Automated detection of offensive language behavior on social

Automated detection of offensive language behavior
on social networking sites
Baptist Vandersmissen
Promotoren: prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters
Begeleiders: Philip Leroux, ir. Johannes Deleu, Joost Roelandts (Massive Media)
Masterproef ingediend tot het behalen van de academische graad van
Master in de ingenieurswetenschappen: computerwetenschappen
Vakgroep Informatietechnologie
Voorzitter: prof. dr. ir. Daniël De Zutter
Faculteit Ingenieurswetenschappen en Architectuur
Academiejaar 2011-2012
Automated detection of offensive language behavior
on social networking sites
Baptist Vandersmissen
Promotoren: prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters
Begeleiders: Philip Leroux, ir. Johannes Deleu, Joost Roelandts (Massive Media)
Masterproef ingediend tot het behalen van de academische graad van
Master in de ingenieurswetenschappen: computerwetenschappen
Vakgroep Informatietechnologie
Voorzitter: prof. dr. ir. Daniël De Zutter
Faculteit Ingenieurswetenschappen en Architectuur
Academiejaar 2011-2012
Acknowledgments
First of all, I would like to express great gratitude to my supervisors Philip Leroux and
ir. Johannes Deleu for their many insights, continuous support and patience. Also dr.
Thomas Demeester should be noticed for his contributions to this study.
I thank Massive Media for providing me with data of Netlog.com
I also want to thank my dear friends at university, Ruben Verhack and Sacha Vanhecke for these marvelous and intense five years. As I want to thank my friends and
family in general for their support and contagious enthusiasm.
iii
Permission for Use of Content
“The author gives permission to make this master dissertation available for consultation
and to copy parts of this master dissertation for personal use. In the case of any other
use, the limitations of the copyright have to be respected, in particular with regard
to the obligation to state expressly the source when quoting results from this master
dissertation.”
Baptist Vandersmissen, June 2012
Automated detection of offensive language behavior on
social networking sites
Baptist Vandersmissen
Promotoren: prof. dr. ir. Filip De Turck, dr. ir. Tim Wauters
Begeleiders: Philip Leroux, ir. Johannes Deleu, Joost Roelandts (Massive Media)
Scriptie ingediend tot het behalen van de academische graad van
burgerlijk ingenieur in de computerwetenschappen
Vakgroep Informatietechnologie
Voorzitter: prof. dr. ir. Daniël De Zutter
Faculteit Ingenieurswetenschappen en Architectuur
Academiejaar 2011-2012
Resume
This study applies machine learning techniques to perform automated offensive language
detection. A corpus, originating from the Dutch distribution of the social network Netlog,
is used. An Information Retrieval system, enhanced with Rocchio Query Expansion,
is developed to effectively find offensive messages. Subsequently, a Naive Bayes and
Support Vector Machine classifier are implemented and trained on the gathered offensive
(and irrelevant) messages. To increase reliability, a third classifier is created that is based
on word lists. Since each classifier has its own strengths and weaknesses, a combination
of different methods is proposed. The combination of seperate classifiers outperforms all
other methods.
Keywords: offensive language detection, query expansion, text classification
Abstract
Social Networking Sites are booming as never before. Apart from the numerous new
opportunities that are provided, also hazards such as messages containing sexual harrasment or racist attacks have to be taken into account. Since manually monitoring and
analysing all messages seperately is unattainable, solutions using automated methods
are sought.
This study applies machine learning techniques to perform automated offensive language
detection. Offensive language can be defined as “expressing extreme subjectivity“ and
this study mainly focuses on two categories ’sexual’ and ’racist’. A corpus, originating
from the Dutch distribution of the social network Netlog, is used and contains over seven
million blog messages. We note that only a very small amount (approximately 0.85%)
of these blog messages can be defined as messages that contain abusive language.
Initially, the intention is to implement two supervised learning methods Naive Bayes
and Support Vector Machine. These methods base the classification of a message on
previous experiences, derived from a labeled training set. To build such training set
offensive messages should be efficiently extracted out of the corpus. In order to achieve
this, an information retrieval system, expanded with a query expansion technique, is
applied. A query containing offensive terms delivers offensive messages, however a more
efficient approach is considered by enhancing the query using Rocchio query expansion.
This study shows that using query expansion can effectively increase the amount of relevant messages retrieved.
These supervised classifiers are trained on the labeled set and afterwards their performance is tested on an independant validation set. The Naive Bayes classifier does
not perform well on the validation set and is therefore disregarded in the further analysis. Our Support Vector Machine implementation achieves results of approximately 69%
precision and 62% recall. However, these results are obtained by ignoring very small
messages, since SVM has difficulties classifying messages that do not contain much information.
To tackle the issues SVM suffers from a more reliable, but less dynamic method is
designed, based on word lists. This method, that is named a semantic classifier, obtains
ix
reasonable results on the validation set with a recall of 93%, but more importantly is
highly complementary with the SVM classifier. The classification of a message is eventually performed by choosing the correct classifier depending on the situation and context.
This method outperforms all others and achieves a precision of 100% combined with a
79% recall.
Notwithstanding the solid results of our classifier on the validation set, we note that
offensive language detection is a challenging domain that, as many text classification
problems, suffers from specific linguistic characteristics. Automated systems can be an
excellent aid, but human capacity to properly estimate the real meaning of a message
is, still, irreplaceable.
Samenvatting
Sociale netwerksites zijn brandend actueel en alomtegenwoordig. Ondanks de talloze
nieuwe mogelijkheden zijn er ook risico’s aan verbonden. Deze komen tot uiting in
de vorm van gebruikers die het medium trachten te misbruiken. Het verspreiden van
ongepaste, aanstootgevende of kwetsende boodschappen is een vaak voorkomend probleem op elk digitaal communicatie middel. Aangezien het niet realistisch is om elk bericht
manueel te controleren, worden er automatische oplossingen gezocht.
Deze studie probeert aan de hand van technieken gelinkt aan machinaal leren op automatische wijze kwetsende of aanstootgevende taal te detecteren. Beledigend taalgebruik kan gedefinieerd worden als ”het uiten van extreme subjectiviteit“. Deze studie
richt zich voornamelijk op berichten die enerzijds seksueel getinte, ongepaste inhoud of
anderzijds racistisch getinte boodschappen bevatten. Hiervoor wordt een berichtenverzameling gebruikt, afkomstig van de Nederlandstalige distributie van de sociale netwerksite
Netlog, dewelke meer dan zeven miljoen blogberichten telt. Er wordt echter opgemerkt
dat slechts 0.85% van deze berichten ook daadwerkelijk ongepaste taal bevat.
Initieel was het de bedoeling om twee gesuperviseerde leermethodes te ontwikkelen, die
enerzijds gebaseerd zijn op Naı̈ve Bayes en anderzijds op een Support Vector Machine.
Deze methoden baseren de classificatie van een bericht op vorige ervaringen, afgeleid van
een gekende trainingset. Een trainingset wordt opgebouwd aan de hand van positieve
en negatieve voorbeelden. Er moeten met andere woorden beschrijvende voorbeeldberichten verzameld worden uit de complete verzameling. Om dit te bekomen is er een
zoeksysteem geı̈mplementeerd in combinatie met een techniek die zoekopdrachten tracht
te verbeteren. Relevante berichten worden in de berichtenverzameling gezocht door het
invoeren van zoekopdrachten in het systeem. Deze zoekopdrachten worden op hun beurt
uitgebreid door Rocchio’s query expansie techniek. Aan de hand van het gebruik van
een query expansie techniek toont deze studie aan dat, op een efficiëntere manier meer
relevante berichten kunnen verkregen worden.
Hierna worden beide gesuperviseerde classifiers getraind op de gelabelde berichtenverzameling om dan vervolgens hun algemene prestaties te testen op een validatieverzameling.
Aangezien de Naı̈ve Bayes classifier eerder ondermaats presteert, laten we deze buiten
beschouwing. De Support Vector Machine behaalt hogere resultaten met een precisie
xi
van ongeveer 69% en een recall van 62%. Deze resultaten zijn echter bekomen door te
kleine berichten, die niet tenminste vijf verschillende woorden bezitten, te negeren. Dit
verbetert het resultaat sterk aangezien SVM moeilijk overweg kan met kleine berichten.
Om de verschillende problemen van de SVM methode aan te pakken is er ook een derde
methode ontwikkeld die berichten classificeert aan de hand van vooraf gedefinieerde
woordenlijsten. Deze semantische methode behaalt redelijke resultaten op de validatieverzameling met een zeer hoge recall van 93%, en is daarbovenop zeer complementair
met de SVM classifier. De uiteindelijke classificatie van een bericht is dan bepaald door
een combinatie van beide methoden. Afhankelijk van de situatie wordt voor de ene of
de andere classifier gekozen. Met dit systeem behalen we een precisie van 100% en een
recall van 79% op de validatieverzameling.
Desalniettemin de uitstekende resultaten op de validatieverzameling, merken we op dat
deze verzameling geen perfecte representatie van de gehele berichtenverzameling is. Het
detecteren van ongepaste taal is een uitdagend domein dat nog zeer sterk zal moeten
evolueren om de moeilijke en typische karakteristieken van taal te doorgronden.
Contents
Acknowledgments
iii
Preface
1
1 Introduction: What is Offensive
1.1 Introduction . . . . . . . . . . .
1.2 Goals of This Study . . . . . .
1.3 Approach . . . . . . . . . . . .
1.4 Applications . . . . . . . . . . .
1.4.1 Query Expansion . . . .
1.4.2 Text Categorisation . .
1.5 Challenges . . . . . . . . . . . .
1.6 Related Work . . . . . . . . . .
2 The
2.1
2.2
2.3
2.4
2.5
Language Detection?
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
. . . . . . . . . . . . . .
Dutch Netlog Corpus
Netlog . . . . . . . . . . . . . . . . . . . . .
Corpus Overview . . . . . . . . . . . . . . .
Offensive language . . . . . . . . . . . . . .
Creating Training and Validation set . . . .
2.4.1 Labels . . . . . . . . . . . . . . . . .
2.4.2 Validation set . . . . . . . . . . . . .
2.4.3 Training set . . . . . . . . . . . . . .
Informal Language and Multiple Languages
3 Methodology
3.1 Information Retrieval . . . .
3.1.1 Features . . . . . . . .
3.1.2 Evaluation . . . . . .
3.1.3 Spelling Correction . .
3.1.4 Stemming . . . . . . .
3.1.5 Part-of-Speech Tagger
3.1.6 Query Expansion . . .
3.1.7 Open Source Software
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xiii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
5
6
6
6
7
7
8
.
.
.
.
.
.
.
.
11
11
11
12
13
13
14
15
16
.
.
.
.
.
.
.
.
17
17
18
18
19
19
20
20
26
3.2
Machine Learning Techniques . .
3.2.1 Data Representation . . .
3.2.2 Feature Selection . . . . .
3.2.3 Classification Techniques
Lexicon Based Text Classification
Evaluation . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
30
33
33
4 Design and Implementation
4.1 Information Retrieval . . . . . .
4.1.1 Query Expansion . . . . .
4.2 Offensive Language Detection . .
4.2.1 Naive Bayes . . . . . . . .
4.2.2 Support Vector Machine .
4.2.3 Semantic Methods . . . .
4.2.4 Combination of Methods
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
35
37
37
38
38
40
3.3
3.4
5 Results
5.1 Query Expansion . . . . . . . . . . . .
5.1.1 Rocchio Relevance Feedback .
5.1.2 Conclusion . . . . . . . . . . .
5.2 Offensive Language Detection . . . . .
5.2.1 Naive Bayes . . . . . . . . . . .
5.2.2 Support Vector Machine . . . .
5.2.3 Semantic Classifier . . . . . . .
5.2.4 Comparison of Single Methods
5.2.5 Combined Algorithm . . . . . .
5.2.6 Corpus Classification . . . . . .
5.2.7 Conclusion . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
43
43
44
49
49
49
50
56
58
60
61
62
6 Discussion
6.1 Main Findings . . . . . . .
6.1.1 Corpus . . . . . . .
6.1.2 Development . . . .
6.1.3 Experimental Study
6.2 Discussion . . . . . . . . . .
6.2.1 Domain-dependency
6.2.2 Implicit Language .
6.2.3 Conclusion . . . . .
6.3 Future Improvements . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
63
63
63
63
64
65
66
66
66
66
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7 Conclusion
69
7.1 Main Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
Preface
“On average 37% of the people talk to people more online than they do in
real life.”
A survey of Badoo in the U.K., U.S. and Germany
Since the beginning of the internet it has always been a goal to create forms of
computer-mediated social interaction. The first type of social networking sites (SNS)
came in the form of online communities such as Theglobe.com and Tripod.com. They
mostly focused on creating interaction between different people by bringing them together in chat rooms. It is only in the late 1990s that many sites began to incorporate
more advanced features and user profiles became the central point of view [56]. This
somewhat newer form of social networking sites began to flourish with the rise of SixDegrees.com in 1997 [4]. Followed by many other such as Friendster and for example the
European Netlog, formerly known as Facebox. It is since then that the popularity of
social networking sites has rapidly been increasing up until today. Many of these pioneering sites such as SixDegrees.com and Friendster however died a silent death or lost
a great amount of users to the more modern SNS.
Social networking sites have become widespread around the global population. Nowadays a staggering 900 million users are monthly active on the networking site Facebook
1 . This means around one person on eight in this world has a facebook profile. Beside
Facebook there are off course many more social networking sites. Qzone for example is a
Chinese SNS and has a user base around 531 million [51]. Except from the most popular
main stream SNSs a numerous smaller niche social network sites exist. The majority of
these SNSs in their turn can also present increasing visitor numbers. Facebook is by far
the most popular SNS but for others the popularity strongly depends on the specific features and geographical location. Other cultures will for example expect more privacy etc.
Social networking sites have long been labeled as unsustainable and have been predicted
to defunct sooner than later. Often the reason to describe SNSs more as an internet hype
or bubble was because the user base mostly consists out of young and “unpredictable”
people. On top the SNSs did not produce any essential value except from staying in con1
Facebook Statistics, http://newsroom.fb.com/content/default.aspx?NewsAreaId=22, [2012-05-04]
1
Automated Detection of Offensive Language Behavior on Social Networking Sites
tact with old friends or sharing pictures. These predictions however have slowly faded
away as the importance and value of these sites continued to grow. Nowadays almost
all major companies are present on one or more social networking sites. Even marketing
strategies and other important decisions are often influenced by the social media.
The fact that a whole world is accessible behind a computer is one of the greatest
and key aspects of the internet. But at the same time creates a safe haven for criminals
and those with less nobel goals. It is obvious that these digital environment pose many
threads, especially for younger users. A study, done by ScanSafe, shows that up to 80%
of blogs contain offensive language [71]. Offensive language has spread into almost every
corner of online communities. Well known issues like bullying, harassment, exposure
to harmful content, sexual grooming and racist attacks are important problems that
directly and indirectly affect our mental health.
To be able to control such digital environments tremendous human efforts have to be
made. For example every twenty minutes 1,587,000 blog posts and 10,208,000 comments
are posted on Facebook 2 . It is here that this study will present an automatic method
to detect offensive or abusive language.
We begin with an introduction of what offensive language detection exactly entails in
chapter one. Moreover, we describe the challenges associated with this domain and our
approach to this study. Chapter two gives an overview of the Netlog data corpus we will
work with. In chapter three the available techniques that are interesting in regard to
this study are discussed. Which techniques we used and how they are implemented can
be found in chapter four, while the fifth chapter reports on the results of the experiments
and the evolution of the whole study. Chapter six discusses the main findings of this
study, and makes some proposals for future work. Finally Chapter seven summarizes
the conclusions.
2
Obsessed with Facebook, http://www.onlineschools.org/blog/facebook-obsession/, [2012-05-01]
2
Chapter 1
Introduction: What is Offensive
Language Detection?
1.1
Introduction
What is Automated Detection?
Automated means “operating with minimal human intervention; independent of external
control” 1 . Automated Detection is the operation of detecting matters with no or minimal human intervention in a controlled environment. Terms directly associated with
automated learning are machine learning and supervised learning. Machine learning, a part of the artificial intellegence branch, is a scientific discipline dealing with the
design and development of algorithms that allow machines to evolve behavior based on
experience. Supervised learning is the machine learning task of deducting a function
from labeled (supervised) training data [24].
What is Offensive Language Behavior?
Offensive
2
can be described as:
− causing anger or annoyance; ’offensive remarks’
− causing or capable of causing harm
− exhibiting lack of respect; rude and discourteous
In this study offensive language is defined as the propagation of offensive messages or
remarks that in some circumstances are inappropriate, exhibit a lack of respect towards
certain groups of people or are just rude in general. In literature offensive language
is often expressed as ’flame’. [2] defines flames as ’exhibiting extreme subjectivity’.
However, offensive language (flames) is a very vague and ambigious concept. In general
1
2
Definition on www.thefreedictionary.com, [2012-05-02]
idem
3
Automated Detection of Offensive Language Behavior on Social Networking Sites
people describe or experience certain events or messages in different ways according to
their own education, culture and personal experience. Therefore it is very important to
accurately define what we describe as being offensive language.
Our final goal is not only to detect offensive language but to be able to discover offensive language behavior. Offensive language behavior can be described as a person - a
user of a social networking site - repeatedly propagating offensive language in a certain
time interval.
Offensive language is not only a vague concept but is also exceptionally wide. Next
subjects can all be interpreted as offensive or at least cause certain nuisance:
− Messages containing unwanted advertisement or plain spam.
− Scammers trying to steal personal information.
− Sexually inappropriate messages or indecent proposals
− Racist messages: offending certain people or entire groups (aggression against
some culture, subgroup of the society, race or ideology in a tirade)
− ...
Because different categories often require different strategies, we decided to focus on just
two categories.
As stated above harassment, sexual grooming and racist attacks are important as
they are widespread and can do more harm to younger child than spam. Therefore, we
focus on the detection of sexually inappropriate and racist messages. However, we
intend to build a system that with some effort can be expanded in a flexible way.
What is a Social Networking Site?
In [4] a social networking site (SNS) is defined as: a web-based service that allows
individuals to
i Construct a public or semi-public profile within a bounded system.
ii Articulate a list of other users with whom they share a connection.
iii View and traverse their list of connections and those made by others within the
system.
The exact meaning of the above terms off course differ from site to site and are
relatively flexible. Holding this definition in mind several hundreds social networking
sites nowadays exist and are widespread around the global population.
For this study we made use of the Dutch Netlog corpus (cf. chapter 2).
4
Automated Detection of Offensive Language Behavior on Social Networking Sites
1.2
Goals of This Study
In this study we attempt to create a machine driven solution to be able to automatically
detect offensive language behavior on social networking sites. The final goal is thus to
develop a system that can automatically detect profiles/persons that extensively and deliberately post messages containing offensive language (cf. section 2.3). We can briefly
describe our study by stating we are dealing with a text classification problem. In the
next chapter (cf. chapter 2) we give an overview of what we mean with offensive language. As mentioned before we focus on two main categories being ’sexual’ and ’racist’
messages.
The main objective is however divided in multiple smaller parts. The first objective
focuses on finding and annotating a relevant subset of messages. As we deal with
a rather large message set, it is key to efficiently extract relevant messages. In our case
relevant messages are messages that contain offensive language. While searching for relevant messages we also intend to build a thesaurus 3 that can help us to build a classifier,
which is not based upon training samples, but on prefabricated word lists.
With this annotated subset of messages we will then try to build several supervised learning methods. We mainly focus on two well-known machine learning techniques being
Naive Bayes and Support Vector Machine. Since the more complex and advanced
nature of a Support Vector Machine, we hypothesize the Support Vector Machine will
outperform our Naive Bayes method. Therefore, we will use our Naive Bayes classifier as
a baseline to compare the eventual improvements. When finished building/implementing
our classifiers we need to be able to validate the results of these classifiers. In order to
do this we expand our existing training set4 with a set of irrelevant messages. We also
randomly select two thousand messages out of the whole message set to create a realistic
validation set. This validation set can be used to test the performance of our classifiers.
We then further try to improve the general performance by tweaking the available parameters and isolate determining factors.
A third goal is to not only take into account the message itself but to analyze the
reactions to a message. Sexually inappropriate or racist messages could cause a fuss,
possibly manifested in the reactions to that message. We do this by creating a third
classifier5 that is able to not only detect sexual or racist offensive language, but has the
ability to detect outrage 6 .
In our final goal we will compare the results of the different seperate classification methods. This research also checks if performance can be increased by combining several
3
In this context a thesaurus can be defined as a list of concepts that describe a certain offensive
category.
4
One set of sexually inappropriate messages and one set of racist messages.
5
A classifier that is build on prefab word lists.
6
In this study ’outrage’ is defined as ’profound indignation, anger, or resentment’
5
Automated Detection of Offensive Language Behavior on Social Networking Sites
seperate methods into one whole. We then choose our best performing classification
method and classify the whole data set to create a general overview. Due to time constraints and the fact that the decision if a user should be blocked depends on much more
than only posting offensive messages, the propagation from message level to profile level
has not been extensively developed. Our final results intend to sort users based on the
number of detected offensive messages. Netlog will have an overview of offensive and
non offensive messages per user, ranked according to a relevance score7 .
1.3
Approach
Our study can be seperated in two main parts. The first part is mainly oriented towards
finding relevant messages in the whole message corpus (cf. chapter 2 for a detailed
overview of the corpus). Because we intend to work with supervised learning methods
(cf. section 1.1), it is crucial to find as many relevant messages as possible to build
a complete training set. Query expansion is a well-known technique that tries to
enhance the search to find more relevant documents by expanding the query and adding
informative terms. From our point of view every message containing racist or sexually
inappropriate content is a relevant message. In the second part we pursue our final goal,
classification of an unknown message, by applying text classification techniques. Text
classification is the domain where documents should be assigned one or more predefined
classes (cf. chapter 3 for a more elaborated overview of these techniques).
1.4
Applications
The exponential growth of the World Wide Web makes it is essential that people can use
developing tools to better find, filter, and manage this electronic information. Document
retrieval, categorization, routing and filtering can all be formulated as classification
problems [29].
1.4.1
Query Expansion
The most well-known application using query expansion is by far Google search engine.
Google will do a numerous things to enhance your search results and increase the amount
of relevant pages. From the Google help pages we find:
− Words that share the same word stem as the word given by the user. For example, if
the user search query includes ’engineer’ the search appliance could add ’engineers’
to the query.
− Terms of one or more space-separated words that are synonymous or closely related
to the words given by the user. For example, if a user searches for ’FAQ’ the
appliance could add ’frequently asked questions’ to the query.
7
This score is based on the belief the classifier attached to the classification of a message to a certain
category.
6
Automated Detection of Offensive Language Behavior on Social Networking Sites
1.4.2
Text Categorisation
Text categorization is of great importance for many information organization and management tasks.
Spam Filtering
The Internet Security Threat Report recently stated that spam email traffic dropped
around 13%. Despite this positive evolution still 75% of all email traffic is identitfied
as spam 8 . Spam messages are annoying to most users, as they waste their time and
clutter their mailboxes. Text classification tries to differentiate regular emails from spam
messages.
Topic Spotting
Topic spotting tries to automatically determine the topic of a text. it is often applied to
structure large quantities of documents into a set of categories. For example, a collection
of news articles can be divided into different categories such as politics, regional, economy,
etc.
Language Identification
Text classification techniques can also be used to try to automatically determine the
language of a text.
1.5
Challenges
Since we are concerned with a text classification problem a number of challenges arise.
The main difficulty in the field of Computational Linguistics is ambiguity. Ambiguity
can occur in a semantic, lexical or syntactic way [58] and is a challenging issue in natural language processing (NLP) [50, 59]. Lexical ambiguity of a word (or phrase) implies
that this word has more than one meaning in the language to which it belongs. For
instance, the word “bow” has several distinct lexical definitions, including “the front of
the ship” and “the weapon to shoot arrows with”. Lexical ambiguity can be solved by
inferring the meaning of a word out of the context. In contrast with lexical ambiguity
there is semantic ambiguity that imposes a choice between any number of possible
interpretations. This form of ambiguity is closely related to vagueness [58]. When a
phrase can be parsed in two or more ways we speak of syntactic ambiguity. The
same sequence of words can then be matched to different grammatical structures. The
well-known sentence for example ’Flying planes can be dangerous’ can be interpreted as
’Is flying planes dangerous?’ or ’are flying planes dangerous?’. Again context can be of
crucial importance here.
8
Internet Security Threat Report instigated by Symantec.
http://www.symantec.com/threatreport, [2012-05-15]
7
Report can be found on
Automated Detection of Offensive Language Behavior on Social Networking Sites
Not only ambiguity but the fact we are working in a social network environment, with
a great deal of very young users, poses a challenge. This means the vast majority of the
text will not be in standard Dutch, but will most likely contain informal language and
different dialects. Numerous different spellings for one word make it harder to estimate
the real value of that word (cf. chapter 2). Informal language’s lack of grammatical
rules makes it more difficult to identify seperate sentences and analyze the context to
resolve ambiguity.
An utmost challenging problem is the detection of implicit messages. Implicit messages could for example contain irony, sarcasm, etc. Again, humans can use the context
to reveal if the writer of a message is using irony or not. But even among human beings
misinterpreted irony is a frequent source of conflicts. The same applies to humour, but
this even includes an extra difficulty. Evaluating the appropriateness of humour implies
determining if a joke crosses a certain non tolerable line. As exploring the boundaries
of the right for freedom of speech is not one of our goals, we will simply ignore the fact
whether or not one is using irony, humour, etc.
To end the list of challenges, domain-independence is one of the biggest problems
in machine learning and classification. Accuracy results for one certain corpus (e.g. blog
corpus) can be very high, but perform very poorly when applied to another kind of
corpus (e.g. biomedical corpus). Finding effective approaches to overcome this problem
is a valued research field [58].
1.6
Related Work
Relatively few articles specifically discuss the detection of offensive language. However,
many researchers are working on different kinds of opinion mining or sentiment detection.
Examples are Pang et al. [45], Gordon et al. [18], Yu and Hatzivassiloglou [73], Riloff
and Wiebe [52], Yi et al. [72] , Dave et al. [11] and Riloff et al. [53]. We mention these
articles because you could state that in many cases the detection of sentiment is for some
specific attributes similar to the detection of offensive language. Hence, the subjective
language detection is a task for which flame detection could be considered an offspring.
Fk yea I swear: Cursing and gender in a corpus of MySpace pages
Thelwall, in [65], studies offensive language on the online network site Myspace with a
specific focus on ’swearing’. In this article they investigate whether (strong) swearing is
more dominant based on gender. The methods used in this article do not rely on any
text classification method. They do show that swearing declines when a user gets older.
But they could not find an overall gender difference for the use of stronger swear words.
8
Automated Detection of Offensive Language Behavior on Social Networking Sites
Smokey: Automatic recognition of hostile messages.
Ellen Spertus proposes in [60] a flame recognition system that not only looks for insulting
words, but also for syntactic constructs that have a negative or condescending tendency.
In the first step the smokey software converts each parsed sentence into Lisp s-expressions
by using sed and awk scripts. Next, semantic rules are used to further process these sexpressions, resulting in a 47-element feature vector based on the syntax and semantics
of each sentence. Finally one feature vector per message is created by summing up the
vectors of each sentence. A message is then classified as flame or not by evaluating
its feature vector with rules, generated by Quinlan’s C4.5 decision-tree generator. The
feature based rules were generated using a training set of 720 messages , which were able
to correctly categorize 64% of the flames and 98% of the non- flames in a separate test
set of 460 messages.
Detecting flames and insults in text
[34] is a sentence level classification system that tries to distinguish flames and information by interpreting the basic meaning of a sentence. This is achieved by using a set of
rules and analysing the general semantic structure of a sentence. However, the system
is affected by the fact that insulting sentences can only be detected when containing
related words or phrases originated from a lexicon.
Filtering Offensive Language in Online Communities using Grammatical Relations
In [71] the focus is more on filtering techniques than on the actual detection of offensive
language. This articles makes use of simple word matching algorithm in combination
with an offensive word lexicon to detect the actual offensive words
Offensive Language Detection Using Multi-level Classification
[48] tries to develop automatic and intelligent software for flame detection. Flames is
an other word for offensive language including taunts, squalid phrases, etc. The article
states that we are increasenly confronted with abusive language in for example emails or
other texts. They intend to perform flame detection by extracting features at different
conceptual levels and applying multi-level classification. A message can be assigned to
either the class Okay or the class Flame. Results point out that a 3-level classification
gives a general accuracy around 96% with a flame precision of 96.6%. They used a total
number of 1525 messages of which 68% Okay and 32% Flame.
9
Chapter 2
The Dutch Netlog Corpus
As we are trying to detect offensive language behavior on social networking sites a resembling corpus is needed. Different corpora are available such as Ken Lang’s Newsgroups
data set [27], the Multi-Domain Sentiment corpus of Blitzer [3], movie reviews [45],
the Wall Street Journal corpus, etc. However, none of these corpora
(a) are real representations of the data on a social networking site, or
(b) are well suited for flame detection, as they’ve been constructed for opinion mining
purposes.
We were able to make use of the Dutch Netlog Corpus (DNC), especially constructed
for this study, containing a collection of blog messages and reactions on Netlog.
2.1
Netlog
Netlog is a European social networking site that was founded in the early 2000s by
Lorenz Bogaerts and Toon Coppens. Netlog, nowadays, is being developed by Massive
Media NV, located in Ghent. Over 96 million members1 spread over more than forty
languages frequently visit Netlog. With Dutch accounted for the ninth highest number
(over four million) of members. At the beginning Netlog focused on reaching a younger
target audience, but this has changed over the years. Apart from Netlog the company
also develops a fast growing online dating site, named Twoo.
We explicitly thank Massive//Media for supplying this corpus.
2.2
Corpus Overview
In this section we give a general overview of the Dutch Netlog Corpus. We received
approximately seven million blog messages from 800,201 different users. We have not
1
http://nl.netlog.com/go/about/statistics, [2012-05-09]
11
Automated Detection of Offensive Language Behavior on Social Networking Sites
received any information concerning a user’s age or other personal data. The corpus also
includes more than eleven million reactions to the different blog messages.
However, 71.9% of the blog messages does not contain any reaction. This means that
the eleven million reactions are spread over about two million blog messages. In other
words, a blog message belonging to the group containing reactions has on average six
reactions. This is an important stat because in a later stadium reactions to a message
are taken into account in the classification process (cf. chapter 4). Only a small subset
of the messages will thus eventually be able to benefit from the extra information that
reactions contain. Moreover, it is important to notice that a message containing offensive language should be detected as soon as possible, thus optimally nobody has been
able to react to such a message.
2.3
Offensive language
Because offensive language or flames are very subjective concepts, we hereby attempt to
clearly define these concepts.
We created two general categories of offensive language. The first category is named
racist, and contains the following types of abusive language:
− Extremism: These phrases target some religion or ideologies.
− Homophobia: Phrases are usually talking about homosexual sentiments.
− Provocative language: expressions that may cause anger or violence.
− Racism: These phrases intimidate race or ethnicity of individuals.
− References to handicaps: These phrases attack the reader using his / her shortcomings.
− Slurs: These phrases try to attack a culture or ethnicity in some way.
The second category is named sexual:
− Crude language: expressions that embarrass people, mostly because it refers to
sexual matters or excrement.
− Implicit/ambigious language: expressions that in an indirect way refer to sexual
matters.
− Indecent proposals: expressions that contain indecent proposals often related to
sex.
− Unrefined language: some expressions that lack polite manners and the speaker is
harsh and rude.
12
Automated Detection of Offensive Language Behavior on Social Networking Sites
Each time this study mentions the concepts flame detection or offensive language,
implicitly we refer to the above definitions. However, even with an elaborate definition,
the labeling of a message as being offensive or not is a nontrivial task. Determining when
a certain expression is rude, unrefined, indecent or provocative is a subjective decision.
2.4
Creating Training and Validation set
Since we intend to work with a supervised learning technique (cf. section 1.1), it is
key to annotate and label a vast subset of blog messages. Supervised learning methods
require an example (labeled) set to build experience, which they can then use to base
decisions of new unlabeled messages on. Therefore, a training set should be as complete
and descriptive as possible.
2.4.1
Labels
A message can be labeled in five different ways: ’sexual’, ’racist’, ’outrage’, ’irrelevant’
or ’unknown’.
Sexual
A message is labeled as ’sexual’ when it relies to the definition of offensive language.
Moreover, it should contain either crude or unrefined language, implicit / ambigious language or an indecent proposal. Because we work on a social networking
site, which has a vast number of young users, we have a rather conservative approach to
what can be described as being sexual and what not.
For example: ’goed geil wil camsex ’
Racist
A message is labeled as ’racist’ when it relies to the definition of offensive language.
The ’racist’ category contains more general abusive language than the name makes you
believe. We do not consider the definition of a racist message to describe only discrimination of a race or ethnicity, but we consider it to also include matters such as homophobia,
extremism, slurs, etc.
For example: ’Bloed moet vloeien, weg met die vrijheid van die Joden republiek! ’
Outrage
Outrage is a category, which is specifically designed for reactions, to indicate if the
writer of the message intends to express contempt. A regular blog message should not
be tagged as an ’outrage’ message. Although it is perfectly possible to annotate a set
of messages expressing outrage, it is not our primary goal to create a training set for
13
Automated Detection of Offensive Language Behavior on Social Networking Sites
outrage.
For example: ’Pfff,vuile seksist! Ik weiger hier aan mee te doen. [thumbs down] ’
Irrelevant
All blog messages that do not contain any form of offensive language are considered to
be ’irrelevant’.
For example: ’ik kan niet slapen :( ’
Unknown
Our last label is named ’unknown’ and is applicable when it is unclear if a message
contains offensive language or not. Messages labeled as ’unknown’ are ignored.
2.4.2
Validation set
To be able to properly evaluate the performance of our classification system a validation
set is required: A validation set (also known as a test set) is a part of a data set to
assess the performance of classification models that have been trained on a separate part
of the same data set 2 .
We created a validation set containing a total of two thousand messages. To create
a realistic image of the whole data set, we randomly selected these messages. Based on
how the validation set is labeled, we can determine what the positive/negative ratio
is in the whole set. The positive/negative ratio gives an indication of the percentage of
messages containing offensive languages in comparison with irrelevant messages.
Thus, we labeled two thousand messages that contain:
Label
Amount
Racist
Sexual
Irrelevant
2
15
1983
Percentage (%)
0.10
0.75
99.15
Table 2.1: Distribution of validation set.
We clearly see that the amount of relevant messages is very low compared to the
amount of irrelevant messages. According to the annotation of the test set, only a tiny
0.85% of the whole data set contains offensive language. This means that our whole data
set should contain around 59500 messages with flames of which 7000 racist messages
and 52500 sexual messages.
2
http://www2.statistics.com/resources/glossary/v/validset.php, [2012-05-10]
14
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 2.1: Distribution of messages containing offensive language.
2.4.3
Training set
A training set can be defined as a specific part of the data set that is characteristic for
the problem to be solved, and is used as input for learning algorithms3 . As mentioned
before, a training set should contain as many labeled samples as possible. Finding relevant messages is not a trivial task in a set that contains just over seven million messages.
Therefore, we decided to use an information retrieval (IR) system, combined with query
expansion (cf. chapter 3), to be able to easily search for relevant documents.
With the aid of such IR system we were able to build a training set containing following message distribution:
Label
Amount
Racist
Sexual
Irrelevant
Total
173
362
4755
5290
Percentage (%)
3.27
6.84
89.89
100.00
Table 2.2: Distribution of different classes in training set.
3
http://www.answers.com/topic/training-set, [2012-05-10]
15
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 2.2: Distribution of different classes in training set.
2.5
Informal Language and Multiple Languages
To conclude this chapter we briefly discuss the nature of the data we received from Netlog. As we stated multiple times, the presence of informal language makes the detection
of offensive language specific, or text classification in general, more difficult. To begin
with, every person has his or her own writing style and practices. Moreover internet is
known for its fast-changing pace and new trends, including the way people communicate
with each other.
Typical examples are the lack of spaces in words (eg. “ikhaatu”) and excessive use
of abbreviations (eg. “BFF”, “LOL”). Other examples are adding or removing characters , writing in dialect (eg. “kzien a geire”), using emoticons (eg. “:-)”, “:D”), using
words from a foreign language, etc. Endless variations and the complete lack of any
structural rule, except from ones personal mindset, makes it very hard to process chat
language. Since there is not one way to write something, which results in different variations that in essence all refer to one word, frequency distribution will not be as accurate
as with standard language. The fact that a classifier is unable to estimate the real value
of a word, hinders the final classification.
16
Chapter 3
Methodology
Text categorization (TC) (also known as text classification, or topic spotting) is the task
of automatically sorting a set of documents into categories (or classes, or topics) from
a predefined set. This task falls at the crossroads of information retrieval (IR) and machine learning (ML).
This study is divided in two main parts based on two different techniques. In the
first part we focus on collecting relevant messages from our data set. To do this we use,
adjust and extend an information retrieval system. We extend this IR system to increase
its overall efficiency, by applying a technique called query expansion.
The second part then focuses on building a classification system to perform text classification. We first start with implementing a baseline classifier to be able to measure
the general improvement and compare our results with a more advanced classifier. Our
baseline classifier implementation will be the relatively simple but widely used Naive
Bayes (NB) method. A second more advanced classifier will be an implementation of
a support vector machine (SVM). The first two classification methods are supervised
learning methods, which means they rely on a training set. In our last step we also
implement a third classifier that does not have the ability to learn. This classifier will
be based on prefab word lists and will, next to detecting offensive language, also have
the ability to detect outrage in a message. The detection of ’outrage’ is however only
used in the reactions to support a possible prior classification of a blog message.
3.1
Information Retrieval
To be able to search through a vast set of messages, the use of an information retrieval
system is essential. However, the term information retrieval is widely used and very
broad. We use the definition proposed by [36]: “Information Retrieval is finding material
(usually documents) of an unstructured nature that satifies an information need from
within large collections.”
17
Automated Detection of Offensive Language Behavior on Social Networking Sites
3.1.1
Features
The eventual goal of an IR system is to provide the user with a specific information need
gathered from within a large collection of data. Although numerous different types of IR
systems exist, depending on the specific context, they all comply to a general structure.
In this section we will very briefly discuss the basic functionality of such a system.
Since an IR system relies on a collection of “unstructured data”, grouped per document,
its first task is to virtually organize the different documents in the collection. This is
done by the creation of an index. How a specific index is created, heavily depends on
the type of data we are working with. An index can be defined as :“A list of words
and corresponding pointers” 1 . An index can thus be compared to a thesaurus where
every word refers to a list of pointers. These pointers represent the set of documents
containing that word.
The construction of an index makes it possible to identify a document based on its
content. After creating an index, the next objective is to be able to efficiently describe
an information need. An information need is mostly described by a summarizing
sentence that lists the most important keywords. This is called a query. A query is
handled by an information retrieval system by processing and selecting the (key)words
in the query. These words are then used to find the matching documents in the index.
3.1.2
Evaluation
The evaluation of an information retrieval system is based on the amount of retrieved
relevant documents. In this study we mainly focus on three important metrics:
Precision
Precision is the fraction of retrieved documents that are relevant:
Precision =
|{relevant documents} ∩ {retrieved documents}|
|{retrieved documents}|
Recall
Recall is the fraction of relevant documents that are retrieved:
Recall =
|{relevant documents} ∩ {retrieved documents}|
|{relevant documents}|
Average Precision (AP)
Average precision takes into account precision and recall simultaneously. This metric
is especially useful when retrieved documents are ranked. Better ranking systems will
1
http://www.answers.com/topic/index, [2012-05-11]
18
Automated Detection of Offensive Language Behavior on Social Networking Sites
yield a higher average precision as opposed to systems where relevant documents are
more scattered [74].
Z
Average Precision =
1
p(t)dr(t) =
0
n
X
p(i)∆r(i)
i=1
with ∆r(i) being the change in recall from i − 1 to i and p(i) the precision from 0 . . . i.
3.1.3
Spelling Correction
The general purpose of spelling correction focuses on resolving typographical errors such
as insertions, deletions, substitutions, and transpositions of letters that result in unknown words [8]. Unknown words are words that can not be found in a trusted lexicon.
However, 25 to over 50% of observed spelling mistakes are mispelled words resulting in
a valid, though unintended word [25]. For example, “I want you to be quite!” where
“quite” should be spelled as “quiet”. The correction of a mispelled word strongly depends
on the information available from the context. Multiple supervised and unsupervised
learning methods exist, some based on a lexicon, other based on machine learning techniques [16, 17].
A very basic approach towards spelling correction is combining a trusted lexicon with
the levenshtein distance. When we find an unknown word, we search a best match in
the trusted lexicon according to the levenshtein distance. This levenshtein (or edit) distance is based on the number of insertions, deletions or substitutions that is required to
transform one word to the other.
Since the description of the information need (query) in an IR system is from utmost
importance, spelling correction can be of significant meaning. Misspelled words can
change the entire focus of the query and consequently result in very few or no relevant
documents at all.
3.1.4
Stemming
Stemming is the process of reducing all words with the same root or stem to a common
form [33]. A stemmer will for example match the words “fishing”, “fished”, “fish”, and
“fisher” all to the same root word, “fish”. The most basic stemming algorithms only take
into account morphological information to find a word’s reduced form. More advanced
stemmers are language specific and try to use statistical and contextual information to
find a word’s root form [69, 46].
Although we can generally increase the recall by using a stemmer, it can also significantly reduce precision by retrieving too many documents that have been incorrectly
matched. When analyzing the results of applying stemming to a large number of queries,
we notice that for every query that is helped by the technique, one is hurt [46].
19
Automated Detection of Offensive Language Behavior on Social Networking Sites
The biggest problem that arises when using a stemmer is off course the fact that words
with a totally different meaning, sometimes are matched to the same root form. For
example stemming the Dutch words ’negeer’, ’negeren’ and ’neger ’ all result to the same
root ’neger ’. This is a typical example, which will greatly reduce precision as ’neger ’
can be defined as offensive language, while ’negeer ’ is definitely not.
3.1.5
Part-of-Speech Tagger
Part-of-speech tagging (or POS tagging) is a technique that assigns an appropriate and
context-related grammatical descriptor to words in a text. The early POS taggers distinguished words among eight different grammatical tags: noun, verb, particle, article,
pronoun, preposition, adverb and conjunction [41]. Nowadays, the Penn Treebank tag
set is often used as basis to build an english POS tagger. This tag set, which is extensively described in [38], contains thirty-six different tags to describe a word. Since very
few grammatical rules can be applied to more than one languages, a general part-ofspeech tagger does not exist. However, different part-of-speech taggers all more or less
operate according to the same procedure [41]:
− Tokenization: divides a text in seperate processing units, and removes unwanted
information.
− Ambiguity look-up: implies using a lexicon to tag known words and a guesser
for tokens not represented in the lexicon.
− Ambigiuty resolution or disambiguation: based on two information sources,
one out of possible multiple pos tags should be chosen.
Part-of-speech tagging can greatly affect the performance of an IR system [32]. Assigning a part-of-speech tag to a term in a query improves the overall description of the
information need, which is an essential part of information retrieval. In other words, a
user is able to discriminate between different senses in which a term is used. On top the
extra information that is provided by a pos tag can be used by a stemmer. This would,
for example solve the issue with ’negeer ’ and ’neger ’, where it is clear that ’negeer ’ (verb)
and ’neger ’ (noun) do not carry the same pos tag and thus should not be mapped to
the same root form.
3.1.6
Query Expansion
Users tend to have difficulties formulating an appropriate query when searching for information on the web. According to a study the average length of a search query is 2.6
words long [1]. Which is in most cases too vague and small to retrieve a large set of relevant documents. Good queries often presuppose knowledge of relevant documents and
on top users may not know how a query is used by a retrieval model. Query expansion
tries to enhance the search to find more relevant documents by expanding the query and
adding informative terms. Additional search terms will define a more specific query that
20
Automated Detection of Offensive Language Behavior on Social Networking Sites
is less ambiguous and will result in documents that reflect the underlying information
needed.
Methods existing to perform query expansion can be divided into two classes:
− Global methods are techniques for expanding or reformulating query terms independent of the query and results returned from it, so that changes in the query
wording will cause the new query to match other semantically similar terms.
− Local methods adjust a query relative to the documents that initially appear to
match the query [36].
A popular technique categorized as being a local method is relevance feedback.
This technique is based on the possible relevance of highly ranked documents. This
information is then used to further specify the query to effectively increase the results.
Relevance feedback can be exploited on a manual or automatic way.
3.1.6.1
Rocchio Classification
Rocchio Classification is a method stemmed from the SMART Information Retrieval
System around the year 1970. It makes use of the Vector Space Model. The algorithm
is based on the assumption that users have a general idea of which documents should be
denoted as relevant or irrelevant [36].
An ideal query has maximal similarity to the relevant documents and minimum similarity
to irrelevant documents. Suppose we have |Dr | relevant documents and |Dn | irrelevant
documents. Then
X
X
1
1
Q=
×
di −
×
dj
|Dr |
|Dn |
di ∈Dr
dj ∈Dn
is the ideal query. Clearly, the relevant documents are not known. Assuming the user defines which retrieved documents are relevant, a set of relevant documents Du (feedback )
and irrelevant documents Dv (remaining documents) are available. Then
Qi+1 = α × Qi +
X
X
β
γ
×
di −
×
dj
|Du |
|Dv |
di ∈Du
dj ∈Dv
is a modified query that shifts towards Q. With α, β, γ being tuning parameters and
Qi the original query [55].
Local methods such as the algorithm described above depend heavily on the relevance
and similarity of the documents first retrieved. It also depends on the documents that
are selected as being relevant, whether this happens by hand or automatical. When the
initial query retrieves only few relevant documents, a query drift can occur. Which will
lead the derived query much further away from the ideal query. A query drift occurs
when the expanded form of the query changes the underlying “intent” [75]. In general
Rocchio query expansion leads to rather large queries and will often increase recall at
the cost of precision [36].
21
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 3.1: Example of Rocchio classification. Obtained from [36].
3.1.6.2
Local Feedback
Local feedback is a local method that will consider the top m most highly ranked documents. The algorithm consists of following steps:
− Add the n-most frequent (non-stop word) terms in the m most highly ranked
documents to the query.
− Disregard information from the other documents.
Many other variants exist (based upon clustering terms) and recent years, many improvements have been obtained on the basis of local feedback. This includes re-ranking
the retrieved documents using automatically constructed fuzzy Boolean filters, clustering
the top-ranked documents and removing the singleton clusters, clustering the retrieved
documents and using the terms that best match the original query for expansion [9].
TREC (Text REtrieval Conference) has showed in 1996 that local feedback approaches are effective [68]. This technique has however an obvious drawback as it supports on blind feedback, assuming the top ranked documents being relevant. If a large
fraction of the top-ranked documents is actually irrelevant, then the words added to
the query are likely to be unrelated to the topic. Thus the effects of pseudo-feedback
strongly depend on the quality of the initial retrieval.
Remark 1. Implicit relevance Feedback
Document’s relevance is not based on direct user feedback but instead indirect sources
of evidence are used. Rather than completely relying on the top-ranked documents.
An example of an indirect source of evidence is an often clicked document. Implicit
feedback is less reliable than explicit feedback, but is more useful than blind relevance
feedback, because local feedback or blind relevance feedback contains no evidence of user
judgements [36].
22
Automated Detection of Offensive Language Behavior on Social Networking Sites
Remark 2. Relevance feedback does not always guarantee a successful query expansion.
− Misspellings. If the user spells a term in a different way to the way it is spelled in
any document in the collection, then relevance feedback is unlikely to be effective.
This is very likely when working with chat language. This is one of the reasons
that shows the importance of stemming and the use of a decent spelling corrector.
− Cross-language information retrieval: Documents (partly) in another language are not nearby in a vector space based on term distribution. This is, anew
the case with chat langauge as this is often influenced by foreign languages. On
the contrary, documents from the same language tend to cluster.
3.1.6.3
Phrasefinder
Global methods consider all documents in the set for query expansion. The basic idea
is that the global context of a concept can determine similarities between concepts. In
other words, related terms will co-occur with each other. This is based on the association
hypothesis stating that words related in a corpus tend to co-occur in the documents of
that corpus [70].
Concept and context can be defined in numerous ways. The simplest definition is the
one that states that all words are concepts (except for stop words) and the context
for that word is the set of all words that co-occur in the set of documents [47]. More
advanced definitions compare concepts with noun groups (containing multiple words)
and define context as a fixed set of words surrounding the concept. Words surrounding
the concept is called a window and has a typical size of one to three sentences. These
definitions are a result from the TREC-3 conference in 1994 and are used as a basis for
the phrasefinder technique.
Every concept (noun group) will be associated with a pseudo-document. The content
of this pseudo-document will be the words occurring in every window for that concept
in the documents. At first these pseudo-documents will be filtered by removing stop
words and words that occur too often or too rare. Then a database or thesaurus is automatically built with these pseudo-documents creating a concept database. To expand
a query a ranked list of concepts will be generated out of this database. A number of
concepts from this ranked list will be added to the query weighted approximately.
In general global methods are very robust techniques that improve information retrieval.
Some queries will however be degraded when adding irrelevant concepts. Another drawback in this case is that most general thesauri will be expensive in terms of disk space
and computer time to analyse the data and build the database [68].
23
Automated Detection of Offensive Language Behavior on Social Networking Sites
3.1.6.4
Latent Semantic Analysis
Latent semantic analysis (LSA) is a mathematical method for computer modeling and
simulation of the meaning of words and passages by analysis of representative corpora of
natural text [26]. Latent Semantic Indexing (LSI) decomposes a term into a vector in a
low-dimensional space. This is achieved using a technique called singular value decomposition. It is hoped that related terms which are orthogonal in the high-dimensional
space will have similar representations in the low-dimensional space, and as a result,
retrieval based on the reduced representations will be more effective.
The first step in the algorithm is to calculate the term-document frequency matrix.
The entries in the term-document matrix are then transformed using a “ltc” weighting.
This weighting takes the log of individual cell entries, multiplies each entry for a term
by the inverse document frequency weight of the term, and then normalizes the document length. The transformed term-document matrix is taken as input for the single
value decomposition algorithm. A best “reduced-dimension” approximation to this
matrix is then calculated which results in a reduced-dimension vector for each term and
each document. The cosine between term-term, document-document, or term-document
vectors is then used as the measure of similarity between them and thus can be used to
build a global thesaurus [12].
Despite its potential, retrieval results using latent semantic indexing so far have not
shown to be conclusively better than those of standard vector space retrieval systems
[70].
A serious problem with term clustering is that it cannot handle ambiguous terms. If
a query term has several meanings, term clustering will add terms related to different
meanings of the term and make the query even more ambiguous.
3.1.6.5
Local Context Analysis
Local context analysis attempts to combine local and global methods. Based on the
hypothesis stating that top-ranked documents tend to form several clusters [70]. And
the number of relevant documents that contain a query term is non zero for (almost)
every query term [70]. As in phrasefinder noun groups are used as concepts and concepts are selected based on co-occurrence with the query terms. Instead of choosing
the concepts out of the whole set of documents the top n ranked are considered. Only
the best passages out of this ranked list are taken into account. A passage is a text
window of fixed size. Using passages is more ideal as opposed to documents, which can
be very long and a co-occurring of concept at the beginning and a term at the end can
be meaningless. It is also more efficient to use smaller parts of a document because it
reduces the cost of processing.
The algorithm is divided in the following steps:
− Use a standard IR system to retrieve the top m passages.
24
Automated Detection of Offensive Language Behavior on Social Networking Sites
− Concepts in these passages are ranked according to the formula
Y
log(af(c, ti )) × idfc idfi
bel(Q, c) =
δ+
log(n)
ti ∈Q
where
af(c, ti ) =
Pj=n
j=1
ftij × fcj
idfi = max 1,
idfc = max 1,
log10
5
log10
N
Ni
N
Nc
!
!
5
and
c
f tij
f cj
N
Ni
Nc
δ
is
is
is
is
is
is
is
a concept.
the number of occurrences of ti in pj .
the number of occurrences of c in pj .
the number of passages in the collection.
the number of passages containing ti .
the number of passages containing c.
a low non zero value to avoid zero bel values.
This formula is a variation on the tf idf formula, which is used in most information retrieval systems. The af part rewards concepts frequently occurring with
query terms, idfc penalizes concepts frequently occurring in the collection. idfi
emphasizes infrequent concepts. Finally, multiplication is used to emphasize cooccurrence with all query terms [68].
− In the third step the top m concepts are chosen from the ranked list and added
to the query. Concepts are weighted in proportion to their ranks so that a higherranked concept is weighted more heavily than a lower-ranked concept. Concepts
are added to the query according to the following formulas:
Qnew = #WSUM(1.0 1.0 Qold wtQ0 )
Q0 = #WSUM(1.0 1.0 wt1 c1 wt2 c2 . . . wtm cm )
wti = 1.0 − 0.90i
m
And ci being the ith ranked concept. The default value for wt is 2.0. #WSUM
is an INQUERY operator to combine evidence from different parts of a query.
Specifically, it computes a weighted average of its operands.
Local context analysis has several advantages. Once the top ranked passages are retrieved query expansion performs fast. This technique also performs well with regard
to proximity constraints. Phrasefinder can for example add concepts to the query that
co-occur with all query terms, but still do not match proximity constraints.
25
Automated Detection of Offensive Language Behavior on Social Networking Sites
3.1.6.6
Lexical Semantic Algorithms
Lexical semantic algorithms make use of the semantics in a document to expand and
enhance a query. The basic idea of these algorithms is to expand the query focusing
on the original query terms and their synonyms, acronyms, etc. These methods do
not make use of any statistical data, but rely on an external lexicon/thesaurus. As a
consequence we are dealing with a language dependant method. The algorithm described
in [67] makes use of the online database WordNet to act as a thesaurus. WordNet is
a large lexical database of English. Nouns, verbs, adjectives and adverbs are grouped
into sets of cognitive synonyms (synsets), each expressing a distinct concept. Synsets
are interlinked by means of conceptual-semantic and lexical relations [40]. One of the
biggest challenge in this algorithm is to properly choose which query terms to expand
based on the thesaurus.
3.1.6.7
Wikipedia as Knowledge Base
Different articles state that using Wikipedia as extra knowledge base to perform query
expansion can be useful [6, 30]. Wikipedia can be used to enhance our knowledge
about certain concepts. In [30], Li et al. run every query both on the target corpus as
on Wikipedia. Experiments pointed out that retrieval performance was superior with
Wikipedia-based query expansion than without. In [20] an algorithm is developed to
design an automatic concept graph by using Wikipedia. This approach is a statistical
approach and is language independent.
3.1.6.8
Manual Constructed Thesaurus
A manual constructed thesaurus is a method that speaks for itself. To apply this method
a thesaurus is manually constructed at first. To expand a query every query term is
matched against a thesaurus and the corresponding terms are then added to the query.
3.1.7
Open Source Software
The most known and popular information retrieval system is off course google’s search
engine. But beside google many open source IR systems exist. We list some freely
available and Java-based IR software. Different systems off course offer different advantages and drawbacks.
3.1.7.1
Lucene
Apache Lucene(TM) is a java-based full-featured text search engine library, originally
developed by Doug Cutting. Lucene is a very popular search framework, and is used
by multiple large companies such as IBM, Twitter, Apple, etc.2 Key features of Apache
Lucene are:
2
http://wiki.apache.org/lucene-java/PoweredBy, [2012-05-11]
26
Automated Detection of Offensive Language Behavior on Social Networking Sites
− Lucene’s core is based on the idea of a document containing fields of text. This
allows Lucene’s API to be independent of the file format and enables text from
PDFs, HTML, Microsoft Word, to OpenDocument documents to be indexed.
− Lucene is highly supported by a large community, which facilitates finding a solution for a certain problem.
− It enables scalability and high-performance indexing.
− Powerful, accurate and efficient search algorithms.
Lucene provides multiple ways to enhance a query:
− Wildcard Searches: It is possible to use single or multiple character wildcards to
broaden the search.
− Fuzzy Searches: A fuzzy search is based on the Levenshtein distance algorithm.
This makes it possible to search on slight variations of the original word.
− Boosting a Term: A term or phrase can be boosted by adding a caret followed by
a boost factor to the term. By boosting a term you can control the relevance of a
document.
In Lucene the relevance of a document is calculated taken into account following factors:
tfq
tfd
idft
numDocs
docF reqt
normq
normdt
boostt
coordqd
:
:
:
:
:
:
:
:
:
Term frequency in query q
Term frequency in document d
Inverse document frequency
Number of documents in index
Number of documents containing t
Norm of q
Square root of number of tokens in document d in the same field as term t
Boost factor for term t
Number of terms in both query q as document d
Number of terms in query q
(3.1)
This gives the following equation to calculate the relevance of a certain document:
!
X tfq × idft × tfd × boostt
scored =
× coordqd
normq × normdt
t
3.1.7.2
Terrier
Terrier, Terabyte Retriever, is a project that was initiated at the University of Glasgow in 2000. The goal of the project was to provide a flexible platform for the rapid
development of large-scale Information Retrieval applications. On top it provides a stateof-the-art test-bed for research and experimentation in the wider field of IR. The Terrier
27
Automated Detection of Offensive Language Behavior on Social Networking Sites
project explored novel, efficient and effective search methods for large scale document
collections, combining new and cutting-edge ideas from probabilistic theory, statistical
analysis, and data compression techniques [43].
Automatic pseudo-relevance feedback (cf. section 3.1.6) is included in the open source
engine. The method works by taking the top most informative terms from the top-ranked
documents of the query, and adding these new related terms into the query. The new
query is reweighted and rerun and as such providing a richer set of retrieved documents.
Terrier provides several term weighting models from the DFR framework, which are
useful for identifying informative terms from top-ranked documents. In addition Terrier
also includes well-established pseudo-relevance feedback (cf. section 3.1.6) techniques,
such as Rocchios method (cf. section 3.1.6.1). Automatic query expansion is highly
effective for many IR tasks.
3.1.7.3
Xapian
Xapian is a highly adaptable toolkit which allows developers to easily add advanced
indexing and search facilities to their own applications. It supports the Probabilistic
Information Retrieval model and also supports a rich set of boolean query operators.
Xapian supports following important features:
− Relevance feedback - given one or more documents, Xapian can suggest the
most relevant index terms to expand a query, suggest related documents, categorise
documents, etc.
− Supports stemming of search terms (e.g. a search for ”football” would match
documents which mention ”footballs” or ”footballer”). This helps to find relevant
documents which might otherwise be missed. Stemmers are currently included for
Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian,
Portuguese, Romanian, Russian, Spanish, Swedish, and Turkish.
− Synonyms are supported, both explicitly (e.g. ” cash”) and as an automatic form
of query expansion.
3.2
Machine Learning Techniques
Text classification can be performed in various ways. Nowadays, most popular and effective techniques are machine learninig techniques. Machine learning is a subdomain
of Artificial Intelligence that deals with algorithms, which can learn based on previous
experiences. In other words, these algorithms are based on previously labeled data, from
which they can infer certain properties and patterns. These properties can then be applied to classify new data samples. Machine learning techniques can be supervised, as
well as unsupervised or used by applying reinforcement learning.
28
Automated Detection of Offensive Language Behavior on Social Networking Sites
The major weakness these techniques involve is their domain-dependency, which means
that new properties and unseen patterns in data will very likely lead to falsely classified
data samples. As language can vary in infinite ways, rarely occuring patters are the rule
rather than the exception. Machine learning techniques are often limited to generalize
based on already known information.
3.2.1
Data Representation
Since we are dealing with textual data, it is vital this information is extracted and
represented in an efficient and complete manner. The simplest approach is using a bagof-words model. This is an unordered set of words (features), disregarding grammar
and even the exact position of the words [58]. Each distinct word corresponds to a feature, with the frequency of the word in the document as its value. Only words that do
not occur in a stop list are considered. Although a bag-of-words representation seems
to throw away significant information, it is a very simple, yet effective representation of
a document [22].
An other way of representing a textual document is by using N-grams. N-grams provide
the ability to identify n-word expressions. For example, ’I love you’ can be defined as
a 3-gram and provides a whole lot more information than the seperate words ’I’, ’love’
and ’you’. However, it is not recommended to use n-grams with more than three words.
Co-occurence of certain n-grams will be less likely to be found as the size of the n-grams
increases.
The extraction of words or concepts from a document can also be influenced by whether
stemming (cf. section 3.1.4) is applied. On top extra contextual information can be
added by applying a part-of-speech tagger (cf. section 3.1.5), detection of emoticons, excessive use of capital letters, etc.
3.2.2
Feature Selection
Feature selection (also known as subset selection or feature reduction) is the technique
of selecting a subset of relevant features. In a textual environment, effective feature
selection is essential to improve the efficiency of the learning task and increase general
accuracy [14]. Text classification in general and offensive language detection more specific contain positive (relevant) classes and a negative (irrelevant) class. In most cases
a relevant document is characterized by only a few specific features. Non-informative
features are often described as noize. This noise can be defined as features that do not
provide any extra information to further enhance the decision making proces. Eliminating noise and selecting only the most informative features is the key task of feature
selection.
The overall feature selection procedure works by giving each potential feature a score
according to a particular metric. Then the best k features are selected and only those
29
Automated Detection of Offensive Language Behavior on Social Networking Sites
features are used in the further text classification process [14]. Multiple feature selection
metrics exist and make use of different aspects of the data. In essence, they can be
divided into wrappers, filters, and embedded methods [19].
− Wrappers rank features according to their predictive power; this score is obtained
by utilizing a learning machine as a black box to determine most predictive subset
of features. A wrapper method often uses sequential stepwise selection to gradually
remove or add features at each step [62].
− Filters select subsets of features as a pre-processing step, not depending on the
chosen predictor. Filters perform the feature selection faster than wrappers and
create a more general subset of features. A well-known filter method is InformationGain based feature selection [14].
− Embedded methods include feature selection in the training process of the learning machine. Embedded methods are in general more efficient than wrappers since
they do not require the continuous retraining of the learning machine and splitting of the data (in validation and training set) [19]. An example of a learning
algorithm that embeds feature selection is CART [31].
A well-known and widely used feature selection (filter) method is Mutual Information (MI). This method attempts to compute the mutual information of a term t and
a class C. This mutual information is a measure of how much information the presence
or absence of a term contributes to making a correct decision whether or not a message
belongs to c [37]. In short, the algorithm selects k terms per class with the highest
mutual information.
3.2.3
Classification Techniques
As described above, machine learning techniques can be divided in three different categories:
− Supervised learning (also known as inductive learning) is the process of learning
a set of rules from labeled training samples [35].
− In unsupervised learning the algorithm is given a set of unlabeled samples and
is supposed to group these into classes, without any prior knowledge of labeled
output or the specific classes [35].
− Reinforcement learning is a technique where an algorithm depends on external
feedback to base its decision on. The learning algorithm is not told which actions
should be executed, but must determine the effectiveness of certain actions based
on the given reward. Without this feedback, the learning algorithm will have no
ground to decide which actions it should attempt [35].
We present two classification techniques, both supervised techniques.
30
Automated Detection of Offensive Language Behavior on Social Networking Sites
3.2.3.1
Naive Bayes
Naive Bayes (NB) is a simple, yet effective classifier that is based on the so-called Naive
Bayes assumption. This assumption states that all attributes of the examples are independent of each other givenQthe context of the class [54]. In a mathematical way this
is described as, P (X|C) = ni=1 P (Xi |C) where X = (X1 , . . . , Xn ) is a feature vector
and C represents a class. Calculating the document probability comes down to multiplying the probabilities of all individual words in that document [54]. The class with
the highest probability is then the final classification of the document. However, in [49]
multiple improvements are proposed to eliminate several well-known issues that torment
the regular naive bayes method. For example, the different features are given a score per
class based on the normalized tf-idf metric (cf. chapter 4). Disregarded the often
solid performance of Naive Bayes, it is often used to perform as a baseline opposed to
other more advanced classification methods [54].
3.2.3.2
Support Vector Machine
A Support Vector Machine (SVM) is often regarded as the classifier with most success, and yielding the highest accuracy results for many classification problems [66].
Much simplified, an SVM represents samples as points in a space, mapped to a highdimensional space, that constructs a hyperplane with a maximal Euclidean distance
between the different labeled samples. When classifying a new sample, it is mapped to
the same space and it depends on which side of the hyperplane it is located to predict
its class [63]. It is important to notice that this SVM hyperplane is determined by only
a small subset of the training instances, which are called support vectors (cf. Figure 3.2)
[63].
A classification problem mapped to a vector space can be linear seperable or nonlinear seperable. Two sets of points in an n-dimensional space are linearly seperable if
they can be competely seperated by a hyperplane. This hyperplane is a decision surface
of the form w · x + b = 0, where w is an adjustable weight vector, x is an input vector,
and b is a bias term [15]. SVMs will learn a linear threshold function, designed to solve
linear seperable data sets.
31
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 3.2: SVM hyperplane construction in the idealized case of linearly separable data. Obtained
from [63].
However, by using an appropriate kernel function, an SVM can be generalized, allowing it to learn nonlinear functions (such as polynomial or radial basic function) [22].
Nonlinear separation surfaces can be approximated by linear surfaces by mapping them
to a higher-dimensional space [64]. How this exactly happens lies outside the scope
of this thesis, however in figure 3.3 a visualisation of this proces is shown, where ϕ
represents the kernel function.
Figure 3.3: Feature vectors are mapped from a two-dimensional space to a three-dimensional
embedding space. Obtained from [64].
32
Automated Detection of Offensive Language Behavior on Social Networking Sites
Often data sets are not perfectly seperable which leads to either wrongly classified
training samples or overfitting. A required and important parameter when dealing with
Support Vector Machines is the cost parameter, C. This parameter controls the trade-off
between allowing training errors and forcing unadaptable margins. Increasing the value
of C will increase the cost of misclassified samples and will force the SVM to create a
more accurace model. However, this model may possibly not be able to generalize well
and thus also won’t correctly estimate new samples (overfitting). It is therefore essential
to find a good balanced value for the cost parameter C.
3.3
Lexicon Based Text Classification
Lexicon based text classification algorithms rely on manually constructed word lists.
Every word in the list belongs to a certain category. In the early 1990s there were
no standardized labeled data sets available [44]. As a result the most researchers only
considered proposals for systems or designed prototypes. Most systems also had no
learning capability and focused on simpler classification tasks [44]. Despite the fact that
these methods are rather static and outdated, they can still be useful in some cases.
In [61], fuzzy logic is combined with natural-language processing to analyze the affect
content in free text.
3.4
Evaluation
The evaluation of classification methods is based on the number of correctly classified
messages as opposed to falsely classified messages. We mention four different situations
when a new message is classified:
− True Positive (TP) : The classifier correctly indicates the message contains
offensive language. In other words, the message is rightly classified as sexual or
racist.
− True Negative (TN): The classifier correctly indicates the message doesn’t contain offensive language. In other words, the message is rightly classified as irrelevant.
− False Positive (FP): The classifier wrongly labels the message as a message
containing abusive language.
− False Negative (FN): The classifier wrongly indicates the message is irrelevant.
The most simple manner to evaluate the performance of a classification system is
to analyze its accuracy. Accuracy shows the general correctness of a classifier and is
calculated as follows:
TP + TN
Accuracy =
TP + TN + FP + FN
33
Automated Detection of Offensive Language Behavior on Social Networking Sites
However, a classification system that automatically labels samples as irrelevant (negative) would yield very high accuracy results when dealing with corpora that contain
only a very small amount of positive samples. Accuracy is thus no useful metric when
working in an environment where one category dominates the others. Therefore, it is
important to again make use of the metrics precision and recall, which are interpreted
the same as in information retrieval (cf. section 3.1.2).
These metrics are defined as in [42]:
Precision =
Recall =
TP
(TP + FP)
TP
(TP + FN)
By analyzing precision and recall we can constitute a better understanding of the performance of the detection of offensive messages.
This study also uses the F1 Measure, which is an evenly weighted combination of both
precision and recall:
precision × recall
F1 Measure = 2 ×
precision + recall
Apart from testing our data on a seperate test set, it also possible to extract a small part
from our training set to use as validation set. This method will be applied by dividing
the training set in k parts. The classifier is then subsequently trained on k − 1 parts and
tested on the remaining part. This technique is called cross validation and provides
us with an accurate view over the performance of a classifier.
34
Chapter 4
Design and Implementation
This chapter discusses the system’s most important design decisions and solutions. It
starts by describing an information retrieval system, extended with a query expansion
technique. Subsequently, the implementation of our offensive language detection system
is explained.
4.1
Information Retrieval
As already mentioned we intend to use an IR system to efficiently retrieve relevant
documents to build a training set. Section 3.1.7 discusses several open source software
packages that provide IR solutions. We decide to base our information retrieval system
on Apache Lucene. Lucene provides an API to index and search large amounts
of data on an efficient and performant manner. We choose Lucene above the other
solutions, because it is an advanced platform that can easily be extended and integrated
into an other system. Lastly, Lucene is well-known and elaborated support is provided
by a large community.
4.1.1
Query Expansion
The aim of query expansion, together with several techniques describing how to apply
it, are already extensively described (cf. section 3.1.6). However, we need to determine which query expansion method best suits our purposes. Avoiding query drift is
the biggest challenge when applying query expansion. Another issue that is faced when
implementing query expansion, this study in particular endures, is chat language. Chat
language implies that multiple relevant features (eg. ’eigenlijk’ ) occur in many different
forms (eg. ’eigelijk’, ’eiglijk’, ’eigelyk’, etc) in a text. Both global as automatic local
methods will have difficulties effectively expanding a query, since it is much harder than
usual to decide which features or documents are relevant. Also using lexical and semantic relations to perform more efficient query expansion seems not designated to use in
this study. The Dutch language does not have a Wordnet alternative and again we predict that faulty and incomplete chat messages won’t contain much structure or grammar.
35
Automated Detection of Offensive Language Behavior on Social Networking Sites
Therefore, the manual and local query expansion technique Rocchio is chosen (cf.
section 3.1.6.1). Rocchio classification performs query expansion by indicating which
documents are relevant (and which are not). In this way, we have more control and can
more easily avoid query drifts. Since the main goal of query expansion is to aid the
construction of a training set, it is not an issue that user input is required.
We implement Rocchio classification by making use of an existing extension on Lucene,
i.e. LucQE. LucQE (Lucene Query Expansion) provides several modules to be able to
perform query expansion [57]. One of these modules is Rocchio query expansion. This
method is slightly adopted to fit into our general system.
Figure 4.1 gives a graphical representation of the approach towards the first stage of
this study. The Netlog corpus is indexed using Lucene, which is also used to search
through the data. The workflow consists of following sequence:
− We insert a query into our Lucene system.
− Lucene provides all messages, ranked according to relevancy, matching the query.
− We select all messages that contain abusive language and specify whether they
belong to the ’sexual’ or ’racist’ category.
− In the next step Rocchio classification is applied by expanding the base query
with new words, originating from the relevant messages. Every message that was
labeled is now also saved, so it can be reproduced to build our training set.
− In the following step the expanded query is inserted into the IR system and matching messages are displayed.
− These new terms can also be saved to construct a thesaurus. This thesaurus then
contains informative terms grouped per category, which can be used to build a
descriptive word list with.
36
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 4.1: Graphical representation of this study’s approach.
4.2
Offensive Language Detection
To be able to satisfy our final goal two supervised classifiers are created. At first, a
Naive Bayes classifier is implemented and afterwards this study proceeds by designing
a Support Vector Machine. Naive Bayes classification is a rather simple method, but
has proven to perform solid in many cases. On the other hand we can also use the
performance of our NB classifier as a baseline to assess the overall performance of our
Support Vector Machine.
4.2.1
Naive Bayes
We describe the algorithm proposed by [49] and used in our system in more detail.
Assume we have a total of N features and we have a fixed set of m classes. The
following is calculated:
− Normalized Frequency for a term (Tf) in a document is calculated by dividing
the term frequency by the root mean square of terms frequencies in that document.
37
Automated Detection of Offensive Language Behavior on Social Networking Sites
− Weight Normalized Tf for a given feature in a given class equals the sum of
normalized frequencies of the feature across all the documents in the class.
− Weight Normalized Tf-Idf for a given feature in a class is the Tf-idf calculated
using standard idf multiplied by the ’Weight Normalized Tf’.
− The sum of the weight normalized Tf-Idf (W-N-Tf-Idf) for all the features in a
label is named Sigmak .
For every term in a new message a weight can be calculated depending on its weight
normalized Tf-Idf in that class:
W-N-Tf-Idf + 1
W eight = Log
Sigmak + N
The probability of a message belonging to a certain class is then calculated by running
over every term in the message and summing their weight.
4.2.2
Support Vector Machine
In order to implement our SVM algorithm an available, open source software package
is used. Well-known packages are LibSVM, LibLinear and SVM-Light. We chose
to work with LibLinear, which inherits many features of the popular SVM library libSVM [13]. Liblinear is built to optimally perform for linear high dimension classification
problems, thus typically for text classification.
To be able to work with LibLinear we need to transform our data in the data format
used by the LibLinear package. LibLinear, just as LibSVM and SVM-Light, expects to
receive an array of feature nodes per sample. This array of nodes represents a vector
in the n-dimensional vector space. In order to convert a textual message to an array of
feature nodes we extract all features out of this message and calculate for every feature
its ’Tf-Idf’ score.
This study also provides a feature selection method based on mutual inclusive information (cf. section 3.2.2). Mutual inclusive information is a filter method that is applied
by removing or ignoring all non informative terms. Since feature selection is a technique
to reduce the amount of noise, we analyze if our system benefits from using it.
4.2.3
Semantic Methods
Except from our two supervised classification methods, we also implement a classifier
based on prefabricated word lists. The main reason to also design a semantic classification method is to be able to detect ’outrage’ in a reaction to support or improve the
performance of our supervised classifiers. In this way we create two seperate methods
that do not, by default, share the same flaws.
38
Automated Detection of Offensive Language Behavior on Social Networking Sites
Sexual
Racist
Outrage
kuste
hoeren
kont
geile meiden
homo
doden
neger
eigen land
racistisch
vuil
vies
niet kunnen
Table 4.1: Random entries from our word lists.
Word lists
These word lists are constructed by using the query expansion module on the training
set and are refined in a manual way. In other words, the top terms are extracted from
our labeled ’racist’ messages to construct our ’racist’ word list. The same procedure is
followed in order to create a ’sexual’ word list. Finally, a list to detect ’outrage’ is constructed by searching and applying query expansion on reactions that express contempt.
These lists are then manually further adjusted and extended.
An entity in our word list contains the following properties:
<word> <score> <vital> <intensity> <centrality> <pos>
The score of a word is a measure of the importance of that word regarding to a specific
class. The score property is derived from the rocchio classification term ranking. The
property vital is a boolean that indicates whether the word is important or not. These
first two properties are only used in our initial algorithms, and are not taken into account
in the final implementation.
We eventually use the third attempt as our final algorithm (cf. chapter 5), which ignores
the properties score and vital. The main properties present in the word list are intensity,
centrality and pos. Intensity expresses the strength of a word, while centrality indicates
the degree of relatedness to a certain class. For example, the word ’neger’ has a greater
intensity level compared to the word ’zwarte’. Centrality is a measure that suggests
how much a term is linked to a class. Many words relate in some context to a class,
but in itself are no offensive words (eg. ’vuil’ ). Both centrality and intensity represent
decimals from zero to one. Finally, every word also has a part-of-speech tag, which is
not extensively used in this study, but allows future enhancements to create a more
advanced algorithm. The values of the intensity, centrality and pos tag are manually
crafted and adjusted, based on the authors experience and assessment skills.
Lastly, the possibility to add bigrams to the word lists is added. Bigrams can greatly improve performance since they often combine two regular words into one offensive concept
(eg. ’white power’ ).
39
Automated Detection of Offensive Language Behavior on Social Networking Sites
Algorithms
We implemented three different algorithms starting from a straightforward, simple algorithm to a more advanced algorithm, which uses the intensity and centrality concepts.
Simple Algorithm runs over the different features of a message and sums the score
of all features that have a match in the word list. The class with the highest score is the
eventual result of the classification. If no hits are found the message is logically classified
as irrelevant.
Vital Algorithm runs over every word of the message and checks if the word can be
found in a certain word list. If the word is indeed found in a word list, its score is added
to the total (per class) and we check whether it’s a vital word or not. Messages with
too few or no vital words in it are automatically classified as irrelevant. We make use
of the vital property to lower the amount of false positives. As we’ve already explained,
many words in our lists contribute to abusive language, but on itself are not offensive.
Since spelling correction is incorporated, words that at first are not found in a word list,
can still be matched to one closely related (based on the levenshtein distance). However,
this is a rather naive approach since very often words are not property corrected (cf.
chapter 5).
Fuzzy Algorithm runs per class over every word of the message and checks for possible hits. If a hit is found the intensity and centrality of the word are analyzed. The
score of a message per class is solely based on the intensity of the matched words. This
score is however reduced to zero whenever the unique hits in a message do not have a
total centrality of at least 1 and one centrality score at least 0.7. The Fuzzy algorithm,
in other words, fine-tunes the vitality concept used in the algorithm above. A certain
word can belong more or less to a certain class based on its centrality.
4.2.4
Combination of Methods
Our final classification method both makes use of our Support Vector Machine classifier and our Fuzzy classifier, based on word lists. It is worthy to mention that at
first it was not our intention to combine several classifiers into one whole. However, our
initial Support Vector Machine did not yield the expected results (cf. chapter 5). Thus,
we tried to improve our classification system by incorporating the information reactions
to a message provide. We did this, as described above, by implementing a semantic
classifier that can detect outrage in a reaction. On top this semantic classifier can also
detect regular abusive language.
This gives us the following sources of knowledge to design a final algorithm with:
− Primary classifier: SVM trained on our default training set (possible outcome
’sexual’, ’racist’, ’irrelevant’)
40
Automated Detection of Offensive Language Behavior on Social Networking Sites
− Secondary classifier: semantic classifier based on a weighted word list (possible
outcome ’sexual’, ’racist’, ’irrelevant’)
− Reaction classifier: semantic classifier also based on word list but now classifying only the reactions of a certain message (possible outcome ’outrage’, ’sexual’,
’racist’, ’irrelevant’ )
The solution whether a message is relevant or not is predicted by one of our classifiers in more than 90% of the cases. Thus, it is vital to use the correct classifier in
every situation. This algorithm takes into account the characteristics from our classifiers,
derived from multiple tests. We distinguish among four different cases:
− When both primary and secondary classifier give a message the same label, we
logically assume this label is the correct one.
− In the case that a message contains less features than a certain threshold, the
secondary classifier has the advantage over the primary.
− When primary and secondary dissent we analyze if any outrage or abusive language
in reactions can be found to support the decision of one of our classifiers.
− If on the contrary reactions do not contain any information we base our decision
on the normalized scores both classifiers assess to their decisions.
On the basis of the above cases we choose which classification result eventually will
be selected.
41
Chapter 5
Results
This chapter presents the results that are obtained from experiments applied on our
different methods. These results are based on the proposed metrics (cf. section 3.1.2
and 3.4) and are gradually improved by a continuous development process. We start
with stage one of our study, which is devoted to query expansion. We discuss the
functionality of our chosen query expansion method and compare this to its overall
performance. Further on, we extensively describe and discuss the performance of our
text classification system. We start with an elaborated view on the functioning of our
Naive Bayes classifier and end with the algorithm that combines the functionality of
different classifiers.
5.1
Query Expansion
To point out the utility of query expansion we compare the (average) precision and recall (cf. section 3.1.2) of different queries, both their simple and expanded forms. It is
important to note that queries in this experiments are focused on the retrieval of racist
messages. Queries that are build to retrieve sexual messages show similar results. We
focus on racist messages because they only occur rarely in the corpus, as opposed to
sexual messages that can be found more easily (cf. section 2.4.2).
To be able to measure the performance of our information retrieval system, we construct different (expandend) queries. By comparing the recall and precision of a simple
query with its expanded form, we should notice significant differences. We presume that
both precision and certainly recall will yield better results when the expanded query is
applied.
43
Automated Detection of Offensive Language Behavior on Social Networking Sites
5.1.1
Rocchio Relevance Feedback
We briefly recap the main formula that should shift a new (expanded) query Qi+1 closer
to the ideal query Q (cf. section 3.1.6.1).
Qi+1 = α × Qi +
X
X
β
γ
di −
dj
×
×
|Du |
|Dv |
di ∈Du
5.1.1.1
dj ∈Dv
Initial Setup
The Rocchio algorithm depends on three main parameters (cf. section 3.1.6.1) α, β and
γ. The importance of the original query is denoted by α, while β and γ respectively
denote the importance of the selected relevant documents and the irrelevant documents.
Although, nowadays the influence of irrelevant messages is questioned and thus γ is often
reduced to zero.
In our implementation, we assign the following values to these parameters:
α = 0.75
β = 0.25
γ=0
5.1.1.2
Query 1
We start with a query containing the following words ’neger ’ and ’zwart’. This query is
expanded, which results in the query: “zwart neger marokkan wit werk virus turk clan
turk the ras jood ”. We now compare both recall and precision of these two queries.
44
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 5.1: Recall Comparison
Figure 5.1 and 5.2 obviously show that both precision and recall increase when the
expanded query is used. Recall of the expanded query outperforms the base query by
far. The simple query retrieves around 33 relevant documents, while the expanded query
retrieves 51. The same applies to precision, albeit less pronounced (cf. Figure 5.2). We
can also compare precision in relation to the retrieved number of relevant documents.
Again, we see a much improved result when using the expanded query (cf. Figure 5.3).
Lastly, Figure 5.4(a) displays the difference in average precision between both queries.
45
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 5.2: Precision Comparison
Figure 5.3: Precision-Recall Comparison
46
Automated Detection of Offensive Language Behavior on Social Networking Sites
(a) Average Precision Comparison of Query 1
(b) Average Precision Comparison of Query 2
Figure 5.4: Average Precision Comparison.
5.1.1.3
Query 2
Our second base query starts from “sieg heil ” and is eventually expanded to “white race
sieg parasiet heil vijand national fight pride trots 88 14 hitler nazi suprematie”. We only
compare the average precision results, as our results are similar to the numbers above (cf.
section 5.1.1.2). Query 2 also shows us the improved average precision of its expanded
form (cf. Figure 5.4(b)).
5.1.1.4
Combined Query
We combine different queries into one large query to see the eventual results. All the
relevant terms of the query can be found in Table 5.1. We emphasize relevant, since
an expanded query usually contains multiple irrelevant terms, which we filtered out to
avoid query drifting.
jood
neger
zwart
parasiet
fight
14
neger
virus
vuil
moslim
white
heil
pride
hitler
marokkaan
turk
vraatzucht
dik
race
vijand
trots
nazi
wit
clan
ras
kk
sieg
national
88
supremati
werk
turk
Table 5.1: Expanded query
Figure 5.5 clearly shows that our combined query outperforms all other queries in
terms of recall.
47
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 5.5: Number of relevant documents when retrieved 1000 documents
Figure 5.6: Average Precision Comparison
We notice that, in terms of average precision, the combined query does not perform
better than the seperately expanded queries (cf. Figure 5.6). This is because the average
48
Automated Detection of Offensive Language Behavior on Social Networking Sites
precision metric depends on the ranking of our retrieved relevant documents. In other
words, we depend on the ranking capability of Lucene. On top our combined query
contains 36 different terms which will automatically increase the amount of retrieved
documents. This, in turn, will make it more difficult for Lucene to properly rank these
documents and will result in more scattered relevant documents.
5.1.2
Conclusion
We notice a great improvement in both recall and precision when expanding a query.
The number of retrieved documents more than doubles when applying our combined
query as opposed to its base form. However, precision does not significantly improve
since the retrieval of documents also depends on the way results are ranked. Thus, it is
not always better to use very large queries, although this will generally increase recall,
precision will not. The main goal of this study was to use query expansion as a manner
to efficiently build a descriptive training set. Albeit, query expansion greatly improved
the finding of relevant messages, human interference is essential to guide the process.
5.2
Offensive Language Detection
This section presents the results of our flame detection system. We begin with an
introduction to our Naive Bayes classifier, proceed to our support vector machine and
semantic classifiers, to finally end with a combination of methods.
5.2.1
Naive Bayes
Our Naive Bayes is a nonregular implementation and takes into account a few improvements (cf. section 4.2.1). However, we do not use any form of feature selection. Our
Naive Bayes classifier is tested by applying cross validation (cf. section 3.4) as well as
testing it on our standard validation set (cf. section 2.4.2).
5.2.1.1
Evaluation
Table 5.2 displays the performances of our Naive Bayes classifier. For both tests the
accuracy achieves solid results of respectively 95% and 87%. In the remainder of this
chapter accuracy results are not analysed in-depth, because in all cases accuracy achieves
values from 90% to 99%. This is due to the large impact of irrelevant messages, which
account for more than 90% of the correctly classified messages.
We notice a big difference in precision when applying cross validation as opposed to
tests on our validation set. We attach more value to the results on our validation set,
which better reflects a realistic situation. Naive Bayes has a tendency to classify documents as flames too rapidly, which in the second case results in a low precision. The
reason why the NB classifier performs exceptionally well when applying cross validation
is twofold. At first, our training set is larger and contains far more relevant messages
49
Automated Detection of Offensive Language Behavior on Social Networking Sites
than our validation set. Secondly, much more than our validation set, our training set is
a coherent set of messages, which ensures that it will be much easier to classify messages
from the same set.
The simplicity of the Naive Bayes method does not allow it to properly unravel the
structure of a message, which leads to the negative results on the validation set.
Evaluation method
Precision (%)
Recall (%)
F1 Measure (%)
Cross Validation
Validation set
71.28
5.24
92.86
70.00
80.65
9.75
Table 5.2: Performance of Naive Bayes.
5.2.1.2
Conclusion
On a realistic validation set our Naive Bayes performs very pooly. One of the reasons
is off course the simplicity of the NB method, which does not permit to fully take
into account the structure of a message. Another flaw is the fact our validation set
contains too few relevant documents, which makes it hard to produce good results, as
falsely classifying one positive document (or the other way round) has a huge impact on
precision and recall numbers.
5.2.2
Support Vector Machine
The complexity of a Support Vector Machine is very high compared to a Naive Bayes
classifier. This section gives a complete overview, starting from our initial setup and
ending with our final version. We extensively examined multiple factors that possibly
determine or influence the performance of our SVM classifier. This study starts with
showing results that are not based on our final training set, but give a good indication
where and how we improved. By showing the complete process this study tries to give
a comprehensive picture of how we achieve our final results.
5.2.2.1
Initial Setup
As we have already discussed, our support vector machine is implemented based on the
LIBLINEAR library. An L2-regularized logistic regression function is used, our cost
parameter is set to 1.2, epsilon has a value of 0.1 and no bias is used.
5.2.2.2
Evaluation
We evaluate our classifier on the standard validation set. We notice our Support Vector
Machine classifies 45 messages falsely as ’positive’. This results in a precision of only
6%. Both precision and recall numbers are extremely low, resulting in a F1 Measure of
only 8.7%.
50
Automated Detection of Offensive Language Behavior on Social Networking Sites
TP
FP
TN
FN
3
45
1932
18
Precision (%)
Recall (%)
F1 Measure (%)
6.25
14.29
8.70
Table 5.3: Results of classifier after running on validation set. Precision drops to 6% due to the
rate of true positive against false positive.
In the next sections we will try to determine crucial factors that are responsible for
this bad performance.
5.2.2.3
Data Analysis
The first and most obvious action one should attempt to enhance the overall functionality of a supervised classifier is enhancing the training data. To try to limit the amount
of false positives, we extend the irrelevant messages in our training set. However, since
SVMs base their decision on only a small subset of samples (ie. support vectors), we
do not randomly add irrelevant messages. Instead we classify 50, 000 random messages,
originating from our data set and manually analyse all messages that are detected as
offensive. Each false positive is then added to our training set as an irrelevant message.
In this way, we change the decision making process and make sure the hyperplane that
seperates the different categories is modified.
This can be viewed as a form of reinforcement learning (cf. section 3.2.3). The messages
that are correctly classified as offensive are also added to our training set. Table 5.4
displays the results of this classification. Thus, approximately 900 irrelevant messages
are added to the training set together with more or less 150 offensive, predominantly
’sexual’ messages. Not every message is added to our training set since several messages
are similar in terms of content and structure.
Total
Sexual
Racist
Irrelevant
Precision
True positive
False positive
50000
774
438
48788
17.07%
205
996
Table 5.4: Results of classifier after running on randomly selected set.
To estimate the effect of our improved training set, we again test our classifier on
our validation set. Table 5.5 shows the results. We clearly see an improvement, though
the overall performance is still very poor.
51
Automated Detection of Offensive Language Behavior on Social Networking Sites
TP
FP
TN
FN
6
14
1963
15
Precision (%)
Recall (%)
F1 Measure (%)
30.00
28.57
29.27
Table 5.5: Results of classifier after improving the training set.
By repeating this process a number of times, we further enhanced our training set
and eventually attained the set described in section 2.4.3. One other way of brightening
our results is by increasing the amount of relevant messages in our validation set. Since
our validation set contains only a very low amount of relevant messages, one wrongly
classified message has a great impact on the eventual performance measures. Although,
our validation set is then no longer a realistic representation of our data set, we do get
a slightly better image of the functionality of our Support Vector Machine. We, thus,
added a number of relevant, though unknown messages to our validation set, which
increased the total of relevant messages to 29. Table 5.6 shows the final results of the
enhancement in both training as validation set.
TP
FP
TN
FN
15
22
1947
14
Precision (%)
Recall (%)
F1 Measure (%)
40.54
51.72
45.45
Table 5.6: Results of classifier after modifying validation set.
Figure 5.7 gives an overview of the evolution of the performance of our Support
Vector Machine.
Figure 5.7: Evolution of SVM performance on validation set.
Notwithstanding, we managed, whether or not in an artificial way, to increase the
performance of our SVM, the results are still far from satisfying. Therefore, we further
52
Automated Detection of Offensive Language Behavior on Social Networking Sites
analyse different aspects in the following sections.
5.2.2.4
Message Length Analysis
A thorough analysis of messages that have been falsely classified as positive, indicates
that 71.3% contains less than ten features. To get a clear view on the performance
on specifically small messages, our test set is split in two parts based on the size of the
messages. Our test set is split into one part with only messages smaller than a certain
threshold and one part containing only larger messages. SVM clearly performs on a
different level depending on which set it is executed. Figure 5.8 shows the difference in
precision between the two sets.
Figure 5.8: Difference in performance when splitting validation set in a set with small messages
and one with the larger messages.
In Figure 5.9 we see the results regarding precision when splitting the validation set
in two parts based on the number of features in a message. We can clearly see that a
message containing more features has a greater chance of being correctly classified.
53
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 5.9: Snapshot of precision results that shows difference between small validation set (messages that contain less than 20 features) and large validation set (solely messages that at least
contain 20 features).
Figure 5.10: Comparison of SVM performance when removing messages under a certain threshold
in both training and validation set.
Based on the above data we can conclude to eliminate all features under a certain
threshold. Instead of only eliminating these messages in our validation set we also remove
them from our training set. In this way, we avoid these small messages to negatively
affect the classifier when it attempts to build an accurate model.
Figure 5.10 shows the results of our classifier when completely removing messages that
54
Automated Detection of Offensive Language Behavior on Social Networking Sites
do not contain a certain minimum amount of features. We notice a great performance
improvement in both precision and recall. However, as soon as we remove messages that
contain more than ten features the performance drops again. This is due to the fact that
we are throwing away too much information, which results in a less accurate training
model.
We conclude that messages that do not contain at least five features are not suitable to
be handled and are thus ignored completely by our Support Vector Machine.
5.2.2.5
Selection of Features Analysis
One last attempt to improve the performance of our SVM is to apply feature selection.
As described in section 3.2.2, feature selection will try to eliminate noise by filtering
out all useless features and preserve only those that contain the most information. We
applied a method that is named Mutual Information (cf. section 4.2.2). However,
the results were not satisfying as performance did not improve. We decided not to
perform an in-depth analysis to what and why feature selection did not produce any
improvements. As feature selection is an art in itself this would cost too much time and
lead us too far away from the main subject.
5.2.2.6
Conclusion
We conclude our Support Vector Machine, when intensively analyzed and tuned, performs slightly above average. Figure 5.11 shows the overall progress of our SVM.
Figure 5.11: Precision and recall comparison of different stages of our Support Vector Machine
55
Automated Detection of Offensive Language Behavior on Social Networking Sites
5.2.3
Semantic Classifier
In this section we present and compare the results of our semantic classifiers. The
algorithms we designed (cf. Section 4.2.3) were originally meant to detect outrage in
reactions. In this way it was possible to support or alter certain decisions taken by our
primary classifier. However, by also constructing word lists for the other classes the
semantic classifiers are regarded equivalent to our SVM or NB classifier.
5.2.3.1
Initial Setup
We created three classifiers, beginning with a rather simple implementation and ended
with a more advanced one. Each implementation has its own characteristics (cf. section
4.2.3). The simplest classifier detects abusive language too quickly as it attaches too
much belief to abusive words, which have multiple meanings depending on the context.
We hypothesize that this classifier will yield high recall numbers, but a rather low precision. The second classifier, named vital classifier, on the contrary focuses much more
on yielding high precision results, at the cost of recall. The last classifier, named fuzzy
classifier, tries to combine the best parts of the two previous ones.
5.2.3.2
Evaluation
Figure 5.12 shows the results of our different semantic classifiers. As expected our simple
classifier especially performs well regarding to recall, as opposed to our vital classifier
that reaches a much higher precision than the first one. Fuzzy classifier reaches same
precision level as vital classifier, but combines this with a recall of approximately 93%.
Figure 5.12: Comparison of performance of different semantic classifiers on validation set.
56
Automated Detection of Offensive Language Behavior on Social Networking Sites
5.2.3.3
Reactions Classification
This section is devoted to determining the efficiency of classifying messages solely based
on its reactions. In chapter 2, an extensive analysis of the corpus is provided. We
concluded that merely 72% of all messages do not contain any reaction. However, a
message that does contain reactions has on average six (cf. section 2.2). Logically, we
can only take into account messages that do contain reactions for this experiment. Our
validation set is then reduced to 584 messages. However, since semantic methods are not
supervised methods, our training set can be used as a ’validation set’. Since our training
set includes 1416 messages that contain reactions, we decide to temporarily execute our
experiments using this set.
We apply three different methods to classify a message based on its reactions. The first
case focuses on detecting outrage in the reactions, while the second method searches for
reactions that contain abusive language. Thus, we assume that messages containing abusive language will elicit reactions that also contain abusive language. Our third method
combines both methods and attempts to find abusive reactions as well as reactions that
express outrage. Figure 5.13 shows a comparison of these different methods, executed
by using our fuzzy classifier.
Figure 5.13: Classification of messages solely based on the reactions to the message.
Our first as well as our second method achieves a rather high precision, but performs
poorly concerning recall. The combination of both methods achieves reasonable results,
but is not comparable to methods directly classifying the message itself. This seems
normal as reactions to a message does not necessarily behave the same way as the
message itself and are much less predictable.
57
Automated Detection of Offensive Language Behavior on Social Networking Sites
5.2.3.4
Conclusion
We decide to select our fuzzy classifier to function as our final semantic classifier. On
top we conclude that using reactions to classify a message can be useful, but rather in
an assisting function than as a standalone solution.
5.2.4
Comparison of Single Methods
In the above sections we thoroughly analyzed the results of three single methods. For
the sake of completeness Table 5.7 displays the results of Naive Bayes on the modified
validation set.
TP
FP
TN
FN
25
248
1721
4
Precision (%)
Recall (%)
F1 Measure (%)
9.16
86.21
16.56
Table 5.7: Results of Naive Bayes classifier on the modified validation set.
Figure 5.14 combines the best results for each classifier seperately. Naive Bayes
achieves a very high recall, but combines it with the lowest precision. We focus on
our Support Vector Machine classifier and our Semantic Classifier which produce solid
results.
Figure 5.14: Comparison between different single classifiers.
5.2.4.1
Length of Message Analysis
Section 5.2.2.4 handles the fact that SVM has difficulties classifying small messages.
In this section we compare the functionality of both SVM and Fuzzy when applied to
58
Automated Detection of Offensive Language Behavior on Social Networking Sites
small messages. We filter all messages that contain more than eleven features out our
validation set; this eventually leaves 479 messages left in our validation set of which two
relevant.
Table 5.8 displays the results of the experiment. Figure 5.15 shows that fuzzy classifier outperforms SVM by far in terms of precision. These findings are used in the final
algorithm that favors fuzzy classifier when messages contain only few features.
Classifier
SVM
Fuzzy
Accuracy (%)
Precision (%)
Recall (%)
F1 Measure (%)
14.3
100.0
100.0
100.0
25.0
100.0
97.3
100.0
Table 5.8: Results of classifiers running on ’small’ validation set.
Figure 5.15: Comparison of precision of both classifiers running on ’small’ validation set.
5.2.4.2
Conclusion
Since in section 5.2.5, a combination of the SVM and fuzzy classifier is proposed, it is
important to effecively determine the different weaknesses and strengths of both classifiers. We conclude that smaller messages should not be classified by SVM and also notice
that fuzzy classifier yields a very high recall. Thus, when fuzzy classifier indicates the
message does not contain abusive language, there is a high chance the message indeed
is irrelevant.
59
Automated Detection of Offensive Language Behavior on Social Networking Sites
5.2.5
Combined Algorithm
We can state that, albeit SVM and fuzzy classifier achieve ’reasonable’ results, the
overall performance is not satisfactory. Therefore, a combination of seperate classifiers
is designed to improve the functionality of the system. As described in section 4.2.4, we
use both our SVM and our fuzzy classifier to create a more reliable system.
5.2.5.1
Initial Setup
First, we solely focused on the belief each classifier attached to its classification. Thus,
we chose the classifier that got the highest score linked to a certain class. However, since
it is hard to compare the score of two completely different classification systems and a
lot of extra information can be derived from the context we designed a more complex
system. This algorithm is described in section 4.2.4 and depends on four general cases
to base its decision on.
5.2.5.2
Evaluation
Our advanced method achieves following results on our validation set.
TP
FP
TN
FN
23
0
1969
6
Precision (%)
Recall (%)
F1 Measure (%)
100.0
79.3
88.46
Table 5.9: Results of combination of different classifiers on validation set.
Table 5.9 displays the accuracy results that are achieved when classifying our validation set. Both precision and recall have improved significantly.
5.2.5.3
Comparison of Methods
The combination of different seperate classifiers clearly outperforms our single methods
(cf. Figure 5.16). Especially precision benefits from combining our classifiers, and even
reaches to perfection. Off course this is just a result obtained from our validation set. In
a realistic environment precision will not be perfect, but will still yield very high results.
Recall yields a less spectacular result, although outperforms almost all other attempts.
60
Automated Detection of Offensive Language Behavior on Social Networking Sites
Figure 5.16: Comparison of all implemented classification methods.
5.2.6
Corpus Classification
This sections presents the results concerning the classification of our whole data set.
Classifying the whole data set allowes us to estimate the basic functionality of our
classification system.
5.2.6.1
Detected Offensive Messages
Table 5.10 displays the distribution of messages among the different categories after the
classification.
Class
Amount
Percentage (%)
Irrelevant
Sexual
Racist
6,862,870
37,152
5,072
99.39
0.54
0.07
Table 5.10: Distribution of messages among the three classes: irrelevant, sexual and racist.
We notice that our systems detects a total of 42, 224 flame messages. This is equivalent with 0.60% of the whole corpus, while in chapter 2, we stated that 0.85% of the
messages, derived from the amount of offensive messages in our validation set, contains
abusive language. In total 30, 271 different users have posted at least one flame message.
61
Automated Detection of Offensive Language Behavior on Social Networking Sites
5.2.6.2
Influence of Different Classifiers
Table 5.11 denotes the impact of the seperate methods on our final classification. Both
the overall share as the impact solely on the positively classified messages is shown. SVM
as well as fuzzy classifier both return the same results for the most part, which makes
the decision straightforward in most cases. Reactions only alter of modify the decisions
about 0.1% of the time, or 16% when solely focused on positive messages. Apart from
the situation where both classifiers give equal results, SVM is prefered in 53% of the
cases, while 47% is based on fuzzy classifier.
Situation
Same classification result
Reactions modify decision
Small message modifies decision
Other
Global (%)
Positive (%)
96.72
0.10
0.96
1.32
51.45
16.27
4.01
28.27
Table 5.11: Impact of different situations
5.2.7
Conclusion
The performance of our Naive Bayes method, as expected, is greatly outperformed by
both our SVM as our semantic method. This study notes that SVM, when optimally
tuned, achieves a precision of 69% and a recall of 62%. The semantic method, which
is based on word lists, especially achieves a very high recall of 93% and combines this
with a precision of 46%. By combining the information we attain from the semantic
method and the SVM classifier a considerable performance boost is achieved. Primarily,
the main reason of building a method based on word lists was to be able to extract the
information out of reactions. However, since the valued performance of this semantic
method on our validation set, it is considered an adequate classifier. Our final algorithm
achieves a precision of 100% and a recall of approximately 79%. All these tests are
performed on our modified validation set, which is our initial validation set with an
increased number of relevant documents.
62
Chapter 6
Discussion
This next to last chapter briefly summarizes the results obtained in our experiments.
It also tries to further describe and explain the different reasons why we obtain such
results. Finally we suggest possible enhancements and future improvements.
6.1
Main Findings
This section will give an overview of the main accomplishments and findings in this
study.
6.1.1
Corpus
The corpus, extracted from the social networking site Netlog, contained approximately
seven million messages. By extracting 2000 random messages out of the corpus, we
intend to
(a) create a standard validation set to perform our experiments on, and
(b) get a global idea of the number of relevant messages in the corpus.
The validation set contains two ’racist’ messages and fifteen ’sexual’ messages. A set
containing only a very low amount of relevant messages has the effect that it is hard
to properly estimate the performance and functionality of a classifier. Therefore, the
number of relevant messages is artificially increased in the validation set. Anew, this is
done by randomly selecting approximately 2000 messages, although now solely extracting
the relevant messages. Randomly chosen irrelevant messages in our standard validation
set are then swapped with the retrieved relevant messages.
6.1.2
Development
In the first stage, an information retrieval platform is set up in order to effectively retrieve
relevant documents. The system is expanded by implementing Rocchio query expansion,
63
Automated Detection of Offensive Language Behavior on Social Networking Sites
which enhances the IR process. By using query expansion the most informative words
are extracted from the relevant documents and added to the query. This method should
not only produce more relevant documents, but also lists the most important words that
characterise the relevant documents. A possibility to save these words and construct a
thesaurus is also implemented.
The second stage contained of implementing a supervised classifier to be able to detect offensive language. Two supervised classifiers, Naive Bayes and Support Vector
Machine, were created. Subsequently a ’semantic’ classifier based on word lists was constructed, increasing reliability at the cost of flexibility.
Finally, a combination of our SVM and semantic classifier were designed. Analysis
of the outcome of these different classfiers on the validation set points out that over
90% of the samples is correctly classified by one out of two or both classifiers. Based
on the context and the strengths and weaknesses of both classifiers, a final classification
method is developed.
6.1.3
Experimental Study
This section gives an overview of the main experimental results.
Query Expansion
We can state that rocchio query expansion, or query expansion in general, is very prone
to query drifts. Even with manual annotation of relevant documents, filtering irrelevant
terms out of the expanded query is required. However, we did prove that using an
expanded query as opposed to its base query, increases recall as well as precision. On
average, recall increased round 20% when expanding both query one and two. Especially
when creating one large query composed from different single queries, we noticed an
improvement in terms of recall. The average precision also increased when expanding
queries, but depends on the way these new relevant messages are ranked. Due to this
fact, precision increases when expanding a query that provides highly ranked relevant
messages. When relevant messages are more scattered precision will likewise be lower.
Offensive Language Detection
Our flame detection mechanism starts with developing different independant classifiers.
We start with a simple, but widely used classifier Naive Bayes. This classifier however
performs rather poorly on our realistic validation set. Albeit, an impressive recall is
achieved, only 9% of messages positively classified are in fact offensive messages. This
off course completely nullifies the high recall numbers.
The second classifier that is implemented is a Support Vector Machine, based on the
LibLinear library. We, at first do not produce any more significant results, as opposed
64
Automated Detection of Offensive Language Behavior on Social Networking Sites
to our Naive Bayes classifier. However, after extending and enhancing our training set
in combination with our modified validation set (cf. section 6.1.1), we manage to obain
reasonable results. The SVM classifier higly depends on the number of features a message contains. This can be clearly inferred from an analysis on the falsely detected flame
messages, of which 70% contains less than ten features. SVM precision is negatively
influenced by messages that do not contain a minimum amount of features. Another
method to enhance our SVM is by filtering features in such a way that only meaningful
features are kept. However, the implemented filter, based on mutual inclusive information, does not produce the expected results. Therefore, we do not apply any feature
selection method. Eventually our SVM reaches, in a best case scenario, a precision of
almost 70% and a recall of 62%. This result is obtained when ignoring messages that
do not contain more than five features. Since very small messages can contain abusive
language as much as large messages, this is considered a flaw.
Because of the limited performance of the SVM, we created another classifier, named semantic classifier, which is based on word lists. These methods are less dynamic and have
much less learning capabilities than supervised methods, though are more trustworthy.
We mainly designed such method to incorporate information provided by reactions in
our classification system. However, we notice our semantic method performs solid on
our validation set, with an especially high recall of 93%.
Finally, different classifiers are combined to create one efficient classification system.
This combined system achieves a precision of 100% on our validation set. Albeit, we
may conclude that this is an excellent result, at the same time should be noticed that
this validation set is not a perfect representation of the whole dataset.
Offensive Language Behavior
The final goal of this study is not only to detect offensive language, but to globally be
able to detect offensive language behavior. This means we intend to identify users that
deliberately post offensive messages. Due to timing constraints we were unable to fully
develop a detailed method to decide whether or not a user should get a label of offensive
language behavior. However, a classification of the complete data set was performed in
order to get an idea of the number of offensive messages and the distribution among the
different users. We note that approximately 30% less messages have been classified as
flames than initially expected when analyzing the number of relevant messages in the
validation set.
6.2
Discussion
This sections handles multiple concerns related to the retrieved results supported with
some examples.
65
Automated Detection of Offensive Language Behavior on Social Networking Sites
6.2.1
Domain-dependency
As described in section 1.5, domain-dependency is one of the biggest issues in the text
classification domain. This issue has become even more relevant since our combined
method partly supports on static word lists. Although, we attempted to add flexiblity
to these word lists, by automatically proposing new words, this has not been extensively
tested. Also, it is necessary to estimate a proper intensity and centrality value for
each word. These elements combined make it hard to estimate the performance of our
classification system on a completely different corpus.
6.2.2
Implicit Language
When messages contain implicit language, it is very difficult to properly decide if a
message should be defined as offensive. One example message that is falsely classified as
sexual is displayed underneath.
Megan Fox: “Ik voel me net een hoer”.
“Als actrice voel ik me een hoer.” De uitspraak komt van de verrukkelijke Megan Fox, onlangs nog verkozen tot meest sexy vrouw. Fox
vindt eigenlijk alle acteurs en actrices een soort van prositituees. ”Het
is vies“ Als je erover na denkt zijn we ...
To detect such messages more advanced techniques should be used (eg. taking quotes
under consideration, etc). Another example is shown below that brings up another issue
concerning our system. Since the main focus is on detection of offensive chat language,
informative and formal messages often are classified wrong when they appear to proclaim
offensive language, but in fact are just communicating about latest news facts, etc.
Poolse nachtclub gebruikt foto van Hitler om volk te lokken.
Een Poolse nachtclub is op de vingers getikt omdat ze een foto van
Adolf Hitler gebruikte om volk te lokken. Op de prent poseert de
nazileider met zonnebril, terwijl onder hem een adelaar van het Derde
Rijk afgebeeld staat.
6.2.3
Conclusion
In general can be stated that detecting offensive language is a very subjective and challenging domain. It is clear that such a system is a good helping tool, but is not suited
as a standalone application.
6.3
Future Improvements
We intended to build a dynamic and flexible system that can easily incorporate new categories. In order to add a new class, two key additions need to be constructed. First, a
66
Automated Detection of Offensive Language Behavior on Social Networking Sites
representative document set that is descriptive for a specific category should be gathered
and added to the general training set. Secondly, a word list needs to be created. It is
possible to partly extract the most informative words from the labeled set by applying
query expansion.
However, our system does lack the ability to improve and evolve together with the
changing behavior of offensive language, in particular with chat language. A potential
improvement would be to automatically suggest new words to our word list, concomitant
with automatically adjusting the values of already existing words. Notwithstanding we
implemented the ability to extend our word lists with new words that often arise in flame
messages, this module has not been extensively tested.
To gradually be able to adjust the behavior of the classifier, reinforcement learning
could be a helpful improvement. With feedback on certain messages, the training set
or word lists could be altered accordingly and thus could be adjusted to comply to new
trends and practices.
67
Chapter 7
Conclusion
7.1
Main Conclusions
The outset of this thesis states that we intend to detect offensive language behavior in an
automated manner. This study outlines two main techniques being query expansion and
text categorisation. The first technique is used in order to effectively retrieve relevant
documents out of the corpus, while the second technique is utilized to seperate irrelevant
messages from messages containing racist or sexual abusive language.
This corpus contains just over seven million messages, but relevant messages only occur
rarely. Analysis reveal that only 0.85% contains offensive language. Our results show
that by using query expansion we are able to increase recall and, when not using too
large queries, also precision.
Based on the constructed training set we design two supervised learning methods. The
first method, being Naive Bayes, performs rather poorly and achieves a precision of
9% and a high recall of 86%. This high recall is negated since precision is very low.
The second supervised classifier is a Support Vector Machine and is more extensively
studied as opposed to Naive Bayes. SVM achieves a reasonable 69% in terms of precision and 62% recall. However, messages that do not contain more than five features
are ignored, since SVM does not produce good results when small messages are classified.
To enhance overall results we intended to incorporate reactions to a message in our
classification process. In this way, we can take into account the possible information
reactions contain to support a possible decision on the message itself. Our study attempts to find reactions that express outrage or contain abusive language on their own.
In order to achieve this a classifier, based on prefabricated word lists, was created. This
’semantic’ classifier achieves a very high recall of 93% and a reasonable precision of 46%
on our standard validation set. Further analysis reveals that reactions only impact 0.1%
of the total decisions.
69
Automated Detection of Offensive Language Behavior on Social Networking Sites
Lastly, to improve the overall results a combination of methods is designed. This
combined method greatly improves performance of the classification system, since it
is capable of using the strengths of both SVM as semantic classifier and avoiding their
weaknesses. By using this method we managed to combine a perfect precision of 100%
with a solid recall of 79%. Although, these results are decent, they are not representative
for the overall corpus and certainly not for other corpora.
Offensive language detection is a groundbreaking stage that can be of great aid to automate certain aspects of language monitoring. The fact remains that up untill today
these machines are not able to truly understand the real meaning and emotions language
expresses.
70
Bibliography
[1] T. Saracevic B. Jansen, A. Spink. Real life, real users, and real needs: A study
and analysis of user queries on the web. Information Processing and Management,
36(2):207–227, 1996.
[2] Janyce Wiebe Theresa Wilson Rebecca Bruce Matthew Bell and Melanie Martin.
Learning subjective language. Computational Linguistics, 30:3:277–308, 2004.
[3] John Blitzer, Mark Dredze, and Fernando Pereira. Biographies, bollywood, boomboxes and blenders: Domain adaptation for sentiment classification. 2007.
[4] Danah M. Boyd and Nicole B. Ellison. Social network sites: Definition, history, and
scholarship. Journal of Computer-Mediated Communication, 13(1):210–230, 2007.
[5] John Broglio, James P. Callan, and W. Bruce Croft. Inquery system overview. In
Proceedings of a workshop on held at Fredericksburg, Virginia: September 19-23,
1993, TIPSTER ’93, pages 47–67. Association for Computational Linguistics, 1993.
[6] Kevyn Collins-Thompson and Jamie Callan. Query expansion using random walk
models. In Proceedings of the 14th ACM international conference on Information
and knowledge management, CIKM ’05, pages 704–711. ACM, 2005.
[7] Steve Cronen-townsend, Yun Zhou, and W. Bruce Croft. A framework for selective query expansion. In In Proceedings of Thirteenth International Conference on
Information and Knowledge Management, pages 236–237. Press, 2004.
[8] Silviu Cucerzan and Eric Brill. Spelling correction as an iterative process that
exploits the collective knowledge of web users. In Proceedings of EMNLP, pages
293–300, 2004.
[9] Hang Cui, Ji-Rong Wen, Jian-Yun Nie, and Wei-Ying Ma. Probabilistic query
expansion using query logs. In Proceedings of the 11th international conference on
World Wide Web, WWW ’02, pages 325–332. ACM, 2002.
[10] H. Turtle D. Metzler, T. Strohman and W. B. Croft. Indri at trec 2004: terabyte
track. In Text Retrieval Conference, 2004.
71
Automated Detection of Offensive Language Behavior on Social Networking Sites
[11] Kushal Dave, Steve Lawrence, and David M. Pennock. Mining the peanut gallery:
opinion extraction and semantic classification of product reviews. In Proceedings of
the 12th international conference on World Wide Web, WWW ’03, pages 519–528,
New York, NY, USA, 2003. ACM.
[12] S. T. Dumais. Latent semantic indexing: Trec-3. In Proceedings of the Text REtrieval Conference (TREC-3), page 219. ACM, 1995.
[13] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. Liblinear: A
library for large linear classification. Journal of Machine Learning Research, 9:1871–
1874, 2008.
[14] George Forman. An extensive empirical study of feature selection metrics for
textclassification. J. Mach. Learn. Res., 3:1289–1305, March 2003.
[15] D Garrett, DA Peterson, CW Anderson, and MH Thaut. Comparison of linear,
nonlinear, and feature selection methods for eeg signal classification. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 11:141 – 144, 2003.
[16] Andrew R. Golding and Dan Roth. A winnow-based approach to context-sensitive
spelling correction. Machine Learning, 34:107–130, 1999.
[17] Andrew R. Golding and Yves Schabes. Combining trigram-based and feature-based
methods for context-sensitive spelling correction. In Proceedings of the 34th annual meeting on Association for Computational Linguistics, ACL ’96, pages 71–78,
Stroudsburg, PA, USA, 1996. Association for Computational Linguistics.
[18] Andrew Gordon, Abe Kazemzadeh, Anish Nair, and Milena Petrova. Recognizing
expressions of commonsense psychology in english text. In Proceedings of the 41st
Annual Meeting on Association for Computational Linguistics - Volume 1, ACL
’03, pages 208–215, Stroudsburg, PA, USA, 2003. Association for Computational
Linguistics.
[19] Isabelle Guyon and André Elisseeff. An introduction to variable and feature selection. J. Mach. Learn. Res., 3:1157–1182, March 2003.
[20] Masoud Rahgozar Hadi Amiri, Abolfazl Ale Ahmad and Farhad Oroumchian. Query
expansion using wikipedia concept graph. 2008.
[21] Marti A. Hearst and Jan O. Pedersen. Reexamining the cluster hypothesis: Scatter/gather on retrieval results. pages 76–84, 1996.
[22] Thorsten Joachims. Text categorization with support vector machines: Learning
with many relevant features. In Claire Ndellec and Cline Rouveirol, editors, Machine
Learning: ECML-98, volume 1398 of Lecture Notes in Computer Science, pages
137–142. Springer Berlin / Heidelberg, 1998. 10.1007/BFb0026683.
72
Automated Detection of Offensive Language Behavior on Social Networking Sites
[23] Lee Rainie Keith N. Hampton, Lauren Sessions Goulet and KristenPurcell. Social
networking sites and our lives. 2010.
[24] S. Kotsiantis, I. Zaharakis, and P. Pintelas. Machine learning: a review of classification and combining techniques. Artificial Intelligence Review, 26:159–190, 2006.
10.1007/s10462-007-9052-3.
[25] Karen Kukich. Techniques for automatically correcting words in text. ACM Comput. Surv., 24(4):377–439, December 1992.
[26] T. K Landauer and S. Dumais. Latent semantic analysis. Scholarpedia, 3(8):4356,
2008.
[27] Ken Lang. Newsweeder: Learning to filter netnews. In Proceedings of the Twelfth
International Conference on Machine Learning, pages 331–339, 1995.
[28] Victor Lavrenko and W. Bruce Croft. Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and
development in information retrieval, SIGIR ’01, pages 120–127. ACM, 2001.
[29] Y. H. Li and A. K. Jain. Classification of text documents. The Computer Journal,
41(8):537–546, 1998.
[30] Yinghao Li, Wing Pong Robert Luk, and Fu Lai Korris Ho, Kei Shiu Edward andChung. Improving weak ad-hoc queries using wikipedia asexternal corpus. In Proceedings of the 30th annual international ACM SIGIR conference on Research and
development in information retrieval, SIGIR ’07, pages 797–798. ACM, 2007.
[31] Wei-Yin Loh. Classification and regression trees. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 1(1):14–23, 2011.
[32] Robert M. Losee. How part-of-speech tags affect text retrieval and filtering performance. CoRR, cmp-lg/9602001, 1996.
[33] Julie B Lovins. Development of a stemming algorithm. Technical report, MASSACHUSETTS INST OF TECH CAMBRIDGE ELECTRONIC SYSTEMS LAB,
1968.
[34] Kazi Zubair Ahmed M. K. Altaf Mahmud. Detecting flames and insults in text.
In In Proceedings of 6th International Conference on Natural Language Processing,
2008.
[35] I. Maglogiannis, K. Karpouzis, B.A. Wallace, and J. Soldatos, editors. Emerging Artificial Intelligence Applications in Computer Engineering - Real Word AI
Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive
Technologies. IOS Press, 2007.
[36] Christopher D. Manning and Hinrich Schtze. An Introduction to Information Retrieval, pages 177–185. Cambridge University Press, 2009.
73
Automated Detection of Offensive Language Behavior on Social Networking Sites
[37] Christopher D. Manning and Hinrich Schtze. An Introduction to Information Retrieval, pages 272–273. Cambridge University Press, 2009.
[38] Mitchell P. Marcus, Mary Ann Marcinkiewicz, and Beatrice Santorini. Building a
large annotated corpus of english: the penn treebank. Comput. Linguist., 19(2):313–
330, June 1993.
[39] Andrew McCallum and Kamal Nigam. A comparison of event models for naive
bayes text classification. Dimension Contemporary German Arts And Letters,
752(1):4148, 1998.
[40] George A. Miller. WordNet: A Lexical Database for English., volume 38. ACM,
1995.
[41] Ruslan Mitkov, editor. The Oxford Handbook Of Computational Linguistics. Oxford
University Press, 2005.
[42] David L. Olson and Dursun Delen. Advanced Data Mining Techniques, page 138.
Springer Publishing Company, Incorporated, 1st edition, 2008.
[43] Iadh Ounis, Gianni Amati, Vassilis Plachouras, Ben He, Craig Macdonald, and
Douglas Johnson. Terrier information retrieval platform. In David Losada and
Juan Fernndez-Luna, editors, Advances in Information Retrieval, volume 3408 of
Lecture Notes in Computer Science, pages 517–519. Springer Berlin / Heidelberg,
2005.
[44] Bo Pang and Lillian Lee. Opinion mining and sentiment analysis. Found. Trends
Inf. Retr., 2(1-2):1–135, January 2008.
[45] Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. Thumbs up?: sentiment
classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing - Volume 10, EMNLP
’02, pages 79–86, Stroudsburg, PA, USA, 2002. Association for Computational Linguistics.
[46] Fuchun Peng, Nawaaz Ahmed, Xin Li, and Yumao Lu. Context sensitive stemming for web search. In Proceedings of the 30th annual international ACM SIGIR
conference on Research and development in information retrieval, SIGIR ’07, pages
639–646, New York, NY, USA, 2007. ACM.
[47] Yonggang Qiu and Hans-Peter Frei. Concept based query expansion. In Proceedings
of the 16th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’93, pages 160–169. ACM, 1993.
[48] Amir Razavi, Diana Inkpen, Sasha Uritsky, and Stan Matwin. Offensive language
detection using multi-level classification. In Advances in Artificial Intelligence, volume 6085 of Lecture Notes in Computer Science, pages 16–27. Springer Berlin /
Heidelberg, 2010.
74
Automated Detection of Offensive Language Behavior on Social Networking Sites
[49] Jason D. M. Rennie, Lawrence Shih, Jaime Teevan, and David R. Karger. Tackling the poor assumptions of naive bayes text classifiers. In In Proceedings of the
Twentieth International Conference on Machine Learning, pages 616–623, 2003.
[50] Philip Resnik. Semantic similarity in a taxonomy: An information-based measure and its application to problems of ambiguity in natural language. CoRR,
abs/1105.5444, 2011.
[51] Luke Richards. Social media in asia: Understanding the numbers., May 2012.
[52] Ellen Riloff and Janyce Wiebe. Learning extraction patterns for subjective expressions. In Proceedings of the 2003 conference on Empirical methods in natural
language processing, EMNLP ’03, pages 105–112, Stroudsburg, PA, USA, 2003.
Association for Computational Linguistics.
[53] Ellen Riloff, Janyce Wiebe, and Theresa Wilson. Learning subjective nouns using
extraction pattern bootstrapping. In Proceedings of the seventh conference on Natural language learning at HLT-NAACL 2003 - Volume 4, CONLL ’03, pages 25–32,
Stroudsburg, PA, USA, 2003. Association for Computational Linguistics.
[54] I Rish. An empirical study of the naive bayes classifier. IJCAI 2001 Workshop on
Empirical Methods in Artificial Intelligence, 3(22):4146, 2001.
[55] J. J. Rocchio. The SMART Retrieval System: Experiments in Automatic Document
Processing, pages 313–323. Prentice Hall, 1971.
[56] C. Romm-Livermore and K. Setzekorn. Social Networking Communities and EDating Services: Concepts and Implications. IGI Global, 2008.
[57] Neil Rubens. The application of fuzzy logic to the construction of the ranking function of information retrieval systems. Computer Modelling and New Technologies,
10(1):20–27, 2006.
[58] Sarah Schrauwen. Machine learning approaches to sentiment analysis using the
dutch netlog corpus. Master’s thesis, University of Antwerp, 2010.
[59] Frdrique Segond, Anne Schiller, Gregory Grefenstette, and Jean pierre Chanod. An
experiment in semantic tagging using hidden markov model tagging. In ACL/EACL
Workshop on Automatic Information Extraction and Building of Lexical Semantic
Resources for NLP Applications, pages 78–81, 1997.
[60] Ellen Spertus. Smokey: Automatic recognition of hostile messages. In In Proc.
IAAI, pages 1058–1065, 1997.
[61] Pero Subasic and Alison Huettner. Affect analysis of text using fuzzy semantic
typing. In The Ninth IEEE International Conference on Fuzzy Systems. FUZZ
IEEE 2000., volume 2, pages 647– 652, 2000.
75
Automated Detection of Offensive Language Behavior on Social Networking Sites
[62] Luis Talavera. An evaluation of filter and wrapper methods for feature selection
in categorical clustering. In A. Famili, Joost Kok, Jos Pea, Arno Siebes, and
Ad Feelders, editors, Advances in Intelligent Data Analysis VI, volume 3646 of
Lecture Notes in Computer Science, pages 742–742. Springer Berlin / Heidelberg,
2005.
[63] Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes
3rd Edition: The Art of Scientific Computing, chapter 16, pages 883–885. William
H. Press, 2007.
[64] Saul A. Teukolsky, William T. Vetterling, and Brian P. Flannery. Numerical Recipes
3rd Edition: The Art of Scientific Computing, chapter 16, page 890. William H.
Press, 2007.
[65] Mike Thelwall. Fk yea i swear: cursing and gender in myspace. Corpora, 3(1):83–
107, 2008.
[66] Simon Tong and Daphne Koller. Support vector machine active learning with applications to text classification. J. Mach. Learn. Res., 2:45–66, March 2002.
[67] Ellen M. Voorhees. Query expansion using lexical-semantic relations. In Proceedings
of the 17th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR ’94, pages 61–69. Springer-Verlag New York,
Inc., 1994.
[68] Jinxi Xu and W. Bruce Croft. Query expansion using local and global document
analysis. In Proceedings of the 19th annual international ACM SIGIR conference on
Research and development in information retrieval, SIGIR ’96, pages 4–11. ACM,
1996.
[69] Jinxi Xu and W. Bruce Croft. Corpus-based stemming using cooccurrence of word
variants. ACM Trans. Inf. Syst., 16(1):61–81, January 1998.
[70] Jinxi Xu and W. Bruce Croft. Improving the effectiveness of information retrieval
with local context analysis. ACM Trans. Inf. Syst., 18:79–112, January 2000.
[71] Zhi Xu and Sencun Zhu. Filtering offensive language in online communities using
grammatical relations. In Seventh annual Collaboration, Electronic messaging, AntiAbuse and Spam Conference, 2010.
[72] Jeonghee Yi, Tetsuya Nasukawa, Razvan Bunescu, and Wayne Niblack. Sentiment
analyzer: Extracting sentiments about a given topic using natural language processing techniques. In In IEEE Intl. Conf. on Data Mining (ICDM, pages 427–434,
2003.
[73] Hong Yu and Vasileios Hatzivassiloglou. Towards answering opinion questions: separating facts from opinions and identifying the polarity of opinion sentences. In
76
Automated Detection of Offensive Language Behavior on Social Networking Sites
Proceedings of the 2003 conference on Empirical methods in natural language processing, EMNLP ’03, pages 129–136, Stroudsburg, PA, USA, 2003. Association for
Computational Linguistics.
[74] M. Zhu. Recall, precision and average precision. Technical report, University of
Waterloo, 2004.
[75] Liron Zighelnic and Oren Kurland. Query-drift prevention for robust query expansion. In Proceedings of the 31st annual international ACM SIGIR conference on
Research and development in information retrieval, SIGIR ’08, pages 825–826, New
York, NY, USA, 2008. ACM.
77
List of Figures
2.1
2.2
Distribution of messages containing offensive language. . . . . . . . . . . . 15
Distribution of different classes in training set. . . . . . . . . . . . . . . . 16
3.1
3.2
Example of Rocchio classification. Obtained from [36]. . . . . . . . . . . . 22
SVM hyperplane construction in the idealized case of linearly separable
data. Obtained from [63]. . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Feature vectors are mapped from a two-dimensional space to a threedimensional embedding space. Obtained from [64]. . . . . . . . . . . . . . 32
3.3
4.1
Graphical representation of this study’s approach. . . . . . . . . . . . . . 37
5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
Recall Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Precision Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Precision-Recall Comparison . . . . . . . . . . . . . . . . . . . . . . . .
Average Precision Comparison. . . . . . . . . . . . . . . . . . . . . . . .
Number of relevant documents when retrieved 1000 documents . . . . .
Average Precision Comparison . . . . . . . . . . . . . . . . . . . . . . .
Evolution of SVM performance on validation set. . . . . . . . . . . . . .
Difference in performance when splitting validation set in a set with small
messages and one with the larger messages. . . . . . . . . . . . . . . . .
Snapshot of precision results that shows difference between small validation set (messages that contain less than 20 features) and large validation
set (solely messages that at least contain 20 features). . . . . . . . . . .
Comparison of SVM performance when removing messages under a certain
threshold in both training and validation set. . . . . . . . . . . . . . . .
Precision and recall comparison of different stages of our Support Vector
Machine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Comparison of performance of different semantic classifiers on validation
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Classification of messages solely based on the reactions to the message. .
Comparison between different single classifiers. . . . . . . . . . . . . . .
Comparison of precision of both classifiers running on ’small’ validation
set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.9
5.10
5.11
5.12
5.13
5.14
5.15
79
.
.
.
.
.
.
.
45
46
46
47
48
48
52
. 53
. 54
. 54
. 55
. 56
. 57
. 58
. 59
Automated Detection of Offensive Language Behavior on Social Networking Sites
5.16 Comparison of all implemented classification methods. . . . . . . . . . . . 61
80
List of Tables
2.1
2.2
Distribution of validation set. . . . . . . . . . . . . . . . . . . . . . . . . . 14
Distribution of different classes in training set. . . . . . . . . . . . . . . . 15
4.1
Random entries from our word lists. . . . . . . . . . . . . . . . . . . . . . 39
5.1
5.2
5.3
. 47
. 50
Expanded query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Performance of Naive Bayes. . . . . . . . . . . . . . . . . . . . . . . . .
Results of classifier after running on validation set. Precision drops to 6%
due to the rate of true positive against false positive. . . . . . . . . . . .
5.4 Results of classifier after running on randomly selected set. . . . . . . .
5.5 Results of classifier after improving the training set. . . . . . . . . . . .
5.6 Results of classifier after modifying validation set. . . . . . . . . . . . .
5.7 Results of Naive Bayes classifier on the modified validation set. . . . . .
5.8 Results of classifiers running on ’small’ validation set. . . . . . . . . . .
5.9 Results of combination of different classifiers on validation set. . . . . .
5.10 Distribution of messages among the three classes: irrelevant, sexual and
racist. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.11 Impact of different situations . . . . . . . . . . . . . . . . . . . . . . . .
81
.
.
.
.
.
.
.
51
51
52
52
58
59
60
. 61
. 62