Using improved generalizable features in domain adaptation for

Mark Mooij 0146196
Master Thesis in Artificial Intelligence
Web Information Retrieval
08/22/12
Using improved generalizable features in
domain adaptation for cross-domain sentiment
classification
Supervisor: Wouter Weerkamp, ISLA, University of Amsterdam, [email protected]
Mark Mooij
[email protected]
Intelligent Systems Lab Amsterdam
University of Amsterdam
Science Park 904, 1098 XM Amsterdam
P.O. Box 94323, 1090 GH Amsterdam
The Netherlands
Abstract
Sentiment classification doesn’t aim to identify the topic or factual content of a document, but aims to
classify sentiments expressed in a text. The general sentiment expressed by the author towards a subject
is identified by analyzing text-features within the document. Traditional machine learning techniques
have been applied with reasonable success, but a severe decline in classification accuracy of such
techniques has been shown when there is no proper match between the training and test-data with
respect to the topic of the document (domain). Traditional approaches have favored domain-specific
supervised machine learning techniques such as Naive Bayes, Maximum Entropy and Support Vector
Machines. When building models to classify the sentiment of texts using machine learning techniques,
extensive sets of labeled training-data from the same domain as the intended target text are often
required. Gathering and labeling new datasets whenever a model is needed for a new domain is often
time-consuming and difficult, especially if a dataset with labels is not available. This thesis will explore
the difficulties experienced with current favored techniques (with regard to cross-domain adaptation)
and propose a solution to this problem by offering a model with which high sentiment classification
accuracy can be achieved without the aid of labeled datasets for the target domain. The problem of
transferring domain-specific knowledge will be addressed by making use of an adapted form of the
Naive Bayes classifier introduced by Tan et al. (2009). The adapted form of the Naive Bayes classifier
uses generalizable features from both source and target domain. Generalizable features are features
which are consistently used between domains and can thus serve as a bridge between the domains. The
generalizable features are explored in three variations: using frequently co-occurring entropy measures
(FCE), using a subjectivity lexicon as generalizable features and using an importance measure in
combination with FCE to select improved generalizable features. Based on the selected generalizable
features a domain transfer can be made between source and target domain using the EM-algorithm. EM
is used to estimate the Naive Bayes probabilities in order to maximize a model with which
unsupervised, cross-domain sentiment classification can be achieved.
2
Acknowledgements
First of all I would like to thank my main thesis supervisor Wouter Weerkamp for his support and
comments, after each round of comments I really felt this thesis was going more into the right
direction. Next, I would like to thank all other people from the University helping my along the way:
Maarten de Rijke for helping me state my research questions, Manos Tsagkias and Valentin Jijkoun for
pointing me in the right direction to get started. During my thesis I had constant support of my fellowstudent and long friend Bruno Jakic who's creative mind always sheds light on new arising problems.
Next, my girlfriend for helping me with structure and grammar and being of constant support. Last but
not least I would like to thank my father who not only supported me during my thesis but during my
entire academic career. Finally this thesis has been a long process and a journey in the sense that I've
been working on it on numerous locations around the world, which this thesis will always remind me
about. Special thanks goes out to my diving-friends which who I visited Egypt and were of constant
dubious support when trying to work in between dives on board the MY Rosetta.
3
Table of Contents
Chapter 1 Introduction...............................................................................................................................6
Research Questions...............................................................................................................................8
Chapter 2 Background.............................................................................................................................10
2.1 Identifying sentiment.....................................................................................................................10
2.2 Sentiment analysis.........................................................................................................................10
2.3 Sentiment classification.................................................................................................................11
2.3.1 Polarity.......................................................................................................................................11
2.3.2 Features......................................................................................................................................12
2.3.3 Domain dependence...................................................................................................................13
2.4 Approaches....................................................................................................................................13
2.4.1 Subjectivity lexicon based methods...........................................................................................13
2.4.2 Machine learning techniques......................................................................................................14
2.5 Cross-Domain Sentiment Classification.......................................................................................14
2.6 This research..................................................................................................................................15
Chapter 3 Models.....................................................................................................................................17
3.1 Generalizable features selection....................................................................................................17
3.1.1 Frequently Co-occurring Entropy..............................................................................................17
3.1.2 Using a subjectivity lexicon as generalizable features...............................................................19
3.1.3 Using an importance measure in combination with FCE to select generalizable features.........21
3.2.1 Naive Bayes classifier................................................................................................................22
3.2.2 Adapted Naive Bayes Classifier.................................................................................................24
3.2.3 Features......................................................................................................................................26
3.3 Lexicon-based classifier................................................................................................................26
Chapter 4 Experimental Setup.................................................................................................................28
4.1.1 Datasets......................................................................................................................................28
4.1.2 Movie review dataset..................................................................................................................29
4.1.3 Movie-scale-review dataset........................................................................................................30
4.1.4 Review-plus-rating dataset ........................................................................................................30
4.1.5 Multi-domain sentiment dataset.................................................................................................31
4.1.6 Twitter-sentiment-dataset...........................................................................................................33
4.1.7 Restaurant reviews dataset.........................................................................................................33
4.2 Data Processing.............................................................................................................................34
4.3 Experiments...................................................................................................................................36
4.3.1 Baseline in-domain experiment..................................................................................................37
4.3.2 Baseline cross-domain experiment.............................................................................................38
4.3.3 Cross-domain experiment using different generalizable features..............................................38
Chapter 5 Results.....................................................................................................................................40
5.1 In-domain baseline experiments....................................................................................................40
5.2 Cross-domain baseline experiments..............................................................................................42
5.3.1 Adapted Naive Bayes classifier using FCE................................................................................44
5.3.2 Adapted Naive Bayes classifier using subjectivity lexicon........................................................46
5.3.3 Adapted Naive Bayes classifier using improved generalizable features....................................48
4
Chapter 6 Results analysis.......................................................................................................................51
6.1 In-domain baseline experiments....................................................................................................51
6.1.2 Naive Bayes Classifier...............................................................................................................51
6.1.3 Subjectivity lexicon classifier....................................................................................................57
6.2 Cross-domain baseline experiments..............................................................................................58
6.2.1 Naive Bayes classifier................................................................................................................58
6.2.2 Subjectivity lexicon classifier....................................................................................................61
6.3.1 Adapted Naive Bayes using FCE...............................................................................................62
6.3.2 Adapted Naive Bayes using subjectivity lexicon.......................................................................69
6.3.3 Adapted Naive Bayes using improved generalizable features...................................................74
Chapter 7 Conclusions and future work...................................................................................................83
References................................................................................................................................................85
5
Chapter 1
Introduction
In recent years, increasingly large amounts of information have become available in the form of online
documents. Organizing this information from a user-perspective is an ongoing challenge. A lot of
research has focused on the problem of automatic categorization of texts by topic or subject matter (e.g.
sports or politics). However, recent years have shown a rapid growth in the general interest in on-line
discussion groups and review sites. A popular example of a review site is Amazon.com in which
categorized products are presented, combined with user-generated articles reviewing these products. A
crucial characteristic of this kind of user-generated information is the article's sentiment, or overall
opinion towards the subject at matter - for example, whether a product review is generally positive or
negative. User generated sentiment expressions can also be found in settings such as personal blogs,
political debate, twitter messages, and newsgroup message boards. This is part of an ongoing trend in
which people increasingly participate in online sentiment expression. This trend has led to many
interesting and challenging research topics such as subjectivity detection, semantic orientation
classification and review classification. Sentiment analysis aims to address complex issues such as the
identification of the object of the sentiment, the detection of mixed and overlapping sentiments in a text
and the identification and interpretation of sarcasm. In practice however, most research has been
concerned with classifying the overall polarity of a document. In other words: is the writer expressing a
positive or negative opinion. This task is called sentiment classification and has proved to be
complicated and challenging enough. Therefore, the primary focus of this thesis lies within the field of
general sentiment classification: identifying the overall polarity of sentiment in different kinds of
documents.
Sentiment classification can be a useful application for message filtering; for example, it might be
useful to use sentiment information to recognize and discard “flames” (Spertus, 1997). Sentiment
classification also offers valuable applications for business intelligence and recommender systems
(Terveen et al., 1997; Tatemnura, 2000). In these applications texts like customer reviews, customer
feedback, survey responses, newsgroups postings, weblog pages and message boards are automatically
processed in order to extract summarized (sentiment) information. Marketeers can use this information
to analyze customer satisfaction or public opinions of a company and/or products, making it possible to
perform in-depth analysis of market trends and consumer opinions (Dave et al., 2003). Based on these
insights products can be adapted, user feedback can be studied, and marketing strategies can be
optimized. Organizations can also use this information to gather critical feedback about problems in
newly released products or to benchmark their provided services. For this purpose, accurate systems for
sentiment classification are needed that are able to interpret free-form texts given in natural language
format. Human processing of such text volumes is expensive. The only realistically feasible manner of
processing such amounts of data is automated clustering, summarization and classification of this data
based on the expressed sentiment.
Sentiment classification does not only offer useful applications for manufacturers and marketeers;
consumers can use sentiment classification to research product quality or services before making an
6
informed purchase. Individuals might want to consider other people's opinions when purchasing a
product or deciding to use a particular service. Algorithms which can automatically classify other
people's opinions as positive or negative would make it possible to develop new search tools (Hearst,
1992) which would facilitate summarized comparison. These search tools would enable customers to
browse through clustered search results according to the expressed sentiment regarding a specific topic.
Apart from the aforementioned product reviews, such a search tool might provide useful information
with regard to sentiment classification on broader subjects such as for example providing detailed
travel destination statistics.
The rapid growth of the internet as a medium allows people to express opinions on any given subject.
To classify the sentiment of these documents has proves to be a challenge as sentiment can be
expressed in very different ways for different subjects (domains) (Engtröm, 2004). Turney (2002) noted
that the expression “unpredictable” might have a positive sentiment in a movie review (e.g. “The movie
has an unpredictable plot.”), but could be negative in the review of an automobile (e.g. “The car's
performance suffers from unpredictable steering.”). For this example “unpredictable” is a very domainspecific expression denoting different sentiment for different domains. In this example “great” could
have alternatively been used to express positive sentiment across the different domains, making “great”
a domain-independent expression. To solve domain-dependency different approaches have been
attempted. Rule-based approaches can't adapt automatically to domain characteristics, for each domain
specific rules would have to be provided. Lexicon-based approaches depend on lexical resources for
which the polarity of words can differ depending on the context and domain they are used in.
Traditional approaches favor domain-specific supervised machine learning techniques. However,
previous work has revealed that classifiers trained using machine learning techniques lose accuracy
when the test-data distribution is significantly different from the training-data distribution (Aue and
Gamon, 2005; Whitehead and Yaeger, 2009). A difference in distribution occurs when there is no
proper match between the training and test-data with respect to the topic of the document. For example,
a classifier trained on movie reviews is unlikely to perform with similar accuracy on reviews of
automobiles. Therefore, building models to classify the sentiment of texts using machine learning
techniques requires extensive sets of labeled training-data from the same domain as the intended target
text. Labeled training-data would be an example document (e.g. a review about a movie) combined
with the overall expressed sentiment in the form of a label. Gathering and labeling new datasets
whenever a model is needed for a new domain is often time-consuming and difficult, especially if a
dataset with labels is not available. The obvious solution to perform sentiment analysis within a new
domain is to collect data, annotate the data and train a domain-specific classifier. Acquiring sufficient
amounts of data in itself can be a difficult task. The data for the domain at hand might be limited (e.g.
consider conducting sentiment research in the domain of tennis-rackets). When enough data is available
the extraction of this data might be another challenging task. The data might be available from different
data sources, and in different formats and layouts (e.g. a number of internet-forums discussing pros and
cons of tennis-rackets). Even extracting all comments specific to the reviewed product-feature within a
consistent format might pose a challenging task in itself. Annotating the data poses another challenge:
in order to create qualitative annotations at least two annotators have to analyze considerate amounts of
data to develop a training set suitable to be used for a classifier. For certain more specific domains it
might be hard to find annotators which have enough expertise to create proper annotations. The
context-dependent nature of sentiment makes the classification task rather difficult even for humans.
The level of annotator agreement for labeled training-data can be quite low (Ghosh and Koch; Rothfels
and Tibshirani, 2010). This would imply that, with widely-varying domains, researchers and engineers
who build sentiment classification systems need to collect and label training-data for each new domain
7
of interest. Even in the specific sub-problem of market analysis, if automated sentiment classification
were to be used across a wide range of domains, the effort to label training-data for each domain may
become prohibitive, especially since product features can change over time.
The ability to analyze sentiments across different domains and provide solutions for domainindependent sentiment classification has become crucial. Ideally one would like to be able to quickly
and cheaply customize a model to provide accurate sentiment classification for any domain. Due to the
complicating factors, the amount of manually labeled training-data available for machine learning is
limited to specifically studied domains. It would be quite useful if the knowledge gained from one set
of training-data could be transferred for the use in another domain by performing a domain-adaptation.
Therefore, in this thesis the problem of transferring domain-specific knowledge is addressed. This
problem is addressed by using an adapted form of the Naive Bayes classifier to maximize a model
which allows for unsupervised sentiment classification without the need of labeled data for the target
domain.
Research Questions
This thesis repeats a strategy proven in previous work (Tan et al., 2009) and suggests two new
strategies which are able to maximize a Naive Bayes model able to provide accurate cross-domain
sentiment classifications. This approach lacks the need of labeled training-data for a target domain,
using only the labeled training-data from the source domain and the data-distribution of the target
domain.
The in this thesis repeated strategy uses frequently co-occurring entropy measures (FCE) to calculate
generalizable features over source and target domain in order to perform a domain-adaptation. The
experiments in this thesis will show that repeating the strategy from the work by Tan et al., (2009),
which was performed on Chinese datasets, on English datasets doesn't result in the generalizable
features which are to be expected. Therefore the method of selecting generalizable features for English
datasets is open for improvement. Therefore, two improvements on selecting generalizable features are
suggested in this work. The first improvement is the use of a subjectivity lexicon instead of using
frequently co-occurring entropy measures to select generalizable features. The second improvement
suggests utilizing an importance measure in combination with frequently co-occurring entropy in order
to improve the selection of generalizable features.
Based on the generalizable features an adapted Naive Bayes model is maximized using the EM
algorithm. As a result a maximized Naive Bayes models is estimated capable of unsupervised, crossdomain sentiment classification.
The research in this thesis aims to answer the following questions:
1. Does maximizing a Naive Bayes model with EM using FCE to select generalizable features
improve cross-domain sentiment classification in comparison to baseline cross-domain
methods?
2. Does maximizing a Naive Bayes model with EM using a subjectivity lexicon as generalizable
features improve cross-domain sentiment classification in comparison to baseline methods and
to using FCE to select generalizable features?
8
3. Does maximizing a Naive Bayes model with EM using a combination of FCE and an
importance measure to select generalizable features improve cross-domain sentiment
classification in comparison to baseline methods, using FCE to select generalizable features and
using a subjectivity lexicon instead as generalizable features?
To evaluate the performance of the different approaches suggested in this thesis a wide variety of
datasets is collected. The datasets originate from different domains and vary in size and balance (the
amount of positive examples versus the amount of negative examples available). Analyzing the
different conducted experiments will provide insight in the effects of size and balance on the Naive
Bayes classifier. In order to answer the research questions three main experiments will be set-up. The
first experiment will evaluate the performance of the standard Naive Bayes classifier for in-domain
sentiment classification using a big variety of datasets. This experiment will establish the performance
of in-domain training and classification as a baseline to compare cross-domain performance against.
Establishing in-domain performance is important to research whether it is possible to use the different
gathered datasets consistently in terms of preprocessing, training and evaluating, while performance
stays comparable to previous research. The second experiment will analyze naive cross-domain
classification of an evaluation-set from one domain with classifiers trained in other domains for a big
variety of datasets. This experiment will establish how much the performance of the Naive Bayes
classifier will decrease from in-domain to cross-domain. The third experiment will evaluate the three
different approaches in order to maximize a Naive Bayes model with EM using different approaches to
select generalizable features answering the research questions.
In the next chapter we will take a look a related work providing an overview of the work done in
sentiment classification. In the third chapter the Naive Bayes classifier is introduced and the for this
research suggested adapted form of the Naive Bayes classifier is discussed. The fourth chapter will
explain the data and experimental setup. In the fifth chapter the results for the three different
experiments will be presented. The sixth chapter will evaluate the experiments and discuss the results.
Finally, in the seventh chapter conclusions will be drawn from the findings and future work will be
discussed.
9
Chapter 2
Background
2.1 Identifying sentiment
In everyday life people use different kinds of communication-methods to convey different kinds of
messages. Communication can be verbal or non-verbal, and verbal communication can be subdivided
into oral communication (e.g. talking directly to each other or talking to each other by telephone) or
written communication (e.g. sending a letter or an e-mail). Communication has different functions,
which can be divided into objective (facts, information) and subjective categories (opinions, feelings).
The specific form of subjective written communication with which this thesis is concerned is the
polarity expressed in subjective texts (like product-reviews). An opinion, or sentiment about an object,
is a subjective belief and is the result of an emotion and/or the interpretation of facts by an individual.
An opinion may be supported by arguments (objective information or subjective information), but often
people draw opposed opinions from the same facts. An example of a sentence stating a sentiment can
be:
“Peter loves using his I-Pad.”
This sentence expresses Pater’s sentiment towards using an I-Pad. This statement doesn’t mean that this
sentence is true or untrue. For this statement, there is no truth value implied in the expressed sentiment.
An overview of the definitions linked to the notion of sentiment is provided by Pang and Lee (2008).
To capture the notion of sentiment in sentiment analysis research, sentiment is defined in terms of
polarity combined with the target of a particular sentiment. The field of sentiment analyses or opinion
mining aims to determine the attitude of an individual (a speaker or a writer) expressing an attitude
with respect to a topic or an overall contextual polarity of a source material. An attitude can be the
writer's judgment or evaluation, affective state (the emotional state of the author when writing), or the
intended emotional communication (the emotional effect the author wishes to communicate, as is
communicated in persuasive texts for example).
2.2 Sentiment analysis
The broad research field of sentiment analysis can also be referred to as subjectivity analysis, opinion
mining and appraisal extraction. It is one of the applications of natural language processing (NLP),
computational linguistics and text analysis with the aim to identify and extract subjective information
in source materials. In sentiment analysis subjective elements in source materials are detected and
studied. Subjective elements are defined as “linguistic expressions of private states in context” (Wiebe
et al., 2004). Private states can be represented as single words, word-phrases or sentences. The task of
polarity detection aims to detect polarity in documents: expresses a single word, word-phrases or
sentence a positive, negative or neutral sentiment (expressing no sentiment at all). Another sub-task of
sentiment analysis is subjectivity detection; the task of identifying subjective words, expressions,
and/or sentences opposed to objective words, expressions, and/or sentences (Wiebe et al., 1999;
Hatzivassiloglou and Wiebe, 2000; Riloff et al., 2003). In order to properly identify a complete text as
10
subjective or objective, syntactic analysis on each individual sentence needs to be performed. This
analysis often focuses on adjectives and adverbs in sentences to determine the polarity of the sentences
(Hatzivassilolou and Wiebe, 2000; Wiebe et al., 2004). Detecting explicit subjective sentences such as:
“I think this is a beautiful MP3-player” is the primary focus of sentiment analysis. However, subjective
statements can also be voiced in an implicit manner: “The earphone broke in two days” (Liu, 2006).
Even though the last sentences doesn’t explicitly verbalize sentiment, the implicit sentiment expressed
in this sentence is a negative one. Detecting implicit sentiment proves to be a more challenging task.
Apart from the aforementioned tasks, even more complex problems have been studied in other
sentiment research. Following the initial research by Pang et al. (2002), in which a later widely studied
dataset was created, the study of movie reviews has become a prominent domain in sentiment analysis.
Consequently, sentiment analysis has extended to a wider variety of domains, ranging from stock
message boards to congressional floor debates (Das and Chen, 2001; Thomas et al., 2006). Research
results have been deployed industrially in systems that gauge market reaction and summarize opinions
from web pages, discussion boards, and blogs. Bethard et al. (2004), Choi et al. (2005) and Kim and
Hovy (2006) performed research in identifying the source of the opinions expressed in sentences using
different techniques. Wilson et al. (2004) focused on the strength of expressed opinions, classifying
strong and weak opinions. Chklovski (2006) presented a system that aggregates and quantifies degree
assessment of opinions on web pages. Going one step further than document level Hu and Liu (2004)
and Popescu and Etzioni (2005) concentrated on mining and summarizing reviews by extracting
opinionated sentences specified to certain product features.
2.3 Sentiment classification
Sentiment classification is a sub-task of sentiment analysis that aims to determine the positive or
negative sentiment of expressions (Hatzivassiloglou and McKeown, 1997; Turney, 2002; Esuli and
Sebastiani, 2005). The classification of these features can be used in order to determine the overall
contextual polarity of documents. Features can be defined as single words, word-phrases or sentences.
The classification of the overall contextual sentiment is mostly applied to reviews, where automated
systems assign a positive or negative sentiment for a whole review document (Pang et al., 2002;
Turney, 2002). It is common in sentiment classification to use the output of one feature-level as input to
higher feature levels to allow for more complex context-related analysis. For example, first sentiment
classification can be applied to words, after which the outcome of this analysis can be utilized to
evaluate word-phrases, followed by the evaluation of sentences and so on (Turney and Littman, 2003;
Dave et al., 2003). Sentiment of phrases and sentences specifically has been studied by Kim and Hovy
(2004) and Wilson et al. (2005). Different techniques show optimal accuracy on different levels. On
word or phrase level, n-gram classifiers and lexicon based approaches are commonly used. On sentence
level, Part-Of-Speech tagging (POS-Tag) is a frequently used method.
2.3.1 Polarity
The classification of sentiment is often expressed in terms of polarity. In the field of sentiment analysis
different approaches can make use of different notions of polarity. A number of researches split polarity
classification into a binary problem: positive versus negative polarity (Pang et al., 2002; Pang and Lee,
2004). A stated opinion can be classified in either of these two categories. As such opinionated
documents are labeled as either expressing an overall positive or an overall negative tonality. A lot of
research in this area has focused on the sub-domain of product reviews (Turney, 2002; Pang et al.,
2002; Dave et al., 2003; Hu and Liu, 2004; Popescu and Etzioni, 2005) and movie reviews (Turney,
11
2002; Pang et al., 2002). Difficulties in binary sentiment classification tasks lie in the fact that the
overall sentiment sometimes is expressed in subtle details:
“The acting was good, the cinematography was well done, the story was well build but overall the
movie was a major disappointment.”
As a whole, this sentence seems to convey a positive sentiment when viewing the number of positive
adverbs used. Yet, the last part of the sentence gives the complete sentence an overall negative tonality.
Contrary to approaching sentiment polarity challenges as a binary problem, sentiment polarity can also
be addressed by splitting it into three categories: positive, negative and neutral. In more complex (i.e.
longer) documents classifying features as “neutral” can improve classification accuracy. In binary
classifications of such documents, each feature necessarily contributes to the classification. Neutral and
informative statements will also be included and labeled as either positive or negative. Using a neutral
category therefore, allows for omission of such expressions (Wilson et al., 2005).
Polarity can also be represented as a range, for example assigning “stars” to a product. This is a popular
method used by different companies to gather user feedback on products or provided services.
Assigning “stars” is often applied for rating a product or service on a scale from 1-5 “stars”. The
amount of “stars” assigned corresponds to the degree of appreciation for the product or service.
Viewing the problem as a range, the classification task becomes a multi-class text categorization
problem (Pang and Lee, 2005). Assigning ranges allows for making distinctions between the strength of
a positive or negative sentiment towards a certain object.
2.3.2 Features
In sentiment classification different results in terms of accuracy can be achieved using various
techniques based on assigning semantic scores to different features types. Features can be n-grams (1
gram, 2 gram), word-phrases like POS-Tags or sentences. The features are assigned a value which
denotes the semantic orientation of the feature. At different levels, different features can be beneficial
in optimizing classification accuracy. In the related scientific field of document clustering the term
frequency (TF) is used to calculate the Term Frequency - Inverse Document Frequency (TF-IDF) score
to distinguish the features which are most informative in describing the document (Jones, 1972). While
sentiment classification proves to be a more challenging task in comparison to document clustering, a
principle similar to TF-IDF can also be applied. Following the TF-IDF principle higher informative
value is assigned to unique features opposed to frequent features. Alternatively, Pang and Lee (2002)
improve the performance of their sentiment classification system using feature presence instead of
feature frequency and perform experiments with combinations of features: 1 gram features combined
with 2 gram features and 1 gram features combined with POS-Tags.
Next to feature presence and feature frequency, the relation between the position of features in a
document and the expressed sentiment has been studied. Pang and Lee (2002) and Kim and Hovy
(2006) extend their methods with information about the feature-position into a feature vector to
increase sentiment classification accuracies. Wiebe et al. (2004) selects n-grams based on training on
annotated documents to find the optimal n-gram in order to find the maximized classification accuracy.
Another feature which proved to be beneficial in improving classification accuracies are POS-Tags. As
mentioned before, adjectives and adverbs can be good indicators for sentiment in text. Adjective and
adverb indicators have been studied by Hatzivassiliglou and Wiebe (2000) and Benamera et al. (2007)
12
among others. Turney (2002) has shown that part-of-speech patterns can help sentiment classification
especially on a document level.
Dealing with negated features continues to be a challenge in sentiment classification. Although n-grams
are able to capture negation to some extent, difficulties in sentiment analysis tasks often arise when
negations are present in the text. Following the most standard bag-of-words approach (considering a
document just as a list of all the words the document consists of), sentences like: “I like the movie.”
and “I don’t like the movie.” can be processed (and classified) in a very similar way as the difference
between the sentences consists only of one single word. An approach using n-grams (over n=1) would
be able to identify the second sentence as revealing negative sentiment but increases the amount of data
necessary to capture the data. However, even the use of n-grams would not pose a solution when such a
sentence would be rephrased as “I thought I would like the movie, but in the end I didn’t”, in which the
negation has become so far removed from the initial statement that n-grams would not be able to
realistically capture it. The approach by Pang and Lee (2008), amongst others, includes negation in
document representation by appending this to the relevant term (e.g. “like-NOT”), but this approach
results in incorrect behavior in sentences like: “No wonder everyone loves it”. The most successful
approach towards solving these issues proved to be the utilization of Part-of-Speech techniques to
prevent incorrect behavior, because POS-Tags are able to identify the syntactical structure in a
sentence.
2.3.3 Domain dependence
A second important aspect in sentiment analysis is the target the sentiment is expressed towards or the
overall context the sentiment is expressed in: the domain. The intended target can be an object, a
concept, a person, an opinion, etc. When the intended target is clear, as in movie-reviews, it is easy to
identify the topic. The intended target can be very important in determining whether the expressed
sentiment is positive or negative, as the connotation can differ between domains. Turney (2002)
described that the expression “unpredictable” has a positive sentiment in a movie review (e.g. “The
movie has an unpredictable plot.”), but has a negative sentiment in the review of an automobile (e.g.
“The car's performance suffers from unpredictable steering.”). For this example “unpredictable” is a
very domain-specific feature denoting different sentiment for different domains. In this example
“great” could have alternatively been used to express positive sentiment across the different domains,
making “great” a domain-independent feature.
2.4 Approaches
A wide variety of methods and techniques has been researched to tackle the challenges in sentiment
classification. There are two main techniques to solve the problems: approaches using subjectivity
lexicons and machine learning techniques. In this section we will discuss both approaches.
2.4.1 Subjectivity lexicon based methods
Lexicon based methods make use of a predefined lexicon to decide the semantic orientation of features.
The lexicon can be defined manually or automatically. The research of Prinz (2004) makes use of a
manually constructed lexicon based on psychological research such as Plutchik’s emotion model.
Lexicon based methods are an intuitive way to perform sentiment classification and were one of the
first methods to be tried in the field. Although lexicon based methods offer decent performance in
various contexts they often give problems capturing context specific sentiment, as the aforementioned
13
example of “unpredictable”.
Automated lexicons have been created in various studies. A large amount of research has been done on
lexicons created with WordNet. WordNet is a large lexical database of the English language. In this
database nouns, verbs, adjectives and adverbs are grouped into sets of cognitive synonyms (synsets),
each expressing a distinct concept. WordNet has been used in research by Kim and Hovy (2004) and
Hu and Liu (2005). In these studies, lexicons are automatically generated by starting with a small list of
seed terms (e.g. love, like, nice). Next the annotated antonymy and synonymy properties of terms are
used to group the terms into polarity based groups. Esuli and Sebastiani (2005) use a method in which
they expand on the WordNet database by adding labels for polarity and objectivity to each term.
Research by Denecke (2009) uses SentiWordNet to determine polarity. SentiWordNet is a lexical
resource for opinion mining which assigns three sentiment scores to each synset of WordNet: positivity,
negativity and objectivity. Other approaches use lexicons from OpinionFinder (Riloff et al., 2005), an
English subjectivity analysis system which, among other things, classifies sentences as subjective or
objective.
2.4.2 Machine learning techniques
Alternatively, statistical analysis of large corpora can be done to automatically generate lexicons.
Utilizing large corpora starts by selecting a suitable corpus with annotated examples. Often product
review sites (like Ebay.com, RottenTomatoes.com and IMDB.com) are used for this purpose. In other
approaches emoticons are used to serve as “noisy labels” to acquire large amounts of annotated
examples. Read (2005) uses this approach on data gathered from Usenet, Go et al. (2009) gather data
based on this principle from twitter. Once a suitable dataset has been obtained, different machine
learning techniques can be used. Some of the most popular approaches are Support Vector Machines
(Pang and Lee, 2002; Dave et al, 2003; Gamon, 2004; Pang and Lee, 2004; Airolodi et al., 2006,
Wilson et al., 2009), Naive Bayes (Wiebe et al., 1999; Pang and Lee, 2002; Yu and Hatzivassiloglou,
2003; Melville et al., 2009), and Maximum Entropy based classifiers (Nigam et al., 1999; Pang and
Lee, 2002, Go et al., 2009). Turney and Littman (2003) use word co-occurrence to infer semantic
orientation of words. In their research they attempt two different techniques: Pointwise Mutual
Information (PMI) (Church and Hanks, 1989) and Latent Semantic Analysis (LSA) (Landauer and
Dumais, 1997). PMI calculates word co-occurrences by using the largest corpora available: the Internet
(via search engines hit counts). LSA uses a matrix factorization technique using Singular Value
Decomposition to analyze the statistical relationship between words. Matsumoto et al. (2005) use text
mining techniques to extract frequent word sub-sequences and dependency sub-trees from sentences in
a document dataset and use them as features of support vector machines in order to perform document
sentiment classification.
2.5 Cross-Domain Sentiment Classification
Different solutions have been explored to tackle the challenges in cross-domain sentiment analysis. In
research by Aue and Gamon (2005) sentiment classifiers have been customized to new domains with
different approaches: training classifiers based on mixtures of labeled data from different domains,
training an ensemble of classifiers from different domains and using small amounts of labeled data
from the target domain. Blitzer et al. (2007) use the structural correspondence algorithm to perform a
domain-transfer for various product reviews. In research by Ghosh and Koch a co-occurrence algorithm
is used to provide a knowledge-transfer in such a way that an unsupervised classification can be made
on a cross-domain basis on sentence level. Whitehead and Yaeger (2009) use an adjusted form of
14
cosine similarity between domain lexicons that can be used to predict which models will be effective in
a new target domain. Furthermore, they use ensembles of classifiers which can be applied to address
cross-domain issues. In research by He et al. (2011) the Joint Sentiment-topic (JST) model was used to
detect sentiment and topic simultaneously from the text. The modification to the JST model by He et al.
(2011) incorporates word polarity priors augmenting the original feature space with polarity-bearing
topics. Impressive performance of 95% on the movie-review dataset and 90% on a multi-domain
sentiment dataset were reported. Bollegala et al. (2011) describe a sentiment classification method
which automatically creates a sentiment sensitive thesaurus using both labeled and unlabeled data from
multiple source domains in order to find the association between words that express similar sentiments
in different domains. The created thesaurus is then used to expand feature vectors to train a binary
classifier. In studies by Wu et al. (Wu et al., 2009a; Wu et al., 2009b) the graph ranking algorithm
shows promising results. This algorithm creates accurate labels from old-domain documents and
“pseudo” labels of the new domain. This cross-domain study has been applied to electronic reviews,
stock reviews and hotel reviews on Chinese datasets. In work by Tan et al. (2007) old-domain labeled
examples are used in combination with new-domain unlabeled data as well. The top labeled
informative examples from the source domain are used to determine informative examples in the target
domain to perform a domain-transfer. Selecting informative examples to achieve effective domain
transfers has also been attempted in research by Liu and Zhao (2009), Wu et al. (2010) and Tan et al
(2009). In Tan et al. (2009) frequently co-occurring entropy measures are used to select generalizable
features, which frequently occur in both domains and have similar occurring probability. These cooccurring features are used to construct seed lists, which are used in models to accommodate a domainadaptation.
In this section a number of examples have shown that recent research in cross-domain sentiment
analysis tries to use similarity between source and target domain in order to train sentiment classifiers
for a target domain. Similarity can be based on some measure (e.g. cosine (Whitehead and Yaeger,
2009)) , co-occurrence (Ghosh and Koch) or can be based on topics detected in the text (He et al.,
2011). All these similarity-based approaches attempt to detect features which can be used to perform a
domain-transfer (in various research referred to as co-occurring features, sentiment sensitive thesaurus,
informative examples, frequently co-occurring features).
2.6 This research
This thesis aims to perform accurate cross-domain sentiment classification by excluding domain
dependent features and find and utilize domain independent features. The ultimate aim here is to find a
set of domain independent features that will accommodate a suitable domain-adaptation between a
wide range of domains. Based on the promising reported accuracies for Chinese domain-adaptations,
this thesis builds on the research by Tan et al. (2009) where frequently co-occurring entropy measures
are used to detect domain independent features called generalizable features. The work in this thesis
repeats the experiments suggested by Tan et al. (2009) for a large number of different English datasets.
The data sources with which this thesis is concerned stem from a wide range of domains. In order to
reach a consistent approach towards these data sources, the choice for a binary overall approach was
essential. Adaption of the available data sources by adding more classes (e.g. neutral or different
ranges) would render the existing annotation useless. The adaptation of the available data sources using
multiple categories to a binary segmentation proved possible by omitting all neutral documents. To
overcome the problem of detecting whether documents hold any sentiment at all (sentiment detection)
only texts known to express a sentiment towards a specific topic are considered: different reviews from
15
different domains and tweets.
In this research the standard Naive Bayes classifier used among others in (Pang and Lee, 2002) is first
setup as a baseline for in-domain sentiment classification for the different used datasets. Next, the
Naive Bayes classifier is used to evaluate baseline naive cross-domain sentiment classification. In the
research by Tan et al. (2009) an adapted form of the Naive Bayes classifier is used to maximize a
model using the EM algorithm to perform domain-adaptations will be used in this research as well. In
order to provide the Naive Bayes classifier with features 1 gram features, 2 gram features and
combinations of 1 gram + POS-pattern features and combinations of 1 gram + 2 gram features are used.
The choice to explore these feature-sets originates from the research by Pang and Lee (2002). The
conducted experiments will show that the approach suggested by Tan et al. (2009) doesn't yield a set of
generalizable features as could be expected when applied to English datasets. As the importance of
these features is stressed the improvement of these features is the main focus of this thesis. Two
improvements are suggested: using a subjectivity lexicon to serve as generalizable features and the
generation of an improved set of generalizable features using a combination of FCE and an importance
measure.
16
Chapter 3
Models
This chapter introduces the approach suggested by Tan et al. (2009) and the two in this research
suggested improvements for domain-adaptation in order to perform cross-domain sentiment
classification. The main focus of this thesis revolves around generalizable features. Generalizable
features are features that occur frequently in both source and target domain and have similar occurring
probability in both domains. In this chapter the generalizable features are introduced properly by first
discussing the method suggested by Tan et al. (2009). This approach has shown to be useful to perform
domain-adaptations in sentiment classification across different domains. In Chapter 6 will be shown
that just using the methods described by Tan et al. (2009) does not yield comparable impressive
accuracy results for English datasets. A possible explanation for the under-performance could be the
selected generalizable features which are not comparable to the ones reported by Tan et al. (2009). In
this study detailed insight is provided in the performance and behavior of the selected generalizable
features on different English datasets. The main contribution of this research lies in the introduction of
two alternative methods for selecting generalizable features. The newly proposed methods build on the
idea of generalizable features. As a first improvement the use of a subjectivity lexicon as generalizable
features is suggested. This method is explained in detail after the introducing of the generalizable
features. Next, the second improvement is explained suggesting the additional use of an importance
measure to improve the selected generalizable features.
After the introduction of the different alternatives for selecting generalizable features the basic Naive
Bayes classifier is explained in detail, which is used in the first two experiments. This is followed by
the explanation of the adapted form of the Naive Bayes classifier able to perform domain-adaptations.
The features used in the different evaluated Naive Bayes classifiers are 1 gram features, 2 gram
features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features.
3.1 Generalizable features selection
In this section the different methods of selecting generalizable features will be introduced. First, the
approach suggested by Tan et al. (2009) is discussed. Next, the alternative approach using a subjectivity
lexicon is introduced. Finally the approach using an importance measure to acquire better generalizable
features is explained.
3.1.1 Frequently Co-occurring Entropy
As discussed in Chapter 2 expressions can have different sentimental meaning in different domains.
Expressions which have the same sentimental meaning between domains are defined as nondomainspecific features. For example the feature “excellent” is considered a nondomain-specific feature,
having a positive meaning in every domain. In contrast to nondomain-specific features there are
domain-specific features which have a different meaning in different domains. For example the feature
“unpredictable” can be positive in context of a movie-review and negative in context of a review about
(the steering of) a car (Turney, 2002). As such, the features in a document can be divided into two
17
categories: domain-specific and nondomain-specific features. Furthermore, a subdivision can be made
between features expressing sentiment and features expressing no sentiment (non sentimental). In this
way, considering a document as a collection of features any given document can be expressed as a
vector:
F=
{
domain specific
nondomain specific
{
{
sentiment (ds)
nonsentiment (dns)
sentiment (nds)
nonsentiment (ndns)
Given the document's representation as a feature by identifying the different features within a document
it becomes possible to use the different characteristics of the features in the context of domainadaptation. In the research by Tan et al. (2009) the different features are detected and are used in order
to facilitate a domain-adaptation using an adapted form of the Naive Bayes classifier (further discussed
in section 3.2.2), using EM to optimize a Naive Bayes model. In order to do so both the domain
specific sentiment features as well as the nondomain specific features are used. The approach uses
labeled training data from a source domain (calculating the Naive Bayes probabilities for positive and
negative features), and the data-distribution of the target-domain (only the feature counts). This way the
target-domain is considered to be unlabeled making it possible for any unlabeled set of documents from
any domain to perform a domain-adaptation. The main benefit of this approach would obviously be that
no set of labeled data from a target-domain would be necessary. This would make it possible to quickly
and cheaply customize domain-specific classifiers.
The nondomain-specific features are called generalizable features. Generalizable features can be used
as a bridge between source and target domain in order to adapted a classifier (trained in a sourcedomain) to a specific target-domain. In order to do so an adapted form of the Naive Bayes classifier is
used (further discussed in section 3.2.2).
In order to make distinction between domain-specific and generalizable features (domain-independent
features) Tan et al. (2009) introduce the Frequently co-occurring entropy (FCE) measure. FCE
expresses a measure of feature similarity between two domains based on two criteria: the frequency of
occurrence in both domains and the probability of a feature occurring in both domains. This is
formalized in Formula 1:
f w =log (
P S (w)×P T ( w)
)
∣P S (w)−PT (w)∣+β
Formula 1: Frequently co-occurring entropy.
In Formula 1 PS (w) and PT (w) indicate the probability of a feature w in the source and in the
target domain further defined in Formula 2. The β factor is introduced in order to prevent the case
when PS (w)=PT (w) where the formula could lead to a zero division. This value is set to a small
number.
18
( N Sw +α)
PS ( w)= S
(∣D ∣+2×α)
( N Tw +α)
PT ( w)= T
(∣D ∣+2×α)
Formula 2: The probability of feature w in source and target domain.
In Formula 2 N Sw and N Tw are the number of examples with word w in the source and target
domain. ∣D S∣ and ∣D T∣ are the number of examples in the source and the target domain. To
account for overflow α is set as a small number. Using this calculation for all the features in source
and target domain results in a ranked list of generalizable features. The research by Tan et al., (2009)
performed on Chinese datasets results in lists of generalizable features which intuitively make sense. A
selection of the features from a cross-domain adaptation from a stock review to a computer review
dataset from the research by Tan et al. (2009) can be found in Table 1 .
not bad
not good
inferior to
adverse
unclear
cannot
deficient
bad
independent
wonderful
cold
good
benefit
pleased
characteristic
difficulty
fast
quick
monopolize
behindhand
slow
beautiful
difficult
challenge
can
deceive
injure
lose
outstanding
incapable
attractive
void
stodgy
excellent
superority
excellence
effective
natural
positive
obstruct
Table 1: The top 40 generalizable features between stock and computer reviews as reported by Tan et al. (2009)
What can be noticed from Table 1 is the different kinds of features which are reported. Most of the
reported features are 1 gram features, but some of them are 2 gram features (e.g. not bad, not good,
inferior to). As the experiments in the next chapter will show using FCE on English datasets will result
in features which are not representative for the examples in Table 1. The difference between the results
for English and Chinese may be explained by a number of reasons. The first two examples are complex
terms in the sense that they consist of two expressed sentiments: “not” and “good”, “not” and “bad”.
Tan et al. (2009) don't report any additional filtering process performed in order to acquire an index of
the data nor are the used features (1 gram, 2 gram, POS-Tag) mentioned. As the reported features are
translations from Chinese characters the syntactic properties of the language could be different in the
sense that these features can be easier gathered compared to English. Therefore, in this research an
analysis is performed on the performance of FCE on English datasets. In the next section this thesis
proposes two alternatives based on the idea of generalizable features in order to acquire a set of
features which are more representative for English and can be used to perform cross-domain sentiment
classification.
3.1.2 Using a subjectivity lexicon as generalizable features
In order to acquire improved generalizable features the first suggested improvement in this thesis
suggests the use of a subjectivity lexicon as an alternative to selecting generalizable features with FCE.
19
The idea behind this suggestion is quite simple. As generalizable features are defined as features
consistently used between source and target domain, these feature need to be domain-independent
between domains. Instead of automatically selecting generalizable features with FCE a qualitative list
of domain-independent sentimental features can also be created manually. Such manual created lists of
generalizable features have been created in lexicon-based sentiment research in the form of sentiment
lexicons.
As described in Chapter 2 lexicon based approaches are the most basic approach towards sentiment
analysis. The basic ingredients for a lexicon based sentiment classifier are two lists: one list of terms
denoting positive sentiment and one list of terms denoting negative sentiment. Often these lists are
formed by linguists (in general applications) or by human annotators for more task-specific settings. In
this way these lists can also be viewed as manually established generalizable features. The benefit of a
lexicon-based approach is that there is little discussion about the terms from which these lists consists
as they are well founded with basis in linguistics. The disadvantage of lexicon-based approaches is that
domain-specific knowledge can only be captured by manual annotation by an expert for the specific
domain.
The most often used lexicon in sentiment classification tasks is the MPQA subjectivity lexicon. In the
MPQA corpus, where the MPQA subjectivity lexicon originated from, subjective expressions of
varying lengths are marked, from single words to long phrases. In research by (Wilson et al., 2005) the
sentiment expressions – positive and negative expressions of emotions, evaluations, and stances were
extracted. The terms in the MPQA subjectivity lexicon were collected from a number of sources. Some
were culled from manually developed resources. Others were identified automatically using both
annotated and unannotated data. A majority of the clues were collected as part of the work reported in
(Riloff et al., 2003). Processing of the MPQA subjectivity lexicon resulted in a list of 4063 negative 1
gram features and 79 negative 2 gram features. The length of the processed positive lexicons are 2242 1
gram and 61 2 gram features.
In related lexicon-based research (Esuli and Sebastiani, 2005) the Mircro WNOp corpus is used. The
Micro WNOp corpus uses WordNet. WordNet is a lexical database for the English language. It groups
English words into sets of synonyms called synsets, provides general definitions, and records the
various semantic relations between the synonym sets. The aim of WordNet is to produce a combination
of a dictionary and thesaurus that is more usable in practice. The lexicon contains a total of 11788
terms, 1915 of them are labeled as positive and 2291 are labeled as negative (the remaining 7582 terms,
not belonging either to the positive or negative category, can be considered to be (implicitly) labeled as
objective).
In this research the Micro WNOp corpus is used as an additional source in combination to the
previously discussed MPQA subjectivity lexicon. The Mirco WNOp corpus let after preprocessing the
available annotations from the different groups to a set of 1651 terms from which 845 are marked
positive and 806 are marked negative. Combining the Mirco WNOp corpus and the MPQA subjectivity
lexicon results in a total of 4365 negative 1 gram features, 138 negative 2 gram features, 2556 positive
1 gram features and 119 positive 2 gram features. The combination of these two lexicons leads to quite
a large number of features considered to be domain-independent. It could be argued that the large
amount of features can't all be completely domain-independent. Looking at the resulting lists of
features some features are definitely open for discussion to be used as generalizable features in context
of this research. Although this may be the case, these lexicons have been used in a wide variety of
20
researches for different domains. This, and the consistent use between domains justifies the use of these
lists of features as a manual compiled list of generalizable features.
The combination of the positive and negative terms is used as a set of generalizable features with which
the EM algorithm maximizes the probabilities in a Naive Bayes model combining source and targetdomain, explained in the sections below.
3.1.3 Using an importance measure in combination with FCE to select generalizable features
The second suggestion to improve the used generalizable features is the combination of FCE and an
importance measure to acquire better generalizable features. This thesis introduces a method combining
the by FCE acquired generalizable features with a formula expressing the importance of terms within a
sentimental context.
As experimentation will show, using solely FCE on the English datasets used in this research results in
generalizable features which contain many common words (like articles). In order to prevent the used
generalizable features to contain such terms this research introduces an importance measure able to
distinct terms important in the used sentimental context. In order to produce such as measurement the
term frequency (TF) is used to calculate the Term Frequency - Inverse Document Frequency (TF-IDF)
score to distinguish the features which are most informative in describing the sentimental context. With
the TF-IDF principle higher informative (or sentimental) value is assigned to unique features opposed
to frequent features. The TF-IDF is calculated separately over both the positive and negative document
collection from the source domain by the standard TF-IDF calculation:
tf ∗idf (t , d , D)=tf (t , d)×idf (t , D)
Formula 3: standard TF-IDF
In Formula 3, t is the term and d is the document.
documents in the corpus. idf (t , D) Is defined as:
idf (t , D)=log(
D represents the the total number of
∣D∣
)
∣( d ∈ D :t ∈d )∣
Formula 4: standard IDF
In Formula 4 d ∈ D:t ∈d is the number of documents where the term t appears. If the term is not
in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the formula to
1+d ∈ D:t ∈d . After calculating the TF-IDF scores for both the positive and negative examples
from the source-domain the importance of a term is expressed by Formula 5:
importance(t , c)=
(tf ∗idf (t , d , c ))
(tf ∗idf (t , d , P )+tf ∗idf (t , d , N )+α)
Formula 5: Importance expressed in terms of TF-IDF.
In Formula 5 the importance for term t is for class c is expressed as the TF-IDF score of term t
for term t in class c divided by the sum of both TF-IDF scores, α is set to a small number in
case there's no TF-IDF score for one of the classes P or N . In order to improve the importance
measure further a stopword list is used to filter out common stop-words. A default stop-word list is used
for this filtering. When calculating 1 gram features all stop-words are filtered out. In the case of 2 gram
21
features only one stop-word is allowed to be in the term (e.g. “very conformable” is a feature which
should be kept, “I am” should be filtered out).
To combine this importance measure with the earlier used FCE score the importance score is added in
the standard FCE formula:
Po (w) x Pn (W )
f w =log (importance (w)+
)
∣P 0 (w)−P n (w)∣+β
Formula 6: the importance measure combined with the FCE score.
Where in the context of Formula 6 instead of t w is used to denote a term. The intuition behind
this formula is that by performing an important measure expressed in TF-IDF scores over both the
positive and negative data-distribution importance in terms of sentimental context is added to the
formula. Combining this with the FCE measure will lead to a ranked list of improved generalizable
features containing qualitative improved generalizable features containing less irrelevant terms as
articles because common terms (as articles) have low importance scores.
3.2.1 Naive Bayes classifier
The Naive Bayes (NB) classifier is a is one of the most widely used algorithms for document
classification (McCallum and Nigam, 1998). For this research the multinomial Naive Bayes or
multinomial NB model, a probabilistic learning method is used in the first two experiments. In the
multinomial model a document is considered to be an ordered sequence of terms, drawn from the same
vocabulary V . The assumption is made that the lengths of the documents are independent of their
class. Another assumption is made that the probability of each term in a document is independent of the
term's context and position in the document. The probability of a document d being in class c
therefore, is computed as:
P (c∣d )∞ P( c) ∏1<k <n P ( t k∣c)
d
Formula 7: Multi nominal Naive Bayes Model
In Formula 7 P (t k∣c ) is the conditional probability of term t k occurring in a document of class
c . The conditional probability P (t k∣c ) is interpreted as a measure of how much evidence t k
contributes that c is the correct class. P (c ) is defined as the prior probability of a document
occurring in class c . The prior probabilities are used in the case if a document doesn’t provide any
evidence of one class above the other, in this case the class with the highest prior probability is chosen.
〈t 1, t 2,. .. ,t n 〉 are the tokens in d that are part of the vocabulary and are used for classification.
n d is the number of such tokens in d apart from a single word, a token can be any n-gram of other
feature.
d
In text classification tasks the goal is to find the best class for the document. In the case of sentiment
classification NB can be used to find the best sentiment class (positive/negative) for the document. The
best class in NB classification is the most likely or maximum a posteriori (MAP) class c MAP defined
as:
22
̂
c MAP =argmax c∈C P̂ (c∣d)=argmax c∈C P(c)
∏1<k<n P̂ (t k∣c )
d
Formula 8: Maximum a priori class
In Formula 8 P̂ is used because the true values of the parameters P (c ) and P (t k∣c ) can’t be
measured exact, instead these parameters are estimate based on the training set.
To estimate parameters P̂ (t k∣c ) and P̂ (c ) the maximum likelihood estimate (MLE) is used, this is
simply the relative frequency and corresponds to the most likely value of each parameter given the
training data. For the priors this estimate is:
P̂ (c )=
Nc
N'
Formula 9: Maximum likelihood estimate.
In Formula 9 N c is the number of documents in class c and N is the total number of documents
in the training set. The conditional probability P̂ (t k∣c ) is estimated as the relative frequency of term
t (where t can be any n-gram) in documents belonging to class c as:
P̂ (t k∣c )=
T ct
∑t '⊂V T ct ' '
Formula 10: Conditional probability
In Formula 10 T ct is the number of occurrences of t in the training documents from class c ,
including multiple occurrences of a term in a document. The positional independence assumption is
made here. T ct is a count of occurrences in all positions k in the documents in the training set. In
other words, no different estimates for different positions are computed, for example, if a word occurs
twice in a document at position k 1 and k 2 then no distinction is made: P̂ (t k ∣c)= P̂ (t k ∣c) .
1
2
The problem with the MLE estimate is that these estimates can result in a zero probability if a termclass combination does not occur in the training data. If all but one term-class combinations in a
document yield high evidence for a certain class one single unknown term will lead to zero probability
for a single term-class combination, which will lead to a zero probability for the complete document.
This is a sparseness problem: The training data is never large enough to represent the frequency of
previously unseen term-class combinations.
To deal with this sparseness problem a popular method is to use add-one or Laplace smoothing, which
simply adds one to each count.
P̂ (t∣c)=
T ct+1
=
T ct +1
∑t '⊂V (T c t ' +1) ∑t '⊂V (T c t ' )+B '
Formula 11: La Place Smoothing
In Formula 11 B=∣V ∣ is the number of terms in the vocabulary. Add-one smoothing can be
interpreted as an uniform prior (each term occurs once for each class) that is then updated as evidence
from the training data comes in.
23
In Go et al. (2009) a slight variation on the multinomial naive Bayes model is described proving
increases in the classification of the sentiment in tweets:
m
P (c) ∑ P(t∣c)n (d )
i
PNB (c∣d )=
i=1
P (d )
Formula 12: Multinomial Naive Bayes
In Formula 12 t represents a term (which, again, can be any n-gram) and n i(d ) represents the count
of feature t i in d where there are a total of m features. Experimentation showed that applying
this addition only has a beneficial effect on the results for the dataset consisting of tweets. Therefore in
favor of the cross-domain adaptation this suggested addition is ignored. After calculating the class
̂
probabilities P(c∣d)
for each sentence individually a classification of the document d is given by
the maximum number of sentences classified as class c . In case the number of sentences of each
̂ ) determines the document's class c . In case the priors are equal the
class is equal the prior P(c
document is randomly returned as being positive.
3.2.2 Adapted Naive Bayes Classifier
In this section the version of the Naive Bayes classifier adapted with the EM algorithm for domainadaptation introduced by Tan et al. (2009) is explained. The adapted Naive Bayes classifier (ANB)
takes labeled data from the source domain and considers the target domain to be unlabeled. In order to
perform a domain-adaptation the assumption is made that both domains obey the same distribution over
the different observed features. This implies that the combined data can be produced by a mixture
model which has a one-to-one correspondence between generative mixture components and classes. In
a cross-domain setting this requirement does not hold. This problem is solved by Tan et al. (2009) with
an adapted form of the Naive Bayes classifier using generalizable features to initialize a Naive Bayes
model in which the probabilities are maximized using the EM algorithm. The expectation maximization
(EM) algorithm is a widely used method in text-classification tasks (Go et al., 2009). The EM
algorithm is an iterative method for finding the maximum likelihood or maximum a posteriori (MAP)
estimates of parameters in statistical models. The EM iterations alternate between two steps. The fist
(E) step calculates the expectation of the log-likelihood evaluated using the current estimate for the
parameters. Next, the (M) step computes the parameters maximizing the expected log-likelihood found
in the E step. These parameter-estimates are then used to determine the distribution of the latent
variables in the next iteration. To estimate the probabilities of a mixed Naive Bayes model combining
data from source and target domain the issue is that the chosen generalizable features do not accurately
cover the examples in the target domain. This problem is solved by Tan et al. (2009) by using a
weighted Naive Bayes classifier. The weighting factor decreases the weight for the target-domain data
in favor for the source-domain data after each EM iteration. In this way a local maximum likelihood
parameterization for the target-domain data is estimated. The log likelihood in this regard is provided
by:
l(θ | D , z )=z ik log( P (c k | θ) P(d i | c k ; θ k ))
Formula 13: Log likelihood
In Formula 13, θk indicates the parameter of mixture models for class c k . D includes the source24
domain and target-domain data. c provides the labels “0” for positive and “1” for negative. z ik is
set to “0” or “1” depending which class c is used. The EM algorithm iterates to find a local
maximum parameterization for l(θ | D , z ) . In the E-step the conditional probability P(c k | d i) is
calculated for every example in the target domain by using the current estimate θ . In the M-step a
new maximum likelihood estimate for θ using the current estimates for the examples from the target
domain is computed, i.e. P(c k | d i) In the formulas below the E-step and M-step for the adapted
Naive Bayes (ANB) classifier are explained:
E-step:
P(c k | d i)∝ P(c k ) ∏ (P(wt | c k )(N ) )
t,i
t ∈∣V∣
M-step:
(1−λ) ∑ P( c k | d i )+λ
P(c k )=
P(w t | c k )=
i∈ D
S
S
∑ P(c k | d i )
Tt
i∈ D
T
(1−λ)∣D ∣+λ∣D ∣
S
S
T
(1−λ)( ηt N t ,i )+λ ( N t ,k )+1
∣V∣
∣V∣
(1−λ)∑ (η N )+λ ∑ ( N Tt , k )+∣V∣
S
t
s
t,k
t =1
t =1
N St , k =∑ (N St , i P(c k | d i ))
N Tt , k = ∑ (N Tt ,i P (c k | d i))
i ∈D
t
Formula 14: formals for EM
In Formula 14, N St , k and N Tt , k are the number of appearances of word w t in the source and target
domain class c k . λ is a parameter that controls the impact between the source-domain and the
target-domain data. It changes with each iteration as follows.
λ=min {δ×τ ,1 }
λ
Formula 15: definition of
In Formula 15, τ indicates the iteration step and δ is a constant which controls the strength of the
update parameter λ ,δ e (0,1) . In the experiments different values for δ will be evaluated. ηSt is a
constant. :
{
0 if w t not∈V FCE
1if wt ∈V FCE
Formula 16: definition of η
ηTt =
In Formula 16 V FCE is the set of the generalizable features. The discussed formulas indicate that for
the old-domain data, throughout all the iterations only the generalizable features are used. While for the
25
new-domain data, all features including domain-specific features are used.
3.2.3 Features
In order to train a Naive Bayes model the classifier needs to be provided with a distribution of features
over the training data. The features which we will take into account in this research are 1 gram features,
2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features.
From previous work (Hatzivassiloglou and Wiebe, 2000; Wiebe et al., 2001) it has been shown that, in
contrast to using n-grams, an additional step of linguistic analysis can be beneficial for classification
accuracies. In particular, adjectives have proven to be good indicators of subjective and/or evaluative
sentences. A poplar method for sentiment classification tasks therefore is constructing a training corpus
based on phrases containing adjectives or adverbs. In this research 1 gram features will be combined
with phrases containing adjectives or adverbs. This combination will be referred to as the 1 gram +
POS-pattern feature-set. A possible shortcoming of this approach is that isolated adjectives may
indicate subjectivity but may not encompass the context of the semantic orientation. A popular solution
to this shortcoming is to extract two consecutive words, where one member of the pair is an adjective
or an adverb and the second one is a word extracted to capture the context of the particular adjective of
adverb. To provide these syntactical annotations each word-type in an available text has to identified.
To annotate these POS-Tags automatically a popular method is the Brill Tagger (Brill, 1994). The tags
which are extracted as being indicative towards sentiment and are used as 1 gram + POS-pattern
features in this research are listed in Table 2.
First Word
JJ
RB, RBR or RBS
JJ
NN or NNS
RB, BR or RBS
Second Word
NN or NNS
JJ
JJ
JJ
VB, VBD, VBN or VBG
Third Word (Not Extracted)
anything
Not NN nor NNS
Not NN nor NNS
Not NN nor NNS
anything
Table 2: syntactical patterns extracted as being indicative towards sentiment
The JJ tags indicate adjectives, the NN tags are nouns, the RB tags are adverbs, and the VB tags are
verbs. The second pattern means that two consecutive words are extracted if the first word is an adverb
and the second word is an adjective, but the third word (which is not extracted) cannot be a noun. NNP
and NNPS (singular and plural proper nouns) are avoided, so that the names of the items in the data
cannot influence the classification.
3.3 Lexicon-based classifier
Apart from the use of a lexicon as an alternative for automatically gathered generalizable features this
research uses the MPQA subjectivity lexicon for a lexicon based classifier. This classifier is used as a a
baseline sentiment classifier and can also be used as a reference against the variations of automatically
generated generalizable features. The reason to include a lexicon-based approach in this research in
spite of the out-performance by many other methods in other research (Multinomial Naive Bayes, POSTag, SVM) is that in a cross-domain setting it might be beneficial to just stick to the most basic
subjective terms instead of using statistical methods to find a great number of relevant subjective terms
by analyzing (domain-specific) training data. This way it serves as a nice benchmark, especially when
considering domains where few training examples are available. In case of few training examples the
26
retrieved vocabularies of features will be small and the probability that the essences of the domain is
captured is small. In this case it is likely that using a subjectivity lexicon will yield better accuracy
results and will serve as a better baseline to compare against. On the other hand, while analyzing
greater numbers of training data over-fitting might become a problem. To compare results from
different approaches in different domains to a lexicon-based approach might prove to provide a useful
minimal baseline.
The lexicon-based classifier is a very simple classifier, it works by counting the number of positive
terms versus the number of negative terms in a given input text the subjectivity lexicon classifier
decides the input text to be either positive or negative. This is expressed by Formula 17:
c lexicon=argmax c∈C ( count( positive _ terms⊂input text ), count (negative _ terms⊂input text ))
Formula 17: Lexicon-based classification
In Formula 17 c lexicon denotes the class of the input text, taking the maximum of the number of
positive terms in the input text and the number of negative terms in the input text.
27
Chapter 4
Experimental Setup
In this chapter the setup for the experiments to evaluate the cross-domain behavior of the different
proposed classifiers is discussed. In order to properly evaluate the cross-domain performance a wide
variety of datasets is used originating from a wide variety of sources and domains. In the first part of
this chapter the used datasets will be described in more detail, as well as the chosen preprocessing steps
taken in order to enable a consistent use of all the datasets for the classifiers. The second part of this
chapter will describe the setup of the different conducted experiments and the evaluation metric chosen.
4.1.1 Datasets
In order to utilize statistical methods and machine learning techniques to train sentiment analysis
classifiers substantial datasets are necessary. To create these datasets, vast amount of labeled documents
are necessary to be provided as training data. The availability of vast amounts of training data is often a
problem, especially if context specific training corpora are necessary. The labeled data in these training
corpora can be a list of terms with their polarity assigned to them, or annotated text - for example
product reviews with the sentiment towards the product expressed by means of a star or a point system.
Available datasets from previous research originate from a variety of domains. One of the used dataset
is the movie-review dataset introduced by Pang et al. (2002). This dataset, prominent in sentiment
analysis research has been used in a number of follow up researches by Pang and Lee (2008) and has
been subject in a wide variety of researches: Aue et al. (2005), Turney (2002), Whitehead and Yaeger
(2009), Yessenalina et al. (2010) to name a few.
Some of the other datasets used in this research consists of automatically gathered Tweets or Usenet
posts. These datasets were introduced by Read (2005) and Go et al. (2009). More datasets of examples
expressing sentiment towards different products in different categories have been described by
Whitehead an Yaeger (2009). In this research datasets containing reviews about cameras, camps,
doctors, drugs, laptops, lawyers, music, radio and restaurants are used.
The datasets which are selected to be used in this researched are listed below:
•
•
•
28
Movie-review data, Pang et al. (2002)
Movie-scale review data, Pang and Lee (2005)
Review-plus-rating dataset(s), Whitehead and Yaeger (2009) consisting of
◦ Digital camera reviews
◦ Summer camp reviews
◦ Reviews of physicians
◦ Reviews of pharmaceutical drugs
◦ Laptop reviews
◦ Reviews of lawyers
◦ Musical CD reviews
◦ Reviews of radio shows
•
•
•
◦ Television show reviews
Multi-domain sentiment dataset used (among others) in Blitzer et al. (2009) consisting out of:
◦ Apparel reviews
◦ Reviews of automobiles
◦ Reviews of baby products
◦ Reviews of beauty products
◦ Reviews of camera and photo products
◦ Reviews of cell phones and cell phone services
◦ Reviews of computer and video games
◦ DVD reviews
◦ Reviews of electronic products
◦ Gourmet food reviews
◦ Reviews of groceries
◦ Reviews of health and personal care products
◦ Jewelry and watches reviews
◦ Kitchen and housewares appliances reviews
◦ Magazine reviews
◦ Musical Cd's reviews
◦ Reviews of musical instruments
◦ Reviews of office products
◦ Reviews of outdoor living
◦ Software reviews
◦ Reviews of sports and outdoors products
◦ Tools and hardware reviews
◦ Toys and games reviews
◦ Reviews of video products
Twitter-sentiment-dataset (1.8M Tweets) and annotated sentiment, Go et al. (2009)
Restaurant reviews dataset, Snyder and Barzilay (2007)
The listed datasets are discussed in more detail in the following sections.
4.1.2 Movie review dataset
The movie-review dataset has been made available by Pang et al. (2002) and has been used in a variety
of follow up research. The movie-review domain has been chosen for two main reasons. First the
domain is experimentally convenient because a lot of reviews are available in the form of on-line
collections. The reviewers often summarize their overall sentiment by means of a rating indicator: a
mark on a scale (from one to ten, or one to five for example) which gives a nice machine extractable
annotation for the review as a whole. This way annotations are provided and this solves one of the
foremost problems in gathering data: no human annotation is necessary. In research by Turney (2002)
was found that movie reviews are a difficult domain for sentiment classification reporting accuracies of
65% at max. This accuracy is based on the amount of annotator agreement when performing human
annotation with three different human annotators.
For
29
the
movie-review
dataset
the
Internet
Movie
Database
(IMDB)
archive
of
the
rec.arts.movies.reviews newsgroup was used as a source. Only reviews were selected for which an
explicit rating was expressed (marks on a scale). The ratings were converted into a scale of positive,
negative or neutral. To prevent biasing a limit was imposed for dominant reviewers, a maximum of 20
reviews per category, per reviewer. The original movie-review dataset consisted of 752 negative and
1301 positive reviews, with a total of 144 reviewers. Different additions let to a dataset consisting of a
total of 2000 annotated reviews with 1000 positive and 1000 negative annotated reviews. The reviews
in this domain are all complete reviews consisting out of multiple lines of text. The average number of
words for a review is 651, the average number of sentences is 33.
4.1.3 Movie-scale-review dataset
In follow up research by Pang and Lee (2005) the rating-inference problem is addressed. Pang and Lee
take sentiment analysis one step further in terms of expressing sentiment on a scale. This approach
differs from the traditional approach which classifies text into either one of two categories: positive or
negative. Instead the attempt is made to classify the actual scale of the review. Pang and Lee argue that
it is often beneficial to have a more detailed annotation available than just positive or negative. This
can especially be beneficial when ranking items by recommendation or comparing several reviewers'
opinions. For this reason data was collected from the Internet Movie Database archive, following the
same methodology as described for the movie-review dataset, resulting in four copora of movie
reviews consisting of 1770, 902, 1307 and 1027 documents from the same reviewer. To attempt to
classify the actual scale of the review is beyond the scope of the general cross-domain classifier
attempted in this research, but the movie scale data seemed like a useful resource. The Movie-scalereview-data was preprocessed based on a 3 class annotation in order to only keep the annotated
examples where “0” is annotated as negative and “2” as positive. The text annotated as “1” was ignored
as these examples are expected not to state a clear positive or negative sentiment but a mixed
sentiment. In contrast to the movie review dataset the sentiment scale review dataset was annotated on
a line to line basis making it possible for a single review to capture multiple sentiments. For the
purpose of this research each line is processed as an independent document. Processing the sentiment
scale dataset let to 3091 annotated lines from reviews of which 1197 are annotated negative and 1894
positive. The average number of words for a review is 387, the average number of sentences is 21.
4.1.4 Review-plus-rating dataset
In previous attempts to build a general purpose cross-domain sentiment mining model, Whitehead and
Yaeger (2009) use the review-plus rating dataset. The datasets represent a wide variety of domains that
could all benefit from the creation of accurate sentiment mining models. The dataset have been made
available by Whitehead and Yaeger (2009) and contain the following subsets:
•
•
30
Camera: Digital camera reviews acquired from Amazon.com. The reviews were taken from
cameras that had a large number of ratings. This dataset and the laptop review set both fall
under the broader domain of consumer electronics. The camera dataset consists of 498
annotated reviews of which 250 are annotated as positive and 248 as negative. The average
number of words for a review is 127, the average number of sentences is 8.
Camp: Summer camp reviews acquired from CampRatingz.com. A significant number of these
reviews were written by young people who attended the reviewed summer camps. The camp
dataset consists of 804 reviews of which 402 are annotated as positive and 402 as negative. The
average number of words for a review is 58, the average number of sentences is 4.
•
•
•
•
•
•
•
Doctor: Reviews of physicians acquired from RateMDs.com. This dataset and the lawyer
review set could both be considered part of the larger “rating of people” domain. The doctor
dataset consists of 1478 reviews of which 739 are annotated as positive and 739 as negative.
The average number of words for a review is 46, the average number of sentences is 4.
Drug: Reviews of pharmaceutical drugs acquired from DrugsRatingz.com. The drug dataset
consists of 802 reviews of which 401 are annotated as positive and 401 as negative. The
average number of words for a review is 74, the average number of sentences is 5.
Laptop: Laptop reviews acquired from Amazon.com. Various reviews about laptops were
collected from different manufacturers. The laptop dataset consists of 176 reviews of which 88
are annotated as positive and 88 as negative. The average number of words for a review is 192,
the average number of sentences is 13.
Lawyer: Reviews of lawyers acquired from LawyerRatingz.com. The lawyer dataset consists of
220 reviews of which 110 are annotated as positive and 110 as negative. The average number of
words for a review is 46, the average number of sentences is 4.
Music: Musical CD reviews acquired from Amazon.com The albums being reviewed were
recently (at the time of publishing of the paper) released popular music from a wide variety of
musical genres. The music dataset consists of 582 reviews of which 291 are annotated as
positive and 291 as negative. The average number of words for a review is 144, the average
number of sentences is 10.
Radio: Reviews of radio shows acquired from Radioratingz.com. This dataset and the TV
dataset had the shortest reviews on average. The radio dataset consists of 1004 reviews of which
502 are annotated as positive and 502 as negative. The average number of words for a review is
34, the average number sentences is 3.
TV: Television shows reviews acquired from TVRatingz.com. These reviews were typically
very short and not very detailed. The TV dataset consists of 470 reviews of which 235 are
annotated as positive and 235 as negative. The average number of words for a review is 27, the
average number of sentences is 3.
The reviews in the review-plus-rating dataset are all complete reviews, consisting out of a single or
multiple lines of text.
4.1.5 Multi-domain sentiment dataset
The multi-domain sentiment dataset has been used in several papers Blitzer et al. (2007), Blitzer et al.
(2008), Dredze et al. (2008), Mansour et al. (2009) as it proved to be a useful dataset for studying
multi-domain sentiment focused specifically on product reviews. The multi-domain sentiment dataset
contains product reviews taken from Amazon.com from many product types (domains). Some domains
(books and dvds) have hundreds of thousands of reviews. Others (musical instruments) have only a few
hundred. Reviews contain star ratings (1 to 5 stars) that are converted into binary labels if needed. For
this research reviews with a rating > 3 were labeled positive, the reviews with a rating < 3 were labeled
negative, the rest of the reviews have been discarded because their polarity seemed to be to ambiguous.
The different sub-domains after preprocessing, have the following properties:
•
•
31
Apparel reviews: 2000 annotations of which 1000 positive and 1000 negative. The average
number of words for a review is 58, the average number of sentences is 4.
Reviews of automobiles: 736 annotations of which 584 positive and 152 negative. The average
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
32
number of words for a review is 76, the average number of sentences is 5.
Reviews of baby products: 1900 annotations of which 1000 positive and 900 negative. The
average number of words for a review is 104, the average number of sentences is 6.
Reviews of beauty products: 1493 annotations of which 1000 positive and 493 negative. The
average number of words for a review is 87, the average number of sentences is 6.
Reviews of books: 2000 annotations of which 1000 positive and 1000 negative. The average
number of words for a review is 158, the average number of sentences is 9.
Reviews of camera and photo products: 1999 annotations of which 1000 positive and 999
negative. The average number of words for a review is 120, the average number of sentences 8.
Reviews of cell phones and cell phone services: 1023 annotations of which 639 positive and
384 negative. The average number of words for a review is 101, the average number of
sentences is 7.
Reviews of computer and video games: 1458 annotations of which 1000 positive and 458
negative. The average number of words for a review is 182, the average number of sentences is
11.
DVD reviews: 2000 annotations of which 1000 positive and 1000 negative. The average
number of words for a review is 171, the average number of sentences is 10.
Electronics: 2000 annotations of which 1000 positive and 1000 negative. The average number
of words for a review is 102, the average number of sentences is 7.
Gourmet food reviews: 1208 annotations of which 1000 positive and 208 negative. The average
number of words for a review is 75, the average number of sentences is 5.
Reviews of groceries: 1352 annotations of which 1000 positive and 352 negative. The average
number of words for a review is 66, the average number of sentences is 5.
Reviews of health and personal care products: 2000 annotations of which 1000 positive and
1000 negative. The average number of words for a review is 82, the average number of
sentences is 6.
Jewelry and watches reviews: 1292 annotations of which 1000 positive and 292 negative. The
average number of words for a review is 60, the average number of sentences is 4.
Kitchen and housewares appliances reviews: 2000 annotations of which 1000 positive and 1000
negative. The average number of words for a review is 86, the average number sentences is 6.
Magazine reviews: 1970 annotations of which 1000 positive and 970 negative. The average
number of words for a review is 111, the average number of sentences is 7.
Musical Cd's reviews: 2000 annotations of which 1000 positive and 1000 negative. The average
number of words for a review is 126, the average number sentences is 8.
Reviews of musical instruments: 332 annotations of which 284 positive and 48 negative. The
average number of words for a review is 83, the average number of sentences is 6.
Reviews of office products: 431 annotations of which 367 positive and 64 negative. The
average number of words for a review is 91, the average number of sentences is 6.
Reviews of outdoor living: 1327 annotations of which 1000 positive and 327 negative. The
average number of words for a review is 82, the average number of sentences is 6.
Software reviews: 1915 annotations of which 1000 positive and 915 negative. The average
number of words for a review is 127, the average number of sentences is 8.
Reviews of sports and outdoors products: 2000 annotations of which 1000 positive and 1000
negative. The average number of words for a review is 94, the average number of sentences is
6.
•
•
•
Tools and hardware reviews: 112 annotations of which 98 positive and 14 negative. The average
number of words for a review is 76, the average number of sentences is 5.
Toys and games reviews: 2000 annotations of which 1000 positive and 1000 negative. The
average number of words for a review is 89, the average number of sentences is 6.
Reviews of video products: 2000 annotations of which 1000 positive and 1000 negative. The
average number of words for a review is 142, the average number of sentences is 9.
The reviews in the multi-domain sentiment dataset dataset are complete reviews, consisting out of a
single or multiple lines of text.
4.1.6 Twitter-sentiment-dataset
The twitter-sentiment-dataset has been made available by Go et al. (2009). Go et al. chose for a Twitter
corpus because of the magnitude of data available through the Twitter API. With the Twitter API it
becomes possible to collect millions of tweets in a relatively short time. Using Twitter is an interesting
option as previous research might consider thousands of training data instances but never millions. The
annotation of these millions of tweets is done automatically by using noisy labels in the form of
emoticons: :-) or :-(
The preprocessing for the twitter-sentiment-dataset was done by removing usernames, the removal of
links, the replacement of repeated letters (hhhhhhuuuuuuunnnnngggggrrrryyyyy to hhuunnggrryy), and
the removal of the labels (emoticons) from the tweet. Further emoticons containing :-P are removed as
the Twitter API returns tweets containing :-P on the query :-( which are not considered to be the same
sentiment. Finally tweets are removed containing both a positive and a negative label: both :-( and :-)
as this label can be ambiguous.
The twitter-sentiment-dataset made available by the authors consists of 800,000 tweets with positive
emoticons and 800,000 tweets with negative emoticons, resulting in a total of 1,600,000 training
tweets. Each tweet is considered a separate document. Although limited to 140 characters a tweet can
consist of multiple sentences. The average number of words for a tweet is 13, the average number of
sentences is 2.
4.1.7 Restaurant reviews dataset
The Restaurant reviews dataset consist of restaurant reviews available from the website
http://www.we8there.com . The dataset was constructed for previous research by Higashinaka et al.
(2006) for sentiment analysis research. Each review in the restaurant review dataset is accompanied by
a set of ranks, each on a scale of 1-5, covering food, ambiance, service, value, and overall experience.
These ranks are provided by consumers who wrote the original reviews. The corpus does not contain
incomplete data points since all the reviews available on this website contain both a review text and the
set of ranks. To create suitable annotations for this research the reviews were preprocessed to annotate
an average grade from the 5 ranks available. An average rating < 2 was annotated as a negative review,
a review with an average score > 3 was rated as a positive review. This resulted in a dataset of 1417
reviews of which 1206 were positive and 211 were negative. The average number of words for a review
is 84, the average number of sentences is 6.
33
For reference purposes the used datasets in this research are listed below in Table 3.
Name
senpol
senscale
camera
camp
doctor
drug
laptop
lawyer
music
radio
tv
apparel
automotive
baby
beauty
camera_& photo
cell_phones_&_services
computer_&_video_games
dvd
electronics
gourmet_food
grocery
health_&_personal_care
jewelry_&_watches
kitchen_&_housewares
magazines
music2
musical_instruments
office_products
outdoor living
software
sports_&_outdoors
tools_&_hardware
toys_&_games
video
books
twitter_stanford
restaurant
Dataset
Movie-review data
Movie-scale review data
Review-plus-rating dataset
Review-plus-rating dataset
Review-plus-rating dataset
Review-plus-rating dataset
Review-plus-rating dataset
Review-plus-rating dataset
Review-plus-rating dataset
Review-plus-rating dataset
Review-plus-rating dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Multi-domain sentiment dataset
Twitter-sentiment dataset
Restaurant reviews dataset
Total Size (negative/positive)
2000 (1000/1000)
3091 (1197/1894)
498 (248/250)
804 (402/402)
1498 (739/739)
802 (401/401)
176 (88/88)
220 (110/110)
582 (291/291)
1004 (502/502)
470 (235/235)
2000 (1000/1000)
736 (152/584)
1900 (900/1000)
1493 (493/1000)
1999 (999/1000)
1023 (384/639)
1458 (458/1000)
2000 (1000/1000)
2000 (1000/1000)
1208 (208/1000)
1352 (352/1000)
2000 (1000/1000)
1292 (292/1000)
2000 (1000/1000)
1970 (970/1000)
2000 (1000/1000)
332 (48/284)
431 (64/367)
1327 (327/1000)
1915 (915/1000)
2000 (1000/1000)
112 (14/98)
2000 (1000/1000)
2000 (1000/1000)
2000 (1000/1000)
1600000 (800000/800000)
1417 (211/1206)
Table 3: Overview of the datasets used in this research, listed are the name of the subset, the dataset the subset originated from and the
total size of the dataset.
4.2 Data Processing
The collected corpora for this research originate from a number of different datasets. In order to be able
to use all available data for both training and evaluation of the proposed classifiers the data needed to
be preprocessed. With all data available in a preprocessed form it became possible to easily and quickly
34
setup experiments evaluating different approaches. An example of a typical data-entry to which all used
data sources were converted is given below:
"1","My daughter loves this activity center. I wish they would change two things about it. I wish we
could change out the toys and it folded up"
In this example, the first comma-separated value denotes the derived annotation, where “0” stands for a
negative label and “1” represents a positive labeled element. In this research binary classification is
used, so all data examples are converted to either be a negative or a positive example. Datasets holding
more classification details (e.g. a scale between 1-5) were converted appropriately (if applicable) or
ignored (if for example a mixed sentiment was expressed). The second comma-separated value consists
of a string of text. The string of text can be of variable length, holding any number of sentences. In case
of the stanford (twitter) dataset the string of text is rather short (as the maximum allowed length of a
tweet is 140 characters) or can even be a single word. Other examples can consists of a paragraph of
text, holding for example a complete elaborate paragraphed review of a restaurant. This research keeps
as close as possible to using the datasets available from previous research in the originally intended
form. It might be argued that splitting each sting of text up in individual sentences and annotating these
could be beneficial. This is for example done for the movie-scale review dataset (senscale). In this case
each sentence is handled as a separate document (a separately labeled example). If this had to be done
for the rest of the datasets as well a lot of the datasets would have to be manually re-annotated. The fact
that some of the longer documents in the dataset only hold a single annotation for the whole document
can be problematic. For example in the movie-review dataset there are examples of paragraphs of text
discussing negative points about a movie ending in the statement that overall the author had to admit
liking the movie. Splitting these sentences up and annotating them individual would lead to a big
number of negative labeled sentences and only a few positive labeled sentences. Instead in the moviereview dataset a single annotation is provided: a positive one. In order to be able to compare
performance in this research to previous research the decision is made to keep the annotations intact
and use them as true as possible to the original research.
To normalize all examples used in this research the following normalization steps are performed:
1.
2.
3.
4.
5.
6.
All text is lowercased.
The characters ~`!@#$%^&*()_+-={}[]:;\"\\|,.<>/?0123456789 are removed.
Return characters, new line characters and tab characters are removed.
Multiple occurrences of spaces are normalized to a single space.
Spaces before or after a sentence are removed.
Tripples of letters are normalized to doubles. In the case of the twitter dataset example words
like “huuuunggggryyyyy” are used, these examples are normalized to “huunggryy”.
While the data is processed during the training of the classifier each example is split up into individual
words. The words are combined into the appropriate n-grams and are stored in a sentiment-vocabulary,
one vocabulary for each sentiment class. While training, the classifier keeps track of size of the
vocabulary, the relative frequencies and the total corpus size. In the case of the POS-pattern feature-set
the POS-Tag patterns are extracted using the Brill Tagger, the patterns containing adverbs and
adjectives are extracted and the same metric is followed as for n-grams. After processing all the
appointed data the necessary calculations are done in order to calculate the multinomial Naive Bayes
probabilities initializing the Naive Bayes model.
35
4.3 Experiments
For this research three different experiments are setup in order to proper evaluate and related the crossdomain approach suggested in this thesis. In this section the different experiments are explained. The
different conduced experiments are:
1. Baseline in-domain experiment: training and classifying within the same domain.
2. Baseline cross-domain experiment: training and classification cross-domain.
3. Cross-domain experiment using (different variations of) generalizable features.
For all the experiments the average accuracy will be reported. The average accuracy is reported in order
to compare results to previous work where only accuracies have been reported. Next to the reported
average accuracy the F-measure metric is reported, which is equal to the harmonic mean of recall ρ
and precision π . ρ and π are defined as follows (Yang and Liu, 1996):
πi=
TP i
TPi
,ρ=
(TPi+FP i)
(TPi +FN i )
Formula 18: definitions of recall and precision.
In Formula 18, TP i (True Positives) are the number of documents classified correctly as class i .
FPi (False Positives) are the number of documents that do not belong to class i but are incorrectly
classified as belonging to class i . FN i (False Negatives) is the number of documents that are not
classified as class i by the classifier but actually belong to class i . The F-measure values are in the
range between 0 and 1. Larger F-measure values correspond to higher classification quality. The overall
F-measure score for the entire classification task is computed by two different types of averages:
micro-average and macro-average (Yang and Liu, 1996).
Micro-averaged F-Measure
In micro-averaging the F-measure is computed global over all category classifications. ρ and π
obtained by summing over all individual classifications:
M
π=
TP
=
(TP+FP)
∑i=1 TPi
M
∑i=1 (TPi+FPi )
M
TP
ρ=
=
(TP+FN )
∑i=1 TPi
M
∑i=1 (TPi+FN i )
Formula 19: Definitions of precision and recall in terms of micro-averaged F-measure
In Formula 19 M is the number of evaluation-sets. Micro-averaged F-measure is computed as:
F( micro−averaged )=
(2 π ρ)
( π+ρ)
Formula 20: micro-averaged F-measure
36
are
The micro-averaged F-measure gives equal weight to each document and is therefore considered as an
average over all the document/category pairs. It tends to be dominated by the classifier’s performance
on common categories.
Macro-averaged F-Measure
In macro-averaging, the F-measure is computed locally over each evaluation-set first and then the
average over all evaluation-sets is taken. ρ and π are computed for each evaluation-set as in
Formula 18. The F-measure for each evaluation-set i is computed and the macro-averaged F-measure
is obtained by taking the average of the F-measure values for each evaluation-set:
Fi =
(2 πi ρi )
(πi+ρi )
M
F( macro−averaged)=
∑i=1 Fi
M
Formula 21: Macro-averaged F-measure
In Formula 21, M is the total number of evaluation-sets. Macro-average is influenced more by the
classifier’s performance on rare categories. For the experiments both measurement are provided to gain
insight.
In the following sections the different experiments will be discussed in more detail.
4.3.1 Baseline in-domain experiment
The baseline in-domain experiment is setup in order to evaluate the general approach taken in this
research. As this research uses a wide variety of datasets which are all preprocessed in a general way it
makes sense to first evaluate whether approaching all data in a general way can be done without losing
performance. For this purpose both the Naive Bayes and subjectivity lexicon classifier are used to
perform in-domain classification. For the in-domain experiment training (in case of the Naive Bayes
classifier) and evaluation will be done within the same domain. In order to perform these experiments it
makes sense to repeat approaches from previous research (Pang and Lee, 2002; Blitzer et al., 2007).
For a number of datasets used in this research extensive in-domain research has been done. This
enables comparisons of the in-domain performance of those datasets against reported performance in
previous work. The differences between the two will be evaluated, where no dramatic performance
differences should be reported. As is usual in previous research (Pang and Lee 2002; Blitzer et al.,
2007) the experiments will consists of 10-fold cross validations. For the in-domain experiment the
training and test-sets are chosen within the same domain. Experimentation showed some performance
problems in the calculation of the priors used in the Naive Bayes classifier. This was caused by the
order some of the datasets were made up in. Therefore, the choice is made to perform an additional
randomization of the datasets before the 10-fold sets were created.
For each domain two classifiers are experimented with: the Naive Bayes classifier and the subjectivity
lexicon classifier. The Naive Bayes classifier is tested with four different feature-sets: 1 gram features,
2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features.
This results in a total of five sets of performance measures for each domain. For each classifier and
feature-set the average accuracy is reported, together with the standard deviation. The standard
deviation is reported to gain insight in the stability of the classifiers over the different datasets. Next to
37
the reported average accuracy the micro F1 score and the macro F1 score are reported in order to
evaluate influence of the balance in the training-set (the amount of positive vs the amount of negative
examples available). In the next chapter confusion matrices for a couple of representative domains are
provided to gain more insights in the performance. In addition more detailed results showing all 10
results of the 10-fold cross validations are reported to give a more detailed overview of the
performance. In Chapter 5, section 1 a table with all results will be presented, in Chapter 6, section 1
the results will be discussed for both classifiers and all feature-sets together with a more detailed
analysis of the used features.
4.3.2 Baseline cross-domain experiment
In the second set of experiments a baseline is established for cross-domain sentiment classification. To
evaluate cross-domain performance of the Naive Bayes classifier the most straight forward and naive
approach towards cross-domain classification is chosen: The Leave One Out Approach (LOO) (Blitzer
et al. 2007). For each (target) domain classifications are made with all other individual in-domain
trained classifiers (using all available data for that domain). This results in a set of accuracies, micro F1
and macro F1 scores: one for each used in-domain trained classifier. The scores are averaged over all
used classifiers. Reported are the average accuracy over all the available in-domain trained classifiers.
The performance of the subjectivity lexicon classifier is not evaluated in this experiment as the results
for cross-domain analysis of the subjectivity lexicon classifier are the same as the results for in-domain.
As for the in-domain experiment four different feature-sets are used: 1 gram features, 2 gram features,
combined 1 gram + POS-pattern features and combined 1 gram + 2 gram features. For each feature-set
the average accuracy together with the standard deviation is reported in order to compare and gain
insight in the stability of the classifier over the different domains. As in the first experiment, micro and
macro F1 scores are reported. To provide more detail for a number of representative domains the results
of the cross-domain performance of different in-domain trained classifiers are provided in terms of
accuracy and micro and macro F1 scores. As for the previous experiment the general setup using a wide
variety of datasets can be evaluated. This can be done by comparing the results of the performance of
subsets of results to related cross-domain research (Blitzer et al. 2007). The comparison to related
research is somewhat limited as no other research before has combined all these different available
datasets. Where possible subsets of the results will be compared to previous work. The main insight
from these results will be the comparison to the performance observed in the in-domain experiment.
From this comparison the difference in performance can be established from in-domain to crossdomain classification. This difference in performance can be related to the two other experiments
described in the next two sections. In the second section of Chapter 5 a table containing all results for
the naive cross-domain approach will be presented. In Chapter 6 the results for the different feature-sets
will be discussed.
4.3.3 Cross-domain experiment using different generalizable features
As a third set of experiments the different approaches concerning generalizable features are explored.
This experiment will contain three sub-experiments. First, the proposed approach by Tan et al. (2009) is
followed using FCE to select generalizable features. For each available domain a domain-adaptation is
performed calculating generalizable features between source and target domain and fitting the domain
using the EM algorithm as described in the previous chapter. For every domain considered all the other
available domains are individually adapted using all data from source domain and the data-distribution
of the target domain. The results for all the different domain-adaptation are averaged resulting in a
single set of performance measures for each domain. The performance measures include the average
38
accuracy, micro F1 and macro F1 scores. The experiments are performed using the adapted Naive
Bayes classifier using the EM algorithm to fit source and target domain. As before different featuressets are considered: 1 gram features, 2 gram features and combinations of 1 gram + POS-pattern
features and combinations of 1 gram + 2 gram features. The experiments will be evaluated against the
performance of the previously discussed naive approach towards cross-domain classification. Further,
the results are compared to the results reported in Tan et al. (2009) on Chinese datasets. As will be
shown the comparison to Chinese datasets is other than expected. To gain better understanding in the
differences examples will be provided showing the acquired generalizable features and the performance
results for a few representative domain-adaptations.
As a second approach in this set of experiments the selection of the generalizable features are explored.
The first considered alternative towards acquiring better generalizable features is to use the subjectivity
lexicon from the previous experiments instead of generalizable features selected by FCE. So, instead of
selecting generalizable features based on FCE the subjectivity lexicon is directly used to attempt a
domain-adaptation. With these generalizable features, again the adapted Naive Bayes classifier using
EM is used to fit source and target domain. Consequently the same approach is followed as for using
FCE as generalizable features, performing the same domain-adaptations and reporting the same
performance measures. The results can be compared to the approach using FCE to select generalizable
features and to naive cross-domain classification.
Third, as a second improvement an importance measure is introduced to be used in combination with
FCE. The importance measure is used to optimize the generalizable features acquired through FCE. As
before, with this set of improved generalizable features adapted Naive Bayes using EM is used to fit
source and target domain. The rest of the approach is followed as for the previous two sub-experiments.
The results are directly compared to the previous two experiments. As an addition, for a number of
representative domains the improvements in the used generalizable features are presented. In the third
section of Chapter 5 the results for the three aforementioned experiments will be presented in the form
of three tables with results; one for each sub-experiment with different used generalizable features. In
the third section of Chapter 6 the results of the different sub-experiments will be discussed providing
analysis of the different used generalizable features and performance results for a few representative
domain-adaptations.
39
Chapter 5
Results
In this chapter the results for the three conducted experiments are presented. First the results for the indomain baseline experiments are presented, next the results for the naive cross-domain baseline
experiments are presented. The third section of results presents the three variations of using
generalizable features for domain-adaptation.
5.1 In-domain baseline experiments
The first experiment performed in this research is the evaluation of the in-domain performance of the
standard Naive Bayes classifier. This evaluation is done to set a baseline for in-domain performance
with a general setup, wider used in the experiments conducted in this thesis. The general setup
consistently uses the data from 38 different datasets. In order to evaluate this general setup the Naive
Bayes classifier is used in order to relate the results observed in this experiment to results reported in
previous in-domain sentiment analysis work. The results for the in-domain experiments are presented
in Table 4.
40
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
Subjectivity Lexicon
ac
micro macro ac
micro macro ac
micro macro ac
micro macro ac
micro macro
kitchen&housewares
0.7730 0.7913 0.7908 0.7712 0.7993 0.8005 0.7630 0.7801 0.7803 0.7392 0.7844 0.7853 0.5746 0.6508 0.6501
toys_&_games
0.7695 0.8014 0.8009 0.7535 0.7948 0.7952 0.7543 0.7855 0.7855 0.7274 0.7794 0.7797 0.5750 0.6709 0.6698
sports_&_outdoors
0.7873 0.7916 0.7911 0.7852 0.7933 0.7930 0.7826 0.7821 0.7815 0.7774 0.7776 0.7770 0.5821 0.6566 0.6556
electronics
0.7692 0.7841 0.7837 0.7788 0.7875 0.7873 0.7727 0.7745 0.7748 0.7561 0.7793 0.7793 0.5704 0.6540 0.6535
beauty
0.7821 0.8609 0.8605 0.8105 0.8635 0.8625 0.7777 0.8570 0.8565 0.8027 0.8533 0.8524 0.6128 0.7381 0.7365
office_products
0.8780 0.9209 0.9204 0.7021 0.8235 0.8236 0.8720 0.9155 0.9150 0.6441 0.7680 0.7669 0.6850 0.8059 0.8048
automotive
0.7983 0.8909 0.8904 0.7732 0.8513 0.8485 0.7930 0.8846 0.8838 0.6971 0.7891 0.7869 0.6877 0.8088 0.8080
laptop
0.7438 0.7753 0.7760 0.6774 0.6905 0.6990 0.7062 0.7209 0.7187 0.6289 0.6093 0.6225 0.5643 0.6455 0.6347
jewelry&watches
0.8049 0.8829 0.8823 0.8265 0.8926 0.8915 0.7965 0.8814 0.8808 0.7992 0.8738 0.8730 0.7233 0.8300 0.8289
stanford
0.6774 0.7324 0.7324 0.6790 0.7113 0.7084 0.6738 0.7082 0.7055 0.6787 0.7113 0.7084 0.5614 0.6620 0.6620
senpol
0.7880 0.7633 0.7621 0.8123 0.8053 0.8052 0.7779 0.7717 0.7716 0.7819 0.7584 0.7565 0.4963 0.6628 0.6612
lawyer
0.8093 0.8367 0.8379 0.7127 0.7895 0.7886 0.6024 0.7138 0.7127 0.6650 0.7581 0.7554 0.6245 0.6992 0.6957
baby
0.7622 0.7959 0.7961 0.7647 0.8077 0.8071 0.7497 0.7923 0.7915 0.7615 0.8031 0.8027 0.5827 0.6839 0.6824
tools&hardware
0.9649 0.9744 0.9741 0.9901 0.9787 0.9778 0.4693 0.5816 0.5590 0.8705 0.9333 0.9309 0.7197 0.8449 0.8416
music2
0.7599 0.7339 0.7336 0.7439 0.7560 0.7557 0.7400 0.7365 0.7358 0.7182 0.7026 0.7012 0.5542 0.6550 0.6543
computer&videogames 0.8489 0.8948 0.8943 0.9179 0.9325 0.9322 0.8426 0.8927 0.8919 0.8805 0.9046 0.9043 0.6569 0.7576 0.7565
senscale
0.8644 0.8870 0.8869 0.8233 0.8633 0.8632 0.8272 0.8644 0.8643 0.8162 0.8589 0.8588 0.6053 0.7497 0.7496
cell phones&service
0.7767 0.8312 0.8292 0.7991 0.8437 0.8419 0.7768 0.8259 0.8229 0.7904 0.8261 0.8250 0.6244 0.7292 0.7279
musical_instruments
0.8559 0.9238 0.9228 0.7388 0.8440 0.8425 0.8319 0.9122 0.9113 0.6253 0.7741 0.7705 0.7803 0.8540 0.8510
grocery
0.7911 0.8793 0.8788 0.7931 0.8689 0.8686 0.7876 0.8797 0.8791 0.7644 0.8491 0.8488 0.7062 0.8125 0.8117
camera_&_photo
0.8153 0.8274 0.8245 0.8441 0.8552 0.8552 0.7957 0.8175 0.8173 0.8391 0.8484 0.8477 0.5547 0.6509 0.6503
doctor
0.8092 0.8345 0.8337 0.7556 0.7911 0.7912 0.7560 0.7917 0.7917 0.7295 0.7705 0.7702 0.5823 0.6831 0.6825
tv
0.7437 0.7689 0.7673 0.6719 0.6969 0.6946 0.6517 0.6986 0.6977 0.6512 0.6869 0.6851 0.6154 0.6817 0.6795
dvd
0.7229 0.7001 0.6971 0.7168 0.7309 0.7299 0.7103 0.7103 0.7082 0.7008 0.7035 0.6998 0.5469 0.6539 0.6538
camp
0.7967 0.7948 0.7930 0.7953 0.7851 0.7820 0.8053 0.7920 0.7922 0.7844 0.7451 0.7405 0.5841 0.6926 0.6925
drug
0.7135 0.7357 0.7335 0.6530 0.6852 0.6844 0.6947 0.7099 0.7086 0.6416 0.6793 0.6792 0.5725 0.6018 0.5996
magazines
0.8612 0.8635 0.8631 0.8999 0.9055 0.9051 0.8543 0.8631 0.8636 0.8901 0.8924 0.8921 0.5493 0.6439 0.6430
video
0.7760 0.7416 0.7419 0.7658 0.7611 0.7600 0.7550 0.7274 0.7253 0.7294 0.6874 0.6845 0.5244 0.6300 0.6297
health&personal_care
0.7875 0.7918 0.7908 0.7862 0.7917 0.7914 0.7715 0.7787 0.7789 0.7773 0.7787 0.7779 0.5754 0.6566 0.6561
camera
0.7481 0.7774 0.7673 0.7433 0.7594 0.7561 0.7407 0.7601 0.7524 0.7258 0.7666 0.7608 0.5877 0.6528 0.6459
music
0.6959 0.7237 0.7227 0.6613 0.6964 0.6889 0.6587 0.6809 0.6730 0.6547 0.6993 0.6960 0.5643 0.6487 0.6444
apparel
0.7676 0.8026 0.8003 0.7591 0.7966 0.7948 0.7721 0.7917 0.7904 0.7379 0.7838 0.7820 0.5814 0.6711 0.6699
radio
0.6872 0.7349 0.7336 0.6335 0.7100 0.7098 0.5987 0.6856 0.6863 0.6115 0.7023 0.7017 0.5585 0.6330 0.6306
outdoor_living
0.7914 0.8771 0.8767 0.8081 0.8827 0.8823 0.7917 0.8763 0.8760 0.7934 0.8706 0.8702 0.6773 0.7896 0.7889
software
0.8019 0.8107 0.8107 0.8190 0.8411 0.8407 0.7912 0.8053 0.8053 0.8067 0.8336 0.8333 0.5555 0.6602 0.6599
books
0.7641 0.7579 0.7572 0.7499 0.7630 0.7626 0.7416 0.7422 0.7418 0.7294 0.7612 0.7601 0.5636 0.6563 0.6539
gourmet_food
0.8473 0.9130 0.9127 0.8066 0.8771 0.8763 0.8434 0.9116 0.9112 0.7716 0.8527 0.8514 0.7175 0.8238 0.8233
restaurant
0.8912 0.9348 0.9347 0.8873 0.9339 0.9338 0.8881 0.9335 0.9334 0.8711 0.9198 0.9194 0.7518 0.8454 0.8451
Table 4: Accuracies, micro F1 and macro F1 scores for the in-domain experiment.
The best performing feature-set for the in-domain experiment with the Naive Bayes classifier is the 1
gram feature-set. The reported accuracy for this feature-set is 0.7901. The combined feature-sets (1
gram + POS-pattern and 1 gram + 2 gram) both have lower accuracy scores in comparison to the single
feature-sets (1 gram and 2 gram). The standard deviation over all the different feature-sets is relatively
low, the highest standard deviation over the accuracies is reported for the combined 1 gram + POSpattern feature-set with 0.0817. This low standard deviation implies that the performances are within a
close range, making the Naive Bayes classifier a stable classifier for this experiment. The reported
micro F1 scores range from 0.8195 for the 1 gram feature-set to 0.7862 for the combined 1 gram + 2
41
gram feature-set . The reported macro F1 scores range from 0.8186 for the 1 gram feature-set to 0.7852
for the combined 1 gram + 2 gram feature-et. The differences between the different reported micro and
macro F1 scores are very small showing that performances for imbalanced and balanced evaluationsets are very similar.
As an alternative classifier the subjectivity lexicon classifier is analyzed. The average accuracy for this
classifier is 0.6092 which is considerable lower in comparison to the results reported for the different
features-sets used for the Naive Bayes classifier. The average reported micro F1 and macro F1 scores
are 0.7065 and 0.7049, which is also lower in comparison to the earlier results. The difference between
the micro and macro F1 score is smaller in comparison to the Naive Bayes classifier showing even
more stability in case of differences between the evaluation of balanced and imbalanced sets. The
standard deviation over the accuracies is 0.0684 which is slightly higher in comparison to the 1 gram
Naive Bayes classifier but lower in comparison to the standard deviation reported for the rest of the
Naive Bayes feature-sets. This makes the subjectivity lexicon classifier also a stable classifier for this
task.
5.2 Cross-domain baseline experiments
In this section the results for the naive cross-domain approach are presented. This experiment can be
compared to previous work to evaluate performance of the general setup and can also be used to
evaluate the performance difference of the Naive Bayes classifier from in-domain to naive crossdomain application for the different datasets. The results are presented in Table 5.
42
Dataset
kitchen&housewares
toys_&_games
sports_&_outdoors
electronics
beauty
office_products
automotive
laptop
jewelry&watches
stanford
senpol
lawyer
baby
tools&hardware
music2
computer&videogames
senscale
cell phones&service
musical_instruments
grocery
camera_&_photo
doctor
tv
dvd
camp
drug
magazines
video
health&personal_care
camera
music
apparel
radio
outdoor_living
software
books
gourmet_food
restaurant
average
standard deviation
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
0.6480 0.6747 0.6732 0.6459 0.6703 0.6691 0.6320 0.6473 0.6457 0.6346 0.6415 0.6401
0.6390 0.6817 0.6808 0.6397 0.6713 0.6705 0.6264 0.6574 0.6565 0.6258 0.6357 0.6348
0.6430 0.6638 0.6626 0.6482 0.6680 0.6668 0.6303 0.6392 0.6380 0.6306 0.6333 0.6322
0.6465 0.6579 0.6571 0.6390 0.6543 0.6535 0.6252 0.6217 0.6209 0.6263 0.6214 0.6203
0.6702 0.7378 0.7368 0.6739 0.7465 0.7453 0.6490 0.7076 0.7066 0.6447 0.7071 0.7058
0.6952 0.7900 0.7887 0.6912 0.7857 0.7836 0.6653 0.7592 0.7575 0.6525 0.7456 0.7433
0.6624 0.7510 0.7490 0.6769 0.7666 0.7652 0.6352 0.7196 0.7177 0.6368 0.7243 0.7225
0.6447 0.6302 0.6150 0.6451 0.6449 0.6251 0.6291 0.5965 0.5805 0.6246 0.5960 0.5800
0.7283 0.8092 0.8082 0.7289 0.8106 0.8096 0.7018 0.7813 0.7800 0.6963 0.7782 0.7769
0.5449 0.5792 0.5791 0.5342 0.6160 0.6160 0.5326 0.5520 0.5520 0.5346 0.5770 0.5770
0.5849 0.5465 0.5447 0.5526 0.4491 0.4481 0.5768 0.5093 0.5077 0.5325 0.3725 0.3715
0.6545 0.6854 0.6779 0.6460 0.6671 0.6627 0.6383 0.6469 0.6407 0.6274 0.6345 0.6303
0.6354 0.6702 0.6690 0.6364 0.6695 0.6684 0.6220 0.6471 0.6459 0.6211 0.6324 0.6314
0.6566 0.7609 0.7549 0.7170 0.8117 0.8057 0.6380 0.7437 0.7369 0.6385 0.7427 0.7333
0.6121 0.6558 0.6553 0.5825 0.6144 0.6137 0.5961 0.6257 0.6251 0.5719 0.5719 0.5712
0.6648 0.7352 0.7344 0.6564 0.7223 0.7214 0.6462 0.7100 0.7093 0.6104 0.6578 0.6569
0.5964 0.6065 0.6061 0.5523 0.5052 0.5048 0.5736 0.5631 0.5627 0.5165 0.4218 0.4215
0.6461 0.6862 0.6836 0.6624 0.7160 0.7142 0.6227 0.6525 0.6497 0.6335 0.6740 0.6721
0.7196 0.8106 0.8075 0.7108 0.8084 0.8045 0.6918 0.7829 0.7789 0.6675 0.7672 0.7632
0.6981 0.7786 0.7778 0.6934 0.7778 0.7769 0.6669 0.7432 0.7423 0.6651 0.7457 0.7447
0.6540 0.6810 0.6797 0.6501 0.6690 0.6680 0.6368 0.6501 0.6489 0.6296 0.6290 0.6278
0.6694 0.7048 0.7039 0.6532 0.6889 0.6878 0.6386 0.6531 0.6519 0.6434 0.6667 0.6655
0.6351 0.6790 0.6756 0.6111 0.6615 0.6575 0.6046 0.6363 0.6329 0.6087 0.6415 0.6374
0.6255 0.6562 0.6558 0.5990 0.6132 0.6128 0.6128 0.6310 0.6305 0.5834 0.5683 0.5678
0.6451 0.6898 0.6888 0.6375 0.6868 0.6863 0.6277 0.6610 0.6597 0.6139 0.6447 0.6434
0.5735 0.5656 0.5644 0.5687 0.5837 0.5822 0.5604 0.5368 0.5354 0.5656 0.5595 0.5578
0.6427 0.6825 0.6817 0.6205 0.6451 0.6441 0.6271 0.6543 0.6534 0.6002 0.6009 0.5999
0.6155 0.6471 0.6462 0.5911 0.6059 0.6048 0.6000 0.6158 0.6150 0.5807 0.5647 0.5636
0.6331 0.6548 0.6531 0.6421 0.6669 0.6658 0.6197 0.6281 0.6265 0.6296 0.6385 0.6372
0.6516 0.6977 0.6923 0.6539 0.6877 0.6821 0.6399 0.6727 0.6674 0.6408 0.6546 0.6470
0.6018 0.6543 0.6512 0.6152 0.6576 0.6545 0.5879 0.6250 0.6215 0.5972 0.6109 0.6074
0.6473 0.6813 0.6792 0.6514 0.6821 0.6798 0.6299 0.6512 0.6490 0.6386 0.6562 0.6540
0.5978 0.6369 0.6345 0.5868 0.6302 0.6283 0.5743 0.5926 0.5904 0.5814 0.6049 0.6029
0.6923 0.7715 0.7706 0.6838 0.7659 0.7651 0.6617 0.7393 0.7384 0.6487 0.7261 0.7253
0.6680 0.6902 0.6898 0.6577 0.6813 0.6808 0.6521 0.6614 0.6609 0.6426 0.6477 0.6471
0.6068 0.6451 0.6428 0.5909 0.5933 0.5910 0.5990 0.6213 0.6193 0.5713 0.5426 0.5400
0.6987 0.7906 0.7898 0.6973 0.7918 0.7910 0.6603 0.7513 0.7505 0.6517 0.7476 0.7465
0.7624 0.8416 0.8412 0.7146 0.8028 0.8025 0.7058 0.7915 0.7910 0.6598 0.7517 0.7513
0.6477 0.6916 0.6895 0.6410 0.6813 0.6792 0.6273 0.6600 0.6578 0.6178 0.6404 0.6382
0.0431 0.0683 0.0686 0.0481 0.0819 0.0819 0.0367 0.0683 0.0685 0.0393 0.0862 0.0860
Table 5: results of the naive cross-domain experiments.
The best performing feature-set in terms of accuracy for the naive cross-domain approach is the 1 gram
feature-set with an accuracy of 0.6477. The reported accuracy for the 2 gram feature-set is slightly
lower with 0.6410. The combined feature-sets perform lower with accuracies of 0.6273 and 0.6178.
The standard deviation over the accuracies is slightly lower in comparison to the in-domain experiment
with a highest reported standard deviation of 0.0481 for the 2 gram feature-set. The highest reported
micro F1 and macro F1 scores are also observed for the 1 gram feature-set with considerable lower
scores for the combined feature-sets as well. The differences between the reported micro and macro F1
scores for the different feature-sets are a little higher in comparison to in-domain performance. As
43
discussed for the in-domain experiment the performance for the subjectivity lexicon classifier is the
same in a cross-domain setting. Comparing the subjectivity lexicon classifier to the naive cross-domain
approach the performance of the subjectivity lexicon classifier is still lower.
In comparison to the in-domain performance all feature-sets decrease in performance. The 1 gram
feature-set decreases the most in performance with a difference of 0.142. All feature-sets decrease in
performance with decreases over 0.128. For the micro F1 and macro F1 scores the same decrease in
performance is observed, all scores decrease from in-domain to cross-domain with decreases bigger
than 0.128.
5.3.1 Adapted Naive Bayes classifier using FCE
The first experiment performing domain-adaptation between domains uses the adapted Naive Bayes
classifier using FCE: ANB (FCE). As explained in the previous chapter FCE is used to select
generalizable features. The results for the experiments conducted in regard to this research are
presented in this section, starting with Table 6.
44
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
kitchen&housewares
0.4475 0.2836 0.2816 0.6581 0.6958 0.6939 0.5049 0.4783 0.4771 0.5523 0.3059 0.3038
toys_&_games
0.5207 0.4771 0.4754 0.6486 0.6795 0.6777 0.4964 0.5769 0.5763 0.5845 0.4733 0.4724
sports_&_outdoors
0.4684 0.3490 0.3464 0.6566 0.6947 0.6921 0.5144 0.6196 0.6188 0.5997 0.4707 0.4695
electronics
0.4473 0.2809 0.2786 0.6491 0.6930 0.6911 0.5065 0.5461 0.5451 0.5777 0.3496 0.3467
beauty
0.5028 0.4424 0.4397 0.6184 0.6659 0.6633 0.6210 0.7317 0.7310 0.5913 0.6279 0.6265
office_products
0.4634 0.3381 0.3352 0.5708 0.6720 0.6698 0.7480 0.8302 0.8291 0.3365 0.3753 0.3721
automotive
0.4523 0.2771 0.2751 0.5736 0.6255 0.6232 0.7028 0.7982 0.7972 0.3722 0.3791 0.3776
laptop
0.5537 0.6348 0.6329 0.5673 0.6338 0.6322 0.5054 0.6251 0.6125 0.4332 0.5354 0.5108
jewelry&watches
0.4757 0.3428 0.3398 0.6058 0.6713 0.6689 0.6964 0.7978 0.7968 0.5205 0.5802 0.5781
stanford
0.5232 0.4928 0.4908 0.6754 0.7455 0.7438 0.5241 0.6471 0.6471 0.5513 0.6102 0.6102
senpol
0.5668 0.6582 0.6562 0.6263 0.6860 0.6835 0.6074 0.6232 0.6221 0.5271 0.6540 0.6527
lawyer
0.4831 0.3925 0.3898 0.4757 0.2988 0.2970 0.5079 0.3281 0.3259 0.4600 0.2454 0.2418
baby
0.4919 0.4103 0.4086 0.6806 0.7390 0.7370 0.5441 0.6499 0.6490 0.5429 0.3906 0.3883
tools&hardware
0.5773 0.6991 0.6977 0.5819 0.7065 0.7050 0.8236 0.8923 0.8895 0.8536 0.9069 0.9050
music2
0.5794 0.7039 0.7021 0.6292 0.6663 0.6634 0.5148 0.5992 0.5988 0.5388 0.6800 0.6796
computer&videogames
0.5521 0.4926 0.4901 0.5969 0.6372 0.6350 0.5955 0.6820 0.6811 0.6511 0.7167 0.7155
senscale
0.5744 0.6685 0.6668 0.6095 0.6400 0.6376 0.6516 0.7467 0.7464 0.6229 0.7448 0.7445
cell phones&service
0.5230 0.5179 0.5163 0.6178 0.6740 0.6716 0.5921 0.7043 0.7018 0.5866 0.6471 0.6447
musical_instruments
0.4917 0.3637 0.3610 0.5748 0.6818 0.6799 0.7804 0.8613 0.8599 0.3420 0.3922 0.3865
grocery
0.4987 0.4557 0.4532 0.6143 0.6747 0.6724 0.6593 0.7655 0.7647 0.5577 0.6221 0.6214
camera_&_photo
0.5439 0.6055 0.6033 0.6707 0.7006 0.6986 0.5222 0.6427 0.6419 0.5809 0.6139 0.6134
doctor
0.5107 0.4505 0.4486 0.5726 0.5787 0.5768 0.5281 0.5651 0.5640 0.5793 0.4513 0.4497
tv
0.4606 0.2782 0.2756 0.4764 0.3025 0.2997 0.5063 0.3958 0.3895 0.4140 0.3072 0.3048
dvd
0.5901 0.7027 0.7009 0.6399 0.6707 0.6686 0.5003 0.6331 0.6327 0.5474 0.6809 0.6806
camp
0.5798 0.7074 0.7056 0.5779 0.6551 0.6522 0.4848 0.5980 0.5972 0.4821 0.6163 0.6156
drug
0.4237 0.1757 0.1741 0.5360 0.4858 0.4836 0.5025 0.3941 0.3927 0.5107 0.1915 0.1920
magazines
0.4989 0.4405 0.4379 0.6496 0.6801 0.6779 0.5271 0.6310 0.6303 0.6204 0.5226 0.5224
video
0.5725 0.6453 0.6431 0.6188 0.6398 0.6378 0.5260 0.5951 0.5944 0.5888 0.6771 0.6767
health&personal_care
0.5225 0.5446 0.5427 0.6638 0.7029 0.7007 0.5097 0.6310 0.6297 0.5748 0.5477 0.5466
camera
0.5097 0.4849 0.4831 0.6012 0.5773 0.5753 0.5104 0.6387 0.6351 0.5116 0.3546 0.3469
music
0.5106 0.4881 0.4854 0.5696 0.5855 0.5838 0.5128 0.5657 0.5633 0.5994 0.4965 0.4935
apparel
0.4633 0.3039 0.3017 0.6286 0.6388 0.6373 0.5152 0.6209 0.6191 0.5457 0.3231 0.3212
radio
0.4561 0.3070 0.3049 0.5145 0.4339 0.4319 0.5328 0.5070 0.5060 0.5166 0.3228 0.3217
outdoor_living
0.4572 0.3203 0.3175 0.6068 0.6599 0.6571 0.6596 0.7586 0.7580 0.4579 0.4598 0.4586
software
0.4442 0.2196 0.2167 0.6405 0.6929 0.6911 0.5173 0.5325 0.5320 0.5363 0.2508 0.2499
books
0.5529 0.6263 0.6245 0.6397 0.6851 0.6833 0.5007 0.6422 0.6403 0.6768 0.6931 0.6914
gourmet_food
0.4760 0.3329 0.3309 0.5907 0.6476 0.6452 0.6856 0.7858 0.7852 0.4497 0.5131 0.5115
restaurant
0.4606 0.2752 0.2726 0.6112 0.6793 0.6773 0.6792 0.7871 0.7866 0.4494 0.5338 0.5328
0.5060 0.4524 0.4502 0.6063 0.6368 0.6346 0.5741 0.6428 0.6413 0.5380 0.5069 0.5047
average
standard deviation
0.0471 0.1550 0.1552 0.0494 0.1000 0.1000 0.0912 0.1274 0.1278 0.0954 0.1639 0.1644
Table 6: results for the adapted Naive Bayes classifier using FCE to select generalizable features.
From the results presented in Table 6 it is noticeable that the performance for all the different featuresets are quite low. Best performing feature-set is the 2 gram feature set with an accuracy of 0.6063. The
best performance in terms of micro and macro F1 score is achieved by the combined 1 gram + POSpattern feature-set. The 1 gram feature-set has the lowest accuracy of 0.5060. In comparison to the
naive cross-domain approach the ANB using FCE performs lower. The difference in terms of accuracy
is the biggest for the 1 gram feature-set with a decrease in accuracy of 0.1417. The smallest difference
45
is achieved by the 2 gram feature-set, where the accuracy decreases with 0.0347. The decrease in
performance in terms of micro and macro F1 scores is even higher for the 1 gram feature-set with
decreases both over 0.23. The smallest difference is observed for the 1 gram + POS-pattern feature-set
with a decrease just over 0.01. The standard deviation over the accuracies for the 1 gram and 2 gram
feature-sets are just over 0.04 which is quite low showing stable performance over the different datasets used. The standard deviation over the accuracies of both the combined feature-sets is a little higher
with scores over 0.09 indicating that performances differ more from dataset to dataset. The differences
between the different reported micro and macro F1 scores are smaller compared to the naive crossdomain classifier. The smaller difference shows that performances for imbalanced and balanced
evaluation-sets is very similar to using FCE. Looking at the difference between the datasets, for the 1
gram features it seems that mainly the unbalanced datasets (tools & hardware, automotive, office
products) show performances under-average. This indicates that the priors might cause problems while
performing domain-adaptations with this feature-set. For the remaining feature-sets this seems to be
less of a problem. For the 2 gram feature-set the smaller data-sets seem to be performing with on
average less performance (office products, lawyer, tools & hardware), following the trend which can be
observed from the previous experiments. For the combined feature-sets it is noticeable that the tools &
hardware dataset performs the highest. This is probably caused by the specific characteristics of the
dataset as for the rest of the datasets no real trend in problems with size or balance can be detected.
Looking at the differences in results from naive cross-domain approach and the ANB using FCE on a
dataset level it is noticeable that the imbalanced datasets (restaurant, gourmet food, outdoor living)
seem to decrease most in performance in comparison to balanced datasets. In contrast, for the 2 gram
feature-set the bigger datasets seem to benefit from the ANB approach using FCE with increased
results (stanford, senpol, senscale). Even for the on average worst-performing combined 1 gram +
POS-pattern feature-set, compared to naive cross-domain some individual datasets actually increase in
performance using FCE. Some of the bigger datasets (senpol, senscale) and a number of imbalance
datasets (tools & hardware, automotive, office products) actually perform better using FCE in
comparison to the naive cross-domain approach. For the combined 1 gram + 2 gram features only the
stanford dataset shows an increase in performance compared to the naive cross-domain approach.
5.3.2 Adapted Naive Bayes classifier using subjectivity lexicon
The first suggested improvement for the adapted Naive Bayes classifier is the replacement of the by
FCE selected generalizable features with a subjectivity lexicon: ANB (SL). In this section the results
for this suggested improvement are discussed, the results are presented in Table 7.
46
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
kitchen&housewares
0.4518 0.3335 0.3306 0.6830 0.7185 0.7170 0.4885 0.5159 0.5151 0.5771 0.3821 0.3801
toys_&_games
0.5120 0.4857 0.4832 0.6677 0.6976 0.6961 0.4797 0.5624 0.5619 0.5914 0.4991 0.4982
sports_&_outdoors
0.4711 0.3943 0.3918 0.6793 0.7158 0.7136 0.4716 0.4884 0.4872 0.6153 0.5346 0.5332
electronics
0.4557 0.3157 0.3129 0.6817 0.7230 0.7212 0.4928 0.5678 0.5670 0.5855 0.3908 0.3879
beauty
0.5020 0.4998 0.4976 0.6224 0.6659 0.6632 0.4506 0.4654 0.4644 0.6319 0.6950 0.6940
office_products
0.4391 0.2221 0.2197 0.5224 0.5493 0.5464 0.4757 0.5482 0.5458 0.2251 0.2182 0.2159
automotive
0.4550 0.2995 0.2963 0.5530 0.5588 0.5563 0.4968 0.5723 0.5704 0.3464 0.3560 0.3544
laptop
0.5398 0.6110 0.6091 0.5448 0.5828 0.5810 0.4549 0.5478 0.5313 0.3337 0.4123 0.3906
jewelry&watches
0.4961 0.4014 0.3990 0.6080 0.6643 0.6619 0.4894 0.5617 0.5603 0.5330 0.6015 0.5995
stanford
0.5602 0.5573 0.5555 0.6766 0.7464 0.7449 0.5197 0.6570 0.6570 0.5591 0.6264 0.6264
senpol
0.5576 0.6576 0.6555 0.6359 0.6938 0.6913 0.6065 0.6133 0.6122 0.5267 0.6632 0.6619
lawyer
0.4914 0.4919 0.4893 0.4712 0.2771 0.2753 0.4850 0.3302 0.3270 0.3791 0.2049 0.2014
baby
0.4901 0.4503 0.4482 0.6959 0.7515 0.7498 0.5063 0.5037 0.5025 0.5808 0.5087 0.5072
tools&hardware
0.4569 0.2750 0.2731 0.5494 0.6446 0.6430 0.5472 0.6618 0.6531 0.2015 0.2474 0.2378
music2
0.5523 0.6505 0.6480 0.6419 0.6782 0.6755 0.4845 0.4430 0.4426 0.5776 0.6820 0.6815
computer&videogames
0.5097 0.4678 0.4653 0.6015 0.6368 0.6343 0.3815 0.2489 0.2480 0.6760 0.7503 0.7492
senscale
0.5558 0.6560 0.6539 0.6133 0.6420 0.6395 0.5949 0.6175 0.6171 0.6256 0.7538 0.7536
cell phones&service
0.5339 0.5163 0.5141 0.6198 0.6723 0.6696 0.4930 0.5430 0.5405 0.5992 0.6705 0.6681
musical_instruments
0.4552 0.2242 0.2222 0.4969 0.4824 0.4799 0.5245 0.6114 0.6082 0.2093 0.2037 0.1990
grocery
0.4998 0.4748 0.4724 0.6153 0.6704 0.6680 0.4412 0.4923 0.4914 0.5648 0.6338 0.6331
camera_&_photo
0.5308 0.5575 0.5556 0.6896 0.7180 0.7163 0.4969 0.5283 0.5274 0.6209 0.6229 0.6223
doctor
0.5371 0.4951 0.4934 0.6016 0.6123 0.6108 0.4867 0.5022 0.5010 0.6032 0.4914 0.4902
tv
0.4722 0.3532 0.3505 0.4769 0.2992 0.2970 0.4749 0.3657 0.3598 0.3296 0.2586 0.2555
dvd
0.5595 0.6447 0.6426 0.6536 0.6834 0.6816 0.5046 0.6497 0.6493 0.5622 0.6669 0.6666
camp
0.5635 0.6707 0.6681 0.5743 0.6536 0.6502 0.4399 0.5464 0.5452 0.4540 0.5726 0.5714
drug
0.4500 0.2828 0.2809 0.5677 0.5195 0.5181 0.4872 0.3658 0.3643 0.5190 0.2280 0.2279
magazines
0.5066 0.4898 0.4869 0.6611 0.6900 0.6882 0.4816 0.4706 0.4700 0.6461 0.5775 0.5772
video
0.5506 0.6009 0.5986 0.6237 0.6441 0.6421 0.4919 0.4298 0.4291 0.5940 0.6663 0.6658
health&personal_care
0.5214 0.5250 0.5216 0.6802 0.7185 0.7165 0.4698 0.5344 0.5330 0.5962 0.5639 0.5628
camera
0.5235 0.5596 0.5581 0.6063 0.5772 0.5754 0.4911 0.5985 0.5948 0.4994 0.3852 0.3785
music
0.5148 0.5057 0.5031 0.5863 0.6032 0.6017 0.4901 0.5562 0.5536 0.6175 0.5263 0.5239
apparel
0.4613 0.3319 0.3296 0.6509 0.6601 0.6589 0.4808 0.5417 0.5398 0.5620 0.3642 0.3620
radio
0.4760 0.4209 0.4190 0.5528 0.4808 0.4792 0.4992 0.4480 0.4469 0.5266 0.3640 0.3628
outdoor_living
0.4669 0.3689 0.3660 0.6083 0.6496 0.6469 0.4161 0.4330 0.4319 0.4865 0.5080 0.5066
software
0.4531 0.2647 0.2615 0.6760 0.7252 0.7240 0.5135 0.5938 0.5934 0.5581 0.3188 0.3177
books
0.5430 0.5815 0.5795 0.6537 0.6958 0.6942 0.5028 0.6409 0.6389 0.6711 0.6696 0.6679
gourmet_food
0.4673 0.3359 0.3339 0.5718 0.6096 0.6072 0.3394 0.3772 0.3762 0.4366 0.5002 0.4986
restaurant
0.4723 0.3342 0.3312 0.6084 0.6732 0.6711 0.3533 0.4139 0.4131 0.5043 0.5961 0.5952
0.5015 0.4555 0.4531 0.6111 0.6312 0.6291 0.4817 0.5144 0.5124 0.5191 0.4978 0.4954
average
standard deviation
0.0391 0.1324 0.1325 0.0600 0.1062 0.1062 0.0511 0.0964 0.0962 0.1260 0.1653 0.1666
Table 7: results of the adapted Naive Bayes classifier using a subjectivity lexicon.
Apart from the 2 gram feature-set where a very small increase in accuracy can be observed compared
to using FCE the first suggested alternative for selecting generalizable features doesn't seem to aid the
classification performances. The 1 gram feature-set performs very alike FCE with an decrease in
accuracy of 0.0045 and micro F1 and macro F1 differences which are even smaller. The 2 gram featureset shows a small increase in performance with 0.0048 but both F1 scores are a little over 0.005 lower
compared to FCE. The combined 1 gram + POS-pattern feature-set performs significantly worst in
47
comparison to FCE with a decrease in accuracy of 0.0925. The decrease in micro and macro F1 score is
even bigger with decreases over 0.13. For the combined 1 gram + 2 gram feature-set a slight decrease
in accuracy is observed with 0.0189 and even smaller decreases in terms of micro and macro F1 score.
Although the accuracy for the 2 gram feature-set slightly increases in comparison to using FCE the
standard deviation over the accuracies is a little higher compared to FCE: 0.06. The higher standard
deviation shows that performance is less stable over the different data-sets used. The standard deviation
over the accuracies of both the combined feature-sets is a little higher as well especially for the
combined 1 gram + 2 gram feature-set which has a standard deviation over the accuracies of 0.1260.
The differences between the different reported micro and macro F1 scores are comparable to the
differences reported by FCE.
For all feature-sets it seems that the biggest datasets (stanford, senpol, senscale) show performances
above-average, indicating that more data aids domain-adaptations. As for FCE, for the 2 gram featureset the smaller data-sets seem to be performing with on average less performance (office products,
lawyer, tools & hardware).
When comparing results on a dataset level to ANB using FCE the stanford dataset, considering the 1
gram feature-set shows the most significant increase in performance with an increase in accuracy of
0.0370. For the 2 gram feature-set most of the medium sized datasets seem to benefit from using ANB
with a subjectivity lexicon in comparison to ANB using FCE. The datasets performing lower in
comparison are mostly small sized datasets for both the combined feature-sets. For the combined 1
gram + POS-pattern feature-set almost all datasets show a decrease in performance compared to ANB
using FCE, with a few insignificant exceptional increases in performance for the books and dvd
dataset.
5.3.3 Adapted Naive Bayes classifier using improved generalizable features
The final experiment conducted for this research is the utilization of the adapted Naive Bayes classifier
using improved generalizable features: ANB (IMP). The results for the different conducted experiments
are presented in this section, starting with the results presented in Table 8
48
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
kitchen&housewares
0.4469 0.2912 0.2890 0.6571 0.6965 0.6947 0.4978 0.4946 0.4939 0.5565 0.3259 0.3241
toys_&_games
0.5191 0.4727 0.4710 0.6475 0.6787 0.6771 0.4949 0.5464 0.5459 0.5822 0.4708 0.4699
sports_&_outdoors
0.4816 0.3869 0.3841 0.6551 0.6936 0.6910 0.4812 0.5006 0.4994 0.6071 0.5041 0.5029
electronics
0.4451 0.2713 0.2688 0.6530 0.6981 0.6962 0.4975 0.5365 0.5355 0.5704 0.3370 0.3340
beauty
0.5102 0.4770 0.4742 0.6203 0.6700 0.6671 0.4680 0.4967 0.4953 0.6102 0.6586 0.6574
office_products
0.4599 0.3278 0.3249 0.5786 0.6778 0.6752 0.5203 0.5990 0.5967 0.3468 0.3881 0.3852
automotive
0.4498 0.2809 0.2787 0.5805 0.6426 0.6403 0.5178 0.5974 0.5957 0.3858 0.3988 0.3970
laptop
0.5527 0.6445 0.6425 0.5728 0.6393 0.6376 0.4635 0.5585 0.5419 0.4446 0.5462 0.5226
jewelry&watches
0.4917 0.3943 0.3921 0.6122 0.6824 0.6799 0.5266 0.6135 0.6120 0.5595 0.6314 0.6295
stanford
0.5231 0.4924 0.4903 0.6754 0.7454 0.7437 0.5242 0.6469 0.6469 0.5514 0.6097 0.6096
senpol
0.5624 0.6552 0.6533 0.6223 0.6826 0.6799 0.6272 0.5907 0.5896 0.5294 0.6550 0.6537
lawyer
0.4852 0.3926 0.3899 0.4793 0.2997 0.2980 0.4853 0.3303 0.3271 0.4713 0.2597 0.2555
baby
0.4828 0.4098 0.4081 0.6815 0.7400 0.7383 0.5069 0.5013 0.5003 0.5479 0.4121 0.4100
tools&hardware
0.5584 0.6303 0.6286 0.5781 0.6919 0.6902 0.5914 0.7015 0.6929 0.8120 0.8795 0.8762
music2
0.5763 0.6985 0.6966 0.6325 0.6705 0.6674 0.4916 0.4916 0.4913 0.5420 0.6809 0.6804
computer&videogames
0.5650 0.5257 0.5235 0.6025 0.6441 0.6418 0.4110 0.3453 0.3441 0.6565 0.7314 0.7302
senscale
0.5730 0.6706 0.6687 0.6062 0.6380 0.6355 0.5953 0.6117 0.6110 0.6225 0.7458 0.7455
cell phones&service
0.5319 0.5593 0.5569 0.6237 0.6820 0.6795 0.5150 0.5795 0.5771 0.6070 0.6876 0.6852
musical_instruments
0.4881 0.3476 0.3455 0.5806 0.6810 0.6792 0.5681 0.6594 0.6563 0.3435 0.3973 0.3919
grocery
0.5118 0.4953 0.4930 0.6147 0.6800 0.6774 0.4657 0.5307 0.5295 0.5846 0.6544 0.6537
camera_&_photo
0.5451 0.6065 0.6044 0.6758 0.7058 0.7040 0.5074 0.5687 0.5678 0.5768 0.6180 0.6175
doctor
0.4969 0.4161 0.4141 0.5855 0.5901 0.5886 0.4986 0.4984 0.4971 0.5875 0.4515 0.4497
tv
0.4581 0.2764 0.2738 0.4809 0.3126 0.3106 0.4760 0.3654 0.3596 0.4262 0.3181 0.3158
dvd
0.5928 0.7039 0.7023 0.6393 0.6714 0.6694 0.4998 0.6488 0.6484 0.5425 0.6783 0.6779
camp
0.5808 0.7079 0.7062 0.5782 0.6557 0.6524 0.4435 0.5520 0.5508 0.4899 0.6220 0.6213
drug
0.4255 0.1908 0.1892 0.5514 0.5067 0.5049 0.4878 0.3360 0.3349 0.5110 0.1987 0.1992
magazines
0.5002 0.4513 0.4490 0.6503 0.6817 0.6796 0.4860 0.4787 0.4780 0.6145 0.5267 0.5265
video
0.5722 0.6401 0.6384 0.6180 0.6397 0.6374 0.4941 0.4607 0.4599 0.5886 0.6757 0.6753
health&personal_care
0.5222 0.5477 0.5461 0.6624 0.7016 0.6990 0.4803 0.5519 0.5505 0.5755 0.5575 0.5563
camera
0.5119 0.4744 0.4723 0.6038 0.5818 0.5799 0.4922 0.5938 0.5897 0.5242 0.3698 0.3620
music
0.5039 0.4660 0.4632 0.5749 0.5894 0.5878 0.4982 0.5493 0.5466 0.6168 0.5090 0.5062
apparel
0.4580 0.2938 0.2917 0.6300 0.6404 0.6389 0.4957 0.5511 0.5490 0.5481 0.3239 0.3217
radio
0.4582 0.3085 0.3065 0.5382 0.4599 0.4583 0.4996 0.4450 0.4439 0.5226 0.3351 0.3341
outdoor_living
0.4657 0.3449 0.3421 0.6168 0.6751 0.6724 0.4510 0.4871 0.4862 0.4824 0.4971 0.4957
software
0.4442 0.2325 0.2302 0.6464 0.6995 0.6979 0.5132 0.5590 0.5584 0.5415 0.2723 0.2713
books
0.5374 0.6068 0.6049 0.6394 0.6849 0.6828 0.5080 0.6399 0.6379 0.6597 0.6688 0.6671
gourmet_food
0.4820 0.3580 0.3562 0.5990 0.6618 0.6597 0.3809 0.4390 0.4378 0.4786 0.5482 0.5466
restaurant
0.4673 0.3002 0.2977 0.6130 0.6836 0.6817 0.3837 0.4577 0.4568 0.4684 0.5551 0.5540
0.5062 0.4566 0.4544 0.6099 0.6415 0.6393 0.4959 0.5293 0.5273 0.5444 0.5184 0.5161
average
standard deviation
0.0459 0.1624 0.1624 0.1066 0.1386 0.1383 0.0915 0.1212 0.1208 0.1217 0.1768 0.1769
Table 8: results of the adapted Naive Bayes classifier using improved generalizable features.
As observed for ANB using FCE the method using improved generalizable features shows the best
performance in terms of accuracies for the 2 gram feature-set with a performance of 0.6099 in terms of
accuracy. The micro F1 and macro F1 scores are 0.6415 and 0.6393. For the improved generalizable
features the combined 1 gram + POS-pattern feature-set is the worst performing feature-set with an
average accuracy of 0.4959, the micro and macro F1 scores for the 1 gram features are the lowest with
0.4566 and 0.4544. The differences between micro F1 and macro F1 scores are very small showing
49
consistent results between balanced and imbalance evaluation-sets.
Compared to the ANB classifier using FCE the performance is very similar. The 1 gram and 2 gram
feature-sets show minimal improvements in terms of accuracy of 0.0003. The increase in performance
for the combined 1 gram + 2 gram feature-set is a little more significant with 0.0064. The performance
for the combined 1 gram + POS-pattern feature-set is a little lower in comparison to ANB using FCE.
The difference in accuracy is 0.0782, the differences in micro and macro F1 are 0.11.
In comparison to ANB using a subjectivity lexicon the performance for ANB using improved
generalizable features increases a little for all the feature-sets except the 2 gram feature-set. The
difference in accuracy for the 1 gram feature-set is very small with 0.0048, the differences for the
combined feature-sets is a little more significant with 0.00142 and 0.0252. For the micro and macro F1
scores all feature-sets show slight improvement over FCE using a subjectivity lexicon.
Comparing the results to the naive cross-domain approach the ANB the same observations as for using
FCE can be made although the improved method performs very slightly better. The performance is
lower in comparison to the naive cross-domain baseline. The difference in terms of accuracy is the
biggest for the 1 gram feature-set with 0.1415. The smallest difference is achieved by the 2 gram
feature-set where the difference is 0.0311. The decrease in performance in terms of micro and macro F1
scores is the highest for the combined 1 gram + POS-pattern feature-set with differences over 0.23. The
smallest difference is observed for the combined 1 gram + 2 gram feature-set with a difference just over
0.03.
The standard deviation over the accuracies for the 1 gram feature-set is quite low with 0.0459, for the
rest of the feature-sets the standard deviation is very high with values all over 0.10. In comparison to
the ANB using FCE this is not surprising as the standard deviations there were quite high as well,
except for the 2 gram feature-set which was low for FCE but high for the improved method. The high
standard deviation learn that although the performance has increased a little over FCE the stability has
changed and the quality of the domain-adaptation is more dependent on the supplied data.
As observed for FCE the smaller and unbalance datasets show below average performance for the
domain-adaptations using 1 gram and 2 gram feature-sets. For the combined 1 gram + POS-pattern
feature-set the smaller sized datasets (drug, camp, musical instruments) seem to perform below
average. The balance of the training-data is of less importance for the combined feature-sets as the best
performing data-set is again the tools & hardware dataset. As observed for ANB using a subjectivity
lexicon, the bigger datasets (stanford, senscale, senpol) seem to have above-average performance
indicating that the availability of data seems to aid classification accuracies.
When comparing ANB using improved generalizable features to ANB using FCE on a dataset level the
results for the 1 gram feature-set are very similar. The main differences lies in a slight increase in
performance mainly for the imbalanced datasets (beauty, jewelery & watches, computer & video
games, grocery). Most of the datasets show increases in performance using ANB using improved
generalizable features compared to ANB using FCE for the 2 gram feature-set, the few datasets
showing decreases in performance are mainly medium sized datasets. For the combined feature-sets
most datasets increase in performance compared to ANB using FCE, the increases are all caused by
performance increases of medium sized datasets. Compared to naive cross-domain classification the
performances for the 1 gram feature-set decreases considerably. While on average the performance of
all the feature-sets decreases, for the 2 gram feature-set, combined 1 gram + POS-pattern feature-set
and combined 1 gram + 2 gram feature-set the performance for the bigger datasets (stanford, senscale,
senpol) actually shows increases.
50
Chapter 6
Results analysis
In this chapter the results for the different conducted experiments in this thesis will be analyzed. First
the in-domain experiments will be discussed. Next the naive cross-domain baseline method will be
considered. In the third section of this chapter the cross-domain adaptation using generalizable features
will be taken into account together with the different suggested improvements for the selection of
generalizable features.
6.1 In-domain baseline experiments
As a baseline for in-domain classification two different classifiers were used: the multinomial Naive
Bayes classifier and the subjectivity lexicon classifier. The baseline in-domain experiment was
conducted in order to evaluate whether the general setup results in performance comparable to previous
research. The multinomial Naive Bayes classifiers was trained based on different feature-sets (1 gram
features, 2 gram features, combined 1 gram + POS-pattern features and combined 1 gram + 2 gram
features). In the following sections the results for the different classifiers will be discussed.
6.1.2 Naive Bayes Classifier
The first experiment discussed in this chapter considers the in-domain performance of the Naive Bayes
classifier. The performance of four different feature-sets were considered over the 38 different datasets
used in this research. After the discussion about the feature-sets confusion matrices are presented for a
couple of representative domains. Next, to gain insight in the behavior of the Naive Bayes classifier
and the features used for the different in-domain classifications detail is provided of different examples
of features and their probabilities.
The best performing features for the in-domain experiment were the 1 gram features. Looking at the
smaller datasets the accuracies seem to be above average compared to the rest of the datasets: the best
performing dataset is even the smallest one: the tools & hardware dataset, both in terms of accuracy as
well as reported micro F1 and macro F1 scores. Looking at the average sized datasets (like kitchen &
housewares, sports & outdoor, music2 and video) the performance is around the reported average with
accuracies of 0.7730, 0.7873, 0.7599 and 0.7760. The small differences are probably caused by small
differences in the datasets and the examples used. The small differences can be caused by differences in
annotations for different examples within the dataset or domain inconsistencies in terminology used by
the different reviewers of the products. Looking at the biggest (stanford) dataset it is noticeable that the
performance is quite lower compared to the results reported by Go et al. (2009). In the paper by Go et
al. (2009) accuracies of 0.80 and more are reported where performance in the experiments here is
0.6774 for the stanford dataset. The stanford dataset is even the worst performing dataset in this
experiment in terms of accuracy, although the micro F1 and macro F1 scores for the dvd dataset are
lower. The under-performance of the stanford dataset is probably caused by the slightly adapted form
of the Naive Bayes classifier used by the authors to handle recurring terms in tweets (in tweets people
often use constructs like HAPPY HAPPY HAPPY). In order to be able to consistently use the same
51
approach between domains the decisions was made to not use this adapted form which, as
experimentation showed to make only a positive difference for the stanford dataset.
The low standard deviation shows that the multinomial Naive Bayes classifier is a stable classifier in
the context of in-domain classifiers. The classifier has shown similar accuracy results in different
domains following the same general setup. These are good properties in terms of training a new
sentiment classification system for a domain where training-data is available or can be easily
(automatically) gathered. Following the methodology described in this thesis accurate sentiment
classifiers can easily be setup when annotated examples are available. As described in previous work
(Pang and Lee, 2002) the Naive Bayes classifier shows to be a stable classifier. To compare results to
previous research is somewhat limited as no other research before used a collection of the different
dataset collected for this thesis. However, looking at the results reported by Pang and Lee (2002) for
the senpol dataset the results for the 1 gram experiments are similar: Pang and Lee (2002) report an
accuracy of 0.787. For this research the reported accuracy for the senpol dataset is 0.7880. In research
by Turney (2002) accuracies for in-domain classification are reported in the range of 0.66 to 0.84,
although Turney uses a SVM classifier instead of the in this research used multinomial Naive Bayes
classifier the accuracies correlate to the performances reported in Turney's research. From this can be
concluded that the general setup for this research can use different datasets from different sources in a
general way and achieve similar results in comparison to previous work.
As an illustrative example the 10-fold cross-validation accuracy results for the musical instruments
dataset and the apparel dataset are shown in Table 9. The musical instruments dataset consists of only
332 examples from which 284 are positive and 48 are negative. The classifier reaches an average
accuracy for the musical instruments dataset of 0.8559 with an micro F1 score of 0.9238 and a macro
F1 score of 0.9228. As a comparison the apparel dataset is added. The apparel dataset is bigger with
2000 available examples and is balanced with 1000 positive and 1000 negative examples. From the
comparison can be seen that the above average performance of 0.8559 for the musical instruments
dataset might be caused by the imbalance in the dataset. The performance of 0.7676 for the apparel
dataset is a better representation of the general performance of the Naive Bayes classifier. Also the
results in Table 9 show that the extra randomization step taken before splitting up the 10-fold makes
sense. In earlier experimentation the results varied widely between folds as sometimes a lot of positive
or negative examples were present in one of the folds leading to big biases.
Overview 10-fold cross validation
musical instruments
apparel
Run number
Accuracy
Run number
Accuracy
1
0.7879
1
0.7450
2
0.8636
2
0.7550
3
0.8687
3
0.7600
4
0.8788
4
0.7700
5
0.8606
5
0.7750
6
0.8586
6
0.7817
7
0.8615
7
0.7721
8
0.8636
8
0.7725
9
0.8552
9
0.7694
10
0.8606
10
0.7755
total
0.8559 total
0.7676
Table 9: overview of 10-fold cross validation for the musical instrument and apparel datasets.
52
In Table 10 the confusion matrices for the same datasets are presented. From the confusion matrix for
the musical instruments dataset can been seen that the imbalance in the dataset leads to a considerable
bias for positive classification. The amount of false positives is 42 against only 4 falsely classified
negative examples. Compare this to the performance of a classifier trained on a balanced dataset like
the apparel dataset the amount of false classified positive against the false classified negative examples
is much smaller. It can be expected that in a cross-domain setting these big biases will have a negative
effect on the classification performances when a biased classifier is used in a cross-domain setting.
Overview confusion matrices
musical instruments
Prediction
Actual TN
5 FP
FN
4 TP
apparel
Prediction
42 Actual TN
638 FP
279
FN
87 TP
362
913
Table 10: overview of the confusion matrices for the musical instruments and apparel datasets.
Looking at the results for the in-domain 10-fold cross-validation for the 2 gram feature-set the results
are slightly lower compared to the 1 gram feature-set. Comparing the results to previous research (Pang
and Lee, 2002) on te senpol dataset, the reported accuracy for 2 gram classification in their research is
0.773. In this research an accuracy of 0.8123 is reported for the senpol dataset. Taking into account the
different datasets used in this research it can be concluded that average accuracy of 0.7739 over all the
datasets is a comparable result to previous research.
Intuitively one might think that using 2 gram features higher accuracies could be achieved in
comparison to 1 gram features. This would be logical because 2 gram features would offer more
expressiveness (i.e. in the case of 2 gram features a structure like “not good” would yield a negative
implication in comparison to the two separate 1 grams “not” and “good”. In most of the datasets
reported in this research 2 gram features have lower accuracy scores in comparison to the 1 gram
features. For example for the office products dataset the performance in 1 gram features experiments is
0.8780, the performance of the 2 gram features classifier is only 0.7021 which is a difference of 0.1759.
In case where 2 gram features (although slightly) outperform the 1 gram features it seems that these are
datasets of considerable size. For example a small increase in performance is found in the electronics
dataset and the beauty dataset. In case where datasets are relatively small fewer examples are used for
training. Hence, fewer expressiveness is captured. For example for the laptop dataset dataset the
performance decreases from 0.7438 for the 1 gram feature-set to 0.6774 for the 2 gram feature-set. The
correlation of the size of the dataset doesn't always hold, for example for the tools & hardware dataset.
For this dataset there are very few examples available (112 data examples) and this dataset is quite
unbalanced (14 negative versus 98 positive examples). This results in quite a unbalanced classifier
having a high prior probability for positive examples. The hight of this prior probability will in many
cases not be exceeded by the “proof” found in the analyzed features. Because the accuracy values are
based on a 10-fold cross validation of the data and the dataset is quite small, the training and evaluation
sets are also small subsets which in this case probability doesn't illustrate a representative picture of the
performance for this domain.
The standard deviation of the results for the 2 gram feature-set is a little higher in comparison to the 1
gram feature-set: 0.0750. This adds to the argument that the classifier trained with 2 gram features is
more reliant on the amount of supplied data in comparison to the classifier trained with 1 gram features.
The performance is also more dependent on the segmentation of the folds of the data. Certainly in case
of smaller training sets it can be the case that certain features are not assigned probabilities as would be
53
intuitive (e.g. because “good” is present in a lot of different 2 gram variations: “good meal”, “good
tool”, “good movie”). In this case the distribution of the sentiment of “good” is spread over different 2
gram features. This can result in the misclassification of certain examples, especially when mixed
sentiment is expressed in a sentence.
The results for the combined 1 gram + POS-pattern feature-set are slightly lower compared to the
accuracies reported for the 1 gram and 2 gram feature-sets. The tools & hardware dataset is a surprise
here as this was the best performing dataset for the 1 gram and 2 gram feature-sets. The sudden change
in performance here can be explained by the aforementioned argument that the complexer the features
get, the more data is necessary in order to capture the essence of the domain. In cases where the
datasets are small (i.e. for the tools & hardware dataset) performance can decrease considerable,
certainly in the case of the POS-pattern feature-set. This argument is further supported by the
performance of other smaller datasets (e.g. the lawyer dataset) where the performance decreases when
POS-patterns are added. Looking at the performance for the medium sized datasets (e.g. kitchen &
housewares, sports & outdoor) the results are comparable to the 1 gram or 2 gram results. Again, to
compare to previous research by Pang and Lee (2002) the senpol dataset is used. The performance
reported by Pang and Lee (2002) is 0.815, where performance here is a little lower: 0.7779. The
difference in performance can be explained by the use of the applied POS-tagger. In the research by
Pang and Lee (2002) Oliver Mason's Qtag program is used, where in this research the Brill tagger is
used.
The combined 1 gram + 2 gram feature-set is the worst performing feature-set considered in the indomain experiment. The performance for the smaller datasets (tools & hardware, lawyer, laptop) is
slightly lower in comparison to the 1 gram feature-set, which is logical as the performance for the 2
gram feature-set for those datasets was slightly lower as well. For the medium and large sized dataset
the performance is stable in comparison to the 1 gram feature-set, 2 gram feature-set and combined 1
gram + POS-pattern feature-set.
To relate the results the senpol dataset is used again (Pang and Lee, 2002). The performance reported
by Pang and Lee (2002) is 0.8060, where performance here is a little lower: 0.7819. The difference in
performance between the two reported accuracies is less than 3% which is an acceptable difference for
the general approach used for this research. Looking at the results for the micro and macro F1 scores
for the different feature-sets the differences between the two F1 scores showed to be quite small. This
implies that whether a balanced or an unbalanced evaluation-set was used few difference in the final
result is observed. From this can be concluded that the classifiers perform quite stable over differently
balanced evaluation-sets.
Feature analysis
To gain more insight in the behavior of the multinomial Naive Bayes classifier and the features on
which the classification is performed in Table 11 the top 10 features with the highest probability
calculated by Naive Bayes are listed for the kitchen & housewares dataset and the apparel dataset. For
the kitchen & housewares dataset the 1 gram features are listed, for the apparel dataset the 2 gram
features are listed. Both the top 10 features for the positive features and the negative features are
shown.
54
Top 10 1 gram features kitchen&houseware dataset
Top 10 2 gram features apparel dataset
Negative
Positive
Negative
Positive
feature
probability feature
probability feature
probability feature
probability
'the'
0.0547 'the'
0.0474 'of the'
0.0026 'they are'
0.0019
'i'
0.029 'i'
0.0271 'in the'
0.002 'i have'
0.0017
'it'
0.0264 'and'
0.0264 'i was'
0.0017 'in the'
0.0014
'and'
0.0237 'a'
0.024 'i have'
0.0015 'of the'
0.0014
'to'
0.0227 'to'
0.0225 'and the'
0.0014 'and the'
0.0013
'a'
0.0215 'it'
0.0217 'it is'
0.0014 'i am'
0.0013
'of'
0.0149 'is'
0.015 'it was'
0.0013 'and i'
0.0013
'this'
0.0132 'this'
0.0133 'on the'
0.0013 'this is'
0.0012
'is'
0.0129 'for'
0.0128 'i am'
0.0012 'i was'
0.0012
'in'
0.0101 'of'
0.0128 'they are'
0.0011 'it is'
0.0011
Table 11: Top 10 positive and negative features for the kitchen & housewares dataset and the apparel dataset.
As can be seen in Table 11 the most appearing features in both positive and negative 1 gram features
are features which we could expect to be prevalent in the English language. Articles like “the” and “I”
ofter occur in both positive and negative annotated examples in the dataset. For the 2 gram features the
same observation can be made: 2 grams like “of the” and “they are” are prevalent. The top 10 features
of the POS-pattern features will be more informative opposed to the top 10 features seen for 1 gram
and 2 gram features. The features are listed in the Table 12 below.
Top 10 features from the restaurant dataset
Negative
Positive
feature
probability feature
probability
'n't know'
0.0023 'great place'
0.0037
'not worth'
0.0018 'great food'
0.0032
'never go'
0.0018 'n't wait'
0.0026
'n't have'
0.0016 'ever had'
0.0026
'never return'
0.0015 'very good'
0.0025
'not get'
0.0015 'n't be'
0.0023
'first time'
0.0013 'several times'
0.0019
'n't waste'
0.0011 'good food'
0.0019
'only thing'
0.0011 'n't know'
0.0018
'n't eat'
0.0011 'great service'
0.0018
Table 12: Top 10 positive and negative features for the restaurant dataset.
In contrast to the features assigned high probability for the 1 gram and 2 gram multinomial Naive
Bayes classifiers the features in Table 12 are intuitively more accurate denoting positive and negative
sentiment. The features “not worth” and “never go” are very likely to denote a negative signal in a
restaurant review. The features “great place” and “great food” are clear signals for positive sentiments.
Although some of the features could be argued to be applicable in general (non-domain specific
features) (e.g. “not get”, “very good), other features seem to be only representative to the specific
domain (domain-specific features), in this case the restaurant dataset. For example “good food” is a
feature probably used in sentences like “Overall a good time and good food” which would clearly be a
55
sentence having positive sentiment, but only within the context of restaurant (or domains considering
food) reviews. These features are highly domain specific and therefore misclassification can be
expected in a cross domain setting when using these features as proof for a positive sentence.
From the observations made in the previous two tables a possible problem rises when calculating
generalizable features using co-occurrence (FCE). As can be seen from the tables common terms (nonsentimental, non-domain specific) are prevalent. When calculating generalizable features based on FCE
it can be expected that these common terms will achieve high rankings instead of sentimental nondomain specific terms as attempted.
Because the multinomial Naive Bayes classifier assigns probability to each feature based on the
number of times the feature has been detected in the examples it could be expected that the top 10
features would look like the examples observed. The number of times the feature has been detected
could be seen as how much “proof” there is for a feature to be used in a positive or negative context.
Ideally this would lead to considerable difference in “proof” for features having sentimental
connotations (e.g. good or bad). Therefore, the behavior of the classifier can better be explained by the
probability assigned to the features for which different values would be expected. In Table 13 an
example is provided with the probability assigned to the features “good” versus “bad” and “very good”
versus “very bad” for different datasets.
Dataset
restaurant
stanford
camera & photo
baby
good
bad
very good
very bad
negative positive
negative positive
negative positive
negative positive
0.002
0.0039
0.0018
0.0005
0.00011
0.00035
0.00002
0.00001
Table 13: Probabilities for the features “good” versus “bad” and “very good” versus “very bad” in a positive and negative context for
different datasets.
In Table 13 can be seen how the multinomial Naive Bayes classifier segregates examples based on the
difference in probabilities assigned to different features which have a sentimental connotation. In the
first column the difference in probability for the feature “good” is shown for the restaurant dataset. As
can be seen the feature “good” has a probability three times as high for occurring in a positive context
compared to the probability of the feature “good” occurring in a negative context. This way the
probabilities contribute to “proof” of a sequence of features being positive and negative. Another
example for the feature “bad” is provided originating from the stanford dataset where “bad” has a
probability which is twice as small of being used in a positive context in comparison to a negative
context. For the 2 gram feature-set in Table 13 the same observation can be made. The feature “very
good” has a probability three times as high for occurring in a positive context compared to the
probability of the feature “very good” occurring in a negative context. Another example for the feature
“very bad” is provided originating from the baby dataset where “very bad” has a probability which is
0.7 times as small of being used in a positive context in comparison to a negative context. From this
behavior of the multinomial Naive Bayes classifier we can conclude that the classifier uses probability
of occurrence within sentimental context in order to assign probabilities in the Naive Bayes model. The
more present a term is in one context in contrast to the presence of that term in an opposite context can
be used in order to detect terms which are sentimental (within context). For the proposed improvement
of selecting generalizable features this principle (using TF-IDF) is used in order to select improved
generalizable features.
56
6.1.3 Subjectivity lexicon classifier
For the subjectivity lexicon classifier accuracies range from a top performance of 0.7803 for the
musical instruments dataset to 0.4963 for the senpol dataset. The observed standard deviation is 0.0676
which is almost as low as the observed standard deviation for the 1 gram feature-set. Although this
classifier is the worst performing one, it is the only classifier which can be expected to perform
consistently in a cross-domain setting as no training-data is used. The features used in this classifier are
a (semi) manual composed lists of terms which are not trained from any form of source data used
during the experiments. The classifier performs comparable to the top performing classifiers in only
one case (musical instruments) but has lowest performance on many other datasets (music2, camera,
video) this might be explained by the more domain specific nature of the datasets where low
performance is observed. Especially in the case of the camera and video dataset it can be expected that
features which are domain specific contribute more to the overall expressed sentiment in the text in
contrast to the general features expressing sentiment as used in a subjectivity lexicon.
In Table 14 the confusion matrices for the musical instrument and the apparel datasets are presented.
For the confusion matrix for the musical instruments dataset can be seen that although there's an
imbalance in the dataset, the classifier doesn't give an unbalanced outcome in the amount of false
positives and false negatives. This is of course caused by the fact that the subjectivity lexicon classifier
doesn't use data to train, but uses a a predefined list of terms. The amount of false positives is 49
against 31 falsely classified negative examples. If we compare this to the performance of a classifier
trained on a balanced dataset like the apparel dataset surprising enough there does seem to be a big
positive bias as 691 examples are wrongly classified as being positive. This is caused by the language
used in the apparel review dataset: the language to described positive examples is probably not general
enough in order to be captured by the subjectivity lexicon classifier.
Overview confusion matrices
musical instruments
Prediction
Actual TN
16 FP
FN
49 TP
apparel
Prediction
31 Actual TN
309 FP
234
FN
146 TP
691
854
Table 14: overview of the confusion matrices for the musical instruments and apparel datasets.
Comparison to previous work for the subjectivity lexicon classifier is somewhat limited as no research
is available covering the amount of datasets used for this research. The results can however be
compared to Das and Chen (2001) who report a performance of 0.620 on financial news datasets with a
subjectivity lexicon. Considering the average accuracy of 0.6094 over the 38 datasets these results
seem to correlate to the accuracies reported for this research.
Although the subjectivity lexicon classifier has shown to be outperformed by all feature-sets considered
for the Naive Bayes classifier the observed difference between the micro and macro F1 score is smaller
in comparison to the Naive Bayes classifier. From this can be concluded that more stability is achieved
by the subjectivity lexicon classifier in case of differences between balanced and unbalanced
evaluation-sets. This can be explained by the lack of need for training-data because of the use of pre
defined list with sentimental terms. The intuition for using a subjectivity lexicon in combination with
the adapted Naive Bayes classifier originates from this performance of the subjectivity lexicon. The
consistent and data-independent performance of the use of a predefined list of sentimental terms could
57
very well be used by the adapted Naive Bayes classifier being dependent on qualitative sentimental
terms.
6.2 Cross-domain baseline experiments
6.2.1 Naive Bayes classifier
For the second experiment conducted for this research the cross-domain performance of the Naive
Bayes classifier is explored using a naive cross-domain approach. For each (target) domain
classifications are made with all other individual in-domain trained classifiers (using all available data
for that domain). In this section the different results for the different feature-sets are presented. After
the results for the different feature-sets a more detailed analysis is provided for a sub-section of the
domains for reference against the methods considered in the next set of experiments. Finally tables are
presented giving insight in domain-specific features based on the different considered datasets in this
research.
As could be expected in a cross-domain setting the performance of the classifier decreases considerably
for all considered feature-sets. For the 1 gram feature-set the performance decreases from a 0.7901
average for in-domain to an accuracy of 0.6477 for cross-domain, a decrease in accuracy of 0.1424, the
same goes for the 2 gram feature-set. This results adds to the case that it is only limited possible to use
classifiers trained on one domain directly to the use for another domain. The best performing domain in
this sense is the restaurant dataset with an accuracy score of 0.7624 and micro and macro F1 scores of
0.8416 and 0.8412. These results probably say something about the generality of the dataset. In a more
general dataset (where more generalizable features are used to denote sentiment), the averaged
classification from all available classifiers is likely to have a higher accuracy compared to a less
general dataset. An illustrative example for this generality of the dataset is added below:
"1","Ambiance is great - service wonderful, and food really excellent. We were all very happy with the
entire experience"
Examining the 1 gram features the majority of the features found in this example seem to be domain
independent: “great”, “wonderful”, “excellent”, “happy” are all very likely to be domain independent
features signaling positive sentiment.
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
small sized datasets
0.6478 0.6963 0.6916 0.6497 0.6995 0.6944 0.6283 0.6661 0.6611 0.6237 0.6597 0.6543
medium sized datasets
0.6563 0.7034 0.7022 0.6489 0.6930 0.6919 0.6349 0.6718 0.6706 0.6263 0.6547 0.6535
big sized datasets
0.5754 0.5774 0.5767 0.5463 0.5234 0.5229 0.5610 0.5415 0.5408 0.5279 0.4571 0.4567
Table 15: relation between the performance and the size of the dataset for the naive cross-domain approach.
As observed earlier, for both the 1 gram feature-set and the 2 gram feature-set the stanford dataset is a
problematic dataset, the lowest overall accuracy is reported for this dataset. In general, considering
Table 15, looking at the smaller datasets (tools & hardware, laptop, lawyer) performance can be
observed which is around average for this experiment. An exception here is the performance of the
tools & hardware set for the 2 gram feature-set. For the 1 gram feature-set this was an average
performing dataset, for the 2 gram feature-sett the performance is restored to an average accuracy of
0.7170, which is above average. This can be explained by over-fitting, probably a lot of data-specific
58
examples containing the right 2 gram features are captured. The same goes for the medium sized
dataset: kitchen & housewares, sports & outdoors. The average accuracy is specifically lower for the
bigger (stanford) dataset. This can be explained by the fact that apart from the domain in terms of
subject, this domain also differs in the sense that the data doesn't consist of reviews but of tweets. As
discussed in Chapter 4 the used data: tweets, are significantly different from reviews in the average
number of words used and the average number of sentences used per document. Therefore, crossdomain performance on datasets of tweets can be expected to be sub-optimal. Another dataset where
cross-domain performance is considerable below average for both the 1 gram and 2 gram feature-set is
the senpol dataset with the lowest reported micro F1 and macro F1 scores: 0.5465 and 0.5447. For both
the 1 gram feature-set and the 2 gram feature-set a considerable decrease in accuracy performance is
observed for this dataset. This can also be explained by the differences in provided data: the senpol
dataset has a relatively large average amount of sentences per provided document which implies more
complexity in the (domain-specific) features which are harder to capture in a cross-domain setting.
The averaged accuracy of the combined 1 gram + POS-pattern feature-set is slightly worse in
comparison to the 1 gram and 2 gram feature-sets: 0.6273. The reported standard deviation is 0.0367
which is slightly better in comparison to the 1 gram and 2 gram feature-sets. This makes this
combination the most stable performing feature-set. For both combinations of features performance of
the small sized dataset is around average, just as the medium sized datasets, the bigger (twitter) dataset
reports lower performance. The issues raised for the in-domain tasks in regard to the POS-pattern
feature-set are even more relevant for the cross-domain task. Combining all different classifiers can
leads to a behavior where a lot of domain-specific features (feature-combinations) skew the
classification. In some individual classifiers, this can lead to a better performance (when for example
by accident there is a good match between source and target domain). In the majority of the cases
adding POS-pattern features to the classifier doesn't aid the classification accuracy scores. In
comparison to in-domain performance the decrease in accuracy is considerable: from 0.7557 to 0.6273.
As concluded for the combined 1 gram + POS-pattern feature-set, using a combination of 1 gram and 2
gram features doesn't help improve cross-domain classification accuracies. On average an accuracy of
0.6178 is reported with average micro F1 and macro F1 scores of 0.6404 and 0.6382 which are the
lowest reported number over all the different considered feature-sets. In comparison to in-domain
performance in terms of accuracy performance decreases from 0.7466 to 0.6178.
In general, the observed differences between the reported micro and macro F1 scores are a little higher
in comparison to in-domain performance. From this can be concluded that in a cross-domain setting the
difference in performance between evaluation of balanced and unbalanced evaluation-sets is more of an
issue. This is probably caused by the different priors which for the different domains all add to the
overall classification, it can be expected that while performing a domain-adaptation this issue will be
resolved.
A comparison to other research can be made for this classifier as Whitehead and Yaeger (2009) provide
a cross-domain analysis on the multi-domain dataset covering a subset of the datasets used in this
research. Whitehead and Yaeger (2009) perform classifications with SVM and report an accuracy of
0.61 on the same approach chosen in this thesis, the LOO approach. For this research an accuracy of
0.6383 is found over the same subset of datasets. For this research however, more datasets are taken
into account in comparison to Whitehead and Yaeger (2009). Another important difference is the used
classifier: here multinomial Naive Bayes is used, Whitehead and Yaeger use SVM. Although the
differences, the reported accuracies results for the same datasets are comparable.
59
Analysis
In order to compare results consistently between different methods in Table 16 the classification results
for a classifier trained on a single domain and evaluated on (a subset of) multiple domains are
presented. The classification results on different datasets are shown from the classifier trained on the
kitchen & housewares dataset. The considered evaluation-sets are: dvd, sports & outdoor, health &
personal care, apparel and electronics. With this subset of datasets it is possible to compare results
consistently between different feature-sets and between different applied methods (for the remainder of
this chapter). The six datasets are all the same in size (2000 documents) and have the same balance
(1000 positive vs 1000 negative examples). The results shown are the accuracies and the micro F1 and
macro F1 scores for the different cross-domain classifications and are shown in Table 16.
Kitchen & houswares Cross-Domain Baseline
Featureset
sports &
health &
dataset
dvd
electronics
average
outdoor
personal care apparel
1gram
accuracy
0.6485
0.7680
0.7705
0.7415
0.7315
0.7320
f1
0.5531/0.5489 0.7844/0.7844 0.7570/0.7559 0.7733/0.7722 0.7692/0.7679 0.7274/0.7259
2gram
accuracy
0.6515
0.7560
0.7405
0.7270
0.7310
0.7212
f1
0.6548/0.6525 0.7843/0.7836 0.7574/0.7569 0.7665/0.7662 0.7707/0.7700 0.7467/0.7458
1 gram + POS accuracy
0.6455
0.7585
0.7445
0.7310
0.7125
0.7184
f1
0.5381/0.5342 0.7652/0.7647 0.7499/0.7481 0.7561/0.7547 0.7591/0.7579 0.7137/0.7119
1 gram + 2
accuracy
0.6660
0.7500
0.7480
0.6750
0.7445
0.7167
gram
f1
0.6447/0.6423 0.7674/0.7671 0.7200/0.7188 0.7402/0.7398 0.7655/0.7646 0.7276/0.7265
Table 16: a subsection of the individual classification results on the kitchen & housewares dataset.
As can be seen in Table 16 classification results from the on the kitchen & housewares trained
classifiers have little differences in the performance of the different feature-sets. The best performing
feature-set in terms of accuracy is the 1 gram feature-set with an accuracy of 0.7320, the highest
reported micro and macro F1 score is for 2 gram feature-set however. The lowest performance is
observed for the combined 1 gram + POS-pattern feature-set. Looking at the different domains the
results vary from evaluation-set to evaluation-set. The dvd evaluation-set performs for all different
feature-sets under average. The sports & outdoor set and health & personal care set perform for all the
feature-sets consistently above average, which says something about the generality of the domain (and
the possibility of the use of this domain in a cross-domain setting). The micro and macro F1 scores
show very little difference between them, which can be expected as the evaluated datasets are all
perfectly balanced. As a result, it can be expected that the evaluation-sets are near perfectly balanced as
well, making few differentness between micro and macro F1. It can be expected that the sports &
outdoor and health & personal care domains will achieve the best performance when used for a
domain-adaption. For the following experiments conducting domain-adaptations this table will be used
as a reference to compare against.
To take a detailed look at the domain-specific features this research is concerned about in Table 17 the
different probabilities assigned by different domain-specific trained classifiers are shown for the 1 gram
features “surprise” and “easy”. Upon inspection of the data these two features seemed to be used in a
different sentimental context for different domains, making them domain-specific features.
60
Dataset
apparel
baby
beauty
books
camera_&_photo
camera
cell_phones_&_service
computer_&_video_games
dvd
electronics
grocery
health_&_personal_care
kitchen_&_housewares
magazines
music2
music
outdoor_living
senpol
senscale
sports_&_outdoors
toys_&_games
video
Surpise
Easy
Negative
Positive
Negative
Positive
0.00013
0.00009
0.00015
0.00081
0.00004
0.00002
0.00074
0.00284
0.00006
0.00004
0.00010
0.00066
0.00004
0.00004
0.00025
0.00052
0.00004
0.00005
0.00053
0.00189
0.00005
0.00006
0.00090
0.00364
0.00005
0.00004
0.00053
0.00185
0.00004
0.00006
0.00058
0.00075
0.00007
0.00010
0.00009
0.00019
0.00005
0.00005
0.00047
0.00158
0.00013
0.00011
0.00020
0.00081
0.00005
0.00005
0.00047
0.00178
0.00009
0.00009
0.00031
0.00209
0.00006
0.00004
0.00012
0.00046
0.00005
0.00005
0.00011
0.00017
0.00009
0.00022
0.00022
0.00009
0.00006
0.00005
0.00030
0.00289
0.00014
0.00016
0.00014
0.00018
0.00016
0.00017
0.00026
0.00047
0.00004
0.00004
0.00042
0.00209
0.00007
0.00007
0.00050
0.00135
0.00005
0.00008
0.00009
0.00022
Table 17: the probabilities assigned by different classifiers for the 1 gram features “surprise” and “easy”
As can be seen in Table 17, the 1 gram feature “surprise” can have different connotation in different
domains. For some domains this feature has a positive connotation (e.g. camera & photo, computer &
video games, dvd). In other domains the feature has a negative connotation (e.g apparel, baby, beauty).
A less obvious example is the 1 gram feature “easy”. In the majority of the domains the 1 gram feature
“easy” is observed in a positive connotation (apparel, baby, etc.) but in the case of the music dataset the
feature “easy” has a more negative connotation. Ideally, performing a domain-adaptation this domaindependency would be captured by the generalizable features. In the analysis of the following
experiments will be evaluated if this domain-specificity is captured by the using generalizable features.
6.2.2 Subjectivity lexicon classifier
When considering the subjectivity lexicon classifier the same numbers hold as observed for the indomain subjectivity lexicon experiments. The subjectivity lexicon classifier doesn't use any data from
any of the domains so the performance considering in- or cross-domain remains the same. The average
acquired accuracy therefore remains, as in the previous experiment 0.6092 with is the lowest among the
other cross-domain experiments. Although the accuracy is quite low, micro F1 and macro F1 scores are
the highest observed for this experiment: 0.7065 and 0.7049. As observed for the in-domain experiment
the subjectivity lexicon classifier has shown to be outperformed by all feature-sets considered. In spite
of this the observed difference between the micro and macro F1 score was smaller in comparison to the
in-domain experiments. For the cross-domain experiments the difference between micro and macro F1
score are even bigger. From this can be concluded that more stability is achieved by the subjectivity
61
lexicon classifier in case of differences between balanced and imbalanced evaluation-sets.
6.3.1 Adapted Naive Bayes using FCE
As a first attempt to perform a cross-domain adaptation an experiment is conducted using the adapted
Naive Bayes classifier using FCE to select generalizable features. In this section the results for this
classifier are presented. First, the results for the different considered feature-sets are presented. Next, a
detailed analysis is provided giving insight in the used generalizable features, the results for a subsection of the data, and the influence of the lambda parameter and different numbers of generalizable
features used.
The results for the adapted Naive Bayes classifier using FCE to select generalizable features are quite
disappointing in comparison to the results reported by Tan et al. (2009) and in comparison to the naive
cross-domain baseline. In Tan's research micro and macro F1 scores are reported of 0.8262 and 0.7962
on average (over 6 Chinese datasets). The best reported micro and macro F1 scores in this research are
0.6428 and 0.6413 for the combined 1 gram + POS-pattern feature-set. This difference in performance
can be explained by a number of reasons. The used domains can be an issue, as can be seen in the
experiments performance measures can vary quite from domain to domain implying that certain
domains are quite suitable to perform a domain-adaptation, other domains show results far under
average making them very unsuitable. As expected the difference between micro and macro F1 score
are smaller in comparison to the naive cross-domain approach showing that the priors are fitted to the
target-domain when performing a domain-adaptation.
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
small sized datasets
0.5054 0.4562 0.4540 0.5532 0.5599 0.5579 0.5882 0.6129 0.6095 0.4943 0.4421 0.4369
medium sized datasets
0.5004 0.4323 0.4301 0.6239 0.6611 0.6589 0.5661 0.6513 0.6503 0.5520 0.5133 0.5120
big sized datasets
0.5548 0.6065 0.6046 0.6371 0.6905 0.6883 0.5944 0.6723 0.6719 0.5671 0.6696 0.6691
Table 18: relation between the performance and the size of the dataset for ANB using FCE.
Sometimes the supplied data appears to be to small to perform a suitable domain-adaptation. For
example, as can be seen in Table 18 in the observed results for the 1 gram feature-set smaller sized and
unbalanced datasets show considerable worst performance. For the medium and bigger sized data-sets
these performance issues are less prevalent. In general it seems that mainly small sized datasets have a
very negative influence in the overall performance of the ANB classifier using FCE. In contrast the
bigger sized dataset all seem to increase in performance in comparison to naive cross-domain
classification. It can be concluded that unbalanced datasets in general benefit from using ANB using
FCE in comparison to the naive cross-domain approach. As for the in this research used 38 datasets the
majority of the datasets is of medium size (around 2000 examples), for the medium sized datasets the
performance varies from dataset to dataset. This can be explained by the overall generality of the
dataset used, some domains are more suitable for domain-adaptation than others. Focusing on the
performance of the bigger sized datasets it can be concluded that supplying ANB using FCE with more
data has a beneficial effect on the performances. Next to the data, the generalizable features (which in
this case originate from data) can play an important role, which will be explored in the next section.
Analysis
The performance of the adapted Naive Bayes classifier using FCE to select generalizable features
62
performs considerable lower than expected. In Tan et al. (2009) for the same method results are
reported with micro F1 and macro F1 scores of 0.8262 and 0.7926 on Chinese datasets. Although
relating these results is difficult the micro and macro F1 scores observed for this experiment are
considerably lower. As an example gaining more insight into the behavior of the classifier a subsection
of the domain-adaptation results for the kitchen & housewares classifier are presented. As a
representative subsection, five different domain-adaptations of the kitchen & housewares classifier are
chosen to be displayed in Table 19. The results shown are the accuracies and the micro F1 and macro
F1 scores for the different classifications of the domain-adapted classifiers. For this experiment 500
generalizable features are used, lambda is set at 0.6 and 1 EM iteration is performed.
Kitchen & houswares ANB (FCE)
Featureset
sports &
health &
dataset
dvd
electronics
average
outdoor
personal care apparel
1gram
accuracy
0.4975
0.5450
0.5385
0.5005
0.5175
0.5198
f1
0.1083/0.1075 0.3793/0.3769 0.3514/0.3490 0.3791/0.3791 0.3889/0.3879 0.3214/0.3201
2gram
accuracy
0.5445
0.7070
0.7115
0.7070
0.6870
0.6714
f1
0.4908/0.4903 0.7446/0.7436 0.7529/0.7517 0.7570/0.7540 0.7295/0.7292 0.6949/0.6938
1 gram + POS accuracy
0.4520
0.4980
0.4910
0.4925
0.4885
0.4844
f1
0.2956/0.2950 0.4975/0.4969 0.5120/0.5115 0.5889/0.5880 0.4888/0.4866 0.4766/0.4756
1 gram + 2
accuracy
0.482
0.5925
0.5865
0.6355
0.562
0.5717
gram
f1
0.0700/0.0690 0.4437/0.4399 0.4229/0.4224 0.5530/0.5524 0.4089/0.4073 0.3797/0.3782
Table 19: a subsection of the domain-adapted classifiers evaluated on the kitchen & housewares dataset
The chosen subset in Table 19 is a descent representation of the general trend observed in the results
presented in Chapter 5. The results for the combined 1 gram + POS-pattern feature-set is the lowest
among the different feature-sets: 0.4844. The highest performing feature-set is the 2 gram feature-set
with an accuracy of 0.6714. Compared to the naive cross-domain approach the reported performances
are quite disappointing. All feature-sets lose considerably in accuracy compared to naive cross-domain,
the smallest difference is for the 2 gram feature-set with a difference in accuracy of 0.0498, the biggest
difference is observed for the combined 1 gram + POS-pattern feature-set with a difference in accuracy
of 0.234. The results for the adapted Naive Bayes classifier using FCE seem to vary from domain to
domain. As observed for the naive cross-domain baseline the sports & outdoor and health & personal
care datasets show on average to be the most suitable for domain-adaptations. This is especially the
case for the 2 gram feature-set where performances come near naive cross domain performance. For all
individual domain-adaptations the adapted Naive Bayes classifier using FCE doesn't seem to be
performing better in comparison to the naive cross-domain approach. This could be explained by the
used generalizable features, which will be examined in the next section.
Generalizable features
As described the adapted Naive Bayes method uses generalizable features selected by FCE between
domains in order to perform a domain-adaptation. To illustrate these generalizable features in Table 20
the top 50 most common generalizable 1 gram features for the different domain-adaptations for the
kitchen & housewares dataset are presented.
63
you
not
several
zip
wider
with
is
out
your
widening
was
in
nice
you're
widened
very
i
my
years
when
to
have
it
wrung
well
this
for
else
would't
weighty
the
and
control
work
weighs
that
a
but
wires
wasted
on
would
are
windstar
waste
of
so
all
will
warping
Table 20: Top 50 most common generalizable 1 gram features using for the kitchen & housewares dataset
In Table 20 the top 50 generalizable features from the domain-adaptations for the kitchen &
housewares dataset are shown for the 1 gram feature-set. It is noticeable that a lot of very common
terms are present in the results. A lot of non-sentimental domain-independent terms like: “you”, “on”
and “of” are among the results. It would be intuitive that sentimental, domain-independent terms like
“good”, “excellent”, “poor” and “bad” would be among the results. When comparing these results to
the terms reported by Tan et al. (2009) (See Table 1) the conclusion can be drawn that these results
leave room for improvement. There are 7 terms which could be considered to be sentimental: “very”,
“not”, “so”, “nice”, “wrung”, “waste” and “wasted”.
In comparison to the 1 gram features the 2 gram features from the domain-adaptation for the kitchen &
housewares dataset show the same issues. To gain more insight in the relation of this performance and
the generalizable features used for the 2 gram experiments the Top 50 generalizable features used for
the kitchen & housewares domain-adaptations are presented in Table 21.
to the
in my
you toss
without a
weeks yes
would not
i would
you pay
with the
we had
that will
i have
yogurt after
with i
wash so
so this
i bought
yes it
with a
wash first
on the
for a
years it
will never
was looking
of these
because i
wouldnt turn
will keep
very well
my wife
and the
worth the
which point
very small
looking for
and it
worked for
which i
very pleased
it is
all over
without hot
what i
very carefully
is a
your juice
without any
well as
used the
Table 21: Top 50 most common generalizable 2 gram features using for the kitchen & housewares dataset
Looking at the results for the 2 gram generalizable features the features intuitively make more sense.
Of the terms in Table 21 eight could be considered to be sentimental features: “without any”, “very
well”, “without a”, “very small”, “very pleased”, “very well”, “the first” and “the best”. Just as for
Table 20 the observation can be made that a lot of non-sentimental general terms are present like: “to
the”, “in my” and “so this”. In the work by Tan et al. (2009) the importance of the generalizable
features is stressed in order to allow a domain transfer. A possible explanation of the disappointing
results can be the quality of the used generalizable features. Therefore, the focus of this thesis lies in
the attempt to explore methods of selecting better generalizable features comparable to the features
reported in Table 1.
64
Influence of Lambda
To gain insight in the behavior of the lambda parameter which controls the impact between the sourcedomain and the target-domain an additional experiment is performed measuring the effect of different
values for this parameter. Lower values for lambda put the emphasis on the source-domain data, higher
values for lambda put emphasis on the target-domain. It makes sense to evaluate different values for
this parameter as there is likely to be an optimal balance between the use of source-domain data and
target-domain data using the adapted Naive Bayes classifier. An argument to put more weight on the
source-domain data could be that, while performing a domain-adaptation, the source-domain data is the
only data from which (human annotated) labels are used to train the classifier. The target-domain data
is considered to be unlabeled making the source-domain data the only reliable source of information,
the target-domain data is expected to have a similar data-distribution but this is an assumption rather
than a certainty. However, the target-domain data is the data where evaluation is performed over, so an
argument to put more weight towards the target-domain data could be that assuming both source and
target domain have similar distributions the target-domain observations should be more important in
fitting the Naive Bayes model. To proper evaluate these hypothesis the experiment performed by Tan et
al. (2009) is repeated evaluating different values for the lambda parameter. In this experiment the
parameter is evaluated for different values starting at a value for lambda of 0.05 up to a lambda value
of 1.0. In order to evaluate the performance consistently N FCE is set to 500 and only 1 EM iteration is
performed. Because fitting cross-domain classifiers is quite time-consuming, especially for all the 38
datasets used in the full experiments for this research the choice is made to use the same representative
subset of datasets as before to evaluate the behavior of the lambda parameter. For each feature-set (1
gram, 2 gram, combined 1 gram + POS-pattern, combined 1 gram + 2 gram) a graph is plotted
containing the different values for the lambda parameter against the performance of the classifier. The
performance of the classifier is expressed in three different measures: accuracy, micro F1 and macro F1
scores. In Figure 1 the different graphs are shown for the adapted Naive Bayes classifier using FCE.
Performance of 1 gram ANB (FCE) vs Lamba
0.6
Performance
0.55
0.5
0.45
0.4
0.35
0.05
0.1
0.2
0.4
Lambda
65
0.6
0.8
1
Performance of 2 gram ANB (FCE) vs Lambda
Performance
0.73
0.71
0.69
0.67
0.65
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
Performance of 1 gram + POS ANB (FCE) vs Lambda
0.7
Performance
0.65
0.6
0.55
0.5
0.45
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
Performance of 1 gram + 2 gram ANB (FCE) vs Lambda
Performance
0.65
0.55
0.45
0.35
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
accuracy
micro F1
macro F1
Figure 1: Performance of the ANB classifier using FCE with different feature-sets versus different values for the lambda parameter.
Looking at the different graphs plotted in Figure 1 for all the feature-sets is noticeable that the highest
performances are observed at the starting of the graph at lambda values around 0.05-0.2. For the 2 gram
feature-set the performance stays stable up until a lambda of 0.2 For all the feature-sets lambda's
greater than 0.2 show slowly declining performances. The decline in terms of micro and macro F1
scores is sharper in comparison to the decline in accuracy.
66
In contrast to the findings reported by Tan et al. (2009) where peaks are reported at higher values of
lambda, in this research peak performances are reported for values of lambda around 0.05. This
indicates that mainly the old domain data is important for cross-domain adaptation for English datasets.
The source-domain data is labeled data which is important because we can be certain of the semantic
orientation. The target-domain holds the same amount of important information but (in the crossdomain setting experiment) is considered to be unlabeled. Intuitively it makes sense that putting the
emphasis on the target-domain data (with higher values for lambda, around 0.8) results in an optimal
fitting of the classifier. However, using FCE as a bridge between the two domains doesn't seem to aid
performance measures.
The difference in observations can be explained by differences in the data or by the sub-optimal
performance of FCE to select (enough) useful generalizable features. For the POS-pattern features for
instance it is observed that sometimes, especially for smaller datasets very little POS-patterns are found
containing an adverb or adjective meeting the pattern requirements in the limited available trainingdata. Therefore, it can be the case that the performance curve is more controlled by the availability and
the quality of the features in the data instead of being controlled by the influence of lambda. As has
been shown in Table 16 and Table 19 the difference between the domains in a cross-domain adaptation
setting are quite big so the selection of qualitative generalizable features is necessarily. For the
suggested improvements this experiment will be repeated gaining insight whether improving the
generalizable features does follow the case made in the research by Tan et al. (2009) that assigning
higher weight to unlabeled new-domain data aids to classification performance.
Influence of number of used generalizable features
In the research by Tan et al. (2009) the claim is made that the generalizable features play an important
role in the performance of the domain-adaptation. The generalizable features are used to provide a
domain-adaptation from source-domain to target domain and are the main suggestion for improvement
in this thesis. To explore the importance of the influence of the generalizable features and the number
of generalizable features used an additional experiment is set-up. This experiment is based on the
experiment performed by Tan et al, (2009) and is setup in order to gain insight in the behavior of the
classifier with different numbers of used generalizable features. In this experiment the number of
generalizable features are varied between 50 and 1200. To consistently measure the influence of the
number of generalizable features the lambda parameter is kept at 0.6 and EM is performed for 1
iteration. Just as the previous experiment the same subset of datasets is used for this experiment. It is
expected that the lower the number of generalizable features the lower the classifier performs. It can
also be expected that the number of features used has some maximum number as adding to many
generalizable features would imply using more and more features which aren't general (but rather
domain-specific). To measure performance as for the previous experiment performance is measured in
terms of accuracy, micro F1 score and macro F1 score. In Figure 2 the performance of the adapted
Naive Bayes classifier is plotted over an increasing number of generalizable features Different featuresets are evaluated: 1 gram, 2 gram, combined 1 gram + POS-pattern and combined 1 gram + 2 gram.
67
Performance of 1 gram ANB (FCE) vs No. Features
Performance
0.53
0.48
0.43
0.38
50
100
200
400
600
800
1000
1200
Number of Features
Performance of 2 gram ANB (FCE) vs No. Features
Performance
0.69
0.68
0.67
0.66
0.65
50
100
200
400
600
800
1000
1200
Number of Features
Performance of 1 gram + POS ANB (FCE) vs No. Features
Performance
0.61
0.56
0.51
0.46
50
100
200
400
600
Number of Features
68
800
1000
1200
Performance of 1gram+2gram ANB (FCE) vs No. Features
Performance
0.57
0.52
0.47
0.42
50
100
200
400
600
800
1000
1200
Number of Features
accuracy
micro F1
macro F1
Figure 2: Performance of the ANB (FCE) classifier with different feature-sets versus different number of generalizable features used.
Looking at Figure 2 considering the results for the 1 gram and 2 gram feature-sets the performance in
terms of accuracy is stable up until the number of features used exceeds 200. With more than 200
features the performance decreases slightly except for the micro and macro F1 scores for the 1 gram
feature-set which increase between the 200-800 interval. The performance for the combined 1 gram +
POS-pattern and combined 1 gram and 2 gram feature-sets shows a slow increase in performance
measures as the number of used generalizable features increases. The increase in terms of micro F1 and
macro F1 scores for the combined 1 gram + 2 gram feature-sets is steeper.
In Tan et al. (2009) where this experiment is performed on Chinese datasets the performance of the use
of different number of features varies from 0.6 to 0.85 in terms of micro F1 and macro F1 scores for
different amounts of used generalizable features with performance measures peaking at around 600
used generalizable features. In this research the performance curves are similar in the sense that
performance measures increase as the number of used generalizable features increases, except for the 2
gram feature-set. The results differ in the fact that performance in these experiments increases up to the
point that performance keeps stable for all evaluated feature-sets. In the results reported by Tan et al.
(2009) there's a clearly observed peak in the graph after which performance decreases again. In the
experiments conducted in this research such a decrease is not observed rather the results seem to
converge to a maximum. As for the lambda experiment this can be explained by the used method to
select generalizable features. The used generalizable features in this experiment contain a considerate
amount of features which are non-sentimental and which might skew the domain-adaptation. Using
more features does seem to increase performance for most of the feature-sets but the converge to a
maximum performance all feature-sets show implies that with using more features more nonsentimental features are used as well.
6.3.2 Adapted Naive Bayes using subjectivity lexicon
As a first improvement over FCE to select generalizable features the use of a subjectivity lexicon is
proposed. Instead of selecting generalizable features with FCE in this experiment a subjectivity lexicon
is used to serve as generalizable features. In this section the results for the experiments with this
approach are analyzed. First, the results for the different considered feature-sets are presented. Next, a
detailed analysis is provided for a sub-section of the considered datasets. As a final experiment the
influence of the lambda parameter for this approach is analyzed. In this experiment the number of used
69
generalizable features isn't considered as in contrast to the generalizable features which are calculated
as a ranked list, the subjectivity lexicon is a static list. This static list makes no distinction in how
generalizable the generalizable features used are. Because of this limitation it makes no sense to
explore the behavior of selecting different numbers of features for the adapted Naive Bayes classifier
using a subjectivity lexicon as any selecting made in the subjectivity lexicon would be random.
When comparing the results of ANB using a subjectivity lexicon it seems that performance for the 1
gram feature-set doesn't increase in comparison to the previous method. Accuracies for the 1 gram
feature-set are 0.5015 in comparison to 0.5060 using FCE. Micro and macro F1 are almost identical. As
the differences between micro and macro F1 are very small the same observations hold as observed in
the previous experiment.
Looking at the results for the 2 gram feature-set performance for the subjectivity lexicon based
generalizable features shows an insignificant increase in performance. When analyzing the results for
the combined 1 gram + POS-pattern feature-set the performance of the subjectivity lexicon based
generalizable features is considerable lower in comparison to the generalizable features selected by
FCE. The big overall differences for this feature-set can be explained by the use of the POS-patterns.
POS-patterns are often very data-specific extracted patterns which have little probability of being
present in the very general subjectivity lexicon (e.g. “great camera” or “bad restaurant”). When there
are very few features which co-occur in both subjectivity lexicon and the data no good match can be
made to facilitate a domain-adaptation. Consequently, the fitting of the model is likely to be quite low.
This argument is supported by the low differences in performance for the bigger stanford dataset. This
dataset is quite big, containing a lot of different terms so the probability of this data having a match
with the subjectivity lexicon is the highest.
Performance of the combined 1 gram and 2 gram feature-set shows a small difference in performance
compared to ANB using FCE (0.0189). In general it can be concluded that for this experiment there's
no improvement using a subjectivity lexicon as generalizable features.
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
small sized datasets
0.4907 0.4196 0.4174 0.5396 0.5189 0.5168 0.4870 0.5132 0.5083 0.3768 0.3257 0.3202
medium sized datasets
0.4990 0.4496 0.4472 0.6360 0.6685 0.6665 0.4685 0.5011 0.5001 0.5699 0.5446 0.5433
big sized datasets
0.5579 0.6236 0.6216 0.6419 0.6941 0.6919 0.5737 0.6293 0.6288 0.5705 0.6811 0.6806
Table 22: relation between the performance and the size of the dataset for ANB using a subjectivity lexicon.
However, looking at the individual datasets who do increase in performance in comparison to ANB
using FCE it is noticeable, as shown in Table 22, that the bigger datasets (stanford, senpol, senscale) do
show increases in performance. The medium sized dataset seem to be more dependent on the generality
of the domain as observed for the experiment conducted with ANB using FCE. Consistent to the
previous experiment the smaller sized datasets seem to show most decreases in performance. As
observed for the previous experiment, the use of a subjectivity lexicon can aid performance compared
to ANB using FCE in case enough (general) data can be supplied.
Analysis
As an illustrative example a representative subsection of the domain-adaptations for the kitchen &
housewares classifier are presented. As a representative subsection, again, five domain-adaptations are
chosen to be displayed in Table 23. The results shown are the accuracies and the micro F1 and macro
F1 scores for the different classifications by the domain-adapted classifier. The experiment is setup
70
using the same parameters as before.
Kitchen & houswares ANB (SL)
Featureset
sports &
health &
dataset
dvd
electronics
average
outdoor
personal care apparel
1gram
accuracy
0.4910
0.5335
0.5180
0.5425
0.5415
0.5253
f1
0.2448/0.2442 0.4356/0.4338 0.4428/0.4420 0.5454/0.5430 0.5285/0.5272 0.4394/0.4380
2gram
accuracy
0.5495
0.7115
0.7120
0.7065
0.6880
0.6735
f1
0.5019/0.5018 0.7495/0.7488 0.7538/0.7526 0.7575/0.7544 0.7313/0.7310 0.6988/0.6977
1 gram + POS accuracy
0.4560
0.4735
0.4870
0.4710
0.4680
0.4711
f1
0.2823/0.2819 0.4408/0.4400 0.5034/0.5025 0.5657/0.5651 0.4400/0.4386 0.4464/0.4456
1 gram + 2
accuracy
0.4895
0.6130
0.6345
0.6775
0.6090
0.6047
gram
f1
0.1442/0.1429 0.5343/0.5320 0.5743/0.5720 0.6481/0.6463 0.5660/0.5656 0.4933/0.4917
Table 23: a subsection of the domain-adapted classifiers evaluated on the kitchen & housewares dataset
The chosen subset in Table 23 is a descent representation of the general trend observed in the results
presented in Chapter 5. When comparing these results to ANB using FCE there's a small improvement
for all the feature-sets except for the 1 gram + POS-pattern feature-set. The 1 gram feature-set shows an
increase in accuracy of 0.0055, the 2 gram feature-set of 0.0021 the combined 1 gram and 2 gram
feature-set improves the most with 0.0330. The 1 gram + POS-pattern feature-set decreases in accuracy
performance from 0.4844 to 0.4711. In comparison to the baseline naive cross-domain approach
performance measures are all still considerably lower. As for FCE the 2 gram feature-set has the
smallest difference in performance (0.0477). The 1 gram + POS-pattern feature-set differs the most
with a difference of 0.2474. In terms of micro and macro F1 scores the 1 gram and the combined 1
gram and 2 gram feature-sets show more considerable increases compared to FCE. The 2 gram and
combined 1 gram + POS-pattern feature-sets show similar differences as observed for the accuracies.
The increase in performance for the 1 gram, 2 gram and combined 1 gram and 2 gram feature-sets
shows that a little performance increase can be acquired by using a subjectivity lexicon instead of
selecting generalizable features by FCE. The decrease in performance for the combined 1 gram + POSpattern feature-set can be explained as POS-pattern features are often very domain-specific features,
which are not present in the subjectivity lexicon.
The results for the adapted Naive Bayes classifier using a subjectivity lexicon vary from domain to
domain. As observed in the previous experiments the sports & outdoor and health & personal care
datasets show on average to be the most suitable for domain-adaptations, especially for the 2 gram
feature-set where performance comes near naive cross domain performance. For all individual domainadaptations ANB using a subjectivity lexicon doesn't seem to be performing better in comparison to the
naive cross-domain approach. From this experiment can be concluded that using a subjectivity lexicon
instead of FCE does seem to increase classification accuracies a little but not as much as expected. This
can be explained by the use of the adapted Naive Bayes classifier which is specifically designed to use
both source and target-domain data, using generalizable features not originating from data can therefore
limit the potential of this method. As the generalizable features originate from a subjectivity lexicon
there is no guarantee that there will be a proper match between the used features and the domain-data.
Another limitation can be the used subjectivity lexicon, as discussed in in section 3.1.2 (Chapter 3), the
used subjectivity lexicon is quite large and can therefore contain features which are not necessarily
domain-independent in this context to serve as generalizable features.
71
Influence of Lambda
As for the previous experiment it makes sense to explore the behavior of the lambda parameter with
regard to ANB using a subjectivity lexicon. As the features differ from the previous experiment it can
be expected that the performance curves for this classifier are a little different as well. This will have
influence on the optimal balance between source and target-domain data. To explore the optimal
balance between source and target domain the same datasets, parameter settings and feature-sets are
used as the previous experiment. In Figure 3 the different graphs are shown for the adapted Naive
Bayes classifier using a subjectivity lexicon.
Performance of 1 gram ANB (SL) vs Lamba
0.6
Performance
0.55
0.5
0.45
0.4
0.35
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
Performance of 2 gram ANB (SL) vs Lambda
Performance
0.73
0.71
0.69
0.67
0.65
0.05
0.1
0.2
0.4
Lambda
72
0.6
0.8
1
Performance of 1 gram + POS ANB (SL) vs Lambda
0.7
Performance
0.65
0.6
0.55
0.5
0.45
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
Performance of 1 gram + 2 gram ANB (SL) vs Lambda
Performance
0.75
0.65
0.55
0.45
0.35
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
accuracy
micro F1
macro F1
Figure 3: Performance of the ANB classifier using a subjectivity lexicon with different feature-sets versus different values of the lambda
parameter.
Considering at the graphs presented in Figure 3 it is noticeable that the same curves are observed for
the ANB using a subjectivity lexicon as for the ANB using FCE. For the 1 gram feature-set the main
difference is the decrease in micro and macro F1 score which is less steep compared to the ANB using
FCE. The curve for the 2 gram feature-set is almost identical to the previous experiment, consistent to
the observed results for the 2 gram feature-set in Table 12. The curve for the combined 1 gram + POSpattern feature-set shows on average lower performance measures in comparison to ANB using FCE
but shows the same curve. The combined 1 gram + 2 gram feature-set shows on average higher
performance and shows little difference between the accuracy line and the micro and macro F1 lines,
indicating more stability.
When using a subjectivity lexicon instead of automatically generated generalizable features all graphs
show a decrease in accuracy when lambda increases to 1.0. In this case no domain-adaptation is
performed and only the source-data is used. It seems that using a mixture of data does aid performance,
but contrary to Tan et al. (2009) the emphasis lies on the old-domain data. The analysis of the POSpattern feature-set learns that the varying results can be explained by the fact that there is no proper
match between the observed POS-patterns in the data (which are sometimes quite few especially when
training-data is small) and the used generalizable features. The generalizable features in this experiment
are the subjectivity lexicon which holds very general terms. These terms have little relation with the
73
often very domain-specific POS-pattern features. Using EM to perform a domain-adaptation between
those two can therefore be expected to show little improvement. Comparing the graphs in Figure 3 to
the graphs in Figure 1 (influence of lambda on ANB using FCE) apart from the described differences in
performance the curves look very similar. This would imply that the used setting for lambda between
these two methods don't make much of a difference. As observed for FCE a possible explanation can be
the quality of the generalizable features. Where in the experiment considering ANB with FCE the
quality of the generalizable features prohibited a successful domain-adaptation the same conclusion can
be drawn for the subjectivity lexicon. The subjectivity lexicon can be concluded to be an insufficient
representation of generalizable features for any target-domain.
6.3.3 Adapted Naive Bayes using improved generalizable features
The final experiment conducted in this research is an experiment using the adapted Naive Bayes
classifier using improved generalizable features. The selection of the generalizable features for this
experiment include an importance measure which has been introduced in Chapter 3. The results for the
experiment with this classifier are analyzed in this section. First, for all the different considered featuresets the results are presented after which a more detailed analysis is provided for the subsection of the
used datasets together with details about the influence of the lambda parameter and the use of different
numbers of used generalizable features.
Comparing the results observed for the 1 gram feature-set for ANB using a improved generalizable
features there is a slight insignificant increase in performance in comparison to ANB using FCE. Micro
and macro F1 are almost identical. As the differences between micro and macro F1 are very small the
same observations hold as observed in the previous experiment.
The increase in performance for the 2 gram feature-set is more significant with 0.0036 showing a slight
improvement for using improved generalizable features. As observed for ANB using a subjectivity
lexicon the combined 1 gram + POS-pattern feature-set isn't increasing performance measures using
improved generalizable features. The big overall differences for this feature-set can be explained by the
lack of representative POS-patterns in the data. As discussed before sometimes very few POS-patterns
are present in the data making them in these cases unsuitable to perform domain-adaptations.
Performance of the combined 1 gram and 2 gram feature-set shows a small difference in performance
compared to ANB using FCE (0.0064). In general it can be concluded that for this experiment there's
no real improvement using improved generalizable features. Furthermore, the high standard deviation
indicates that the stability has decreased and the quality of the domain-adaptation is more dependent on
the supplied data.
Dataset
1 gram
2 gram
1 gram + POS
1 gram + 2 gram
ac
micro macro ac
micro macro ac
micro macro ac
micro macro
small sized datasets
0.5024 0.4458 0.4436 0.5579 0.5636 0.5616 0.5026 0.5245 0.5196 0.4986 0.4488 0.4436
medium sized datasets
0.5022 0.4429 0.4408 0.6278 0.6670 0.6648 0.4829 0.5208 0.5197 0.5599 0.5281 0.5267
big sized datasets
0.5528 0.6061 0.6041 0.6346 0.6886 0.6864 0.5822 0.6164 0.6158 0.5678 0.6701 0.6696
Table 24: relation between the performance and the size of the dataset for ANB using improved generalizable features.
However, in comparison to ANB using a subjectivity lexicon all feature-sets show slight increases in
performance. As observed for ANB using a subjectivity lexicon the bigger datasets (stanford, senpol,
senscale) do show increases in performance, showing the best results observed over all the methods as
can be seen in Table 24. The medium sized datasets seem to be more dependent on the generality of the
domain as observed in the earlier experiments. Consistent to the previous experiment the smaller sized
74
datasets seem to cause most decreases in performance. As observed for the previous experiment, the
use of improved generalizable features can aid performance in case enough (general) data can be
supplied.
Analysis
As an illustrative example a representative subsection of the domain-adaptations for the kitchen &
housewares classifier are presented. As a representative subsection, again, five domain-adaptations are
chosen to be displayed in Table 25. The results shown are the accuracies and the micro F1 and macro
F1 scores for the different classifications by the domain-adapted classifier. The experiment is setup
using the same parameters as before.
Kitchen & houswares ANB (IMP)
Featureset
sports &
health &
dataset
dvd
electronics
average
outdoor
personal care apparel
1gram
accuracy
0.4975
0.5420
0.5335
0.5000
0.5240
0.5194
f1
0.1083/0.1075 0.3567/0.3550 0.3321/0.3290 0.3582/0.3585 0.4138/0.4129 0.3138/0.3126
2gram
accuracy
0.5480
0.7070
0.7110
0.7060
0.6835
0.6711
f1
0.4967/0.4963 0.7437/0.7427 0.7513/0.7503 0.7566/0.7536 0.7254/0.7250 0.6947/0.6936
1 gram + POS accuracy
0.4615
0.4795
0.4825
0.4650
0.4820
0.4741
f1
0.2628/0.2624 0.4290/0.4284 0.4717/0.4708 0.5307/0.5298 0.4264/0.4257 0.4241/0.4234
1 gram + 2
accuracy
0.4820
0.5820
0.5835
0.6360
0.5665
0.5700
gram
f1
0.0700/0.0691 0.4121/0.4087 0.4063/0.4054 0.5387/0.5377 0.4185/0.4163 0.3691/0.3674
Table 25: a subsection of the domain-adapted classifiers evaluated on the kitchen & housewares dataset
The chosen subset in Table 25 is a descent representation of the general trend observed in the results
presented in Chapter 5. When comparing these results to AMB using FCE the results for all the featuresets are almost identical. The performances for the the 1 gram feature-set, 2 gram feature-set and
combined 1 gram and 2 gram feature-set are less than 0.002 lower. For the micro and macro F1 scores
the same differences are observed. The performance of the 1 gram + POS-pattern feature-set is a little
over 0.01 lower in comparison to FCE. In comparison to ANB using a subjectivity lexicon, which
performed a little better in comparison to FCE the results are consequently also lower, however the
decrease in performance for the combined 1 gram + POS-pattern feature-set is a little less with 0.0103
The results for ANB using improved generalizable features varies from domain to domain. As observed
in the FCE experiments the sports & outdoor and health & personal care datasets show on average to be
the most suitable for domain-adaptations, especially for the 2 gram feature-set where performance
come near naive cross domain performance. For all individual domain-adaptations ANB using
improved generalizable features doesn't seem to be performing better in comparison to the naive crossdomain approach and in comparison to ANB using FCE. Just as for the previous two experiments it can
be concluded that the attempt to perform a successful domain-adaptation has failed. Just as for the
previous experiments the cause can be found in the used generalizable features which will be further
explored in the next two sections.
Generalizable features
To illustrate the generalizable features generated by the improved method in Table 26 the top 50 most
75
common generalizable 1 gram features for the different domain-adaptations for the kitchen &
housewares dataset are presented.
very
several
wrung
wicks
warping
out
recommend
would't
white
want
good
nice
worked
whereby
walmar
even
make
work
welded
waiste
years
it's
won't
weighty
w
well
great
wires
weighs
voided
use
control
windstar
week
versatility
up
back
wider
wasted
usually
time
zippers
widening
waste
using
think
zip
widened
washings
used
Table 26: Top 50 most common generalizable 1 gram features using for the kitchen & housewares dataset
In Table 26 the calculated improved generalizable features are shown for the domain-adaptations for
the kitchen & housewares dataset. Comparing these features to the observed features in Table 20
(FCE) the observed features for the improved method have increased in the number of sentimental
terms. For the 1 gram features sentimental domain-independent terms like “good”, “well”,
“recommend”, “nice”, “great”, “wrung”, “wasted”, “waste”, “voided” and “versatility” are present
among the resulting terms which are more terms in comparison to the features observed in Table 20.
Although the terms reported in Table 26 seem to be an improvement over the results reported in Table
20, still plenty of common non-sentimental terms are present. Examples of non-sentimental domainindependent terms in Table 26 are “years”, “use”, “up” and “time”.
To gain more insight in the relation of this performance and the generalizable features used for the 2
gram feature-set the top 50 generalizable features used for the kitchen & housewares domainadaptations are presented in Table 27.
want to
don't waste
yes it
without any
very small
use to
bought this
years it
without a
very pleased
use it
all over
wrong with
will never
very carefully
to use
a very
wouldnt turn
which point
used the
my wife
a lot
would recommendwell as
used it
looking for
a huge
worth the
way to
used for
it went
a good
worked great
was told
use while
i think
a bit
worked for
was looking
use this
i bought
you won't
won't be
very well
use in
easy to
you pay
without hot
very very
use a
Table 27: Top 50 most common generalizable 2 gram features using for the kitchen & housewares dataset
Looking at the reported 2 gram features in Table 27 the same observation can be made. Sentimental
domain-independent examples are: “easy to”, “don't waste”, “a lot”, “a huge”, “a good”, “wrong with”,
“would recommend”, “worked great”, “will never”, “very well”, “very small” and “very pleased”.
Observed non-sentimental domain-independent terms are (among others): “use to”, “use it” and “i
think”. Compared to Table 21 more sentimental terms seem to be present in this set as well. Although
the quality of the selected generalizable features does seem to be an improvement in comparison to the
features selected by FCE the results are still not comparable to the results presented in Table 1. In order
to provide for a successful domain-adaptation the previous experiments have shown that the quality of
76
the used generalizable features is essential.
Influence of Lambda
As for the previous two experiments the influence of the lambda parameter with regard to ANB using
improved generalizable features is explored. As the features differ from the previous two experiments it
can be expected that the performance curves for this classifier are different as well. The same datasets,
parameter settings and feature-sets are used as the previous experiment. In Figure 4 the different graphs
are shown for the adapted Naive Bayes classifier using improved generalizable features.
Performance of 1 gram ANB (IMP) vs Lamba
0.6
Performance
0.55
0.5
0.45
0.4
0.35
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
Performance of 2 gram ANB (IMP) vs Lambda
Performance
0.73
0.71
0.69
0.67
0.65
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
Performance of 1 gram + POS ANB (IMP) vs Lambda
0.7
Performance
0.65
0.6
0.55
0.5
0.45
0.05
0.1
0.2
0.4
Lambda
77
0.6
0.8
1
Performance of 1 gram + 2 gram ANB (IMP) vs Lambda
Performance
0.65
0.55
0.45
0.35
0.05
0.1
0.2
0.4
0.6
0.8
1
Lambda
accuracy
micro F1
macro F1
Figure 4: Performance of the ANB classifier using improved generalizable features with different feature-sets versus different values of
the lambda parameter.
When comparing the results for lambda experiments of the improved generalizable features to ANB
using FCE, there seem to be very little difference in the curves of the feature-sets. The main difference
lies in the small differences in the performance measures most noticeable for the 1 gram + POS-pattern
feature-set. As observed for the previous two experiments with ANB a possible explanation for the
unsuccessful domain-adaptation can be the quality of the generalizable features. Where in the
experiment considering ANB with FCE the quality of the generalizable features prohibited a successful
domain-adaptation the same conclusion has to be drawn for the improved generalizable features. Even
with more sentimental features among the used generalizable features it has to be concluded that there's
are still insufficient qualitative generalizable features to provided a successful domain-adaptation to the
different target-domains.
Influence of number of features
As concluded for ANB using FCE the generalizable features play an important role in the performance
of the different classifiers. To explore the importance of the influence of the generalizable features and
the number of generalizable features used the experiment differentiating the used number of features is
repeated for ANB using improved generalizable features. In order to run the experiment the number of
generalizable features are varied between 50 and 1000. To consistently measure the influence of the
number of generalizable features the lambda parameter is kept at 0.6 and EM is performed for 1
iteration. It is expected that because different features are used, the curves for this experiment differ
from the results for ANB using FCE and there will be another optimal number of generalizable
features. To measure, performance is measured in terms of accuracy, micro F1 score and macro F1
score. In Figure 5 the performance for ANB using improved generalizable features is plotted over an
increasing number of generalizable features, different feature-sets are evaluated: 1 gram, 2 gram,
combined 1 gram + POS-pattern and combined 1 gram + 2 gram.
78
Performance of 1 gram ANB (IMP) vs No. Features
Performance
0.53
0.48
0.43
0.38
50
100
200
400
600
800
1000
1200
Number of Features
Performance of 2 gram ANB (IMP) vs No. Features
Performance
0.69
0.68
0.67
0.66
0.65
50
100
200
400
600
800
1000
1200
Number of Features
Performance of 1 gram + POS ANB (IMP) vs No. Features
Performance
0.61
0.56
0.51
0.46
50
100
200
400
600
Number of Features
79
800
1000
1200
Performance of 1gram+2gram ANB (IMP) vs No. Features
Performance
0.57
0.52
0.47
0.42
50
100
200
400
600
800
1000
1200
Number of Features
accuracy
micro F1
macro F1
Figure 5: Performance of the ANB classifier using improved generalizable features with different feature-sets versus different number of
generalizable features used.
Comparing the results for the 1 gram feature-set to ANB using FCE the accuracy curve is similar stable
but the main difference between the curves is the micro and macro F1 score which for the improved
features shows a less steep increase in performance from 200 to 800 used features. Both graphs
converge around 1200 features to the same final result. The graph for the 2 gram feature-set is
practically identical to the one observed for FCE, except the expected slightly higher performance
measures for the improved features. The main difference between the 1 gram + POS-pattern feature-set
is the increase for the FCE graph which is stable as the number of the used number of features
increases. For the improved features the performance measures slightly increase from 600 to 800
features but remain stable for the rest of the plotted graph. For the combined 1 gram + 2 gram featureset the same difference is observed as for the 1 gram feature-set: the increase in micro F1 and macro F1
in the interval between 200 to 800 features is steeper in case of the FCE features.
In Tan et al. (2009) where this experiment is performed on Chinese datasets the performance of the use
of different number of features varies from 0.6 to 0.85 in terms of micro and macro F1 scores for
different amounts of used generalizable features. In this experiment the observed micro and macro F1
score don't exceed scores of 0.70. The curve reported in Tan et al. (2009) shows increase as the
numbers of generalizable features increases up to a maximum of 0.8, after which a decrease is shown.
In this experiment the different curves for the different feature-sets look different. For the 1 gram and
combined 1 gram + POS-pattern feature-sets the number of generalizable features used doesn't seem to
make much difference for the performance (in terms of accuracy). In terms of micro F1 and macro F1
the performance seems to increase for the 1 gram and combined 1 gram and 2 gram feature-sets up
until the use of 1000 features after which performance stabilizes (contradictory to the results reported
by Tan et al. (2009) where adding more features after reaching a top decreases).
As concluded for ANB using FCE the used generalizable features in this experiment still contain a
considerate amount of features which are non-sentimental and skew the domain-adaptation. Using
more features does seem to increase performance for most of the feature-sets but the converge to a
maximum performance all feature-sets show implies that with using more features more nonsentimental features are used as well.
Differences with FCE
As has been shown in the previous sections the differences between using FCE to select generalizable
80
features and the suggested improvements are quite little. In this section the differences between the two
methods will be explored in more detail. In the next table the differences in probabilities for the two
methods will be presented for a couple of domain-adaptations. The differences shown are the features
which have the most difference in probability between ANB using FCE and ANB using improved
generalizable features for domain-transfers of the kitchen & housewares dataset. As an example the
probability differences between the kitchen & housewares dataset and a domain suitable for domainadaptation and a domain less suitable for domain-adaptation are presented. As the experiments in the
previous sections have shown the dvd dataset is a domain less suitable for domain-adaptation as results
for all the different methods have shown performances below average. The sports & outdoors dataset
has shown to report performances above average so this domain has been chosen as a representative
example for a domain suitable for domain-adaptation. In Table 28 the differences in probabilities for
both 1 gram and 2 gram features are shown.
Differences between FCE and IMP
dvd
sports & outdoors
feature
difference
feature
difference
selle
0.0011 nut
0.0038
figuring
0.0011 delicate
0.0030
whateve
0.0005 containing
0.0023
wafer
0.0005 endless
0.0015
soundly
0.0005 drastically
0.0015
quality it
0.0013 pre drilled
0.0016
mine to
0.0013 could buy
0.0016
would put
0.0005 company is
0.0016
was fine
0.0005 working on
0.0008
now just
0.0005 then to
0.0008
Table 28: differences in probabilities between FCE and IMP for domain-adaptations from the kitchen & housewares dataset.
As can be seen in Table 28 there are differences between the resulting probabilities after performing a
domain-adaptation with FCE or with improved generalizable features. However, the differences are not
very big. A possible explanation for the very small differences in results for both methods as discussed
in the previous section can be explained by these small differences. As classification is performed on a
sentence level it is likely that a single difference in a term doesn't make a whole of a difference in the
overall classification of the complete sentence resulting in few difference for a classification of an
entire text. From Table 28 can also be inferred that the detected differences between the resulting
features make more sense for the domain more suitable for a domain-adaptation. In the reported
features for the sports & outdoor dataset more sentimental generalizable features are reported compared
to the dvd dataset. In the sports & outdoors set features are present like “nut”, “delicate”, “drastically”
and “could buy”. In the column for the dvd dataset the total number of features denoting sentimental
generalizable features is lower with: “soundly” and “quality it”.
In Table 29 the different assigned probabilities for the earlier observed domain-specific features
“surprise” and “easy” are reported. From these results the differences in the methods can be explored
and explanation can be found for the unsuccessful application of the different domain-adaptations.
81
dataset
dataset
probability “surprise”
probability “easy”
apparel2camera_&_photo neg
0.0001343 apparel2music neg
0.0000850
pos
0.0000647
pos
0.0000462
camera_&_photo2apparel neg
0.0000423 music2apparel neg
0.0000537
pos
0.0000418
pos
0.0000485
Table 29: probabilities assigned to two features by domain-adaptation using ANB with improved generalizable features.
In Table 29 can be seen how the domain-adaptation works in terms of assigning probabilities to
domain-dependent features. When adapting the apparel domain to the camera & photo domain the
probability in a negative context for the feature “surprise” is bigger in comparison to the positive
probability. In the case of adapting the domain the other way around: from the camera & photo domain
to the apparel domain the opposite adaptation is performed: the negative context has a higher
probability in comparison to the positive context. For the less obvious example “easy” domainadaptations are shown from the apparel and the music dataset. In this case it can be seen that although
the assigned probabilities are different the highest probability in both cases is assigned to the negative
context.
As discussed in case of the apparel and camera & photo dataset the domain-adaptation shows different
probabilities to the feature but the differences in probability is very small. This can be the explanation
to relative low classification results as the small differences between the probabilities probably don't
dictate the final difference when classification is performed on sentence level.
82
Chapter 7
Conclusions and future work
In this thesis the three different experiments have shown insight into the behavior of the cross-domain
behavior of the Naive Bayes classifier. The first experiment has shown that it is possible to setup a
general approach in which the 38 different datasets are consistently used. The 38 datasets are first
preprocessed, then trained and evaluated using the Naive Bayes classifier resulting in performance
consistent with previous research. For the different evaluated feature-sets accuracy scores all above
0.74 and micro F1 and macro F1 scores above 0.78 are reported. As an additional classifier the
subjectivity lexicon classifier was evaluated showing performance as expected lower in comparison to
the Naive Bayes classifier and consistent with findings in previous research. From these experiments it
can be concluded that the general setup used in this research is valid and can be used in order to setup
cross-domain classification experiments as well as serve as a baseline for performance.
In order to evaluate the performance of the same general setup in a cross-domain setting a naive crossdomain experiment was setup evaluating the cross-domain performance following the leave-one-out
approach (LOO). The accuracy results for this experiment ranged from 0.6178 to 0.6477 which is a
considerate decrease in performance in comparison to the in-domain performance of the same classifier
with the same feature-sets. The micro and macro F1 scores ranged from 0.6382 to 0.6916. As discussed
as far as comparison to previous research was possible, results were consistent. From this experiment it
can be concluded that the general setup used in this research is valid in a naive cross-domain setting
and the setup can be used in order to explore more elaborate cross-domain approaches. Further it can be
concluded, that in order to perform cross-domain sentiment analysis it is necessary to use a more
profound method in order to prevent this considerate decrease in performance. The suggested
improvements in this research attempt to perform domain-adaptations between source and target
domain in order to properly fit a probabilistic model capable of highly accurate cross-domain sentiment
classification.
In the third set of experiments the research questions stated in this thesis are answered. As a first
experiment the approach suggested by Tan et al. (2009) was attempted on the 38 English datasets used
in this research. The approach used an adapted form of the Naive Bayes classifier using EM to optimize
a probabilistic model using generalizable features as a bridge between source and target-domain. The
performance observed in this experiment ranged widely in terms of accuracies from 0.5060 to 0.6063
for the different evaluated feature-sets. The micro and macro F1 scores ranged from 0.4502 to 0.6413.
From these experiments it can be concluded that using the adapted Naive Bayes classifier using FCE
doesn't help to improve classification results overall. The performance with ANB using FCE shows
lower performance measures, especially compared to the reported accuracies from the research by Tan
et al. (2009) on Chinese datasets. As the different experiments have shown the generalizable features
show room for improvement as the generalizable features calculated by FCE show a lot of common
non-sentimental terms. In contrast to the full experiments individual datasets can show increase in
performance compared to naive cross-domain classification. The datasets who do show improvement
are the bigger datasets having more data available and imbalanced datasets. The majority of the used
83
datasets in these experiments are of a medium size (2000 examples) which show to be very dependent
on the domain they originate from. The bigger datasets show to be less dependent on the domain.
Future research can be focused on using bigger datasets in order to explore whether the trend observed
in this research holds with an experiment with larger sized datasets.
The first suggested improvement of selecting generalizable features is the use of a subjectivity lexicon
instead of the use of generalizable features acquired from the data. The subjectivity lexicon has been
used in the first experiment as a classifier showing stable performance from in-domain to cross-domain
and stable performance in terms micro F1 and macro F1 scores. As a first improvement for the
generalizable features the lexicon is used directly as generalizable features. The results reported for this
experiment range from 0.4817 to 0.6111 in terms of accuracy. The micro and macro F1 scores range
from 0.4531 to 0.6312. From the experiments conducted with the subjectivity lexicon as generalizable
features it can be concluded that the performance increases very slightly in comparison to ANB using
FCE for all the feature-sets except the combined 1 gram + POS-pattern feature-set. The found
explanation were the used POS-patterns, which in general are very domain-specific. This domain
specificity isn't present in the subjectivity lexicon. As for ANB using FCE, it can be concluded that
using a subjectivity lexicon as generalizable features does not help increase classification performance
in comparison to the naive cross-domain baseline. Also consistent to using ANB using FCE the datasets
who do show improvement are the bigger datasets having more data available. The bigger datasets
show to be less dependent on the domain. Future research can be focused on using bigger datasets in
order to explore whether the trend observed in this research holds with an experiments with larger
datasets. Another possible improvement can be the use of a smaller subjectivity lexicon. As described
the subjectivity lexicon is quite big which could have as a consequence that domain-dependent, nonsentimental features are present in the lexicon as well. Narrowing down the subjectivity lexicon could
result in better results.
In the final set of experiments an improvement for the selection of generalizable features is evaluated
using an importance measure in addition to the use of FCE to select generalizable features. The
observed selected features in this experiment do seem to show improvements in comparison to solely
using FCE and to using a subjectivity lexicon as generalizable features. In comparison to ANB using a
subjectivity lexicon all feature-sets increase in performance. In comparison to ANB using FCE all
feature-set increase with exception of the POS-pattern feature-set. Looking at the generated
generalizable features, less non-sentimental terms are observed and more features containing sentiment
are reported. The results reported for this experiment range from 0.4959 to 0.6099 in terms of accuracy.
The micro and macro F1 scores range from 0.4544 to 0.6415. As the increases between the two other
methods using ANB are very small the conclusion has to be drawn that using a improved generalizable
features does not help increase classification performance in comparison to the naive cross-domain
baseline. However the improved generalizable features do show a slight increase to the ones selected
by FCE. This leads to the conclusion that performance increases can be achieved when trying to
optimize the generalizable features used to initiate the Naive Bayes model. Future research can focus
on further trying to acquire better generalizable features in order to be used to facilitate the domaintransfer. Consistent to using ANB using FCE the datasets who do show improvement are the bigger
datasets having more data available. The bigger datasets show to be less dependent on the domain.
Leading to a better representation of generalizable features. Future research can also be focused on
using bigger datasets in order to explore whether the trend observed in this research holds with an
experiment with larger datasets.
84
References
Airoldi, E.M., Bai, X., and Padman, R. (2006). Markov blankets and meta-heuristic search: Sentiment
extraction from unstructured text. Lecture Notes in Computer Science, 3932:167–187.
Aue, A. and Gamon, M. (2005). Customizing Sentiment Classifiers to New Domains: a Case Study.
Submitted to RANLP-05, the International Conference on Recent Advances in Natural Language
Processing, Borovets, BG, 2005
Balasubramanian, R., Cohen, W., Pierce, D. and Redlawsk, D.P. (2011). What pushes their buttons?
predicting polarity from the content of political blog posts. In Proceedings of the Workshop on
Language in Social Media (LSM 2011), pages 12–19, Portland, Oregon, 23 June 2011.
Benamara, F., Cesarano, C., Picariello, A., Reforgiato, D., and Subrahmanian, V. (2007). Sentiment
analysis: Adjectives and adverbs are better than adjectives alone. In Proceedings of the International
Conference in Weblogs and Social Media (ICWSM).
Bethard, S., Yu, H., Thornton, A., Hatzivassiloglou, V., and Jurafsky, D. (2004). Automatic Extraction
of Opinion Propositions and their Holders. In AAAI Spring Symposium on Exploring Attitude and
Affect in Text: Theories and Applications.
Blitzer, J., Dredze, M., Pereira, F. (2007). Biographies, Bollywood, Boom-boxes and Blenders: Domain
Adaptation for Sentiment Classification. In Association of Computational Linguistics (ACL), 2007
Bollegala D., Weir, D. and Carroll, J. (2011). Using Multiple Sources to Construct a Sentiment
Sensitive Thesaurus for Cross-Domain Sentiment Classification. In Proceedings of the 49th Annual
Meeting of the Association for Computational Linguistics, pages 132–141, Portland, Oregon, June 1924, 2011.
Brill, E. (1994). Some advances in transformation-based part of speech tagging. In Proceedings of the
Twelfth National Conference on Artificial Intelligence, pp. 722-727, Menlo Park, CA: AAAI Press.
Choi, Y., Cardie, C., Riloff, E., and Patwardhan, S. (2005). Identifying Sources of Opinions with
Conditional Random Fields and Extraction Patterns. In Proceedings of HLT/EMNLP-05.
Church, K.W. and Hanks, P. (1989). Word association norms, mutual information and lexicography.
Proceedings of the 27th Annual Conference of the Association of Computational Linguists, pages 76–
83.
Chklovski, T. (2006). Deriving Quantitative Overviews of Free Text Assessments on the Web. In
Proceedings of 2006 International Conference on Intelligent User Interfaces (IUI06), Sydney, Australia.
Das, S. and Chen, M. (2001). Yahoo! For Amazon: Extracting market sentiment from stock
message boards. In Proceedings of the 8th Asia Pacific Finance Association Annual Conference (APFA
2001).
85
Dave, K., Lawrence, S., and Pennock, D.M. (2003). Mining the Peanut Gallery: Opinion Extraction and
Semantic Classification of Product Reviews. In Proceedings of the International World Wide Web
Conference, Budapest, Hungary.
Denecke, K. (2009). Are SentiWordNet scores suited for multi-domain sentiment classification? By
Fourth International Conference on Digital Information Management (2009). Publisher: IEEE, Pages:
1-6.
Engström, C. (2004). Topic dependence in sentiment classification. Unpublished M.Sc. thesis,
University of Cambridge.
Esuli, A. and Sebastiani, F. (2005). Determining the semantic orientation of terms through gloss
classification. In Proceedings of CIKM-05, 14th ACM International Conference on Information and
Knowledge Management, Bremen, DE, pp. 617-624.
Gamon, M. (2004). Sentiment classification on customer feedback data: Noisy data, large feature
vectors, and the role of linguistic analysis. In Proceedings of the International Conference on
Computational Linguistics (COLING).
Ghosh, S. and Koch, M. Unsupervised Sentiment Classification Across Domains using adjective cooccurrence for polarity analysis. Unpublished.
Go, A., Bhayani, R. and Huang, L. (2009). Twitter Sentiment Classification using Distant Supervision.
In Entropy (2009), Issue: June, Publisher: Association for Computational Linguistics, Pages: 30-38
He, Y., Lin, C., Alani, H. (2011). Automatically Extracting Polarity-Bearing Topics for Cross-Domain
Sentiment Classification. In Proceedings of ACL, 2011, 123-131.
Hatzivassiloglou, V. and McKeown, K. (1997). Predicting the Semantic Orientation of Adjectives. In
Proceedings of 35th Annual Meeting of the Assoc. for Computational Linguistics (ACL-97): 174-181
Hatzivassiloglou, V. and Wiebe, J. (2000). Effects of adjective orientation and gradability on sentence
subjectivity. In Proceedings of the International Conference on Computational Linguistics (COLING).
Hearst, M.A. (1992). Direction-based text interpretation as an information access refinement. In TextBased Intelligent Systems: Current Research and Practice in Information Extraction and Retrieval,
Mahwah, NJ: Lawrence Erlbaum Associates.
Hu, M. and Liu, B. (2004). Mining and summarizing customer reviews". In Proceedings of the ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD-2004), Seattle,
Washington, USA.
Hu, M. and Liu, B. (2005). Mining and summarizing customer reviews. In Proceedings of the
conference on Human Language Technology and Empirical Methods in Natural Language Processing.
Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. In
86
Journal of Documentation, 28:11–21.
Kim, S. and Hovy, E. (2004). Determining the sentiment of opinions. In Proceedings of the 20th
International Conference on Computational Linguistics.
Kim, S. and Hovy, E. (2006). Identifying and Analyzing Judgment Opinions. In Proceedings of
HLT/NAACL-2006, New York City, NY.
Landauer, T. K. and Dumais, S. T. (1997). A solution to plato’s problem: The latent semantic analysis
theory of the acquisition, induction, and representation of knowledge. In Psychology Review.
Liu, B. (2006). Web Data Mining. Chapter Opinion Mining, Springer.
Liu, K. and Zhao, J. (2009). Cross-domain sentiment classification using a two-stage method. In
Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM), Hong
Kong, China, pages 1717–1720, 2009
Matsumoto, S., Takamura, H., and Okumara, M. (2005). Sentiment classification using word subsequences and dependency sub-trees. In Proceedings of PAKDD’05, the 9th Pacific-Asia Conference
on Advances in Knowledge Discovery and Data Mining.
McCallum, A. and Nigam, K. A Comparison of Event Models for Naive Bayes Text Classification. In
AAAI/ICML Workshop on learning for Text Categorization (1998).
Melville, P., Gryc, W., and Lawrence, R. D. (2009). Sentiment analysis of blogs by combining lexical
knowledge with text classification. In Proceedings of the Conference on Knowledge Discovery and
Data Mining 2009.
Nigam, K., Lafferty, J., and McCallum, A. (1999). Using maximum entropy for text classification. In
Proceedings of the IJCAI-99 Workshop on Machine Learning for Information Filtering, pages 61–67.
Pang, B., Lee, L., and Vaithyanathan, S. (2002). Thumbs up? Sentiment Classification using Machine
Learning Techniques. In Proceedings of EMNLP 2002.
Pang, B. and Lee, L. (2004), A Sentimental Education: Sentiment Analysis Using Subjectivity
Summarization Based on Minimum Cuts. In Proceedings of ACL 2004.
Pang, B. and Lee, L. (2005). Seeing stars: Exploiting class relationships for sentiment categorization
with respect to rating scales. In Proceedings of the ACL pages 115-124, 2005
Pang, B. and Lee, L. (2008). Opinion mining and sentiment analysis. In Foundation and Trends in
Information Retrieval, 2(1-2):1–135.
Popescu, A., and Etzioni, O. (2005). Extracting Product Features and Opinions from Reviews. In
Proceedings of HLT-EMNLP 2005.
Prinz, J. (2004). Gut Reactions: A Perceptual Theory of Emotion. Oxford University Press.
87
Read, J. (2005). Using emoticons to reduce dependency in machine learning techniques for sentiment
classification. In Proceeding ACL student '05 Proceedings of the ACL Student Research Workshop
Riloff, E., Wiebe, J. and Wilson, T. (2003). Learning Subjective Nouns Using Extraction Pattern
Bootstrapping. In Proceedings of Seventh Conference on Natural Language Learning (CoNLL-03).
ACL SIGNLL. Pages 25-32.
Riloff, E., Wiebe, J., and Phillips, W. (2005). Exploiting subjectivity classification to improve
information extraction. In 20th National Conference on Artificial Intelligence (AAAI-2005).
Rothfels, J and Tibshirani, J. (2010) Unsupervised sentiment classification of English movie reviews
using automatic selection of positive and negative sentiment items.
Snyder, B. and Barzilay, R. (2007). Multiple Aspect Ranking using the Good Grief Algorithm. In
NAACL 2007
Spertus, E. (1997). Smokey: Automatic recognition of hostile messages. In Proceedings of the
Conference on Innovative Applications of Artificial Intelligence, pp. 1058-1065,. Menlo Park, CA:
AAAI Press.
Tan, S., Wu, G., Tang, H., and Cheng, X. (2007). A novel scheme for domain-transfer problem in the
context of sentiment analysis. In Proceedings of the sixteenth ACM conference on Conference on
information and knowledge management, CIKM ’07, New York, NY, USA.
Tan, S., Cheng, X., Wang, Y. and Xu, H. (2009). Adapting Naive Bayes to Domain Adaption for
Sentiment Analysis. In ECIR 2009: 337-349
Tatemura, J. (2000). Virtual reviewers for collaborative exploration of movie reviews. In Proc. of the
5th International Conference on Intelligent User Interfaces, pages 272–275.
Terveen, L., Hill, W., Amento, B., McDonald, D, and Creter, J. (1997). PHOAKS: A system for sharing
recommendations. In Communications of the ACM, 40(3): 59–62.
Thomas, M., Pang, B., and Lee, L. (2006). Get out the vote: Determining support or opposition from
congressional floor-debate transcripts. In Empirical Methods in Natural Language Processing
(EMNLP).
Turney, P.D. (2002). Thumbs up or thumbs down? semantic orientation applied to unsupervised
classification of reviews. In Proceedings of ACL-02, Philadelphia, Pennsylvania, 417-424, 2002.
Turney, P. D. and Littman, M. L. (2003). Measuring praise and criticism: Inference of semantic
orientation from association. In ACM Transactions on Information Systems (TOIS), 21(4):315–346.
Whitehead, M. and Yaeger, L. (2009). Building a General Purpose Cross-Domain Sentiment Mining
Model. In Proceedings of CSIE (4)'2009. pp.472-476.
88
Wiebe, J., Bruce, R., and O’Hara, T. (1999). Development and use of a gold standard data set for
subjectivity classifications. In Proceedings of the 37th Annual Meeting of the Association for
Computational Linguists (ACL-99), pages 246–253.
Wiebe, J. M., Wilson, T., Bruce, R., Bell, M., and Martin, M. (2004). Learning subjective language. In
Computational Linguistics, 30:277–308.
Wilson, T, Wiebe, J., and Hwa, R. (2004). Just how mad are you? Finding strong and weak opinion
clauses. In Proceedings of 19th National Conference on Artificial Intelligence (AAAI-2004).
Wilson, T., Wiebe, J. and Hoffmann, P. (2005). Recognizing Contextual Polarity in Phrase-Level
Sentiment Analysis. In Proceedings of HLT/EMNLP 2005, Vancouver, Canada
Wilson, T., Wiebe, J., and Hoffmann, P. (2009). Recognizing Contextual Polarity: an exploration of
features for phrase-level sentiment analysis.In Computational Linguistics 35(3).
Wu, Q. , Tan, S. and Cheng, X. (2009a). Graph ranking for sentiment transfer. In ACL-IJCNLP, pages
317–32
Wu, Q., Tan, S., Zhai, H., Zhang, G., Duan, M., Cheng, X. (2009b). SentiRank: Cross-Domain Graph
Ranking for Sentiment Classification, In Proceedings of the 2009 IEEE/WIC/ACM International Joint
Conference on Web Intelligence and Intelligent Agent Technology, p.309-314, September 15-18, 2009
Wu, Q., Tan, S., Duan, M., and Cheng, X. (2010). A Two-Stage Algorithm for Domain Adaptation with
Application to Sentiment Transfer Problems. In AAIRS 2010: 443-453
Yang, Y., Liu, X. (1996). A Re-examination of Text Categorization Methods. In Proceedings
of SIGIR-99, 22nd ACM International Conference on Research and Development in Information
Retrieval, Berkeley, US (1996)
Yessenalina, A., Choi, Y. and Cardie, C. (2010). Automatically generating annotator rationales to
improve sentiment classification.
Yu, H. and Hatzivassiloglou, V. (2003). Towards answering opinion questions: Separating facts from
opinions and identifying the polarity of opinion sentences. In Proceedings of the Conference on
Empirical Methods in Natural Language Processing (EMNLP).
89