Kaggle Movie Review Sentiment Prediction

Sentiment Analysis on Movie Reviews
Kaggle Competition
The dataset is from Rotten Tomatoes site. Each of the short reviews is parsed and broken into many
phrases using the Stanford parser. Each phrase is given a label value from 0 to 4
(0: very negative, 1: negative, 2: neutral, 3: positive, 4: very positive).
PhraseId SentenceId
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
2
Phrase
Sentiment
This quiet , introspective and entertaining independent is worth seeking . 4
This quiet , introspective and entertaining independent 3
This 2
quiet , introspective and entertaining independent 4
quiet , introspective and entertaining 3
quiet 2
, introspective and entertaining 3
introspective and entertaining 3
introspective and 3
introspective 2
and 2
entertaining 4
independent 2
is worth seeking . 3
is worth seeking 4
is worth 2
worth 2
seeking 2
Test data set looks similar with no sentiment value given. The goal is to assign correct labels to each of the review phrases.
Challenges
•
•
•
•
•
•
•
Semantic word spaces
Sentence compositionality
Positive and negative words
Order of words
Effect of contrastive conjunctions
Negation words
Not enough training data
Movie Review Sentiment Analysis –
Our Way
Preprocessor
Vectorizer
Bank
Classifier
Bank
Majority
Voting
Preprocessor
• Kept only alphabets and sentiment markers
+/• Tokenized, Stemmed, Lemmatized
NLTK modules
Python module NLTK (Natural Language Tool Kit)
comes with several packages that aid text
processing.
– Stop Words – eliminate low information features from long phrases
– Used movie reviews corpus as additional training set – comes with 2 levels (‘neg’,‘pos’ –
mapped to levels 1,3 or 0,4 for this project)
– NLTK synonym set from NLTK WordNet package
– NLTK parts of speech tagging (POS) + ignore nouns or choose only adjectives
•
Same or worse performance (accuracy) seen when training data set split into training and developmenttest sets.
Vectorizers
Vectorizer
Associated Classifier
Comment
Tfidf
Logistic Regression,
Linear SVM
With ngram(1,2)
Count
MNB
With ngram(1,1)
Word2vec
(google-trained or MovieReview-Trained)
Advanced – SVM, Random
Forest
Issue with Sentences
New – Sentiment
Vectorizer
Logistic Regression,
Linear SVM, MNB
Used with FeatureUnion
with Tfidf and Count
Sentiment Vectorizer searched each sentence’s words for presence in
positive/Negative/Contrast words dictionaries.
Features were: binary: positive, negative, constrast & magnitude,
Contrast words: But, However, Although…
Classifiers
• With over 77000 features
and 140000 examples,
advanced classifiers
needing dense-matrix will
run out of memory
• Logistic Regression,
Linear SVM and MNB
were the only choices
• Confusion Matrices show
that labels 1 and 3 are
confused for label 2.
Class weights used
Voting Method
Methods Attempted
• Latent Semantic Analysis with TruncatedSVD,
n=3000
• One-vs-One Classification with SVC
• Word2Vec
– Aggregation method for a sentence is an issue
Movie Review Sentiment Analysis – Our Way
• Kaggle Standing: 146 of 634
• The traditional methods have a big drawback
with respect to sentiment analysis.
• Here learning and sentiment prediction
works by looking at words in isolation.
• The order of words is ignored or lost and
thus important information lost.
• 3-gram and higher n-gram models add too
much noise.
Next Step: Sentiment Treebank and Recursive Neural Tensor Network