Sentiment Analysis on Movie Reviews Kaggle Competition The dataset is from Rotten Tomatoes site. Each of the short reviews is parsed and broken into many phrases using the Stanford parser. Each phrase is given a label value from 0 to 4 (0: very negative, 1: negative, 2: neutral, 3: positive, 4: very positive). PhraseId SentenceId 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 Phrase Sentiment This quiet , introspective and entertaining independent is worth seeking . 4 This quiet , introspective and entertaining independent 3 This 2 quiet , introspective and entertaining independent 4 quiet , introspective and entertaining 3 quiet 2 , introspective and entertaining 3 introspective and entertaining 3 introspective and 3 introspective 2 and 2 entertaining 4 independent 2 is worth seeking . 3 is worth seeking 4 is worth 2 worth 2 seeking 2 Test data set looks similar with no sentiment value given. The goal is to assign correct labels to each of the review phrases. Challenges • • • • • • • Semantic word spaces Sentence compositionality Positive and negative words Order of words Effect of contrastive conjunctions Negation words Not enough training data Movie Review Sentiment Analysis – Our Way Preprocessor Vectorizer Bank Classifier Bank Majority Voting Preprocessor • Kept only alphabets and sentiment markers +/• Tokenized, Stemmed, Lemmatized NLTK modules Python module NLTK (Natural Language Tool Kit) comes with several packages that aid text processing. – Stop Words – eliminate low information features from long phrases – Used movie reviews corpus as additional training set – comes with 2 levels (‘neg’,‘pos’ – mapped to levels 1,3 or 0,4 for this project) – NLTK synonym set from NLTK WordNet package – NLTK parts of speech tagging (POS) + ignore nouns or choose only adjectives • Same or worse performance (accuracy) seen when training data set split into training and developmenttest sets. Vectorizers Vectorizer Associated Classifier Comment Tfidf Logistic Regression, Linear SVM With ngram(1,2) Count MNB With ngram(1,1) Word2vec (google-trained or MovieReview-Trained) Advanced – SVM, Random Forest Issue with Sentences New – Sentiment Vectorizer Logistic Regression, Linear SVM, MNB Used with FeatureUnion with Tfidf and Count Sentiment Vectorizer searched each sentence’s words for presence in positive/Negative/Contrast words dictionaries. Features were: binary: positive, negative, constrast & magnitude, Contrast words: But, However, Although… Classifiers • With over 77000 features and 140000 examples, advanced classifiers needing dense-matrix will run out of memory • Logistic Regression, Linear SVM and MNB were the only choices • Confusion Matrices show that labels 1 and 3 are confused for label 2. Class weights used Voting Method Methods Attempted • Latent Semantic Analysis with TruncatedSVD, n=3000 • One-vs-One Classification with SVC • Word2Vec – Aggregation method for a sentence is an issue Movie Review Sentiment Analysis – Our Way • Kaggle Standing: 146 of 634 • The traditional methods have a big drawback with respect to sentiment analysis. • Here learning and sentiment prediction works by looking at words in isolation. • The order of words is ignored or lost and thus important information lost. • 3-gram and higher n-gram models add too much noise. Next Step: Sentiment Treebank and Recursive Neural Tensor Network
© Copyright 2026 Paperzz