Signals from Text: Sentiment, Intent, Emotion, Deception Stephen Pulman TheySay Ltd, www.theysay.io and Dept. of Computer Science, Oxford University [email protected] March 9, 2017 Text Analytics Overview Sentiment analysis proper: classify text into positive, negative or neutral. (Future intent, risk, speculation...) (Demographic profile: age, gender, politics, religion...) Deception: can we tell if someone is lying? Emotion: joy, sadness, fear, anger... Some example application areas. What tools do we use? Linguistically based pattern matching (information extraction) e.g. <PERSON> resign/be-sacked/fired/moved-from <COMPANY-ROLE> Machine learning methods: train a classifier using annotated data: Naive Bayes, Support Vector Machine, Averaged Perceptron, Neural Networks etc. Currently fashionable: Deep Learning methods - Convolutional Neural Net, Long Short Term Memory models etc Choice usually depends on the availability of annotated data: expensive and time-consuming to acquire. What is sentiment analysis? Sentiment proper Positive, negative, or neutral attitudes expressed in text: Suffice to say, Skyfall is one of the best Bonds in the 50-year history of moviedom’s most successful franchise. Skyfall abounds with bum notes and unfortunate compromises. There is a breach of MI6. 007 has to catch the rogue agent. Sometimes, factual statements imply sentiment: Online giant Amazon’s shares have closed 9.8% higher Building a sentiment analysis system Cheap and cheerful Version 1: collect lists of positive and negative words, classify a text based on proportion of pos/neg. Version 2: (what most commercial systems do) get a training corpus of texts human annotated for sentiment (e.g. pos/neg/neut); represent each text as a vector of counts of words or successive pairs of words; train your favourite classifier on these vectors Problems: if number of positive = number of negatives? bag-of-words means structure is ignored: “Airbus: orders slump but profits rise” wrongly = “Airbus: orders rise but profits slump” Compositional effects will be missed: clever, too clever, not too clever bacteria, kill, kill bacteria, fail to kill bacteria, never fail to kill bacteria Version 3 Use linguistic analysis do as full a parse as possible on input texts use the syntax to do ‘compositional’ sentiment analysis: Sentence Noun H HH HH Our product VerbPhrase H H HH Adverb VerbPhrase never HH H Verb fails VerbPhrase H HH Verb Noun to kill bacteria Sentiment logic rules kill + negative → positive (kill bacteria) kill + positive → negative (kill kittens) kill + neutral → neutral (kill time) too + anything → negative (too clever, too red, too cheap) etc. In our system (www.theysay.io) we have 75,000+ rules... Problems: still need extra work for context-dependence (‘cold’, ‘wicked’, ‘sick’...) can’t deal with reader perspective: “Oil prices are down” is good for me, not for investors. can’t deal with sarcasm or irony: “Oh, great, they want it to run on Windows” Linguistic characteristics of deceptive vs. truthful text Several studies have found that we can tell liars from truthtellers: Liars Liars tend to use more emotion words, fewer first person (“I”, “we”), more negation words, and more motion verbs (“lead”, “go”). (Why?) Possible that liars exaggerate certainty and positive/negative aspects, and do not associate themselves closely with the content. Truthtellers Truthtellers use more self references, exclusive words (“except”, “but”), tentative words (“maybe”, “perhaps”) and time related words. Truthtellers are more cautious, accept alternatives, and do associate themselves with the content. Financial applications Lying CEOs Larcker and Zakolyukina (2012) looked at the language used by CEOs and CFOs talking to analysts in conference calls about earnings announcements. Looking at subsequent events: discovery of ‘accounting irregularities’ restatements of earnings changes of accountants exit of CEO and/or CFO - you can identify retrospectively who was telling the truth or not. This gives us a corpus of transcripts which can be labelled as ‘true’ or ‘deceptive’: training data. Detecting deception Avoid losing money Training a classifier on features like those just described, Larcker and Zakolyukina were able to get up to 66% prediction accuracy. Building a portfolio of deceptive companies will lose you 4 to 11% per annum... Verisimilitude We (TheySay) have build a general ‘verisimilitude’ classifier which looks out for linguistic indicators of deception... ... and also measures clarity and readability (vs. obfuscation, hedging, etc.) We’ve tried this on speeches by many politicians, and guess what?! Emotion detection Many theories of emotion... Ekman: anger, disgust, fear, happiness, sadness, surprise. Seems to be a correspondence with facial expressions, and emoticons: Universality of emotion classes? Whereas Japanese and US subjects agree on classification of expressions of happiness, surprise, and sadness... ... agreement is significantly lower for anger, disgust and fear Is this emotion taxonomy universal? Possibly not - in Ifaluk, spoken on a small group of islands in the Pacific: “...fago is felt when someone dies, is needy, is ill, or goes on a voyage...” So fago looks like sadness so far. But “...fago is also felt when in the presence of someone admirable or when given a gift.” Nevertheless... ...the basic dimensions of emotion are expressed in all languages, we believe. Emotion classification is difficult Data collection Human annotation usually gives an upper limit on what is possible: initial results of cross annotator agreement not encouraging. But we can bootstrap data collection using self-labelled blog posts (LiveJournal) or social media: Best day in ages! #Happy :) Gets so #angry when tutors don’t email back..Do you job idiots! In Oxford (TheySay) we have used distant supervision and human annotated data to build a multi-dimensional emotion classifier with acceptable accuracy... Text analytics and politics Predicting election results: can we eliminate opinion polls? A very mixed picture, to put it mildly: Tumasjan et a. 2010 claimed that share of volume on Twitter corresponded to distribution of votes between 6 main parties in the 2009 German Federal election. (Volume rather than sentiment is also a better predictor of movie success). Jungherr et al. 2011 pointed out that not all parties running had been included and that different results are got with different time windows. Tumasjan et al. replied, toning down their original claims. Tjong et al. found that Tweet volume was NOT a good predictor for the Dutch 2011 Senate elections, and that sentiment analysis was better. Predicting election results Skoric et al. 2012 found some correlation between Twitter volume and the votes in the 2011 Singapore general election, but not enough to make accurate predictions. Bermingham and Smeaton 2011 found that share of Tweet volume and proportion of positive Tweets correlated best with the outcome of the Irish General Election of 2011 ... but that the mean average error was greater than that of traditional opinion polls! So this doesn’t look very promising, although of course other sources than Twitter might give better results - but they are difficult to get at quickly. What do the voters think? We can detect which topics arouse strong emotions among the voters: Candidate Jones: fear anger happiness economy immigration education health 8 2 1 6 9 1 2 1 5 1 1 6 Candidate Smith: fear anger happiness economy immigration education health 2 1 7 7 9 1 2 1 5 3 1 3 Scottish Independence Referendum Positive sentiment Levels of fear Economy: Daily Mirror debate 17th May: John McDonnell, Nigel Farage and Peter Mandelson Immigration: murder of Remain MP Jo Cox 16th June Sentiment: all topics Brexit today Brexit today Brexit today Summary Text Analytics today We can detect several different types of signals in text. Performance is nowhere near 100% accurate but improving all the time, and being able to process large volumes of text increases the signal to noise ratio. We know that sentiment has a predictive value. Emotion classification gives a finer grained analysis of opinion, and more insight and explanation than traditional sentiment analysis. Current research is also finding signals from text that predict intent, risk, deception, etc. Text is an underused resource: it is (mostly) free and though unstructured, text analytics can derive structured information from it, very quickly. Combining rich text signals (sentiment, deception, risk etc.) with other time series data will provide analysts with important insights into underlying trends.
© Copyright 2026 Paperzz