Multilingual Sentiment Analysis: From Formal to Informal and Scarce

MultilingualSentimentAnalysis:FromFormaltoInformalandScarceResourceLanguages
SiawLingLo1,ErikCambria2*,RaymondChiong1,DavidCornforth1
1
SchoolofDesign,CommunicationandInformationTechnology,TheUniversityofNewcastle,
Callaghan,NSW2308,Australia
2
SchoolofComputerEngineering,NanyangTechnologicalUniversity,639798Singapore
*Correspondingauthor
E-mail:[email protected]
Phone:(+65)67904328
Abstract
Theabilitytoanalyseonlineuser-generatedcontentrelatedtosentiments(e.g.,thoughtsand
opinions)onproductsorpolicieshasbecomeade-factoskillsetformanycompaniesand
organisations.Besidesthechallengeofunderstandingformaltextualcontent,itisalsonecessaryto
takeintoconsiderationtheinformalandmixedlinguisticnatureofonlinesocialmedialanguages,
whichareoftencoupledwithlocalisedslangasawaytoexpress‘true’feelings.Duetothe
multilingualnatureofsocialmediadata,analysisbasedonasingleofficiallanguagemaycarrythe
riskofnotcapturingtheoverallsentimentofonlinecontent.Whileeffortshavebeenmadeto
understandmultilingualsentimentanalysisbasedonarangeofinformallanguages,nosignificant
electronicresourcehasbeenbuiltfortheselocalisedlanguages.Thispaperreviewsthevarious
currentapproachesandtoolsusedformultilingualsentimentanalysis,identifieschallengesalong
thislineofresearch,andprovidesseveralrecommendationsincludingaframeworkthatis
particularlyapplicabletodealingwithscarceresourcelanguages.
Keywords:multilingualanalysis,sentimentanalysis,scarceresourcelanguages,socialmedia
1.
Introduction
Sentimentanalysishasbeenapopularresearchareaoverthepastfewyears.Itisgainingevenmore
attentionwiththeprevalenceofsocialmediausage,wherenetizensfreelyandopenlyexpresstheir
viewsandopinionsaboutanything;beitaproduct,apolicyorevenapicture.Althoughthese
opinionsarevaluableforunderstandingtheconcernsandissuesontheground,itremainsa
challengetofullydecipherthemessageandcontextofonlineuser-generatedcontent.Thisismainly
duetoafewkeyissues,suchassentenceparsing,namedentityrecognition,anaphoraresolution
andconceptdisambiguation.Itisessentialtocomprehendthesubjectandtopicofanycontent
beforediscerningthesentimentexpressed(e.g.,positiveornegative).Tomakethemattermore
complicated,onlinesharingorsocialmediacontentisknowntobenoisyandoftenmixedwith
linguisticvariations.Itisthusnotsurprisingthatsentimentanalysiscontinuestobeoneofthemain
analyticsresearchdomainsgivenitsmanychallengesbutalsopromises.
Sentimentanalysisforalanguageisusuallydependentonmanuallyorsemi-automatically
constructedlexicons[1],[2],foundindictionariesorcorpora[3].Theavailabilityoftheseresources
enablesthecreationofrule-basedsentimentanalysisortheconstructionoftrainingdatafor
classificationtasks.DespitethefactthatEnglishremainsthemainlanguageusedinvariousresearch
studiesinthisarea(e.g.,see[4],[5]),therearealsoeffortsincreatingsubjectivityresourcesfor
otherformallanguagessuchasJapanese[6],Chinese[7]andGerman[2].However,sincecreating
lexicalorcorpusresourcesforanewlanguagecanbeverytime-consumingandresourceintensive,
mostofthemultilingualsentimentanalysesonotherlanguages[3],[8]havebeenrelyingonsome
availableEnglishknowledgebase,suchasSentiWordNet[9].
Whileincreasingefforthasbeenmadeincreatingresourcesforotherformallanguages,thereare
notmanyresourcesavailablewhenitcomestolanguagesthatarenotcommonlyusedinofficial
communicationorformalnewsreportingduetotheirinformalandevolvingnature.Theselanguages
oftenevolvefromamainnationallanguage,suchasEnglish,andarebroadlyusedbyalocal
communityindailyconversationbothinthephysicalandonlineworld.Withthepopularityofsocial
mediaandthefreedomofexpressionitaffords,languageswithlocalisedexpressionsorvariantsof
formallanguagesarebecomingwidespreadintheonlineenvironment.Inaddition,itisnot
uncommontoseeafewlanguagesbeingmixedtoformauniquelanguageinamulticulturalsociety.
OnesuchexampleisSinglish,thecolloquialSingaporeanEnglishthathasincorporatedelementsof
someChinesedialectsandtheMalaylanguage[10].Tofullyunderstandthesentimentsinthissort
oflanguages,itisessentialtoanalysethemalongsideotherformallanguages.Theaimofthispaper
istoreviewsentimentanalysisresearchinamultilingualsetting,byconsideringnotjustformalbut
alsoinformalandscarceresourcelanguagesusedonsocialmedia,especiallyvariantsoftheEnglish
language.Itisofinteresttoexaminecurrentapproachesandtoolsusedinmultilingualsentiment
analysis,sothatchallengescanbeidentifiedandrecommendationscanbeprovided.
Byscarceresourcelanguages,werefertothosewithjustabasicdictionaryavailableand/orlacking
ofdevelopedtextprocessingresources(suchasatranslationengine).ThevariousEnglishvariants
widelyusedonsocialmediabelongtothiscategory.Inthispaper,wefirstassessarangeofcurrent
multilingualsentimentanalysisstudiesbasedontheresourcesused,intermsofwhetheralexicon,a
corpus,atranslationmachineoratranslatorisapplied,beforesentimentanalysisresearchcarried
outonsocialmediaisreviewed.Itisimportanttoexaminecurrentapproachesusedinanalysing
socialmediadata,giventhattheworldisatpresentdominatedbythiskindofdata.Mostofthe
socialmediamessageswouldbewritteninaninformalmanner,withlinguisticvariationsthat
requiredifferentconsiderationscomparedtoanalysingformalreviewsornewscorporathatwould
typicallyconsistofasingleofficiallanguage.Socialmediadataanalysiscanbetreatedas
understandinganother‘new’languagewithlimitedresources.Here,wehandlesentimentanalysisof
socialmediadataseparatelywithrespecttootherscarceresourcelanguages,asthemajorityof
researchstudiesonscarceresourcelanguageshavebeenfocusedonasinglelanguage.
Inthenextsection,wedescribecurrentapproachesusedinmultilingualsentimentanalysisstudies.
Wethencoverothertypesofstudiesonmultilingualsentimentanalysis,andlistresourcesavailable
fordifferentlanguages.Thisisfollowedbyreviewingsentimentanalysisresearchcarriedouton
socialmedia,beforetouchingonresearchdoneonotherscarceresourcelanguages.Afterthat,we
putforththechallengesidentifiedandrecommendationstoovercomethesechallenges.The
recommendationsincludesomeproposedsolutionsandahybridframeworkfordealingwithscarce
resourcelanguages.Finally,weconcludethepaper.
2.
Currentapproachesusedformultilingualsentimentanalysis
Therearemainlytwoapproachesinsentimentanalysis–subjectivityandpolaritydetection.
Subjectivitydetectionisaboutunderstandingifthecontentcontainspersonalviewsandopinionsas
opposedtofactualinformation.Often,thesesubjectiveexpressionsareduetocultureorexperience
ofapersonorcommunityandhence,canbevery‘localised’andspecifictoasociety.Asaresult,
subjectivityisusuallystudiedbeforedetailedsentimentanalysisisdone,sinceitisessentialtofilter
outfactualcontenttohaveabetterunderstandingofissuesthataresharedamongnetizens.
Polaritydetection,ontheotherhand,isaboutstudyingsubjectivitywithdifferentpolarities,
intensitiesorrankings.Somepolarityanalysisstudiesregardedanopinionaseitherhighlypositive,
positive,negativeorhighlynegative[4],whileothers[11]workedonhumanemotionsuchasjoyor
anger.
MostsubjectivityandpolarityanalysisstudieshavelimitedthemselvestoEnglish,butwiththe
increasingpopularityofonlinesocialmediaworldwide,itisnolongersufficienttodealwithonly
Englishlanguagecontent.Infact,only28.6%oftheInternetusersspeakEnglish1.Itisthusessential
toexploreorbuildresourcesandtoolsinlanguagesotherthanEnglish.Moreover,Asianowhasthe
mostInternetusers(48.2%);followedbyEurope(18%)2.Asaresult,thereisagrowingneedtowork
onlanguagessuchasChineseandJapanese.Multilingualsubjectivityandpolarityanalysisresearch
hasbecomemorewidespread,andlanguagesthathavebeenstudiedincludeChinese[7],[12],[13],
Japanese[14],German,Spanish,French,Italian[15],Swedish[16],Arabic[17]andRomanian[3].
Thisreviewpaperwilllookatthevariousmultilingualapproachestakenintheareasofsubjectivity
andpolarityanalysis,andassesshowtheseapproachescanbeappliedtoascarceresource
language.Thegeneralapproachesforbothsubjectivityandpolarityanalysesonmultilingualstudies
arelexicon,corpusortranslator-based,althoughtherearealsoapproachesthatmovetowards
researchbasedonconceptsandsentics.Senticcomputing[18]incorporatescommon-sense
reasoningtospecifyaffectiveinformationassociatedwithreal-worldobjects,actions,eventsand
people.ThevariousmultilingualsentimentanalysisapproachesusedcanbefoundinTable1,and
theircorrespondinglexicon,corpusordatasetislistedinTable2.
2.1
Subjectivityanalysis
Mihalceaetal.[3]investigatedbothlexiconandcorpus-basedapproachesformultilingual
subjectivityanalysis(subjectivityvs.objectivity).Theirlexicon-basedapproachusesalemmatised
formofEnglishtermsfromOpinionFinder[5],anEnglishsubjectivityanalysissystem,andtranslates
themintoRomaniantermsusingtwobilingualdictionaries.Theythenbuiltarule-basedsubjectivity
classifierusingthelexicon.Thesubjectivityprecisionoftheclassifierwasshowntobegood,
althoughitsrecallwaslow.Withinthesamestudy,corpus-basedsentencelevelsubjectivityanalysis
wasconductedbasedonaparallelcorpusconsistingof107documentsfromtheSemCorcorpus
[19].ANaïveBayes(NB)[20]classifierwasusedontheRomaniantrainingdataset,wherethe
1
2
http://www.internetworldstats.com/stats7.htm
http://www.internetworldstats.com/stats.htm
annotationswereprojectedfromtwoOpinionFinder[5]classifiers.Whilethehighestprecisionfor
subjectiveclassificationwasobtainedwiththerule-basedclassifierusingthegeneratedlexicon,the
overallbestF-measureresultof67.85wasproducedbytheNB-basedstatisticalmachinelearning
approach.
Ahmadetal.[21]usedalocalgrammarapproachtoextractsentiment-bearingphraseswithina
multilingualframework(English,Arabic,andChinese).Astheirfocuswasonsentimentanalysisof
financialnewsstreams,domain-specifickeywordswereselectedbycomparingthedistributionof
wordsindomain-specificdocumentstothedistributionofwordsinagenerallanguagecorpus.
Wordsfromdomain-specificdocumentsfoundtobeasymmetricwiththegeneralcorpuswere
assignedaskeywords.Thesekeywords,togetherwiththeirlocalgrammarpatterns,wereusedto
extractsentiment-bearingphrases.Theirexperimentalresultsshowedthatthelocalgrammar
patternsinallthreelanguagesconsidered,i.e.,English,ArabicandChinese,canbeusedtoextract
sentiment-bearingphrases.Thisobservationisimportant,asitdemonstratesthatdomain-specific
keywordscantranscenddifferentlanguagetypologies(Indo-European->Sino-Asiatic->Semitic).
Theirmanualevaluationfoundthattheaccuracyoftheirapproachiswithinthe60-75%range.
Itisworthnotingthattheapproacheslistedabovedonotdeterminethepolarityofcontentbut
focusonconstructionanddetectionofwordsorphrasescontainingsubjectivitynotions.Although
subjectivityanalysisdoesnotapplydirectlytosentimentanalysisoropinionmining,itisoftenthe
firststeptowardsimprovingsentimentclassificationresults[4].Ithasbeenshownthat
distinguishingsubjectiveversusobjectiveinstancesisoftenmorechallengingthanthesubsequent
polarityclassification[22],[23].
2.2
Polarityanalysis
Therearedifferentgranularitiesofpolarityanalysis.Basicanalysisinvolvesclassifyingtheexpressed
opinionofgiventext(e.g.,attheaspect,sentenceordocumentlevel)asbeingpositive,negativeor
neutral.Moreadvancedanalysisdealswithclassificationattheemotionoraffectivelevel,where
differentemotionstatessuchas“joy”,“angry”andsoonarerecognised.Thisreviewpaper
concentratesonthemethods/approachestakenwithregardstowhetheralexicon,corpusor
translationengineisused,andhenceboththeanalysesofpositive/negativeexpressionsandvarious
emotionstatesareconsidered.
Incontrasttosubjectivityanalysis,polarityanalysisisnotlimitedtolexicon-orcorpus-based
approaches.Whilelexicalresourcesarestillusedtodetectthepolarityintext,machine-learning
approachesaremorecommoninthistypeofanalysis.Inaddition,machinetranslationenginesor
translatorsareoftenusedinconjunctionwithvariousEnglishknowledgebases.Concept-based
resourcessuchasSenticNet[11]arealsousedformultilingualsentimentanalysis.
2.2.1
Lexiconandmachinelearning-basedpolarityanalysis
OneofthefirststudiesonmultilingualpolarityanalysiscanbefoundintheworkofYaoetal.[24],in
whichtheyproposedamethodtodeterminesentimentorientationofChinesewordsbyusinga
bilinguallexicon.TheirmethodusestheoccurrenceofEnglishsentimentwordsfromaninterpreted
ChinesewordtopredictthesentimentorientationoftheChineseword.Thisisachievedthroughthe
calculationofthesentimentvectorfromtheEnglishwordsequencefollowedbyclassificationbased
ontheSupportVectorMachine(SVM)[25]andC4.5[26].Thebestaccuracyobtainedinpredicting
thesentimentorientationofaChinesewordisabove90%,whensupportvectorsthatdonotcontain
anypolaritywordsareeliminatedfromtheclassification.
KimandHovy[2]utilisedalexicaldatabase,i.e.,WordNet[27],andthreesetsofmanually
annotatedpositive,negativeandneutralwordstobuildawordsentimentclassifierfordetecting
opinionsinemails.Sincetheiropinion-bearingwordsareinEnglishandthetargetsystemisin
German,astatisticalwordalignmenttechnique,GIZA++[28],isusedonaparallelEuropean
ParliamentcorpustoacquirewordpairsinGerman-EnglishandEnglish-German.Thesewordpairs
arethenusedtobuildaGermanopinionanalysissystemusingtheEnglishopinion-bearingwords
withoutatranslationsystem.Theprecisionobtainedis72%forpositiveemailsandtherecallis80%
fornegativeemails,buttherecallandprecisionvaluesforpositiveandnegativeemails,respectively,
arelow.
Inadifferentstudy,RosellandKann[16]constructedaSwedishgeneralpurposepolaritylexicon
withagraph-basedrandomwalkapproach.UsingthePeople’sDictionaryofSynonyms[29],they
extractedalargeamountofpolaritytermsfromasmallsetofseedwordsthroughmappingfroma
bilingualdictionaryofEnglishandSwedishlanguages.Theirrandomwalkapproachtakesinto
considerationthesynonymityandpathlengthincalculatingthemeanpolarityvalueofwords.Some
examplesofwordswiththeirpolarityvalueshavebeenpresented.
AnotherlexicalresourceforsentimentanalysisinEnglishisSentiWordNet[30],usedbyDenecke[9]
todetectthepolarityofadocumentwithinamultilingualframework.Theclassificationhereisbased
onthreeclassifiers:LingPipeClassifier3,SentiWordNetClassifierwithclassificationrules,and
SentiWordNetClassifierwithmachinelearning.Theseclassifiersweretrainedusingtheannotated
moviereviewsdatasetfromLingPipebutevaluatedontwodifferenttestingdatasets.Thefirst
datasetwasgeneratedfromthemulti-perspectivequestionanswering(MPQA)[31]corpus,with250
positiveand250negativesentencesselectedatrandom.TheseconddatasetwasbasedonGerman
moviereviewsselectedfromAmazon.de,with100positiveand100negativereviewstranslatedto
English.Resultsfromthestudyshowthatthemachine-learningbasedSentiWordNetClassifierhas
achievedthebestaccuracyof66%forGermanmoviereviews,whiletheothertwoclassifiershave
similaraccuraciesofaround52%forEnglishand58%forGermandocuments.Inaddition,theresults
suggestthattheaccuracyofthedifferentmethodsdoesnotdependontheprocessedlanguage.
Wan[32]usedtheEnglishsentimentlexiconfromOpinionFinder[5]forChinesesentimentanalysis
byemployingmachinetranslationandensembletechniques.Experimentalresultsshowthatusing
anensembleofChineselexiconswithEnglishreviewstranslatedbybothGoogleTranslateandYahoo
BabelFishcanachieveanaccuracyof0.854.Wanfurtherextendedthelexicon-basedapproachtoa
corpus-basedoneviaaco-trainingmethodusingtwo-waytranslation[8],sothattheEnglishand
Chinesefeaturescanbeconsideredastwoindependentviewsoftheclassificationproblem.Labelled
EnglishreviewsareusedtocreatelabelledChinesereviewsthroughtranslation.Theunlabelled
ChinesereviewsarepairedwiththelabelledChinesereviews(translatedfromEnglishreviews)for
thefirsttrainingdataset.ThesecondtrainingdatasetisfromthetranslatedunlabelledEnglish
reviews(derivedfromChinesereviews)pairedwithinitiallylabelledEnglishreviews.Theclassifiers
fromthetwotrainingdatasetsarethencombinedintoasinglesentimentclassifierthroughaco
3
http://alias-i.com/lingpipe/index.html
trainingprocess.Theco-trainingapproachachievesthebestaccuracyof0.775and0.79forEnglish
andChineseclassifiers,respectively.Thisco-trainingapproachisusefulintheabsenceofaparallel
corpus,whichiscoveredinthenextsection.
2.2.2
Parallelcorpus-basedpolarityanalysis
Anothertypeofpolarityanalysisistouseparallelcorporatolearnlanguagecharacteristicswithout
theneedofusingatranslationmachineortranslator.Mengetal.[33]builtagenerativecross-lingual
mixturemodel(CLMM)toleverageunlabelledbilingualparalleldata.TheCLMMutiliseswordsfrom
aparallelcorpustolearnaboutwordpolarity.Itexpandsthevocabularythroughmaximisingthe
likelihoodofandestimatingword-generationprobabilitiesforwordsnotseeninthelabelleddata
butpresentintheparallelcorpus.Itisshownthattheaccuracyofclassificationresultsusingonly
Englishlabelleddatais71%buttheaccuracyimprovesto83%whenbothEnglishandChinese
labelleddataareused.Theinitialloweraccuracyisprobablyduetothelimitedvocabularycoverage
inmachinetranslateddataandhencetheusageoftheparallelbilingualcorpusimprovesthe
classificationresultsbylearningpreviouslyunseensentimentwordsfromthelargeunlabelleddata.
Luetal.[34]adoptedamaximumentropy-basedapproachtojointlylearntwomonolingual
sentimentclassifiers.Theirfocusistosimultaneouslyimprovetheperformanceofsentiment
classificationinapairoflanguages–EnglishandChinese–byrelyingonsentiment-labelleddatain
eachlanguageaswellasunlabelledparalleltextforthelanguagepair.Itisreportedthatthe
proposedapproachisabletooutperformthemonolingualbaselinesandimprovetheaccuracyfor
bothlanguagesby3.44%-8.12%,withthebestaccuracyscoredat83.71%bytheEnglishclassifier
usingtheNTCIRparallelcorpora[35],[36].
2.2.3
Corpusandmachinelearning-basedpolarityanalysis
Incontrasttotheparallelcorporaapproach,PrettenhoferandStein[37]usedEnglishasthesource
language,andGerman,FrenchandJapaneseastargetlanguages,forcross-languagetopicand
sentimentclassification.StructuralCorrespondenceLearning(SCL)[38],proposedfordomain
adaption,wasadoptedintheirstudy.Unlabelleddocumentsfrombothlanguages,togetherwith
pivotwordsorpairsofwordsthathavepredictivevalue,wereusedtocreateamapofcross-lingual
featurespace.Itisshownthattheirapproachcanreducetherelativeerrorto59%insentiment
classificationascomparedtoamachinetranslationbaseline.
BoiyandMoens[39]alsodidnotuselanguagetranslationintheirwork.Instead,theyusedthree
manuallyannotatedlanguages–English,DutchandFrench–totrainvariousmachinelearning
algorithmsforclassifyingifastatementispositive,negativeorneutralwithregardstoacertain
entity.Theyproposedacascadingframeworkforthethreelanguages,butdifferentnegationrules,
discourseprocessingandparsingtoolswereusedforeachofthelanguages.Thisismainlyduetothe
differentbehavioursofthelanguagesandthefactthatdifferentmachinelearningalgorithmsalso
workdifferently.TheirresultsshowthatanEnglishcorpususingunigramfeaturesaugmentedwith
linguisticfeaturesachievesanaccuracyof83%,whileDutchandFrenchtextshaveloweraccuracies
of70%and68%becauseofthelargervarietyoflinguisticexpressionsinthetwolanguages.Thebest
classificationresultsforEnglish,DutchandFrenchcamefromMultinomialNaïveBayes(MNB),SVM
andMaximumEntropyclassifiers,respectively.
2.2.4
Corpus-basedtopicmodellingpolarityanalysis
Whilemostofthecorpus-basedapproachesarecoupledwitheithermachinetranslationorparallel
corporatoclassifythesubjectivityorpolarityofgiventext,Boyd-GraberandResnik[40]developeda
generativetopicmodelknownasmultilingualsupervisedLatentDirichletAllocation(MS-LDA).Their
approachjointlymodelstopicsthatareconsistentacrosslanguages,andconnectsthemtopredict
sentimentratings.MS-LDAiscapableofclusteringthematicallycoherenttopicstogetherwiththeir
sentimentswithoutrequiringparallelcorporaandmachinetranslation.Itisshownthatthemodelis
abletomakebetterpredictionwhenamixofEnglishandGermandataisused,comparedtowhen
Germandataaloneisused.Thisisinteresting,astheapproachshowsthepotentialofleveraging
anotherlanguagetoimprovesentimentanalysisclassificationresults.
2.2.5
Cross-lingualandmachinetranslationpolarityanalysis
Anotherpolarityanalysisapproachistousecross-lingualcorporaformultilingualsentimentanalysis.
Cross-languageclassificationusesasourcelanguage(oftenannotated)asthetrainingdatasetand
anotherlanguageorthetargetlanguageasthetestingdataset.Itisnotuncommontohave
documentsfromthetrainingandtestingdatasetsmappedontonon-overlappingregionsofthe
featurespacewhenthedomainsofbothsourcesaredifferent.Panetal.[41]utilisedanannotated
sentimentcorpusinEnglishtopredictsentimentpolarityinChinese.Theapproachusesmachine
translationsothattwodatasetsinthetwolanguagescanbecreatedastwoindependentviews.The
twoviewsarecombinedinamatrixfactorisationprocesssothattrainingcanbedone
simultaneously(insteadofconductingtrainingusingaseriesofclassifiersfromaco-training
approach).Inaddition,lexicalknowledgeisincorporatedintothemodeltoimproveitsaccuracy.
Threedifferentdatasets(i.e.,movie,bookandmusicreviews)weretestedinthestudyandthebest
accuracyof84%camefromthemoviereviewsdataset.
SimilartoPanetal.[41],Bautinetal.[42]alsousedlexicons,translatorsandtwotypesofcorpora
(i.e.,multilingualnewsstreamsandparallelcorpora)forsentimentanalysisandcross-cultural
comparison.Theirfocuswasoncomparingthediversityofdifferentlanguagesbasedonaselected
entity,e.g.apolitician,overatimeperiod,andtheyemphasisedthatitisessentialtoapply
normalisationcoefficientstominimisetheeffectofvarianceindifferentlanguages.TheLydia
sentimentsystem[43]wasusedandcertainentitieswereselectedforcross-languagesentiment
analysisusing10daysofnewsstreams.Entitysentiment(subjectivityandpolarity)wascalculated
foreachdaybasedonco-occurringoftheentitywithsentimentwords.Eventhoughmachine
translationhasbeenusedforthestudy,itisfoundthattheaccuracyislargelytranslator
independent.Inaddition,theresultsfromanewsentityfrequencycorrelationstudyshowthat
Englishhasasignificantcorrelationwiththeothereightlanguagesinvestigated,andhenceconfirm
itspivotalroleinthemulti-languageanalysisapproach.
2.2.6
Translation-basedpolarityanalysis
Oneofthereasonsforusingaparallelcorpusisduetothelanguagegapanddifferenceinthe
underlyingdistributionbetweentheoriginallanguageandthetranslatedlanguage[8],[33].While
poorperformanceofmultilingualsentimentanalysismaybeduetothelimitationofamachine
translationsystem,BalahurandTurchi[44]conductedextensiveevaluationscenariostoshowthat
machinetranslationsystemsarematureenoughtoobtainmultilingualdataforsupervised
sentimentanalysis.Theyquantifiedtheeffectoftranslationqualityusingthreedifferentmachine
translationsystems.Variousfeatures,algorithmsandmeta-classifierswereadoptedforpolarity
detection,andtheyshowedthatfeaturerepresentationusingTermFrequency–InverseDocument
FrequencyofunigramandbigraminanSVMwithsequentialminimaloptimisationproducesthebest
result.
Hiroshietal.[45]alsoexploredatranslation-basedapproach,whichincludesparsingandpattern
discoveryformultilingualsentimentanalysis.Specifically,theyusedtransfer-basedmachine
translationtechnologytodevelopahigh-precisionsentimentanalysissystemfortheJapanese
languagebyleveragingEnglishsentimentresourcestoidentifyrelevantsentimentunits.The
sentimentunitpolarityextractionprecisionwasreportedtobeashighas89%.
2.2.7
Concept-basedpolarityanalysis
Whilelexicon,corpusandtranslator-basedapproachesoracombinationoftheseapproacheshave
beenusedextensivelyforsubjectivityandpolarityanalysis,concept-basedtechniquesaregaining
popularityduetotheirabilitytodetectsubtlyexpressedsentiments[46]-[48].SenticNet[11]isa
widelyusedconcept-basedresource.Xiaetal.[49]createdalocalisationtoolkitforSenticNetby
implementingasetofconceptdisambiguationalgorithmstodiscovercontext.Inthistoolkit,Google
translateisusedtodomappingoftheEnglishandChineselanguages.VariousChineseresourcesare
alsousedtodiscoverlanguage-dependentsentimentconceptsthroughtranslation.Theyevaluated
thetoolkitbasedonthecorrectlypredictedpolarityoftherootconcept,andanagreementrateof
0.901wasachievedbasedonannotationsfromtwopostgraduatestudents.
2.2.8
Summary
Inshort,itisobservedthatmultilingualsentimentanalysisusingaparallelcorpusinsteadofmachine
translationcanimproveclassificationaccuracy[33],[34].Ontopofthat,Luetal.[34]showedthata
naturalparallelcorpusproducesperformancegaincomparedtousingpseudo-paralleldatafrom
machinetranslationengines.Havingsaidthat,thereareotherresearcherswhofirmlybelievethat
machinetranslationtechnologyhasmatured[44],andthatthetechniquesusedintranslation[45]
canbeappliedtomultilingualsentimentanalysis.Bothoftheseapproaches,however,donotwork
wellforscarceresourcelanguages,asparallelcorporaandtranslationmachinesareliterallynonexistentforthissortoflanguages,andmanualeffortsareneededforcreatingsuchresourcesbefore
theapproachesreviewedabovecanbeadopted.
Table1.Multilingualapproachesusedinsubjectivityandpolaritystudies
*L,CandTatthetableheaderindicateiftheapproachuseslexicon,corpusortranslator-based
resources,respectively.ThecorrespondingresourcescanbefoundinTable2andTable3.
Approach
Challenges
Subjectivity
Bilingualdictionary • DuetoinflectedEnglishwords,
translationand
lemmatisedsubjectiveEnglish
rule-based
termsareusedtomapentriesto
classifier
thebilingualdictionarybutthis
maylosesubjectivity
Language
L* C* T* Reference
English,
√
Romanian
[3]
Approach
Challenges
• Ambiguityofwordsenseandpart
ofspeechduetoidenticalentries
• Multi-wordexpressionsthat
cannotmatchthedictionary
entriessoword-by-wordmatching
isadopted
Parallelannotation • Interpretationofdifferent
projectionand
languagesonsubjectivityofa
statisticalclassifier
sentenceduetodifferentopinions
ofannotatorsandlossof
informationintranslation
• Difficultyincapturingsubtle
expressionssuchasirony
Localgrammar
• Wordsenseambiguityof
patterndiscovery
sentimentwordsextracted
anddomain
• Grid-basedanalysisisproposedto
specifickeywords
copewithmultiplenewssources
andthehugevolume
Polarity
Lexicon-basedto
• Manualannotationisusedfor
buildSupport
creatingsentiment-taggedChinese
Vector(SV)of
words
sentimentwords
• ItisobservedthatSVwithzero
fortheSVMand
elements(nomatchinallthe
C4.5
positiveornegativewords)should
beeliminatedtoimprovethe
classifier’sresult
Lexiconand
• Hugemanualeffortneededfor
parallelcorpus
generatingalistofsentimentwithastatistical
bearingwordsbasedonWordNet
wordalignment
• Resultsshowthattheapproach
approach
recognisesnegativeemailsbetter
thanpositiveemails
Lexiconwithrule- • Translationerrorsormissing
basedclassifier
translation
andmachine
• Ambiguitiesanddifferent
learning(Simple
meaningsofasynsetin
LogisticClassifier)
SentiWordNetarenotresolved
• Limitedabilitytorecognise
negativetext;maybedueto
negatedstructuresnotconsidered
intheclassifier
Lexicon-based
• Heavilydependentonthe
randomwalk
dictionaryofsynonymsandthe
approachon
weightderivedfromthelinks
synonymswith
betweenthewords
seedwordsanda bilingualdictionary
Lexicon-based
• Cross-linguallexicontranslation
Language
L* C* T* Reference
English,
Romanian
√
[3]
English,
Arabic,
Chinese
√
[21]
English,
Chinese
√
[24]
English,
German
√
√
[2]
English,
German
√
√
[9]
English,
Swedish
√
[16]
English,
√
√
[32]
Approach
withtranslation
Challenges
doesnotworkwellforChinese
sentimentanalysis
• Volumeandqualityofbilingual
paralleldataiscriticaltothe
performanceofthemethod
Parallelcorpusbasedvialearning
sentimentwords
fromthecorpus
Parallelcorpus•
basedusingjoint
trainingontwo
monolingual
classifiersonthe
unlabelledcorpus
Corpus-basedwith •
domainadaptation
ofSCL
•
Corpus-basedbut
withaspectfocus
•
•
Corpus-basedwith •
MS-LDA
•
•
Corpus-basedand
2-waytranslation
withco-training
•
•
Corpus-basedand
translationwith
LingPipeclassifier
•
•
Corpus-basedwith •
translationand
machinelearning
•
Language
Chinese
L* C* T* Reference
English,
Chinese
√
[33]
Itisassumedthattheperspectives English
ofparallelsentencesinthecorpus Chinese
arethesameandshouldhavethe
samesentimentpolarity
√
[34]
English,
German,
French,
Japanese
√
[37]
English,
Dutch,
French
√
[39]
English,
German,
Chinese
√
[40]
English,
Chinese
√
√
[8]
English,
German
√
√
[9]
English,
Spanish,
French,
German
√
√
[44]
Itisessentialtohaveataskor
domainspecificcorpusforthe
approach
Thepragmaticcorrelationofpivot
wordsorwordpairscanonlywork
onadomainspecificcross-lingual
corpus
Manuallyannotatetrainingdatain
regardtoacertainentity
Majorcauseoferrorsisthe
scarcityoftrainingexampleswith
informallanguagesusedonblogs
Variousresourcesareneededasa
bridgetolinkthedifferentcorpora
Qualityandtheamountofcorpora
areessentialforbetter
performance
Mappingthatcapturesthelocal
syntaxandmeaningful
collocationscanimprovethe
model
Inaccuracyofmachinetranslation
servicecausesthedifferencein
featuredistribution
Learningcurveofclassifiersinthe
co-trainingapproach
Limitedabilitytorecognise
negativetext;maybedueto
negatedstructuresnotconsidered
intheclassifier
Frequencyofpolarityfeaturesand
subjectivitydetectionmethodsare
proposedtoimproveaccuracy
Translationenginesortranslators
needtobeavailableforthetarget
language
Multipletranslateddatafrom
Approach
Challenges
varioustranslatorsproposedin
ordertominimisethetranslation
error
Lexiconand
• Domainspecificdatasetsareused
corpus-basedwith
inthestudy;itisnotknownhow
translationto
theapproachperformsongeneral
createabi-view
cross-lingualclassification
non-negative
• Parametersusedhaveinfluence
matrixtriondifferentlanguagesandthey
factorisation
needtobesetmanually;
model
suggestedtoestimatethe
parametersviaavalidationset
Lexiconand
• Theavailabilityoftranslatorsfor
corpus-basedwith
thetargetlanguage
translationto
• Duetothescorevarianceofeach
Englishto
language,itisproposedthat
understandthe
includingnormalisation
diversityofthe
coefficientsforcross-language
different
polaritycomparisonwillhelp
languages
improvetheapproach
Lexiconand
•
corpus-basedwith
patterntransfer
•
translationto
identifyasetof
sentimentunits
Concept-based
•
withtranslation
•
•
Coverageofpatternsisimportant
fortheaccuracyoftheapproach
Itisessentialtounderstandthe
knowledgeandtechniquesoftext
translationtoderiveparsingrules
andpatterns
Disambiguationalgorithmsfor
identifyingthecontextwithintext
Manualeffortisneededasthe
polarityofsomeconceptsmaybe
‘opposite’innature
Translationerrors,untranslated
termsandout-of-vocabulary
(OOV)concepts
Language
L* C* T* Reference
English,
Chinese
√
√
√
[41]
English,
√
Arabic,
Chinese,
French,
German,
Italian,
Japanese,
Korean,
Spanish
English,
√
Japanese
√
√
[42]
√
√
[45]
English,
Chinese
√
[49]
Table2.Lexiconsandcorporausedinmultilingualsentimentanalysis
Language
English
Name
OpinionFinder[5]
Type
Subjectivity
lexicon
English
Negationterms
Valenceshifters.tff
[5],[22]
244intensifier
Intensifiers2.tff[5],
[22]
Negation
lexicon
Remarks
Reference
6,856uniqueentries,outof
[3]
which990aremulti-word
expressionsandwithattributes
–strong,weakandwordsenses
(verb,adj,adv)
88negationterms
[32]
Intensifier
lexicon
244intensifiers
English
[32]
Language
English
Name
SentiWordNet[30]
Type
Polaritylexicon
English
Subjectivityclues
Subjclueslen1HLTEMNLP05.tff
[5],[22]
LingPipemovies
reviews4
MPQA[31]
Polaritylexicon
Polaritycorpus
English
Multi-domain
sentimentcorpus
[50]
NTCIROpinion
AnalysisPilotTask
[35],[36]
NTCIR8
Multilingual
OpinionAnalysis
Task(MOAT)5
GeneralInquirer
Categories6
WordNet[51]
English
ReutersRCV17
English
BritishNational
Corpus(BNC)8
HITIR-LabTongyici
Cilin[52]
HowNetChinese
sentimentlexicon9
Productreviews10
English
English
English
English
English
English
Chinese
Chinese
Chinese
Chinese
NTCIROpinion
AnalysisPilotTask
[36]
Remarks
Trioofpolarityscoresassigned
(positivity,negativityand
objectivityscores);thesumof
thesescoresisalways1
2718Englishpositiveand4910
negativeterms
Reference
[9],[44]
1000positiveand1000negative
reviews
535newsarticlesfrom187
differentforeignandU.S.news
sources;4,958sentences(1,471
positiveand3,487negative)
8,000Amazonproductreviews
(4000positive+4000negative)
[9],[39],
[41]
[9],[33],
[34]
Polaritycorpus
1,737sentences(528positive
and1,209negative)
[33],[34]
Polaritycorpus
6,223opinionunits
[44]
Polarity
Dictionary
Vocabulary
lexiconswith
synonymsets
Financial
trainingcorpus
General
language
Lexicaldatabase
[24]
[40]
800,000texts,eachcontaining
200-400words
[21]
77,3443Chinesewordswithin
17,817synsets
60,000Chinesewordsand
11,000sentences
886ITproductreviews(451
positiveand435negative)
4,294sentences(2,378positive
and1,916negative)
[53]
Polaritycorpus
Polaritycorpus
Polaritylexicon
Polaritycorpus
Polaritycorpus
4
http://alias-i.com/lingpipe/index.html
http://research.nii.ac.jp/ntcir/ntcir-ws8/permission/ntcir8xinhua-nyt-moat.html
6
http://www.wjh.harvard.edu/~inquirer/homecat.htm
7
http://trec.nist.gov/data/reuters/reuters.html
8
http://www.natcorp.ox.ac.uk/corpus/
9
http://www.keenage.com/
10
http://www.it168.com/
5
[32]
[8],[41]
[21],[40]
[32],[41],
[53]
[8]
[33],[34]
Language
Chinese
Name
Doubanreviews11
Type
Polaritycorpus
Chinese
OPINMINEChinese
opinionannotation
corpus[54]
BingonlineEnglishChinese
dictionary12
English-to-Chinese
dictionary
LDC_CE_DIC2.0
[32]
Localisationfor
TaiwanandBig5
Encoding(TaBE)13
Moviereviews14
ThePeople’s
Dictionary15
SemCorcorpus[19]
Polaritycorpus
English
Romanian
RomanianNLP16
Various
resourcesfor
RomanianNLP
ChineseEnglish
English,
German,
French,
Japanese
English,
Chinese,
German
English-
German
StarDict17
Dictionary
Cross-Lingual
Sentiment(CLS)
dataset18
Polaritycorpus
Chinese
Chinese
Chinese
German
English
Swedish
English
Romanian
Dictionary
[53]
Dictionary
128,366Chinesetermsand
theircorrespondingEnglish
terms
[32]
General
language
[21]
Polaritycorpus
English-Swedish
dictionary
Annotated
subjective
parallelcorpus
[40]
[16]
AmherstSentiment Polaritycorpus
Corpus[55]
Ding19
Remarks
Reference
Movie/music/bookreviewswith [41]
1000positiveand1000negative
foreachdomain
AnnotationcorpusfromNTCIR- [53]
6OpinionAnalysisTask
Dictionary
Parallelcorpusof107
[3]
documentscoveringtopicsin
sports,politics,fashion,
educationandothers
Corpusofnewspaperarticles
[3]
(50millionwords),sensetagged
data(39ambiguouswords),
Romanian-Englishparalleltext
(1millionwords),RomanianEnglishdictionary(38,000
entries)
10bilinguallexicons
[24]
800,000Amazonproduct
reviewsinfourlanguages(the
productcategoriesarebooks,
dvdsandmusic)
[37]
[40]
11
http://www.douban.com/
http://cn.bing.com/dict/
13
http://sourceforge.net/projects/libtabe/
14
http://www.cs.colorado.edu/~jbg/static/data.html
15
http://folkets-lexikon.csc.kth.se/folkets/folkets.en.html
16
http://web.eecs.umich.edu/~mihalcea/downloads.html#romanian
17
http://goldendict.org/dictionaries.php
18
http://www.uni-weimar.de/en/media/chairs/webis/corpora/webis-cls-10/
19
https://www-user.tu-chemnitz.de/~fri/ding/
12
[40]
Language
ChineseEnglish
ChineseGerman
Multiple
Name
MDBG20
Type
Dictionary
Remarks
Reference
[40]
HanDe21
Dictionary
[40]
Universal
Dictionary
downloadsite22
Dictionary
Webvolunteercontributors–
4,500entriesinRomanian
[3]
3.
Otherworkonmultilingualanalysis
Thescopeofmultilingualanalysisdoesnotrestricttosubjectivityandpolarityanalysis;italso
includescross-languagedocumentsummarisation[56]andinformationretrievalinwebsearch[57],
amongothers.AlistofrelevanttoolsformultilingualsentimentanalysisisshowninTable3,asa
referenceforothercross-languagestudies.Briefly,twotypesofresourcesareshared,i.e.,
translatorsandunlabelledparallelcorpora.BesidesthecommonlyusedtranslatorsuchasGoogle
Translate,commercial23andopensource[58]toolsarealsocovered.Anumberofstudieshave
shownthatYahooBabelFish24containstheleast‘correct’translationaftermanualinspection(e.g.,
see[32],[44]),andhenceitisoftenusedasabaselineeitherformanualcorrectionortoimpede
translationbiasofahumantranslator[44].Parallelcorporacanbeavaluableassetforlearningand
overcomingculturalandlinguisticdiversity,sothatinformationcanbesharedaccuratelyand
transparentlyacrossdifferentsocietieswithdifferentlanguages.
Table3.Toolsformultilingualanalysis
Type
Translator
Name
PROMTeXcellentTranslation
(XT)Technology25
Translatorand GoogleTranslate26
Mapping
Translator
YahooBabelFish27
Translator
Translator
Translator
BingTranslator28
Moses[58]
IBMWebSphereTranslation
Server(WTS)29
20
http://www.mdbg.net/chindict/chindict.php
http://www.handedict.de/
22
http://www.dicts.info/uddl.php
23
http://www.promt.com/
24
http://www.babelfish.com/
25
http://www.promt.com/
26
https://translate.google.com.au/
27
http://www.babelfish.com/
28
http://www.bing.com/translator/
21
Language
German,English,Spanish,French,
Portuguese,ItalianandRussian
Multiple
Chinese-to-Englishtasksof
MT2005,theBLEU-4score~is
0.3531[32]
Multiple
Chinese-to-Englishtasksof
MT2005,theBLEU-4score~is
0.1471[32]
Multiple
Multiple
Multiple
Reference
[9]
[8],[32],[53]
[32]
[44]
[44]
[42]
Type
Unlabelled
Parallel
Corpus
Unlabelled
Parallel
Corpus
Unlabelled
Parallel
Corpus
Name
ISIChinese-EnglishParallel
Corpus[59]
Language
English,Chinese
Reference
[33],[34]
ParallelCorporaof23
OfficialEULanguages30
Multiple
[42],[60]
ParallelCorporaof21
Multiple
[2],[40]
EuropeanLanguages
(extractedfromthe
proceedingsoftheEuropean
Parliament)[61]
~BLEU(BilingualEvaluationUnderstudy)isanalgorithmforevaluatingthequalityofapieceof
translatedtext.Thescoreiscalculatedbasedonamodifiedformofprecisionforcomparingthe
candidatetranslationagainstmultiplereferences.
4.
Sentimentanalysisonsocialmedia
Intheearlierdays,companiesandgovernmentsdidnotrealisethepowerofsocialmedia,untilthey
sawtheinfluenceofword-of-mouthandhowquicklyitcouldresonatewiththecommunityand
inspirethelaunchofaprotestorcampaignforacause[62].Sincethen,sentimentanalysishas
expandedfrombeingaresearchareaonformallanguagessuchasEnglishtoincludeinformal
languagesusedonsocialmedia.Inparticular,thecontentoftweets(postssharedonTwitter)is
amongthemoststudied,duetotheirabilitytopropagatehottopicsinaveryshortdurationandtoa
largenumberofusersoverwidegeographicalregions.
However,asseenfromSection2,mostofthesentimentanalysisstudiestodatehaveutilised
resourcessuchaslexiconsandmanuallylabelledcorporainEnglish.Thecorporausedaremainly
fromnews[3],[31],[33],[34]andreviews[8],[9],[37],[39],[41],[63],withcontentwrittenin
properEnglish.Giventhatsocialmediaisbecomingthemainstreammodeforcommunicationand
expressingone’sthoughtsonavarietyofissues,itisessentialtoanalysethestructureofsocial
mediacorporaandcurrentapproachesusedforsentimentanalysisandopinionminingonsocial
media.
4.1
Englishsentimentanalysisonsocialmedia
Eventhoughitiscommonfortweetstoincludemanylinguisticvariationsormixedlanguages
(especiallyinmulticulturalsocieties),mostsentimentanalysisstudiesstillfocusonEnglishcontent
becauseoftheavailabilityofresources.PakandParoubek[64]collectedacorpusof300,000text
postsfromTwitterforobjectivityandpositive/negativeemotionanalysis.Theyconcludedthat
Twitteruserswouldusesyntacticstructurestodescribeemotionorstatefacts,andthatPart-ofSpeech(POS)tagsmaybestrongindicatorsofemotionaltext.Inaddition,thereisadifferencein
usingthePOStagswhenexpressingdifferenttypesofemotion;positivetextusesmostlysuperlative
adverbs,suchas“most”,“best”andpossessiveendings,whilenegativetextcontainsmoreverbsin
29
http://www-03.ibm.com/software/products/en/translation-server
https://ec.europa.eu/jrc/en/language-technologies/jrc-acquis
30
thepasttense.AnMNBclassifierwithn-gramsandPOStagsasfeatureswastestedandtheyfound
thatthebestperformanceisachievedwithusingbigrams.
LikePakandParoubek[64],BarbosaandFeng[65],Kouloumpisetal.[66]andDavidovetal.[67]
followedthemachine-learningbasedapproachforTwittersentimentanalysis.BarbosaandFeng
[65]proposedatwo-stepapproachtoclassifythesentimentoftweetsusingSVMclassifierswith
abstractfeatures.Kouloumpisetal.[66]evaluatedtrainingdataextractedfromhashtagsand
emoticonsandexaminedifTwitterfeaturesplayanimportantroleinTwittersentimentanalysis.
Davidovetal.[67]usedasupervisedk-nearestneighbours-likeclassifiertoclassifytweetsinto
multiplesentimenttypesusinghashtagsandsmileysaslabels.
Incontrast,Jiangetal.[68]classifiedthesentimentofatweetaccordingtoitspositive,negativeor
neutralsentimentaboutatargetorentity.Theyarguedthatthecontextofatweetisimportantto
understandtheunderlyingsentiment,andhencerelatedtweetsshouldbetakenintoconsideration
ratherthanjustrelyingonasingletweet,whichisusuallytooshortandambiguousforsentiment
analysis.TheyusedPointwiseMutualInformation(PMI)[69]toidentifytheextendedtargetand
implementedathree-levelapproachfordetectingsubjectivity,polarityandgraph-based
relationships.Theirresultsshowthattheproposedapproachisabletoimprovetheperformanceof
target-dependentsentimentclassification.
4.2
Multilingualsentimentanalysisonsocialmedia
WhiletheaforementionedstudiesconcentrateonsentimentanalysiswithEnglishcontent,thereare
alsoresearchstudiesthatusetweetsasacorpusformultilingualsentimentanalysis.Volkovaetal.
[70]proposedanapproachforbootstrappingsubjectivitycluesfromTwitterdataandevaluated
theirapproachonEnglish,SpanishandRussianTwitterstreams.Theproposedapproachusesthe
MPQAlexicon[22]tobootstrapsentimentlexiconsfromalargepoolofunlabelleddatausingasmall
amountoflabelleddatatoguidetheprocess.Termsthatarestronglysubjectiveintranslationare
usedasseedtermsinthenewlanguage,withtermpolarityprojectedfromtheEnglishlexicon.
However,itischallengingtoclassifysubjectivetweetswithphilosophicalthoughts.Thisismainly
duetosometermsbeingweaklysubjectiveandhenceuseableonbothneutralandsubjective
tweets.Besidesthat,termswithambiguouswordsenseandcontradictingpolarity(dependingon
thecontext)arefoundtobeparticularlyerror-prone.
BalahurandTurchi[15]builtasimplesentimentanalysissystemfortweetsinEnglish,andused
tweetsfromSemEval2013Task2–SentimentAnalysisinTwitter[71]astheirtrainingandtesting
datasets.TheyalsotranslatedthedatasetsfromEnglishtofourotherlanguages–Italian,Spanish,
FrenchandGerman.Itisfoundthatjointtrainingdatasetsfromlanguageswithsimilarstructures
helptoachieveimprovementovertheresultsobtainedonanindividuallanguage.Whilethismethod
isattractive,asithelpstodisambiguatethecontextualuseofspecificwords,itcannoteliminatethe
errorintroducedbytranslation.Fromthefindings,itisclearthatconsideringthedifferentwaysthe
negationtermsareconstructedinthedifferentlanguagesishighlyessential.
Cuietal.[72]didnotuseatranslationmachinebutinstead,focusedonbuildingemotiontokensor
SentiLexiconusingemoticons,repeatingpunctuationsandrepeatingletters.Theseemotiontokens
arefirstextractedtobuildaco-occurrencegraphandthroughagraphpropagationalgorithm,
positiveandnegativelexiconsarelabelled.ThetypeoflanguageisidentifiedthroughUnicodeofthe
character.IfatweetisfromtheBasicLatinorsymbolssection,itisassignedasaBasicLatintweet.
MostofthetweetsconsideredareEnglishinnature.ThosecharactersintheLatinextendedsection
areofteninPortuguese,Spanish,German,andsoon.Theircomparativeevaluationwith
SentiWordNet[30]indicatedthatemotiontokensarehelpfulforbothEnglishandnon-English
Twittersentimentanalysis.
4.3
Discussion
Itisworthhighlightingthatapproachesfromsentimentanalysisonsocialmediaaremainlybasedon
patterndiscoverysuchassyntacticstructures[64],Twitterfeatures[66],[67],emotiontokens[72],
machinelearningthroughannotateddatasets[65],[67],[68]andtranslation[15],[70].Considering
thelimitedwordsavailableinatweetanditsevolvingvocabulary,itisnotsurprisingthattheparallel
corpus-basedapproachisnotadoptedascomparedtoothermultilingualstudiesdiscussedin
Section2.Inaddition,sentencestructuresorgrammaticalrulesarehardlyconsideredeventhough
PakandParoubek[64]showedthatPOStagscanbeausefulindicatorofemotionaltext.POStags
areonlyapplicableifthesubjectofstudyisofasinglelanguagewithpropergrammaticalrules,as
theidentificationoftagsisnotstraightforwardwhenatweetcontainsamixtureoflanguages.In
fact,Kouloumpisetal.[66]showedthatPOSfeaturesmaynotbeusefulforsentimentanalysisbut
otherfeaturessuchasemoticonsandintensifiersaremoreusefulincomparison.Itisobservedthat
noneofthemultilingualsentimentanalysisstudiestakesintoconsiderationthemultiplelanguages
foundinatweet.Instead,theirfocusistypicallyonstudyingtheeffectsofdifferentlanguagesona
Twitterplatform[70],[72]orleveragingavailableresourcesofonelanguageforsentimentanalysis
ofanotherlanguage[15].
5.
Workonscarceresourcelanguages
Inadditiontotheinformallanguagesusedonsocialmedia,asdiscussedintheprevioussection,this
sectionexploresstudiesthatanalyselanguageswithlimitedelectronicresources,i.e.,eitherno
availableorveryminimalNaturalLanguageProcessing(NLP)toolscanbefoundforthelanguage.In
areviewpapersuchasthis,itisimportanttoconsiderandtrytounderstandresearchthathasbeen
doneonscarceresourcelanguages.OntopofdevelopingNLPtoolsforsomeofthoselanguages
[73],effortshavealsobeenmadeinthefollowingthreeareas:sentimentanalysisitself[74]-[77],
speechrecognition[78],[79]andmachinetranslation[80],[81].Whilestudiesonsentimentanalysis
alongthislineofresearchoftenconcentrateondevelopingresourcesandapproachesforasingle
scarceresourcelanguage,theotherareas,speechrecognitionandmachinetranslation,alsolook
intoconstructingresourcesforotherlanguagesinordertosupporttheirresearch(suchascrowd
sourcing[80]).
5.1
Sentimentanalysis
Asinmultilingualsentimentanalysis,subjectivityanalysisandpolarityanalysishavebeendoneon
scarceresourcelanguages,althoughnotextensivelyduetothelimitedresourcesavailable.Baneaet
al.[74]createdasubjectivitylexiconfortheRomanianlanguageusingasmallsetofseedwords,a
basicdictionary,andasmallrawcorpus.Theyusedabootstrappingapproachtoaddnewrelated
wordstoacandidatelist.TheyalsousedbothPMI[82],[83]andLatentSemanticAnalysis(LSA)[84]
tofilternoisefromthelexicon.ThecaveatoftheirapproachisthattheLSAmoduleneedstobe
trainedusingasufficientlylargecorpus,anditissuggestedthatsemi-automaticmethodsshouldbe
usedforcorpusconstructionasproposedbyGhanietal.[85].Baneaetal.showedthatunsupervised
learningusingarule-basedsentencelevelsubjectivityclassifierisabletoachieveasubjectivityFmeasurescoreof66.2,whichisanimprovementcomparedtopreviouslyproposedsemi-supervised
methods.
Bakliwaletal.[75]constructedaHindisubjectivelexiconforpolarityclassificationofHindiproduct
reviews.UsingWordNet[27]andagraph-basedtraversalmethod,theybuiltafull(adjectiveand
adverb)subjectivelexicon.Theirapproachusesasmallseedlistwithpolaritytoleveragethe
synonymandantonymrelationsofWordNetinordertoexpandontheinitiallexicon.The
subjectivitylexiconisthenusedinreviewclassification.Theyachieved79%accuracyusingunigram
andpolarityscoresasfeatures.AnotherapproachbyChowdhuryandChowdhury[76]usesboth
BengaliandEnglishwordstoperformsentimentanalysisontweets.Theyappliedasemi-supervised
bootstrappingmethodtocreatethetrainingcorpusformachinelearningclassification,andachieved
93%accuracythroughanSVMusingunigramswithemoticonsasfeatures.
ThestudybySouzaandVieira[77]concentratedonsentimentanalysisofPortuguesetweetsusing
Portuguesepolaritylexiconsandnegationmodels.Theyfoundthatdifferentlexiconssuchas
OplexiconandSentiLexactuallyhavedifferentaccuracies.Specifically,Oplexicon[86]hasbetter
performancecomparedtoSentiLex[87],duetotheformer’smorecomprehensivecoverageoftypes
ofwordsanddomains.AseparatestudybyElmingetal.[88]usedarobustoffline-learningapproach
forcross-domainsentimentanalysisonDanishbasedonapolaritylexicon.Theyobserved
significantlypoorerperformancewhentheanalysisisdonefromonedomaintoanother(i.e.,
reviewsfromthefilmdomaintothecompanydomain).
Asshownabove,theeffortsinanalysingsentimentonscarceresourcelanguagesarepredominately
devotedtoconstructingpolaritylexicons[74],[75],[76]ormakinguseofanavailablelexiconfor
sentimentclassification[77],[88].Thisisunderstandableaslexicon-basedapproachesarealso
widelyadoptedinmultilingualsentimentanalysis(seeSection2).Partofthereasonbeingthat,a
polaritylexiconprovidesastraightforwardmethodinassigningpolaritytosomecontentdepending
ontheexistenceofatermorterms.Thisoffersaviableoptiongiventheconstraintofother
resources,suchastheavailabilityofsynonymdictionariesandtranslationmachines.
5.2
Speechrecognition
Thomasetal.[78]proposedtotraindeepneuralnetworks(DNNs)[89]forlowresourcespeech
recognition.Toovercomethelimitationofhavinginsufficienttrainingdata,theyusedtranscribed
datafromotherlanguagestobuildmultilingualacousticmodels.Theyobserveda16%improvement
withjustonehourofin-domaintraining,andthree-fourthsofthegaincomesfromDNN-based
features.
Qianetal.[79]usedadataborrowingstrategyandtheSubspaceGaussianMixtureModel[90]for
thesameproblem.Eventhoughtheirapproachachievesonlyanimprovementofabout1.7%,the
resultsindicatethatitisimportanttoselectlanguagesthatarelinguisticallysimilarandtie
parametersatacontext-dependentstate.
5.3
Machinetranslation
Machinetranslationapproachesoftenrelyonparallelcorporatoimprovetheiraccuracyand
coverage.However,limitedresourcesavailableforsomeofthelanguagesimplythatdevelopinga
machinetranslationenginecanbeanexpensivetaskintermsofmoneyandeffortspent.Human
annotationeffortsandtheavailabilityofexpertsarerequiredforthesuccessofsuchtasks.Ambati
etal.[80]proposedanapproachtoleverageactivelearningof‘sentenceselection’throughcrowdsourcingtoenableautomatictranslationoflow-resourcelanguagepairs.Whiletheuseof
MechanicalTurkforannotationtaskshasalwaysbeenquestioned,Ambatietal.showedthatitis
possibletocreateparallelcorporausingnon-expertswithsufficientqualityassurance.
Incontrast,IrvineandCallison-Burch[81]usedcomparablecorporatoimprovetheaccuracyof
translationfromasmallparallelcorpus.Theyutilisedabilinguallexiconinductiontechniquetolearn
newtranslationfromthecomparablecorporausingaphrase-basedstatisticalmachinetranslation
modelforsixlowresourcelanguages.Theirresultsindicatethataddinginducedtranslationoflow
frequencywordscanimprovetheperformancebeyondinducingOOVsalone. 6.
Challengesandrecommendations
AsshowninTable1,commonchallengesencounteredinmultilingualsentimentanalysisresearch
includethewordsenseambiguityproblem[3],[9],[21],[49],languagespecificstructure(negation
[15]orparsingrules[45])andtranslationerrors[8],[9].Mostofthechallengesarerelevantto
scarceresourcelanguages,exceptfortheerrorsintroducedbytranslationmachines,asmostof
theselanguagesdonothavesuchmachinesavailabletothem.
6.1
Wordsensedis-ambiguity
Therearevarioussuggestionsforaddressingthewordsenseambiguityproblem.Xiaetal.[49]used
LatentDirichletAllocation(LDA)[91]toextracttopwordsthatarerelatedtoatopic,andadopted
PMI[82],[83]tocalculatethepolaritytendencyofanopinion.Baneaetal.[74]suggestedthatLSA
[84]issufficienttocalculatethesimilaritybetweenanoriginalseedandeachofthecandidates
extractedthroughabootstrappingprocess.Activelearning[80],whichisusedtoimprovemachine
translationbyselectingsentencesthataremostinformativeforthetaskathand,mayhelpin
targetingphrasesorimprovingsampleselection.Thesephrasesandsamplescollectedcanbeuseful
foramanualdis-ambiguityannotationprocessandalsoasinputforfeedbacklearningofamachine
learningapproach.
6.2
Languagestructure
Itiswell-knownthatdifferentlanguageshavetheirownuniquewaysofexpression;forexample,itis
foundthatintheRussianlanguage,philosophicalthoughtsandopinionsareoftenmisclassifiedand
hencelexicon-basedapproachesmaynotbesufficient[70].Instead,adeeperlinguisticanalysisis
required.Inaddition,negationrulesmaybedifferentfordifferentlanguagesandhencemaycause
unnecessaryerrors[15].Forscarceresourcelanguages,someofthevariantsordialectscanbequite
differentinnature[92].Inviewofthefactthatthereareatotalof48variantsofEnglishavailable
aroundtheworld31,withsomebeingamixtureoflanguages,andothersbeingnon-native
pronunciationofEnglishaswellasahostofotherpermutations,itisessentialtounderstandthe
31
http://en.wikipedia.org/wiki/List_of_dialects_of_the_English_language
structureofalanguagesuchastheseinordertoassessthebestapproachforleveragingthe
availableEnglishsentimentanalysisresources.
6.3
Machinelearning
Mostofthescarceresourcelanguagesareusedonsocialmedia,whereslangorinformallanguages
andemoticonsarecommonlyfound.Anumberofresearchstudieshavebeenabletoachieve
reasonablygoodresultsbyincludingemotiontokensasfeaturesintheirmachinelearning
approaches[67],[72],[76].Read[93]studiedemoticonsusingtextfromtheUsenetnewsgroups.He
classifiedthetextintopositiveandnegativetypeswithboththeSVMandNB,andachievedan
accuracyofaround70%onthetestsetused.Goetal.[94]usedasimilarideabuttheyconstructed
theircorpusfromtweets.Thebestresultof81%accuracywasobtainedusingtheNBclassifier.
Thesemethods,however,donotperformwellinidentifyingneutraltext.Amulti-level/cascading
[39]ormeta-classifier[44]approachhasthereforebeenrecommendedformultilingualsentiment
analysiswheresubjectivityanalysisshouldbedonebeforepolarityanalysisisconducted.
6.4
Essentialresources
Subjectivityanalysiscannotbeaccomplishedwithoutalexiconorannotatedcorpus.Eventhough
mostofthescarceresourcelanguageshavelimitedresourcesavailable,aninitialannotated
dictionaryorlexiconisstillneededbeforeaclassifierwithreasonableaccuracycanbeachieved.The
followingaretwoproposedapproachesforcreatinglexiconsforscarceresourcelanguages,
dependingontheavailabilityofresources:
1. Asmallbilingualdictionaryastheavailableresource
Theonlywaytoconstructasubjectivelexiconisbytranslatinganexistinglexiconfrom
anotherlanguagethroughtheuseofabilingualdictionary.Althoughthismappingprocess
canbeautomated,theaccuracywouldunfortunatelyberatherlowduetothecoverage
limitationoftheinitialdictionaryandthecontext-freetranslationprocess,whichcan
introducemanywordambiguityproblems.Itisessentialforthecreatedlexicontobe
verifiedbyhumanannotatorstoensureitsquality,sothatitcanbeusedasabasisfor
generatingmoreresourcesforagivenscarceresourcelanguage.
2. Asmallsubjectivelexiconastheavailableresource
Asetofseedwordscanbeselectedfromthelexicontoextractacorpuscontainingthe
seedsviaakeywordsearchonthecontentofinterest.Fromthissetofcandidates,a
bootstrappingmethodcanbeapplied,withtheirrelatednessbeingmeasuredusing
similaritymetricssuchasLSAorPMItoincreasethevolumeofthelexicon.Werecommend
usingthebootstrappingalgorithmspecifiedintheworkofBaneaetal.[74],ifareasonablysizeddictionaryisavailable,oradoptingtheapproachbyVolkovaetal.[70]toextract
subjectivitylexiconsfromsocialmediacontent,whichistypicallyshortandrelativelynonstructured.
ThereviewfromSection4indicatesthatnoneofthemultilingualsentimentanalysisstudieson
socialmediatakesintoaccountthepossibilityofhavingmixedlanguagesinmessagesshared,even
thoughitiscommonforsocialmediadatatohavesuchlanguages(e.g.,Singlishwithwordsfrom
English,MalayandChinesedialectsinasingletweet[92],[95]).Itisthereforenecessarytoconsider
amorecomprehensivepolaritylexiconthatcontainspolaritylexiconsforeachofthelanguages.As
mentionedinSection6.2,negationrulesmaybedifferentfordifferentlanguages.However,dueto
theextensiveeffortrequiredforparsingasentenceinascarceresourcelanguage,initiativesin
identifyingthedifferentnegationtermscanberewardingasastart.Thesenegationtermscanbe
coupledwiththecombinedlexiconbuiltformoreaccurateclassification.Futureworkshould
investigatethebehaviourandstructureofsentencesofdifferentlanguagesinordertoconstructa
listofknowledge-basednegationrules.
6.5
Ahybridframework
Inviewofthelimitationofresourcesandchallengesdiscussed,itisworthexploringaframework
thatincorporatesbothknowledge-basedtechniques(e.g.,polaritylexicons)andstatisticalmethods
(e.g.,machinelearning)[96].TherecommendedhybridframeworkisshowninFigure1.This
proposedframeworkisespeciallyapplicabletoscarceresourcelanguages,whenresourcessuchas
polaritylexiconsanddictionariesmaynotbeavailableorcomprehensiveenough.Ascanbeseen
fromthefigure,machinelearningcanbeusedforassigningpolarityifthatisthecase.Eventhoughit
isarequirementtohaveanannotatedtrainingdatasetbeforeamachinelearningmodelcanbe
generated,semi-supervisedmethodswiththeuseofemoticons(seeSection6.3andreferences
therein)orhashtags[66],[67]toextractapreliminarydatasetwithpolaritycanbeadoptedbefore
manualannotationisdone.Theestablishedhybridframeworkisabletoassignpolaritytounseen
content(consideringthesituationwhennoneofthewordsmatchesanyterminapolaritylexicon)
bylearninghiddenrulesoftheannotateddata.Inaddition,unseendatathathasbeenclassifiedcan
bereviewedforknowledge-basedruleextractionorasafeedbacksystemtoimprovemachine
learningclassification.
Duetoscarceresourcelimitationsandmultilingualsettings,thisframeworkcanbeadapted
dependingonresourcesavailableandthetargetlanguage(s)tobeanalysed.Thepolaritypattern
mentionedinFigure1canbeapolaritywordfoundinalexiconoratypeofnegationpatternspecific
toalanguage.Theknowledge-basedpolarityassignmentismainlybasedonresourcesoralgorithms
developedthroughdetailedanalysisofthelanguageorlanguages.Itcanbeamixedlanguage
lexicontoaddressthemixtureoflanguagesfoundinsocialmediadataand/orknowledge-based
negationrulesmentionedinSection6.4.Inaddition,theknowledgelearntfromwordsensedisambiguityexplainedinSection6.1canbeincorporatedintotheknowledge-basedalgorithmto
improvetheaccuracyofpolarityassignment.Themachinelearningpolarityassignmentcanadopta
simplemodeltrainedusingatrainingdatasetwithemoticonsorensemble/cascadinglearning
pointedoutinSection6.3.Theaccuracyoftheproposedframeworkisheavilydependentonthe
finalapproachesimplementedinthevariouscomponentsandqualityofresourcesavailable.
Figure1.Therecommendedhybridframework
6.6
Otherconsiderations
Whileamanuallyannotatedlexiconorcorpusisstillvitalforsentimentanalysis,itrequiresfinancial
fundingsupportandaconsiderableamountofhumanefforttocreateareasonablysizedresource.If
fundingisnotalimitingfactor,acrowd-sourcingapproach[80],[97]canbeconsidered,asthe
qualityofannotationcanbeimprovedthroughcrossvalidationandverificationofseveral
annotators.However,ifcrowdsourcingisnotaviableoption,aninitialpolaritycorpuscanbe
createdbyusingemotiontokens[72].Thiscorpuscanthenbeputtogetherwiththelexiconbuilt,to
discovermorecandidatesthroughabootstrapping[74]orSCL[37]approach.Itiscommontousea
subjectivitylexiconforarule-basedclassifier,however,anumberofstudies[2],[42]haveshown
thatacombinationofcorpus-basedmachinelearningandlexiconrule-basedmethodswith
cascadinglearning[39]canimprovetheaccuracyofsentimentanalysis.
Eventhoughthelinguisticstructureofascarceresourcelanguageisimportantfordeterminingif
Englishresourcescanbeadaptedsuccessfully,itrequiresdetailedanalysistobecarriedoutby
linguisticexpertsinordertoidentifythestructuraldifferences.Asaresult,itissuggestedthat
machinelearningshouldbeusedasanalternativeoralitmustest,toassessifthereisaneedfora
structuralstudytofurtherimprovetheaccuracy.AsshowninTable1,oneofthedownfallsisthe
limitedabilityofaclassifiertorecognisenegativetextandomittednegationstructures.Whileitmay
notbepossibletoconductastudyonthelinguisticstructureofascarceresourcelanguage,itis
certainlypossibletomanuallyidentifysomenegationsamplesfromtheavailablecorpusand
incorporatethespecificpatternorstructurewhenconstructingatrainingdataset.
Tosumup,althoughthelexicon-basedapproachisstillessentialforsentimentanalysis,itshouldbe
expandedtoincludecontextualawarenessfeatures,asmostofthesentimentsarerelatedtoan
entityoratopic.Apartfromthat,theconcept-basedapproach,whichincorporatescommon-sense
reasoning[98],isfastdeveloping.Concept-levelsentimentanalysisisnecessaryformanagingmore
subtlesentimentsthatareoftennotcapturedorhandledincurrentmultilingualsentimentanalysis
research.
7.
Conclusion
Sentimentanalysisisanactiveresearcharea,thankstothemanychallengesbutalsopromises.
Whilemanysentimentanalysisstudieshavebeenconductedonformallanguagesusingmainstream
platformslikenewsorofficialdocuments,increasingattentionisnowplacedonanalysisofsocial
mediacontenttofacilitateunderstandingofthewellbeingofacommunityortheperceivedimageof
acompany/product.Socialmediacontentoftencontainsinformalormixedlanguages.Itisthusno
longersufficienttoconsideronlyaformallanguage(e.g.,English)insentimentanalysisresearch.In
thisreviewpaper,wehavelookedatarangeofcurrentapproachesandtoolsusedformultilingual
sentimentanalysis.Wetookintoaccountnotjustformallanguagesbutinformalandscarceresource
languagestoo.Majorchallengeshavebeenidentified,andwerecommendedpossibleremediesas
wellasahybridframeworkfordevelopingsentimentanalysisresourcesparticularlyforlanguages
withlimitedelectronicresources.
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
E.RiloffandJ.Wiebe,‘Learningextractionpatternsforsubjectiveexpressions’,inProceedings
oftheConferenceonEmpiricalMethodsinNaturalLanguageProcessing,2003,pp.105–112.
S.-M.KimandE.Hovy,‘Identifyingandanalyzingjudgmentopinions’,inProceedingsofthe
conferenceofNorthAmericanChapteroftheAssociationofComputationalLinguistics,2006,
pp.200–207.
R.Mihalcea,C.Banea,andJ.Wiebe,‘Learningmultilingualsubjectivelanguageviacrosslingualprojections’,inProceedingsofAnnualMeetingofAssociationforComputational
Linguistics,2007,vol.45,p.976.
B.PangandL.Lee,‘Opinionminingandsentimentanalysis’,Found.TrendsInf.Retr.,vol.2,
no.1–2,pp.1–135,2008.
T.Wilson,P.Hoffmann,S.Somasundaran,J.Kessler,J.Wiebe,Y.Choi,C.Cardie,E.Riloff,and
S.Patwardhan,‘OpinionFinder:Asystemforsubjectivityanalysis’,inProceedingsof
ConferenceonEmpiricalMethodsinNaturalLanguageProcessing,2005,pp.34–35.
H.KanayamaandT.Nasukawa,‘Fullyautomaticlexiconexpansionfordomain-oriented
sentimentanalysis’,inProceedingsoftheConferenceonEmpiricalMethodsinNatural
LanguageProcessing,2006,pp.355–363.
Y.Hu,J.Duan,X.Chen,B.Pei,andR.Lu,‘Anewmethodforsentimentclassificationintext
retrieval’,inProceedingsofInternationalJointConferenceonNaturalLanguageProcessing,
2005,pp.1–9.
X.Wan,‘Co-trainingforcross-lingualsentimentclassification’,inProceedingsoftheJoint
Conferenceofthe47thAnnualMeetingoftheAssociationforComputationalLinguisticsand
the4thInternationalJointConferenceonNaturalLanguageProcessing,2009,pp.235-243.
K.Denecke,‘Usingsentiwordnetformultilingualsentimentanalysis’,inProceedingsof
InternationalConferenceonDataEngineeringWorkshops,2008,pp.507–512.
J.R.Leimgruber,‘SingaporeEnglish’,Lang.Linguist.Compass,vol.5,no.1,pp.47–62,2011.
E.Cambria,D.Olsher,andD.Rajagopal,‘SenticNet3:acommonandcommon-sense
knowledgebaseforcognition-drivensentimentanalysis’,inProceedingsofAAAIConference
onArtificialIntelligence,2014,pp.1515–1521.
S.TanandJ.Zhang,‘Anempiricalstudyofsentimentanalysisforchinesedocuments’,Expert
Syst.Appl.,vol.34,no.4,pp.2622–2629,2008.
J.Zhao,L.Dong,J.Wu,andK.Xu,‘Moodlens:anemoticon-basedsentimentanalysissystem
forchinesetweets’,inProceedingsofthe18thACMSIGKDDInternationalConferenceon
KnowledgeDiscoveryandDataMining,2012,pp.1528–1531.
N.Kobayashi,K.Inui,Y.Matsumoto,K.Tateishi,andT.Fukushima,‘Collectingevaluative
expressionsforopinionextraction’,inProceedingsofInternationalConferenceonNatural
LanguageProcessing,2005,pp.596–605.
A.BalahurandM.Turchi,‘ImprovingsentimentanalysisinTwitterusingmultilingualmachine
translateddata.’,inProceedingsofRecentAdvancesinNaturalLanguageProcessing,2013,
pp.49–55.
M.RosellandV.Kann,‘Constructingaswedishgeneralpurposepolaritylexiconrandomwalks
inthepeople’sdictionaryofsynonyms’,inProceedingsofSwedishLanguageTechnology
Conference,2010,pp.19–20.
M.Abdul-Mageed,M.T.Diab,andM.Korayem,‘Subjectivityandsentimentanalysisof
modernstandardarabic’,inProceedingsoftheAnnualMeetingoftheAssociationfor
ComputationalLinguistics:HumanLanguageTechnologies:shortpapers,2011,vol.2,pp.
587–591.
E.CambriaandA.Hussain,Senticcomputing:acommon-sense-basedframeworkforconceptlevelsentimentanalysis,vol.1.Springer,2015.
[19]
[20]
[21]
[22]
[23]
[24]
[25]
[26]
[27]
[28]
[29]
[30]
[31]
[32]
[33]
[34]
[35]
[36]
[37]
[38]
[39]
G.A.Miller,C.Leacock,R.Tengi,andR.T.Bunker,‘Asemanticconcordance’,inProceedings
oftheWorkshoponHumanLanguageTechnology,1993,pp.303–308.
D.D.Lewis,‘Naive(Bayes)atforty:Theindependenceassumptionininformationretrieval’,in
ProceedingsofEuropeanConferenceonMachineLearning,1998,pp.4–15.
K.Ahmad,D.Cheng,andY.Almas,‘Multi-lingualsentimentanalysisoffinancialnews
streams’,inProceedingsoftheInternationalConferenceonGridinFinance,2006.
T.Wilson,J.Wiebe,andP.Hoffmann,‘Recognizingcontextualpolarityinphrase-level
sentimentanalysis’,inProceedingsofConferenceonEmpiricalMethodsinNaturalLanguage
Processing,2005,pp.347–354.
A.EsuliandF.Sebastiani,‘DeterminingTermSubjectivityandTermOrientationforOpinion
Mining.’,inProceedingsoftheConferenceoftheEuropeanChapteroftheAssociationfor
ComputationalLinguistics,2006,vol.6,p.2006.
J.Yao,G.Wu,J.Liu,andY.Zheng,‘Usingbilinguallexicontojudgesentimentorientationof
Chinesewords’,inProceedingsofIEEEInternationalConferenceonComputerandInformation
Technology,2006,pp.38–38.
V.Vapnik,Thenatureofstatisticallearningtheory.SpringerScience&BusinessMedia,2000.
J.R.Quinlan,C4.5:programsformachinelearning.Elsevier,2014.
G.A.Miller,‘WordNet:alexicaldatabaseforEnglish’,Commun.ACM,vol.38,no.11,pp.39–
41,1995.
F.J.OchandH.Ney,‘Improvedstatisticalalignmentmodels’,inProceedingsoftheAnnual
MeetingonAssociationforComputationalLinguistics,2000,pp.440–447.
V.KannandM.Rosell,‘FreeconstructionofafreeSwedishdictionaryofsynonyms’,in
ProceedingsoftheNordicConferenceonComputationalLinguistics,2005,pp.105–110.
S.Baccianella,A.Esuli,andF.Sebastiani,‘SentiWordNet3.0:Anenhancedlexicalresourcefor
sentimentanalysisandopinionmining.’,inProceedingsofLanguageResourcesand
EvaluationConference,2010,vol.10,pp.2200–2204.
J.Wiebe,T.Wilson,andC.Cardie,‘Annotatingexpressionsofopinionsandemotionsin
language’,Lang.Resour.Eval.,vol.39,no.2–3,pp.165–210,2005.
X.Wan,‘UsingbilingualknowledgeandensembletechniquesforunsupervisedChinese
sentimentanalysis’,inProceedingsoftheConferenceonEmpiricalMethodsinNatural
LanguageProcessing,2008,pp.553–561.
X.Meng,F.Wei,X.Liu,M.Zhou,G.Xu,andH.Wang,‘Cross-lingualmixturemodelfor
sentimentclassification’,inProceedingsoftheAnnualMeetingoftheAssociationfor
ComputationalLinguistics:LongPapers,2012,vol.1,pp.572–581.
B.Lu,C.Tan,C.Cardie,andB.K.Tsou,‘Jointbilingualsentimentclassificationwithunlabeled
parallelcorpora’,inProceedingsoftheAnnualMeetingoftheAssociationforComputational
Linguistics:HumanLanguageTechnologies,2011,vol.1,pp.320–330.
Y.Seki,D.K.Evans,L.-W.Ku,H.-H.Chen,N.Kando,andC.-Y.Lin,‘Overviewofopinion
analysispilottaskatNTCIR-6’,inProceedingsofNTCIR-6WorkshopMeeting,2007,pp.265–
278.
Y.Seki,D.K.Evans,L.-W.Ku,L.Sun,H.-H.Chen,N.Kando,andC.-Y.Lin,‘Overviewof
multilingualopinionanalysistaskatNTCIR-7’,inProceedingsofNTCIR-7WorkshopMeeting,
2008.
P.PrettenhoferandB.Stein,‘Cross-lingualadaptationusingstructuralcorrespondence
learning’,ACMTrans.Intell.Syst.Technol.,vol.3,no.1,p.13,2011.
J.Blitzer,R.McDonald,andF.Pereira,‘Domainadaptationwithstructuralcorrespondence
learning’,inProceedingsoftheConferenceonEmpiricalMethodsinNaturalLanguage
Processing,2006,pp.120–128.
E.BoiyandM.-F.Moens,‘Amachinelearningapproachtosentimentanalysisinmultilingual
Webtexts’,Inf.Retr.,vol.12,no.5,pp.526–558,2009.
[40]
[41]
[42]
[43]
[44]
[45]
[46]
[47]
[48]
[49]
[50]
[51]
[52]
[53]
[54]
[55]
[56]
[57]
[58]
[59]
J.Boyd-GraberandP.Resnik,‘Holisticsentimentanalysisacrosslanguages:Multilingual
supervisedlatentDirichletallocation’,inProceedingsoftheConferenceonEmpiricalMethods
inNaturalLanguageProcessing,2010,pp.45–55.
J.Pan,G.-R.Xue,Y.Yu,andY.Wang,‘Cross-lingualsentimentclassificationviabi-viewnonnegativematrixtri-factorization’,inAdvancesinknowledgediscoveryanddatamining,
Springer,2011,pp.289–300.
M.Bautin,L.Vijayarenu,andS.Skiena,‘Internationalsentimentanalysisfornewsandblogs.’,
inProceedingsofInternationalConferenceonWebandSocialMedia,2008.
N.Godbole,M.Srinivasaiah,andS.Skiena,‘Large-scalesentimentanalysisfornewsand
blogs.’,inProceedingsofInternationalConferenceonWebandSocialMedia,2007,vol.7,p.
21.
A.BalahurandM.Turchi,‘Comparativeexperimentsusingsupervisedlearningandmachine
translationformultilingualsentimentanalysis’,Comput.SpeechLang.,vol.28,no.1,pp.56–
75,2014.
K.Hiroshi,N.Tetsuya,andW.Hideo,‘Deepersentimentanalysisusingmachinetranslation
technology’,inProceedingsoftheInternationalConferenceonComputationalLinguistics,
2004,p.494.
S.Poria,E.Cambria,G.Winterstein,andG.-B.Huang,‘Senticpatterns:Dependency-based
rulesforconcept-levelsentimentanalysis’,Knowl.-BasedSyst.,vol.69,pp.45–63,2014.
E.Cambria,P.Gastaldo,F.Bisio,andR.Zunino,‘AnELM-basedmodelforaffectiveanalogical
reasoning’,Neurocomputing,vol.149,pp.443–455,2015.
S.Poria,E.Cambria,A.Gelbukh,F.Bisio,andA.Hussain,‘Sentimentdataflowanalysisby
meansofdynamiclinguisticpatterns’,Comput.Intell.Mag.IEEE,vol.10,no.4,pp.26–36,
2015.
Y.Xia,X.Li,E.Cambria,andA.Hussain,‘AlocalizationtoolkitforSenticNet’,inProceedingsof
IEEEInternationalConferenceonDataMiningWorkshops,2014,pp.403–408.
J.Blitzer,M.Dredze,andF.Pereira,‘Biographies,bollywood,boom-boxesandblenders:
Domainadaptationforsentimentclassification’,inProceedingsofAnnualMeetingof
AssociationforComputationalLinguistics,2007,vol.7,pp.440–447.
G.A.Miller,‘NounsinWordNet:alexicalinheritancesystem’,Int.J.Lexicogr.,vol.3,no.4,pp.
245–264,1990.
W.Che,Z.Li,andT.Liu,‘Ltp:Achineselanguagetechnologyplatform’,inProceedingsofthe
InternationalConferenceonComputationalLinguistics:Demonstrations,2010,pp.13–16.
‘NTCIR8MOATXinhuaandNYTNewscorpus’.[Online].Available:
http://research.nii.ac.jp/ntcir/ntcir-ws8/permission/ntcir8xinhua-nyt-moat.html.[Accessed:
27-Mar-2015].
R.Xu,K.-F.Wong,andY.Xia,‘Opinmine–opinionanalysissystembyCUHKforNTCIR-6pilot
task’,inProceedingsoftheNTCIR-6Workshop,2007.
N.Constant,C.Davis,C.Potts,andF.Schwarz,‘Thepragmaticsofexpressivecontent:
Evidencefromlargecorpora’,SpracheDatenverarb.,vol.33,no.1–2,pp.5–21,2009.
F.Boudin,S.Huet,J.-M.Torres-Moreno,andJ.Torres-Moreno,‘Agraph-basedapproachto
cross-languagemulti-documentsummarization’,Res.J.Comput.Sci.Comput.Eng.Appl.
Polibits,vol.43,pp.113–118,2010.
J.SavoyandL.Dolamic,‘HoweffectiveisGoogle’stranslationserviceinsearch?’,Commun.
ACM,vol.52,no.10,pp.139–143,2009.
P.Koehn,H.Hoang,A.Birch,C.Callison-Burch,M.Federico,N.Bertoldi,B.Cowan,W.Shen,
C.Moran,andR.Zens,‘Moses:Opensourcetoolkitforstatisticalmachinetranslation’,in
ProceedingsoftheAnnualMeetingonAssociationforComputationalLinguistics :
Demonstrations,2007,pp.177–180.
D.S.MunteanuandD.Marcu,‘Improvingmachinetranslationperformancebyexploiting
non-parallelcorpora’,Comput.Linguist.,vol.31,no.4,pp.477–504,2005.
[60]
[61]
[62]
[63]
[64]
[65]
[66]
[67]
[68]
[69]
[70]
[71]
[72]
[73]
[74]
[75]
[76]
[77]
[78]
‘IBM-WebSphereTranslationServerforMultiplatforms’.[Online].Available:http://www03.ibm.com/software/products/en/translation-server.[Accessed:28-Mar-2015].
P.Koehn,‘Europarl:Aparallelcorpusforstatisticalmachinetranslation’,inProceedingsof
MachineTranslationSummit,2005,vol.5,pp.79–86.
W.Zhang,T.J.Johnson,T.Seltzer,andS.L.Bichard,‘Therevolutionwillbenetworked:The
influenceofsocialnetworkingsitesonpoliticalattitudesandbehavior’,Soc.Sci.Comput.
Rev.,2009.
‘LingPipeHome’.[Online].Available:http://alias-i.com/lingpipe/index.html.[Accessed:25Mar-2015].
A.PakandP.Paroubek,‘Twitterasacorpusforsentimentanalysisandopinionmining.’,in
ProceedingsofLanguageResourcesandEvaluationConference,2010,vol.10,pp.1320–1326.
L.BarbosaandJ.Feng,‘Robustsentimentdetectionontwitterfrombiasedandnoisydata’,in
Proceedingsofthe23rdInternationalConferenceonComputationalLinguistics:Posters,2010,
pp.36–44.
E.Kouloumpis,T.Wilson,andJ.D.Moore,‘Twittersentimentanalysis:Thegoodthebadand
theomg!’,inProceedingsofInternationalConferenceonWebandSocialMedia,2011,vol.11,
pp.538–541.
D.Davidov,O.Tsur,andA.Rappoport,‘Enhancedsentimentlearningusingtwitterhashtags
andsmileys’,inProceedingsofthe23rdInternationalConferenceonComputational
Linguistics:Posters,2010,pp.241–249.
L.Jiang,M.Yu,M.Zhou,X.Liu,andT.Zhao,‘Target-dependenttwittersentiment
classification’,inProceedingsoftheAnnualMeetingoftheAssociationforComputational
Linguistics:HumanLanguageTechnologies,2011,vol.1,pp.151–160.
Q.Su,K.Xiang,H.Wang,B.Sun,andS.Yu,‘Usingpointwisemutualinformationtoidentify
implicitfeaturesincustomerreviews’,inComputerProcessingofOrientalLanguages.Beyond
theOrient:TheResearchChallengesAhead,Springer,2006,pp.22–30.
S.Volkova,T.Wilson,andD.Yarowsky,‘Exploringsentimentinsocialmedia:Bootstrapping
subjectivitycluesfrommultilingualtwitterstreams.’,inProceedingsofAnnualMeetingofthe
AssociationofComputationalLinguistics,2013,pp.505–510.
P.Nakov,Z.Kozareva,A.Ritter,S.Rosenthal,V.Stoyanov,andT.Wilson,‘Semeval-2013task
2:Sentimentanalysisintwitter’,inProceedingsoftheInternationalWorkshoponSemantic
Evaluation,2013.
A.Cui,M.Zhang,Y.Liu,andS.Ma,‘Emotiontokens:Bridgingthegapamongmultilingual
twittersentimentanalysis’,inInformationretrievaltechnology,Springer,2011,pp.238–249.
C.Monson,A.F.Llitjós,R.Aranovich,L.Levin,R.Brown,E.Peterson,J.Carbonell,andA.
Lavie,‘BuildingNLPsystemsfortworesource-scarceindigenouslanguages:Mapudungunand
Quechua’,Strateg.Dev.Mach.Transl.Minor.Lang.,p.15,2006.
C.Banea,R.Mihalcea,andJ.Wiebe,‘Abootstrappingmethodforbuildingsubjectivity
lexiconsforlanguageswithscarceresources.’,inProceedingsofLanguageResourcesand
EvaluationConference,2008,vol.8,pp.2–764.
A.Bakliwal,P.Arora,andV.Varma,‘Hindisubjectivelexicon:AlexicalresourceforHindi
polarityclassification’,inProceedingsofLanguageResourcesandEvaluationConference,
2012,pp.1189–1196.
S.ChowdhuryandW.Chowdhury,‘PerformingsentimentanalysisinBanglamicroblogposts’,
inProceedingsofInternationalConferenceonInformatics,Electronics&Vision,2014,pp.1–6.
M.SouzaandR.Vieira,‘Sentimentanalysisontwitterdataforportugueselanguage’,in
ComputationalProcessingofthePortugueseLanguage,Springer,2012,pp.241–247.
S.Thomas,M.L.Seltzer,K.Church,andH.Hermansky,‘Deepneuralnetworkfeaturesand
semi-supervisedtrainingforlowresourcespeechrecognition’,inProceedingsofIEEE
InternationalConferenceonAcoustics,SpeechandSignalProcessing,2013,pp.6704–6708.
[79]
[80]
[81]
[82]
[83]
[84]
[85]
[86]
[87]
[88]
[89]
[90]
[91]
[92]
[93]
[94]
[95]
[96]
[97]
Y.Qian,D.Povey,andJ.Liu,‘State-leveldataborrowingforlow-resourcespeechrecognition
basedonsubspaceGMMs.’,inProceedingsofAnnualConferenceoftheInternationalSpeech
CommunicationAssociation,2011,pp.553–560.
V.Ambati,S.Vogel,andJ.G.Carbonell,‘Activelearningandcrowd-sourcingformachine
translation.’,inProceedingsofLanguageResourcesandEvaluationConference,2010,vol.1,
p.2.
A.IrvineandC.Callison-Burch,‘Combiningbilingualandcomparablecorporaforlowresource
machinetranslation’,inProceedingsoftheEighthWorkshoponStatisticalMachine
Translation,2013,pp.262–270.
P.D.Turney,‘MiningtheWebforsynonyms:PMI-IRversusLSAonTOEFL’,Lect.Notes
Comput.Sci.,pp.491–502,2001.
P.D.Turney,‘Thumbsuporthumbsdown?:semanticorientationappliedtounsupervised
classificationofreviews’,inProceedingsofAnnualMeetingoftheAssociationof
ComputationalLinguistics,2002,pp.417–424.
S.T.Dumais,G.W.Furnas,T.K.Landauer,S.Deerwester,andR.Harshman,‘Usinglatent
semanticanalysistoimproveaccesstotextualinformation’,inProceedingsoftheSpecial
InterestGrouponComputer-HumanInteractionconference,1988,pp.281–285.
R.Ghani,R.Jones,andD.Mladenić,‘Miningthewebtocreateminoritylanguagecorpora’,in
ProceedingsoftheInternationalConferenceonInformationandKnowledgeManagement,
2001,pp.279–286.
M.Souza,R.Vieira,D.Busetti,R.Chishman,andI.M.Alves,‘Constructionofaportuguese
opinionlexiconfrommultipleresources’,inProceedingsoftheBrazilianSymposiumin
InformationandHumanLanguageTechnology,2011,pp.59–66.
M.J.Silva,P.Carvalho,C.Costa,andL.Sarmento,‘Automaticexpansionofasocialjudgment
lexiconforsentimentanalysis’,2010.
J.Elming,D.Hovy,andB.Plank,‘Robustcross-domainsentimentanalysisforlow-resource
languages’,inProceedingsofAnnualMeetingofAssociationforComputationalLinguistics,
2014,pp.2–7.
L.Deng,G.Hinton,andB.Kingsbury,‘Newtypesofdeepneuralnetworklearningforspeech
recognitionandrelatedapplications:Anoverview’,inProceedingsofIEEEInternational
ConferenceonAcoustics,SpeechandSignalProcessing,2013,pp.8599–8603.
D.Povey,L.Burget,M.Agarwal,P.Akyazi,F.Kai,A.Ghoshal,O.Glembek,N.Goel,M.
Karafiát,andA.Rastrow,‘ThesubspaceGaussianmixturemodel—Astructuredmodelfor
speechrecognition’,Comput.SpeechLang.,vol.25,no.2,pp.404–439,2011.
D.M.Blei,A.Y.Ng,andM.I.Jordan,‘Latentdirichletallocation’,J.Mach.Learn.Res.,vol.3,
pp.993–1022,2003.
S.L.Lo,E.Cambria,R.Chiong,andD.Cornforth,‘Amultilingualsemi-supervisedapproachin
derivingSinglishsenticpatternsforpolaritydetection’,Knowl.-BasedSyst.,2016.
J.Read,‘Usingemoticonstoreducedependencyinmachinelearningtechniquesfor
sentimentclassification’,inProceedingsoftheAssociationforComputationalLinguistics
StudentResearchWorkshop,2005,pp.43–48.
A.Go,R.Bhayani,andL.Huang,‘Twittersentimentclassificationusingdistantsupervision’,
CS224NProj.Rep.Stanf.,pp.1–12,2009.
S.L.Lo,R.Chiong,D.Cornforth,andY.Bao,‘Anunsupervisedmultilingualapproachfor
identifyinghigh-valuetopicsonTwitter’,WorkingPaper,2016.
E.Cambria,‘Affectivecomputingandsentimentanalysis’,IEEEIntell.Syst.,vol.31,no.2,pp.
102–107,2016.
E.Cambria,D.Rajagopal,K.Kwok,andJ.Sepulveda,‘GECKA:gameengineforcommonsense
knowledgeacquisition’,inProceedingsofAAAIFLAIRSConference,2015,pp.282–287.
[98]
E.Cambria,J.Fu,F.Bisio,andS.Poria,‘AffectiveSpace2:Enablingaffectiveintuitionfor
concept-levelsentimentanalysis’,inProceedingsofAAAIConferenceonArtificialIntelligence,
2015,pp.508–514.