Words, more words … and statistics

Words,morewords…andstatistics
Tosegmentwords,thebraincouldbeusingstatisticalmethods
May19,2016
Pickingoutsinglewordsinaflowofspeechisnoeasytaskand,accordingtolinguists,tosucceed
indoingitthebrainmightusestatisticalmethods.AgroupofSISSAscientistshasapplieda
statistics-basedmethodforwordsegmentationandmeasureditsefficacyonnaturallanguage,in
9differentlanguages,todiscoverthatlinguisticrhythmplaysanimportantrole.Thestudyhas
justbeenpublishedintheJournalofDevelopmentalScience.
Haveyoueverrackedyourbrainstryingtomakeoutevenasinglewordofanuninterruptedflow
ofspeechinalanguageyouhardlyknowatall?Itisnaïvetothinkthatinspeechthereiseventhe
smallestofpausesbetweenonewordandthenext(likethespaceweconventionallyinsert
betweenwordsinwriting):inactualfact,speechisalmostalwaysacontinuousstreamofsound.
However,whenwelistentoournativelanguage,word“segmentation”isaneffortlessprocess.
Whatare,linguistswonder,theautomaticcognitivemechanismsunderlyingthisskill?Clearly,
knowledgeofthevocabularyhelps:memoryofthesoundofthesinglewordshelpsustopick
themout.However,manylinguistsargue,therearealsoautomatic,subconscious“low-level”
mechanismsthathelpusevenwhenwedonotrecognisethewordsorwhen,asinthecaseofvery
youngchildren,ourknowledgeofthelanguageisstillonlyrudimentary.Thesemechanisms,they
think,relyonthestatisticalanalysisofthefrequency(estimatedbasedonpastexperience)ofthe
syllablesineachlanguage.
Oneindicatorthatcouldcontributetosegmentationprocessesis“transitionalprobability”(TP),
whichprovidesanestimateofthelikelihoodoftwosyllablesco-occurringinthesameword,
basedonthefrequencywithwhichtheyarefoundassociatedinagivenlanguage.Inpractice,if
everytimeIhearthesyllable“TA”itisinvariablyfollowedbythesyllable“DA”,thenthe
transitionalprobabilityfor“DA”,given“TA”,is1(thehighest).If,ontheotherhand,wheneverI
hearthesyllable“BU”itisfollowedhalfofthetimebythesyllable”DI”andhalfofthetimeby
“FI”,thenthetransitionalprobabilityof“DI”(and“FI”),given“BU”,is0.5,andsoforth.The
cognitivesystemcouldbeimplicitlycomputingthisvaluebyrelyingonlinguisticmemory,from
whichitwouldderivethefrequencies.
ThestudyconductedbyAmandaSaksida,researchscientistattheInternationalSchoolfor
AdvancedStudies(SISSA)inTrieste,withthecollaborationofAlanLangus,SISSAresearch
fellow,underthesupervisionofSISSAprofessorMarinaNespor,usedTPtosegmentnatural
language,byusingtwodifferentapproaches.
Basedonrhythm
Saksida’sstudyisbasedontheworkwithcorpora,thatis,bodiesoftextsspecificallycollectedfor
linguisticanalysis.Inthecaseathand,thecorporaconsistedoftranscriptionsofthe“linguistic
soundenvironment”thatinfantsareexposedto.“Wewantedtohaveanexampleofthetypeof
linguisticenvironmentinwhichachild’slanguagedevelops”,explainedSaksida,“Wewondered
whetheralow-levelmechanismsuchastransitionalprobabilityworkedwithreal-lifelanguage
cues,whichareverydifferentfromtheartificialcuesnormallyusedinthelaboratory,whichare
moreschematicandfreeofsourcesof‘noise’.Furthermore,thequestionwaswhetherthesame
low-levelcueisequallyefficientindifferentlanguages”.Saksidaandcolleaguesusedcorporaof
nolessthan9differentlanguages,andtoeachtheyappliedtwodifferentTP-basedmodels.
FirsttheycalculatedtheTPvaluesforeachpointofthelanguageflowforallofthecorpora,and
thenthey“segmented”theflowusingtwodifferentmethods.Thefirstwasbasedonabsolute
thresholding:acertainfixedreferenceTPvaluewasestablishedbelowwhichaboundarywas
identified.Thesecondmethodwasbasedonrelativethresholding:theboundariescorresponded
tothelocallylowestTPfunction.
Inallcases,Saksidaandcolleaguesfoundthattransitionalprobabilitywasaneffectivetoolfor
segmentation(49%to86%ofwordsidentifiedcorrectly)irrespectiveofthesegmentation
algorithmused,whichconfirmsTPefficacy.Ofnote,whilebothmodelsprovedtobequite
efficient,whenonemodelwasparticularlysuccessfulwithonelanguage,thealternativemodel
alwaysperformedsignificantlyworse.
“Thiscross-linguisticdifferencesuggeststhateachmodelisbettersuitedthantheotherfor
certainlanguagesandviceversa.Wethereforeconductedfurtheranalysestounderstandwhat
linguisticfeaturescorrelatedwiththebetterperformanceofonemodelovertheother”,explains
Saksida.Thecrucialdimensionprovedtobelinguisticrhythm.“WecandivideEuropean
languagesintotwolargegroupsbasedonrhythm:stress-timedandsyllable-timed“.Stress-timed
languageshavefewervowelsandshorterwords,andincludeEnglish,SlovenianandGerman.
Syllable-timedlanguagescontainmorevowelsandlongerwordsonaverage,andincludeItalian,
SpanishandFinnish.ThethirdrhythmicgroupoflanguagesdoesnotexistinEuropeandisbased
on“morae”(apartofthesyllable),suchasJapanese.Thisgroupisknownas“mora-timed”and
containsevenmorevowelsthansyllable-timedlanguages.
Theabsolutethresholdmodelprovedtoworkbestonstress-timedlanguages,whereasrelative
thresholdingwasbetterforthemora-timedones.“It’sthereforepossiblethatthecognitive
systemlearnstousethesegmentationalgorithmthatisbestsuitedtoone’snativelanguage,and
thatthisleadstodifficultiessegmentinglanguagesbelongingtoanotherrhythmiccategory.
Experimentalstudieswillclearlybenecessarytotestthishypothesis.Weknowfromthescientific
literaturethatimmediatelyafterbirthinfantsalreadyuserhythmicinformation,andwethinkthat
thestrategiesusedtochoosethemostappropriatesegmentationcouldbeoneoftheareasin
whichinformationaboutrhythmismostuseful”.
Thestudyisinfactunabletosaywhetherthecognitivesystem(ofbothadultsandchildren)really
usesthistypeofstrategy.“Ourstudyclearlyconfirmsthatthisstrategyworksacrossawiderange
oflanguages”,concludesSaksida.“Itwillnowserveasaguideforlaboratoryexperiments.”
USEFULLINKS:
• OriginalpaperArticolooriginale:http://goo.gl/cOk5VD
IMAGES:
• Credits:Jev55(Flickr:https://goo.gl/yVVdJ3)
Contact:
Pressoffice:
[email protected]
Tel:(+39)0403787644|(+39)366-3677586
viaBonomea,265
34136Trieste
MoreinformationaboutSISSA:www.sissa.it