Intralinguis`c varia`on in the audio descrip`on of a short film

Intralinguis,cvaria,onintheaudio
descrip,onofashortfilm
AnnaMatamala
UniversitatAutònomadeBarcelona
TransMediaCataloniaresearchgroup
[email protected]
AIETI.AlcaládeHenares,8-10.03.17
FFI2015-64038-P(MINECO/FEDER,UE),2014SGR0027
Overview
• Audiodescrip,onresearch
• Corpusstudiesandaudiodescrip,on
• TheVIWproject
• Acomparisonofstudentsandprofessionals’audio
descrip,ons
• Conclusions
2
AD research
• Theore,calapproaches(narratology,cogni,vestudies)
• Experimentalresearch
• Researchontechnology,training,etc.
• Descrip,veapproachesmostlybasedoncasestudies
3
Audio description and corpora
• TIWO(Salway2007)
• TRACCE(JiménezHurtadoetal.2010)
• MPIIMovieDescrip,ondataset(Rohrbachetal.2015)
• PearTreeProject(MazurandKruger2012),inspiredby
Chafe(1980)
• Reviersetal.(2015)
• VIW(Matamala&Villegas2016)
4
The VIW project
• Shortfilmcommissionedtoafilmdirector(guidelines
basedonliteraturereview):“Whathappenswhile---”,by
NúriaNia,inEnglish.
• DubbedintoCatalanandSpanishinprofessionalstudio
• Professionalaudiodescrip,ons(10x3)andstudents’
audiodescrip,ons(10+7).Total:+30,000words.
5
The corpus
hZp://pagines.uab.cat/viw/
LinkedtoUAB’sopenaccessrepository
6
7
8
9
Processing the materials
mp4
txt
eaffile
web
app
Ling.
Ling.
Annotated
Ling.
Annotated
textAnnotated
text
text
conll2eaf
10
Segmenting and processing
• Linguis,c,ers:AD-unit,er(sentences,chunks,tokens)
andCredits,er
• Token:partsofspeech,lemma,andseman,cvalues
• Filmic,ers:scene,shot,sound,character,text.
11
The web app: source data
• Rawmaterialperproviderandpersubcorpustoimport
intoELANandintoCQPweb.Alsofilmicannota,onsas
eaffile.
• Visualiza,onsforpre-establishedanalyses.
• Accessfrompreviouspagebutalsodirectly:
hhp://transmediacatalonia.uab.cat/web/
12
13
14
Analysing the data
• Mul,pleanalyses(on-going)
• FocusforAIETI:comparisonoftheSpanishprofessional
subcorpusandtheSpanishstudents’subcorpus(and
addi,onallytoSpanishgenerallanguagecorpora)
• Quan,ta,veapproachonautoma,callytaggeddata
15
AD units, sentences, and words
• ProfessionalsusemoreADunitsandshowlessvariability
• Professionalsusemoresentencesandwords
• MeannumberofwordsperADunit:professionals12.57,
students11.42
• Meannumberofwordspersentences:similarmeans
(around8.4),differentfrom20.89wordsfoundin
Cumbrecorpus
• Nosta,s,callysignificantdifferences
16
Word classes
• Similarpercentagesforwordclasseswithslight
differences
• Professionals:moreadjec,vesandverbs
• Students:morenounsandadposi,ons
• Mostfrequent:nouns,followedbyverbs.
• Higherpresencethaningenerallanguagecorpus.
• Moreopen-classwordsthanclosed-classwords,in
contrastwithCREAcorpus
17
20 most frequent lemmas
•  70%ofmostfrequentlemmasshared.
•  Professionals(allwordclasses):el,uno,de,y,se,a,en,con,mirar,su,
por,qué,James,Rick,caminar,móvil,lo,hacia,estar,playa.
•  Professionals(N,V,A,Adj):mirar,James,Rick,caminar,móvil,estar,
playa,alrededor),dejarhablar,no,vaso,Jess,tener,hombre,llevar,
banco,haber,mano,sonido.
•  Students(allwordclasses):el,uno,de,y,se,a,en,mirar,con,su,por,
James,qué,Rick,caminar,Jess,no,hacia,estar,playa.
•  Students(N,V,A,Adj):mirar,James,Rick,caminar,Jess,no,estar,
playa,lado,café,dejar,hombre,buscar,año,arena,hablar,móvil,
tener,alrededor,blanco.
18
Text similarity
• TedPedersen’sTextSimilarityModule
(transmediacatalonia.uab.cat/web/similarity/WHW-ES-Pr)
• Mostpairs(81.58%)between0.30and0.39,with10%or
lessatalowerorhigherend
• Lesssimilar(2professionalproviders)andmoresimilar
(student-professional).
19
Semantic classes
• Seman,ctagsfromSuggestedUpperMergedOntology
(SUMO)
• Selec,veannota,onfocusingon
• Verbslinkedtospa,o-temporalsenngs,movements,
communica,on,andcharacterdescrip,on
• Adjec,veslinkedtocharacters’mood,hearingandsight
• Nounslinkedtospa,o-temporalsenngsandcharacters
• Adverbslinkedtospa,o-temporalsenngs
20
Semantic classes
• Mostfrequentseman,ctags:
• Loca,on(professional=12.09,students=11.80)
• Human,ObjectandMo,onintheprofessional
• Object,AuxandSeeinginthestudents’subcorpus
• Similarpercentages.
21
Possibilities of automatic tagging
• Firstapproachto:
• Loca,on(70%mostfrequentloca,onnounsshared,similar
percentages)
• Describingcolours(students=2.27%,professionals2.14%,
fourmostfrequentshared,interes,ngdescrip,onofa
characteraspelirrojo/rubio)
• Describingcharacters(90%ofthemostfrequentwords
shared)
22
Conclusions
• Needforcorpus-basedresearch
• Needforopenaccessdata
• Possibili,esofautoma,ctagging
• Interestinanalysingvaria,onbutalsocontrastwith
generallanguagecorpora
• Quan,ta,vefollowedbyqualita,vestudies
23
Intralinguis,cvaria,onintheaudio
descrip,onofashortfilm
AnnaMatamala
UniversitatAutònomadeBarcelona
TransMediaCataloniaresearchgroup
[email protected]
AIETI.AlcaládeHenares,8-10.03.17
FFI2015-64038-P(MINECO/FEDER,UE),2014SGR0027