Memory-‐based models of language learning and processing

13-‐07-‐14 Memory-‐based models of language learning and processing hEps://sites.google.com/site/
iascl2014tutorial/ Antal van den Bosch & Walter Daelemans IASCL Tutorial, Amsterdam, July 14, 2014 Tutorial Structure 1.  13.00 -‐ 13.50 (Walter): IntroducOon, MBLP and TiMBL, algorithms and metrics, German Plural 2.  14.00 -‐ 14.50: Case studies: 1.  acquisiOon of word stress (Walter) 2.  acquisiOon of the daOve alternaOon (Antal) Memory-‐based language processing Daelemans, Walter and Van den Bosch, Antal (2005), Cambridge, UK: Cambridge University Press 3.  15.00 -‐ 15.50 (Antal): Comparison, psychological validity, discussion hEps://sites.google.com/site/iascl2014tutorial/ GeneralizaOon without abstracOon Rules, probabiliOes + abstracOon + generalizaOon Analogical reasoning from memory Forge^ng? -‐ generalizaOon -‐ abstracOon Storage and Analogy Rote learning, table lookup This “rule of nearest neighbor” has considerable elementary intui5ve appeal and probably corresponds to prac5ce in many situa5ons. For example, it is possible that much medical diagnosis is influenced by the doctor's recollec5on of the subsequent history of an earlier pa5ent whose symptoms resemble in some way those of the current pa5ent. (Fix and Hodges, 1952, p.43) 1 13-‐07-‐14 Storage and Analogy This “rule of nearest neighbor” has considerable elementary intui5ve appeal and probably corresponds to prac5ce in many situa5ons. For example, it is possible that much medical diagnosis is influenced by the doctor's recollec5on of the subsequent history of an earlier pa5ent whose symptoms resemble in some way those of the current pa5ent. (Fix and Hodges, 1952, p.43) The model is the data core data rules analogy over exemplars excepOons Example: Dutch diminuOve ClassificaOon/predicOon in MBL k-‐NN classificaOon ImplementaOon: k-‐NN in TiMBL •  Given k, the number of nearest neighbors to which the hyperball must be extended, •  Produce a distribuOon of classes occurring in the hyperball •  Open source, hEp://ilk.uvt.nl/Ombl –  Take the winning class –  Or, convert counts to probabiliOes •  Local probability distribuOon 2 13-‐07-‐14 The role of excepOons •  Empirical Study on 7 datasets (phonology, morphology, syntax, semanOcs) Daelemans, W., Van den Bosch, A., and Zavrel, J. (1999). Forge^ng excepOons is harmful in language learning. Machine learning, 34:1-‐3, 11-‐41. ExcepOons: Good or Bad? •  ObservaOons from literature: ignoring low-‐
frequence events harms performance (Bod, Collins & Brooks, Ng & Lee, Dagan et al.) •  Classical work in ediOng excepOons in k-‐NN for noise suppression •  What effect does the forge^ng of (excepOonal) instances have on generalizaOon performance? OperaOonalizing ExcepOonality Two metrics esOmaOng the excepOonality of instances: •  Typicality (Zhang, 1992) –  “family resemblance” –  Average likeness to instances of the same class vs of different classes •  Class-‐predicOon strength (Salzberg, 1990) –  C/P –  C = number of Omes a memory instance was nearest neighbor with correct class –  P = number of Omes a memory instance was nearest neighbor Summary •  Removing excepOons is generally not good for performance (someOmes not that bad) •  Language data is highly disjunct (some tasks more than others) •  Many excepOons have few or no “friendly neighbors” •  Removing excepOons kills producOve regions •  Because excepOons are good neighbors and may recur themselves MBLP algorithms and metrics German Plural AcquisiOon and Processing 3 13-‐07-‐14 German Plural The default status of -‐s •  Notoriously complex but rouOnely acquired (at age 6) •  Evidence for Words + Rules (Dual Mechanism)? -‐s suffix is default/regular (novel words, surnames, acronyms, …) -‐s suffix is infrequent (least frequent of the five most important suffixes) Vast majority of plurals should be handled in the words route according to W+R! • 
• 
• 
• 
• 
• 
Similar item missing Fnöhk-‐s Surname, product name Mann-‐s Borrowings Kiosk-‐s Acronyms BMW-‐s Lexicalized phrases Vergissmeinnicht-‐s Onomatopoeia, truncated roots, derived nouns, ... Daelemans, W. (2002). A comparison of analogical modeling to memory-‐based language processing. Analogical modeling. Amsterdam, Benjamins, 157-‐179. Data & RepresentaOon •  Symbolic features –  segmental informaOon (syllable structure 2 last syllables, right alignment) –  gender •  ~25,000 nouns from CELEX Class Frequency Umlaut Frequency
Example
(e)n
11920
Abart
e
6656
no
4646
Abbau
yes
2010
Abdampf
4651
no
4402
Aasgeier
yes
249
Abwasser
er
974
no
287
Abbild
yes
687
Abgang
s
967
Abonnement
The Overlap distance funcOon Count the number of mismatching features n
Δ(X,Y ) = ∑δ (x i , y i )
i=1
€
#
xi − yi
%
% max i − min i
%
δ (xi , yi ) = $ 0
%
% 1
%
&
if numeric, else
if xi = yi
if xi ≠ yi
4 13-‐07-‐14 Feature weighOng Entropy & IG: Formulas •  Some features are more important than others •  Possible metrics: InformaOon Gain, Gain RaOo, Chi Square, Shared Variance, … •  Ex. IG: –  Compute data base entropy –  For each feature, •  parOOon the data base on all values of that feature – For all values, compute the sub-‐data base entropy •  Take the weighted average entropy over all parOOoned subdatabases •  The difference between this “parOOoned” entropy and the overall entropy is the feature’s InformaOon Gain n
Δ(X,Y ) = ∑ wiδ (xi , yi )
i=1
The MVDM distance function
The MVDM distance funcOon •  EsOmate a numeric “distance” between pairs of values –  “e” is more like “i” than like “p” in a phoneOc task –  “book” is more like “document” than like “the” in a parsing task –  “NNP” is more like “NN” than like VBD in a tagging task n
Δ(X,Y ) = ∑δ (x i , y i )
•  Builds for each value of each feature a matrix of pairwise distances between values (Stanfill & Waltz, Salzberg) i=1
n
δ (x i , y i ) = ∑ P(C j | x i ) − P(C j | y i )
€
j=1
€
5 13-‐07-‐14 Exemplar weighOng German Plural
•  Scale the distance of a memory instance by some externally computed factor –  Class predicOon strength –  Frequency –  Typicality •  Smaller distance for “good” instances •  Bigger distance for “bad” instances n
Δ(X,Y ) = E Y ∑δ (x i , y i )
i=1
€
Distance weighted class voOng •  Increasing the value of k is similar to smoothing •  Subtle extension: making more distant neighbors count less in the class vote –  Linear inverse of distance (relaOve to maximum) –  Inverse of distance –  ExponenOal decay AcquisiOon Data: Summary of previous studies •  The developmental sequence of the acquisiOon of plural markers is sOll unclear (Kauschke et al, 2011) •  ExisOng nouns: –  Children mainly overapply -‐e or -‐(e)n (unOl age 3) –  -‐s plurals are learned late –  There is no single suffix that is applied by default in spontaneous speech and/or elicitaOons tasks AcquisiOon Data: Summary of previous studies •  Novel words: –  Children inflect novel words with -‐e or -‐(e)n –  More “irregular” plural forms produced than “defaults” 6 13-‐07-‐14 Bartke, Marcus, Clahsen (1995) overgeneralization
model overapplies mainly -‐en and -‐e -‐e and -‐en acquired fastest -‐s is learned late and imperfectly Mainly but not completely parallel to input frequency (more -‐s overgeneralizaOon than -‐er generalizaOon) MBL simulaOon Words
• 
25
• 
20
15
-en
-s
10
5
nonrhyme
all
0
Roots
split CELEX data according to rhyme / non-‐
rhyme compare overgeneralizaOon –  to -‐en versus to -‐s –  percentage of total number of errors • 
results: –  overgeneralizaOon to -‐en drops below the level of overgeneralizaOon to –s with non-‐rhyming words • 
• 
70
60
50
40
30
20
10
0
-en
-s
37 children age 3.6 to 6.6 pictures of imaginary things, presented as neologisms –  rhymes of exisOng words or not –  choice -‐en or -‐s • 
results: –  children are aware that unusual sounding words require the default rhyme
-en
-e
-s
-er
• 
• 
• 
• 
nonrhyme
MBL simulaOon Discussion •  Three “classes” of plurals: –  ((-‐en -‐)(-‐e -‐er))(s) •  The former 4 suffixes seem “regular”
and can be accurately learned using informaOon from phonology and gender •  -‐s is learned reasonably well but informaOon is lacking –  Hypothesis: more “features” are needed (syntacOc, semanOc, meta-‐linguisOc, …) to enrich the “lexical similarity space” Discussion •  No difference in accuracy and speed of learning with and without Umlaut –  Unlike aEested in children •  Overall generalizaOon accuracy very high: 95% •  Implicitly implements schema-‐based learning (Köpcke). *,*,*,*,i,r,M
Part II Case Studies Case Study 1: Word Stress AcquisiOon e
7 13-‐07-‐14 Word stress: surface or deep features? •  Classic approach –  Principles and Parameters, UG –  Arguments from acquisiOon and typology •  Formalism: Metrical trees, metrical grids, ordered constraints •  Stress = prominence relaOons between consOtuents in a hierarchical structure YOUPIE (Dresher & Kaye, 1990) •  AssumpOons –  11 parameters (216 “languages”) –  Task-‐specific system for learning stress (domain knowledge) –  Core grammar only •  Learning –  Cue-‐based parameter se^ng results in a grammar of stress •  Performance –  Generate tree with grammar and algorithmically determine stress locaOon Parameters (with se^ng for Dutch) Parameter
Value
P1
Word tree right/left dominant
P2
Binary/unbound feet
P3
Feet assigned from the left/right edge
P4
Feet right/left-dominant
P5
Feet are / are not quantity-sensitive
P6
Feet are quantity-sensitive w.r.t. rime / nucleus
P7
Strong node in foot must / mustn’t branch
P8A
There isn’t / is an extra-metrical syllable
P8
Left / Right-most syllable is extra-metrical
P9
Weak foot looses / doesn’t loose foot status in a clash
P10
Feet are / aren’t assigned iteratively
YOUPIE tested •  Experimental design –  216 languages –  117 items per language generated by YOUPIE performance component (no excep5ons, core only) –  For each language, grammar learned with YOUPIE cue-‐based learning component •  Results –  For 60% of the languages, YOUPIE reconstructs the original parameter se^ng with which the words were generated –  For 21% convergence is to a compaOble se^ng –  For 19% of the languages errors in one or more stress paEerns •  Upper Boundary! –  Perfect input, no excepOons to be learned Memory-‐Based Learning •  AssumpOons –  Lexical storage and generalizaOon –  Generic learning method, no task-‐specific linguisOc knowledge –  Core and periphery •  Learning –  Based on storage of exemplars •  Performance –  Similarity-‐based reasoning with feature weighOng on stored exemplars System and level
Score
Sd
Accuracy
MBLP-‐words
104
15.01
89%
YOUPIE-‐words
105
28.24
90%
MBLP-‐syllables
3.7
97%
11.88
95%
YOUPIE-‐syllables
MBLP-‐languages
YOUPIE-‐
languages
89
41%
176
81%
MBLP vs. Youpie 8 13-‐07-‐14 Discussion •  No significant quanOtaOve difference in performance •  Clear qualitaOve difference –  YOUPIE: more languages perfectly learned –  MBLP: fewer errors per language •  Issues: –  Not real language data –  Only core and no periphery Dutch stress •  Stress on one of the last three syllables •  Predictable, but not completely –  E.g., py-‐à-‐ma `ca-‐na-‐da pa-‐ra-‐`plu •  Irregular words not covered by the parameter-‐
configuraOon for Dutch –  need lexical marking with excepOon features (one, two or completely idiosyncraOc) MBLP on Dutch data •  CELEX, 4868 monomorphemes •  Exemplar encoding schemes For each of the three final syllables: –  S1: syllable weight (SL, L, H, SH) –  S2: nucleus and coda (complete rhymes, VC-‐notaOon) –  S3: nucleus and coda (separate features, phonemes) –  S4: onset, nucleus, and coda (phonemes) •  Class: final, penulOmate, ante-‐penulOmate Example •  Example word: [kapitein] /ka-‐pi-‐’tEIn/ –  Stress on final syllable •  Four encoding schemes: –  S1: encoded as three numeric features •  L,L,SH FIN –  S2: encoded as one rhyme feature per syllable •  VV,VV,VXC FIN –  S3: semi-‐linguisOc encoding of rhyme of last three syllables •  a,-‐,i,-‐,ei,n FIN –  S4: knowledge-‐free encoding of all phonemes in last three syllables •  k,a,-‐,p,i,-‐,t,ei,n FIN Results Error analysis •  LinguisOcally-‐informed features (S1 and S2) are best for regular words –  99.3% and 99.4% versus 93.5% and 92.6% •  But other two (S3 and S4) are beEer for irregular words –  2.5% and 7.6% versus 68.7% and 71.2% •  Overall, knowledge-‐free scores best, and beEer for “marked” cases 9 13-‐07-‐14 Error Percentages Language AcquisiOon •  Learning rules or learning lexical items? •  Rules (Hochberg ‘88 Spanish, Nouveau ‘93 Dutch) –  Lexical learning lacks generalizaOon capacity –  Lexical learning incompaOble with acquisiOon data •  ImitaOon task –  Errors increase with irregularity –  Tendency to regularizaOon (but irregularizaOon occurs) •  By stress shi} •  By changing structure of repeated word Discussion •  MBLP error correlates with markedness like children’s errors •  MBLP has a tendency for regularizaOon like children –  DirecOon of stress shi}s –  Structural changes (from inspecOon of nearest neighbors) •  IrregularizaOon and differences 3 and 4 year-‐olds on marked paEerns hard to explain in rule-‐based context Rule learning is not the only possible explanaOon for the language acquisiOon data MBL word stress Durieux, G. (2003) “Computermodellen en klemtoon.” Fonologische Kruispunten, BICN. Daelemans, W., Gillis, S., and Durieux, G. (1994). The acquisiOon of stress: A data-‐oriented approach." Computa5onal Linguis5cs 20: 421-‐451. Daelemans, W., Gillis, S., Durieux, G., and Van den Bosch, A. (1993). Learnability and markedness: Dutch stress assignment. In T.M. Ellison and J.M. Scobbie (Eds.), ComputaOonal Phonology. Edinburgh Working Papers in Cogni5ve Science, 8, pp. 157-‐178. Gillis, S., Daelemans, W., and Durieux, G. (2000). Lazy learning: A comparison of natural and machine learning of stress. In P. Broeder and J. Murre (eds.), Models of language acquisi5on: Induc5ve and deduc5ve approaches, pp. 76-‐99. Oxford, UK: Oxford University Press. Summary •  Goal: put MBLP to the test on a concrete linguisOc problem of sufficient complexity by comparing it to –  LinguisOc theory –  Child language acquisiOon data –  Adult processing data) •  Results: –  MBLP and YOUPIE (P&P/UG) comparable –  MBLP can learn core as well as periphery using surface representaOons –  MBLP shows same errors and tendencies as children learning stress placement –  MBLP beEer predictor of human adult behaviour with non-‐words A word from Gupta and Touretzky (1993:27) “It could be argued that a theoreOcal account is a descripOve formalism, which serves to organize the phenomena by abstracOng away from the excepOons in order to reveal an underlying regularity, and that it is therefore a virtue rather than a failing of the theoreOcal analysis that it ignores “performance” consideraOons. …” 10 13-‐07-‐14 A word from Gupta and Touretzky (1993:27) (2) Case Study: A Memory-‐Based Learning account of the daOve alternaOon in individual children “… However, it becomes difficult to maintain this with respect to a processing model that uses the descripOve formalism as its basis: the processing or learning account sOll has to deal with actual data and actual performance phenomena.” Antal van den Bosch and Joan Bresnan (submiEed). Modeling DaOve AlternaOons of Individual Children DaOve alternaOon data 7 children: staOsOcs The evil queen gives the poisonous apple to Snow White (PD) The evil queen gives Snow White the poisonous apple (DO) •  As used in De Marneffe, Grimm, Arnon, Kirby, & Bresnan (2012), Language and CogniOve Processes •  extracted PD and DO construcOons with V = give, show from children’s transcribed spontaneous conversaOons with caretakers in CHILDES (MacWhinney 2000) •  530 instances from 7 children ages 2–5 •  788 instances from 5 adult caretakers of 3 of the children Features Lexical features: •  Verb (‘give’ in ‘give me a hug’) •  Theme (‘a hug’ in ‘give me a hug’) •  Recipient (‘me’ in ‘give me a hug’) LinguisOc features: •  Theme givenness, animacy (w/wo toys), length, pronoun status •  Recipient givenness, animacy (w/wo toys), length, pronoun status •  Prime (daOve in last 10 uEerances) Learning curves of individual children •  Can we predict a child’s next daOve choice? •  For some children a decent amount of daOve choices is available –  Sufficient for per-‐child experiments –  Child-‐directed speech and other children’s speech may be added •  Time should determine split points in training and test data –  UnrealisOc to train on future data and test on past data •  Learning curve experiment: – 
– 
– 
– 
Grow a training set of per-‐day aEestaOons Test on each next day with aEestaOons Aggregate how many aEestaOons are predicted correctly Might reveal smooth or non-‐smooth learning success over Ome 11 13-‐07-‐14 Learning curves of individual children Learning curves of individual children •  Adam’s data: –  50 days –  221 daOves produced by Adam –  207 daOves directed at Adam •  Incrementally grow Adam’s data –  Train on Adam’s own data up to day d, test on all of Adam’s next daOves on d+1. –  Possibly with other children’s data added to the training set –  Possibily with speech directed at Adam –  Possibly with other children’s directed speech Individual learning curves Day 1 Day 1 DaOve 1a DaOve 1b DaOve 1c Day 1 DaOve 1a DaOve 1b DaOve 1c Day 2 test DaOve 2a DaOve 2b Day 3 Day 2 DaOve 3a DaOve 3b DaOve 3c DaOve 2a DaOve 2b Day 3 Day 2 DaOve 2a DaOve 2b DaOve 3a DaOve 3b DaOve 3c Day 4 DaOve 4a Comparing augmented training data (AUC) Child abe adam naomi # da4ves (CDS) Child only + other children CDS other children + all 74 0.50 0.84 0.87 0.86 221 (207) 0.80 0.77 0.80 0.80 21 0.50 0.52 0.81 0.58 146 (443) 0.71 0.74 0.76 0.79 sarah 19 0.50 0.83 0.88 0.83 shem 15 (138) 0.50 0.74 0.88 0.74 trevor 33 0.50 0.72 0.86 0.73 nina How well do Adam and Nina approximate CDS? training DaOve 1a DaOve 1b DaOve 1c Theijssen (2012) •  Compares regression-‐based and memory-‐based learning (MBL) accounts of the daOve alternaOon •  Dataset = 11.784 PD-‐ & DO-‐construcOons extracted from the BNC (7757 spoken, 4027 wriEen) •  Regression on (automaOcally determined) higher-‐level features –  animacy, definiteness, givenness, pronominality, person of the recipient, and definiteness, givenness, pronominality of the theme •  MBL on the basis of TiMBL (Daelemans & Van den Bosch 2005) makes use of lexical features only; leave-‐1-‐out tesOng Theijssen, D. (2012). Making choices: Modelling the English daOve alternaOon. Ph.D. thesis, Radboud University Nijmegen. 12 13-‐07-‐14 Theijssen’s conclusions •  MBL as accurate as regression analysis –  93.1% accuracy vs 93.5 % fit –  MBL does not use higher-‐level features •  Preference for specific verbs to take either DO or PD, and the significant effect of length difference between recipient and theme, are both implicitly present in lexical input •  “there is no reason to abstract away from the original input by defining higher-‐level features” (p. 106); success of MBL “is a reason to call into quesOon the importance of higher-‐level features in language processing” (p. 119) Nature of the human language processing architecture •  How is linguisOc knowledge represented? –  Words and rules; mental lexicon and grammar –  ConstrucOons, schemas –  Similarity-‐based reasoning over examples •  How is linguisOc knowledge acquired? –  Innate rules / constraints –  BoEom-‐up discovery of construcOons, (parOally) schemaOc units –  Storage of (paQerns of) observable linguis5c items •  Words, syllables, segments, … Example-‐based modeling •  CogniOve plausibility, psychological reality? –  Naturally models incremental learning and forge^ng –  AcOvaOon, inhibiOon –  Models episodic memory, global matching models (Clark & Eklund, 1996) –  Models fast matching of cues in sequence rather than the slow retrieval of order (McElree, 2006; Lewis, Vasishth, and Van Dyke, 2006) Part III Comparison and psychological validity Eager vs Lazy Learning •  Eager: learning is compression –  Minimal DescripOon Length principle –  Ockham’s razor –  Minimize size of abstracted model (core) plus length of list of excepOons not covered by model (periphery) •  Lazy: learning is storage of exemplars + analogy –  In language, what is core and what is periphery? •  Small disjuncts, pockets of excepOons, polymorphism, … •  Zipfian distribuOons –  “Forge^ng excepOons is harmful in language learning” A Memory-‐Based Learning account of construcOonal differences between Netherlandic and Belgian Dutch with Stef Grondelaers and Dirk Speelman 13 13-‐07-‐14 New quesOon •  Is MBL equally successfull with one the most difficult syntacOc variables in Dutch (Van der Wouden 2009: 300), which has been modelled almost exclusively on the basis of higher-‐order processing features (Grondelaers et al. 2007, 2008, 2009)? ExistenOal er in locaOve-‐iniOal construcOon (1)  Er lag een sigarenpeuk in de asbak. There was a cigar buE in the ashtray (1)  In de asbak lag een sigarenpeuk. In the ashtray was a cigar buE (3) In de asbak lag er een hagelkorrel. In the ashtray there was a hailstone •  “For the distribuOon of er, no strict rules can be given. It can be opOonal, there may be semanOc or stylisOc differences involved, and there is a lot of individual and someOmes also regional variaOon in its use” (ANS 1997)‫ ‏‬ AlternaOve analysis •  In (Belgian) Dutch, er is an expectancy monitor inserted to facilitate the processing of an unpredictable subject –  Comprehension: in self paced reading and eye-‐tracking experiments er reduces the reading Ome of unpredictable subjects but not predictable subjects (Grondelaers et al.: 2002; Grondelaers et al.: 2009) –  ProducOon: er is systemaOcally produced in low-‐
predictability contexts: The distribuOon of er in locaOve-‐iniOal construcOons can be predicted in 85% of all cases in a regression analysis with 7 higher-‐order low-‐predictability factors (Grondelaers, Speelman & Geeraerts: 2002; Grondelaers & Speelman: 2007) New research quesOons •  Can MBL classify –er and +er in adjunct-‐iniOal sentences as accurately as earlier regression-‐
based accounts? •  Does lexical input suffice, as it did for the daOve alternaOon in Theijsen (2011)? •  Does MBL confirm the regression-‐based finding that there are outspoken preferenOal differences between Belgian and Netherlandic Dutch? AddiOonal problem: er in Netherlandic Dutch? •  er is much less frequent in adjunct-‐iniOal construcOons in Netherlandic Dutch, and sentences with a concrete locaOve adjunct followed by er are typically rejected: *In het linkerpor5er zat er een deuk “In the leV car door was er a dent” Regression: Dependent & independent variables •  Dependent variable (class): presence of er •  Independent variables (features): higher-‐order features which enhance subject predictability –  Taxonomically vague vs. specific main verb: On the tree fluQered a _____ vs. On the tree was a _____. –  Vague vs. concrete locaOve adjuncts: concrete locaOve adjuncts (in her lunchbox was a _____) raise the expectancy specific subjects, vague adjuncts (in Brussels was a _____) do not. –  Non-‐topical vs. topical adjuncts: topical adjuncts “entrench” the adjunct sentence in the preceding context, which increases the number of available inferences to anOcipate the subject 14 13-‐07-‐14 Test sets •  Test sets based on Belgian and Netherlandic datasets used in previous regression analyses – but Usenet materials removed –  Belgian set: 649 instances (468 –er; 181 +er) –  Netherlandic set: 291 instances (244 –er; 47 +er) Each instance entered in TiMBL as tokenised string of 10 features: le} context of 5 words and right context of 5 words around classifier er or null EvaluaOon metrics Memory-‐based Learning: training sets •  Test sets in leave-‐1-‐out procedure (iteraOve training on all-‐but-‐one instances and tesOng on the remaining instance) •  AddiOonal training sets •  Belgian database of 115.451 adjunct-‐inital clauses (16.8 % +er) (automaOcally extracted from Leuven News Corpus on the basis of Alpino-‐parsed instances from test set ) •  Netherlandic database of 135.248 adjunct-‐iniOal clauses (9.1 % +er) (automaOcally extracted from Twente News Corpus on the basis of Alpino-‐parsed instances from test set ) Results •  Accuracy: overall proporOon of correct classificaOons •  Precision: proporOon of predicted classificaOons that are correct (High precision means that an algorithm generated more relevant than irrelevant results for a class) •  Recall: proporOon of target classificaOons that were correctly predicted (High recall means that an algorithm detected many of the actual cases of a class) •  (Paramsearch was applied to find the best parameter se^ngs) Conclusions •  Majority of locaOve-‐iniOal clauses does not contain er è much higher Precision & Recall scores for the null-‐variant (without er) •  MBL-‐data mirror the regression findings: •  Netherlandic data can be modelled just as accurately on the basis of lexical input as on the basis of a regression with abstract higher-‐order features •  The Belgian distribuOon of er is more difficult to model than the Netherlandic Since MBL accuracy is below regression accuracy in Belgian Dutch, it seems that lexical input does not suffice to train a learning algorithm to insert er MBL as a research tool •  Memory-‐based learning is a versaOle modeling approach –  Is a supervised ML algorithm like any other –  Worked with mixed-‐effects logisOc regression? Consider MBL •  Allows for tesOng lexical vs higher-‐level features –  O}en shows lexical features carry same predicOve power as abstract features, implicitly. 15 13-‐07-‐14 AdjecOve ordering Part IIIb Comparison and psychological validity Case studies: Impaired adjecOve ordering; English past tense •  Preferences –  a large brown desk vs. a brown large desk –  value ≺ dimension ≺ physical property ≺ speed ≺ human propensity ≺ age ≺ color (Dixon, 1982) –  Degree of absoluteness (MarOn, 1969) •  Less meaning shi}s with nouns: closer to noun –  Degree of objecOveness (Whorf, 1945) •  More inherent property: closer to noun Impaired adjecOve ordering Impaired adjecOve ordering •  Vandekerckhove, B., Sandra, D., and Daelemans, W. (2013). SelecOve impairment of adjecOve order constraints as overeager abstracOon: An elaboraOon on Kemmerer et al. (2009). Journal of Neurolinguis5cs, 26:46-‐72. •  Kemmerer, Tranel, Zdanczyk (2009) •  Vandekerckhove, B., Sandra, D., and Daelemans, W. (2013). SelecOve impairment of adjecOve order constraints as overeager abstracOon: An elaboraOon on Kemmerer et al. (2009). Journal of Neurolinguis5cs, 26:46-‐72. •  Hypothesis: –  report paOents (similar to semanOc aphasia) who fail to discriminate between preferred and dispreferred orders of prenominal adjecOves, yet were sensiOve to the order of adjecOves in relaOon to other parts of speech –  conclude that knowledge of the semanOc constraints on prenominal adjecOve order can be impaired without an impairment of purely syntacOc adjecOve order knowledge, or knowledge of semanOc adjecOve classes An MBL bigram language model •  N-‐gram models are the working horse in NLP –  E.g. in staOsOcal MT, automaOc speech recogniOon –  Correlates with eye fixaOon duraOon, pupil dilaOon, EEG, fMRI (McDonald & Shillcock, 2003; work of Stefan Frank) •  Given word, predict next word –  Single feature, MVDM metric –  k as a smoothing parameter –  Impairment can be characterized as overeager abstrac4on or oversmoothing •  MBL can be tuned to overeager abstracOon –  By increasing k. –  PaOents' impairments might affect not so much abstract linguisOc knowledge of explicit semanOc constraints as the level of abstracOon they employ during linguisOc processing. SimulaOons •  Task 1: Judgement task on adjecOve order •  a huge gray elephant vs a gray huge elephant •  Task 2: Judgement task on ordering with other PoS •  a hilly bumpy road vs •  road bumpy hilly a •  Task 3: Judgement task on semanOc similarity •  wide vs narrow or blue 16 13-‐07-‐14 ParOcipants vs model Analysis •  The big bad wolf –  IdiosyncraOc: bad usually precedes big •  bad liQle, bad old –  Yet, with k<5 MBL predicts big bad –  Only at k>100, preference for the big bad wolf vs wolf big bad the is lost •  Task 1 and task 2: tasks on a single conOnuum –  In contrast to Kemmerer et al’s analysis English Past Tense: Words + Rules •  Mental Rule(s) + Memory •  The past tense of to plit is pliQed –  VERB + ed (Default Rule) –  Explains generalizaOon and default behavior •  Is the past tense of to spling, splinged, splang, or splung (Bybee & Moder, 1983; Prasada & Pinker, 1993) –  Race between Rule and Words (AssociaOve Memory) Memory-‐Based AlternaOve (Keuleers E., 2008, Memory-‐based learning of inflecOonal morphology. PhD dissertaOon. University of Antwerp)
Prasada & Pinker Data • Rate on a scale from 1 to 7 – Today, I spling, yesterday I splinged – Today, I spling, yesterday I splung – Today, I plip, yesterday I plipped – Today, I plip, yesterday I plup Features: peak-‐valley representaOon •  Features –  Phonological segmental informaOon last two syllables (onset, nucleus, coda) –  Per syllable: sonority ‘peak-‐valley’ representaOon (Keuleers and Daelemans, 1997) •  Classes –  TransformaOon label derived from diff betweenroot and past tense form (+Id, +tId, I-‐>A) •  CELEX: present/paste tense forms occur >=1 in COBUILD •  RaOng ~ class distribuOon nearest neighbors with maximal k, exponenOal decay •  SimulaOon of results by Prasada and Pinker (1993) 17 13-‐07-‐14 Classes: TransformaOons Increasing k Spling Bigger picture, with exponenOal decay Prasada and Pinker SimulaOon ReplicaOon Study in Dutch spling fring flape plip ploag trilb MBL: exponenOal decay, maximal k, entropy of class distribuOon in NNs 18 13-‐07-‐14 Discussion •  Psychological-‐cogniOve validity? –  RelaOon to psycho/neuro views on memory –  The crucial role of k •  Explicit vs implicit linguisOcs –  Abstract vs surface features –  ConOnuum between lexical, syntacOc, semanOc domains –  No core/periphery disOncOon –  ‘implicit linguisOcs’ •  Child language acquisiOon AddiOonal work with TiMBL •  CLiPS Antwerp –  (Daelemans, Durieux, Gillis, Vandekerckhove, Keuleers, Sandra, et al) –  Dutch gender, word stress, plural –  English past tense –  German plural •  David Eddington –  Spanish stress and gender –  Italian conjugaOon •  Andrea KroE, Baayen & Schreuder –  Dutch and German compound linking element •  Ingo Plag et al. –  English compound stress –  What are your ideas? hEps://sites.google.com/site/
iascl2014tutorial/ [email protected] [email protected] 19

Download Report

Memory-‐based models of language learning and processing

Paperzz.com

Your Paperzz