MeadeHPCForum

HPC in linguistic
research
Andrew Meade
University Of Reading
[email protected]
HPC use in linguistic research
•
•
•
•
•
•
•
•
•
Linguistic and biological models
Phylogenies
Linguistic data
Models of evolution
Parallelism
Scaling
Results
On going work
Key challenges
Linguistic and biological systems
Attribute
Genetics
Linguistics
Discrete units
nucleotides, codons,
genes, individuals
words, grammar, syntax
Replication
transcription
Dominant mode(s) of
inheritance
parent-offspring, clonal
Horizontal transmission
many mechanisms
borrowing
Mutation
many mechanisms
SNP’s, mobile DNA,
mistakes, vowel shifts,
innovation
Selection
fitness differences
among alleles
?
teaching, learning,
imitation
parent-offspring, peer
groups, teaching
Inferring evolutionary histories
form linguistic data
• Evolutionary histories, phylogenies
• Tools for understand evolution
•
•
•
•
•
•
•
•
Depicts relationships between languages
Identify groups which share a common ancestor
Calculate timing events
Account for lack of independence in the data
Inferred from data, taken from different languages
Using an explicate statistical model of evolution
Problem is NP-hard, growth is a double factorial.
Markov chain Monte Carlo search methods, heuristic search,
hill climber
• Product of Data + Model
Greek
Indo-Iranian
Slavic
Celtic
Germanic
Romance
The Data
• Swadesh list, Morris Swadesh 1940, onwards
• 200 meaning, present in all languages (all most)
• Chosen to be stable, slowly evolving and resistant to
borrowing
• Some what of a language “gene”
Cognate classes
• Word with a common evolutionary ancestry and meaning
English
Fish
Danish
Fisk
Dutch
Visch
Czech
Ryba
Russian
Ryba
Bulgarian
Riba
Fish
Ryba
34other languages
23 other languages
Data coding, Cognates
• Cognates, words and meaning what are derived from a
common ancestor
• Languages evolve by a processes of descent with modification
“When”
1 cognate
English
German
French
Italian
Greek
Hittite
when
wann
quand
quando
qote
kuwapi
water
wasser
eau
acqua
nero
watar
English
German
French
Italian
Greek
Hittite
1
1
1
1
1
1
“Water”
3 cognates
100
100
010
010
001
100
Continuous-time Markov Model
Q10
0
Non cognate
Q01
Q10
Q01
1
Cognate
Rate at which cognates are gained
Rate at which cognates are lost
The Likelihood Model
• Calculates the probability of a tree (T), given the data (D) and
model of evolution (M). Fitness / evaluation
• Accounts for > 99% of the run time
𝑃 𝑇 𝐷, 𝑀 =
𝑃(𝐷𝑗, 𝑀𝑖 |𝑇)
𝑖
Product over the model
1 – 12 categories
𝑗
Product over the data
200 – 100,000 sites
Trivially parallel
Level of parallelism
Data – Analysis of multiple datasets (3-5)
Model – Test a range of models (10-20)
Run – Stochastic process multiple runs (5-10)
Code – individual run can still take years
The problem
• 2003 – 16 taxa, 125 sites, 1 x model
• 2005 – 87 taxa, 2450 sites, 4 x model
• 2007 – 400 taxa, 34,440 sites, 100 x model
• Complexity 700,000x, 5-6 order of magnitude
• 4.8 years per run, typically 5 publication quality runs + 10
model tests
• 4.8 years < attention span of academics
• results are required in days
Parallel method 1
Distribute the data (MPI)
Cognates
0 1 1
Languages
0 1 1
……………………..……………..
0 0 0
1 0 1
0 1 0
1 0 1
1 1 0
1 0 1
1 1 1
0 0 1
1 0 1
Data
0 0 1
0 0 1
1 1 1
0 0 1
1 0 0
1 0 1
1 0 0
1 0 1
Core 1
……………………..……………..
Core 2
1 0 0
Core 3
Parallel method 2
Distribute the model (OpenMP)
Pass 1
Pass 2
Data
Pass 3
Core 1
Data
Pass 4
Core 2
Data
Data
Core 3
Core 4
Distribute the data and the model
(MPI + OpenMP)
Pass 1
Pass 2
Data
Core 1
Data
Core 2
Pass 3
Core 3
Pass 4
Data
Core 5
Core 4
Core 6
Data
Core 7
Core 8
Cores
Seconds - log 10
Cores
Efficiency
Results
• Runtime reduced from 4.8 years to
Cores
Days
60
31.5
150
14.5
300
8.5
600
6
• Good scaling, but not sustainable
• HPC has allowed for the accurate analysis of large complex
data sets with statistically justifiable models.
Current work
• Phoneme data
• Modelling sound utterances
Language
English
Danish
•
•
•
•
Word
Fish
Fisk
Cogency
1
1
Phoneme
Fish
Fisk
Better resolution than cogency data
Relevant linguistics patterns are emerging
120 phonemes, 2 cogency judgments
Another 3 order of magnitude complexity
• Accelerator implementation CUDA / OpenCL
Scalable computing
• Last 10 years, 5-6 order of magnate increase in complexity
• Reasonably scalable code redesign needed.
• Need to change the how not the what
• What – statistical framework, realistic models
• How – algorithm, language, parallelisation method, hardware
• Scalable algorithms
Burn in
Serial
Convergence
Parallel
Parallel sampling using multiple chains
Key challenges
• Computing is a rate limiting step
•
•
•
•
Trending water / drowning
Widening gap between computing power and data models complexity
Data set size and model complexity restricted
20-30 year old methods, which are less accurate and non statistical are
returning
• Connecting researchers with results not HPC
• HPC is a nuisance in science
•
•
•
•
Steep learning curve
High cost. Hardware, running costs and personnel
Access and flexibility
Not one off activity, thousands of data sets are produced each year, 3000+
published in 2011
Acknowledgments
Mark Pagel