Low Latency Computations on Massive Data

Low Latency Computations on
Massive Data
Ion Stoica
CS Division, UC Berkeley
Fujitsu Symposium
Mountain View, June 5, 2013
UC BERKELEY
Challenges
Data grows faster than Moore’s law*
Data is dirty
» uncurated, no schema, no consistent
syntax and sematics
Complex questions, e.g.,
» Is there a virus outbreak?
» Is the building structurally safe?
*[IDC report, Kathy Yelick, LBNL]
Low Latency & Massive Data
May not be able to achieve both of them!
Even if all data in memory, computation may
take tens of seconds
Key Insight
Answers don’t always need to be exact
• Input often noisy: exact computations do not
guarantee exact answers
• Error often acceptable if small and bounded
Best scale
± 0.5lb error
Speedometers
± 2.5 % error
(edmunds.com)
OmniPod Insulin Pump
± 0.96 % error
(www.ncbi.nlm.nih.gov/pubmed/22226273)
Error-bounded Computations
Error depends on sample size (S) not on original
data size:
» error ~1/ S
» E.g., error of a poll on 1,000 people is “same” for a
population of 1M or 100M people
New generation of scale-independent algorithms
What Does It Mean?
Can trade between answer’s latency and accuracy
Data rapid increase no longer a problem…
Moore's Law
Data
2012
2014
2016
2018
2020
What Does It Mean?
Can trade between answer’s latency and accuracy
Data rapid increase no longer a problem…
Moore's Law
Data
Error
2012
2014
2016
2018
2020
Moore’s Law  error halves every two years