Recovery Oriented Computing: Update

Welcome to the
Winter 2004 ROC Retreat
Armando Fox and David Patterson
About ROC Retreats


Purpose of semi-annual retreats

Progress reports/talks from academia and industry

Exposure/feedback on new ideas or work in progress

Brainstorming in immersive atmosphere

Industry/visitor feedback, opportunities for collaboration

Skiing
Logistics

Web server with retreat talks/papers - thanks to Mike Howard
and Bob Miller

Skiing
© 2002 Armando Fox
ROC Events

Aaron Brown, UC Berkeley
=> Dr. Aaron Brown, IBM Research

Pete Broadwell, UC Berkeley
=> Pete Broadwell, M.S., ???

Soon: Mike Chen, UC Berkeley
=> Dr. Mike Chen, ???

ROC work recognized in the 2003 Scientific American 50
© 2002 Armando Fox
Recent Publications (since June 2003)
Published or to appear:




Ben Ling, Emre Kiciman, Armando Fox: Session State: Beyond Soft State, in NSDI 2004
Mike Chen, Anthony Accardi, Emre Kiciman, Jim Lloyd, Eric Brewer, Armando Fox: Path-Based
Failure and Evolution Management, in NSDI 2004
George Candea, Steve Zhang, Emre Kiciman, Armando Fox, Application-Generic Recovery for
Internet Middleware, Cluster Computing Journal (special issue on Autonomic Computing), summer
2004
George Candea, James Cutler, Armando Fox, Improving Availability with Recursive Microreboots: A
Soft-State System Case Study, Performance Evaluation Journal, 56(1-3), March 2004
In submission:




George Candea and Armando Fox, Microreboots: An Application-Generic Recovery Technique for
Internet Services, submitted to USENIX 2004
Andy Huang and Armando Fox, Free Recovery: A Step Towards Self-Managing State, submitted to
USENIX 2004
Emre Kiciman and Armando Fox, Detecting and Localizing Anomalous Behavior to Discover Failures
in Component-Based Internet Services, submitted to USENIX 2004
Yee-Jiun Song, Jeff Raymakers, Wendy Tobagus, Armando Fox. Is MTTR More Important Than
MTTF For User-Perceived Availability?, submitted to DSN-IPDS 2004
© 2002 Armando Fox
Preview of some upcoming talks
1.
2.
3.
Benchmarking

Evaluating undo: human-aware recovery benchmarks

Benchmarking distributed services

Including latency & data quality in performability evaluation of a webbased service
Making recovery nearly free

Evaluating the effect of micro-reboots on end users

How cheap recovery simplifies persistent state management
Embracing statistical analysis

Using statistical learning to detect and localize faults in componentized
Internet services

A statistical learning approach to failure diagnosis for eBay

Toward generalized API’s for statistical monitoring
© 2002 Armando Fox
ROC => RADS

Generalize ROC approaches that focus on statistical
anomaly detection as a way of detecting conditions that
require response

Generalize “recovery” to “adaptation”


System is “always recovering”/”always adapting”

Some early examples of this will be featured in talks

Insight: statistical pattern recognition provides a degree of
application-generic failure detection; nearly-free recovery means
we can tolerate some false positives
Kickoff panel this evening
© 2002 Armando Fox
Other Highlights

Poster advertisements before poster session

Three talks from industrial visitors

Moises Goldszmidt: statistical pattern recognition applied to
systems management

Chris Overton: modeling large-scale IT systems

Paul Brett: Real-world failures, a systemic view
© 2002 Armando Fox