The State-Space Approach to
Self-Management of Enterprise Systems
Vibhore Kumar, Karsten Schwan
Subu Iyer*, Yuan Chen*, Akhil Sahai*
Georgia Institute of Technology
Hewlett-Packard labs*
Outline
Motivation: Enterprise Complexity
Issues
Solution Overview
Policy-Driven Self-Management
Dynamic SLA Decomposition
Results
Future Work
Enterprise Complexity: Some Facts
From a survey conducted by Forrester
Research
Enterprises now devote 80% of their overall IT
budget to maintenance and ongoing operations
More than half of the 347 participating companies
used at least 3 database vendors
A major banking-industry client had 18 different
travel and expense systems in the organization
“VP of IT Governance” - says tons about the
state of enterprise IT infrastructure
The Complexity Wall
“If we don’t get a handle on complexity, it will stop the
expansion”
- Paul Horn, Senior Vice President, IBM Research
“Our enterprise customers are working with enormous
complexity”
- Dick Lampman, Former Director, HP Labs
The Complexity Wall @
Worldspan, one of our industry collaborators, provides
services to the travel industry
One of their airline ticket pricing/availability services is
hosted on a farm of 1400 servers
In 2006 alone, they processed around 9.6 billion messages
Highly varying request rates and request type mix
Several behaviors of their system are not well understood
Effects of Ticket Geography
Effects of Cache Refresh Time
Effects of Time of Day …
To Handle The Complexity…
One must enable self-management of complex
enterprise infrastructures driven by high-level goals
Enterprise Self-Management: The Hurdles
Enterprise systems are too big
The problem of Scale
It is tough to relate high-level goals to lowlevel actions
The problem of Complex System Modeling
The operating environment is very dynamic
The problem of Dynamism
Administrators find it hard to trust black-box
solutions
The problem of Trust & Tractability
Solution Overview: System State-Space
Enterprise System
Monitored
System
Variables
Monitored
Component
Variables
System State Space V = (v1, v2, v3, v4, v5, v6, v7, v8, v9, v10, v11, v12, v13, v14, v15, . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ,vn)
• Variables of Interest Vø
• Controllable Variables Vα
V, e.g. Response-Time, QoI
V, e.g. Allocated-Servers, Memory
The aim is to establish a relation between Vø
and Vα under current operating conditions
Simple Automated Operation
SLO: “Response Time < 10msec”
Event: SLO Violation
Condition: Bandwidth=90Mbps, Request Rate=30
Action: set Allocated Servers to 3
: Vα
Vø given V – (Vα U Vø)
V
Vø
α
1
3
90
30
12
12
8
9
Allocated Servers
Bandwidth
Request Rate
Response Time
Solution Overview: The Function
Learn from observed system states
But there are problems
Different behavior in different sub-spaces
Large state space, |V| ≈ 102 to 103
v1 v2 . . . . . . . . . . . . . vn
CPU
Bottleneck
Machine
Learning
Network
Bottleneck
Observed System States
Solution Overview: The Function
We decided to model the system using multiple µ-models
= { 1 , 2 ,, n }
We intelligently partition the set of observed system states
v
. . . . . . . . . exhibit
. . . . vn
1 v2 partitions
homogenous behavior
partitions have a reduced number of relevant variables
Reduced Number
of Relevant
in a µ-model
Partitioning
& µ-Modeling
solveVariables
two problems!
The problem of Scale
The problem of Complex System Modeling
Solution Overview: µ-Models
We use Tree Augmented Naïve Bayes (TAN)
Classifier to build µ-models
The model returns the following probability
γ = Pr(Vα | Vdesired)
Find assignment of values to variables in Vα that
maximizes the probability of moving the system
to the desired state
Solution Approach: Dynamism
As the system keeps running more system states
are generated, which could be incorporated into
the µ-models
µ-models are easier to update as compared to
monolithic system models
As a result of µ-model update
Policy Invalidation
Policy Adaptation
New Policies can Result
This addresses the problem of Dynamism
Solution Approach: Tractability & Trust
Each self-management action that assigns values
to variables in Vα is associated with a probability
γ = Pr(Vα | V – Vø)
An action is taken only when γ > γthreshold
This can be used to fine-tune self-management
TANs can be easily understood by administrators
Outline
Motivation: Enterprise Complexity
Issues
Solution Overview
Policy-Driven Self-Management
Dynamic SLA Decomposition
Results
Future Work
Policy-Driven Self-Management
SLO: “Response Time < 10msec”
Event: SLO Violation
Condition: Bandwidth=90Mbps, Request Rate=30
Given
the goal
state (90,30,9), find
the
µ-model to use
Current
State
Goal
State
Action:
set Allocated Servers to (90,30,9)
3
(90,30,12)
evaluate c : Pr(c | 90,30,9) max(Pr(ci | 90,30,9))
ci V
1
3
90
30
12
12
8
9
Allocated Servers
Bandwidth
Request Rate
Response Time
Dynamic SLA Decomposition
Problem: To determine sub-SLAs for components that
lead to SLA conformance
System-Level SLA
Sub-SLAs can be thought of as per-component range of
values for controllable variables
SLA1
SLA2
SLA3
SLA4
SLA5
If each component adheres to the sub-SLAs then the SLA
is not violated
Our techniques can handle SLA decomposition
conformance(SLA1, SLA2, …, SLAn)
conformance(System SLA)
Experimental Results: SOA Simulator
Without Self-Management
With Self-Management
Experimental Results: RUBiS over VMs
Without Self-Management
Database
Perturbation
With Self-Management
Partition
Change
Conclusions & Future Work
Our techniques are applicable for a variety of enterprise
systems
In our experiments the techniques have proven to be very
scalable and accurate
Monitoring overheads can be reduced by taking inputs
about relevant variables from the state-space partitions
Design & Implement techniques that can proactively avoid
SLA violations
Thank You!
References
[1] V. Kumar, K. Schwan, S. Iyer, Y. Chen, A. Sahai. The statespace approach to SLA-based management. In submission to
NOMS 2008.
[2] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. iManage:
Policy-Driven Self-Management for Enterprise-Scale Systsem.
Middleware 2007.
[3] V. Kumar, B. F. Cooper, G. Eisenhauer, K. Schwan. Enabling
Policy-Driven Self-Management for Enterprise Systems. PBAC
2007 in conjunction with ICAC-2007
[4] V. Kumar, et al. Implementing Diverse Messaging Models with
Self-Managing Properties using IFLOW. ICAC 2006
© Copyright 2026 Paperzz