A Resource Manager for Optimal Resource Selection and

A Resource Manager for Optimal
Resource Selection and Fault
Tolerance Service in Grids
Hwa Min Lee, Sung Ho Chin, Jong Hyuk Lee, Dae Won Lee,
Kwang Sik Chung, Soon Young Jung, and Heon Chang Yu
Cluster Computing and the Grid, 2004. CCGrid 2004.
Outline
Introduction
 Architecture of Resource Manager, Fault
Detector and Fault Manager
 Resource Management Service
 Fault Tolerance Service
 Implementation and Simulation

Two Subject
Resource Manager
 Fault Tolerance Service



Fault Detector
Fault Manager
Resource Manager

The state of the selected resources for
job execution is a primary factor that
determines the computing performance.


We propose a resource manager for optimal
resource selection.
The resource manager automatically
selects the optimal resource among
candidate resources using a genetic
algorithm.
Fault Tolerance Service

The failure of resource affects job
execution fatally.


Process failures, machine crashes, and
network failures, etc.
A fault tolerance service detects resource
failures, deviations from required Qos
levels, and excessive resource usages and
resolves detected failures.
Goals



Resource manager finds the optimal set of
resources that guarantees the optimal
performance.
Fault detector detects the occurrence of resource
failures.
Fault manager guarantees that the submitted
jobs complete and improves the performance of
job execution due to job migration even if some
failures happen.
Architecture of Resource Manager,
Fault Detector and Fault Manager 1
Architecture of Resource Manager,
Fault Detector and Fault Manager 2



The resource manager, fault detector, and fault
manager are connected with Globus toolkit.
To execute job with the resource manager, a
user describes some information by RSL
(resource specification language).
A RSL parser extracts the information and sends
them to a resource manager.
Architecture of Resource Manager,
Fault Detector and Fault Manager 3



Resource manager runs the genetic algorithm to
select the set of optimal resource for an efficient
job execution.
An execution time predictor is responsible for
predicting the execution time of job.
A fault detector is responsible for detecting a
resource failure.
Architecture of Resource Manager,
Fault Detector and Fault Manager 4



If the resource failure or system performance
degradation occurs, the fault detector sends an
alert message to a fault manager.
If the fault manager receives an alert message,
it should decide whether job migration should
perform or not.
If it decides job migration, it requests to allocate
new selected resource and restarts execution
using a checkpoint.
Components of Resource
Manager
1
Components of Resource
Manager
2

RSL Parser


Resource Search Agent


Extracts the resource type and resource
condition.
For discovery of resources that meet user
requirement.
Resource Selection Agent

It performs genetic algorithm to select the set
of optimal resources for the efficient job
execution.
Components of Resource
Manager
3

Execution Time Predictor


It predicts the execution time of job.
Resource Allocation Request Agent

Requests the resource allocation to a resource
co-allocator.
Optimal Resource Selection Using
Genetic Algorithm 1
Optimal Resource Selection Using
Genetic Algorithm 2
Optimal Resource Selection Using
Genetic Algorithm 3
Expansion of Failure Definition

Definition of failure

It is a failure if and only if one of the following
two conditions is satisfied.
 1. resource service stop due to resource crash
 2. availability of resource does not meet the
minimum levels of Qos
Types of expanded failures
Components of Fault
Detector and Fault Manager
1
Components of Fault
Detector

The fault detector provides a monitoring service
that monitors resource states of processes,
processors, and networks, a fault detection
service that decides the failure occurrence for
each resource , a communication service that
provides communication with each component.
Components of Fault
Manager

Rescheduling Agent


Evaluates the performance benefits that can be
obtained due to job migration and decides whether job
migrates or not. And the rescheduling agent decides a
new resource allocation for jobs.
Job Migration Agent

If receiving alert message, it requests rescheduling
agent to decide whether job migrates or not. It
requests to allocate new selected resources and
restarts execution using checkpoint if migration should
be taken.
Job Migration



1
Fault tolerance is achieved by periodically using
stable storage to save the processes’ states
during failure-free execution.
Upon a failure, a failed job restarts from one of
its saved states, thereby reducing the amount of
lost computation.
Each of the saved states is called a checkpoint.
Job Migration



2
In case resource Qos failure occurs, rescheduling
agent must decide whether job migrates of not.
Because job migration overhead
(job_migration_overhead) happens, the job
migration may give rise to increase the
execution time.
So in case resource Qos failure occurs, the
rescheduling agent evaluates the performance
benefit that can be obtained due to job
migration.
Job Migration
3
Implementation and
Simulation
1


Assumes that the task could be divided into subjobs that have no dependency and
communication and are executed in parallel.
According the assumption, the total execution
time of task depends on the longest execution
time of sub-jobs.
Generates the 1000 virtual nodes and divided
task into 10 sub-jobs for simulation.
Implementation and
Simulation
2
Implementation and
Simulation
3