A Resource Manager for Optimal Resource Selection and Fault Tolerance Service in Grids Hwa Min Lee, Sung Ho Chin, Jong Hyuk Lee, Dae Won Lee, Kwang Sik Chung, Soon Young Jung, and Heon Chang Yu Cluster Computing and the Grid, 2004. CCGrid 2004. Outline Introduction Architecture of Resource Manager, Fault Detector and Fault Manager Resource Management Service Fault Tolerance Service Implementation and Simulation Two Subject Resource Manager Fault Tolerance Service Fault Detector Fault Manager Resource Manager The state of the selected resources for job execution is a primary factor that determines the computing performance. We propose a resource manager for optimal resource selection. The resource manager automatically selects the optimal resource among candidate resources using a genetic algorithm. Fault Tolerance Service The failure of resource affects job execution fatally. Process failures, machine crashes, and network failures, etc. A fault tolerance service detects resource failures, deviations from required Qos levels, and excessive resource usages and resolves detected failures. Goals Resource manager finds the optimal set of resources that guarantees the optimal performance. Fault detector detects the occurrence of resource failures. Fault manager guarantees that the submitted jobs complete and improves the performance of job execution due to job migration even if some failures happen. Architecture of Resource Manager, Fault Detector and Fault Manager 1 Architecture of Resource Manager, Fault Detector and Fault Manager 2 The resource manager, fault detector, and fault manager are connected with Globus toolkit. To execute job with the resource manager, a user describes some information by RSL (resource specification language). A RSL parser extracts the information and sends them to a resource manager. Architecture of Resource Manager, Fault Detector and Fault Manager 3 Resource manager runs the genetic algorithm to select the set of optimal resource for an efficient job execution. An execution time predictor is responsible for predicting the execution time of job. A fault detector is responsible for detecting a resource failure. Architecture of Resource Manager, Fault Detector and Fault Manager 4 If the resource failure or system performance degradation occurs, the fault detector sends an alert message to a fault manager. If the fault manager receives an alert message, it should decide whether job migration should perform or not. If it decides job migration, it requests to allocate new selected resource and restarts execution using a checkpoint. Components of Resource Manager 1 Components of Resource Manager 2 RSL Parser Resource Search Agent Extracts the resource type and resource condition. For discovery of resources that meet user requirement. Resource Selection Agent It performs genetic algorithm to select the set of optimal resources for the efficient job execution. Components of Resource Manager 3 Execution Time Predictor It predicts the execution time of job. Resource Allocation Request Agent Requests the resource allocation to a resource co-allocator. Optimal Resource Selection Using Genetic Algorithm 1 Optimal Resource Selection Using Genetic Algorithm 2 Optimal Resource Selection Using Genetic Algorithm 3 Expansion of Failure Definition Definition of failure It is a failure if and only if one of the following two conditions is satisfied. 1. resource service stop due to resource crash 2. availability of resource does not meet the minimum levels of Qos Types of expanded failures Components of Fault Detector and Fault Manager 1 Components of Fault Detector The fault detector provides a monitoring service that monitors resource states of processes, processors, and networks, a fault detection service that decides the failure occurrence for each resource , a communication service that provides communication with each component. Components of Fault Manager Rescheduling Agent Evaluates the performance benefits that can be obtained due to job migration and decides whether job migrates or not. And the rescheduling agent decides a new resource allocation for jobs. Job Migration Agent If receiving alert message, it requests rescheduling agent to decide whether job migrates or not. It requests to allocate new selected resources and restarts execution using checkpoint if migration should be taken. Job Migration 1 Fault tolerance is achieved by periodically using stable storage to save the processes’ states during failure-free execution. Upon a failure, a failed job restarts from one of its saved states, thereby reducing the amount of lost computation. Each of the saved states is called a checkpoint. Job Migration 2 In case resource Qos failure occurs, rescheduling agent must decide whether job migrates of not. Because job migration overhead (job_migration_overhead) happens, the job migration may give rise to increase the execution time. So in case resource Qos failure occurs, the rescheduling agent evaluates the performance benefit that can be obtained due to job migration. Job Migration 3 Implementation and Simulation 1 Assumes that the task could be divided into subjobs that have no dependency and communication and are executed in parallel. According the assumption, the total execution time of task depends on the longest execution time of sub-jobs. Generates the 1000 virtual nodes and divided task into 10 sub-jobs for simulation. Implementation and Simulation 2 Implementation and Simulation 3
© Copyright 2026 Paperzz