Bridging Clouds with CernVM: ATLAS/PanDA example

Bridging Clouds with CernVM:
ATLAS/PanDA example
Wenjing Wu
2010-8-27
1
Outline
ATLAS computing model (PanDA)
Extending ATLAS computing model to
use Cloud computing resources
Challenges
Solution
Work Done
2
PanDA - the Production and Distributed Analysis system for
the ATLAS Experiment
6. PanDA server managers the final data transfer
1.Submit jobs to PanDA server
5.Pilot updates job status to PanDA server
3.Pilot checks environment, fetch jobs from PanDA server
2.Pilots are submitted to work nodes
Storage Element
Logical File Catalog
3
4.Pilot upload and register output files after job done
Extending ATLAS computing model to
use Cloud Computing resources
What are Clouds (in nowadays common terms)?
Virtualized computing resources provided by
academic and commercial institutions (e.g. CERN
lxcloud, Amazon EC2)
The resources provided by users participating in
volunteer computing projects (e.g. BOINC)
The goal:
Run ATLAS production jobs on Cloud Computing
resources.
4
Challenges!
Transparency: users and production operators
should not notice the difference
The whole set of Cloud resources should
appear to PanDA server as just another Grid
site
Credentials (which are essential for the
functioning of PanDA pilot) can not be brought
into the ‘untrusted’ environment (e.g. to the
machines of the volunteers)
5
Solve the challenge using CernVM
CernVM
Provides a lightweight virtual machine image
containing the applications of LHC experiments
The application software is distributed through
HTTP based content delivery network and is
cached locally
Provides Co-Pilot: a framework for the delivery
and execution of the workload on remote
virtual machines
6
1. submit PanDA job
4. Pilot fetch PanDA job and runs it
Integration!
Cloud resources
provided through VMs
running Co-Pilot Agent
7 update job final status to PanDA server
Storage Element
Logical File Catalog
3. Agent get a Co-Pilot job which launches the PanDA pilot
5. uploads output to temporary storage after job finished
Co-Pilot Client
2. submit Co-Pilot job
Co-Pilot
Job Manager
Co-Pilot Storage
Manager
CernVM Co-Pilot
7
6. uploads and register output files
Work Done (1)
Setup CERNVM site (part of ATLAS Grid
infrastructure)
Is a dynamic virtual cluster formed by virtual
machines running CernVM Co-Pilot Agents
Is configured according to ATLAS computing
conventions
Appears to ATLAS Grid central services as a Tier
2 site
8
Work Done(2)
Adaptation of PanDA Pilot:
Adding support for the heterogeneous
structure of the software repository
Adding support for saving job output
metadata and job status files
Development of Co-Pilot Storage Manager
A component running in the trusted
environment and acting as a proxy between
Co-Pilot agents and PanDA Grid services
9
1
0
Thanks!
1
1
Solve the challenge using CernVM
CernVM Co-Pilot is to help to run ATLAS PanDA
job in a non-credentialed computing environment.
CernVM Co-Pilot Components:
Co-Pilot client: submit jobs to Co-Pilot JobManager
Co-Pilot Server:
Co-Pilot Job Manager: dispatch jobs to Co-Pilot
Agents
Co-Pilot Storage sManager: upload /register output
files, change job status with credential
Co-Pilot Agent: runs the jobs on non-credentialed
computer nodes
1
2
Ingredients
CernVM
Provides an ultralight image for different
hyper-visors
ATLAS software is distributed by CVMFS,
cached locally
Co-Pilot
Co-Pilot Agent is distributed with CernVM
image
schedule jobs to CernVM virtual clusters
1
3
Co-Pilot Storage Manager
How CoPilot SM(Storage Manager) works?
receives “JobDone” message from Co-Pilot agent(JobID
is included)
SM calls the Co-Pilot_Data_Mover which extracts
metadata of job output from pilot log, upload files to
designated SE and register them to designated LFC
catalog
SM verify the status of file uploading and registration
SM calls Co-Pilot_Job_Status_Updater which update
the status to PanDA server(finished or failed)
Both Co-Pilot_Data_Mover and CoPilot_Job_Status_Updater are python scripts using libraries
from pilot source code
1
4