Production status

Services for Titan workflow
integration
Introduction
• Following slides contain factorized view of the
services needed to integrate the Titan system
in the ALICE Grid
• The content is not final
• Some details are omitted on purpose
• The general goal is to identify the basic
building blocks and dependencies
2
JobAgent submission
• Submits JA launch
script to Titan
PanDa pilot factory
(Titan login node)
• Receives signal to submit
(if suitable jobs exist) from
AliEn
• Queries batch for
available job slots
• Returns max wall
time of job + number
of cores (can be
auto-discovered?)
• Status of the factory,
submission status and
various parameters from
batch should be made
available to VO-box
services
3
Software distribution + sandboxing
Titan Filesystems
• CVMFS over NFS – ALICE
SW in $HOME/prod user
• Otherwise pre-packaging
in the VO-box
• Spider (/lustre/) for
job sandboxes
• NFS ($HOME/prod
user) for software
distribution
VO-box service S1
(SW distribution)
4
JobAgent start
Titan WN
• JA spawns X threads
(X=number of cores)
• JA TTL = time
communicated from
PanDa
• JA (Titan version) is
launched by
submission script
from $HOME/prod
user
• JA writes ‘I am alive’ and
waits for payload
• ..or… parses Spider for
suitable pre-build payload
5
Payload preparation
VO-box service S2
(Payload preparation +
execution monitoring)
• Takes list of eligible MC
jobs from common TQ
• Creates /lustre/AliEn-jobxxxxxx (macros/scripts)
• SW in $HOME/prod
user
• Job sandboxes in
/lustre
• JA thread picks up a job,
creates ‘job taken’ lock
6
Payload execution
Titan WN
• Job output written to
sandbox /lustre/AliEn-jobxxxxxx
• JA keeps ‘job taken’ lock
and updates job status
• ‘job taken’ lock file is
parsed by S2
• Payload running on X
cores
‘Job taken’ content
• Job status: STARTED,
RUNNING, SAVING
• Job error code ERROR_E,
ERROR_V, ERROR_VN,
ERROR_IB
• Heartbeat info + job
parameters info
• The remaining codes – set
by central/VO-box
services
7
Job epilogue
VO-box service S3
(Job output handling)
• Writes job output files to
Grid storage
• Success - ‘VALIDATED’
• Error – all other cases
• Updates task queue to
next status, rest is done by
central services
• Parses sandboxes for
‘job finished’ file
• Reads files from
/lustre/AliEn-jobxxxxxx
• Cleans up the sandbox
8
Other services
VO-box service S4
(Job parameters
monitoring)
• Parses sandboxes for
job service files
• Reads files from
/lustre/AliEn-jobxxxxxx
• Receives info from
PanDa pilot factory
• Aggregates info and sends
to central monitoring
9
AliEn job status chart
10