Services for Titan workflow integration Introduction • Following slides contain factorized view of the services needed to integrate the Titan system in the ALICE Grid • The content is not final • Some details are omitted on purpose • The general goal is to identify the basic building blocks and dependencies 2 JobAgent submission • Submits JA launch script to Titan PanDa pilot factory (Titan login node) • Receives signal to submit (if suitable jobs exist) from AliEn • Queries batch for available job slots • Returns max wall time of job + number of cores (can be auto-discovered?) • Status of the factory, submission status and various parameters from batch should be made available to VO-box services 3 Software distribution + sandboxing Titan Filesystems • CVMFS over NFS – ALICE SW in $HOME/prod user • Otherwise pre-packaging in the VO-box • Spider (/lustre/) for job sandboxes • NFS ($HOME/prod user) for software distribution VO-box service S1 (SW distribution) 4 JobAgent start Titan WN • JA spawns X threads (X=number of cores) • JA TTL = time communicated from PanDa • JA (Titan version) is launched by submission script from $HOME/prod user • JA writes ‘I am alive’ and waits for payload • ..or… parses Spider for suitable pre-build payload 5 Payload preparation VO-box service S2 (Payload preparation + execution monitoring) • Takes list of eligible MC jobs from common TQ • Creates /lustre/AliEn-jobxxxxxx (macros/scripts) • SW in $HOME/prod user • Job sandboxes in /lustre • JA thread picks up a job, creates ‘job taken’ lock 6 Payload execution Titan WN • Job output written to sandbox /lustre/AliEn-jobxxxxxx • JA keeps ‘job taken’ lock and updates job status • ‘job taken’ lock file is parsed by S2 • Payload running on X cores ‘Job taken’ content • Job status: STARTED, RUNNING, SAVING • Job error code ERROR_E, ERROR_V, ERROR_VN, ERROR_IB • Heartbeat info + job parameters info • The remaining codes – set by central/VO-box services 7 Job epilogue VO-box service S3 (Job output handling) • Writes job output files to Grid storage • Success - ‘VALIDATED’ • Error – all other cases • Updates task queue to next status, rest is done by central services • Parses sandboxes for ‘job finished’ file • Reads files from /lustre/AliEn-jobxxxxxx • Cleans up the sandbox 8 Other services VO-box service S4 (Job parameters monitoring) • Parses sandboxes for job service files • Reads files from /lustre/AliEn-jobxxxxxx • Receives info from PanDa pilot factory • Aggregates info and sends to central monitoring 9 AliEn job status chart 10
© Copyright 2026 Paperzz