TIMActionList

Action list from the Chicago TIM meeting
Priority Levels
Critical: by the end of November 2014 (before Rucio and Prodsys-2 migration)
High: by the end of January 2015
Medium: by the start of data taking
Low: long term
Workload Management & Resource Provisioning

High: improve dataset lookups and see how this can be speed up in Rucio
through metadata searches rather than pattern matching
 High: error codes and their propagation (Athena / TRF / pilot / PanDA /
Monitoriing ) should be improved.
 High: improve (JEDI) merging: avoid 1-1 merging, handle 0 events files
properly, handle LB boundary correctly, tune merging parameters
 High: discuss (one day) and define a work plan for dynamic queues, with
the aim to have dynamic queues by Q2 2015.
 High: demonstrate MCORE slots provisioning in Cloud environment other
than HLT@CERN.
 High: integrate HPC submission with ADC components (Prodsys/Rucio:
tasks run as prodsys tasks and datasets stored in Rucio) and demonstrate
at scale for various architectures (process and validate sample A for x86
compatible machines)








Medium/High: increase # of MCORE available resources (currently 30%
less than SCORE for production)
Medium: enhance monitoring for tracking requests rather than tasks.
Provide overview of request completion status.
Medium: investigate how to integrate Good Run Lists in Prodsys-2
(concerns with RunQuery scalability).
Medium: merge the reprocessing monitor functionality in BigPanDA
monitor
Medium: explore the ACT model (“push”) for other sites than NDGF (sites
with ARC CE installed).
Medium: setup few MCORE queues for analysis and get experience
Medium: standardize ATLAS SW installation in HPC facilities.
Low: understand overlaps between provisioning solutions for cloud
resources (VAC, CloudSchaduler, APF2) and consolidate where possible.
Data Management

Critical: lumiblock and #events metadata should be supported in Rucio
















Critical: expose Rucio-aware dq2-clients to power users and implement
feedback before migration.
Critical: validate DaTRI against Rucio before the migration
Critical: achieve horizontal scalability of all Rucio services
High: instrument/improve probes to check Rucio services sanity
(internal and external probes)
High: provide dumps for sites to be used for consistency checks
Medium/High: extend Rucio storage dumps to include replica popularity
and # of replicas information
Medium: harmonize the consistency checks procedures and automate
more
Medium: implement the needed functionalities in Rucio clients so that
they can replace dq2-clients
Medium: provide a single client for get operations (rucio-get?) using the
“best” protocol available and the best fallback strategy
Medium: implement DaTRI functionalities in the Rucio Web UI, including
delegation and notifications.
Medium: define the quota system for users in LOCALGROUPDISK (and
other quotas in the system e.g. group quota in DATADISK)
Medium: investigate further the usage of xrootd caches and understand
how it fits the Rucio cache model.
Medium: implement popularity for xAOD access w/o dq2 clients or
PanDA. Integrate xrootd popularity in Rucio popularity.
Medium/Low: port the popularity data in the analytics platform and
provide tools to mine the data.
Medium: implement a LOCALGROUPDISK automatic cleaning based on
lifetime model + a flag set by space manager
Medium: implement an agent for automatic replica reduction (within the
lifetime and retention policy).
Tier0


High: complete the evolution of the T0 toward hot/cold storage.
High: discuss the possibility to use MCORE in T0 in light of the fact that
Grid submission at CERN will be all MCORE

Medium: test the T0 spillover to the Grid

Low: get involved in testing of Condor batch system at CERN (as LSF
replacement)
Operations


High: understand job efficiency in MCORE vs SCORE, reduce serial part of
MCORE and size MCORE jobs properly
High: re-discuss production vs analysis shares at sites

High: define how Subscriptions and Rules will be used for Run-2,
compatibly with the data lifetime model. Make sure data reduction can be
achieved in a simple way as well as data replication (avoid complex
rules)

Medium/High Priority: set up a cloud for opportunistic resources to
minimize inefficiencies on pledged
Medium/High: setup a validation system based on nightly releases for
the Derivation Framework
Medium: get a better handling of memory usage inside the job
Medium: decrease latency (placed request to output available) in
production and analysis, leveraging new features of Prodsys-2 and Rucio.
Medium: Tune FTS parameters to be able to better digest spiked of
activity
Medium: get more reactive monitoring, as today many issues are
discovered rather late (transferring jobs was one example) as the system
is very fault tolerant.
Medium: move toward a tier-less model, where sites are “specialized”:
sites should run the workflows they are capable of.
Medium: move toward a cloudless model. Start defining a World Cloud
and assign tasks to it. The only relevant parameter is the custodial site of
outputs to be retained.
Medium: define storage organization for Run-2 (GROUPDISK, DATADISK,
SCRATCHDISK, LOCALGROUPDISK and TAPES) and define a “migration”
scenario if needed.
Medium/Low: investigate remaining issues creating dark data









Computing Model


Critical: define lifetimes for all data types and have a dry-run of the data
lifecycle model
Critical: expose new model to sites (Jamboree) and get immediate
feedback. Start a more long term discussion on resources.


High: apply data deletion based on dry-run above
High: define data retention policies and suggest an implementation per
data type

Medium/High: define/review placement/replication model within the
data lifetime (disk vs tape, T1 vs T2, #copies)

Low: start considering the implications of Data Preservation in the
computing model.
Software

Critical: effort should be put in the TRF area. Item for SW&C coordination

High: tighter testing of SW releases needed (memory consumption, CPU
and WallTime). Merging should work when reconstruction is released.