Action list from the Chicago TIM meeting Priority Levels Critical: by the end of November 2014 (before Rucio and Prodsys-2 migration) High: by the end of January 2015 Medium: by the start of data taking Low: long term Workload Management & Resource Provisioning High: improve dataset lookups and see how this can be speed up in Rucio through metadata searches rather than pattern matching High: error codes and their propagation (Athena / TRF / pilot / PanDA / Monitoriing ) should be improved. High: improve (JEDI) merging: avoid 1-1 merging, handle 0 events files properly, handle LB boundary correctly, tune merging parameters High: discuss (one day) and define a work plan for dynamic queues, with the aim to have dynamic queues by Q2 2015. High: demonstrate MCORE slots provisioning in Cloud environment other than HLT@CERN. High: integrate HPC submission with ADC components (Prodsys/Rucio: tasks run as prodsys tasks and datasets stored in Rucio) and demonstrate at scale for various architectures (process and validate sample A for x86 compatible machines) Medium/High: increase # of MCORE available resources (currently 30% less than SCORE for production) Medium: enhance monitoring for tracking requests rather than tasks. Provide overview of request completion status. Medium: investigate how to integrate Good Run Lists in Prodsys-2 (concerns with RunQuery scalability). Medium: merge the reprocessing monitor functionality in BigPanDA monitor Medium: explore the ACT model (“push”) for other sites than NDGF (sites with ARC CE installed). Medium: setup few MCORE queues for analysis and get experience Medium: standardize ATLAS SW installation in HPC facilities. Low: understand overlaps between provisioning solutions for cloud resources (VAC, CloudSchaduler, APF2) and consolidate where possible. Data Management Critical: lumiblock and #events metadata should be supported in Rucio Critical: expose Rucio-aware dq2-clients to power users and implement feedback before migration. Critical: validate DaTRI against Rucio before the migration Critical: achieve horizontal scalability of all Rucio services High: instrument/improve probes to check Rucio services sanity (internal and external probes) High: provide dumps for sites to be used for consistency checks Medium/High: extend Rucio storage dumps to include replica popularity and # of replicas information Medium: harmonize the consistency checks procedures and automate more Medium: implement the needed functionalities in Rucio clients so that they can replace dq2-clients Medium: provide a single client for get operations (rucio-get?) using the “best” protocol available and the best fallback strategy Medium: implement DaTRI functionalities in the Rucio Web UI, including delegation and notifications. Medium: define the quota system for users in LOCALGROUPDISK (and other quotas in the system e.g. group quota in DATADISK) Medium: investigate further the usage of xrootd caches and understand how it fits the Rucio cache model. Medium: implement popularity for xAOD access w/o dq2 clients or PanDA. Integrate xrootd popularity in Rucio popularity. Medium/Low: port the popularity data in the analytics platform and provide tools to mine the data. Medium: implement a LOCALGROUPDISK automatic cleaning based on lifetime model + a flag set by space manager Medium: implement an agent for automatic replica reduction (within the lifetime and retention policy). Tier0 High: complete the evolution of the T0 toward hot/cold storage. High: discuss the possibility to use MCORE in T0 in light of the fact that Grid submission at CERN will be all MCORE Medium: test the T0 spillover to the Grid Low: get involved in testing of Condor batch system at CERN (as LSF replacement) Operations High: understand job efficiency in MCORE vs SCORE, reduce serial part of MCORE and size MCORE jobs properly High: re-discuss production vs analysis shares at sites High: define how Subscriptions and Rules will be used for Run-2, compatibly with the data lifetime model. Make sure data reduction can be achieved in a simple way as well as data replication (avoid complex rules) Medium/High Priority: set up a cloud for opportunistic resources to minimize inefficiencies on pledged Medium/High: setup a validation system based on nightly releases for the Derivation Framework Medium: get a better handling of memory usage inside the job Medium: decrease latency (placed request to output available) in production and analysis, leveraging new features of Prodsys-2 and Rucio. Medium: Tune FTS parameters to be able to better digest spiked of activity Medium: get more reactive monitoring, as today many issues are discovered rather late (transferring jobs was one example) as the system is very fault tolerant. Medium: move toward a tier-less model, where sites are “specialized”: sites should run the workflows they are capable of. Medium: move toward a cloudless model. Start defining a World Cloud and assign tasks to it. The only relevant parameter is the custodial site of outputs to be retained. Medium: define storage organization for Run-2 (GROUPDISK, DATADISK, SCRATCHDISK, LOCALGROUPDISK and TAPES) and define a “migration” scenario if needed. Medium/Low: investigate remaining issues creating dark data Computing Model Critical: define lifetimes for all data types and have a dry-run of the data lifecycle model Critical: expose new model to sites (Jamboree) and get immediate feedback. Start a more long term discussion on resources. High: apply data deletion based on dry-run above High: define data retention policies and suggest an implementation per data type Medium/High: define/review placement/replication model within the data lifetime (disk vs tape, T1 vs T2, #copies) Low: start considering the implications of Data Preservation in the computing model. Software Critical: effort should be put in the TRF area. Item for SW&C coordination High: tighter testing of SW releases needed (memory consumption, CPU and WallTime). Merging should work when reconstruction is released.
© Copyright 2025 Paperzz