CVMFS Post Mortem
Doug Benjamin
Duke University
What happened?
PoolFileCatalog.xml became corrupt
The relevant section of the file is -
<File ID="6651E9BA-061E-DD11-8F27-00304879FC6E“>
<physical>
<pfn filetype="ROOT_All" name="/cvmfs/a<File ID="F80FEF94-CAF8-E011-8FBD-003048F0E7A2">
<physical>
<pfn filetype="ROOT_All" name="/cvmfs/atlas-condb.cern.ch/repo/conditions/cond11/cond11_data.
000012.gen.COND/cond11_data.000012.gen.COND._0002.pool.root"/>
</physical>
<logical>
<lfn name="cond11_data.000012.gen.COND._0002.pool.root"/>
</logical>
</File>
The first <pfn filetype="ROOT_ALL" name="/cvmfs/a ... is bogus.
What happened (2)
Lead cvmfs developer was cleaning the repository and
triggered the publishing of the bogus file.
He did not know it was bogus (There is no way he would
have known)
Stratum 1 servers within 1 hour picked up the bogus file
and published it.
Cron jobs on Stratum 1 servers fetch files from the
Stratum 0 server hourly
Cvmfs clients fetch files from the Stratum 1 servers
whenever either time to live information expires or
automount of cvmfs areas is triggered
How was the PFC created
The PoolFileCatalog.xml is create by a cron script that runs this command in loop:
where $dir_list is
dir_list="oflcond cmccond comcond cond08 cond09 cond10 cond11 cond12 cond13 cond14 cond15 cond16 cond17
cond18 cond19 cond20"
and ATLAS_POOLCOND_PATH is
export ATLAS_POOLCOND_PATH=/cvmfs/atlas-condb.cern.ch/repo/conditions
# loop over the directories
for dir in $dir_list
do
# determine if there are any data sets
ls -1 ${ATLAS_POOLCOND_PATH}/${dir}/* > /dev/null 2>&1
if [ "$?" = "0" ]
then
echo "running command - dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir}" >> $LogFile 2>&1
dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir} >> $PoolFileCatalogLog 2>&1
retcode=RC$?
if [ $retcode != "RC0" ] ; then
echo "Error - failed to update PoolFileCatalog - exiting " >> $LogFile 2>&1
echo "Error - failed to update PoolFileCatalog - exiting "
exit 1
fi
else
echo "${ATLAS_POOLCOND_PATH}/${dir} does not have datasets" >> $LogFile 2>&1
fi
done
What was the immediate fix?
The bogus lines were removed from the
PoolFileCatalog.xml
The cron job that does the file checkout and ultimate
publishing was stopped and has not been restarted
Why it happened?
Not sure why the PoolFileCatalog creation failed?
Logs did not give any indication of the failure.
Did not have a backup PFC file.
Remediation steps
Ultimately use Alessandro DeSalvo’s sw-mgr code to
get the datasets, create the PFC (saves older version)
Requires ATLAS software releases available on the
conditions db machine.
Steve Traylen working on cvmfs mounts – It is a bit tricky
and troublesome
Run in cron job xml and file verification step from Misha
Borodin
Short term plans
Resume fetching of datasets to machine
Will be done manually (with same script w/o the
publishing step)
Will run PFC file creation separately.
Add xml format verification
PFC file backup (keep a few copies)
Once everything looks good. Publish manually
Will update every day or so
Intermediate plans
Once ATLAS code is available
Implement sw-mgr creation of PFC and fetch of the
datasets.
Initially will be done by hand
Ultimately moved to cron job
Will add e-mail notification in case of failures
© Copyright 2026 Paperzz