CVMFS Post Mortem

CVMFS Post Mortem
Doug Benjamin
Duke University
What happened?

PoolFileCatalog.xml became corrupt

The relevant section of the file is -
<File ID="6651E9BA-061E-DD11-8F27-00304879FC6E“>
<physical>
<pfn filetype="ROOT_All" name="/cvmfs/a<File ID="F80FEF94-CAF8-E011-8FBD-003048F0E7A2">
<physical>
<pfn filetype="ROOT_All" name="/cvmfs/atlas-condb.cern.ch/repo/conditions/cond11/cond11_data.
000012.gen.COND/cond11_data.000012.gen.COND._0002.pool.root"/>
</physical>
<logical>
<lfn name="cond11_data.000012.gen.COND._0002.pool.root"/>
</logical>
</File>
The first <pfn filetype="ROOT_ALL" name="/cvmfs/a ... is bogus.
What happened (2)
 Lead cvmfs developer was cleaning the repository and
triggered the publishing of the bogus file.
 He did not know it was bogus (There is no way he would
have known)
 Stratum 1 servers within 1 hour picked up the bogus file
and published it.
 Cron jobs on Stratum 1 servers fetch files from the
Stratum 0 server hourly
 Cvmfs clients fetch files from the Stratum 1 servers
whenever either time to live information expires or
automount of cvmfs areas is triggered
How was the PFC created

The PoolFileCatalog.xml is create by a cron script that runs this command in loop:
where $dir_list is
dir_list="oflcond cmccond comcond cond08 cond09 cond10 cond11 cond12 cond13 cond14 cond15 cond16 cond17
cond18 cond19 cond20"
and ATLAS_POOLCOND_PATH is
export ATLAS_POOLCOND_PATH=/cvmfs/atlas-condb.cern.ch/repo/conditions
# loop over the directories
for dir in $dir_list
do
# determine if there are any data sets
ls -1 ${ATLAS_POOLCOND_PATH}/${dir}/* > /dev/null 2>&1
if [ "$?" = "0" ]
then
echo "running command - dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir}" >> $LogFile 2>&1
dq2-ls -T ${ATLAS_POOLCOND_PATH}/${dir} >> $PoolFileCatalogLog 2>&1
retcode=RC$?
if [ $retcode != "RC0" ] ; then
echo "Error - failed to update PoolFileCatalog - exiting " >> $LogFile 2>&1
echo "Error - failed to update PoolFileCatalog - exiting "
exit 1
fi
else
echo "${ATLAS_POOLCOND_PATH}/${dir} does not have datasets" >> $LogFile 2>&1
fi
done
What was the immediate fix?
 The bogus lines were removed from the
PoolFileCatalog.xml
 The cron job that does the file checkout and ultimate
publishing was stopped and has not been restarted
Why it happened?
 Not sure why the PoolFileCatalog creation failed?
 Logs did not give any indication of the failure.
 Did not have a backup PFC file.
Remediation steps
 Ultimately use Alessandro DeSalvo’s sw-mgr code to
get the datasets, create the PFC (saves older version)
 Requires ATLAS software releases available on the
conditions db machine.
 Steve Traylen working on cvmfs mounts – It is a bit tricky
and troublesome
 Run in cron job xml and file verification step from Misha
Borodin
Short term plans
 Resume fetching of datasets to machine
 Will be done manually (with same script w/o the
publishing step)
 Will run PFC file creation separately.
 Add xml format verification
 PFC file backup (keep a few copies)
 Once everything looks good. Publish manually
 Will update every day or so
Intermediate plans
 Once ATLAS code is available
 Implement sw-mgr creation of PFC and fetch of the
datasets.
 Initially will be done by hand
 Ultimately moved to cron job
 Will add e-mail notification in case of failures