ATLAS DDM Lost Files - Used procedures

General approach was described in a talk at ATLAS DDM workshop. A short summary:

  1. Only production files are treated
  2. Get list of lost files (provided by a sysadmin)
  3. Remove information about lost files from the SE db (must be done by a sysadmin)
  4. Delete lost entries from the T1 LFC catalogue
  5. Locate replicas of lost files. If they exist, consider replication to the affected SE. If they
do not exist, remove lost files from datasets (DQ2 db) and pass the list of really lost files to prodsys group.

relevant scripts and instruction can be found here

Past cases

SARA 200612

39 files lost, disk crashed on ant1 before files were migrated to tapes list of lost files.
  • Locate files in LFC:
LFN was constructed from SURL. Due to two different convensions used, 2 separate cases: /pnfs/grid.sara.nl/data/atlas -> lfn:/grid/atlas/dq2/ /pnfs/grid.sara.nl/data/atlas/misal1_csc11/ -> lfn:/grid/atlas/dq2/misal1_csc11/HITS/
  • Remove lost replicas from a LFC:
lcg-lg --vo atlas $LFN # to get GUID
lcg-uf $GUID $SURL
  • Replicate lost files (In this case all files had replicas in the same cloud). The
script used was not general, it worked for this case.

NIKHEF 200701

3604 files lost
Format of an input file with list of lost files:
144080 151 /dpm/nikhef.nl/home/atlas/dq2/calib0/calib0.005011.J2_pythia_jetjet.simul.HITS.v12000301_tid003287/calib0.005011.J2_pythia_jetjet.simul.HITS.v12000301_tid003287._00417.pool.root.14
144083 151 /dpm/nikhef.nl/home/atlas/dq2/calib0/calib0.005011.J2_pythia_jetjet.simul.HITS.v12000301_tid003287/calib0.005011.J2_pythia_jetjet.simul.HITS.v12000301_tid003287._00689.pool.root.18
- two numbers (probably: unique file id; owner id) and a file name
If the path does not start with /grid/atlas/dq2, but with /grid/atlas/something_else, add dq2:
sed 's#/grid/atlas/\([^dq2][^/]*\)/#/grid/atlas/dq2/\1/#' lost_atlas_files.lfc.list > list_for_lfc_with_dq2.list
Then find them again in the LFC:
/usr/bin/time cat list_for_lfc_with_dq2.list |while read FN ; do lfc-ls $FN >> in_lfc.list 2>> not_in_lfc.list ; done

2. step: find GUID:
export LCG_GFAL_VO=atlas
export LFC_HOST=mu11.matrix.sara.nl
cat  list_for_lfc_with_dq2.list.10 | while read FN ; do echo -n "lfn:$FN " >> files_guid.list.10 ; lcg-lg lfn:$FN >> files_guid.list.10 ; done 

3. step: unregister files
/usr/bin/time cut -f2 -d' ' surl_guid.list |while read GUID ; do echo $GUID ;  SURL=`lcg-lr $GUID |grep tbn18.nikhef.nl/dpm/nikhef.nl` ;  echo $SURL ;  COMM="lcg-uf $GUID $SURL" ; $COMM ; done > lfc.update.log 2>&1 

It took 1 hour 40 minutes.
3. step: Find replicas on other sites
a script from Jiahang: lfccheck.py:
# Read in dataset list "dq2check.files"
# Inquery every dataset replica location to find files
# Add replica location information for every file, and save the dataset list as "lfccheck.files"
# Sort out files in dq2 catalog but has no replica anywhere (to be clean up), and write into "woreplica.files"  

4. step: Remove really lost files (with no replica) from the list in woreplica.files from datasets definitions and pass the list to the production team - done by Alexei.

ITEP 200701

A provided list of files (20 156 files) present at ITEP at 28.01.2007 did not help: some files were there twice. 4323 files were not known to LFC, 16193 had no replica at ITEP, so these cannot be unregistered. I did a complete "integrity" check. I identified 38590 missing files in ITEP and I ran lcg-uf to delete them from the catalogue. It was very slow, but succeeded.

SARA 200703

1275 lost files

NIKHEF 200704

176931 files possibly corrupted files. Summary from Jan Kubalec is on a page ATLASLostNIKHEF200704. Still not cleaned from the LFC (LFC mostly unavailable in week 20070514 - 18).

FZU (Prague) 200705

xxx files lost from a pool node se2 attached to golias100.farm.particle.cz. The list of lost files not yet available.

NDGF T1 200705

A faulty RAID controller resulted in the loss of around 250,000 files (11TB of data) from a disk pool in the NDGF T1. There were several attempts at recovery lasting around 3 weeks, and during this time the files were still registered in the LRC and RLS catalogs. When all these attempts failed the files were deleted from these catalogs. A homemade script (available from here) was used to check LFCs and LRCs in other clouds to determine whether any replicas were available of the lost files. This took a few hours to run. It was run from CERN lxplus and there was a significant performance difference between the European replica catalogs and those outside. Around 100,000 files had replicas elsewhere, and the list of 150,000 permanently lost files was sent to the DDM ops team for deletion from the DQ2 datasets.

-- JiriChudoba - 18 May 2007

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2007-09-24 - JiriChudoba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback