ATLAS DDM Operations at SARA, 2007
December 10
Many errors "Reason [DESTINATION error during PREPARATION phase: [REQUEST_TIMEOUT] fail
ed to prepare Destination file in 180 seconds] Source Host [srm.cern.ch]" on SARADISK, no good
transfers.
December 7
Scheduled downtime of SARA-MATRIX due to problems with pnfs. Start 2007-12-06, 18:43:00 [UTC],
End: 2007-12-07, 18:43:00 [UTC].
November 29
dq2_cleanup is not anymore in .../GRID/ddm/pro03/, I will have to use a different tool
to clean DS from aborted tasks.
November 28
Scheduled (?) downtime for SARA-MATRIX from 9:10 to 13:00 UTC.
Needed for dcache upgrade
November 21
Scheduled downtime for SARA-MATRIX since 19.11. extended until 22.11.
November 2
Obsolete data deletion succeeded on all sites without a single error!!!
Number of files deleted on each site
IHEP_aborted_ds_20071030.list_20071102_1608.log: 69
ITEP_aborted_ds_20071030.list_20071102_1608.log: 65
JINR_aborted_ds_20071030.list_20071102_1608.log: 60
NIKHEF_aborted_ds_20071030.list_20071102_1608.log: 1195
SARADISK_aborted_ds_20071030.list_20071102_1608.log: 1294
SARATAPE_aborted_ds_20071030.list_20071102_1608.log: 0
SINP_aborted_ds_20071030.list_20071102_1608.log: 7
Those were DS belonging to obsolete tasks mailed 30.10.2007.
November 1
Lots of errors for transfers to SARA tape. Reported in GGUS ticket 28553 and also
by Pedro in Savannah:
http://savannah.cern.ch/support/?102905.
October 26
DS belonging to obsolete tasks were deleted:
IHEP_aborted_ds_20071025.list_20071026_0518.log: 1
ITEP_aborted_ds_20071025.list_20071026_0518.log: 1
SARADISK_aborted_ds_20071025.list_20071026_0518.log: 3
Only 5 files in total.No errors.
October 25
SINP downtime finished 3 days ago: 2007-10-22, 16:00:00 [UTC]. There is a new downtime since
today, only CE should be down. SE responds to srm-get-metadata. Open channels
glite-transfer-channel-set -S Active -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/ChannelManagement STAR-SINP
glite-transfer-channel-set -S Active -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/ChannelManagement SINP-STAR
And check they are active:
glite-transfer-channel-list -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/ChannelManagement STAR-SINP
Channel: STAR-SINP
Between: * and RU-MOSCOW-SINP-LCG2
State: Active
Contact: (null)
Bandwidth: 0
Nominal throughput: 0
Number of files: 5, streams: 5
Number of VO shares: 1
VO 'atlas' share is: 100
October 8
There are errors for NIKHEF at dashboard pages:
State from FTS: Failed; Retries: 3; Reason: TRANSFER error during TRANSFER phase: [GRIDFTP] the server sent
an error response: 550 550 rfio write failure: No space left on device.
But there should be enough space for production role:
POOL ATLASPRD DEFSIZE 0 GC_START_THRESH 0 GC_STOP_THRESH 0 DEF_LIFETIME 7.0d DEFPINTIME 2.0h MAX_LIFETIME 1.0m MAXPINTIME 12.0h FSS_POLICY maxfreespace GC_POLICY lru RS_POLICY fifo GIDS 127 S_TYPE - MIG_POLICY none RET_POLICY R
CAPACITY 24.00T FREE 7.75T ( 32.3%)
hooibroei.nikhef.nl /export/data/vg0/lv0 CAPACITY 6.00T FREE 61.83M ( 0.0%)
hooibroei.nikhef.nl /export/data/vg0/lv1 CAPACITY 6.00T FREE 3.81T ( 63.6%)
hooikist.nikhef.nl /export/cache5 CAPACITY 1.88T FREE 61.69G ( 3.2%) RDONLY
hooikist.nikhef.nl /export/cache6 CAPACITY 1.88T FREE 131.00G ( 6.8%) RDONLY
hooikist.nikhef.nl /export/cache7 CAPACITY 1.88T FREE 259.58G ( 13.5%) RDONLY
hooizolder.nikhef.nl /export/data/vg0/lv0 CAPACITY 6.00T FREE 193.21M ( 0.0%)
hooizolder.nikhef.nl /export/data/vg0/lv1 CAPACITY 6.00T FREE 3.94T ( 65.6%)
I must do more debugging.
October 3
Site services on the sara vobox at CERN were crashing and finally completely stopped. Pedro
managed to solve it, here is an explanation and manual what to do:
"MySQL was simply not working properly. It didn't accept any more connections.
when the agents were running you could see the error.
to fix you should stop and start the agents.
when you start the agents they try to make a database connection which
because
MySQL is still blocking new connections (this behaviour I
haven't seen before).
in this case, you should login to the 'site'
MySQL machine, stop and
start
MySQL, go back to VOBox and start the agents."
A new unscheduled downtime for SINP, since 3.10. to 10.10.
October 2
Savannah bug
https://savannah.cern.ch/bugs/?29502
A file is registered in the LFC, but its size on disk is 0:
lcg-lr guid:0433D26E-891B-DC11-BDFE-00112FCCC3FB
srm://srm.grid.sara.nl/pnfs/grid.sara.nl/disk/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
lfc-ls -l /grid/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
-rw-rw-r-- 1 18992 1475 5302218 Jun 16 00:18 /grid/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
srm-get-metadata srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/disk/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
WARNING: SRM_PATH is defined, which might cause a wrong version of srm client to be executed
WARNING: SRM_PATH=/afs/cern.ch/project/gd/LCG-share/3.0.24-1/d-cache/srm
FileMetaData(srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/disk/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2)=
RequestFileStatus SURL :srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/disk/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
size :0
owner :1900
group :1213
permMode :420
checksumType :adler32
checksumValue :00000001
isPinned :false
isPermanent :true
isCached :true
state :
fileId :0
TURL :
estSecondsToStart :0
sourceFilename :
destFilename :
queueOrder :0
Delete it from disk and from the catalogue:
srm-advisory-delete srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/disk/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
lcg-uf --vo atlas guid:0433D26E-891B-DC11-BDFE-00112FCCC3FB srm://srm.grid.sara.nl/pnfs/grid.sara.nl/disk/atlas/dq2/trig1_misal1_mc12/AOD/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498/trig1_misal1_mc12.007268.singlepart_mu_p1000.recon.AOD.v12000605_tid010498._00479.pool.root.2
I have no permissions to delete it from the DS.
September 28
Broadcast message from Ron:
The network problems we were having were solved at about 10 pm last night. However, some storage nodes seemed
to be in a peculiar state after the network problems. Today, we have tied up those loose ends and we are passing SAM
tests again.
September 20
SINP SE lcg60.sinp.msu.ru downtime finished (17.9.), but the site downtime continues and the SE does not work.
A short problem with srm.grid.sara.nl announced by EGEE broadcast solved.
Files from aborted DS deleted. All aborted DS were only on SARADISK. 42 files deleted, no errors.
September 19
It was decided that SARA should get all M4 RAW data.
September 13
SINP is in unscheduled downtime since yesterday until 26.9. Jurriaan added me as a channel manager for NL T2's
and I set channels to and from SINP as inactive:
glite-transfer-channel-set -S Inactive -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/ChannelManagement STAR-SINP
glite-transfer-channel-set -S Inactive -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/ChannelManagement SINP-STAR
September 12
I delete files from aborted DS using dq2_delete_aborted.sh (parallel delete from all NL sites).
SINP SE does not respond, GGUS-Ticket 26743 has been created.
Some files from PNPI cannot be copied, FTS reports an error Destination and source file sizes don\'t match!!.
Problem reported by Stephane via savannah. GGUS-Ticket 26749 has been created.
September 5
A distributionof M4 ESD DS to IHEP, ITEP, JINR and PNPI is quite good:
M4 DS panda monitor
September 4
All transfers to ITEP stayed in status "ready". Restart of services on FTS helped.
August 28
List of aborted DS contained 16 DS. In total 81 files were deleted using the procedure with dq2_cleanup, no errors.
Log file on afs: /afs/cern.ch/user/c/chudobaj/ddm/aborted/20070828/aborted_to_delete_20070828_20070828_1346.log
List of aborted DS (from August 21) contained 3 DS. 3 files were deleted from SARADISK, no errors.
August 23
The channel STAR-ITEP is now working for transfers to se3.itep.ru:
gt-stat-sara -l a7a7440d-50b2-11dc-97f4-93d4a533c78d
Request ID: a7a7440d-50b2-11dc-97f4-93d4a533c78d
Status: Finished
Channel: STAR-ITEP
Client DN: /DC=cz/DC=cesnet-ca/O=Institute of Physics of the Academy of Sciences of the CR/CN=Jiri Chudoba
Reason:
Submit time: 2007-08-22 13:21:56.146
Files: 1
Priority: 3
VOName: atlas
Done: 0
Active: 0
Pending: 0
Ready: 0
Canceled: 0
Failed: 0
Finishing: 0
Finished: 1
Submitted: 0
Hold: 0
Waiting: 0
Source: srm://tbn18.nikhef.nl:8443/dpm/nikhef.nl/home/atlas/dq2/jiri.20070818.1
Destination: srm://se3.itep.ru:8443/pnfs/itep.ru/data/atlas/jiri.20070822.1
State: Finished
Retries: 0
Reason: (null)
Duration: 18
August 20
Further tests of FTS channel STAR - ITEP:
glite-transfer-submit -v -p ftspwd -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/FileTransfer srm://tbn18.nikhef.nl:8443/dpm/nikhef.nl/home/atlas/dq2/jiri.20070818.1 srm://se3.itep.ru:8443/pnfs/itep.ru/data/atlas/jiri.1
Server supports delegation, however a MyProxy passphrase was given: will use MyProxy legacy mode.
ff5d2fcd-4ee0-11dc-97f4-93d4a533c78d
gt-stat-sara ff5d2fcd-4ee0-11dc-97f4-93d4a533c78d
- after several minutes still waiting
The same transfer to se2.itep.ru proceeds very fast:
glite-transfer-submit -v -p ftspwd -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/FileTransfer srm://tbn18.nikhef.nl:8443/dpm/nikhef.nl/home/atlas/dq2/jiri.20070818.1 srm://se2.itep.ru:8443/dpm/itep.ru/home/atlas/jiri.20070820.fts.2
gt-stat-sara 75e0ff15-4ee1-11dc-97f4-93d4a533c78d
- finished within 1 minute
August 18
Check SARA-ITEP channel. Transfer from a UI to ITEP:
srmcp file:////mnt/raid4_atlas/chudoba/transfer/file.1KB srm://se3.itep.ru:8443/pnfs/itep.ru/data/atlas/jiri.1
- OK
A transfer to SARADISK:
srmcp file:////mnt/raid4_atlas/chudoba/transfer/file.1KB srm://srm.grid.sara.nl:8443/pnfs/grid.sara.nl/disk/atlas/jiri.1
- now hangs
A transfer to NIKHEF:
srmcp -debug=true file:////mnt/raid4_atlas/chudoba/transfer/file.1KB srm://tbn18.nikhef.nl:8443/dpm/nikhef.nl/home/atlas/dq2/jiri.20070818.1
- OK
Delete the file from ITEP and submit a transfer request from NIKHEF:
srm-advisory-delete srm://se3.itep.ru:8443/pnfs/itep.ru/data/atlas/jiri.1
myproxy-init -d -s myproxy-fts.cern.ch
glite-transfer-submit -v -p ... -s https://fts.grid.sara.nl:8443/glite-data-transfer-fts/services/FileTransfer srm://tbn18.nikhef.nl:8443/dpm/nikhef.nl/home/atlas/dq2/jiri.20070818.1 srm://se3.itep.ru:8443/pnfs/itep.ru/data/atlas/jiri.1
- not finished, I again lost connection. I will check when I am back to Prague.
August 2
Migration of srm.grid.sara.nl to a new hardware.
July 31
FTS server was migrated to fts.grid.sara.nl
July 26
LFC server change, mu11.matrix.sara.nl -> lfc-atlas.grid.sara.nl
Oracle database server and the LFC were moved from two old dual Xeon nodes to two new dual core dual CPU Xeon machines with 4GB of memory, two power supplies each and hardware RAID1 system disks. This make everything more reliable than it was before.
July 17
Unscheduled intervention due to problems with /pnfs. FTS channel CERN-SARA set inactive. Announced
at 9:10 via broadcast, back online at 13:44 (broadcast announcement).
June 27
FTS: VO manager role granted for me:
/DC=cz/DC=cesnet-ca/O=Institute of Physics of the Academy of Sciences of the CR/CN=Jiri Chudoba
(disappeared during an upgrade?)
June 22
Corrupted files unregistered from the LFC (in total 24660 files, 28 were already unregistered earlier - by whom?).
June 20
dCache at SARA upgraded to 1.7.0-36 (13:00 - 15:00)
June 19
A number of SARA's dcache pool nodes suffered from running out of disk space on the root file system.
(13:10 - 13:45)
June 11
SARA router maintenance (18:00 - 19:30).
June 6
Maintenance of Oracle db - LFC and FTS down, 14:00 - 18:00, broadcasted just a few minutes earlier.
Announced back online at 16:40.
May 30
Scheduled maintenance for oracle server at SARA, announced 8:55, started 9:00, finished ??.
The FTS on mu8.matrix.sara.nl and LFC at mu11.matrix.sara.nl were affected.
May 24
Still LFC problems: The LFC seems to crash every so many minutes (10:36).
LFC has been running stably for the last 3.5 hours. (15:29).
May 23
LFC back online (15:26).
May 19
LFC mu11 still down, no way to do a cleanup ...
Answer to Yevgenij, he was complaining that se2.itep.ru (DPM) is still used. We must migrate
to se3.itep.ru (dCache).
May 18
LFC again down: "Due to a disk problem of the oracle server at SARA the Oracle database is down
for the moment. This problem affects the LFC on mu11.matrix.sara.nl and the FTS server."
Yesterday we managed to get a dump of files stored at tbn18 and registered at mu11. Out of 45028 files,
which were not found by the first version of the script (uses the same path at LFC as for SURL with a
change /dpm/nikhef.nl/home/atlas -> /grid/atlas), only 47 were not registered.
Page with a description of procedures used after file losts:
AtlasDDMLostFiles .
May 17
Since May 14 many errors in config/TIER2S/subscriptions.log concerning proxy retrieval
from myproxy-fts.cern.ch. Still not understood. Ron issied an EGEE broadcast about
SARA's internal network problems, later prolongued until Friday 18 May.
Transfers using FTS server ar CERN (config/SARA/subscriptions.log) issued last error
due to a missing proxy on myproxy-fts.cern.ch on 2007-05-14 22:52:49.
May 15
LFC mu11 down. GGUS ticket 21985.
Numbers about corrupted files at NIKHEF:
- 78308 md5sum_correct.rec
- 21169 md5sum_corrupted.rec
- 32426 md5sum_missing.rec
- 45028 missing_files.rec
- 176931 total
Missing files were not found in the LFC because there were different conventions how to create LFN from SURL.
After a correction only 47 files were not found in the LFC.
May 5
decommisioning of SE teras.sara.nl.
The data stored on the SE teras.sara.nl can now be accessed through gridftp. The TURLs now start with:
gsiftp://a1-extern.teras.sara.nl/home/
/...
April 24
PNPI site is in ToA. I added it into crontab.
April 23
I got a list of possibly corrupted files at tbn18 together with their md5sums.
Jan Kubalec is going to compare md5sums values with values stored in the LFC.
Emails from the vobox are still not reaching their recepients, although the GGUS ticket
14014 was again closed.
April 20
FTS channel STAR-PNPI and PNPI-STAR tested. OK for small files, 1 GB file failed.
Transfers to from PNPI to NIKHEF failed too. After an increase of time out value
on the FTS server I was able to copy 1 GB from SARADISK
April 19
A long list of possibly corrupted files at NIKHEF. They may be corrupted due to a bad
enclosure. The list contains 176931 files. A randomly chosen 3 files is a small statistics,
but I got 1 corrupted and 2 not corrupted. Sizes were 79 KB, 80MB (the two non corrupted)
and 100 MB (corrupted). I downloaded them, computed md5sum and compared with a value
stored in the LFC.
April 5
Some sites were missing in FTS configuration file services.xml. They were inserted yesterday.
Many other sites are missing since then.
March 25
Error from March 19 still there. It is due to full disks at SARA.
Errors for transfer to NIKHEF:
Transfer failed. ERROR the server sent an error response: 550 550 rfio write failure: No space left on device.
dpm-qryconf shows some space on all ATLAS pools (minimum 600 GB).
ITEP:
Cleaning of ITEP continues. A provided list of files (20 156 files) present at ITEP
at 28.01.2007 did not help: some files were there twice. 4323 files were not known to LFC,
16193 had no replica at ITEP, so these cannot be unregistered.
I did a complete "integrity" check. I identified 38590 missing files in ITEP
and now I ran lcg-uf to delete them from the catalogue. It is very
slow, I hope that tomorrow it will be cleaned.
March 21
srm.ndgf.org added to services.xml. Errors "No site found for host srm.ndgf.org"
do not appear anymore.
March 19
This error started to appear at 15:33 in SARA/subscription.log
Pool manager error: Best pool too high : 2.0E8
No transfers to SARADISK since then.
March 15
An LFC error caused by an upgrade was corrected. The last LFC upgrade
included a schema changed, it was not done at first, because the configure_node
broke down.
March 14
List of lost ATLAS files provided. There are 1275 lost files.
March 12
Another loss of ATLAS files at SARA was reported.
March 7
FTS channels for PNPI were established, endpoint is:
srm://cluster.pnpi.nw.ru/grid/atlas
Channels are:
PNPI-STAR
STAR-PNPI
February 28
A broadcast from Ron
1.3. the tape backend of srm.grid.sara.nl will not be available from 10:30-11:30 due to maintenance.
New endpoint for NGDF T1 was announced:
srm://srm.ndgf.org:8443/pnfs/ndgf.org/data/atlas/disk/
Check if channel to SARA exists.
February 2
A standard non voms certificate used until now was replaced by a new one
(from Mario Lassnig).
Here are new entries in the crontab:
X509_USER_PROXY=/home/atlassgm/x509up_u23311.voms
35 0,12 * * * [ -e "$HOME/.profile" ] && . $HOME/.profile; . /etc/profile; voms-proxy-init -confile /opt/glite/etc/vomses -valid 96:00 -cert $HOME/x509up_u23311 -key $HOME/x509up_u23311 -out $HOME/x509up_u23311.voms -voms atlas:/atlas/Role=production
January 31
DQ2 monitoring moves from the "classical" to a dashboard system developed by ARDA:
http://dashb-atlas-data.cern.ch/dashboard/index.html
January 26
New fs was added to tbn18, which was previously full:
POOL Atlas-D DEFSIZE 200.00M GC_START_THRESH 0 GC_STOP_THRESH 0 DEFPINTIME 0 PUT_RETENP 86400 FSS_POLICY maxfreespace GC_POLICY lru RS_POLICY fifo GID 104 S_TYPE -
CAPACITY 12.77T FREE 1.24T ( 9.7%)
January 24
All lost files were unregistered from the SARA LFC.
January 21
The NDGF-SARA channel is configured and the agent is running and there is also a NIKHEF-STAR and STAR-NIKHEF channel with associated agents.
January 12
Lost files on DPM tbn18.nikhef.nl (human error). We got a list of lost files - 3604.
All files were registered in the SARA LFC. Later, entries about these files were removed
from DPM db.
January 8
File srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/atlas//dq2/csc11/csc11.005200.T1_McAtNlo_Jimmy.evgen.EVNT.v11004205/csc11.005200.T1_McAtNlo_Jimmy.evgen.EVNT.v11004205._00037.pool.root.1
cannot be copied. Problem reported.
-- JiriChudoba - 08 Jan 2007