Useful links for LHCb
Shifter's 101
- If many jobs are completed but cannot upload to LogSE (volhcb15), check number of connections (socket count), restart StorageElement.
- Mario's script for shifter report:
$ssh lxplus
$SetupProject LHCbDirac
$dirac-proxy-init
$cd ~ubeda/public
# I just took the Hot Productions from https://lhcb-shifters.web.cern.ch/dashboard
$python dirac-production-shifter.py -g -i 8657,8656,8622
# If you want to take a closer look to funny states of the files, try this one
$python dirac-production-shifter-files.py -i 8657
- Allowing a site after downtime is over. First let's see all SE:
$ dirac-admin-site-info LCG.IN2P3.fr
{'CE': 'cccreamceli05.in2p3.fr, cccreamceli06.in2p3.fr',
'Coordinates': '4.8655:45.7825',
'Mail': 'grid.admin@cc.in2p3.fr',
'MoUTierLevel': '1',
'Name': 'IN2P3-CC',
'SE': 'IN2P3-RAW, IN2P3-DST, IN2P3_M-DST, IN2P3-USER, IN2P3-FAILOVER, IN2P3-RDST, IN2P3_MC_M-DST, IN2P3_MC-DST, IN2P3-ARCHIVE, IN2P3-BUFFER'}
Now let's unban:
$ dirac-admin-allow-site LCG.IN2P3.fr "Downtime finished"
$ dirac-admin-allow-se IN2P3-RAW, IN2P3-DST, IN2P3_M-DST, IN2P3-USER, IN2P3-FAILOVER, IN2P3-RDST, IN2P3_MC_M-DST, IN2P3_MC-DST, IN2P3-ARCHIVE, IN2P3-BUFFER "Downtime finished"
- Get the file access protocols for a site:
dirac-admin-get-site-protocols --Site=LCG.SARA.nl
CERN-BUFFER file, xroot, root, dcap, gsidcap, rfio
CERN-CASTORBUFFER file, xroot, root, dcap, gsidcap, rfio
- Get BDII site info on MaxWallclockTimes
$ dirac-admin-site-info LCG.RAL.uk
{'CE': 'lcgce05.gridpp.rl.ac.uk, lcgce04.gridpp.rl.ac.uk',
'Coordinates': '-1.32:51.57',
'Mail': 'lcg-support@gridpp.rl.ac.uk',
'Name': 'RAL-LCG2',
$ dirac-admin-bdii-ce-state lcgce04.gridpp.rl.ac.uk | grep MaxWallClockTime
GlueCEPolicyMaxWallClockTime: 120
GlueCEPolicyMaxWallClockTime: 4320
GlueCEPolicyMaxWallClockTime: 4320
GlueCEPolicyMaxWallClockTime: 4320
GlueCEPolicyMaxWallClockTime: 4320
- Get file descendants to see if the file was processed indeed:
dirac-bookkeeping-get-file-descendants /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81609/081609_0000000035.raw 9
- Get RAW ancestors of the lost FULL.DST files:
dirac-bookkeeping-get-file-ancestors /lhcb/LHCb/Collision12/FULL.DST/00020526/0003/00020526_00031385_1.full.dst
- Productions with some Unused files, which have run number = zero in production DB (a fix)
dirac-transformation-debug 16771 --Status Unused
- Prestaging files (debug problems)
danielar@herault dremensk $ srm-bring-online -debug srm://storm-fe-lhcb.cr.cnaf.infn.it/t1d0/lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/125977/125977_0000000019.raw
- Prestage files using gfal in python:
import gfal
print 'GFAL version',gfal.gfal_version()
gfalDict = {'srmv2_spacetokendesc': 'LHCb-Tape', 'no_bdii_check': 1, 'srmv2_desiredpintime': 86400, 'defaultsetype': 'srmv2', 'timeout': 30, 'nbfiles': 1, 'surls': ['srm://storm-fe-lhcb.cr.cnaf.infn.it:8444/srm/managerv2?SFN=/t1d0/lhcb/archive/lhcb/MC/MC10/ALLSTREAMS.DST/00009779/0000/00009779_00001506_1.allstreams.dst'],'protocols': ['file', 'dcap', 'gsidcap', 'xroot', 'root', 'rfio']}
errCode, gfalObject, errMessage = gfal.gfal_init( gfalDict )
print 'gfal.gfal_init:', errCode, errMessage
errCode,gfalObject,errMessage = gfal.gfal_prestage( gfalObject )
print 'gfal.gfal_prestage:', errCode, errMessage
numberOfResults, gfalObject, listOfResults = gfal.gfal_get_results( gfalObject )
for result in listOfResults:
print 'result per surl', result
here.
for work on the Storage Management System.
to-do
My notes
Elisa's twiki
How to's.
How to access volhcb12.
From any machine except lxplus, you should login first to lxvoadm.cern.ch and then to volhcb12.
ssh dremensk@lxvoadmNOSPAMPLEASE.cern.ch
sudo su dirac
mysql -p -uDirac
mysql> show databases;
Useful to see all processes and their open files
ps -ef | grep RequestFinalizationAgent | wc -l
path: /opt/dirac/pro/DIRAC/
Submit a job
dirac-wms-job-submit Simple.jdl
Simple.jdl:
JobName = "Simple_Job";
Executable = "/bin/ls";
Arguments = "-ltr";
StdOutput = "StdOut";
StdError = "StdErr";
OutputSandbox = {"StdOut","StdErr"};
InputData =
{
"LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/77969/077969_0000000611.raw"
};
BannedSites =
{
"LCG.CERN.ch"
};
Script to restart all agents/services
runsvctrl d /opt/dirac/startup/StorageManagement_StorageManagerHandler
runsvctrl d /opt/dirac/startup/StorageManagement_RequestPreparationAgent
runsvctrl d /opt/dirac/startup/StorageManagement_StageRequestAgent
runsvctrl d /opt/dirac/startup/StorageManagement_StageMonitorAgent
runsvctrl d /opt/dirac/startup/StorageManagement_RequestFinalizationAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StorageManagerHandler
runsvctrl u /opt/dirac/startup/StorageManagement_RequestPreparationAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StageRequestAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StageMonitorAgent
runsvctrl u /opt/dirac/startup/StorageManagement_RequestFinalizationAgent
Script to check immediatelly all logs on volhcb12
tail -150 /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StageRequestAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StageMonitorAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/RequestFinalizationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StorageManagerHandler/log/current
To clear content of logs
echo -n > current
A list of LFNs for testing the staging procedure
(under /project/bfys/dremensk/cmtdev/InputFiles.txt)
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/69924/069924_0000000001.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71493/071493_0000000057.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71479/071479_0000000001.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/70171/070171_0000000001.raw
How to check the SPACE TOKEN(s) for a file
$ dirac-dms-lfn-replicas LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw
2011-03-15 16:55:13 UTC dirac-dms-lfn-replicas/DiracAPI INFO: Replica Lookup Time: 0.23 seconds
{'Failed': {},
'Successful': {'/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw': {'CERN-RAW': 'srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw'}}}
Setting up a manual request from python directly
from DIRAC.Core.Base.Script import parseCommandLine
parseCommandLine()
from DIRAC.Core.DISET.RPCClient import RPCClient
s = RPCClient("StorageManagement/StorageManagerHandler")
s.getWaitingReplicas()
s.getTasksWithStatus('Done')
---------------
s = RPCClient("WorkloadManagement/JobMonitoring")
s.getJobTypes()
{'OK': True, 'rpcStub': (('WorkloadManagement/JobMonitoring', {'skipCACheck': False, 'delegatedGroup': 'lhcb_user', 'delegatedDN': '/O=dutchgrid/O=users/O=nikhef/CN=Daniela Remenska', 'timeout': 600}), 'getJobTypes', ()), 'Value': ['DataReconstruction', 'DataStripping', 'MCSimulation', 'Merge', 'SAM', 'User']}
---------------
>>> s.getStates()
{'OK': True, 'rpcStub': (('WorkloadManagement/JobMonitoring', {'skipCACheck': False, 'delegatedGroup': 'lhcb_user', 'delegatedDN': '/O=dutchgrid/O=users/O=nikhef/CN=Daniela Remenska', 'timeout': 600}), 'getStates', ()), 'Value': ['Checking', 'Completed', 'Done', 'Failed', 'Killed', 'Matched', 'Received', 'Rescheduled', 'Running', 'Stalled', 'Waiting']}
--------------
s.getStageRequests({'StageStatus':'Staged'})
>>> s.getStageRequests({'StageStatus':'Staged'})['Value'][1874854]
{'PinExpiryTime': datetime.datetime(2011, 3, 4, 11, 7, 26), 'StageRequestCompletedTime': datetime.datetime(2011, 3, 3, 11, 47, 26), 'StageStatus': 'Staged', 'RequestID': '140117334', 'StageRequestSubmitTime': datetime.datetime(2011, 3, 3, 11, 46, 53), 'PinLength': 86400L}
--------------
s = RPCClient("StorageManagement/StorageManagerHandler")
s.setRequest({'CERN-RDST':'/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/77969/077969_0000000611.raw'},'DanielaTest','method@DanielaTest/TestHandler',999999)
-----------------
Procedure to deploy new code
- cd /project/bfys/dremensk/cmtdev/LHCbDirac_v5r11p3
- svn update
- Stop the agents and services in the SMS:
runsvctrl d /opt/dirac/startup/StorageManagement_RequestPreparationAgent
- Check if it is in fact disabled:
ps -ef | grep RequestPreparationAgent
- Set the logging level to debug:
emacs /opt/dirac/runit/StorageManagement/RequestPreparationAgent/run LogLevel=DEBUG
- the modified code needs to be copied to the appropriate path on volhcb12:
cd /opt/dirac/pro/DIRAC/StorageManagementSystem/Agent
=scp danielar@loginNOSPAMPLEASE.nikhef.nl:/project/bfys/dremensk/cmtdev/LHCbDirac_v5r8/DIRAC/StorageManagementSystem/Agent/RequestPreparationAgent.py . =
- 1 for Agent:
(ONLY if new,not update) dirac-install-agent
StorageManagement RequestPreparationAgent
start the agent:
runsvctrl u /opt/dirac/startup/StorageManagement_RequestPreparationAgent
- 2 For Service:
cd /opt/dirac/pro
(if new) ./scripts/install_service.sh DataManagement testDMS
cd /opt/dirac/startup
ln -s /opt/dirac/pro/runit/DataManagement/testDMS DataManagement_testDMS
Once this link has been created, then the service will automatically start.
- Check the log to see if your modifications are visible:
cat /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current
- If all is ok, SET BACK THE LOG LEVELS TO INFO:
emacs /opt/dirac/runit/StorageManagement/RequestPreparationAgent/run
To browse SRM
srmls srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/
To get TURLs for files
lcg-getturls srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/lhcb/LHCb/Collision11/SWIMSTRIPPINGD02KSKK.MDST/00019038/0000/00019038_00000080_1.swimstrippingd02kskk.mdst -p file,xroot,root,dcap,gsidcap
To see your jobs
https://lhcbweb.pic.es/DIRAC/LHCb-Development/lhcb_user/jobs/JobMonitor/display
Which user is doing what on a machine
ps efu -U <user>
Some usefull GitHub commands:
Fixing committed mistakes:
git revert HEAD
create new branch:
git branch experimental
to switch to the new branch:
git checkout experimental
commit all changes:
git commit -a
copying changes from the local branch to the remote one:
git push origin fixes-sms:refs/heads/fixes-sms
revert a published change:
git reset --hard HEAD~1
Graphical overview with all branches/comments:
gitk &
Checking out existing git repo:
git clone git@githubNOSPAMPLEASE.com:remenska/DIRAC.git
cd DIRAC/
git checkout -b fixes-sms origin/fixes-sms
Quick fixes in
DIRAC:
git clone git@githubNOSPAMPLEASE.com:remenska/DIRAC.git
cd DIRAC/
=git remote add upstream git://github.com/DIRACGrid/DIRAC.git =
git fetch upstream
git checkout -b rel-v6r9-fixes remotes/upstream/rel-v6r9
make the changes...
git commit -a
=git remote add remenska http://github.com/remenska/DIRAC.git =
git fetch remenska
git push remenska rel-v6r9-fixes
life saver for testing any Dirac code on-the-fly:
on volhcb22
from DIRAC.Core.Base import Script
Script.addDefaultOptionValue( '/DIRAC/Security/UseServerCertificate', 'yes' )
Script.parseCommandLine( ignoreErrors = False )
from DIRAC.StorageManagementSystem.DB.StorageManagementDB import StorageManagementDB
storageDB = StorageManagementDB()
res = storageDB.getCacheReplicas( {'Status':'StageSubmitted'} )
print res
Check if there's something fishy with the stager:
SELECT ReplicaID FROM CacheReplicas WHERE Status='StageSubmitted' AND ReplicaID NOT IN ( SELECT DISTINCT( ReplicaID ) FROM StageRequests );
To see which DIRAC version is in production, on volhcb20 just grep the log of the agents: "DIRAC version:"
--
DanielaRemenska - 04-Apr-2011