Useful links for LHCb

Shifter's 101

  • If many jobs are completed but cannot upload to LogSE (volhcb15), check number of connections (socket count), restart StorageElement.
  • Mario's script for shifter report:
      $ssh lxplus
      $SetupProject LHCbDirac
      $dirac-proxy-init
      $cd ~ubeda/public
      # I just took the Hot Productions from https://lhcb-shifters.web.cern.ch/dashboard
      $python dirac-production-shifter.py -g -i 8657,8656,8622
      # If you want to take a closer look to funny states of the files, try this one
      $python dirac-production-shifter-files.py -i 8657
       

  • For files in MaxReset, see which jobs attempted to run on them (NB: there is a dirac-admin script for this also!)
    [volhcb22] /home/dirac > ./jobs4file.sh 19995 /lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/127979/127979_0000000005.raw
       update:
       dirac-transformation-debug 24183 --Status MaxReset --Info jobs
      

  • Script for displaying stats on reprocessing (Stefan's)
    python ~roiser/public/dirac-reprocessing-display-stats.py
       
  • Fixing files in productonDB with RunNumber=0
    dirac-transformation-debug 19782 --Status=Unused --FixIt
       

  • Debugging CVMFS problems on WNs.
    . /cvmfs/lhcb.cern.ch/lib/LbLogin.sh
    export CMTCONFIG=x86_64-slc5-gcc46-opt
    SetupProject.sh --debug --use="AppConfig v3r151"  --use="SQLDDDB v7r9"  --use="ProdConf"  Brunel v43r2p2 gfal CASTOR lfc oracle dcache_client --use-grid
       ... 
    time . /cvmfs/lhcb.cern.ch/lib/LbLogin.sh
    time SetupProject --debug --use="AppConfig v3r158"  --use="SQLDDDB v7r9"  --use="ProdConf"  Brunel v43r2p3 gfal lfc dpm --use-grid
    
    python -c "from hashlib import md5"
    
       

  • Check if CVMFS cash is fresh enough on worker nodes. From an lxplus node:
       [dremensk@lxplus0158 ~]$ /usr/bin/attr -q -g revision /cvmfs/lhcb.cern.ch/
       12659
       

  • get the production environment
        (on lxplus)
        SetupProject LHCbDIRAC
        lhcb-proxy-init -g lhcb_prod
       

  • Banning a SE if a Site is in downtime or full
         dirac-admin-ban-se -c LCG.RAL.uk 
        Example to ban one SE at RAL
          dirac-admin-ban-se RAL-DST
       Example to ban one SE in writing at CNAF
          dirac-admin-ban-se -w CNAF-USER
       For ARCHIVE:
          lhcb-admin-ban-se -w CNAF-ARCHIVE
       

  • Allowing a site after downtime is over. First let's see all SE:
     $ dirac-admin-site-info LCG.IN2P3.fr
    {'CE': 'cccreamceli05.in2p3.fr, cccreamceli06.in2p3.fr',
     'Coordinates': '4.8655:45.7825',
     'Mail': 'grid.admin@cc.in2p3.fr',
     'MoUTierLevel': '1',
     'Name': 'IN2P3-CC',
     'SE': 'IN2P3-RAW, IN2P3-DST, IN2P3_M-DST, IN2P3-USER, IN2P3-FAILOVER, IN2P3-RDST, IN2P3_MC_M-DST, IN2P3_MC-DST, IN2P3-ARCHIVE, IN2P3-BUFFER'}
       
    Now let's unban:
       $ dirac-admin-allow-site LCG.IN2P3.fr "Downtime finished"
       $ dirac-admin-allow-se IN2P3-RAW, IN2P3-DST, IN2P3_M-DST, IN2P3-USER, IN2P3-FAILOVER, IN2P3-RDST, IN2P3_MC_M-DST, IN2P3_MC-DST, IN2P3-ARCHIVE, IN2P3-BUFFER "Downtime finished"
       

  • You want to investigate on which nodes jobs failed at LCG.Dortmund.de:
       dirac-wms-jobs-select-output-search --Site=LCG.Dortmund.de --Status='Failed' --Date=2008-09-19 'running on '
       

  • Web portal stuck: how to restart it
         runsvctrl t runit/Web/Paster
        
    if this is not enough, then it will be necessary to get all Paster processes ('ps faux | grep -i web_paster') and do a 'kill -9' of them.

  • An example for checking whether the WMS Job Manager service is up:
        dirac-framework-ping-service WorkloadManagement JobManager
        

  • List of currently banned sites:
         dirac-admin-get-banned-sites
        

  • Banning a site:
     dirac-admin-ban-site LCG.CERN.ch --comment="All jobs failing with Application not Found error" 

  • Check which SEs are Banned/Active
         dirac-dms-show-se-status | grep SARA
        

  • Launch a replication of RAW files (in case they're lost). Registers them in the LFC as well.
        dirac-dms-add-replication --Term --Plugin ReplicateDataset --Destination SARA-RAW --Start
       

  • Get the file access protocols for a site:
        dirac-admin-get-site-protocols --Site=LCG.SARA.nl
        CERN-BUFFER                   file, xroot, root, dcap, gsidcap, rfio
        CERN-CASTORBUFFER             file, xroot, root, dcap, gsidcap, rfio
        

  • Get BDII site info on MaxWallclockTimes
       $ dirac-admin-site-info LCG.RAL.uk
        {'CE': 'lcgce05.gridpp.rl.ac.uk, lcgce04.gridpp.rl.ac.uk',
          'Coordinates': '-1.32:51.57',
          'Mail': 'lcg-support@gridpp.rl.ac.uk',
          'Name': 'RAL-LCG2',
    
        $ dirac-admin-bdii-ce-state lcgce04.gridpp.rl.ac.uk | grep MaxWallClockTime
        GlueCEPolicyMaxWallClockTime: 120
        GlueCEPolicyMaxWallClockTime: 4320
        GlueCEPolicyMaxWallClockTime: 4320
        GlueCEPolicyMaxWallClockTime: 4320
       GlueCEPolicyMaxWallClockTime: 4320
        

  • Get file descendants to see if the file was processed indeed:
         dirac-bookkeeping-get-file-descendants /lhcb/data/2010/RAW/FULL/LHCb/COLLISION10/81609/081609_0000000035.raw 9
       
  • Get RAW ancestors of the lost FULL.DST files:
        dirac-bookkeeping-get-file-ancestors /lhcb/LHCb/Collision12/FULL.DST/00020526/0003/00020526_00031385_1.full.dst
       
  • Productions with some Unused files, which have run number = zero in production DB (a fix)
        dirac-transformation-debug 16771 --Status Unused
       
  • Prestaging files (debug problems)
       danielar@herault dremensk $ srm-bring-online -debug srm://storm-fe-lhcb.cr.cnaf.infn.it/t1d0/lhcb/data/2012/RAW/FULL/LHCb/COLLISION12/125977/125977_0000000019.raw
       
  • Prestage files using gfal in python:
    import gfal
    print 'GFAL version',gfal.gfal_version()
    gfalDict = {'srmv2_spacetokendesc': 'LHCb-Tape', 'no_bdii_check': 1, 'srmv2_desiredpintime': 86400, 'defaultsetype': 'srmv2', 'timeout': 30, 'nbfiles': 1, 'surls': ['srm://storm-fe-lhcb.cr.cnaf.infn.it:8444/srm/managerv2?SFN=/t1d0/lhcb/archive/lhcb/MC/MC10/ALLSTREAMS.DST/00009779/0000/00009779_00001506_1.allstreams.dst'],'protocols': ['file', 'dcap', 'gsidcap', 'xroot', 'root', 'rfio']}
    errCode, gfalObject, errMessage = gfal.gfal_init( gfalDict )
    print 'gfal.gfal_init:', errCode, errMessage
    
    errCode,gfalObject,errMessage = gfal.gfal_prestage( gfalObject )
    print 'gfal.gfal_prestage:', errCode, errMessage
    
    numberOfResults, gfalObject, listOfResults = gfal.gfal_get_results( gfalObject )
    for result in listOfResults:
      print 'result per surl', result
       

  • A script that is looking at the JDL, extracts the input data files and checks whether they are accessible at the site where the job ran
    ~ $ dirac-dms-check-inputdata 47384426,47384216
        

  • VERY USEFUL: Get info on how a file is produced (run number, descendants, processing pass...)
     $ dirac-transformation-debug 20392 --LFN /lhcb/LHCb/Collision12/FULL.DST/00020391/0009/00020391_00093978_1.full.dst --Info alltasks
      

  • Check for IDLE/REGISTERED/ jobs at a particular CE:
    $ glite-ce-job-status -L0 -a -e lcgce02.gridpp.rl.ac.uk --to '2013-04-11 00:00:00' -s IDLE:WAITING:REGISTERED |grep -c IDLE
       

  • Browse the LFC:
     $ lfc-ls /grid/lhcb/LHCb/Collision12/LOG/ 
       

Dirac Storage Management System overview Work in progress, under construction

here.

TODO for work on the Storage Management System.

to-do

LHCb Popularity Service for LHCb

My notes

Elisa's twiki

How to's. Question

How to access volhcb12.

From any machine except lxplus, you should login first to lxvoadm.cern.ch and then to volhcb12.

ssh dremensk@lxvoadmNOSPAMPLEASE.cern.ch

sudo su dirac

mysql -p -uDirac

mysql> show databases;

Useful to see all processes and their open files

ps -ef | grep RequestFinalizationAgent | wc -l

path: /opt/dirac/pro/DIRAC/

Submit a job

dirac-wms-job-submit Simple.jdl

Simple.jdl:

JobName = "Simple_Job";
Executable = "/bin/ls";
Arguments = "-ltr";
StdOutput = "StdOut";
StdError = "StdErr";
OutputSandbox = {"StdOut","StdErr"};
InputData =
        {
             "LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/77969/077969_0000000611.raw"
        };

BannedSites =
        {
            "LCG.CERN.ch"
        };

Script to restart all agents/services

runsvctrl d /opt/dirac/startup/StorageManagement_StorageManagerHandler
runsvctrl d /opt/dirac/startup/StorageManagement_RequestPreparationAgent
runsvctrl d /opt/dirac/startup/StorageManagement_StageRequestAgent
runsvctrl d /opt/dirac/startup/StorageManagement_StageMonitorAgent
runsvctrl d /opt/dirac/startup/StorageManagement_RequestFinalizationAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StorageManagerHandler
runsvctrl u /opt/dirac/startup/StorageManagement_RequestPreparationAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StageRequestAgent
runsvctrl u /opt/dirac/startup/StorageManagement_StageMonitorAgent
runsvctrl u /opt/dirac/startup/StorageManagement_RequestFinalizationAgent

Script to check immediatelly all logs on volhcb12

tail -150 /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StageRequestAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StageMonitorAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/RequestFinalizationAgent/log/current
tail -150 /opt/dirac/runit/StorageManagement/StorageManagerHandler/log/current

To clear content of logs

 echo -n > current 

A list of LFNs for testing the staging procedure

(under /project/bfys/dremensk/cmtdev/InputFiles.txt)

LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/69924/069924_0000000001.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71493/071493_0000000057.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71479/071479_0000000001.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw
LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/70171/070171_0000000001.raw

How to check the SPACE TOKEN(s) for a file

 $ dirac-dms-lfn-replicas LFN:/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw
2011-03-15 16:55:13 UTC dirac-dms-lfn-replicas/DiracAPI  INFO: Replica Lookup Time: 0.23 seconds
{'Failed': {},
'Successful': {'/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw': {'CERN-RAW': 'srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/71476/071476_0000000241.raw'}}}

Setting up a manual request from python directly

from DIRAC.Core.Base.Script import parseCommandLine
parseCommandLine()
from DIRAC.Core.DISET.RPCClient import RPCClient
s = RPCClient("StorageManagement/StorageManagerHandler")
s.getWaitingReplicas()
s.getTasksWithStatus('Done')
---------------
s = RPCClient("WorkloadManagement/JobMonitoring")
s.getJobTypes()
{'OK': True, 'rpcStub': (('WorkloadManagement/JobMonitoring', {'skipCACheck': False, 'delegatedGroup': 'lhcb_user', 'delegatedDN': '/O=dutchgrid/O=users/O=nikhef/CN=Daniela Remenska', 'timeout': 600}), 'getJobTypes', ()), 'Value': ['DataReconstruction', 'DataStripping', 'MCSimulation', 'Merge', 'SAM', 'User']}
---------------
>>> s.getStates()
{'OK': True, 'rpcStub': (('WorkloadManagement/JobMonitoring', {'skipCACheck': False, 'delegatedGroup': 'lhcb_user', 'delegatedDN': '/O=dutchgrid/O=users/O=nikhef/CN=Daniela Remenska', 'timeout': 600}), 'getStates', ()), 'Value': ['Checking', 'Completed', 'Done', 'Failed', 'Killed', 'Matched', 'Received', 'Rescheduled', 'Running', 'Stalled', 'Waiting']}
--------------
s.getStageRequests({'StageStatus':'Staged'})
>>> s.getStageRequests({'StageStatus':'Staged'})['Value'][1874854]
{'PinExpiryTime': datetime.datetime(2011, 3, 4, 11, 7, 26), 'StageRequestCompletedTime': datetime.datetime(2011, 3, 3, 11, 47, 26), 'StageStatus': 'Staged', 'RequestID': '140117334', 'StageRequestSubmitTime': datetime.datetime(2011, 3, 3, 11, 46, 53), 'PinLength': 86400L}
--------------

s = RPCClient("StorageManagement/StorageManagerHandler")
s.setRequest({'CERN-RDST':'/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/77969/077969_0000000611.raw'},'DanielaTest','method@DanielaTest/TestHandler',999999)
-----------------

Procedure to deploy new code

  1. cd /project/bfys/dremensk/cmtdev/LHCbDirac_v5r11p3
  2. svn update
  3. Stop the agents and services in the SMS:

runsvctrl d /opt/dirac/startup/StorageManagement_RequestPreparationAgent

  1. Check if it is in fact disabled:

ps -ef | grep RequestPreparationAgent

  1. Set the logging level to debug:

emacs /opt/dirac/runit/StorageManagement/RequestPreparationAgent/run LogLevel=DEBUG

  1. the modified code needs to be copied to the appropriate path on volhcb12:

cd /opt/dirac/pro/DIRAC/StorageManagementSystem/Agent

=scp danielar@loginNOSPAMPLEASE.nikhef.nl:/project/bfys/dremensk/cmtdev/LHCbDirac_v5r8/DIRAC/StorageManagementSystem/Agent/RequestPreparationAgent.py . =

  1. 1 for Agent:
(ONLY if new,not update) dirac-install-agent StorageManagement RequestPreparationAgent start the agent:

runsvctrl u /opt/dirac/startup/StorageManagement_RequestPreparationAgent

  1. 2 For Service:

cd /opt/dirac/pro (if new) ./scripts/install_service.sh DataManagement testDMS

cd /opt/dirac/startup

ln -s /opt/dirac/pro/runit/DataManagement/testDMS DataManagement_testDMS

Once this link has been created, then the service will automatically start.

  1. Check the log to see if your modifications are visible:

cat /opt/dirac/runit/StorageManagement/RequestPreparationAgent/log/current

  1. If all is ok, SET BACK THE LOG LEVELS TO INFO:

emacs /opt/dirac/runit/StorageManagement/RequestPreparationAgent/run

To browse SRM

srmls srm://srm-lhcb.cern.ch/castor/cern.ch/grid/lhcb/data/2010/RAW/EXPRESS/LHCb/COLLISION10/

To get TURLs for files

lcg-getturls srm://srm.grid.sara.nl/pnfs/grid.sara.nl/data/lhcb/LHCb/Collision11/SWIMSTRIPPINGD02KSKK.MDST/00019038/0000/00019038_00000080_1.swimstrippingd02kskk.mdst -p file,xroot,root,dcap,gsidcap

To see your jobs

https://lhcbweb.pic.es/DIRAC/LHCb-Development/lhcb_user/jobs/JobMonitor/display

Which user is doing what on a machine

    ps efu -U <user>    

Some usefull GitHub commands:

Fixing committed mistakes: git revert HEAD

create new branch: git branch experimental

to switch to the new branch: git checkout experimental

commit all changes: git commit -a

copying changes from the local branch to the remote one: git push origin fixes-sms:refs/heads/fixes-sms

revert a published change: git reset --hard HEAD~1

Graphical overview with all branches/comments: gitk &

Checking out existing git repo: git clone git@githubNOSPAMPLEASE.com:remenska/DIRAC.git

cd DIRAC/

git checkout -b fixes-sms origin/fixes-sms

Quick fixes in DIRAC:

git clone git@githubNOSPAMPLEASE.com:remenska/DIRAC.git

cd DIRAC/

=git remote add upstream git://github.com/DIRACGrid/DIRAC.git =

git fetch upstream

git checkout -b rel-v6r9-fixes remotes/upstream/rel-v6r9

make the changes...

git commit -a

=git remote add remenska http://github.com/remenska/DIRAC.git =

git fetch remenska

git push remenska rel-v6r9-fixes

life saver for testing any Dirac code on-the-fly:

on volhcb22
from DIRAC.Core.Base import Script
Script.addDefaultOptionValue( '/DIRAC/Security/UseServerCertificate', 'yes' )
Script.parseCommandLine( ignoreErrors = False )
from DIRAC.StorageManagementSystem.DB.StorageManagementDB  import StorageManagementDB
storageDB = StorageManagementDB()
res = storageDB.getCacheReplicas( {'Status':'StageSubmitted'} )
print res

Check if there's something fishy with the stager:

SELECT ReplicaID FROM CacheReplicas WHERE Status='StageSubmitted' AND ReplicaID NOT IN ( SELECT DISTINCT( ReplicaID ) FROM StageRequests );

To see which DIRAC version is in production, on volhcb20 just grep the log of the agents: "DIRAC version:"

-- DanielaRemenska - 04-Apr-2011

Edit | Attach | Watch | Print version | History: r41 < r40 < r39 < r38 < r37 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r41 - 2014-01-07 - DanielaRemenska
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback