Workflow team operations

The purpose of this twiki page is to document some of the procedures I follow when dealing with Workflow team issues, please copy and paste them to any correct location in the Workflow team twikis.

Missing files not injected in DBS

For some reason, the WMAgent has been acting up lately and decided to ignore some of the locations when injecting files into dbsbuffer from the JobAccountant. Fortunately, the remedy is easy and works 99.75% of the time.

1. How to identify the problem?

Some files appear in the workflow summary but not in DAS, and there are missing events in the output of a workflow. To confirm that this is related to this problem you'll need to go to the WMAgent that processed the request you are concerned about and do:

$manage mysql-prompt wmagent # Open the SQL interface
#Inside the SQL prompt
SELECT * FROM dbsbuffer_file WHERE status = 'READY' AND block_id is NULL;

If the previous query return any results then there is a problem and you should execute the remedy script.

2. The remedy script

Depending whether the WMAgent is Oracle or MySQL there are two scripts for this, they are in this gist. The script usage is:

cmst1
source /data/admin/wmagent/env.sh
source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
python <locationToScript>/fixLimboFiles_mysql.py # or
python <locationToScript>/fixLimboFiles_oracle.py

3. What about the last 0.25% of the time?

The script should fix most of the files but some of them won't have location information in CouchDB due to the known couch problems. These files will be printed out by the script with lines like:

Could not find location for /store/mc/Summer11/ZH_HToTauTau_M-100_lepdecay_7TeV-pythia6-tauola/GEN-SIM/START311_V2-v1/00000/CA6863B9-28E9-E211-AEBA-003048D41148.root

In that case something that should work is doing the following, in LXPLUS5:

source /afs/cern.ch/cms/LCG/LCG-2/UI/cms_ui_env.sh
voms-proxy-init -voms cms
xrd xrootd.unl.edu
root://xrootd.unl.edu:1094//> locateall /store/mc/Summer11/ZH_HToTauTau_M-100_lepdecay_7TeV-pythia6-tauola/GEN-SIM/START311_V2-v1/00000/CA6863B9-28E9-E211-AEBA-003048D41148.root

If the file exists in a xrootd published location, it will print something like:

------------- Location #1
InfoType: kXrdcLocDataServer
CanWrite: true
Location: '148.6.8.143:11000'

Then we have the IP address of the SE, so we can do a domain lookup:

host 148.6.8.143 # or
nslookup 148.6.8.143

This will leads to the hostname which in the example is: grid143.kfki.hu. So we know it is a site in Hungary, and we can find there is only one for CMS in SiteDB which is T2_HU_Budapest. So we can safely link the file to this site in the database. This is done with the following queries:

# First find the SE registered for the site
SELECT se_name FROM wmbs_location_senames WHERE location = (SELECT id FROM wmbs_location WHERE site_name = 'T2_HU_Budapest')
# If there are many, just take your favorite one. Then register the file to that SE in DBSBuffer
INSERT INTO dbsbuffer_file_location (filename, location)
                           SELECT df.id, dl.id
                           FROM dbsbuffer_file df,  dbsbuffer_location dl
                           WHERE df.lfn = '/store/mc/Summer11/ZH_HToTauTau_M-100_lepdecay_7TeV-pythia6-tauola/GEN-SIM/START311_V2-v1/00000/CA6863B9-28E9-E211-AEBA-003048D41148.root'
                           AND dl.se_name = 'grid143.kfki.hu'

Finally set the file back to NOTUPLOADED so the DBSUpload component picks it up:

UPDATE dbsbuffer_file SET status = 'NOTUPLOADED' WHERE lfn = '/store/mc/Summer11/ZH_HToTauTau_M-100_lepdecay_7TeV-pythia6-tauola/GEN-SIM/START311_V2-v1/00000/CA6863B9-28E9-E211-AEBA-003048D41148.root'

4. What if the file is not in xrootd?

Then we are doomed, feel free to despair and run in circles. Or better, since that file can't be recovered mark it as so in the DBSBuffer and it will be ignored by the system. You could do the following:

UPDATE dbsbuffer_file SET status = 'LOST' WHERE lfn = '/store/mc/Summer11/ZH_HToTauTau_M-100_lepdecay_7TeV-pythia6-tauola/GEN-SIM/START311_V2-v1/00000/CA6863B9-28E9-E211-AEBA-003048D41148.root'

You must note the fact that a file was lost when analyzing the output of the workflow, although that file doesn't appear in DBS or PhEDEx so no further action is needed on those fronts.

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2013-07-22 - DiegoBallesterosVillamizar
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback