Max CPU time

LHCb requirement

VO requirements are expressed in KSI2K benchmark in the VO card. For LHCb it was Job CPU limit = 2500 min/1000SI2K, but has recently (19/05/2010) been updated to: 4500 min SI2K (corresponding to 18000 min HS06).

In DIRAC they set a limit, for all jobs, which it half of this value, in order to have a safety margin. The number is expressed in HS06 seconds, which is another benchmark, and the conversion factor 4 is used (good approximation):

2500 min/KSI2K = 60 * 2500 seconds/KSI2K = 150000 seconds/KSI2K = 4 * 150000 seconds/KSI2K = 600000 seconds/KSI2K.

Inside DIRAC the limit (hard coded?) is set to 350000 Dirac.Seconds. Dirac.Seconds is still another unit! A Dirac Unit is half KSI2K.

Unit in KSI2K in HS06 in Dirac
1 KSI2K s 1 4 HS06 s 2 Dirac s
1 HS06 s 1/4 KSI2K s 1 0.5 Dirac s
1 Dirac s 0.5 KSI2K s 2 HS06 s 1

Value published by sites

The value published by sites can be checked with an ldap query. To see the max CPU limit:
[lxplus201] /afs/cern.ch/user/l/lanciott > ldapsearch -h lcg-bdii.cern.ch:2170 -b "mds-vo-name=local,o=grid" -x -LLL "(& (GlueForeignKey=GlueClusterUniqueID=ce08.pic.es) (GlueCEAccessControlBaseRule=VO:lhcb))" GlueCEUniqueID GlueCEPolicyMaxCPUTime 
dn: GlueCEUniqueID=ce08.pic.es:8443/cream-pbs-glong_sl5,Mds-Vo-name=pic,Mds-Vo
 -name=local,o=grid
GlueCEUniqueID: ce08.pic.es:8443/cream-pbs-glong_sl5
GlueCEPolicyMaxCPUTime: 4800

dn: GlueCEUniqueID=ce08.pic.es:8443/cream-pbs-gshort_sl5,Mds-Vo-name=pic,Mds-V
 o-name=local,o=grid
GlueCEUniqueID: ce08.pic.es:8443/cream-pbs-gshort_sl5
GlueCEPolicyMaxCPUTime: 60

dn: GlueCEUniqueID=ce08.pic.es:8443/cream-pbs-gmedium_sl5,Mds-Vo-name=pic,Mds-
 Vo-name=local,o=grid
GlueCEUniqueID: ce08.pic.es:8443/cream-pbs-gmedium_sl5
GlueCEPolicyMaxCPUTime: 720
the longest queue for PIC says 4800 minutes. NB: GLUE schema publishes the value in minutes. And, in principle it should be the KSI2K benchmark. Of course the benchmark is not enough, sites have to specify also the factor. This is published as the GlueHostBenchmarkSI00 attribute:
[lxplus201] /afs/cern.ch/user/l/lanciott > ldapsearch -h lcg-bdii.cern.ch:2170 -b "mds-vo-name=local,o=grid" -x -LLL "(GlueSubClusterUniqueID=ce08.pic.es)" GlueHostBenchmarkSI00 | grep GlueHostBenchmarkSI00
GlueHostBenchmarkSI00: 1200

For PIC: 4800 min SI2K = 4800/60 = 80 hours SI2K. Confirmed by running a simple qstat -q in the UI. To compare this value with the requirement of LHCb let's apply the normalization factor and convert in minutes: 80 hours 1200 SI2K = 60 * 80 *1200 min SI2K = 5760 min KSI2K. This is more than twice their requirement. *Fine! PIC provides the needed CPU time (as of Aug 2010)

Nice summary from Roberto here.

Raw data file size tuning

The raw data file size has to be tuned in order not to exceed the CPU time offered by the longest queue. Bigger files optimize storage and are also better for Online management. But be careful because too big files would exceed the cpu time limit and would be killed by the batch system. First attempt in April 2010 with 3 GB files failed: too big files, many reconstruction jobs killed by batch system. Size reduced then to 1 GB. Update on May 17th: the HLT has reached the nominal value (2KHz). With this high rate 1GB seems too small. A very big amount of files is taken in short time. With the current version of Brunel the CPU needed to process 1GB file is 290K HS06 seconds, so a 2GB file would require 590000 HS06 seconds. Still below the time required by LHCb, which is 600000 KSI2K sec = 2400000 HS06 sec.

Follow up of the discussion

Following discussions (19/05/2010) the CPU time limit has been increased to 4500 min SI2K and a Savannah task opened to modify the SAM test to check the queue length. PIC provides 5760 min KSI2K, so we are ok.

19.05.2010 Study of the correlation between site and WN CPU power reveals that this power is not reliable

See plot in attachment. The CPU power is one value for each site, rather than being distributed, and for some sites is 1 or 4, which is not a realistic value. This is what in Dirac is called NormalisedCPU (in the job it is CPUNormalisationFactor): does it make sense? The consequence as well is that when plotting the NormalisedCPU vs the file size, there is no correlation! Philippe proposes to use a private CPU normalization system: using a reference based on a popular CPU type, and then normalizing the others based on a big MC production. This was presented by Ricardo at CHEP09 (look for reference!). Ricardo: the ScaledCPU should take into account a normalization made for each WN that the site should do, whereas normalizedCPU is an average value for the site. If both numbers agree, within few hundred seconds it means the site does not do any WN to WN normalization at the batch system level. This might be OK if all WNs are similar. If they disagree it means that the there is a WN to WN normalization and Scaled CPU is probably doing it better. Proposed to use Scaled CPU instead of normalized CPU.

03.06.2010: Meeting on CPU normalization

Minutes of the meeting to discuss how Dirac can ge the normalization factor.

June 2010 GDB where S. Burke explained the changes applied to the publication schema.

11.06.2010 Peek standard output reports CPU time used by pilot

From a thread of June 11 2010: The CPU consumed reported in the "Peek stdoutput" is that of the pilot and not of the job this should be fixed

11.06.2010 The CPU time left utility at CERN is not working properly

observed in a user job

-- ElisaLanciotti - 17-May-2010

Topic attachments
I Attachment History Action Size Date Who Comment
Postscriptps 6394prodwncpuhs06.ps r1 manage 28.6 K 2010-07-04 - 23:51 UnknownUser CPUNormalisationFactor reported in the BK vs site, for a given production.
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2010-10-28 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback