Notes and Tests

Test from Ian Fisk

  1. Initialize an environment for CMSSW 3_3_6:
    cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
  2. Copy in the config file:
    cp /afs/cern.ch/user/f/fisk/public/CERNResults/test_cfg.py $PWD/src/
    Alternately, you can download the file attached to the twiki:
     curl https://twiki.cern.ch/twiki/pub/Main/CmsIOWork/test_cfg.py.txt > src/test_cfg.py 
  3. Check out the analysis code:
    cvs co -r V03-00-00 -d src/CMS2 UserCode/JRibnik/CMS2
  4. Apply the patches above to TTreeCache and TStorageFactoryFile.cxx
  5. Build the code (I find building this code takes about 100MB of space; make sure you have enough quota):
    scram b
  6. Edit test_cfg.py to turn on the AdaptorConfig, turn on stats, and turn the cacheHint to "application-only" (see https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideTFileAdaptor).
  7. Run the script:
    cmsRun -j CERNImproved.fjr src/test_cfg.py 2>&1 > CERNImproved.log &
  8. Performance may be monitored using the script in /afs/cern.ch/user/f/fisk/public/CERNResults/measuremon.sh

If you don't see messages like this:

Using my custom readv impl for 1346 buffers
Number of reads performed 506
in your stdout, then you have not successfully turned on both TTreeCache and read coalescing!

Results from running on lxplus

Here's the runtime information with TTreeCache and read coalescing
real    11m53.148s
user    10m5.383s
sys     0m17.628s
A subset of the TFileAdaptor statistics relevant to reads:
<storage-timing-summary>
 <counter-value subsystem='rfio' counter-name='open' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='2799.22' min-msecs='1275.6' max-msecs='1523.62'/>
 <counter-value subsystem='rfio' counter-name='read' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='71747.1' min-msecs='0.00512' max-msecs='677.673'/>
 <counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='rfio' counter-name='stagein' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='278.703' min-msecs='116.939' max-msecs='161.764'/>
 <counter-value subsystem='tstoragefile' counter-name='construct' num-operations='3' num-successful-operations='3' total-megabytes='0' total-msecs='9026.48' min-msecs='95.465' max-msecs='7306.93'/>
 <counter-value subsystem='tstoragefile' counter-name='read' num-operations='78246' num-successful-operations='78246' total-megabytes='1254.42' total-msecs='115435' min-msecs='0.00896' max-msecs='3249.42'/>
 <counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='71803' min-msecs='0.006912' max-msecs='677.676'/>
 <counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='59856' num-successful-operations='55830' total-megabytes='448.547' total-msecs='42581.5' min-msecs='0.011008' max-msecs='3249.42'/>
 <counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='29' num-successful-operations='29' total-megabytes='0' total-msecs='42225.5' min-msecs='132.351' max-msecs='3247.55'/>
</storage-timing-summary>

Here's the stats and timing information for the original file:

real    13m58.683s
user    10m12.768s
sys     0m25.069s
<storage-timing-summary>
 <counter-value subsystem='rfio' counter-name='open' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='6581.33' min-msecs='2614.11' max-msecs='3967.23'/>
 <counter-value subsystem='rfio' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='200315' min-msecs='0.002816' max-msecs='422.642'/>
 <counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='rfio' counter-name='stagein' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='276.838' min-msecs='122.541' max-msecs='154.297'/>
 <counter-value subsystem='tstoragefile' counter-name='construct' num-operations='3' num-successful-operations='3' total-megabytes='0' total-msecs='11352.9' min-msecs='15.905' max-msecs='7327.45'/>
 <counter-value subsystem='tstoragefile' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='200492' min-msecs='0.004864' max-msecs='422.646'/>
 <counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='200406' min-msecs='0.00384' max-msecs='422.644'/>
</storage-timing-summary>

Quick-run script

cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
PackageManagement.pl --pack "IOPool/Input" --release CMSSW_3_3_6
PackageManagement.pl --pack "IOPool/TFileAdaptor" --release CMSSW_3_3_6
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/ttreecache.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/read_coalesce.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/test_cfg.py.txt > src/test_cfg.py
cvs co -r V03-00-00 -d src/CMS2 UserCode/JRibnik/CMS2
scram b
cmsRun -j CERNImproved.fjr src/test_cfg.py

Test from Leonardo Sala

The following test is an example of a CPU-bound PAT tuple creator. The interesting aspects of this include:

  1. Demonstrate changes have no significant penalty for CPU-bound processes.
  2. Unlike the above test, the number of I/O reads go down only by a factor of 2.
  3. A case where read coalescing causes a significant increase in the number of MB read.

To run this on your own:

  1. Initialize an environment for CMSSW 3_3_6:
    cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
  2. Copy in the config file:
    curl https://twiki.cern.ch/twiki/pub/Main/CmsIOWork/PATSimple.py.txt > src/PATSimple.py
  3. Apply the patches above to RootTree.cxx for TTreeCache and TStorageFactoryFile.cxx for read coalescing.
  4. Build the code:
    scram b
  5. Edit PATSimple.py to turn on the AdaptorConfig, turn on stats, and turn the cacheHint to "application-only" (see https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideTFileAdaptor).
  6. Run the script:
    cmsRun -j FNALImproved.fjr src/test_cfg.py 2>&1 > FNALImproved.log &

Results

This test was run at FNAL. Overall, the number of seconds spent reading in files decreased from 32s to 29s when turning on optimizations. The wallclock time decreased by approximately 20s (original job's walltime was 10min, 20s).

The FJR statistics with the read coalescing turned on is:

real    10m4.795s
user    9m25.875s
sys     0m6.565s
<storage-timing-summary>
 <counter-value subsystem='dcap' counter-name='open' num-operations='1' num-successful-operations='1' total-megabytes='0' total-msecs='178.99' min-msecs='178.99' max-msecs='178.99'/>
 <counter-value subsystem='dcap' counter-name='read' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16633' min-msecs='0.000768' max-msecs='104.784'/>
 <counter-value subsystem='dcap' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='dcap' counter-name='stagein' num-operations='1' num-successful-operations='1' total-megabytes='0' total-msecs='202.402' min-msecs='202.402' max-msecs='202.402'/>
 <counter-value subsystem='tstoragefile' counter-name='construct' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='4486.92' min-msecs='29.503' max-msecs='4457.42'/>
 <counter-value subsystem='tstoragefile' counter-name='read' num-operations='38571' num-successful-operations='38571' total-megabytes='488.867' total-msecs='29343.8' min-msecs='0.00384' max-msecs='1685.15'/>
 <counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16655.9' min-msecs='0.002816' max-msecs='104.787'/>
 <counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='26680' num-successful-operations='25397' total-megabytes='171.6' total-msecs='12616.8' min-msecs='0.007936' max-msecs='1685.15'/>
 <counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='10' num-successful-operations='10' total-megabytes='0' total-msecs='12232.9' min-msecs='276.526' max-msecs='1681'/>
</storage-timing-summary>

The original file has the following statistics:

real    10m21.438s
user    9m30.674s
sys     0m14.617s
 <counter-value subsystem='dcap' counter-name='read' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16633' min-msecs='0.000768' max-msecs='104.784'/>
 <counter-value subsystem='dcap' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='dcap' counter-name='stagein' num-operations='1' num-successful-operations='1' total-megabytes='0' total-msecs='202.402' min-msecs='202.402' max-msecs='202.402'/>
 <counter-value subsystem='tstoragefile' counter-name='construct' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='4486.92' min-msecs='29.503' max-msecs='4457.42'/>
 <counter-value subsystem='tstoragefile' counter-name='read' num-operations='38571' num-successful-operations='38571' total-megabytes='488.867' total-msecs='29343.8' min-msecs='0.00384' max-msecs='1685.15'/>
 <counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16655.9' min-msecs='0.002816' max-msecs='104.787'/>
 <counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='26680' num-successful-operations='25397' total-megabytes='171.6' total-msecs='12616.8' min-msecs='0.007936' max-msecs='1685.15'/>
 <counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='10' num-successful-operations='10' total-megabytes='0' total-msecs='12232.9' min-msecs='276.526' max-msecs='1681'/>

Quick-run script

cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
PackageManagement.pl --pack "IOPool/Input" --release CMSSW_3_3_6
PackageManagement.pl --pack "IOPool/TFileAdaptor" --release CMSSW_3_3_6
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/ttreecache.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/read_coalesce.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/PATSimple.py.txt > src/PATSimple.py
scram b
cmsRun -j FNALImproved.fjr src/PATSimple.py

Tests run by Andrew Lahiff

Running at RAL the same test as described above by Ian Fisk.

Results

The FJR statistics with the read coalescing turned on is:

real    674.95s
user    625.93s
sys     8.60s
<storage-timing-summary>
 <counter-value subsystem='rfio' counter-name='read' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='28009.8' min-msecs='0.00384' max-msecs='215.024'/>
 <counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='tstoragefile' counter-name='read' num-operations='78246' num-successful-operations='78246' total-megabytes='1254.42' total-msecs='43910.6' min-msecs='0.005888' max-msecs='637.961'/>
 <counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='28039' min-msecs='0.004864' max-msecs='215.024'/>
 <counter-value subsystem='tstoragefile' counter-name='read-async' num-operations='2' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='59856' num-successful-operations='55830' total-megabytes='448.547' total-msecs='15273.5' min-msecs='0.007936' max-msecs='637.959'/>
 <counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='29' num-successful-operations='29' total-megabytes='0' total-msecs='14957' min-msecs='50.2172' max-msecs='636.861'/>
</storage-timing-summary>

The original file has the following statistics:

real    725.73s
user    623.57s
sys     12.55s
<storage-timing-summary>
 <counter-value subsystem='rfio' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='79710.9' min-msecs='0.00384' max-msecs='221.464'/>
 <counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='tstoragefile' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='79864.2' min-msecs='0.005888' max-msecs='221.466'/>
 <counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='79791.5' min-msecs='0.004864' max-msecs='221.465'/>
</storage-timing-summary>

Using lazy-download:

real    664.70s
user    622.25s
sys     8.90s
<storage-timing-summary>
 <counter-value subsystem='rfio' counter-name='read' num-operations='10' num-successful-operations='10' total-megabytes='1169.41' total-msecs='18010' min-msecs='800.178' max-msecs='2464.78'/>
 <counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
 <counter-value subsystem='tstoragefile' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='19509.6' min-msecs='0.00384' max-msecs='2474.41'/>
 <counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='18756.1' min-msecs='0.002816' max-msecs='2470.94'/>
 <counter-value subsystem='tstoragefile' counter-name='read-async' num-operations='13806' num-successful-operations='13806' total-megabytes='1506.9' total-msecs='39.6974' min-msecs='0.001792' max-msecs='0.185856'/>
 <counter-value subsystem='tstoragefile' counter-name='read-prefetch-to-cache' num-operations='59856' num-successful-operations='59829' total-megabytes='480.968' total-msecs='427.438' min-msecs='0.002816' max-msecs='3.9849'/>
</storage-timing-summary>

Comments on RAL test from Brian Bockelman

  • Lazy-download delivered the file at 67MB/s average rate. This is not plausible per-slot overall at a site.
  • Assume each read takes exactly 5 msec of latency and we have 100/8 = 12MB/sec per slot. Then, the timings would be:
    • Original: 360s
    • Optimizations: 179s
    • Lazy-download: 97s
    • Optimizations in CERN case: 40s.
    • This makes it obvious we need to understand the difference between running at CERN and running at RAL. Is it possible Brian B originally ran on the wrong number of events?
Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2010-05-13 - BrianBockelman
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback