Notes and Tests
Test from Ian Fisk
- Initialize an environment for CMSSW 3_3_6:
cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
- Copy in the config file:
cp /afs/cern.ch/user/f/fisk/public/CERNResults/test_cfg.py $PWD/src/
Alternately, you can download the file attached to the twiki: curl https://twiki.cern.ch/twiki/pub/Main/CmsIOWork/test_cfg.py.txt > src/test_cfg.py
- Check out the analysis code:
cvs co -r V03-00-00 -d src/CMS2 UserCode/JRibnik/CMS2
- Apply the patches above to TTreeCache and TStorageFactoryFile.cxx
- Build the code (I find building this code takes about 100MB of space; make sure you have enough quota):
scram b
- Edit test_cfg.py to turn on the AdaptorConfig, turn on stats, and turn the cacheHint to "application-only" (see https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideTFileAdaptor).
- Run the script:
cmsRun -j CERNImproved.fjr src/test_cfg.py 2>&1 > CERNImproved.log &
- Performance may be monitored using the script in
/afs/cern.ch/user/f/fisk/public/CERNResults/measuremon.sh
If you don't see messages like this:
Using my custom readv impl for 1346 buffers
Number of reads performed 506
in your stdout, then you have not successfully turned on both TTreeCache and read coalescing!
Results from running on lxplus
Here's the runtime information with TTreeCache and read coalescing
real 11m53.148s
user 10m5.383s
sys 0m17.628s
A subset of the TFileAdaptor statistics relevant to reads:
<storage-timing-summary>
<counter-value subsystem='rfio' counter-name='open' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='2799.22' min-msecs='1275.6' max-msecs='1523.62'/>
<counter-value subsystem='rfio' counter-name='read' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='71747.1' min-msecs='0.00512' max-msecs='677.673'/>
<counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='rfio' counter-name='stagein' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='278.703' min-msecs='116.939' max-msecs='161.764'/>
<counter-value subsystem='tstoragefile' counter-name='construct' num-operations='3' num-successful-operations='3' total-megabytes='0' total-msecs='9026.48' min-msecs='95.465' max-msecs='7306.93'/>
<counter-value subsystem='tstoragefile' counter-name='read' num-operations='78246' num-successful-operations='78246' total-megabytes='1254.42' total-msecs='115435' min-msecs='0.00896' max-msecs='3249.42'/>
<counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='71803' min-msecs='0.006912' max-msecs='677.676'/>
<counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='59856' num-successful-operations='55830' total-megabytes='448.547' total-msecs='42581.5' min-msecs='0.011008' max-msecs='3249.42'/>
<counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='29' num-successful-operations='29' total-megabytes='0' total-msecs='42225.5' min-msecs='132.351' max-msecs='3247.55'/>
</storage-timing-summary>
Here's the stats and timing information for the original file:
real 13m58.683s
user 10m12.768s
sys 0m25.069s
<storage-timing-summary>
<counter-value subsystem='rfio' counter-name='open' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='6581.33' min-msecs='2614.11' max-msecs='3967.23'/>
<counter-value subsystem='rfio' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='200315' min-msecs='0.002816' max-msecs='422.642'/>
<counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='rfio' counter-name='stagein' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='276.838' min-msecs='122.541' max-msecs='154.297'/>
<counter-value subsystem='tstoragefile' counter-name='construct' num-operations='3' num-successful-operations='3' total-megabytes='0' total-msecs='11352.9' min-msecs='15.905' max-msecs='7327.45'/>
<counter-value subsystem='tstoragefile' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='200492' min-msecs='0.004864' max-msecs='422.646'/>
<counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='200406' min-msecs='0.00384' max-msecs='422.644'/>
</storage-timing-summary>
Quick-run script
cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
PackageManagement.pl --pack "IOPool/Input" --release CMSSW_3_3_6
PackageManagement.pl --pack "IOPool/TFileAdaptor" --release CMSSW_3_3_6
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/ttreecache.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/read_coalesce.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/test_cfg.py.txt > src/test_cfg.py
cvs co -r V03-00-00 -d src/CMS2 UserCode/JRibnik/CMS2
scram b
cmsRun -j CERNImproved.fjr src/test_cfg.py
Test from Leonardo Sala
The following test is an example of a CPU-bound
PAT tuple creator. The interesting aspects of this include:
- Demonstrate changes have no significant penalty for CPU-bound processes.
- Unlike the above test, the number of I/O reads go down only by a factor of 2.
- A case where read coalescing causes a significant increase in the number of MB read.
To run this on your own:
- Initialize an environment for CMSSW 3_3_6:
cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
- Copy in the config file:
curl https://twiki.cern.ch/twiki/pub/Main/CmsIOWork/PATSimple.py.txt > src/PATSimple.py
- Apply the patches above to RootTree.cxx for TTreeCache and TStorageFactoryFile.cxx for read coalescing.
- Build the code:
scram b
- Edit PATSimple.py to turn on the AdaptorConfig, turn on stats, and turn the cacheHint to "application-only" (see https://twiki.cern.ch/twiki/bin/viewauth/CMS/SWGuideTFileAdaptor).
- Run the script:
cmsRun -j FNALImproved.fjr src/test_cfg.py 2>&1 > FNALImproved.log &
Results
This test was run at FNAL. Overall, the number of seconds spent reading in files decreased from 32s to 29s when turning on optimizations. The wallclock time decreased by approximately 20s (original job's walltime was 10min, 20s).
The FJR statistics with the read coalescing turned on is:
real 10m4.795s
user 9m25.875s
sys 0m6.565s
<storage-timing-summary>
<counter-value subsystem='dcap' counter-name='open' num-operations='1' num-successful-operations='1' total-megabytes='0' total-msecs='178.99' min-msecs='178.99' max-msecs='178.99'/>
<counter-value subsystem='dcap' counter-name='read' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16633' min-msecs='0.000768' max-msecs='104.784'/>
<counter-value subsystem='dcap' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='dcap' counter-name='stagein' num-operations='1' num-successful-operations='1' total-megabytes='0' total-msecs='202.402' min-msecs='202.402' max-msecs='202.402'/>
<counter-value subsystem='tstoragefile' counter-name='construct' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='4486.92' min-msecs='29.503' max-msecs='4457.42'/>
<counter-value subsystem='tstoragefile' counter-name='read' num-operations='38571' num-successful-operations='38571' total-megabytes='488.867' total-msecs='29343.8' min-msecs='0.00384' max-msecs='1685.15'/>
<counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16655.9' min-msecs='0.002816' max-msecs='104.787'/>
<counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='26680' num-successful-operations='25397' total-megabytes='171.6' total-msecs='12616.8' min-msecs='0.007936' max-msecs='1685.15'/>
<counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='10' num-successful-operations='10' total-megabytes='0' total-msecs='12232.9' min-msecs='276.526' max-msecs='1681'/>
</storage-timing-summary>
The original file has the following statistics:
real 10m21.438s
user 9m30.674s
sys 0m14.617s
<counter-value subsystem='dcap' counter-name='read' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16633' min-msecs='0.000768' max-msecs='104.784'/>
<counter-value subsystem='dcap' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='dcap' counter-name='stagein' num-operations='1' num-successful-operations='1' total-megabytes='0' total-msecs='202.402' min-msecs='202.402' max-msecs='202.402'/>
<counter-value subsystem='tstoragefile' counter-name='construct' num-operations='2' num-successful-operations='2' total-megabytes='0' total-msecs='4486.92' min-msecs='29.503' max-msecs='4457.42'/>
<counter-value subsystem='tstoragefile' counter-name='read' num-operations='38571' num-successful-operations='38571' total-megabytes='488.867' total-msecs='29343.8' min-msecs='0.00384' max-msecs='1685.15'/>
<counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='13174' num-successful-operations='13174' total-megabytes='317.267' total-msecs='16655.9' min-msecs='0.002816' max-msecs='104.787'/>
<counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='26680' num-successful-operations='25397' total-megabytes='171.6' total-msecs='12616.8' min-msecs='0.007936' max-msecs='1685.15'/>
<counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='10' num-successful-operations='10' total-megabytes='0' total-msecs='12232.9' min-msecs='276.526' max-msecs='1681'/>
Quick-run script
cmsrel CMSSW_3_3_6; cd CMSSW_3_3_6; cmsenv
PackageManagement.pl --pack "IOPool/Input" --release CMSSW_3_3_6
PackageManagement.pl --pack "IOPool/TFileAdaptor" --release CMSSW_3_3_6
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/ttreecache.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/read_coalesce.patch | patch -p0
/usr/bin/curl https://twiki.cern.ch/twiki/pub/Sandbox/CmsIOWork/PATSimple.py.txt > src/PATSimple.py
scram b
cmsRun -j FNALImproved.fjr src/PATSimple.py
Tests run by Andrew Lahiff
Running at RAL the same test as described above by Ian Fisk.
Results
The FJR statistics with the read coalescing turned on is:
real 674.95s
user 625.93s
sys 8.60s
<storage-timing-summary>
<counter-value subsystem='rfio' counter-name='read' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='28009.8' min-msecs='0.00384' max-msecs='215.024'/>
<counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='tstoragefile' counter-name='read' num-operations='78246' num-successful-operations='78246' total-megabytes='1254.42' total-msecs='43910.6' min-msecs='0.005888' max-msecs='637.961'/>
<counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='22416' num-successful-operations='22416' total-megabytes='805.87' total-msecs='28039' min-msecs='0.004864' max-msecs='215.024'/>
<counter-value subsystem='tstoragefile' counter-name='read-async' num-operations='2' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='tstoragefile' counter-name='read-via-cache' num-operations='59856' num-successful-operations='55830' total-megabytes='448.547' total-msecs='15273.5' min-msecs='0.007936' max-msecs='637.959'/>
<counter-value subsystem='tstoragefile' counter-name='readv-actual' num-operations='29' num-successful-operations='29' total-megabytes='0' total-msecs='14957' min-msecs='50.2172' max-msecs='636.861'/>
</storage-timing-summary>
The original file has the following statistics:
real 725.73s
user 623.57s
sys 12.55s
<storage-timing-summary>
<counter-value subsystem='rfio' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='79710.9' min-msecs='0.00384' max-msecs='221.464'/>
<counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='tstoragefile' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='79864.2' min-msecs='0.005888' max-msecs='221.466'/>
<counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='79791.5' min-msecs='0.004864' max-msecs='221.465'/>
</storage-timing-summary>
Using lazy-download:
real 664.70s
user 622.25s
sys 8.90s
<storage-timing-summary>
<counter-value subsystem='rfio' counter-name='read' num-operations='10' num-successful-operations='10' total-megabytes='1169.41' total-msecs='18010' min-msecs='800.178' max-msecs='2464.78'/>
<counter-value subsystem='rfio' counter-name='readv' num-operations='0' num-successful-operations='0' total-megabytes='0' total-msecs='0' min-msecs='0' max-msecs='0'/>
<counter-value subsystem='tstoragefile' counter-name='read' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='19509.6' min-msecs='0.00384' max-msecs='2474.41'/>
<counter-value subsystem='tstoragefile' counter-name='read-actual' num-operations='63143' num-successful-operations='63143' total-megabytes='540.754' total-msecs='18756.1' min-msecs='0.002816' max-msecs='2470.94'/>
<counter-value subsystem='tstoragefile' counter-name='read-async' num-operations='13806' num-successful-operations='13806' total-megabytes='1506.9' total-msecs='39.6974' min-msecs='0.001792' max-msecs='0.185856'/>
<counter-value subsystem='tstoragefile' counter-name='read-prefetch-to-cache' num-operations='59856' num-successful-operations='59829' total-megabytes='480.968' total-msecs='427.438' min-msecs='0.002816' max-msecs='3.9849'/>
</storage-timing-summary>
Comments on RAL test from Brian Bockelman
- Lazy-download delivered the file at 67MB/s average rate. This is not plausible per-slot overall at a site.
- Assume each read takes exactly 5 msec of latency and we have 100/8 = 12MB/sec per slot. Then, the timings would be:
- Original: 360s
- Optimizations: 179s
- Lazy-download: 97s
- Optimizations in CERN case: 40s.
- This makes it obvious we need to understand the difference between running at CERN and running at RAL. Is it possible Brian B originally ran on the wrong number of events?