RStDStrategyGrid28Nov05

Introduction

Given the ability to generate and store files on the Grid, it is necessary to decide on a strategy for producing the whole chain of events. Examples are:
  • Run the full chain for blocks of 100 events, starting with generation, ending with ntuples and storing all intermediate files to an SE
  • Generate all the events desired, then run the Reconstruction on blocks of events in parallel jobs. Follow this by one of:
    • The digitization to ntupling chain in one job with blocks of events
    • Each stage in the chain in separate jobs with one block of events
    • All events digitized together, and then only one file for the rest of the chain.

The decision on how to do this depends on

  • How to fetch collections of input files
  • Sizes of input and choice of tape or disk for storage
  • Sizes of output and choice of tape or disk for storage
  • CPU

Below we examine the information at hand on this and come to a conclusion on how to proceed. It has been decided beforehand that it is convenient to generate all the events in one job then when that job completes, to send off parallel jobs taking of order 24 hours to do the simulation since it is know that this takes the most time and it is worth storing the output because of the intensity of the cpu resources needed to repeat this. The main discussion will be what to do about the rest of the chain.

Collection of information

The 5, 10 and 1000 event tests are used as a basis for the sizes and cpu for each stage in analysis. Simulation was done in all cases; however, the full chain was run with only the 5 events by running a separate job for each stage and storing the output to disk.

The table below indicates the sizes of the files in bytes, GB and MB for the Generation (G), Simulation (S), Digitisation(D), Reconstruction(R), AOD production (A) and Ntupling(N). The second number is the job number: only the simulation step had multpile jobs. The number of events is also indicated.

Stage
/job   Output  Output Output     Events
         (B)      GB     MB
G/0   20392389   0.02   20.39      1000
S/0   45970611   0.05   45.97       100
S/1   50984475   0.05   50.98       100 
S/2   48995048   0.05   49          100
S/3   46701472   0.05   46.7        100
S/4    9597974   0.01    9.6         19
S/5   48920212   0.05   48.92       100
S/6   49830866   0.05   49.83       100
S/7   46356582   0.05   46.36       100
S/8   52653942   0.05   52.65       100
S/9   49479588   0.05   49.48       100
S/10  50433684   0.05   50.43       100
Sum         
S/0   2155152    0       2.16          5
D/0   8019983    0.01    8.02          5
R/0   2417653    0       2.42          5
A/0   224711     0       0.22          5
N/0   7134       0       0.01          5

Jobs S/4 and S/10 will not be used in this because S/4 crashed and S/10 output was lost in the storage step.

Discussion

It is interesting to use the numbers above to get size/event and time/event estimates so that the amount of cpu and size of files can be examined. This has been done for all 900 events for each stage and for 100 events in the tables below. The idea is that one can consider doing all events that were simulated and run them through the rest of the chain together or continue with the 100 event chunks. In the first case, the numbers and sizes for 900 events are intesting and in the second case, numbers and sizes for the 100 events are intersting.

The table contains the total size for all the simulated events as well as the total cpu for two reasons. First it is useful to compare this to the other stages. Second, if one wishes to use all the events, then one has to have room for the input or some caching mechanism to hold individual files on demand. These simulated event numbers are not included in any of the sums since the goal is to determine the strategy for the rest of the chain.

Times and sizes for 900 events
Stage      Totsize     Time(s)    Time (m)  Time(h)  TotSize(G)   TotSize(M)
S          387927360   531037.8   8850.63   147.51   0.39            387.93

D         1443596940    17730      295.5      4.93   1.44           1443.6
R          435177540    38520      642       10.7    0.44            435.18
A           40447980     8568      142.8      2.38   0.04             40.45
N            1284120      991.8     16.53     0.28   0                 1.28
Tot(D->N) 1920506580    65809.8   1096.83    18.28   1.92           1920.51

      
Times and sizes for 100 events            
S           43103040    59004.2     983.4    16.39   0.04              43.1

D          160399660     1970        32.83    0.55   0.16             160.4
R           48353060     4280        71.33    1.19   0.05              48.35
A            4494220      952        15.87    0.26   0                  4.49
N             142680      110.2       1.84    0.03   0                  0.14
Tot(D->N)  213389620     7312.2     121.87    2.03   0.21             213.39

As already mentioned, a job should not exceed 24h of cpu. This seems to be the case for whether one runs all 900 events at once or whether one runs parallel jobs. The parallel job times are not too short either.

The input and output sizes are a consideration. The software to allow collections of files to be handled is not yet ready. Hence, to process all files for input and output in one job would require 1.92G to be available and take 18 hours of cpu time.

Advantages:

  • Only one job is handled and submitted
  • There is only one output file at each stage: this could exceed 2G
Disadvantages:
  • There must be room for all input and output files: It is not clear how much can be available.
  • If any event causes a problem then all the output from that point forwards is not produced.

The total size needed to process is probably and the output file size is also less than 2G. Files greater than 250M could be stored to tape although the tape store usage is discouraged due to poor performance arising from previous storage of files that are small.

Conclusions

At this point,the disk stores in scotgrid will be used since there is local expertise that can monitor the disk space and any issues that arise, disk is exhausted at RAL and CASTOR has performance issues.

Splitting the job up into many files will create a number of small files, increasing the data handling and job handling difficulties. For this size of generation of events, it is therefore convenient to try to run the single 18 hour job.


Major updates:
-- RichardStDenis - 28 Nov 2005

%RESPONSIBLE%
%REVIEW%

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2005-11-28 - RichardStDenis
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback