Milano, july 1998;

MONARC note n. 1/98 - version 1.1

The analysis model and the optimization of geografical distribution of computing resources:

a strong connection

M. Campanella, L. Perini

INFN and University of Milano

 

ABSTRACT

This document contains a preliminary investigation of possible computing analysys models for ATLAS (and LHC) experiments. Evaluation of CPU, storage and I/O requirements is sketched in a top-down approach, starting from the CTP model. As a result, some models show themselves as not realistic and items are underlined where more data and investigation is needed to proceed.

1. Introduction.

The work documented here has been done as a first preliminary step for the common project with other LHC experiments about distributed computing and Regional Centers. Starting from the analysis model sketched in the CTP, the purpose was to better define the input parameters and their range of variability for a detailed simulation.

In the process of quantifying the range of possible analysis models based on the CTP ideas, we realized that different, yet allowed and reasonable models, could lead to very different conclusions in respect to the convenience and efficiency of decentralized models (Regional Centers and local "ntuple" analysis in the institutes) and the analyses system as a whole. We realized also the need for a detailed simulation and the central role of the scheduling requirements.

While these facts are not really new or unexpected, they are relevant in the definition of a project about Computing Model, as they underline the point of a joint tuning for the analysis model and the level of geographical distribution.

In the first section we list our assumptions together with a brief description of terminology. The second one is devoted to different analysis models. The third draws some conclusion and tracks the main directions to follow.

The results reported are meant only to be a starting point for broader discussion and further work.

2. Overall assumptions

Most of the input derives from the CTP model; we assumed a simple and naive scheduling model, based on the requirement that the relevant analysis phases be completed within some fixed amount of time. This is a natural consequence for our "back-of-envelop" computational approach and, while it creates an unrealistic amounts of idle time and clearly signals the need for a real simulation, it illustrates well the strong link between analysis model and optimization of the geographical distribution.

The same sketchy approach to storage and I/O requirements is enough to pinpoint various interesting areas and weaknesses in some models.

Note that no assumption is made in the following about the (distributed) object oriented database or its performance; in some way, it is assumed to behave with no overhead. This is a non-trivial assumption and, as we will see, the behavior of the database is of outmost importance in evaluating the whole system.

Last, but not least, this investigation examines only the steady state of the analysis process. The period of time elapsing between the first event and the start of the steady state is not (yet) considered here. Nevertheless the transient period is of outmost importance and deserves a careful investigation.

2.1 Terminology

Few comments about the terminology used in the report:

3. Analysis Models.

Three main phases with a simple but tight fixed scheduling are supposed:

  1. All the 10^9 events are analyzed 1 time per month. The events are selected using 0.25 SpecInt95*sec/ev. The sample is reduced by a factor 100, to 10^7 events. Requested response time is 3 days. This phase is done 1 time for each analysis.
  2. The specific computation needed for each different analysis is done and the results saved. The CPU time is evaluated to 2.5 SpecInt95*sec/ev. Response time 1 day. Each analysis group accomplishes this phase immediately after the previous one is completed. Thus, at any given time, only 1 analysis group is active in this phase.
  3. Each physicist analyses "interactively" his sample with a response time of 4 hours; CPU needed 2.5 SpecInt95*sec/ev.

Note that the scheduling model assumes that the response times for phase 1 and 2 are 3 and 1 day respectively, only when the phases are started according to a predetermined sequential order: there are analyses that start first and end first and other ones that start and end later; the "waiting time" DT between the first and last analysis depends from the level of distribution of a model and will be given for the various models considered.

Such a strong scheduling policy is probably not what the collaborations want, and in any case it would be difficult to implement; it is however a good comparison point as far as required resources are concerned. In the initial stage it would be utterly impossible.

Table 1 summarizes the relevant assumptions.

 

Phase

Ev.

size

[KB]

Event

Number

reduction

Event

Size

reduction

Events/

year

Requested

Response time

SpecInt95 *sec

/event

Total ev.

size [GB]

RAW

1000

 

 

10^9

 

 

10^6

ESD

100

 

10%

10^9

ON LINE

250

10^5

1

100

1 %

 

10^7

3 DAYS (1)

0.25

10^3

2

10

 

10%

10^7

1 DAY (1)

25

10^2

O

10

10 %

 

10^6

< 1 DAY

nihil

10

3

10

 

 

10^7 - 10^6

4 HOURS

2.5

100 - 10

1 - Only one analysis active at the same time

Table 1 - Main model parameters

 

3.1 Different versions of the analysis phases

The analysis model is not completely specified by the phases sketched above, and some additional assumptions may be made:

For the CPU aspects only the variations considered in the last 3 points will be taken into account, giving rise to models A, A-R, O.
As for as the volume of data is concerned we consider AOD as input and output format of phase-2 and "ntuple" as input format for phase-3.
The "ntuple" is simply a subsample of the analysis specific AOD produced in phase-2, so no extra CPU is required for getting them from the DB.

3.2 Levels of distribution

We compare 4 scenarios

See fig. 1 for a flow diagram of the various possibilities.

3.3 Storage and network considerations

In all the scenarios considered a copy of the ESD (100TB/year) is kept in each of the available centers. Each analysis is done on 100GB of AOD (=10^7 ev x 10^4 Bytes) in models A, A-R; in model O only 10GB are used.

Table 2 reports the ratio between time spent in reading and in processing an event (sum of reading, processing and the usually negligible writing time). It is obvious that, to optimize the resources, the ratio has to be small. The rather crude approximation shows nevertheless that a problem exists in phase 1.

Maximum care has to be taken to have a parallel reading and processing and that the I/O time to read and write an event is not greater than the time to process it.

Note also that, although the initial event pool corresponds to one year, the storage to keep the output from the various phases is computed per one month only. It is unrealistically assumed that after one month data is scratched and the same storage may be reused. Multiplying the result a factor of three could be more realistic.

Throughout this exercise, data is supposed to be distributed evenly on the servers, so that access timing and bandwidth can be computed easily. This is also an optimistic hypothesis that has to be modified in a more detailed simulation.

 

 

Phase

Ev. Size [KB]

Read time [s] at 20 MB/s

R

SpecInt95*sec

/event

Time to process one event on a 100 SpecInt95*sec CPU P

% of time spent in reading

R/(R+P)

100

0.005

250

2.5

0.2

1

100

0.005

0.25

0.0025

66

2

10

0.0005

25

0.25

0.2

O

10

nihil

3

10

0.0005

2.5

2.5

0.02

Table 2: comparison between the time needed to read
one event and the time to process it

As for as the network is concerned the only moving around of data on WAN is foreseen during the analysis is the transfer of "ntuples" before the local phase-3. Assuming 30 Mbit/sec lines, the transfer of 10GB at 1/3 occupation rate would take (10x10^9x8x)/(3x 10x10^6)=8x10^3 seconds i.e. a couple of hours; quite an acceptable time. However the number of concurrent transfers is much bigger than one. Also a large number of concurrent remote X session may require a significant bandwidth.

In the LOCAL-3 scenario each month the 25 physicists of an analysis group transfer in 1 day (maximum allowed time for avoiding superposition with other analysis groups and for keeping in step with the phase-2) 10GB each. Allowing for a maximum transfer time of 1 day at 1/3 line occupancy we need an out-capacity of (25x10x10^9x8x3)/10^5=60 Mbit/sec. Here transferring the AOD instead would require 600 Mbit/sec, an amount already somewhat critical.

In the RC-LOCAL-3 scenario for each center the 12.5 physicists of an analysis group transfer 10GB each: an out-capacity of 30 Mbit/sec is required (300 for AOD transfer). The line leading to a single institute should allow for the concurrent transfers by e.g. 4 physicists requiring a capacity of 10 Mbit/sec for "ntuple" and 100 for AOD transfer.

Thus the local scenarios requiring the transfer of 10^7 events in "ntuple" format are well within the expected network performances; the performances required for AOD transfer are less obvious, albeit still reasonable.

3.4 Results

The detailed computation in each case is reported in appendix A. The results of the models are summarized in the following and shown in the figures 2, 3 and 4.

At this stage of the study only preliminary results may be reached about the I/O and networking aspects of phase 1 and 2. Note that RC models have greater flexibility; the variation (DT) in waiting time between the first and last scheduled analysis is 30 days both in central and RC models; but the addition of small amounts of the total CPU in phase 1 bring easily the Central DT to 20 days and the RC DT to only 5 days. This kind of RC advantages has a cost; however the results summarized below show that models exist for which the extra cost can be kept at the level of a low fraction of the total one.

3.6 Model summary

MODEL A. For the analysis model A, the CENTRAL scenario is convenient, the RC too, provided the use of same machines for phase-2 and 3 can be scheduled. LOCAL-3 is disfavored; RC-LOCAL-3 can be secluded as too costly.

Model A-R For the analysis model A-R, the CENTRAL and RC scenario are both convenient.
Note that the R version of phase-2 may be somewhat unrealistic in CENTRAL, as it requires general analysis format agreement between far institutes. It is however could also be considered more realistic in RC where it accommodates naturally 2 different analysis formats for each group. Intermediate models with 3-5 analysis format (institute formats?)

The only disadvantage of both LOCAL scenarios is the unbalance of the CPU distribution toward the institutes; to such an extent it does not seem realistic and lacks of flexibility. This kind of objection would not be valid if virtual RC would be considered possible and convenient: more information on distributed DB performance is needed for addressing this point.

Model O. For the analysis model O, the CENTRAL scenario is convenient. RC is very disfavored because of the excess CPU of phase-2, which in this model is by far dominant, and so cannot be taken by the machines of another phase (as it was done by phase-3 machines in model A).

 

 

4. Conclusions and future work.

The computation shown in this report has to be taken as a first step to probe the complexity of the engineering of the final computing model.

The model detailed is nevertheless enough to show the unaivoidability of the presence of a scheduling policy and the need for a detailed discussion on the architecture. A discussion on tagging and analysis definitions would shade light on important starting options.

As an example, the unbalanced ratio between I/O and CPU power in phase 1 could point to avoiding this kind of phase-1, and rather change it in scope or merge with the previous (or following) phase.

The model A-Revised seems viable for both the Central and RC scenarios, with or without LOCAL option.

It should be noted that the assumption of a strong sequential scheduling is too constraining and hardly feasible already in a RC with only 5 analyses: in a unique center with 20 analyses seems utterly unrealistic. On the other hand it is obvious that the time-uniform query distribution assumed here cannot be obtained without a kind of "militaristic" scheduling policy. The worsening in response times due to a random time distribution of the queries will be the subject for some of the forthcoming simulation work. It may however be expected that the impact will be more severe in the centralized scenario where the probability of clashes between the different analyses is higher.

It should be also noted that even a random query distribution is not realistic: a serious scheduling policy has to be designed and simulated for coping with the foreseeable peaking times in query requests.

As far as the storage is concerned, it is clear that the disk space has to be connected to computers either of phase 1 or of phase 2. Each approach has consequences of moving the I/O bandwidth from local bus to the network during the other phase.

This study has to be pursued simulating specific architectures and taking into account the database performances.

The behavior of the database affects many aspects of the model. Information need to be acquired on:

The work of detailing the analysis models has to continue: in a next phase the aspects of MC simulation of the physics channels of interest and of the relevant backgrounds have to be added, and the effects of a "continuum" of random queries on a limited amount of data (for analysis setting up and debugging) has to be taken into account. A model simulation program is needed for this purpose; one might however start with a fairly simple one (this is true also for studying some of the randomness and scheduling effects mentioned above).

A detailed model has to be elaborated also for the initial stage of the LHC analysis, when the main concurrent activities will probably be the understanding of the data, with the inherent detector behavior, and the setting up of the analysis chains.

 

APPENDIX A

Detailed computations for

The different analysis models

 

In this naive model with fixed response times, the results on the basis of a scenario can be evaluated are two:

It is evident that for a more realistic approach we would need a simulation program. The inputs to the simulation would be:

In this more realistic simulation the result would be the distribution of the response times and the different scenarios would be evaluated on the basis of:

Note that the mean response time here is not the only aspect to be taken into account: tails of long response times should be especially avoided.

 

Model A

CENTRAL

PHASE 1

CPU

0.25 SpecInt95*sec * 10^9 ev / (3x10^5 sec) = 800 SpecInt95*sec
800 x 3 (N.concurrent phases 1) = 2400 SpecInt95*sec
- 24 machines
If reading and processing are performed in parallel, then only 12 CPU's are needed)
- idle time 33%

STORAGE

INPUT: storage is 10^5 GB ( - 1000 disks)

OUTPUT: as the process just tags the events, adding few bytes, no additional storage is needed

I/O

READ: 10^5 GB/(3*10^5 sec) = 333 MB -> 333/24 MB/machine = 14 MB/machine if a safety factor of 3 is assumed the request is 42 MB/s/machine which is reasonable if the disks are local, in case they are remote it translates to 333 Mbit/sec which is worrying
(Similar conclusions if one takes into account that the time devoted to reading is only half of the analysis time)

WRITE: just tagging, negligible if the database is smart enough

PHASE 2

CPU

25 physicists x 25 SpecInt95*sec x 10^7 ev / (10^5 sec) = 6.25x10^4 SpecInt95*sec
- 625 machines
- idle time 33%

STORAGE

INPUT: 10^5 GB ( - 1000 disks) (kept in phase 1 computers?)

OUTPUT: 10^7 ev.* 10^4 B * 500 physicist = 50.000 GB (500 disks)
(If data from last three month has to be kept = 1500 disks)

I/O

READ: reads all events tag header and all tagged events = 25 physicist * 10^4GB / (10^5 sec *20 groups as only one group is allowed to run) = 250 MB/s = 400 KB/machine (but 10 MB/s on each server)

WRITE: 25 physicists * 10^2 GB / 10^5 sec = 25MB/s

PHASE 3

CPU

150 physicists x 2.5 SpecInt95*sec x 10^7 ev / (1.5x10^4 sec) = 2.5x10^5 SpecInt95*sec
- 2500 machines
- this phase is accomplished in 1/6 of the day by 30% of all the physicists. The full 500 physicists ( 10/3 of the active ones ) will occupy the machines for (1/6 x 10/3) = 55%: idle time = 45%

STORAGE

INPUT: storage is 5*10^4 GB ( - 500 disks) kept in PHASE 2 machines

OUTPUT: 25 MB / month * 500 physicist = 13 GB

I/O

READ: 150 simultaneous transfer * 10 or 100 GB / 2 10^4 sec = 75/750 MB/s = 600/6000 Mb/s. Even if only one tenth does the transfer at the same time it is a very high bandwidth, the time constraint has to be relaxed by at least a factor 5 (1 days)

WRITE: negligible

SumCPU (1+2+3) = 315 kSpecInt95*sec (SumCPU is CPU summed over the 3 phases)
TOTAL CPU required could be only 250 kSpecInt95*sec if phase-2 can be done during phase-3 idle time. However it looks difficult in this scenario.

SumSTORAGE(1+2+3) = 1600 GB (2200 if 3 month of data retention is requested).

RC

PHASE 1

CPU

0.25 SpecInt95*sec x 10^9 ev / (3x10^5) = 800 SpecInt95*sec

- 8 machines
- idle time = 50% (3 busy days x 5 groups = 50% of month)
- CPU8 = 6.4 kSpecInt95*sec (CPU8 is the sum over the 8 RC's )

 

STORAGE

INPUT: storage is local = 10^5 GB ( - 1000 disks)

OUTPUT: as the process just tags the events, adding few bytes, no additional storage is needed

I/O

READ: 10^5 GB/(3*10^5 sec) = 333 MB -> 333/24 MB/machine = 14 MB/machine if a safety factor of 3 is assumed the request is 42 MB/s/machine which is reasonable if the disks are local, in case they are remote it translates to 333 Mbit/sec which is worrying

WRITE: just tagging, negligible if the database is smart enough

PHASE 2

CPU

12.5 physicists x 25 SpecInt95*sec x 10^7 ev / (10^5 sec) = 3.1 x 10^4 SpecInt95*sec

- 310 machines
- idle time = 83% (busy for 1 day x 5 groups = 5/30)
- CPU8 = 248 kSpecInt95*sec

STORAGE

INPUT: 10^5 GB ( - 1000 disks) kept in ???????????

OUTPUT: 10^7 ev.* 10^4 B * 125 physicist = 12.500 GB (125 disks)

I/O

READ: reads all events tag header and all tagged events = 25 physicist * 10^4GB / (10^5 sec *20 groups as only one group is allowed to run) = 250 MB/s = 400 KB/machine (but 10 MB/s on each server)

WRITE: 25 physicists * 10^2 GB / 10^5 sec = 25MB/s

PHASE 3

CPU

5 physicists x 5 groups x 2.5 SpecInt95*sec x 10^7 ev / (1.5x10^4 sec) = 4x10^4 SpecInt95*sec

- 400 machines
- idle time = 58% . This phase is accomplished in 1/6 of the day by 40% of all the physicists using the RC: the busy time is ( 1/6 x 10/4 ) = 10/24 = 42%
- CPU8 = 320 kSpecInt95*sec

STORAGE

INPUT: storage is 12500 GB ( - 125 disks) kept in PHASE 2 machines

OUTPUT: 25 MB / month * 125 physicist = 2 GB

I/O

READ: 25 physicist * 10 or 100 GB / 2 10^4 sec = 125 MB/s

WRITE: negligible

 

LOCAL-3

CPU

The CPU of phase-3 is the same as in the RC case ( the activity ratio is the same ). The 3200 machines of phase-3 go to the institutes.
- 6.4 local machines/physicist
- idle time for phase-3 = 50% (about)
The big idle time of the local machines ( 58% ) can be used for MC. If the estimation 25 KSpecInt95*sec for MC is confirmed, it uses however less than 10% of this phase-3 CPU power

STORAGE

INPUT: storage is 12500 GB ( - 125 disks) kept in PHASE 2 machines

OUTPUT: is local (5000 GB, ntuples only = 50 disks)

I/O

READ: needed bandwidth to read the ntuples (about 400 Mb/s) at the source

WRITE: negligible

RC-LOCAL-3

The local situation is as already sketched for LOCAL-3.
However no trade-off is possible here between the CPU of phases 2 and 3.
- 6.4 local machines/physicist
- idle time for phase-3 = 50% (about)

Model A summary

Thus for this analysis model A, the CENTRAL scenario is convenient, the RC too, provided the use of same machines for phase-2 and 3 can be scheduled. LOCAL-3 is disfavored, RC-LOCAL-3 can be secluded as too costly.

Model A-R

CENTRAL

PHASE 1 (as central)

CPU

0.25 SpecInt95*sec * 10^9 ev / (3x10^5 sec) = 800 SpecInt95*sec
800 x 3 (N.concurrent phases 1) = 2400 SpecInt95*sec
- 24 machines
- idle time 33%

STORAGE

INPUT: storage is 10^5 GB ( - 1000 disks)

OUTPUT: as the process just tags the events, adding few bytes, no additional storage is needed

I/O

READ: 10^5 GB/(3*10^5 sec) = 333 MB -> 333/24 MB/machine = 14 MB/machine if a safety factor of 3 is as

WRITE: just tagging, negligible if the database is smart enough

PHASE 2: Here comes the big difference with model A : instead of all the 25 physicists of the
group we have just 1 on behalf of the full group:

CPU

1 group x 25 SpecInt95*sec x 10^7 ev / (10^5 sec) = 2.5x10^3 SpecInt95*sec
- 25 machines
- idle time 33%

STORAGE

INPUT: 10^5 GB ( - 1000 disks) kept in phase 1 computers;

OUTPUT: 10^7 ev.* 10^4 B * 20 groups = 2.000 GB (20 disks)

I/O

READ: reads all events tag header and all tagged events = 10^5GB / (10^5 sec *10) (10 is a reduction factor due to reading of just tags and not whole events) = 100 MB/s = 4 Mb/s /machine (but >10 MB/s on each server)

WRITE: 1 group * 10^2 GB / 10^5 sec = 1MB/s

PHASE 3

CPU

150 physicists x 2.5 SpecInt95*sec x 10^7 ev / (1.5x10^4 sec) = 2.5x10^5 SpecInt95*sec
- 2500 machines
- this phase is accomplished in 1/6 of the day by 30% of all the physicists. The full 500 physicists ( 10/3 of the active ones ) will occupy the machines for (1/6 x 10/3) = 55% : idle time = 45%

STORAGE

INPUT: storage is 5*10^4 GB ( - 500 disks) kept in PHASE 2 machines

OUTPUT: 25 MB / month * 500 physicist = 13 GB

I/O

READ: 125 physicist * 10 or 100 GB / 2 10^4 sec = 625 MB/s

WRITE: negligible

SumCPU(1+2+3) = 255 kSpecInt95*sec (SumCPU is CPU summed over the 3 phases)
TOTAL CPU required 255 kSpecInt95*sec fully dominated by phase-3.

RC

PHASE 1

CPU

0.25 SpecInt95*sec x 10^9 ev / (3x10^5) = 800 SpecInt95*sec

- 8 machines
- idle time = 50% (3 busy days x 5 groups = 50% of month)
- CPU8 = 6.4 kSpecInt95*sec (CPU8 is the sum over the 8 RC's )

 

STORAGE

INPUT: storage is local = 10^5 GB ( - 1000 disks)

OUTPUT: as the process just tags the events, adding few bytes, no additional storage is needed

I/O

READ: 10^5 GB/(3*10^5 sec) = 333 MB -> 333/24 MB/machine = 14 MB/machine if a safety factor

of 3 is assumed the request is 42 MB/s/machine which is reasonable if the disks are local, in case they are remote it translates to 333 Mbit/sec which is worrying

WRITE: just tagging, negligible if the database is smart enough

PHASE 2 Here comes the big difference with model A : instead of all the 12.5 physicists of the group we have just 1 on behalf of the full group:

CPU

1 group x 25 SpecInt95*sec x 10^7 ev / (10^5 sec) = 2.5 x 10^3 SpecInt95*sec

- 25 machines
- idle time = 83% ( busy for 1 day x 5 groups = 5/30 )
- CPU8 = 20 kSpecInt95*sec

STORAGE

INPUT: 10^5 GB ( - 1000 disks) kept in ???????????

OUTPUT: 10^7 ev.* 10^4 B * 125 physicist = 12.500 GB (125 disks)

I/O

READ: reads all events tag header and all tagged events = 25 physicist * 10^4GB / (10^5 sec *20 groups as only one group is allowed to run) = 250 MB/s = 400 KB/machine (but 10 MB/s on each server)

WRITE: 25 physicists * 10^2 GB / 10^5 sec = 25MB/s

PHASE 3

CPU

5 physicists x 5 groups x 2.5 SpecInt95*sec x 10^7 ev / (1.5x10^4 sec) = 4x10^4 SpecInt95*sec

- 400 machines
- idle time = 58% . This phase is accomplished in 1/6 of the day by 40% of all the physicists using the RC: the busy time is ( 1/6 x 10/4 ) = 10/24 = 42%
- CPU8 = 320 kSpecInt95*sec

STORAGE

INPUT: storage is 12500 GB ( - 125 disks) kept in PHASE 2 machines

OUTPUT: 25 MB / month * 125 physicist = 2 GB

I/O

READ: 25 physicist * 10 or 100 GB / 2 10^4 sec = 125 MB/s

WRITE: negligible

- SumCPU8 = 348 kSpecInt95*sec
- TOTAL CPU required could be only 326 kSpecInt95*sec if the machines of phase-3 are used for the other phases in their idle time; the difference however is not so big as the CPU is dominated by phase-3.

LOCAL-3

No change n phase-3 respect to model A.

The CPU of phase-3 is the same as in the RC case ( the activity ratio is the same ). The 3200 machines of phase-3 go to the institutes.
- 6.4 local machines/physicist
- idle time for phase-3 = 50% (about)
The big idle time of the local machines ( 58% ) can be used for MC. If the estimation 25 KSpecInt95*sec for MC is confirmed, it uses however less than 10% of this phase-3 CPU power.

 - TOTAL CPU required = 325 kSpecInt95*sec

RC-LOCAL-3

No change in phase-3 respect to model A.

The local situation is as already sketched for LOCAL-3.
However no trade-off is possible here between the CPU of phases 2 and 3.
- 6.4 local machines/physicist
- idle time for phase-3 = 50% (about)

- TOTAL CPU required = 348 kSpecInt95*sec