CERN--European Laboratory for Particle Physics

Rough Sizing Estimates for a Computing Facility for a Large LHC Experiment

Les Robertson CERN/IT - May 1999

 

Background

These estimates have been prepared as part of the MONARC project in order to make a first guess at the sizing of the computing configuration that will be required at CERN for ATLAS or CMS. The estimates my be used to estimate the capacity required for a "regional computing centre" for one of these experiments. The sizing of a specific regional centre will depend on the range of services it intends to provide, and the number of active physics groups it intends to support.

The base for these estimates is a set of figures for CMS offline computing at CERN that was made available by CMS in the first half of 1998 in the context of providing cost estimates for the CERN management. The estimates include capacity for raw data recording and for reconstruction and basic analysis, but do not include the "level 3" filter farm (funded as part of the online) and simulation (assumed to be provided outside CERN). Although ATLAS did not provide such detailed figures, it was estimated at that time that their requirements were similar.

The estimates assume a model for the build-up in capacity over the years 2004 through 2007. In 2004 a substantial proof of concept facility is configured, extended in 2005 to handle the first data from LHC, and expanded again in 2006 to reach the capacity required for the first year of full data taking. In 2007 (and presumably later years) the configuration is further extended to provide analysis capacity for the growing aggregate data collection. I have freely extended the estimates to include guesses about I/O throughput, Objectivity/DB overhead and other items, using my own intuition, in the main these extensions are explained in the text.

As a rough check on the numbers at CERN, a comparison is included with an analysis model produced by the MONARC Analysis Working Group. This was based on a paper by Mauro Campanella and Laura Perini describing possible analysis models for ATLAS (MONARC note 1, July 1998). Note, however, that these are early estimates. The numbers will be refined by the MONARC project during the remainder of this year, but they will continue to evolve as the collaborations improve their understanding of the algorithms used in the various phases of the analysis.

Note that funding for computing facilities at CERN on this scale for the four LHC experiments is not yet in any formal plan. The scale of funding required was first discussed with the CERN management in mid-1998, when it was decided to leave the problem as an exercise for CERN's new management team.


The Estimates

The following table gives the basic estimates of capacity required at CERN for CMS for the following functions: data recording; first-pass reconstruction; some re-processing; basic analysis of the ESD (pass-1 + pass-2); support for a few analysis groups (say ~4 groups, ~100 physicists). It is assumed that there will also be "good" networking connections to the outside institutes and regional centres. The base estimates have been summarised in terms of computational capacity (SPECint95 units, where 1 SI95 = 10 CERN-Units = 40 MIPS), disk capacity (TeraBytes), and automated tape storage (PetaBytes), and are given for the four year period around LHC start-up. The numbers are for the total capacity available in each year.
 

Base estimates

year

2004

2005

2006

2007

Capacity installed

processors (SI95)

60'000

310'000

460'000

610'000

disks (TB)

40

340

540

740

tapes (PB)

0.2

1

2

3

For comparison, by the end of 1999 the physics data processing capacity of the CERN Computer Centre - for all CERN experiments - is expected to be roughly 4'800 SI95 of computation (in about 500 very assorted boxes), 23TB of disk, and robotic capacity for 900 TB of tape (but only a fraction of these slots will be populated).

I/O overhead

Experience with current experiments, and particularly with NA48 and NA45 at CERN, has shown that the raw computational requirements must be increased to allow for the overhead costs of file and database processing and network access. This overhead is of course a function of the aggregate I/O data rate which must be supported, which is in turn related to the computational capacity available. Two measurements which may be used to estimate the size of the overhead are:

    1. In the NA48 model, data is stored in straight-forward Unix files on dedicated disk servers. These servers require a computational capacity of 0.5 SI95 for each MB/sec of I/O throughput.
    2. The COMPASS experiment has performed a series of measurements using Objectivity/DB with a client-AMS model. An average figure for the cpu overhead per MB/sec of I/O throughput on writing is 1.3 SI95, including network, file system and database processing.

So we need to add something of the order of 1 SI95 for each MB/sec of I/O throughput. But first we must estimate the required throughput. Some work in the MONARC Analysis Working group, presented in January by Paolo Capiluppi, suggests that the average I/O to computation ration is 10 KB/sec per SI95, with a peak requirement of 400 KB/sec per SI95 for pass-1 and pass-2 analysis of the ESD. Since the latter represents only a few percent of the total computation requirement, I have used a figure of 100 KB/sec per SI95 to estimate a reasonable allowance for I/O overhead. Taking this together with the overhead measured by Compass, gives a very modest 13% cpu overhead. This leads to the following requirements for cpu, disk and tape capacity, and aggregate I/O throughput. If the model used is completely distributed (as is the case for the NA45, NA48 and COMPASS farms at CERN), then the total I/O throughput must be supported by the LAN. Most probably the disk will be "network storage" and will use a dedicated network (i.e. the successor to fibre channel) but for the moment we must assume that the client-AMS model for database access will require all data to traverse the storage network (to get to the AMS) and then the LAN (to get to the application).
 

Estimates adjusted for I/O overhead

year

2004

2005

2006

2007

Capacity installed

processors (SI95)

67'800

350'300

519'800

689'300

disks (TB)

40

340

540

740

I/O (GB/sec)

6

31

46

61

tapes (PB)

0.2

1

2

3

Tape - data throughput and total capacity

A figure of 100 MB/sec is used for the peak raw data rate from the detector (is this realistic?) and a total of 1 PByte is assumed to be collected during 200 days of running. I have estimated the amount of tape to be purchased each year at twice the volume of raw data collected. Total tape I/O throughput is trivially estimated as 5 times the raw data rate: record (1) + re-read (2) and write export copy (3) + re-process (4) + write and read the full ESD (10% of the raw data) a few times during the year (5). Will it be necessary or useful to make a full copy of the raw data, or will only some fraction be copied for export? In any case, the factor of 5 assumes that a version of the ESD suitable for most analysis tasks is really disk-resident (I believe that this is a basic fact - no matter how much or little disk is available).
 

Tape capacity & I/O

year

2004

2005

2006

2007

tape capacity (PB)

0.2

1

3

5

tape I/O (GB/sec)

0.2

0.3

0.5

0.5

Boxes to be managed

The box counts are just to give an idea of the order of magnitude of the number of pieces of equipment to be managed. I assume a basic mother-board, box or node with 4-cpus, which evolves in capacity from 280 SI95 in 2003 to 2'000 SI95 in 2008. This gives totals of one to two thousand boxes in the period 2005-07. Assuming that equipment is replaced after three or four years this number will be a peak, as the older equipment is replaced with more powerful systems.

Summary

The data is summarised in the following table.
 

Summary of required installed capacity

year

2004

2005

2006

2007

total cpu (SI95)

70'000

350'000

520'000

700'000

disks (TB)

40

340

540

740

LAN thr-put (GB/sec)

6

31

46

61

tapes (PB)

0.2

1

3

5

tape I/O (GB/sec)

0.2

0.3

0.5

0.5

approx box count

250

900

1400

1900


Reality Check:

A quick comparison with Paolo Capiluppi's presentation at the 15 January meeting of the Analysis working group gives the following.

Reconstruction:

The initial reconstruction at 100 events/sec (100 MB/sec) and 350 SI95 per event requires 35K SI95 and 100 MB/sec tape data rate (if a single copy of the raw data is maintained), 200 MB/sec tape data rate (if two copies - one for CERN, one for export - are made). An alternative approach of making the export copy offline would require a re-read of the original data, but could take advantage of the average recording rate of 60MB/sec (1 PB in 200 days), leading to a total tape data rate requirement of 220 MB/sec.

Re-processing of a full-year's data in one month: 1 PB=109 events, 350 SI95-sec per event. This requires a computation capacity of 150K SI95 (30% of the 2006 capacity), and a sustained tape data rate of 400 MB/sec (80% of the 2006 capacity).

So the above estimates are a bit short on tape I/O capacity to support re-processing during data-taking.

Pass-1 Analysis:

Each of 20 Working Groups reads through 100 TB of ESD data (=109 events) in 3 days, 0.25 SI95-sec per event. Assume all of the ESD is on disk. This requires a total sustained capacity for each working group of only 1K SI95 and 400 MB/sec disk input I/O rate, but note that the I/O to computation ratio is 0.4 MB/sec per SI95 - four times the average used in the above estimates.

Pass-2 Analysis:

Each of the 20 Working Groups reads through 1 TB of ESD data (=107 events) in 1 day, 2.5 SI95-sec per event. Sustained capacity per working group: 300 SI95, 12 MB/sec input disk I/O.

User Analysis:

4-hour (desktop?) job which passes over 100 GB of AOD (=107 events?), generating another 100GB, 3 SI95-sec per event. This requires a 2K SI95 system (in the above estimates a 4-cpu system in 2006 is rated at only half of that), sustaining a modest 15 MB/sec of disk I/O.

Conclusion:

The raw numbers required to support at CERN data recording, first pass reconstruction, re-processing occasionally (sometimes in parallel with data acquisition), 4 analysis groups (25% of collaboration) performing pass-1 + pass-2 analysis (sometimes in parallel), and 100 users each doing one 4-hour private analysis each 8-hour day (i.e. 50 simultaneous analyses) are given in the following table.
 

function

cpu (SI95)

disk I/O (MB/sec)

tape I/O (MB/sec)

ratio of disk I/O to cpu (MB/sec per SI95)

reconstruction + data recording 

35'000

500

100

0.01

copy raw data

120

120

re-processing

150'000

400

400

0.003

Pass1+2 analysis - 4 groups

4'000

1'600

0.40

User analysis - 50 simultaneous jobs

100'000

750

0.01

Totals

289'000

3'410

620

0.01

% of 2006 base capacity

63%

7%

124%

 

The CPU is just about right. The tape I/O capacity is over-stretched. The disk I/O capacity is grossly over-configured as an average. Only Pass-1 analysis has a high I/O to cpu ratio.