Monarc Report on Existing Computing Architectures

Last Update: May 29, 1999

Introduction
Scope of the Survey
Methodology of the Survey
Quantitative Results of the Survey
Analysis and Conclusions

Comparisons with LHC Experiments
Issues Related to Distributed Computing
Computing Predictions Versus Reality
Issues in Tracking Experimental Computing Models

Final Comments

Introduction: This report is prepared by the Site and Network Architecture Working Group of the MONARC project and provides a survey of computing architectures used by a selection of current experiments. The purposes of this survey are:

1. To develop a formal framework for describing analysis architectures and to apply them to existing experiments. The framework can then be used to discuss architectural models for the LHC.

2. To provide quantitative information on the resources which have been employed in the most challenging data analysis efforts that have been undertaken in High Energy Physics (HEP) so far. This information can provide a baseline to which the magnitude of LHC data analysis problem can be compared.

3. To identify successful approaches that can be expected to carry over to the future and to identify key problem areas which are already known to exist at this scale of effort.

In the second section of the report, we explain the scope of the survey and our reasons for choosing a particular subset of experiments. The third section presents a brief explanation of the methodology used to collect the data and provides information required to interpret the data tables which are presented in section four. In section five, we provide the conclusions drawn from the data by the members of the working group. Back to Table of Contents Scope of the Survey: For reasons described below, we believe that we can learn only a limited amount from existing experiments. Therefore, the survey is not exhaustive, and concentrates on a representative selection of LEP, HERA, FNAL Collider, and large CERN and FNAL fixed target experiments. We considered experiments that had, by past standards, very large volumes of data, high rates, high degrees of complexity, and large geographically dispersed collaborations. We included in the list experiments that are running now at CERN, such as NA48 and NA45, which have large amounts of data, at least by current standards, and are being used to try out new models of computing that were designed to be relevant to future experiments with even larger datasets.

We believe that this survey will produce only a limited amount of information that can be used to guide the development of computing architectures for LHC experments. Many of these experiments are quite mature and their initial computing architecture was established in the late 1980's or early 1990's. Computing technology, software, and networking were very different than they are today. While all of these experiments have evolved to take partial advantage of new developments in computing, they have all been heavily influenced and constrained by their initial conditions. In particular, their detailed hardware solutions are unlikely to be applicable to the timeframe of the LHC startup. We can, however, learn much about the accuracy with which these groups were able to forecast their computing needs, their ability to reconfigure while operating, and the speed with which they could introduce new technologies and methods. One area which we focussed on was the degree to which analysis was centralized or distributed and the reasons why the particular patterns emerged.

After some discussion, we decided to contact the ten experiments which are listed in the next section. Back to Table of Contents Methodology of the Survey: The survey was carried out by first formulating a list of questions, then appointing a `summarizer' for each experiment to contact recognized leaders of computing in that experiment who provided the answers to the questions. The answers, of course, can depend to a degree on the view of respondent for the experiment, and often represent a snapshot of the computing model at a particular point of time. We would like to thank the contact people for the effort in answering our questions which are non-trivial to understand and research. In addition, the terminology between different computing models can vary enough as to not translate well.

The list of experiments, summarizers and contact people is:

Experiment Summarizer Contact Person

LEP experiments:

ALEPH Tim Smith Marco Cattaneo

DELPHI Tim Smith Richard Gokieli

L3 Tim Smith Bob Clare

OPAL Tim Smith Ann Williamson, Gordon Long, Steve O'Neal

CERN Fixed Target

NA48 Tim Smith Bernd Panzer-Steindel, Monica Pepe

NA45 Tim Smith Bernd Panzer-Steindel, Andreas Pfeiffer

FNAL experiments:

CDF Mike Diesburg Eric Wicklund

D0 Mike Diesburg Mike Diesburg

KTeV Vivian O'Dell Vivian O'Dell

FOCUS Vivian O'Dell Harry Cheung

DESY experiments

ZEUS Ian McArthur Dave Baily

The questions that were posed were the following:

Survey of Experiments - Form

Experimental Parameters:

Event Reconstruction time: (either in SpecInt95-secs or TINY-secs, or give the machine that the time was measured on) (Note: TINY is a program used at Fermilab to measure CPU. It is roughly equivalent to 1/20 SpecInt95's.)
Event Analysis time: (either in SpecInt95-Secs or TINY-secs, or give the machine that the time was measured on)
Event Size:
Collaboration Size: # institutions/# members

CPU

Central CPU for I/O: (i.e. where you access the most data) In SpecInt95 or TINY numbers
Onsite CPU: (any distributed computing beyond the Central System) In SpecInt95 or TINY numbers
Offsite CPU: In SpecInt95 or TINY numbers
Onsite/Offsite Reconstruction: Ratio of the amount of data reconstruction done onsite to the amount done offsite. If done offsite, how is the data accessed by the offsite CPU's?
Onsite/Offsite Data Analysis: Ratio of the amount of data analysis done onsite to the amount done offsite. If done offsite, how is the data accessed by the offsite CPU's?

Data Sizes

Total Raw Data (TBytes):
Total DST Data (TB):
DST Event Size:
Onsite/Offsite Data Storage: How much data is stored offsite? (We assume all data is stored onsite).

Monte Carlo Parameters

MC Event Generation Time: (either in SpecInt95-secs or TINY-secs, or give the machine that the time was measured on)
MC Event Size:
Onsite/Offsite generation: How much of the final MC was generated onsite and how much offsite?

Aggregate I/O

Onsite Central CPU for I/O : (aggregate MB/sec for access to data)
Onsite Distributed CPU: (aggregate MB/sec for access to data)
Offsite: total MB/sec capability for all outside institutions. Even if there is no data stored at a particular institution, add in its I/O capability.

Networking:

LAN network aggregate rate: network connecting onsite CPU
WAN network aggregate rate: network connecting remote institutions to central site

The results of these questionnaires was then distilled into the tables presented in the next section. Question marks in the table indicate that the requested information could not be obtained -- usually because the experiment did not or could not track the information. The areas where the question marks occur are therefore, in themselves, significant. The information in the tables and the conclusions drawn in the final section of the report constitute the results of this study. Back to Table of Contents Quantitative Results of the Survey:

Experimental Parameters

Experiment	Event Recon. Time	Event Analysis Time	Event Size	Collaboration Size
D0	45 Si95-secs	?	600 kB	32 Institutes,400 members
CDF	35 Si95-secs	?	130 kB	47 Institutes,460 members
ZEUS	33 Si95-secs	1.0 Si95-secs	100 kB	52 Institutes, 450 members
ALEPH	2.6 Si95-secs	?	30 kB	32 Institutes,493 members
DELPHI	5 Si95-secs	?	50 kB	55 Institutes,534 members
L3	7.7 Si95-secs	2.0 Si95-secs (average)	40 kB	54 Institutes,619 members
OPAL	2.0 Si95-secs	1.0 Si95-secs	20 kB	35 Institutes,418 members
NA45	6.4 Si95-secs	0.01 Si95-secs	200 kB	8 Institutes,45 members
NA48	0.15 Si95-secs	?	15 kB	16 Institutes,207 members
KTeV	0.0465 Si95-secs	0.07 Si95-secs	7 kB	12 Institutes,80 members
FOCUS	4 Si95-secs	0.25-1.0 Si95-secs	4 kB	20 Institutes,120 members

Note that there are many gaps in the "Event Analysis Time" entry in the table. This is because the analysis time varies widely with the type of analysis being done, and so this number (even as a range) is not well defined.

CPU

Experiment	Central CPU for IO (SpecInt95)	Onsite CPU (SpecInt95)	Offsite CPU (SpecInt95)	Onsite/Offsite Reconstruction	Onsite/Offsite Data Analysis
D0	20	275	25	all onsite	95% onsite/ 5% offsite
CDF	180	100	25	all onsite	all onsite
ZEUS	150	170	?	all onsite	all onsite
ALEPH	130	170	?	mainly onsite	50% onsite/50% offsite
DELPHI	250	265	?	all onsite	50% onsite/50% offsite
L3	125	500	?	all onsite	90% onsite/10% offsite
OPAL	190	645	?	all onsite	?
NA45	587	None	?	all onsite	0% onsite/100% offsite
NA48	600	50	?	all onsite	?
KTeV	280	195	460	during analysis	80% onsite/20% offsite
FOCUS	9	35	340 (?)	all onsite	10% onsite/90% offsite

Many experiments provided most of their own computing resources in the preparation and initial data taking. This took many forms, including private IBM mainframe computers and Vax clusters for the exclusive use of individual institutes, farms of emulator, vaxes and risc machines for dedicated services such as online filtering, quasi-online event reconstruction and analysis farms for dst access. For some experimeents, the spare cycles on university funded desktop machines were used for experiment wide simulation via batch jobs or other job distribution systems and in groups for specific analysis topics where specific machines hosted the data disks.

We chose to list machines as "onsite" if they met the following criteria:

The machines are physically onsite.
The machines are considered a general reconstruction, analysis, Monte Carlo, etc. resource for the entire experiment.

Thus machines listed as "onsite" CPU in the table corresponds to their use from the point of view of distributed computing architectures even though they may be considered "offsite" by funding agencies and cocotime (cerns computer resource allocation/oversight committee).

Data Sizes

Experiment	Total Raw Data (TB)	Total DST Data (TB)	DST Event Size	Onsite/Offsite Data Storage
D0	30	2	50 kB	100% onsite/ 1% offsite
CDF	8.3	2	32 kB	100% onsite / <5% offsite
ZEUS	3.5/year	5.3/year	150 kB	100% onsite
ALEPH	3.5	0.94	1.5 kB	100% onsite/100% offsite
DELPHI	3.5	0.55	31 kB	?
L3	3.4	0.6	7.1 kB	100% onsite
OPAL	2.7	1.0	7.3 kB	100% onsite
NA45	2.0	0.4	100 kB	100% onsite
NA48	100	1.5	1.7 kB	100% onsite
KTeV	45	N/A	7 kB	100% onsite/25% offsite
FOCUS	25	18 TB (L1/L2 DSTs)	1-3 kB	100% onsite/?% offsite

Monte Carlo Parameters

Experiment	MC Event Generation Time	MC Event Size	Onsite/Offsite generation	Total MC Data Set Size
D0	125 Si95-secs	1 MB	50% onsite/ 50% offsite	?
CDF	25 Si95-secs	100 kB	95%/5%	?
ZEUS	36 Si95-secs	45 kB	40%/60%	9 TB
ALEPH	7.8 Si95-secs	100 kB	50%/50%	2 TB
DELPHI	100-500 Si95-secs	500 kB	20%/80%	5 TB
L3	300 Si95-secs	350 kB	45%/55%	?
OPAL	200 Si95-secs	100-300 kB	85%/15%	12 TB
NA45	900 Si95-secs	50 kB	40%/60%	2 TB
NA48	0.4-0.8 Si95-secs	30 kB	0%/100% (50%-70% Lyon)	0.15 TB
KTeV	0.2 Si95-secs	2 kB	?	?
FOCUS	? Si95-secs	0.5-3 kB	?	?

Aggregate I/O

Experiment	Onsite Central CPU for I/O	Onsite Distributed CPU	Offsite
D0	10 MB/s	30 MB/s	3.2 MB/s
CDF	5 MB/s	all data access done centrally	4.7 MB/s

Networking

Experiment	LAN network aggregate rate	WAN network aggregate rate
D0	3 100 Mb/s FDDI backbones, 300 Mb/s	2 T3 lines, 90 Mb/s
CDF	100 Mb/s FDDI backbone	2 T3 lines, 90 Mb/s
ZEUS	800 Mb/s HIPPI(analysis),100 Mb/s Ethernet (reconstruction),10 Mb/s Ethernet (desktop)	34 Mb/s (DESY - European network)
ALEPH	800 Mb/s HIPPI. Switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop	?
DELPHI	800 Mb/s HIPPI + Gb Ethernet + switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop	?
L3	800 Mb/s HIPPI. Switched 100 Mb/s ethernet in Computer Center. FDDI and 10 Mb/s ethernet on desktop	?
OPAL	800 Mb/s HIPPI. Switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop	?
NA45	800 Mb/s HIPPI + Gb Ethernet + switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop	?
NA48	800 Mb/s HIPPI + Gb Ethernet + switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop	?

Back to Table of Contents

Analysis and Conclusions

Comparisons with LHC Experiments

While it is understood that the magnitude of the computing required by the LHC experiments far exceeds anything that has been achieved so far in HEP, it is useful to quantify this by comparing the projected needs for ATLAS and CMS to the surveyed experiments. Below is a table comparing the above experiments with an estimate of the size of one LHC experiment. For illustration purposes, current numbers for Babar and estimates for CDF/D0 Run 2 are also shown. Note that the Babar numbers listed are at turn-on. (We are indebted to Charlie Young for information about Babar). The LHC numbers come from estimates for CMS offline computing during the first year of operation (currently projected at 2006.) (see "Rough Sizing Esitmates for a Computing Facility for a Large LHC Experiment" by Les Robertson.)

Comparison with LHC sized experiment

Experiment	Onsite CPU (Si95)	onsite disk (TB)	onsite tape (TB)	LAN capacity	Data Import/Export	Box Count
LHC	520,000	540	3000	46 GB/s	10 TB/day (sustained)	~1400
CDF - 2	12,000	20	800	1 Gb/s	18 MB/s	~250
D0 - 2	7,000	20	600	300 Mb/s	10 MB/s	~250
Babar	~6000	8	~300	Mix of Gb/sec, 100 Mb/sec Ethernet to servers	~400 GB/day	~400
D0	295	1.5	65	300 Mb/s	?	180
CDF	280	2	100	100 Mb/s	~100 GB/day	?
ALEPH	300	1.8	30	1 Gb/s	?	70
DELPHI	515	1.2	60	1 Gb/s	?	80
L3	625	2.0	40	1 Gb/s	?	160
OPAL	835	1.6	22	1 Gb/s	?	220
NA45	587	1.3	2	1 Gb/s	5 GB/day	30
NA48	650	4.5	140	1 Gb/s	5 GB/day	50
KTeV	475	1	50	100 Mb/s FDDI	150 GB/day	6

It is clear from this table that the scale of the computing problem in the surveyed current experiments is so much smaller than for LHC experiments that they cannot possibly define a framework for the LHC. However, the point behind this survey is to make the best possible use of the experiences of the past in formulating what is essentially a new model of computing for the LHC era.

Issues Related to Distributed Computing
In discussions with experiments about how to improve remote data processing, the following issues were the most commonly listed.
1. System and Code Management Issues
  Probably the issue raised most often during the survey is that there are not enough people at any individual remote institution to ensure that reconstruction and analysis code is kept up to date. This is especially true when the software is under development and in the first year of running when software is most likely to change often. Although there are tools that have been, and continue to be, developed for both code management and code distribution, remote institutions still find it difficult to keep software current.
  Professional system management and 7 X 24 hour coverage is essential for a remote institution to successfully run a production. In addition at least one full time software engineer familiar with the software, and physicists to test each local release, are needed to keep the site up to date.
2. Raw Data Access
  In our survey we found that many remote institutions are running large Monte Carlo productions, while not being able to efficiently run data production jobs. This is mainly due to the inability of the remote institutions to access and store large quantities of raw data.
  For remote sites, raw data access is typically through exchange of physical media. To date, wide area network speeds have prevented most experiments from collecting data in real time or semi-real time over a network. All of the experiments report archiving the full set of raw data onsite for safekeeping, and raw data is copied before being shipped offsite.
  Although remote centers often have lots of CPU, most of the remote sites we polled did not have sufficient resources to handle large datasets. Clearly for a remote site to fully participate in raw data reconstruction and analysis, it must be able to provide users easy access to the data.
3. Distributed Databases
  For a remote site to be able to be part of the event reconstruction project means it must have access to geometry, calibration, hardware and other constants typically stored in databases. Keeping the remote site up to date means that automatic tools for updating remote databases must exist. For LHC experiments, the size of the databases could be quite large, which will put heavy demands on networking. This, of course, depends on how often the databases need to be updated.
4. Number of Different Hardware Platforms
  A stumbling block for many of the remote institutions we talked to is that the officially distributed software for an experiment did not run on their hardware platform either because of operating system incompatibility or a fundamental difference in hardware architecture. For an experiment to successfully include remote sites in its reconstruction/analysis production, a commitment must be made to do so and effort must be supplied either by the central institution or by the remote site (or both) to both homogenize computing or analysis hardware and to support different platforms and configurations.
Why do these problems exist? None of them are insurmountable or insoluble. The main issue is having a mindset of distributed computing in the computing model from the outset . The Fermilab fixed target experiment, FOCUS, did much of its computing at remote sites, and although they had access to fast networks, in fact the raw data tapes were shipped conventionally and the network was not used for raw data reconstruction. However looking at the table comparing onsite to offsite CPU for FOCUS, it was clear that distributed reconstruction and analysis was necessary from the beginning. While developing a model of distributed computing as integral to the experiment may not guarantee the effective use of offsite resources for data processing, not developing such a model early on will guarantee an ineffective, or even non-existent, use of offsite computing.
There are some clear requirements that can come out of these observations for both the central site and the remote sites. The remote sites must have:
1. a large enough 24 X 7 support crew to ensure that critical resources for data handling have a high uptime percentage. The size of this team depends on the number and flavor of the hardware pieces of the computing architecture.
2. a non-inconsequential software development team to ensure all data handling, reconstruction and analysis code is portable to the remote site, and to develop (possibly site-specific) tools to facilitate large production jobs, code and database distributions, and analysis code support. This team should also contribute to the central experiment software repository.
3. a diverse body of users willing to beat on the system and help fix things when they break.
4. good, clear documentation of all software and software tools.
The central site must:
1. make the central code repository easy to use and easily accessible for remote sites.
2. be sensitive to the needs of remote sites in data base handling, raw data handling and machine flavors.
3. provide good, clear documentation of all software and software tools.
Back to Table of Contents
Computing Predictions Versus Reality
In researching this report, it became clear that for many experiments there existed computing models and predictions written years before the experiments began taking data. (See Lessons for MONARC from previous CERN computing planning exercises ). In reviewing these documents, several lessons came out:
1. When working on timescales of a decade and more, the real outcome is very strongly determined by the improvement of price-performance over time. That improvement is the combination of "smooth" changes, which can be estimated reasonably, and of discontinuous changes, normally resulting from technological breakthroughs, which are exceptionally hard to predict. If the price- performance improvement due to the move from mainframes to servers in the early 1990s had not occurred, the LEP computing environment would have been far more constraining for the physics program than was actually the case.
2. Even if colliders do not start operating at maximum luminosity on day one, it is best to assume a fairly fast build up in the estimates of computing requirements. The excitement of the initial data taking and analysis means that the production procedures cannot be completely smooth, and extra resources will be needed to cover the sometimes considerable re-processing which inevitably occurs.
3. There seems to be no better approach to computing planning than to try to do it as well as possible. It is best to make the estimates as realistic as possible, and consciously to avoid overly restraining the estimates in order to match concepts of what might form a sensible cost.
4. We should hope that the improvement in computing price-performance continues to be as impressive over the next 10-20 years as in the recent past. Unless that is the case then we may have some hard choices ahead of us as we try to extract the maximum amount of physics from the LHC data.
Back to Table of Contents
Issues in Tracking Experimental Computing Models
As we attempted to collect data for this report, we encountered several problems. Large scale production jobs, such as event reconstruction, are generally carefully monitored and important information such as the reconstruction time, total amount of data in and out, etc. are well-documented and easy to obtain. However, there are whole areas of the data analysis where accurate information was very hard to obtain, was completely lacking, or was not consistent among various sources. This is especially true with respect to the final physics analysis and to distributed computing. While this is understandable, it does make it difficult to profit from the experience of the past. Over the next year, many new experiments will be starting up whose computing is much closer to the LHC scale than the experiments we have reviewed in this document. It would be worthwhile to develop a plan to learn as much as possible from their experiences. Having a liaison to the offline group in each of the major new experiments from the LHC experiments, perhaps through MONARC, would be a start. This would provide a continuous flow of authoritative information to the LHC experiments while it can still be useful to the planning process.
Back to Table of Contents

Final Comments

In this document we have attempted to survey existing solutions to technical problems in computing in High Energy Physics. By definition all of these solutions are now, or certainly will be by LHC turn-on, obsolete. New approaches in defining and structuring computing architectures for the LHC era will use new technologies and hence generate new problems. Much research and development is and will be needed over the next few years in order to define a working architecture for doing distributed computing for LHC scale high energy physics experiments.

Back to Table of Contents

Experiment	Summarizer	Contact Person
LEP experiments:
ALEPH	Tim Smith	Marco Cattaneo
DELPHI	Tim Smith	Richard Gokieli
L3	Tim Smith	Bob Clare
OPAL	Tim Smith	Ann Williamson, Gordon Long, Steve O'Neal
CERN Fixed Target
NA48	Tim Smith	Bernd Panzer-Steindel, Monica Pepe
NA45	Tim Smith	Bernd Panzer-Steindel, Andreas Pfeiffer
FNAL experiments:
CDF	Mike Diesburg	Eric Wicklund
D0	Mike Diesburg	Mike Diesburg
KTeV	Vivian O'Dell	Vivian O'Dell
FOCUS	Vivian O'Dell	Harry Cheung
DESY experiments
ZEUS	Ian McArthur	Dave Baily