Monarc Report on Existing Computing Architectures

Last Update: May 29, 1999


Table of Contents: Introduction: This report is prepared by the Site and Network Architecture Working Group of the MONARC project and provides a survey of computing architectures used by a selection of current experiments. The purposes of this survey are:

1. To develop a formal framework for describing analysis architectures and to apply them to existing experiments. The framework can then be used to discuss architectural models for the LHC.

2. To provide quantitative information on the resources which have been employed in the most challenging data analysis efforts that have been undertaken in High Energy Physics (HEP) so far. This information can provide a baseline to which the magnitude of LHC data analysis problem can be compared.

3. To identify successful approaches that can be expected to carry over to the future and to identify key problem areas which are already known to exist at this scale of effort.

In the second section of the report, we explain the scope of the survey and our reasons for choosing a particular subset of experiments. The third section presents a brief explanation of the methodology used to collect the data and provides information required to interpret the data tables which are presented in section four. In section five, we provide the conclusions drawn from the data by the members of the working group. Back to Table of Contents Scope of the Survey: For reasons described below, we believe that we can learn only a limited amount from existing experiments. Therefore, the survey is not exhaustive, and concentrates on a representative selection of LEP, HERA, FNAL Collider, and large CERN and FNAL fixed target experiments. We considered experiments that had, by past standards, very large volumes of data, high rates, high degrees of complexity, and large geographically dispersed collaborations. We included in the list experiments that are running now at CERN, such as NA48 and NA45, which have large amounts of data, at least by current standards, and are being used to try out new models of computing that were designed to be relevant to future experiments with even larger datasets.

We believe that this survey will produce only a limited amount of information that can be used to guide the development of computing architectures for LHC experments. Many of these experiments are quite mature and their initial computing architecture was established in the late 1980's or early 1990's. Computing technology, software, and networking were very different than they are today. While all of these experiments have evolved to take partial advantage of new developments in computing, they have all been heavily influenced and constrained by their initial conditions. In particular, their detailed hardware solutions are unlikely to be applicable to the timeframe of the LHC startup. We can, however, learn much about the accuracy with which these groups were able to forecast their computing needs, their ability to reconfigure while operating, and the speed with which they could introduce new technologies and methods. One area which we focussed on was the degree to which analysis was centralized or distributed and the reasons why the particular patterns emerged.

After some discussion, we decided to contact the ten experiments which are listed in the next section. Back to Table of Contents Methodology of the Survey: The survey was carried out by first formulating a list of questions, then appointing a `summarizer' for each experiment to contact recognized leaders of computing in that experiment who provided the answers to the questions. The answers, of course, can depend to a degree on the view of respondent for the experiment, and often represent a snapshot of the computing model at a particular point of time. We would like to thank the contact people for the effort in answering our questions which are non-trivial to understand and research. In addition, the terminology between different computing models can vary enough as to not translate well.

The list of experiments, summarizers and contact people is:

Experiment Summarizer Contact Person
LEP experiments:
ALEPH Tim Smith Marco Cattaneo
DELPHI Tim Smith Richard Gokieli
L3 Tim Smith Bob Clare
OPAL Tim Smith Ann Williamson, Gordon Long, Steve O'Neal
CERN Fixed Target
NA48 Tim Smith Bernd Panzer-Steindel, Monica Pepe
NA45 Tim Smith Bernd Panzer-Steindel, Andreas Pfeiffer
FNAL experiments:
CDF Mike Diesburg Eric Wicklund
D0 Mike Diesburg Mike Diesburg
KTeV Vivian O'Dell Vivian O'Dell
FOCUS Vivian O'Dell Harry Cheung
DESY experiments
ZEUS Ian McArthur Dave Baily

The questions that were posed were the following:

Survey of Experiments - Form

Experimental Parameters:

CPU

Data Sizes

Monte Carlo Parameters

Aggregate I/O

Networking:

The results of these questionnaires was then distilled into the tables presented in the next section. Question marks in the table indicate that the requested information could not be obtained -- usually because the experiment did not or could not track the information. The areas where the question marks occur are therefore, in themselves, significant. The information in the tables and the conclusions drawn in the final section of the report constitute the results of this study. Back to Table of Contents Quantitative Results of the Survey:

Experiment Event Recon. Time Event Analysis Time Event Size Collaboration Size
D0 45 Si95-secs ? 600 kB 32 Institutes,400 members
CDF 35 Si95-secs ? 130 kB 47 Institutes,460 members
ZEUS 33 Si95-secs 1.0 Si95-secs 100 kB 52 Institutes, 450 members
ALEPH 2.6 Si95-secs ? 30 kB 32 Institutes,493 members
DELPHI 5 Si95-secs ? 50 kB 55 Institutes,534 members
L3 7.7 Si95-secs 2.0 Si95-secs (average) 40 kB 54 Institutes,619 members
OPAL 2.0 Si95-secs 1.0 Si95-secs 20 kB 35 Institutes,418 members
NA45 6.4 Si95-secs 0.01 Si95-secs 200 kB 8 Institutes,45 members
NA48 0.15 Si95-secs ? 15 kB 16 Institutes,207 members
KTeV 0.0465 Si95-secs 0.07 Si95-secs 7 kB 12 Institutes,80 members
FOCUS 4 Si95-secs 0.25-1.0 Si95-secs 4 kB 20 Institutes,120 members

Note that there are many gaps in the "Event Analysis Time" entry in the table. This is because the analysis time varies widely with the type of analysis being done, and so this number (even as a range) is not well defined.

Experiment Central CPU for IO (SpecInt95) Onsite CPU (SpecInt95) Offsite CPU (SpecInt95) Onsite/Offsite Reconstruction Onsite/Offsite Data Analysis
D0 20 275 25 all onsite 95% onsite/ 5% offsite
CDF 180 100 25 all onsite all onsite
ZEUS 150 170 ? all onsite all onsite
ALEPH 130 170 ? mainly onsite 50% onsite/50% offsite
DELPHI 250 265 ? all onsite 50% onsite/50% offsite
L3 125 500 ? all onsite 90% onsite/10% offsite
OPAL 190 645 ? all onsite ?
NA45 587 None ? all onsite 0% onsite/100% offsite
NA48 600 50 ? all onsite ?
KTeV 280 195 460 during analysis 80% onsite/20% offsite
FOCUS 9 35 340 (?) all onsite 10% onsite/90% offsite

Many experiments provided most of their own computing resources in the preparation and initial data taking. This took many forms, including private IBM mainframe computers and Vax clusters for the exclusive use of individual institutes, farms of emulator, vaxes and risc machines for dedicated services such as online filtering, quasi-online event reconstruction and analysis farms for dst access. For some experimeents, the spare cycles on university funded desktop machines were used for experiment wide simulation via batch jobs or other job distribution systems and in groups for specific analysis topics where specific machines hosted the data disks.

We chose to list machines as "onsite" if they met the following criteria:

Thus machines listed as "onsite" CPU in the table corresponds to their use from the point of view of distributed computing architectures even though they may be considered "offsite" by funding agencies and cocotime (cerns computer resource allocation/oversight committee).

Experiment Total Raw Data (TB) Total DST Data (TB) DST Event Size Onsite/Offsite Data Storage
D0 30 2 50 kB 100% onsite/ 1% offsite
CDF 8.3 2 32 kB 100% onsite / <5% offsite
ZEUS 3.5/year 5.3/year 150 kB 100% onsite
ALEPH 3.5 0.94 1.5 kB 100% onsite/100% offsite
DELPHI 3.5 0.55 31 kB ?
L3 3.4 0.6 7.1 kB 100% onsite
OPAL 2.7 1.0 7.3 kB 100% onsite
NA45 2.0 0.4 100 kB 100% onsite
NA48 100 1.5 1.7 kB 100% onsite
KTeV 45 N/A 7 kB 100% onsite/25% offsite
FOCUS 25 18 TB (L1/L2 DSTs) 1-3 kB 100% onsite/?% offsite
Experiment MC Event Generation Time MC Event Size Onsite/Offsite generation Total MC Data Set Size
D0 125 Si95-secs 1 MB 50% onsite/ 50% offsite?
CDF 25 Si95-secs 100 kB 95%/5% ?
ZEUS 36 Si95-secs 45 kB 40%/60%9 TB
ALEPH 7.8 Si95-secs 100 kB 50%/50%2 TB
DELPHI 100-500 Si95-secs 500 kB 20%/80% 5 TB
L3 300 Si95-secs 350 kB 45%/55% ?
OPAL 200 Si95-secs 100-300 kB 85%/15% 12 TB
NA45 900 Si95-secs 50 kB 40%/60%2 TB
NA48 0.4-0.8 Si95-secs 30 kB 0%/100% (50%-70% Lyon) 0.15 TB
KTeV 0.2 Si95-secs 2 kB ? ?
FOCUS ? Si95-secs 0.5-3 kB ? ?
Experiment Onsite Central CPU for I/O Onsite Distributed CPU Offsite
D0 10 MB/s 30 MB/s 3.2 MB/s
CDF 5 MB/s all data access done centrally 4.7 MB/s
Experiment LAN network aggregate rate WAN network aggregate rate
D0 3 100 Mb/s FDDI backbones, 300 Mb/s 2 T3 lines, 90 Mb/s
CDF 100 Mb/s FDDI backbone 2 T3 lines, 90 Mb/s
ZEUS 800 Mb/s HIPPI(analysis),100 Mb/s Ethernet (reconstruction),10 Mb/s Ethernet (desktop) 34 Mb/s (DESY - European network)
ALEPH 800 Mb/s HIPPI. Switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop ?
DELPHI 800 Mb/s HIPPI + Gb Ethernet + switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop ?
L3 800 Mb/s HIPPI. Switched 100 Mb/s ethernet in Computer Center. FDDI and 10 Mb/s ethernet on desktop ?
OPAL 800 Mb/s HIPPI. Switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop ?
NA45 800 Mb/s HIPPI + Gb Ethernet + switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop ?
NA48 800 Mb/s HIPPI + Gb Ethernet + switched 100 Mb/s ethernet in Computer Center. 10 Mb/s ethernet on desktop ?
Back to Table of Contents

Analysis and Conclusions

  1. Comparisons with LHC Experiments

    While it is understood that the magnitude of the computing required by the LHC experiments far exceeds anything that has been achieved so far in HEP, it is useful to quantify this by comparing the projected needs for ATLAS and CMS to the surveyed experiments. Below is a table comparing the above experiments with an estimate of the size of one LHC experiment. For illustration purposes, current numbers for Babar and estimates for CDF/D0 Run 2 are also shown. Note that the Babar numbers listed are at turn-on. (We are indebted to Charlie Young for information about Babar). The LHC numbers come from estimates for CMS offline computing during the first year of operation (currently projected at 2006.) (see "Rough Sizing Esitmates for a Computing Facility for a Large LHC Experiment" by Les Robertson.)

    Experiment Onsite CPU (Si95) onsite disk (TB) onsite tape (TB) LAN capacity Data Import/Export Box Count
    LHC 520,000 540 3000 46 GB/s 10 TB/day (sustained) ~1400
    CDF - 2 12,000 20 800 1 Gb/s 18 MB/s ~250
    D0 - 2 7,000 20 600 300 Mb/s 10 MB/s ~250
    Babar ~6000 8 ~300 Mix of Gb/sec, 100 Mb/sec Ethernet to servers ~400 GB/day ~400
    D0 295 1.5 65 300 Mb/s ? 180
    CDF 280 2 100 100 Mb/s ~100 GB/day ?
    ALEPH 300 1.8 30 1 Gb/s ? 70
    DELPHI 515 1.2 60 1 Gb/s ? 80
    L3 625 2.0 40 1 Gb/s ? 160
    OPAL 835 1.6 22 1 Gb/s ? 220
    NA45 587 1.3 2 1 Gb/s 5 GB/day 30
    NA48 650 4.5 140 1 Gb/s 5 GB/day 50
    KTeV 475 1 50 100 Mb/s FDDI 150 GB/day 6

    It is clear from this table that the scale of the computing problem in the surveyed current experiments is so much smaller than for LHC experiments that they cannot possibly define a framework for the LHC. However, the point behind this survey is to make the best possible use of the experiences of the past in formulating what is essentially a new model of computing for the LHC era.

  2. Issues Related to Distributed Computing

    In discussions with experiments about how to improve remote data processing, the following issues were the most commonly listed.

    1. System and Code Management Issues

      Probably the issue raised most often during the survey is that there are not enough people at any individual remote institution to ensure that reconstruction and analysis code is kept up to date. This is especially true when the software is under development and in the first year of running when software is most likely to change often. Although there are tools that have been, and continue to be, developed for both code management and code distribution, remote institutions still find it difficult to keep software current.

      Professional system management and 7 X 24 hour coverage is essential for a remote institution to successfully run a production. In addition at least one full time software engineer familiar with the software, and physicists to test each local release, are needed to keep the site up to date.

    2. Raw Data Access

      In our survey we found that many remote institutions are running large Monte Carlo productions, while not being able to efficiently run data production jobs. This is mainly due to the inability of the remote institutions to access and store large quantities of raw data.

      For remote sites, raw data access is typically through exchange of physical media. To date, wide area network speeds have prevented most experiments from collecting data in real time or semi-real time over a network. All of the experiments report archiving the full set of raw data onsite for safekeeping, and raw data is copied before being shipped offsite.

      Although remote centers often have lots of CPU, most of the remote sites we polled did not have sufficient resources to handle large datasets. Clearly for a remote site to fully participate in raw data reconstruction and analysis, it must be able to provide users easy access to the data.

    3. Distributed Databases

      For a remote site to be able to be part of the event reconstruction project means it must have access to geometry, calibration, hardware and other constants typically stored in databases. Keeping the remote site up to date means that automatic tools for updating remote databases must exist. For LHC experiments, the size of the databases could be quite large, which will put heavy demands on networking. This, of course, depends on how often the databases need to be updated.

    4. Number of Different Hardware Platforms

      A stumbling block for many of the remote institutions we talked to is that the officially distributed software for an experiment did not run on their hardware platform either because of operating system incompatibility or a fundamental difference in hardware architecture. For an experiment to successfully include remote sites in its reconstruction/analysis production, a commitment must be made to do so and effort must be supplied either by the central institution or by the remote site (or both) to both homogenize computing or analysis hardware and to support different platforms and configurations.

    Why do these problems exist? None of them are insurmountable or insoluble. The main issue is having a mindset of distributed computing in the computing model from the outset . The Fermilab fixed target experiment, FOCUS, did much of its computing at remote sites, and although they had access to fast networks, in fact the raw data tapes were shipped conventionally and the network was not used for raw data reconstruction. However looking at the table comparing onsite to offsite CPU for FOCUS, it was clear that distributed reconstruction and analysis was necessary from the beginning. While developing a model of distributed computing as integral to the experiment may not guarantee the effective use of offsite resources for data processing, not developing such a model early on will guarantee an ineffective, or even non-existent, use of offsite computing.

    There are some clear requirements that can come out of these observations for both the central site and the remote sites. The remote sites must have:

    1. a large enough 24 X 7 support crew to ensure that critical resources for data handling have a high uptime percentage. The size of this team depends on the number and flavor of the hardware pieces of the computing architecture.

    2. a non-inconsequential software development team to ensure all data handling, reconstruction and analysis code is portable to the remote site, and to develop (possibly site-specific) tools to facilitate large production jobs, code and database distributions, and analysis code support. This team should also contribute to the central experiment software repository.

    3. a diverse body of users willing to beat on the system and help fix things when they break.

    4. good, clear documentation of all software and software tools.

    The central site must:

    1. make the central code repository easy to use and easily accessible for remote sites.

    2. be sensitive to the needs of remote sites in data base handling, raw data handling and machine flavors.

    3. provide good, clear documentation of all software and software tools.

    Back to Table of Contents

  3. Computing Predictions Versus Reality

    In researching this report, it became clear that for many experiments there existed computing models and predictions written years before the experiments began taking data. (See Lessons for MONARC from previous CERN computing planning exercises ). In reviewing these documents, several lessons came out:

    1. When working on timescales of a decade and more, the real outcome is very strongly determined by the improvement of price-performance over time. That improvement is the combination of "smooth" changes, which can be estimated reasonably, and of discontinuous changes, normally resulting from technological breakthroughs, which are exceptionally hard to predict. If the price- performance improvement due to the move from mainframes to servers in the early 1990s had not occurred, the LEP computing environment would have been far more constraining for the physics program than was actually the case.

    2. Even if colliders do not start operating at maximum luminosity on day one, it is best to assume a fairly fast build up in the estimates of computing requirements. The excitement of the initial data taking and analysis means that the production procedures cannot be completely smooth, and extra resources will be needed to cover the sometimes considerable re-processing which inevitably occurs.

    3. There seems to be no better approach to computing planning than to try to do it as well as possible. It is best to make the estimates as realistic as possible, and consciously to avoid overly restraining the estimates in order to match concepts of what might form a sensible cost.

    4. We should hope that the improvement in computing price-performance continues to be as impressive over the next 10-20 years as in the recent past. Unless that is the case then we may have some hard choices ahead of us as we try to extract the maximum amount of physics from the LHC data.

    Back to Table of Contents

  4. Issues in Tracking Experimental Computing Models

    As we attempted to collect data for this report, we encountered several problems. Large scale production jobs, such as event reconstruction, are generally carefully monitored and important information such as the reconstruction time, total amount of data in and out, etc. are well-documented and easy to obtain. However, there are whole areas of the data analysis where accurate information was very hard to obtain, was completely lacking, or was not consistent among various sources. This is especially true with respect to the final physics analysis and to distributed computing. While this is understandable, it does make it difficult to profit from the experience of the past. Over the next year, many new experiments will be starting up whose computing is much closer to the LHC scale than the experiments we have reviewed in this document. It would be worthwhile to develop a plan to learn as much as possible from their experiences. Having a liaison to the offline group in each of the major new experiments from the LHC experiments, perhaps through MONARC, would be a start. This would provide a continuous flow of authoritative information to the LHC experiments while it can still be useful to the planning process.

    Back to Table of Contents

Final Comments

In this document we have attempted to survey existing solutions to technical problems in computing in High Energy Physics. By definition all of these solutions are now, or certainly will be by LHC turn-on, obsolete. New approaches in defining and structuring computing architectures for the LHC era will use new technologies and hence generate new problems. Much research and development is and will be needed over the next few years in order to define a working architecture for doing distributed computing for LHC scale high energy physics experiments.

Back to Table of Contents