Regional Centers for LHC computing

The MONARC Architecture Group

DRAFT Version 5.1
Last Update May 20th, 16:10

"Madamina il catalogo e' questo..."
Don Giovanni (Da Ponte - Mozart)

This document tries to establish a framework, within the MONARC project, for the development and eventual testing and validation of models for distributed computing using Regional Centers working together with the main computing facility for LHC experiments at CERN.

Given the existence of significant computing facilities outside CERN for LHC experiments, we envision a hierarchy of computing centers ranging from very large, expensive multi-service facilities to special purpose service facilities such as large PC farms.

We classify a computing center in terms of the kinds of services, described in detail below, and the scale of services it provides. We define a "Regional Center" (RC) in the following to be a multi-service center which provides significant resources and support. We expect there to be smaller "special service centers", which, for example, provide facilities for Monte Carlo event generation or Documentation support.

This document focuses on the Regional Centers.

Glossary

In the following sections a number of acronyms and terms are used which need a clear definition. Some of these terms are adopted for sake of clarity in the discussion, and may have a different definition for the same term used by one or more experiments. This glossary defines the terms used in the context of this document.

Access (to data) vs. Retrieval (of data): "Data access" refers to the specific situation where data are stored in a database. This implies the necessity of defining access patterns, evaluating access efficiency and enforcing access policies. Data retrieval refers to making available to a job, or a community, data previously archived in whichever form.

AOD (Analysis Object Data): refers to objects which facilitate analysis, but which by construction are not larger than 10 Kbytes. So, if an analysis requires information which do not fit in this limited size, this analysis should access other objects (possibly larger in size). An old time term would be "micro-dst" or "ntuple".

Bookkeeping: refers not only to production logging (as tape numbers or filenames) but also to the complete set of information allowing the user to be aware of the nature and quality of the data analyzed. For instance, with which version of the reconstruction program and with which calibration constants the data have been reconstructed or on which files/tapes a dataset is stored.

Calibration Data: refers to diverse kind of information generally acquired online (which maybe strictly calibration constants, or monitoring data) as well as special runs taken for calibration purposes. A cosmic run is included in this category. Collections of event data used for alignment and calibration studies are not classified as calibration data.

Catalog: this term refers to the collection of bookkeeping information and NOT to the Objectivity "catalog" (which is the list of full pathnames for all the database files in a federation). In this context a catalog might, or might not, be implemented in Objectivity itself. The catalog may be important as an added layer of functionality on top of the database (whichever database), for instance mapping human readable easy-to-decipher information to filenames.

Data Caching: this term refers to the capability of holding a copy of frequently accessed data on rapidly accessible storage medium, under algorithmic control to minimize data turnaround and maximize user response.

Data Mirroring: this term refers to automatic procedures which maintain identical (synchronised) copies of (parts of) a database in two or more locations. One copy is normally regarded as the "master copy" and the others as "mirror copies".

ESD (Event Summary Data): refers to physics objects by construction not larger than 100 Kbytes. An old time term would be "mini-dst".

Retrieval (of data): see Access

Tag: the term tag refers to very small objects (100 to 500 bytes) which identify (tag) an event by its physics signature. A tag could be a set of 96 bits each tagging a given physics channel or it could be a set of 10 words with some packed overall information for an event. In this document the tag is NOT an object identifier (i.e. a set of bits identifying uniquely an object in a database). An old time term would be "nano-dst" or "ntuple" The border between AOD and Tags is well defined in size but less well defined in functionality. (In the same way at LEP one could have ntuples of 10 words per event or of 100 words per event).

Finally the document in all its parts is dealing with official data sets. So, for instance, "creation of AOD" means that AOD for the collaboration or its analysis groups are created from larger data sets by the production team on the basis of criteria agreed upon in common analysis meetings. Personal data of any kind are not considered in this document.

Tasks

The offline software of each experiment is supposed to perform the following tasks:

data reconstruction (may include more than a single step, such as preprocessing, reduction and streaming; such steps might be done online)
MC production (includes generation, simulation and reconstruction)
offline (re)calibration
successive data reconstruction
analysis

Services

To execute completely and successfully the above tasks, services are required. A starting list includes:

Data Services

(re)processing of data through the official reconstruction program [requires CPU, storage, bookkeeping, SW support]
generation of events [requires little CPU and storage, bookkeeping, SW support]
simulation of events [requires a lot of CPU, storage, bookkeeping, SW support]
reconstruction of MC events [see point 1]
insertion of data into the database
creation of the official ESD/AOD/tags
updating of the official ESD/AOD/tags under new conditions
ESD/AOD/tag access (possibly with added layers of functionalities)
data archival/retrieval for all formats (including media replication, tape copying)
bookkeeping (includes format/content definition, relation with DB)

Technical Services

database maintenance (including backup, recovery, installation of new versions, monitoring and policing)
basic and experiment specific sw maintenance (backup, updating, installation)
production of tools for data services
production and maintenance of documentation (including Web pages)
storage management (disks, tapes, distributed file systems if applicable)
CPU usage monitoring and policing
database access monitoring and policing
I/O usage monitoring and policing
network maintenance (as appropriate)
support of large bandwidth

Basic Assumptions

There exists one "central site" (CERN): the central site is able to provide all the services. The following steps happen at the central site only:

Online data acquisition and storage
Possible data pre
processing before first reconstruction
First data reconstruction

Other production steps (calibration data storage, creation of ESD/AOD/tags) are shared between CERN and the RCs.

The central site holds:

a complete archive of all raw data
a master copy of the calibration data (including geometry, gains etc...)
a complete copy of all ESD, AOD, tags possibly online

Data taking estimate is:

1 PB raw data per year per experiment
10**9 events (1 MB each) per year per experiment
100 days of data taking (i.e. 10**7 events per day per experiment)

Current estimates for a single LHC experiment capacity to be installed by 2006 at CERN are (see Robertson):

520,000 SI95 for CPU, covering data recording, first-pass reconstruction, some reprocessing, basic analysis of the ESD, support for 4 analysis groups
about 1400 boxes to be managed
540 TBs of disk capacity
3 PBs of automated tape capacity
46 GB/s LAN throughput

One can assume that 10 to 100 TB of disk space is allocated to AOD/ESD/tags at the central site.

In the following, resources for the RC will be expressed in terms of percentage of the CERN ones.

Motivations for a RC

The primary motivation for a hierarchical collection of computing resources, called Regional Centers, is to maximize the intellectual contribution of physicists all over the world, without requiring their physical presence at the CERN. An architecture based on RCs allows an organization of computing tasks which may take advantage of physicists no matter where they are located. Next, the computing architecture based on RCs is an acknowledgement of the facts of life about network bandwidths and costs. Short distance networks will always be cheaper and higher bandwidth than long distance (especially intercontinental) networks. A hierarchy of centers with associated data storage ensures that network realities will not interfere with physics analysis. Finally, RCs provide a way to utilize the expertise and resources residing in computing centers throughout the world. For a variety of reasons it is difficult to concentrate resources (not only hardware but, more importantly, personnel and support resources) in a single location. A RC architecture will provide greater total computing resources for the experiments by allowing flexibility in how these resources are configured and located. A corollary of these motivations is that the RC model allows one to optimize the efficiency of data delivery/access by making appropriate decisions on processing the data

where it resides
where the largest CPU resources are available, or
nearest to the user(s) doing the analysis.

Under different conditions of network bandwidth, required turnaround time, and the future use of the data, different choices among the alternatives (1) - (3) may be optimal in terms of resource utilization or responsiveness to the users.

Capabilities

An RC should provide all the technical services, all the data services needed for the analysis and preferably another class of data services (MC production or data reprocessing, not necessarily from raw data).

An RC could serve more than one experiment; this possibility adds some implications which will not be addressed in this document.

The aggregated resources of all RCs should be comparable to CERN resources; we expect that there will be between 5 and 10 RCs supporting each experiment. As a consequence a RC should provide resources to the experiment in the range 10 to 20% of CERN (although the functionality provided by automatic mass storage systems might be provided differently).

Furthermore, a RC should be capable of coping with the experiments' needs as they increase with time, evolving in resource availability and use of technology.

Constituency

The RC constituency will be here discussed in terms of local and remote users accessing the RC services and resources. The present discussion applies in equal terms to RCs serving one or more experiments. However the situation in which the RC resources are shared by more than one experiment adds some complications to our analysis; such complications will not be discussed in this document.

Two general considerations may alter the proposed scheme:

whether issues of network connectivity place one set of physicists, say those in the nation or geographic region where the center is located, "closer" in terms of data access capability to the center than to other collaboration resources.
how the center was funded and whether the funding agency has particular requirements or expectations with respect to the priority of various constituents.

Following these considerations it would be very appropriate to provide higher priority to serving people close to the center who would otherwise have trouble getting resources and actively participating to the analysis.

Given this complexity, there are several possible "constituencies" which might be served by a RC with various priorities depending on circumstances:

Physicists in the region where the center resides: the analysis resources supplied to these physicists represent the region's fair share of support for the experiment's data analysis.
The whole collaboration: this involves carrying out production activities and providing support services that represent the region's contribution to the common production and support effort.
Groups of physicists at other RCs: where appropriate providing services to other RCs to avoid unnecessary duplication of services or effort. (In return, this center would use the services of other RCs so that it did not have to duplicate their resources.)
Members of other regions: providing service on a "best effort" basis or on the basis of an agreement with the collaboration to members outside the region to maximize the effectiveness of the collaboration in carrying out the data analysis.

The centers will probably each need to reach agreement with their local physicists and with the collaborations they serve on how resources will be allocated and priorities assigned among various constituencies and activities.

Data Profile

An RC should maintain a large fraction of the ESD/AOD/tags, possibly all of them for given physics channels in case of prioritized data access to defined analysis groups. As the LHC program unfolds, the RC will have to accomodate the increasing volume of data by expanding its storage and access capabilities with incremental (annual) funding and by managing the data so that parts of it that are not frequently accessed are either migrated to cheaper, less rapidly accessible (possibly archival) storage or dropped altogether from the repository based on agreements with the experiment.

An RC should maintain a fixed statistical fraction of fully reconstructed data. Calibration constants are not an issue; they are a tiny fraction of the data (GB vs. TB). Moreover past experience shows that they are not used in physics analysis, once one trustes the reconstruction. Calibration data (special runs) should only go where calibration studies are performed. Bookkeeping data (the catalog) should be everywhere and synchronized. The RC should implement a mechanism of data caching and/or mirroring. Data caching, and mirroring of portions of the data, will be of great value in improving the speed of completing an analysis in cases where network bandwidth is limited. A distributed model of computing with several RCs makes caching possible, and will increase the efficiency experienced at any individual Center.

Communication Profile

The mechanism of data transfer from the CERN to the RC depends also on the underlying network capability. The RC should have high performance network with maximum available bandwidth towards CERN; at the same time excellent connectivity and throughput should be provided to the RC users.

Data sets should preferrably be transported via network. If network bandwidth, cost or other factors made this unfeasible for all data needed at the RC, then priority should be given to network tansfer of smaller size data (ESD/AOD/tags as long as they are produced); larger size data, could then be shipped on removable data volumes.

Data access at the RC should be fast, lest one looses all the advantage of computing at the RC. So, for example, if analysis service is concentrated on physics channels, all data needed for that channel should be online.

Collaboration, Dependency

Software Collaboration: the RC is supposed to share common procedures for maintenance, validation and operation of production software. Specific procedures as possible localization to given platforms or site specific software tools are sole responsibility of the RC.

The dependence of the RC on CERN can be further classified in:

data dependence
software dependence
synchronization mechanisms.

Data Dependence: the obvious part concerns data which are copies of the CERN ones (ESD/AOD/tags, reconstructed, calibrations). These data are transferred to the RC regularly to fulfil RC services. Less obvious is "creation" of new data at the RC; example: reprocessing of data done at the RC, creation of official ESD/AOD/tags for given analysis channel etc. In the latter case the RC holds a unique copy of these data until transfer to the CERN. So, assuming the capability of RC to reprocess data with the official offline software, the management issues involved depend on decisions taken by the collaborations.

Software Dependence: the RC relies on delivery from CERN of the reconstruction software, through the appropriate mechanisms set up in the experiments.

Synchronization Mechanisms: this depends on the underlying OODB; if it is Objectivity and if one has a single federation with CERN, then data will be synchronized by the Objy mechanism. If one has different federations then synchronization becomes a management issue, and automatic procedures must be envisaged. In both cases the catalog must be kept in sync continuously; this may be demanding in terms of network bandwidth, reliability and of backup procedures in case of network failure.

Conclusions

LHC era experiments, with unprecedented levels of size, complexity, and world-wide participation, demand new computing strategies. In particular, the collaborations must make effective use of resources, both human and machine, located throughout the world. This leads naturally to consider models where regional centers perform major portions of the computing activities. The considerations discussed in this report suggest a need for regional centers with unprecedented amounts of computing resources. Nevertheless, the proposed scope for these centers is within the reach of multiple collaborating countries or regions and existing computing centers. This working document is the first step in a process to define and model these centers. Enough flexibility is allowed to accomodate differing needs of different regions and experiments. We hope this encourages serious discussion and planning on the part of the experiments and prospective regional computing centers, as well as further consideration of the role of desktop systems and other portions of the total LHC computing environment.