June 29, 1998

Data Management and Computing Using
Distributed Architectures

PROJECT ASSIGNMENT PLAN

Prepared by:

1.0 INTRODUCTION

The LHC experiments have envisaged Computing Models (CM) involving hundreds of physicists doing analysis at institutions around the world. CMS and ATLAS also are considering the use of Regional Centres, each of which could complement the functionality of the CERN Centre. They are intended to facilitate the access to the data, with more efficient and cost-effective data delivery to the groups in each world region, using national networks of greater capacity than may be available on intercontinental links.

The LHC Models encompass a complex set of wide-area, regional and local-area networks, a heterogeneous set of compute- and data-servers, and a yet-to-be determined set of priorities for group-oriented and individuals' demands for remote data. Distributed systems of this scope and complexity do not exist yet, although systems of a similar size to those foreseen for the LHC experiments are predicted to come into operation by around 2005 at large corporations.

In order to proceed with the planning and design of the LHC Computing Models, and to correctly dimension the capacity of the networks and the size and characteristics of Regional Centres, it is essential to conduct a systematic study of these distributed systems. This project therefore intends to simulate and study network-distributed computing architectures, data access and data management systems that are major components of the CM, and the ways in which the components interact across networks. The project will bring together the efforts and relevant expertise from the LHC experiments and LHC R&D projects, as well as from the current or near-future experiments that are already engaged in building distributed systems for computing, data access, simulation and analysis.
The primary goals of this project are:

To determine which classes of Models, and modes of distributed analyses are feasible, according to the network capacity and data-handling resources likely to be available at the collaborating sites;

To specify the main parameters that characterize this class of Models, and

To produce example baseline Models which fall into the "feasible" category.

As a result of this study, we expect to deliver a set of tools for simulating candidate CM of the experiments, and a set of common guidelines to allow the experiments to formulate their final Models.

Distributed databases are an important part of the CM to be studied. The RD45 project has developed considerable expertise in the field of Object Oriented Database Management Systems (ODBMS), and this project intends to benefit from the RD45 experience and to cooperate with RD45 as appropriate, in the specific areas where the work of the two projects (necessarily) overlaps. The proposed project intends to investigate questions which are largely complementary to RD45, such as network performance and prioritization of traffic for a variety of applications that must coexist and share the network resources.

2.0 OBJECTIVES

This project aims at developing a set of common modelling and simulation tools, and the environment which would enable the LHC experiments to realistically evaluate and optimize their analysis models and CMs, based on distributed data and computing architectures. Tools to realistically estimate the network bandwidth required in a given CM will be developed. The parameters that are necessary and sufficient to characterize the CM and its performance will be identified. The methods and tools to measure the Model's performance and detect bottlenecks will be designed and developed, and also tested in prototypes. This work will be done with as much as possible co-operation with the present LHC R&D Projects, and current or near-future experiments. The goal is to determine a set of feasible models, and to provide a set of guidelines which the experiments could use to build their respective Computing Models.

The main objectives are:

To identify the crucial parameters of Computing Models, collect information about those parameters and design, and plan and execute the necessary measurements when they are not already available;

To develop simulation and modelling tools to enable the experiments to evaluate their Computing Models

To determine the necessary infrastructure (network capacity, CPU, storage, manpower) needed to implement the baseline models;

To assess the major components and their behavioural characteristics relevant to the performance of the distributed computing system. The parts related to an ODBMS will be done in close co-operation with RD45.

To investigate the impact of varying degrees of network saturation on the overall system performance, using the baseline models as examples.

To extract the common architectural features of the viable distributed computing systems, including their components, linkages and functional characteristics.

3.0 INTERACTIONS WITH EXPERIMENTS AND OTHER PROJECTS

The aim of this project is to establish a set of viable computing models and a set of common guidelines to allow experiments to develop their CMs in a realistic way. We believe that the best way to achieve this objective is to bring together and enhance direct involvement in R&D from the LHC experiments. The project will set up a framework for its collaboration with the experiments, with RD45, with the Technology Tracking Team (TTT), and with other groups having relevant expertise (for example HPSS and other MSS).

This document has been has been prepared in consultation with RD45. We have (all) agreed to hold common meetings and workshops to discuss the overlapping areas of interest, and to define the most efficient way for both projects to proceed and produce the required results. In all cases, there will be a clear understanding with RD45 regarding the work-sharing, especially in testing of the performances of a distributed ODBMS.

One of the important tasks of this project is to identify the questions and tests with an ODBMS which must be done, as part of a distributed system, in order to define the Computing Models. This task will be done in close collaboration with RD45. However, another important role of this project is to begin investigating questions related to the construction, operation and management of a distributed computing and network system optimized for large scale data access, which are largely complementary to RD45. A good example of an area not covered by RD45 is the question of network performance and the prioritization of traffic for a variety of applications that must coexist and share the network resources. These applications include interactive logins, high priority access to system and detector parameters in the database, and real-time "collaborative" applications, in addition to transfers of substantial amounts of event-data as requested by the ODBMS.

The "Computing Model groups" of the experiments will be responsible for providing the parameters for the models for reconstruction, analysis, Monte Carlo simulation, etc. The collaborations are already involved in discussions with the proponents of this project, and it is recognized that the collaborations will make the final choices leading to their CM.While the details of the LHC experiments' Models will differ, it is necessary to first study a range of baseline models, so that all of the Models which are finally chosen fall into the feasible range.

4.0 WORK PLAN

A primary aim of this project is to demonstrate a set of feasible models, and to provide a set of guidelines with which the experiments could build their respective computer models.

In order to achieve this goal, the first stage of the project will:

Identify and/or build the modelling simulation tools and the parameters with which to construct the baseline computing models.

Assess the status of technologies which are the ingredients of the distributed computing systems - databases, networking, network traffic control mechanisms, and network management tools.

Systematically analyse a set of baseline models and configurations.

Preliminary studies of modelling network-oriented data analysis tasks using in-house simulation
tools developed at CERN and Caltech are already underway, in preparation for this project.

In its second stage, the project will address the question of how a distributed computing system could be built, controlled and run efficiently. The architectures of the entire computing system, making use of distributed computing technologies, will have to be developed and critically analysed. The impact of the choice of a particular viable model on the infrastructure required at CERN, at the remote institutions, and the technical requirements for users' work group servers and desktops will be evaluated.

The work will be performed using either commercial modelling tools capable of simulating large distributed systems), or of modelling tools which are developed as part of this project (based on smaller tool sets already developed for other purposes), or a combination of both.

Scaling tests under controlled and uncontrolled shared network conditions using pieces of currently-operating experiments' data analysis, as well as working prototype analyses for the LHC, will be performed in order to extract information on how to best use the database and its management system, as part of the overall distributed computing system. This information will be complementary to that learned in RD45.

A number of important specific issues will be addressed in the course
of the project. Examples are:

The flexibility of the different approaches to distributed computing and the resources that already exist, or will probably exist in the different countries.

The impact on the performance of the overall system, or on the resource requirements, that arise from different approaches to the analysis.

The minimal level of flexibility that is required to carry out the analysis effectively, in the face of the constraints imposed by the limited available resources.

In the third stage, the project will provide tools and prototype designs for the test implementations of the elements of the LHC experiments Computing Models, in time for the second round of the Computing Technical Proposals, in 2001. This stage is, most likely, aimed at a future R&D project.

5.0 MAJOR TASK DEFINITIONS

The major tasks foreseen, in order to achieve the objectives of the project are:

To identify and/or develop the modeling and simulation tools and a range of criteria needed to evaluate the baseline computing models.

To define a set of input parameters to candidate models,. This set, to be developed in close co-operation with the experiments, will include (a) parameters expressing the experiment's requirements, such as its data access patterns, analysis patterns, data volumes, data access times, number of concurrent users, and (b) parameters describing the hardware and software components of the computing and networking systems.

To collect the existing information on the status and future projections for network technology, through such channels as the ICFA Network Taskforce. To determine a viable range of network parameters, and their inevitable variation from country to country.

To define a set of alternative Computing Models which appear to be viable for the LHC experiments. In this way, to develop a small set of baseline computing models.

To define a common set of parameters important to the performance of the distributed system, along with a set of measuring tools. To design, plan and perform measurements of the missing or poorly known parameters.

To establish a set of feasible models, and to produce guidelines for the experiments' detailed design work on their Computing Models.

To develop a test-bed, for rapid prototyping, and for verifying the simulations of the baseline models. The testbed is also expected to detect some problems that are not covered by the simulations. One focus of test-bed activities will be aspects of the CM which may be too complex to be simulated reliably (for example network QoS mechanisms in a multi-user environment ).

To determine the range of network bandwidth, response time, modes of prioritization of traffic, and response to saturation, that are required for the distributed computing system of variable architecture to function reliably.

To extract the common architectural features of the viable baseline models, including a description of the components, their linkages and functional characteristics. To determine a set of recommendations on how to set up, operate and manage a distributed system composed of these components (most likely aimed at a future R&D project).

6.0 DELIVERABLES

The major deliverables are:

The specifications for a set of feasible models, identified by a well defined region in the multidimensional space of the model parameters which leads to viable configurations.

A set of guidelines for the collaborations to use in building the Computing Models for their experiments;

A set of modeling tools which would enable the experiments to simulate and refine their Computing Models.

7.0 RESOURCES

Part of the resources, both people and material, are already available from the experiments and from the general support services in collaborating laboratories and institutions. We estimate that 9-11 FTE will come from the experiments. Additional resources needed for the central support of this project are requested from CERN. We anticipate this support to be at the level of 2.5 FTE. Development of a modeling toolset will require 1 FTE, setup and operation of the test-bed will require 1/2 FTE, and studies of the distributed data and computing architectures, and that of the network behaviour and network management will require at least 1/2 FTE each.

The 2.5 FTE at CERN will provide a core of professional experts that will work with the physicists and technical staffs at remote laboratories and universities to evaluate, evolve, classify promising classes of Computing Models, and extract the essential features of the feasible models.

Table 1. Manpower from non-CERN institutions committed to the project
Institution	FTE
INFN/Bari /CMS	0.5 FTE
INFN/Roma-1 /ATLAS	0.6 FTE
INFN/Roma-1 /CMS	0.5 FTE
INFN/Bologna /CMS	0.6 FTE
INFN/Perugia /CMS	0.6 FTE
INFN/Milano /ATLAS	1.0 FTE
Caltech/CMS	1.4 FTE
Helsinki Institute of Physics	0.4 FTE
INFN/Genova /ATLAS	0.5 FTE
Padova /CMS	0.6 FTE
Tufts /ATLAS	0.6 FTE
US-ATLAS	1.0 FTE
FNAL /CMS	1.0 FTE
TOTAL:	9.3 FTE

The groups outside of CERN engaged in this project have committed the use of several workgroup servers and desktops, along with shares of large computing systems (e.g. at INFN and Caltech). Specific configurations of local and wide area networks managed by some of the collaborating groups are planned, to prototype elements of the distributed systems and to provide test data validating the simulations. The total value of these systems, or shares of systems dedicated to this project cannot be specified precisely, but is estimated to be in the range of several hundred kCHF.

A workgroup server that will serve as a central element in the ensemble of servers for this project is requested. The server is currently foreseen to be a one- or two-CPU UNIX system, with sufficient disk space, a high speed tape drive for local file storage and backup, and ATM or Gigabit Ethernet, as well as Fast Ethernet local area network interfaces. The CERN-based specialists working on this project will use this system as a development and test platform. The project will certainly benefit from the experience and from the collaboration with the IT/PDP group in the development of a test-bed prototype.

Table 2. Equipment, software and other resources needed
Category	amount (CHF)
Dedicated workgroup server for the development work and tests of the project	60 kCHF
Software licences	40 kCHF
Network interfaces	20 kCHF
Travel	20 kCHF
TOTAL:	140 kCHF

8.0 SCHEDULE

Phase 1 : Provide a first round set of tools for evaluating the baseline models, and to allow the start of defining the CM by the experiments, within one year.

Phase 2: Provide a refined set of tools, and the guidelines for the construction of a feasible CM, in time for the preparation of next round of the Computing TPRs. (Fall 1999 for Atlas and CMS; and later for LCB and ALICE).

Phase 3: In the third stage, the project will provide prototype designs for the test implementations of the elements of the LHC experiments Computing Models, in time for the second round of the Computing Technical Proposals, in 2001. This stage is, most likely, aimed at a future R&D project.

June 29, 1998

PROJECT ASSIGNMENT PLAN

Prepared by:

Table 1. Manpower from non-CERN institutions committed to the project

Table 2. Equipment, software and other resources needed