June 29, 1998

 

            Data Management and Computing Using
                      Distributed  Architectures 

                           PROJECT ASSIGNMENT PLAN

Prepared by:

1.0 INTRODUCTION
 
The LHC experiments have envisaged Computing Models  (CM) involving hundreds of physicists doing analysis at institutions around the world. CMS and ATLAS also are considering the use of Regional Centres, each of which could complement the functionality of the CERN Centre. They are intended to facilitate the access to the data, with more efficient and cost-effective data delivery to the groups in each world region, using national networks of greater capacity than may be available on intercontinental links.

The LHC Models encompass a complex set of wide-area, regional and local-area networks, a heterogeneous set of compute- and data-servers, and a yet-to-be determined set of priorities for group-oriented and individuals' demands for  remote data. Distributed systems of this scope and complexity do not exist yet, although systems of a similar size to those foreseen for the LHC experiments are predicted to come into operation by around 2005 at large corporations.
 
In order to proceed with the planning and design of the LHC Computing Models, and to correctly dimension the capacity of the networks and the size and characteristics of Regional Centres, it is essential to conduct a systematic study of these distributed systems. This project therefore intends to simulate and study  network-distributed computing architectures, data access and data management systems that are major components of  the CM, and the ways in which the components interact across networks. The project will bring together the efforts and relevant expertise from the LHC experiments and LHC R&D projects, as well as from the current or near-future experiments that are already engaged in building distributed systems for computing, data access, simulation and analysis.
 The primary goals of this project are:

As a result of this study, we expect to deliver a set of tools for simulating candidate CM of the experiments, and a set of common guidelines to allow the experiments to formulate their final Models.

Distributed databases are an important part of the CM to be studied. The RD45 project has developed considerable expertise in the field of Object Oriented Database Management Systems (ODBMS), and this project intends to benefit from the RD45 experience and to cooperate with RD45 as appropriate, in the specific areas where the work of the two projects (necessarily) overlaps. The proposed project intends to investigate questions which are largely complementary to RD45, such as network performance and prioritization of traffic for a variety of applications that must coexist and share the network resources.
 

2.0 OBJECTIVES
 

This project aims at developing a set of common modelling and simulation tools, and the environment which would enable the LHC experiments to realistically evaluate and optimize their analysis models and CMs, based on distributed data and computing architectures. Tools to realistically estimate the network bandwidth required in a given CM will be developed. The parameters that are necessary and sufficient to characterize the CM and its performance will be identified. The methods and tools to measure the Model's performance and detect bottlenecks will be designed and developed, and also tested in prototypes. This work will be done with as much as possible co-operation with the present LHC R&D Projects, and current or near-future experiments. The goal is to determine a set of feasible models, and to provide a set of guidelines which the experiments could use to build their respective Computing Models.

The main objectives are:
 

 

3.0 INTERACTIONS WITH EXPERIMENTS AND OTHER PROJECTS

The aim of this project is to establish a set of viable computing models and a set of common guidelines to allow experiments to develop their CMs in a realistic way. We believe that the best way to achieve this objective is to bring together and enhance direct involvement in R&D from the LHC experiments. The project will set up a framework for its collaboration with the experiments, with RD45, with the Technology Tracking Team (TTT), and with other groups having relevant expertise (for example HPSS and other MSS).

This document has been has been prepared in consultation with RD45. We have (all) agreed to hold common meetings and workshops to discuss the overlapping areas of interest, and to define the most efficient way for both projects to proceed and produce the required results. In all cases, there will be a clear understanding with RD45 regarding the work-sharing, especially  in testing of the performances of a distributed ODBMS.

One of the important tasks of this project is to identify the questions and tests with an ODBMS which must be done, as part of a distributed system, in order to define the Computing Models. This task will be done in close collaboration with RD45. However, another important role of this project is to begin investigating questions related to the construction, operation and management of a distributed computing and network system optimized for large scale data access, which are largely complementary to RD45. A good example of an area not covered by RD45 is the question of network performance and the prioritization of traffic for a variety of applications that must coexist and share the network resources. These applications include interactive logins, high priority access to system and detector parameters in the database, and real-time "collaborative" applications, in addition to transfers of substantial amounts of event-data as requested by the ODBMS.

The "Computing Model groups" of the experiments will be responsible for providing the parameters for the models for reconstruction, analysis, Monte Carlo simulation, etc. The collaborations are  already involved in discussions with the proponents of this project, and it is recognized that the collaborations will make the final choices leading to their CM.While the details of the LHC experiments' Models will differ, it is necessary to first study a range of baseline models, so that all of the Models which are finally chosen fall into the feasible range.
 

4.0 WORK PLAN

A primary aim of this project is to demonstrate a set of feasible models,  and to provide a set of guidelines with which the experiments could build their respective computer models.

In order to achieve this goal,  the first stage of the project will:
 

Preliminary studies of modelling network-oriented data analysis tasks using in-house simulation
tools developed at CERN and Caltech are already underway, in preparation for this project.

In its second stage, the project will address the question of how a distributed computing system could be built, controlled and run efficiently. The architectures of the entire computing system, making use of distributed computing technologies, will have to be developed and critically analysed. The impact of the choice of a particular viable model on the infrastructure required at CERN, at the remote institutions, and the technical requirements for users' work group servers and desktops will  be evaluated.

The work will be performed using either commercial modelling tools capable of simulating large distributed systems), or of modelling tools which are developed as part of this project (based on smaller tool sets already developed for other purposes), or a combination of both.

Scaling tests under controlled and uncontrolled shared network conditions using pieces of currently-operating experiments' data analysis, as well as working prototype analyses for the LHC, will be performed in order to extract information on how to best use the database and its management system, as part of the overall distributed computing system. This information will be complementary to that learned in RD45.

A number of important specific issues will be addressed in the course
of the project. Examples are:
 

 

In the third stage, the project will provide tools and prototype designs for the test implementations of the elements of the LHC experiments Computing Models, in time for the second round of the Computing Technical Proposals, in 2001. This stage is, most likely, aimed at a future R&D project.
 

5.0 MAJOR TASK DEFINITIONS

The major tasks foreseen, in order to achieve the objectives of the project are:
 

 

6.0 DELIVERABLES

The major deliverables are:
 

 
 
7.0 RESOURCES

Part of the resources, both people and material, are already available from the experiments and from the general support services in collaborating laboratories and institutions. We estimate that 9-11 FTE  will come from the experiments. Additional resources needed for the central support of this project are requested from CERN. We anticipate this support to be at the level of 2.5 FTE. Development of a modeling toolset will require 1 FTE, setup and operation of the test-bed will require 1/2 FTE, and studies of the distributed data and computing architectures, and that of the  network behaviour and network management will require at least 1/2 FTE each.

The 2.5 FTE at CERN will provide a core of professional experts that will work with the physicists and technical staffs at remote laboratories and universities to evaluate, evolve, classify promising classes of Computing Models, and extract the essential features of the feasible models.
 
 
 


Table 1. Manpower from non-CERN institutions committed to the project

Institution FTE
INFN/Bari /CMS
0.5 FTE
INFN/Roma-1 /ATLAS
0.6 FTE
INFN/Roma-1 /CMS 0.5 FTE
INFN/Bologna /CMS
0.6 FTE
INFN/Perugia /CMS 0.6 FTE
INFN/Milano /ATLAS  1.0 FTE 
Caltech/CMS
1.4 FTE 
Helsinki Institute of Physics 0.4 FTE
INFN/Genova /ATLAS 0.5 FTE
Padova /CMS 0.6 FTE
Tufts /ATLAS 0.6 FTE
US-ATLAS  1.0 FTE
FNAL /CMS 1.0 FTE
TOTAL: 9.3 FTE
 

The groups outside of CERN engaged in this project have committed the use of several workgroup servers and desktops, along with shares of large computing systems (e.g. at INFN and Caltech). Specific configurations of local and wide area networks managed by some of the collaborating groups are planned, to prototype elements of the distributed systems and to provide test data validating the simulations. The total value of these systems, or shares of systems dedicated to this project cannot be specified precisely, but is estimated to be in the range of several hundred kCHF.

A workgroup server that will serve as a central element in the ensemble of servers for this project is requested. The server is currently foreseen to be a one- or two-CPU  UNIX  system, with sufficient disk space, a high speed tape drive for local file storage and backup, and ATM or Gigabit Ethernet, as well as Fast Ethernet local area network interfaces. The CERN-based specialists working on this project will use this system as a development and test platform.  The project will certainly benefit from the experience and from the collaboration with the IT/PDP group in the development of a test-bed prototype.
 
 
 


Table 2. Equipment, software and other resources needed

Category amount (CHF)
Dedicated workgroup server for the development work and tests of the project
60 kCHF
Software licences
40 kCHF
Network interfaces 20 kCHF
Travel 20 kCHF
TOTAL: 140 kCHF
 
 

8.0 SCHEDULE
 

Phase 1 : Provide a first round set of tools for evaluating the baseline models, and to allow the start of defining the CM by the experiments, within one year.
Phase 2: Provide a refined set of tools, and the guidelines for the construction of a feasible CM, in time for the preparation of  next round of the Computing TPRs. (Fall 1999 for  Atlas and CMS; and later  for LCB and ALICE).
Phase 3: In the third stage, the project will provide prototype designs for the test implementations of  the elements of the LHC experiments Computing Models, in time for the second round of the Computing Technical Proposals, in 2001. This stage is, most likely, aimed at a future R&D project.