Version 4.0

May 13, 1998

 

        Data Management and Computing Using
                  Distributed  Architectures

                 Preliminary Draft Project Assignment Plan

This draft-version is distributed for the discussion of its final form before it is distributed among the LHC experiments (and other interested experiments) asking for their participation in the project; and the experiments' EB approval of the PAP before it is presented to the LHC.
 
Preliminary list of names:

1.0 INTRODUCTION
 
The LHC experiments have envisaged Computing Models  (CM) involving hundreds of physicists doing analysis at institutions around the world. CMS and ATLAS also are considering the use of Regional Centres, each of which could complement the functionality of the CERN Centre. They are intended to facilitate the access to the data, with more efficient and cost-effective data delivery to the groups in each world region, using national networks of greater capacity than may be available on intercontinental links.

The LHC Models encompass a complex set of wide area, regional and local area networks, a heterogeneous set of compute- and data-servers, and a yet-to-be determined set of priorities for group-oriented and individuals' demands for  remote data. Distributed systems of this scope and complexity do not yet exist, although systems of a similar size to those foreseen for the LHC experiments are predicted to come into operation by around 2005 at large corporations.
 
In order to proceed with the planning and design of the LHC Computing Models, and to correctly dimension the capacity of the networks and the size and characteristics of Regional Centres, it is essential to conduct a systematic study of these distributed systems. This project therefore intends to simulate and study the network-distributed computing architectures, data access and data management systems that are major components of  the CM, and the ways in which the components interact across networks. The project will bring together the efforts and relevant expertise from the LHC experiments and LHC R&D projects, as well as from the current or near-future experiments that are already engaged in building distributed systems for computing, data access, simulation and analysis.

The primary goals of this project are

As a result of this study, we expect to deliver a set of tools for simulating candidate CM of the experiments, and a set of common guidelines to allow the experiments to formulate their final Models.

Distributed databases are an important part of the CM to be studied. The RD45 project has developed considerable expertise in the field of Object Oriented Database Management Systems (ODBMS), and this project intends to benefit from the RD45 experience and to cooperate with RD45 as appropriate, in the specific areas where the work of the two projects (necessarily) overlaps. However, this project will begin investigating questions which are largely complementary to RD45. Examples are the network performance and the prioritization of traffic for a variety of applications that must coexist and share the network resources.
 

2.0 OBJECTIVES
 

This proposal aims at developing a set of common modeling and simulation tools, and the environment which would enable the LHC experiments to realistically evaluate and optimize their analysis models and CM, based on distributed data and computing architectures. Tools to realistically estimate the network bandwidth required in a given CM will be developed. The parameters that are necessary and sufficient to characterize the CM and its performance will be identified. The methods and tools to measure the Model's performance and detect bottlenecks will be designed and developed, and also tested in prototypes. This work will be done with as much as possible co-ordination with the present LHC R&D Projects, and current or near-future experiments. The goal is to determine a set of feasible models, and to provide a set of guidelines which the experiments could use to build their respective Computing Models.

The main objectives are to:

i) Develop simulation and modeling tools to enable the experiments to evaluate their Computing Models

ii) Identify the crucial parameters of Computing Models, collect information about those parameters and design, and plan and execute the necessary measurements when they are not already available;

iii) Determine the necessary infrastructure (network capacity, CPU, storage, manpower) needed to implement the baseline models;

iv) Assess the major components and their behavioral characteristics relevant to the performance of the distributed computing system.  The parts related to an ODBMS will be done in close co-operation with RD45.

v) Investigate the impact of varying degrees of network saturation on the overall system performance, using the baseline models as examples.

vi) Extract the common architectural features of the viable distributed computing systems, including their components, linkages and functional chracteristics.
 

3.0 INTERACTIONS WITH EXPERIMENTS AND OTHER PROJECTS

The aim of this project is to establish a set of viable computing models and a set of common guidelines to allow experiments to develop their CM in a realistic way. We believe that the best way to achieve this objective is to bring together and enhance direct involvement in R&D from the LHC experiments. The project will set up a framework for its collaboration with the experiments, with RD45, with the Technology Tracking Team (TTT), and with other groups having relevant expertise (for example HPSS and other MSS).

This document has been has been prepared in consultation with RD45. We have all agreed to hold common meetings and workshops to discuss the overlapping areas of interest, and to define the most efficient way for both projects to proceed and produce the required results. In all cases, there will be a clear understanding with RD45 regarding the work-sharing, especially  in testing of the performances of a distributed ODBMS.

One of the important tasks of this project is to identify the questions and tests with an ODBMS which must be done, as part of a distributed system, in order to define the Computing Models. This task will be done in close collaboration with RD45. However, another important role of this project is to begin investigating questions related to the construction, operation and management of a distributed computing and network system optimized for large scale data access, which are largely complementary to RD45. A good example of an area not covered by RD45 is the question of network performance and the prioritization of traffic for a variety of applications that must coexist and share the network resources. These applications include interactive logins, high priority access to system and detector parameters in the database, and realtime "collaborative" applications, in addition to transfers of substantial amounts of event-data as requested by the ODBMS.

The "Computing Model groups" of the experiments will be responsible for providing the parameters for the models for reconstruction, analysis, Monte Carlo simulation, etc. The collaborations are  already involved in discussions with the proponents of this project, and it is recognized that they will make the final choices leading to their CM.While the details of the LHC experiments' Models will differ, it is necessary to first study a range of baseline models, so that all of the Models which are finally chosen fall into the feasible range.
 

4.0 WORKPLAN

A primary aim of this project is to demonstrate a set of feasible models,  and to provide a set of guidelines with which the experiments could build their respective computer models.

In order to achieve this goal,  the first stage of the project will:

i) Identify and/or build the modeling simulation tools and the parameters with which to construct the baseline computing models.

ii) Assess the status of technologies which are the ingredients of the distributed computing systems - databases, networking, network traffic control mechanisms, and network management tools.

iii) Systematically analyse a set of baseline models and configurations.

In its second stage, the project will address the question of how a distributed computing system could be built, controlled and run efficiently. The architectures of the entire computing system, making use of distributed computing technologies, will have to be developed and critically analyzed. The impact of the choice of a particular viable model on the infrastructure required at CERN, at the remote institutions, and the technical requirements for users' workgroup servers and desktops will  be evaluated.

The work will be performed using either commercial modeling tools capable of simulating large distributed systems), or of modeling tools which are developed as part of this project (based on smaller toolsets already developed for other purposes), or a combination of both.

Scaling tests under controlled and uncontrolled shared network conditions using pieces of currently-operating experiments' data analysis, as well as working prototype analyses for the LHC, will be performed in order to extract information on how to best use the database and its management system, as part of the overall distributed computing system. This information will be complementary to that learned in RD45.

A number of important specific issues will be addressed in the course
of the project. Examples are:

i)  The flexibility of the different approaches to distributed computing and the resources that already exist, or will probably exist in the different countries.
 
ii) The impact on the performance of the overall system, or on the resource
    requirements, that arise from different approaches to the analysis.

iii) The minimal level of flexibility that is required to carry out the analysis effectively, in the face of the constraints imposed by the limited available resources.
 

In the third stage, the project will provide tools and prototype designs for the test implementations of the elements of the LHC experiments Computing Models, in time for the second round of the Computing Technical Proposals, in 2001. This stage is, most likely, aimed at a future R&D project.

5.0 MAJOR TASK DEFINITIONS

The major tasks foreseen, in order to achieve the objectives of the project are:

i)  To identify and/or develop the modeling and simulation tools and a range of criteria needed to evaluate the baseline computing models.

ii) To define a set of input parameters to candidate models,. This set, to be developed in close co-operation with the experiments, will include (a) parameters expressing the  experiment's requirements, such as its data access patterns, analysis patterns, data volumes, data access times,  number of concurrent users,  and (b) parameters describing the hardware and software components of the computing and networking systems.

iii) To collect the existing information on the status and future projections for network technology, through such channels as the ICFA Network Taskforce. To determine a viable range of network parameters, and their inevitable variation from country to country.

iv) To define a set of alternative Computing Models which appear to be viable for the LHC experiments. In this way, to develop a small set of baseline computing models.

v) To define a common set of parameters important to the performance of the distributed system, along with a set of measuring tools. To design, plan and perform measurements of the missing or poorly known parameters.

vi) To establish a set of feasible models, and to produce guidelines for the experiments' detailed design work on their Computing Models.

vii) To develop a test-bed, for rapid prototyping, and for verifying the simulations of the baseline models. The testbed is also expected to detect some problems that are not covered by the simulations. One focus of test-bed activities will be  aspects of the CM which may be too complex to be simulated reliably (for example network QoS mechanisms in a multi-user environment ).

viii) To determine the range of network bandwidth, response time,  modes of prioritization of traffic, and response to saturation, that are required for the distributed computing system of variable architecture to function reliably.

ix) To extract the common architectural features of the viable baseline models, including a description of the components, their linkages and functional characteristics. To determine a set of recommendations  on how to set up, operate and manage a distributed system composed of these components (most likely aimed at a future R&D project).
 

6.0 DELIVERABLES

The major deliverables are:

i)  the specifications for a set of feasible models, identified by a well defined region in the multidimensional space of the model parameters which leads to viable configurations.

ii)  a set of guidelines for the collaborations to use in building the Computing Models for their experiments;

iii ) a set of modeling tools which would enable the experiments to simulate and refine their Computing Models.
 
 
7.0 RESOURCES

Part of the resources, both people and material, are already available from the experiments and from the general support services in collaborating laboratories and institutions. We estimate that 5-7 FTE  will come from the experiments. Additional resources needed for the central support of this project are requested from CERN. We anticipate this support to be at the level of 2.5 FTE. Development of a modeling toolset will require 1 FTE, setup and operation of the test-bed will require 1/2 FTE, and studies of the distributed data and computing architectures, and that of the  network behaviour and network management will require at least 1/2 FTE each.

The 2.5 FTE at CERN will provide a core of professional experts that will work with the physicists and technical staffs at remote laboratories and universities to evaluate, evolve, classify promising classes of Computing Models, and extract the essential features of the feasible models.

_________________________________________________________________________________

COMMENTS (HBN):

[ I have somewhat altered the above. We need 2.5 - 3 people at CERN.
We are collecting the committments from CMS now, but so far have 2.6 FTEs
from INFN/Rome, INFN/Bologna, and Caltech. We hope to reach 3.5 FTEs and
hope ATLAS can do the same (parity in this project is very important as what will
result will have "policy" implications in the further development of the Computing
Models, and in the sharing of common resources.

For ALICE and LHCB -- we want them to participate, with additional manpower.
In a May 8 meeting in CMS it was suggested that a minimum of 1 FTE would be
a reasonable minimum for each of these two experiments. That would bring the
FTEs from the experiments to 9.

It was also suggested at the May 8 meeting that for an institute to be a (named) member
of the project, the minimum level of participation should be approximately 0.5 FTE.

Among the resources we should also cite the necessity of a "workgroup server" specifically
for the developments and tests of the project. A figure of 60 kCHF was suggested for the
(rough) cost. Similary, roughly 40 kCHF should be forseen for software licenses.

I also thought a bit more, and suggest 20 kCHF for "network interfaces".

Overall, we thought the equipment and software license cost should not exceed 150 kCHF --
or should it ?]

__________________________________________________________________________________


 
 

8.0 SCHEDULE

Phase 1 : Provide a first round set of tools for evaluating the baseline models, and
               to allow the start of defining the CM by the experiments, within one year.

Phase 2: Provide a refined set of tools, and the guidelines for the construction of
              a feasible CM, in time for the preparation of  next round of the Computing TPRs.
              (Fall 1999 for  Atlas and CMS; and ???  for LCB and ALICE).

Phase 3: Provide the tools, and some prototype designs,  for test-implementations
              of the LHC CM's, in time for preparation of the next-to-next round of
              the  Computing TPRs, in 2001.