Data Management
and Computing Using
Distributed Architectures
This draft-version
is distributed for the discussion of its final form before it is distributed
among the LHC experiments (and other interested experiments) asking for
their participation in the project; and the experiments' EB approval of
the PAP before it is presented to the LHC.
Preliminary list of names:
Mike Aderholz
Luciano Barone
Julian Bunn
Paolo Capiluppi
Claudio Grandi
Mikhail Leltchouk
Harvey Newman
Laura Perini
Alois Putzer
Krzysztof Sliwa
Rick Wilkinson
David Williams
1.0 INTRODUCTION
The LHC experiments have envisaged Computing Models (CM) involving
hundreds of physicists doing analysis at institutions around the world.
CMS and ATLAS also are considering the use of Regional Centres, each of
which could complement the functionality of the CERN Centre. They are intended
to facilitate the access to the data, with more efficient and cost-effective
data delivery to the groups in each world region, using national networks
of greater capacity than may be available on intercontinental links.
The LHC Models encompass a complex set of wide area, regional and local
area networks, a heterogeneous set of compute- and data-servers, and a
yet-to-be determined set of priorities for group-oriented and individuals'
demands for remote data. Distributed systems of this scope and complexity
do not yet exist, although systems of a similar size to those foreseen
for the LHC experiments are predicted to come into operation by around
2005 at large corporations.
In order to proceed with the planning and design of the LHC Computing Models,
and to correctly dimension the capacity of the networks and the size and
characteristics of Regional Centres, it is essential to conduct a systematic
study of these distributed systems. This project therefore intends to simulate
and study the network-distributed computing architectures, data access
and data management systems that are major components of the CM,
and the ways in which the components interact across networks. The project
will bring together the efforts and relevant expertise from the LHC experiments
and LHC R&D projects, as well as from the current or near-future experiments
that are already engaged in building distributed systems for computing,
data access, simulation and analysis.
The primary goals of this project are
As a result of this study, we expect to deliver a set of tools for simulating candidate CM of the experiments, and a set of common guidelines to allow the experiments to formulate their final Models.
Distributed databases are an important part of the CM to be studied.
The RD45 project has developed considerable expertise in the field of Object
Oriented Database Management Systems (ODBMS), and this project intends
to benefit from the RD45 experience and to cooperate with RD45 as appropriate,
in the specific areas where the work of the two projects (necessarily)
overlaps. However, this project will begin investigating questions which
are largely complementary to RD45. Examples are the network performance
and the prioritization of traffic for a variety of applications that must
coexist and share the network resources.
2.0 OBJECTIVES
This proposal aims at developing a set of common modeling and simulation tools, and the environment which would enable the LHC experiments to realistically evaluate and optimize their analysis models and CM, based on distributed data and computing architectures. Tools to realistically estimate the network bandwidth required in a given CM will be developed. The parameters that are necessary and sufficient to characterize the CM and its performance will be identified. The methods and tools to measure the Model's performance and detect bottlenecks will be designed and developed, and also tested in prototypes. This work will be done with as much as possible co-ordination with the present LHC R&D Projects, and current or near-future experiments. The goal is to determine a set of feasible models, and to provide a set of guidelines which the experiments could use to build their respective Computing Models.
The main objectives are to:
i) Develop simulation and modeling tools to enable the experiments to evaluate their Computing Models
ii) Identify the crucial parameters of Computing Models, collect information about those parameters and design, and plan and execute the necessary measurements when they are not already available;
iii) Determine the necessary infrastructure (network capacity, CPU, storage, manpower) needed to implement the baseline models;
iv) Assess the major components and their behavioral characteristics relevant to the performance of the distributed computing system. The parts related to an ODBMS will be done in close co-operation with RD45.
v) Investigate the impact of varying degrees of network saturation on the overall system performance, using the baseline models as examples.
vi) Extract the common architectural features of the viable distributed
computing systems, including their components, linkages and functional
chracteristics.
3.0 INTERACTIONS WITH EXPERIMENTS AND OTHER PROJECTS
The aim of this project is to establish a set of viable computing models and a set of common guidelines to allow experiments to develop their CM in a realistic way. We believe that the best way to achieve this objective is to bring together and enhance direct involvement in R&D from the LHC experiments. The project will set up a framework for its collaboration with the experiments, with RD45, with the Technology Tracking Team (TTT), and with other groups having relevant expertise (for example HPSS and other MSS).
This document has been has been prepared in consultation with RD45.
We have all agreed to hold common meetings and workshops to discuss the
overlapping areas of interest, and to define the most efficient way for
both projects to proceed and produce the required results. In all cases,
there will be a clear understanding with RD45 regarding the work-sharing,
especially in testing of the performances of a distributed ODBMS.
One of the important tasks of this project is to identify the questions
and tests with an ODBMS which must be done, as part of a distributed system,
in order to define the Computing Models. This task will be done in close
collaboration with RD45. However, another important role of this project
is to begin investigating questions related to the construction, operation
and management of a distributed computing and network system optimized
for large scale data access, which are largely complementary to RD45. A
good example of an area not covered by RD45 is the question of network
performance and the prioritization of traffic for a variety of applications
that must coexist and share the network resources. These applications include
interactive logins, high priority access to system and detector parameters
in the database, and realtime "collaborative" applications, in
addition to transfers of substantial amounts of event-data as requested
by the ODBMS.
The "Computing Model groups" of the experiments will be responsible
for providing the parameters for the models for reconstruction, analysis,
Monte Carlo simulation, etc. The collaborations are already involved
in discussions with the proponents of this project, and it is recognized
that they will make the final choices leading to their CM.While the details
of the LHC experiments' Models will differ, it is necessary to first study
a range of baseline models, so that all of the Models which are finally
chosen fall into the feasible range.
4.0 WORKPLAN
A primary aim of this project is to demonstrate a set of feasible models, and to provide a set of guidelines with which the experiments could build their respective computer models.
In order to achieve this goal, the first stage of the project will:
i) Identify and/or build the modeling simulation tools and the parameters with which to construct the baseline computing models.
ii) Assess the status of technologies which are the ingredients of the distributed computing systems - databases, networking, network traffic control mechanisms, and network management tools.
iii) Systematically analyse a set of baseline models and configurations.
In its second stage, the project will address the question of how a distributed computing system could be built, controlled and run efficiently. The architectures of the entire computing system, making use of distributed computing technologies, will have to be developed and critically analyzed. The impact of the choice of a particular viable model on the infrastructure required at CERN, at the remote institutions, and the technical requirements for users' workgroup servers and desktops will be evaluated.
The work will be performed using either commercial modeling tools capable of simulating large distributed systems), or of modeling tools which are developed as part of this project (based on smaller toolsets already developed for other purposes), or a combination of both.
Scaling tests under controlled and uncontrolled shared network conditions using pieces of currently-operating experiments' data analysis, as well as working prototype analyses for the LHC, will be performed in order to extract information on how to best use the database and its management system, as part of the overall distributed computing system. This information will be complementary to that learned in RD45.
A number of important specific issues will be addressed in the course
of the project. Examples are:
i) The flexibility of the different approaches to distributed
computing and the resources that already exist, or will probably exist
in the different countries.
ii) The impact on the performance of the overall system, or on the resource
requirements, that arise from different approaches to
the analysis.
iii) The minimal level of flexibility that is required to carry out
the analysis effectively, in the face of the constraints imposed by the
limited available resources.
In the third stage, the project will provide tools and prototype designs for the test implementations of the elements of the LHC experiments Computing Models, in time for the second round of the Computing Technical Proposals, in 2001. This stage is, most likely, aimed at a future R&D project.
5.0 MAJOR TASK DEFINITIONS
The major tasks foreseen, in order to achieve the objectives of the project are:
i) To identify and/or develop the modeling and simulation tools and a range of criteria needed to evaluate the baseline computing models.
ii) To define a set of input parameters to candidate models,. This set, to be developed in close co-operation with the experiments, will include (a) parameters expressing the experiment's requirements, such as its data access patterns, analysis patterns, data volumes, data access times, number of concurrent users, and (b) parameters describing the hardware and software components of the computing and networking systems.
iii) To collect the existing information on the status and future projections for network technology, through such channels as the ICFA Network Taskforce. To determine a viable range of network parameters, and their inevitable variation from country to country.
iv) To define a set of alternative Computing Models which appear to be viable for the LHC experiments. In this way, to develop a small set of baseline computing models.
v) To define a common set of parameters important to the performance of the distributed system, along with a set of measuring tools. To design, plan and perform measurements of the missing or poorly known parameters.
vi) To establish a set of feasible models, and to produce guidelines for the experiments' detailed design work on their Computing Models.
vii) To develop a test-bed, for rapid prototyping, and for verifying the simulations of the baseline models. The testbed is also expected to detect some problems that are not covered by the simulations. One focus of test-bed activities will be aspects of the CM which may be too complex to be simulated reliably (for example network QoS mechanisms in a multi-user environment ).
viii) To determine the range of network bandwidth, response time, modes of prioritization of traffic, and response to saturation, that are required for the distributed computing system of variable architecture to function reliably.
ix) To extract the common architectural features of the viable baseline
models, including a description of the components, their linkages and functional
characteristics. To determine a set of recommendations on how to
set up, operate and manage a distributed system composed of these components
(most likely aimed at a future R&D project).
6.0 DELIVERABLES
The major deliverables are:
i) the specifications for a set of feasible models, identified by a well defined region in the multidimensional space of the model parameters which leads to viable configurations.
ii) a set of guidelines for the collaborations to use in building the Computing Models for their experiments;
iii ) a set of modeling tools which would enable the experiments to
simulate and refine their Computing Models.
7.0 RESOURCES
Part of the resources, both people and material, are already available
from the experiments and from the general support services in collaborating
laboratories and institutions. We estimate that 5-7 FTE will come
from the experiments. Additional resources needed for the central support
of this project are requested from CERN. We anticipate this support to
be at the level of 2.5 FTE. Development of a modeling toolset will require
1 FTE, setup and operation of the test-bed will require 1/2 FTE, and studies
of the distributed data and computing architectures, and that of the
network behaviour and network management will require at least 1/2 FTE
each.
The 2.5 FTE at CERN will provide a core of professional experts that will work with the physicists and technical staffs at remote laboratories and universities to evaluate, evolve, classify promising classes of Computing Models, and extract the essential features of the feasible models.
_________________________________________________________________________________
COMMENTS (HBN):
[ I have somewhat altered the above. We need 2.5 - 3 people at CERN.
We are collecting the committments from CMS now, but so far have 2.6 FTEs
from INFN/Rome, INFN/Bologna, and Caltech. We hope to reach 3.5 FTEs and
hope ATLAS can do the same (parity in this project is very important as
what will
result will have "policy" implications in the further development
of the Computing
Models, and in the sharing of common resources.
For ALICE and LHCB -- we want them to participate, with additional manpower.
In a May 8 meeting in CMS it was suggested that a minimum of 1 FTE would
be
a reasonable minimum for each of these two experiments. That would bring
the
FTEs from the experiments to 9.
It was also suggested at the May 8 meeting that for an institute to
be a (named) member
of the project, the minimum level of participation should be approximately
0.5 FTE.
Among the resources we should also cite the necessity of a "workgroup
server" specifically
for the developments and tests of the project. A figure of 60 kCHF was
suggested for the
(rough) cost. Similary, roughly 40 kCHF should be forseen for software
licenses.
I also thought a bit more, and suggest 20 kCHF for "network interfaces".
Overall, we thought the equipment and software license cost should not
exceed 150 kCHF --
or should it ?]
__________________________________________________________________________________
8.0 SCHEDULE
Phase 1 : Provide a first round set of tools for evaluating the baseline
models, and
to allow the start of defining the CM by the experiments, within one year.
Phase 2: Provide a refined set of tools, and the guidelines for the
construction of
a feasible CM, in time for the preparation of next round of the Computing
TPRs.
(Fall 1999 for Atlas and CMS; and ??? for LCB and ALICE).
Phase 3: Provide the tools, and some prototype designs, for test-implementations
of the LHC CM's, in time for preparation of the next-to-next round of
the Computing TPRs, in 2001.