Data Management and Computing Using
Distributed Architectures
The LHC Models encompass a complex set of wide-area, regional and local-area
networks, a heterogeneous set of compute- and data-servers, and a yet-to-be
determined set of priorities for group-oriented and individuals' demands
for remote data. Distributed systems of this scope and complexity
do not exist yet, although systems of a similar size to those foreseen
for the LHC experiments are predicted to come into operation by around
2005 at large corporations.
In order to proceed with the planning and design of the LHC Computing
Models, and to correctly dimension the capacity of the networks and the
size and characteristics of Regional Centres, it is essential to conduct
a systematic study of these distributed systems. This project therefore
intends to simulate and study network-distributed computing architectures,
data access and data management systems that are major components of
the CM, and the ways in which the components interact across networks.
The project will bring together the efforts and relevant expertise from
the LHC experiments and LHC R&D projects, as well as from the current
or near-future experiments that are already engaged in building distributed
systems for computing, data access, simulation and analysis.
The primary goals of this project are:
Distributed databases are an important part of the CM to be studied.
The RD45 project has developed considerable expertise in the field of Object
Oriented Database Management Systems (ODBMS), and this project intends
to benefit from the RD45 experience and to cooperate with RD45 as appropriate,
in the specific areas where the work of the two projects (necessarily)
overlaps. The proposed project intends to investigate questions which are
largely complementary to RD45, such as network performance and prioritization
of traffic for a variety of applications that must coexist and share the
network resources.
2.0 OBJECTIVES
This project aims at developing a set of common modelling and simulation tools, and the environment which would enable the LHC experiments to realistically evaluate and optimize their analysis models and CMs, based on distributed data and computing architectures. Tools to realistically estimate the network bandwidth required in a given CM will be developed. The parameters that are necessary and sufficient to characterize the CM and its performance will be identified. The methods and tools to measure the Model's performance and detect bottlenecks will be designed and developed, and also tested in prototypes. This work will be done with as much as possible co-operation with the present LHC R&D Projects, and current or near-future experiments. The goal is to determine a set of feasible models, and to provide a set of guidelines which the experiments could use to build their respective Computing Models.
The main objectives are:
3.0 INTERACTIONS WITH EXPERIMENTS AND OTHER PROJECTS
The aim of this project is to establish a set of viable computing models and a set of common guidelines to allow experiments to develop their CMs in a realistic way. We believe that the best way to achieve this objective is to bring together and enhance direct involvement in R&D from the LHC experiments. The project will set up a framework for its collaboration with the experiments, with RD45, with the Technology Tracking Team (TTT), and with other groups having relevant expertise (for example HPSS and other MSS).
This document has been has been prepared in consultation with RD45. We have (all) agreed to hold common meetings and workshops to discuss the overlapping areas of interest, and to define the most efficient way for both projects to proceed and produce the required results. In all cases, there will be a clear understanding with RD45 regarding the work-sharing, especially in testing of the performances of a distributed ODBMS.
One of the important tasks of this project is to identify the questions and tests with an ODBMS which must be done, as part of a distributed system, in order to define the Computing Models. This task will be done in close collaboration with RD45. However, another important role of this project is to begin investigating questions related to the construction, operation and management of a distributed computing and network system optimized for large scale data access, which are largely complementary to RD45. A good example of an area not covered by RD45 is the question of network performance and the prioritization of traffic for a variety of applications that must coexist and share the network resources. These applications include interactive logins, high priority access to system and detector parameters in the database, and real-time "collaborative" applications, in addition to transfers of substantial amounts of event-data as requested by the ODBMS.
The "Computing Model groups" of the experiments will be responsible
for providing the parameters for the models for reconstruction, analysis,
Monte Carlo simulation, etc. The collaborations are already involved
in discussions with the proponents of this project, and it is recognized
that the collaborations will make the final choices leading to their CM.While
the details of the LHC experiments' Models will differ, it is necessary
to first study a range of baseline models, so that all of the Models which
are finally chosen fall into the feasible range.
4.0 WORK PLAN
A primary aim of this project is to demonstrate a set of feasible models, and to provide a set of guidelines with which the experiments could build their respective computer models.
In order to achieve this goal, the first stage of the project
will:
In its second stage, the project will address the question of how a distributed computing system could be built, controlled and run efficiently. The architectures of the entire computing system, making use of distributed computing technologies, will have to be developed and critically analysed. The impact of the choice of a particular viable model on the infrastructure required at CERN, at the remote institutions, and the technical requirements for users' work group servers and desktops will be evaluated.
The work will be performed using either commercial modelling tools capable of simulating large distributed systems), or of modelling tools which are developed as part of this project (based on smaller tool sets already developed for other purposes), or a combination of both.
Scaling tests under controlled and uncontrolled shared network conditions using pieces of currently-operating experiments' data analysis, as well as working prototype analyses for the LHC, will be performed in order to extract information on how to best use the database and its management system, as part of the overall distributed computing system. This information will be complementary to that learned in RD45.
A number of important specific issues will be addressed in the course
of the project. Examples are:
In the third stage, the project will provide tools and prototype designs
for the test implementations of the elements of the LHC experiments Computing
Models, in time for the second round of the Computing Technical Proposals,
in 2001. This stage is, most likely, aimed at a future R&D project.
5.0 MAJOR TASK DEFINITIONS
The major tasks foreseen, in order to achieve the objectives of the
project are:
6.0 DELIVERABLES
The major deliverables are:
Part of the resources, both people and material, are already available from the experiments and from the general support services in collaborating laboratories and institutions. We estimate that 9-11 FTE will come from the experiments. Additional resources needed for the central support of this project are requested from CERN. We anticipate this support to be at the level of 2.5 FTE. Development of a modeling toolset will require 1 FTE, setup and operation of the test-bed will require 1/2 FTE, and studies of the distributed data and computing architectures, and that of the network behaviour and network management will require at least 1/2 FTE each.
The 2.5 FTE at CERN will provide a core of professional experts that
will work with the physicists and technical staffs at remote laboratories
and universities to evaluate, evolve, classify promising classes of Computing
Models, and extract the essential features of the feasible models.
|
|
---|---|
Institution | FTE |
INFN/Bari /CMS |
|
INFN/Roma-1 /ATLAS |
|
INFN/Roma-1 /CMS | 0.5 FTE |
INFN/Bologna /CMS |
|
INFN/Perugia /CMS | 0.6 FTE |
INFN/Milano /ATLAS | 1.0 FTE |
Caltech/CMS |
|
Helsinki Institute of Physics | 0.4 FTE |
INFN/Genova /ATLAS | 0.5 FTE |
Padova /CMS | 0.6 FTE |
Tufts /ATLAS | 0.6 FTE |
US-ATLAS | 1.0 FTE |
FNAL /CMS | 1.0 FTE |
TOTAL: | 9.3 FTE |
The groups outside of CERN engaged in this project have committed the use of several workgroup servers and desktops, along with shares of large computing systems (e.g. at INFN and Caltech). Specific configurations of local and wide area networks managed by some of the collaborating groups are planned, to prototype elements of the distributed systems and to provide test data validating the simulations. The total value of these systems, or shares of systems dedicated to this project cannot be specified precisely, but is estimated to be in the range of several hundred kCHF.
A workgroup server that will serve as a central element in the ensemble
of servers for this project is requested. The server is currently foreseen
to be a one- or two-CPU UNIX system, with sufficient disk space,
a high speed tape drive for local file storage and backup, and ATM or Gigabit
Ethernet, as well as Fast Ethernet local area network interfaces. The CERN-based
specialists working on this project will use this system as a development
and test platform. The project will certainly benefit from the experience
and from the collaboration with the IT/PDP group in the development of
a test-bed prototype.
|
|
---|---|
Category | amount (CHF) |
Dedicated workgroup server for the development work and tests of the project |
|
Software licences |
|
Network interfaces | 20 kCHF |
Travel | 20 kCHF |
TOTAL: | 140 kCHF |
8.0 SCHEDULE
Phase 1 : Provide a first round set of tools for evaluating the baseline models, and to allow the start of defining the CM by the experiments, within one year.
Phase 2: Provide a refined set of tools, and the guidelines for the construction of a feasible CM, in time for the preparation of next round of the Computing TPRs. (Fall 1999 for Atlas and CMS; and later for LCB and ALICE).
Phase 3: In the third stage, the project will provide prototype designs for the test implementations of the elements of the LHC experiments Computing Models, in time for the second round of the Computing Technical Proposals, in 2001. This stage is, most likely, aimed at a future R&D project.