Adding Quality of Service to the Grid with DIANE.
Jakub Moscicki, CERN/IT
Currently the mainstream usage of the Grids resembles a very large
batch system: the goal is to maximize the computational throughput
over long periods of time. This fits many applications, in
particular, large data productions of the LHC experiments: production
manager puts thousands of jobs into the system and, he or she expects,
that after several days they come out with the result. However this
model does not support very well other usage scenarios. For example
in the interactive analysis the response of the system should be much
faster and aligned with the interactive activity of the user. Life
Science applications often involve short deadline jobs: a very large
number of very short jobs which must finish with certain time
limit. In general, the Quality of Service (
QoS) characteristics are not
present in the current Grid systems.
On the other hand, there is also an effect of the scale and
complexity. EGEE is the world's largest Grid system to date,
comprising over 20000 worker nodes, 200 computing sites and petabytes
of storage. Such an impressive enterprise, connecting heterogeneous
computing environments and organizations, comes with a cost: from the
end-user perspective tracking of possible problems may be very time
consuming and, at times, the system may exhibit lower efficiency.
User-level scheduling is a very light software technique which allows
adding new capabilities, and improving
QoS characteristics and
reliability, on top of existing Grid middleware and infrastructure.
DIANE (DIstributed ANalysis Environment,
http://cern.ch/diane) is a
R&D project started at CERN/IT in 2001. At the beginning the target
was to investigate distributed ntuple analysis for particle physics.
However, with time, DIANE has become an application-independent user
scheduling tool on the Grid and it has been interfaced to the number
of applications in High Energy Physics, Medical Physics, Life Sciences
and others.
DIANE is a python framework based on Master/Worker processing model
which is used on top of regular Grid middleware in a transparent
way. Worker agents are sent to the Grid as regular Grid jobs and they
register to the Master agent by opening a TCP/IP connection. The
Master agent runs on the user's desktop computer and is the
coordination point for the virtual Worker pool. Workers may
dynamically join and leave the pool, without disrupting the processing
as a whole. The processing is composed of a large number of short
tasks which are the units of computation. The Master allocates the
tasks to Workers directly, bypassing the middleware scheduling layer.
This allows to reduce the total job turnaround time and to react much
faster to errors in task execution by reallocating them to other
workers. Splitting the processing into a large number of fine-grained
tasks improves the load balancing, assuring efficient utilization of
the workers. In the result the computing resources may be returned to
the Grid faster: the Worker agents are automatically terminated when
the processing reaches the end.
DIANE's python framework allows to easily and promptly integrate
existing applications even as complex as Athena - the analysis
framework of the ATLAS experiment. Studies performed by members of the
Atlas collaboration showed that it is possible to use DIANE to
integrate local and Grid resources, and even resources which come from
different Grid infrastructures at the same time. The demonstration of
DIANE-based parallel Athena prototype has been shown at a number of
EGEE conferences and it has been included in the Atlas Technical
Design Report (TDR 2005). Additionally, DIANE has been interfaced to
Ganga, a user-friendly Grid interface created in the context of Atlas
and LHCb experiments at CERN. The physicists using Ganga will have in
the future a possibility to choose the DIANE optimizer, which will be
attached transparently to their jobs.
The statistical regression testing, which is part of the Geant-4
release validation procedure is operated on the EGEE Grid using DIANE
scheduler. It allows to cut down the turnaround time several times
and to provide more stable and predictable job output rate because the
Worker agents which has been acquired at the beginning of processing
are held inside the pool and are shielded from the instabilities in
the Grid brokering. Stable job output rate is an important
QoS feature
because allows to plan the testing operations on the Grid with more
reliability.
DIANE has been recently used to perform a sizeable fraction of the
in silico
drug discovery using EGEE infrastructure. The challange was to
analyse possible drug components against the avian flu virus
H5N1. This activity, addressing current and socially important problem,
has had a number of press releases worldwide, including BBC and
Liberation. It has been demonstrated that a User Level Scheduler such
as DIANE, may improve the distribution efficiency on the Grid from
below 40% to above 80% by optimizing the allocation of the
fine-grained computing tasks. Efficient automatic error recovery
mechanisms proved to be efficient in extended period of continuous
work: the part performed with DIANE of the
in silico drug search
activity lasted around 30 days.
Over the months of May and June 2006, CERN has successfully supported
a series of large-scale data processing activities being carried out
by the International Telecommunications Union (ITU) as part of the
ITU's Regional Radiocommunication Conference (RRC-06). Several sites
of the EGEE infrastructure provided a computing grid of more than 400
PCs to work on each analysis in parallel. The processing on the EGEE
infrastructure have been conducted using DIANE scheduling layer. The
system completed more that 200 thousand very short frequency analysis
jobs (clustered in around 40 thousand processing tasks) in around one
hour, proving that on-demand computing with short deadline is possible
on the Grid. The frequency allocation plan optimized with the help of
the Grid allowed over 1000 delegates from 104 countries to adopt the
treaty agreement that will replace the analog broadcasting plans
existing since 1961 for Europe and since 1989 for Africa.
In the future a closer integration with Ganga will enable access to
all DIANE capabilities. On-going activities in the context of
PhD_StaraWersja
research aim at supporting hard
QoS requirements with novel techniques
such as floating worker pool, extending scalability above 500 worker
agents and supporting inter-dependent tasks for workflow applications.
--
JakubMoscicki - 26 Jul 2006