xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Scheduling for Responsive Grids
User-level scheduling
User-level (or application-level) scheduling is a virtualization layer
at the application side: instead of being executed directly, the
application is executed via an overlay scheduling layer (user-level
scheduler). The overlay scheduling layer runs as a set of regular
user jobs and therefore it operates entirely inside the user space
(Fig 1)
Fig.1
User-level scheduling does not require modifications to the Grid
middleware and infrastructure nor the deployment of special services
in the Grid sites. Therefore it is much easier to setup and operate a
user-level scheduling system to exploit the full range of a Grid sites
which are available for a given user.
Constraints
The user-level scheduling approach has the following constraints:
- user jobs must be instrumented with the scheduling functionality;
- jobs with user-level scheduling must compete on the same basis with all other jobs on the Grid and therefore is is not possible to obtain more guarantees of the resource availability that is provided by underlying middleware.
A user-level scheduler may be be embedded into the application or
external to it. A scheduler embedded into the application is
developed and optimized specifically for a given application,
typically by refactoring and instrumenting the original application
code. It allows to fine tune and customize the scheduling according to
specific execution patterns of the application. Such scheduler is
intrusive at the application source code level which means that the
code reuse of the scheduler is reduced and the development effort is
high for each application. A scheduler external to the application
relies on the general properties of the application such as a
particular parallel decomposition pattern (e.g. iterative
decomposition, geometric decomposition or divide-and-conquer). An
application adapter connects the external scheduler to the application
at runtime. Depending on the decomposition pattern, the application
refactoring at the source code level may or may not be required. The
disadvantage of external schedulers is that it may be very hard to
generalize execution patterns for irregular or speculative
parallelism. In this case a development of a specialized embedded
scheduler may be necessary.
It is not possible in general to guarantee the availability of Grid
resources with user-level scheduling techniques. Jobs instrumented
with user-level scheduling obey the same resource allocation rules as
regular jobs. Unless middleware provides mechanisms for resource
reservation, pre-emption or general processor scheduling with fast
queues, the hard
QoS requirements may only be implemented with
dedication of resources. Partially, this problem may be solved with
user-level scheduler which delays the execution of jobs until the
requested number of resources is available. However this strategy
violates the fair-share in the multiuser environment.
# - it is not possible to guarantee the availability of resources because
# the user level scheduling jobs are competing for the resources on the
# same footing as all other jobs.
On the other side the user-level scheduling may improve the Quality of
Service on the Grid in the following ways:
- reduce the job turnaround time (makespan);
- provide a sustained job output rate;
- optimize the failure recovery.
In the next sections we examine two user-level schedulers: an external
scheduler for generic master-worker applications (DIANE) and an
embedded scheduler for medical image processing (gPTM3D).
DIANE: a generic, external scheduler
Overview
DIANE (DIstributed ANalysis Environment) is a R&D project developed in
Information Technology Department at CERN, Geneva. It is a generic
user-level scheduler based on the extended task farming (master/slave)
processing [ref!]. The runtime behaviour of the framework, such as
failure recovery or task dispatching, may be customized with a set of
hot-pluggable policy functions. This enables a fine-tuning of the
scheduler according to the needs of particular application and also
provides a support for other parallel decomposition patterns
(e.g. divide-and-conquer).
Applications
DIANE provides a python-based framework and enables a rapid
integration with existing applications. Both the transparent and
intrusive application integration have been demonstrated. Data
analysis in Athena framework for Atlas experiment [ref],
Autodock-based Avian Flu drug search [ref] and frequency compatibility
analysis for International Telecommunication Union RRC06 [ref] are all
examples of transparent application integration i.e the application
adapters in the form of python packages have been developed without
modifying the original application code. The examples of intrusive
integrations include the particle simulation in medical physics using
Geant 4 toolkit [ref] and BLAST... The parallelization of these
applications has been based on the iterative decomposition and
master/worker processing model with fully independent tasks.
Execution model
In the DIANE execution model, a temporary virtual Master/Worker
overlay network is created for each user job and is destroyed when the
job terminates. This is compatible with the multi-user fair-share
scheduling on the Grid and guarantees that the resources are not
monopolized by a single user.
The job is split into a number of tasks which are executed by a number
of Worker agents in the Grid. The worker agents run as regular Grid
jobs. Each tasks is defined by a set of application-specific
parameters. The dispatching of tasks is a process of allocating the
tasks to workers by sending appropriate parameters to the Worker
agents. The communication overhead is typically much smaller than in
the systems based on checkpointing and task migration. It allows to
achieve a high message rate (incoming and outgoing tasks). For the
ITU frequency analysis application, the combined dispatching/receiving
rate achieved peaks of 115 Hz.
DO MOJEGO PAPIERU:
In the DIANE execution model, a virtual Master/Worker overlay network
is created for each user job separately. The Master agent is started
on a computer with external, incoming connectivity enabled. The user
job is activated and the Master agent splits the job into the tasks.
Independently, a set of Worker agents is started via a Grid User
Interface. The address of the Master agent is shipped to each worker
agent via Grid sandbox. When the Worker agent is started on the Grid
Worker Node, it creates a permanent TCP/IP connection to the
Master. Master allocates tasks to the free Workers by sending a task
parameters (such as loop index or configuration parameters). Therefore
the size of the exchanged messages is typically small.
High-granularity splitting
DIANE supports high-granularity job splitting, i.e. partitioning a job
into a large number of short or very short tasks. For example, the ITU
frequency compatibility analysis jobs, have been split into
approximately 50 thousand tasks performed simultaneously by around 300
worker agents in 6 EGEE Grid sites across Europe (Fig.2). Task
duration was highly variable: from few seconds (majority of the tasks)
to 30 minutes (few individual tasks). The exact distribution of the
task duration was not known until the job was fully executed, i.e. it
was not possible to
a priori agglomerate short tasks and isolate
long tasks. Without the dynamic load-balancing the total job
turnaround time would be orders of magnitude higher.
Fig.2
Output rate
User-level scheduling provides a more sustained job output rate.
Fig.3 shows the number of completed tasks in the function of time with
(red) and without (green) user-level scheduling for a Geant 4 release
validation application . The job has been split in 207 tasks
and average task duration was around 400 seconds. In the Grid, the
load on the Computing Elements (queuing time) and the load on the
Resource Broker (efficiency of matchmaking) may change dynamically in
short periods of time. The user-level scheduler assures that even if
the number of effectively available resources is low and varying, the
job output throughput is stable (provided the high-granularity
splitting).
Fig.3
Error recovery
Efficient and accurate failure recovery is an important factor of the
Quality of Service. Large distributed systems such as Grid are prone
to diverse configuration and system errors. A generic strategy of
handling errors does not exist and depends on the application as well
as the environment. An application-oriented scheduler such as DIANE
is capable of distinguishing application and system errors and
reacting appropriately via customizable error recovery
methods. Crashing worker agents are automatically taken off the worker
pool. Transient connectivity problems in the WAN are detected. The
failed tasks are automatically re-dispatched to another worker agents.
The mechanism uses a direct, highly efficient communication links in
the virtual Master/Worker network and is much more efficient than a
standard metascheduling techniques implemented in the middleware (JDL
RetryCount parameter) which involve the full submission cycle.
A part of recent Avian Flu Drug Search have been performed using DIANE
scheduler. A Master agent spanning several weeks was taking care of
efficient error recovery and the system was operated by a single
person. Because of the long durartion of the job, the worker agents
were periodically aborted because the exceeded the time limits in the
queues at the Computing Elements. The operator was adding new worker
agents to the system so that at least 200 were available at any
time. DIANE was able to dynamically reconfigure the virtual
Master/Worker network to accomodate the new worker agents.