Some Ideas about scheduling


Contents:

Introduction

PROOF is meant to be an interactive system. Recently some batch-like features have been added, but this should not deviate the project from the main goals, which are to facilitate the analysis in the design and development phase, providing easier interactive access to larger portions of data (larger than those which could be found on a local machine).

PROOF was shown to perform pretty well in the following scenario:

  1. single user (or a very small number of users)
  2. file locality, i.e. data files copied in the pools available on each cluster machine

This scenario is not the one met in LHC analysis, where:

  1. the number of users is at least O(100)
  2. data files are typically served by storage systems via file server daemons
  3. local pools may be available for temporary replication of data files

This more realistic scenario is completed by the size and properties of the cluster:

  1. O(100) CPUs
  2. O(150 GB) pool hard-disk
  3. GB/s Ethernet connection to the storage system

So the main question becomes:
Is it possible to provide a good quality of service to O(100) users analysing data located on a storage system attached via GB Ethernet to O(100) CPUs?

There are at least two issues here: scheduling of the sessions assigned to users and data access. The two are correlated, as scheduling may depend on the performances of the processes and these depend in PROOF on the data access latencies.

Reducing latencies in accessing data is the subject of recent studies and will be the main topic of research of a new PROOF developer (Leandro Franco).

Scheduling is a less defined and more complicated issue. In PROOF, the processes to be scheduled are ROOT applications started on the worker nodes. Each user gets one instance of such an application in each assigned worker node. Within each worker node scheduling reduces to operative system scheduling and we can assume that this is performing well; the only parameter on which we could act is the nice level, the overall priority. Typically (see A.Tanenbaum) in multi-computer systems, the scheduling problem reduces to load balancing problem, i.e. the problem of efficiently distributing the processes among the available CPUs.

As examples of load-balancing algorithms in multi-processor systems, Tanenbaum describes a sender-initiated distributed heuristic algorithm , where the overloaded nodes send-out messages looking for someone ready to take some work, and a receiver-initiated distributed heuristic algorithm in which the less loaded nodes ask around for work once they are idle. This second algorithm, which de facto implements a pull architecture, has the advantage of over-heading the system with messages when it is not overloaded and more apt to tolerate the additional probes.

PROOF internally implements a pull architecture to load-balance dynamically the work among the worker nodes.

A PROOF session is defined by the workers assigned. To improve scheduling we are considering the following options:

  1. define sessions with only a sub-set of the available worker nodes; several techniques could be envisaged to choose the worker sub-set: random, round-robin, based on the current load and available memory, etc.
  2. reduce temporarily (by suspending) or definitely (by terminating) the number of workers
  3. reconfigure the sub-set of workers with some frequency, based on the results of last few seconds or minutes; this implies the possibility to add workers to the active list
  4. change the priority ( nice level) of the some or all participants to the session.
  5. ...

However, a more relevant point is to determine when to act. Some information is available to the master during processing:

  1. Worker node performance
    1. MB processed per sec
    2. CPU time used
    3. Performance evaluation wrt expectation in the last N seconds
  2. Total CPU time used by the cluster in the last N seconds
  3. Updated estimate of the CPU time still needed to complete the query
  4. ...

The sessions with its workers looks more similar to a process on a single CPU machine. And the parameters just mentioned are similar to the ones used for single process scheduling on Unix systems. This indicates that the parallel should perhaps be studied in some more detail: something could be learned from that.

Scheduling PROOF sessions as processes

In this section we try to investigate a solution based on considering a PROOF session as a single process run in a processor identified as being the whole cluster .

The process table in this case should be managed by a entity having the full control on the system. If only one node in the system is allowed to be a master a first obvious candidate would be the coordinator at the master node: this process knows about all the PROOF sessions it has started and may request all the information that it needs about queries being run.

However, a more realistic case is the one in which there are a few master nodes in the system. In this case, while the single masters would continue to be the point of collection of the information about the queries running on it, a higher level entity must have the control of the whole picture. The main task of this separate entity would be:

  1. manage the process table, collecting information from the single masters at fixed intervals
  2. decide the priorities and communicate back the decisions

This new entity, hereafter referred to as the scheduler , would be a new running mode of the XrdProofdProtocol started in a separate daemon. As a by-product, the scheduler, having an instantaneous picture of the whole system, could also act as initial redirector to evenly load the available masters (see below).

Work flow

The following cycles could be envisaged:

  1. At given frequency, representing de facto the quantum of the processor , the scheduler asks, via the masters, the status of the running queries; the status should contain this sort of information ("CPU-time" here means some number expressing the amount of resources used: may be something different from the plain CPU-time):
    1. the user ID or tag (for accounting)
    2. the total CPU-time used
    3. the CPU-time used in the last N seconds
    4. the updated estimate of the remaining CPU-time needed to complete the query
    5. possibly some information about the performance of the single worker node
    6. ...
  2. The request arrives to the PROOF sessions on the masters in the form of an interrupt which must be served with high priority; after sending off the reply, the master goes back to its normal distribution work, so that, if it is about to complete a query, it is not delayed by the time the scheduler takes to fix and communicate its decision.
  3. The decision is announced to the PROOF sessions on the masters in the form of an interrupt which again must be processed with high priority; the result of the decision is sent by the scheduler to the coordinators on the masters and picked-up by the PROOF master via:

    EQueryAction TProofServ::GetWorkers(TList &ToBeAdded, TList &ToBeStopped, Int_t &PriorityChange)

    where
    1. enum EQueryAction { kOK, kModify, kStop } with
      1. kOK : the query can be continued with the existing configuration;
      2. kModify : the configuration has to be modified according to the contents of the lists;
      3. kStop : the query must be stopped at this point returning the available results.
    2. TList &ToBeAdded / TList &ToBeStopped
      are the list of worker nodes to be added / to be stopped
    3. Int_t &PriorityChange
      retrieves the change in priority of the processes associated to the session.
  4. The PROOF session will use the information to:
    1. start the new workers, if any, initializing them to the current query context
    2. stop the workers to be stopped, retrieving their current output list and statistical information.

Once the new setup is ready, the processing re-starts from where it stopped.

Some remarks

  1. Queries are entering the process table only if they last at least one quantum; so the scheduling will only be effective on reasonably long queries. A large number of short queries will affect the load, hence resource assignment.
  2. The size of the quantum needs to be determined; it does not need to be to small, as in real processors; O(min) should be OK.
  3. When the session is started-up the coordinator launching the PROOF session will ask the scheduler for the best cluster configuration at that moment; the session will retrieve this information again using TProofServ::GetWorkers(...) .

Determining the cluster configuration parameters

To initiate the discussion one should fix what is the average fraction of a CPU that one user should get from the cluster. This is highly debatable and may depend on the context. As a first approximation one can say that a user should get at least the equivalent of O(1) CPUs to have a good response. This means that a cluster with N CPUs could serve well up to O(N) users.

The next step is to determine the max number of workers to be assigned to a session. This depends on how many PROOF server application can be started on a worker node, which in turn depend on the memory of the machine. Assuming that the maximum total memory in PROOF server sessions is half of the available memory (in average), and that a PROOF server sessions takes O(100 MB) memory, a cluster as the one being setup for ALICE (110 CPUs, 4 GB RAM) should be able to serve reasonably well up to 110 users with sessions of 20 worker nodes each (in average).

The formula to determine the number of workers to be assigned to a user should be something like

N_CPU * N_max_sessions_per_CPU / N_users

saturating to N_CPU for N_users < N_max_sessions_per_CPU .

These numbers should be used as a guideline to adjust the configuration of existing sessions in real-time.

Scheduling policy

This needs to be defined around some general rules, like:

  1. short (in terms of remaining CPU-time) queries should get higher priority in the attempt to avoid long-tail effects; long queries will start slowly and accelerate towards the end.
  2. not yet scheduled users should get higher priority than next queries of users with running queries
  3. ...

All this needs some thorough discussion and, perhaps, simulation.

Using the scheduler as initial redirector and security token deliverer

Redirector

As mentioned above, as a by-product the scheduler could be used as initial redirector. The resulting architecture have a new tier:

  1. The client contacts the redirector to know where to start the session
  2. The scheduler has all the information to know which master is the best one for the new session and what is the best sub-set of worker nodes:
    1. redirects the client to the chosen master
    2. finds out the workers and sends the list to the appropriate master coordinator
  3. If the client has already sessions on the cluster:
    1. if there is only one then it will be automatically redirected to attach to it;
    2. if there are more then it will be prompted for the one to be attached;
    3. in any case there will be the possibility to create a brand new session.

Security token deliverer

The scheduler would also be the natural place where to run a new service being developed for the XROOTD security infrastructure and meant to move away from the redirector the authentication load. This service run strong authentication and delivers tokens that can be used to access the redirector and the other components in the cluster.

-- GerardoGanis - 10 Mar 2006

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2006-03-13 - GerardoGanis
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback