0.1 The GBB Grid Big Brother

The tools consists into 2 python scripts:

  • gbb (server)
  • gbbc (client)
The server script, gbb, is run by the user in the WNs while the other is used to get back the partial output of the running jobs. GBB has the following features:

  • is runs a user command as specified in the command line
  • it captures stdout/stderr from the user command, echoes them and optionally save both streams to file
  • it may perform a full timing of the application
  • a separate thread may send incremental portions of the stdout/stderr to a remote server (the mechanism is an evolution of the one used by g-peek). At the moment only DPM SE are supported.
  • a monitoring thread is sitting aside the running user command, in order to check its activity. In particular it is able to:
    • kill the program if the used time is more than a given threshold
    • kill the program if the used cpu % is less than a given threshold in the last n seconds (i.e. between two calls to the monitoring thread)
    • kill a job if it is using more than the specified amount of memory
    • give the cumulative info on used memory, cpu time and number of children of the user process
    • give info on the real time used by the process
  • a user module can also be called by gbb, in case for example you want to do some specific parsing or different activities (partially implemented)
  • the client part (gbbc) is used to retrieve the partial output of a running job or to clean up the output fragments in the SEs.
The full list of options for both gbb and gbbc may be displayed by using the --help switch when calling the script at the command line prompt. An example of gbb usage, to run the command "foo", while arguments "one two three", store the partial output in grid-cert-03.roma1.infn.it each 120 seconds, monitor the task each 600 seconds and kill the command is the used CPU % is < 10% during the last 600 seconds, verbose output:

$> gbb -D 120 -m 600 -c 10 -v -p -s grid-cert-03.roma1.infn.it foo one two three

To get back the partial output, of the jobid ID, with verbose output, the correspondent command is:

$> gbbc -v -s grid-cert-03.roma1.infn.it ID

Where ID is the LCG jobID, of the form "https://...".

  • GriBB (Grid Big Brother) job manager:
    • GriBB is an intelligent job executor which takes care of several running aspects, including used resources and partial output dumping at runtime. Using GriBB LJSFi is able to get the partial output of a job while it's still running, killing a job which is consuming too much memory, kill a job which is consuming too few CPU cycles, etc. etc.
  • partial output retrieval:
    • a new command 'get-partial-output ' is available to retrieve the partial output of a job, when it is still in the running phase. The partial dumps are stored into the DPM SE indicated in the etc/install.conf configuration file. The files stored in the SE are removed when the job finishes. Please customize the DPM SE name to one machine that is appropriate for you. If possible avoid using the default.
  • autoget facility:
    • an automatic output retrieval daemon is available. To start the daemon simply type "autoget start". In the same way, using "autoget stop" the daemon will be stopped. The daemon will store the logs in the /var/log directory and it is aware of clusters with shared home dirs. To get the info on its running status use "autoget status". In case of problems use "autoget killproc" to completely kill the daemon. The autoget agent will run until the end-of-life of your proxy, then it will close by itself. When you have launched the autoget agent you may safely quit your shell, since it will continue getting the output files when you logoff, until your proxy expires.
  • resource listing:
    • using the command 'show-resources' you will get a list of currently available resources, ordered descending by the number of total CPUs available.
  • installation CPU statistics:
    • the command 'atlascpustats ' will reveal the number of CPUs available in LCG with the specified release installed.
-- KondoGnanvo - 30 Jan 2006
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2006-01-30 - KondoGnanvo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback