PanDA server notes

This twiki is currently being edited and needs to be reviewed.

Legend:
Help Information regarding Jarka's changes for the LSST version
Warning, important: Questions, warnings, parts that need clarification.
Tip, idea: Suggestions for improvement

Panda_high_level_view2.png

Job state machine Job_state_diagram1.png

server

PanDA server is a apache+mod_python application. The default PanDA server configuration for apache defines 2 virtual hosts (one for http on port 25080, one for https on port 25443) that expose the server/panda.py module as a mod_python application. The server/panda.py module is therefore the entry point to the PanDA server. It does not do much:

  • it initializes the other components (brokerage, taskbuffer, etc.)
  • it imports all functions that want to be exposed as a web service from the other components. The effect is the URLs point to the "server" module, although the functions are in other modules.

Help *Jarka added some import for LSST specific functions.*

config

Uses LiveConfigParser from panda-common package to initialise required values from config files or where possible set default values. Config files:
  • /etc/panda/panda_server.cfg
  • ~/etc/panda/panda_server.cfg
  • ${PANDA_HOME}/etc/panda/panda_server.cfg

brokerage

The job brokerage component tracks the load and availability of all the sites/queues and assigns jobs to the site calculated to having the optimal resource matching for the job.
Help The brokerage component is completely ATLAS specific. For LSST they had their own brokerage module on top of PanDA and were directly telling PanDA where to run each job.

broker_util

  • class _Curl: Tip, idea Proposition to move out to a separate util library (maybe even in panda-common). There is another Curl class in userinterface/Client.py it could be merged with.
  • getDefaultStorage
  • getPoolFileCatalog: Warning, important I have the impression it's not used. Candidate for deletion.
  • get*From*: The LRC ones don't have anything to do with LRC anymore - this must be some legacy naming. The functions are dependent as in diagram below.
    • getPFNFromMySQL
    • getPFNFromLFC
    • getPFNFromLRC
    • getFilesFromLRC
    • getNFilesFromLRC
    • getMissLFNsFromLRC
    • getSEfromSched get_From__flow.png

ErrorCode

  • Definition of error codes.
Tip, idea There is a ErrorCode module in each component. Can they be consolidated in one file?

LFCClient

  • Defines error codes. Tip, idea Propose to consolidate error codes in one file.
  • getFilesLFC: does a call to DQ2 LFC API to call bulkFindReplicas
  • main: resolves the files specified in a text file

PandaSiteIDs

  • Dictionary with sites, nickname and status.
Warning, important This file seems obsolete, but there are references to it (maybe also dead)

Sitemapper

  • Reads cloud and site information from the DB and keeps it in memory. Has some aux functions to get/check sites/clouds.
Tip, idea it would be more logical to move Sitemapper out of brokerage and define it in some Information System module (or in panda-common)
Help *Jarka changed this module to define BNL_LSST as default site* Made configurable

broker

  • checkRelease
  • getOKFiles: get files already present at site
  • isReproJob: checks if job type is reprocessing or transformation is in a list
  • setReadyToFiles: updates the metadata of files depending on their availability
  • sendAnalyBrokerageInfo: translates result (with decisions of brokerage) into text
  • sendMsgToLoggerHTTP: Warning, important ask Tadashi for an overview of the HTTP logger (why was it implemented)
  • getT2CandList
  • getHospitalQueues: Warning, important ask Tadashi for the meaning of a hospital queue/how they are defined
  • getPrestageSites: Warning, important ask Tadashi what the DDM comparison is
  • makeCompactDiagMessage
  • schedule: a completely ATLAS specific brokerage module of 1200 lines. Here is the intelligence of the brokerage modules. Everything else is auxiliary.

Help *Jarka set BNL_LSST as default site* Made configurable

jobdispatcher

The job dispatcher receives requests for jobs from pilots and dispatches the job payloads. Jobs are assigned according to the capabilities of the site and the worker node (data availability, disk space, memory etc.).
Help The component is quite generic. Jarka only did minor modifications on it.

ErrorCode

  • Defitions of error codes
Help Could be consolidated in one file only

Protocol

  • SC: Status Code
  • encode: converts a dictionary to a string with format
key1+value1&key2=value+with+spaces+2&...
  • appendNode: adds a key-value to the dictionary
  • appendJob: creates a dictionary with the job information
  • setUserProxy: massages the user DN and generates the proxy file, adding it to the dictionary
Help Could be moved to utils component

Watcher

Extends thread. Checks the status of a job in a loop.
  • retries analysis jobs with certain error codes
  • declares jobs as failed if they did not com back in time
Help A few ATLAS specific things around

JobDispatcher

Depends on dataservice and taskbuffer.
  • class TimedMethod: Help Can be moved to utils component
  • class CachedObject: Help Can be moved to utils component
  • class JobDipatcher: Implements Singleton pattern. The following methods are taskBuffer wrappers with some added functionality.
    • getJob: gets a job and if needed (for GLEXEC) sets the user proxy
    • updateJob: updates job status. If needed retries failed analysis jobs. Adds metadata and/or StdOut. Calls taskBuffer to update the DB.
    • getStatus: returns status and #attempts
    • getEventRanges
    • updateEventRanges
    • getDNTokenMap: gets list of Sched users
    • getPilotToken
    • getKeyPair: gets DN
    • getFQAN
    • checkRole: Help Prod roles are hardcoded. They should be configurable.
    • getDN
    • checkToken
  • All the functions in JobDipatcher class are now wrapped to expose them as a web service. You will see the functions with the same names.

taskbuffer

ArchiveDBProxyPool, DBProxyPool, LogDBProxyPool

DB connection pools (stack with put and get connection). Warning, important Ask Tadashi to explain us the usage of the different databases and document it here briefly
Tip, idea Could we merge all the pool classes in one generic? Some parameter could indicate which part of the configuration to use
Tip, idea THIS NEEDS SOME CLEANUP. ONLY THE DBPROXYPOOL AND THE ORADBPROXY ARE IN USE
This was all cleaned up by Tadashi. Only OraDBProxy remaining

Spec files

CloudSpec, CloudTaskSpec, DatasetSpec, FileSpec, JobSpec, SiteSpec. These modules are the object representation of a cloud, task, dataset, etc. Tip, idea Most of them repeat the same functions. Could we define a common parent and then inherit and modify functions if needed.
Help Jarka added the JobSpecHTCCondor
Help *Jarka added in some functions the optional backend='Oracle' parameter*
Help Spec files could be moved to some sub-component of the task buffer

ConBridge

Wrapper on DB connections

DBProxy, LogDBProxy, OraDBProxy, OraLogDBProxy

DB connection methods plus a variety of functions to execute different SQL code
Tip, idea THIS NEEDS SOME CLEANUP. ONLY THE DBPROXYPOOL AND THE ORADBPROXY ARE IN USE
This was all cleaned up by Tadashi. Only OraDBProxy remaining
Help Here is where the meat is. Need to still have a better look, understand the DB schema (asked Gancho for one) and see what modifications Jarka did for the LSST version.
Warning, important *Is the DBProxy module obsolete? Can it be removed?*
Help Ruslan said the EC2 version is simplified and does not admit tasks Help There is a lot of the MySQL compatibility here, e.g. connectMySQL Help archiveJob has significant changes

ErrorCode

Tip, idea Same story. Could it be consolidated?

EventServiceUtils

Functions for event service.
Help Pure ATLAS.

Initiliazer

Initializes a connection.
Warning, important Not sure if widely used

MemProxy

Memcached module

PrioUtil

Small util functions (unicodeConvert, decodeJSON, calculatePriority) priority = 1000 + offset - serNum/5 - 100*weight
Warning, important Ask Tadashi how the priority calculation works

ProcessGroups

Help ATLAS specific

SQLDumper

Dumps the executed SQL into the log file so it's easy for debugging.

Utils

More aux functions, seem related to Cassandra and event picking.
Help ATLAS specific

WrappedCursor

Compatibility with MySQL and Oracle. Regexp replacements of SQL code (e.g. schemas, ROWNUM...)
Tip, idea *Suggestion for a better solution. E.g. DAO+factory or to use SQL Alchemy. Will require time.* After discussing with Ruslan and Tadashi we agreed to leave a major refactorization for LS2. Here -ideally- SQLAlchemy will be implemented.

WrappedPickle

Checks if class is allowed to be pickled

TaskBuffer

All the functions from DBProxy with a wrapper getting and putting a connection from DBProxyPool. Suggestion for improvement: usage of decorators.
Tip, idea Evaluate the possibility to use decorators in order to remove glue code
Help *Jarka has added configurable VO in some methods* I added it as well.
Help EventService seems to be new and not there for Jarka's method
Help JEDI code is also new
Help Jarka added checkSandboxFileEC2, storeHTCondorJobs, updateHTCondorJobs, removeHTCondorJobs methods. Not clear to me what experiment this is for.
Help Jarka added checkSandboxFileEC2 method

dataservice

The data service implements the data management required by the PanDA server. The usage of the external experiment data management system (e.g. the ATLAS Distributed Data Management system DQ2) is abstracted and the modules can be exchanged for the different experiments.

Activator

  • Called when a dataset transfer completed. Sets the files as ready in the DB.
  • Activates the job (move from jobsDefined4 to jobsActive4 table)

AdderPluginBase

(Empty) base class for ATLAS, CMS and other plugins.

Adder

Tip, idea I believe it was an initial implementation and then moved to AdderAtlasPlugin. It is still being used from jobdispatcher/JobDispatcher.py. Check with Tadashi the history of this file

Adder2

Spin-off from Adder. Need to know historical reason. I think this file is deprecated.
Tip, idea Check with Tadashi and delete it

AdderAtlasPlugin

ATLAS specific implementation of the DDM interactions. Takes care of - registering subscription requests - registering locations - registering new versions of a dataset - deleting files from a dataset (removing unmerged files) ... Info Tadashi's branch has evolved (Jedi and Rucio)

AdderCMSPlugin

CMS specific calls to DDM. Almost empty. Must have been written during the Common Analysis Framework exercise.

AdderDummyPlugin

Dummy version. Returns True for anything.

AdderGen

This is the Adder called from add.py (the adder cronjob!).
  • It is instantiated for a specific job and specifying the XML file.
  • It checks that none of the input files are in cancelled state
  • It checks the job is not in a final state
  • It instantiates and executes the correct Adder plugin (ATLAS by default)
  • It does a series of checks and corrections on the job status
  • It has a parseXML function to read the XML with the file information that it received from the pilot.
Info Made configurable in Tadashi's branch. Before it worked only for CMS and ATLAS VOs
Info Includes some EventService related things

AddressFinder

Searches user's email address from phonebook (client must be in ~atlpan/phonebook) or xwho (web call) based on the user name. Warning, important ATLAS specific Files were obsolete and deleted

Closer

Update and close dataset.

countGuidsClient, eventLookupClient

Clients for Athenaeum/eventLookup TAG service Warning, important ATLAS specific

DataService

Singleton. Web service for DDM callbacks about file transfer status

DataServiceUtils

aux functions Warning, important ATLAS specific

datriHandler

datri client built around curl

DDM

Rucio client Warning, important ATLAS specific

Setupper, SetupperAtlasPlugin,

TBC

TaskAssigner

TBC

Some other files are skipped.

proxycache

Module used for GLEXEC. Builds a cache under /tmp/proxies with user proxies. The file names for the proxies are generated using a hash function.

Help ATLAS VO is hardcoded. This module is optional and pre-dates Jarka's branch

userinterface

Client

Several functions that build a URL and Curl the panda server. The output is then printed.
  • Curl class with get, post, put methods. Help Could be moved to utils and merged with the Curl class in broker_util
  • submitJobs
  • runTaskAssignment
  • getJobStatus
  • ...
Help Only the runBrokerage function seems ATLAS specific

RbLauncher, runRebroker, ReBroker

RbLauncher calls runRebroker, which calls ReBroker. Help At the moment ATLAS specific

UserIF

Exposes all the functionality as a web service. Receives the requests, user authentication, serialize and de-serialize the information.
Help *Jarka modified the Submit jobs with some LSST lines, otherwise it's the same.* Added the same functionality
Help Some functions are only relevant to ATLAS (e.g. JEDI)
Tip, idea Propose to add an authentication/authorization module to make other components easier

Test

test

  • Test scripts to create jobs, get the status, finish, kill them...
  • The sh scripts are usually crons running on one of the PanDA servers. Tip, idea The PanDA server is based on crons - they could be converted to daemons

Packaging/Installation

Carl from OSG committed the changes listed in https://github.com/PanDAWMS/panda-server/commit/9cf6517f768b81b0b0d62611e67a9306d86bd1f5. Main changes are:
  • usage of /data/pansrv instead of /data/atlpan as installation root
  • init.d scripts, configuration files and so on are moved to the traditional linux directories some levels up (e.g. /data/pansrv/etc/init.d will be simply /etc/init.d)
  • template will be stripped from the file names
Tip, idea *Review them with Tadashi in order not to break current installation process*
All commited and in production now for ATLAS
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2015-05-04 - FernandoHaraldBarreiroMegino
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback