PanDA server notes
This twiki is currently being edited and needs to be reviewed.
Legend:
Information regarding Jarka's changes for the LSST version
: Questions, warnings, parts that need clarification.
: Suggestions for improvement
Job state machine
server
PanDA server is a apache+mod_python application. The default
PanDA server configuration for apache defines 2 virtual hosts (one for http on port 25080, one for https on port 25443) that expose the
server/panda.py module as a mod_python application.
The server/panda.py module is therefore the entry point to the
PanDA server. It does not do much:
- it initializes the other components (brokerage, taskbuffer, etc.)
- it imports all functions that want to be exposed as a web service from the other components. The effect is the URLs point to the "server" module, although the functions are in other modules.
*Jarka added some import for LSST specific functions.*
config
Uses
LiveConfigParser from panda-common package to initialise required values from config files or where possible set default values. Config files:
- /etc/panda/panda_server.cfg
- ~/etc/panda/panda_server.cfg
- ${PANDA_HOME}/etc/panda/panda_server.cfg
brokerage
The job brokerage component tracks the load and availability of all the sites/queues and assigns jobs to the site calculated to having the optimal resource matching for the job.
The brokerage component is completely ATLAS specific. For LSST they had their own brokerage module on top of PanDA and were directly telling PanDA where to run each job.
broker_util
- class _Curl: Proposition to move out to a separate util library (maybe even in panda-common). There is another Curl class in userinterface/Client.py it could be merged with.
- getDefaultStorage
- getPoolFileCatalog: I have the impression it's not used. Candidate for deletion.
- get*From*: The LRC ones don't have anything to do with LRC anymore - this must be some legacy naming. The functions are dependent as in diagram below.
- getPFNFromMySQL
- getPFNFromLFC
- getPFNFromLRC
- getFilesFromLRC
- getNFilesFromLRC
- getMissLFNsFromLRC
- getSEfromSched
- Definition of error codes.
There is a ErrorCode module in each component. Can they be consolidated in one file?
LFCClient
- Defines error codes. Propose to consolidate error codes in one file.
- getFilesLFC: does a call to DQ2 LFC API to call bulkFindReplicas
- main: resolves the files specified in a text file
- Dictionary with sites, nickname and status.
This file seems obsolete, but there are references to it (maybe also dead)
Sitemapper
- Reads cloud and site information from the DB and keeps it in memory. Has some aux functions to get/check sites/clouds.
it would be more logical to move Sitemapper out of brokerage and define it in some Information System module (or in panda-common)
*Jarka changed this module to define BNL_LSST as default site* Made configurable
broker
- checkRelease
- getOKFiles: get files already present at site
- isReproJob: checks if job type is reprocessing or transformation is in a list
- setReadyToFiles: updates the metadata of files depending on their availability
- sendAnalyBrokerageInfo: translates result (with decisions of brokerage) into text
- sendMsgToLoggerHTTP: ask Tadashi for an overview of the HTTP logger (why was it implemented)
- getT2CandList
- getHospitalQueues: ask Tadashi for the meaning of a hospital queue/how they are defined
- getPrestageSites: ask Tadashi what the DDM comparison is
- makeCompactDiagMessage
- schedule: a completely ATLAS specific brokerage module of 1200 lines. Here is the intelligence of the brokerage modules. Everything else is auxiliary.
*Jarka set BNL_LSST as default site* Made configurable
jobdispatcher
The job dispatcher receives requests for jobs from pilots and dispatches the job payloads. Jobs are assigned according to the capabilities of the site and the worker node (data availability, disk space, memory etc.).
The component is quite generic. Jarka only did minor modifications on it.
Could be consolidated in one file only
Protocol
- SC: Status Code
- encode: converts a dictionary to a string with format
key1+value1&key2=value+with+spaces+2&...
- appendNode: adds a key-value to the dictionary
- appendJob: creates a dictionary with the job information
- setUserProxy: massages the user DN and generates the proxy file, adding it to the dictionary
Could be moved to utils component
Watcher
Extends thread. Checks the status of a job in a loop.
- retries analysis jobs with certain error codes
- declares jobs as failed if they did not com back in time
A few ATLAS specific things around
Depends on dataservice and taskbuffer.
- class TimedMethod: Can be moved to utils component
- class CachedObject: Can be moved to utils component
- class JobDipatcher: Implements Singleton pattern. The following methods are taskBuffer wrappers with some added functionality.
- getJob: gets a job and if needed (for GLEXEC) sets the user proxy
- updateJob: updates job status. If needed retries failed analysis jobs. Adds metadata and/or StdOut. Calls taskBuffer to update the DB.
- getStatus: returns status and #attempts
- getEventRanges
- updateEventRanges
- getDNTokenMap: gets list of Sched users
- getPilotToken
- getKeyPair: gets DN
- getFQAN
- checkRole: Prod roles are hardcoded. They should be configurable.
- getDN
- checkToken
- All the functions in JobDipatcher class are now wrapped to expose them as a web service. You will see the functions with the same names.
taskbuffer
DB connection pools (stack with put and get connection).
Ask Tadashi to explain us the usage of the different databases and document it here briefly
Could we merge all the pool classes in one generic? Some parameter could indicate which part of the configuration to use
THIS NEEDS SOME CLEANUP. ONLY THE DBPROXYPOOL AND THE ORADBPROXY ARE IN USE
This was all cleaned up by Tadashi. Only
OraDBProxy remaining
Spec files
CloudSpec,
CloudTaskSpec,
DatasetSpec,
FileSpec,
JobSpec,
SiteSpec. These modules are the object representation of a cloud, task, dataset, etc.
Most of them repeat the same functions. Could we define a common parent and then inherit and modify functions if needed.
Jarka added the JobSpecHTCCondor
*Jarka added in some functions the optional backend='Oracle' parameter*
Spec files could be moved to some sub-component of the task buffer
Wrapper on DB connections
DB connection methods plus a variety of functions to execute different SQL code
THIS NEEDS SOME CLEANUP. ONLY THE DBPROXYPOOL AND THE ORADBPROXY ARE IN USE
This was all cleaned up by Tadashi. Only
OraDBProxy remaining
Here is where the meat is. Need to still have a better look, understand the DB schema (asked Gancho for one) and see what modifications Jarka did for the LSST version.
*Is the DBProxy module obsolete? Can it be removed?*
Ruslan said the EC2 version is simplified and does not admit tasks
There is a lot of the MySQL compatibility here, e.g. connectMySQL
archiveJob has significant changes
Same story. Could it be consolidated?
Functions for event service.
Pure ATLAS.
Initiliazer
Initializes a connection.
Not sure if widely used
Memcached module
Small util functions (unicodeConvert, decodeJSON, calculatePriority)
priority = 1000 + offset - serNum/5 - 100*weight
Ask Tadashi how the priority calculation works
ATLAS specific
SQLDumper
Dumps the executed SQL into the log file so it's easy for debugging.
Utils
More aux functions, seem related to Cassandra and event picking.
ATLAS specific
Compatibility with
MySQL and Oracle. Regexp replacements of SQL code (e.g. schemas, ROWNUM...)
*Suggestion for a better solution. E.g. DAO+factory or to use SQL Alchemy. Will require time.* After discussing with Ruslan and Tadashi we agreed to leave a major refactorization for LS2. Here -ideally- SQLAlchemy will be implemented.
Checks if class is allowed to be pickled
All the functions from DBProxy with a wrapper getting and putting a connection from
DBProxyPool.
Suggestion for improvement: usage of decorators.
Evaluate the possibility to use decorators in order to remove glue code
*Jarka has added configurable VO in some methods* I added it as well.
EventService seems to be new and not there for Jarka's method
JEDI code is also new
Jarka added checkSandboxFileEC2, storeHTCondorJobs, updateHTCondorJobs, removeHTCondorJobs methods. Not clear to me what experiment this is for.
Jarka added checkSandboxFileEC2 method
dataservice
The data service implements the data management required by the
PanDA server. The usage of the external experiment data management system (e.g. the ATLAS Distributed Data Management system DQ2) is abstracted and the modules can be exchanged for the different experiments.
Activator
- Called when a dataset transfer completed. Sets the files as ready in the DB.
- Activates the job (move from jobsDefined4 to jobsActive4 table)
(Empty) base class for ATLAS, CMS and other plugins.
Adder
I believe it was an initial implementation and then moved to AdderAtlasPlugin. It is still being used from jobdispatcher/JobDispatcher.py. Check with Tadashi the history of this file
Adder2
Spin-off from Adder. Need to know historical reason. I think this file is deprecated.
Check with Tadashi and delete it
ATLAS specific implementation of the DDM interactions. Takes care of
- registering subscription requests
- registering locations
- registering new versions of a dataset
- deleting files from a dataset (removing unmerged files)
...
Tadashi's branch has evolved (Jedi and Rucio)
CMS specific calls to DDM. Almost empty. Must have been written during the Common Analysis Framework exercise.
Dummy version. Returns True for anything.
This is the Adder called from add.py (the adder cronjob!).
- It is instantiated for a specific job and specifying the XML file.
- It checks that none of the input files are in cancelled state
- It checks the job is not in a final state
- It instantiates and executes the correct Adder plugin (ATLAS by default)
- It does a series of checks and corrections on the job status
- It has a parseXML function to read the XML with the file information that it received from the pilot.
Made configurable in Tadashi's branch. Before it worked only for CMS and ATLAS VOs
Includes some EventService related things
Searches user's email address from phonebook (client must be in ~atlpan/phonebook) or xwho (web call) based on the user name.
ATLAS specific
Files were obsolete and deleted
Closer
Update and close dataset.
countGuidsClient, eventLookupClient
Clients for Athenaeum/eventLookup TAG service
ATLAS specific
Singleton. Web service for DDM callbacks about file transfer status
aux functions
ATLAS specific
datriHandler
datri client built around curl
DDM
Rucio client
ATLAS specific
TBC
TBC
Some other files are skipped.
proxycache
Module used for GLEXEC. Builds a cache under /tmp/proxies with user proxies. The file names for the proxies are generated using a hash function.
ATLAS VO is hardcoded. This module is optional and pre-dates Jarka's branch
userinterface
Client
Several functions that build a URL and Curl the panda server. The output is then printed.
- Curl class with get, post, put methods. Could be moved to utils and merged with the Curl class in broker_util
- submitJobs
- runTaskAssignment
- getJobStatus
- ...
Only the runBrokerage function seems ATLAS specific
RbLauncher calls runRebroker, which calls
ReBroker.
At the moment ATLAS specific
Exposes all the functionality as a web service. Receives the requests, user authentication, serialize and de-serialize the information.
*Jarka modified the Submit jobs with some LSST lines, otherwise it's the same.* Added the same functionality
Some functions are only relevant to ATLAS (e.g. JEDI)
Propose to add an authentication/authorization module to make other components easier
Test
test
- Test scripts to create jobs, get the status, finish, kill them...
- The sh scripts are usually crons running on one of the PanDA servers. The PanDA server is based on crons - they could be converted to daemons
Packaging/Installation
Carl from OSG committed the changes listed in https://github.com/PanDAWMS/panda-server/commit/9cf6517f768b81b0b0d62611e67a9306d86bd1f5. Main changes are:
- usage of /data/pansrv instead of /data/atlpan as installation root
- init.d scripts, configuration files and so on are moved to the traditional linux directories some levels up (e.g. /data/pansrv/etc/init.d will be simply /etc/init.d)
- template will be stripped from the file names
*Review them with Tadashi in order not to break current installation process* All commited and in production now for ATLAS