Main Web>TWikiUsers>FernandoHaraldBarreiroMegino>FernandoSPanDANotes>PandaServerNotes (2015-05-04, FernandoHaraldBarreiroMegino)

EditAttachPDF

PanDA server notes

PanDA server notes
- server
- config
- brokerage
- jobdispatcher
- taskbuffer
- dataservice
- proxycache
- userinterface
- Test
Packaging/Installation

This twiki is currently being edited and needs to be reviewed.

Legend:
Information regarding Jarka's changes for the LSST version
: Questions, warnings, parts that need clarification.
: Suggestions for improvement

Job state machine

server

PanDA server is a apache+mod_python application. The default PanDA server configuration for apache defines 2 virtual hosts (one for http on port 25080, one for https on port 25443) that expose the server/panda.py module as a mod_python application. The server/panda.py module is therefore the entry point to the PanDA server. It does not do much:

it initializes the other components (brokerage, taskbuffer, etc.)
it imports all functions that want to be exposed as a web service from the other components. The effect is the URLs point to the "server" module, although the functions are in other modules.

~~*Jarka added some import for LSST specific functions.*~~

config

Uses LiveConfigParser from panda-common package to initialise required values from config files or where possible set default values. Config files:

/etc/panda/panda_server.cfg
~/etc/panda/panda_server.cfg
${PANDA_HOME}/etc/panda/panda_server.cfg

brokerage

The job brokerage component tracks the load and availability of all the sites/queues and assigns jobs to the site calculated to having the optimal resource matching for the job.

The brokerage component is completely ATLAS specific. For LSST they had their own brokerage module on top of PanDA and were directly telling PanDA where to run each job.

broker_util

class _Curl: Proposition to move out to a separate util library (maybe even in panda-common). There is another Curl class in userinterface/Client.py it could be merged with.
getDefaultStorage
getPoolFileCatalog: I have the impression it's not used. Candidate for deletion.
get*From*: The LRC ones don't have anything to do with LRC anymore - this must be some legacy naming. The functions are dependent as in diagram below.
- getPFNFromMySQL
- getPFNFromLFC
- getPFNFromLRC
- getFilesFromLRC
- getNFilesFromLRC
- getMissLFNsFromLRC
- getSEfromSched

ErrorCode

Definition of error codes.

There is a ErrorCode module in each component. Can they be consolidated in one file?

LFCClient

Defines error codes. Propose to consolidate error codes in one file.
getFilesLFC: does a call to DQ2 LFC API to call bulkFindReplicas
main: resolves the files specified in a text file

PandaSiteIDs

Dictionary with sites, nickname and status.

This file seems obsolete, but there are references to it (maybe also dead)

Sitemapper

Reads cloud and site information from the DB and keeps it in memory. Has some aux functions to get/check sites/clouds.

it would be more logical to move Sitemapper out of brokerage and define it in some Information System module (or in panda-common)

~~*Jarka changed this module to define BNL_LSST as default site*~~ Made configurable

broker

checkRelease
getOKFiles: get files already present at site
isReproJob: checks if job type is reprocessing or transformation is in a list
setReadyToFiles: updates the metadata of files depending on their availability
sendAnalyBrokerageInfo: translates result (with decisions of brokerage) into text
sendMsgToLoggerHTTP: ask Tadashi for an overview of the HTTP logger (why was it implemented)
getT2CandList
getHospitalQueues: ask Tadashi for the meaning of a hospital queue/how they are defined
getPrestageSites: ask Tadashi what the DDM comparison is
makeCompactDiagMessage
schedule: a completely ATLAS specific brokerage module of 1200 lines. Here is the intelligence of the brokerage modules. Everything else is auxiliary.

~~*Jarka set BNL_LSST as default site*~~ Made configurable

jobdispatcher

The job dispatcher receives requests for jobs from pilots and dispatches the job payloads. Jobs are assigned according to the capabilities of the site and the worker node (data availability, disk space, memory etc.).

The component is quite generic. Jarka only did minor modifications on it.

ErrorCode

Defitions of error codes

Could be consolidated in one file only

Protocol

SC: Status Code
encode: converts a dictionary to a string with format

key1+value1&key2=value+with+spaces+2&...

appendNode: adds a key-value to the dictionary
appendJob: creates a dictionary with the job information
setUserProxy: massages the user DN and generates the proxy file, adding it to the dictionary

Could be moved to utils component

Watcher

Extends thread. Checks the status of a job in a loop.

retries analysis jobs with certain error codes
declares jobs as failed if they did not com back in time

A few ATLAS specific things around

JobDispatcher

Depends on dataservice and taskbuffer.

class TimedMethod: Can be moved to utils component
class CachedObject: Can be moved to utils component
class JobDipatcher: Implements Singleton pattern. The following methods are taskBuffer wrappers with some added functionality.
- getJob: gets a job and if needed (for GLEXEC) sets the user proxy
- updateJob: updates job status. If needed retries failed analysis jobs. Adds metadata and/or StdOut. Calls taskBuffer to update the DB.
- getStatus: returns status and #attempts
- getEventRanges
- updateEventRanges
- getDNTokenMap: gets list of Sched users
- getPilotToken
- getKeyPair: gets DN
- getFQAN
- checkRole: Prod roles are hardcoded. They should be configurable.
- getDN
- checkToken
All the functions in JobDipatcher class are now wrapped to expose them as a web service. You will see the functions with the same names.

taskbuffer

ArchiveDBProxyPool, DBProxyPool, LogDBProxyPool

DB connection pools (stack with put and get connection). Ask Tadashi to explain us the usage of the different databases and document it here briefly
Could we merge all the pool classes in one generic? Some parameter could indicate which part of the configuration to use
THIS NEEDS SOME CLEANUP. ONLY THE DBPROXYPOOL AND THE ORADBPROXY ARE IN USE This was all cleaned up by Tadashi. Only OraDBProxy remaining

Spec files

CloudSpec, CloudTaskSpec, DatasetSpec, FileSpec, JobSpec, SiteSpec. These modules are the object representation of a cloud, task, dataset, etc.

Most of them repeat the same functions. Could we define a common parent and then inherit and modify functions if needed.

Jarka added the JobSpecHTCCondor

*Jarka added in some functions the optional backend='Oracle' parameter*

Spec files could be moved to some sub-component of the task buffer

ConBridge

Wrapper on DB connections

DBProxy, LogDBProxy, OraDBProxy, OraLogDBProxy

DB connection methods plus a variety of functions to execute different SQL code
~~THIS NEEDS SOME CLEANUP. ONLY THE DBPROXYPOOL AND THE ORADBPROXY ARE IN USE~~
This was all cleaned up by Tadashi. Only OraDBProxy remaining

Here is where the meat is. Need to still have a better look, understand the DB schema (asked Gancho for one) and see what modifications Jarka did for the LSST version.

*Is the DBProxy module obsolete? Can it be removed?*

Ruslan said the EC2 version is simplified and does not admit tasks

There is a lot of the MySQL compatibility here, e.g. connectMySQL

archiveJob has significant changes

ErrorCode

Same story. Could it be consolidated?

EventServiceUtils

Functions for event service.

Pure ATLAS.

Initiliazer

Initializes a connection.

Not sure if widely used

MemProxy

Memcached module

PrioUtil

Small util functions (unicodeConvert, decodeJSON, calculatePriority) priority = 1000 + offset - serNum/5 - 100*weight

Ask Tadashi how the priority calculation works

ProcessGroups

ATLAS specific

SQLDumper

Dumps the executed SQL into the log file so it's easy for debugging.

Utils

More aux functions, seem related to Cassandra and event picking.

ATLAS specific

WrappedCursor

Compatibility with MySQL and Oracle. Regexp replacements of SQL code (e.g. schemas, ROWNUM...)

*Suggestion for a better solution. E.g. DAO+factory or to use SQL Alchemy. Will require time.* After discussing with Ruslan and Tadashi we agreed to leave a major refactorization for LS2. Here -ideally- SQLAlchemy will be implemented.

WrappedPickle

Checks if class is allowed to be pickled

TaskBuffer

All the functions from DBProxy with a wrapper getting and putting a connection from DBProxyPool. Suggestion for improvement: usage of decorators.

Evaluate the possibility to use decorators in order to remove glue code

*Jarka has added configurable VO in some methods* I added it as well.

EventService seems to be new and not there for Jarka's method

JEDI code is also new

Jarka added checkSandboxFileEC2, storeHTCondorJobs, updateHTCondorJobs, removeHTCondorJobs methods. Not clear to me what experiment this is for.

Jarka added checkSandboxFileEC2 method

dataservice

The data service implements the data management required by the PanDA server. The usage of the external experiment data management system (e.g. the ATLAS Distributed Data Management system DQ2) is abstracted and the modules can be exchanged for the different experiments.

Activator

Called when a dataset transfer completed. Sets the files as ready in the DB.
Activates the job (move from jobsDefined4 to jobsActive4 table)

AdderPluginBase

(Empty) base class for ATLAS, CMS and other plugins.

Adder

I believe it was an initial implementation and then moved to AdderAtlasPlugin. It is still being used from jobdispatcher/JobDispatcher.py. Check with Tadashi the history of this file

Adder2

Spin-off from Adder. Need to know historical reason. I think this file is deprecated.

Check with Tadashi and delete it

AdderAtlasPlugin

ATLAS specific implementation of the DDM interactions. Takes care of - registering subscription requests - registering locations - registering new versions of a dataset - deleting files from a dataset (removing unmerged files) ...

Tadashi's branch has evolved (Jedi and Rucio)

AdderCMSPlugin

CMS specific calls to DDM. Almost empty. Must have been written during the Common Analysis Framework exercise.

AdderDummyPlugin

Dummy version. Returns True for anything.

AdderGen

This is the Adder called from add.py (the adder cronjob!).

It is instantiated for a specific job and specifying the XML file.
It checks that none of the input files are in cancelled state
It checks the job is not in a final state
It instantiates and executes the correct Adder plugin (ATLAS by default)
It does a series of checks and corrections on the job status
It has a parseXML function to read the XML with the file information that it received from the pilot.

Made configurable in Tadashi's branch. Before it worked only for CMS and ATLAS VOs

Includes some EventService related things

AddressFinder

Searches user's email address from phonebook (client must be in ~atlpan/phonebook) or xwho (web call) based on the user name. ATLAS specific Files were obsolete and deleted

Closer

Update and close dataset.

countGuidsClient, eventLookupClient

Clients for Athenaeum/eventLookup TAG service

ATLAS specific

DataService

Singleton. Web service for DDM callbacks about file transfer status

DataServiceUtils

aux functions

ATLAS specific

datriHandler

datri client built around curl

DDM

Rucio client

ATLAS specific

Setupper, SetupperAtlasPlugin,

TBC

TaskAssigner

TBC

Some other files are skipped.

proxycache

Module used for GLEXEC. Builds a cache under /tmp/proxies with user proxies. The file names for the proxies are generated using a hash function.

ATLAS VO is hardcoded. This module is optional and pre-dates Jarka's branch

userinterface

Client

Several functions that build a URL and Curl the panda server. The output is then printed.

Curl class with get, post, put methods. Could be moved to utils and merged with the Curl class in broker_util
submitJobs
runTaskAssignment
getJobStatus
...

Only the runBrokerage function seems ATLAS specific

RbLauncher, runRebroker, ReBroker

RbLauncher calls runRebroker, which calls ReBroker.

At the moment ATLAS specific

UserIF

Exposes all the functionality as a web service. Receives the requests, user authentication, serialize and de-serialize the information.
~~*Jarka modified the Submit jobs with some LSST lines, otherwise it's the same.*~~ Added the same functionality

Some functions are only relevant to ATLAS (e.g. JEDI)

Propose to add an authentication/authorization module to make other components easier

Test

test

Test scripts to create jobs, get the status, finish, kill them...
The sh scripts are usually crons running on one of the PanDA servers. The PanDA server is based on crons - they could be converted to daemons

Packaging/Installation

Carl from OSG committed the changes listed in https://github.com/PanDAWMS/panda-server/commit/9cf6517f768b81b0b0d62611e67a9306d86bd1f5. Main changes are:

usage of /data/pansrv instead of /data/atlpan as installation root

init.d scripts, configuration files and so on are moved to the traditional linux directories some levels up (e.g. /data/pansrv/etc/init.d will be simply /etc/init.d)

template will be stripped from the file names
*Review them with Tadashi in order not to break current installation process* All commited and in production now for ATLAS

Attachments

Topic attachments
I	Attachment	History	Action	Size	Date	Who
png	Job_state_diagram1.png	r1	manage	36.4 K	2015-03-04 - 13:08	FernandoHaraldBarreiroMegino
png	Panda_high_level_view2.png	r1	manage	60.9 K	2015-03-06 - 11:10	FernandoHaraldBarreiroMegino
png	get_From__flow.png	r1	manage	11.0 K	2015-03-04 - 19:57	FernandoHaraldBarreiroMegino

Topic revision: r14 - 2015-05-04 - FernandoHaraldBarreiroMegino

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback