Tier0 Operations, Monitoring and Status displays Guide
Introduction
testFor installation, configuring, starting and stopping the system, consult the
FAQ!
For the CSA06 exercise, the Tier0 operation is based on a number of components:
- Logger: central recording and steering of all activities
- FileFeeder: Getting available files in Castor and feeding them into the Prompt Reconstruction
- PR Manager: Prompt Reconstruction Manager, manageing the PR Workers
- PR Worker: Prompt Reconstruction Worker, running the actual reconstruction code and configuration according to role
- Export Manager: handles the injections into PhEDEx
- PhEDEx itself, which loads the export buffer with appropriate traffic according to what the Export Emulator Manager tells it to do
All components are independent from each other and can be started and stopped as need arises (as of today, restarting the Logger likely will lose information, to be fixed by Tony?!).
All CSA06 Tier0 operations are run with the loginID
cmsprod (for the password, contact Tony, Nick or Werner). All required software is installed in
~cmsprod/public/T0, with the exception of the required Perl and ApMon modules installed in
/afs/cern.ch/user/w/wildish/public/perl ($T0ROOT/env.sh will set up your
PERL5LIB accordingly).
All steering and manageing components run on
lxgate39.
All worker components run on dedicated worker nodes (dual CPU 2.8GHz with SLC4 32-bit mode) as batch jobs, using the LSF-queue
cmscsa06.
Installation of the system
All CSA06 computing tasks should use CMSSW_1_0_x. Currently, CMSSW_1_0_3 is installed, and this is the version the instructions following refer to!
Other versions may be installed in parallel as needed.
The code has to be installed on shared disk space, visible from all worker nodes, with
~cmsprod/public/T0 as the BASE_DIR.
The following tasks have to be performed to get the code to a runnable state:
- Checkout the code from the CVS repository
- Install the requisite Perl modules
- Configure the system
Castor usage
Castor2 is used with several disk pools, configured both in size and functionality as required.
- t0input 65TB on 13 servers, no tape, garbage collection disabled
- t0export 80TB on 16 servers, with tape
- cmsprod 22TB on 4 servers
The "RAW-data" input files are read from /castor/cern.ch/cms/T0Prototype/Input located in the t0input disk pool.
The RECO output files are written to /castor/cern.ch/cms/store/CSA06/?????
All options in the configuration file have to be set correctly in order to select the correct disk pool and path!
Checking out the code
Running in a /bin/bash shell, you can check out the T0 code as follows:
#CMSSW Version and installation dir
export CMSSW_VERS=CMSSW_1_0_3
export CMSSW_BASE_DIR=~cmsprod/public/CSA06
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_BASE_DIR
#Check out from cmscvs
project CMSSW
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_VERS/src
mkdir T0
cd T0
cvs co COMP/T0
# Set environment
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
scramv1 runtime -csh | tee $T0ROOT/runtime.csh
scramv1 runtime -sh | tee $T0ROOT/runtime.sh
scramv1 runtime -sh | tee $T0ROOT/runtime_pr.sh
#Prompt Reconstruction application
cd $CMSSW_BASE_DIR
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`
cmscvsroot CMSSW
cvs co Configuration/Examples/data/RECO081.cfg
# ..following two patches needed for CMSSW_1_0_0 (fix should be in 1_0_1..)!
cvs co -r HEAD Configuration/CompatibilityFragments/data/RecoLocalEcal.cff
cvs co -r 1.15 Configuration/Examples/data/RECO.cff
cp Configuration/Examples/data/RECO081.cfg $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl
You need to edit
$T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl and set the input
fileNames, the
maxEvents, and the output
fileName as follows, wherever it is they appear in the configuration file:
- untracked vstring fileNames = {'T0_INPUT_FILE'}
- untracked int32 maxEvents = T0_MAX_EVENTS
- untracked string fileName = "file:T0_OUTPUT_FILE"
Configuring the components
A single configuration file
$CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component, with the exception of the
FileFeeder $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf, which we prefer to keep seperate for more frequent changes.
After making a change, the syntax of the configuration file should be checked with
perl -c $modified_config_file.
Changeing the configuration file while the components are running is possible, as they will re-execute it and update themselves, but for obvious reasons should be done with care! Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)
A detailed description of how to change the configuration file is in
July prototype configuration, here only an overview of the most common changes is given.
Change of configuration parameters
It is assumed that the configuration files are configured such that they can be used direclty. Here only a few possible changes are shown, needed for special cases such as changes of input/output path names, feeding rates, export rates to Tier1's, etc.
The Logger
The Logger writes a logfile if you have one set in the
Logfile parameter in the
Logger::Receiver section. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.
The configuration for the
FileFeeder is maintained in a seperate file.
The Prompt Reconstruction components
Prompt Reconstruction has three components, the
PromptReco::Manager, the
PromptReco::Worker, and the
PromptReco::Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger.
For PromptReco::Worker, the only supported
Mode at the moment is
LocalPull. Actually, that's not true, all modes are supported, but that's the only one tested.
Classic should also work, but
LocalPush won't until/unless someone writes the bits that push the files in the first place!
Leave
TargetDirs as it is (a single entry consisting of
'.') to write the RECO output in the jobs local working directory. As elsewhere, the only
TargetMode supported at the moment is RoundRobin.
Set
MaxEvents to the maximum number of events to process, if you want to process all events, set '-1' instead.
The
DataDirs and
LogDirs path should be set to an RFIO-accessible directory
which exists (maybe Tony fixes this one day?), best to Castor2 to have them in a persistent store and
SvcClass has to be set as appropriate.
Operating the Tier0
Starting the components
Some of the components are fussy about being started in the correct order, though that will change as the code improves. In all cases, the only argument you need to specify is the configuration file, with '--config $file'. Also, in all cases, if the
Host is specified for that component, you must run it there or it will abort with an appropriate error message. For everything except the workers, just starting the task in a terminal window is good enough. I use
screen to create persistant sessions that I can connect to from home or from the office, see
http://wildish.home.cern.ch/wildish/UseScreen.html for a 1-minute tutorial on screen if you're interested.
To start the full system, the components should be started in this order:
#logon to machine where servers run..
ssh cmsprod@lxgate39
...give the password
# Set environment
export CMSSW_VERS=CMSSW_1_0_0
export CMSSW_BASE_DIR=~cmsprod/public/T0
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
#Working dir
export T0_WORK_DIR=/data/csa06
cd $T0_WORK_DIR
#Create dirs for log-files
mkdir -p ${T0_WORK_DIR}/Logs/Logger
mkdir -p ${T0_WORK_DIR}/Logs/FileFeeder
mkdir -p ${T0_WORK_DIR}/Logs/PromptRecoManager
mkdir -p ${T0_WORK_DIR}/Logs/ExportManager
# sTART THE COMPONENTS
#The Logger
$T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/Logger/LoggerReceiver.log 2>&1 &
#The FileFeeder
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf --name CSA06Mixed::Feeder > ${T0_WORK_DIR}/Logs/FileFeeder/FileFeeder.log 2>&1 &
#The PR Manager
$T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/PromptRecoManager/PromptRecoManager.log 2>&1 &
#The PR Workers (e.g. start 20 jobs...)
for i in `seq 1 20`; do
bsub -q cmscsa06 -R 'type=SLC4' $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh
sleep 5
done
#The Export Manager
$T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/ExportManager/ExportManager.log 2>&1 &
The script
$T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day).
Watching the running system
Global plots
Export plots (PhEDEx)
Logfile(s)
All components produce output with variable levels of verbosity, see the configuration file syntax for details.
The most important information comes from the
Logger, which collects and prints the log-info from all components.
The logfiles of all components are written on
lxgate39.cern.ch:/data/CSA06/logs/component/version/channel where
component is
e.g. PR or Alca,
version is e.g. 102 or 103, and
channel is e.g. EWKSoup
ExoticSoup HLTSoup Jets minbias
SoftMuon TTbar Wenu
ZMuMu.
The log-files can either be retrieved with
rfio (a lxplus account is required!) or listed directly by logging into
lxgate39.cern.ch (a special
registration is required!):
tail -f /data/CSA06/logs/Logger.log
Useful Twiki pages
Troubleshooting the system
For checking progress or debugging of the applications, there are several more or less intrusive possibilities.
- bpeek <lsf_jobID> allows inspecting the STDOUT of the running application.
- lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
- A direct login to a worker node is possible with the cmsprod account. Machine and running status of the applications can be checked with the usual Linux commands (e.g. ps -efl, top,...), directory listings and filesizes can be monitored, file contents can be examined, attaching to the running processes is possible, etc.
- To check which batch nodes are in production and which are in maintenance, on lxplus the command
CDBHosts -cl lxbatch -q "clustersubname='cmscsa06'" -data "hostname,get_value(hostname,'/system/network/interfaces/eth0/switchmedium'),state"
can be used.
Reporting bugs/problems
Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at
https://savannah.cern.ch/projects/cmstier0/. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.
Contacts
Name |
office |
mobile |
Tony Wildish |
77103 |
? |
Nick Sinanis |
79881 |
160516 |
Zhechka Toteva |
71604 |
Jens Rehn |
71606 |
Dirk Hufnagel |
71704 |
Lassi Tuura |
71542 |
Werner Jank |
71580 |
160512 |
|
-- Main.jank - 06 Oct 2006