Main Web>CMST0CSA06OperationsGuide>WernerJank (2020-06-18, MaximKonyushikhin)

Tier0 Operations, Monitoring and Status displays Guide

Introduction

testFor installation, configuring, starting and stopping the system, consult the FAQ!

For the CSA06 exercise, the Tier0 operation is based on a number of components:

Logger: central recording and steering of all activities
FileFeeder: Getting available files in Castor and feeding them into the Prompt Reconstruction
PR Manager: Prompt Reconstruction Manager, manageing the PR Workers
PR Worker: Prompt Reconstruction Worker, running the actual reconstruction code and configuration according to role
Export Manager: handles the injections into PhEDEx
PhEDEx itself, which loads the export buffer with appropriate traffic according to what the Export Emulator Manager tells it to do

All components are independent from each other and can be started and stopped as need arises (as of today, restarting the Logger likely will lose information, to be fixed by Tony?!).

All CSA06 Tier0 operations are run with the loginID cmsprod (for the password, contact Tony, Nick or Werner). All required software is installed in ~cmsprod/public/T0, with the exception of the required Perl and ApMon modules installed in /afs/cern.ch/user/w/wildish/public/perl ($T0ROOT/env.sh will set up your PERL5LIB accordingly).

All steering and manageing components run on lxgate39.

All worker components run on dedicated worker nodes (dual CPU 2.8GHz with SLC4 32-bit mode) as batch jobs, using the LSF-queue cmscsa06.

Installation of the system

All CSA06 computing tasks should use CMSSW_1_0_x. Currently, CMSSW_1_0_3 is installed, and this is the version the instructions following refer to! Other versions may be installed in parallel as needed.

The code has to be installed on shared disk space, visible from all worker nodes, with ~cmsprod/public/T0 as the BASE_DIR. The following tasks have to be performed to get the code to a runnable state:

Checkout the code from the CVS repository
Install the requisite Perl modules
Configure the system

Castor usage

Castor2 is used with several disk pools, configured both in size and functionality as required.

t0input 65TB on 13 servers, no tape, garbage collection disabled
t0export 80TB on 16 servers, with tape
cmsprod 22TB on 4 servers

The "RAW-data" input files are read from /castor/cern.ch/cms/T0Prototype/Input located in the t0input disk pool. The RECO output files are written to /castor/cern.ch/cms/store/CSA06/????? All options in the configuration file have to be set correctly in order to select the correct disk pool and path!

Checking out the code

Running in a /bin/bash shell, you can check out the T0 code as follows:

#CMSSW Version and installation dir
export CMSSW_VERS=CMSSW_1_0_3
export CMSSW_BASE_DIR=~cmsprod/public/CSA06
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_BASE_DIR

#Check out from cmscvs
project CMSSW
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_VERS/src
mkdir T0
cd T0
cvs co COMP/T0

# Set environment
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
scramv1 runtime -csh | tee $T0ROOT/runtime.csh
scramv1 runtime -sh | tee $T0ROOT/runtime.sh
scramv1 runtime -sh | tee $T0ROOT/runtime_pr.sh

#Prompt Reconstruction application
cd $CMSSW_BASE_DIR
scramv1 project CMSSW $CMSSW_VERS
cd $CMSSW_DIR/src
eval `scramv1 runtime -sh`
cmscvsroot CMSSW
cvs co Configuration/Examples/data/RECO081.cfg
# ..following two patches needed for CMSSW_1_0_0 (fix should be in 1_0_1..)!
cvs co -r HEAD Configuration/CompatibilityFragments/data/RecoLocalEcal.cff
cvs co -r 1.15 Configuration/Examples/data/RECO.cff
cp Configuration/Examples/data/RECO081.cfg $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl

You need to edit $T0ROOT/src/PromptReconstruction/Reco.cfg.tmpl and set the input fileNames, the maxEvents, and the output fileName as follows, wherever it is they appear in the configuration file:

untracked vstring fileNames = {'T0_INPUT_FILE'}
untracked int32 maxEvents = T0_MAX_EVENTS
untracked string fileName = "file:T0_OUTPUT_FILE"

Configuring the components

A single configuration file $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf is passed to each component, with the exception of the FileFeeder $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf, which we prefer to keep seperate for more frequent changes.

After making a change, the syntax of the configuration file should be checked with perl -c $modified_config_file.

Changeing the configuration file while the components are running is possible, as they will re-execute it and update themselves, but for obvious reasons should be done with care! Note, however, that the Worker components will only read their configuration at startup, and then they only really need it to find out where their Loggers and Managers are. Once they have connected to the Manager, that Manager will send them the configuration that they will use, and will send them any updates to it too. So it is only the configuration file(s) used by the Managers that really matters, if you sop/start workers with a different configuration file, it won't make any difference to what they do (providing they connect to the same Managers!)

A detailed description of how to change the configuration file is in July prototype configuration, here only an overview of the most common changes is given.

Change of configuration parameters

It is assumed that the configuration files are configured such that they can be used direclty. Here only a few possible changes are shown, needed for special cases such as changes of input/output path names, feeding rates, export rates to Tier1's, etc.

The Logger

The Logger writes a logfile if you have one set in the Logfile parameter in the Logger::Receiver section. This should not be on AFS, or the logger will fail as soon as the token expires. Any local filesystem will do. The file will be appended if it already exists. If your Logger is still running at midnight, it will rotate the logfile, adding a ".YYYYMMDD" suffix to the old one.

The FileFeeder

The configuration for the FileFeeder is maintained in a seperate file.

The Prompt Reconstruction components

Prompt Reconstruction has three components, the PromptReco::Manager, the PromptReco::Worker, and the PromptReco::Receiver. The Receiver is internal to the manager, to handle subscriptions to the Logger.

For PromptReco::Worker, the only supported Mode at the moment is LocalPull. Actually, that's not true, all modes are supported, but that's the only one tested. Classic should also work, but LocalPush won't until/unless someone writes the bits that push the files in the first place!

Leave TargetDirs as it is (a single entry consisting of '.') to write the RECO output in the jobs local working directory. As elsewhere, the only TargetMode supported at the moment is RoundRobin.

Set MaxEvents to the maximum number of events to process, if you want to process all events, set '-1' instead.

The DataDirs and LogDirs path should be set to an RFIO-accessible directory which exists (maybe Tony fixes this one day?), best to Castor2 to have them in a persistent store and SvcClass has to be set as appropriate.

Operating the Tier0

Starting the components

Some of the components are fussy about being started in the correct order, though that will change as the code improves. In all cases, the only argument you need to specify is the configuration file, with '--config $file'. Also, in all cases, if the Host is specified for that component, you must run it there or it will abort with an appropriate error message. For everything except the workers, just starting the task in a terminal window is good enough. I use screen to create persistant sessions that I can connect to from home or from the office, see http://wildish.home.cern.ch/wildish/UseScreen.html

for a 1-minute tutorial on screen if you're interested.

To start the full system, the components should be started in this order:

#logon to machine where servers run..
ssh cmsprod@lxgate39
...give the password
# Set environment
export CMSSW_VERS=CMSSW_1_0_0
export CMSSW_BASE_DIR=~cmsprod/public/T0
export CMSSW_DIR=$CMSSW_BASE_DIR/$CMSSW_VERS
cd $CMSSW_DIR/src/T0/COMP/T0
. env.sh
#Working dir
export T0_WORK_DIR=/data/csa06
cd $T0_WORK_DIR
#Create dirs for log-files
mkdir -p ${T0_WORK_DIR}/Logs/Logger
mkdir -p ${T0_WORK_DIR}/Logs/FileFeeder
mkdir -p ${T0_WORK_DIR}/Logs/PromptRecoManager
mkdir -p ${T0_WORK_DIR}/Logs/ExportManager
# sTART THE COMPONENTS
#The Logger
$T0ROOT/src/Logger/LoggerReceiver.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/Logger/LoggerReceiver.log 2>&1 &
#The FileFeeder
$T0ROOT/src/Utilities/FileFeeder.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06Feeder.conf --name CSA06Mixed::Feeder > ${T0_WORK_DIR}/Logs/FileFeeder/FileFeeder.log 2>&1 &
#The PR Manager
$T0ROOT/src/PromptReconstruction/PromptReconstructionManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/PromptRecoManager/PromptRecoManager.log 2>&1 &
#The PR Workers (e.g. start 20 jobs...)
for i in `seq 1 20`; do 
bsub -q cmscsa06  -R 'type=SLC4' $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh
sleep 5
done
#The Export Manager
$T0ROOT/src/ExportManager/ExportManager.pl --config $CMSSW_DIR/src/T0/COMP/T0/src/Config/CSA06.conf > ${T0_WORK_DIR}/Logs/ExportManager/ExportManager.log 2>&1 &

The script $T0ROOT/src/PromptReconstruction/run_PromptReconstructionWorker.sh will need editing to pick up the environment from the correct place (to be improved one day).

Watching the running system

Global plots

*dashboard* monitoring - Select CSA06
MonaLisa , where there is a CMS/Tier0 group that plots most of the high-level metrics
The cmscsa06 batch resource
The cmscsa06 LSF-queue
The t0input pool
The t0export pool
The Castor pools - capacity and usage, with Tape Queues (guest/guest login)
The running RFIO daemons (e.g. lxfsrc4601)

Frontier DB

IT Service level status

Export plots (PhEDEx)

Logfile(s)

All components produce output with variable levels of verbosity, see the configuration file syntax for details.
The most important information comes from the Logger, which collects and prints the log-info from all components.
The logfiles of all components are written on lxgate39.cern.ch:/data/CSA06/logs/component/version/channel where component is e.g. PR or Alca, version is e.g. 102 or 103, and channel is e.g. EWKSoup ExoticSoup HLTSoup Jets minbias SoftMuon TTbar Wenu ZMuMu.
The log-files can either be retrieved with rfio (a lxplus account is required!) or listed directly by logging into lxgate39.cern.ch (a special registration is required!):

tail -f /data/CSA06/logs/Logger.log

Useful Twiki pages

Troubleshooting the system

For checking progress or debugging of the applications, there are several more or less intrusive possibilities.

bpeek <lsf_jobID> allows inspecting the STDOUT of the running application.
lsrun -P -R cmscsa06 /bin/bash will give you a terminal shell on a worker node. Be advised that if you use Ctrl-Z to stop the job, it will stop the lsrun command, not the process on the worker node!
A direct login to a worker node is possible with the cmsprod account. Machine and running status of the applications can be checked with the usual Linux commands (e.g. ps -efl, top,...), directory listings and filesizes can be monitored, file contents can be examined, attaching to the running processes is possible, etc.
To check which batch nodes are in production and which are in maintenance, on lxplus the command
CDBHosts -cl lxbatch -q "clustersubname='cmscsa06'" -data "hostname,get_value(hostname,'/system/network/interfaces/eth0/switchmedium'),state"
can be used.

Reporting bugs/problems

Simple operational problems can be reported to the t0 operations list or the t0 hypernews forum, depending if they are CERN-local for our specific dedicated hardware or conceptual, in the architecture of the system. Bugs or feature-requests should be reported using savannah, at https://savannah.cern.ch/projects/cmstier0/

. Make sure you assign your problem report to someone appropriate, or at least include them in the CC:, or nobody will follow it up! By default, assign problems to Tony.

Contacts

Name	office	mobile
Tony Wildish	77103	?
Nick Sinanis	79881	160516
Zhechka Toteva	71604
Jens Rehn	71606
Dirk Hufnagel	71704
Lassi Tuura	71542
Werner Jank	71580	160512

-- Main.jank - 06 Oct 2006

Topic revision: r17 - 2020-06-18 - MaximKonyushikhin

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback