NikhefCPUResources
This is the LHCb Nikhef group page describing locally available CPU resources
Pre-requisites and outcomes
You are assumed to know what SLC6 is, how to use a unix platform, and to have done the LHCb software
tutorials.
This Twiki will discuss the computing resources available to you so that you can continue on your analysis.
Introduction
- You have a lot of CPU at your fingertips... do you know about it all?
- Do you want to know what system is best for a given activity?
- LHCb defines certain jargon.. what does it all mean?
Read On...
LHCb computing model
LHCb has a well-defined computing model. The computing model accounts for user activity in a few forms:
- to provide adequate support LHCb restricts usage patterns within the computing model
- to facilitate book-keeping, projections, procurement, resource management, LHCb restricts users to patterns within the computing model
CPU resources are divided into "tiers" where certain activities are expected:
- tier-0: the CERN analysis centre/farm.
- Used for central reconstruction, storage of data, data processing, and a small amout of user analysis.
- tier-1: a short-list of key sites with a lot of dedicated CPU resouces to be shared by the grid community including LHCb.
- Used for central reconstruction, storage of data, data processing, and a larger amount of user analysis.
- tier-2: Smaller ancilliary sites with significant dedicated CPU and resources,
- with some guaranteed access to LHCb data,
- associated with a given tier-1 or having their own storage systems.
- Used for MC production, and available for user jobs.
- tier-3: Small clusters which may form part of a larger tier-2,
- are usually not directly grid accessible and may be shared with other activities.
- Not normally used for central production directly,
- used heavily by user jobs requiring no input data such as toy studies,
- available for non-grid analysis.
- tier-4: A personal resource, laptop/desktop, maintained by the user, not accounted in the computing model.
Apart from tier-3/4 resources, it is assumed the entire LHCb community has guaranteed access to all the resources.
LHCb supports primarily
GRID submission through
DIRAC using the Ganga front end. Ganga supports a multitude of different possible backends. All other approaches are either not really allowed, or supported only on a best-effort basis.
A few activities which are specifically not supported are:
- Priority access to given users or groups to grid resources based on location. This is very hard to book-keep. Instead we define "roles".
- Using directly LCG/grid tools without Ganga/Dirac to submit to grid resources. This is very hard to maintain and impossible to manage.
When to use what and why!
What |
When and Why? |
Desktop |
Code development, small-scale analysis and whatever else you want, remember that you'll be losing out on the power of the Grid and StoomBoot if you stick to your local desktop. |
local interactive nodes |
Code development, small-scale analysis requiring a lot of IO and/or a lot of CPU, intensive fit procedures, and whatever else you want |
lxplus |
Don't have SLC6 on your desktop, or don't have quick access to a different institute? Want to share your code and results with collaborators through afs. Need to use CERN castor for data storage. |
lxbatch |
Only if you really really have to. It's slow, annoying, and you have a lot of competition. |
Local Batch System |
(StoomBoot) Mid-scale analysis benefitting from parallelization, but not necessarily needing LHCb software and not needing data files only on the grid. Really only use for software which will not run on the Grid. Usually, you can still use the grid directly if you can work within the lhcb software environment. |
The Grid |
Large-scale analysis. Any jobs requiring Grid-data. Any jobs whose output needs to be stored centrally. The grid is good for practically everything which can be done inside the lhcb-software environment. |
Test, test, test
With Ganga you are encouraged to:
- test your scripts on a small test sample on a local machine,
- then test for other scaling problems on a local batch system,
- ONLY SUBMIT TO THE GRID JOBS WHICH YOU KNOW WILL WORK
Before using the Grid consider the hints and tips from the:
Computing resources for you
As a member of CERN you have access to:
- lxplus:
lxplus.cern.ch
, a set of interactive nodes for CERN users
- lxbatch: also known as LSF, the CERN batch system
- Exploiting the CERN resources is a topic for LHCb tutorials
- Login to lxplus and use the LSF backend to submit with Ganga
You probably already know about
lxplus and
lxbatch which respectively are tier-3 in LHCb jargon.
As a member of LHCb you have access to:
As a member of Nikhef you are provided with also has lots of resources on site for your exploitation
- Your desktop
- Gateway machines
- StoomBoot
- StoomBoot interactive nodes
- Your Desktop (tier-4)
- Self explanatory. For problems and questions with your local desktop contact
helpdesk AT nikhef.nl
- If the problem is LHCb-software-specific, though, you should contact GerhardRaven or RobLambert who will sort it for you.
- Local Gateway Machines (tier-4)
-
login.nikhef.nl
, running a SLC5 platform with afs-support. For problems and questions with the login system contact helpdesk AT nikhef.nl
- You are discoraged from performing any intensive tasks on the gateway machines themselves, since that can screw up everybody trying to reach the network.
-
parret.nikhef.nl
, a shared SLC5 machine for the nikhef lhcb group. Consider this your go-to-machine once you're on the network.
- Access and submission (Ganga)
- ssh into the machine
-
j.backend=Local()
- or
j.backend=Interactive()
-
j.submit()
- What local running is good for
Running locally is useful for:
- Testing, you must test before submitting large numbers of grid jobs
- Small and quick analyses of small files with fast turn-arounds
- Visualization and graphical processes, really you need to run these as close as possible, ideally on the machine you are sitting in front of
StoomBoot (tier-3)
The StoomBoot cluster:
- is a PBS (Portable Batch System)
- covers around 200 cores
- has the same NFS mounted directories as you would get on
login.nikhef.nl
and/or your local desktop.
- has a CVMfs mount with the lhcb software, which can lead to much faster configuration of your jobs. See NikhefLocalSoftware
- does not have AFS support (see below).
StoomBoot is good for running:
- Jobs with no data at all (MC production, toy studies)
- Jobs where the data cannot be/is not on the grid (Fitting procedures)
- CPU-intensive tasks benefiting from parallelization
StoomBoot is not a replacement for the grid, and should not be used as such.
Please subscribe to the
stbc-users mailing list for announcements and support.
-> Interactive nodes
There are 5 dedicated interactive node machines on StoomBoot.
- These can be used to configure your jobs in the exact same environment that they will see on StoomBoot
- These can be used in replacement of your desktop to run a session locally
-
stbc-i1
(SLC5)
-
stbc-i2
(SLC6)
-
stbc-i3
(SLC6)
-
stbc-i4
(SLC6)
-
stbc-32
- from a machine already on the nikhef network:
ssh stbc-...
to get to your favourite StoomBoot node.
-> Interactive job submission
As well as using the interactive nodes, you can obtain an interactive session on a regular node.
-
qsub -I
- from a machine already on the nikhef network: or your favourite interactive StoomBoot node.
-> command-line submission
From a directory somewhere under your home directory:
-
qsub <myscript>
to submit to stoomboot, if the environment does not need to be passed to the worker nodes
-
qsub -V <myscript>
to submit to stoomboot, passing the local environment. Useful if you're running LHCb jobs.
-
qstat
to watch the status of your jobs
- stdout and stderr returned to the local directory as
<scriptname>.o<jobID>
and <scriptname>.e<jobID>
respectively
- you can submit from any of the nikhef computing nodes, e.g. parret, or the StoomBoot interactive node.
- there are different queues (select with
qsub -q <queuename>
); currently, there is express
(jobs < 10 minutes), generic
(jobs < 24 hours, this is what you get by default), short
(jobs < 4 hours) and long
(jobs < 2 days); finally there are legacy
(jobs < 8 hours, on SLC5), stbcq
and iolimited
queues (jobs < 8 hours, on SLC6), all with access to the gluster file system; you can check for queue properties with qstat -Qf
- jobs requiring multiple cores should only be submitted to the special
multicore
queue (jobs < 3 days). This requires you to be added to the list of users allowed to submit to it (contact Jeff Templon). The job script should be a PBS script stating explicitly how many nodes and cores the job will use (add e.g. #PBS -l nodes=1:ppn=8
at the top for 8 threads)
-> Access and submission (Ganga)
As a user you don't need to configure anything.
- The local ganga configuration is managed by a central ganga ini file.
-
j.backend=PBS(queue="<queuename>")
-
j.submit()
Simple as that.
If you come across any problems with the environment, note the following:
-> StoomBoot and AFS
StoomBoot
does not have afs installed or mounted, this can cause some problems when you are using the interactive nodes... for example:
- if your gangadir and/or cmtuser are a softlink to AFS, the softlinks will be overwritten with blank directories
- if your ganga.py is a softlink it will be unable to be loaded at run-time
- if you have afs directories on your pythonpath (for example for ganga utils) they will slow down configuration
So, if you do use afs from anything, better not to use the interactive nodes directly. Instead use a different desktop machine where Afs is installed, you can still submit to the StoomBoot cluster from any of the desktop nodes at nikhef.nl.
-> Monitoring and links
NIKHEF/SARA (tier-1&2)
The netherlands tier-1 grid site is located here.
- The tier-1 site is subdivided into two (each effectively tier-2), NIKHEF/SARA, which together combine to make the netherlands tier-1.
- The chosen mass storage technology is DPM.
- There are thousands of cores and tens of petabites of storage shared between the grid community.
-> Access and submission (Ganga)
Apart from being physically closer to you, there is no difference between Nikhef and the other grid sites, so in general the policy is to submit to "the grid" so that you have access to even more machines.
It is possible to demand a given grid site using the
Dirac()
backend in Ganga, and it is possible to replicate files to the Nikhef DPM through:
-
dataset.replicate("NIKHEF-USER")
-
dataset.replicate("SARA-USER")
-
j.backend.settings['Destination'] = 'LCG.SARA.nl'
-
j.backend.settings['Destination'] = 'LCG.NIKHEF.nl'
Before using the Grid consider the hints and tips from the:
Storing things?
--
RobLambert - 24-Oct-2011