About this document

This document is an attempt to help you understand what the Grid is and how to use it. It will not tell you how to do anything specifically, but I aim to give you some sense of direction so that you can start to find your own solutions.

I will not:

  • Provide "recipes". There are hundreds of these already. Most of these are out of date, and none of them will help you understand what you are doing.
  • Attempt to be comprehensive. If a useful tutorial for a piece of software exists I will link to it, not replicate it.
  • Claim to know all the answers. These are the few useful pieces of information I have gleaned in 18 months of frustration.

I assume:

  • You have an lxplus account, accessed via ssh. Unless I state otherwise, and code I include is to be used in this manner, and may not work otherwise.
  • You have a Grid certificate installed in your LXPLUS account.
  • Your Grid certificate is registered with the ATLAS virtual organisation.

This is a work in progress.

What is the Grid for?

The Grid is required to store the vast amount of data produced by the LHC experiments - this much is obvious. The important point is how you interact with this data: the Grid is not there to move data to your analysis, the Grid is there to move your analysis to the data. Consequently, any use of the Grid can be reduced to two fundamental steps:

The software that sends your analysis to the data may incorporate finding the data first. Once your analysis has completed, its output will also be stored on the Grid.

Exploring the Grid

The computers that make up the Grid are spread over many countries. Ideally this would only be a curiousity - unfortunately this means that different Grid computers tend to behave in different ways. The only advice I can give is to try and design your analysis to be flexible about the environment that it is run in, or failing that to know precisely what it is that your analysis depends upon, and check that it is there.

Data is stored on the grid in data sets (i.e. folders) containing data files with some common origin or purpose. These files are primarily in ROOT format. The only tool you really need for finding and examining data sets is DQ2. On LXPLUS it is set up as follows:

This sets up your basic grid environment

source /afs/cern.ch/project/gd/LCG-share/current/etc/profile.d/grid_env.sh

This sets up DQ2

source /afs/cern.ch/atlas/offline/external/GRID/ddm/DQ2Clients/setup.sh

This will create a temporary encrypted token (called a Grid proxy) allowing you to access the Grid. It will require you to type in the password you used when setting up your Grid certificate.

voms-proxy-init -voms atlas

Now you have access to useful DQ2 commands such as:

  • Search for data sets with
    dq2-ls "Pattern to search for"
  • Find where a data set is stored with
    dq2-ls -r "Data set name"
  • Download a data set with
    dq2-get "Data set name"
  • Click here for more information

Note that dq2-get should be used with some thought. As stated above, the purpose of the grid is to move your analysis to the data, not the data to your analysis. Conseqently you should only really download a data set if it contains the output of your analysis, or if (for whatever reason) you don't want to run your analysis on the Grid. This is understandable given the difficulty of using the Grid!

Using the Grid

Physics analysis software

Before tackling the topic of using the Grid, it is worth mentioning two pieces of software that will be relevant to your analysis.

ROOT

Most data on the Grid is stored in the ROOT file format, and within these files as tables of data (TNtuples) and histograms (TH1 and derived classes). ROOT provides software libraries to create and access these files and the objects within them, as well as a vast number of statistical functions and methods of numerical analysis. When used as a simple library in this way, ROOT is extremely useful. However ROOT has many quirks, most obviously in the user interface called CINT. This interprets syntax very similar to C++, and so can run programs using the ROOT library. It may seem tempting to use CINT to run analysis code but it is almost certainly a bad idea, as the restrictions and strange behaviour of this interpreter will make all but the simplest tasks very frustrating.

Do:

  • Use ROOT libraries linked to your analysis code.
  • Refer to the reasonably good documentation - simply Google search for the name of a ROOT class to bring up the appropriate page.
  • Use the TBrowser object to explore .root files.

Don't:

  • Write code to run in CINT.

ATHENA

ATHENA comprises the bulk of ATLAS analysis software. It contains methods for generating simulated data and for reconstructing the properties of particles from data, as well providing tools for performing your analysis. However, if ROOT is quirky and occasionally frustrating, ATHENA is a nightmare. Development is (at the moment) very fast, and code that worked with one version of the framework may well not work with newer versions, or may work differently. Consequently at Grid sites there are many versions of ATHENA available, and you should specify the version you require. To make things worse, different Grid sites have different versions available, and if you need a very new or very old version you are probably out of luck.

While ATHENA should be the tool for use in all analyses, I have had no success with it on the Grid, and so can offer no help at this time.

Running your analysis on the Grid

There are many tools to help you do this, some are very complex, some are unreliable, some simply awkward. However, there is a tool called PANDA that I (and others) have found both intuitive and reliable. To set up PANDA on LXPLUS run the following (first you probably need to have set up the Grid environment and created a Grid proxy as described above):

source /afs/cern.ch/atlas/offline/external/GRID/DA/panda-client/latest/etc/panda/panda_setup.sh

There are two aspects to PANDA:

PANDA without ATHENA

If your analysis does not require ATHENA (and the associated headaches) then the command "prun" is perfect for you. It will find the input data set(s) you specify, move your analysis code to the Grid site where the data is stored, compile your code, run it, and then store the output on the Grid. All your code needs to do is accept a comma-separated list of the names of files in the data set. It took me less than a day to have this command running to my satisfaction, and I need do little more than direct you to the excellent tutorial page.

PANDA with ATHENA

The command "pathena" can in theory be used instead of "athena" in most situations, running the job on the Grid instead of locally. I have not tested this, although I have heard many encouraging things about it.

-- BenjaminWynne - 16-Mar-2010

Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-03-17 - BenjaminWynne
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback