LFC and LHCb

LFC instance at PIC

The LFC at T1 is a mirror instance (for accessing replica information). It has assigned level 3 of service availability which means "reduced effectiveness" (in PIC terminology it is C - Max downtime next labour day)

General about how LHCb uses the LFC

LHCb do use all LFC instances for queries, using a random instance. For writing however we use the central master instance at CERN such that there is no problem of synchronisation. Therefore the PIC instance is useful! Of course as it is mainly for scalability and redundancy, a one day downtime is acceptable. We should however be warned if it lasts more than a few hours in order to remove it from the list in the CS in order to avoid querying it when we know it is down, and be warned when it is back up and running.

We specified this procedure in the service manual in the pic twiki, under 'calendario del servicio'.

Monitoring of the service

SLS sensor to monitor the active threads.

LFC Distributed service for LHCb: Master (CERN) and slave instances monitoring with SLS.

3d streams monitor at CERN. User 'streams' and password required.

Question raised on PASTE Nov 24 09:

an Oracle bug affected the LFC streaming. They pointed out that the client only queries the master LFC of CERN, and not the Tiers1's, for registering files. If the master LFC is down, requests go to a failover (does failover rely on Tier1 instances?). In principle it should, but at some point Dirac was configured to talk only to the master catalog, as the slaves were not reliable.

Problem due to too many connections to LFC - 12th Jan 2010

An error was observed: send2nsd: NS002 - send error : Operation timed out and in fact the SAM test for the read-only instance of LFC at CERN was red: problems with reading!

Sophie says: The problem we also noticed before was due to - I believe - the Persistency/CONDDB calls to the LFC made by a user. Last year, we had talked about modifications to make to the Persistency/CONDDB code. I am not sure whether Andrea had the chance to implement them. This is the way to go in my opinion: implement them, and get the new release deployed everywhere.

This particular user was submitting ~4000 jobs yesterday, many of which went to CERN and many of those started at the same time. Probably the LFC can't handle that many simultaneous connections, because it has a maximum number of thread in its configuration, and it cannot accept more. Here it is visible the saturation: https://sls.cern.ch/sls/history.php?id=LFC_LHCB_RO&more=nv%3AActive+connections&period=week

Remark: Whatever the activity is, the service should not fall apart but gracefully not reply if overloaded. This is why we have distributed replicas. B.t.w. do we use our replicas outside CERN now? The fact that jobs started at CERN is unrelated to where the LFC service is running anyway, right?

At Jan 13th 2010 Andrei comments: We have stopped for the moment using the LFC replicas outside CERN because it turned out that their consistency is sometimes not guaranteed without us actually knowing about that. These problems with the Oracle replication are now being addressed but also on our side we will have to develop more tests and workarounds in case the LFC mirror inconsistency is detected. However, to my knowledge the Coral layer is not capable to work with several mirrors and talks only to the CERN LFC instance. So, not profiting from the LFC redundancy. Something to be discussed in the Core software.

How to tackle this problem? proposed:

  • working on the Persistency server and get rid of this very old LFC Interface
  • LFC-RO at CERN is still overloaded (70 thrads exhausted). we could ask for more h/w and also increase the number of threads per box to something like 200 hundred. This shifts the level to support a ~5 times larger load; this measure with the concurrent usage of other 6 available
instances of LFC at T1 could allow us to survive until our applications are moved away from this LFC Persistency interface. Unless we want to set top priority on Persistency people to rewrite this interface.

Andrea Valassi (CORAL) said that there is a new LFC interface ready for testing. Said to Marco C. to have a test of it and in case ask developer to commit it in SVN and make available soon.

As far as concerns LFC h/w and threads, can be asked to LFC people to add box and increase the number of thread per each but our side we should also enable T1's LFCes. Andrew/Andrei how/when can this be done?

Very interesting remark from Andrew: the problem with Persistency was pointed out June 2009 for STEP09 WLCG. People said a fix was ongoing and would be available soon. But then it took 6 months to be released (and it is a minor development).

DIRAC tool to retrieve a list of catalogs on the basis of the location of the client

New utilities of DIRAC/Core to obtain an ordered list of catalogs based on the location of the client (resolved from the site name e.g. LCG.Bari.it -> it and will use the mirror at CNAF). To use this do the following:
from DIRAC.Core.Utilities.ResolveCatalog import getLocationOrderedCatalogs
getLocationOrderedCatalogs(siteName='LCG.CNAF.it')

This will return a list of the catalogs with the first element in the list the 'local' catalog. With all catalogs active this will look something like:

['lfc-lhcb-ro.cr.cnaf.infn.it', 'lfc-lhcb.grid.sara.nl', 'lfc-lhcb-ro.cern.ch', 'lfc-lhcb-ro.in2p3.fr', 'lfclhcb.pic.es', 'lhcb-lfc-fzk.gridka.de', 'lhcb-lfc.gridpp.rl.ac.uk']

This utility has been incorporated into the DIRAC LFC client but can also be used before the start of the application execution to correctly set the LFC_HOST variable. This will ensure that where possible the local LFC instance is used and if not the load randomised between the Active mirrors. This has an advantage over blindly setting the LFC_HOST to the Tier1 instance by considering the availability of the catalog as stored in the CS and updated by Roberto/Federico.

To do this I have updated the CS representation of the LFC mirrors such that they are stored along with their current status. For example the CERN mirror information is stored in the following section:

/Resources/FileCatalogs/LcgFileCatalogCombined/LCG.CERN.ch

Within this section are two options: Status and ReadOnly. If the status option value is Active this catalog is deemed to be usable. The ReadOnly option contains the LFC service URL. I will cleanup the existing representation of this information once the new release is made and the old clients updated.

To be able to manipulate the status of the mirrors two new CLI similar to that for banning/allowing the SEs: dirac-admin-allow-catalog, dirac-admin-ban-catalog

-- ElisaLanciotti - 24-Nov-2009

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2010-07-05 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback