CREAM and LHCb

Job submission to CREAM CE

After 9 Sept 09 GDB Philippe reports that it is now possible to submit jobs to CREAM CEs through WMSs running the latest releases. We should start testing submission to those CEs. There are currently 34 CREAM CEs deployed (16 at Tier1s, 18 at Tier2s). Philippe proposes to add them under a specific site name for tests. Of course pilots should be submitted only to latest WMS, but maybe most of our WMS instances had been upgraded (to be checked). Could someone check how any CREAM CEs are supporting LHCb? They should be advertised in the BDII. For sure CNAF, RAL, IN2P3 (in progress for mid-November) could be tried out... Of course direct submission to CREAM (without passing through WMS) is the goal, and should be pursued with high priority

Roberto adds: now both CERN and CNAF support gLite WMS 3.2. that has fixed the ICE interface to submit to CREAM and then finally support the submission to CREAM. Actually, direct submission was possible since a long while.

The point is: submission to CREAM through WMS is almost automatic, whereas direct submission to CREAM requires some development in DIRAC.

Direct submission to CREAM

ALICE already does it.

DIRAC should develop an agent director for direct submission to CREAM. If all goes well, we could even simplify quite a lot the logic, just keeping a constant flow of pilots to CREAM CEs, without particular clever brokering: as long as there are jobs to be run for a site, some pilots should be waiting. For simulation jobs, pilots could be sent using the same criteria submitting randomly to sites that are lacking pilots.

Sept 15th: GGUS ticket opened to demonstrate pilot submission through the CREAM CE.

Advantages of direct submission wrt submission through WMS:

  • Get rid of one component (the gLite WMS) in your submission chain that suffers many limitations, mainly for VOMS awareness of its internal components.
  • Removing an intermediate step would also improve the overall efficiency (eff= # pilot scheduled on the CE/ # of submitted pilot) and the speed in dispatching (I know this is not critical) your pilots.
  • Stop to have bottle neck to submit your pilot jobs. In fact WMS servers in the past had hit some scalability issue
  • the most interesting aspect is that CREAM was originally thought by its developers for the pull approach. By design it comes with a lot functionalities that could be resurrected/requested once we become client of it.
  • LCG-CE (despite GDB is still pushing sites to upgrade to the latest version fixing some issues) is no longer maintained and is going to be phased out. CREAM CE will be the CE for LCG.

Situation at Spanish sites

Result of a polling by Gonzalo 9 Sept 09:
  • CIEMAT: no plan, maybe in the future a test one
  • LIP: no, no human resources
  • UAM: not now, maybe in the future
  • UB: not now, maybe in the future when they are sure it works fine
  • IFCA: not now. Maybe in the future a test one.
  • CESGA:

At PIC

At PIC the CREAM CE is available and the WMS has been updated to the last version, to it allows job submission through the WMS. The CREAMCE is under the test CE:
[elisa@vobox07 ~]$ lcg-infosites --vo lhcb -f pic ce | grep ce-test
4299	   0	   2	          2	   0	ce-test.pic.es:8443/cream-pbs-glong_sl5
4290	   0	 114	        114	   0	ce-test.pic.es:8443/cream-pbs-gmedium_sl5
 316	   6	  33	          0	  33	ce-test.pic.es:8443/cream-pbs-gshort_sl5

Added in DIRAC on 19 OCT: LCG.PIC-CREAM.es. Still it is "Special" in BDII, and all CREAM sites are banned for production and SAM jobs.

Situation at Oct 09

CREAM CE are published as special in the BDII because they are not still in production, but in a pre-production system called pilot. In order to avoid user jobs to match the CREAM CE they are published as special, and in November they should finally be put in production.

So, from the site point of view we should just wait for the developers to put CREAM in production.

Message passed from LHCb to CREAM developers at 9 Oct GDB:

LHCb stated that as soon as all gLite WMSes at CERN and T1’s will be upgraded to 3.2 (supporting the working ICE interface to CREAM) there shouldn’t be major problems in moving to use CREAM CE transparently via pilot job submission through gLite WMS.

Instances not upgraded would be simply discarded from the list of WMSes in use by LHCb. Today (to be checked) all WMS that LHCB rely on at T1’s and CERN should have installed this version and then support the submission to CREAM.

It has been noted that all CREAM CEs statuses instead of ”Production” are advertised as “Special” in the BDII and this is something that internally in DIRAC had to be sorted out. The question turns into: when sites will start to publish these CREAM CEs as Production and let LHCb to use them for their real daily activities?

LHCb also started to evaluate the CREAM interface for the direct submission and realized a couple of missing functionality:

  • Querying CEMON for the status of the queues is currently not possible. By default plug-ins for this CEMON sensor come disabled on the site and only user owned jobs can be queried through that interface (which makes it a bit pointless). If LHCb wants to implement the DIRAC logic for interacting directly with the CE, it has to collect information via ldapsearch queries directly to the top BDII and this is felt a bit impractical to exploit the pull paradigm that CREAM CE claims to come for. Alternatively keeping a private bookkeeping of jobs sent to each CE as ALICE do.
  • In case of OutputSanbox a gridftp server URI to upload it has to be provided. However LCG does not support any more Classical SE (gridftpd + file system). A possible solution: equipping VOBOXes with gridftpd, but certainly this is a bit inconsistent with the rest of the production infrastructure. It is inconsistent because LCG does not support classic SE (= gridftp server + disk servers), but CREAM needs it if you have to upload an output sandbox! Here a possible solution is to ask CREAM developers to support SRM endpoint to upload the sandbox (so you write a SURL in the JDL, to copy there the sandbox). Otherwise, like ALICE, we can ask to install gridftp servers on the VOBOX (remember that gridftp server is not a standard component of VOBOX!).

Both the points are not really a show stopper for proceeding in setting up a DIRAC layer to CREAM (no yet in place) for direct submission but again it is hard to make precise time line estimation - mainly due to man power issues.

Follow up

22 Oct 2009: Ricardo modified the Director to handle 'special' CEs and submitted some jobs to CREAM CEs via gLite WMS. At PIC fine!

11 Nov 2009: CREAM CE out in production in DIRAC. SAM tests submitted. They were using the ce-test. I asked to substitute it with the ce08, now in production at PIC

25 Nov 2009 ce08 in production. SAM tests for LHCb enabled.

06 Dec 2010 put in production the site CREAM.NIKHEF.nl (still in parallel with LCG.NIKHEF.nl), user jobs as well as simulation jobs going to this new site. To be discussed which policy to use concerning the CREAM sites. As they seem to work fine, we could progressively move the Tier1s to pure CREAM sites, which will allow more jobs to go outside CERN in particular for analysis.

To be discussed how to integrate them into the system. Creating a new site i.e. CREAM.Nikhef.nl? or adding the CREAM CE to the already existing LCG.NIKHEF.nl and setting a flag to it so that Dirac understands that they submit directly the job without usign gLite WMS?

Very important: we should have a way to keep the accounting separated! For that we have in the pilot agents db Broker and GridType parameters:

  • For the WMS pilots Brokers are the WMS end-points and the GridType is set to gLite
  • For the CREAM direct, Broker is the host where the Site Director is running (volhcb20.cern.h for the CREAM.NIKHEF.nl ) and the GridType is set currently to DIRAC.

is this enough to make comparative analysis?

Jobs do not know which pilot submission mechanism was used for their pilots, this is not their business. However pilots in the accounting have GridMiddleware parameter (not GridType) which is either gLite ( for the gLite WMS ) or DIRAC (since the scheduling is done by DIRAC). They also have GridResourceBroker parameter which should be volhcb20.cern.ch. But these values don't appear in the accounting!! frown

-- ElisaLanciotti - 2009-09-09

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2010-12-13 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback