Logbook of data production activities since April 2012 up to November 2012

Warning: this page has been updated only until November 2012.

Pages for monitoring the progress

Here

Apr 2012 restart of data taking

Pit: Thursday 5: high priority set of data at a mu~1.6 with the latest HLT trigger for physics which has a loser cut on the p_T and no global event for the initial production measurements. Fill with 36 colliding bunches and was needed by the HLT group to study the new trigger. Off-line processing time 1.5 times of what had been measured offline on old data. However the data taking went rather smooth.

Friday 6 morning some runs sent unduly to off-line. Run Numbers to discard to avoid any confusion: 111307 – 111313

Pit: Friday 6 night: another HLT validation test with 264 bunches and we encountered very serious problem with deadtime, steady memory increase, dieing tasks. Problems were also encountered off-line to process these events with reconstruction taking twice the time and stripping ten times.

Pit: Sunday: For the moment we are able to run at a mu of 1.2 with 264 bunches with the lose TCK but the processing times are long and the memory leak is present and the tasks are dieing. For this reason we will run with the global event cut and tight p_T settings in 624 bunch fill and a lower mu while the investigations continue.

Brunel very slow For normal runs, 1.7 s/evt looks like these events are very big, looking into the BKK, the average event size is 73.5 kB/evt. Average SDST size is 76.5 kB/event. 37000 events per RAW file (79 kB/event on average)

Also Stripping taking much longer than in Validation with 2011 data! (that was 6h with 2 pairs of input files) taking 8.4 seconds per event!!!

The noSPD triggers requested by the production group really slow down off-line dramatically. Brunel takes >3 times normal CPU, stripping >10 times

Run 111473 taken with normal conditions But it seems there is however a problem! The 400 ms/event is the averaged CPU over the normal event and the problematic events. Since this run is taken with GEC cuts, there should be less problematic events, but it is not the case, that's a problem.

For the odd runs: they will run at CERN only. no need to derive a new production. The Transformation BD has been manipulated setting the selected site as CERN. Then, jobs have been killed, the MaxWallClockTime in the JobWrapper set to 10 days, and then new batch of reco jobs started, and then the MaxWallClockTime removed (back to default value, i.e. 3 days).

Failing jobs due to conditions not available ONLINE snapshot not yet being up to date. For example job ID 31345409. Update frequency is not sufficient, in this and previous examples the difference between current time and last update is 3-4 hours, one hour less would be sufficient! This is probably only a problem in the current situation of very short runs and empty processing queues (and no deferred triggering). Jobs that are failing with "Brunel Exited with Status 4" are due to this problem, both at GridKa and CVMFS sites we should discuss if we want to improve this and how.

  • jobs should check the last update of file /cvmfs/lhcb.cern.ch/lib/lhcb/SQLite/SQLDDDB/db/ONLINE-2012.db and allow 3-4 hours before starting?
  • TransformationPlugin adapted?
  • Bkk should return runs older than a given date?

Pit / RemovalAgent and TransferAgent got stuck preventing files to go into BK and further T0 export (temporarily fixed by agent restarted). Problem due to python when trying to exit when many threads are started: they can exit correctly only id deamonized. Python assumes that not "daemonized" threads have to be waited for before exiting.

Problems downloading files are Gridka https://ggus.eu/ws/ticket_info.php?ticket=81028 set a Limit to the matching rate for Stripping Jobs at GRIDKA, using the new feature that has enter in the current release.

Merging jobs no AMS.DataQuality Flag set (creation problem or Dirac?) - fixed by setting DQ flag to UNCHECKED in the Dirac database directly (shall be UNCHECKED + OK on the long run)

Reco jobs finished within minutes (b/c no physics events in)

Mon 9 Apr killed all jobs for run 111183 to 111195 and MaxWallClockTime set to 864000 seconds (=10 days!)

Tue 10 Apr still ongoing issue with TransferAgent getting stuck on online system

Pit Take data with the bare minimum trigger configuration (0x00860040) at the highest possible mu (<1.6) in the fills with 604 colliding bunches until tomorrow. Considering about moving to 840 bunches.

Wed 11 Apr stopped the reco production 17394 PreReco13, it was looping on the 'bad' runs. Bad runs have been flagged as such. New processing pass: PreReco13a: created a prompt reco and stripping

Many failures due to TCK not propagate to CernVM-FS. Fixed. More Brunel failures due to conditions not available (gridka) and Brunel exit with status=1. Under investigation...

Thu 12 Apr Take data with 840 bunches, with deferred triggering.

Off-line: many failures for prompt reco: Site (Done:Failed:Running)

  • CERN (82:178:1366) : Still getting a few TCK (Brunel 3) errors and the unknown Brunel 1 errors
  • CNAF (0:5:316) : Fails with TCK errors and Brunel 2 errors (look like unknwon Brunel 1 errors)
  • GridKa (19:535:669) : Vast majority of failures are now in SetupProject. I'm assuming CVMFS is still getting 'propogated' and cached but I would have thought this would have happened by now....
  • IN2P3 (8:1:329) : One Brunel 1 crash so far
  • PIC (17:1:283) : One TCK error
  • SARA (0:3:144) : 2 SetupProject errors and one Brunel 1
  • RAL (0:1:213) : One Brunel 2 error
  • NIKHEF (1:6:224) : Errors are bad_allocs:

Sat 14 Apr New Brunel version in prod: https://lblogbook.cern.ch/Operations/9447 new prompt reco prod: 17448

18 Apr

Stripping18a CPU time on PreReco13b prod. 17513, average on all runs (62 jobs) , : 13.945 +- 0.458 , RMS: 3.603 , Max: 8143.856 (00017513_00000045) elog. A factor of 3 reduction w.r.t. Stripping18 is as expected. Stripping19 should reach the "expectation" on the Reco13 data, that hopefully will be close to Reco12

19 Apr

Distributions of the processing time for PreReco13a and PreReco13b to compare performance, a slight improvement of about 3 hours...

Sun 22 Apr

More studies of processing time for PreReco13b, plots per site produced also by Philippe elog

gap during which I was GEOC

Mon 23 Apr

set end run to PreReco13b

Decided to ask for queues of at least 72 hours wall clock time to In2p3, Sara and Nikhef (only sites providing shorted queues)

Tue 24 Apr

Some concerns for stripping jobs using too much memory, several were killed at Nikhef elog. Triggered some discussion about memory usage and when to kill jobs, if necessary

Launched Reprocessing, request 7789, production 17751, elog Repro uses the binaries compiled with gcc4.6. Some doubts about the processing time.. they seemed sometimes slower than gcc4.3, but the comparison was not conclusive. So we keep on with gcc46.

Femto validation still ongoing

Wed 25 Apr

Launched the stripping18 - ICHEP stream elog, request 7790, production 17801

Keep on extending the reprocessing elog. 6k running jobs...

One tier2 (Manchester) attached for Swimming elog. Some settings necessary in order not to overload the storage here

Thu 26 Apr

Reduced the share for reprocessing for Cnaf, Nikhef, Sara, Ral elog

Fri 27 Apr

Prompt reco after April TS 17862 elog

Mon 30 Apr

Some study about the stalled jobs (cpu time per evt, evt processed before getting stalled, etc...) for PIC jobs

New stripping19 steps According to Jibo's tests should be a factor of 2 faster than Stripping18a. It was discussed that we will put Stripping19 into production instead of Stripping18a, to process data taken so far and new coming data until the June TS:

  • 17350 for Stripping
  • 17351 for Merging

they should be ready after the Grid deployments of AppConfig v3r132 and DV v30r3 are finished

Stripping19 production launched, on Reco13: elog request 7817, prod. 17918

Calo FEMTO stripping launched on Reco13 elog request 7818, prod. 17932

Stop all!! new idea: gain back the ~200ms/event that we are spending doing the LZMA:6 compression in the stripping. Since the output files of the stripping are anyway temporary (they get merged), it makes more sense to run with the fastest algorithm (GZIP:1) . All prods cleaned and waiting for a new AppConfig...

Marco: created new steps for Stripping19 (step 17360) and CaloFemtoDST (step 17361) which option file $APPCONFIGOPTS/Persistency/Compression-ZLIB-1.py. switches to the less aggressive ROOT compression (used last year) and gains in execution time at the expense of output file size: suited for steps which produce intermediate temporary files (e.g. merging). Set the corresponding steps 17350 and 17280 to obsolete

CaloFemto on Reco13 data launched elog uses new options for Zlib compression

Stripping19 with Zlib algo launched elog prod 17934

May 2012

Wed 2 May

Prompt stripping19 MagUp launched elog, start run: 114205, end run: 114287

  • Prompt Stripping19 MagUp: prod 17957, request: 7830
  • Prompt Femto Stripping19 magUp: prod 17971, request 7831

Prompt Stripping 19 + Femto for MagDown elog, start run 114297

  • prompt Stripping19 MagDown: prod 17973, request 7832
  • promt Femto Stripping19 MagDown: prod 17987, request 7833

First 2012 MagUp Prompt Reco launched elog StartRun 114205, prod 17956

New step for Reco13a created. On Monday we'll set up the request for prompt reco. elog

Provided also new step for DaVinci DQ: DataQuality-FULL step, 17404 elog

Reco13a/Stripping19a

Mon 7 May

New prompt reconstruction for Reco13a launched, starting from run 114686. ProdID 18028, request 7854 elog. The end run of the Reco13 prompt reco production (id 17862) has been set to 114685 (last run taken this morning).

New stripping step ready for Stripping19a elog. The request will be created as soon as we have the first Reco13a data available (possibly tomorrow)

Problem files without replica flag in Bkk! elog has to be set by hand after.

Tue 8 May

Stripping19a started elog - prod 18032, request 7856

Problems with reco13a Brunel exit with status 6 elog

New merging step for FemtoDST merging: elog with the print frequency 1/10000. And new Femto production launched elog prod 18057 , request 7857

Wed 9 May

New Brunel v42r3p2 with bug fix to be used for Reco13a elog. Derived new Reco13a production elog request 7858, prod 18062. Some problems for the derivations process... but finally production is running.

Also available new Stripping step: elog derive also a new Stripping19a prod... many problems because derivation could not work for stripping elog previous production 18032 stopped. New Stripping19a: 18110

Thu 10 May

Discussed improvements for deriving productions, documented here New script to transfer the unused files from parent to child production elog and reset to Maxreset files to unused.

Mon 14 May

Reco13a timing is Ok! Less than 40 h elog

Thu 17 May

Filling mode disabled elog due to wrong CPU requirement and queue settings. To be fixed

Sat 19 May

Stripping 19a MagUp started elog prod. 18400 - 18413, and Femto stripping elog

Thu 24 May

huge backlog accumulating at CERN for merging jobs (2000!). Reduced the matching delay from 2mn to 60 seconds.

Thu 31 May

Magnet polarity flip: from up to down! Decided to keep the same production open from the previous magnet down period.

Decided to clean all PreReco13 a and b to be done. Waiting for the fix for TransformationCleaningAgent...

Fri 1 June

Stripping19a has high failure rate at In2p3 for the memory limit, some files copied to CERN-FREEZER for debugging and improvement elog

Sun 3 June

Prompt reco 13a: jobs failed for "Brunel exit with status 1" elog Under investigation...

Tue 5 June

Merging productions18111-18123 flushed elog

Thu 7 June

Problems of memory usage also for prod 18110 Stripping19a MagDown elog, failures for jobs stalled killed by batch system at IN2P3. Savannah task ongoing.

Other pb for staging input files, solved elog

June 10-17

nothing remarkable. Usual peaks of failures for Input Data Resolution

June technical stop

June 19

End of beam. Started MD

20 June

many jobs failure at the long queue of In2p3 elog. Queue should be disabled (only very long queue can cope with our requirements)

Thu 21 June

Announcement of new workflow using the FULL.DST files here more details

25 June

Started to close the Reco13/Stripping19 productions elog

26 June

Finishing prompt Reco13a/Stripping19a elog. New processing pass will be used for the prompt reco after technical stop. End runs set to:

  • MagUp Prods run No 118880
  • MagDown Prods run No 118286

29 June

New Brunel step for prompt reco Reco13b elog

Started Reco13b validation to test:

  • new Brunel step
  • new CPU requirement 1.4 HepSpec06 seconds
elog
  • new FULL.DST format

first attempt failed because of the output not properly specified elog. Production stopped.

July 2012

Mon 2 July

Set the internal queue length in Dirac for reconstruction to 1.4M hepSpec06 elog

Second round of validation of Reco13b elog

Reco13b

Tue 3 July

Started the Reco13b prompt reco Start Run 119956 elog ProdId 18889

Wed 4 July

Stripping19b steps ready elog

Started the replication of the Reco13b FULL.DST files to the tape backend elog

Stripping19b

5 July

Problems with BookeepingWatch agent elog

Prompt Stripping19b launched elog

Fri 6 July

validation of prompt FEMTO stripping Production 18953

Calo Femto Stripping on Reco13b launched elog prod. 18957,18958

Replication of Stripping19b launched elog

Reco13c

Tue 10 July

problem found a bug in CondDB for HCAL -> stopped all productions! elog

Provided new CondDB tag, and new steps for Reco13c and new Stripping19b. Cleaned all the productions of requests: 8219 (Reco13b), 8237 (Femto stripping), 8234 (Stripping19b)

Launched Reco13c prompt reco Mag up elog prod 18979

Launched new Stripping (same name than the previous one, but on reco13c!): elog and FemtoStripping elog

Wed 11 July

Another problem! bug in Dirac in Atomic run plugin: events from more runs are processed in the same file. Stopped all Stripping, patched Dirac. For Reco13c: removed the only affected file, re-set its raw ancestors as Unused, and re-started the stripping. Now ok. elog This problem affects Reco13a. Final decision is to exclude the affected runs from analysis and not to process again the affected data, see elog

Wed 25 July

Magnet polarity flip from UP to DOWN, Prompt Reco13c MagDown started with run 123804, prod 19146 elog

Thu 26 July

Reco13d validation elog, new Brunel version. launched production 19195 (req. 8340) (but some days later was cleaned!)

Prompt Stripping19b on Reco13c MagDown launched elog, prods 19167-19180 , and the prompt CaloFemtoStripping19b elog

Fri 27 July

Some crashes of Brunel being investigated elog

Aug 2012

Wed 1 Aug

New steps for Reco13d and validations elog Brunel v43r1p1

  • for prompt processing step 17910
  • for 2011 validation step 17909
They use the latest Brunel being released right now, that fixes the issues found in validation at the weekend, and the latest CondDB tag that includes the new HPD mappings from Monday.

Thu 2 Aug

New round of Reco13d validation for prompt reco elog prod. 19400 (req. 8408) Brunel v43r1p1

And Reco13d validation for 2011 data elog prod. 19401 (req. 8409)

Stripping19c Validation steps provided by Anton:

  • steps 17927 and 17928 for promt proc. 2012 data
  • steps 17929, 17930 for 2011 data
for 2011 data also a new AppConfig needed! v3r143.

for Stripping19c for 2011 data: NEW STEPS needed even if it's the same application, because the data type 2012 was hard coded. Suggestion for the future: this setting (data type) should be removed and placed it in a separate 'DataType' option file, as is done for Brunel (see Brunel/DataType-2012.py for instance). then specify the data type as an additional option file, in the list supplied in the step. That way it would be much easier to reuse the same stripping options over different years

Fri 3 Aug

Launched both validation production for Stripping19c on Reco13d:

  • for 2012 prompt reco: prod 19406 elog
  • for Collision11: prod. 19146 elog

Wed 8 Aug

The validation for Stripping19c for 2012 data was bad! we should have kept the un-merged files. Also a little bug in DaVinci. So, cleaned all Stripping/merging 19407-19419 and created a new request and production elog prod. 19626-19639 (req. 8491)

Thu 9 Aug

For the 2011 Reco13d, got the green light for the pre-validation. So, just started the real validation on the 30 pb-1 on 2 2011 fills elog prod. 19652

Reco13d/Stripping19c

Mon 13 Aug

Validation of Stripping19c on Collision11 data: started the stripping elog, prod. 19767. Only 2 merging productions (two stripping lines only)

Prompt reconstruction: first data processed for Reco13d. Started the Stripping19c elog prod. 19770

Tue 14 Aug

Little problem with production 19730 (prompt reco), CPU time requirement too low (1MHS06)!! set end run 125584 and generated a new production from same request (8544) with the correct CPU time requirement elog. The only consequence is some job failures at IN2P3, as now we are in filling mode and there the queue is 46h, so it happened that the same pilot took 2 reco jobs, and of course the second was killed for exceeding the cpu limit elog But actually it seems that the problem is due to Dirac retrieving the time left from GE, because with a queue of 46h 2 reco jobs (even with the requirement of only 1MHS06) would never fit! under investigation It was a problem with the Time Left utility, patched.

Fri 17 Aug

Started the validation of Reco13e elog

Started the Femto Stripping on Reco13d elog. Unfortunately, impossible to know which steps we should have used. We just guessed that the same steps of the previous femto stripping, updating the Condition DB, would be fine. We will see when someone will look at the output.

Reco13e/Stripping19c

Fri 24 Aug

Green light for Reco13e to be used for prompt reco

Started Reco13e on MagUp elog prod. 19859

And Stripping19c on Reco13e elog prod 19901

Started the Femto Stripping on Reco13e, prod. 19875-19876 elog

Sat 25 Aug

Started the replication of Reco13e/Stripping19c and the productions for copying the FULL.DST to the RDST and to remove them from buffer once they have been processed by the stripping elog

Mon 26 Aug

Current issues: both stripping and reco affected by the cond db not up to date elog still under investigation

Fri 31 Aug

End of Reco13 Mag Up: end run set for reco and stripping to 126680, last run taken on 28/08.

Started the new reconstruction Reco13e Mag Down, with a new Brunel release and new Conditions, that were release after a change in Ecal high voltage elog

Started also the Stripping19c on Reco13e - Mag Down elog , we added SQLDDB v7 for stripping too (first time it's added for stripping), this should alleviate the problem observed with the database not up to date problem, that has happened so often during last month. Though, it's not the solution, as that mean that the software is not being read from the software area when Cernvm-fs should serve it correctly.

Start Femto Stripping on Reco13e Mag Down elog

Second round of validation for Stripping19c on Reco13d on 2011 data elog, same reconstructed data than the first round, so it's only running the stripping. it's always 19c, with some modification applied. Only 2 streams. Big input: about 5900 input files.

Sept 2012

Mon 3 Sept

Some problems:

  • Huge number of jobs in ProductionDB! 1.7M. Should not be higher than 1M. Problem with TransformationCleaningAgent elog TO BE FIXED
  • conditions DB too big! this caused a problem in some sites where Cernvm-fs cache was not big enough elog

New Brunel step for Reco14 elog for Reprocessing.

Tue 4 Sept

Started the validation for reprocessing with Reco14 and new SQLDDDB version to use the selected Conditions DB elog

Testing reduced prompt processing at CERN elog

Investigations on the problem ongoing with tcmalloc at In2p3 elog

Mon 10 Sept

Validation of Stripping20 elog

Thu 13 Sept

Using EOS in production: elog, CERN-DST now poins to EOS

Fri 14 Sept

pA reconstruction New Step for the prompt reconstruction of Reco13f elog

Production launched on 17th Sept elog

2012 reprocessing

Mon 17 Sept

Page to monitor the progress in Bookkeeping and graphically

Created clouds for the reprocessing elog

During Sept. TS the reprocessing will run also at CERN, so the CPU for reprocessing have been set equally at all sites (will be changed later) elog

Reconstruction for 2012 reprocessing launched for data up to June TS MagDown elog

Tue 18 Sept

Stripping for reprocessing started, processing pass is Stripping20 - MagDown elog

Started the reconstruction production for the MagUp polarity until June TS elog

Reco13f

Fri 21 Sept

Some reshuffling of the shares before re-starting the prompt reco (will start on Sunday 23) elog. Prompt reco will run at CERN, whereas reprocessing will continue at T1s (and some selected T2)

Prompt Reco13f - MagUp ready and waiting for data elog, priority is 3, to have some advantage wrt reprocessing. Only 40% will be processed.

Mon 24 Sept

Serious problems with CNAF storage. Rearranged the sites associated to Italian cloud, assigned to French one elog Downtime from 22 to 26 (at least).

Problem in Transformation system: now that there is only 1 merging production, it happens that this production has a huge number of Unused files, and the query to get them fails. Temporary fix for the time being elog.

Tue 25 Sept

Launched the stripping for reprocessing MagUp elog

Sat 29 Sept

New steps for reprocessing: elog with new cond-20120929 valid until 22 Aug. Processing pass stays unchanged, Reco14/Stripping20

Reprocessing: Reco+Stripping/Merging - MagDown, June TS up to 22 Aug launched elog

Sun 30 Sept

CNAF is recovering after the long down time. Additional CPU has been provided for LHCb

Data taking re-started. Started the stripping and femto-stripping on Reco13f. Started also the removal production to remove the FULL.DST as soon as they are processed elog

October 2012

Thu 4 Oct

Problem! the productions 20280,20281,20282 for reprocessing from June TS to Aug 22 had to be cleaned and new prod. launched: elog problem with the stager, the last batch of files were all assigned to CERN.

New productions: reconstruction: 20330 elog.

Fri 5 Oct

Still problems for input data resolution at GRIDKA ggus

Stripping20 - Reprocessing - MagDown - June TS to 22 Aug - launched elog

Mon 8 Oct

Replication of Reco13f/Stripping20 launched elog

Tue 9 Oct

Reconstruction for the reprocessing launched: MagUp, from June TS to Aug 22 elog, and relative stripping elog

Mon 15 Oct

LHCb magnet polarity flipped. New prompt reconstruction started for Reco13f MagDown elog. After magnet flip to Down we had one fill with only two runs sent off-line (Fill 3169, runs 130316, 130317). Please note that these runs have a mu in the range of 4 - 6 since the machine used this fill to go head on and validate our collision orbit. So we can expect some time-limit problems reconstructing them. Should be marked NOT OK for the reprocessing.

Fraction to process for the prompt reco has been decreased from 0.4 to 0.35 elog

Prompt Stripping20 MagDown on Reco13f launched elog, and femto stripping http://lblogbook.cern.ch/Operations/12153

Tue 16 Oct

New steps for reprocessing up to Sept TS using LHCbCond tag cond-20121016 elog. The steps for stripping should be used also for the next prompt stripping.

Wed 17 Oct

Problem with lost files at CNAF due to a bug in replication plugin elog the replication of Reco14/Stripping20 (reprocessing) is affected.

Thu 18 Oct

Some removal of the second archive copy for Reco13/Stripping19 elog, Reco13a/Stripping19a elog, Reco13b/Stripping19c elog, Reco13d/Stripping19c elog: now the default is to have 1 archive replica. Situation at the end of the removal: elog

Data Reprocessing 2012, Reconstruction, MagDown, 22. Aug - SepTS, launched elog, and Stripping elog

Fri 26 Oct New step for Reco13f. And new production launched after magnet polarity flip to Up elog

For the reprocessing, the last run range, after Sept TS have been launched elog

Thu 1 Nov

Reprocessing Reco14 Validation on 2011 data launched elog

-- ElisaLanciotti - 03-Aug-2012

Edit | Attach | Watch | Print version | History: r41 < r40 < r39 < r38 < r37 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r41 - 2020-06-18 - MaximKonyushikhin
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback