Notes on SW On-call



This is a place to collect some notes on the fly...

Troubleshooting and notes for various recent errors:

11 June 2011:

  • Some ROS::ModulesException Warnings at the beginning of the run...
  • Called Hector to confirm, there was a data flow problem at the beginning of the run (see the L2PU warning messages) and seemed to affect others, not just LAr
  • FEB errors and missing header errors in plots but not increasing
  • will affect 1st lumi block
  • can restart LAr monitoring to clear the plots
Host: pc-lar-ros-embc-05.cern.ch Application Name: ROS-LAR-EMBC-05 Issued: 11 Jun 2012 13:58:58 CEST Severity: Warning Message ID: ROS::ModulesException Message: Error in the status word in the ROB header: RobinDataChannel: Lost fragment detected. The L1ID is 0x30000c3. The ROB Source ID is 0x420038 Context: PACKAGE_NAME: ROSModules. FILE_NAME: ../src/RobinDataChannel.cpp. FUNCTION_NAME: virtual ROS::EventFragment* ROS::RobinDataChannel::getFragment(int). LINE_NUMBER: 244. DATE_TIME: 1339415938. Parameters: Qualifiers: ROSModules LAR

1 June 2012:

  • the new "Tile" dqmd panel - data loss errors. Not sure what exactly this is at the moment, tile just added this thing without consulting us, even though it monitors LAr too. Ignore the panel for now? - check with Andrew

31 May/1 June 2012:

  • 3am dqmd red, tile and lar problems, problem with the monitoring. Should have just restarted the monitoring partition, but the SL seems to have restarted the entire run instead. Will follow up with the SL shifters
  • 4pm yesterday (31 May) the crazy slew of errors... looked like TTC crates (3) all had simultaneous problems - weird. Turned out Fatih had started to run a test before the end of the run
(time stamps when larg-all was booted up: sbc-lar-tcc-ltpi-01 /logs/tdaq-04-00-01/LArgAll/LTPIC_sbc-lar-tcc-ltpi-01.cern.ch_1338472292.out) Andrew called him at 3:30 to tell him beam was supposed to be dumped around 4pm,

should be protections in place to not let this happen - process manager should not let you start the process... NEED TO FOLLOW UP ON THIS

27 May 2012:

  • dqm embc yellow due to dsp timing - not in run.. andrew checking...
  • Issues from last night:
  • 02:42 -- "Larg Conditions Server died. Checked TDAQ to ensure that they had restarted." - This always happens when you have a database update (see Pavol had just done one (https://pc-atlas-www.cern.ch/elog/ATLAS/ATLAS/206306) The important thing is that it restarts.
  • 06:20 -- "Had a host of LARG buffer full errors in FSM. Called DCS Expert, who investigated. They stopped after about 20 minutes. See https://pc-atlas-www.cern.ch/elog/ATLAS/ATLAS/206311" - There was a whole lot of errors from everyone (not just LAr) - SL called Stefan (DCS expert) who seems to have fixed the problem.

25 May 2012: LAr Weekly Ops Meeting - notes

  • possible way of resetting the rod crates without power cycling... ?
  • dataflow issues from daq, missing headers errors - implies that we also have dqmd red (number of febs incorrect)
  • missing headers from one PU - created lost fragments, 2 bursts 20 min apart - 23:20 was the first. Bertrand, Stefan, daq ppl investigating, probable that it's the rod itself or slink. Still not clear if it has been resolved by itself or what - if it comes again dump rodttc stats and ros info to check offline. Also consider possibility to exchange rod board.. If it starts to be a real prob, we can disable the corresponding PU (Emmanuel)? Hector - ros will put an xoff - or (Pavol) RC can do removal by hand.
  • Andrew - want to implement new online DB this afternoon - (with the bit flip thing)... may get lots of calls this weekend

18 May 2012: VME Bus errors

Hector: during high rate test, VME Bus errors in MRS out of nowhere (weird) and one PU (processing unit) became busy. The usual thing of asking for a ttc restart didn't solve the problem, even after a couple of tries. Indicates a hw problem. Bertrand unavailable, so Guy Perrot was called and he power cycled the ROD crate (not to be done lightly!) elog (note, we occasionally get vme errors at boot, doing a TTC restart on, eg, LArg-EMECA segment fixes it)

  • Random comment: vme bus has nothing to do with what the rate is
  • for future: in case this happens in stable beams:
can't power cycle crate while running because of DIM server - not been tested yet

18 May 2012: Missing trigger type in red dqmd

restart ttc partition should fix the problem


Things to be able to do while oncall:

hot tower - be able to disable

  • if provoking high empty trigger rates

LAr busy:

  • if its a hw busy on our side the PU will be automatically disabled and the busy will go away.
  • lots of PU's busy - either power supply prob (hw) or daq issue


Notes on restarting the run

(ATLAS as a whole) - on paper it's 5 min, but in reality that's only an ideal case.. more like 15 min due to other probs with other sub detectors.

Replacement of module times:

  • Barrel 3 hours
  • EC - 1.5 hours

1 channel - wait for best option group of channels -

need to talk to l1calo and atlas rc, and also upload the latest constant while restarting the run.

HW/SW system notes

  • FEB's -> ROD's -> ROS.... our domain is up to (not including) the ROS - there daq takes over
  • One RODC (ROD crate) has 4 PU's, each PU has 2 DSP's, each DSP has 1 FEB
  • When doing a TTC restart, it restarts all the RODC's (and ROS's) and resyncs them all with the TTC.
  • TTC is what synchronises the RODs to the trigger computer... NOTE: If restarting, should never restart RODC applications alone as it will restart bun not resync with TTC overall (although this should not be an issue while in BOOT phase, but still don't do it)
  • TTC should be started in either init or config... (not boot)

-- ClaireLee - 18-May-2012

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2012-06-11 - ClaireLee
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback