Main Web>TWikiUsers>ClaireLee>MyOnlineOncallNotes (2012-06-11, ClaireLee)

EditAttachPDF

Notes on SW On-call

Notes on SW On-call

This is a place to collect some notes on the fly...

Troubleshooting and notes for various recent errors:

11 June 2011:

Some ROS::ModulesException Warnings at the beginning of the run...
Called Hector to confirm, there was a data flow problem at the beginning of the run (see the L2PU warning messages) and seemed to affect others, not just LAr
FEB errors and missing header errors in plots but not increasing
will affect 1st lumi block
can restart LAr monitoring to clear the plots

Host: pc-lar-ros-embc-05.cern.ch Application Name: ROS-LAR-EMBC-05 Issued: 11 Jun 2012 13:58:58 CEST Severity: Warning Message ID: ROS::ModulesException Message: Error in the status word in the ROB header: RobinDataChannel: Lost fragment detected. The L1ID is 0x30000c3. The ROB Source ID is 0x420038 Context: PACKAGE_NAME: ROSModules. FILE_NAME: ../src/RobinDataChannel.cpp. FUNCTION_NAME: virtual ROS::EventFragment* ROS::RobinDataChannel::getFragment(int). LINE_NUMBER: 244. DATE_TIME: 1339415938. Parameters: Qualifiers: ROSModules LAR

1 June 2012:

the new "Tile" dqmd panel - data loss errors. Not sure what exactly this is at the moment, tile just added this thing without consulting us, even though it monitors LAr too. Ignore the panel for now? - check with Andrew

31 May/1 June 2012:

3am dqmd red, tile and lar problems, problem with the monitoring. Should have just restarted the monitoring partition, but the SL seems to have restarted the entire run instead. Will follow up with the SL shifters
4pm yesterday (31 May) the crazy slew of errors... looked like TTC crates (3) all had simultaneous problems - weird. Turned out Fatih had started to run a test before the end of the run

(time stamps when larg-all was booted up: sbc-lar-tcc-ltpi-01 /logs/tdaq-04-00-01/LArgAll/LTPIC_sbc-lar-tcc-ltpi-01.cern.ch_1338472292.out) Andrew called him at 3:30 to tell him beam was supposed to be dumped around 4pm,

should be protections in place to not let this happen - process manager should not let you start the process... NEED TO FOLLOW UP ON THIS

27 May 2012:

dqm embc yellow due to dsp timing - not in run.. andrew checking...
Issues from last night:
02:42 -- "Larg Conditions Server died. Checked TDAQ to ensure that they had restarted." - This always happens when you have a database update (see Pavol had just done one (https://pc-atlas-www.cern.ch/elog/ATLAS/ATLAS/206306) The important thing is that it restarts.
06:20 -- "Had a host of LARG buffer full errors in FSM. Called DCS Expert, who investigated. They stopped after about 20 minutes. See https://pc-atlas-www.cern.ch/elog/ATLAS/ATLAS/206311" - There was a whole lot of errors from everyone (not just LAr) - SL called Stefan (DCS expert) who seems to have fixed the problem.

25 May 2012: LAr Weekly Ops Meeting - notes

possible way of resetting the rod crates without power cycling... ?
dataflow issues from daq, missing headers errors - implies that we also have dqmd red (number of febs incorrect)
missing headers from one PU - created lost fragments, 2 bursts 20 min apart - 23:20 was the first. Bertrand, Stefan, daq ppl investigating, probable that it's the rod itself or slink. Still not clear if it has been resolved by itself or what - if it comes again dump rodttc stats and ros info to check offline. Also consider possibility to exchange rod board.. If it starts to be a real prob, we can disable the corresponding PU (Emmanuel)? Hector - ros will put an xoff - or (Pavol) RC can do removal by hand.
Andrew - want to implement new online DB this afternoon - (with the bit flip thing)... may get lots of calls this weekend

18 May 2012: VME Bus errors

Hector: during high rate test, VME Bus errors in MRS out of nowhere (weird) and one PU (processing unit) became busy. The usual thing of asking for a ttc restart didn't solve the problem, even after a couple of tries. Indicates a hw problem. Bertrand unavailable, so Guy Perrot was called and he power cycled the ROD crate (not to be done lightly!) elog

(note, we occasionally get vme errors at boot, doing a TTC restart on, eg, LArg-EMECA segment fixes it)

Random comment: vme bus has nothing to do with what the rate is
for future: in case this happens in stable beams:

can't power cycle crate while running because of DIM server - not been tested yet

18 May 2012: Missing trigger type in red dqmd

restart ttc partition should fix the problem

Things to be able to do while oncall:

hot tower - be able to disable

if provoking high empty trigger rates

LAr busy:

if its a hw busy on our side the PU will be automatically disabled and the busy will go away.
lots of PU's busy - either power supply prob (hw) or daq issue

Notes on restarting the run

(ATLAS as a whole) - on paper it's 5 min, but in reality that's only an ideal case.. more like 15 min due to other probs with other sub detectors.

Replacement of module times:

Barrel 3 hours
EC - 1.5 hours

1 channel - wait for best option group of channels -

need to talk to l1calo and atlas rc, and also upload the latest constant while restarting the run.

HW/SW system notes

FEB's -> ROD's -> ROS.... our domain is up to (not including) the ROS - there daq takes over
One RODC (ROD crate) has 4 PU's, each PU has 2 DSP's, each DSP has 1 FEB
When doing a TTC restart, it restarts all the RODC's (and ROS's) and resyncs them all with the TTC.
TTC is what synchronises the RODs to the trigger computer... NOTE: If restarting, should never restart RODC applications alone as it will restart bun not resync with TTC overall (although this should not be an issue while in BOOT phase, but still don't do it)
TTC should be started in either init or config... (not boot)

-- ClaireLee - 18-May-2012

Topic revision: r7 - 2012-06-11 - ClaireLee

Main

Webs

Public webs

Welcome Guest

- Cern Search
- TWiki Search
- Google Search
Main All webs

Copyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback