CASTOR affected alarm for LHCb

Description

CASTOR affected alarm triggered due to LHCBMDST full and all DEFAULT slots in use caused piquet call.

Impact

  • No user impact. Unnecessary piquet call.

Time line of the incident

When What
15-Sep2010 04:32 Piquet call
15-Sep2010 05:05 After analysis of the case the piquet decides not to notify LHCb and informs the operator to ignore such alarm until the next morning

Analysis

  • Since more than 24h the default pool was running above its capacity, with all of its scheduling slots full and availability close to 0% according to SLS.
  • The lhcbmdst pool, whose space is user-managed, got full at 01:00.
  • There was a transient overload of lhcbfailover that triggered the "castor affected" alarm only because two pools were already unavailable. The alarm disappeared by itself few minutes later.

Follow ups

  • A similar incident happened last week (Incidentsmonitoring8Sep2010), we should change the monitoring thresholds and/or the criteria for the piquet calls. Done.
  • Get in contact with LHCb about current usage of pools. Done.

-- Created by GiuseppeLoPresti at 15-Sep-2010 - 16:17

  • SLS availability of the castorlhcb / default disk pool:
    availability_default.png

  • Free space in the castorlhcb / lhcbmdst diskpool (Disk1Tape1):
    freespace_lhcbmdst.png
Edit | Attach | Watch | Print version | History: r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r1 - 2010-09-30 - GiuseppeLoPresti
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback