CASTOR affected alarm for LHCb
Description
CASTOR affected alarm triggered due to LHCBMDST full and all DEFAULT slots in use caused piquet call.
Impact
- No user impact. Unnecessary piquet call.
Time line of the incident
When |
What |
15-Sep2010 04:32 |
Piquet call |
15-Sep2010 05:05 |
After analysis of the case the piquet decides not to notify LHCb and informs the operator to ignore such alarm until the next morning |
Analysis
- Since more than 24h the default pool was running above its capacity, with all of its scheduling slots full and availability close to 0% according to SLS.
- The lhcbmdst pool, whose space is user-managed, got full at 01:00.
- There was a transient overload of lhcbfailover that triggered the "castor affected" alarm only because two pools were already unavailable. The alarm disappeared by itself few minutes later.
Follow ups
- A similar incident happened last week (Incidentsmonitoring8Sep2010), we should change the monitoring thresholds and/or the criteria for the piquet calls. Done.
- Get in contact with LHCb about current usage of pools. Done.
-- Created by
GiuseppeLoPresti at 15-Sep-2010 - 16:17
- SLS availability of the castorlhcb / default disk pool:
- Free space in the castorlhcb / lhcbmdst diskpool (Disk1Tape1):