Disaster Recovery and Business Continuity for IT-DSS-FDO services

Terms and definitions

  • "Disaster" (as in ".. recovery"): any event affecting the service making uneffective the standard counter-measures (see Business Continuity). For example a data loss making permanently unusuable a significant fraction of the backup store.
  • "Business" (as in ".. continuity"): the core missions of CERN, i.e. "storing and retrieving physics data from the experiments", or "providing tools for physics analysis". This is implemented on various services, but has much wider scope. After a disaster hits a service, business might continue by replacing that service or to move it somewhere else, not repairing it...
Although the two concepts are closely linked they do not coincide. For example, a second copy of the LEP data on tapes shelved in a remote place is a priori a valid disaster-recovery mitigation plan but it has little impact on business continuity. Guaranteeing business continuity is somewhat part of the SLD while the disaster recovery is somewhat treated outside.

  • "Short period": any event lasting less than 10-100 hours.
  • "Long period": the same but > 100 hours.
Short/Long are defined by the application and actually depend on the type of applications. For everything connected with the operations of the accelerator complex, already a few hours is "long" since it affects the overall efficiency of the operations (1 day of stop is a drop 1% or more of the yearly efficiency of the physics programme and a considerable financial loss). Data-acquisition related activities are typically insensitive to stops up to a few days (as DSS we constantly remind the experiments to provide buffer space for ~100h with local buffering). Interruption on the production/analysis chain are in general severe (if leading to loss of many jobs hence of time) but they do not lead to data loss; the same hold for data export. Similar considerations hold for other human activities where a service could prevent colleagues to work for the entire duration of the event (e.g. accessing home-directory files, accessing project data like CAD, software repositories etc...).

Application-Service mapping

To simplify the analysis for both DR and BC we select some exemplary applications and correlate them to DSS service responsibilities:

  • Physics-data handling: CASTOR (disk) and/or EOS are the receiving ends for the experiments and are critical if the adverse event lasts more than 100 h. One should note that the network from an experiment to the Meyrin computer centre has the same criticality of the end points.
  • Tape writing: CASTOR (disk and tape) should be operational to guarantee a safe tape copy of all data. It gets critical if t>100 especially if both CASTOR and EOS are down (otherwise data can be received and exported to the Tier1 reaching a reasonable level of data safety
  • IT services (1): CEPH
  • IT services (2):

Basic scenarios

For the disaster recovery we consider the following scenarios:

  • DR1: Barn untouched, both machine rooms in bd 513 lost (all equipment unrecoverably lost like in case of a fire)
  • DR2: Same as DR1, but the barn is also lost.
  • DR3: Machine room in bd 613 lost. The rest of the CERN computer centre is untouched.
  • DR4: Wigner centre is lost. The Meyrin computer centre is untouched.
The cases of both bd 513 and 613 being lost or any combination between the Meyrin and Wigner are not considered.

For the business continuity we consider the following scenarios:

  • BC1: Meyrin centre is unavailable for a long time (for example a long power cut)
  • BC2: As BC1 but the Barn is not usable as well
  • BC2: Wigner centre is unavailable for a long time

General outline

For every service, need to list

  • the intended purpose (link to SLD)
    • in the scope of IT-DSS services this is typically to accept, store and make available in a timely fashion some kind of data
    • the contract with the user is often implicit or not well understood - the service may have refused to store the file, but the user is not aware of the failure (c.f "error on close()" for EOS or "client interruption before acknowledgement" for AFS).
  • the dependencies on other services (such as power, network, configuration management, backup)
    • split between current usage (which would facilitate a recovery scenario) and hard dependencies (without which the service will not be recoverable)
  • particularities of the service that establish the available options
  • the reaction to various levels of "disasters" -some just technical malfunctions, some human error, malicious actions or disasters in the common sense (fire, industrial accidents, natural catastrophes)
    • how the scenario is detected (major ones might be very obvious)
    • whether the scenario for the user is
      • fully transparent,
      • noticeable (only) via reduced performance,
      • leads to temporary data unavailability,
      • leads to loss of recent changes,
      • leads to data/content loss (metadata is preserved - loss is detectable and quantifiable within the service)
      • leads to data and metadata loss (service cannot determine what/how much has been lost, except at coarse granularity)
    • each non-transparent scenario should outline preventive measures that could be taken to mitigate: prevent, reduce likelihood, reduce impact (i.e. amount of data affected, time to recover), and assess the costs (and make recommendations whether these measures should be taken).
    • some of these scenarios will still lead to irrecoverable data loss. These should be agreed on with the users, i.e. the cost to address these should be clearly prohibitive compared to the cost of data loss.
    • depending on the service's purpose, clear knock-on effects of data unavailability or data loss need to be considered. Should limit to technical+immediate effects ( not somebody abandoning their PhD after part of their work got lost, and hence one future Nobel prize or world-saving invention not being realized..). Only disaster effects on SLD-compliant data should be considered.

"Disasters" small and big that need to be looked at

  • some of these may be grouped together, if none of the IT-DSS services actually make a difference in handling these scenarios.
  • Focus is on 'disk' as primary storage medium for IT-DSS-FDO, but can include SSD, RAID-BBU or tapes as needed.
  • some errors might be in a grey area - data no longer is accessible via standard OS means, but might be partly recoverable by specialist companies. This is currently not used for disk-based storage, but tapes get sent offsite for data recovery.
  • some error scenarios have a time component - a short outage for a motherboard replacement would not be considered a data loss, a month-long data unavailability might be
  • backup solutions means that while the most recent file content might be lost, an older version can be recovered. For such cases, the backup frequency needs to be specified
    • backups themselves can be a secondary source of data loss (a failure scenario relies on backups to not lose data, but the backup itself has failed) - this is not considered here
  • data might be recoverable from outside of the service - for most experiment data, one or more copies will exists at other sites; users might have backup copies of personal files. Such recovery sources are not considered here, for the purposes of a SLA, the service has lost the data in these cases.

single disk pre-failure (SMART, recoverable errors)

A disk announces that it will fail in the near future (via SMART prefailure attributes such as number of relocated sectors, internal tests detecting non-writeable areas etc), but so far no user-visible data has been affected. This is the most benign scenario, and should be transparent for the user on all services

single file on single disk error

File is not readable at OS level (internal disk checksum errors, "bad sector", local filesystem corruption), or service-level checksum disagrees with on-disk content, or file is missing (service metadata inconsistency)

single disk failure (hard error)

Either a mechanical/electrical failure of the whole device, or an administrator wiping data on a given disk/partition/filesystem.

multiple disk failures, single machine

often repairable without data loss (common point of failure would be disk/RAID controller), if not treat as "single machine issue". Only reason for including here is the case of multiple independent failures inside a single RAID array

(single machine failure - repairable)

(assume that this mostly leads to temporary data unavailability of some data. If disks are directly affected, see either single or multiple disk failure scenarios). One particular point is that most services do not use synchronous writes to disk (so the last few seconds of freshly-written data are often lost in case of a machine crash or power outage). Similarly, both RAID controller caches and on-disk caches may lie about data persistency immediately after write.

single machine issue - all disk content destroyed

(i.e. fire/water/mechanical damage, electric damage to onboard disk controllers, machine reinstallation with all disks being formatted)

multiple disks failures, spanning several machines

(underlying assumption is independent failure modes, not one set of hardware/firmware stopping to work on the same millisecond)

simultaneous multiple machines failure, one physical location of various extent (rack/room/floor/building)

(i.e. fire/water/mechanical damage, long-lasting power issues)

logical data deletion

All current copies of the data (should) have been removed after an action from an authorised account. This action can be accidental (human error) or malicious, and might nor have been requested by the user owning the account (computer security - account or service compromise). Both actions from the service's users and administrators should be considered. Data redundancy does not help, only lazy deletion, backup and data versioning can recover data here.

From a service perspective, this is not a disaster scenario in the strict sense (service performs as designed), but from a user perspective it is.

A sub-case is account deletion, i.e. all data owned by that account is removed (after some grace period), either accidentally (account was not or should not have been considered to be deleted), or intentionally (i.e in line with policy, but different from user expectations).

data leak/disclosure

(not considered here, although from a user perspective inadvertent data disclosure might have significant negative effects)


AFS Service

  • provides a shared filesystem, with POSIX-like interface (some exceptions/extensions to POSIX: ACLs, shared locking, content visibility)
  • Service information and SLD
  • key concept: "AFS volume": directory sub-tree that is used as a single unit. Can be moved between servers, backed up, made read-only and then replicated. read-only replicas can be "promoted" to read-write copies (as an alternative to restoring from backup)
  • AFS has metadata servers (VLDB, redundant), and file servers. These are non-redundant - only one fileserver is responsible for each (read-write) AFS volume
  • AFS is backed up once/day (old: TSM, new: CASTOR). Agreed data retention period is 6 months.
    • the most recent backup snaphsot is kept on disk for user home directories for fast self-service
    • backup restore operations are directly accessible to users
  • Dependencies: hardware, power, network (within service, between AFS and CASTOR/TSM and with the clients).
    • documentation: off-site/generic, CERN-specific: inside AFS (replicated AFS volume), in configuration management (which is itself hosted on AFS)
    • re-creating individual machines strongly benefits from configuration management
    • re-creating the whole service benefits from configuration management

Scenario detection user visible effect on AFS service data loss and causes possible mitigations mitigations agreed
single disk pre-failure alarmed performance impact manual RAID rebuild to keep full redundancy, then disk change very unlikely (human error) trigger rebuild from tools/scripts no
single file on single disk error manual or incidental data and/or metadata loss, old version implies file system error - file needs to recovered from backup very unlikely (backup failure) FS-level scans/consistency checks no
single disk failure alarmed performance impact automatic RAID rebuild to restore full redundancy, then disk change very unlikely (human error or disk failure in narrow time window = next point) defined procedures/scripts (need for procedures agreed with repair service)
multiple disk failures, single machine alarmed (same as above) data unavailable, data+metadata loss, old version need to recover from backup, manual action. Will take O(hours), based on volume inherent data loss due to local storage RAID6 or non-local storage (CEPH) agree to investigate
single machine failure - repairable alarmed data+metadata unavailable until machine restored or disk tray connected to a different server (manual). O(hours) data loss unlikely (human error during repair) shorten outage via defined procedures and standby hardware. Automatic takeover possible in case of non-local storage (CEPH), but risk of data corruption no
single machine issue alarmed data+metadata unavailable, data+metadata loss, old version recover from backup, manual action. Will take O(day) inherent data loss due to local storage reduce outage via efficient multi-volume restore, high-priority tape access  
multiple disks failures, spanning several machines alarmed performance impact (treat as individual failures, above)      
simultaneous multiple machines failures "obvious" data+metadata unavailable, old version from backup scaled-up recovery time from tape, time to set up replacement hardware: O(weeks). No difference whether failures are geographically close or not inherent data loss due to local storage   none
data deletion incidental / by user data+metadata loss, old version (n.a - service works as expected) recover from snapshot or backup, manual action (possibly by user) data versioning? none
account deletion n.a. n.a. personal data is kept (retention period is undefined, no access by former user), non-personal data is kept (access by other users untouched) data loss is very unlikely, "accidental" account deletions can be (manually) recovered from for several months longer retention period (but data privacy issues) none

-- JanIven - 26 Aug 2014

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-09-10 - MassimoLamanna
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Main All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback