Disaster Recovery and Resilience

Opportunity and risk come in pairs

IT resilience is defined as an organization's ability to maintain acceptable service levels through, and beyond, severe disruptions to its critical processes and the IT systems which support them.

By focusing on the areas of awareness, protection, discovery, preparedness, recovery, review, and improvement, an organization will minimise the potential impact(s) of disruptions to its IT service(s) which, in the current highly competitive business environment that most of us operate in, could be extremely costly, possibly to the point of complete failure. These areas are key to effective IT resilience. None can be taken in isolation; they all overlap at some point in the overall process.

Awareness is having the knowledge of what are the normal business requirements of operational functionality; dependencies that might exist; the criticality of IT system components and elements; and the minimum acceptable operational levels. There must also be an awareness of the recovery requirements in terms of time, system capacity and performance in the event of severe disruption to, or failure of, IT systems supporting the business processes. These should be identified by an effective business impact assessment / analysis (BIA).

Protection is more than having physical and system access security controls. It can also mean reducing the risk of system failure, e.g. removing single points of failure (SPoF) by having load balancing servers or redundant systems or components. Potential exposures to systems deemed to be critical to business processes should be identified and addressed as priority.

Discovery means that the quicker the IT team knows that a system has been disrupted, the sooner they can resolve the problem. The use of effective means of alerts of problems enable the IT group to understand and address problems before they result in severe disruption.

Preparedness means having detailed plans for addressing the effects of a disruption, such as having seamless failover of systems and components, enabling essential business processes to continue to function with no, or an acceptable minimum, break of service.

Recovery focuses on returning services and operations to business as usual levels within defined timescales and with minimal acceptable data loss following an event causing disruption or failure. This will only be achieved by having an effective and tested recovery plan which meets the business requirements in place.

Review is essential to every IT resilience programme, and includes post-incident reviews to identify the root causes of disruptions. It is a continual process which aims to enable the IT team and the business to understand potential issues and to assess and implement preventative actions to remove, or at least mitigate, the risk of severe disruption.

Improvement is the process of taking the knowledge gained from all the above and taking steps to improve systems and increase resilience, and to continuously refine disaster recovery and business continuity plans.

It should be noted here that most, if not all, the information required for the above to be achieved successfully will come from effective business impact assessment/ analysis and risk assessment.

Disaster Recovery and Resilience

IT resilience and DR considerations
The ISO/IEC 27031:2011 standard recommends six main categories to be considered when formulating an IT DR strategy:
1. Key competencies and knowledge: What information is necessary to run critical IT services? Is it in-house or does it sit solely with a service supplier, or is it a mix of both? How can this information be incorporated into the organization's "knowledge bank(s)" and be made available in the event of a severely disruptive incident or event requiring IT disaster recovery processes to be activated?

2. Facilities: What are the criteria that installations and infrastructure should meet to minimize the risk of failure or severe disruption and eventual recovery? Where should such facilities be located?

3. Technology (systems): Which systems are most important to the organization's business? Have recovery requirements been identified, e.g., RTO (recovery time objective), RPO (recovery point objective), or dependencies on other systems?

4. Data: Has the data required to restore / resume business activities, and the timescales within which it must be available, been identified? It is to be noted that there may be different RTO and RPO for IT services and data. The recovery, resumption or implementation of security controls to secure the data must also be considered.

5. Processes: Which processes are in place to deal with an incident or disaster, and how do they make the topics outlined above combine to deliver the required, and defined, business services.

6. Suppliers: Which service suppliers are critical to IT continuity, and how do they ensure that they can support the organization's recovery and business continuity requirements? Are these service suppliers, in turn, dependent upon the effective responses from other third parties, internal or external to their organization?

IT recovery strategies
It makes good business sense to develop and maintain recovery strategies for IT systems, applications and data. Recovery strategies must address the above IT resilience and disaster recovery considerations and include all the elements that make up each system, e.g. networks, servers, desktops, laptops, wireless devices, data and connectivity.

Disaster Recovery and Resilience

Priorities for IT recovery must be consistent with the priorities for recovery of critical business functions and processes that have been identified by an effective BIA.

IT resources required to support critical business functions and processes must also be identified. The recovery time for an IT resource should be commensurate with the recovery time objective (RTO) for the business function or process that depends on that IT resource. The RTO is the time within which a business process must be restored, and a stated minimum level of service / functionality achieved following a disruption, to avoid unacceptable consequences associated with disruption to that service. For each system, this must be identified by the business area which is the "owner'/prime user of that system via the BIA.

Recovery strategies should be developed to anticipate the failure, or loss, of one or more of the following system components:
  *   Physical environment (data center / centre building; computer rooms; facilities; utilities)
  *   Hardware (servers; desktop and laptop computers; wireless devices and peripherals)
  *   Connectivity (network links; equipment and services)
  *   Systems software (computer operating systems)
  *   Middleware (platform services, e.g., web servers or application services)
  *   Enabling software (shared central applications, such as electronic mail)
  *   Applications (data processing) software
  *   Data

IT disaster recovery planning (IT DRP)
Disaster recovery planning is the ongoing process of planning, developing, implementing, and testing disaster recovery management procedures and processes to ensure the efficient and effective resumption of critical functions in the event of an unscheduled interruption which might cause severe disruption.

A disaster recovery plan can only be effective if system dependencies have been identified and accounted for when developing the order of recovery, establishing recovery time and recovery point objectives and documenting the roles of required personnel. The source of this information should, again, be the BIA.

Resilience reviews
To ensure effective implementation of the BCM policy in terms of IT resilience and recovery and continuity of service to the required level, the organization should instigate a programme of reviews to establish the IT business continuity / disaster recovery (BC/DR) resilience of systems supporting its business operations, functions and processes. The reviews should be conducted by a group with the required professional knowledge and experience at regular intervals; to ensure objectivity, this should not be the internal IT function.

Reviews should be structured to assess the capability and resilience of IT systems or services supporting business areas or processes, rather than individual IT systems. Reviews may highlight potential exposures which, in the event of an incident causing severe disruption to IT services, could delay or, in a worst-case scenario, prevent, recovery of critical business processes, functions or services, with potential financial and/or reputational damage to the organization.

The review process should include evaluation of the processes, policies and procedures related to preparing for recovery or continuation of technology infrastructure, systems and applications, following an incident that may cause severe disruption to, or failure of, essential services, from whatever cause.

Recommended standards and guideline
The following is a list of the issues which are the minimum considerations for inclusion in any organization's IT resilience and DR standards and guidelines.
1. IT disaster recovery
  *   IT disaster recovery plan
  *   System criticality and recovery objectives
  *   Testing and Review
  *   IT disaster recovery plan content
  *   IT service providers: disaster recovery plan information
  *   New systems

Business impact assessment
  *   Business impact assessment evidence
  *   Business area objectives/priorities
  *   Information Required
  *   Review
  *   IT systems planning

Risk assessment
  *   Risk assessment evidence
  *   Risks and exposures
  *   Categories
  *   Monitoring

IT resilience
  *   Data backup
  *   Systems software backup
  *   Systems management & infrastructure services
  *   Network resilience
  *   IT processing locations/data centers
  *   Recovery site(s) location(s)
  *   Primary and secondary site security controls
  *   Recovery system management
  *   Recovery capabilities
  *   Service supplier internal governance
  *   Service supplier contracts/SLAs
  *   Single points of failure (SPoFs)
  *   System performance & capacity review
  *   Power supplies
  *   Resilience of other utilities
  *   System software controls
  *   Applications software controls
  *   Data security
  *   Compute and storage resilience
  *   People

Conclusion Having a business continuity management policy is not enough, even if you have supporting procedures and recovery and continuity plans. If your IT systems are important, and it would be a big surprise in this day and age if they are not, then due diligence dictates that you ensure that they are resilient. This means implementing review and risk mitigation programmes. As stated previously, reviews should be carried out by a group outside of the area directly responsible for IT systems, to retain objectivity. Risk mitigation activities are, of course, the responsibility of the area that owns the identified risk, and should be at the appropriate level to meet the organization's risk appetite.

Our experts partner with clients on corporate planning, providing perspective not only on immediate value and impact, but on long-term implications. We work closely with management and other advisers to leverage and complement their knowledge and ensure maximum impact, and actively support implementation and skill building.

Disaster Recovery and Resilience

Opportunity and risk come in pairs

Featured Experts - Disaster Recovery and Resilience

Nav
Kaplish

Preethi
Hari

See what we can do for you

Disaster Recovery and Resilience

Opportunity and risk come in pairs

Nav Kaplish

Preethi Hari

See what we can do for you

Nav
Kaplish

Preethi
Hari