Skip to content

Disaster Recovery

homepage-banner

Disaster recovery (DR) refers to an organization’s ability to restore access and functionality to IT infrastructure following a disaster event, whether natural or caused by human error or action. DR is a subset of business continuity and specifically focuses on ensuring that the IT systems that support critical business functions are operational as soon as possible after a disruptive event occurs.

What is disaster recovery?

Disaster recovery (DR) refers to an organization’s ability to respond to and recover from an event that negatively impacts business operations. The goal of DR methods is to enable the organization to regain use of critical systems and IT infrastructure as soon as possible after a disaster occurs. To prepare for this, organizations often perform an in-depth analysis of their systems and create a formal document to follow in times of crisis. This document is known as a disaster recovery plan.

Read on to learn more about the importance of DR, how it works, and the differences between disaster recovery and business continuity. You’ll also discover what to include in a disaster recovery plan, the major types of DR, as well as major DR services and vendors.

What is a disaster?

DR revolves around serious events that can disrupt or completely stop critical business operations for a period of time. These events include natural disasters, such as hurricanes, tornadoes, earthquakes, floods, and fires, as well as cyber attacks, equipment failure, epidemics or pandemics like COVID-19, sabotage, terrorist attacks or threats, and industrial accidents.

Why is disaster recovery important?

Disasters can cause various types of damage with varying levels of severity. A brief network outage could result in frustrated customers and some loss of business to an e-commerce system. On the other hand, a hurricane or tornado could destroy an entire manufacturing facility, data center, or office.

The monetary costs of disasters can be significant. According to the Uptime Institute’s Annual Outage Analysis 2021 report, 40% of outages or service interruptions in businesses cost between $100,000 and $1 million, while about 17% cost more than $1 million. A data breach can be even more expensive, with the average cost being $3.86 million in 2020, according to the 2020 Cost of a Data Breach Report by IBM and the Ponemon Institute.

This article is part of the Business Continuity and Disaster Recovery (BCDR) guide, which also includes information on business resilience, preparing an annual schedule of business continuity activities, and a free business impact analysis (BIA) template with instructions.

Many businesses are required to create and follow plans for disaster recovery, business continuity, and data protection to meet compliance regulations. This is particularly important for organizations operating in financial, healthcare, manufacturing, and government sectors. Failure to have DR procedures in place can result in legal or regulatory penalties. Therefore, understanding how to comply with resiliency standards is crucial.

While preparing for every potential disaster may seem extreme, the COVID-19 crisis illustrated that even scenarios that seem far-fetched can come to pass. Businesses with emergency measures in place to support remote work had a clear advantage when stay-at-home orders were enacted.

Thinking about disasters before they happen and creating a plan for how to respond can provide many benefits. It raises awareness about potential disruptions and helps an organization prioritize its mission-critical functions. It also provides a forum for discussing these topics and making careful decisions about how to best respond in a low-pressure setting.

What is the difference between disaster recovery and business continuity?

On a practical level, DR and business continuity are often combined into a single corporate initiative and even abbreviated together as BCDR, but they are not the same thing. While the two disciplines have similar goals relating to an organization’s resilience, they differ greatly in scope.

Business continuity is a proactive discipline intended to minimize risk and ensure that the business can continue to deliver its products and services no matter the circumstances. It focuses especially on how employees will continue to work and how the business will continue operations while a disaster is occurring. BC is also closely related to business resilience, crisis management, and risk management, but each of these has different goals and parameters.

DR is a subset of business continuity that focuses on the IT systems that enable business functions. It addresses the specific steps an organization must take to resume technology operations following an event. DR is also a reactive process by nature. While planning for it must be done in advance, DR activity is not initiated until a disaster actually occurs.

Elements of a disaster recovery strategy

Before an organization can determine its DR strategies, it must first analyze existing assets and priorities. Two different analyses typically factor into DR decision-making:

Risk analysis

Risk analysis or risk assessment is an evaluation of all potential risks that a business could face, as well as their possible outcomes. The risks can vary greatly depending on the industry and geographic location of the organization. The assessment should identify potential hazards, determine who or what these hazards would harm, and create procedures that take these risks into account.

Business impact analysis

Business impact analysis (BIA) evaluates the effects of the risks identified above on business operations. A BIA can predict and quantify costs, both financial and non-financial. It also examines the impact of different disasters on an organization’s safety, finances, marketing, business reputation, legal compliance, and quality assurance.

Understanding the difference between risk analysis and BIA and conducting the assessments can also help an organization define its goals for data protection and the need for backup. Organizations generally quantify these using measurements called recovery point objective (RPO) and recovery time objective (RTO).

Recovery point objective

RPO is the maximum age of files that an organization must recover from backup storage for normal operations to resume after a disaster. The RPO determines the minimum frequency of backups. For example, if an organization has an RPO of four hours, the system must back up at least every four hours.

Recovery time objective

RTO refers to the amount of time an organization estimates its systems can be down without causing significant or irreparable damage to the business. In some cases, applications can be down for several days without severe consequences. In others, seconds can do substantial harm to the business.

RPO and RTO are both important elements in disaster recovery, but the metrics have different uses. RPOs are acted on before a disruptive event takes place to ensure data will be backed up, while RTOs come into play after an event occurs.

What’s in a disaster recovery plan?

Once an organization has thoroughly reviewed its risk factors, recovery goals, and technology environment, it can write a disaster recovery (DR) plan. The DR plan is the formal document that specifies these elements and outlines how the organization will respond when disruption or disaster occurs. The plan details recovery goals, including RTO and RPO, as well as the steps the organization will take to minimize the effects of the disaster.

The components of a DR plan should include:

  • A DR policy statement, plan overview, and main goals of the plan.
  • Key personnel and DR team contact information.
  • A step-by-step description of disaster response actions immediately following an incident.
  • A diagram of the entire network and recovery site.
  • Directions for how to reach the recovery site.
  • A list of software and systems that staff will use in the recovery.
  • Sample templates for a variety of technology recoveries, including technical documentation from vendors.
  • A communication plan that includes internal and external contacts, as well as boilerplate for dealing with the media.
  • Summary of insurance coverage.
  • Proposed actions for dealing with financial and legal issues.

An organization should consider its DR plan a living document. Regular disaster recovery testing should be scheduled to ensure the plan is accurate and will work when a recovery is required. The plan should also be evaluated against consistent criteria whenever there are changes in the business or IT systems that could affect DR.

How disaster recovery works

Disaster recovery (DR) initiatives have become more attainable for businesses of all sizes due to widespread cloud adoption and availability of virtualization technologies that make backup and replication easier. However, much of the terminology and best practices developed for DR were based on enterprise efforts to recreate large-scale physical data centers. This involved plans to transfer, or fail over, workloads from a primary data center to a secondary location or DR site in order to restore data and operations.

Disaster recovery sites

An organization uses a DR site to recover and restore its data, technology infrastructure, and operations when its primary data center is unavailable. DR sites can be internal, external, or cloud-based.

An organization sets up and maintains an internal DR site. Organizations with large information requirements and aggressive recovery time objectives (RTOs) are more likely to use an internal DR site, which is typically a second data center. When building an internal site, the business must consider hardware configuration, supporting equipment, power maintenance, heating and cooling of the site, layout design, location, and staff.

An external disaster recovery site is owned and operated by a third-party provider. External sites can be hot, warm, or cold.

  • Hot site: A fully functional data center with hardware and software, personnel, and customer data. It is typically staffed around the clock and operationally ready in the event of a disaster.
  • Warm site: An equipped data center that doesn’t have customer data. An organization can install additional equipment and introduce customer data following a disaster.
  • Cold site: Has infrastructure to support IT systems and data, but no technology until an organization activates DR plans and installs equipment. They are sometimes used to supplement hot and warm sites during a long-term disaster.

A cloud recovery site is another option. An organization should consider site proximity, internal and external resources, operational risks, service-level agreements, and cost when contracting with cloud providers to host their DR assets or outsourcing additional services.

Disaster recovery tiers

In addition to choosing the most appropriate DR site, it may be helpful for organizations to consult the tiers of disaster recovery identified by the Share Technical Steering Committee and IBM in the 1980s. The tiers feature a variety of recovery options organizations can use as a blueprint to help determine the best DR approach depending on their business needs.

Another type of DR tiering involves assigning levels of importance to different types of data and applications and treating each tier differently based on the tolerance for data loss. This approach recognizes that some mission-critical functions may not be able to tolerate any data loss or downtime, while others can be offline for longer or have smaller sets of data restored.

Types of disaster recovery

In addition to choosing a DR site and considering DR tiers, IT and business leaders must evaluate the best way to put their DR plan into action. This will depend on the IT environment and the technology the business chooses to support its DR strategy.

Types of DR can vary, based on the IT infrastructure and assets that need protection as well as the method of backup and recovery the organization decides to use. Depending on the size and scope of the organization, it may have separate DR plans and implementation teams specific to departments such as data centers or networking. Major types of DR include:

Data center disaster recovery

Organizations that house their own data centers must have a DR strategy that considers all the IT infrastructure within the data center as well as the physical facility. Backup to a failover site at a secondary data center or a colocation facility is often a large part of the plan (see “Disaster recovery sites” above). IT and business leaders should also document and make alternative arrangements for a wide range of facilities-related components including power systems, heating and cooling, fire safety, and physical security.

Network disaster recovery

Network connectivity is essential for internal and external communication, data sharing, and application access during a disaster. A network DR strategy must provide a plan for restoring network services, especially in terms of access to backup sites and data.

Virtualized disaster recovery

Virtualization enables DR by allowing organizations to replicate workloads in an alternate location or the cloud. The benefits of virtual DR include flexibility, ease of implementation, efficiency, and speed. Virtualized workloads have a small IT footprint, replication can be done frequently, and failover can be initiated quickly. Several data protection vendors offer virtual backup and DR as a product.

Cloud Disaster Recovery

The widespread adoption of cloud services now allows organizations that previously used an alternate location for disaster recovery (DR) to be hosted in the cloud. Cloud DR goes beyond simply backing up to the cloud. It requires an IT team to set up automatic failover of workloads to a public cloud platform in the event of a disruption.

Disaster Recovery as a Service (DRaaS)

DRaaS is the commercially available version of cloud DR. In DRaaS, a third party provides replication and hosting of an organization’s physical and virtual servers. The provider assumes responsibility for implementing the DR plan when a crisis arises, based on a service-level agreement.

Disaster Recovery Services and Vendors

Disaster recovery vendors can take many forms, as DR is more than just an IT issue. DR vendors include those selling backup and recovery software, as well as those offering hosted or managed services. Because DR is also an element of organizational risk management, some vendors couple disaster recovery with other aspects of security planning, such as incident response and emergency planning. Options include:

  • Backup and data protection platforms
  • DRaaS providers
  • Add-on services from data center and colocation providers
  • Infrastructure-as-a-service providers

Choosing the best option for an organization ultimately depends on top-level business continuity plans and data protection goals, and which option best meets those needs along with budgetary goals.

Disaster Recovery from storage point of view

  • Locally redundant storage (LRS): Microsoft creates three copies of your storage account within a single data center in your home region.
  • Zone-redundant storage (ZRS): The three copies of your storage account are distributed across different data centers in your home region.
  • Georedundant storage (GRS): Three copies of your storage account are distributed across data centers in your home region, and an additional three copies are placed in a secondary region chosen by Microsoft. (Refer to Chapter 2 for more details on paired regions.)
  • Read-access georedundant storage (RA-GRS): This option is similar to GRS, but it provides read-only access to the contents of your secondary storage account.
Replication Option Protects
LRS Storage array within a single data center in your home region
ZRS A single data center in your home region
GRS Your home region

Reference

  • https://cloud.google.com/learn/what-is-disaster-recovery
  • https://www.vmware.com/topics/glossary/content/disaster-recovery.html
  • https://aws.amazon.com/what-is/disaster-recovery/
  • https://azure.microsoft.com/en-us/solutions/backup-and-disaster-recovery
  • https://www.techtarget.com/searchdisasterrecovery/definition/disaster-recovery
  • Jack A. Hyman - Microsoft Azure For Dummies (2023)
Leave a message