Data Center Resiliency: What to Know for Max Uptime

If you’ve ever had the misfortune of experiencing a data center outage, you already know something about the importance of data center resiliency. Whether the power’s out or a server component is malfunctioning, outages are expensive—TechChannel reports 91 percent of organizations say a single hour of downtime costs over $300,000.

Outages aren’t a common occurrence, but they do happen. In this post, we’ll talk about what data center resiliency actually means, how your organization can maintain a resilient data center, as well as some practical steps you can take to minimize downtime.

What is data center resiliency?

Data center resiliency is a data center’s ability to bounce back from an outage or service disruption. While that might sound arbitrary, a lot of careful planning goes into building resilient data centers.

Part of business continuity planning (BCP), resiliency is a crucial pillar of data center design, along with scalability and flexibility. A scalable data center can quickly and efficiently adapt to fluctuations in workload by increasing or decreasing computing or storage capacity.

A flexible data center can accommodate changing business needs with minimal or no disruption. Flexibility includes a number of considerations, from the actual physical dimensions of data halls to cooling methods and energy sources.

Combined, all three pillars of data center design work together to maximize uptime and serve an organization’s ever changing needs.

What are the pillars of a resilient data center?

Speaking of pillars, let’s take a look at the pillars that make up a resilient data center. There are no rules set in stone here, but in general, resiliency is made up of four components:

Redundancy. Creating multiple fail-safes for factors such as networking hardware, internet connectivity, power, and physical security measures. For example, if the power fails—the number-one cause of outages—the data center should have backup generators or batteries to prevent an outage.
Monitoring. In addition to monitoring temperature and humidity levels, technicians should also monitor metrics like server CPU usage and data backup to catch issues before they become a problem.
Alerting. Assuming proper monitoring is in place, trigger-based alerts should be configured to notify technicians of problems.
Testing. Conducting regular stress tests and disaster drills can help prepare a data center for real emergencies and reveal areas for improvement in emergency response.

Of course, a number of other factors affect resilience, from physical location of the data center to staffing levels. And while power failure is the biggest cause of outages, risks such as software bugs, internet disruption, and cooling issues also pose major threats.

Resilience vs. redundancy

These words may sound similar, but it’s a common misconception to think of them as one versus the other. As we said in the previous section, redundancy is a core pillar of resiliency—you’d be hard-pressed to find a resilient data center that doesn’t incorporate redundancy into its design.

Redundancy is typically planned and managed through designs called redundancy configurations. There are multiple different redundancy configurations, including:

N+1. Where N represents the necessary capacity to power and cool a data center operating at full workload, N+1 is a redundancy configuration in which an additional backup component has been added to account for the failure of a single component. This is one of the most popular configurations.
2N. Keeping in mind our definition of N, 2N would be a fully redundant data center. Meaning, if the uninterrupted power supply (UPS) for a data center goes down, there would be enough backup power to fully power the data center operating at full workload until power is restored.
2N+1. A data center with 100 percent available backup power, plus one extra backup component for added peace of mind.
Distributed redundancy or 3N/2. Distributed redundancy (three-to-make-two redundancy) adds capacity based on the data center’s load. This configuration offers the advantage of providing 2N levels of redundancy for the cost of N+1.

Is the cloud more resilient than on-prem?

While there are certainly benefits of maintaining your own on-premises data center, many organizations choose to run their workloads in the cloud. An organization might choose to do this for a number of different reasons, but when it comes to resiliency, the cloud offers some clear advantages.

Backup and recovery

Backup and recovery might be the biggest area where the cloud shines in terms of resilience. Cloud providers offer multiple options for backing up data, with some compute services performing backups automatically. In the event of a disaster, you can recover your data from backups with minimal data loss and business impact.

Some cloud providers also offer disaster recovery as a service (DRaaS). DRaaS lets organizations manage backup and recovery in a third party cloud environment through a SaaS solution for extra control and peace of mind.

Security

Security is also a big consideration for organizations thinking about migrating to the cloud. Cloud providers like AWS and Microsoft Azure maintain strict physical security protocols as well as network security best practices.

Under this model, organizations in the cloud are free to focus more on the security of the workloads they run in the cloud while cloud providers focus exclusively on maintaining security of the cloud infrastructure. In an on-prem data center, the organization would be responsible for security in and of the cloud.

Do your homework

Running a data center is an expensive, resource-intensive process that requires highly trained staff to operate. If you’re thinking about migrating some or all of your workloads to the cloud, it’s important to research different cloud providers to make sure they satisfy your requirements.

Check uptime metrics for the providers you’re considering (most cloud providers make this information publicly available on their websites), and talk to cloud customers to learn more about their experience. A hybrid cloud approach is always an option as well if you decide it doesn’t make sense to migrate all of your workloads to a public or private cloud.

Data Center Resiliency: What You Need to Know for Max Uptime