Use Site Reliability Engineering to Address Cloud Instability

Cloud platforms, as a remotely managed service, come with a service-level agreement (SLA) that guarantees an uptime percentage or your money back. These SLAs, and the shifting of responsibility of infrastructure maintenance from your organization or colocation provider to the clouds in use in your organization, have prompted an expectation that cloud services will “just work”, even though reality often falls short of that. Computing infrastructure has become faster and cheaper over time, but a server today is not meaningfully more reliable than a decade ago, because the root causes of outages are often environmental or the result of third-party error.

Some outages over the past two years have been eyebrow-raising in their origin, effect or circumstances. The fire that destroyed OVHcloud’s Strasbourg SBG2 facility last March was the result of a faulty repair to an uninterruptable power supply. Cooling systems failed to keep pace with the London heatwave in July, leading to outages at Google Cloud Platform and Oracle Cloud Infrastructure. Although not cloud-specific, the 2020 Nashville bombing damaged a significant amount of telecom equipment, leading to regional outages. Given a rise in global temperatures owing to climate change — and a rise in political temperatures — the potential for climate- or extremism-related outages is real.

Of course, comparatively mundane factors also lead to outages, such as bad software deployments, software supply chain problems, power failures and networking issues ranging in severity from tripped-over cables to fibre cuts. Naturally, no discussion of outages would be complete without a mention of DNS and BGP-related outages, which were cited as the root cause of incidents at Microsoft Azure, Salesforce, Facebook and Rogers Communications over the past two years.

Engineer Like a Storm Is Coming

If your application is mission-critical, deployment and instrumentation should reflect that. Consider where the single points of failure are; deploying only to one region in a single cloud provides no redundancy. Use of a content delivery network (CDN) can provide cached versions of pages in the event of an outage, which provides utility for serving relatively static content, though use of a CDN alone will not maintain full feature availability.

Deploying to multiple regions in a single cloud is the lowest-friction means of ensuring availability, although architecting a scalable application whose constituent components can be distributed involves significant engineering time and infrastructure cost. Operating and maintaining individual service units — including data stores — that are deployed to geographically separate facilities is a significant endeavour that needs thoughtful planning and institutional support to accomplish.

Arguments could be made here for multicloud: operating parallel infrastructure to eliminate a single point of failure is enticing, but expensive, complex and repetitive, requiring institutional knowledge of two different cloud platforms and accommodating both as equals in every step of your production processes. Similarly, compelling arguments could be made in these circumstances for hybrid cloud, but this too is complex. Some of this complexity can be managed through initiatives such as AWS Outposts, Azure Stack Hub and IBM Cloud Satellite, which provide consistent operating environments across public and private infrastructure.

Using these offerings as the sole hedge against outages is short-sighted: it exchanges reliability problems for complexity problems, introducing a new avenue from which outages could occur.

You Need Site Reliability Engineering

By adopting site reliability engineering (SRE) to create scalable and reliable systems, it is possible to usefully embrace complexity and increase reliability with careful planning, clearly articulated roles and well-defined incident management processes.

Site reliability engineers are generally tasked with reducing “toil” — repetitive, manual work directly tied to running a service — as well as defining and measuring reliability goals: the service-level indicators and service-level objectives that are tied to the SLAs of a cloud or infrastructure provider. Measuring these, and application performance generally, is achieved with observability tools, which provide the ability for site reliability engineers and other troubleshooters to ask questions about an environment without knowing what needs to be asked prior to an incident.

Although there are different approaches to implementing SRE — and by extension, defining the responsibilities of reliability engineers — there is a distinction between engineers and platform teams. Platform teams are tasked with building out the infrastructure in an IT estate, site reliability engineers are multidisciplinary roles tasked with ensuring reliability in the infrastructure, applications and tooling used by an organization to deliver a product or service to customers.

Assume the Worst, But Hope for the Best

The ubiquity of cloud platforms leads to visibility among consumers that data centre operators do not have — services like Downdetector illustrate the relationship between cloud outages and outages of the consumer brands that use those cloud platforms. Downdetector, and internally, observability tools, provide a real-time understanding of cloud outages that may not be reflected in the service status pages of a cloud platform. The vendor-provided dashboards require manual intervention to acknowledge a service degradation or outage, making them an editorial product, not an automated real-time view of the service status of a cloud platform.

That is not to imply wrongdoing — there are useful reasons to limit information, particularly to avoid tipping off threat actors about the degree to which a service is stressed by an attack.

Cloud platform operators are, naturally, working to improve reliability and reduce the effect of outages. Microsoft’s introduction of Azure Availability Zones to logically separate infrastructure in the same data centre region is one attempt to improve overall reliability, and IBM’s work to strengthen platform reliability has reduced major incidents by 90% in a year.

Disruptions in cloud platforms, network hiccups — for infrastructure or end users — and the unpredictable effects of software changes or “code rot” all mean there is practically no way to guarantee perfect uptime of an application. But thoughtful planning and resource allocation can reduce the severity of incidents. Proactively engineering for instability requires upfront investment, but this is preferable to emergency firefighting.

A version of this article was first published on Computer Weekly on 5 October 2022.