Developers
September 15, 2020

Chaos Engineering and Fault Injection: What They Are and Why They Matter

Chaos engineering and fault injection may be two of the most important elements to successful cloud availability.

There are few technologies more important than cloud computing. Already the darling of the tech industry, cloud computing has become even more vital since the coronavirus pandemic.

As employees were sent home amid lockdowns and quarantines, companies began to ramp up their migration to cloud services. While cloud services have proven invaluable in helping companies remain active during the pandemic, they come with their own unique set of challenges for companies to deal with.

One of the most important steps a company can take when deploying cloud services is to exhaustively test in an effort to find failure points. Why is this so important? What role do chaos engineering and fault injection play?

The Benefits and Challenges of Cloud Computing

Cloud computing has ushered in a new era of business operations, making it possible for people to work in ways that could hardly be imagined just a few decades ago.

Thanks to cloud computing, employees can work from home with the same efficiency and productivity as in the office. What’s more, employees can use a plethora of devices, including desktops, laptops, tablets, and smartphones. In many cases, Bring Your Own Device (BYOD) policies result in significant savings for corporations.

Further benefiting companies is the cost of cloud computing. Rather than investing millions in on-premise architecture, cloud computing gives companies the ability to pay-on-demand for the needed infrastructure, services, and software. This scalability also makes it possible to rapidly spin up new services and products, often much faster than could ever be achieved with legacy services.

For all its many positives, cloud computing has unique challenges that organizations must deal with. One of the biggest is the decentralized, distributed nature of cloud platforms and applications.

With on-premise computing, the IT department is responsible for everything that happens. Whether it be an operating system (OS) upgrades or email troubleshooting, everything is onsite and accessible. In contrast, by its very nature, cloud computing is far more decentralized. Everything from the OS to the individual applications is running on remote computers.

Similarly, cloud computing platforms and services often rely on a variety of third-party frameworks, APIs, services, and tools. These can range from minor tools to mission-critical applications. With each third-party resource, however, there is an added layer of complexity. It’s no longer enough for IT personnel to make sure their systems, infrastructure, services, and software are operational. Now they have to worry about a third-party service bringing their entire cloud ecosystem down.

Chaos Engineering and Fault Injection to the Rescue

As a result, extensive testing is required to ensure a system has the necessary resilience to gracefully fail and ultimately recover from disruptions.

“Chaos engineering is the practice of subjecting a system to the real-world failures and dependency disruptions it will face in production,” reads the Microsoft Azure blog. “Fault injection is the deliberate introduction of failure into a system in order to validate its robustness and error handling.

Through the use of fault injection and the application of chaos engineering practices generally, architects can build confidence in their designs – and developers can measure, understand, and improve the resilience of their applications. Similarly, Site Reliability Engineers (SREs) and in fact anyone who holds their wider teams accountable in this space can ensure that their service level objectives are within target, and monitor system health in production. Likewise, operations teams can validate new hardware and datacenters before rolling out for customer use. Incorporation of chaos techniques in release validation gives everyone, including management, confidence in the systems that their organization is building.

“Throughout the development process, as you are hopefully doing already, test early and test often. As you prepare to take your application or service to production, follow normal testing practices by adding and running unit, functional, stress, and integration tests. Where it makes sense, add test coverage for failure cases, and use fault injection to confirm error handling and algorithm behavior. For even greater impact, and this is where chaos engineering really comes into play, augment end-to-end workloads (such as stress tests, performance benchmarks, or a synthetic workload) with fault injection. Start in a pre-production test environment before performing experiments in production, and understand how your solution behaves in a safe environment with a synthetic workload before introducing potential impact to real customer traffic.”

As Microsoft points out, care must be taken to make sure that any fault injection testing doesn’t make it into production. It’s also important to control who and how many can perform these kinds of tests. As with any powerful tool, extreme caution must be exercised to make sure hackers aren’t able to exploit any fault injection testing.

In today’s cloud-centric world, chaos engineering and fault injection are two important tools to ensure a cloud platform is up to the task. When used responsibly, these tools can give an organization confidence that their cloud services will be able to weather whatever is thrown at them.

TagsChaos EngineeringFault InjectionCloud Computing
Matt Milano
Technical Writer
Matt is a tech journalist and writer with a background in web and software development.

Related Articles

Back
DevelopersSeptember 15, 2020
Chaos Engineering and Fault Injection: What They Are and Why They Matter
Chaos engineering and fault injection may be two of the most important elements to successful cloud availability.

There are few technologies more important than cloud computing. Already the darling of the tech industry, cloud computing has become even more vital since the coronavirus pandemic.

As employees were sent home amid lockdowns and quarantines, companies began to ramp up their migration to cloud services. While cloud services have proven invaluable in helping companies remain active during the pandemic, they come with their own unique set of challenges for companies to deal with.

One of the most important steps a company can take when deploying cloud services is to exhaustively test in an effort to find failure points. Why is this so important? What role do chaos engineering and fault injection play?

The Benefits and Challenges of Cloud Computing

Cloud computing has ushered in a new era of business operations, making it possible for people to work in ways that could hardly be imagined just a few decades ago.

Thanks to cloud computing, employees can work from home with the same efficiency and productivity as in the office. What’s more, employees can use a plethora of devices, including desktops, laptops, tablets, and smartphones. In many cases, Bring Your Own Device (BYOD) policies result in significant savings for corporations.

Further benefiting companies is the cost of cloud computing. Rather than investing millions in on-premise architecture, cloud computing gives companies the ability to pay-on-demand for the needed infrastructure, services, and software. This scalability also makes it possible to rapidly spin up new services and products, often much faster than could ever be achieved with legacy services.

For all its many positives, cloud computing has unique challenges that organizations must deal with. One of the biggest is the decentralized, distributed nature of cloud platforms and applications.

With on-premise computing, the IT department is responsible for everything that happens. Whether it be an operating system (OS) upgrades or email troubleshooting, everything is onsite and accessible. In contrast, by its very nature, cloud computing is far more decentralized. Everything from the OS to the individual applications is running on remote computers.

Similarly, cloud computing platforms and services often rely on a variety of third-party frameworks, APIs, services, and tools. These can range from minor tools to mission-critical applications. With each third-party resource, however, there is an added layer of complexity. It’s no longer enough for IT personnel to make sure their systems, infrastructure, services, and software are operational. Now they have to worry about a third-party service bringing their entire cloud ecosystem down.

Chaos Engineering and Fault Injection to the Rescue

As a result, extensive testing is required to ensure a system has the necessary resilience to gracefully fail and ultimately recover from disruptions.

“Chaos engineering is the practice of subjecting a system to the real-world failures and dependency disruptions it will face in production,” reads the Microsoft Azure blog. “Fault injection is the deliberate introduction of failure into a system in order to validate its robustness and error handling.

Through the use of fault injection and the application of chaos engineering practices generally, architects can build confidence in their designs – and developers can measure, understand, and improve the resilience of their applications. Similarly, Site Reliability Engineers (SREs) and in fact anyone who holds their wider teams accountable in this space can ensure that their service level objectives are within target, and monitor system health in production. Likewise, operations teams can validate new hardware and datacenters before rolling out for customer use. Incorporation of chaos techniques in release validation gives everyone, including management, confidence in the systems that their organization is building.

“Throughout the development process, as you are hopefully doing already, test early and test often. As you prepare to take your application or service to production, follow normal testing practices by adding and running unit, functional, stress, and integration tests. Where it makes sense, add test coverage for failure cases, and use fault injection to confirm error handling and algorithm behavior. For even greater impact, and this is where chaos engineering really comes into play, augment end-to-end workloads (such as stress tests, performance benchmarks, or a synthetic workload) with fault injection. Start in a pre-production test environment before performing experiments in production, and understand how your solution behaves in a safe environment with a synthetic workload before introducing potential impact to real customer traffic.”

As Microsoft points out, care must be taken to make sure that any fault injection testing doesn’t make it into production. It’s also important to control who and how many can perform these kinds of tests. As with any powerful tool, extreme caution must be exercised to make sure hackers aren’t able to exploit any fault injection testing.

In today’s cloud-centric world, chaos engineering and fault injection are two important tools to ensure a cloud platform is up to the task. When used responsibly, these tools can give an organization confidence that their cloud services will be able to weather whatever is thrown at them.

Chaos Engineering
Fault Injection
Cloud Computing
About the author
Matt Milano -Technical Writer
Matt is a tech journalist and writer with a background in web and software development.

Related Articles