07.16.20

The Principles of Chaos Engineering

By Mauricio Ashimine
The Principles of #ChaosEngineering

Resilience is something those who use Kubernetes to run apps and microservices in containers aim for. When a system is resilient, it can handle losing a portion of its microservices and components without the entire system becoming inaccessible.

Resilience is achieved by integrating loosely coupled microservices. When a system is resilient, microservices can be updated or taken down without having to bring the entire system down. Scaling becomes easier too, since you don’t have to scale the whole cloud environment at once.

That said, resilience is not without its challenges. Building microservices that are independent yet work well together is not easy. You also have to create and maintain a reliable system with high fault tolerance. This is where Chaos Engineering comes into play.

What Is Chaos Engineering?

Chaos Engineering has been around for almost a decade now but it is still a relevent and useful concept to incorporate into improving your whole systems architecture. In essence, Chaos Engineering is the process of triggering and injecting faults into a system deliberately. Instead of waiting for errors to occur, engineers can take deliberate steps to cause (or simulate) errors in a controlled environment.

Chaos Engineering allows for better, more advanced resilience testing. Developers can now experiment in cloud-native distributed systems. Experiments involve testing both the physical infrastructure and the cloud ecosystem.

Chaos Engineering is not a new approach. In fact, companies like Netflix have been using resilience testing through Chaos Monkey, an in-house Chaos Engineering framework designed to improve the strength of cloud infrastructure for years now.

When dealing with a large-scale distributed system, Chaos Engineering provides an empirical way of building confidence by anticipating faults instead of reacting to them. The chaotic condition is triggered intentionally for this purpose.

There are a lot of analogies depicting how Chaos Engineering works, but the traffic light analogy represents the concept best. Conventional testing is similar to testing traffic lights individually to make sure that they work.

Chaos Engineering, on the other hand, means closing out a busy array of intersections to see how traffic reacts to the chaos of losing traffic lights. Since the test is run deliberately, more insights can be collected from the process.

Why Use Chaos Engineering?

Let’s start this part by accepting the fact that chaos engineering and fault testing have a lot of overlaps, but the latter tests a specific condition instead of real chaos. Chaos Engineering actually lets you garner more insights, down to identifying nodes that are not behaving as intended and generating errors under certain conditions.

Chaos Engineering is not for testing every condition, but it covers a lot of ground. There are several things you can learn about the system through chaos testing.

  • Inter-service latency causing bottlenecks and grinding the system to a slow pace can be discovered through Chaos Engineering.
  • Routine driver checks to identify how drivers of different versions or variations affect system performance and stability.
  • Memory and CPU usage being maxed out and how the condition affects your system, particularly in scalable Elastic environments.
  • Fault injection, particularly the introduction of multiple faults at the same time.
  • Simulating catastrophic failure of the entire cloud architecture or datacenter.

Chaos Engineering, on the other hand, is not very effective when used to test specific faults, especially when the faults are siloed in one or two microservices. Chaos experiments are also more suitable for revealing effects and impacts that you are not aware of and confirming your suspicions after running into several similar errors.

The Principles of Chaos Engineering

From the previous explanations, it is clear that Chaos Engineering is meant to uncover faults and weaknesses in a system. You want to run the Chaos experiment in a controlled and close-to-ideal environment so that there is no bias to worry about.

To run the experiment, there are several principles that you need to follow.

Formulate a Hypothesis

This is the start of the experiment. You begin by making assumptions about how the system will react when certain faults and conditions are introduced. The hypothesis lets you identify which metrics to measure.

Identify Variables

Chaos Engineering relies on real-life events; the Chaos experiment reflects real-life events and potential situations that you want to anticipate. When choosing variables, you have to think about the probability of those variables manifesting and their estimated impact on your system.

Automation

Automated orchestration is another important component of Chaos Engineering. Using automated orchestration, you can perform state condition checks and analyze application and data integrity more comprehensively than when you do everything manually. In fact, manual experiments are far from sustainable.

Reduced Blast Radius

The next principle is reduced blast radius, which means running the Chaos experiment in a controlled environment to minimize the damage and influence caused by the test. Most developers run Chaos experiments on their production environment, but with added measures put in place to prevent catastrophic failures. Alternatively, you can create a robust testing environment to further reduce the blast radius.

Scaling the Blast Radius

The blast radius can be scaled to further identify system faults. Limiting the blast radius too much may not always yield the best results, which is why it is sometimes necessary to scale the blast radius up in order to identify faults that relate to real-life behaviors of the system.

The principles governing Chaos Engineering, when implemented thoroughly, creates the perfect chaos. It is a lot easier to identify faults and understand their impacts when you have a controlled chaos, a robust monitoring routine, and the ability to observe real-life system behaviors when faults are introduced.

Are You Ready for Chaos Engineering?

Lastly, it is important to understand whether your organization is ready for Chaos Engineering. Chaos Engineering requires the basic weaknesses to be plugged first. This means getting your system to a point where it is resilient enough to handle network spikes and service failures. Once the prerequisites are met, you can run Chaos experiments to uncover other weaknesses.

Chaos Engineering also requires your system to have maximum visibility. Fortunately, most cloud architectures and environments are compatible with monitoring tools that maximize system visibility. Optimize Chaos Engineering to review your systems, seek out new systemic weaknesses and improve your infrastructure’s fault tolerance.  


Caylent provides a critical DevOps-as-a-Service function to high growth companies looking for expert support with Kubernetes, cloud security, cloud infrastructure, and CI/CD pipelines. Our managed and consulting services are a more cost-effective option than hiring in-house, and we scale as your team and company grow. Check out some of the use cases, learn how we work with clients, and read more about our DevOps-as-a-Service offering.