Create a Culture of Strength: Resilience Engineering

Resilience Engineering

As the saying goes, “The best defense is a good offense.” An adage true for both football and software development. When it comes to organizations protecting themselves against disruptions, the tendency is to bulk up post-disaster, rather than move on the offensive beforehand. In manufacturing terms, this can be in the form of an inventory stockpile (which incurs some quite obvious costs in the way of storage) or spending more money on equipment/people/or floor space in reaction to machinery failure.

Reactions to IT catastrophes are very similar. When disasters happen, the business instinct is to throw money, time, and people at the problem until it’s fixed. All costly responses. But with governments and organizations becoming increasingly reliant on technology, the stakes are raised even further than before. Software failures are no trivial matter; whole businesses—and even people’s lives—are at stake.

IT disasters cause the same habitual reaction as they do in manufacturing. Add further buffers into an already complex system in an attempt to prevent the same disruptions from repeating.

Whether your team is working in a manufacturing value stream or a technology one though, move on the offensive and take proactive steps to reduce these disasters.

If you’re looking for ways to integrate DevOps into your organization, give Caylent a try here.


DevOps encourages teams to take a systemic view following IT disasters. Rather than looking for who to blame, DevOps practices provoke an examination of all the factors—both human and technical—that contributed to the failure within a faultless setting. Operations like the blameless post-mortem seek to examine ways to mitigate repeating issues involving reliability, resiliency, security, and cloud service recoverability with everyone involved—minus the “should have done this” mindset.

Once you achieve this goal, the next exercise should involve generating small incremental changes and tasks for achieving positive future countermeasures.

Resilience Through Destruction

Designing fault-tolerant architecture is not enough to prevent IT disasters. While cloud-based infrastructure is all about redundancy and fault-tolerance, there is no way to guarantee 100% uptime. Systems must be stronger than their weakest link. To achieve this state, teams must become better at problem-solving through self-diagnostics and self-improvement by learning from failures and mistakes alike. Only once there is strength in the working culture to accept and move past disasters without apportioning blame will technicians have the confidence to push further with actions to prevent disastrous events in the future.

Monkey Business

Netflix’s “Simian Army” is showcased so often as the model case study for resilience engineering that I won’t bore you too much with an overly in-depth look. For those who haven’t come across the renowned primate suite of resilience engineering tools before, here’s the catch up:

Originally, the Simian Army was an internal suite of Netflix tools. These tools saved the company from the 2011 AWS EAST Outage that dramatically affected other internet-based businesses at the time. The Army’s mission is to keep the cloud operating in top form. All by randomly disabling production instances on an AWS infrastructure. Chaos Monkey, the principal member of the Simian Army, is a resiliency tool that ensures applications such as Netflix can tolerate large-scale fault injection. The process happens within a carefully monitored environment. Doing this exercise regularly, allows IT teams to build up the necessary reactions and recovery mechanisms for unplanned disruptions.

Furthermore, the practice is suitable for more than just cloud-based companies; adapt the same resilience engineering for traditional corporate IT environment too. Because let’s face it, everyone experiences IT disasters.

For more on creating a just, learning culture with DevOps, check out the article Why You Need a DevOps Consultant.

Caylent is a cloud-native services company that helps organizations bring the best out of their people and technology using AWS. We are living in a software-defined world where technology is at the core of every business. To thrive in this paradigm, organizations need to empower their people and processes through technology. Caylent is uniquely positioned to fuel that engine of innovation by bringing ambitious ideas to life for our customers.

Caylent works with customers to build, scale and optimize sophisticated cloud solutions using deep subject matter expertise to deliver world-class outcomes through an agile co-delivery model.

Share this article

Leave a comment


Share this article


Join Thousands of DevOps & Cloud Professionals. Sign up for our newsletter for updated information, insight and promotion.