Chaos engineering is a kind of contradiction: it works against the very system it is protecting in order to build an environment that is more resilient and more secure. How does it work? How is introducing errors useful and how does it help to secure the digital environment? Understanding this discipline can lead to substantial improvements.

What is it?

The concept of chaos engineering is based on four principles defined by Netflix. These principles consist of defining a “stable” state, making a hypothesis of the state that will follow, introducing variables that reflect events true to reality, and trying to break the hypothesis (in that order).

Through a series of tests, characteristics of the infrastructure, such as availability, security, and performance, are assessed. The goal is to resolve problems in these distributed systems in order to bolster recovery capabilities for the entire system. This means, in short, getting structures that withstand extreme conditions.

Resilience and “antifragility”

The concept of chaos engineering is only understood if we understand the definition of “antifragility”, a term coined by Nassim Nicholas Taleb. This is the precursor concept of chaos engineering and, in turn, is based on resilience. Resilience is defined as the ability to absorb disturbances. These disturbances are caused by stressors, or stress factors, that trigger destabilization.

It is a concept widely used in living organisms (ecology, physiology, psychology, etc.) and refers to the ability to overcome problems actively and adapt to the situation. “Antifragility” goes beyond resilience since it implies the evolution of a system, which would be able to grow from the stress to which it has been subjected to adapt to new failures.

Panda Adaptive Defense is a tool that keeps a close eye on the principles of antifragility and adds resilience to the company, while increasing visibility into the state of the corporate network.

The Simian Army

Taking all this into account, large companies such as Netflix or Amazon see in chaos engineering the possibility of testing their infrastructure to make their systems more mature and increasingly robust — and also more evolved. In short, more resilient. Since performing an analysis and correcting a problem in a repetitive and escalating way is a very difficult task, they use heuristic strategies focused on prioritizing decision-making aimed simply at resolving problems.

Thus, Netflix, for example, uses its own suite of applications called the Simian Army, which tests the stability of its network. Simian Army has more than a dozen stressors that test the system in various ways. Security Monkey, for expample, is just one “piece” of the Simian Army. It implements a security strategy into cloud-computing platforms based on chaos engineering.

How can chaos engineering help companies?

The first question is, why should a company consider using chaos engineering?

Implementing a strategy based on chaos engineering helps to work the antifragility of a platform, including meeting the control objectives and requirements of PCI-DSS in case of audits. Thus, any company could benefit greatly from implementing a tool such as Security Monkey in its security strategy.

This would require a “chaosification” of the platform in a controlled manner, which could consist of actions of the following type: disable SG (Security Groups) rules, modify files at random, randomly listen to ports, inject malicious traffic into the VPC (Virtual Private Cloud), randomly kill processes while they are taking place… and the list of havoc-wreaking could go on.

Thanks to this tool (or strategy), a deeper visibility of the consequences of attacks can be achieved with the intention of improving defenses. This, in the long run, is the basis of a more mature and reliable system, capable of recovering from attacks and reducing losses in the face of a serious security incident, something that should be mandatory for any high availability service.