Monkey See, Monkey Do, Monkey Break to Test Your Cloud Infrastructure

It may be time you met the monkeys in the cloud. If your business like many others is making strategic use of cloud services, proper management and monitoring are essential. When you did all your computing on site, you could control it directly. With systems and applications in the cloud, things are not the same. The cloud’s great advantage of letting you start up servers whenever you want can be a drawback as well. It becomes too easy to lose track of what is running where, or of where potential points of failure have been created. You need a way of finding out what could go wrong before it goes wrong, and of keeping your cloud consumption within bounds. A collection of software monkeys could be the answer.
Birth of the Chaos Monkey
A few years ago, Netflix, the on-demand Internet media company, created a software tool to test the systems it was running using Amazon Web Services (AWS). The company wanted to protect itself from online failure. It was unthinkable that its customers might be unable to access movies or TV shows via Netflix, because providing this content to the market was how Netflix earned its money. Netflix engineers had a smart idea. To avoid facing an unexpected systems crisis, they would cause different failures in their systems beforehand. That would let them check that their cloud infrastructure could cope while the broken part was repaired. They called their test tool the Chaos Monkey, because it was designed to run around at random and interrupt normal operations.
Meet the Monkey Family
The Chaos Monkey was a success. It helped Netflix to find weaknesses and fix them before they could affect customers. The company went further. It made the Chaos Monkey publicly available as Open Source software able to run in different cloud provider environments as well as AWS. It also created its ‘Simian army’ with additional software tools to further test and manage cloud computing. At the moment, the family album of cloud primates contains:

  • Chaos Monkey. Makes software processes or systems unavailable or perform poorly.
  • Chaos Gorilla. Capable of taking out whole networks (simulating their failure.)
  • Conformity Monkey. Checks if processes are running in the way you want them to run.
  • Janitor Monkey. Seeks out and terminates processes that should have been shut down before.

Putting Fun into Making Things Fail
The original Netflix Chaos Monkey simply shutdown software processes. Since then it has learnt many new tricks. Just to keep things fun, these tricks have also been given names that draw on Latin words. Here are some of the more accessible examples:

  • Simius Mortus. This is the classic Chaos Monkey trick of shutting down a process.
  • Simius Amputa. Detaches (amputates, right?) disk volumes and makes disk reads and writes fail.
  • Simius Cogitarius. Runs CPU-hungry processes to deprive the real processes of CPU power.
  • Simius Occupatus. Same as above, but for disk reads and writes to slow them down.
  • Simius Delirius. Continually kills (every second) applications that run on a process.
  • Simius Tardus. Increases the latency of network traffic to simulate network degradation

Actually, Now Isn’t a Good Time to Break Anything
Just in case you were wondering, you are not obliged to give these monkeys a completely free hand. You can let them out into your systems at times when real customers are not present (or less present). You can choose the kind of havoc caused by the monkeys by using the possibilities above and others. One day, so they say, enough monkeys sitting at enough typewriters will randomly type out one of Shakespeare’s plays. Until that time, make the most of software monkeys randomly testing your cloud infrastructure to help make work it better.