Netflix open sources cloud-testing Chaos Monkey

July 30, 2012 Off By David
Object Storage

Grazed from GigaOM. Author: Derrick Harris.

Netflix has a gift for anybody who needs to ensure their cloud-hosted applications keep running even if some of the virtual servers on which they’re running die. It’s called a Chaos Monkey — but don’t worry, this monkey is very tameable and is now open source.

The video rental and streaming giant is one of the world’s biggest consumer of cloud computing resources — it hosts the majority of its infrastructure on the Amazon Web Services cloud — and Netflix developed Chaos Monkey as a method for ensuring that its system is capable of healing itself or continuing to run should instances fail. “Over the last year,” Netflix cloud engineers Cory Bennett and Ariel Tseitlin wrote in a blog post announcing the open source version, “Chaos Monkey has terminated over 65,000 instances running in our production and testing environments. Most of the time nobody notices, but we continue to find surprises caused by Chaos Monkey which allows us to isolate and resolve them so they don’t happen again.”…

Anyone scared releasing such a wild-sounding entity into their application infrastructure (or envious that they can’t do so because they don’t run on Amazon’s cloud) need not worry. As Bennett and Tseitlin explain, Chaos Monkey is configurable and “by default, runs on non-holiday weekdays between 9am and 3pm.” It’s also flexible enough to run on clouds other than AWS, they write.

Oh, and Chaos Monkey is just the first of Netflix’s Simian Army to find its way into the open source world. “The next likely candidate will be Janitor Monkey which helps keep your environment tidy and your costs down,” Bennett and Tseitlin note.

Another member of the army, Chaos Gorilla — which is designed to simulate the loss of an entire AWS Availability Zone — recently made headlines when a cascading bug took down part of Amazon’s cloud in late June.