Simply put, Chaos Monkey, developed by Netflix, is a tool that causes random failures in groups of systems. For an organisation that's used to constant firefighting, the change of mindset required for deliberately starting fires is dramatic (even though it's not unheard of for firefighters to moonlight as arsonists). The reasoning for the change, however, is simple: complex systems fail inevitably. If our infrastructure is resilient enough to withstand Chaos Monkey without catastrophic failure, it implies we have reached a certain milestone in the maturity of infrastructural design and configuration management.
We also set some other goals:
- We want to kill configuration drift
- We want our infrastructure to be versioned, and thus represented as code
- We want our environments to be identical across stages
- We want Dev and QA to be able to create and destroy environments on demand
- We want to save our clients money by using resources more efficiently
Finding solutions
A lot of our problems will be solved with immutable infrastructure. Instead of managing long-lived servers, resources should be considered disposable. Instead of reconfiguring running servers and deploying new versions of software, we should start from scratch whenever possible. To achieve this, our weapon of choice is Docker with Kubernetes. Obviously, this requires refactoring all of our components to work inside containers - a task that is definitely as interesting as it sounds.
For tackling the snowflake-environment problem, we're relying heavily on autoscaling. The fundamental idea is that every environment should start the same, while production environments will automatically scale to match the load imposed on them. This will come with two benefits: our environments are very much alike across stages, and we can introduce cost savings and flexibility for our clients by automatically scaling the environments up or down depending on the load.
For creating environments on demand, we're building a simple tool with 5 basic functions:
- Wrap Terraform to create Kubernetes clusters on AWS EC2
- Create environments (namespaces) with specific component (Docker image) versions and services in the Kubernetes clusters
- Present stdout from containers, and logs gathered by fals
- Automate data migration between environments (MySQL/MariaDB, Redis, S3)
- Manage hostnames for environments with AWS Route 53
Effectively a hybrid between an internal PaaS and CaaS, this will allow Dev and QA to create disposable environments on their own. Among other cool stuff, it will ultimately allow us to do things like blue/green deployments to production - increasing our confidence in deployment even further.
What will our future look like?
While our transformation is still very much in progress, I see these improvements as a big win for the business as a whole: it will increase our confidence in deployment, bringing us a step closer to the benefits of true Continuous Delivery. It will eliminate error-prone manual work, and decrease time to market for new features and projects - a real competitive advantage.
We're not only patching the holes that slow our boat down. We're adding motors.