Rethinking infrastructure

When I joined Artirix, the company had just started the journey towards Continuous Delivery. A large part of it is the cultural transformation to DevOps: improving workflows and collaboration between Dev, Ops and QA. But in addition to having the best rowers beautifully in sync, we also need a streamlined and leakless boat. Therefore an equally large part of our journey is optimising and modernising our technical environment - patching the holes that slow our boat down.

Identifying problems

One of the largest challenges we have is configuration drift, ie. the gap between actual live production configuration state, and the state our configuration management tool depicts. Ideally those states are identical. But in the real world, the two configuration sets are often wildly different. In our case this is certainly true, and can be accounted to the way we historically do configuration management. We run Puppet in a masterless configuration with Capistrano, which means that configuration changes are applied manually on demand. This is both slow (Cap rsyncs the full Puppet repository to every node and applies it locally) and non-enforcing. It's too tempting to do a quick change manually, and update configuration management to reflect that later (or, never). While a masterless setup might scale better in some circumstances, the benefits just don't outweigh the disadvantages for us.

“At some point it became easier to just SSH into production, change a setting, and go on with your life.”

Creating environments is a particularly painstaking chore for us. It takes days - sometimes even weeks - to create a new project environment. We need to provision and bootstrap new instances, modify and rewrite Puppet manifests to fit this special case, set up monitoring and backups, and add separate environment configuration in about a dozen components. It's tedious, error-prone and repeating work - in other words, a prime candidate to be automated.

Another problem with our current environments is that by design, each one is a unique snowflake. For our internal and testing environments, we tend to pack every component neatly in one box. This lowers hosting costs, but makes the environment very different from a production environment, which might span dozens of separate instances. This poses a clear problem with a Continuous Delivery pipeline: even if code functions as expected on every environment prior to production, the final step (deployment to production) is still a risky leap into the unknown.

Setting targets

As we embarked on the journey to transform our infrastructure, the first target we set was simple:

“We want to be able to run Chaos Monkey and feel good about it.”

Simply put, Chaos Monkey, developed by Netflix, is a tool that causes random failures in groups of systems. For an organisation that's used to constant firefighting, the change of mindset required for deliberately starting fires is dramatic (even though it's not unheard of for firefighters to moonlight as arsonists). The reasoning for the change, however, is simple: complex systems fail inevitably. If our infrastructure is resilient enough to withstand Chaos Monkey without catastrophic failure, it implies we have reached a certain milestone in the maturity of infrastructural design and configuration management.

We also set some other goals:

We want to kill configuration drift
We want our infrastructure to be versioned, and thus represented as code
We want our environments to be identical across stages
We want Dev and QA to be able to create and destroy environments on demand
We want to save our clients money by using resources more efficiently

Finding solutions

A lot of our problems will be solved with immutable infrastructure. Instead of managing long-lived servers, resources should be considered disposable. Instead of reconfiguring running servers and deploying new versions of software, we should start from scratch whenever possible. To achieve this, our weapon of choice is Docker with Kubernetes. Obviously, this requires refactoring all of our components to work inside containers - a task that is definitely as interesting as it sounds.

For tackling the snowflake-environment problem, we're relying heavily on autoscaling. The fundamental idea is that every environment should start the same, while production environments will automatically scale to match the load imposed on them. This will come with two benefits: our environments are very much alike across stages, and we can introduce cost savings and flexibility for our clients by automatically scaling the environments up or down depending on the load.

For creating environments on demand, we're building a simple tool with 5 basic functions:

Wrap Terraform to create Kubernetes clusters on AWS EC2
Create environments (namespaces) with specific component (Docker image) versions and services in the Kubernetes clusters
Present stdout from containers, and logs gathered by fals
Automate data migration between environments (MySQL/MariaDB, Redis, S3)
Manage hostnames for environments with AWS Route 53

Effectively a hybrid between an internal PaaS and CaaS, this will allow Dev and QA to create disposable environments on their own. Among other cool stuff, it will ultimately allow us to do things like blue/green deployments to production - increasing our confidence in deployment even further.

What will our future look like?

While our transformation is still very much in progress, I see these improvements as a big win for the business as a whole: it will increase our confidence in deployment, bringing us a step closer to the benefits of true Continuous Delivery. It will eliminate error-prone manual work, and decrease time to market for new features and projects - a real competitive advantage.

We're not only patching the holes that slow our boat down. We're adding motors.

1 Comment