Continuous Delivery with Blue/Green Deployment


At thetrainline.com we are always striving to deliver the best user experience for our customers. We want to get great ideas from conception to the customer as quickly as possible, to enhance our offerings and streamline our processes. This post talks about how we helped achieve this by harnessing Continuous Delivery.

Following on from the post (Moving to Multiple Deployments Per Week at thetrainline.com) that Matt Richardson and I published in December 2013 we have since evolved our deployments to allow us to achieve automated continuous delivery of our components from development through to production – with zero downtime.

Where we were and what we wanted to achieve

In December we had achieved automated deployments via pipelines which was a great improvement over the previous manual process but we still had a long way to go. During deployment we would:

  • Place the feature into maintenance mode, making it unavailable to customers
  • Un-install the old version of the feature
  • Install the new version of the feature
  • Wait for the application to warm-up
  • Do some manual testing
  • Take the feature out of maintenance mode

Only after all of the above steps had completed would the feature be available for customers to access. To counter any inconvenience to our customers we had to schedule our deployments to minimise disruptions. This meant we had a limited window to test and roll-back if necessary.

Our website receives hundreds of thousands of hits per day (24/7, 365) so there was no “good time” to take a feature offline and deploy without impacting at least some of our customers. We needed a method where we could deploy features with zero downtime. We also wanted to automate as much of the process as possible so that we could deploy and release a feature upgrade with minimal human involvement.

Our solution

Born from the desire to achieve continuous delivery without impacting customers we adopted the Blue/Green deployment model for some new server side payment features for our mobile applications. We created two “slices” with one being active, serving contents to our customers, with the other slice being inactive. In the following example we will start with green being the currently active slice, and running version v1 of our feature:

Green slice is active.

Customers receive content from the GREEN slice.

To deploy v2 of the feature we would query NetScaler and determine that the blue slice was inactive and deploy the new version to that slice. During the deployment of v2, the customers would continue to receive v1 content from the green (active) slice. Once v2 was deployed a suite of automated tests would run against the inactive blue slice. We could also run manual tests at this stage if these were required, such as checking the user experience.

Green slice active, Blue slice under test.

Customers receive content from the GREEN slice / Tests run against the BLUE slice.

Once the blue slice had passed all quality checks, a manual trigger would toggle NetScaler to route customer traffic to the blue slice. At this point all customers would start receiving the new v2 feature.

Blue slice active.

Customers receive content from the BLUE slice.

After routing all customers to the blue slice we would monitor to ensure there were no issues. If any issues were identified with version v2 of the feature then we could easily switch customers back to the previously working v1 version by toggling NetScaler back to the green slice. This would be instantaneous, requiring no downtime.

Once v2 had been deemed stable we would switch off the green slice to free up resources. When v3 of the feature was ready to deploy we could recreate the green slice using our infrastructure automation and repeat the above process with blue being the initial, active slice.

Martin Fowler has a much more detailed write up about Blue/Green Deployments on his blog.

Technical achievements

The main enabler was the creation of an in-house RESTful API to automate our NetScalers. This meant that our deployment pipelines (in Go or TeamCity) could query the current routing of traffic and deploy the new version to the inactive servers. The API could also toggle/switch our NetScalers to route traffic once the new version passed all of our quality gates.

This was coupled with our continuous integration and continuous delivery pipelines so that we could commit a code change and release into production with minimal friction. At key stages of the pipeline we would tag our code commits to ensure we had a full audit trail for later analysis.

We rely heavily on the quality of our automated tests throughout the process to ensure that as our pipeline progresses our level of confidence grows. This enables us to deploy into production with a high level of certainty that the feature could be enabled without incident.

Benefits

Blue/Green deployments helped us to reap these benefits:

  • Zero downtime / outage because the customer is always connected to a fully working environment
  • Small, incremental releases means there are less surprises when a release goes into production
  • The ability to instantly rollback if problems are discovered means any issues are in the wild for a significantly shorter time
  • Allows us to recreate a whole, working environment in an automated fashion for testing or disaster recovery

Added Bonus: Because we can deploy at any time and rollback quickly this has given us the confidence to not only deploy new features but also retire features that are no longer being used or not meeting our high standards.

Where do we go from here?

  • We want to roll out this approach to our other applications and features so that we gain the same benefits.
  • We want to enhance the visibility of our deployments in production. This will most likely be done via dashboards and metrics allowing us see at a glance what we have in production and what is next in the deployment pipeline.
  • Improve our monitoring to gain a comprehensive health check of all components. This could also feed into auto-switching and enabling roll-backs based on threshold triggers (failure rates etc).
  • Move towards continuous deployment with reduced manual intervention and shorter feedback cycles.