Deployment Agility with Air-Traffic Control


iPad Brand

From Change Control to Assumed Approval: how we first managed the operational visibility of Continuous Delivery and how it’s still in use 2 years later

Trainline has changed in many ways over the last 2½ years and, as a 4-year veteran, I have been ideally placed to watch and help enable that change. One of the big changes was from a project-led to a product-led organisation. Along with that comes lots of things, one of which is Continuous Delivery (CD). The advantages of this are well known, and one excellent stat was recently produced that showed that we have achieved a:

122-fold improvement in deployment agility!

That is to say that, during the last 2½ years, we have moved from 1 production deployment every 6 weeks to 187 in a week (this was the highest number of deployments in a week during 2016).

Pretty good going, but operationally how do we keep track of that and how do we ensure our central Operations team know what’s happening? (By the way, we definitely still feel that we need a central Operations team while processing £2.3bn in train tickets a year – the NoOps debate is outside the scope of this blog post).

The relationship between a development team and a central Operations team changes with CD, but in what way? My epiphany came when I realised that, with CD, the relationship is very much like that of an Airline Pilot and an Airport Air Traffic Control Tower. Just as pilots need a time slot to safely land their plane, development teams need slots in which they can safely deploy their code.

Not there?

Let me explain with the use of a table:

Airline Pilot Development Team
Objective Action Objective Action
Get to an airport Fly the plane in the path determined before take off Get software ready to go live Follow the direction set by the Product Owner and Dev Manager
Get the plane on the ground safely Get information on the holding pattern, what runway is open and local conditions (e.g. Weather) Deploy the software to Production successfully Get information on what deployments slots are open in the calendar.

Change approval is assumed if request accepted. If denied, deployment is not approved.

Get passengers to terminal Land the plane Get features\defect fixes to the customer Deploy and make live

 

What is inherent in the above is that the pilot has the responsibility for landing the plane safely, just as a development team has the responsibility for successfully deploying their software.

On the other hand, the Air Traffic Controllers have the responsibility for the coordination of that landing, which is not dissimilar to the role of the central Operations team in a Continuous Delivery environment. Thus, we have the following similarities between these two groups:

 

Airport Air Traffic Control Operations Team
Objective Action Objective Action
Keep the runways open for planes to land Don’t allow more than 1 plane to land on the same runway at the same time (or they crash) Ensure Platform is stable for deployments to happen Don’t allow more than 1 deployment at a time for services that are dependent or interact (or they could crash)
Know when planes are expected and manage their path Ensure holding pattern and landing slots information is available, accurate and up to date Understand and have visibility of what is happening when Keep an eye on the requests that have been accepted in the calendar
Manage multiple requests to land Confirm plane can land on pre-defined runway and slot Manage multiple requests for deployments Have automation running on the deployment calendar to ensure that rules are adhered too, and immediately enforced and fed back

So, taking the above analogy into account we created an ‘Air Traffic Control’ (ATC) Calendar. The calendar, created in Outlook, is open for anyone to view and make a deployment slot request by sending a meeting request. A PowerShell script, which runs every minute on these calendar requests, enforces the following rules:

  • Deployments can only happen Monday to Friday 05:00 to 15:30 UK time (05:00am so we accommodate development partners in other locations)
  • More than one deployment cannot happen at the same time
  • Deployments can take a maximum of 30 minutes (any longer than this is too slow for Continuous Delivery)
  • Deployments can be booked a maximum of 3 hours in advance (no squatting, an agile mindset means booking at the last responsible moment in order to avoid wasting slots)
  • Deployments can be booked a minimum of 30 minutes in advance of the deployment time (as I said, the last responsible moment)

On top of this, the central Operations team and certain members of the development teams with Ops responsibilities have full rights over the calendar. This means they can override any of these rules if they have good reason – this is a good example of our approach to new processes in general: lay out a clear framework, automate as much as possible (for quick feedback and with no reliance on individuals), but still leave room for out-of-band requests.

Side note: Those of you with excellent attention to detail and maths skills can see that with the above rules you cannot have enough slots for 187 deployments in a week. This is because the ATC Slot is independent from the actual deployment package, therefore it would be possible to deploy multiple related micro-services in a single slot, provided you can do it in the 30 mins. From an Ops view the visibility is the same “we can see something wrong and the last deployment was related to X”

So how has the ATC Calendar been working?

great

It’s been one of those lightweight process that sits there and works. Although we get the occasional ‘we cannot get a deployment slot’, both development teams and the central Operations team are happy with it.

What’s next for this mighty innovation?

MORE RUNWAYS!

Yes, it was always intended that the super-cool analogy could be expanded to open more ‘runways’ (calendars) that would allow more deployment slots for systems and services which could be safely deployed at the same time. The net effect being to further increase our deployment agility without increasing our operational risk.

The first new runway that has been added is to facilitate a major initiative we have for 2017 – to move to a new eCommerce platform, it’s called ATC eCommerce and works exactly like the original ATC Calendar, but we have updated the automation to only accept requests from developers in teams owning new eCommerce services. This way we have again automated the rules and can sit there in confidence knowing that it is being used correctly.

Finally, we will soon be adding another runway for infrastructure changes since we are pretty much all infrastructure-as-code these days. So, for Trainline, unlike those who live in south-west London, the more runways the better!

aeroplanes

 

About the author

David Stanley is Head of Platform Delivery at Trainline leading the teams responsible for operations, cloud and physical infrastructure, ensuring the benefits of development philosophies, such as Agile, DevOps, CD, Automation and Infrastructure-as-code, are balanced with, and adopted for, production operations.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s