For several years, much of the code for the systems at thetrainline.com has been versioned and deployed together as a single ‘platform’. Recently, we have begun to divide up the platform into smaller chunks, to enable us to deliver some parts more frequently and rapidly, leaving other parts to evolve more slowly (as needed). Moving from a single version number for all subsystems to multiple version numbers for independent subsystems has implications for how code is built and released; this blog post outlines some of the work we have done so far in this area.
My colleague Owain Perry and I recently presented on this topic at the London Continuous Delivery meetup group (http://londoncd.org.uk/) and the slides we showed relate to the details in this post:
Why Release All Systems as a Platform?
The codebase of the systems which power thetrainline.com began life around 1999, when the first public-facing booking system was launched, in partnership with Virgin Trains. We had a substantial code re-write around 2006, and today, we have about 4 million lines of application code (mostly in C# on the .NET platform).
During the code re-write, we needed to be sure that we had consistency across all parts of the code when testing new features, and so it made sense to apply the same version number to all subsystems and components, and then deploy all parts of the system together as a ‘platform’. The rate of code change at the time was very high, and almost every part of the code was undergoing rapid changes, so it was also necessary to deploy almost every subsystem on a regular basis.
The subsystems were built, tested, and deployed with Continuous Integration (CI) techniques using Cruise from ThoughtWorks (we have since moved to ThoughtWorks GO for CI and deployment orchestration). Delivery of features which spanned multiple subsystems required any given team to work on any of the subsystems. Deployment to Production happened out-of-hours (overnight) and required us to take down thetrainline.com systems for many hours during the deployment activity.
As of October 2013, we deploy a major new release every six weeks, and need around five weeks for fully testing that new set of changes. The production deployment itself is now fairly rapid: down from 6 hours in 2010 to 17 minutes now, thanks to some nifty Blue-Green deployment techniques put in place by our Deployments and Environments team.
With the platform components, we have not to date made much use of feature toggles, which means that we need to use release branching in order to manage the work required for a particular release.
At any one time, for subsystems in the platform we have three release branches active. Bugfixes from older branches (either in or on the way to Production) are merged into newer branches, and once a newer release is stable in Production, we delete (actually, deactivate) the previous branch, leading to a ‘staircase’ branching scheme without a mainline. This is not perfect, but fits the platform release model.
What Are the Limitations of a Platform Release?
Building systems as a single platform does have some advantages in terms of simplifying interdependencies and whole-system testing, and was useful when the systems at thetrainline.com were evolving at an identical, rapid rate. However, our subsystems now need to change at different rates, and, the release branching scheme became increasingly difficult to manage as more and smaller subsystems were added into the platform. Our cycle times were also quite long (around 12 weeks from start of development to a feature being available in production), and some systems needed to change more rapidly than this. We identified that Conway’s Law was having a negative effect on the design of our systems: because any team could change any code anywhere, we started to see a blurring of domain boundaries between different subsystems which ought to be separate, for instance – more on Conway later.
All of this led us to look at practices and techniques from Continuous Delivery, which advocates (among other things) avoiding builds and deployments if the component has not changed. We have started to extract some subsystems from the platform, making them what we’re calling ‘independent subsystems’. Most of these systems still call into parts of the platform, but we’re treating these independent systems somewhat differently, building and deploying them when they change, and pushing those changes more frequently than every 6 weeks.
We expect this to allow us to see return on investment (ROI) for new features sooner, and to make changes to our website more quickly, responding to changing market conditions (such as weather, industry news, Government policy, etc.). More simplistically, a software developer is more likely to be able to remember details of the code they wrote a few days ago compared to code they wrote 12 weeks ago!
Supporting Independent Subsystems which Depend on a Platform
Taking Advantage of Conway’s Law
We decided to take the implications of Conway’s Law and turn them around: if we set up our teams – and the work we assign to those teams – to reflect the ideal communication between software subsystems, then we have a good chance of building systems which work well together and avoid ‘bleed’ or incorrect coupling between domains. Following some of the great work done at Spotify on team structure, we have started to align our software development teams with groups of products and related subsystems.
We are also looking at how to achieve effective cross-team collaboration for concerns such as security, deployments, or performance (shown by the horizontal oval in the diagram, contrasting with the vertical product-focused team groupings). A team will need to honour a ‘social contract’ with other teams for the subsystems and components which it provides, cleaning up any ‘mess’ which they introduced or inherited, but gaining the authority to make appropriate changes to their systems as they see fit from an engineering viewpoint.
We have also identified the need for the widespread use of Semantic Versioning for communicating meaning between teams and components to help reduce some of the complexity introduced by multiple version numbers across independent subsystems (in the .NET world, the scheme is known as SemVer). By identifying in the version number when a change in the component will break consuming clients, teams can make effective decisions about when to upgrade to a new version of a dependency, which helps to avoid tightly-coupled changes across different teams.
More Frequent Deployments
Independent subsystems use a semantic versioning scheme separate from the version number of the platform, which allows us to communicate to any ‘consumers’. For these systems that we want to deploy more frequently (‘interim deployments’), we still need to synchronise with the six-weekly ‘heartbeat’ of the platform release, which we treat as a breaking change, because all platform components will have been rebuilt and deployed. This means that as the platform release approaches, the rate of interim deployments decreases, to give us time to test the independent subsystem against the Release Candidate of the platform.
For independent subsystems we can develop almost entirely on the mainline (‘master’ in Git, or ‘trunk’ in Subversion – we use Git). Only when we need to test against the (potentially ‘breaking’) platform release do we create a temporary release branch; once the platform release has gone live and is stable, we can merge the release branch back to the mainline (with some tags).
In fact, in some cases we avoid the need to create a release branch by running nightly CI builds against the release candidate, and only branching if the build fails. The two branching schemes can be compared visually like this:
Eventually, we expect to be able to release changes to some systems daily; for the time being, we will retain the 6-week platform ‘heartbeat’, but see the strength of this rhythm reduce as more systems are made independent over time.