Here at thetrainline.com we have several useful online tools for helping our customers plan and manage their train travel, including Train Times and Live Departure Boards. We recently changed the way we build, test, and deploy these kinds of applications to enable us to release new features much more frequently and easily; in fact, we shortened the deployment cycle from one deployment every few months to multiple deployments per week. These changes have produced a sea change in team culture, with a marked increase in product ownership by the team. This post describes what we’ve done so far, and where we want to go over the coming months.
Until recently (unlike our main platform releases), we configured and deployed our train tools applications in a manual way (zip files for binary packages, xcopy for deployments onto target servers, etc.). This led to unnecessary errors with configuration settings and file copy operations, and made deployments, well, scary.
We introduced an improved process for building and deploying the train tools applications with the aim to:
- Reduce errors and outages
- Reduce on-going operational effort
- Improve speed of delivery of new features and changes
- Improve the auditability of the actions taken during deployments
- Improve the visibility and traceability of the progress of changes towards production for all stakeholders
We implemented a single, gated deployment pipeline per product using ThoughtWorks GO, with role-based permissions to control the flow of changes to production. The new pipelines for the train tools applications have fewer components and ‘moving parts’ than the main platform-based pipelines, leading to a lower maintenance overhead.
Early last week, at our weekly tech shindig, Burrito Club, we showcased the progress made so far:
Click the image to view the presentation.
We implemented a single pipeline to production, that flowed from initial code commit (to Git) to production deployment with smoke tests. This single pipeline allows us to see at a glance what stage a particular change is at at any point.
- A build is triggered automatically via a code commit, building and executing the unit tests, outputting a single package containing the binaries.
- The “auto-validation” stage is triggered automatically, and runs self contained, in memory tests (using stubs and technologies such as CassiniDev and SQLite).
- The “deployment-map” stage is triggered automatically and combines the binaries package with the configuration for a given environment.
- When ready, a developer or QA can trigger the deployment to the test environment, which is automatically followed by the smoke tests of the test environment (see below).
- At that point, role based security prevents anyone not in the QA role from triggering the “manual-test” stage. This stage is a manual sign-off checkpoint.
- Finally, Production Support is the only one with rights to deploy the code to production, which is followed automatically by (read-only) smoke tests against production.
Initially, we implemented role based permissions to ensure that only authorised people could push to production:
In the end, however, the only stage with role based security ended up being the “deploy-to-production” stage. We removed the manual gates before “deploy-to-test” and “manual-test” as the culture of trust increased.
At each manual checkpoint, any user can easily see who triggered the stage:
One limitation of Thoughtworks Go that we had to work around was the assignment of build agents to a pipeline. As this can only be done at the pipeline level, this meant we had the possibility of a normal build running on the deployment agents (in the production environment!). In the end, we created a small utility that triggered a separate pipeline and returned the results.
Our (internal) packaging and deployment technology detects a previous installation and will uninstall the old package and install the new. This enables us to easily roll back to an earlier version by simply running a previous pipeline.
In order to isolate the deployment agent servers from as much of the production infrastructure as possible, the deployment agent servers are placed in a limited-access DMZ-style restricted zone. The deployment servers are able to reach target servers in production only on specific ports, and can reach their controlling GO server on the internal network, but little else.
On the left side of the diagram, we can see the build agents responsible for compilation and unit testing. On the right, one of the several deployment environments (eg, test, pre-prod or production).
The most difficult part of this implementation was the cultural change required – from developers who are used to doing deployments the manual way, to greatly increasing the trust required by IS Operations of the developers, to the people who did the actual deployments.
In the past, the Production Support team received a change request with a one or two sentence description of the change. This didn’t really make them happy, but as they were involved in the deployment, it sort of balanced out. As we changed to one-click deployments, we lost the sense of involvement, leaving the team feeling completely out of the loop. We learnt the hard way that the development team had to share much more information about code and configuration changes with the people who were charged with the first line support. After a slightly rocky start, we’ve all ended up in a much better place where communication and involvement are key.
From a development team perspective, we’ve found a much greater sense of ownership, increased confidence and happiness. The team now is much more production focused and cares much more about getting the changes into the hands of the users. Probably the best indicator of this is the “ready for deploy” column on our kanban board is now usually empty!
One of the most suprising challenges we faced was that we started delivering much faster than the business was expecting. We found that we hadn’t taken the business with us on the ride, and they were not used to things being delivered this fast. However, increasing communication helped us around this – something that can only help.
Future improvements will likely include:
- Better visualisation – we found that the single pipeline was good for developers, but didn’t really show other teams what they were after. It also meant that in the event of a rollback, there was no obvious way of showing that this had occured. We are planning a dashboard that has slightly different views available, depending on the audience. For example, the business wants to know when deployments happen and when they can expect their change to go live. However, Production Support wants to know what is in production right now, and what changed in the most recent release.
- Automated maintence mode – we inadvertandly caused several P2 alerts by deploying without setting our monitoring tool (SCOM) into maintenance mode. We are intending on updating our installation tool to handle setting the server into and out of maintenance mode automatically, and raising alerts if an installation fails.
- A fully trusted automated test suite – we are aiming to get to the point that our automation is trusted enough that we can remove the manual test stage (bring on Continuous Deployment!).
- Automated raising of CRs – we are contemplating generatiung a change request (CR) automatically from Git commit logs.
- Blue/green deployments – now that we can deploy much easier, and once we have migrated all train tools apps to this new process, we will implement blue/green deployments with automated switching at the load balancer level.
In short, we have found the new deployment pipelines have had a suprising effect on the culture of our teams, enouraging shared ownership and a delivery focus. With shorter cycle time, the business has been able to see changes delivered to production much sooner.