At thetrainline.com we recently transformed our software release process by rebuilding our problematic test and integration environments on a private (on-premise) PaaS cloud platform. The outcome of the 8 month project was a fully automated and repeatable infrastructure and software build process that reduced the environment build time from 12 weeks down to 4.5 hours, achieving ROI within 8 months. In this post we’ll share the rationale behind why we chose private cloud over the readily available public cloud offerings, details of the components, what we’ve learnt, and how we were able to use the experience to improve our other environments and processes.
Going back two years or so we had a reliability issue with our test and integration environments that precipitated a much bigger problem with our software release process. So we decided to do something about it which ultimately resulted in us building our own private cloud (PaaS). You’re probably thinking that we lost the plot as building your own internal cloud instead of going to the likes of Amazon and co. goes against conventional logic. “There is no way that an internally hosted private cloud could be more cost effective than a commercial offering”, you might think. Well that isn’t always the case; choosing to do things internally is sometimes the preferred option and it doesn’t always come down to cost. This post will outline the rationale behind why we chose the private cloud route and some of the things we learnt from that journey.
Like most development organisations, we use many environments for the purpose of developing, testing (load, feature, regression and load), integrating and releasing code. The figure below outlines a somewhat simplified view of the environment landscape. Code is written on the developers’ machines, progressed into the test and integration (T&I) environments before being deployed to Pre-Production in preparation for release into Production.
The Test & Integration environments (red box in Figure 1) had been around for a number of years and were consistent only in the fact that there was very little consistency. They were also critical for the regression and feature testing as defects found at this stage were often much easier and cheaper to fix than later on during the Pre-Production test cycles.
Inconsistency and unreliability
As IS Director, I had the overall accountability for these environments, and I can safely say that they had evolved into somewhat of a nightmare for anyone that either depended on, or was tasked with looking after them. So what were the issues, and why did they end up being so problematic?
- Availability and reliability of the environments was pretty poor, amplified by the fact that there were never enough to go around to begin with. If it wasn’t a flaky server that was playing up then it was certainly an issue with the latest build that put a dent in the release schedule.
- Bespoke configuration and hand-tooling over the years had caused the environment makeup to diverge over time. The situation was made worse by the fact that features were sometimes only built in a single environment and not across the estate, creating even more bottlenecks along the way.
- Lack of ownership and governance compounded the issue as we didn’t have a dedicated owner of the environments at the time nor were there any guidelines that users agreed and adhered to. Every developer having Administrator-level access also exacerbated the configuration inconsistencies across the board.
- Developers investigating and fixing environment issues became the norm, and although it helped with availability it jeopardised the release schedule. Development velocity took a dip and the technical debt became progressively worse with every iteration.
These issues, in conjunction with a complex and extensive codebase, inevitably precipitated a “perfect storm”. Most of our testing was eventually being executed in Pre-Production, the only environment readily available and trusted as far as testing output was concerned. Moving testing to Pre-Production however only made the situation worse as defects were being identified very late in the cycle which inevitably pushed out the release dates.
An interesting thing happens when you delay a release; it becomes susceptible to scope creep as various stakeholders try to limit the long term impact by adding in new features and increasing the number of iterations. This in-turn starts a chain reaction that inevitably puts more pressure on the environments, deploys bigger payloads into live which equates to more production issues, requires patching that in turn requires environments for testing etc.
By now, I think you will have the picture: we needed to do something urgently to get our releases back on track. So we secured budget and resources to fix our T&I environments, seeing that is where most of our pain stemmed from. We initially thought to Make It Someone Else’s Problem i.e. move it to the public cloud and let someone else take of it but that wasn’t an option available to us, as I’ll now explain.
Why the public cloud was not a good fit
We investigated the option of running these environments in a public cloud but we could not make it work for a number of reasons. To begin with, none of the cloud providers offered a Platform-as-a-Service (PaaS) or Software-as-a-Service (SaaS) platform representative of the functionality of our in-house applications suite. Bearing in mind that the majority of the environment problems stemmed from configuration and the integration issues, going down the Infrastructure as a Service (IaaS), route wasn’t going to add any value either.
Secondly, we had the wrong kind of budget; although the project had a substantial capital expenditure (capex) budget, it only had a marginal operational (opex) budget that would be required to run services in externally hosted clouds.
And lastly, we just didn’t have the legal, technology and process experience of public clouds to risk building our environments there, so we elected to build an internally hosted system. One of the requirements we simply couldn’t compromise on was that a build of any environment should be repeatable and predictable, all the way from the virtual machine to the integrated software components that made up the hundreds of features in the software estate of thetrainline.com. Automation of build, smoke test and teardown processes was essential to the success of the project as manual processes were prone to error and expensive to run and maintain.
The components of our private on-premise cloud infrastructure
Our test and integration private cloud at thetrainline.com was built from the amalgamation of commodity infrastructure, commercial software and readily available scripting and automation tools. We did not entertain any of the “cloud-in-a-box” commercial solutions nor converged infrastructure because they were costly and limited in functionality, and we wanted to avoid vendor lock-in that could limit future optionality.
We settled on a hardware setup that consisted of:
- 15 x ESXi 4.1 (Virtual Host)
- 1 x Solaris (Oracle Database server)
- 1440GB pool of RAM
- 120 processor cores in pool
Our suite of automation tools includes:
- PowerCLI (VMware’s PowerShell add-on module to control vCenter)
- Perl scripts with Netscaler
- Legacy Windows functions such as Sysoc to add Windows 2003 features
- PowerShell v2 (v3 released but not available for 2003; we will move to it in due course)
- SQL scripts to setup the databases
- ThoughtWorks GO for management of deployment pipelines and for deployment of the software builds
How we provision test environments
The project provisioned an elastic (VMware-based) virtual platform that hosted 370 virtual machines + Templates that made up the 16 identical T&I environments. Each one accommodates in excess of a hundred features and applications and is created in 3 phases (see Figure 2):
- Infrastructure components (networking, GPO’s databases, etc.) are spun up (2 hours 30 min)
- The application stack is deployed (2 hours)
- An optional Skim Test is run (a further 2 hours)
Each environment can be spun up and torn down 24 hours a day with no manual intervention along the way (apart from triggering the builds), always with the same result – a fully functional, completely integrated environment. If you consider that it previously took us up to 12 weeks to build a single new environment, this represents a monumental shift in capability and agility.
What did we achieve and learn?
Return on investment
When we proposed this project to our investment board we had no way or quantifying the benefits that it would deliver other than the fact improved test integration environment availability would help with stabilising the release management process. The actual benefits exceeded our expectations by significantly contributing to the predictable and reliable 6 week release schedule we have today.
The return on the initial investment was achieved in 8 months where the justification was based purely on the number of avoided lost development days. When you account for the project’s contribution to significantly cheaper support and run cost, improved development velocity and customer satisfaction, the return on investment is even more compelling.
Seeing that we built the platform from the ground up, we now understand the actual cost of the utility compute unit which is handy when comparing the total cost of ownership against that of commercial cloud providers. Even without the capex/opex constraints our private cloud is significantly cheaper to run over a 3 year period than Amazon EC2, for example.
The hidden gem in the platform lies in its portability; given that the environment creation is automated and could be spun up pretty much anywhere, we have the option of moving the entire platform to a public cloud as long as the compute unit cost is lower than ours. This mobility is a great option for continuously driving down the cost of the solution.
Virtualisation is not Cloud
Just because you have a virtual machine estate in place the platform may be virtual but is definitely not a cloud… unless the provisioning of the entire stack its automated. For example, if you can quickly provision the build of virtual machines (VMs) but it still takes manual intervention to deploy and integrate the applications that run on the VMs, then the process is neither complete nor automated and therefore can’t be viewed as a cloud implementation. The automation part also needs to include all of the associated infrastructure and application management and monitoring components.
With complexity come complications
Increases in the complexity of the infrastructure and application estate lead to an increase in the number of issues with automation and migration. This may seem obvious but the more complex the estate, the harder it is to deliver the automated provisioning and other “cloud” components. Having said that, the benefits of getting it right are significantly higher.
Blurring the delineation between Development and IS, is critical to delivering projects that span both organisation’s sphere of competence as it improves project delivery and delivers a better product. A benefit that we often seem to take for granted is a people and organisational one. The primary reason this project was delivered successfully was down to the fact that the demarcation lines between infrastructure and development project teams blurred. We started to collaborate as a single unit building cross functional capabilities across teams which was critical not only to the delivery but also the on-going maintenance and enhancements post project delivery.
DIY is great for learning
We are fortunate enough to have the in-house skills to have been able to do this project ourselves and I really do believe that there is a lot of value in learning how to Do It Yourself. Apart from better understanding the applications and integration points it also facilitates the opportunity to fix the issues you’d ordinarily never get the opportunity to. Understanding the intricacies of what it takes to run the platform also puts one in good stead when negotiating the SLA’s with cloud providers should you have the need to migrate the service.
We have progressed significantly from the dark days of test environments instability in 2010 that crippled our software release process. Our internally hosted PaaS cloud was built up from commercially available commodity software and hardware, and has transformed our capabilities with respect to agility, release management, automation and DevOps. It has also provided us opportunities for extending our cloud platform to Production and further reducing operating cost.