What the ELK!? – Log Aggregation

Everyone loves logs right?

No…

Logs are long, complex, full of useless information, and it takes ages for you to find that one error message that you need to fix a problem. So if you’re working with over 100 servers and you’re getting over 200GB of logs a day how can you get through your logs to find the real information inside?

Continue reading

Improving customer happiness at trainline with New Relic

Later this week, we’re excited to be hosting the second New Relic meet-up at our London offices, and we’ve been given the opportunity to talk about our experiences using their products. For those that don’t know, New Relic is a suite of monitoring products that we’ve been using to get near real-time monitoring of end-user, application and server performance, and we’ve been using it across our entire stack of products and services for more than a year.

Continue reading

Migrating from Gitolite to GitHub Enterprise

Recently, we performed a mass migration of our git repositories from Gitolite to GitHub Enterprise.

We had found that the level of maintenance required on Gitolite was quite high, and had quite an impact on the team that was looking after it due to the configuration complexity. We were running a rather old version with some pretty big security flaws, and running on some out of date, snowflake servers. One of the biggest issues though was the way that it required developers to request another team to create repos and change permissions etc, adding unnecessary delay and causing blockages.

After reviewing multiple options, we decided to migrate to GitHub Enterprise (GitHub), which runs as an on-premises VMWare appliance. We chose this due to the familiarity most developers have with github.com, and GitHub’s superior support amongst third party tools. This allowed developers to create repositories and perform most common tasks as self-service, rather than relying on another team.

As this migration does not appear to be very common, this post shares some detail about the steps that were required.

Continue reading

Building AMIs with Packer

During the planning stages of our migration to AWS, we identified the need to create custom images (AMI’s) as the base for new instances. While we are relatively experienced with Chef, we found that running Chef at instance launch time was much longer than acceptable. Creating custom AMI’s that are preconfigured (known as baking) allowed us to shift the heavy lifting from instance launch time to an earlier, out-of-band time.

In designing this process, we came up with multiple goals – we needed to have a reliable, repeatable, auditable and tested process with a fast spin-up time. This post explores our recent infrastructure automation efforts in this area.

Continue reading

Ignoring unimportant file modifications with Git

Frequently, when compiling applications, we need to update the version number inside some source files, so that the binary ends up with the correct version metadata.  This usually means there is a build task that modifies a source file, based on the information passed from the controlling CI system.

This works well when it all happens on the CI server, and any modifications to those files are thrown away at the end of the build. However, this can be a pain when you run the build locally, and you end up with modified files in your working copy. You are able run the same build that happens on the CI server locally, aren’t you?

This can be avoided by skipping this task in your build script if it’s not run under the CI server (for example, if certain environment variables are not present). The downside of this is that the process you test locally is different to the one that runs on the CI server.

Continue reading

Is DevOps the answer? Or just a key part of the Journey? Part 3

This post is part 3 in a series. Read part 1 and part 2.

Key Learnings

Part 2 finished detailing our relatively recent move to Product teams, a change that has had a big impact on our delivery process.

While this is definitely an exciting change, with product teams having a lot more responsibility from development to live, it highlights the fact that Development and Test environments have some needs that are similar to the Live environment, but also differences that must be clearly understood and supported, potentially in a different way to the Live environment:

  • Provisioning both Automated and Manual
    Use of the same tools, process and resources across the build farm to live deployment is key to reducing the time taken to operate the pipeline. The same tools should be used. The market has no clear leader, with Chef and Puppet being popular in the space but still lacking many capabilities.
  • Configuration Management
    Driving as much of the configuration of the infrastructure and application from development through to live using SDLC processes.
  • Change Control
    The number of gates and level of approval is historically driven by the risk of the change failing and also the time at which the change will be implemented. With increased automated testing, Canary Releases or Blue / Green deployments with adequate real time monitoring changes can be made without the need for a formal manual change review board. The quality of auditing becomes more important; when did change x actually take place and when did the system observe a change in reliability?  The tooling here remains inadequate and bespoke in particular for systems that have a large fulfillment window, i.e. a UI change may not result in customer issues for many days if the fulfillment is delayed.
  • Incident and Problem Management
    Whom to communicate issues to will vary, but as mentioned previously an outage to a test environment can be as important as a production incident. The tooling and general processes for managing problems and incidents should be the same but the communication plans and business impact does differ, i.e. internal communications vs external communications and contractual liabilities. JIRA is more than adequate at managing the tickets but the maturity of an organisation to prioritise non-production incidents and problems over production is a benchmark in evaluating how well continuous delivery is understood.
  • Availability and Performance Management
    Application Performance Management tools, and those with “Real User Monitoring” are essential and must be accessible to everyone. Products such as NewRelic and AppDynamics lead the field. Ensure the APM tools are also used on the build and deployment infrastructure, as well as across test and live environments.
  • Capacity Management and Scalability
    Cloud public, private or hybrid has allowed for a step change in auto-provisioning of servers. However scaling storage, although cheap, still requires effort to implement, and there are differences between Test and Live environments that need to be handled.
  • Security including Anti Virus
    What are the actual threats that need to be protected against? Security should be baked into applications and development tools and processes but there still needs to be effective dynamic monitoring of threats. The live systems will need extra protection and monitoring.
  • Patching (not the application code)
    In the ideal world all servers will be baked and rebuilt frequently with the latest patches applied, but there will still be servers that cannot be rebuilt and hence effective processes still need to implemented to allow for patching of the operating systems, rdbms, messaging frameworks (middle ware).
  • Backups
    Leave this to someone else, do not burden your Product teams with this, BUT ensure that there is clarity between systems of record and systems of engagement. Infrastructure has the capability to backup both machines and data.

Summary

For any eCommerce business in 2015 Continuous Delivery is mandatory and not an option. Automation plays a key role in this.To ensure the Operations requirements are built into the system being developed, as well as how it is deployed, monitored and managed, the operations and development resource MUST work together. The culture and management of the organisation must embrace this. This does not mean every resource involved in development and operations is involved as a full time member in every team, as this would be impractical in any medium to large size organisation BUT they must all follow the principles of Continuous Delivery and be encouraged to do so.

Shipping early (MVP), even with minimal features allows for quicker feedback which in turn drives product optimisation. Discovering that a product is not fit for purpose early in the process is significantly cheaper than uncovering the failure at the end when full design, development and integration costs have been incurred for the entire product. This is one of the great benefits of the agile journey.

This is important for us at thetrainline as the time taken to release new products and resolve issues post go live impacts the next wave of development. If it takes multiple weeks before a new feature leaves the test environments to live use this then causes a drag on product improvements. It is not possible to have an effective backlog of work to balance with new features if the feedback from production takes many weeks. Development resources are re-deployed to new features which are then delayed increasing the total cost but more importantly delaying value to the business.

Is DevOps the answer? Or just a key part of the Journey? Part 2

This post is part 2 in a series. Read part 1.

thetrainline’s Journey in Improving Throughput

From a very early point in thetrainline’s journey it was clear that the web site was only the tip of the iceberg and that there would need to be a continued programme of development to: improve customer experience; adapt as web technology evolved; and as more automation was implemented in back office processing from initially fulfilment to most recently refunds. In the past 14 years rail travel has doubled in size and the customer’s expectations have also risen. Although journey planning and advanced purchase ticketing are well planned and carried out in advance the immediate future will see more innovation in ticketing from smart cards supporting multiple train operators to NFC payment and potentially ticketing. In order to be able to provide the required levels of service at the relevant price for the product and across all channels and devices thetrainline will need to continue to improve the throughput of ideas through to production implementation.

1999 to 2003 – The Waterfall Age

The first four years of our journey were formed in the early dot com bubble with much hype over how the internet would revolutionise the modern world. With hind sight we are not quite there but without doubt the internet has and will continue to dominate all markets and businesses for the foreseeable future. The demand for the speed of change will continue to grow if an enterprise is to continue to grow.

thetrainline.com was first launched in 1999 and in its first 12 months as a start-up it was able to benefit from the development, support and infrastructure teams/specialists all working closely together. Despite having a typical waterfall approach the proximity of the teams to each other as well as a single management structure allowed for operations and developers to communicate and rotate roles, as well as help ensure feedback from the production system was used in the next phase of development.

However as the system grew and matured, as experienced people left and new members joined, and as SLAs were defined, the lines between development and operations teams were drawn. The push to reduce costs resulted in infrastructure working in mutualised teams in remote locations. Initially when there was a period of relatively little development this was acceptable, but this would ultimately lead to the redevelopment of the system with little knowledge and experience of the operational requirements and how the system behaved in live being understood by the development team. During this period releases were delivered every quarter but the time from idea to live would be at least six months with significant management overheads.

2003 to 2007 – The Dawn of Agile (Almost)

The next chapter of growth would see the implementation of more agile methods of development as the need to re-platform from a Visual Basic and Windows 2000/NT infrastructure was clear. The cost of adopting a waterfall approach to the re-platform would have resulted in a multi-million pound failure and potentially the end for thetrainline.

A major platform refresh was undertaken, including both the application software and the infrastructure. The initial approach followed was a traditional waterfall approach but using the Rational Unified Process (RUP). During the development phase key developers would play central roles in providing and maintaining the build and test environments, automating the deployments where possible using Microsoft tools. BUT when the system needed to be deployed to production significant delays and pains were felt due to the change in people involved and the diverse responsibilities of the development and operations teams. Migration from the old platform to the new platform had also been underestimated. The lack of process and organisation structure, married with the poor tooling and automation led to severe delays.

To regain control Agile Development approaches were implemented, including extreme programming, test driven development and continuous integration. However there still remained a separation between Development and Operations teams. This provided a sense of control over changes being implemented in production but ultimately still prevented the required functional knowledge in operations teams and the lack of production feedback in the development teams. The net effect, a significant amount of product improvements still remaining on the shelf for months and by the time they were live the developers had already moved onto something else. It also required the management of knowledge transfer to Operations teams but with the inevitable loss of knowledge.

2007 to 2011 – The Teen Years of Agile

To address the loss of control over production and to increase the knowledge and feedback into development of operational requirements 2nd and 3rd line application support teams were formalised and close working between the teams a primary goal. Initially this proved successful as it provided a much needed stability to production systems, but again we would ultimately hit the limits of continuous improvement.

During the same period server virtualisation was also implemented in the test and production environments. Due to physical separation as well as commercial relationships the resources, tools and management of the environments remained separate. This initially did not block improvements in throughput and stability but in the long run led to inconsistency in automation approaches and a lack of feedback to the development teams.

2011 to present – Improving on Agile

The most recent past has been focused on three key initiatives:

  1. Formalising and enhancing operational processes such as Run Books, Service Monitoring and Change Control to work with Agile delivery, without this being managed and stable other initiatives would still be blocked.
  2. Removal of snowflakes within the test and development (build) environments. Due to the number of environments, variations in purpose and users of the environments, plus variations in operational requirements, the need for closer collaboration and communication between development and operations staff has never been greater. Development and Test environments are production systems, if they stop working it is very costly in terms of lost productivity and velocity.
  3. And most recently BUT the most important step has been the implementation of Product teams responsible from idea to operation of their products from development through to live.

To be continued…

Part 3 will share some of our key learnings.