Spotting problems and fixing them before our customers notice

Just over a year ago we were set the challenge of ‘spotting problems and fixing them before our customers notice’. To spot the problems we knew we had to change our behavior from only opening our logs when an issue was suspected, to making them available at the touch of a button to our engineers so it was easy for them to monitor their applications and spot problems and fix them.

Ash Powell’s blog explains how we did this, ‘What the ELK!? – Log Aggregation,‘ and here’s a short film we’ve just completed with Elastic to explain our business and how ELK’s role:

Keeping trainline on track

Challenges

Continue reading

New Relic in action at trainline #futurestack

Toward the end of last year I was invited to present at New Relic’s FutureStack conference in their Hacker Lounge track. I had a great time and it was great to introduce trainline to those across the pond and further afield.

All the sessions were filmed and mine has made it onto Youtube. The talk was on how we at the trainline adopted and scaled the use of New Relic across the entire organisation and the value that we’re getting out of it, including how team KPIs are judged on some of the metrics we get out of New Relic.

Video

Paul Kiddie – New Relic in action at trainline Continue reading

What the ELK!? – Log Aggregation

Everyone loves logs right?

No…

Logs are long, complex, full of useless information, and it takes ages for you to find that one error message that you need to fix a problem. So if you’re working with over 100 servers and you’re getting over 200GB of logs a day how can you get through your logs to find the real information inside?

Continue reading

Chef on Windows – detecting and fixing WMI problems which prevent chef-client runs

At thetrainline.com we use Opscode Chef for managing our build infrastructure. Like many other tools running on Windows, the chef-client ohai framework relies on WMI for extracting information about the server machine on which scripts are being run. We found that Windows WMI repository corruption can cause chef-client runs to fail due to missing WMI classes, which causes the node to remain out of policy. The WMI repo can be repaired using winmgmt /salvagerepository, and the WMI errors can be monitored using the WMIDiag script to alert on WMI repository corruption before future chef-client runs. This post details how we detected and fixed the problem, and how to monitor for WMI repository corruption.

Continue reading

thetrainline.com @ Velocity EU 2012

The Velocity EU conference was held in London this year during the first week of October. thetrainline.com sent a contingent to see the latest and greatest ideas and concepts in high performance web sites and operations.

Velocity, for those who don’t know, is “the best place on the planet for web ops and performance professionals like you to learn from your peers, exchange ideas with experts, and share best practices and lessons learned.” The conference is chaired by Steve Souders (Head Performance Engineer @ Google) & John Allspaw (VP of Tech Operations @ Etsy), so people who know their stuff then. It has been running since 2008 and now happens 3 times per year in China, the USA and Europe.

So then, what did we learn and how are we going to use it?

Continue reading

Engineering Day – Lifting the Bonnet

Over the past year there have been a huge number of changes and improvements carried out by the engineering teams at thetrainline.com, so we techies decided to invite the non-techies within the company to a day of talks and demos on 28th September to ‘lift the bonnet’ and explain what we get up to; to showcase some of the work we have been doing to improve user experience, speed up deployments, reduce operational costs, and generally Make Things Better.

thetrainline.com Engineering Day - posters

In all there were 16 sessions of around 15-20 minutes each, with presenters from almost every team within the IDG (engineering) department.

Continue reading