Clusters, IT Operations and why Availability is the wrong word

It has been a very busy time at thetrainline, transformational projects have been implemented to improve our development capacity and IT Operations, including automation of our build agent environment using Chef, and more direct control over our hardware across the entire Development to Production pipeline. Consequently the Engineering Blog has been somewhat overlooked, my apologies.

Clusters

Even though the above is an excellent list of cool stuff that needed doing, for me the most important change (aside from the arrival of a new CTO Mark Holt) has been the shift from a project centric development approach to ‘Clusters’. Clusters have been on the (Kanban) Cards for a while at thetrainline but has taken time to make through the Pipeline. The brain child of Duncan Freke our Development Director, it is about aligning the Dev and Commercial teams on a Product basis but by doing so empowering the Product Owners to take ownership and derive true value for the customer.  Sounds like Marketing hype right? Yes, but that does not mean it’s not true.

Where do IT Operations fit in?

Given Clusters is primarily an evolution of Development and Commercial departments you would be forgiven for wondering why, as Head of IS Operations, I am writing about this change. The answer is simple, I believe IT Operations is not just about Availability but about the Service we provide to our customers.

For example, when I tell someone who I work for and the first thing they mention (after asking if I get cheap train tickets) is our booking fee I take it personally. Not because I think they are being cheap but because we have failed them. If the experience they have received – whether it is the performance of the site, the features they used or the information we gave them throughout the booking flow – did not give them added value equal to the booking fee then we have done something wrong.

This is why Clusters is important to me. We now have true ownership for the products we offer and as such my teams have a consistent place to go and explain the customer experience that we see – this does not exist with a project centric approach.

Why is Availability the wrong word?

So, after I have apologised to someone I just met for them not getting the value out of our site I explain what I do, but I actively avoid using the word Availability. It’s not that I don’t like the word or what it stands for, but for me it’s the wrong focus.

To explain why I am going to digress for a minute and draw an analogy with mobile phones. You may remember some of these beauties (I even had a pencil case that looked very much like the second from the left :) ):

mobile-evolution-1

Source: Kyle Bean (Mobile Evolution). Works licensed under a Creative Commons BY‑NC‑ND 3.0 License

Once mobile phones entered the mainstream they rapidly got smaller and smaller. However, as I was reliably informed some years ago by an Industrial Designer friend of mine, miniaturisation (in general) was not driven by consumer demand but was a natural result of improvements in technology. Availability of IT systems is no different. As the technology improves from dedicated hardware to virtualisation to Cloud computing, availability should increase as a by-product. Therefore to just focus on availability is like making a phone from the 80’s smaller and smaller, where is the ease of use? Where is the bigger screen, where are the Apps!!??

So, if miniaturisation was not the primary consumer demand, and therefore the right focus, what is?

Time

It’s Time. The thing that everyone wants more of, and are we willing to pay for. Time is the great demand of consumers, don’t believe me? Take a look at the retail cost of Laptops and see how much more you pay for a faster Laptop:

ProcessorCost

Prices courtesy of Insight UK 03/06/14

I find it hard to believe that the cost to design and make an i7 processer is ~£600 more per unit than an i3 even when you factor in the difference in motherboard, number of cores etc.? No, these are commodities, most likely made in the same factory and at similar cost. You are paying a premium for the convenience of time – saving yourself many seconds for each operation, which in total mean you can get more done.

If Miniaturisation is to Availability, Time is to…

Service! When you remove the focus on availability and understand Time is the new currency you realise that uptime is not the be all and end all of IT Operations – it’s just the beginning. Service is the key.

That means we need to be measuring real customer performance, we need to be tracking errors that slow customers down and we need to build robust compartmentalised systems that mean a problem in one area for one customer does not impede the ability of another.

It also means that as an IT Operations team we need to be able to respond to issues that come from the other departments (Contact Centre, Commercial teams etc.) quicker and crucially first time without the waste associated with ticket queues and handoffs.

Finally, we should see Clusters as our opportunity to be working closely with Product Owners, feeding back these experiences so that they can make Brilliant Products that constantly strive to meet customer demand and expectations.

This is why the change to Clusters is important, it should be a catalyst for the IT Operations teams to move from the traditional Availability is King model to one where Customer experience is King, that means being more proactive, more accepting of Agile, Dev driven processes like Continuous Deployment and always ask ourselves are we providing the best experience possible?

In short the new word is Service, not Availability.

How to use nUnit TestCase to simplify near-identical test cases

Often a situation faced by coders, especially when following test-driven development, is the writing of very similar test cases, changing only in, for example, the expected and actual values, along with some set up parameters. We often end up writing dozens, nay hundreds of near identical test cases, and end up with a test class that looks that it has suffered from a terminal case of copy-paste. This blog post shows a little-known technique for making this sort of test class a little more readable using the nUnit TestCase attribute.

Continue reading

Moving to Multiple Deployments Per Week at thetrainline.com

Here at thetrainline.com we have several useful online tools for helping our customers plan and manage their train travel, including Train Times and Live Departure Boards. We recently changed the way we build, test, and deploy these kinds of applications to enable us to release new features much more frequently and easily; in fact, we shortened the deployment cycle from one deployment every few months to multiple deployments per week.  These changes have produced a sea change in team culture, with a marked increase in product ownership by the team. This post describes what we’ve done so far, and where we want to go over the coming months.

Continue reading

thetrainline.com at Silicon Milkroundabout 6.0 – November 17th 2013

We (the tech team at thetrainline.com) will be at the Silicon MilkRoundabout recruitment fair on 17th November 2013, between 12 noon and 5pm. The event is at the Old Truman Brewery, Brick Lane, London.

Drop by and visit us on stand 17, and have a chat about what we’re up to!

thetrainline at Silicon Milk 2013

We were at Silicon MilkRoundabout 5.0, so look out for our dark blue stand.

Leaving the Platform – Branching and Releasing for Independent Subsystems

For several years, much of the code for the systems at thetrainline.com has been versioned and deployed together as a single ‘platform’. Recently, we have begun to divide up the platform into smaller chunks, to enable us to deliver some parts more frequently and rapidly, leaving other parts to evolve more slowly (as needed). Moving from a single version number for all subsystems to multiple version numbers for independent subsystems has implications for how code is built and released; this blog post outlines some of the work we have done so far in this area.

My colleague Owain Perry and I recently presented on this topic at the London Continuous Delivery meetup group (http://londoncd.org.uk/) and the slides we showed relate to the details in this post:

Continue reading

Chef on Windows – detecting and fixing WMI problems which prevent chef-client runs

At thetrainline.com we use Opscode Chef for managing our build infrastructure. Like many other tools running on Windows, the chef-client ohai framework relies on WMI for extracting information about the server machine on which scripts are being run. We found that Windows WMI repository corruption can cause chef-client runs to fail due to missing WMI classes, which causes the node to remain out of policy. The WMI repo can be repaired using winmgmt /salvagerepository, and the WMI errors can be monitored using the WMIDiag script to alert on WMI repository corruption before future chef-client runs. This post details how we detected and fixed the problem, and how to monitor for WMI repository corruption.

Continue reading

Using Visual Studio 2010 to target .NET 3.5

In common with other big systems, thetrainline’s systems use a variety of technologies under the hood. Most of our code is written for the .NET framework, although there are bits of other technology stacks in there as well.

Recently, working with a project targeting version 3.5 of the .NET framework using Visual Studio, I came across a rather subtle gotcha.

Visual Studio 2010 was released in April 2010 and by default will target version 4 of the .NET framework. Version 4 of .NET came with, amongst other things, the following features.

  • The Parallel extensions library.
  • Dynamic dispatch.
  • Named parameters.
  • Optional parameters.

It was this last feature – optional parameters – that was the original source of this gotcha, leading to ‘error CS0241: Default parameter specifiers are not permitted’.

Continue reading