Service Tech Symposium: Fault-Tolerant Cloud Design, Conway’s Law and SOA


On 24 September I went to Service Technology Symposium 2012 in London to see the latest industry thinking around cloud and datacentre automation. In the engineering team at thetrainline.com we have recently been busy fleshing out our strategy for cloud computing. We virtualised our Production infrastructure a few years ago, and we’re now looking at various forms of public and hybrid cloud (on-premise plus IaaS/PaaS) and more advanced infrastructure automation.

Two of the sessions in particular were useful: Fault-Tolerant Cloud Computing by John deVadoss of Microsoft [slides], and Conway’s Law and Service-Orientation by HP and Vodafone [slides].

Fault-Tolerant Cloud Computing

John deVadoss shared a useful 30 mins on how to design cloud-based systems to be fault-tolerant. Much of the material echoed recent articles and Velocity talks on resilience by people such as Jon Allspaw, so it was good to see the same kind of messages coming from Microsoft.

One slide out of John’s deck captured the essence of the talk:

Click to download PDF of slides from John deVadoss: Fault-Tolerant Cloud Computing

Here is my lo-fi version from the audience:

John deVadoss at Service Tech Symposium

The five principles of resilient cloud computing outline by John are:

  1. Partition by Application Workload – “the most important aspect of resilient SOA systems is that you understand the workload”
  2. Establish a Lifecycle Model  & Design for Operations – periods of peak demand show different trends across the day, week and year; your monitoring should expect peaks and troughs.
  3. Establish an Availability Model and Plan – do not pretend that your service can be 99.99% available if it relies on five other services which are each 99.99% available; the maximum you can expect from such a composite system is 99.95% availability. It seems that these kind of basic resiliency calculations are often overlooked.
  4. Identify Failure Points and Failure Modes – the ways in which the application can commonly fail (DB connection failure, storage full, incorrect permissions) and design the system to be resilient in these scenarios.
  5. Review Resiliency Patterns – “Traffic management is critical for mature fault-tolerant cloud services”.

Of course, these are all good practice for resiliency in any engineered system, not just software. They are especially crucial for cloud computing because in cloud systems we typically have influence over the resilience of only a small part of the system, and therefore must assume that parts of the system will fail or be unavailable at times.

Another Microsoft guy, Dan Rosanova, recently wrote about Hybrid Cloud architectures in Service Technology Magazine (the article came out just before the symposium); Dan’s characterisation of four main cloud deployment models (Cloud Front-end, Cloud as Reliable-secure Bridge, On-premise With Elastic Cloud Scaling, Cloud for Disaster Recovery) neatly complements Jon deVadoss’s talk on cloud resilience by providing examples of how different cloud deployment patterns suit certain workloads.

Conway’s Law and Service Orientation

Conway’s Law (more of a heuristic than a law) holds that the structure of a software system will reflect the communication structures within the organisation developing the system. Thus, if you have three teams developing a system, with good communication between two of the teams and poor communication with the third, you will likely end up with a software system with three parts: two which probably work well together and a third part which is difficult to integrate.

Roger Stoffers (Hewlett Packard) and Marc Schmeetz (Vodafone) spoke about how they applied Conway’s Law in reverse (that is, set up your team structure to match the software structure you know you need), an idea which is gaining popularity (e.g. Allan Kelly, Jeff Sutherland, CJ Marsh). I was struck by the extent of the change introduced at Vodafone: they changed whole team structures in order to facilitate a service-oriented operating model. Roger and Marc talked about using ‘Process Archaeology’ to really understand how your tech team communication works so you can best exploit Conway’s Law in reverse. The main message was: reconfigure your teams to deliver the architecture you want (see slide 7 in the deck).

At thetrainline.com, we’ve been moving in this service-oriented direction for a few months now, and we’re starting to see the benefits: fewer release bugs, better Dev and Ops collaboration, and faster delivery. We’ve found that moving from a traditional structure of functional teams (often called ‘silos’) towards cross-functional, service-focussed teams is not easy, but some key activities have helped enormously, such as open-forum major incident post-mortems, our weekly software forum (Burrito Club), and infrastructure automation activity, all of which bring people from different teams together.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s