Supercharging your Production Monitoring


At Trainline our development teams have moved on a lot from what we were doing a while back to what and how we are doing things today. Here are just a few of the things that have completely changed in just the last couple of years: moving to continuous delivery, massive increase in automation testing, new infrastructure, green-blue deployments, load balancing, alerting and monitoring.

What has taken my interest in the last year is the extent of the monitoring that we have available now and how we need to choose what to look at and what not to.

“Less is more, some would say, but in production more is awesome!!!”

Trainline has moved away from the model of teams creating APIs where code and configuration is written and then thrown over the wall for a different team to make it work. We have successfully made the transition to a model where the team supports an API all the way to production. The new model has made the teams more aware of not only how our code behaves in production but also how infrastructure is set up and the tools that we use. The most important thing that helps us achieve this, without getting up in the middle of the night or working over the weekend, is by the copious monitoring of our APIs.

Over time, we have set up monitoring at more and more different levels, including at some levels that we did not even know existed when we first started! I am sure there is much more that we can monitor, but one must be careful to distinguish noise from useful data.

So we have split our monitoring into the following aspects:

Health Checks – External Availability

The most important thing for us to ensure is that our APIs are publicly available all the time. There is no point in doing all the hard work of creating and setting up an API if it is not available to our customers. We are using New Relic’s product Synthetics for setting this up, which has given us the capability of pinging our APIs from multiple geographies. This helps Trainline, particularly now that our services are extending across Europe, to give a consistent user experience to all our customers internationally.

image1

Error Monitoring – Functional Availability

Once an API is externally available, it then needs to be functionally available. By this I mean, when I call my API then it should always come back with either a successful response or a handled error response. We are able to achieve this by using NewRelic APM monitoring on errors and APDEX. This is especially useful to us when we are deploying changes to production. With a commit to deploy duration of only 10 min, this monitoring gives us a lot of confidence.

image2

Response Time Monitoring – Usability Feedback

We are all aware that slow response times can take potential customers away from your application. Trainline has many years’ experience at doing everything to make sure that all our APIs respond well within our defined SLA’s. The team that I am working on creates APIs which expose Trainline core capabilities externally. So it is very important for us to monitor not only our response times but also the response times of our underlying APIs, as any increase in their response times gets passed on to our clients. To achieve this, we use a mixture of tools like NewRelic APM monitoring and customer attributes in NewRelic Insights.

image3

Feature Monitoring – Business Process Usage

Even after having all of this monitoring in place, we have still seen that certain issues can fly under the radar if we don’t have some custom monitoring set up. To give an example, on one occasion, our iOS was experiencing issues between PayPal and our PayPal client. As a result, for a period we had no PayPal bookings flowing through our booking API. Even though PayPal was not available, our total revenue and booking volume were at expected levels. This happened because those customers who were not able to make PayPal bookings were able to make payments by alternative means (i.e.: cards). From a revenue perspective we were not impacted by this blip, but it was nevertheless important to be aware that a part of our API was not getting as many hits as we expected. To give ourselves this capability, we developed a custom solution which gets API attributes from multiple sources and then publishes messages on our Slack channel when a certain threshold is breached.

image4

Process Monitoring – Process Status

Trainline is mostly a .NET shop and we use IIS a lot to host our APIs. There have been times when an app pool would fail to come back up after a regular recycle. To make sure this does not go unnoticed we have setup SCOM alerts on various process events.

image5

Hardware Monitoring – Instance Health

We use a mix of tools to monitor CPU, RAM, disk and network for our EC2 instances. The idea is to have enough headroom for spikes but not so much that we are over provisioned. NewRelic, Graphite and SCOM help us monitor and set alerts on all these parameters.

image6

Now with all this monitoring we have different ways of alerting set up. We use PagerDuty for really critical incidents, we use Slack for threshold breaches, and then we eyeball dashboards on NewRelic for passive monitoring.

Conclusion

The monitoring that we have set up over the last year has deepened the team’s understanding of our APIs’ characteristics and infrastructure. This in turn has given our business and my team a lot of confidence when making any type of change to the code and infrastructure in production really fast and the confidence to try new things quickly!

About the author

Akshay is a developer working with Trainline. He leads the API development initiative for the mobile teams. He has worked extensively on .Net, Java, and loves AWS, TDD, CI-CD and Agile.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s