Trainline’s journey from MSMQ to SQS


Whenever I reach a milestone or finish something, I pause and reflect on the journey and what I have learnt. I’m going to talk about our journey of migrating one of our systems from MSMQ to SQS.

When we started, we had a system that looked like this:

msmq

 

Here are some of its characteristics:

  • fully asynchronous
  • built on NServiceBus using MSMQ for message transport
  • pub-sub model

It had its issues, but overall it worked well.  However, when we started moving Trainline systems to the cloud (we use AWS), we carried out an audit and realised that this part of the system was not very cloud friendly.

What were the pain points to address?

No visibility

Message queues were local to machines. Every time we needed to take a look, we had to login remotely to those machines. We had fixed machines with fixed IP addresses. Our machines were like pets. While we were in a non-cloud-hosted environment, this approach was just about workable. But once you’re in the cloud, such practices have to go. Cloud hosting forces you to treat machines like cattle not pets, to use the popular, if slightly grizzly, analogy. Your system should be able to cope with a machine getting killed and a new one spinning up. Relying on fixed IPs is not workable as machines and IPs will change very frequently.

Flakiness

There were too many steps to make whole system work. Having too many queues was big contributor to increased flakiness. We had incidents when messages were stuck on outgoing queues due to network issues. Then subscriptions did not work because the subscription service had not processed the message. The asynchronous nature of failures like these meant they did not throw exceptions and were going unnoticed for a long time. The system was hard to diagnose and it was very time consuming to narrow down issues.

Not fully resilient

As the system was queue-based and asynchronous, it was resilient enough to cope with some scenarios, like a service restarting or a system being partially unavailable for a short period. But it was lacking resilience in other areas. As queues were local to machines, if we lost that machine, we had no way of tracking what was lost. Another example where resilience was lacking was that workers were dependent on the availability of distributor, so if a distributor went down, workers would not receive any message to process.

Scalability in AWS

A distributor managed a list of available workers. It kept track of each worker’s IP address and queue name to be able to forward messages. But in the cloud, we didn’t want to rely on fixed IPs or machines: if a worker node died, there wouldn’t be anything to tell the distributor about it.

Steep learning curve

There is a lot that you need to learn/know to start using NServiceBus effectively. It would take a long time for a new team members to get up and running. There were lots of configurations and framework-specific idiosyncrasies that you needed to know just to understand what is going on!

We need a better queuing system!

So we concluded that we needed to get a better queuing system and, ideally, to reduce the complexity built up around NServiceBus.

SQS vs RabbitMQ

There are many queuing systems, but we eventually narrowed it down to SQS and Rabbit as the final contenders.

RabbitMQ

• We have experience. We already use RabbitMQ in multiple parts of our system.
• All the infrastructure is already available/set up. So we would not have to invest time for that.

SQS

• It’s AWS. As we are hosting our system in AWS it makes sense to look for options available in your current ecosystem first.
• Low maintenance, as it’s hosted by Amazon and we consume it only as a service, we would have almost no maintenance overheads.

Here is how they compared on points which were important to us:

  RabbitMQ Amazon SQS
Support Good community support Support from Amazon, apparently also good, systematically increasing community
High availability / Reliability Require additional effort and money By default, heavy reliable (at least one delivery guaranteed). No additional efforts needed.
Message lose More moving parts/configuration, message can evaporate(lost) in multiple ways Very few configurations and comparatively more reliable
Pub-sub pattern with multiple clients Done in standard way, easy to use Can be implemented using amazon Simple Notification Service: https://aws.amazon.com/sns/faqs/ (fist question ‘What is Amazon Simple Notification Service (Amazon SNS)?’)

http://www.infoq.com/articles/AmazonPubSub

Message consumption pattern Push is the more usual approach. Polls at specific intervals
Message acknowledgement More usual, configurable Less usual – visibility timeout for messages with extra time consumption; a declaration that the message is processing is required
NServiceBus Supported Community-run transport extension
FIFO Yes No, by design (https://aws.amazon.com/sqs/faqs/ question ‘Does Amazon SQS provide first-in-first-out (FIFO) access to messages?’)
Security Authentication plugins Identity and Access Management
Message Security Messages are in memory (non-persistent) Messages are persisted, so need to write code for message encryption
Message filtering Yes No, there is some kind of workaround:

http://stackoverflow.com/questions/22196890/routing-messages-from-amazon-sns-to-sqs-with-filtering

Configuration and maintenance effort needed Large Small
Effort needed to achieve scalability Large Small
Max message size Memory of the server 256 KB
Price Cheaper when number of messages is enormously big (thousands per second) Quite cheap when number of messages is not enormously big (thousands per second)

https://aws.amazon.com/sqs/pricing/

Performance Very good Good
Should be chosen when crucial are: Price, performance Scalability, reliability, integration with services hosted in Amazon

After reviewing all the points mentioned above, we decided to go ahead with SQS.

How our system looks now

sqs

 

 

What did we gain?

Scalability:

Subscriptions are managed at the SNS level and abstracted away from the worker. So adding more workers or removing any of them does not need any subscription management. As soon as any new worker comes up, it just starts reading from a fixed queue.

Visibility:

Now queues are external and easy to look at using AWS Console. If a worker is failing to process a message, after 5 attempts, the message is moved to our dead-letter queues.  We have alerting setup on dead-letter queues. All of this happens without any need to log in to any worker nodes, which is a huge relief!

Reliability/Robustness:

Another benefit of having external queues is that now, if any worker is taken down, we just simply spin up a new one. There is no chance of losing any messages. And we can increase or decrease the number of workers as and when needed. This is the “cattle” principle in action!

Simplicity:

We no longer use NServiceBus for this part of our system. AWS SQS code is no more than 10 lines! Removing NServiceBus from the equation has significantly reduced the learning curve for new team members!

Conclusion

We still use NServiceBus in others parts of our system. We think it’s a very powerful framework and very useful in many use cases, especially when you have long-running sagas. But our needs for this part of the system are much simpler, so NServiceBus and RabbitMQ were overkill for that.

If you have also gone through, or are going through, a similar phase, we would love to hear your questions, thoughts and insights.

About the author

Balpreet is a full-stack developer. He loves building back-end APIs and Mobile Apps. He tries hard to keep things simple so that the minimum of effort is required by anyone trying to learn the system.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s