Whenever I reach a milestone or finish something, I pause and reflect on the journey and what I have learnt. I’m going to talk about our journey of migrating one of our systems from MSMQ to SQS.
When we started, we had a system that looked like this:
Here are some of its characteristics:
- fully asynchronous
- built on NServiceBus using MSMQ for message transport
- pub-sub model
It had its issues, but overall it worked well. However, when we started moving Trainline systems to the cloud (we use AWS), we carried out an audit and realised that this part of the system was not very cloud friendly.
What were the pain points to address?
Message queues were local to machines. Every time we needed to take a look, we had to login remotely to those machines. We had fixed machines with fixed IP addresses. Our machines were like pets. While we were in a non-cloud-hosted environment, this approach was just about workable. But once you’re in the cloud, such practices have to go. Cloud hosting forces you to treat machines like cattle not pets, to use the popular, if slightly grizzly, analogy. Your system should be able to cope with a machine getting killed and a new one spinning up. Relying on fixed IPs is not workable as machines and IPs will change very frequently.
There were too many steps to make whole system work. Having too many queues was big contributor to increased flakiness. We had incidents when messages were stuck on outgoing queues due to network issues. Then subscriptions did not work because the subscription service had not processed the message. The asynchronous nature of failures like these meant they did not throw exceptions and were going unnoticed for a long time. The system was hard to diagnose and it was very time consuming to narrow down issues.
Not fully resilient
As the system was queue-based and asynchronous, it was resilient enough to cope with some scenarios, like a service restarting or a system being partially unavailable for a short period. But it was lacking resilience in other areas. As queues were local to machines, if we lost that machine, we had no way of tracking what was lost. Another example where resilience was lacking was that workers were dependent on the availability of distributor, so if a distributor went down, workers would not receive any message to process.
Scalability in AWS
A distributor managed a list of available workers. It kept track of each worker’s IP address and queue name to be able to forward messages. But in the cloud, we didn’t want to rely on fixed IPs or machines: if a worker node died, there wouldn’t be anything to tell the distributor about it.
Steep learning curve
There is a lot that you need to learn/know to start using NServiceBus effectively. It would take a long time for a new team members to get up and running. There were lots of configurations and framework-specific idiosyncrasies that you needed to know just to understand what is going on!
We need a better queuing system!
So we concluded that we needed to get a better queuing system and, ideally, to reduce the complexity built up around NServiceBus.
SQS vs RabbitMQ
There are many queuing systems, but we eventually narrowed it down to SQS and Rabbit as the final contenders.
• We have experience. We already use RabbitMQ in multiple parts of our system.
• All the infrastructure is already available/set up. So we would not have to invest time for that.
• It’s AWS. As we are hosting our system in AWS it makes sense to look for options available in your current ecosystem first.
• Low maintenance, as it’s hosted by Amazon and we consume it only as a service, we would have almost no maintenance overheads.
Here is how they compared on points which were important to us:
|Support||Good community support||Support from Amazon, apparently also good, systematically increasing community|
|High availability / Reliability||Require additional effort and money||By default, heavy reliable (at least one delivery guaranteed). No additional efforts needed.|
|Message lose||More moving parts/configuration, message can evaporate(lost) in multiple ways||Very few configurations and comparatively more reliable|
|Pub-sub pattern with multiple clients||Done in standard way, easy to use||Can be implemented using amazon Simple Notification Service: https://aws.amazon.com/sns/faqs/ (fist question ‘What is Amazon Simple Notification Service (Amazon SNS)?’)|
|Message consumption pattern||Push is the more usual approach.||Polls at specific intervals|
|Message acknowledgement||More usual, configurable||Less usual – visibility timeout for messages with extra time consumption; a declaration that the message is processing is required|
|NServiceBus||Supported||Community-run transport extension|
|FIFO||Yes||No, by design (https://aws.amazon.com/sqs/faqs/ question ‘Does Amazon SQS provide first-in-first-out (FIFO) access to messages?’)|
|Security||Authentication plugins||Identity and Access Management|
|Message Security||Messages are in memory (non-persistent)||Messages are persisted, so need to write code for message encryption|
|Message filtering||Yes||No, there is some kind of workaround:|
|Configuration and maintenance effort needed||Large||Small|
|Effort needed to achieve scalability||Large||Small|
|Max message size||Memory of the server||256 KB|
|Price||Cheaper when number of messages is enormously big (thousands per second)||Quite cheap when number of messages is not enormously big (thousands per second)|
|Should be chosen when crucial are:||Price, performance||Scalability, reliability, integration with services hosted in Amazon|
After reviewing all the points mentioned above, we decided to go ahead with SQS.
How our system looks now
What did we gain?
Subscriptions are managed at the SNS level and abstracted away from the worker. So adding more workers or removing any of them does not need any subscription management. As soon as any new worker comes up, it just starts reading from a fixed queue.
Now queues are external and easy to look at using AWS Console. If a worker is failing to process a message, after 5 attempts, the message is moved to our dead-letter queues. We have alerting setup on dead-letter queues. All of this happens without any need to log in to any worker nodes, which is a huge relief!
Another benefit of having external queues is that now, if any worker is taken down, we just simply spin up a new one. There is no chance of losing any messages. And we can increase or decrease the number of workers as and when needed. This is the “cattle” principle in action!
We no longer use NServiceBus for this part of our system. AWS SQS code is no more than 10 lines! Removing NServiceBus from the equation has significantly reduced the learning curve for new team members!
We still use NServiceBus in others parts of our system. We think it’s a very powerful framework and very useful in many use cases, especially when you have long-running sagas. But our needs for this part of the system are much simpler, so NServiceBus and RabbitMQ were overkill for that.
If you have also gone through, or are going through, a similar phase, we would love to hear your questions, thoughts and insights.
About the author
Balpreet is a full-stack developer. He loves building back-end APIs and Mobile Apps. He tries hard to keep things simple so that the minimum of effort is required by anyone trying to learn the system.