StreamMachine — Managing Nginx upstreams with Consul, or Death to Reloads (depending on your attitude)

Stuart Macleod
Trainline’s Blog
Published in
8 min readJul 17, 2018

--

In a microservices world there will always be challenges involved in service discovery and associated load balancing. Allowing developers to deploy, scale and toggle their services with a high degree of confidence that a given URL will be routed to the right place is key. At Trainline we use Nginx Plus load balancers, and route traffic to groups of servers referred to as Upstreams. The contents of those upstreams therefore are the key thing. How do you add, remove or replace entirely hostnames or IP addresses without affecting any other services running on the same box?

Well, the first method we tried was using AWS Auto-Scaling Groups, and their associated DNS, but this was a bit of a disaster. We discovered that the Nginx DNS client isn’t all it could be, and at the time reloading Nginx caused it to instantly forget its DNS resolution data and 502 until it caught up again. With reloads being very common (one was required for any scaling event or toggle, so multiple times per minute) this was a dealbreaker. Enter Upstreamr.

- Upstreamr

Upstreamr is an open-source template manager we developed in-house, with the primary purpose being to collate the desired server list from one of several supported sources, write to a local file and reload Nginx to apply. This way we remove reliance on Nginx’s ropey DNS client and provide it only IP addresses.

And Upstreamr was largely successful, running here for around 18 months. It began reading Upstream configs direct from AWS DynamoDB before we migrated to Environment Manager, and back then we read ASG membership from another DynamoDB table before we moved to Consul service discovery. The product worked as advertised, by and large, but after a period certain issues began to creep in that we could eventually no longer tolerate.

Specifically, there were two problematic areas. Firstly, support for ASG tables, reading direct from Dynamo, Environment Manager & Consul left the codebase much larger than we required at this stage. Add to that the fact that Upstreamr was written solely by one member of the team, who has now moved on to another company, support for the app when it failed was difficult.

Secondly, and more importantly, was the issue of reloads. Nginx users will tell you reloads are safe by design, and they are sort of correct. However, the experience we had was anything but. There was a marked increase in 504 timeout errors immediately after each reload, leading to failed transactions in our platform, manual remediation and customer irritation. The reason for this remains unclear (from our investigations reloads really shouldn’t cause these issues, but they do), however the best guess is that reloads cause a TCP pool issue between the Nginx servers themselves and the AWS Elastic Load Balancers fronting them. We use ELBs to make Nginx highly available (its own solution for this is not practical in AWS) and this second level of load balancing seems to hugely exacerbate the issue of reloads. In any case, Upstreamr and reloads in particular were not popular, so we did something about it.

- Consul Trickery

The first attempt was to go back to Nginx’s DNS client in the hope that all the cool stuff we had put in in the years since we first tried it would mitigate the client’s inherent ropeyness. And for a while things looked pretty promising. Consul services have their own DNS name, so you might wonder why we don’t use that. However, each upstream in our setup references one of several possible Consul services (typically one each for the blue and green slices), and if you want to change a DNS name entirely you need a reload. Boo.

However, the plan we came up with was to sort of trick Consul into doing our bidding by creating a unified service for each upstream, whose nodes were actually other Consul services. This sounds like the kind of thing that shouldn’t work, but surprisingly it works absolutely fine. So, imagine you have two slices:

- myapp-blue.service.consul

- myapp-green.service.consul

And, say, you create a third service with one or other of those as a single node. So:

- myapp.service.consul

Therefore, if you resolve myapp.service.consul it will CNAME to myapp-bue.service.consul, which will in turn return A records for any active actual nodes for that service. Therefore, if you point your Nginx upstream to myapp.service.consul and set up some kind of job (in our case AWS Lambda) to register or deregister the slices as nodes to the myapp service when you toggle blue to green or vice versa, then you get the right IPs all the time without ever reloading Nginx. And, as these DNS records are SRV they contain the port as well as the IPs, so you can flip between ports as you do slices. Hooray!

But…

Booo!

The above solution works great as it no longer requires you to reload Nginx, but crucially, it requires that you do not reload Nginx at all, an operation which even without upstream changes remains crucial (new load balancer settings, changes to the overall config, etc.). The reason for this is once again Nginx’s pretty rubbish (well, it used to be, more on this later) DNS client. At the time, when Nginx reloaded it ‘forgot’ the resolution to all its DNS records and had to re-resolve them. In our clever Consul setup above if you reload you would need to re-resolve every single upstream. In our non-production Nginx servers (which are admittedly a fair bit more cluttered than the production ones) this meant that Nginx had to resolve the service and node CNAMEs for each of more than 1,000 upstreams. ‘Surely that wouldn’t take long, though?’ I hear you ask. Nope, more than 60s most of the time, which for a production server every time we reloaded would be a disaster.

We played around with minimising the DNS resolution time using local DNSMasq services, resolving exclusively to the local consul DNS endpoint, but to no avail. In all scenarios it took far too long to resolve all the DNS records and we had to abandon the idea entirely.

- Enter StreamMachine

So, back to the drawing board. We had had some success with a local agent running as a service which did the updating (Upstreamr) so we decided to go back to that design, but find a way to avoid the reloading necessity. The solution to this was to use the new Nginx on-the-fly API released with Nginx Plus R13. Nginx Plus had had an OK sort-of API before, the upstream_conf module, but we never really used it. Its interface was not especially well defined, and besides we had Upstreamr at the time and didn’t know about reloads being a problem.

With the new API, however, things are much clearer. It comes with Swagger docs built in so programming for it is simple. Plus, upstream servers (since version 1.9.7) configured via an API call can now write their state to disk which persists across reloads and even service restarts. Therefore we can be confident that if we are to set the Upstream servers correctly and regularly check them then we can have a high degree of confidence that they will be unaffected by other Nginx events. And, as the state is what is changed and not the upstream file, we can write the file once and not worry about changing it. No template engines, no reloads, no problem.

So, we began working on a new Python app, which eventually became StreamMachine.

The first step was to collate the current state of things and the desired one. Most of this is simple REST calls to various services, which in Python is easy. But, certain services (in this case Consul and Nginx) do not have a single endpoint to get all the info we want in one call. Which, again, requires that we make possibly thousands of calls to iterate our way through. The solution to this was the python requests_futures module, and its asynchronous http calls.

In essence, this is simply a method of firing off the requests for HTTP services one after another, without waiting for a response to the first before requesting the second and so on. Then, once all the requests are away, we can look for each response in turn. When running this technology with our non-production servers, our 1,000 Consul requests reduced from taking around 45s sequentially, to about 3s asynchronously. The requests to Nginx (which are a similar number) run even faster.

So, having found a speedy way to collate all of our information, all we need do is compare what we have with what we want (which is just list comparison, simples) and fire off REST calls to Nginx with any changes we wish to enact. Thus, if one of our Consul services scales up or down, or we toggle a blue slice to a green one, all StreamMachine knows is that the list of servers it wants is different to that which is has and acts accordingly. Also, because the processing time is so low (the whole iteration on our production servers takes about 4s all in), we don’t need to cache any information on the disk (corruption of which was an Upstreamr issue occasionally) and can just run dumbly time and again. Wrap the script in a systemctl service, package it, give it a pithy name, and Bob’s your uncle.

We now run all of our Nginx servers with StreamMachine, and have seen a huge reduction in reloads leading to a similarly pleasing reduction in 504 timeouts, as well as better insights into historical performance which comes from Nginx’s dashboard not resetting itself for hours or even days at a time.

- The Future

So, we’ve mentioned a few times that Nginx’s DNS client had a few issues, and in particular used to forget all its resolutions for every reload. As reloads are still a necessity, this was an issue. But, just when we were rolling StreamMachine out to our production servers, Nginx R14 was released, which now makes DNS resolutions persist across reloads. As you can imagine, this was met with an equal measure of delight at the feature, and irritation at the timing. Regardless, we ploughed on with StreamMachine, but started thinking back to our old Consul method.

Now, with persistent DNS, the only risk to the DNS-based approach would be service restarts. These very rarely happen (they really shouldn’t happen ever to be fair) so it might be worth revisiting. StreamMachine, as I have said, works well, but it also uses a huge amount of CPU processing to do its thing. The servers we use have CPU to spare, so this isn’t actually an issue, but potentially we could downsize our boxes and save some cloud costs. Also, the DNS solution is centralised so has a potential advantage of an agent-based, independently failable solution. The StreamMachine service could fail on only 1 server, leading to possibly difficult to pinpoint issues. We can and do alert on these things, but surely it is better to avoid the issue entirely.

Regardless, we now have multiple approaches to take, both of which avoid the issue of reloads and reduce error rates in our platform. Can’t say fairer than that.

--

--

Stuart Macleod
Trainline’s Blog

Infrastructure Architect @ trainline. Scottish Ambassador to London