At DueDil we are migrating workloads from a managed hardware provider to the Google Cloud Platform (GCP). Services have historically been run on our Mesos cluster, using Marathon for Docker container orchestration of long-running services. In a small number of cases dedicated VMs have also been used for the occasional ‘bespoke’ service.
We’re making this move now in order to take advantage of PAAS and the cloud-native offerings. This will help to scale and manage workloads more effectively, and standardise our processes.
This work has seen the API stack moving over to production first, with DueDil’s Site and supporting services to follow in Q1 2020.
We’re now using managed K8S and helm to orchestrate the deployment of the API V4 Stack.
Problems with authentication begin...
Currently, we use a 3rd party service for API key management. They offer a REST API and our proxy calls this to authenticate keys and authorise requests to the DueDiIl API. Prior to migration all unit and integration tests were passing cleanly. Once we deployed to K8S these strange problems surfaced.
Error rates when connecting to this service sky-rocketed. 1 in 4 calls were failing with HTTP 502 and this was having a knock-on effect on the service. We were able to fall back on our warm Redis auth cache for authentication, however default limits on throttling were being applied. Not good as throughput for clients was being restricted.
As we took stock of this we asked the following questions:
- Can I replicate this locally with curl ? No.
- Can I replicate this on K8S dev environment within an API proxy container ? No.
- Have rates of request changed - are we seeing a different usage pattern ? No.
- Does the same issue exist on our old infrastructure prior to migration ? No.
- Any other logging signals from the K8S cluster ? No.
- Network or firewall rules causing this ? These would produce a blanket block, not this intermittent behaviour. No.
- Has the application changed ? - Yes.
We completely rebuilt the API proxy image. Now basing it off an official Openresty Docker image. The version of Openresty in use had moved to the latest LTS.
- Were there any relevant bugs reported against this version ? - No
Still, it must be the application right ?
Mitigation approach 1 - rule out differences in the application
With minimal tweaks we ported the latest tagged image running on the old infrastructure to GCP. The release is tested on our Sandbox environment and we deploy. No change. This can’t be an application-level issue.
Mitigation approach 2 - are there other constraints to our usage of this external API ?
Okay, single requests are succeeding, let’s try a more intensive test. Using Vegeta we tested steadily increased rates of throughput and concurrency. Getting closer to the load generated by production. All good.
Okay, let’s have a look at the Vegeta defaults. One interesting default is that TCP keep-alives are enabled by default. A good thing if you want to conserve HTTP connections - and re-use them. Let’s turn that off and re-run.
….bingo ! We can now re-create this issue locally. At precisely 100 concurrent connections we were seeing the error rates increase exactly as we were seeing on prod. Our hypothesis was that we were somehow hitting a limit that was there all along.
Mitigation Approach 3 - find environmental differences causing increased connections
Now we have twice as many cores and four times the RAM. Surely that’s always a good thing ? Looking at the top of our configuration file:
# WORKER THREADS
Okay, now we have a mechanism for generating change.
The penny drops
There were actually 3 influencers at play:
- Our workers were doubling from 8 to 16
- For each worker the number of connections it could establish was constrained to 512. However we were not effectively implementing TCP keep-alives ! For effective re-use of connections you must implement a specific directive and force usage of HTTP 1.1.
- On top of this we had increased the number of deployed services post-migration from 3 to 6.
The problem of creating more connections than needed was being effectively magnified as we moved from 24 core deployment (total) to 96 = 4 times !
Graphical example using JMeter
End of user impact
We did the following to bring this incident to a close:
- Ensured the correct enablement of keep-alives
- Min/Max connections per worker applied
- Nginx workers fixed to 8
Next steps will be to add effective monitoring for the number of connections we’re consuming. Ideally in the medium term we'll will bring all gateway functionality in-house.
This was a tricky one to debug. In the end, persistent curiosity was the vital ingredient make a breakthrough.
We’re hiring !
If this post was a fun journey for you, or perhaps the source of these woes was obvious all along - we’re always looking for talented and driven engineers to join the team. Take a look at our open roles now.