- Programming language and version, databases used:
We have been using Qovery to run our production servers here for the last month following a major outage in our AWS deployment, for the most part it has been going great. However multiple times throughout the day we see connectivity issues resulting in dropped requests. The issue sounds very similar to: Services latency.
We have spent the past weeks diagnosing (and finding) some application issues to address, all of which failed to have any impact on the issue.
Most recently we have configured DataDog to gain additional insight and with this we now have evidence that points to nginx load balancing as the root cause.
Please consider the following CPU and Memory metrics for our nginx instances during two events:
The large spike at 11:33 and 11:53 correspond to a loss in traffic.
Very eager to here your input on this matter, and if you believe a similar approach is warranted as that discussed in the other thread.
Update, we did end up setting
nginx.vcpu.request_in_milli_cpu to a value of 500 and are monitoring today to see if this helps.
Thank you for going forward with this, We missed the notification of your previous message.
Indeed, changing nginx minimum CPU often helps if you are seeing latencies.
Be sure to re-deploy your cluster to apply the setting change. Only updating the setting without re-deploying is not enough.
Thanks, yes we redeployed the cluster after making the change. Will provide an update at the end of the day.
Hi @Erebe, thanks again for your help yesterday. Unfortunately we are still seeing intermittent dips in application performance correlated to spikes in the
nginx-ingress-controller cpu. Let me make my questions one at a time, apologies if some of these are obvious - this is well outside my area of expertise.
- You mentioned yesterday that changing the cluster instance type to 8 CPU / 16 GB Ram would save us money. This puts us into a C5/C6 compute optimized EC2 instance, would we expect any performance benefits from the change in addition to cost savings.
- Do I understand correctly that
nginx.vcpu.limit_in_milli_cpu represents the maximum number of mVCPU’s that at
nginx-ingress-controller may draw? (Clearly capped at 2000, given our current instance type).
- Do I understand correctly that
nginx.vcpu.request_in_milli_cpu represents the amount of mVCPU’s that the
nginx-ingress-controller requests from the Kubernetes Controller?
- If my understanding on points 2 and 3 are correct then would setting
nginx.vcpu.limit_in_milli_cpu: 2000 and
nginx.vcpu.request_in_milli_cpu: 1500 be a reasonable “max” given our current instance type?
- Additionally I note the setting:
nginx.hpa.min_number_instances available in the configuration. Do I understand that we can use this to peg the minimum number of
nginx-ingress-controllers to a higher number. If so are there any concerns with increasing this to 5 while we continue to investigate a root cause to stabilize the application performance?
- When I look at this chart which shows DataDog’s system.load metric for one of the
nginx-ingress-controller it has a periodic spike in load, is this behavior normal in your experience?
Once again appreciate your insights.
Hi back @Corky3892,
You should see some performance benefits as well, but it is hard to tell without benchmarking because it depends on your application/load behavior. If it doesn’t tend to use much CPU and is more Io bound, you would not see much of an improvement.
In any case, the change is not going to make any harm, and you will be able to postpone the decision of your new app resources requirements with real metrics.
Indeed, you are right.
Also right, the request is what the controller reserves for your app, that’s the minimum guaranteed. And the limit is up to which your app is allowed to go without being throttled. But the limit is not guaranteed, if there are other apps running on the nodes taking CPU at the same time, the kernel is going to apply throttling between all apps. Basically, what is between [request, limit] is not guaranteed all the time. I will discuss it in later point.
With your nodes, you can use 1800 (no worries, if the new value is too much it will be rollbacked and the deployment of the cluster fails)
A node is not dedicated to a single app, so your nginx is running alongside your applications on the same node. So that would be the reason you see the system load spiking, because some app running on it are using cpu.
Still right, it is forcing the minimum nginx to this value. ATM it is 2, and if you set it to 5 you will always have at least 5 nginx instances running.
If you want to put nginx out of the equation to investigate if the issue is not coming from your app. I would
Dedicate machine to nginx to avoid noisy neighbours. To do that set the request and limit of cpu and ram close to the maximum of your nodes (I.e: 1800 CPU and 6/7G of RAM). The resources requirements will saturate one node, and it will be the single app running on it.
Increase the minimum instances of nginx
nginx.hpa.min_number_instances to sustain your max all the time or go beyond to overallocate and remove it from the equation.
Hopes it helps !
Great information, let me review this with my team, make some configuration changes and see where we go from there.
Also, as you are running on low-end AWS machines, be sure you are not hitting the network bandwidth/iops limit of those nodes. if it is the case taking bigger instances will fix it.