For several weeks now, we’ve been experiencing consistent latency issues with our application, where requests are taking several seconds or resulting in connection refusals.
Our service monitoring is not indicating any anomalies, and some of the requests are not showing up in our logs. The problem seems to be coming from the ingress, but we haven’t been able to pinpoint the exact cause.
It appears that NGINX scales up to 10 times the usual number of pods we have and then rapidly scales back down to the standard pod count a few minutes later. These spikes in the pod count graph seem to match with the latency problems we’re experiencing.
We think that the CPU resources allocated to the NGINX pod are configured too low. This is why auto-scaling is triggered too often.
If you agree, we will manually adjust this setting to reduce auto-scaling. We will temporarily lock your cluster, and you will not be able to change any configurations.
After a few days of testing, if you encounter no more latency issues, we will deliver a feature that allows you to configure this parameter (thereby unlocking the cluster).
If you still encounter issues, we would need the same kind of metrics, including the number of pods, CPU, and RAM usage of NGINX.
Could you please provide your web console URL and the name of the cluster?
Did you encounter any other latency issues after my change to the NGINX CPU resources? You can now configure the NGINX settings through the cluster advanced settings. I’ve changed the nginx.vcpu.request_in_milli_cpu to 500 to reflect the manual changes I made last week.