We are developing a high traffic application (more than 1000 requests per second), during our load tests we noticed that we have 502 (bad Gateway) errors related to Nginx ingress.
In our cluster configuration there appear to be 11 pods for ingress that have been automatically created, each with 0.5 CPU and 768MB of RAM.
Is it possible to increase the number of pods and modify their configuration? If so, how?
Hello @jeremy_gaudin !
I does make sense for you to be able to act on NGINX resources. As of today, we (Qovery) have a way to change it but it’s not yet exposed to end users. I am looping product team here in order to see how we can prio this.
Just out of curiosity, are requests sent as burst or is it 1000 req/s constant? If burst, it might have a lag while autoscaling new nginx instances.
Will get back to you.
Thanks for the Reply, it’s around 1k req/s constant
At the time of testing there were 25 instances of our application.
Lien vers la console : Qovery
Before supporting it, I would like to perform some tests. I can set the values you want for CPU and Memory so you can re-run your tests. It will validate if the issue is indeed on nginx or your app.
How much CPU / memory do you want?
For the moment I think it’s SET at 0.5CPU and 768MB RAM (for the NGINX ingress). I don’t know if these values are sufficient (I think 512MB ram would be enough according to the tests), I think the problem was linked to the number of Pods available for the ingress (11 at the most) when we had our 25 application replicas online.
Ok ! Do you want to try to increase those replicas and run another test? If so I will increase max replicas to 25 as well.
Ok give me a few minutes to update the cluster
It’s ok now, the 25 replica sets are available
Ok, max NGINX replicas is set to 25.
We have 500 errors now that we didn’t have before, even though we’re only making 500 requests per second, which the application and services behind it normally manage without a hitch.
Ok, and from what I see, there is only 20 NGINX ingress running, it didn’t scaled up to 25.
From your app live logs, I can see some HTTP 500, I think NGINX is propagating those.
Do you know why your app might be generated those?
Yes, according to our dashboard, the limit was 20 and not 25. However, we lost all the metrics on the monitoring tool dashboard following this modification, so we can’t see the error behind it.
We’ll have to dig deeper.