How to scale the number of instances regarding the cluster config?

Hey guys!

I’m trying to scale up our infra a bit, we had an incident yesterday with a lot of ClientRead wait time on the DB, and we want to eliminate the hypothesis that it’s our API that is not scaled enough.
But I have hard time understanding the CPU management with kube.

So right now, we have 5-8 instances with 1.5 vCPU and 2048MB of RAM.
To support them, we have a cluster of 6-10 nodes with 4 CPU and 16BG of RAM.

Everything is at the minimum, and our monitoring says our cluster capacity is at ~50% of CPU.
My goal right now is to increase the size of the instance CPU a bit, what would you recommend here?
Setting up few large very large nodes, or a few small ones?
I know this slider is super sensitive, so just want to confirm the plan with someone.

image

My plan as of now:

  • Change the cluster config to have 4-8 t3.2xlarge nodes (8 CPU - 32GB RAM)
  • Increase the API resources to have 5-8 instances to 4 vCPU

If my maths are right, at the minimum I’ll have 20 vCPU running on 32 nodes CPU (leaving some room for other pods, datadog, nginx etc)
And at most, we’ll run 32 vCPU on 64 nodes CPU, depending on the auto-scaling.
That seems a bit excessive but I’d to eliminate the API scalability hypothesis.

Is it the right approach? What are the attention points I need to check?

1 Like

Hi,

I don’t have only one single answer, because I need more information, but I’ll give you my advice on all the things you need to know and check.

First of all, your cluster seems to have the capacity to handle more load as you look to have enough free resources available.

You’re guessing it comes from the CPU and thinks growing this would help. Before doing this, I advise you to look at graphs to validate the resources congestion:

  1. Cluster graph capacity during that period of time (is your graph showing the current status or during the time you had issues?).
  2. Your pod’s application resources usage, you should see spikes with a growing number of pods, if CPU is the limit.
  3. Look at your application logs and database logs to see if you find more useful information.

Without having much more info, I can only guess, but I would look at database logs and graphs as well, I’m guessing you’re hitting limits on the database you may raise.

To finish, your application may be too long to start, ingest the incoming traffic, and struggle during the time new pod (instance) of your application are coming. If from the graphs, you observe such behavior, you can try to reduce the autoscaler average percentage trigger. Which will start to bootstrap a new pod earlier, avoiding this situation (more info here: Advanced Settings | Docs | Qovery)

The last best advice I can tell you is to find in the graphs and logs, the elements giving you more info to analyze and find where the issue can come from. Growing numbers here and there can (sometimes) temporarily resolve the issue but, it will come back later and you’ll have to fight again with it.

Hope this is clear and will help you to find the root cause

Pierre

2 Likes

Quickly Googling the error gave me:

It may be related to the number of operations you’re doing in your transactions.

My questions are:

  • Are transactions too big?
  • Did you activate slow queries logs database side to see which ones are taking so much time, leading to queries stacked in the queue
  • As you had multiple requests more than usual, do you have in mind, locks on tables that could prevent others from being read?

One again, hope it will help to find the issue

2 Likes

You can’t exactly do that because AWS doesn’t give you exactly 8 vCPU, but something between 7.8 and 7.95. In addition to that, some resources are deployed on it, like Datadog, Qovery requirements, and AWS ones as well. I advise you to take a 0.5mcpu margin. So count on 7.5vCPU per node to maximize the allocation.

So set your application memory to 3.8 mcpu instead of 4, it will reduce the consumption.

2 Likes