What's the recommended way to scale down a cluster?

Hey there! :wave:

I’m currently looking at our AWS costs as they seem very high to me for our workload – happy to give more details in private. On the EKS side, our nodes seem under utilized to me, at least in term of usage (~10% CPU / RAM avg).

So I tried to scale down the cluster:

  • I lowered the “Desired” state of the Auto Scaling Group on AWS → It ended up being modified back (by Qovery, I guess?) a few minutes later, with two more nodes than originally, hehe…
  • I lowered the range in the Qovery console and triggered an update → The range and desired state in AWS are still higher (max is 2x higher).

Something like 15 minutes after these attempts, the cluster actually scaled down a bit, but I’m not sure why. :slight_smile:

Hence my question: what’s the recommended way to scale down a cluster?

(Side question: is there any way to check nodes resource allocation?)

Edit: Looks like the current autoscaling group range is the same than on Qovery now, so I guess it just took the 15 minutes to update on AWS?

1 Like

Hello !

I’ll try to answer step by step.

First, the major control plane is actually Qovery, every modification done cloud provider side will be override by Qovery. This time it’s ok but careful, a wrong manipulation provider side could lead to break your cluster or the link with the Qovery control plane.

I lowered the “Desired” state of the Auto Scaling Group on AWS → It ended up being modified back (by Qovery, I guess?) a few minutes later, with two more nodes than originally, hehe…

Because cluster autoscaler handle the desired size by itself regarding resource consumption.

I lowered the range in the Qovery console and triggered an update → The range and desired state in AWS are still higher (max is 2x higher).

Edit: Looks like the current autoscaling group range is the same than on Qovery now, so I guess it just took the 15 minutes to update on AWS?

When you modify nodes pool it take around 20 minutes to be effective, this could explain the difference between AWS UI and Qovery UI.

Something like 15 minutes after these attempts, the cluster actually scaled down a bit, but I’m not sure why.

When you change pool settings and a scale down is triggered, all pods on the future deleted node must be migrated to another one. Once it’s done, node can be deleted properly. This operation takes time.

Hence my question: what’s the recommended way to scale down a cluster?

The best way is to let the autoscaler do the job. For your information, the scale margin is 10% CPU, so if you’re node use 90%+ CPU a new node will be created. If pods running on a node can fit another one and leave it under 90% CPU usage, it will scaledown. This resource consumption is checked every minute and based on actual workload.

(Side question: is there any way to check nodes resource allocation?)

A feature coming in v3 will display all the informations you’ll need.In the meantime, if you’re using k9s you switch to node view :no then do ctrl+w if you don’t see resources columns. If you want more detail press enter on a node the same shorcuts to display pod’s resources usage.If you’re using kubectl and want a quick overview kubectl top nodes is what you need. For a more detailled response use kubectl describe nodes. Again, be careful, a wrong manipulation could break it all.

Hi @ramnes , plus I would add, maybe Qovery AWS EC2 would be a better and most cost-effective solution for you? (we don’t recommend it for production workload since it’s not resilient, but if EKS costs is an issue, it could be a good one)

it could be great for me aswel to test this :smiley:

2 Likes

TL;DR: I still don’t know why my cluster didn’t scale down by itself earlier, but at least I do know now that it can’t scale down more than it did yesterday because all nodes have a high CPU allocation.

That’s the whole point: I tried to scale down manually because nodes were using an average of 10% CPU and RAM. I wouldn’t have touched anything if the cluster was staying at the minimum amount of nodes as long as they’re under 90% CPU.

I assumed there was reasons on your side to keep more nodes alive, hence the post. If not, the real question is why my cluster doesn’t stay at the minimum amount of nodes when its CPU usage is 10%. :slight_smile:

Both k9s and kubectl top nodes show usage, not allocation. What I wanted to see is how much resource was still allocatable on my nodes.

I asked the question because:

  1. I saw resources allocation as another potential reason of my cluster not scaling down and
  2. from what I could remember, this information wasn’t available through kubectl;

but I’ve found in the meantime that kubectl describe node actually gives it!

And from its output, CPU requests / limits are very high: around 90% of allocatable CPU is requested on most nodes.And from its output, CPU requests / limits are very high: around 90% of allocatable CPU is requested on most nodes. So it explains why my cluster doesn’t scale down more. But not why it did not scale down earlier than when I tried manually.

FWIW, Qovery’ overall Kubernetes stack allocates between 12% and 61% of available CPU on each one my nodes, with an average of 34%, so there’s probably a bit of optimization left on your side. (For example, pods in the qovery namespace seem to stay idle most of the time, maybe they don’t need 200m each?)

But I guess I should first and foremost switch to bigger EC2 machines (to reduce the impact of Kubernetes’ overhead) and reduce CPU allocation on my applications a bit more, but that’s hard without CPU / RAM usage metrics over time. Any plan for a Vertical Pod Autoscaler feature within Qovery? :slight_smile:

1 Like

We’re full of AWS credits so I’m not trying to heavily cut costs; I just want to avoid blowing these credits unnecessarily. And I want the resilience. :slight_smile:

1 Like