VPA (Vertical Pod Autoscaler) issues in EKS Cluster

ChrisBolman1 · April 11, 2024, 12:12pm

To expand on this, here’s what I’m trying to understand and solve for:

Our cluster is running multiple nodes with 64GB RAM and 8 CPU. I’ve done the math, and given the size of our application pod and our pod scaling logic, at most one node should have 1 application pod, an API pod, and then the resources consumed by nginx-ingress, kube-system, logging, New Relic, cert-manager, and any other resources running. At a maximum, I’ve calculated the app pods + other resources should take up to 51% of available RAM [although I see one pod using 56.9%] and 60% of available node CPU [although actual CPU usage % should be much lower]. Reasonable headroom, correct? I can share the spreadsheet calculation in more detail if it’s helpful.
Given the above, on paper, we should have sufficient resources. And this seems to be consistent with what we see in our monitoring:

However, during our highest traffic periods (7-10am UTC), the same period I’m showing node from above, we intermittently see cluster health issues, corresponding to our app slowing down and requests incurring timeouts (an indicator this is happening with traffic and request serving + handling is 524 timeouts will start to occur). That said, this issue is 100% related to our current nodes, because if I redeploy the application (cycling or rebooting the nodes), the issue gets resolved, and the timeouts stop.

When I look at our cluster errors and logging, the first issue I see is with the HPA (HorizontalPodAutoscaler), specifically: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io) - we actually patched this manually ourselves by taking the following steps:

Downgrade metric-server to 0.6.4
Ensure TLS is turned off in the metric-server container args
Update metric-server container port to 4443
Update metric-server service to accommodate correct port
Re-deploy metric-server deployment/RS
Ensure metric-server endpoint is functional

but then it looked like Qovery overwrote the servicing because now HPA can’t connect to the metric server again. How can we implement a persistent, sustainable fix for this to ensure HPA and VPA are working correctly and responding to resource demands?

We’ve already tried modifying the nginx advanced settings in our Qovery cluster. Thanks

Topic		Replies	Views
Nginx ingress controller FailedGetResourceMetric Questions and Answers qovery	3	301	April 17, 2024
Cluster is in Deploying state for 3 days now Deployment	2	56	September 19, 2024
How Kubernetes CPU and RAM resources allocation works with Qovery? Deployment aws	18	4687	September 15, 2022
Services latency Questions and Answers qovery , aws	8	491	March 25, 2024
What's the recommended way to scale down a cluster? Questions and Answers	12	2631	March 25, 2024

VPA (Vertical Pod Autoscaler) issues in EKS Cluster

Related topics