VPA (Vertical Pod Autoscaler) issues in EKS Cluster

To expand on this, here’s what I’m trying to understand and solve for:

  • Our cluster is running multiple nodes with 64GB RAM and 8 CPU. I’ve done the math, and given the size of our application pod and our pod scaling logic, at most one node should have 1 application pod, an API pod, and then the resources consumed by nginx-ingress, kube-system, logging, New Relic, cert-manager, and any other resources running. At a maximum, I’ve calculated the app pods + other resources should take up to 51% of available RAM [although I see one pod using 56.9%] and 60% of available node CPU [although actual CPU usage % should be much lower]. Reasonable headroom, correct? I can share the spreadsheet calculation in more detail if it’s helpful.

  • Given the above, on paper, we should have sufficient resources. And this seems to be consistent with what we see in our monitoring:

  • However, during our highest traffic periods (7-10am UTC), the same period I’m showing node from above, we intermittently see cluster health issues, corresponding to our app slowing down and requests incurring timeouts (an indicator this is happening with traffic and request serving + handling is 524 timeouts will start to occur). That said, this issue is 100% related to our current nodes, because if I redeploy the application (cycling or rebooting the nodes), the issue gets resolved, and the timeouts stop.

  • When I look at our cluster errors and logging, the first issue I see is with the HPA (HorizontalPodAutoscaler), specifically: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io) - we actually patched this manually ourselves by taking the following steps:
  1. Downgrade metric-server to 0.6.4
  2. Ensure TLS is turned off in the metric-server container args
  3. Update metric-server container port to 4443
  4. Update metric-server service to accommodate correct port
  5. Re-deploy metric-server deployment/RS
  6. Ensure metric-server endpoint is functional

but then it looked like Qovery overwrote the servicing because now HPA can’t connect to the metric server again. How can we implement a persistent, sustainable fix for this to ensure HPA and VPA are working correctly and responding to resource demands?

We’ve already tried modifying the nginx advanced settings in our Qovery cluster. Thanks