VPA (Vertical Pod Autoscaler) issues in EKS Cluster

Hi Qovery, in the process of trying to troubleshoot some intermittent cluster issues, we’ve noticed that there seems to be an issue with our vertical pod autoscaler (VPA) in our production cluster: https://console.qovery.com/organization/828d9c2c-fd72-4b4e-ab58-71de40ecfdd2/cluster/91913fb3-27bf-4ec9-9afa-b1e04bcdafe7/

Specifically:

1. vertical-pod-autoscaler-vpa -webhook

FailedToUpdateEndpoint Failed to update endpoint kube-system/vertical-pod-aut oscaler-vpa-webhook: Operation cannot be fulfilled on endpoints "vertical-pod-autoscaler-vpa- webhook": the object has been modified; please apply your changes to the latest version and try again

2. vertical-pod-autoscaler-vpa -admission-controller-55f7 6b5b472j29

Unhealthy Readiness probe failed: Get "http://xx.x.xx.167:8944/heal th-check": dial tcp xx.x.xx.167:8944: connect: connection refused

AFAIK, our app has only and always been configured for port 3000 (external port 443). Not familiar with the use of port 8944

Everything that we’re currently running is auto-generated by Qovery. Is there any guidance we can follow in terms of fixing this issue with our VPA? It seems like this is managed by a Helm chart. Is there a way to reapply our VPA chart? maybe the latest version?

thanks

To expand on this, here’s what I’m trying to understand and solve for:

  • Our cluster is running multiple nodes with 64GB RAM and 8 CPU. I’ve done the math, and given the size of our application pod and our pod scaling logic, at most one node should have 1 application pod, an API pod, and then the resources consumed by nginx-ingress, kube-system, logging, New Relic, cert-manager, and any other resources running. At a maximum, I’ve calculated the app pods + other resources should take up to 51% of available RAM [although I see one pod using 56.9%] and 60% of available node CPU [although actual CPU usage % should be much lower]. Reasonable headroom, correct? I can share the spreadsheet calculation in more detail if it’s helpful.

  • Given the above, on paper, we should have sufficient resources. And this seems to be consistent with what we see in our monitoring:

  • However, during our highest traffic periods (7-10am UTC), the same period I’m showing node from above, we intermittently see cluster health issues, corresponding to our app slowing down and requests incurring timeouts (an indicator this is happening with traffic and request serving + handling is 524 timeouts will start to occur). That said, this issue is 100% related to our current nodes, because if I redeploy the application (cycling or rebooting the nodes), the issue gets resolved, and the timeouts stop.

  • When I look at our cluster errors and logging, the first issue I see is with the HPA (HorizontalPodAutoscaler), specifically: failed to get cpu utilization: unable to get metrics for resource cpu: unable to fetch metrics from resource metrics API: the server is currently unable to handle the request (get pods.metrics.k8s.io) - we actually patched this manually ourselves by taking the following steps:
  1. Downgrade metric-server to 0.6.4
  2. Ensure TLS is turned off in the metric-server container args
  3. Update metric-server container port to 4443
  4. Update metric-server service to accommodate correct port
  5. Re-deploy metric-server deployment/RS
  6. Ensure metric-server endpoint is functional

but then it looked like Qovery overwrote the servicing because now HPA can’t connect to the metric server again. How can we implement a persistent, sustainable fix for this to ensure HPA and VPA are working correctly and responding to resource demands?

We’ve already tried modifying the nginx advanced settings in our Qovery cluster. Thanks

Hi,

I’m currently working on a fix on an issue that may be related. Can you please share the resources usages of the vpa pods and metrics-server + a sum of each of those pods please during those hours?

Thanks

Hi @Pierre_Mavro

There is a small spike with the metric server around the issue period, I see:

vertical-pod-autoscaler-vpa-admission-controller-55f76b5b4x9ptc - 32 MB, 0.1 CPU cores

I do see a spike in received KBps for this pod during the problem period:

vertical-pod-autoscaler-vpa-recommender-7fc4fc986f-lvdwr - 47 MB, 0.1 CPU cores

Also has KBps spike:

vertical-pod-autoscaler-vpa-updater-ccff6b445-kgrqr - 34.6 MB, 0.1 CPU cores

overall VPA resource profile during issue period:

Metric Server - 23.97 MB, 0.00372 CPU cores

So total rounded up would be 32 + 47 + 35 + 24 = 138 MB RAM, pretty negligible CPU usage from what I can see. Will look into it more and double-check to see if anything else stands out.

An update is ongoing on our side, it should be released shortly. I keep you up to date, it should resolve scaling instabilities

1 Like

The fix has been release. Please run an update on your cluster and let me know if you see the issue coming back.

Have a good weekend

thanks @Pierre_Mavro I have a follow-up question on this.

as a test today, I decided to clone everything:

  1. I created a duplicate cluster with the exact same config, running the same EC2 instance type in the same AZ with the same resource settings

  2. I cloned our production app and deployed it in the new cluster

our application’s traffic flow is: User > CloudFlare > NLB > Instance

traffic is being routed to resources in the cluster via ingress-nginx-controller

here, to me, is the interesting thing:

I perform a speed test of existing prod app (in original cluster 1), and new clone app (in new cluster 2). Existing prod app in original cluster 1 is significantly and consistently faster serving requests than new clone app in cluster 2, even though they are exact same app with exact same resources. maybe there is some filesystem based caching but do not think that applies to K8S.

but I do see one difference: by whatever default config (I have not modified anything related to this) in cluster 1, nginx-ingress-ingress-nginx-controller pods are running in all three of our nodes, and all three instances and AZs are healthy load balancer targets:

in new cluster, there are 3 nginx pods in node 1, 1 in node 2, 0 in node 3 and, correspondingly, we only have two healthy targets:

My impression was this should not matter: unhealthy node should not affect app speed as long as the load balancers don’t direct to it. But in our case it seems to, because some instances respond to HTTPS (ex: https://xx.xx.xxx.xx:31980/healthz) and some do not. The load balancer for our new cluster says it has three AZs in network mapping through VPC, but only 2 AZs contain instances that respond to http/https requests:

Screenshot 2024-04-13 at 11.14.06 AM

There’s no way to control this from Qovery correct (i.e., the allocation of nginx pods across instances and instance targets that respond over https)? I can turn on stickiness and deregister unhealthy targets in AWS (which seems to help), but nginx pod assignment is totally random, yes? Can you help me understand this relationship a little better? thanks

Latency or throughput difference on your kind of stack is really complex to address and requires a lot of tracing and tooling to really understand what’s going on. And you didn’t gave any numbers as well so, from my point of view it’s subjective since it’s a feeling on your side (please correct me if I’m wrong).

In your case, there are so many network hop, that the issue can come from various items:

  • Cloudflare
  • The NLB
  • Nginx Ingress
  • Kube-proxy
  • The EC2 VM where the Nginx ingress is serving traffic
  • Your pod app

Having less Nginx can matters if you’re sensitive to the latency since it can create an extra hop. For example a 3 nodes cluster with 2 Nginx ingress.
Let’s say a user is requesting your service, the request comes to node 1, but your app is running on node 3, you’ll have one more hop (kube-proxy here).

You can control the number of minimum Nginx ingress you have (so you can set 3 if you want to maximize zone repartition) Cluster Advanced Settings | Docs | Qovery

I’d like to help you more, but there are so many possibilities and a lot of time to investigate to find the right info…it’s out of Qovery’s scope.

My last advice are:

  • Win some time and deploy an observability tool like Datadog
  • Enable tracing everywhere you can on your complete stack so you’ll have data to compare
  • Mesure, do not guess. You’ll save a lot of time
  • Try to contact AWS support, they may have more insight or advice on their side for the AWS part to mesure

Pierre

Understood, and we are using cluster monitoring and tracing tools, so these are not just subjective statements. I can see traces. This also ties back to basic questions about using and instrumenting Qovery for our infrastructure with a Qovery-generated cluster (since we are not using BYO). To your point about our network stack:

  • Cloudflare - Our responsibility (or Cloudflare’s)
  • The NLB - AWS but config generated by Qovery
  • Nginx Ingress - Generated / configured by Qovery
  • Kube-proxy - Generated / configured by Qovery
  • The EC2 VM where the Nginx ingress is serving traffic - AWS but config generated by Qovery
  • Your pod app - our code, our responsibility

A lot of this infrastructure is spawned and configured by Qovery. So, assuming we are latency-sensitive, I can adjust the minimum nginx ingress instances in our cluster, but Qovery is randomly (and not equally) distributing them. For example, I increase minimum nginx ingress setting in Qovery to 6, we end up with:

Node 1 - 0 nginx-ingress
Node 2 - has 5

Node 3 - 1 nginx-ingress

I don’t need 5 in node 2, I’m trying to place 1 in Node 1 - there’s no possible way to do this via Qovery right? Having some control over pod allocation to nodes would be quite helpful. thanks

OK, so what you want if I understand correctly is 1 nginx ingress per node (technically replacing the Nginx ingress deployment by a deamonset).

I don’t think it will change anything since you don’t have the warranty that the Nginx will forward the request on your application located on the same node (in your case it may be more easy to control since you don’t have a lot of node, but it doesn’t scale).

Perfect if you already observe everything. Can you please share what you see/have, because we may go into the wrong direction and some config may be updated to get exactly what you expect.

To have a better idea, I’d like to see graphs with the different configs, latency you observe VS what you expect, and something detailed enough so I can easily reproduce, validate and find a definitive solution for you.

Thanks