Unhealthy EC2 Target Groups

We’re evaluating and diagnosing some performance issues with our application, we’ve traced the issue to unhealthy EC2 targets. We currently have 2 healthy targets out of 11. NLB is trying to forward the traffic to an instance, but since most of these appear to be unresponsive, it’s causing the issues.

Related to cluster: https://console.qovery.com/organization/828d9c2c-fd72-4b4e-ab58-71de40ecfdd2/cluster/65665c34-242c-4833-9e78-626de9ca1ecc/settings/general

And application: https://console.qovery.com/organization/828d9c2c-fd72-4b4e-ab58-71de40ecfdd2/project/0bf26679-c2d6-48fd-a485-aed61158fb1c/environment/7d2525b8-8f28-45b0-9522-116e229905df/services/general

My questions are: is Qovery doing anything to manage target health or drain and restart unhealthy targets? NLB management is part of Qovery’s EC2 resource support, but is managing this specific infra settings in or outside Qovery’s scope? Trying to understand better (a) what the underlying cause of this is to begin with, (b) how to increase healthy targets, and (c) if we need to monitor and manage this on the AWS side. thanks

Hello @ChrisBolman1 ,

AWS monitors nodes and replaces them on failure. However, if there is disk pressure (no enougt disk space), it is the responsibility of the user to reduce the data present on the disk or increase the disk space.

At the moment, we do not report this kind of issue, but we are working on it. We will be able to identify evicted pods with reasons like disk pressure.

For now, I cannot be sure if disk pressure was the reason for the issue, as it seems you redeployed the cluster with a new type of node. You may try looking into your CloudWatch logs on AWS to see if you have more information on the reason for unhealthy nodes.

If the disk pressure is confirmed, you can review the amount of data written by your app on disk or increase the node disk size

Regards

Hi @Pierre_Gerbelot thanks for the reply. I’m still unclear on a few points here and have been researching to understand it better.

Take our specific case:

  1. Docker runtime
  2. Running EC2 instances with 32GB RAM, 8 vCPU

By all detectable metrics, this is plenty of memory and CPU for our application.

kubectl top nodes shows a typical node is consuming 1-2% of CPU, ~20% of memory. Looking at the pods too we are also well under memory and CPU limits.

kubectl get pods --namespace <our namespace> -o wide indicates we have 2 instances running, one on EC2 “node A” and a second on “node B”.

If I look at our application’s load balancer (standard AWS us-east-1 ELB) and target groups, I see 2 healthy targets and 2 unhealthy targets. “node A” is a healthy target, “node B” is not. One of our pods is running on a healthy target (or at least passing health checks), one of our pods is running on an unhealthy target.

Just as a test, I stopped both of our healthy targets temporarily. If there are only unhealthy targets, the app can’t serve traffic, so we need healthy targets to serve traffic.

Essentially, as far as I can tell, if we have one pod on an unhealthy target group and one pod on a healthy target group, then up to 50% of our traffic risks issues (delayed requests, timeout risk). For example, here is a performance-related timeout in nginx logs:

2023/10/21 20:47:33 [error] 31#31: *97199 upstream timed out (110: Operation timed out) while reading response header from upstream, client: ..., server: www.app.io, request: "GET /manage/app/path/ HTTP/2.0", upstream: "http://xx.1.64.xx:3000/manage/app/path/", host: "www.app.io"

Whereas most requests process fine and return 200. All our instances are being generated and managed by Qovery, so same AMI, security group rules, NACLs and route tables.

Investigating this further, I looked more into the instances and, again, the failing ones are giving an HTTP Status mismatch and a 503 for the health checks.

I tried running and executing into an nginx pod in the new EKS cluster and curling to the health check paths with the following commands:

kubectl run nginx2 --image=nginx 
kubectl exec -i -t nginx2 -- /bin/bash 
curl -v [http://<ip>:<port>/healthzz

If I curl an ip for an EC2 instance that’s failing its health check I get:

< HTTP/1.1 503 Service Unavailable < Content-Type: application/json < X-Content-Type-Options: nosniff < Date: Sat, 21 Oct 2023 01:19:40 GMT < Content-Length: 125 < { "service": { "namespace": "nginx-ingress", "name": "nginx-ingress-ingress-nginx-controller" }, "localEndpoints": 0

The key difference here, as far as I can tell are 0 local endpoints. Because if I curl a healthy instance I get:

User-Agent: curl/7.88.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < X-Content-Type-Options: nosniff < Date: Sat, 21 Oct 2023 01:20:47 GMT < Content-Length: 125 < { "service": { "namespace": "nginx-ingress", "name": "nginx-ingress-ingress-nginx-controller" }, "localEndpoints": 1

Here, “localEndpoints”: 1. Is it correct this is related to nginx-ingress-ingress-nginx-controller? We have 2 nginx-ingress-ingress-nginx-controller pods running:

kubectl get endpoints nginx-ingress-ingress-nginx-controller -n nginx-ingress -o yaml also shows that the Endpoints’

subsets:
- addresses:
     nodeName:

Corresponds to the two healthy EC2 target instances.

My interpretation (or hypothesis) is one pod has a working nginx-ingress-ingress-nginx-controller and one pod doesn’t. So again, 50% of our traffic can’t be served property by nginx, or has to be terminated and re-routed.

We have not done anything to our nginx ingress-controller configuration outside of the Qovery and AWS defaults, but this seems like some sort of instance issue where not all load balancer routed traffic is being routed to an EC2 instance running nginx correctly.

One other observation (I don’t know if it’s related or coincidence) is our healthy EC2 instances have 3 network interfaces (eni), whereas our unhealthy instances only have 2.

At this stage I’m confident when our application instances are scheduled to a healthy EC2 instance with nginx configured correctly we don’t get timeout or performance issues with our application, whereas we do when pods run in instances that are not healthy LB targets. In fact, I tried some random relaunches of pods and got two pods scheduled to healthy EC2 target instances, and am no longer seeing application performance issues.

It would be immensely helpful to (a) understand this better and also (b) have some way to control, adjust or target pod scheduling to a specific node [or only deploy pods to instances passing their health checks or have nginx running], as ultimately I am working with Qovery-managed default config. thanks