Hi @Pierre_Gerbelot thanks for the reply. I’m still unclear on a few points here and have been researching to understand it better.
Take our specific case:
- Docker runtime
- Running EC2 instances with 32GB RAM, 8 vCPU
By all detectable metrics, this is plenty of memory and CPU for our application.
kubectl top nodes
shows a typical node is consuming 1-2% of CPU, ~20% of memory. Looking at the pods too we are also well under memory and CPU limits.
kubectl get pods --namespace <our namespace> -o wide
indicates we have 2 instances running, one on EC2 “node A” and a second on “node B”.
If I look at our application’s load balancer (standard AWS us-east-1 ELB) and target groups, I see 2 healthy targets and 2 unhealthy targets. “node A” is a healthy target, “node B” is not. One of our pods is running on a healthy target (or at least passing health checks), one of our pods is running on an unhealthy target.
Just as a test, I stopped both of our healthy targets temporarily. If there are only unhealthy targets, the app can’t serve traffic, so we need healthy targets to serve traffic.
Essentially, as far as I can tell, if we have one pod on an unhealthy target group and one pod on a healthy target group, then up to 50% of our traffic risks issues (delayed requests, timeout risk). For example, here is a performance-related timeout in nginx logs:
2023/10/21 20:47:33 [error] 31#31: *97199 upstream timed out (110: Operation timed out) while reading response header from upstream, client: ..., server: www.app.io, request: "GET /manage/app/path/ HTTP/2.0", upstream: "http://xx.1.64.xx:3000/manage/app/path/", host: "www.app.io"
Whereas most requests process fine and return 200. All our instances are being generated and managed by Qovery, so same AMI, security group rules, NACLs and route tables.
Investigating this further, I looked more into the instances and, again, the failing ones are giving an HTTP Status mismatch and a 503 for the health checks.
I tried running and executing into an nginx pod in the new EKS cluster and curling to the health check paths with the following commands:
kubectl run nginx2 --image=nginx
kubectl exec -i -t nginx2 -- /bin/bash
curl -v [http://<ip>:<port>/healthzz
If I curl an ip for an EC2 instance that’s failing its health check I get:
< HTTP/1.1 503 Service Unavailable < Content-Type: application/json < X-Content-Type-Options: nosniff < Date: Sat, 21 Oct 2023 01:19:40 GMT < Content-Length: 125 < { "service": { "namespace": "nginx-ingress", "name": "nginx-ingress-ingress-nginx-controller" }, "localEndpoints": 0
The key difference here, as far as I can tell are 0 local endpoints. Because if I curl a healthy instance I get:
User-Agent: curl/7.88.1 > Accept: */* > < HTTP/1.1 200 OK < Content-Type: application/json < X-Content-Type-Options: nosniff < Date: Sat, 21 Oct 2023 01:20:47 GMT < Content-Length: 125 < { "service": { "namespace": "nginx-ingress", "name": "nginx-ingress-ingress-nginx-controller" }, "localEndpoints": 1
Here, “localEndpoints”: 1. Is it correct this is related to nginx-ingress-ingress-nginx-controller? We have 2 nginx-ingress-ingress-nginx-controller pods running:
kubectl get endpoints nginx-ingress-ingress-nginx-controller -n nginx-ingress -o yaml
also shows that the Endpoints’
subsets:
- addresses:
nodeName:
Corresponds to the two healthy EC2 target instances.
My interpretation (or hypothesis) is one pod has a working nginx-ingress-ingress-nginx-controller and one pod doesn’t. So again, 50% of our traffic can’t be served property by nginx, or has to be terminated and re-routed.
We have not done anything to our nginx ingress-controller configuration outside of the Qovery and AWS defaults, but this seems like some sort of instance issue where not all load balancer routed traffic is being routed to an EC2 instance running nginx correctly.
One other observation (I don’t know if it’s related or coincidence) is our healthy EC2 instances have 3 network interfaces (eni), whereas our unhealthy instances only have 2.
At this stage I’m confident when our application instances are scheduled to a healthy EC2 instance with nginx configured correctly we don’t get timeout or performance issues with our application, whereas we do when pods run in instances that are not healthy LB targets. In fact, I tried some random relaunches of pods and got two pods scheduled to healthy EC2 target instances, and am no longer seeing application performance issues.
It would be immensely helpful to (a) understand this better and also (b) have some way to control, adjust or target pod scheduling to a specific node [or only deploy pods to instances passing their health checks or have nginx running], as ultimately I am working with Qovery-managed default config. thanks