Unhealthy EC2 Target Groups

We’re evaluating and diagnosing some performance issues with our application, we’ve traced the issue to unhealthy EC2 targets. We currently have 2 healthy targets out of 11. NLB is trying to forward the traffic to an instance, but since most of these appear to be unresponsive, it’s causing the issues.

Related to cluster: https://console.qovery.com/organization/828d9c2c-fd72-4b4e-ab58-71de40ecfdd2/cluster/65665c34-242c-4833-9e78-626de9ca1ecc/settings/general

And application: https://console.qovery.com/organization/828d9c2c-fd72-4b4e-ab58-71de40ecfdd2/project/0bf26679-c2d6-48fd-a485-aed61158fb1c/environment/7d2525b8-8f28-45b0-9522-116e229905df/services/general

My questions are: is Qovery doing anything to manage target health or drain and restart unhealthy targets? NLB management is part of Qovery’s EC2 resource support, but is managing this specific infra settings in or outside Qovery’s scope? Trying to understand better (a) what the underlying cause of this is to begin with, (b) how to increase healthy targets, and (c) if we need to monitor and manage this on the AWS side. thanks

Hello @ChrisBolman1 ,

AWS monitors nodes and replaces them on failure. However, if there is disk pressure (no enougt disk space), it is the responsibility of the user to reduce the data present on the disk or increase the disk space.

At the moment, we do not report this kind of issue, but we are working on it. We will be able to identify evicted pods with reasons like disk pressure.

For now, I cannot be sure if disk pressure was the reason for the issue, as it seems you redeployed the cluster with a new type of node. You may try looking into your CloudWatch logs on AWS to see if you have more information on the reason for unhealthy nodes.

If the disk pressure is confirmed, you can review the amount of data written by your app on disk or increase the node disk size

Regards

Hi Pierre, thanks for the reply. We were able to figure out the issue.

It was not related to disk pressure. The issue was that, by default our Target Group health check listeners were set to http, rather than TCP. Once we changed the health check listeners to TCP for our target groups, all instances passed health checks immediately. There were no issues with the EC2 instances or NLB themselves, just this setting that got auto-populated incorrectly.