Pod stuck in pending status

Hi

We just found that many of our services are down, and in Qovery, they are stuck in the starting status. When I checked the Kubernetes status, it seems that those pods are in a pending state. Below is the status of one pending pod. It looks like there might be an issue with Karpenter. Could you help take a look?
https://console.qovery.com/organization/b4271b12-477a-4b41-a274-9c9cba8043e8/project/42fc429e-4f98-40fe-b1c5-e6e0a348a172/environment/cde278a3-a413-40f7-b203-ad070309f579/application/d96dc289-1c98-4012-a347-b8f914876979/general

Name:                 app-zd96dc289-portal-web-7484947765-pbqxn
Namespace:            zcde278a3-prod
Priority:             1000

    Liveness:   tcp-socket :8000 delay=30s timeout=5s period=10s #success=1 #failure=3
    Readiness:  tcp-socket :8000 delay=15s timeout=1s period=10s #success=1 #failure=5
    Environment:
                Optional: false
Conditions:
  Type           Status
  PodScheduled   False
Volumes:         <none>
QoS Class:       Guaranteed
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  53s (x3 over 63s)  default-scheduler  0/7 nodes are available: 1 Insufficient cpu, 2 node(s) had untolerated taint {eks.amazonaws.com/compute-type: fargate}, 4 node(s) had untolerated taint {nodepool/stable: }. preemption: 0/7 nodes are available: 1 No preemption victims found for incoming pod, 6 Preemption is not helpful for scheduling..

Hi,

I’m looking into it.

Pierre

Karpenter was in a bad shape, this is why new nodes couldn’t come. Fixing it manually and we’ll investigate tomorrow on the why. It looks like a lack of resources on Karpenter pods after a quick look.

You should see your apps coming back now. Sorry for the inconvenience, the team will dig into it tomorrow. In the meantime please do not redeploy your cluster. Thanks

thanks for quick response, everything works properly now

Great. The issue has been identified. A fix will be released in the next 2 days

The fix has been deployed. You can make changes on your cluster now if you want.

Pierre

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.