Cluster update is frozen

Hello guys!

I’m trying to update our default cluster and it seems to be frozen, it is not even emitting logs. Is there anything we can do here?

These are the logs for the cluster update:

Several operations can take time. Mostly because of flush stuffs to S3 API on upgrades, restarting pods (more nodes you have, the more time it can take) etc…

Most of the time it takes only a few min, but in some cases, we’ve seen it take 1h. For how long does it run?

this one has run for about 3 hours already, is true that when it started some apps on the cluster had deployment error status and where really not in a good state. Previous to that i tried updating once and it lasted 1.5hours failing at the end

The weird thing this time is that it has been frozen with no logs for about 2hours, im wondering if there’s any way to cancel without destroying the cluster. Apps seem to have been migrated to the 2nd cluster and are working but im afraid deleting this cluster thats updating will kill the apps and then we will suffer some big downtime

@Pierre_Mavro So it has not been freezing anymore but it just fails with the following error:

Transmitter: Staging_ueast1 - Error, cloud provider quotas exceeded. `vCPU` has reached its quotas of 32.

Error: error waiting for EKS Node Group (qovery-ze3590276:qovery-20220818195543406800000001) to create: unexpected state 'CREATE_FAILED', wanted target 'ACTIVE'. last error: 2 errors occurred:
	* eks-qovery-20220818195543406800000001-3ec15943-5001-2679-f124-7615221827f4: AsgInstanceLaunchFailures: Could not launch On-Demand Instances. VcpuLimitExceeded - You have requested more vCPU capacity than your current vCPU limit of 32 allows for the instance bucket that the specified instance type belongs to. Please visit http://aws.amazon.com/contact-us/ec2-request to request an adjustment to this limit. Launching EC2 instance failed.
	* DUMMY_0204b5ab-eccf-445e-9903-0d6d5d695c31, DUMMY_04ea64b8-e822-4548-97c7-0587cd18daee, DUMMY_177850f3-3534-4e12-94db-d4325488accb, DUMMY_22f037b3-f31c-42d6-ae4e-060b065291a1, DUMMY_25cd0d38-c6f4-47e4-9d90-f54eb9a797be, DUMMY_372e7498-6b36-4a2e-8b9d-22f219bf474d, DUMMY_6b054f67-dfe4-449f-9594-d25180642f23, DUMMY_9367f74f-e4d9-47f4-8d7f-07ab33229336, DUMMY_9f050f99-9f58-475a-a721-6011894e61bd, DUMMY_a41930f8-4bb5-46eb-b25b-515cf2ab691b, DUMMY_b5434b72-514f-493f-ae90-c07daba69007, DUMMY_b6eba629-7def-434c-bc08-5291e25cd788, DUMMY_bb23b28c-51c4-4f1c-829a-960bbaeebe26, DUMMY_beb05a25-9eef-4a4c-a80f-fd835724d064, DUMMY_cce7dc2a-94b6-40db-8b25-c2693f6953e8, DUMMY_d366ae36-d8bb-4fcf-92f3-4ca4ca181c5c, DUMMY_dd8672f5-457f-4033-b23b-8137b1e57bc0, DUMMY_e38b9aea-3dbe-489f-b701-ef9b48ea26f7, DUMMY_f4a786cd-4a84-4a33-9b88-584df5a08fae, DUMMY_f6309ca9-7dde-447e-84a6-70db4150150a, DUMMY_f8384a56-65bb-4ebe-baf6-bc64a9e03434: NodeCreationFailure: Instances failed to join the kubernetes cluster



  on eks-workers-nodes.tf line 2, in resource "aws_eks_node_group" "eks_cluster_workers_1":
   2: resource "aws_eks_node_group" "eks_cluster_workers_1" {

Im unsure as of. why there are so many DUMMY-* instances this error is kinda confusing

Hello @ditorojuan,

I’m looking at your issue

1 Like

Hello @ditorojuan,

The cluster update has failed due to a vCPU quota issue.
As every node is duplicated during the cluster update, you need to have a vCPU limit = 2 x max_potential_nodes

More information about how to calculate quotas in AWS: Calculate a vCPU limit increase request for an Amazon EC2 On-Demand Instance

To unblock the situation, you should request AWS to increase your quotas through this link: http://aws.amazon.com/contact-us/ec2-request

Please do not relaunch the cluster update before AWS has increased your quotas, otherwise it will fail again.

Cheers,
Melvin

1 Like