I deleted all EKS nodes as part of SOC2 tests

Hello, as part of our SOC2 compliance DR tests, I had to delete all worker nodes from our Kubernetes cluster but when I did so it seems that the control plane nodes were also deleted and are not coming back. Is it possible that you’re scheduling pods also on the nodes that you’re using as control plane nodes? :thinking: Also, can you please fix our Playground cluster cause now it’s broken for good, I fear. I’m including the set of steps I did, for your reference. Let me know if you need any further explanation.

Notes:

  • I just tried changing the max number of nodes to force a redeployment to see if it would fix the problem. The cluster says it’s “Updating” and has the spinner going
  • looking at the logs it seems that the Helm deployment is failing while trying to UPGRADE the vertical-pod-autoscaler among other things
  • put back max nodes to 10 to force another deployment… but I don’t have high hopes looking at what I see in the logs.

Hi :wave: ,Your questions are completely legit. If you took the Kubernetes documentation or followed a guide on some classical Kubernetes operations and did not get the expected result, it’s because of some EKS specificities/knowledge you don’t get. I’m going to try to give you more input to quickly fix your issue and help you to move on:

  1. “I had to delete all worker nodes”: It’s not how things are done on EKS. May be you worked with non managed Kubernetes in the past, but on EKS, an EC2 instance is linked to a cluster and the correlation between a node and an EC2 is not that strong. What happened here, is you deleted the node FROM kubernetes, but the EC2 is still alive. While I guess you expected it to be deleted. So here, from an AWS POV, everything is “normal”. You just can’t do that. To recover from this situation, please go on the AWS console, and delete the nodegroup of your cluster. It will delete all the EC2 instances you expected to be removed from the Kubernetes cluster. Then re-run a Qovery cluster deployment, new nodes will be deployed and your services will be back. I advise you to look at the EKS documentation and contact the support for more info, you’ll be able to have more internal feedback. Qovery can’t do much more to help you here, as it’s how EKS works and your SOC2 validation tests are on the AWS stack, not the Qovery one.
  2. “but when I did so it seems that the control plane nodes were also deleted and are not coming back”: from an AWS POV, control plane nodes are the Kubernetes master nodes. Worker nodes != master nodes/control plane. If you were losing the control plane, you were not able to connect to the Kubernetes API at all.

Hope it’s clear. Don’t expect EKS to work as a classical/vanilla Kubernetes cluster. Every Cloud provider having a Kubernetes managed service are not equivalent. Learning those specificities help to use correctly Kubernetes, but I know it’s not obvious unfortunately . Don’t hesitate to contact the AWS support for further more help on your SOC2 preparation, they will be able to give you more insight and context