Hi @prki ,
AWS support validated the issue.
TL;DR We have to make an important migration to fix this issue, leading to some small downtimes on all deployed Kubernetes clusters. This won’t be a trivial fix, requiring several tests, so don’t expect it to be fixed in the coming days. However, it’s an important one, and already started to work on it. FIX ETA: July
Here is the original answer:
I further deep dive into the issue and I can confirmed that the issue is still exist even I tried to run with Amazon EKS 1.22 which bring my attention. Then I tried to look into CloudWatch Log Insights and I can see the same behavior as issue [1] described by executing the following query.
------------------------------------------------------------------------------------------------------------------------------
fields @timestamp, verb, requestURI, responseObject.spec.type, responseObject.status.loadBalancer.ingress.0.hostname, @message
| filter @message like /K8S-SERVICE-NAME/ or @message like /NLB-RESOURCE-NAME/
| filter userAgent not like /kubectl/ # filter out log lines that generated by "kubectl"
------------------------------------------------------------------------------------------------------------------------------
From the log lines, you will find that the Kubernetes removed the annotation for NLB during the transition of “LoadBalancer” → “ClusterIP”. So it hit the issue [1] and you may find the source code at link [2] for Kubernetes-1.20 and link [3] for Kubernetes-1.22.
------------------------------------------------------------------------------------------------------------------------------
// EnsureLoadBalancerDeleted implements LoadBalancer.EnsureLoadBalancerDeleted.
func (c *Cloud) EnsureLoadBalancerDeleted(ctx context.Context, clusterName string, service *v1.Service) error {
if isLBExternal(service.Annotations) {
return nil
}
loadBalancerName := c.GetLoadBalancerName(ctx, clusterName, service)
if isNLB(service.Annotations) { // <-------- the annotation have been removed, so it would not goes into this block.
...
}
...
------------------------------------------------------------------------------------------------------------------------------
However, Amazon EKS inherit the same behavior that upstream Kubernetes does and the in-tree source code is not we can changed. I recommend you to post a “+1” or write some comment to the issue [1] to see how upstream Kuberneetes developer response. But one thing you might be noticed is that both “in-tree” controller is located under the folder “legacy-cloud-providers/…”, I’m not sure if upstream Kubernetes will refine this function or not. Since it have been marked as “legacy” back in 2019[4][5], it have been years.
As alternative solution, I tried to switch my Load Balancer Controller from Kubernetes “in-tree” Load Balancer Controller to AWS Load Balancer Controller[6][7], doing the same test again and I can confirmed there’s no issue while switching service type between “ClusterIP” and “LoadBalancer”. Here’s the log line while I tried to switch from “LoadBalancer” back to “ClusterIP” while using “AWS Load Balancer Controller”.
------------------------------------------------------------------------------------------------------------------------------
{"level":"info","ts":XXX,"logger":"controllers.service","msg":"deleting loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:AWS_ACCOUNT_ID:loadbalancer/net/k8s-..."}
{"level":"info","ts":XXX,"logger":"controllers.service","msg":"deleted loadBalancer","arn":"arn:aws:elasticloadbalancing:us-east-1:AWS_ACCOUNT_ID:loadbalancer/net/k8s-..."}
{"level":"info","ts":XXX,"logger":"controllers.service","msg":"deleting targetGroupBinding","targetGroupBinding":{"namespace":"default","name":"k8s-..."}}
{"level":"info","ts":XXX,"logger":"controllers.service","msg":"deleted targetGroupBinding","targetGroupBinding":{"namespace":"default","name":"k8s-..."}}
{"level":"info","ts":XXX,"logger":"controllers.service","msg":"deleting targetGroup","arn":"arn:aws:elasticloadbalancing:..."}
{"level":"info","ts":XXX,"logger":"controllers.service","msg":"deleted targetGroup","arn":"arn:aws:elasticloadbalancing:..."}
{"level":"info","ts":XXX,"logger":"controllers.service","msg":"successfully deployed model","service":{"namespace":"default","name":"..."}}
------------------------------------------------------------------------------------------------------------------------------
In summary, what I suggest you to do next would be either report the issue directly to upstream Kubernetes [1] or try to switch your Load Balancer Controller from “in-tree” to “AWS Load Balancer Controller”[6] and follow the NLB setup guidance[7] to finish the setup. I believe after switching to “AWS Load Balancer Controller” should resolved the issue that you observed.
I hope provided info helps you understand the state of “in-tree” controller support and thanks again for reporting the issue.
- [1] https://github.com/kubernetes/kubernetes/issues/95042
- [2] kubernetes/aws.go at v1.20.15 · kubernetes/kubernetes · GitHub # <---- v1.20.15
- [3] kubernetes/aws.go at v1.22.2 · kubernetes/kubernetes · GitHub # <---- v1.22.2
- [4] Staging legacy AWS cloud provider by mcrute · Pull Request #77352 · kubernetes/kubernetes · GitHub
- [5] Removing In-Tree Cloud Providers · Issue #2395 · kubernetes/enhancements · GitHub
- [6] Installing the AWS Load Balancer Controller add-on - Amazon EKS
- [7] NLB - AWS Load Balancer Controller