we had a very strange problem this week:
Our containers could not access the Database (AWS RDS) anymore, but only when we had a specific combination of pod counts. End of the story was that we found out it was only happening when the container was deployed to a specific node.
We drained the node now and everything is working again. (sadly the autoscheduler killed it, so we can’t investigate properly)
- Telnet & ping & psql connect from node to RDS (external url) worked
- connection from container to RDS didn’t
It seems the k8s DNS was broken for this node or the maybe the node out of the VPC we don’t know.
As we never touched a node manually before we wonder how this can happen? Have you ever heard something like this & how can we make sure it doesn’t happen anymore?
Hi @Alexander_Braunreuth ,
We saw similar situations in the past. Hopefully, it’s rare and happening less and less (can vary from one cloud provider to another, but AWS is very good on this part). What happens is a node is losing some connectivity with one or multiple zones and is unable to recover/detect it fast enough.
Everything is configured with multizone by default, but if the cloud provider is encountering an issue in a zone, the physical host where your VM (Kubernetes node) is located is having troubles, then you have a part of your apps (pods) which can’t communicate anymore. Most of the time, the cloud providers see it and auto-replace the node in trouble, but in some rare situations, it doesn’t detect it fast enough, and this is what you encountered.
Regarding the DNS issue, it confirms that this node was in a bad shape. By default, we’ve set 2 DNS pods (CoreDNS) on top of a load balancer, and obviously, the 2 pods can’t run on the same node. This is why you didn’t notice it very quickly.
Having multiple apps running on multiple nodes in different zones is the best thing to do to avoid an outage (what you’ve done).
For now, it’s hard to investigate more on our side, and AWS side as the node is not there anymore. Feel free to contact us or AWS if it happens again to make an advanced diagnosis and avoid similar issues from happening in the future.