Hi,
This ticket is the continuation of this one : Random 502 Bad Gateway
As I was not able to work on the problem for 15 days, the ticket was closed.
@bchastanier I worked on the problem for a few days, but sadly I don’ t have the solution yet. I made more testing, more analysis and learned as much as possible about networking in k8s (not the easiest part…).
I used locust to do load testing: the more requests there are, the more errors there are.
We also tried again with 3 replicas instead of a single one, the number of errors decreased but the error happened again.
I was not able to find what could cause the connection reset in our application, there is no error or log matching the problem.
Since last time we discussed the problem, this error even happened in another application, based on the NestJS framework too. So I did a lot of research around a possible problem with this framework, but I couldn’t find anything interesting.
In NewRelic I found a permanent error in a pod named externaldns-external-dns-...
, happening every minute. But I found no correlation between this error timestamps and the recv() failed
error timestamps. I’m wondering if it can be linked to our problem, maybe it could be the reason why the service is unreachable sometimes ?
Following multiple resources, I decided to dig deeper using tcpdump and wireshark (I’m just starting with these tools, I had never worked with it before).
I did a packet capture from the application pod, waiting for the recv() failed
to appear in NewRelic.
Then I opened wireshark to analyse the .pcap file and I found a perfect match between a recv() failed (104: Connection reset by peer)
timestamp and a [RST, ACK]
packet timestamp sent by the application pod to nginx-ingress pod, probably not a surprise for you.
I don’t know how to do a better traffic analysis, maybe you have some tips to do it more efficiently and more precisely ? Or maybe is it useless to keep looking in this direction ?
I tried some tools but it’s starting to be be really too low level for me. I’m so far from my comfort zone…
Through all this research, I found some connection reset by peer
issues with Kubernetes. I’m not competent enough to know if there could be a link or not with our problem :
- "Connection reset by peer" due to invalid conntrack packets · Issue #117924 · kubernetes/kubernetes · GitHub
- Connection resets with large payload · Issue #119887 · kubernetes/kubernetes · GitHub (for now we effectively have some large payloads and long requests, but not a lot of RPS)
- "Connection reset by peer" due to invalid conntrack packets · Issue #74839 · kubernetes/kubernetes · GitHub
- kube-proxy: Drop packets in INVALID state drops packets from outside the pod range · Issue #94861 · kubernetes/kubernetes · GitHub
Can you please help me make some progress on this subject?
Thanks in advance.