Nginx-ingress random error : recv() failed (104: Connection reset by peer)

Hi,

This ticket is the continuation of this one : Random 502 Bad Gateway
As I was not able to work on the problem for 15 days, the ticket was closed.

@bchastanier I worked on the problem for a few days, but sadly I don’ t have the solution yet. I made more testing, more analysis and learned as much as possible about networking in k8s (not the easiest part…).
I used locust to do load testing: the more requests there are, the more errors there are.
We also tried again with 3 replicas instead of a single one, the number of errors decreased but the error happened again.

I was not able to find what could cause the connection reset in our application, there is no error or log matching the problem.
Since last time we discussed the problem, this error even happened in another application, based on the NestJS framework too. So I did a lot of research around a possible problem with this framework, but I couldn’t find anything interesting.

In NewRelic I found a permanent error in a pod named externaldns-external-dns-..., happening every minute. But I found no correlation between this error timestamps and the recv() failed error timestamps. I’m wondering if it can be linked to our problem, maybe it could be the reason why the service is unreachable sometimes ?

Following multiple resources, I decided to dig deeper using tcpdump and wireshark (I’m just starting with these tools, I had never worked with it before).
I did a packet capture from the application pod, waiting for the recv() failed to appear in NewRelic.
Then I opened wireshark to analyse the .pcap file and I found a perfect match between a recv() failed (104: Connection reset by peer) timestamp and a [RST, ACK] packet timestamp sent by the application pod to nginx-ingress pod, probably not a surprise for you.
I don’t know how to do a better traffic analysis, maybe you have some tips to do it more efficiently and more precisely ? Or maybe is it useless to keep looking in this direction ?
I tried some tools but it’s starting to be be really too low level for me. I’m so far from my comfort zone… :face_with_spiral_eyes:

Through all this research, I found some connection reset by peer issues with Kubernetes. I’m not competent enough to know if there could be a link or not with our problem :

Can you please help me make some progress on this subject?
Thanks in advance.

Hey @Mathieu_Haage,

Thanks for your investigation.

Smells like your app is sending this ACK, so IMO there is something going on in your app. Are you catching all exceptions? Is there any port connection reset or any crash or anything on that front?
Are you able to add extensive logging / debugging to this app?

Cheers

Hi @bchastanier,

Problem solved ! :tada:

You were right since the beginning, it was a timeout problem, but I wasn’t able to find it immediately. The error was caused by 2 timeouts that didn’t work well together : Node.js default keep alive timeout (5s) + proxy timeout (60s).

When you pointed me in this direction, I’ve been searching about timeout problems in Node and Nest but I didn’t find anything related to my problem.
The deep analysis and a better understanding of the problem and the system probably helped me find the resources that match my problem.

I finally found someone who had exactly the same problem, but on AWS with Node + AWS load balancer. The article was written in 2019 but the solution is still relevant.
After trying his best to monitor the Node application and not finding a clue, he did a packet analysis and found the RST packets from Node, as I did.
Then he managed to find that the Node keepalivetimeout was responsible for the connection reset :

After investigating Express, it becomes apparent that Express isn’t really handling much on the socket-layer, so it must be the underlying native Node http.Server that Express uses. And sure enough, in the docs (new with NodeJS 8.0+), is a ‘keepAliveTimeout’, which will forcefully destroy a socket after having a TCP connection sit idle for a default 5 seconds.

Thanks to this article and a few others, I’ve been able to configure the NestJS application correctly by adding 3 lines of code in the bootstrap function to configure the timeouts : based on proxy 60s timeout, I set keepAliveTimeout to 61s and timeoutHeaders to 62s, as advised in one of the articles listed below.

If I understand correctly, another solution could have been to reduce the proxy timeout under the default Node timeout, so 4s. But after checking metrics, these settings suit our needs and the error recv() failed has disappeared since these new settings are set. What a relief !

Links to the articles that helped me solve the problem :

I wish this solution can help other people if they encounter a related issue, to avoid searching hopelessly in so many wrong directions as I did.

The good thing: I’ve learnt a lot, especially about the network in k8s ! So much time spent, but no time wasted. :sweat_smile:

Thank you again for your help !

3 Likes

Whaaoo ! Great you solved it :slight_smile:

Thanks a lot for sharing the solution, it will definitively help others !

Winning The Office GIF

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.