Websocket, webhook and multiple workers instance - issues

ISSUE

Hello Qovery community :slightly_smiling_face:,

We manage to create a solution that interacts with some external services, but we encounter a massive issue.

Our solution is a FastAPI (python back-end framework) application which was previously hosted on another platform, but as we grow we have decided to be a part of your project and take the benefits of your solution :fist:.

I will first describe how our solution works so you might better understand the issue we our facing (you could also look at the photo pinned below):

  • User get connected to our FastAPI back-end through the usage of a websocket connection interface
  • We do some hidden stuff and everything going pretty well so far
  • Then an external service will contact our API using a webhook
  • This webhook is catch by our API and using the Id of an element included in the response we are going to link this answer to a websocket instance (I think you could start seeing the problem coming)
  • All the websocket connection instance are stored inside this FastAPI back-end solution and manage if the client is connected / has to be connected / can be disconnected
  • While performing some local test this service was working 100% of the time but on the Heroku’s platform, this solution keep throwing errors 50% of the time (couldn’t find the connection instance)

The problem we imagine is that the solution is not deployed on only one worker, but in many workers, and sometimes the webhook goes the right instance and sometimes it goes to another instance (maybe due to a load balancing or worker rules they had) which doesn’t have the websocket connection instance (pretty logical so far).

As we are switching to your solution, we would like to get your helps, because we are sure that you got ‘DevOpsish’ knowledge to understand our problem and might have faced it.

Do you know one or many ways (the more the merrier), to deal with workers instance to make the webhook call go to the correct instance so it would find his user and it would be an happy ending :grinning:

Kind regards,

Pierre from the LittleBill Dev Team

@rophilogene

1 Like

Hi @PierroD , thank you for your detailed diagram and issue. I will take some time to respond in the coming hours. :slight_smile:

Hi @PierroD,

It’s great to hear that you’ve decided to join the Qovery community! I’ll do my best to provide some insights to help you resolve your issue.

You’re correct in identifying that the problem might be due to multiple worker instances causing the webhook to sometimes go to the correct instance and sometimes not. Managing stateful connections like WebSockets can be challenging in a distributed system.

Here is an approach you can consider to resolve this issue:

I would change how your connection manager works to pull data from an intermediate layer (broker) only to pull data related to the connections it manages.

In my diagram:

  1. your webhook service pushes the incoming data into a broker. The broker can be anything, your current database Postgres, Redis, RabbitMQ, or others… The idea is to push the data into a layer that partitions the data by user ID. You might notice that your broker is external to your FastAPI project. It is key!
  2. Your connection managers pull messages from the broker by user ID (or another key). Then you are sure that you are constantly pulling data for a user from a specific instance. The instance pulling data for a user must be the one that is managing their connection. To do that, there are many ways - like watching the WebSockets connection you manage per instance.

I hope this suggestion helps you address the issue with your app. If you have any questions or need further clarification, please don’t hesitate to ask.

Romaric.

1 Like

Hi @rophilogene ,

Thank you for your answer (explanations and drawing) which make it really clear and easy to understand. I think you just provide us which seems to be one if not the best working solution to our problem.

We would try to work on it in the next days / weeks and post a feedback on how we manage to solve the issue (if we could), to let the next person reading this post know this is a fully operational solution.

Kind regards,

Pierre from the LittleBill Dev Team

1 Like