Sudden connection timeout on our managed MongoDB instance

Issues information

Your issue

Although we have not gone live yet, our prod env has started giving the following timeout error when trying to connect to the managed MongoDB instance. Any ideas why this could be? It worked previously, and I don’t think any changes to prod have been made. This issue would bring our entire application down if it was live in production - I am not sure if it is a Qovery issue, or we accidentally changed something without realising.

|11 May, 18:39:55.639|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|└ MongoClient(host=['z9f5978fc-mongodb.zab943a2a.rustrocks.me:27017'], document_class=dict, tz_aware=False, connect=False, ssl=Tru...|
| --- | --- | --- | --- |
|11 May, 18:39:55.639|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|File "/usr/local/lib/python3.10/site-packages/pymongo/mongo_client.py", line 1921, in _get_server_session|
|11 May, 18:39:55.639|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|return self._topology.get_server_session()|
|11 May, 18:39:55.639|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|File "/usr/local/lib/python3.10/site-packages/pymongo/topology.py", line 520, in get_server_session|
|11 May, 18:39:55.639|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|session_timeout = self._check_session_support()|
|11 May, 18:39:55.639|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|└ <Topology <TopologyDescription id: 626037ada9b9df407795933a, topology_type: Single, servers: [<ServerDescription ('z9f5978fc-mon...|
|11 May, 18:39:55.640|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|File "/usr/local/lib/python3.10/site-packages/pymongo/topology.py", line 499, in _check_session_support|
|11 May, 18:39:55.640|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|self._select_servers_loop(|
|11 May, 18:39:55.640|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|File "/usr/local/lib/python3.10/site-packages/pymongo/topology.py", line 218, in _select_servers_loop|
|11 May, 18:39:55.640|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|raise ServerSelectionTimeoutError(|
|11 May, 18:39:55.640|app-zf2f130e2-5b88677d96-fcc4h|7b01c8|pymongo.errors.ServerSelectionTimeoutError: z9f5978fc-mongodb.zab943a2a.rustrocks.me:27017: timed out, Timeout: 30s, Topology Description: <TopologyDescription id: 626037ada9b9df407795933a, topology_type: Single, servers: [<ServerDescription ('z9f5978fc-mongodb.zab943a2a.rustrocks.me', 27017) server_type: Unknown, rtt: None, error=NetworkTimeout('z9f5978fc-mongodb.zab943a2a.rustrocks.me:27017: timed out')>]>|

Dockerfile content (if any)

FROM python:3.10.2-buster

COPY requirements.txt requirements.txt

RUN python -m pip install --upgrade pip

RUN pip install --no-cache-dir -r requirements.txt

COPY . .

EXPOSE 8000

CMD python main.py

Thanks,
Kevin

Even stranger, on some rare occasions it works fine and the database connection is successful, however, most times we make requests that involve a database read/write we get this timeout, so this leads me to believe there is no issue from our end.

I have also tried redeploying all apps and the database in the environment and still face the same issue.

Hello @KevinFinvault,
I’m taking a look at this and will come back to you as soon as I have more informations on it.

1 Like

The database setup on Qovery’s side looks good, I am wondering if it’s not a performance issue.

Could you take a look at the monitoring tools available in AWS, maybe this would help: Monitoring with Performance Insights - Amazon DocumentDB

Also I found out that pymongo might not be fork safe and might create too many connection causing the database to timeout: The Art of Graceful Reloading — uWSGI 2.0 documentation

Hi @bilel,

Thanks for taking the time to look into this and to respond.

We are actually using Motor which uses PyMongo underneath, with some differences to how PyMongo itself works - notably in how connections are created, and that multithreading and forking are not supported.

Secondly, we have been using this setup for ~5 months now and have never seen this issue arise once. Our dev env is still running without issue. I think if this was an issue with Motor/PyMongo, then we would have seen this issue arising intermittently from the start months ago, rather than suddenly happening consistently after several months.

I took a look at the monitoring dashboard on AWS the other day, but didn’t notice anything strange. Let me take a deeper look now.

@bilel I can’t see anything strange in the monitoring or Performance Insights pages. Anything in particular I should look for?

@bilel Hmm, interestingly, I noticed that each time I make a query to the database, this CPU Utilization Wait metric either drops or spikes, not sure which yet. Don’t know what this means exactly?

Your dev environment works with a MongoDB container instance where your prod env runs a MongoDB managed instance.

@bilel it’s possible that the MongoDB managed instance is too small and can’t handle the same amount of connections to the MongoDB instance compared to the dev one. AWS limits the number of connections depending on the instance type of the managed database.

@KevinFinvault can you show us your database network connections graph for your managed MongoDB instance (Service: AWS DocumentDB). Thank you

Hi @rophilogene here is the connections graph for the last 4 weeks. Seems to have been mostly at 0 for the last week or so, even when I attempt to perform a database operation, but the spike to 100 a while back seems particularly weird given that we aren’t really using our production env actively yet.

What you mentioned about the managed instance not being able to handle as many connections may be an issue, since, I believe, Motor will create a new connection for each database operation.

@benjaminch is going to take a look - but a number of connections with a PostgreSQL DB are a common issue. That’s why PgBouncer is so successful.

Note: To make PostgreSQL / RDS able to handle more connections, you just need to use a bigger instance than the one currently used (if it is the problem of course).

Hey @KevinFinvault,

Do you have any clue about number of connections to expect once production will be live?
As @rophilogene mentioned, increasing managed Mongo size might be an option.

Found this doc on Mongo’s side, maybe we can find some clues: Maximizing MongoDB Performance on AWS | MongoDB Blog

Let me know how I can help.

Cheers