[Post-mortem] HTTP Requests failing with 403

ce_gagnaire · August 30, 2024, 8:20am

Incident Postmortem Report

Incident Title: HTTP Requests failing with 403
Date of Incident: 2024/08/28 between 2 pm and 5 pm (UTC)
Duration: 3 hours
Impact: Some customers’ web services were unreachable, HTTP requests failed with 403
Incident Severity: Partial outage

1. Summary

Following a global cluster update triggered by our team, some configuration of the Nginx Ingress controller was wrongly modified on a subset of these clusters.

Our team identified the root cause of the problem and deployed a fix on all clusters.

The websites hosted on impacted clusters were unreachable for the duration of the outage.

We apologize for any inconvenience caused.

2. Timeline

Time (UTC)	Event Description
1:45 pm	Incident started
2:00 pm	Incident detected
2:05 pm	Incident response initiated
2:40 pm	Root cause identified
3:20 pm	Manual fix on a test cluster
3:35 pm	Implementing fix
3:40 pm	Mitigation steps started (rollout)
4:40 pm	Services restored
5:00 pm	Post-mortem review scheduled

3. Root Cause Analysis

on March 27th, we shipped a new feature containing a sleeping bug. This feature added a new Cluster Advanced Setting: nginx.controller.enable_client_ip
on August 28th, we updated our JSON library: Jackson
To handle correctly the deserialization of Kotlin data class with Jackson, we created two paths to set default values in Qovery’s control plane. We had mistakenly set different default values in these two paths. Jackson’s different version does not use the same path. The old version used the first path with the correct default value, while the new version used the second path with the wrong default value.
Following Jackson’s update, we mistakenly set the wrong default value when we upgraded all the clusters.
This new default value activated “enable-real-ip” in Nginx’s configuration. This value was modified only for clusters created before March 27th and where no Cluster Advanced Settings were modified using Qovery’s console.

Problem illustration with a diff:

4. Impact

User Impact:

Activating “enable-real-ip” adds the following impacts:

For users with IP whitelisting, the whitelisted IP was not the right one resulting in a 403 error for all requests.
The default value for Nginx’s whitelist is IPv4 only (“0.0.0.0/0”) and does not allow IPv6. Customers using a CDN or a reverse proxy (like Cloudflare or Cloudfront) in front of their Qovery cluster received the reverse proxy’s real IP. If this IP was an IPv6, the request was sent a 403 error.

Business Impact:

Impacted websites were unreachable from 1:45 pm UTC and up to 4:40 pm UTC.

5. Incident Detection

Detection Method:

A customer created a thread on the forum saying he got a lot of requests receiving 403 errors. Following this first thread, we received messages from other customers so we decided to warn our team.

Detection Gaps:

403 errors are on the customers’ cluster so these errors do not trigger our monitoring tools.

6. Incident Response

Response Actions:

Our team got a warning following customer feedback
Investigation to understand where the problem was
We decided not to rollback because the change responsible for the bug was too old
Customers who used whitelists and offered a workaround on a forum thread where we proposed to manually and temporarily disable their whitelist.
Test on an impacted cluster to validate our fix
Communication on Qovery’s status page and the forum
We checked every other cluster’s advanced settings to confirm we had no other bug.

Communication:

Warn our team
Status page and forum communication
Post-mortem release

7. Resolution and Recovery

Resolution Steps:

Correct the wrong default value in our code
Deploy the customer’s clusters again

Recovery Validation:

Search customer’s Nginx ingress logs looking for 403 errors
Contact impacted users to confirm everything was back to normal

8. Lessons Learned

What Went Well:

We managed to quickly find where the problem was
We deployed the fix on all customers’ clusters promptly

How to avoid this in the future

Improve our communication during the incident
Stop deploying all servers at once, even if the modification is a low-impact one
Better test third-party libraries upgrade to prevent side effects
During this incident, we needed at least 20 minutes to deploy the fix. We should work on reducing the deployment duration.

9. Action Items

Action Item	Owner	Priority	Due Date	Status
Improve our communication during an incident	Alessandro Carrano (Lead PM)	High	2024-09-30	Open
Improve our deployments strategy	Pierre Mavro (CTO)	High	2024-09-30	In Progress
Better test third party libraries integration	Benjamin Chastanier (SRE Developer)	High	2024-10-30	Open

Topic		Replies	Views
Cluster down with "403 Forbidden" error Questions and Answers	13	71	September 10, 2024
Ocassional but persistent nginx 403 forbidden Questions and Answers	4	50	September 10, 2024
Nginx-ingress random error : recv() failed (104: Connection reset by peer) Questions and Answers kubernetes	4	5970	April 26, 2024
Nginx ingress controller FailedGetResourceMetric Questions and Answers qovery	3	245	April 17, 2024
Best way to handle HTTP redirects within the cluster? Questions and Answers qovery	4	691	January 24, 2024