Incident Postmortem Report
Incident Title: HTTP Requests failing with 403
Date of Incident: 2024/08/28 between 2 pm and 5 pm (UTC)
Duration: 3 hours
Impact: Some customers’ web services were unreachable, HTTP requests failed with 403
Incident Severity: Partial outage
1. Summary
Following a global cluster update triggered by our team, some configuration of the Nginx Ingress controller was wrongly modified on a subset of these clusters.
Our team identified the root cause of the problem and deployed a fix on all clusters.
The websites hosted on impacted clusters were unreachable for the duration of the outage.
We apologize for any inconvenience caused.
2. Timeline
Time (UTC) | Event Description |
---|---|
1:45 pm | Incident started |
2:00 pm | Incident detected |
2:05 pm | Incident response initiated |
2:40 pm | Root cause identified |
3:20 pm | Manual fix on a test cluster |
3:35 pm | Implementing fix |
3:40 pm | Mitigation steps started (rollout) |
4:40 pm | Services restored |
5:00 pm | Post-mortem review scheduled |
3. Root Cause Analysis
- on March 27th, we shipped a new feature containing a sleeping bug. This feature added a new Cluster Advanced Setting: nginx.controller.enable_client_ip
- on August 28th, we updated our JSON library: Jackson
- To handle correctly the deserialization of Kotlin data class with Jackson, we created two paths to set default values in Qovery’s control plane. We had mistakenly set different default values in these two paths. Jackson’s different version does not use the same path. The old version used the first path with the correct default value, while the new version used the second path with the wrong default value.
- Following Jackson’s update, we mistakenly set the wrong default value when we upgraded all the clusters.
- This new default value activated “enable-real-ip” in Nginx’s configuration. This value was modified only for clusters created before March 27th and where no Cluster Advanced Settings were modified using Qovery’s console.
Problem illustration with a diff:
4. Impact
User Impact:
Activating “enable-real-ip” adds the following impacts:
- For users with IP whitelisting, the whitelisted IP was not the right one resulting in a 403 error for all requests.
- The default value for Nginx’s whitelist is IPv4 only (“0.0.0.0/0”) and does not allow IPv6. Customers using a CDN or a reverse proxy (like Cloudflare or Cloudfront) in front of their Qovery cluster received the reverse proxy’s real IP. If this IP was an IPv6, the request was sent a 403 error.
Business Impact:
- Impacted websites were unreachable from 1:45 pm UTC and up to 4:40 pm UTC.
5. Incident Detection
Detection Method:
- A customer created a thread on the forum saying he got a lot of requests receiving 403 errors. Following this first thread, we received messages from other customers so we decided to warn our team.
Detection Gaps:
- 403 errors are on the customers’ cluster so these errors do not trigger our monitoring tools.
6. Incident Response
Response Actions:
- Our team got a warning following customer feedback
- Investigation to understand where the problem was
- We decided not to rollback because the change responsible for the bug was too old
- Customers who used whitelists and offered a workaround on a forum thread where we proposed to manually and temporarily disable their whitelist.
- Test on an impacted cluster to validate our fix
- Communication on Qovery’s status page and the forum
- We checked every other cluster’s advanced settings to confirm we had no other bug.
Communication:
- Warn our team
- Status page and forum communication
- Post-mortem release
7. Resolution and Recovery
Resolution Steps:
- Correct the wrong default value in our code
- Deploy the customer’s clusters again
Recovery Validation:
- Search customer’s Nginx ingress logs looking for 403 errors
- Contact impacted users to confirm everything was back to normal
8. Lessons Learned
What Went Well:
- We managed to quickly find where the problem was
- We deployed the fix on all customers’ clusters promptly
How to avoid this in the future
- Improve our communication during the incident
- Stop deploying all servers at once, even if the modification is a low-impact one
- Better test third-party libraries upgrade to prevent side effects
- During this incident, we needed at least 20 minutes to deploy the fix. We should work on reducing the deployment duration.
9. Action Items
Action Item | Owner | Priority | Due Date | Status |
---|---|---|---|---|
Improve our communication during an incident | Alessandro Carrano (Lead PM) | High | 2024-09-30 | Open |
Improve our deployments strategy | Pierre Mavro (CTO) | High | 2024-09-30 | In Progress |
Better test third party libraries integration | Benjamin Chastanier (SRE Developer) | High | 2024-10-30 | Open |