0x19. Postmortem

·

3 min read

0x19. Postmortem

Incident Report: Database Timeout Error

Date and Time of Incident: 2024/05/10 11:30 SAST

Issue Summary:

On 2024/05/10 11:30 SAST, our monitoring system detected a database timeout error affecting the loading of the portfolio under the freelancer page. Users attempting to upload and update their portfolios experienced significant delays, resulting in a degraded user experience.

Timeline

  • 11:24 AM SAST: Configuration push begins

  • 11:30 AM SAST: Error 504, database timeout.

  • 11:30 PM SAST: Pagers alerted teams

  • 12:15 PM SAST: Failed configuration, escalated to senior engineer

  • 12:30 PM SAST: Successful configuration, task resolved.

  • 12:45 PM SAST: Server restarts begin

  • 1:30 PM SAST: 100% of traffic back online

Root Cause:

Upon investigation, it was discovered that the database timeout error was caused by a bottleneck in the database query execution process. The query responsible for retrieving portfolio data was not optimized efficiently, leading to excessive load times, ultimately exceeding the configured timeout threshold. Users experienced delays and timeouts when attempting to upload and update their portfolios. This in turn degraded user experience may have resulted in frustration and decreased engagement with the platform.

Resolutions and Recovery:

  1. Immediate Response (11:30 AM SAST): Upon detecting the Error 504 and database timeout, the incident response team should have been promptly alerted via pagers. This ensures that the appropriate personnel are aware of the issue and can begin investigating.

  2. Investigation and Escalation (11:30 AM — 12:15 PM SAST): The incident should have been triaged to determine the root cause of the database timeout. Meanwhile, efforts should be made to mitigate the impact on users. If the initial troubleshooting steps are unsuccessful, the issue should be escalated to a senior engineer for further investigation.

  3. Resolution (12:15 PM — 12:30 PM SAST): The senior engineer takes charge of resolving the configuration issue. They conduct a thorough analysis of the problem and implement the necessary fixes to restore functionality. This may involve adjusting database settings, optimizing queries, or addressing any other underlying issues contributing to the timeout.

  4. Recovery (12:30 PM — 1:30 PM SAST): With the configuration issue resolved, server restarts are initiated to ensure that any necessary changes take effect. Throughout this process, monitoring systems should be in place to track the progress of the restarts and verify that services are coming back online as expected. Once all servers have been successfully restarted, traffic can be gradually routed back to the affected services to minimize any further disruptions. By 1:30 PM, 100% of traffic should be back online, and normal operations restored.

Corrective and preventative measure:

  1. Regular Performance Testing: Implement a robust performance testing regimen to identify and address potential bottlenecks before they impact production.

  2. Code Review: Institute a thorough code review process to ensure that database queries are optimized for efficiency and scalability.

  3. Continuous Monitoring: Maintain vigilant monitoring of database performance metrics to promptly identify and address any anomalies or degradation in performance.

  4. Capacity Planning: Conduct regular capacity planning exercises to anticipate future growth and ensure that our infrastructure can scale to meet increasing demand.

The database timeout error affecting the portfolio page has been resolved, and normal functionality has been restored. We apologize for any inconvenience this incident may have caused and remain committed to providing a seamless and reliable experience for our users.