NA4 Outage

Incident Report for QLess Linebuster

Postmortem

On September 22nd there was a downtime event. Our Platform Engineers have determined that the issue was caused by a database access problem on our NA4 environment. We are currently working with our Partners at Amazon Web Services to perform a root cause analysis of the issue in order to determine if it is related to the already known concurrency issue persistent in our main application. We will provide an update once the analysis is complete and what our remediation steps will entail.

‌

** Update after investigation** 9/24/21

After further investigation we were able to determine that the root cause for the service outage was related to an errant backup process. This process, which exists at the database root level with Amazon Web Services, was erroneously triggered and caused locking of database tables which then cascaded into a full server outage. In order to prevent this from occurring in the future, the backup process will not have access to the production database for the purposes of creating backups, but will instead use one of our mirrored database instances which only have read access (to prevent locking tables that need to be written to in production).

Posted Sep 23, 2021 - 10:09 PDT

Resolved

QLess services are now operational.

Posted Sep 22, 2021 - 11:52 PDT

Investigating

We are currently investigating this issue.

Posted Sep 22, 2021 - 11:46 PDT

This incident affected: NA4.