NA1 Outage
Incident Report for QLess
Postmortem

On September 21st there was a downtime event. Our Platform Engineers have determined that the issue was caused by a database access problem on our NA1 environment. We are currently working with our Partners at Amazon Web Services to perform a root cause analysis of the issue in order to determine if it is related to the already known concurrency issue persistent in our main application. We will provide an update once the analysis is complete and what our remediation steps will entail.

** Update after investigation** 9/24/21

After further investigation we were able to determine that the root cause for the service outage was related to an errant backup process. This process, which exists at the database root level with Amazon Web Services, was erroneously triggered and caused locking of database tables which then cascaded into a full server outage. In order to prevent this from occurring in the future, the backup process will not have access to the production database for the purposes of creating backups, but will instead use one of our mirrored database instances which only have read access (to prevent locking tables that need to be written to in production).

Posted Sep 23, 2021 - 10:09 PDT

Resolved
QLess services are now operational
Posted Sep 21, 2021 - 15:13 PDT
Investigating
We are currently investigating this issue.
Posted Sep 21, 2021 - 15:04 PDT
This incident affected: NA1.