We would like to provide additional detail surrounding the downtime which occurred on 10/12/2021
What happened?
At 6:28am (duration 8 min) and 10:41am (duration 3min) PST, the US2 server environment experienced downtime which resulted in QLess applications becoming temporarily inaccessible on the server.
Cause
The root cause of the outage was due to a race condition that was triggered on one of our backend servers. This race condition is a known issue, and can be triggered by heavy transactional load or other rare circumstances.
After resolving initial issue one of our reverse proxy nodes was not rebooted properly, which caused continuing service degradation and slow performance resulting in a second downtime event. So had perform a controlled reboot of all proxy nodes.
Remediation
Upon receiving a monitoring alert notification QLess Engineers restarted the service.
Prevention
QLess has initiated a complete refactoring of the affected application. The new service application will be more fault-tolerant, available and self-healing.