SMS Service Outage
Incident Report for QLess
Postmortem

What happened

On Friday, February 26, 2021, one of Twilio's internal services suffered a service disruption that impacted a broad set of Twilio products from 5:00am PST to 7:30am PST — a total duration of 2.5 hours, which far exceeds our targets for diagnosis and correction.

During the impact period, customers experienced increased latency, errors from our API, inaccessible web interfaces, and/or undelivered messages. Impacted products included SMS, Flex, Console, and others. Although the service disruption was detected and our on-call engineering team was notified within 1 minute, our Status Page did not update for 25 minutes which led to further customer uncertainty.

Root cause

The root cause of the service disruption was when a critical service that manages feature-enablement for many Twilio products became overloaded. Multiple Twilio products that rely on this feature-enablement service did not handle its failure gracefully and began to fail themselves, manifesting as customer-facing API errors and increased latency.

Resolution

To resolve the immediate issue, we increased server capacity and added additional caching to reduce the load on the service. This took longer than anticipated because our standard procedure for bringing additional capacity online didn’t fully take into account the ongoing load that occurred as other services continuously retried their failed requests.

These changes will remain in place to prevent reoccurrence while we deploy additional and permanent protections and process improvements.

Our path forward 

During our review of this disruption, we identified several improvements that will prevent the recurrence of this specific issue in the future. We will be making the following changes:

  • Reconfiguring the service with more aggressive auto-scaling behavior to better handle traffic spikes.
  • Removing this service from critical paths and making client-side caching the default behavior to prevent service unavailability.
  • Reducing the service’s request timeout and refactor the service’s API to increase scalability.
  • Reconfiguring the service’s failover mechanism to increase resilience in events of failures.
  • Refactoring the server’s approach to caching to decrease workloads.

We are also reviewing our tooling and procedures for communicating with customers during disruptions, including via our status.twilio.com page, to ensure you have accurate and up-to-date information.

Our post-incident review process remains at a relatively early stage. To prevent similar issues with other services, we are taking the following steps across our engineering organization:

  • Conducting an audit of our codebase to identify services with similar risk characteristics and remediate as appropriate.
  • Instituting common architecture best practices for client services to degrade more gracefully.
  • Improving our deployment tooling and on-call runbooks to better manage server fleet capacity across all our services, eliminating manual steps and shortening future time-to-recovery.

Further action items will be identified and shared with you as this process progresses.

Posted Mar 03, 2021 - 12:19 PST

Resolved
QLess SMS notification service restored.
Posted Feb 26, 2021 - 10:21 PST
Investigating
Our SMS provider is down https://status.twilio.com/
Posted Feb 26, 2021 - 06:43 PST
This incident affected: AU1, CA1, EUR, US, US1, US2, NA1, NA2, NA3, NA4, and NA6.