On Friday, February 26, 2021, one of Twilio's internal services suffered a service disruption that impacted a broad set of Twilio products from 5:00am PST to 7:30am PST — a total duration of 2.5 hours, which far exceeds our targets for diagnosis and correction.
During the impact period, customers experienced increased latency, errors from our API, inaccessible web interfaces, and/or undelivered messages. Impacted products included SMS, Flex, Console, and others. Although the service disruption was detected and our on-call engineering team was notified within 1 minute, our Status Page did not update for 25 minutes which led to further customer uncertainty.
The root cause of the service disruption was when a critical service that manages feature-enablement for many Twilio products became overloaded. Multiple Twilio products that rely on this feature-enablement service did not handle its failure gracefully and began to fail themselves, manifesting as customer-facing API errors and increased latency.
To resolve the immediate issue, we increased server capacity and added additional caching to reduce the load on the service. This took longer than anticipated because our standard procedure for bringing additional capacity online didn’t fully take into account the ongoing load that occurred as other services continuously retried their failed requests.
These changes will remain in place to prevent reoccurrence while we deploy additional and permanent protections and process improvements.
During our review of this disruption, we identified several improvements that will prevent the recurrence of this specific issue in the future. We will be making the following changes:
We are also reviewing our tooling and procedures for communicating with customers during disruptions, including via our status.twilio.com page, to ensure you have accurate and up-to-date information.
Our post-incident review process remains at a relatively early stage. To prevent similar issues with other services, we are taking the following steps across our engineering organization:
Further action items will be identified and shared with you as this process progresses.