Ensuring Resiliency For Engineering

July 20, 2021
Written by
Twilio
Twilion

Ensuring Resiliency For Engineering

Twilio suffered a significant service disruption on Feb 26, 2021.

When we fall short, as we did with the service disruption on Feb 26, it motivates us to learn and to make our services more resilient and reliable. Nothing is more important than regaining your trust. We want the opportunity to show that Twilio can and will continue to be a reliable and consistent partner.

We are committed to you that when an incident occurs that disrupts your customer communications, we will always tell you about it. “No shenanigans” is our ethos. Striving to act in an honest, direct, and transparent way is a value every Twilion lives by - and in that spirit, we want to share our Q2 improvements that we’ve completed.

Recap of the Feb 26, 2021 service disruption:

On Friday, February 26, 2021, one of Twilio's internal services suffered a service disruption that impacted a broad set of Twilio products from 5:00am PST to 7:30am PST. A major contributing factor to the service disruption was the overload on a critical service that manages feature-enablement for many Twilio products.  Although the service disruption was detected and our on-call engineering team was notified within 1 minute, our Status Page did not update for 25 minutes which led to further customer uncertainty. To resolve the immediate issue, we increased server capacity and added additional caching to reduce the load on the service. To read more about the service disruption, head over to this blog.

We have identified a total of 37 technical improvements to the Feature Enablement Service which was the cause of the failure event on Feb 26, 2021. Of these 37 improvement opportunities, we have completed 32 to date. Examples of critical completed and in-flight technical improvements:

Completed in Q1 2021:

  • Reconfigured our Feature Service with more aggressive auto-scaling behavior to better handle traffic spikes.
  • Removed Feature Service from critical paths and made client-side caching the default behavior to prevent service unavailability.
  • Reduced the Feature Service’s request timeout and refactored the API to increase scalability.

Completed in Q2 2021:

  • Reconfigured the Feature Service’s failover mechanisms to be more resilient in the event of system failure.
  • Refactored the Feature Service’s approach to internal caching to decrease workloads.
  • Ensured all client services of the Feature Service are configured to degrade more gracefully in the event of downstream failures.

In-flight with a completion ETA in Q3 2021:

  • Further hardening of non critical services that have a dependency on Feature Service
  • Redefining the caching layer of non critical services that utilize Feature Service

Mitigating the risk of similar issues across other services:

To improve our engineering operational processes and prevent further failure recurrences, we’ve completed a few holistic changes throughout our engineering organization in Q2.

Completed in Q2 2021:

  • Completed an audit of our production systems to identify services with similar risk characteristics.
  • Improved our deployment tooling and on-call runbooks to better manage server fleet capacity across all our services, eliminating manual steps and shortening future time-to-recovery.
  • Published our Software Change Management process.

Later this month (July 2021), we will also publish an update to our Business Continuity and Disaster Recovery (BC/DR) plans.

We’ve also identified a few longer term initiatives (greater than 180 days) that our engineering teams will be working on to further improve our operational maturity.

In-flight Longer Term Initiatives:

  • Continuously identifying and mitigating risks using our updated company wide risk assessment process.
  • Developing plans to further harden and automate the deployment process.
  • Introducing new standardized incident management tooling and incident response processes to further optimize our response time and provide effective communications.

Final note:

Once again, we’d like to apologize for the inconvenience the disruption caused and thank you for being a valued customer. We will provide an iterative update for our Q3 improvements across processes and functions.