On April 4th, 2025, Temporal Cloud in AWS eu-west-1 experienced a service outage from 1647 UTC to 1717 UTC. During this 30 minute window all customers of this region saw high API latencies and a peak gRPC error rate of 87%.
The root cause was a latent internal TLS configuration change that had previous rolled out to the cell, with the trigger being an unrelated History service restart during a WAL rollout. As each History service restarted more and more of the overall service pool was unable to connect to the WAL and cell performance and availability deteriorated.
The issue was mitigated by reverting the latent configuration change. After which we identified the sequence of events which lead to the configuration change laying dormant instead of being immediately applied and have instituted process improvements to prevent it from happening in the future.