Customers in AWS eu-west-1 may experience elevated latencies and error rates

Incident Report for Temporal

Postmortem

On April 4th, 2025, Temporal Cloud in AWS eu-west-1 experienced a service outage from 1647 UTC to 1717 UTC. During this 30 minute window all customers of this region saw high API latencies and a peak gRPC error rate of 87%.

The root cause was a latent internal TLS configuration change that had previous rolled out to the cell, with the trigger being an unrelated History service restart during a WAL rollout. As each History service restarted more and more of the overall service pool was unable to connect to the WAL and cell performance and availability deteriorated.

The issue was mitigated by reverting the latent configuration change. After which we identified the sequence of events which lead to the configuration change laying dormant instead of being immediately applied and have instituted process improvements to prevent it from happening in the future.

Posted Apr 08, 2025 - 13:52 UTC

Resolved

This incident has been resolved.

Posted Apr 04, 2025 - 17:20 UTC

Investigating

Some customers in AWS eu-west-1 may experience elevated API latencies

Posted Apr 04, 2025 - 16:59 UTC

This incident affected: Amazon Web Services (AWS) (eu-west-1).