By Grego — yoDEV
On Monday, May 19 at 22:20 UTC, my applications stopped responding. It wasn’t a bug on my end. It wasn’t a broken deploy. It was Google Cloud.
That night, Google Cloud’s automated systems incorrectly placed Railway’s production account in a suspended state — without warning, without human review, without a single notification. The incident lasted approximately 8 hours and affected all Railway customers across all regions. Including everyone like me who had their apps running there.
What happened that night is one of the clearest cases I’ve seen in years for why concentrating dependencies on a single cloud provider is a real operational risk — not a hypothetical architecture scenario.
What happened, technically
The postmortem published by Railway on May 20 is one of the most honest I’ve read in a long time. What it describes is a cascading failure that any architect should study.
GCP’s automated suspension disabled Railway’s dashboard, API, and network control plane — all hosted on Google Cloud. Up to that point, it was a provider failure. The real problem came after.
Railway’s edge proxies maintain a cache of routing tables, populated from that control plane hosted on GCP. While the cache held up, workloads on Railway Metal and AWS kept running. When the cache expired, the mesh couldn’t re-populate the routes — and workloads across all regions, even those running on AWS and Metal with nothing to do with GCP, started returning 404s.
A single dependency in the hot path of network discovery became a single point of failure for the entire platform.
As if that weren’t enough, when Google restored access to the account, individual services didn’t restore automatically. Disks, compute instances, and networking required separate recovery. Network and edge routing restoration took until approximately 01:30 UTC on May 20 — more than three hours after the outage began. And when the system came back, the volume of retries saturated GitHub OAuth integrations and webhooks, blocking logins and builds for another additional hour.
Railway’s own founder, Jake Cooper, called Google’s behavior “gobsmacking” — that a provider they spent over 10 million dollars with on Cloud would execute a massive automated suspension without human review or prior communication is, indeed, hard to process.
The real problem isn’t Google
It would be easy to turn this into a “Google bad, Railway victim” story. But Railway itself framed it very differently in its postmortem:
“Railway owns our vendor choices, and we ultimately own this one.”
And they’re right. The architectural problem existed before this suspension. The network control plane — the piece that decides where traffic goes — had a hard dependency on GCP. That’s a design decision. Google didn’t make it.
For those of us managing platforms and services, the takeaway isn’t “don’t use Railway” or “don’t use Google Cloud.” The takeaway is: where are your dependencies in the hot path?
It’s not enough to have workloads distributed across multiple clouds if the control plane connecting them lives in just one. Multi-AZ high availability within one provider isn’t the same as resilience against total loss of that provider.
What Railway is doing
The postmortem outlines three concrete measures:
1. Remove the hard dependency on GCP in the network control plane. The goal is to turn the network into a true mesh where if any interconnection fails, there’s always an alternate path between clouds.
2. Extend high-availability database shards to AWS and Metal. If all instances of one cloud disappear at once, the database quorum keeps everything running and fails over immediately.
3. Remove Google Cloud from the hot path of the data plane and control plane. GCP would be relegated to a secondary/failover role while a new architecture for both planes is implemented.
These are exactly the right measures. But they’ll take time.
What this should trigger in your organization
If your applications were down that Monday early morning, and you’re responsible for platform or infrastructure, this incident is the business case you needed for the conversation you’ve been putting off about DRP and multi-provider strategy.
Some concrete areas to evaluate:
Map your dependencies in the hot path. Not your workloads — your control and discovery dependencies. Where does your control plane live? What happens if that provider disappears for 8 hours?
Review your real SLA vs your contractual SLA. Google Cloud has 99.9% SLAs for many services. But an SLA doesn’t give back your 8 hours of downtime or your lost customers. The SLA covers billing credits, not business impact.
Consider provider concentration as operational risk. Not as a technical decision — as a risk to report and manage. The PaaS you choose to deploy on isn’t just a productivity tool; it’s critical infrastructure.
Define what level of multi-provider makes sense for your context. It’s not the same for a 3-person startup as for a platform with committed customer SLAs. But both need to have that answer clear before the incident arrives.
The irony of all this is that Railway is precisely a platform designed so you don’t have to worry about infrastructure. And yet an automated decision by an upstream provider left them without service for 8 hours. The risk of dependency doesn’t disappear by abstracting it — it just shifts.
The closing
I started writing this note with my apps still down. I’m finishing it with a renewed conviction about something I knew but daily urgency tends to postpone: resilience isn’t a project for when we have time. It’s the technical debt that hurts the most when it comes due.
Railway will emerge from this with better architecture. The question is whether you’ll emerge with a clearer DRP strategy.
Sources: Railway Incident Report — May 19, 2026 · The Register
