Audience: Backend / platform engineers
Format: Architecture analysis
Context: Reliable and maintainable AI systems in production
TL;DR
- OpenAI is adding durability primitives to the Agents SDK
- The focus is no longer just generation — it’s reliable execution
- AI workflows are starting to adopt classic distributed systems patterns
The important shift
For much of 2024 and early 2025, AI agents looked more like sophisticated demos than operational systems.
They worked well when:
- everything went perfectly
- the workflow was short
- there were no interruptions
But real systems don’t work that way.
The silent problem
Most AI workflows fail for reasons that have very little to do with “AI”:
- timeouts
- slow APIs
- network errors
- partial executions
- tools responding poorly
- state loss
In other words:
classic infrastructure problems
What OpenAI is adding
The new approach to the Agents SDK introduces primitives for:
- automatic retries
- resumable execution
- state persistence
- failure recovery
This completely changes the type of systems you can build.
Before vs now
Before:
prompt → response → done
Now:
persistent and recoverable workflow
Why this matters
Because modern agents are no longer:
- a single model call
They are now:
- long-running workflows
- multiple tools
- dependent steps
- asynchronous execution
- coordination across systems
The parallel with backend engineering
This starts to look very familiar.
The same problems we solve in:
- microservices
- distributed jobs
- message queues
- data pipelines
Now appear in AI workflows.
Retry is no longer optional
Simple example:
1. agent analyzes PR
2. runs tests
3. consults documentation
4. generates summary
5. posts comment
What if step 4 fails?
Without durability:
the entire workflow is lost
With resumable execution:
the system continues from the right point
The real shift: statefulness
State persistence is probably the most important part.
Because it enables:
- long-running workflows
- coordination between steps
- reliable recovery
- operational traceability
What’s interesting
The public conversation still revolves a lot around:
- benchmarks
- reasoning
- context
But OpenAI is clearly pushing in another direction:
operational infrastructure for agents
The new AI stack starts to look like this
LLM
↓
Workflow Runtime
↓
State Persistence
↓
Retries + Recovery
↓
Tool Execution
↓
Observability
The model is just one layer.
What changes for platform teams
The work is no longer simply:
- integrating an AI API
It now involves:
- designing durable workflows
- managing persistent state
- controlling retries
- preventing infinite loops
- monitoring execution
The risk that comes
When you add automatic retries and persistence:
you also amplify risks
Examples:
- retry storms
- autonomous loops
- unexpected costs
- repetition of sensitive actions
Patterns that start to matter
Idempotence
Actions must be able to repeat without breaking the system.
Clear timeouts
Workflows need explicit limits.
Circuit breakers
Prevent cascading failures.
Observability
You need:
- tracing
- logs
- replay
- audit trails
What separates demos from production
An AI demo:
- generates something impressive once
Production requires:
- recovery
- resilience
- consistency
- operational control
Perspective for lean teams
This is especially important for small teams.
Because AI workflows without durability:
- require constant manual intervention
- generate chaotic debugging
- scale poorly operationally
Reliability matters more when your operational margin is thin.
Verdict
OpenAI’s move is a strong signal:
agents are entering their “infrastructure” phase
And that means classic backend engineering concepts are central again.
Final thought
The next generation of AI systems will probably win by:
- better prompts
- more context
- more speed
It’s going to win by:
- resilience
- recovery
- observability
- operational reliability
Because in the end:
a useful AI workflow isn’t the one that impresses once.
It’s the one that keeps working when things go wrong.
