Death by a Thousand Handshakes
Why persistent connections are the future of AI agent infrastructure, and how one protocol change cuts agent latency in half.

March 9, 2026
Every tool call an AI coding agent makes is a full HTTP round trip. The agent asks the model for a plan. The model responds with a tool call. The harness executes the tool. Then it sends the result back to the model, along with the entire conversation history from the beginning. For an agentic coding workflow that makes 20–30 tool calls, that overhead adds up fast.
OpenAI just shipped WebSocket mode for the Responses API. Instead of creating a new HTTP connection per turn, you keep a persistent WebSocket connection open and send only incremental inputs. The server keeps the conversation state in memory. No full-context resends. No connection setup per turn.
The early benchmarks are significant: up to 39% faster on complex multi-file workflows, with best cases hitting 50%. This isn't a prompting trick or a model improvement. It's an infrastructure change. And it matters more than most people realize.
How HTTP Mode Works (and Why It's Slow)
In standard HTTP mode, every turn in an agent loop is a standalone request. The client opens a connection, sends the full conversation context (system prompt, all prior messages, all tool definitions, all previous tool call results), and waits for the model to respond. When the response arrives, the connection closes.
For the next turn, the same thing happens again: new connection, full context resent, response, close. The payload grows linearly with conversation length. By tool call 15, you're sending kilobytes of redundant data per request. The model has already seen all of this context, but you're paying to transmit it again every single turn.
It's a Groundhog Day situation. Every time the model “wakes up” for the next turn, there's nothing left in its brain. It's stateless. It has no memory of what just happened. So all of the state has to be shipped back every single time. The agent might ingest 100,000 tokens you sent over the API and then respond with eight words. That's a real thing that happens every day.
And if you're thinking “isn't that what prompt caching is for?”, the cache only reduces compute, not bandwidth. The cache key is a hash of your full history. You still send the entire context every time so the server can hash it and check what's cached. Caching makes processing faster and cheaper, but it does not change how much data crosses the wire.
There's a deeper architectural reason this happens. OpenAI runs thousands of inference servers. When you make an HTTP request, it gets routed to whichever server is available. The next request might land on a completely different machine. There's no affinity, no memory of who you are. Each server has to reconstruct your entire session from the payload you sent. The protocol itself makes statefulness impossible.
There's also the connection lifecycle overhead: DNS resolution, TCP handshake, TLS negotiation, HTTP framing. These costs are small individually but compound across 20+ sequential round trips.
One Connection, Zero Resends
WebSocket mode flips this model. You open a single persistent connection to wss://api.openai.com/v1/responses and keep it open for the entire session. Each turn, you send a response.create event with only incremental inputs: the new tool outputs and the previous_response_id.
The server keeps the most recent response state in a connection-local in-memory cache. No disk writes, no context reconstruction. When you reference the previous response ID, the server reuses the cached state directly. This is what makes continuation fast.
There's also a warmup pattern: send response.create with generate: false to pre-load your tools, instructions, and context before the first real turn. The server prepares request state so the first generated turn starts faster. This is the WebSocket equivalent of warming the cache.
The best part: WebSocket mode is compatible with store=false and Zero Data Retention. The in-memory cache exists only on the active connection. Nothing is persisted to disk. When the connection closes, the state is gone.
The WebSocket is less a protocol change and more a guarantee. It guarantees that your requests hit the same server for the duration of the connection. That's what makes the in-memory cache possible. HTTP can't make that promise. Each request might land anywhere. A WebSocket connection pins you to a single server, and that server affinity is what unlocks everything else.
Turn 1: Both modes send similar initial payloads.
The Real Numbers
Cline tested WebSocket mode with OpenAI's Codex against the standard API. The results:
- Simple tasks: ~15% faster. The WebSocket handshake adds slight TTFT overhead that hasn't been fully amortized yet.
- Complex multi-file workflows: ~39% faster. Dozens of tool calls amortize the handshake cost. The savings compound with every round trip.
- Best cases: hitting 50% faster. These are workflows with 20+ tool calls where the overhead elimination is dramatic.
The key insight: the speedup scales with the number of tool calls. More round trips means more saved handshakes, more eliminated redundant context. It's not a fixed percentage improvement. It compounds. On bandwidth alone, the reduction is over 90%. Instead of resending the full conversation every turn, you're sending just the delta. Try the slider below to see how the gap grows.
Simulate your workload
Drag to see how speedup scales with tool call count
at 20 calls
When It Matters Most
Agent coding workflows are the ideal use case. A typical agentic session involves reading files, grepping for patterns, making edits, validating syntax, and self-correcting, all as separate tool calls. 20+ tool calls per session is normal. 30-50 is common for complex tasks.
In HTTP mode, each of those tool calls means: full context resent (growing with every turn) + connection overhead. In WebSocket mode: just the tool output + previous_response_id on an already-open connection.
For our agent at Headstarter, a single prompt can trigger 15 iterations with multiple tool calls per iteration, plus self-healing loops that add more round trips when the compiler catches errors. That's 30–50 total round trips per session. The cumulative savings from eliminating per-call handshake and context overhead are massive at that scale.
Limitations and Trade-offs
WebSocket mode is not a universal replacement for HTTP. There are real constraints to design around:
- 60-minute connection limit. The server closes the connection after an hour. You need reconnection logic. For most agent sessions this is fine, but long-running workflows need to handle it.
- No multiplexing. One in-flight response at a time per connection. If you need parallel runs, open multiple connections.
- Cache eviction on failure. If a turn fails (4xx or 5xx), the server evicts the referenced
previous_response_idfrom the cache. Withstore=false, there's no fallback, so you'll need to resend full context or use a compacted window. - TTFT overhead on short tasks. The WebSocket handshake adds latency upfront. For single-turn requests, HTTP is actually faster. The crossover point where WebSocket wins is around 3–4 tool calls.
WebSocket mode is not a universal replacement for HTTP. It is a targeted optimization for the exact workload that matters most: long-running, tool-call-heavy agent loops.
Where Agent Infrastructure is Heading
We got here because tool calling was stapled onto APIs that were designed for single-turn request-response. Nobody planned for agents making 30 sequential calls. The chat completions API was built for chatbots, not orchestration loops. Agent loops are conversational, and the transport mismatch was always there, but we accepted it because there was no alternative at the API level.
WebSocket mode is the first major API-level acknowledgment that agent workflows need their own transport. Persistent connections, incremental state updates, server-side context management. These are the primitives that agent harnesses actually need. Not bigger context windows. Not faster models. Better infrastructure.
We expect this pattern to spread. Every API provider serving agentic workloads will need to solve the same problem: how do you make 30+ sequential round trips fast enough that the human waiting for results doesn't lose patience?
OpenAI has also open-sourced their Responses API format as the Open Responses standard, with the WebSocket piece expected to follow. If that happens, any provider could implement the same persistent connection semantics. The protocol won't be locked to one vendor.
The agents that win won't just have better prompts or better models. They'll have better infrastructure. The transport layer is part of that infrastructure. And persistent connections are the right answer for conversational workloads.