Skip to Content
← Back to BlogPart 1 of 7 · The Execution-State Continuity Layer

The Missing Layer: Why AI-Native Systems Need Execution-State Continuity

TL;DR

The execution-state continuity layer is the missing third layer of an AI-native system, sitting beneath memory and orchestration. Its job is to keep the live runtime — process tree, PTY, file descriptors, and local sockets — alive as a single-homed, ownerless object with its own identity, so humans, AI agents, devices, and services attach to one running execution as operators instead of each client owning a runtime that dies when its connection drops. We built persistent memory and workflow orchestration well; we never built the layer that keeps the live execution alive, and every long-horizon agent now hits that wall. This is the shift from a controller model to an operator model.

An agent has been working for an hour. It cloned the repo, installed the toolchain, started a dev server, opened a connection to the database, and is now nine steps into a ten-step migration. Then the laptop lid closes. Or the desktop app ships an auto-update and restarts. Or the train goes into a tunnel and the WebSocket drops for forty seconds.

When you come back, the agent’s memory is pristine. It remembers every decision, every file it touched, the summary of the plan, the note it wrote to itself about the edge case in step seven. What it does not have is the dev server, the database connection, the half-applied migration, or the shell that was waiting on a sudo prompt. The autobiography survived. The runtime did not.

A client disconnects from a running agent: memory (transcripts, notes) is safe, but the live runtime — process tree, dev server, DB connection, shell — is lost

That closed lid is the version of the wall everyone has felt in their own hands — the work was right there, and then it wasn’t. It is just the most relatable version, though, not the deepest one. Move the agent to the cloud — run it server-side, where every serious runtime already runs it — and the same wall reappears the moment two operators, two devices, or a host migration enter the picture. Surviving your own disconnect is the easy half; the hard half shows up when the execution has to be reachable by someone other than the process that started it.

This is not a bug in any particular product. It is a missing layer in the entire stack. The industry has, over the last few years, built two of the three layers an AI-native system needs — and built them very well. It has not yet built the third. This article names that third layer, traces why it is missing, and shows the evidence that the whole field is now converging on it from different directions at once.

Two layers we got right

Step back and look at the architecture every serious agent system has converged on. There are two layers almost everyone now agrees on.

The first is memory. This is the durable record of what happened: conversation transcripts, vector embeddings, rolling summaries, profile and rules files, retrieval pipelines over episodic and semantic stores. An enormous amount of excellent engineering has gone here. Memory serializes cleanly to disk, replicates trivially, and is explicitly designed to outlive any single session. When people say an agent “has long-term memory,” this is the layer they mean. It is, by now, a solved-enough problem that it has commodity infrastructure.

The second is orchestration. This is the logic that decides what to do next: the agent loop, the planner, the task graph, the tool-dispatch layer, the subagent fan-out, the durable-workflow engine that guarantees a multi-step process completes even across restarts. This layer too is mature. Temporal, Orleans, Dapr, and the durable-execution lineage solved the hard problem of making a logical workflow survive failure by deterministically replaying it. Agent frameworks layered planning and tool-calling on top. When people say an agent “can run a long task reliably,” they mean this layer.

Memory answers what did I learn and decide. Orchestration answers what should I do next, and how do I make sure the plan finishes. Between them they cover a remarkable amount of ground. And yet the opening scenario — the closed laptop, the dropped socket, the reaped process — is untouched by either of them. Memory remembered the plan. Orchestration would happily re-issue the next step. Neither one kept the live runtime alive.

The layer we skipped

There is a third object in the system, and it is neither memory nor orchestration. It is the live execution state: the running process tree, the pseudo-terminal with its scrollback and line discipline, the open file descriptors with their offsets and locks, the bound listening socket, the live variables resident in the process’s user-space address space. This is not what happened and it is not what to do next. It is what is happening, right now, on a real machine. (One thing that looks like it belongs in this tuple but does not: the in-flight TCP connection to an external database or exchange. Half of that connection lives in a remote peer’s kernel, which no local layer can hold — it belongs to the application protocol, not the execution-state layer. The boundary paragraph below makes this precise.)

And in almost every system shipping today, that object has no independent existence. It is an implementation detail of whichever client happened to spawn it. The process tree is parented to a session that dies when the client disconnects. The PTY belongs to the terminal that opened it. The socket lives and dies with the connection that made it. When the transport drops — disconnect, restart, crash, device switch — the execution state is annihilated, silently, by construction. There is no layer whose job is to keep it alive.

That is the missing layer. Call it the command-operator execution layer — descriptively, the execution-state continuity layer: it makes the live runtime — process tree, PTY, file descriptors, and local sockets — a first-class, single-homed, ownerless object with its own identity, so that humans, AI agents, devices, and services attach to one running execution as operators (detaching and reattaching across client, transport, and device) instead of each client owning a runtime that dies when its connection drops. The execution is single-homed; what is distributed is the set of operators attaching to it. This is the shift from a controller model, where whichever client holds the connection is the holder of the runtime, to an operator model, where the runtime is the durable thing and every client — including the one that spawned it — is a replaceable attached reference. (“Operator” here is the human-factors sense — an actor that operates a live system from inside it — not the Kubernetes Operator pattern, which is itself a controller reconciling desired state from above; the two are nearly opposite.)

One honest clarification up front, because the tuple above mixes things of very different difficulty. There are three distinct continuity regimes hiding in that scenario, and a serious continuity layer must be precise about which it owns. (1) The client detaches while the host lives — laptop lid, dropped socket, app restart. The runtime keeps running; a later client re-attaches. This is the regime the layer owns outright, and it is the one this series is about. (2) The host itself dies or migrates. Now you are in checkpoint/restore territory (the CRIU lineage, productized by pause/resume sandboxes) — solvable for memory and process state, with real cost and limits. In practice this regime is approached on a spectrum: from persisting and recovering session state across restarts (available today) toward full live-memory checkpoint/restore (the harder end of the same axis). (3) A live external connection survives a host migration — the in-flight socket to a database or an exchange. This one is not a layer problem at all: the peer on the other end holds its own half of the connection in its own kernel, and no amount of local continuity can rewrite a remote server’s socket state. That regime is owned by the application protocol — reconnect, resync by sequence number, idempotent operations — not by the runtime. A continuity layer that claimed otherwise would be lying about physics. So when this series says the layer keeps the live execution alive, it means regime (1) as the core, reaching into (2); regime (3) it deliberately hands back to the protocol. Naming that boundary is not a weakness of the category — it is the category, drawn honestly. (Part 6 makes the boundary explicit; Part 7 walks each failure mode.)

Here is the whole mental model in one picture.

A clarifying note on the geometry before the picture. The three are best read as three independent concerns, not a strict vertical stack — memory and execution state are orthogonal axes (Part 2 makes that precise), and orchestration is a third axis again. The reason continuity is drawn at the bottom is not that the others are built out of it byte-for-byte, but that both of the others quietly assume a live runtime exists: memory correlates its transcript to a runtime, and orchestration re-issues steps into one. Continuity is the concern the other two take for granted. That is what the diagram means by “beneath.”

The three concerns of an AI-native system: Memory and Orchestration are well-built and both assume a live runtime; Execution-State Continuity is the missing concern they take for granted

Two of those boxes have a decade of infrastructure behind them. The bottom box, in most stacks, is empty — or, more precisely, it is filled by accident, by whichever transport happened to open the connection, and it evaporates the moment that transport goes away.

A short lineage of the live session

The strange thing is that the problem of keeping a live session alive across a dying client is one of the oldest themes in systems software. The field has solved it, partially, over and over — and each solution stopped one slice short of the general object.

Evolution of the live session: screen, tmux, tmate, Jupyter, Guacamole, cloud workspaces, Live Share, agent runtimes each kept one slice persistent while the runtime stayed lost — until the execution-state layer

GNU Screen and then tmux (early 2000s) decoupled the terminal UI from the parent shell. A background daemon held the PTY master/slave pairs and the screen buffers, so when your SSH connection dropped, the shell and everything under it kept running, ready to re-attach. It solved local process survival — but it died with the host, and it knew nothing beyond the terminal.

tmate extended that across the network, opening an outbound tunnel to a relay and minting a session token so that multiple remote clients could attach to one live PTY through NAT and firewalls. It solved relay-mediated, multi-client terminal sharing — but it was still, fundamentally, a terminal.

The Jupyter kernel generalized the idea past terminals entirely. An independent kernel process holds your variables, imports, and connections in memory, while notebook clients disconnect and reconnect over ZeroMQ without losing the computation. It solved decoupling a live heap from the UI — but it bound that heap to one kernel process, and it was for interactive notebooks, not the general runtime.

Apache Guacamole carried the theme to the graphical desktop, with a guacd daemon translating RDP/VNC/SSH into a standardized display stream delivered to a browser over WebSocket. It solved clientless remote access — but, tellingly, what it persists is a display surface. It normalizes heterogeneous output protocols into pixels a viewer renders; the client can only watch, never become the execution. That is the altitude difference worth holding onto: a reattaching client of a display proxy resumes a video, whereas a reattaching operator of an execution object grabs the controls. The execution-state layer normalizes clients to one execution object that exposes observe-and-mutate rights with per-operator attribution over the live tuple — “view a render stream” versus “hold transferable authority over the live execution.”

Cloud workspaces — Gitpod, early Replit and Daytona containers — moved the whole environment off the laptop and bound a persistent disk volume to a branch. They solved environment reproducibility and storage durability — but when the workspace stops, only the disk is backed up. The running compiler, the loaded variables, the open socket, the half-applied migration are discarded; a fresh container is provisioned on restart. This is storage-level persistence, not execution continuity.

VS Code Live Share pushed hardest of all on the multi-actor edge. It already puts a human — and, as of the January 2026 “Agent Sessions” work in VS Code 1.109, an AI agent — into one shared terminal, with shared servers and grant-based asymmetric view/edit access. By the standard of every node before it, this is the closest the field came to the described object. And the durable-host objection has a real answer: pair Live Share with a cloud-hosted backend — Codespaces running the host, a shipping combination — and the session no longer dies when the laptop lid closes. So the surviving distinction is not host-durability; the hybrid has that, to the same degree this series concedes for the pause-resume sandboxes below. The distinction is ownership of identity. Live Share — even cloud-hosted — always has a privileged host-owner: one occupant whose VS Code instance is the session, through whom every guest is routed, and whose departure ends it. The guests are projections of that owner’s runtime. The operator model requires the inverse: the execution state itself is the durable holder, with no privileged host-occupant — every operator, including the one that spawned it, is a replaceable attached reference, and none of them leaving ends the execution. The collaboration was real; the identity stayed owned. What the lineage never reached is ownerless identity.

Then came the agent runtimes — Devin, Cursor, OpenHands and the rest — which added autonomous loops running shells, editors, and headless browsers. And here the gap got worse before it got better, because these systems layered a magnificent memory architecture on top while leaving the runtime as disposable as it had always been. The agent could remember everything and keep nothing alive between turns.

Each phase persisted a little more of the live world. Screen and tmux persisted a terminal. Jupyter persisted a heap. Guacamole persisted a display. Cloud workspaces persisted a disk. Live Share persisted a shared session — but anchored to a privileged owner. None of them persisted the execution state itself — the full live tuple of process tree, PTY, descriptors, and local sockets — as a first-class, ownerless object that outlives whatever client opened it. The lineage was converging on something none of these names quite captured.

There is a reflexive rebuttal that this lineage seems to invite, and it is worth killing on the spot: just run the agent server-side, in a tmux session or a long-lived container, and the problem disappears. That fix is correct and insufficient — which is exactly why every serious agent runtime already does it. Running server-side survives the disconnect (regime 1) for clients that can reach that one host’s socket — tmux even lets several attach at once, but only locally, unattributed, and only while the host lives; that is the whole reason the convergence evidence below exists. What it does not give you is the execution as a first-class, addressable object that multiple operators — a phone, a CLI, an AI — attach to and hand off, with one identity across transport and host. The gap the lineage keeps circling is not survival. The runtimes solved survival a decade ago. The gap is shape: whether the live execution is a thing you can name and route to independently of who is currently holding it.

The vocabulary problem

Part of why the layer stayed missing is that we lacked the words to point at it. The same loose terms get reused for fundamentally different objects, and the conflation hides the gap. It is worth being precise, because the rest of this series — and arguably the next phase of the field — depends on holding these distinctions.

Weak / generic framingStrong / precise framing
AI agentscommand-operator execution systems
memory persistenceexecution-state persistence
workflow orchestrationexecution continuity
tool callingpersistent execution identity
workspace lifecyclesession-state primitive
AI as controller (controller model)AI as operator (operator model)
transport layerexecution continuity layer
execution = running/stoppedexecution state as a persistent object
workflow graph (logical/replay)execution graph (live OS state)

The left column is how the gap gets talked around. “Persistence” gets used for both a vector store and a live socket, as if they were two grades of the same thing rather than different objects with different lifespans and different failure modes. “Orchestration” gets used both for a planner and for the substrate the plan runs on. The right column is the language that lets you say what is actually missing: not better memory, not a better planner, but execution state as a persistent object, addressed independently of any client.

Four edges of one category

This is the flagship of a seven-part series, and each of the other articles is a single edge of the category named here. The next four sharpen the core distinctions; the final two stress-test the category at its boundary and its failure modes. In one line apiece, the four edges developed next:

  1. Memory is not execution state. Remembering that you started a server is not the same as a server being alive on a socket; the most common category error in agent design conflates the autobiography with the pulse.
  2. Steered, not replayed. Durable-execution engines reconstruct a logical workflow by deterministic replay; an execution-state system observes and steers the live OS state — these are different graphs.
  3. AI as operator, not controller. Tool-dispatch puts the model above the runtime (the controller model); the operator model puts humans, agents, and services inside the same execution through the same observe-and-mutate interface — equal in access to the mechanism, not in authority (which is asymmetric but transferable; where the substrate enforces a non-cooperative halt, a human can take control back — preemption that depends on the agent yielding is not preemption).
  4. The session as a primitive. A workspace persists a disk volume; the session-state primitive persists the live execution and survives client, transport, and device — attach, detach, reattach.

Each is a consequence of taking the missing layer seriously. Each is developed in its own article.

Why now

For a single-turn assistant, none of this bites. You ask, it runs one command, it answers; if the runtime evaporates afterward, nobody notices. The gap was tolerable precisely because agents were short.

They are not short anymore. Long-horizon agents now run for hours across hundreds of sequential tool calls — multi-file refactors, staged migrations, deep-research loops, overnight build-and-test pipelines. Over that horizon the cost of a missing execution-state layer compounds. Every disconnect demolishes a runtime the agent then rebuilds from memory. Every crash turns “resume” into “redo,” and redo against a partially-mutated environment is how you get duplicate writes and corrupted state. The economics are unforgiving: LLM calls are slow and expensive, and an agent that fails at step nine and can only restart burns the whole budget and risks compounding the damage on retry.

So it should be no surprise that the most serious long-horizon systems are independently, and visibly, reaching for the same primitive — under different names, from different starting points. Read the public record as evidence of convergence, not as a scoreboard:

  • Devin runs its execution sandbox (“Devbox”) in a cloud tenant connected to the controller over an outbound relay, so a task continues after the developer’s laptop closes — a clean separation of a stateless controller from a long-lived execution plane.
  • Warp has been moving terminal execution toward a background daemon and a cloud-relayed, shareable agent session, where multiple participants attach to one live terminal in real time — multi-actor attachment to a running execution.
  • OpenHands routes every action through a persistent tmux session and a resident IPython kernel inside its workspace, so directories, environment variables, and in-memory state survive across hundreds of discrete actions, and a human can attach to the same live filesystem mid-task — daemon-managed runtime continuity.
  • Claude Code drives long-running orchestration and large parallel subagent fan-out, pushing on exactly the long-horizon coordination that exposes how disposable the underlying runtime still is.
  • E2B makes the live sandbox itself durable — whole-microVM (Firecracker) pause-and-resume of memory, process trees, and loaded variables; stable addressing that survives hibernation and host migration — treating the running environment as a serializable object. Daytona pushes the same direction one notch weaker: it persists the workspace filesystem across stops but clears volatile memory, so the environment survives while the live process state does not — durable environment, not durable live execution.

Convergence map: Devin, Warp, OpenHands, Claude Code, E2B, and Daytona each drift from a different starting point toward a shared execution-state continuity layer

These are different teams solving different immediate problems: secure code execution, terminal collaboration, agent reliability, sandbox cost. But the shape they are each converging toward is identical. They are all, by different routes, building the layer that keeps the live execution alive and addressable independent of the client. When that many capable teams independently rediscover the same missing primitive, the primitive is real — it has simply been unnamed.

Naming the layer

So name it. The third layer of an AI-native system, sitting beneath memory and beneath orchestration, is an execution-state continuity layer: the layer whose single responsibility is to keep the live tuple of process tree, PTY, file descriptors, and local sockets alive, observable, and addressable, decoupled from any client or transport — an execution-state object, not a normalized display stream — handing in-flight external connections to the application protocol (per the boundary above) — so that the question “is the server still running?” has an authoritative answer that does not depend on what an agent happens to remember doing.

Calling it a “layer” earns its keep only if there is an upward interface — something the concerns above it actually call. State it once, as an invariant: orchestration addresses execution by session identity, not by holding the connection; memory references that identity to correlate transcripts to a live runtime. The layer’s upward contract is exactly that narrow: hand me a session identity, I give you back an addressable live execution. That attach/detach/reattach contract is not merely evidence of the category — it is the interface of the category, the joint other systems bind against, and protocols are what win category wars (MCP, LSP, OAuth each became the category by being the contract, not the implementation). What runs above binds to the identity, never to the transport. This is not a fresh invention: Devin’s controller already addresses a persistent devbox it does not hold open, and Temporal’s stable workflow ID is the logical-layer analogue of exactly this contract — the same shape, drawn one layer down at the live execution. That single contract is what makes “layer” a structural claim rather than a diagram convention.

There is a fair skeptic’s reply here: granted the concern is coherent, why must it be a horizontal layer rather than a feature baked into each runtime, the way retry logic lives inside every framework and never became a shared layer of its own? The answer is heterogeneity of attach. A retry concern is single-actor — the runtime retries its own call, and nobody outside it ever needs a handle on that retry — so it can stay buried inside the runtime forever. Execution-state continuity is the opposite: its whole point is that a phone, a CLI, a third-party agent, and a background service must reach the same live execution. A continuity concern sealed inside one runtime cannot be attached-to by a client of a different vendor or a different modality — there is no handle exposed below the runtime for them to grab. The moment heterogeneous, cross-vendor, cross-modality clients must converge on one execution, the concern has to be exposed beneath all of them, as a shared object they can each address. That heterogeneous-attach requirement is exactly what forces the concern out of any single runtime and makes it horizontal — and it is precisely the requirement retry never has.

The same heterogeneity-of-attach argument answers a second, opposite objection — that this names a feature, not a category, and that the real category is “the agent runtime,” with execution-state continuity as one section of it. But “the agent runtime” is a product category: a thing a single vendor ships. Execution-state continuity is a cross-vendor interface category — the thing heterogeneous runtimes must each expose in order to interoperate, the shared object a phone, a CLI, and a third party’s agent all address regardless of who built the runtime underneath. TCP/IP is a layer, not a feature of any one network appliance, precisely because it is the contract appliances from different vendors must meet to interoperate; execution-state continuity sits at the same altitude. A feature lives inside one product; an interface is what independent products converge on — and the heterogeneous-attach requirement is exactly what makes this the latter.

Stated as positioning, the category is this: a command-operator execution layer for AI-native computing, where humans, AI agents, devices, and services attach as operators to the same long-lived, single-homed execution. Not a transport layer. Not an agent platform. A continuity layer for the live runtime — the box at the bottom of the diagram that the field has been leaving empty.

Memory and orchestration were the right two layers to build first, and the work on them was not wasted. But an agent with a perfect memory and a reliable planner, running on a runtime that dies with its client, is an agent with a flawless autobiography and amnesia about the machine in front of it. The next phase of AI-native infrastructure is the layer that closes that gap.

cmdop (cmdop.com ) is one reference implementation of this category — the live execution state kept continuous and addressable beneath whichever client, human or agent, attaches to it. It owns regime (1) — a running process re-attachable by any client over a relay — and addresses regime (2) today through session persistence and recovery across restarts and reconnects, with deeper live-state checkpointing on the roadmap; regime (3) it hands to the application protocol, and says so. The broader point stands regardless of any implementation: the missing layer has a shape, it has a name, and the whole industry is already building toward it. The rest of this series walks its edges.


See it in the product: the shipped embodiment of this layer is the cmdop architecture spine — single-homed, ownerless, operator-model — and its block-diagram realization.