Failure Modes of a Continuity Layer
An execution-state continuity layer is stress-tested by nine failure modes, and a serious claim must say, mode by mode, what it guarantees versus what it hands back to the application: client disconnect (layer owns), host death / migration (reaches in via checkpoint/restore, must fence against split-brain), external-connection drop (handed off to the application protocol), fork / branch (internal state copy-on-write, external effects to idempotency + sagas), replay divergence (avoided by construction — it steers live state, never replays), partial / half-applied mutation (a transaction problem, not a continuity one), multi-actor conflict (serialized, attributed writes — but semantic conflict is policy), cross-actor poisoning / confused deputy (bind the originator, expose a gating seam), and idle / no-return reaping (an eviction discipline must exist). And the one explicit non-invariant: a live external connection surviving a host migration — the remote peer holds its own half of the socket in its own kernel, so the layer refuses to promise it. Naming the thing it will not guarantee is what makes everything it does guarantee believable.
A category essay names a thing. An infrastructure claim has to survive contact with the failure cases. When Part 1 of this series named the execution-state continuity layer — the live tuple of process tree, PTY, file descriptors, and sockets, elevated to a first-class object that outlives any client — the fairest expert objection was not “this category doesn’t exist.” It was sharper: formalize the failure modes. Disconnect, migration, fork, replay divergence. Show, mode by mode, what the layer actually guarantees and what it quietly leaves to someone else. Until you do that, “continuity layer” is a slogan, not an architecture.
This article does exactly that. It enumerates every way a continuity layer is stress-tested and states, for each, four things: the trigger, what is at risk, what the layer guarantees, and — the part most marketing omits — what the layer cannot guarantee and therefore hands to the application, plus the observable signature by which you recognize the mode in production.
The discipline here matters more than the rhetoric. A continuity layer that claims to guarantee everything is lying about physics, and the lie surfaces precisely at the failure boundary. The honest version is more useful: it owns some modes outright, reaches into others at a stated cost, and refuses a few entirely — and it tells you which is which before the incident, not during it.
The three regimes, restated
Part 1 drew the line that this article walks. Three continuity regimes hide inside “keep the execution alive,” and they are of wildly different difficulty:

- Regime 1 — the client detaches while the host lives. Laptop lid, dropped WebSocket, app restart. The runtime keeps running; a later client re-attaches. The layer owns this outright.
- Regime 2 — the host itself dies or migrates. OOM kill, node failure, scale-down. Now you are in checkpoint/restore territory — the CRIU lineage from Part 3 — solvable for process and memory state, at real cost and with real limits.
- Regime 3 — a live external connection must survive a host migration. The in-flight socket to a database, an exchange, a third-party API. This is not a layer problem at all. The peer on the other end holds its own half of the connection in its own kernel, and no amount of local continuity rewrites a remote server’s socket state.
Every failure mode below lands in one of these regimes, and the regime tells you, in advance, who owns the recovery.
1. Client disconnect (regime 1)
Trigger. The transport between a client and the live execution drops: a laptop sleeps, Wi-Fi changes, the desktop app auto-updates and restarts, a phone goes through a tunnel.
At risk. In a client-owned runtime — the default everywhere — everything. The process tree is parented to the session the client opened; the PTY belongs to the terminal; the socket dies with the connection. Disconnect equals annihilation, silently, by construction. That is the gap Part 1 named.
Layer guarantees. The runtime and its execution state survive the transport’s death. The process tree keeps running, the PTY keeps its line discipline and scrollback, file descriptors keep their offsets and locks. A later client — the same device, a different device, a human or an agent — re-attaches and the view is restored. Output accumulated while detached is held in a bounded buffer and made available to the re-attaching client — continuity of the live object, not a replayed historical log — with the honest caveat that very-long-detached output can roll off the bounded window.
Cannot guarantee / hands to app. The live state is preserved end to end — this is the regime the layer owns. Two residuals, though, are not nothing. First, history: output beyond the bounded buffer window of a long-detached interval is not held by the live object; that detached-interval history falls to the memory layer (consistent with Part 2’s split between live state and durable memory), not to the continuity layer. Second, application-level: a client that assumed it was the sole owner of the runtime may need to tolerate finding the state advanced when it returns.
There is also a cost the rest of the series, which prices only LLM spend, never names: keeping a session live across a detach is not free. Every idle detached session pins a process tree, its resident memory, and a slot on a host — a standing residency cost that accrues whether or not any client is attached. This is the category’s structural trade, the mirror image of a replay engine’s scale-to-zero: replay buys cheap idling at the price of cold reconstruction; the operator layer buys live continuity at the price of standing residency. It is a property of the category, not a policy choice of any one implementation.
Observable signature. A client gap followed by a clean re-attach to a still-live runtime; PID lineage unchanged; buffered output available on re-attach (bounded); no cold start.
2. Host death / migration (regime 2)
Trigger. The host holding the execution disappears or must move: an OOM kill, a node hardware fault, a spot-instance reclaim, a deliberate scale or bin-packing event.
At risk. The live memory image — heap, registers, the populated REPL namespace, the half-built in-memory data — plus the local OS state (open files, the process tree itself). Unlike regime 1, the substrate beneath the state is gone.
Layer guarantees. Where checkpoint/restore is supported, process and memory state are restorable: the CRIU lineage (ptrace seizure, memory-page dump, register and fd capture) makes a faithful freeze-and-thaw possible on a compatible host. The execution recovers as an identity — addressable independent of which host now holds it — rather than as a cold restart from disk.
Cannot guarantee / hands to app. Three honest limits. First, cost and cold-restore latency: a checkpoint is not free to take and a restore is not instantaneous; large memory images move slowly, and restore generally demands a matching kernel and ISA (Part 3’s caveat). And restore latency is not merely a cost — it can lose a race: a multi-gigabyte restore takes minutes, while the disconnect tolerance that triggers re-homing is measured in seconds. For large images the practical outcome may be a degraded or cold result even though the logical identity is preserved — the user may reconnect before the restore lands, or the regime-2 “reach” may simply lose to the clock. Identity continuity does not imply latency continuity. Second — and this is the hard one — a remote socket does not survive the migration. TCP_REPAIR can re-establish local socket bookkeeping, but the peer on the far end never agreed to the move. The moment migration touches an outbound connection to a database or an API, you have left regime 2 and entered regime 3, where the layer no longer owns the outcome.
Fencing — the split-brain hazard, and the invariant it forces. A migration assumes the old host is gone. But the third partition case — host alive but unreachable — breaks that assumption: if the layer re-homes a session while the original incarnation is still running, it has produced two live incarnations of the same session, and the “single coherent state” property is violated by the recovery mechanism itself. This is classic split-brain. Note the choice this forces: under partition the layer chooses consistency over availability — it refuses to re-home rather than risk two live incarnations, a clean CP choice. A safe operator model must therefore make at-most-one-attachable-incarnation an invariant: the layer must refuse to re-home a session without fencing the prior incarnation. None of this is the operator model’s invention — fencing and lease revocation are standard distributed-systems hygiene (lease-based leader election, fencing tokens); the operator model inherits the requirement, it does not originate it. Scope matters, because the host-alive-but-unreachable trigger is a partition — the coordination point may be unable to reach the old host at all, so it cannot “shut it off” remotely. What the fence can enforce is twofold and neither act depends on reaching the old host: the coordination point invalidates the lease so no client can attach to or be routed to the old incarnation, and the old incarnation self-fences on lease-loss — where “observes the lease is gone” is itself a local lease-expiry timeout under partition, so there is a bounded window (the lease interval) in which the old incarnation may still act locally before it quiesces. The attachable/routable invariant holds throughout — lease invalidation at the coordination point is unilateral and immediate — while any in-window local effects fall to the same regime-3 idempotency / fencing-token discipline as external ones. The honest invariant is therefore at most one live incarnation that is attachable/routable; effects the old incarnation already has in flight to external systems are not reached by lease revocation and remain a regime-3 concern (see mode 3). Without an enforced fence, regime-2 recovery becomes a correctness hazard rather than a recovery. (The fencing/lease invariant is category-level; how a given implementation issues, observes, and revokes the lease is out of scope here.)
Observable signature. A host-level event (OOM, reclaim) followed by a restore on a new host with the same logical identity; a measurable cold-restore interval proportional to image size; and — the tell — any external connection the process held now reads as reset or stale.
3. External-connection drop (regime 3)
Trigger. Any host migration or network partition that affects an outbound socket the execution holds to something it does not control: a database, an exchange, a message broker, a third-party API.
At risk. The correctness of in-flight remote operations: a query whose result never returned, an order whose acknowledgement was lost, a batch half-sent.
Layer guarantees. Here the layer’s honesty is the whole point: it guarantees nothing about the remote half of the connection, and it says so. It can keep the local execution alive and consistent, but the remote half lives in a kernel the layer has no authority over — the hard physical boundary of the entire category, which Part 6 draws in full (TCP state split across two kernels, TCP_REPAIR, QUIC). No local continuity mechanism can reach across the wire and rewrite that.
Cannot guarantee / hands to app. The recovery is owned, fully, by the application protocol, and the classics are exactly the tools: reconnect, resync by sequence number (replay from the last acknowledged offset, as a Kafka consumer or a FIX session does), and idempotent operations keyed so a retried write is recognized and de-duplicated rather than applied twice. These are decades-old, well-understood patterns. The continuity layer’s job is not to replace them — it is to not pretend it has.
Observable signature. A connection reset or partition on an outbound socket; the application’s own reconnect-and-resync logic engaging; idempotency keys suppressing duplicate effects. If you see the layer claiming it transparently healed a remote socket, you are looking at a bug or a lie.
This is the mode that separates a serious continuity claim from an overclaim. A layer that owns regimes 1 and 2 and openly hands regime 3 to the protocol is drawing the boundary where physics actually puts it.
4. Fork / branch
Trigger. Speculative exploration: an agent wants to try two approaches from the same point, or parallel-sample several attempts and keep the best, without destroying the shared starting state.
At risk. State integrity across the branches — and, far more dangerously, the external side effects the branches produce.
Layer guarantees. Forking internal execution state is tractable. Copy-on-write lets two branches share an unmodified base and diverge only where they actually write, so the process and its local state can be branched cheaply and coherently. The layer can own this: two live branches from one parent state, each internally consistent.
Cannot guarantee / hands to app. You can fork a process; you cannot fork the email you already sent. Internal state is copy-on-write; external side effects are not. If a branch charged a card, dispatched an order, or wrote to a shared database, that effect exists in the world exactly once and belongs to no single branch. Reconciling forked external effects is owned by idempotency (so a repeated effect across branches collapses to one) and compensation — the saga pattern’s compensating transactions, which undo an effect that a discarded branch should not have had. The non-forkability of external side effects is a property of the world, not a deficiency of the layer.
Observable signature. Two live branches sharing a copy-on-write base, each with coherent internal state; and at the external boundary, either idempotency keys collapsing duplicate effects or compensating actions unwinding the effects of an abandoned branch.
5. Replay divergence
Trigger. A recovery strategy that rebuilds state by logical replay — the durable-execution model from Part 3 — re-executes its workflow code and hits non-determinism: a wall-clock read, a random value, an unrecorded side effect. The rebuilt state no longer matches reality.
At risk. In a replay-based system, silent state corruption — which is why those engines raise a non-determinism error to halt rather than continue on a divergent rebuild.
Layer guarantees. A live-execution-state layer avoids this entire failure class by construction. It does not replay. It observes and steers live OS state (Part 3’s “steered, not replayed”), so there is no event log to re-execute and therefore no determinism contract to violate. The non-determinism that breaks replay — concurrent mutation, real-time I/O, the messy reality of a running OS — is simply the medium the layer operates in, not a hazard it must forbid.
Cannot guarantee / hands to app. The converse cost, stated honestly. Because the layer holds live state rather than deriving it from a journal, it cannot cheaply reconstruct from an event log the way a replay engine can. It cannot “sleep for a month at near-zero cost” by freeing memory and replaying later; it cannot answer “re-derive the state as of step 7” from a compact history. If your problem genuinely wants cheap, deterministic, scale-to-zero logical durability, replay is the right tool and this layer is the wrong one. The two paradigms trade a determinism contract for a live-state contract; neither dominates.
Observable signature. What you observe in recovery is a live execution graph being re-homed — re-attached to a new host so an existing process tree resumes running — rather than state being re-derived from an event log, where the live graph survived; where it did not, recovery re-establishes from persisted session state — not from an event log either, so the no-replay point still holds. There is no replay phase in the recovery path at all, so there is no determinism-check step that could fire: the recovery sequence has no stage at which a journal is re-executed. The cost is the mirror image and equally observable: there is no log-based time-travel either — you cannot reconstruct an arbitrary past step or scale to zero and rebuild later, because the only state that exists is the live one being re-homed.
6. Partial / half-applied mutation
Trigger. A crash mid-operation: a database migration applied to four of seven tables, a file half-written, a batch of API calls partially sent.
At risk. The reviewer’s exact example — “a half-applied migration” — and the temptation to treat it as a continuity problem.
Layer guarantees. The layer can preserve the process and its local state across the crash (via regimes 1 and 2): the shell that issued the migration, the script’s local variables, the file handles. It keeps alive the agent of the operation.
Cannot guarantee / hands to app. It cannot make a multi-step external operation atomic. Correctness of “apply seven schema changes” or “send this batch exactly once” is owned by transactions (a real DB transaction makes the migration all-or-nothing at the database), sagas (compensating steps for operations too long or too distributed for one transaction), and idempotency (so re-issuing the migration recognizes what already landed). This is the precise answer to the reviewer’s point: a half-applied migration is a transaction problem, not a continuity problem. Keeping the process alive does not make a non-transactional multi-step mutation correct, and no continuity layer should claim it does. The layer’s contribution is narrow and real — it preserves the actor so the recovery logic (the transaction retry, the saga compensation) can run against accurate local state — but the atomicity guarantee lives in the data layer, not the runtime.
Observable signature. A surviving process with intact local state, sitting atop a partially-mutated external resource; recovery proceeds via DB rollback, saga compensation, or idempotent re-apply — never via the runtime claiming the external mutation completed.
7. Multi-actor conflict (concurrent observation, serialized writes)
Trigger. Many actors observe the live state concurrently while write access is serialized — a human and an AI both attached to the same shell and contending for the keyboard, or two agents acting on one execution (Part 4’s multi-actor model). The contention is over the write turn, not simultaneous writing.
At risk. Coherence of the single shared state and the ability to attribute and order the writes once they are serialized.
Layer guarantees. Part 4’s invariants apply: a single coherent execution state (not per-actor copies that drift and later have to be merged), uniform mechanics across all actor inputs regardless of origin (coherence, ordering, attribution treat inputs the same way — though permission stays per-actor and may be asymmetric), per-actor provenance on every mutation, and a serialization order imposed on the discrete inputs actors submit — commands, edits, events — at the single canonical state, each attributed, so that “who saw what before acting” has a defined answer. Because there is single-homed state on one host with a single point of serialization, the model differentiates not by simultaneous writing but by attribution, transferable authority (turn-taking and handoff), and heterogeneous modality: many actors observe concurrently; write access is serialized and attributed. That ordering applies to discrete submitted inputs, not to raw keystrokes: the layer does not pretend to merge two streams of simultaneous co-typing into one stdin into a coherent intent. A byte stream is not a CRDT; two writers feeding the same terminal at once produce noise, not a mergeable structure. Concurrency at that raw level is prevented, not reconciled — handled by a control discipline (turn-taking, soft-locking, explicit handoff, the transferable authority of Part 4) rather than by merging. What the layer guarantees is that one host holds one serialization point and that the discrete actions taken against that single-homed state are ordered and attributed rather than left to silently clobber one another.
Cannot guarantee / hands to app. The layer can guarantee ordering and attribution of discrete inputs; it cannot guarantee semantic non-conflict. If a human and an agent issue logically contradictory intents — one deletes the directory the other is building in — the layer will order and attribute both faithfully, but resolving the meaning of the conflict (which intent wins, whether to abort) is application and policy. Coordination at the level of intent belongs above the execution state. Nor does the guarantee reach down to the raw input level: simultaneous co-typing into one stream is not something the layer makes coherent. Keystroke-level concurrency is resolved by a turn-taking or handoff discipline that decides whose input the stream carries at a given moment — it is excluded, not merged.
Observable signature. Discrete inputs from multiple actors against one live state — each carrying actor identity, applied in a consistent order — while raw input concurrency is gated by a turn-taking/handoff discipline (only one writer holds the stream at a time) rather than two keystroke streams being interleaved into one. Semantic conflicts are surfaced for application-level resolution rather than silently merged.
8. Cross-actor state poisoning (confused deputy)
Trigger. One actor writes into the shared mutable live state — an environment variable, PATH, a shell alias, a staged command, an LD_PRELOAD hook — and a second actor then executes against that state. The first actor sets the trap; the second springs it, under the second actor’s authority.
At risk. Authority and attribution integrity. The same single shared state that makes the operator model possible is a single trust boundary: actor A poisons it, actor B acts on it, and the effect runs with B’s permissions. Worse for forensics, naive provenance attributes the effect to B — the actor who triggered it — not to A, who staged it. That is the textbook confused-deputy shape (Hardy 1988, “The Confused Deputy”), and it punches a hole in the safety story if attribution is treated as the whole of safety (see Part 4’s distinction between detective attribution and preventive gating). The per-actor permission envelope that bounds each operator is object-capability thinking (the object-capability model, Miller).
Layer guarantees. The mechanics the layer owns are coherence, ordering, and attribution of the trigger — it can always say which actor’s input caused the execution. What a safe operator model must additionally make true is two category-level invariants: authority is evaluated at the moment of the acting actor’s input against the then-current state (not once at attach time), and provenance binds the originator of a staged effect, not merely the actor who triggered it — so a poisoned PATH or alias is attributable to whoever wrote it.
Cannot guarantee / hands to app. Isolation and policy. The layer does not, by itself, decide which cross-actor writes are legitimate — that is a permission/policy question handed to a neighbor (the per-actor permission envelopes and the pre-commit gating seam of Part 4). The layer’s obligation is to expose enough — originator-bound provenance and a pre-execution interposition point — that a policy layer can gate the confused-deputy path; deciding the policy is not the continuity layer’s job.
Observable signature. A staged mutation by actor A (env/PATH/alias/staged command) followed by an execution triggered by actor B; correct provenance binds the originator of the staged effect, not only the trigger; and a pre-commit gating point exists where policy can refuse B’s execution against A-shaped state. If the only record is “B ran it,” the layer is attributing the deputy and missing the manipulator.
9. Idle / no-return (reaping)
Trigger. A session detaches and no client ever comes back — the laptop is never reopened, the agent run is abandoned, the device is lost. The live state sits resident indefinitely with no future attach.
At risk. Host capacity against continuity. Because the layer’s promise is standing residency (mode 1’s cost note), an unbounded population of never-returning sessions is a slow resource leak: each pins a process tree, memory, and a host slot forever. But the obvious fix — reap aggressively — directly attacks the layer’s core promise: reap too early and you break continuity for a client that would have returned; reap too late and idle sessions leak hosts.
Layer guarantees. That a reaping / eviction discipline exists is a category invariant: a continuity layer that never reclaims idle sessions is not durable, it is leaking. The existence of an eviction policy — the property that abandoned sessions are eventually reclaimed — is what the category must guarantee.
Cannot guarantee / hands to app. The specific threshold — how long is “abandoned,” what TTL or signal triggers reclamation, whether a session is checkpointed-then-evicted or destroyed — is implementation and policy, not a category property (and a concrete TTL is deliberately out of scope here). The tension between continuity and capacity is real and is tuned, not solved; where the line sits is handed to operators and policy.
Observable signature. A detached session with no re-attach over a policy window, followed by reclamation (eviction or checkpoint-then-evict); a bounded, not unbounded, population of idle resident sessions; and a reaping event in the layer’s telemetry rather than silent unbounded growth.
The table

The image above is the glance view; the table below is the detail — every mode against the same four columns, plus the observable signature that lets you recognize it in production.
| Failure mode | Trigger | Layer guarantees | Hands to application | Observable signature |
|---|---|---|---|---|
| 1. Client disconnect (regime 1) | Laptop / network / app transport drops | Runtime + exec state survive; re-attach restores view; bounded buffer available on re-attach | Client tolerating an advanced state on return | Clean re-attach to live runtime; PID lineage intact; no cold start |
| 2. Host death / migration (regime 2) | OOM / node fail / spot reclaim / scale | Process + memory restore via checkpoint/restore (CRIU lineage) | Cold-restore cost; remote sockets do NOT survive (→ regime 3) | Restore on new host, same identity; restore latency ~ image size |
| 3. External-connection drop (regime 3) | Migration / partition hits an outbound socket | Nothing about the remote half; keeps LOCAL state alive + consistent | Reconnect + resync by seq number + idempotent ops (the app protocol owns it) | Outbound socket reset; app reconnect/resync engages; dup suppression |
| 4. Fork / branch | Speculative / parallel attempts | Copy-on-write of internal state; coherent branches | Forking external side effects via idempotency + compensation (sagas) | COW branches, coherent internally; idempotency/compensation at boundary |
| 5. Replay divergence | Logical-replay recovery hits non-determinism | Avoids the class entirely (observes live, doesn’t replay) | Cannot cheaply rebuild from an event log (no scale-to-zero time-travel) | Live graph re-homed (proc tree resumes); no replay phase; no log time-travel |
| 6. Partial / half-applied mutation | Crash mid multi-step operation | Preserves process + LOCAL state | Atomicity via transactions / sagas / idempotency (a TRANSACTION problem) | Live process over a partly-mutated resource; recovery via DB/saga |
| 7. Multi-actor write conflict (serialized, attributed) | Many observe; write turn is serialized (turn-taking/handoff, not co-typing) | One coherent state; uniform mechanics; per-actor provenance; serialization order on discrete inputs | Semantic conflict resolution (which intent wins) — app + policy; raw concurrency = handoff | Discrete attributed inputs serialized + ordered; raw co-typing via turn-taking, not merged; conflicts shown |
| 8. Cross-actor state poisoning (confused deputy) | Actor A poisons shared state (env/PATH/alias/staged), actor B executes it | Coherence + ordering + provenance; must bind the ORIGINATOR of a staged effect, not the trigger; eval authority vs then-current state | Isolation + which cross-actor writes are legit (permission envelopes + gating seam) — to a policy neighbor | Staged write by A, exec by B; provenance binds the ORIGINATOR not just the trigger; pre-commit gate exists where policy can refuse B-on-A-shaped state |
| 9. Idle / no-return (reaping) | Session detaches and no client returns | A reaping/eviction discipline EXISTS (idle sessions reclaimed) | The specific TTL/threshold + reap-vs-checkpoint choice — impl + policy | Bounded (not unbounded) idle resident population; reaping event in telemetry |
The shape of the table is the argument. Mode 1 is a column the layer fills (at a standing residency cost it must own honestly); mode 2 it reaches into along a spectrum (session-state persistence at the near end, full live-memory checkpoint/restore at the costly far end) — and must fence against split-brain when it does. Modes 3 and 6 are columns the layer deliberately hands off — to the application protocol and to the data layer respectively — and saying so plainly is what makes the rest credible. Modes 4, 5, and 7 are split: the layer owns the internal-state half and hands the external-effect or semantic half to the classic disciplines (idempotency, sagas, transactions, application policy). Modes 8 and 9 are the operator model’s own bills coming due: cross-actor poisoning is the price of one shared trust boundary (the layer must bind the originator and expose a gating seam, then hand isolation/policy to a neighbor), and idle/no-return is the price of standing residency (the layer must own that a reaping discipline exists, while the threshold is policy). None of these is a free win; each is a column the table makes the layer name out loud.
Invariants — and the one explicit non-invariant
State the guarantees as properties, not as implementation. These are what a continuity layer must make true; the mechanisms that satisfy them are an implementation concern (and, in some systems, a patented one — not the subject of this article).

-
Continuity across client transitions. The live execution outlives any client’s connection. Detach and re-attach — from the same device, a different device, a human, or an agent — do not destroy or restart the runtime. (Owns: mode 1.)
-
State identity across transports. The execution is addressable as a stable identity independent of which transport currently carries it and, where checkpoint/restore applies, which host currently holds it. Recovery is re-homing an identity, not re-deriving from a log or cold-starting from disk. (Owns: mode 2, within stated cost.)
-
Serialized, attributed ordering across actors. The discrete inputs submitted by multiple actors — commands, edits, events — are applied to one canonical state through a single serialization point, each attributed to its actor; there is no second, drifting copy that must later be merged. The ordering is best stated precisely: each actor’s inputs carry a happens-before partial order (the Lamport 1978 sense), totalized by arrival at the single serialization point. The single serialization point is what makes the total order trivial — it is not a logical-clock protocol negotiating order among hosts; it is one canonical state on one host that arrival order alone serializes. This is an ordering over discrete attributed inputs, not a claim that raw interleaved keystrokes are semantically merged — concurrency at the raw input level is held off by a turn-taking/handoff discipline, not reconciled into one intent. (Owns: mode 7’s coherence half.)
-
Linearizable by construction within an incarnation; across a re-home, linearizable only because the fence orders the serializers. “Coherent” has a precise meaning here, and it is worth stating at category altitude. Within a single incarnation, because there is one host with one serialization point, operations against the state are linearizable (the Herlihy & Wing 1990 sense) by construction — each takes effect at a single point between its submission and its observed result, in an order all actors agree on. This single-serializer (sequential-bottleneck) linearizability result is textbook; single-homing inherits it rather than inventing it. Across a regime-2 re-home the serializer itself moves to a new host, so there are two serialization points across time, and “by construction” no longer carries the claim for free: linearizability over the object’s whole lifetime holds only because the fencing invariant guarantees the new incarnation’s serializer begins strictly after the old one’s is fenced — and under partition “fenced” means “the old lease is known-expired,” so safe re-home pays a lease-expiry wait on top of restore latency; the happens-before edge is bought with that wait, not granted instantaneously by the revocation. No operation is accepted by the old serializer after the fence, so the two local orders compose into a single global order with a real happens-before edge at the migration. Linearizability-across-migration is thus load-bearing on the fence being correct, not on construction alone. “Coherent” denotes that ordering guarantee; it is explicitly not a merge or replication guarantee (there is nothing to merge and no replica to reconcile). Single-homing here is a consequence of coherence, not a limitation of it: two live copies of one coherent execution would demand consensus over a non-mergeable byte stream — a byte stream is not a CRDT — which is incoherent by construction, so “just add replication” is the wrong axis for live shared mutable execution, not a missing feature. How the single-serializer shape and the fence are built is out of scope. (Underpins modes 7 and 8; depends on the fencing invariant across mode 2.)
-
At most one live incarnation, attachable/routable (fencing). A session has, at any time, at most one live incarnation that any client can attach to or be routed to. The recovery path must not manufacture a second: re-homing a session under host-alive-but-unreachable conditions requires fencing the prior incarnation. Honest scope matters here, because the host-alive-but-unreachable trigger is precisely a partition — the coordination point may be unable to reach the old host at all. So the fence is the conjunction of two acts, neither of which assumes the relay can touch the unreachable host: (1) the coordination point invalidates the session’s lease, so that no client can attach to or be routed to the old incarnation; and (2) the old incarnation self-fences on lease-loss — it ceases to act as the session the moment it observes it no longer holds the lease, rather than waiting to be told so by a relay it may be partitioned from; and “observes it no longer holds the lease” is itself a local lease-expiry timeout under partition, so there is a bounded window (the lease interval) in which the old incarnation may still act locally before it quiesces — the attachable/routable invariant holds throughout (lease invalidation at the coordination point is unilateral and immediate), while any in-window local effects fall to the same regime-3 idempotency / fencing-token discipline as external ones. The invariant the fence actually enforces is therefore at most one live incarnation that is attachable/routable, which is what “single coherent state” requires of the recovery path. Without it, regime-2 recovery can violate that property via the recovery mechanism itself (split-brain). One boundary stays explicit: lease invalidation reaches attach and routing, not effects the old incarnation already has in flight to external systems — those are unreached by lease revocation and remain a regime-3 concern, resolved on the external side (idempotency, or a monotonic fencing token the external resource checks). The fencing/lease property is the invariant; how the lease is issued, observed, or revoked is implementation. And one distinction is worth stating outright, because it looks like a contradiction until it is named: this lease lives in the coordination/control plane, not in the execution graph — the single-homed, no-consensus claim is about the live execution state (one host, no replica to reconcile), while leader-election-style fencing is a property of the routing/control layer that addresses it. The two are different planes; conflating them is what makes the tension look real. (Guards: mode 2; defers external in-flight effects to mode 3.)
-
A reaping discipline exists. Idle, never-returning sessions are eventually reclaimed; an eviction/reaping discipline is part of the contract. A continuity layer with no reclamation is not durable, it is leaking hosts. The existence of the discipline is the invariant; the specific threshold (TTL, signal, reap-versus-checkpoint) is policy and implementation. (Owns: mode 9’s existence half; hands the threshold to policy.)
Two properties of the operator model are stated honestly as costs the category carries, not guarantees it dissolves:
- Standing residency is a cost, not a leak to be wished away. Live continuity is paid for: every detached-but-alive session pins a process tree, its memory, and a host slot for as long as it lives. This is the structural mirror of replay’s scale-to-zero — the operator layer trades cheap idling for live continuity — and the reaping invariant above is what keeps the cost bounded rather than unbounded.
And the non-invariants, stated as plainly as the invariants — because refusing to overclaim is itself the discipline:
-
Remote-connection survival is NOT an invariant of this layer. The layer does not, and physically cannot, guarantee that an outbound connection to a peer it does not control survives a host migration or partition. That peer holds its own half of the connection in its own kernel. Recovery of the remote half is owned by the application protocol — reconnect, resync by sequence number, idempotent operations — and the layer’s correctness depends on being honest that this is so. (Hands off: modes 3 and 6’s external half.)
-
Isolation policy is NOT an invariant of this layer — but the pre-commit interposition seam is. Draw the split exactly. Provenance gives attribution — a detective, after-the-fact answer to “who did this.” It does not, by itself, prevent a confused-deputy execution (mode 8). A safe operator model needs a preventive property too — authority evaluated against the then-current state before an effect commits — and the seam at which that evaluation happens is something the category must expose, not hand away: the pre-commit interposition point is an invariant the layer provides, because it is the only place where authority over the live tuple can be checked before an effect on that tuple commits, and a layer that buried it would leave no object-capability handle for any neighbor to gate against. What is external is the policy that runs at the seam, not the seam itself — and keeping the seam inside the category is what stops a governance layer from being built around the runtime rather than into it. What the layer may not do is let attribution masquerade as control, nor let the high-value safety seam leak out of the category as a mere convention. The category’s obligation is therefore to expose originator-bound provenance and a pre-commit interposition point as invariants; deciding which cross-actor effects are legitimate is the only part that is policy. (Provides the seam; hands off only mode 8’s policy.)
-
Attach admission — authenticating and authorizing the attaching client, and isolating one session from another — is NOT an invariant of this layer. Addressing a session by its identity is not entitlement to attach to it (the same move as provenance ≠ control): nothing in “reference the session by its identity” stops one actor attaching to another’s session, so the per-actor authority model is meaningless unless attach itself authenticates the client and authorizes it for this session, and unless sessions are isolated from one another on a shared relay or host. That admission-and-isolation property is handed to a security neighbor (authn/authz and tenant isolation), not assumed to fall out of identity-based addressing. (Hands off: the attach-admission and isolation policy.)
A category essay can assert that a missing layer exists. A whitepaper has to draw the line where the layer stops, and stand behind it. The non-invariant above is that line. Everything the layer guarantees is more believable because it names the thing it refuses to guarantee.
A reference point
This honesty is not a rhetorical posture; it is a design constraint that an implementation either meets or fails. cmdop is one reference implementation of this category, and the relevant property here is its failure-mode posture: it owns the client-disconnect regime (mode 1) outright, reaches into host migration (mode 2) at the session-state-persistence end of that axis rather than claiming full live-memory checkpoint/restore, and hands the external-connection and partial-mutation regimes (modes 3 and 6) back to the application protocol — reconnect-and-resync, idempotency, transactions, sagas — rather than claiming to absorb them. That division of labor is the point. A continuity layer earns its name by what it keeps continuous; it earns trust by what it admits it cannot.
The mechanisms that make modes 1, 2, 4, 7, and 8 hold internally — how coherence of a single live state is maintained, how identity is preserved across transports, how branches diverge cheaply, how a fence revokes a lease, how a gating seam interposes — are implementation concerns, and in some systems patented ones, and they are deliberately out of scope here. What this article fixes is the contract: nine failure modes, the properties the layer guarantees, the ones it explicitly refuses or hands to a neighbor, and an observable signature for each.
One honesty about those signatures, lest the “checkable in production” claim overreach: several of them — the buffered-replay-free re-home, per-actor attribution, originator-bound provenance, the absence of a replay phase, a reaping event in telemetry — are not visible through stock ps, ss, or off-the-shelf Prometheus exporters. They are signatures the layer must export about itself. The contract is checkable in production provided the layer instruments these properties; they do not fall out of generic observability, and a layer that does not emit them leaves its own contract unverifiable.
See it in the product: how cmdop handles these modes in practice — the multi-operator runtime (serialized, attributed input under transferable authority) and the troubleshooting guide for disconnect, recovery, and reattach.
This is Part 7 of 7 — the close of a seven-part series on the command-operator execution layer. Part 1 named the missing layer; Part 2 separated memory from execution state; Part 3 separated steering from replay; Part 4 set out the operator model; Part 5 named the session primitive; Part 6 drew the category’s boundary. This final part formalizes what happens when that boundary is tested.