Kata Federation¶

Federation lets multiple kata daemons share selected projects while each user keeps a local daemon and local database. It is opt-in per project. A normal single-user kata installation with no federated bindings should not read federation credentials or make federation network calls unless an operator invokes federation commands.

Federation is not a replacement for a shared daemon. Use a shared daemon when users need immediate single-copy reads and writes, centralized authorization, or strict online-only arbitration. Federation deliberately chooses local-first availability, durable offline queues, and deterministic convergence for routine work, with documented consistency limits.

Terms¶

Hub: the authoritative daemon for a federated project. It owns enrollment tokens, lease arbitration, purge/reset authority, and the canonical project event stream.
Spoke: a daemon with a local replica bound to a hub project.
Binding: a local row in federation_bindings that marks one project as a hub or spoke replica and stores pull/push cursors.
Enrollment: a hub-side credential in federation_enrollments. A token is bound to one spoke instance UID, optional project scope, and capabilities such as pull, push, and claim (the transport capability name for leases).
Origin instance UID: the durable daemon identity stamped on events. It is how replicas distinguish local-origin from foreign-origin work.
Pull cursor: the highest hub event ID consumed by a spoke.
Push cursor: the highest spoke-local event ID accepted by the hub.
Replay horizon: the hub event ID from which a spoke can bootstrap. History before that point is represented by baseline snapshot events rather than by a replay-complete event stream.
Lease: a hub-authoritative write lease for one existing issue. Mutating existing issue work on federated projects, including comments, requires a live lease. Creating new issues remains lease-free because there is no existing issue to lease. The internal storage and audit events still use the claim name.
Quarantine: a local operator stop marker for a poisoned federation batch. It prevents hot-looping and requires an explicit operator decision.

Tokens And Trust Boundaries¶

Kata federation uses two different bearer-token systems. They are intentionally separate:

Daemon API tokens identify clients talking to a daemon's normal API. They are configured with KATA_AUTH_TOKEN or [auth].token and managed with kata tokens ... when token identity is required. Operator commands such as kata federation enroll, kata federation revoke, kata federation status, kata federation quarantine skip, and hub-local force-release use this normal daemon API auth surface.
Federation enrollment tokens authorize one spoke to call hub federation transport routes for an enrolled scope and capability set. They are created by kata federation enroll, stored hashed on the hub, stored plaintext only in the spoke federation credentials file, and used for pull, push, join metadata fetches, and forwarded lease actions. They are not general daemon API tokens.

Lease commands have two hops on a spoke. The CLI first talks to the local spoke daemon using the normal daemon API auth rules. The spoke then forwards the lease request to the hub with its federation enrollment token; the hub derives holder_instance_uid from that enrollment. A hub-local lease command has only the first hop and uses daemon API auth. When [auth].require_token_identity = true, local operator and lease commands must use DB-backed API tokens from kata tokens create; enrollment tokens still only authenticate spoke-to-hub federation transport.

When a daemon listens on a non-loopback address, configure a daemon API token and explicitly trust the private network:

[auth]
token = "..."
trust_private_network = true

The equivalent environment variables are KATA_AUTH_TOKEN and KATA_TRUST_PRIVATE_NETWORK=1. Plain HTTP private-network clients that attach bearer tokens also require this trust opt-in. That includes normal CLI/TUI access to a remote daemon and spoke-to-hub federation calls that use enrollment tokens. Without the opt-in, kata refuses to put bearer tokens on plaintext non-loopback HTTP connections. HTTPS, Unix sockets, and loopback HTTP do not need the private-network trust opt-in.

Implementation Map¶

internal/db: schema, federation bindings/enrollments, event ingest, materialization, lease state, quarantine, reset guards, and JSONL cutover.
internal/daemon: daemon HTTP routes, enrollment-aware transport auth, local/admin operator APIs, lease forwarding, purge/reset behavior, and status responses.
internal/client: generic first-party daemon discovery, auto-start, bearer attachment, Unix-socket HTTP clients, TCP remote selection, and SSE clients.
internal/federation: spoke-side hub HTTP client, pull/push runner, pending lease retry, failpoints, and federation runner tests.
cmd/kata: federation operator CLI, lease CLI, daemon runner startup, and normal CLI client wiring.
e2e and docker/federation: multi-daemon, randomized stress, failpoint, and Docker Compose smoke coverage.

Setup Model¶

Federation setup is an operator workflow. The kata federation command is visible in CLI help, but it remains separate from ordinary issue commands so users who never opt into a federated project do not see daemon prompts, credential reads, or network calls.

On the hub:

Create or register the project first. In a workspace that should become the hub project, use the normal project setup command:

kata init --project fedlab

Enable federation for a project when you want an explicit enable step:

kata federation enable --project fedlab

This records project.federation_enabled and baseline issue.snapshot events at the replay horizon. This step is optional when the next command is kata federation enroll, because enrollment auto-enables the project if it is not already federated. 3. Create one enrollment token per trusted spoke:

kata federation enroll --project fedlab \
  --spoke-instance 01H... \
  --hub-url http://<private-hub-ip>:7787

The hub stores only the token hash. The enrollment records the spoke instance UID, optional project scope, and capabilities. The CLI prints a pasteable kata federation join ... command using the binary name that invoked enroll, and containing the generated token; treat that command as secret-bearing material.

On each spoke:

Read the spoke instance UID when creating the hub enrollment:

kata federation identity

Run the join command printed by the hub:

kata federation join --project fedlab \
  --hub-url http://<private-hub-ip>:7787 \
  --hub-project-id 1 \
  --token ... \
  --push

join fetches the hub project UID and replay horizon from the hub using the enrollment token, so the hub must be reachable at join time and the token must include pull. The metadata flags (--hub-project-uid, --replay-horizon, and --baseline-through) remain available as explicit overrides for scripts. The command creates a local replica project bound to the hub project UID and replay horizon, stores the hub URL/project/token in the local federation credentials file, and enables push only when --push is present.

If the spoke already has a non-federated local project that should join the hub under the same name, opt in explicitly with --adopt-existing, which requires --push:

kata federation join --project fedlab \
  --hub-url http://<private-hub-ip>:7787 \
  --hub-project-id 1 \
  --token ... \
  --push \
  --adopt-existing

Adoption works even when the local project name differs from the hub project name; choose the local project with --project and the hub with the hub selector:

kata federation join --project shared-foo \
  --hub-url http://<private-hub-ip>:7787 \
  --hub-project-id 1 \
  --token ... \
  --push \
  --adopt-existing

Adoption preserves the current state of local issues, including closed and soft-deleted issues, comments, labels, metadata, priority, owner, and same-project links. It does not preserve the old local event history. Instead it removes those pre-adoption local events and queues fresh snapshots for the hub with same-project links embedded in the snapshot payloads. After adoption, existing issues are ordinary federated spoke issues. Future edits remain local-first; use live hub leases only when you want exclusive coordination.

Adoption is a current-state cutover rather than a history merge. Operators who need the pre-adoption event timeline for audit or rollback context should run a scoped JSONL export before --adopt-existing, for example kata --project <project> export --output <path>.jsonl. The federated event stream starts from the adoption snapshots; kata does not currently retain a separate in-product archive of pre-adoption events.

Enrollment capabilities and local spoke behavior are separate knobs: --capabilities pull,push,lease on the hub says what the token may do, while --push on the spoke says this replica should actually push local-origin events back to the hub. If the token has push but join is run without --push, the spoke remains pull-only and the CLI prints a warning.

The transport routes use enrollment bearer tokens. Local operator routes, including kata federation status, quarantine skip, force-release, and purge, use the normal daemon local/admin auth surface. Federated hub purge uses the same live-lease conflict gate as other issue mutations; an operator can force-release first when an abandoned lease blocks destructive maintenance.

The daemon federation runner polls every 30 seconds by default. For tests, short-lived labs, or latency-sensitive private deployments, set KATA_FEDERATION_PULL_INTERVAL_MS=<milliseconds> on the daemon process. Very short intervals are useful for smoke tests but increase wakeups and database traffic.

Schema And Upgrade¶

Federation is one schema bump: upstream schema 11 upgrades to schema 12. Existing schema-11 databases upgrade through the JSONL cutover path. The v11 source schema does not have comments.uid or event replay fields (events.hlc_physical_ms, events.hlc_counter, events.content_hash), so the exporter keeps v11 on the legacy comment/event projections and the importer backfills those fields while loading the fresh schema-12 database.

Federation push requests also carry a schema_version field in the wire body. The hub rejects requests that omit it or report a schema newer than the hub's own schema. This is intentionally conservative: an old hub must not blindly materialize events from a newer spoke whose payloads or fold semantics may have changed. Upgrade hubs before push-enabled spokes when rolling out a new federation schema.

Pull Replication¶

A spoke polls the hub transport route for events after its pull cursor. It applies hub events in order, deduplicates by event UID and content hash, folds portable payloads into the local projection, and advances its pull cursor only after successful application.

Baseline snapshots bridge the fact that pre-horizon history is not replay-complete. Events after the baseline are ordered by their event/HLC data and materialized normally. If the hub reports reset_required, the spoke refreshes hub metadata and re-bootstrap from the current horizon.

Push Replication¶

A push-enabled spoke scans for local-origin events above its push cursor and sends them to the hub as an all-or-nothing batch. The hub authenticates the enrollment token, verifies project scope and capability, checks that each event belongs to the bound spoke origin, verifies the spoke's declared schema version, deduplicates same-hash retries, rejects same-UID/different-hash conflicts, materializes the batch, and returns the advanced push cursor.

If the response is lost after the hub commits, retrying the same batch is safe: the hub treats fully duplicated same-hash batches as successful and returns an advanced cursor. Permanent validation failures or hash conflicts record a quarantine on the spoke instead of retrying forever.

Leases And Write Gates¶

Leases are hub-authoritative. A spoke forwards acquire, renew, release, and status requests to the hub using an enrollment token with claim capability. The hub derives holder_instance_uid from that enrollment token; clients provide the human-readable holder string.

Leases exist for agent coordination, not for general edit permission. An agent acquires a lease to signal "I am actively working this issue" and to get temporary exclusivity against other non-comment mutations while the lease is live. Status and audit surfaces can then show who is actively working. A lease does not replace durable owner, does not serialize all collaboration, and is not required before ordinary federated edits.

For federated projects, ordinary edits of existing issues are local-first and converge by LWW. Creating new issues is also local-first. A lease is optional coordination: when another holder has a live lease on an affected existing issue, non-comment mutations are denied until the lease is released or expires. Comments bypass leases because they are append-only.

Spokes refresh cached lease state before checking exclusivity when online. When offline, cached hard leases can still be used as a continuity hint, but they are not proof that exclusivity still holds. Timed leases expire by hub time and stop blocking edits once expired.

The hub checks pushed work against the live lease state at ingest time. Work that conflicts with another holder's live lease is not dropped; the hub records claim.violated. Work on unleased issues is normal and is not a violation. This is best-effort, not a causal proof that the work was unauthorized when originally performed. An offline edit that was covered at edit time can arrive after another holder acquires a lease and be marked violated because the hub checks current state during ingest.

kata show surfaces the current lease and recent unresolved lease violations for a federated issue. kata federation status shows project-level counts and recent violation summaries.

Operator Commands¶

kata federation identity
kata federation enable --project <project>
kata federation enroll --project <project> --spoke-instance <uid> --hub-url <url>
kata federation join --project <project> --hub-url <url> --hub-project-id <id> \
  --token <token> [--push]
kata federation join --project <existing-project> --hub-url <url> --hub-project-id <id> \
  --token <token> --push --adopt-existing
kata federation enrollments list
kata federation revoke <enrollment-id>
kata federation status
kata federation status --json
kata federation lease acquire <issue-ref> [--ttl 30m]
kata federation lease release <issue-ref>

kata federation enrollments list audits hub-side spoke grants without showing token hashes or plaintext tokens. kata federation revoke <enrollment-id> marks an enrollment inactive so the token no longer authorizes pull, push, or lease transport calls. Project-level leave and disable teardown commands are not currently exposed; operators should treat those as a separate reset/removal workflow because local pending pushes and replicated state need explicit policy.

kata federation status reports one entry for each local federation binding:

project name, role, enabled state, and push-enabled state
pull and push cursors
pending push depth and high-water event ID
last sync, pull, push, reset, and error timestamps
enrollment count on hubs
live and pending lease counts
active quarantine count and reset blocker
unresolved lease violation count and recent violation summaries

A daemon with no federation bindings returns an empty status list and prints no federation bindings in text mode.

Quarantine¶

A spoke records an active quarantine when it sees a permanently poisoned push batch. Quarantine blocks further push and can block reset. Operators can inspect it with status and intentionally skip it:

kata federation quarantine skip <id> \
  --confirm "SKIP FEDERATION BATCH <id>" \
  --reason "operator accepted the skipped outbound batch"

Skipping advances the spoke push cursor past the quarantined event range and marks the quarantine skipped. It does not delete local events and it does not make the skipped work appear on the hub. Use it only when the operator accepts that the local batch will not be federated.

Purge And Reset¶

Hard purge is hub-admin-only for federated projects. A spoke rejects hard purge with federated_admin_required. A hub purge uses the normal local/admin daemon auth surface, the existing exact confirmation string, and the same live-lease gate as other issue mutations.

When a hub purge removes replay history, it records a reset boundary in purge_log and writes a fresh federation baseline for the remaining project state. A spoke whose pull cursor is below that boundary receives reset_required and re-bootstrap from the current federation horizon. A push-enabled spoke refuses to reset while it has unaccepted local-origin events or an active quarantine. kata federation status reports these reset blockers.

Recurrences, Merge, And Other Boundaries¶

Recurrences remain hub-owned for federated projects. Spoke recurrence mutation is blocked rather than partially synchronized. Project merge refuses projects with federation bindings because local integer identities, remote UIDs, and federation cursors would otherwise become ambiguous.

Federation does not add a global user registry. Actor strings remain audit metadata. Origin instance UIDs disambiguate daemon identity.

Design Rationale¶

The operational sections above describe what federation does. This section records why it converges and why the lease model is shaped the way it is — the reasoning that is not recoverable from the behavior alone.

Why Federation Converges¶

For federated projects the event log is the source of truth and the issue/comment/label/link/metadata tables are a deterministic projection — a pure Fold over the events. All mutable state is modeled as a CRDT: scalar fields and metadata leaves are last-writer-wins registers, labels and links are per-element LWW sets with tombstones, and comments are a grow-only log keyed by UID. Given a deterministic total order over the retained event set, the fold is commutative-after-sort: any two nodes holding the same events compute identical state regardless of arrival order, so there is no irreconcilable conflict to resolve by hand. A late event with a lower clock simply loses to an already-applied higher-clock write for the same field, without being discarded — it stays in the audit log and remains visible. This is property-tested by shuffling arrival order and asserting a byte-identical projection.

Each mutation appends a complete event and applies its projection delta in one transaction, never "mutate the row, best-effort event afterward." That keeps the local write path and the replay path from drifting apart.

The Hybrid Logical Clock And Merge Order¶

The total order is (hlc, origin_instance_uid, event_uid). The HLC is a hybrid logical clock — a (physical_ms, counter) pair stamped when an event is emitted and advanced past the clock of every foreign event applied — so causally later work always sorts after what it saw. The local autoincrement event id is only a delivery cursor, never the merge order, and wall-clock created_at is kept for display and digests but is too skew-prone to order merges. An immutable content_hash over each event's identity and canonical contents lets a duplicate UID with the same hash be treated as a harmless retry, while a duplicate UID with a different hash is rejected as an integrity violation before it can be folded. An earlier groundwork column, origin_seq, was deliberately not revived as the merge clock; if gap detection is ever needed it can return as an optional per-origin continuity counter, orthogonal to the HLC.

The Drift-Guard Invariant¶

The guarantee that event-truth can be trusted rests on one invariant: for every non-federated project, which still uses the direct-write path, direct_write_projection == Fold(project_events). It is asserted at project scope, not per issue, because links and project metadata are project-scoped — an event on one issue can affect another issue's projection — so a per-issue comparison would be too narrow to catch a missing field. Holding this invariant across every mutation type proves events are replay-complete and keeps the direct-write and replay paths from diverging. It is the gate that had to pass before federation could be turned on.

Replay-Complete Events¶

Before federation, events did not carry enough to reconstruct state: issue.updated was written with an empty payload and issue.created omitted the title and body, so folding from zero was impossible. Federation required every mutation event to carry the resulting state of what it touched, and added a uid to comments so they have a stable identity to fold on. Because pre-federation history is not replay-complete, enabling federation emits a baseline of issue.snapshot events at a replay horizon, captured from one consistent single-writer view; replicas fold forward from that horizon and keep older events as audit-only.

Metadata Merge Semantics¶

Issue and project metadata is a per-key map, and unreserved keys may hold arbitrary nested JSON. To salvage concurrent edits to different sub-fields of the same key, the fold diffs each metadata event's per-key {from, to} down to JSON-pointer paths and resolves each leaf as an LWW register. Deletion keys off structural absence, never off JSON null: a path present in from but absent in to is a tombstone — covering the whole subtree if it was an object — while a leaf whose value is null in to is a real null value and is preserved. Arrays are atomic leaves, with no element-level array CRDT, so a checklist is replaced wholesale. The only place null acts as a marker is the top-level "clear this key" convention, which surfaces as structural absence anyway.

Why Leases Are Optional Coordination, Not An Edit Gate¶

A lease answers "who is actively working this issue right now," which is a different question from owner ("who is responsible"). Leases exist to help agents avoid double-work; they are deliberately not a prerequisite for editing a federated issue. This is the point the implementation and docs were realigned to after some early drift toward treating a lease as a write gate.

The reason is the local-first goal. Edits are asynchronous and may happen offline, so requiring a live lease before every edit would force a synchronous hub round-trip and trade away the offline editing that federation exists to preserve. Ordinary edits are therefore always local-first and converge by LWW. A lease adds only temporary exclusivity against conflicting non-comment work by another holder while it is live — the hub guarantees at most one live holder. Unleased work is normal and is never a violation; pending or expired leases do not block edits; and comments always pass because they are append-only.

Because the hub cannot reject a mutation that already happened offline, it does not try to. When pushed work conflicts with another holder's live lease the hub keeps the data and records a best-effort claim.violated annotation rather than dropping anything. Lease state — especially timed-lease expiry after renewals — is authoritative from the hub, not folded from events, so a spoke treats cached lease state as a hint and confirms against the hub before relying on exclusivity, never as proof that exclusivity still holds.

Rejected And Deferred Alternatives¶

Hard purge of federated history is hub-admin-only and establishes a reset boundary rather than being routine, because deleting replay history fights event-truth. Recurrence materialization stays hub-only: running the scheduler on multiple spokes would each materialize duplicate occurrences, since the occurrence-uniqueness constraint is per-database and UIDs differ across nodes. Issue moves are allowed only between two federated projects on the same hub; a move across a federation boundary has no coherent single event log spanning it and is disallowed. Deferred work includes element-level array CRDTs for metadata, HLC skew-guard thresholds, a snapshot fast-path for very large projects, and federated authoring of recurrences from spokes.

Consistency Limitations¶

The following are expected federation behaviors. They are the main reasons to prefer a shared daemon for stricter collaboration.

Reads Can Be Stale¶

Spokes read from their local database. A hub or another spoke can accept work that has not reached this spoke yet. Users may see stale title, status, labels, priority, lease status, or violation counts until the next pull succeeds.

Use a shared daemon when every user must see the latest committed state before acting.

Writes Are Not Globally Serialized At Edit Time¶

Local spoke writes happen before the hub has accepted them. The hub serializes event ingest later. For ordinary fields, deterministic materialization converges replicas, but users can temporarily see different local projections and the later accepted event order may not match what each user saw locally.

Use a shared daemon when conflicting edits must be prevented synchronously.

Offline Cached Hard Leases Can Be Superseded¶

A spoke can allow work under a cached hard lease while offline. During the outage, a hub operator can force-release the lease or another actor can acquire a later lease. When the offline work reconnects, the hub keeps the data but can record claim.violated.

Use a shared daemon when every write must be serialized through one online authority and stale offline lease state is unacceptable.

Lease Violation Signals Are Best-Effort¶

Violation annotation checks live hub lease state at ingest time, not historical lease state at the event's HLC timestamp. It is an operational signal for "work arrived while another holder currently has a live lease", not a complete audit proof of user intent or causal authorization.

Use a shared daemon when lease compliance must be decided at the exact write time.

Poisoned Push Batches Require Operator Choice¶

A validation error, hash conflict, or schema divergence can quarantine a push batch. Until an operator fixes the data or skips the batch, push remains blocked and reset may be blocked. Skipping is explicit data divergence: the hub will not receive those local events.

Use a shared daemon when local writes must either commit centrally or fail synchronously with no later operator reconciliation.

Hub Outage Changes The User Experience¶

During a hub outage, spokes can still read local state and may perform allowed local work, but lease acquisition, timed-lease refresh, push, pull, and status freshness degrade. Pending lease requests are not authoritative until the hub accepts them.

Use a shared daemon when lack of hub connectivity should stop all shared work.

Purge Causes Re-bootstrap¶

Hub purge creates a reset boundary. Spokes may temporarily show old state until they pull the reset signal and re-bootstrap. Push-enabled spokes can delay reset if they have unaccepted local work or quarantine.

Use a shared daemon when destructive administrative actions must be instantly visible to all users.

No Multi-Tenant Authorization Model¶

Enrollment tokens authorize spokes, not individual global users. The local daemon remains a single-user local tool. Project-scoped user ACLs and a global identity model belong to shared-daemon or later hardening work.

Use a shared daemon when different human users require centrally enforced roles on the same project.

Verification¶

Routine checks:

make test
make vet
make lint
make nilaway

Federation-specific checks:

make test-stress
make test-federation-docker

make test-stress runs Go-based randomized and failpoint tests. When Rapid prints a failing seed, reproduce it with:

RAPID_SEED=<seed> go test -tags federation_stress ./e2e -run TestFederationStressRandomizedWorkload -count=1 -timeout 2m

make test-federation-docker builds the current checkout into Docker images, starts one hub daemon and two spoke daemons in separate containers, and drives a lease-gated convergence scenario across real process and network boundaries.

For manual Docker debugging:

docker compose -f docker/federation/docker-compose.yml -p kata-federation-smoke up --build
docker compose -f docker/federation/docker-compose.yml -p kata-federation-smoke down --volumes --remove-orphans

Use an isolated project name for parallel runs:

KATA_FEDERATION_DOCKER_PROJECT=kata-federation-smoke-$USER make test-federation-docker

Historical Design Context¶

This file is the canonical description of implemented federation behavior, its design rationale, and its limitations. The original phased design spec and the per-phase implementation plans were folded into this document and the architecture and data model notes after the work shipped; the superseded drafts remain available in version control.