PRD: Consolidated Fleet Pod Management

Status: Draft · Owner: TBD · Created: 2026-06-05 Modules touched: contract/, agent/, control-plane/, dashboard/, deploy/

1. Summary

Expand Niro’s k3s fleet management with a single place to:

List all pods across every cluster in the org subtree (consolidated inventory).
Inspect per-pod resource usage (CPU/memory, requests/limits, restarts) with history.
Stream pod logs live to triage issues without leaving the dashboard.

Shipped in 3 phases. Pod inventory + usage are pushed in the heartbeat (always-fresh). Logs use a new real-time SSE/WebSocket transport (the documented v2 path) gated behind an opt-in escalated agent RBAC role.

2. Problem & Goals

Today: Heartbeat carries only aggregate counts (pods, pods_running, pods_failed, namespaces). Operators see per-cluster tiles but cannot see individual pods, drill into resource usage, or read logs — they must kubectl into each cluster separately. With NAT’d clusters and outbound-only agents, that’s painful. Goals

G1: One cross-cluster pod table, filterable by cluster / namespace / status / search.
G2: Per-pod CPU & memory (usage vs requests/limits) + restart/age, with short history chart.
G3: Live log tailing for a chosen pod/container from the browser.
G4: Preserve the secure-by-default, outbound-only, multi-tenant (org-subtree) architecture.

Non-goals (this PRD)

kubectl exec / shell, port-forward, pod editing/deletion, scaling.
Cross-cluster log search/aggregation/retention (live tail only; no log storage).
Full historical pod metrics warehouse (only short rolling window, plan-gated).

3. Decisions (locked)

Decision	Choice	Rationale
Pod inventory delivery	Push in heartbeat	Always-fresh consolidated view, no per-view round-trips, fits dozens-of-clusters scale.
Log transport	Build SSE/WebSocket now	True real-time tailing; reusable v2 transport for future exec/port-forward.
RBAC posture	Opt-in escalated role	Keeps read-only default; logs/pod-metrics require explicit per-cluster enablement.
Scope	Phased (1→2→3)	Ship value incrementally; de-risk the hard streaming work last.

4. Architecture Impact

            heartbeat (15s, now carries pod inventory + per-pod metrics)
  agent  ─────────────────────────────────────────────►  control-plane ──►  Postgres
   │                                                          ▲   │
   │     NEW: persistent stream (logs)                        │   │  GET /v1/clusters/pods (fleet)
   └──────────  WS  ◄────────────────────────────────────────┘   ▼
                                                              dashboard (React Query + EventSource/WS)

Heartbeat grows (Phase 1/2): add Pods []PodDetail to HeartbeatRequest. Bump Protocol. Control-plane writes a pods snapshot per heartbeat (wholesale replace, same pattern as nodes).
New streaming transport (Phase 3): a persistent agent↔CP channel (WebSocket) so logs flow agent→CP→browser. Agent remains the dialer (outbound-only preserved). Browser↔CP uses SSE or WS.
Escalated RBAC (Phase 2/3): new opt-in ClusterRole granting pods/log (+ ensures pod metrics). Default role unchanged.

5. Phase 1 — Consolidated Pod Inventory

Outcome: A /pods page listing every pod across the org’s clusters.

5.1 Contract (`contract/heartbeat.go`)

Add:

type PodDetail struct {
    Namespace    string   `json:"namespace"`
    Name         string   `json:"name"`
    Phase        string   `json:"phase"`        // Running/Pending/Failed/Succeeded/Unknown
    NodeName     string   `json:"nodeName"`
    Restarts     int32    `json:"restarts"`
    Ready        string   `json:"ready"`        // "2/3"
    Containers   []string `json:"containers"`   // names (images optional, see Phase 2)
    StartedUnix  int64    `json:"startedUnix"`
    OwnerKind    string   `json:"ownerKind"`    // Deployment/StatefulSet/DaemonSet/Job...
    OwnerName    string   `json:"ownerName"`
}
// HeartbeatRequest gains: Pods []PodDetail `json:"pods,omitempty"`

Bump Protocol (and MinSupportedProtocol policy — keep back-compat: CP must accept heartbeats without Pods).

5.2 Agent (`agent/internal/collect/collect.go`)

Extend Snapshot() to populate req.Pods from the existing Pods("").List(...) call (already fetched — just map each item instead of only counting). Derive Ready, Restarts, owner refs from pod.OwnerReferences. No new RBAC needed (list pods already allowed).

5.3 Control-plane

Migration 00004_pods.sql: table pods (cluster_id, namespace, name, phase, node_name, restarts, ready, containers text[], owner_kind, owner_name, started_at, updated_at), PK (cluster_id, namespace, name), index on (cluster_id) and (phase).
Heartbeat handler (agent.go): inside the existing tx, DeletePodsForCluster + bulk insert req.Pods (wholesale replace, mirrors node handling). Guard payload size.
Queries (queries.sql → make sqlc): DeletePodsForCluster, InsertPod, ListPodsForOrgSubtree (joins clusters via the existing OrgSubtree CTE; supports filters + keyset/limit pagination), ListPodsForCluster.
Endpoints (server.go, new pods_api.go):
- GET /v1/clusters/pods — fleet-wide, withUser, filters: cluster, namespace, status, q, limit, cursor.
- GET /v1/clusters/{id}/pods — single-cluster.

5.4 Dashboard

New route app/pods/page.tsx + pods-table.tsx (clone patterns from clusters/fleet-table.tsx): server-seeded (PPR) then React Query 5s poll.
Columns: Cluster · Namespace · Pod · Status badge · Ready · Restarts · Node · Age. Filter bar (cluster, namespace, status) + search. Status badge reuses cluster badge styling.
Add nav entry in AppShell. lib/api.ts + lib/server.ts fetchers.

Phase 1 acceptance

Heartbeat from an agent with N pods results in N rows visible on /pods, scoped to the caller’s org subtree.
Filtering by cluster/namespace/status and search work; pagination holds at hundreds of pods.
Old-protocol agents (no Pods) still heartbeat successfully; their pods simply don’t list.

6. Phase 2 — Per-Pod Resource Usage

Outcome: Click a pod → drawer with CPU/memory usage vs requests/limits + short history; restart/age detail.

6.1 Contract

Extend PodDetail (or add parallel PodUsage) with:

CPUUsageMilli  int64 `json:"cpuUsageMilli"`
MemUsageBytes  int64 `json:"memUsageBytes"`
CPURequestMilli int64 `json:"cpuRequestMilli"`
CPULimitMilli   int64 `json:"cpuLimitMilli"`
MemRequestBytes int64 `json:"memRequestBytes"`
MemLimitBytes   int64 `json:"memLimitBytes"`

6.2 Agent

Requests/limits: sum container resources from the pod spec (no new RBAC).
Usage: call mc.MetricsV1beta1().PodMetricses("").List(...) (mirrors existing nodeUsage), join to pods by namespace/name. Best-effort — zero when metrics-server absent or RBAC not escalated. Pod metrics may require the escalated role (see §8); degrade gracefully.

6.3 Control-plane

Migration 00005_pod_metrics.sql: append-only pod_metrics (cluster_id, namespace, name, recorded_at, cpu_usage_milli, mem_usage_bytes, cpu_request_milli, cpu_limit_milli, mem_request_bytes, mem_limit_bytes). Add the usage/request/limit columns to pods for “latest” reads.
Reuse the metrics retention/plan-gating pattern from metrics_api.go (free 24h / pro 7d / enterprise 30d).
Endpoints: GET /v1/clusters/{id}/pods/{namespace}/{name} (detail incl. latest usage), GET /v1/clusters/{id}/pods/{namespace}/{name}/metrics (bucketed time-series).

6.4 Dashboard

Pod detail drawer/modal: usage tiles (CPU/mem usage vs request/limit, % bars), Recharts time-series (reuse components/charts/), restart/age/owner metadata.
Add CPU/Mem columns to the /pods table (sortable).

Phase 2 acceptance

Pod drawer shows live CPU/mem and request/limit; chart renders within plan retention window.
Clusters without metrics-server (or without escalated role) show “metrics unavailable” rather than erroring.

7. Phase 3 — Live Log Streaming

Outcome: From a pod, “View logs” opens a live tail of a chosen container, streamed browser←CP←agent.

7.1 Transport (new, reusable)

Agent↔CP: persistent WebSocket initiated by the agent (preserves outbound-only). Agent dials wss://<cp>/v1/agent/stream, authenticates with the agent key (X-Niro-Agent-Key), then multiplexes stream sessions. Coexists with existing long-poll commands; the command envelope (contract/command.go) is already designed transport-agnostic.
Log session lifecycle:
1. Browser calls POST /v1/clusters/{id}/pods/{ns}/{name}/logs {container, follow, tailLines, sinceSeconds} → CP returns a streamId.
2. CP sends a start-log-stream frame to that cluster’s agent over the WS.
3. Agent runs cs.CoreV1().Pods(ns).GetLogs(name, &PodLogOptions{Follow:true,...}).Stream(ctx) and pushes chunks back over the WS, tagged streamId.
4. CP relays chunks to the browser via SSE GET /v1/streams/{streamId} (EventSource) — or WS if duplex needed.
5. Browser closing the SSE / DELETE /v1/streams/{streamId} → CP sends stop → agent cancels the stream ctx.
Backpressure & limits: per-stream byte/line cap, idle timeout, max concurrent streams per cluster/org. Drop with a visible “stream throttled/ended” notice rather than unbounded buffering.

7.2 Contract

New frame/command types in contract/: CommandStartLogStream, CommandStopLogStream, plus a StreamChunk{StreamID, Seq, Data, EOF, Err} envelope. Add WS frame types (control vs data).

7.3 Control-plane

New internal/stream/ hub: registry of connected agents (clusterID→WS conn) and active browser subscribers (streamId→SSE writer). Relays chunks, enforces auth (browser must own a cluster in its org subtree; agent must match clusterID), applies limits.
Endpoints: GET /v1/agent/stream (WS, withAgentAuth), POST .../logs + GET /v1/streams/{id} (SSE) + DELETE /v1/streams/{id} (withUser).
Multi-replica note: if CP runs >1 replica (the repo already does zero-downtime rollouts), agent WS and browser SSE may land on different replicas. Use sticky routing for the stream pair, or a lightweight pub/sub (e.g. Postgres LISTEN/NOTIFY or in-cluster broker). Flag as a design spike before build.

7.4 Agent

New internal/stream/ client: maintains the WS, handles start/stop-log-stream, spawns a goroutine per stream calling GetLogs(...).Stream(), forwards chunks, honors ctx cancel. Requires the escalated pods/log RBAC — feature is inert (returns a clear “logs not enabled” error) when the role isn’t granted.

7.5 Dashboard

Log viewer (drawer/full-screen): container selector, follow toggle, tail-lines, pause/resume, clear, wrap, copy/download buffer, search-in-buffer. Consume via EventSource. Auto-reconnect with resume hint.

Phase 3 acceptance

Selecting a pod/container streams live logs within ~1s; new lines append in real time.
Closing the viewer terminates the agent-side stream (verify ctx cancel, no leaked goroutines).
Clusters without the escalated role show “Log streaming not enabled for this cluster” with enablement instructions.
Stream limits enforced; multi-replica path resolved (sticky or pub/sub).

8. RBAC: Opt-in Escalated Role (`deploy/manifests/`)

Add a separate ClusterRole niro-agent-escalated granting get/list on pods/log and (if needed) metrics.k8s.io pods. Default niro-agent role stays read-only — unchanged.
Enablement: operator applies the escalated ClusterRoleBinding (e.g. install.sh --enable-logs, or a documented kubectl apply). Agent detects capability (SelfSubjectAccessReview or a startup probe) and reports it to CP.
CP stores a per-cluster logs_enabled (and pod_metrics_enabled) capability flag; dashboard reflects it (disabled “View logs” with tooltip when off).
Security: document that enabling grants log read (which may contain secrets) to dashboard users in the org subtree. Audit-log stream starts (who/which pod/when).

9. Data, Scale & Performance

Heartbeat size: pod payload scales with pod count. Cap pods-per-heartbeat; if exceeded, truncate with a flag + log (no silent loss). Consider gzip on the heartbeat body if large.
Write amplification: wholesale pod replace every 15s × clusters. Use bulk insert (COPY/multi-row). pod_metrics append-only — reuse existing retention pruning job/pattern.
Fleet query: ListPodsForOrgSubtree must be indexed + paginated; never unbounded.
Streaming: bounded buffers, per-stream/per-org caps, idle timeouts.

10. Security & Multi-tenancy

All new user endpoints go through withUser + OrgSubtree scoping — a user only sees pods/logs for clusters in their org subtree. Verify on every pod and stream endpoint (defense in depth: re-check cluster ownership server-side, never trust client clusterId alone).
Logs/exec-class access strictly behind the opt-in role. Audit stream initiations.
No new endpoint uses the /_private prefix (reverse-proxy routing rule).

11. Testing

Contract/back-compat: old-protocol heartbeat (no Pods) still accepted.
Agent: unit-test Snapshot() pod mapping (ready string, restarts, owner refs) with fake clientset; metrics best-effort path.
Control-plane: handler + query tests for pods CRUD, org-subtree scoping (negative: other org’s pods invisible), pagination, retention.
Streaming: integration test agent→CP→browser chunk relay; stop/cancel cleanup; limit enforcement; multi-replica routing.
E2E: seed a kind/k3s cluster + agent; assert pods list, usage chart, live log tail.

12. Rollout

Protocol bump is additive & back-compat; deploy CP first, then agents.
Phase 3 transport behind a feature flag; escalated RBAC off by default.
Zero-downtime rollout already in place (control-plane/dashboard); ensure WS/SSE survive rolling restarts (reconnect logic).

13. Phase / Task Breakdown

Phase 1 — Inventory

contract: add PodDetail, HeartbeatRequest.Pods, bump protocol (keep back-compat).
agent/collect: populate req.Pods.
CP migration 00004_pods.sql + queries (make sqlc).
CP heartbeat handler: replace pods in tx.
CP endpoints GET /v1/clusters/pods, GET /v1/clusters/{id}/pods.
Dashboard /pods page, table, filters, nav, fetchers.
Tests + back-compat.

Phase 2 — Usage 8. contract: pod usage/request/limit fields. 9. agent: pod metrics + spec requests/limits (graceful degrade). 10. CP migration 00005_pod_metrics.sql, retention reuse, detail + metrics endpoints. 11. Dashboard pod detail drawer + usage chart + table columns. 12. Tests (metrics-absent path). Phase 3 — Logs 13. Spike: multi-replica stream routing (sticky vs pub/sub). 14. contract: stream/log command + chunk frames. 15. CP internal/stream hub; agent WS endpoint; browser SSE + control endpoints. 16. agent/stream client + GetLogs().Stream(). 17. deploy: opt-in escalated RBAC role + capability detection/flag. 18. Dashboard log viewer. 19. Streaming/cleanup/limit/multi-replica tests + E2E.

14. Open Questions

Pod scale ceiling — realistic max pods/cluster? Sets heartbeat cap & whether 15s wholesale replace is fine vs. delta encoding.
Heartbeat vs. pod-list cadence — keep pods on the 15s heartbeat, or a slower separate pod-sync interval to cut write amplification?
CP replica count in prod — single or multi? Determines if the Phase-3 pub/sub spike is mandatory now.
Browser↔CP for logs — SSE (simpler, one-way, fits tailing) vs. WebSocket (needed only if interactive input later). Default SSE unless exec is imminent.
Log retention — confirm live-tail only (no storage), or is “last N lines on open” enough without a follow stream for v1 of Phase 3?
Enablement UX — install.sh --enable-logs flag vs. documented manual kubectl apply for the escalated role?
Multi-container default — auto-pick first container, or require selection before streaming?

​PRD: Consolidated Fleet Pod Management

​1. Summary

​2. Problem & Goals

​3. Decisions (locked)

​4. Architecture Impact

​5. Phase 1 — Consolidated Pod Inventory

​5.1 Contract (contract/heartbeat.go)

​5.2 Agent (agent/internal/collect/collect.go)

​5.3 Control-plane

​5.4 Dashboard

​6. Phase 2 — Per-Pod Resource Usage

​6.1 Contract

​6.2 Agent

​6.3 Control-plane

​6.4 Dashboard

​7. Phase 3 — Live Log Streaming

​7.1 Transport (new, reusable)

​7.2 Contract

​7.3 Control-plane

​7.4 Agent

​7.5 Dashboard

​8. RBAC: Opt-in Escalated Role (deploy/manifests/)

​9. Data, Scale & Performance

​10. Security & Multi-tenancy

​11. Testing

​12. Rollout

​13. Phase / Task Breakdown

​14. Open Questions