PRD: Consolidated Fleet Pod Management
Status: Draft · Owner: TBD · Created: 2026-06-05
Modules touched: contract/, agent/, control-plane/, dashboard/, deploy/
1. Summary
Expand Niro’s k3s fleet management with a single place to:
- List all pods across every cluster in the org subtree (consolidated inventory).
- Inspect per-pod resource usage (CPU/memory, requests/limits, restarts) with history.
- Stream pod logs live to triage issues without leaving the dashboard.
Shipped in 3 phases. Pod inventory + usage are pushed in the heartbeat (always-fresh). Logs use a new real-time SSE/WebSocket transport (the documented v2 path) gated behind an opt-in escalated agent RBAC role.
2. Problem & Goals
Today: Heartbeat carries only aggregate counts (pods, pods_running, pods_failed, namespaces). Operators see per-cluster tiles but cannot see individual pods, drill into resource usage, or read logs — they must kubectl into each cluster separately. With NAT’d clusters and outbound-only agents, that’s painful.
Goals
- G1: One cross-cluster pod table, filterable by cluster / namespace / status / search.
- G2: Per-pod CPU & memory (usage vs requests/limits) + restart/age, with short history chart.
- G3: Live log tailing for a chosen pod/container from the browser.
- G4: Preserve the secure-by-default, outbound-only, multi-tenant (org-subtree) architecture.
Non-goals (this PRD)
kubectl exec / shell, port-forward, pod editing/deletion, scaling.
- Cross-cluster log search/aggregation/retention (live tail only; no log storage).
- Full historical pod metrics warehouse (only short rolling window, plan-gated).
3. Decisions (locked)
| Decision | Choice | Rationale |
|---|
| Pod inventory delivery | Push in heartbeat | Always-fresh consolidated view, no per-view round-trips, fits dozens-of-clusters scale. |
| Log transport | Build SSE/WebSocket now | True real-time tailing; reusable v2 transport for future exec/port-forward. |
| RBAC posture | Opt-in escalated role | Keeps read-only default; logs/pod-metrics require explicit per-cluster enablement. |
| Scope | Phased (1→2→3) | Ship value incrementally; de-risk the hard streaming work last. |
4. Architecture Impact
heartbeat (15s, now carries pod inventory + per-pod metrics)
agent ─────────────────────────────────────────────► control-plane ──► Postgres
│ ▲ │
│ NEW: persistent stream (logs) │ │ GET /v1/clusters/pods (fleet)
└────────── WS ◄────────────────────────────────────────┘ ▼
dashboard (React Query + EventSource/WS)
- Heartbeat grows (Phase 1/2): add
Pods []PodDetail to HeartbeatRequest. Bump Protocol. Control-plane writes a pods snapshot per heartbeat (wholesale replace, same pattern as nodes).
- New streaming transport (Phase 3): a persistent agent↔CP channel (WebSocket) so logs flow agent→CP→browser. Agent remains the dialer (outbound-only preserved). Browser↔CP uses SSE or WS.
- Escalated RBAC (Phase 2/3): new opt-in ClusterRole granting
pods/log (+ ensures pod metrics). Default role unchanged.
5. Phase 1 — Consolidated Pod Inventory
Outcome: A /pods page listing every pod across the org’s clusters.
5.1 Contract (contract/heartbeat.go)
Add:
type PodDetail struct {
Namespace string `json:"namespace"`
Name string `json:"name"`
Phase string `json:"phase"` // Running/Pending/Failed/Succeeded/Unknown
NodeName string `json:"nodeName"`
Restarts int32 `json:"restarts"`
Ready string `json:"ready"` // "2/3"
Containers []string `json:"containers"` // names (images optional, see Phase 2)
StartedUnix int64 `json:"startedUnix"`
OwnerKind string `json:"ownerKind"` // Deployment/StatefulSet/DaemonSet/Job...
OwnerName string `json:"ownerName"`
}
// HeartbeatRequest gains: Pods []PodDetail `json:"pods,omitempty"`
Bump Protocol (and MinSupportedProtocol policy — keep back-compat: CP must accept heartbeats without Pods).
5.2 Agent (agent/internal/collect/collect.go)
Extend Snapshot() to populate req.Pods from the existing Pods("").List(...) call (already fetched — just map each item instead of only counting). Derive Ready, Restarts, owner refs from pod.OwnerReferences. No new RBAC needed (list pods already allowed).
5.3 Control-plane
- Migration
00004_pods.sql: table pods (cluster_id, namespace, name, phase, node_name, restarts, ready, containers text[], owner_kind, owner_name, started_at, updated_at), PK (cluster_id, namespace, name), index on (cluster_id) and (phase).
- Heartbeat handler (
agent.go): inside the existing tx, DeletePodsForCluster + bulk insert req.Pods (wholesale replace, mirrors node handling). Guard payload size.
- Queries (
queries.sql → make sqlc): DeletePodsForCluster, InsertPod, ListPodsForOrgSubtree (joins clusters via the existing OrgSubtree CTE; supports filters + keyset/limit pagination), ListPodsForCluster.
- Endpoints (
server.go, new pods_api.go):
GET /v1/clusters/pods — fleet-wide, withUser, filters: cluster, namespace, status, q, limit, cursor.
GET /v1/clusters/{id}/pods — single-cluster.
5.4 Dashboard
- New route
app/pods/page.tsx + pods-table.tsx (clone patterns from clusters/fleet-table.tsx): server-seeded (PPR) then React Query 5s poll.
- Columns: Cluster · Namespace · Pod · Status badge · Ready · Restarts · Node · Age. Filter bar (cluster, namespace, status) + search. Status badge reuses cluster badge styling.
- Add nav entry in AppShell.
lib/api.ts + lib/server.ts fetchers.
Phase 1 acceptance
- Heartbeat from an agent with N pods results in N rows visible on
/pods, scoped to the caller’s org subtree.
- Filtering by cluster/namespace/status and search work; pagination holds at hundreds of pods.
- Old-protocol agents (no
Pods) still heartbeat successfully; their pods simply don’t list.
6. Phase 2 — Per-Pod Resource Usage
Outcome: Click a pod → drawer with CPU/memory usage vs requests/limits + short history; restart/age detail.
6.1 Contract
Extend PodDetail (or add parallel PodUsage) with:
CPUUsageMilli int64 `json:"cpuUsageMilli"`
MemUsageBytes int64 `json:"memUsageBytes"`
CPURequestMilli int64 `json:"cpuRequestMilli"`
CPULimitMilli int64 `json:"cpuLimitMilli"`
MemRequestBytes int64 `json:"memRequestBytes"`
MemLimitBytes int64 `json:"memLimitBytes"`
6.2 Agent
- Requests/limits: sum container
resources from the pod spec (no new RBAC).
- Usage: call
mc.MetricsV1beta1().PodMetricses("").List(...) (mirrors existing nodeUsage), join to pods by namespace/name. Best-effort — zero when metrics-server absent or RBAC not escalated. Pod metrics may require the escalated role (see §8); degrade gracefully.
6.3 Control-plane
- Migration
00005_pod_metrics.sql: append-only pod_metrics (cluster_id, namespace, name, recorded_at, cpu_usage_milli, mem_usage_bytes, cpu_request_milli, cpu_limit_milli, mem_request_bytes, mem_limit_bytes). Add the usage/request/limit columns to pods for “latest” reads.
- Reuse the metrics retention/plan-gating pattern from
metrics_api.go (free 24h / pro 7d / enterprise 30d).
- Endpoints:
GET /v1/clusters/{id}/pods/{namespace}/{name} (detail incl. latest usage), GET /v1/clusters/{id}/pods/{namespace}/{name}/metrics (bucketed time-series).
6.4 Dashboard
- Pod detail drawer/modal: usage tiles (CPU/mem usage vs request/limit, % bars), Recharts time-series (reuse
components/charts/), restart/age/owner metadata.
- Add CPU/Mem columns to the
/pods table (sortable).
Phase 2 acceptance
- Pod drawer shows live CPU/mem and request/limit; chart renders within plan retention window.
- Clusters without metrics-server (or without escalated role) show “metrics unavailable” rather than erroring.
7. Phase 3 — Live Log Streaming
Outcome: From a pod, “View logs” opens a live tail of a chosen container, streamed browser←CP←agent.
7.1 Transport (new, reusable)
- Agent↔CP: persistent WebSocket initiated by the agent (preserves outbound-only). Agent dials
wss://<cp>/v1/agent/stream, authenticates with the agent key (X-Niro-Agent-Key), then multiplexes stream sessions. Coexists with existing long-poll commands; the command envelope (contract/command.go) is already designed transport-agnostic.
- Log session lifecycle:
- Browser calls
POST /v1/clusters/{id}/pods/{ns}/{name}/logs {container, follow, tailLines, sinceSeconds} → CP returns a streamId.
- CP sends a
start-log-stream frame to that cluster’s agent over the WS.
- Agent runs
cs.CoreV1().Pods(ns).GetLogs(name, &PodLogOptions{Follow:true,...}).Stream(ctx) and pushes chunks back over the WS, tagged streamId.
- CP relays chunks to the browser via SSE
GET /v1/streams/{streamId} (EventSource) — or WS if duplex needed.
- Browser closing the SSE /
DELETE /v1/streams/{streamId} → CP sends stop → agent cancels the stream ctx.
- Backpressure & limits: per-stream byte/line cap, idle timeout, max concurrent streams per cluster/org. Drop with a visible “stream throttled/ended” notice rather than unbounded buffering.
7.2 Contract
New frame/command types in contract/: CommandStartLogStream, CommandStopLogStream, plus a StreamChunk{StreamID, Seq, Data, EOF, Err} envelope. Add WS frame types (control vs data).
7.3 Control-plane
- New
internal/stream/ hub: registry of connected agents (clusterID→WS conn) and active browser subscribers (streamId→SSE writer). Relays chunks, enforces auth (browser must own a cluster in its org subtree; agent must match clusterID), applies limits.
- Endpoints:
GET /v1/agent/stream (WS, withAgentAuth), POST .../logs + GET /v1/streams/{id} (SSE) + DELETE /v1/streams/{id} (withUser).
- Multi-replica note: if CP runs >1 replica (the repo already does zero-downtime rollouts), agent WS and browser SSE may land on different replicas. Use sticky routing for the stream pair, or a lightweight pub/sub (e.g. Postgres
LISTEN/NOTIFY or in-cluster broker). Flag as a design spike before build.
7.4 Agent
- New
internal/stream/ client: maintains the WS, handles start/stop-log-stream, spawns a goroutine per stream calling GetLogs(...).Stream(), forwards chunks, honors ctx cancel. Requires the escalated pods/log RBAC — feature is inert (returns a clear “logs not enabled” error) when the role isn’t granted.
7.5 Dashboard
- Log viewer (drawer/full-screen): container selector, follow toggle, tail-lines, pause/resume, clear, wrap, copy/download buffer, search-in-buffer. Consume via
EventSource. Auto-reconnect with resume hint.
Phase 3 acceptance
- Selecting a pod/container streams live logs within ~1s; new lines append in real time.
- Closing the viewer terminates the agent-side stream (verify ctx cancel, no leaked goroutines).
- Clusters without the escalated role show “Log streaming not enabled for this cluster” with enablement instructions.
- Stream limits enforced; multi-replica path resolved (sticky or pub/sub).
8. RBAC: Opt-in Escalated Role (deploy/manifests/)
- Add a separate ClusterRole
niro-agent-escalated granting get/list on pods/log and (if needed) metrics.k8s.io pods. Default niro-agent role stays read-only — unchanged.
- Enablement: operator applies the escalated ClusterRoleBinding (e.g.
install.sh --enable-logs, or a documented kubectl apply). Agent detects capability (SelfSubjectAccessReview or a startup probe) and reports it to CP.
- CP stores a per-cluster
logs_enabled (and pod_metrics_enabled) capability flag; dashboard reflects it (disabled “View logs” with tooltip when off).
- Security: document that enabling grants log read (which may contain secrets) to dashboard users in the org subtree. Audit-log stream starts (who/which pod/when).
- Heartbeat size: pod payload scales with pod count. Cap pods-per-heartbeat; if exceeded, truncate with a flag + log (no silent loss). Consider gzip on the heartbeat body if large.
- Write amplification: wholesale pod replace every 15s × clusters. Use bulk insert (
COPY/multi-row). pod_metrics append-only — reuse existing retention pruning job/pattern.
- Fleet query:
ListPodsForOrgSubtree must be indexed + paginated; never unbounded.
- Streaming: bounded buffers, per-stream/per-org caps, idle timeouts.
10. Security & Multi-tenancy
- All new user endpoints go through
withUser + OrgSubtree scoping — a user only sees pods/logs for clusters in their org subtree. Verify on every pod and stream endpoint (defense in depth: re-check cluster ownership server-side, never trust client clusterId alone).
- Logs/exec-class access strictly behind the opt-in role. Audit stream initiations.
- No new endpoint uses the
/_private prefix (reverse-proxy routing rule).
11. Testing
- Contract/back-compat: old-protocol heartbeat (no
Pods) still accepted.
- Agent: unit-test
Snapshot() pod mapping (ready string, restarts, owner refs) with fake clientset; metrics best-effort path.
- Control-plane: handler + query tests for pods CRUD, org-subtree scoping (negative: other org’s pods invisible), pagination, retention.
- Streaming: integration test agent→CP→browser chunk relay; stop/cancel cleanup; limit enforcement; multi-replica routing.
- E2E: seed a kind/k3s cluster + agent; assert pods list, usage chart, live log tail.
12. Rollout
- Protocol bump is additive & back-compat; deploy CP first, then agents.
- Phase 3 transport behind a feature flag; escalated RBAC off by default.
- Zero-downtime rollout already in place (
control-plane/dashboard); ensure WS/SSE survive rolling restarts (reconnect logic).
13. Phase / Task Breakdown
Phase 1 — Inventory
contract: add PodDetail, HeartbeatRequest.Pods, bump protocol (keep back-compat).
agent/collect: populate req.Pods.
- CP migration
00004_pods.sql + queries (make sqlc).
- CP heartbeat handler: replace pods in tx.
- CP endpoints
GET /v1/clusters/pods, GET /v1/clusters/{id}/pods.
- Dashboard
/pods page, table, filters, nav, fetchers.
- Tests + back-compat.
Phase 2 — Usage
8. contract: pod usage/request/limit fields.
9. agent: pod metrics + spec requests/limits (graceful degrade).
10. CP migration 00005_pod_metrics.sql, retention reuse, detail + metrics endpoints.
11. Dashboard pod detail drawer + usage chart + table columns.
12. Tests (metrics-absent path).
Phase 3 — Logs
13. Spike: multi-replica stream routing (sticky vs pub/sub).
14. contract: stream/log command + chunk frames.
15. CP internal/stream hub; agent WS endpoint; browser SSE + control endpoints.
16. agent/stream client + GetLogs().Stream().
17. deploy: opt-in escalated RBAC role + capability detection/flag.
18. Dashboard log viewer.
19. Streaming/cleanup/limit/multi-replica tests + E2E.
14. Open Questions
- Pod scale ceiling — realistic max pods/cluster? Sets heartbeat cap & whether 15s wholesale replace is fine vs. delta encoding.
- Heartbeat vs. pod-list cadence — keep pods on the 15s heartbeat, or a slower separate pod-sync interval to cut write amplification?
- CP replica count in prod — single or multi? Determines if the Phase-3 pub/sub spike is mandatory now.
- Browser↔CP for logs — SSE (simpler, one-way, fits tailing) vs. WebSocket (needed only if interactive input later). Default SSE unless exec is imminent.
- Log retention — confirm live-tail only (no storage), or is “last N lines on open” enough without a follow stream for v1 of Phase 3?
- Enablement UX —
install.sh --enable-logs flag vs. documented manual kubectl apply for the escalated role?
- Multi-container default — auto-pick first container, or require selection before streaming?
Last modified on June 12, 2026