Skip to main content

PRD: Consolidated Fleet Pod Management

Status: Draft · Owner: TBD · Created: 2026-06-05 Modules touched: contract/, agent/, control-plane/, dashboard/, deploy/

1. Summary

Expand Niro’s k3s fleet management with a single place to:
  1. List all pods across every cluster in the org subtree (consolidated inventory).
  2. Inspect per-pod resource usage (CPU/memory, requests/limits, restarts) with history.
  3. Stream pod logs live to triage issues without leaving the dashboard.
Shipped in 3 phases. Pod inventory + usage are pushed in the heartbeat (always-fresh). Logs use a new real-time SSE/WebSocket transport (the documented v2 path) gated behind an opt-in escalated agent RBAC role.

2. Problem & Goals

Today: Heartbeat carries only aggregate counts (pods, pods_running, pods_failed, namespaces). Operators see per-cluster tiles but cannot see individual pods, drill into resource usage, or read logs — they must kubectl into each cluster separately. With NAT’d clusters and outbound-only agents, that’s painful. Goals
  • G1: One cross-cluster pod table, filterable by cluster / namespace / status / search.
  • G2: Per-pod CPU & memory (usage vs requests/limits) + restart/age, with short history chart.
  • G3: Live log tailing for a chosen pod/container from the browser.
  • G4: Preserve the secure-by-default, outbound-only, multi-tenant (org-subtree) architecture.
Non-goals (this PRD)
  • kubectl exec / shell, port-forward, pod editing/deletion, scaling.
  • Cross-cluster log search/aggregation/retention (live tail only; no log storage).
  • Full historical pod metrics warehouse (only short rolling window, plan-gated).

3. Decisions (locked)

DecisionChoiceRationale
Pod inventory deliveryPush in heartbeatAlways-fresh consolidated view, no per-view round-trips, fits dozens-of-clusters scale.
Log transportBuild SSE/WebSocket nowTrue real-time tailing; reusable v2 transport for future exec/port-forward.
RBAC postureOpt-in escalated roleKeeps read-only default; logs/pod-metrics require explicit per-cluster enablement.
ScopePhased (1→2→3)Ship value incrementally; de-risk the hard streaming work last.

4. Architecture Impact

            heartbeat (15s, now carries pod inventory + per-pod metrics)
  agent  ─────────────────────────────────────────────►  control-plane ──►  Postgres
   │                                                          ▲   │
   │     NEW: persistent stream (logs)                        │   │  GET /v1/clusters/pods (fleet)
   └──────────  WS  ◄────────────────────────────────────────┘   ▼
                                                              dashboard (React Query + EventSource/WS)
  • Heartbeat grows (Phase 1/2): add Pods []PodDetail to HeartbeatRequest. Bump Protocol. Control-plane writes a pods snapshot per heartbeat (wholesale replace, same pattern as nodes).
  • New streaming transport (Phase 3): a persistent agent↔CP channel (WebSocket) so logs flow agent→CP→browser. Agent remains the dialer (outbound-only preserved). Browser↔CP uses SSE or WS.
  • Escalated RBAC (Phase 2/3): new opt-in ClusterRole granting pods/log (+ ensures pod metrics). Default role unchanged.

5. Phase 1 — Consolidated Pod Inventory

Outcome: A /pods page listing every pod across the org’s clusters.

5.1 Contract (contract/heartbeat.go)

Add:
type PodDetail struct {
    Namespace    string   `json:"namespace"`
    Name         string   `json:"name"`
    Phase        string   `json:"phase"`        // Running/Pending/Failed/Succeeded/Unknown
    NodeName     string   `json:"nodeName"`
    Restarts     int32    `json:"restarts"`
    Ready        string   `json:"ready"`        // "2/3"
    Containers   []string `json:"containers"`   // names (images optional, see Phase 2)
    StartedUnix  int64    `json:"startedUnix"`
    OwnerKind    string   `json:"ownerKind"`    // Deployment/StatefulSet/DaemonSet/Job...
    OwnerName    string   `json:"ownerName"`
}
// HeartbeatRequest gains: Pods []PodDetail `json:"pods,omitempty"`
Bump Protocol (and MinSupportedProtocol policy — keep back-compat: CP must accept heartbeats without Pods).

5.2 Agent (agent/internal/collect/collect.go)

Extend Snapshot() to populate req.Pods from the existing Pods("").List(...) call (already fetched — just map each item instead of only counting). Derive Ready, Restarts, owner refs from pod.OwnerReferences. No new RBAC needed (list pods already allowed).

5.3 Control-plane

  • Migration 00004_pods.sql: table pods (cluster_id, namespace, name, phase, node_name, restarts, ready, containers text[], owner_kind, owner_name, started_at, updated_at), PK (cluster_id, namespace, name), index on (cluster_id) and (phase).
  • Heartbeat handler (agent.go): inside the existing tx, DeletePodsForCluster + bulk insert req.Pods (wholesale replace, mirrors node handling). Guard payload size.
  • Queries (queries.sqlmake sqlc): DeletePodsForCluster, InsertPod, ListPodsForOrgSubtree (joins clusters via the existing OrgSubtree CTE; supports filters + keyset/limit pagination), ListPodsForCluster.
  • Endpoints (server.go, new pods_api.go):
    • GET /v1/clusters/pods — fleet-wide, withUser, filters: cluster, namespace, status, q, limit, cursor.
    • GET /v1/clusters/{id}/pods — single-cluster.

5.4 Dashboard

  • New route app/pods/page.tsx + pods-table.tsx (clone patterns from clusters/fleet-table.tsx): server-seeded (PPR) then React Query 5s poll.
  • Columns: Cluster · Namespace · Pod · Status badge · Ready · Restarts · Node · Age. Filter bar (cluster, namespace, status) + search. Status badge reuses cluster badge styling.
  • Add nav entry in AppShell. lib/api.ts + lib/server.ts fetchers.
Phase 1 acceptance
  • Heartbeat from an agent with N pods results in N rows visible on /pods, scoped to the caller’s org subtree.
  • Filtering by cluster/namespace/status and search work; pagination holds at hundreds of pods.
  • Old-protocol agents (no Pods) still heartbeat successfully; their pods simply don’t list.

6. Phase 2 — Per-Pod Resource Usage

Outcome: Click a pod → drawer with CPU/memory usage vs requests/limits + short history; restart/age detail.

6.1 Contract

Extend PodDetail (or add parallel PodUsage) with:
CPUUsageMilli  int64 `json:"cpuUsageMilli"`
MemUsageBytes  int64 `json:"memUsageBytes"`
CPURequestMilli int64 `json:"cpuRequestMilli"`
CPULimitMilli   int64 `json:"cpuLimitMilli"`
MemRequestBytes int64 `json:"memRequestBytes"`
MemLimitBytes   int64 `json:"memLimitBytes"`

6.2 Agent

  • Requests/limits: sum container resources from the pod spec (no new RBAC).
  • Usage: call mc.MetricsV1beta1().PodMetricses("").List(...) (mirrors existing nodeUsage), join to pods by namespace/name. Best-effort — zero when metrics-server absent or RBAC not escalated. Pod metrics may require the escalated role (see §8); degrade gracefully.

6.3 Control-plane

  • Migration 00005_pod_metrics.sql: append-only pod_metrics (cluster_id, namespace, name, recorded_at, cpu_usage_milli, mem_usage_bytes, cpu_request_milli, cpu_limit_milli, mem_request_bytes, mem_limit_bytes). Add the usage/request/limit columns to pods for “latest” reads.
  • Reuse the metrics retention/plan-gating pattern from metrics_api.go (free 24h / pro 7d / enterprise 30d).
  • Endpoints: GET /v1/clusters/{id}/pods/{namespace}/{name} (detail incl. latest usage), GET /v1/clusters/{id}/pods/{namespace}/{name}/metrics (bucketed time-series).

6.4 Dashboard

  • Pod detail drawer/modal: usage tiles (CPU/mem usage vs request/limit, % bars), Recharts time-series (reuse components/charts/), restart/age/owner metadata.
  • Add CPU/Mem columns to the /pods table (sortable).
Phase 2 acceptance
  • Pod drawer shows live CPU/mem and request/limit; chart renders within plan retention window.
  • Clusters without metrics-server (or without escalated role) show “metrics unavailable” rather than erroring.

7. Phase 3 — Live Log Streaming

Outcome: From a pod, “View logs” opens a live tail of a chosen container, streamed browser←CP←agent.

7.1 Transport (new, reusable)

  • Agent↔CP: persistent WebSocket initiated by the agent (preserves outbound-only). Agent dials wss://<cp>/v1/agent/stream, authenticates with the agent key (X-Niro-Agent-Key), then multiplexes stream sessions. Coexists with existing long-poll commands; the command envelope (contract/command.go) is already designed transport-agnostic.
  • Log session lifecycle:
    1. Browser calls POST /v1/clusters/{id}/pods/{ns}/{name}/logs {container, follow, tailLines, sinceSeconds} → CP returns a streamId.
    2. CP sends a start-log-stream frame to that cluster’s agent over the WS.
    3. Agent runs cs.CoreV1().Pods(ns).GetLogs(name, &PodLogOptions{Follow:true,...}).Stream(ctx) and pushes chunks back over the WS, tagged streamId.
    4. CP relays chunks to the browser via SSE GET /v1/streams/{streamId} (EventSource) — or WS if duplex needed.
    5. Browser closing the SSE / DELETE /v1/streams/{streamId} → CP sends stop → agent cancels the stream ctx.
  • Backpressure & limits: per-stream byte/line cap, idle timeout, max concurrent streams per cluster/org. Drop with a visible “stream throttled/ended” notice rather than unbounded buffering.

7.2 Contract

New frame/command types in contract/: CommandStartLogStream, CommandStopLogStream, plus a StreamChunk{StreamID, Seq, Data, EOF, Err} envelope. Add WS frame types (control vs data).

7.3 Control-plane

  • New internal/stream/ hub: registry of connected agents (clusterID→WS conn) and active browser subscribers (streamId→SSE writer). Relays chunks, enforces auth (browser must own a cluster in its org subtree; agent must match clusterID), applies limits.
  • Endpoints: GET /v1/agent/stream (WS, withAgentAuth), POST .../logs + GET /v1/streams/{id} (SSE) + DELETE /v1/streams/{id} (withUser).
  • Multi-replica note: if CP runs >1 replica (the repo already does zero-downtime rollouts), agent WS and browser SSE may land on different replicas. Use sticky routing for the stream pair, or a lightweight pub/sub (e.g. Postgres LISTEN/NOTIFY or in-cluster broker). Flag as a design spike before build.

7.4 Agent

  • New internal/stream/ client: maintains the WS, handles start/stop-log-stream, spawns a goroutine per stream calling GetLogs(...).Stream(), forwards chunks, honors ctx cancel. Requires the escalated pods/log RBAC — feature is inert (returns a clear “logs not enabled” error) when the role isn’t granted.

7.5 Dashboard

  • Log viewer (drawer/full-screen): container selector, follow toggle, tail-lines, pause/resume, clear, wrap, copy/download buffer, search-in-buffer. Consume via EventSource. Auto-reconnect with resume hint.
Phase 3 acceptance
  • Selecting a pod/container streams live logs within ~1s; new lines append in real time.
  • Closing the viewer terminates the agent-side stream (verify ctx cancel, no leaked goroutines).
  • Clusters without the escalated role show “Log streaming not enabled for this cluster” with enablement instructions.
  • Stream limits enforced; multi-replica path resolved (sticky or pub/sub).

8. RBAC: Opt-in Escalated Role (deploy/manifests/)

  • Add a separate ClusterRole niro-agent-escalated granting get/list on pods/log and (if needed) metrics.k8s.io pods. Default niro-agent role stays read-only — unchanged.
  • Enablement: operator applies the escalated ClusterRoleBinding (e.g. install.sh --enable-logs, or a documented kubectl apply). Agent detects capability (SelfSubjectAccessReview or a startup probe) and reports it to CP.
  • CP stores a per-cluster logs_enabled (and pod_metrics_enabled) capability flag; dashboard reflects it (disabled “View logs” with tooltip when off).
  • Security: document that enabling grants log read (which may contain secrets) to dashboard users in the org subtree. Audit-log stream starts (who/which pod/when).

9. Data, Scale & Performance

  • Heartbeat size: pod payload scales with pod count. Cap pods-per-heartbeat; if exceeded, truncate with a flag + log (no silent loss). Consider gzip on the heartbeat body if large.
  • Write amplification: wholesale pod replace every 15s × clusters. Use bulk insert (COPY/multi-row). pod_metrics append-only — reuse existing retention pruning job/pattern.
  • Fleet query: ListPodsForOrgSubtree must be indexed + paginated; never unbounded.
  • Streaming: bounded buffers, per-stream/per-org caps, idle timeouts.

10. Security & Multi-tenancy

  • All new user endpoints go through withUser + OrgSubtree scoping — a user only sees pods/logs for clusters in their org subtree. Verify on every pod and stream endpoint (defense in depth: re-check cluster ownership server-side, never trust client clusterId alone).
  • Logs/exec-class access strictly behind the opt-in role. Audit stream initiations.
  • No new endpoint uses the /_private prefix (reverse-proxy routing rule).

11. Testing

  • Contract/back-compat: old-protocol heartbeat (no Pods) still accepted.
  • Agent: unit-test Snapshot() pod mapping (ready string, restarts, owner refs) with fake clientset; metrics best-effort path.
  • Control-plane: handler + query tests for pods CRUD, org-subtree scoping (negative: other org’s pods invisible), pagination, retention.
  • Streaming: integration test agent→CP→browser chunk relay; stop/cancel cleanup; limit enforcement; multi-replica routing.
  • E2E: seed a kind/k3s cluster + agent; assert pods list, usage chart, live log tail.

12. Rollout

  • Protocol bump is additive & back-compat; deploy CP first, then agents.
  • Phase 3 transport behind a feature flag; escalated RBAC off by default.
  • Zero-downtime rollout already in place (control-plane/dashboard); ensure WS/SSE survive rolling restarts (reconnect logic).

13. Phase / Task Breakdown

Phase 1 — Inventory
  1. contract: add PodDetail, HeartbeatRequest.Pods, bump protocol (keep back-compat).
  2. agent/collect: populate req.Pods.
  3. CP migration 00004_pods.sql + queries (make sqlc).
  4. CP heartbeat handler: replace pods in tx.
  5. CP endpoints GET /v1/clusters/pods, GET /v1/clusters/{id}/pods.
  6. Dashboard /pods page, table, filters, nav, fetchers.
  7. Tests + back-compat.
Phase 2 — Usage 8. contract: pod usage/request/limit fields. 9. agent: pod metrics + spec requests/limits (graceful degrade). 10. CP migration 00005_pod_metrics.sql, retention reuse, detail + metrics endpoints. 11. Dashboard pod detail drawer + usage chart + table columns. 12. Tests (metrics-absent path). Phase 3 — Logs 13. Spike: multi-replica stream routing (sticky vs pub/sub). 14. contract: stream/log command + chunk frames. 15. CP internal/stream hub; agent WS endpoint; browser SSE + control endpoints. 16. agent/stream client + GetLogs().Stream(). 17. deploy: opt-in escalated RBAC role + capability detection/flag. 18. Dashboard log viewer. 19. Streaming/cleanup/limit/multi-replica tests + E2E.

14. Open Questions

  1. Pod scale ceiling — realistic max pods/cluster? Sets heartbeat cap & whether 15s wholesale replace is fine vs. delta encoding.
  2. Heartbeat vs. pod-list cadence — keep pods on the 15s heartbeat, or a slower separate pod-sync interval to cut write amplification?
  3. CP replica count in prod — single or multi? Determines if the Phase-3 pub/sub spike is mandatory now.
  4. Browser↔CP for logs — SSE (simpler, one-way, fits tailing) vs. WebSocket (needed only if interactive input later). Default SSE unless exec is imminent.
  5. Log retention — confirm live-tail only (no storage), or is “last N lines on open” enough without a follow stream for v1 of Phase 3?
  6. Enablement UXinstall.sh --enable-logs flag vs. documented manual kubectl apply for the escalated role?
  7. Multi-container default — auto-pick first container, or require selection before streaming?
Last modified on June 12, 2026