AI Agent Live Monitoring

Real-time human-in-the-loop supervision for the production AI agent. When the agent fails or degrades mid-conversation — an AI-service/engine error, an unexpected handover, a silent message-limit cap, or a low-confidence answer — a configured supervisor is alerted in real time so they can open the room and intervene before the customer is left stranded.

Today none of these events reach a human proactively: failures fall back to a default answer or a Rollbar log, and a supervisor only finds out when a customer complains or during after-the-fact review. This initiative closes that gap by turning the agent's existing quality/safety signals into push alerts, delivered through the existing notification-service chat channel.

This is the escalation counterpart to the initiative's after-the-fact quality work: the AI Activity Log datamart remains the durable system of record, while AI Agent Testing (pre-release) and Unified Agent Scorecard (post-hoc scoring) judge agent quality on their own cadence. Live Monitoring is the only piece that puts a human on the room while it is still happening.

Master index (ANCHOR)

ai-agent-monitoring-anchor.md — the ANCHOR PRD: the initiative master index (identity, Phase Index, north-star metrics, initiative-level decisions). It carries no acceptance criteria of its own (those live in the phase PRDs), so it sits at the initiative root rather than under prds/.

Phases

Phase	Scope	Status	Epic
Phase 1 — Supervisor Alerts (Confluence)	Detect the four "agent did something wrong" signals in chatbot BE, resolve the org supervisor, and emit a throttled/deduped real-time alert via notification-service chat push	PRD draft	TBD
Phase 2 — Monitoring Console	An at-a-glance "agent health" board (live failure feed, per-org rates, drill-in to room) reading the alert/telemetry stream + Activity Log datamart	Planned	TBD
Phase 3 — Scorecard / closed-loop hooks	Feed confirmed live-failure events into the Unified Agent Scorecard and explore automated containment (pause/forced-handoff)	Planned	TBD

Cross-repo & dependencies

chatbot (Rails BE) — signal detection at the existing emit chokepoints, supervisor resolution, throttle/dedup, and the outbound emit. Primary surface for Phase 1.
notification-service (Go) — delivery. Phase 1 needs a new ai_agent_alert notification category (seed + FCM whitelist line) and confirmed chat-origin push to supervisor devices. Captured as a blocking dependency with a small TECH RFC owned by the Broadcast / notification squad — this initiative does not modify that repo directly.

Scope Changes

Backend — chatbot BE signal detection + supervisor resolution + throttle/dedup + outbound emit; notification-service category/whitelist change tracked as an external dependency (Broadcast squad).
Frontend — hub-chat (Omnichannel) Notification Center renders the new ai_agent_alert category (icon/label + room deep-link) in the bell + FCM push; an incremental add owned by the Inbox squad.
Data — alert events land in the AI Activity Log datamart as the system of record; KPI (supervisor time-to-intervene) is computed there.

QA Lane

Lane B — keeps a human QA gate. Real-time: live supervisor alerts and push notifications are timing-dependent flows automation reproduces poorly. No E2E test specs exist for this initiative yet, so the Lane-A entry bar (100% E2E, spec-mapped coverage) is unmet regardless. Classified 2026-06-29.

prds/ — phase PRDs (each with its own ACs → Jira Epic) land here.
rfcs/ — technical design proposals (incl. the notification-service ai_agent_alert TECH RFC).
tests/ — E2E / acceptance test specs.

Master index (ANCHOR)​

Phases​

Cross-repo & dependencies​

Scope Changes​

QA Lane​

Contents​

Master index (ANCHOR)

Phases

Cross-repo & dependencies

Scope Changes

QA Lane

Contents