AI Agent Live Monitoring
Real-time human-in-the-loop supervision for the production AI agent. When the agent fails or degrades mid-conversation — an AI-service/engine error, an unexpected handover, a silent message-limit cap, or a low-confidence answer — a configured supervisor is alerted in real time so they can open the room and intervene before the customer is left stranded.
Today none of these events reach a human proactively: failures fall back to a default answer or a Rollbar log, and a supervisor only finds out when a customer complains or during after-the-fact review. This initiative closes that gap by turning the agent's existing quality/safety signals into push alerts, delivered through the existing notification-service chat channel.
This is the escalation counterpart to the initiative's after-the-fact quality work: the AI Activity Log datamart remains the durable system of record, while AI Agent Testing (pre-release) and Unified Agent Scorecard (post-hoc scoring) judge agent quality on their own cadence. Live Monitoring is the only piece that puts a human on the room while it is still happening.
Master index (ANCHOR)
ai-agent-monitoring-anchor.md— the ANCHOR PRD: the initiative master index (identity, Phase Index, north-star metrics, initiative-level decisions). It carries no acceptance criteria of its own (those live in the phase PRDs), so it sits at the initiative root rather than underprds/.
Phases
| Phase | Scope | Status | Epic |
|---|---|---|---|
| Phase 1 — Supervisor Alerts (Confluence) | Detect the four "agent did something wrong" signals in chatbot BE, resolve the org supervisor, and emit a throttled/deduped real-time alert via notification-service chat push | PRD draft | TBD |
| Phase 2 — Monitoring Console | An at-a-glance "agent health" board (live failure feed, per-org rates, drill-in to room) reading the alert/telemetry stream + Activity Log datamart | Planned | TBD |
| Phase 3 — Scorecard / closed-loop hooks | Feed confirmed live-failure events into the Unified Agent Scorecard and explore automated containment (pause/forced-handoff) | Planned | TBD |
Cross-repo & dependencies
- chatbot (Rails BE) — signal detection at the existing emit chokepoints, supervisor resolution, throttle/dedup, and the outbound emit. Primary surface for Phase 1.
- notification-service (Go) — delivery. Phase 1 needs a new
ai_agent_alertnotification category (seed + FCM whitelist line) and confirmed chat-origin push to supervisor devices. Captured as a blocking dependency with a small TECH RFC owned by the Broadcast / notification squad — this initiative does not modify that repo directly.
Scope Changes
- Backend — chatbot BE signal detection + supervisor resolution + throttle/dedup + outbound emit; notification-service category/whitelist change tracked as an external dependency (Broadcast squad).
- Frontend — hub-chat (Omnichannel) Notification Center renders the new
ai_agent_alertcategory (icon/label + room deep-link) in the bell + FCM push; an incremental add owned by the Inbox squad. - Data — alert events land in the AI Activity Log datamart as the system of record; KPI (supervisor time-to-intervene) is computed there.
QA Lane
Lane B — keeps a human QA gate. Real-time: live supervisor alerts and push notifications are timing-dependent flows automation reproduces poorly. No E2E test specs exist for this initiative yet, so the Lane-A entry bar (100% E2E, spec-mapped coverage) is unmet regardless. Classified 2026-06-29.
Contents
prds/— phase PRDs (each with its own ACs → Jira Epic) land here.rfcs/— technical design proposals (incl. the notification-serviceai_agent_alertTECH RFC).tests/— E2E / acceptance test specs.