Skip to main content

AI Agent Live Monitoring

Real-time human-in-the-loop supervision for the production AI agent. When the agent fails or degrades mid-conversation — an AI-service/engine error, an unexpected handover, a silent message-limit cap, or a low-confidence answer — a configured supervisor is alerted in real time so they can open the room and intervene before the customer is left stranded.

Today none of these events reach a human proactively: failures fall back to a default answer or a Rollbar log, and a supervisor only finds out when a customer complains or during after-the-fact review. This initiative closes that gap by turning the agent's existing quality/safety signals into push alerts, delivered through the existing notification-service chat channel.

This is the escalation counterpart to the initiative's after-the-fact quality work: the AI Activity Log datamart remains the durable system of record, while AI Agent Testing (pre-release) and Unified Agent Scorecard (post-hoc scoring) judge agent quality on their own cadence. Live Monitoring is the only piece that puts a human on the room while it is still happening.

Master index (ANCHOR)

  • ai-agent-monitoring-anchor.md — the ANCHOR PRD: the initiative master index (identity, Phase Index, north-star metrics, initiative-level decisions). It carries no acceptance criteria of its own (those live in the phase PRDs), so it sits at the initiative root rather than under prds/.

Phases

PhaseScopeStatusEpic
Phase 1 — Supervisor Alerts (Confluence)Detect the four "agent did something wrong" signals in chatbot BE, resolve the org supervisor, and emit a throttled/deduped real-time alert via notification-service chat pushPRD draftTBD
Phase 2 — Monitoring ConsoleAn at-a-glance "agent health" board (live failure feed, per-org rates, drill-in to room) reading the alert/telemetry stream + Activity Log datamartPlannedTBD
Phase 3 — Scorecard / closed-loop hooksFeed confirmed live-failure events into the Unified Agent Scorecard and explore automated containment (pause/forced-handoff)PlannedTBD

Cross-repo & dependencies

  • chatbot (Rails BE) — signal detection at the existing emit chokepoints, supervisor resolution, throttle/dedup, and the outbound emit. Primary surface for Phase 1.
  • notification-service (Go) — delivery. Phase 1 needs a new ai_agent_alert notification category (seed + FCM whitelist line) and confirmed chat-origin push to supervisor devices. Captured as a blocking dependency with a small TECH RFC owned by the Broadcast / notification squad — this initiative does not modify that repo directly.

Scope Changes

  • Backend — chatbot BE signal detection + supervisor resolution + throttle/dedup + outbound emit; notification-service category/whitelist change tracked as an external dependency (Broadcast squad).
  • Frontend — hub-chat (Omnichannel) Notification Center renders the new ai_agent_alert category (icon/label + room deep-link) in the bell + FCM push; an incremental add owned by the Inbox squad.
  • Data — alert events land in the AI Activity Log datamart as the system of record; KPI (supervisor time-to-intervene) is computed there.

QA Lane

Lane B — keeps a human QA gate. Real-time: live supervisor alerts and push notifications are timing-dependent flows automation reproduces poorly. No E2E test specs exist for this initiative yet, so the Lane-A entry bar (100% E2E, spec-mapped coverage) is unmet regardless. Classified 2026-06-29.

Contents

  • prds/ — phase PRDs (each with its own ACs → Jira Epic) land here.
  • rfcs/ — technical design proposals (incl. the notification-service ai_agent_alert TECH RFC).
  • tests/ — E2E / acceptance test specs.