Qontak | AI Agent | AI Agent Live Monitoring — ANCHOR
ANCHOR PRD — the initiative master index. It orchestrates all phases beneath it and carries no acceptance criteria of its own (ACs live in each phase PRD). Reconciled against the actual codebases: chatbot BE (Rails — emit chokepoints) and notification-service (Go — delivery channel).
Scope: AI Agent Live Monitoring = real-time human supervision of the production AI agent. When the agent fails or degrades mid-conversation, a configured supervisor is alerted in time to open the room and intervene. Phase 1 (Supervisor Alerts) wires the agent's existing quality/safety signals to push alerts via the existing notification-service chat channel; later phases add a monitoring console and closed-loop scorecard hooks.
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat (Product Manager, Mekari Qontak) |
| PRD Version | 1.0 |
| Status | DRAFT |
| PRD Type | ANCHOR |
| Anchor | Yes — this IS the Anchor |
| Labels | epic:qontak-chat | module:ai-agent | feature:ai-agent-monitoring |
| Last Updated | 2026-06-28 |
Status values:
DRAFT→ACTIVE→DEPRECATED
Table of Contents
- HEADER BLOCK
- 1. PHASE INDEX
- 2. One-liner + Problem
- 3. Target Users + Persona Context
- 4. Success Metrics (Initiative-level)
- 5. Key Decisions + Alternatives Rejected
- 6. Open Questions
- PRD CHANGELOG
1. PHASE INDEX
| Phase | Goal | PRD Link | Epic Key | Status | Shipped |
|---|---|---|---|---|---|
| Phase 1: Supervisor Alerts | Detect the four agent-failure signals in chatbot BE, resolve the org supervisor, and emit a throttled/deduped real-time alert via notification-service chat push | Phase 1 — Supervisor Alerts | TBD | 📝 Draft | — |
| Phase 2: Monitoring Console | Live "agent health" board — failure feed, per-org rates, drill-in to room — reading the alert/telemetry stream + Activity Log datamart | — | — | ⏳ Not started | — |
| Phase 3: Scorecard / Closed-loop Hooks | Feed confirmed live-failure events into the Unified Agent Scorecard; explore automated containment (pause / forced-handoff) | — | — | ⏳ Not started | — |
Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled
2. One-liner + Problem
One-liner: Alert a human supervisor in real time whenever the production AI agent fails or degrades in a live customer conversation, so they can intervene before the customer is stranded.
Problem: When the AI agent errors, hands off unexpectedly, hits its message cap, or answers with low confidence, nothing reaches a human proactively — failures silently fall back to a default answer or a Rollbar log, and the chatbot BE has no integration that pushes these events to a supervisor. Supervisors and CS team leads running production AI agents (the 26Q2 push targets agents across 15 customer IDs on Plus/Ultimate/360 tiers) therefore only learn about a bad agent moment after a customer complains or during after-the-fact review. The cost is unrecovered conversations, eroded trust in the AI agent, and slow human takeover — the exact failure mode that makes customers hesitant to let the agent run autonomously.
3. Target Users + Persona Context
Primary Persona: CS Supervisor / Team Lead
| Field | Detail |
|---|---|
| Role | Supervisor or team lead responsible for an organization's production AI agent and the human agents who back it up |
| Goal | Know — within seconds, not hours — when the AI agent has failed a live conversation, and jump into that specific room to take over or coach |
| Pain | No proactive signal exists; failures are invisible until a customer escalates or a manual review surfaces them after the fact |
| Workaround | Periodically eyeball the inbox, rely on customers to complain, or scrub the AI Activity Log / Rollbar after the fact — none of which is real-time |
Secondary Persona: Chatbot Specialist (Mekari)
| Field | Detail |
|---|---|
| Role | Mekari-side specialist who configures and tunes customer AI agents |
| Goal | See live failure patterns (which signal fires, how often, on which agent) to prioritise tuning and prove the agent is safe to run autonomously |
| Pain | Failure signals are scattered across logs and the datamart with no real-time, per-agent view |
| Workaround | Manual datamart queries and Rollbar spelunking, well after the conversations have ended |
4. Success Metrics (Initiative-level)
Efficiency & Impact:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| ⭐ Supervisor time-to-intervene | Median elapsed time from an agent-failure event to a human opening/acting on that room | N/A — unmeasurable today (no alert exists) | Establish a baseline in Phase 1; drive median to ≤ 5 minutes within 90 days of GA |
Quality & Accuracy:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Failure-event human-follow-up rate | Share of triggering events that receive a supervisor action within 15 minutes | N/A — new | ≥ 70% within 90 days of GA |
| Unresolved bad-conversation rate | Share of conversations that hit a failure signal and ended without human pickup | N/A — to be measured from Activity Log | Reduce vs. the Phase-1 baseline once the console (Phase 2) lands |
Adoption & Usage:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Monitored-agent coverage | Share of production autonomous agents (of the 15 target customer IDs) with at least one supervisor receiving alerts | 0 | 100% of targeted production agents within 30 days of GA |
5. Key Decisions + Alternatives Rejected
5a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2026-06-28 | Deliver alerts through the existing notification-service chat channel rather than a net-new alerting system | The service already stores in-app notifications and pushes via FCM; reusing it avoids new infra and gives supervisors alerts on the device they already use |
| 2026-06-28 | Target a configured supervisor role per org, not the currently-assigned agent | Failure events (esp. failed handover) may have no assignee; supervision is an org-level responsibility, and a role mapping is predictable and independent of assignment state |
| 2026-06-28 | Scope the initiative as its own ANCHOR (not a phase of Autonomous AI Agent) | Live monitoring is a distinct, multi-phase program (alerts → console → scorecard hooks) spanning two repos, with its own lifecycle separate from the engine-migration initiative |
| 2026-06-28 | Keep the notification-service change as an external dependency + TECH RFC (Broadcast squad), not in-repo edits here | Preserves squad ownership boundaries; the chatbot initiative owns detection/emit, the Broadcast squad owns the delivery channel contract |
5b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
| Build a dedicated AI-alerting microservice | Duplicates notification-service's storage + FCM delivery; far more infra for the same supervisor-facing outcome | 2026-06-28 |
| Alert only via Rollbar / engineering tooling | Rollbar reaches engineers, not the CS supervisor who can actually take over the conversation | 2026-06-28 |
| Emit only from the central billing-deduction worker (single chokepoint) | Async and billing-coupled; loses the real-time value and conflates monitoring with billing telemetry. Used as an audit backstop, not the primary path | 2026-06-28 |
| Start with a polling dashboard instead of push | A board the supervisor must remember to watch does not deliver "intervene while it's happening"; push is the core value. Console comes in Phase 2 on top of the same event stream | 2026-06-28 |
6. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Open Question | How is the "supervisor" role defined and resolved to one or more sso_ids per org — existing role/permission, a new org setting, or a configured list? | Dimas + Chatbot BE | 2026-07-11 |
| 2 | Open Question | Do supervisors already have notification-service FCM tokens registered with user_source="chat", or is token coverage a gap to close first? | Dimas + Broadcast squad | 2026-07-11 |
| 3 | Assumption | The numeric confidence value needed for low-confidence alerts is available from the AI-service response / Activity Log and can be surfaced at the BE emit point | Dimas + Data/ML | 2026-07-18 |
| 4 | Risk | If trigger logic is too broad, alert volume becomes noise and supervisors mute it — mitigation: per-room throttle + dedup (Phase 1) and severity tiers; revisit thresholds in the post-launch cadence | Dimas | 2026-07-18 |
Types:
Assumption·Open Question·Risk
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.0 | 2026-06-28 | Claude | All | CREATED | Initial ANCHOR for the AI Agent Live Monitoring initiative, grounded in chatbot BE emit chokepoints + notification-service delivery channel |