Skip to main content

Qontak | AI Agent | AI Agent Live Monitoring — ANCHOR

ANCHOR PRD — the initiative master index. It orchestrates all phases beneath it and carries no acceptance criteria of its own (ACs live in each phase PRD). Reconciled against the actual codebases: chatbot BE (Rails — emit chokepoints) and notification-service (Go — delivery channel).

Scope: AI Agent Live Monitoring = real-time human supervision of the production AI agent. When the agent fails or degrades mid-conversation, a configured supervisor is alerted in time to open the room and intervene. Phase 1 (Supervisor Alerts) wires the agent's existing quality/safety signals to push alerts via the existing notification-service chat channel; later phases add a monitoring console and closed-loop scorecard hooks.

HEADER BLOCK

FieldValue
PMDimas Fauzi Hidayat (Product Manager, Mekari Qontak)
PRD Version1.0
StatusDRAFT
PRD TypeANCHOR
AnchorYes — this IS the Anchor
Labelsepic:qontak-chat | module:ai-agent | feature:ai-agent-monitoring
Last Updated2026-06-28

Status values: DRAFTACTIVEDEPRECATED


Table of Contents


1. PHASE INDEX

PhaseGoalPRD LinkEpic KeyStatusShipped
Phase 1: Supervisor AlertsDetect the four agent-failure signals in chatbot BE, resolve the org supervisor, and emit a throttled/deduped real-time alert via notification-service chat pushPhase 1 — Supervisor AlertsTBD📝 Draft
Phase 2: Monitoring ConsoleLive "agent health" board — failure feed, per-org rates, drill-in to room — reading the alert/telemetry stream + Activity Log datamart⏳ Not started
Phase 3: Scorecard / Closed-loop HooksFeed confirmed live-failure events into the Unified Agent Scorecard; explore automated containment (pause / forced-handoff)⏳ Not started

Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled


2. One-liner + Problem

One-liner: Alert a human supervisor in real time whenever the production AI agent fails or degrades in a live customer conversation, so they can intervene before the customer is stranded.

Problem: When the AI agent errors, hands off unexpectedly, hits its message cap, or answers with low confidence, nothing reaches a human proactively — failures silently fall back to a default answer or a Rollbar log, and the chatbot BE has no integration that pushes these events to a supervisor. Supervisors and CS team leads running production AI agents (the 26Q2 push targets agents across 15 customer IDs on Plus/Ultimate/360 tiers) therefore only learn about a bad agent moment after a customer complains or during after-the-fact review. The cost is unrecovered conversations, eroded trust in the AI agent, and slow human takeover — the exact failure mode that makes customers hesitant to let the agent run autonomously.


3. Target Users + Persona Context

Primary Persona: CS Supervisor / Team Lead

FieldDetail
RoleSupervisor or team lead responsible for an organization's production AI agent and the human agents who back it up
GoalKnow — within seconds, not hours — when the AI agent has failed a live conversation, and jump into that specific room to take over or coach
PainNo proactive signal exists; failures are invisible until a customer escalates or a manual review surfaces them after the fact
WorkaroundPeriodically eyeball the inbox, rely on customers to complain, or scrub the AI Activity Log / Rollbar after the fact — none of which is real-time

Secondary Persona: Chatbot Specialist (Mekari)

FieldDetail
RoleMekari-side specialist who configures and tunes customer AI agents
GoalSee live failure patterns (which signal fires, how often, on which agent) to prioritise tuning and prove the agent is safe to run autonomously
PainFailure signals are scattered across logs and the datamart with no real-time, per-agent view
WorkaroundManual datamart queries and Rollbar spelunking, well after the conversations have ended

4. Success Metrics (Initiative-level)

Efficiency & Impact:

MetricDefinitionBaselineTarget
Supervisor time-to-interveneMedian elapsed time from an agent-failure event to a human opening/acting on that roomN/A — unmeasurable today (no alert exists)Establish a baseline in Phase 1; drive median to ≤ 5 minutes within 90 days of GA

Quality & Accuracy:

MetricDefinitionBaselineTarget
Failure-event human-follow-up rateShare of triggering events that receive a supervisor action within 15 minutesN/A — new≥ 70% within 90 days of GA
Unresolved bad-conversation rateShare of conversations that hit a failure signal and ended without human pickupN/A — to be measured from Activity LogReduce vs. the Phase-1 baseline once the console (Phase 2) lands

Adoption & Usage:

MetricDefinitionBaselineTarget
Monitored-agent coverageShare of production autonomous agents (of the 15 target customer IDs) with at least one supervisor receiving alerts0100% of targeted production agents within 30 days of GA

5. Key Decisions + Alternatives Rejected

5a — Decisions Made

DateDecisionRationale
2026-06-28Deliver alerts through the existing notification-service chat channel rather than a net-new alerting systemThe service already stores in-app notifications and pushes via FCM; reusing it avoids new infra and gives supervisors alerts on the device they already use
2026-06-28Target a configured supervisor role per org, not the currently-assigned agentFailure events (esp. failed handover) may have no assignee; supervision is an org-level responsibility, and a role mapping is predictable and independent of assignment state
2026-06-28Scope the initiative as its own ANCHOR (not a phase of Autonomous AI Agent)Live monitoring is a distinct, multi-phase program (alerts → console → scorecard hooks) spanning two repos, with its own lifecycle separate from the engine-migration initiative
2026-06-28Keep the notification-service change as an external dependency + TECH RFC (Broadcast squad), not in-repo edits herePreserves squad ownership boundaries; the chatbot initiative owns detection/emit, the Broadcast squad owns the delivery channel contract

5b — Alternatives Rejected

AlternativeWhy RejectedDate
Build a dedicated AI-alerting microserviceDuplicates notification-service's storage + FCM delivery; far more infra for the same supervisor-facing outcome2026-06-28
Alert only via Rollbar / engineering toolingRollbar reaches engineers, not the CS supervisor who can actually take over the conversation2026-06-28
Emit only from the central billing-deduction worker (single chokepoint)Async and billing-coupled; loses the real-time value and conflates monitoring with billing telemetry. Used as an audit backstop, not the primary path2026-06-28
Start with a polling dashboard instead of pushA board the supervisor must remember to watch does not deliver "intervene while it's happening"; push is the core value. Console comes in Phase 2 on top of the same event stream2026-06-28

6. Open Questions

#TypeQuestionOwnerDeadline
1Open QuestionHow is the "supervisor" role defined and resolved to one or more sso_ids per org — existing role/permission, a new org setting, or a configured list?Dimas + Chatbot BE2026-07-11
2Open QuestionDo supervisors already have notification-service FCM tokens registered with user_source="chat", or is token coverage a gap to close first?Dimas + Broadcast squad2026-07-11
3AssumptionThe numeric confidence value needed for low-confidence alerts is available from the AI-service response / Activity Log and can be surfaced at the BE emit pointDimas + Data/ML2026-07-18
4RiskIf trigger logic is too broad, alert volume becomes noise and supervisors mute it — mitigation: per-room throttle + dedup (Phase 1) and severity tiers; revisit thresholds in the post-launch cadenceDimas2026-07-18

Types: Assumption · Open Question · Risk


PRD CHANGELOG

VersionDateBySectionTypeSummary
1.02026-06-28ClaudeAllCREATEDInitial ANCHOR for the AI Agent Live Monitoring initiative, grounded in chatbot BE emit chokepoints + notification-service delivery channel