Qontak | AI Agent | AI Agent Live Monitoring — ANCHOR

ANCHOR PRD — the initiative master index. It orchestrates all phases beneath it and carries no acceptance criteria of its own (ACs live in each phase PRD). Reconciled against the actual codebases: chatbot BE (Rails — emit chokepoints) and notification-service (Go — delivery channel).

Scope: AI Agent Live Monitoring = real-time human supervision of the production AI agent. When the agent fails or degrades mid-conversation, a configured supervisor is alerted in time to open the room and intervene. Phase 1 (Supervisor Alerts) wires the agent's existing quality/safety signals to push alerts via the existing notification-service chat channel; later phases add a monitoring console and closed-loop scorecard hooks.

HEADER BLOCK

Field	Value
PM	Dimas Fauzi Hidayat (Product Manager, Mekari Qontak)
PRD Version	1.0
Status	DRAFT
PRD Type	ANCHOR
Anchor	Yes — this IS the Anchor
Labels	`epic:qontak-chat` \| `module:ai-agent` \| `feature:ai-agent-monitoring`
Last Updated	2026-06-28

Status values: DRAFT → ACTIVE → DEPRECATED

HEADER BLOCK
1. PHASE INDEX
2. One-liner + Problem
3. Target Users + Persona Context
4. Success Metrics (Initiative-level)
5. Key Decisions + Alternatives Rejected
6. Open Questions
PRD CHANGELOG

1. PHASE INDEX

Phase	Goal	PRD Link	Epic Key	Status	Shipped
Phase 1: Supervisor Alerts	Detect the four agent-failure signals in chatbot BE, resolve the org supervisor, and emit a throttled/deduped real-time alert via notification-service chat push	Phase 1 — Supervisor Alerts	TBD	📝 Draft	—
Phase 2: Monitoring Console	Live "agent health" board — failure feed, per-org rates, drill-in to room — reading the alert/telemetry stream + Activity Log datamart	—	—	⏳ Not started	—
Phase 3: Scorecard / Closed-loop Hooks	Feed confirmed live-failure events into the Unified Agent Scorecard; explore automated containment (pause / forced-handoff)	—	—	⏳ Not started	—

Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled

2. One-liner + Problem

One-liner: Alert a human supervisor in real time whenever the production AI agent fails or degrades in a live customer conversation, so they can intervene before the customer is stranded.

Problem: When the AI agent errors, hands off unexpectedly, hits its message cap, or answers with low confidence, nothing reaches a human proactively — failures silently fall back to a default answer or a Rollbar log, and the chatbot BE has no integration that pushes these events to a supervisor. Supervisors and CS team leads running production AI agents (the 26Q2 push targets agents across 15 customer IDs on Plus/Ultimate/360 tiers) therefore only learn about a bad agent moment after a customer complains or during after-the-fact review. The cost is unrecovered conversations, eroded trust in the AI agent, and slow human takeover — the exact failure mode that makes customers hesitant to let the agent run autonomously.

3. Target Users + Persona Context

Primary Persona: CS Supervisor / Team Lead

Field	Detail
Role	Supervisor or team lead responsible for an organization's production AI agent and the human agents who back it up
Goal	Know — within seconds, not hours — when the AI agent has failed a live conversation, and jump into that specific room to take over or coach
Pain	No proactive signal exists; failures are invisible until a customer escalates or a manual review surfaces them after the fact
Workaround	Periodically eyeball the inbox, rely on customers to complain, or scrub the AI Activity Log / Rollbar after the fact — none of which is real-time

Secondary Persona: Chatbot Specialist (Mekari)

Field	Detail
Role	Mekari-side specialist who configures and tunes customer AI agents
Goal	See live failure patterns (which signal fires, how often, on which agent) to prioritise tuning and prove the agent is safe to run autonomously
Pain	Failure signals are scattered across logs and the datamart with no real-time, per-agent view
Workaround	Manual datamart queries and Rollbar spelunking, well after the conversations have ended

4. Success Metrics (Initiative-level)

Efficiency & Impact:

Metric	Definition	Baseline	Target
⭐ Supervisor time-to-intervene	Median elapsed time from an agent-failure event to a human opening/acting on that room	N/A — unmeasurable today (no alert exists)	Establish a baseline in Phase 1; drive median to ≤ 5 minutes within 90 days of GA

Quality & Accuracy:

Metric	Definition	Baseline	Target
Failure-event human-follow-up rate	Share of triggering events that receive a supervisor action within 15 minutes	N/A — new	≥ 70% within 90 days of GA
Unresolved bad-conversation rate	Share of conversations that hit a failure signal and ended without human pickup	N/A — to be measured from Activity Log	Reduce vs. the Phase-1 baseline once the console (Phase 2) lands

Adoption & Usage:

Metric	Definition	Baseline	Target
Monitored-agent coverage	Share of production autonomous agents (of the 15 target customer IDs) with at least one supervisor receiving alerts	0	100% of targeted production agents within 30 days of GA

5. Key Decisions + Alternatives Rejected

5a — Decisions Made

Date	Decision	Rationale
2026-06-28	Deliver alerts through the existing notification-service chat channel rather than a net-new alerting system	The service already stores in-app notifications and pushes via FCM; reusing it avoids new infra and gives supervisors alerts on the device they already use
2026-06-28	Target a configured supervisor role per org, not the currently-assigned agent	Failure events (esp. failed handover) may have no assignee; supervision is an org-level responsibility, and a role mapping is predictable and independent of assignment state
2026-06-28	Scope the initiative as its own ANCHOR (not a phase of Autonomous AI Agent)	Live monitoring is a distinct, multi-phase program (alerts → console → scorecard hooks) spanning two repos, with its own lifecycle separate from the engine-migration initiative
2026-06-28	Keep the notification-service change as an external dependency + TECH RFC (Broadcast squad), not in-repo edits here	Preserves squad ownership boundaries; the chatbot initiative owns detection/emit, the Broadcast squad owns the delivery channel contract

5b — Alternatives Rejected

Alternative	Why Rejected	Date
Build a dedicated AI-alerting microservice	Duplicates notification-service's storage + FCM delivery; far more infra for the same supervisor-facing outcome	2026-06-28
Alert only via Rollbar / engineering tooling	Rollbar reaches engineers, not the CS supervisor who can actually take over the conversation	2026-06-28
Emit only from the central billing-deduction worker (single chokepoint)	Async and billing-coupled; loses the real-time value and conflates monitoring with billing telemetry. Used as an audit backstop, not the primary path	2026-06-28
Start with a polling dashboard instead of push	A board the supervisor must remember to watch does not deliver "intervene while it's happening"; push is the core value. Console comes in Phase 2 on top of the same event stream	2026-06-28

6. Open Questions

#	Type	Question	Owner	Deadline
1	Open Question	How is the "supervisor" role defined and resolved to one or more `sso_id`s per org — existing role/permission, a new org setting, or a configured list?	Dimas + Chatbot BE	2026-07-11
2	Open Question	Do supervisors already have notification-service FCM tokens registered with `user_source="chat"`, or is token coverage a gap to close first?	Dimas + Broadcast squad	2026-07-11
3	Assumption	The numeric confidence value needed for low-confidence alerts is available from the AI-service response / Activity Log and can be surfaced at the BE emit point	Dimas + Data/ML	2026-07-18
4	Risk	If trigger logic is too broad, alert volume becomes noise and supervisors mute it — mitigation: per-room throttle + dedup (Phase 1) and severity tiers; revisit thresholds in the post-launch cadence	Dimas	2026-07-18

Types: Assumption · Open Question · Risk

PRD CHANGELOG

Version	Date	By	Section	Type	Summary
1.0	2026-06-28	Claude	All	CREATED	Initial ANCHOR for the AI Agent Live Monitoring initiative, grounded in chatbot BE emit chokepoints + notification-service delivery channel

HEADER BLOCK​

Table of Contents​

1. PHASE INDEX​

2. One-liner + Problem​

3. Target Users + Persona Context​

4. Success Metrics (Initiative-level)​

5. Key Decisions + Alternatives Rejected​

6. Open Questions​

PRD CHANGELOG​