Qontak | AI Agent | AI Agent Live Monitoring — Phase 1: Supervisor Alerts
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat (Product Manager, Mekari Qontak) |
| PRD Version | 1.2 |
| Status | DRAFT |
| PRD Type | NEW |
| Epic | TBD — add once Epic is created |
| Squad | BOT — Chatbot Squad |
| RFC Link | N/A — pending; notification-service delivery change tracked as a TECH RFC (Broadcast squad), see Dependencies |
| Figma Master | N/A — backend-only; alerts render in the existing notification surface |
| Anchor | Yes — AI Agent Live Monitoring — ANCHOR |
| Labels | epic:qontak-chat | module:ai-agent | feature:ai-agent-monitoring |
| Last Updated | 2026-06-28 |
Status values:
DRAFT→READY→BUILD→SHIPPED
Table of Contents
- HEADER BLOCK
- 2. One-liner + Problem
- 3. Target Users + Persona Context
- 4. Non-Goals
- 5. Constraints
- 6. New Features
- 7. API & Webhook Behavior
- 8. System Flow + User Stories
- 9. Rollout
- 10. Observability
- 11. Success Metrics
- 12. Launch Plan & Stage Gates
- 13. Dependencies
- 14. Key Decisions + Alternatives Rejected
- 15. Open Questions
- PRD CHANGELOG
2. One-liner + Problem
One-liner: Push a real-time alert to a configured organization supervisor whenever the production AI agent fails or degrades in a live customer conversation.
Problem:
The chatbot backend already detects when the AI agent fails — an AI-service or v2-engine error falls back to a default answer, an unexpected handover escalates the room, a message-limit cap silently stops the agent (Rollbar.info only), and low-confidence answers are recorded in telemetry — but none of these reach a human proactively; there is no integration that notifies a supervisor. CS supervisors running production agents (the 26Q2 target spans agents across 15 customer IDs on Plus/Ultimate/360 tiers) therefore only discover a bad agent moment after a customer complains or during after-the-fact review. The cost is unrecovered conversations, slow human takeover, and eroded trust in letting the agent run autonomously — the precise risk this Phase exists to reduce.
3. Target Users + Persona Context
Primary Persona: CS Supervisor / Team Lead
| Field | Detail |
|---|---|
| Role | Supervisor or team lead who owns an organization's production AI agent and the human agents who back it up |
| Goal | Learn within seconds, not hours, that the AI agent has failed a live conversation, and open that specific room to take over or coach |
| Pain | No proactive signal exists; failures are invisible until a customer escalates or a manual review surfaces them after the fact |
| Workaround | Periodically eyeball the inbox, wait for customer complaints, or scrub the AI Activity Log / Rollbar after conversations have ended |
(See Constraints for plan availability and feature flag scope.)
Secondary Persona: Chatbot Specialist (Mekari)
| Field | Detail |
|---|---|
| Role | Mekari-side specialist who configures and tunes customer AI agents |
| Goal | Receive the same failure signals to spot patterns (which signal, how often, which agent) and prioritise tuning |
| Pain | Failure signals are scattered across logs and the datamart with no real-time, per-agent view |
| Workaround | Manual datamart queries and Rollbar spelunking, well after the fact |
4. Non-Goals
- No new alerting infrastructure — Phase 1 delivers exclusively through the existing notification-service chat channel. We do not build a dedicated alerting service, queue, or datastore.
- No monitoring console / dashboard — the at-a-glance "agent health" board is Phase 2. Phase 1 is push-only.
- No automated containment — alerts inform a human; the system does not pause the agent, force a handoff, or take any corrective action automatically. That is explored in Phase 3.
- No alerting on normal agent behavior — a successful resolve, or a deliberate, expected handover (e.g.
EVALUATE_ANSWER), is not a failure and must not alert. - No changes to the notification-service repo by this squad — the new
ai_agent_alertcategory, FCM whitelist line, and chat-push delivery are owned by the Broadcast squad and tracked as a dependency (Section 13). - No end-customer-facing notifications — alerts go to internal supervisors/specialists only, never to the conversation's customer.
- No per-agent custom alert rules in this phase — the trigger set and thresholds are global defaults; per-org configurability beyond on/off and supervisor targeting is deferred.
- No retroactive alerting — only events occurring after the feature flag is enabled for an org generate alerts; we do not backfill historical failures.
5. Constraints
| Field | Value |
|---|---|
| Platform | Backend-only (chatbot Rails BE). Alert is consumed on whatever surfaces notification-service already supports — in-app notification + FCM push to the supervisor's chat client. No new screen. |
| Performance | Detection and emit must run out-of-band of the customer reply path — the alert must never add latency to, or block, the agent's response to the customer. Emit is fire-and-forget (async worker); target alert delivered to notification-service within ≤ 30s of the triggering event. |
| Throttle / dedup | At most one alert per room per signal-type per cooldown window (default 5 minutes), deduped by event_id. A looping failure on one room must not flood the supervisor. |
| Plan scope | Production AI-agent tiers only — Plus / Ultimate / 360 (the tiers already entitled for the autonomous agent). Not Starter/Free. |
| Feature flag | ai_agent_monitoring_alerts | default: OFF. Enabled per organization. |
| Read/write | Only the chatbot BE (service-to-service, via the notification-service internal API key) writes alerts. Supervisors read alerts; no one edits them. Customers never see them. |
| Dependency gate | Chat-origin push for the ai_agent_alert category must be live in notification-service (Section 13) before an org can be moved past internal testing. |
5.1. Data Lifecycle
Phase 1 introduces two persisted artifacts outside the main conversation model: the throttle/dedup state and the delivered alert notifications.
| Artifact Type | Retention Period | Cleanup Trigger | User-Visible Effect |
|---|---|---|---|
| Throttle/dedup key (per room + signal-type) | Cooldown window only (default 5 min) | TTL expiry on the key store (e.g. Redis) | None — internal only |
| Alert notification record (in notification-service) | Per notification-service chat retention (currently 7 days, notif_type=3) | notification-service retention policy (query-time pruning) | Alert disappears from the supervisor's notification list after the retention window |
| Alert event in AI Activity Log datamart (system of record) | Per datamart retention (durable) | Datamart lifecycle | None directly — powers KPI reporting and Phase-2 console |
Note: the notification-service 7-day window is fine for live alerting; the durable record of every failure event lives in the AI Activity Log datamart, not in notification-service.
6. New Features
No net-new screen is built, but there is a user-facing surface change: the supervisor sees the alert in the existing hub-chat (Qontak Omnichannel) Notification Center — the navbar bell — and as an FCM push. hub-chat already wires Firebase FCM (app.vue → useFirebase/initFirebase) and renders a category-aware notification list (layouts/composables/useNotificationCenter.ts, common/types/NotificationCenterTypes.ts). This is an incremental add to that existing component, not a new page — but it is not zero FE work (see Dependencies: hub-chat must register and render the new ai_agent_alert category).
| Field | Detail |
|---|---|
| Surface | hub-chat Notification Center (navbar bell) + FCM push banner. URL: the supervisor's existing Omnichannel inbox (/inbox); no new route. |
| Access | Users with the configured supervisor role for the org, logged into hub-chat (web). Mobile inbox is an open question (S15). |
What the supervisor sees — Notification Center list item (success state):
┌─ Notifications ───────────────────────────────────┐
│ 🔴 AI Agent Alert · 2m ago │ ← unread dot · notif_category_label · created_at (relative)
│ AI Agent failed — engine error │ ← title
│ Couldn't answer in "Order #1234" — fell back │ ← description (truncated)
│ to default reply │
│ ↳ Open room → │ ← click_action_url deep-link to the failing room
└────────────────────────────────────────────────────┘
FCM push banner (device):
● Qontak Omnichannel now
AI Agent failed — engine error
Couldn't answer in "Order #1234" room
Field → UI mapping (drives the render from the notification-service payload):
| Notification field | Renders as |
|---|---|
notif_category = ai_agent_alert | Selects the icon + needs a registered type entry in hub-chat (the notif_category → type lookup in TheNavbar.vue/OneNavbar.vue) |
notif_category_label | The category chip text ("AI Agent Alert") |
title | Bold first line (the failure summary) |
description | Secondary line (room + what happened), truncated |
created_at / read_at | Relative timestamp + unread dot |
click_action = OPEN_URL, click_action_url + extra.room_id | Click opens the failing room in the inbox |
UI States (reuse the existing Notification Center states; the new category only adds an item type):
| State | Description |
|---|---|
| Empty | No AI agent alerts → nothing extra shown; the existing center's empty/zero state applies. |
| Loading | Existing Notification Center skeleton while fetching. |
| Error | Existing center error/retry; a push that fails to render falls back to the list on next fetch. |
| Success | The alert item renders as mocked above; click deep-links to the room. |
Figma: N/A — reuses the existing Pixel3 notification-list component; this spec is the rendering contract the hub-chat (Inbox) squad builds to. A frame can be added if the squad wants a visual sign-off.
📊 UI State Diagram — AI Agent Alert (Notification Center item)
stateDiagram-v2
[*] --> Delivered: notification-service push received
Delivered --> Unread: shown in bell (red dot) + push banner
Unread --> Read: supervisor opens the Notification Center
Read --> RoomOpened: taps "Open room" (click_action_url + extra.room_id)
Unread --> Expired: 7-day chat retention reached
Read --> Expired: 7-day chat retention reached
RoomOpened --> [*]: supervisor takes over / coaches
Expired --> [*]: pruned from list (durable record stays in Activity Log)
7. API & Webhook Behavior
PM describes behavior in plain language; Engineering resolves HTTP methods, schemas, and error codes in the RFC. The outbound contract is to notification-service's existing internal endpoint.
| # | Behavior | Entity Affected | Triggered By | Expected Behavior | Failure Behavior |
|---|---|---|---|---|---|
| 1 | Resolve alert recipient(s) | Supervisor mapping (org → sso_id[]) | A failure signal is raised for a room in org X | • Look up the configured supervisor(s) for org X (open question: existing role vs. new org setting — see S15) • Return one or more sso_ids to target• Cache the lookup for the cooldown window | • If no supervisor is configured for the org: drop the alert, increment ai_agent_alert_dropped with reason no_supervisor, do not error the conversation• If lookup times out: skip alert, log; never block the customer reply |
| 2 | Throttle & dedup the alert | In-memory/Redis throttle key (room_id + signal_type) | An alert is about to be emitted | • If no live key exists: proceed and set the key with TTL = cooldown (default 5 min) • Attach a stable event_id for idempotency | • If a live key exists (within cooldown): suppress this alert, increment ai_agent_alert_suppressed with the signal type• If the key store is unavailable: fail open (emit) rather than fail closed — a missed dedup is better than a missed failure alert |
| 3 | Emit the alert to notification-service | A chat-origin notification (notif_type=3, notif_category=ai_agent_alert) | Recipient resolved + not throttled | • POST to notification-service POST /api/v1/notifications/chat with X-Api-Key internal auth, targeting the supervisor sso_id• Title/description summarise the signal (e.g. "AI Agent failed — engine error"), click_action=OPEN_URL, click_action_url deep-links the room• Telemetry travels in the extra JSONB envelope: room_id, organization_id, conversation_id, signal_type, failed_reason/assign_reason, confidence (when present), agent_id• Runs in an async worker (Sidekiq), out-of-band of the reply path • Mirror the event to the AI Activity Log datamart (system of record) | • Non-2xx from notification-service: retry with backoff (bounded); on final failure increment ai_agent_alert_delivery_failed and log — never raise into the conversation flow• If the dependency (category/whitelist) is not yet live: alert is created but no push is delivered — covered by the Section 13 blocking gate |
Engineering resolves during RFC: exact request/response schema, retry/backoff policy, async worker boundary, and whether HTTP or the qontak_chat.public.notification_worker Kafka topic is used for emit (both are supported by notification-service).
8. System Flow + User Stories
8.1. System Flow
Flow: AI Agent Failure → Supervisor Alert Type: API Sequence
- A customer message reaches the chatbot BE and the AI agent attempts a response (
get_answer/send_message_with_resolve). - The agent pipeline hits one of the monitored conditions: AI-service non-200 (
get_answer.rb:44), v2-engine failure (send_message_with_resolve.rb:1923), unexpected handover (assign_agent==truewith an abnormalassign_reason,get_answer.rb:56), message-limit cap (send_message_with_resolve.rb:~1812), or a low-confidence answer (confidence below floor). - The agent's normal fallback behavior runs unchanged (default answer / handover / cap) — the customer path is not altered or delayed.
- The detector classifies the event into a
signal_typeand enqueues an alert job (async, Sidekiq), passingroom_id,organization_id,conversation_id, signal context, and confidence when available. - The alert worker resolves the org's configured supervisor(s) →
sso_id[]. - Decision: no supervisor configured → drop, count
no_supervisor, stop. - Decision: a throttle key for this
room_id+signal_typeis live (within cooldown) → suppress, countsuppressed, stop. Otherwise set the key (TTL = cooldown) and continue. - The worker POSTs the alert to notification-service
/api/v1/notifications/chat(X-Api-Key,notif_type=3,notif_category=ai_agent_alert), telemetry inextra, deep-link inclick_action_url. - notification-service stores the notification and (category whitelisted) pushes via FCM to the supervisor's chat device.
- hub-chat (Omnichannel web) renders the alert in the navbar Notification Center bell and shows the FCM push banner, using the registered
ai_agent_alertcategory for the icon/label. - The worker mirrors the event to the AI Activity Log datamart (system of record) for KPI + Phase-2 console.
- Failure branch: notification-service returns non-2xx → bounded retry/backoff; on final failure count
delivery_failedand log — never raise into the conversation. - The supervisor sees the alert in the bell / push, taps the deep-link, and opens the failing room to take over or coach.
📊 System Flow — Supervisor Alerts
sequenceDiagram
participant Cust as Customer
participant BE as Chatbot BE (agent pipeline)
participant W as Alert Worker (Sidekiq)
participant Res as Supervisor Resolver
participant Thr as Throttle Store (Redis)
participant NS as notification-service
participant HC as hub-chat (Notification Center)
participant ADL as AI Activity Log
participant Spv as Supervisor
Cust->>BE: Customer message
BE-->>Cust: Reply / fallback (unchanged — no added latency)
Note over BE: Failure detected (service · engine · handover · limit · low-confidence)
BE->>W: Enqueue alert (room_id, org_id, signal_type, context)
W->>Res: Resolve supervisor(s) for org
alt No supervisor configured
Res-->>W: none
W-->>W: Drop · count no_supervisor
else Supervisor resolved
Res-->>W: sso_id[]
W->>Thr: Check room + signal-type key
alt Within cooldown
Thr-->>W: key live
W-->>W: Suppress · count suppressed
else Not throttled
Thr-->>W: set key (TTL = cooldown)
W->>NS: POST /notifications/chat (X-Api-Key · ai_agent_alert · extra)
alt Delivery non-2xx
NS-->>W: error
W-->>W: Retry/backoff · on final fail count delivery_failed
else Delivered
NS->>HC: FCM push + stored notification (ai_agent_alert)
HC->>Spv: Render in bell + push banner
W->>ADL: Mirror event (system of record)
Spv->>BE: Open room (deep-link) · take over / coach
end
end
end
8.2. User Stories
[MON-S01] — Supervisor alert delivery pipeline (resolve · throttle · emit)
| User Story | As a CS Supervisor, I want a single reliable pipeline that resolves who to notify, suppresses repeats, and delivers the alert, so that every genuine agent failure reaches me once, fast, without flooding me. |
| Before State | None — the chatbot BE has no integration with notification-service and no concept of a "supervisor alert". Failures end at a default answer or a Rollbar log. |
| After Delta | A shared async pipeline (Sidekiq worker) takes a raised failure signal, resolves the org supervisor(s), applies per-room/per-signal throttle + dedup, POSTs the alert to notification-service chat, and mirrors the event to the Activity Log datamart — out-of-band of the customer reply path. |
| Importance | Must Have |
| Mockup / Technical Notes | Figma: N/A — backend-only. Data Fields (alert payload): • sso_id (uuid, required) — resolved supervisor target• room_id (string, required) — source: room state• organization_id (uuid, required) — source: room/org context• conversation_id (string, required) — source: conversation context• signal_type (enum, required) — service_failure / engine_failure / unexpected_handover / message_limit / low_confidence• event_id (uuid, required) — idempotency/dedup key• extra (json, required) — telemetry envelope (reason, confidence, agent_id)Technical Notes: Delivery via notification-service POST /api/v1/notifications/chat (X-Api-Key, notif_type=3, notif_category=ai_agent_alert). Throttle key store = Redis with TTL = cooldown. Pipeline is invoked by all detector stories (S02–S05). |
| Acceptance Criteria | — Happy Path — • AC-1: Given a failure signal is raised for a room whose org has a configured supervisor, when the pipeline runs, then exactly one alert is POSTed to notification-service targeting the supervisor sso_id with the correct signal_type and extra telemetry.• AC-2: Given the same room raises the same signal_type again within the cooldown window, when the pipeline runs, then the second alert is suppressed and ai_agent_alert_suppressed is incremented.• AC-3 (volume/boundary): Given a room loops a failure many times within the cooldown, when the pipeline runs repeatedly, then at most one alert is delivered for that room+signal in the window. • AC-4: Given an org configures more than one supervisor, when an alert is emitted, then each configured supervisor sso_id receives the alert (deduped per recipient).— Error / Unhappy Path — • ERR-1: Given the org has no configured supervisor, when the pipeline runs, then no alert is sent, ai_agent_alert_dropped is incremented with reason no_supervisor, and the conversation flow is unaffected.• ERR-2: Given notification-service returns a non-2xx, when the pipeline emits, then it retries with bounded backoff and on final failure increments ai_agent_alert_delivery_failed and logs — without raising into the conversation.• ERR-3: Given the throttle key store is unavailable, when the pipeline runs, then it fails open (emits the alert) rather than dropping a genuine failure. — Permission Model — • CAN: chatbot BE (service-to-service via internal API key) emits; configured supervisor(s) receive. • CANNOT: end customers (never recipients); agents without supervisor configuration. • CANNOT (reversibility): an alert cannot be recalled or edited once delivered; it is fire-once and auto-expires per notification-service retention (7-day chat window). The durable record persists in the AI Activity Log datamart. • Unauthorized: if the internal API key is rejected by notification-service, emit fails closed to the supervisor (no alert) and is counted as delivery_failed.— UI States — • Loading: N/A (backend). • Empty: no supervisor configured → no alert (see ERR-1). • Error: delivery failure logged/counted, invisible to customer. • Success: alert appears in supervisor's notification surface + FCM push. — Negative Scenarios — (from Non-Goals) • NEG-1: Given the agent responds successfully (no failure signal), when the pipeline is not invoked, then no alert is generated. • NEG-2: Given a customer is in the room, when an alert is emitted, then the customer never receives it (internal-only). |
Dependencies: notification-service ai_agent_alert category + chat push (Section 13).
🧪 Test Coverage Matrix — [MON-S01]
| Dimension | Coverage | Notes |
|---|---|---|
| Boundary values | ✅ defined | AC-3 covers the loop/flood bound; AC-4 covers multi-supervisor |
| State transitions | ✅ defined | AC-1/AC-2 cover first-fire vs. within-cooldown |
| Data validation | ⚠️ TBD | ⚠️ QA: malformed/missing organization_id or room_id in the raised signal |
| Concurrency | ⚠️ partial | AC-2 covers sequential repeats; ⚠️ QA: two failures on the same room+signal racing the throttle key simultaneously |
| Network/timeout | ✅ defined | ERR-2 (delivery retry/backoff) + ERR-3 (key store unavailable, fail-open) |
[MON-S02] — Detect AI-service / v2-engine failure and raise an alert
| User Story | As a CS Supervisor, I want to be alerted when the AI service or v2 engine fails to answer, so that I can step into a room where the bot has fallen back to a default answer. |
| Before State | On get_answer.rb:44 (AI-service non-200) the BE assigns a fallback default answer; on send_message_with_resolve.rb:1923 (v2 result.code != 200) it runs _execute_ai_assist_fallback. Both are invisible to a human. |
| After Delta | Each of these failure branches additionally raises a service_failure / engine_failure signal into the S01 pipeline, with the failure reason captured in telemetry. The existing fallback behavior is unchanged. |
| Importance | Must Have |
| Mockup / Technical Notes | Figma: N/A. Data Fields: • signal_type (enum, required) — service_failure or engine_failure• failed_reason (string, required) — source: existing failed_reason ("failed to access ai service" / "Fail to get answer from AI Agent")• room_id, organization_id, conversation_id (required) — source: room/conversation contextTechnical Notes: Emit hook at app/core/repositories/ai_service/get_answer.rb:44 and app/core/use_cases/system/hub/send_message_with_resolve.rb:1923. The DeductionRequest (is_failed/failed_reason) at chatbot_ai_deduction_worker.rb is the async audit backstop. |
| Acceptance Criteria | — Happy Path — • AC-1: Given the AI service returns a non-200, when the BE applies its default-answer fallback, then a service_failure signal is raised into the pipeline with failed_reason populated.• AC-2: Given the v2 engine returns result.code != 200, when _execute_ai_assist_fallback runs, then an engine_failure signal is raised with failed_reason populated.• AC-3: Given a failure is detected, when the signal is raised, then the customer still receives the existing fallback answer with no added latency. — Error / Unhappy Path — • ERR-1: Given signal raising itself errors (e.g. enqueue fails), when the failure branch runs, then the customer-facing fallback still completes and the enqueue error is logged — detection never degrades the reply path. — Permission Model — • CAN: BE failure branches raise the signal automatically. • CANNOT: no manual/user trigger for this signal. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. • Unauthorized: N/A (system-raised). — UI States — • Loading/Empty/Error/Success: N/A (backend); delivery states handled by S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given the AI service returns 200 and the agent answers normally, when no failure branch runs, then no service_failure/engine_failure signal is raised. |
Dependencies: [MON-S01].
🧪 Test Coverage Matrix — [MON-S02]
| Dimension | Coverage | Notes |
|---|---|---|
| Boundary values | ✅ defined | AC-1/AC-2 cover both distinct failure sources |
| State transitions | ✅ defined | AC-3: failure → fallback delivered, signal raised in parallel |
| Data validation | ⚠️ TBD | ⚠️ QA: empty/unexpected failed_reason string still produces a usable alert |
| Concurrency | ⚠️ TBD | ⚠️ QA: service failure followed immediately by a retry success on the same room |
| Network/timeout | ✅ defined | ERR-1 (enqueue failure never blocks reply path) |
[MON-S03] — Detect unexpected handover/escalation and raise an alert
| User Story | As a CS Supervisor, I want to be alerted when the agent hands a conversation off unexpectedly, so that I can pick up an escalation that the bot couldn't resolve rather than letting it sit unassigned. |
| Before State | At get_answer.rb:56, assign_agent==true triggers _assign_agent. Handovers happen for different reasons (TRANSFER_CONDITION, EVALUATE_ANSWER); none notify a supervisor, and a normal "evaluate answer" handoff is expected behavior. |
| After Delta | An abnormal handover (e.g. TRANSFER_CONDITION where the agent could not satisfy a transfer condition) raises an unexpected_handover signal into the pipeline; expected handovers (EVALUATE_ANSWER) do not alert. assign_reason is captured in telemetry. |
| Importance | Must Have |
| Mockup / Technical Notes | Figma: N/A. Data Fields: • signal_type (enum, required) — unexpected_handover• assign_reason (string, required) — source: AI-service response assign_reason• room_id, organization_id, conversation_id (required)Technical Notes: Emit hook at app/core/repositories/ai_service/get_answer.rb:56. The expected-vs-unexpected boundary is defined by assign_reason; the exact reason allow/deny list is confirmed with the BE team (see S15). |
| Acceptance Criteria | — Happy Path — • AC-1: Given assign_agent==true with an abnormal assign_reason (e.g. TRANSFER_CONDITION), when the handover runs, then an unexpected_handover signal is raised with assign_reason in telemetry.• AC-2: Given the handover assigns the room to a human, when the alert is emitted, then the deep-link opens that specific room for the supervisor. — Error / Unhappy Path — • ERR-1: Given the handover succeeds but signal raising fails, when the branch runs, then the handover/assignment still completes and the error is logged. — Permission Model — • CAN: BE handover branch raises automatically. • CANNOT: manual trigger. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. • Unauthorized: N/A. — UI States — • N/A (backend); delivery via S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given a normal, expected handover ( EVALUATE_ANSWER), when it runs, then no unexpected_handover alert is raised (Non-Goal #4 — no alerting on normal behavior).• NEG-2: Given the agent resolves the conversation without handover, when it completes, then no handover signal is raised. |
Dependencies: [MON-S01].
🧪 Test Coverage Matrix — [MON-S03]
| Dimension | Coverage | Notes |
|---|---|---|
| Boundary values | ✅ defined | AC-1 vs. NEG-1 draw the expected/unexpected boundary on assign_reason |
| State transitions | ✅ defined | AC-2: handover → room assigned → deep-link targets it |
| Data validation | ⚠️ TBD | ⚠️ QA: unknown/new assign_reason value — default to alert or not? (tie to S15 allow-list) |
| Concurrency | ⚠️ TBD | ⚠️ QA: handover + a near-simultaneous failure signal on the same room (throttle is per signal-type, so both may alert) |
| Network/timeout | ✅ defined | ERR-1 (signal failure never blocks the assignment) |
[MON-S04] — Detect message-limit cap and raise an alert
| User Story | As a CS Supervisor, I want to be alerted when the agent hits its message limit and stops responding, so that I know a conversation has gone silent because of a cap, not because the customer left. |
| Before State | At send_message_with_resolve.rb:~1812, exceeding message_limit calls _handle_ai_agent_message_limit_reached, which only emits Rollbar.info — invisible to any supervisor. |
| After Delta | The message-limit branch additionally raises a message_limit signal into the pipeline so a human can pick up the now-silent conversation. |
| Importance | Should Have |
| Mockup / Technical Notes | Figma: N/A. Data Fields: • signal_type (enum, required) — message_limit• room_id, organization_id, conversation_id (required)• message_count / limit (int, optional) — source: Redis counter + agent configTechnical Notes: Emit hook alongside _handle_ai_agent_message_limit_reached. Should-Have: ships if Phase-1 capacity permits; otherwise rolls to a fast-follow. |
| Acceptance Criteria | — Happy Path — • AC-1: Given a room exceeds the agent message_limit, when the cap handler runs, then a message_limit signal is raised into the pipeline.• AC-2: Given the cap recurs on the same room within cooldown, when the handler runs again, then S01 throttling suppresses the repeat. — Error / Unhappy Path — • ERR-1: Given signal raising fails, when the cap handler runs, then the existing cap behavior (and Rollbar log) still completes. — Permission Model — • CAN: BE cap handler raises automatically. CANNOT: manual trigger. Unauthorized: N/A. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. — UI States — • N/A (backend); delivery via S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given the room is below the message limit, when the agent responds, then no message_limit signal is raised. |
Dependencies: [MON-S01].
[MON-S05] — Detect low-confidence response and raise an alert
| User Story | As a CS Supervisor, I want to be alerted when the agent answers with low confidence, so that I can review a reply that the bot itself is unsure about before it costs us the customer. |
| Before State | The numeric confidence of a response is produced by the AI service and recorded in telemetry / the AI Activity Log datamart, but it is not evaluated against a threshold at the BE emit point and never surfaces to a human in real time. |
| After Delta | When a response's confidence is below a configured floor, a low_confidence signal is raised into the pipeline with the confidence value in telemetry. Requires the confidence value to be available at the BE emit point (dependency/assumption — see S13/S15). |
| Importance | Should Have |
| Mockup / Technical Notes | Figma: N/A. Data Fields: • signal_type (enum, required) — low_confidence• confidence (float, required) — source: AI-service response / Activity Log• confidence_floor (float, required) — configured threshold (default TBD with Data/ML)• room_id, organization_id, conversation_id (required)Technical Notes: Gated on confidence being surfaced into the BE response handling (it is not today). Threshold is a global default in Phase 1 (no per-agent tuning — Non-Goal #7). |
| Acceptance Criteria | — Happy Path — • AC-1: Given a response with confidence below the configured floor, when the response is produced, then a low_confidence signal is raised with the confidence value in telemetry.• AC-2 (boundary): Given confidence exactly equals the floor, when the response is produced, then it is treated as not low-confidence (strictly below triggers). — Error / Unhappy Path — • ERR-1: Given the confidence value is absent/null for a response, when evaluation runs, then no low_confidence signal is raised and ai_agent_alert_skipped is incremented with reason confidence_unavailable (fail safe, no false alerts).— Permission Model — • CAN: BE raises automatically when confidence is present and below floor. CANNOT: manual trigger. Unauthorized: N/A. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. — UI States — • N/A (backend); delivery via S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given confidence is at or above the floor, when the agent answers, then no low_confidence signal is raised. |
Dependencies: [MON-S01]; confidence availability at the BE emit point (S13/S15).
9. Rollout
| Field | Value |
|---|---|
| Feature flag | ai_agent_monitoring_alerts — default: OFF (per organization; see Constraints) |
| Stage 1 | Internal: 1–2 Mekari-owned test orgs with a Chatbot Specialist as the configured supervisor; verify each of the 4 signals end-to-end (alert received, deep-link opens room, throttle works) |
| Stage 2 | Closed pilot: 2–3 friendly production customers (Plus/Ultimate/360) running an autonomous agent, with the dependency live; supervisor opt-in |
| Stage 3 | Targeted rollout: the 26Q2 production autonomous agents (toward the 15 target customer IDs), enabled per org on request |
| GA | All eligible (Plus/Ultimate/360) orgs with a configured supervisor, flag on by org setting |
| Backward compat | Yes — purely additive. With the flag OFF, agent behavior is byte-for-byte unchanged (detection hooks are no-ops). |
| Migration | None — no data migration. New artifacts (throttle keys, alert records) are created forward-only. |
10. Observability
Key Events:
| Event Name | Trigger | Properties |
|---|---|---|
ai_agent_failure_signal_raised | A detector (S02–S05) raises any signal | signal_type, room_id, organization_id, conversation_id, confidence?, reason?, timestamp |
ai_agent_alert_delivered | Alert successfully POSTed to notification-service (2xx) | signal_type, organization_id, sso_id, event_id, timestamp |
ai_agent_alert_suppressed | Alert suppressed by throttle/dedup within cooldown | signal_type, room_id, organization_id, timestamp |
ai_agent_alert_dropped | No recipient resolved | reason (no_supervisor), organization_id, timestamp |
ai_agent_alert_skipped | Signal not evaluated (e.g. confidence unavailable) | reason, signal_type, organization_id, timestamp |
ai_agent_alert_delivery_failed | notification-service non-2xx after retries | signal_type, organization_id, http_status, timestamp |
| Field | Detail |
|---|---|
| Dashboard owner | Chatbot squad (BOT), with the alert/telemetry stream also landing in the AI Activity Log datamart (Data) |
| Alert 1 | ai_agent_alert_delivery_failed rate > 5% of attempts in any 1h window → Slack: #chatbot-oncall |
| Alert 2 | ai_agent_alert_suppressed / ai_agent_failure_signal_raised ratio > 50% sustained 1h (signals being throttled heavily → noisy trigger or looping failure) → Slack: #chatbot-oncall for threshold review |
10.1. Post-Launch Monitoring Cadence
| Field | Detail |
|---|---|
| Review cadence | Weekly for the first 4 weeks post-GA, then monthly |
| Owner | Dimas (PM) + Chatbot squad |
| Review scope | ai_agent_failure_signal_raised, ai_agent_alert_delivered, ai_agent_alert_suppressed, ai_agent_alert_delivery_failed, and the ⭐ supervisor time-to-intervene KPI |
| Trigger threshold 1 | ai_agent_alert_delivery_failed > 5% of attempts in a week → investigate delivery/dependency health immediately |
| Trigger threshold 2 | Suppressed/raised ratio > 50% for 2 consecutive weeks → revisit cooldown window and per-signal thresholds (noise risk, S15 Risk #4) |
| Rollback consideration | If delivery failures or alert-storm complaints cannot be resolved within 48h, PM disables ai_agent_monitoring_alerts for the affected org(s) pending root cause. |
11. Success Metrics
Efficiency & Impact:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| ⭐ Supervisor time-to-intervene | Median elapsed time from ai_agent_failure_signal_raised to a human opening/acting on that room | N/A — unmeasurable today (no alert exists); establish baseline in Stage 1–2 | Median ≤ 5 minutes within 90 days of GA |
Adoption & Usage:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Monitored-agent coverage | Share of targeted production autonomous agents with ≥1 supervisor receiving alerts | 0 | 100% of targeted production agents within 30 days of GA |
| Alert engagement rate | Share of delivered alerts whose deep-link is opened by a supervisor | N/A — new | ≥ 60% within 60 days of GA |
Quality & Accuracy:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Alert delivery success rate | ai_agent_alert_delivered / (delivered + delivery_failed) | N/A — new | ≥ 98% within 30 days of GA |
| Alert noise ratio | Supervisor-reported "not useful" alerts / delivered alerts (sampled) | N/A — new | ≤ 10% within 60 days of GA (else revisit thresholds) |
12. Launch Plan & Stage Gates
| Stage | Audience | Duration | Success Gate to Advance | Owner |
|---|---|---|---|---|
| Internal Alpha | 1–2 Mekari test orgs | 1–2 weeks | All 4 signals deliver end-to-end; throttle verified; 0 P0/P1; agent reply path unaffected with flag ON and OFF | PM + QA |
| Closed Pilot | 2–3 friendly production customers | 2–3 weeks | Alert delivery success ≥ 98%; supervisor time-to-intervene baseline captured; noise ratio ≤ 10% (sampled); dependency live | PM + CSM |
| Targeted Rollout | 26Q2 production autonomous agents (toward 15 IDs) | 2–4 weeks | Pilot gates sustained; no alert-storm complaints unresolved > 48h | Eng Lead + PM |
| GA | All eligible orgs with a configured supervisor | Ongoing | All rollout gates sustained 2 weeks; ⭐ KPI trending to ≤ 5 min | PM |
13. Dependencies
| Dependency | Owning Team | Deliverable Needed | Blocking? |
|---|---|---|---|
notification-service ai_agent_alert category + chat-origin FCM push | Broadcast / notification-service squad | New notif_category (value e.g. 14) seeded via migration and added to the FCM whitelist so chat-origin ai_agent_alert notifications push; documented in a small TECH RFC. Without it, alerts are stored but not pushed. | YES |
hub-chat Notification Center rendering of ai_agent_alert | Chat / Inbox squad (hub-chat) | Register the new notif_category in the navbar type lookup (icon + notif_category_label) and ensure the bell + FCM push render it with a working room deep-link, per the S6 contract. hub-chat already does FCM + has the Notification Center, so this is incremental, not net-new. Without it, the alert arrives but renders unlabelled / may not surface. | YES |
Supervisor FCM token coverage (user_source="chat") | Broadcast squad + Chatbot | Targeted supervisors must have notification-service FCM tokens registered for chat, or push silently no-ops | YES |
Supervisor-role resolution (org → sso_id[]) | Chatbot BE + Platform (role/permission) | A reliable way to resolve "who is the supervisor for org X" — existing role, new org setting, or configured list (see S15 #1) | YES |
| Confidence value at the BE emit point | Data/ML + Chatbot BE | The numeric confidence surfaced into BE response handling so MON-S05 can threshold it (not available today) | NO — only blocks the Should-Have MON-S05, not Phase-1 GA |
| AI Activity Log datamart write path | Data (BI) | Accept the alert event as the durable system-of-record row (consistent with existing per-response telemetry) | NO — alerts still deliver without it; needed for KPI reporting + Phase 2 |
| Internal API key / auth to notification-service | Broadcast squad | Valid X-Api-Key for chatbot BE → notification-service service-to-service calls | YES |
14. Key Decisions + Alternatives Rejected
14a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2026-06-28 | Detect at the existing failure branches and emit via an async pipeline, out-of-band of the reply path | Detection must never add latency to or block the customer reply; the failure branches are the natural, already-present hooks |
| 2026-06-28 | Target a configured org supervisor, resolved to sso_id[], not the assigned agent | A failed handover may have no assignee; supervision is org-level and must be independent of assignment state |
| 2026-06-28 | Apply per-room/per-signal throttle + dedup (5-min cooldown default) | A looping failure on one room must not flood the supervisor; signal-type granularity still surfaces distinct problems |
| 2026-06-28 | Alert only on abnormal handovers, not every escalation | A normal EVALUATE_ANSWER handoff is expected behavior; alerting on it would be noise (Non-Goal #4) |
| 2026-06-28 | Make MON-S05 (low-confidence) Should-Have, gated on confidence availability | The confidence value isn't surfaced at the BE emit point today; the 3 Must-Have stories deliver value without it |
| 2026-06-28 | Reuse notification-service; keep its change as an external dependency + TECH RFC | Avoids new infra and preserves squad ownership of the delivery channel |
| 2026-06-28 | Render in the existing hub-chat Notification Center, not a new surface; the FE work is a scoped dependency on the Inbox squad | hub-chat already does FCM + a category-aware Notification Center; reusing it means only registering one new category vs. building a notification UI. S6 is the rendering contract |
14b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
| Emit synchronously inside the reply path | Risks adding latency to or breaking the customer-facing response; unacceptable for a monitoring feature | 2026-06-28 |
Emit only from the central chatbot_ai_deduction_worker (single chokepoint) | Async + billing-coupled; loses real-time value and conflates monitoring with billing telemetry. Kept as an audit backstop, not the primary path | 2026-06-28 |
| Alert the currently-assigned agent instead of a supervisor | No assignee on failed handovers; not true supervision | 2026-06-28 |
| Alert on every occurrence (no throttle) | A repeated failure becomes a notification storm; supervisors would mute the channel | 2026-06-28 |
| Build a dedicated AI-alerting service | Duplicates notification-service storage + FCM for the same outcome; far more infra | 2026-06-28 |
15. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Open Question | How is "supervisor" defined and resolved to sso_id[] per org — existing role/permission, a new org setting, or an explicitly configured list? Determines the MON-S01 resolution logic. | Dimas + Chatbot BE | 2026-07-11 |
| 2 | Open Question | What is the exact assign_reason allow/deny list that separates "unexpected" from "expected" handovers (MON-S03)? And how do we treat an unknown/new reason — default alert or default silent? | Dimas + Chatbot BE | 2026-07-11 |
| 3 | Assumption | The numeric confidence value is available (or can be cheaply surfaced) at the BE emit point for MON-S05; default confidence_floor to be set with Data/ML. If false, MON-S05 slips to a fast-follow. | Dimas + Data/ML | 2026-07-18 |
| 4 | Risk | Alert noise — too-broad triggers cause supervisors to mute the channel. Mitigation: per-room throttle + dedup (MON-S01), abnormal-only handover (MON-S03), noise-ratio metric + threshold-review cadence (S10.1), and feature flag to disable per org. | Dimas | 2026-07-18 |
| 5 | Risk | Push not delivered — supervisors lack user_source="chat" FCM tokens, so alerts store but never push. Mitigation: confirm token coverage in Stage 1 (S13 blocking dep); fall back to in-app notification list until tokens exist. | Dimas + Broadcast squad | 2026-07-18 |
| 6 | Assumption | notification-service 7-day chat retention is acceptable for live alerts because the durable record lives in the AI Activity Log datamart. | Dimas | 2026-07-11 |
| 7 | Open Question | Do supervisors also need the alert on the mobile inbox app (not just hub-chat web)? If yes, mobile FCM rendering of ai_agent_alert is an additional dependency. | Dimas + Chat/Mobile squad | 2026-07-18 |
Types:
Assumption·Open Question·Risk
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.0 | 2026-06-28 | Claude | All | CREATED | Initial NEW PRD generated from grounding analysis of chatbot BE emit chokepoints + notification-service delivery channel |
| 1.1 | 2026-06-28 | Claude | S8 | MODIFIED | Added explicit alert-reversibility (CANNOT) line to all 5 stories and generated the S8 system-flow Mermaid diagram, per score-prd Layer 2.5 Q2 + diagram coverage |
| 1.2 | 2026-06-28 | Claude | S6,S8,S13,S14,S15 | MODIFIED | Corrected the "backend-only" framing: documented the hub-chat Notification Center as the supervisor-facing surface with a rendered look + field mapping (S6), added the hub-chat FE rendering as a blocking dependency (S13), added Frontend to scope_changes, updated the system flow + diagram, and logged the mobile-inbox open question (S15) |