Qontak | AI Agent | AI Agent Live Monitoring — Phase 1: Supervisor Alerts

HEADER BLOCK

Field	Value
PM	Dimas Fauzi Hidayat (Product Manager, Mekari Qontak)
PRD Version	1.2
Status	DRAFT
PRD Type	NEW
Epic	TBD — add once Epic is created
Squad	BOT — Chatbot Squad
RFC Link	N/A — pending; notification-service delivery change tracked as a TECH RFC (Broadcast squad), see Dependencies
Figma Master	N/A — backend-only; alerts render in the existing notification surface
Anchor	Yes — AI Agent Live Monitoring — ANCHOR
Labels	`epic:qontak-chat` \| `module:ai-agent` \| `feature:ai-agent-monitoring`
Last Updated	2026-06-28

Status values: DRAFT → READY → BUILD → SHIPPED

HEADER BLOCK
2. One-liner + Problem
3. Target Users + Persona Context
4. Non-Goals
5. Constraints
- 5.1. Data Lifecycle
6. New Features
7. API & Webhook Behavior
8. System Flow + User Stories
- 8.1. System Flow
- 8.2. User Stories
9. Rollout
10. Observability
- 10.1. Post-Launch Monitoring Cadence
11. Success Metrics
12. Launch Plan & Stage Gates
13. Dependencies
14. Key Decisions + Alternatives Rejected
15. Open Questions
PRD CHANGELOG

2. One-liner + Problem

One-liner: Push a real-time alert to a configured organization supervisor whenever the production AI agent fails or degrades in a live customer conversation.

Problem: The chatbot backend already detects when the AI agent fails — an AI-service or v2-engine error falls back to a default answer, an unexpected handover escalates the room, a message-limit cap silently stops the agent (Rollbar.info only), and low-confidence answers are recorded in telemetry — but none of these reach a human proactively; there is no integration that notifies a supervisor. CS supervisors running production agents (the 26Q2 target spans agents across 15 customer IDs on Plus/Ultimate/360 tiers) therefore only discover a bad agent moment after a customer complains or during after-the-fact review. The cost is unrecovered conversations, slow human takeover, and eroded trust in letting the agent run autonomously — the precise risk this Phase exists to reduce.

3. Target Users + Persona Context

Primary Persona: CS Supervisor / Team Lead

Field	Detail
Role	Supervisor or team lead who owns an organization's production AI agent and the human agents who back it up
Goal	Learn within seconds, not hours, that the AI agent has failed a live conversation, and open that specific room to take over or coach
Pain	No proactive signal exists; failures are invisible until a customer escalates or a manual review surfaces them after the fact
Workaround	Periodically eyeball the inbox, wait for customer complaints, or scrub the AI Activity Log / Rollbar after conversations have ended

(See Constraints for plan availability and feature flag scope.)

Secondary Persona: Chatbot Specialist (Mekari)

Field	Detail
Role	Mekari-side specialist who configures and tunes customer AI agents
Goal	Receive the same failure signals to spot patterns (which signal, how often, which agent) and prioritise tuning
Pain	Failure signals are scattered across logs and the datamart with no real-time, per-agent view
Workaround	Manual datamart queries and Rollbar spelunking, well after the fact

4. Non-Goals

No new alerting infrastructure — Phase 1 delivers exclusively through the existing notification-service chat channel. We do not build a dedicated alerting service, queue, or datastore.
No monitoring console / dashboard — the at-a-glance "agent health" board is Phase 2. Phase 1 is push-only.
No automated containment — alerts inform a human; the system does not pause the agent, force a handoff, or take any corrective action automatically. That is explored in Phase 3.
No alerting on normal agent behavior — a successful resolve, or a deliberate, expected handover (e.g. EVALUATE_ANSWER), is not a failure and must not alert.
No changes to the notification-service repo by this squad — the new ai_agent_alert category, FCM whitelist line, and chat-push delivery are owned by the Broadcast squad and tracked as a dependency (Section 13).
No end-customer-facing notifications — alerts go to internal supervisors/specialists only, never to the conversation's customer.
No per-agent custom alert rules in this phase — the trigger set and thresholds are global defaults; per-org configurability beyond on/off and supervisor targeting is deferred.
No retroactive alerting — only events occurring after the feature flag is enabled for an org generate alerts; we do not backfill historical failures.

5. Constraints

Field	Value
Platform	Backend-only (chatbot Rails BE). Alert is consumed on whatever surfaces notification-service already supports — in-app notification + FCM push to the supervisor's chat client. No new screen.
Performance	Detection and emit must run out-of-band of the customer reply path — the alert must never add latency to, or block, the agent's response to the customer. Emit is fire-and-forget (async worker); target alert delivered to notification-service within ≤ 30s of the triggering event.
Throttle / dedup	At most one alert per room per signal-type per cooldown window (default 5 minutes), deduped by `event_id`. A looping failure on one room must not flood the supervisor.
Plan scope	Production AI-agent tiers only — Plus / Ultimate / 360 (the tiers already entitled for the autonomous agent). Not Starter/Free.
Feature flag	`ai_agent_monitoring_alerts` \| default: OFF. Enabled per organization.
Read/write	Only the chatbot BE (service-to-service, via the notification-service internal API key) writes alerts. Supervisors read alerts; no one edits them. Customers never see them.
Dependency gate	Chat-origin push for the `ai_agent_alert` category must be live in notification-service (Section 13) before an org can be moved past internal testing.

5.1. Data Lifecycle

Phase 1 introduces two persisted artifacts outside the main conversation model: the throttle/dedup state and the delivered alert notifications.

Artifact Type	Retention Period	Cleanup Trigger	User-Visible Effect
Throttle/dedup key (per room + signal-type)	Cooldown window only (default 5 min)	TTL expiry on the key store (e.g. Redis)	None — internal only
Alert notification record (in notification-service)	Per notification-service chat retention (currently 7 days, `notif_type=3`)	notification-service retention policy (query-time pruning)	Alert disappears from the supervisor's notification list after the retention window
Alert event in AI Activity Log datamart (system of record)	Per datamart retention (durable)	Datamart lifecycle	None directly — powers KPI reporting and Phase-2 console

Note: the notification-service 7-day window is fine for live alerting; the durable record of every failure event lives in the AI Activity Log datamart, not in notification-service.

6. New Features

No net-new screen is built, but there is a user-facing surface change: the supervisor sees the alert in the existing hub-chat (Qontak Omnichannel) Notification Center — the navbar bell — and as an FCM push. hub-chat already wires Firebase FCM (app.vue → useFirebase/initFirebase) and renders a category-aware notification list (layouts/composables/useNotificationCenter.ts, common/types/NotificationCenterTypes.ts). This is an incremental add to that existing component, not a new page — but it is not zero FE work (see Dependencies: hub-chat must register and render the new ai_agent_alert category).

Field	Detail
Surface	hub-chat Notification Center (navbar bell) + FCM push banner. URL: the supervisor's existing Omnichannel inbox (`/inbox`); no new route.
Access	Users with the configured supervisor role for the org, logged into hub-chat (web). Mobile inbox is an open question (S15).

What the supervisor sees — Notification Center list item (success state):

┌─ Notifications ───────────────────────────────────┐
│ 🔴  AI Agent Alert                      · 2m ago   │   ← unread dot · notif_category_label · created_at (relative)
│     AI Agent failed — engine error                 │   ← title
│     Couldn't answer in "Order #1234" — fell back   │   ← description (truncated)
│     to default reply                               │
│     ↳ Open room →                                  │   ← click_action_url deep-link to the failing room
└────────────────────────────────────────────────────┘

FCM push banner (device):

● Qontak Omnichannel                         now
  AI Agent failed — engine error
  Couldn't answer in "Order #1234" room

Field → UI mapping (drives the render from the notification-service payload):

Notification field	Renders as
`notif_category` = `ai_agent_alert`	Selects the icon + needs a registered type entry in hub-chat (the `notif_category` → type lookup in `TheNavbar.vue`/`OneNavbar.vue`)
`notif_category_label`	The category chip text ("AI Agent Alert")
`title`	Bold first line (the failure summary)
`description`	Secondary line (room + what happened), truncated
`created_at` / `read_at`	Relative timestamp + unread dot
`click_action` = `OPEN_URL`, `click_action_url` + `extra.room_id`	Click opens the failing room in the inbox

UI States (reuse the existing Notification Center states; the new category only adds an item type):

State	Description
Empty	No AI agent alerts → nothing extra shown; the existing center's empty/zero state applies.
Loading	Existing Notification Center skeleton while fetching.
Error	Existing center error/retry; a push that fails to render falls back to the list on next fetch.
Success	The alert item renders as mocked above; click deep-links to the room.

Figma: N/A — reuses the existing Pixel3 notification-list component; this spec is the rendering contract the hub-chat (Inbox) squad builds to. A frame can be added if the squad wants a visual sign-off.

📊 UI State Diagram — AI Agent Alert (Notification Center item)

stateDiagram-v2
    [*] --> Delivered: notification-service push received
    Delivered --> Unread: shown in bell (red dot) + push banner
    Unread --> Read: supervisor opens the Notification Center
    Read --> RoomOpened: taps "Open room" (click_action_url + extra.room_id)
    Unread --> Expired: 7-day chat retention reached
    Read --> Expired: 7-day chat retention reached
    RoomOpened --> [*]: supervisor takes over / coaches
    Expired --> [*]: pruned from list (durable record stays in Activity Log)

7. API & Webhook Behavior

PM describes behavior in plain language; Engineering resolves HTTP methods, schemas, and error codes in the RFC. The outbound contract is to notification-service's existing internal endpoint.

#	Behavior	Entity Affected	Triggered By	Expected Behavior	Failure Behavior
1	Resolve alert recipient(s)	Supervisor mapping (org → sso_id[])	A failure signal is raised for a room in org X	• Look up the configured supervisor(s) for org X (open question: existing role vs. new org setting — see S15) • Return one or more `sso_id`s to target • Cache the lookup for the cooldown window	• If no supervisor is configured for the org: drop the alert, increment `ai_agent_alert_dropped` with reason `no_supervisor`, do not error the conversation • If lookup times out: skip alert, log; never block the customer reply
2	Throttle & dedup the alert	In-memory/Redis throttle key (`room_id` + `signal_type`)	An alert is about to be emitted	• If no live key exists: proceed and set the key with TTL = cooldown (default 5 min) • Attach a stable `event_id` for idempotency	• If a live key exists (within cooldown): suppress this alert, increment `ai_agent_alert_suppressed` with the signal type • If the key store is unavailable: fail open (emit) rather than fail closed — a missed dedup is better than a missed failure alert
3	Emit the alert to notification-service	A chat-origin notification (`notif_type=3`, `notif_category=ai_agent_alert`)	Recipient resolved + not throttled	• POST to notification-service `POST /api/v1/notifications/chat` with `X-Api-Key` internal auth, targeting the supervisor `sso_id` • Title/description summarise the signal (e.g. "AI Agent failed — engine error"), `click_action=OPEN_URL`, `click_action_url` deep-links the room • Telemetry travels in the `extra` JSONB envelope: `room_id`, `organization_id`, `conversation_id`, `signal_type`, `failed_reason`/`assign_reason`, `confidence` (when present), `agent_id` • Runs in an async worker (Sidekiq), out-of-band of the reply path • Mirror the event to the AI Activity Log datamart (system of record)	• Non-2xx from notification-service: retry with backoff (bounded); on final failure increment `ai_agent_alert_delivery_failed` and log — never raise into the conversation flow • If the dependency (category/whitelist) is not yet live: alert is created but no push is delivered — covered by the Section 13 blocking gate

Engineering resolves during RFC: exact request/response schema, retry/backoff policy, async worker boundary, and whether HTTP or the qontak_chat.public.notification_worker Kafka topic is used for emit (both are supported by notification-service).

8. System Flow + User Stories

8.1. System Flow

Flow: AI Agent Failure → Supervisor Alert Type: API Sequence

A customer message reaches the chatbot BE and the AI agent attempts a response (get_answer / send_message_with_resolve).
The agent pipeline hits one of the monitored conditions: AI-service non-200 (get_answer.rb:44), v2-engine failure (send_message_with_resolve.rb:1923), unexpected handover (assign_agent==true with an abnormal assign_reason, get_answer.rb:56), message-limit cap (send_message_with_resolve.rb:~1812), or a low-confidence answer (confidence below floor).
The agent's normal fallback behavior runs unchanged (default answer / handover / cap) — the customer path is not altered or delayed.
The detector classifies the event into a signal_type and enqueues an alert job (async, Sidekiq), passing room_id, organization_id, conversation_id, signal context, and confidence when available.
The alert worker resolves the org's configured supervisor(s) → sso_id[].
Decision: no supervisor configured → drop, count no_supervisor, stop.
Decision: a throttle key for this room_id+signal_type is live (within cooldown) → suppress, count suppressed, stop. Otherwise set the key (TTL = cooldown) and continue.
The worker POSTs the alert to notification-service /api/v1/notifications/chat (X-Api-Key, notif_type=3, notif_category=ai_agent_alert), telemetry in extra, deep-link in click_action_url.
notification-service stores the notification and (category whitelisted) pushes via FCM to the supervisor's chat device.
hub-chat (Omnichannel web) renders the alert in the navbar Notification Center bell and shows the FCM push banner, using the registered ai_agent_alert category for the icon/label.
The worker mirrors the event to the AI Activity Log datamart (system of record) for KPI + Phase-2 console.
Failure branch: notification-service returns non-2xx → bounded retry/backoff; on final failure count delivery_failed and log — never raise into the conversation.
The supervisor sees the alert in the bell / push, taps the deep-link, and opens the failing room to take over or coach.

📊 System Flow — Supervisor Alerts

sequenceDiagram
    participant Cust as Customer
    participant BE as Chatbot BE (agent pipeline)
    participant W as Alert Worker (Sidekiq)
    participant Res as Supervisor Resolver
    participant Thr as Throttle Store (Redis)
    participant NS as notification-service
    participant HC as hub-chat (Notification Center)
    participant ADL as AI Activity Log
    participant Spv as Supervisor

    Cust->>BE: Customer message
    BE-->>Cust: Reply / fallback (unchanged — no added latency)
    Note over BE: Failure detected (service · engine · handover · limit · low-confidence)
    BE->>W: Enqueue alert (room_id, org_id, signal_type, context)
    W->>Res: Resolve supervisor(s) for org
    alt No supervisor configured
        Res-->>W: none
        W-->>W: Drop · count no_supervisor
    else Supervisor resolved
        Res-->>W: sso_id[]
        W->>Thr: Check room + signal-type key
        alt Within cooldown
            Thr-->>W: key live
            W-->>W: Suppress · count suppressed
        else Not throttled
            Thr-->>W: set key (TTL = cooldown)
            W->>NS: POST /notifications/chat (X-Api-Key · ai_agent_alert · extra)
            alt Delivery non-2xx
                NS-->>W: error
                W-->>W: Retry/backoff · on final fail count delivery_failed
            else Delivered
                NS->>HC: FCM push + stored notification (ai_agent_alert)
                HC->>Spv: Render in bell + push banner
                W->>ADL: Mirror event (system of record)
                Spv->>BE: Open room (deep-link) · take over / coach
            end
        end
    end

8.2. User Stories

[MON-S01] — Supervisor alert delivery pipeline (resolve · throttle · emit)


User Story	As a CS Supervisor, I want a single reliable pipeline that resolves who to notify, suppresses repeats, and delivers the alert, so that every genuine agent failure reaches me once, fast, without flooding me.
Before State	None — the chatbot BE has no integration with notification-service and no concept of a "supervisor alert". Failures end at a default answer or a Rollbar log.
After Delta	A shared async pipeline (Sidekiq worker) takes a raised failure signal, resolves the org supervisor(s), applies per-room/per-signal throttle + dedup, POSTs the alert to notification-service chat, and mirrors the event to the Activity Log datamart — out-of-band of the customer reply path.
Importance	Must Have
Mockup / Technical Notes	Figma: N/A — backend-only. Data Fields (alert payload): • `sso_id` (uuid, required) — resolved supervisor target • `room_id` (string, required) — source: room state • `organization_id` (uuid, required) — source: room/org context • `conversation_id` (string, required) — source: conversation context • `signal_type` (enum, required) — `service_failure` / `engine_failure` / `unexpected_handover` / `message_limit` / `low_confidence` • `event_id` (uuid, required) — idempotency/dedup key • `extra` (json, required) — telemetry envelope (reason, confidence, agent_id) Technical Notes: Delivery via notification-service `POST /api/v1/notifications/chat` (`X-Api-Key`, `notif_type=3`, `notif_category=ai_agent_alert`). Throttle key store = Redis with TTL = cooldown. Pipeline is invoked by all detector stories (S02–S05).
Acceptance Criteria	— Happy Path — • AC-1: Given a failure signal is raised for a room whose org has a configured supervisor, when the pipeline runs, then exactly one alert is POSTed to notification-service targeting the supervisor `sso_id` with the correct `signal_type` and `extra` telemetry. • AC-2: Given the same room raises the same `signal_type` again within the cooldown window, when the pipeline runs, then the second alert is suppressed and `ai_agent_alert_suppressed` is incremented. • AC-3 (volume/boundary): Given a room loops a failure many times within the cooldown, when the pipeline runs repeatedly, then at most one alert is delivered for that room+signal in the window. • AC-4: Given an org configures more than one supervisor, when an alert is emitted, then each configured supervisor `sso_id` receives the alert (deduped per recipient). — Error / Unhappy Path — • ERR-1: Given the org has no configured supervisor, when the pipeline runs, then no alert is sent, `ai_agent_alert_dropped` is incremented with reason `no_supervisor`, and the conversation flow is unaffected. • ERR-2: Given notification-service returns a non-2xx, when the pipeline emits, then it retries with bounded backoff and on final failure increments `ai_agent_alert_delivery_failed` and logs — without raising into the conversation. • ERR-3: Given the throttle key store is unavailable, when the pipeline runs, then it fails open (emits the alert) rather than dropping a genuine failure. — Permission Model — • CAN: chatbot BE (service-to-service via internal API key) emits; configured supervisor(s) receive. • CANNOT: end customers (never recipients); agents without supervisor configuration. • CANNOT (reversibility): an alert cannot be recalled or edited once delivered; it is fire-once and auto-expires per notification-service retention (7-day chat window). The durable record persists in the AI Activity Log datamart. • Unauthorized: if the internal API key is rejected by notification-service, emit fails closed to the supervisor (no alert) and is counted as `delivery_failed`. — UI States — • Loading: N/A (backend). • Empty: no supervisor configured → no alert (see ERR-1). • Error: delivery failure logged/counted, invisible to customer. • Success: alert appears in supervisor's notification surface + FCM push. — Negative Scenarios — (from Non-Goals) • NEG-1: Given the agent responds successfully (no failure signal), when the pipeline is not invoked, then no alert is generated. • NEG-2: Given a customer is in the room, when an alert is emitted, then the customer never receives it (internal-only).

Dependencies: notification-service ai_agent_alert category + chat push (Section 13).

🧪 Test Coverage Matrix — [MON-S01]

Dimension	Coverage	Notes
Boundary values	✅ defined	AC-3 covers the loop/flood bound; AC-4 covers multi-supervisor
State transitions	✅ defined	AC-1/AC-2 cover first-fire vs. within-cooldown
Data validation	⚠️ TBD	⚠️ QA: malformed/missing `organization_id` or `room_id` in the raised signal
Concurrency	⚠️ partial	AC-2 covers sequential repeats; ⚠️ QA: two failures on the same room+signal racing the throttle key simultaneously
Network/timeout	✅ defined	ERR-2 (delivery retry/backoff) + ERR-3 (key store unavailable, fail-open)

[MON-S02] — Detect AI-service / v2-engine failure and raise an alert


User Story	As a CS Supervisor, I want to be alerted when the AI service or v2 engine fails to answer, so that I can step into a room where the bot has fallen back to a default answer.
Before State	On `get_answer.rb:44` (AI-service non-200) the BE assigns a fallback default answer; on `send_message_with_resolve.rb:1923` (v2 `result.code != 200`) it runs `_execute_ai_assist_fallback`. Both are invisible to a human.
After Delta	Each of these failure branches additionally raises a `service_failure` / `engine_failure` signal into the S01 pipeline, with the failure reason captured in telemetry. The existing fallback behavior is unchanged.
Importance	Must Have
Mockup / Technical Notes	Figma: N/A. Data Fields: • `signal_type` (enum, required) — `service_failure` or `engine_failure` • `failed_reason` (string, required) — source: existing `failed_reason` ("failed to access ai service" / "Fail to get answer from AI Agent") • `room_id`, `organization_id`, `conversation_id` (required) — source: room/conversation context Technical Notes: Emit hook at `app/core/repositories/ai_service/get_answer.rb:44` and `app/core/use_cases/system/hub/send_message_with_resolve.rb:1923`. The `DeductionRequest` (`is_failed`/`failed_reason`) at `chatbot_ai_deduction_worker.rb` is the async audit backstop.
Acceptance Criteria	— Happy Path — • AC-1: Given the AI service returns a non-200, when the BE applies its default-answer fallback, then a `service_failure` signal is raised into the pipeline with `failed_reason` populated. • AC-2: Given the v2 engine returns `result.code != 200`, when `_execute_ai_assist_fallback` runs, then an `engine_failure` signal is raised with `failed_reason` populated. • AC-3: Given a failure is detected, when the signal is raised, then the customer still receives the existing fallback answer with no added latency. — Error / Unhappy Path — • ERR-1: Given signal raising itself errors (e.g. enqueue fails), when the failure branch runs, then the customer-facing fallback still completes and the enqueue error is logged — detection never degrades the reply path. — Permission Model — • CAN: BE failure branches raise the signal automatically. • CANNOT: no manual/user trigger for this signal. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. • Unauthorized: N/A (system-raised). — UI States — • Loading/Empty/Error/Success: N/A (backend); delivery states handled by S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given the AI service returns 200 and the agent answers normally, when no failure branch runs, then no `service_failure`/`engine_failure` signal is raised.

Dependencies: [MON-S01].

🧪 Test Coverage Matrix — [MON-S02]

Dimension	Coverage	Notes
Boundary values	✅ defined	AC-1/AC-2 cover both distinct failure sources
State transitions	✅ defined	AC-3: failure → fallback delivered, signal raised in parallel
Data validation	⚠️ TBD	⚠️ QA: empty/unexpected `failed_reason` string still produces a usable alert
Concurrency	⚠️ TBD	⚠️ QA: service failure followed immediately by a retry success on the same room
Network/timeout	✅ defined	ERR-1 (enqueue failure never blocks reply path)

[MON-S03] — Detect unexpected handover/escalation and raise an alert


User Story	As a CS Supervisor, I want to be alerted when the agent hands a conversation off unexpectedly, so that I can pick up an escalation that the bot couldn't resolve rather than letting it sit unassigned.
Before State	At `get_answer.rb:56`, `assign_agent==true` triggers `_assign_agent`. Handovers happen for different reasons (`TRANSFER_CONDITION`, `EVALUATE_ANSWER`); none notify a supervisor, and a normal "evaluate answer" handoff is expected behavior.
After Delta	An abnormal handover (e.g. `TRANSFER_CONDITION` where the agent could not satisfy a transfer condition) raises an `unexpected_handover` signal into the pipeline; expected handovers (`EVALUATE_ANSWER`) do not alert. `assign_reason` is captured in telemetry.
Importance	Must Have
Mockup / Technical Notes	Figma: N/A. Data Fields: • `signal_type` (enum, required) — `unexpected_handover` • `assign_reason` (string, required) — source: AI-service response `assign_reason` • `room_id`, `organization_id`, `conversation_id` (required) Technical Notes: Emit hook at `app/core/repositories/ai_service/get_answer.rb:56`. The expected-vs-unexpected boundary is defined by `assign_reason`; the exact reason allow/deny list is confirmed with the BE team (see S15).
Acceptance Criteria	— Happy Path — • AC-1: Given `assign_agent==true` with an abnormal `assign_reason` (e.g. `TRANSFER_CONDITION`), when the handover runs, then an `unexpected_handover` signal is raised with `assign_reason` in telemetry. • AC-2: Given the handover assigns the room to a human, when the alert is emitted, then the deep-link opens that specific room for the supervisor. — Error / Unhappy Path — • ERR-1: Given the handover succeeds but signal raising fails, when the branch runs, then the handover/assignment still completes and the error is logged. — Permission Model — • CAN: BE handover branch raises automatically. • CANNOT: manual trigger. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. • Unauthorized: N/A. — UI States — • N/A (backend); delivery via S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given a normal, expected handover (`EVALUATE_ANSWER`), when it runs, then no `unexpected_handover` alert is raised (Non-Goal #4 — no alerting on normal behavior). • NEG-2: Given the agent resolves the conversation without handover, when it completes, then no handover signal is raised.

Dependencies: [MON-S01].

🧪 Test Coverage Matrix — [MON-S03]

Dimension	Coverage	Notes
Boundary values	✅ defined	AC-1 vs. NEG-1 draw the expected/unexpected boundary on `assign_reason`
State transitions	✅ defined	AC-2: handover → room assigned → deep-link targets it
Data validation	⚠️ TBD	⚠️ QA: unknown/new `assign_reason` value — default to alert or not? (tie to S15 allow-list)
Concurrency	⚠️ TBD	⚠️ QA: handover + a near-simultaneous failure signal on the same room (throttle is per signal-type, so both may alert)
Network/timeout	✅ defined	ERR-1 (signal failure never blocks the assignment)

[MON-S04] — Detect message-limit cap and raise an alert


User Story	As a CS Supervisor, I want to be alerted when the agent hits its message limit and stops responding, so that I know a conversation has gone silent because of a cap, not because the customer left.
Before State	At `send_message_with_resolve.rb:~1812`, exceeding `message_limit` calls `_handle_ai_agent_message_limit_reached`, which only emits `Rollbar.info` — invisible to any supervisor.
After Delta	The message-limit branch additionally raises a `message_limit` signal into the pipeline so a human can pick up the now-silent conversation.
Importance	Should Have
Mockup / Technical Notes	Figma: N/A. Data Fields: • `signal_type` (enum, required) — `message_limit` • `room_id`, `organization_id`, `conversation_id` (required) • `message_count` / `limit` (int, optional) — source: Redis counter + agent config Technical Notes: Emit hook alongside `_handle_ai_agent_message_limit_reached`. Should-Have: ships if Phase-1 capacity permits; otherwise rolls to a fast-follow.
Acceptance Criteria	— Happy Path — • AC-1: Given a room exceeds the agent `message_limit`, when the cap handler runs, then a `message_limit` signal is raised into the pipeline. • AC-2: Given the cap recurs on the same room within cooldown, when the handler runs again, then S01 throttling suppresses the repeat. — Error / Unhappy Path — • ERR-1: Given signal raising fails, when the cap handler runs, then the existing cap behavior (and Rollbar log) still completes. — Permission Model — • CAN: BE cap handler raises automatically. CANNOT: manual trigger. Unauthorized: N/A. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. — UI States — • N/A (backend); delivery via S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given the room is below the message limit, when the agent responds, then no `message_limit` signal is raised.

Dependencies: [MON-S01].

[MON-S05] — Detect low-confidence response and raise an alert


User Story	As a CS Supervisor, I want to be alerted when the agent answers with low confidence, so that I can review a reply that the bot itself is unsure about before it costs us the customer.
Before State	The numeric confidence of a response is produced by the AI service and recorded in telemetry / the AI Activity Log datamart, but it is not evaluated against a threshold at the BE emit point and never surfaces to a human in real time.
After Delta	When a response's confidence is below a configured floor, a `low_confidence` signal is raised into the pipeline with the confidence value in telemetry. Requires the confidence value to be available at the BE emit point (dependency/assumption — see S13/S15).
Importance	Should Have
Mockup / Technical Notes	Figma: N/A. Data Fields: • `signal_type` (enum, required) — `low_confidence` • `confidence` (float, required) — source: AI-service response / Activity Log • `confidence_floor` (float, required) — configured threshold (default TBD with Data/ML) • `room_id`, `organization_id`, `conversation_id` (required) Technical Notes: Gated on confidence being surfaced into the BE response handling (it is not today). Threshold is a global default in Phase 1 (no per-agent tuning — Non-Goal #7).
Acceptance Criteria	— Happy Path — • AC-1: Given a response with confidence below the configured floor, when the response is produced, then a `low_confidence` signal is raised with the `confidence` value in telemetry. • AC-2 (boundary): Given confidence exactly equals the floor, when the response is produced, then it is treated as not low-confidence (strictly below triggers). — Error / Unhappy Path — • ERR-1: Given the confidence value is absent/null for a response, when evaluation runs, then no `low_confidence` signal is raised and `ai_agent_alert_skipped` is incremented with reason `confidence_unavailable` (fail safe, no false alerts). — Permission Model — • CAN: BE raises automatically when confidence is present and below floor. CANNOT: manual trigger. Unauthorized: N/A. • CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention. — UI States — • N/A (backend); delivery via S01. — Negative Scenarios — (from Non-Goals) • NEG-1: Given confidence is at or above the floor, when the agent answers, then no `low_confidence` signal is raised.

Dependencies: [MON-S01]; confidence availability at the BE emit point (S13/S15).

9. Rollout

Field	Value
Feature flag	`ai_agent_monitoring_alerts` — default: OFF (per organization; see Constraints)
Stage 1	Internal: 1–2 Mekari-owned test orgs with a Chatbot Specialist as the configured supervisor; verify each of the 4 signals end-to-end (alert received, deep-link opens room, throttle works)
Stage 2	Closed pilot: 2–3 friendly production customers (Plus/Ultimate/360) running an autonomous agent, with the dependency live; supervisor opt-in
Stage 3	Targeted rollout: the 26Q2 production autonomous agents (toward the 15 target customer IDs), enabled per org on request
GA	All eligible (Plus/Ultimate/360) orgs with a configured supervisor, flag on by org setting
Backward compat	Yes — purely additive. With the flag OFF, agent behavior is byte-for-byte unchanged (detection hooks are no-ops).
Migration	None — no data migration. New artifacts (throttle keys, alert records) are created forward-only.

10. Observability

Key Events:

Event Name	Trigger	Properties
`ai_agent_failure_signal_raised`	A detector (S02–S05) raises any signal	`signal_type`, `room_id`, `organization_id`, `conversation_id`, `confidence?`, `reason?`, timestamp
`ai_agent_alert_delivered`	Alert successfully POSTed to notification-service (2xx)	`signal_type`, `organization_id`, `sso_id`, `event_id`, timestamp
`ai_agent_alert_suppressed`	Alert suppressed by throttle/dedup within cooldown	`signal_type`, `room_id`, `organization_id`, timestamp
`ai_agent_alert_dropped`	No recipient resolved	`reason` (`no_supervisor`), `organization_id`, timestamp
`ai_agent_alert_skipped`	Signal not evaluated (e.g. confidence unavailable)	`reason`, `signal_type`, `organization_id`, timestamp
`ai_agent_alert_delivery_failed`	notification-service non-2xx after retries	`signal_type`, `organization_id`, `http_status`, timestamp

Field	Detail
Dashboard owner	Chatbot squad (BOT), with the alert/telemetry stream also landing in the AI Activity Log datamart (Data)
Alert 1	`ai_agent_alert_delivery_failed` rate > 5% of attempts in any 1h window → Slack: #chatbot-oncall
Alert 2	`ai_agent_alert_suppressed` / `ai_agent_failure_signal_raised` ratio > 50% sustained 1h (signals being throttled heavily → noisy trigger or looping failure) → Slack: #chatbot-oncall for threshold review

10.1. Post-Launch Monitoring Cadence

Field	Detail
Review cadence	Weekly for the first 4 weeks post-GA, then monthly
Owner	Dimas (PM) + Chatbot squad
Review scope	`ai_agent_failure_signal_raised`, `ai_agent_alert_delivered`, `ai_agent_alert_suppressed`, `ai_agent_alert_delivery_failed`, and the ⭐ supervisor time-to-intervene KPI
Trigger threshold 1	`ai_agent_alert_delivery_failed` > 5% of attempts in a week → investigate delivery/dependency health immediately
Trigger threshold 2	Suppressed/raised ratio > 50% for 2 consecutive weeks → revisit cooldown window and per-signal thresholds (noise risk, S15 Risk #4)
Rollback consideration	If delivery failures or alert-storm complaints cannot be resolved within 48h, PM disables `ai_agent_monitoring_alerts` for the affected org(s) pending root cause.

11. Success Metrics

Efficiency & Impact:

Metric	Definition	Baseline	Target
⭐ Supervisor time-to-intervene	Median elapsed time from `ai_agent_failure_signal_raised` to a human opening/acting on that room	N/A — unmeasurable today (no alert exists); establish baseline in Stage 1–2	Median ≤ 5 minutes within 90 days of GA

Adoption & Usage:

Metric	Definition	Baseline	Target
Monitored-agent coverage	Share of targeted production autonomous agents with ≥1 supervisor receiving alerts	0	100% of targeted production agents within 30 days of GA
Alert engagement rate	Share of delivered alerts whose deep-link is opened by a supervisor	N/A — new	≥ 60% within 60 days of GA

Quality & Accuracy:

Metric	Definition	Baseline	Target
Alert delivery success rate	`ai_agent_alert_delivered` / (delivered + `delivery_failed`)	N/A — new	≥ 98% within 30 days of GA
Alert noise ratio	Supervisor-reported "not useful" alerts / delivered alerts (sampled)	N/A — new	≤ 10% within 60 days of GA (else revisit thresholds)

12. Launch Plan & Stage Gates

Stage	Audience	Duration	Success Gate to Advance	Owner
Internal Alpha	1–2 Mekari test orgs	1–2 weeks	All 4 signals deliver end-to-end; throttle verified; 0 P0/P1; agent reply path unaffected with flag ON and OFF	PM + QA
Closed Pilot	2–3 friendly production customers	2–3 weeks	Alert delivery success ≥ 98%; supervisor time-to-intervene baseline captured; noise ratio ≤ 10% (sampled); dependency live	PM + CSM
Targeted Rollout	26Q2 production autonomous agents (toward 15 IDs)	2–4 weeks	Pilot gates sustained; no alert-storm complaints unresolved > 48h	Eng Lead + PM
GA	All eligible orgs with a configured supervisor	Ongoing	All rollout gates sustained 2 weeks; ⭐ KPI trending to ≤ 5 min	PM

13. Dependencies

Dependency	Owning Team	Deliverable Needed	Blocking?
notification-service `ai_agent_alert` category + chat-origin FCM push	Broadcast / notification-service squad	New `notif_category` (value e.g. `14`) seeded via migration and added to the FCM whitelist so chat-origin `ai_agent_alert` notifications push; documented in a small TECH RFC. Without it, alerts are stored but not pushed.	YES
hub-chat Notification Center rendering of `ai_agent_alert`	Chat / Inbox squad (hub-chat)	Register the new `notif_category` in the navbar type lookup (icon + `notif_category_label`) and ensure the bell + FCM push render it with a working room deep-link, per the S6 contract. hub-chat already does FCM + has the Notification Center, so this is incremental, not net-new. Without it, the alert arrives but renders unlabelled / may not surface.	YES
Supervisor FCM token coverage (`user_source="chat"`)	Broadcast squad + Chatbot	Targeted supervisors must have notification-service FCM tokens registered for chat, or push silently no-ops	YES
Supervisor-role resolution (org → `sso_id[]`)	Chatbot BE + Platform (role/permission)	A reliable way to resolve "who is the supervisor for org X" — existing role, new org setting, or configured list (see S15 #1)	YES
Confidence value at the BE emit point	Data/ML + Chatbot BE	The numeric confidence surfaced into BE response handling so MON-S05 can threshold it (not available today)	NO — only blocks the Should-Have MON-S05, not Phase-1 GA
AI Activity Log datamart write path	Data (BI)	Accept the alert event as the durable system-of-record row (consistent with existing per-response telemetry)	NO — alerts still deliver without it; needed for KPI reporting + Phase 2
Internal API key / auth to notification-service	Broadcast squad	Valid `X-Api-Key` for chatbot BE → notification-service service-to-service calls	YES

14. Key Decisions + Alternatives Rejected

14a — Decisions Made

Date	Decision	Rationale
2026-06-28	Detect at the existing failure branches and emit via an async pipeline, out-of-band of the reply path	Detection must never add latency to or block the customer reply; the failure branches are the natural, already-present hooks
2026-06-28	Target a configured org supervisor, resolved to `sso_id[]`, not the assigned agent	A failed handover may have no assignee; supervision is org-level and must be independent of assignment state
2026-06-28	Apply per-room/per-signal throttle + dedup (5-min cooldown default)	A looping failure on one room must not flood the supervisor; signal-type granularity still surfaces distinct problems
2026-06-28	Alert only on abnormal handovers, not every escalation	A normal `EVALUATE_ANSWER` handoff is expected behavior; alerting on it would be noise (Non-Goal #4)
2026-06-28	Make MON-S05 (low-confidence) Should-Have, gated on confidence availability	The confidence value isn't surfaced at the BE emit point today; the 3 Must-Have stories deliver value without it
2026-06-28	Reuse notification-service; keep its change as an external dependency + TECH RFC	Avoids new infra and preserves squad ownership of the delivery channel
2026-06-28	Render in the existing hub-chat Notification Center, not a new surface; the FE work is a scoped dependency on the Inbox squad	hub-chat already does FCM + a category-aware Notification Center; reusing it means only registering one new category vs. building a notification UI. S6 is the rendering contract

14b — Alternatives Rejected

Alternative	Why Rejected	Date
Emit synchronously inside the reply path	Risks adding latency to or breaking the customer-facing response; unacceptable for a monitoring feature	2026-06-28
Emit only from the central `chatbot_ai_deduction_worker` (single chokepoint)	Async + billing-coupled; loses real-time value and conflates monitoring with billing telemetry. Kept as an audit backstop, not the primary path	2026-06-28
Alert the currently-assigned agent instead of a supervisor	No assignee on failed handovers; not true supervision	2026-06-28
Alert on every occurrence (no throttle)	A repeated failure becomes a notification storm; supervisors would mute the channel	2026-06-28
Build a dedicated AI-alerting service	Duplicates notification-service storage + FCM for the same outcome; far more infra	2026-06-28

15. Open Questions

#	Type	Question	Owner	Deadline
1	Open Question	How is "supervisor" defined and resolved to `sso_id[]` per org — existing role/permission, a new org setting, or an explicitly configured list? Determines the MON-S01 resolution logic.	Dimas + Chatbot BE	2026-07-11
2	Open Question	What is the exact `assign_reason` allow/deny list that separates "unexpected" from "expected" handovers (MON-S03)? And how do we treat an unknown/new reason — default alert or default silent?	Dimas + Chatbot BE	2026-07-11
3	Assumption	The numeric confidence value is available (or can be cheaply surfaced) at the BE emit point for MON-S05; default `confidence_floor` to be set with Data/ML. If false, MON-S05 slips to a fast-follow.	Dimas + Data/ML	2026-07-18
4	Risk	Alert noise — too-broad triggers cause supervisors to mute the channel. Mitigation: per-room throttle + dedup (MON-S01), abnormal-only handover (MON-S03), noise-ratio metric + threshold-review cadence (S10.1), and feature flag to disable per org.	Dimas	2026-07-18
5	Risk	Push not delivered — supervisors lack `user_source="chat"` FCM tokens, so alerts store but never push. Mitigation: confirm token coverage in Stage 1 (S13 blocking dep); fall back to in-app notification list until tokens exist.	Dimas + Broadcast squad	2026-07-18
6	Assumption	notification-service 7-day chat retention is acceptable for live alerts because the durable record lives in the AI Activity Log datamart.	Dimas	2026-07-11
7	Open Question	Do supervisors also need the alert on the mobile inbox app (not just hub-chat web)? If yes, mobile FCM rendering of `ai_agent_alert` is an additional dependency.	Dimas + Chat/Mobile squad	2026-07-18

Types: Assumption · Open Question · Risk

PRD CHANGELOG

Version	Date	By	Section	Type	Summary
1.0	2026-06-28	Claude	All	CREATED	Initial NEW PRD generated from grounding analysis of chatbot BE emit chokepoints + notification-service delivery channel
1.1	2026-06-28	Claude	S8	MODIFIED	Added explicit alert-reversibility (CANNOT) line to all 5 stories and generated the S8 system-flow Mermaid diagram, per score-prd Layer 2.5 Q2 + diagram coverage
1.2	2026-06-28	Claude	S6,S8,S13,S14,S15	MODIFIED	Corrected the "backend-only" framing: documented the hub-chat Notification Center as the supervisor-facing surface with a rendered look + field mapping (S6), added the hub-chat FE rendering as a blocking dependency (S13), added Frontend to scope_changes, updated the system flow + diagram, and logged the mobile-inbox open question (S15)

HEADER BLOCK​

Table of Contents​

2. One-liner + Problem​

3. Target Users + Persona Context​

4. Non-Goals​

5. Constraints​

5.1. Data Lifecycle​

6. New Features​

📊 UI State Diagram — AI Agent Alert (Notification Center item)​

7. API & Webhook Behavior​

8. System Flow + User Stories​

8.1. System Flow​

📊 System Flow — Supervisor Alerts​

8.2. User Stories​

🧪 Test Coverage Matrix — [MON-S01]​

🧪 Test Coverage Matrix — [MON-S02]​

🧪 Test Coverage Matrix — [MON-S03]​

9. Rollout​

10. Observability​

10.1. Post-Launch Monitoring Cadence​

11. Success Metrics​

12. Launch Plan & Stage Gates​

13. Dependencies​

14. Key Decisions + Alternatives Rejected​

15. Open Questions​

PRD CHANGELOG​

HEADER BLOCK

Table of Contents

2. One-liner + Problem

3. Target Users + Persona Context

4. Non-Goals

5. Constraints

5.1. Data Lifecycle

6. New Features

📊 UI State Diagram — AI Agent Alert (Notification Center item)

7. API & Webhook Behavior

8. System Flow + User Stories

8.1. System Flow

📊 System Flow — Supervisor Alerts

8.2. User Stories

🧪 Test Coverage Matrix — [MON-S01]

🧪 Test Coverage Matrix — [MON-S02]

🧪 Test Coverage Matrix — [MON-S03]

9. Rollout

10. Observability

10.1. Post-Launch Monitoring Cadence

11. Success Metrics

12. Launch Plan & Stage Gates

13. Dependencies

14. Key Decisions + Alternatives Rejected

15. Open Questions

PRD CHANGELOG