Skip to main content

Qontak | AI Agent | AI Agent Live Monitoring — Phase 1: Supervisor Alerts

HEADER BLOCK

FieldValue
PMDimas Fauzi Hidayat (Product Manager, Mekari Qontak)
PRD Version1.2
StatusDRAFT
PRD TypeNEW
EpicTBD — add once Epic is created
SquadBOT — Chatbot Squad
RFC LinkN/A — pending; notification-service delivery change tracked as a TECH RFC (Broadcast squad), see Dependencies
Figma MasterN/A — backend-only; alerts render in the existing notification surface
AnchorYes — AI Agent Live Monitoring — ANCHOR
Labelsepic:qontak-chat | module:ai-agent | feature:ai-agent-monitoring
Last Updated2026-06-28

Status values: DRAFTREADYBUILDSHIPPED


Table of Contents


2. One-liner + Problem

One-liner: Push a real-time alert to a configured organization supervisor whenever the production AI agent fails or degrades in a live customer conversation.

Problem: The chatbot backend already detects when the AI agent fails — an AI-service or v2-engine error falls back to a default answer, an unexpected handover escalates the room, a message-limit cap silently stops the agent (Rollbar.info only), and low-confidence answers are recorded in telemetry — but none of these reach a human proactively; there is no integration that notifies a supervisor. CS supervisors running production agents (the 26Q2 target spans agents across 15 customer IDs on Plus/Ultimate/360 tiers) therefore only discover a bad agent moment after a customer complains or during after-the-fact review. The cost is unrecovered conversations, slow human takeover, and eroded trust in letting the agent run autonomously — the precise risk this Phase exists to reduce.


3. Target Users + Persona Context

Primary Persona: CS Supervisor / Team Lead

FieldDetail
RoleSupervisor or team lead who owns an organization's production AI agent and the human agents who back it up
GoalLearn within seconds, not hours, that the AI agent has failed a live conversation, and open that specific room to take over or coach
PainNo proactive signal exists; failures are invisible until a customer escalates or a manual review surfaces them after the fact
WorkaroundPeriodically eyeball the inbox, wait for customer complaints, or scrub the AI Activity Log / Rollbar after conversations have ended

(See Constraints for plan availability and feature flag scope.)

Secondary Persona: Chatbot Specialist (Mekari)

FieldDetail
RoleMekari-side specialist who configures and tunes customer AI agents
GoalReceive the same failure signals to spot patterns (which signal, how often, which agent) and prioritise tuning
PainFailure signals are scattered across logs and the datamart with no real-time, per-agent view
WorkaroundManual datamart queries and Rollbar spelunking, well after the fact

4. Non-Goals

  1. No new alerting infrastructure — Phase 1 delivers exclusively through the existing notification-service chat channel. We do not build a dedicated alerting service, queue, or datastore.
  2. No monitoring console / dashboard — the at-a-glance "agent health" board is Phase 2. Phase 1 is push-only.
  3. No automated containment — alerts inform a human; the system does not pause the agent, force a handoff, or take any corrective action automatically. That is explored in Phase 3.
  4. No alerting on normal agent behavior — a successful resolve, or a deliberate, expected handover (e.g. EVALUATE_ANSWER), is not a failure and must not alert.
  5. No changes to the notification-service repo by this squad — the new ai_agent_alert category, FCM whitelist line, and chat-push delivery are owned by the Broadcast squad and tracked as a dependency (Section 13).
  6. No end-customer-facing notifications — alerts go to internal supervisors/specialists only, never to the conversation's customer.
  7. No per-agent custom alert rules in this phase — the trigger set and thresholds are global defaults; per-org configurability beyond on/off and supervisor targeting is deferred.
  8. No retroactive alerting — only events occurring after the feature flag is enabled for an org generate alerts; we do not backfill historical failures.

5. Constraints

FieldValue
PlatformBackend-only (chatbot Rails BE). Alert is consumed on whatever surfaces notification-service already supports — in-app notification + FCM push to the supervisor's chat client. No new screen.
PerformanceDetection and emit must run out-of-band of the customer reply path — the alert must never add latency to, or block, the agent's response to the customer. Emit is fire-and-forget (async worker); target alert delivered to notification-service within ≤ 30s of the triggering event.
Throttle / dedupAt most one alert per room per signal-type per cooldown window (default 5 minutes), deduped by event_id. A looping failure on one room must not flood the supervisor.
Plan scopeProduction AI-agent tiers only — Plus / Ultimate / 360 (the tiers already entitled for the autonomous agent). Not Starter/Free.
Feature flagai_agent_monitoring_alerts | default: OFF. Enabled per organization.
Read/writeOnly the chatbot BE (service-to-service, via the notification-service internal API key) writes alerts. Supervisors read alerts; no one edits them. Customers never see them.
Dependency gateChat-origin push for the ai_agent_alert category must be live in notification-service (Section 13) before an org can be moved past internal testing.

5.1. Data Lifecycle

Phase 1 introduces two persisted artifacts outside the main conversation model: the throttle/dedup state and the delivered alert notifications.

Artifact TypeRetention PeriodCleanup TriggerUser-Visible Effect
Throttle/dedup key (per room + signal-type)Cooldown window only (default 5 min)TTL expiry on the key store (e.g. Redis)None — internal only
Alert notification record (in notification-service)Per notification-service chat retention (currently 7 days, notif_type=3)notification-service retention policy (query-time pruning)Alert disappears from the supervisor's notification list after the retention window
Alert event in AI Activity Log datamart (system of record)Per datamart retention (durable)Datamart lifecycleNone directly — powers KPI reporting and Phase-2 console

Note: the notification-service 7-day window is fine for live alerting; the durable record of every failure event lives in the AI Activity Log datamart, not in notification-service.


6. New Features

No net-new screen is built, but there is a user-facing surface change: the supervisor sees the alert in the existing hub-chat (Qontak Omnichannel) Notification Center — the navbar bell — and as an FCM push. hub-chat already wires Firebase FCM (app.vueuseFirebase/initFirebase) and renders a category-aware notification list (layouts/composables/useNotificationCenter.ts, common/types/NotificationCenterTypes.ts). This is an incremental add to that existing component, not a new page — but it is not zero FE work (see Dependencies: hub-chat must register and render the new ai_agent_alert category).

FieldDetail
Surfacehub-chat Notification Center (navbar bell) + FCM push banner. URL: the supervisor's existing Omnichannel inbox (/inbox); no new route.
AccessUsers with the configured supervisor role for the org, logged into hub-chat (web). Mobile inbox is an open question (S15).

What the supervisor sees — Notification Center list item (success state):

┌─ Notifications ───────────────────────────────────┐
│ 🔴 AI Agent Alert · 2m ago │ ← unread dot · notif_category_label · created_at (relative)
│ AI Agent failed — engine error │ ← title
│ Couldn't answer in "Order #1234" — fell back │ ← description (truncated)
│ to default reply │
│ ↳ Open room → │ ← click_action_url deep-link to the failing room
└────────────────────────────────────────────────────┘

FCM push banner (device):

● Qontak Omnichannel now
AI Agent failed — engine error
Couldn't answer in "Order #1234" room

Field → UI mapping (drives the render from the notification-service payload):

Notification fieldRenders as
notif_category = ai_agent_alertSelects the icon + needs a registered type entry in hub-chat (the notif_category → type lookup in TheNavbar.vue/OneNavbar.vue)
notif_category_labelThe category chip text ("AI Agent Alert")
titleBold first line (the failure summary)
descriptionSecondary line (room + what happened), truncated
created_at / read_atRelative timestamp + unread dot
click_action = OPEN_URL, click_action_url + extra.room_idClick opens the failing room in the inbox

UI States (reuse the existing Notification Center states; the new category only adds an item type):

StateDescription
EmptyNo AI agent alerts → nothing extra shown; the existing center's empty/zero state applies.
LoadingExisting Notification Center skeleton while fetching.
ErrorExisting center error/retry; a push that fails to render falls back to the list on next fetch.
SuccessThe alert item renders as mocked above; click deep-links to the room.

Figma: N/A — reuses the existing Pixel3 notification-list component; this spec is the rendering contract the hub-chat (Inbox) squad builds to. A frame can be added if the squad wants a visual sign-off.

📊 UI State Diagram — AI Agent Alert (Notification Center item)

stateDiagram-v2
[*] --> Delivered: notification-service push received
Delivered --> Unread: shown in bell (red dot) + push banner
Unread --> Read: supervisor opens the Notification Center
Read --> RoomOpened: taps "Open room" (click_action_url + extra.room_id)
Unread --> Expired: 7-day chat retention reached
Read --> Expired: 7-day chat retention reached
RoomOpened --> [*]: supervisor takes over / coaches
Expired --> [*]: pruned from list (durable record stays in Activity Log)

7. API & Webhook Behavior

PM describes behavior in plain language; Engineering resolves HTTP methods, schemas, and error codes in the RFC. The outbound contract is to notification-service's existing internal endpoint.

#BehaviorEntity AffectedTriggered ByExpected BehaviorFailure Behavior
1Resolve alert recipient(s)Supervisor mapping (org → sso_id[])A failure signal is raised for a room in org X• Look up the configured supervisor(s) for org X (open question: existing role vs. new org setting — see S15)
• Return one or more sso_ids to target
• Cache the lookup for the cooldown window
• If no supervisor is configured for the org: drop the alert, increment ai_agent_alert_dropped with reason no_supervisor, do not error the conversation
• If lookup times out: skip alert, log; never block the customer reply
2Throttle & dedup the alertIn-memory/Redis throttle key (room_id + signal_type)An alert is about to be emitted• If no live key exists: proceed and set the key with TTL = cooldown (default 5 min)
• Attach a stable event_id for idempotency
• If a live key exists (within cooldown): suppress this alert, increment ai_agent_alert_suppressed with the signal type
• If the key store is unavailable: fail open (emit) rather than fail closed — a missed dedup is better than a missed failure alert
3Emit the alert to notification-serviceA chat-origin notification (notif_type=3, notif_category=ai_agent_alert)Recipient resolved + not throttled• POST to notification-service POST /api/v1/notifications/chat with X-Api-Key internal auth, targeting the supervisor sso_id
• Title/description summarise the signal (e.g. "AI Agent failed — engine error"), click_action=OPEN_URL, click_action_url deep-links the room
• Telemetry travels in the extra JSONB envelope: room_id, organization_id, conversation_id, signal_type, failed_reason/assign_reason, confidence (when present), agent_id
• Runs in an async worker (Sidekiq), out-of-band of the reply path
• Mirror the event to the AI Activity Log datamart (system of record)
• Non-2xx from notification-service: retry with backoff (bounded); on final failure increment ai_agent_alert_delivery_failed and log — never raise into the conversation flow
• If the dependency (category/whitelist) is not yet live: alert is created but no push is delivered — covered by the Section 13 blocking gate

Engineering resolves during RFC: exact request/response schema, retry/backoff policy, async worker boundary, and whether HTTP or the qontak_chat.public.notification_worker Kafka topic is used for emit (both are supported by notification-service).


8. System Flow + User Stories

8.1. System Flow

Flow: AI Agent Failure → Supervisor Alert Type: API Sequence

  1. A customer message reaches the chatbot BE and the AI agent attempts a response (get_answer / send_message_with_resolve).
  2. The agent pipeline hits one of the monitored conditions: AI-service non-200 (get_answer.rb:44), v2-engine failure (send_message_with_resolve.rb:1923), unexpected handover (assign_agent==true with an abnormal assign_reason, get_answer.rb:56), message-limit cap (send_message_with_resolve.rb:~1812), or a low-confidence answer (confidence below floor).
  3. The agent's normal fallback behavior runs unchanged (default answer / handover / cap) — the customer path is not altered or delayed.
  4. The detector classifies the event into a signal_type and enqueues an alert job (async, Sidekiq), passing room_id, organization_id, conversation_id, signal context, and confidence when available.
  5. The alert worker resolves the org's configured supervisor(s) → sso_id[].
  6. Decision: no supervisor configured → drop, count no_supervisor, stop.
  7. Decision: a throttle key for this room_id+signal_type is live (within cooldown) → suppress, count suppressed, stop. Otherwise set the key (TTL = cooldown) and continue.
  8. The worker POSTs the alert to notification-service /api/v1/notifications/chat (X-Api-Key, notif_type=3, notif_category=ai_agent_alert), telemetry in extra, deep-link in click_action_url.
  9. notification-service stores the notification and (category whitelisted) pushes via FCM to the supervisor's chat device.
  10. hub-chat (Omnichannel web) renders the alert in the navbar Notification Center bell and shows the FCM push banner, using the registered ai_agent_alert category for the icon/label.
  11. The worker mirrors the event to the AI Activity Log datamart (system of record) for KPI + Phase-2 console.
  12. Failure branch: notification-service returns non-2xx → bounded retry/backoff; on final failure count delivery_failed and log — never raise into the conversation.
  13. The supervisor sees the alert in the bell / push, taps the deep-link, and opens the failing room to take over or coach.

📊 System Flow — Supervisor Alerts

sequenceDiagram
participant Cust as Customer
participant BE as Chatbot BE (agent pipeline)
participant W as Alert Worker (Sidekiq)
participant Res as Supervisor Resolver
participant Thr as Throttle Store (Redis)
participant NS as notification-service
participant HC as hub-chat (Notification Center)
participant ADL as AI Activity Log
participant Spv as Supervisor

Cust->>BE: Customer message
BE-->>Cust: Reply / fallback (unchanged — no added latency)
Note over BE: Failure detected (service · engine · handover · limit · low-confidence)
BE->>W: Enqueue alert (room_id, org_id, signal_type, context)
W->>Res: Resolve supervisor(s) for org
alt No supervisor configured
Res-->>W: none
W-->>W: Drop · count no_supervisor
else Supervisor resolved
Res-->>W: sso_id[]
W->>Thr: Check room + signal-type key
alt Within cooldown
Thr-->>W: key live
W-->>W: Suppress · count suppressed
else Not throttled
Thr-->>W: set key (TTL = cooldown)
W->>NS: POST /notifications/chat (X-Api-Key · ai_agent_alert · extra)
alt Delivery non-2xx
NS-->>W: error
W-->>W: Retry/backoff · on final fail count delivery_failed
else Delivered
NS->>HC: FCM push + stored notification (ai_agent_alert)
HC->>Spv: Render in bell + push banner
W->>ADL: Mirror event (system of record)
Spv->>BE: Open room (deep-link) · take over / coach
end
end
end

8.2. User Stories

[MON-S01] — Supervisor alert delivery pipeline (resolve · throttle · emit)

User StoryAs a CS Supervisor, I want a single reliable pipeline that resolves who to notify, suppresses repeats, and delivers the alert, so that every genuine agent failure reaches me once, fast, without flooding me.
Before StateNone — the chatbot BE has no integration with notification-service and no concept of a "supervisor alert". Failures end at a default answer or a Rollbar log.
After DeltaA shared async pipeline (Sidekiq worker) takes a raised failure signal, resolves the org supervisor(s), applies per-room/per-signal throttle + dedup, POSTs the alert to notification-service chat, and mirrors the event to the Activity Log datamart — out-of-band of the customer reply path.
ImportanceMust Have
Mockup / Technical NotesFigma: N/A — backend-only.

Data Fields (alert payload):
sso_id (uuid, required) — resolved supervisor target
room_id (string, required) — source: room state
organization_id (uuid, required) — source: room/org context
conversation_id (string, required) — source: conversation context
signal_type (enum, required) — service_failure / engine_failure / unexpected_handover / message_limit / low_confidence
event_id (uuid, required) — idempotency/dedup key
extra (json, required) — telemetry envelope (reason, confidence, agent_id)

Technical Notes: Delivery via notification-service POST /api/v1/notifications/chat (X-Api-Key, notif_type=3, notif_category=ai_agent_alert). Throttle key store = Redis with TTL = cooldown. Pipeline is invoked by all detector stories (S02–S05).
Acceptance Criteria— Happy Path —
• AC-1: Given a failure signal is raised for a room whose org has a configured supervisor, when the pipeline runs, then exactly one alert is POSTed to notification-service targeting the supervisor sso_id with the correct signal_type and extra telemetry.
• AC-2: Given the same room raises the same signal_type again within the cooldown window, when the pipeline runs, then the second alert is suppressed and ai_agent_alert_suppressed is incremented.
• AC-3 (volume/boundary): Given a room loops a failure many times within the cooldown, when the pipeline runs repeatedly, then at most one alert is delivered for that room+signal in the window.
• AC-4: Given an org configures more than one supervisor, when an alert is emitted, then each configured supervisor sso_id receives the alert (deduped per recipient).

— Error / Unhappy Path —
• ERR-1: Given the org has no configured supervisor, when the pipeline runs, then no alert is sent, ai_agent_alert_dropped is incremented with reason no_supervisor, and the conversation flow is unaffected.
• ERR-2: Given notification-service returns a non-2xx, when the pipeline emits, then it retries with bounded backoff and on final failure increments ai_agent_alert_delivery_failed and logs — without raising into the conversation.
• ERR-3: Given the throttle key store is unavailable, when the pipeline runs, then it fails open (emits the alert) rather than dropping a genuine failure.

— Permission Model —
• CAN: chatbot BE (service-to-service via internal API key) emits; configured supervisor(s) receive.
• CANNOT: end customers (never recipients); agents without supervisor configuration.
• CANNOT (reversibility): an alert cannot be recalled or edited once delivered; it is fire-once and auto-expires per notification-service retention (7-day chat window). The durable record persists in the AI Activity Log datamart.
• Unauthorized: if the internal API key is rejected by notification-service, emit fails closed to the supervisor (no alert) and is counted as delivery_failed.

— UI States —
• Loading: N/A (backend).
• Empty: no supervisor configured → no alert (see ERR-1).
• Error: delivery failure logged/counted, invisible to customer.
• Success: alert appears in supervisor's notification surface + FCM push.

— Negative Scenarios — (from Non-Goals)
• NEG-1: Given the agent responds successfully (no failure signal), when the pipeline is not invoked, then no alert is generated.
• NEG-2: Given a customer is in the room, when an alert is emitted, then the customer never receives it (internal-only).

Dependencies: notification-service ai_agent_alert category + chat push (Section 13).

🧪 Test Coverage Matrix — [MON-S01]

DimensionCoverageNotes
Boundary values✅ definedAC-3 covers the loop/flood bound; AC-4 covers multi-supervisor
State transitions✅ definedAC-1/AC-2 cover first-fire vs. within-cooldown
Data validation⚠️ TBD⚠️ QA: malformed/missing organization_id or room_id in the raised signal
Concurrency⚠️ partialAC-2 covers sequential repeats; ⚠️ QA: two failures on the same room+signal racing the throttle key simultaneously
Network/timeout✅ definedERR-2 (delivery retry/backoff) + ERR-3 (key store unavailable, fail-open)

[MON-S02] — Detect AI-service / v2-engine failure and raise an alert

User StoryAs a CS Supervisor, I want to be alerted when the AI service or v2 engine fails to answer, so that I can step into a room where the bot has fallen back to a default answer.
Before StateOn get_answer.rb:44 (AI-service non-200) the BE assigns a fallback default answer; on send_message_with_resolve.rb:1923 (v2 result.code != 200) it runs _execute_ai_assist_fallback. Both are invisible to a human.
After DeltaEach of these failure branches additionally raises a service_failure / engine_failure signal into the S01 pipeline, with the failure reason captured in telemetry. The existing fallback behavior is unchanged.
ImportanceMust Have
Mockup / Technical NotesFigma: N/A.

Data Fields:
signal_type (enum, required) — service_failure or engine_failure
failed_reason (string, required) — source: existing failed_reason ("failed to access ai service" / "Fail to get answer from AI Agent")
room_id, organization_id, conversation_id (required) — source: room/conversation context

Technical Notes: Emit hook at app/core/repositories/ai_service/get_answer.rb:44 and app/core/use_cases/system/hub/send_message_with_resolve.rb:1923. The DeductionRequest (is_failed/failed_reason) at chatbot_ai_deduction_worker.rb is the async audit backstop.
Acceptance Criteria— Happy Path —
• AC-1: Given the AI service returns a non-200, when the BE applies its default-answer fallback, then a service_failure signal is raised into the pipeline with failed_reason populated.
• AC-2: Given the v2 engine returns result.code != 200, when _execute_ai_assist_fallback runs, then an engine_failure signal is raised with failed_reason populated.
• AC-3: Given a failure is detected, when the signal is raised, then the customer still receives the existing fallback answer with no added latency.

— Error / Unhappy Path —
• ERR-1: Given signal raising itself errors (e.g. enqueue fails), when the failure branch runs, then the customer-facing fallback still completes and the enqueue error is logged — detection never degrades the reply path.

— Permission Model —
• CAN: BE failure branches raise the signal automatically.
• CANNOT: no manual/user trigger for this signal.
• CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention.
• Unauthorized: N/A (system-raised).

— UI States —
• Loading/Empty/Error/Success: N/A (backend); delivery states handled by S01.

— Negative Scenarios — (from Non-Goals)
• NEG-1: Given the AI service returns 200 and the agent answers normally, when no failure branch runs, then no service_failure/engine_failure signal is raised.

Dependencies: [MON-S01].

🧪 Test Coverage Matrix — [MON-S02]

DimensionCoverageNotes
Boundary values✅ definedAC-1/AC-2 cover both distinct failure sources
State transitions✅ definedAC-3: failure → fallback delivered, signal raised in parallel
Data validation⚠️ TBD⚠️ QA: empty/unexpected failed_reason string still produces a usable alert
Concurrency⚠️ TBD⚠️ QA: service failure followed immediately by a retry success on the same room
Network/timeout✅ definedERR-1 (enqueue failure never blocks reply path)

[MON-S03] — Detect unexpected handover/escalation and raise an alert

User StoryAs a CS Supervisor, I want to be alerted when the agent hands a conversation off unexpectedly, so that I can pick up an escalation that the bot couldn't resolve rather than letting it sit unassigned.
Before StateAt get_answer.rb:56, assign_agent==true triggers _assign_agent. Handovers happen for different reasons (TRANSFER_CONDITION, EVALUATE_ANSWER); none notify a supervisor, and a normal "evaluate answer" handoff is expected behavior.
After DeltaAn abnormal handover (e.g. TRANSFER_CONDITION where the agent could not satisfy a transfer condition) raises an unexpected_handover signal into the pipeline; expected handovers (EVALUATE_ANSWER) do not alert. assign_reason is captured in telemetry.
ImportanceMust Have
Mockup / Technical NotesFigma: N/A.

Data Fields:
signal_type (enum, required) — unexpected_handover
assign_reason (string, required) — source: AI-service response assign_reason
room_id, organization_id, conversation_id (required)

Technical Notes: Emit hook at app/core/repositories/ai_service/get_answer.rb:56. The expected-vs-unexpected boundary is defined by assign_reason; the exact reason allow/deny list is confirmed with the BE team (see S15).
Acceptance Criteria— Happy Path —
• AC-1: Given assign_agent==true with an abnormal assign_reason (e.g. TRANSFER_CONDITION), when the handover runs, then an unexpected_handover signal is raised with assign_reason in telemetry.
• AC-2: Given the handover assigns the room to a human, when the alert is emitted, then the deep-link opens that specific room for the supervisor.

— Error / Unhappy Path —
• ERR-1: Given the handover succeeds but signal raising fails, when the branch runs, then the handover/assignment still completes and the error is logged.

— Permission Model —
• CAN: BE handover branch raises automatically.
• CANNOT: manual trigger.
• CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention.
• Unauthorized: N/A.

— UI States —
• N/A (backend); delivery via S01.

— Negative Scenarios — (from Non-Goals)
• NEG-1: Given a normal, expected handover (EVALUATE_ANSWER), when it runs, then no unexpected_handover alert is raised (Non-Goal #4 — no alerting on normal behavior).
• NEG-2: Given the agent resolves the conversation without handover, when it completes, then no handover signal is raised.

Dependencies: [MON-S01].

🧪 Test Coverage Matrix — [MON-S03]

DimensionCoverageNotes
Boundary values✅ definedAC-1 vs. NEG-1 draw the expected/unexpected boundary on assign_reason
State transitions✅ definedAC-2: handover → room assigned → deep-link targets it
Data validation⚠️ TBD⚠️ QA: unknown/new assign_reason value — default to alert or not? (tie to S15 allow-list)
Concurrency⚠️ TBD⚠️ QA: handover + a near-simultaneous failure signal on the same room (throttle is per signal-type, so both may alert)
Network/timeout✅ definedERR-1 (signal failure never blocks the assignment)

[MON-S04] — Detect message-limit cap and raise an alert

User StoryAs a CS Supervisor, I want to be alerted when the agent hits its message limit and stops responding, so that I know a conversation has gone silent because of a cap, not because the customer left.
Before StateAt send_message_with_resolve.rb:~1812, exceeding message_limit calls _handle_ai_agent_message_limit_reached, which only emits Rollbar.info — invisible to any supervisor.
After DeltaThe message-limit branch additionally raises a message_limit signal into the pipeline so a human can pick up the now-silent conversation.
ImportanceShould Have
Mockup / Technical NotesFigma: N/A.

Data Fields:
signal_type (enum, required) — message_limit
room_id, organization_id, conversation_id (required)
message_count / limit (int, optional) — source: Redis counter + agent config

Technical Notes: Emit hook alongside _handle_ai_agent_message_limit_reached. Should-Have: ships if Phase-1 capacity permits; otherwise rolls to a fast-follow.
Acceptance Criteria— Happy Path —
• AC-1: Given a room exceeds the agent message_limit, when the cap handler runs, then a message_limit signal is raised into the pipeline.
• AC-2: Given the cap recurs on the same room within cooldown, when the handler runs again, then S01 throttling suppresses the repeat.

— Error / Unhappy Path —
• ERR-1: Given signal raising fails, when the cap handler runs, then the existing cap behavior (and Rollbar log) still completes.

— Permission Model —
• CAN: BE cap handler raises automatically. CANNOT: manual trigger. Unauthorized: N/A.
• CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention.

— UI States —
• N/A (backend); delivery via S01.

— Negative Scenarios — (from Non-Goals)
• NEG-1: Given the room is below the message limit, when the agent responds, then no message_limit signal is raised.

Dependencies: [MON-S01].


[MON-S05] — Detect low-confidence response and raise an alert

User StoryAs a CS Supervisor, I want to be alerted when the agent answers with low confidence, so that I can review a reply that the bot itself is unsure about before it costs us the customer.
Before StateThe numeric confidence of a response is produced by the AI service and recorded in telemetry / the AI Activity Log datamart, but it is not evaluated against a threshold at the BE emit point and never surfaces to a human in real time.
After DeltaWhen a response's confidence is below a configured floor, a low_confidence signal is raised into the pipeline with the confidence value in telemetry. Requires the confidence value to be available at the BE emit point (dependency/assumption — see S13/S15).
ImportanceShould Have
Mockup / Technical NotesFigma: N/A.

Data Fields:
signal_type (enum, required) — low_confidence
confidence (float, required) — source: AI-service response / Activity Log
confidence_floor (float, required) — configured threshold (default TBD with Data/ML)
room_id, organization_id, conversation_id (required)

Technical Notes: Gated on confidence being surfaced into the BE response handling (it is not today). Threshold is a global default in Phase 1 (no per-agent tuning — Non-Goal #7).
Acceptance Criteria— Happy Path —
• AC-1: Given a response with confidence below the configured floor, when the response is produced, then a low_confidence signal is raised with the confidence value in telemetry.
• AC-2 (boundary): Given confidence exactly equals the floor, when the response is produced, then it is treated as not low-confidence (strictly below triggers).

— Error / Unhappy Path —
• ERR-1: Given the confidence value is absent/null for a response, when evaluation runs, then no low_confidence signal is raised and ai_agent_alert_skipped is incremented with reason confidence_unavailable (fail safe, no false alerts).

— Permission Model —
• CAN: BE raises automatically when confidence is present and below floor. CANNOT: manual trigger. Unauthorized: N/A.
• CANNOT (reversibility): the resulting alert cannot be recalled or edited once delivered; fire-once, auto-expires per notification-service retention.

— UI States —
• N/A (backend); delivery via S01.

— Negative Scenarios — (from Non-Goals)
• NEG-1: Given confidence is at or above the floor, when the agent answers, then no low_confidence signal is raised.

Dependencies: [MON-S01]; confidence availability at the BE emit point (S13/S15).


9. Rollout

FieldValue
Feature flagai_agent_monitoring_alerts — default: OFF (per organization; see Constraints)
Stage 1Internal: 1–2 Mekari-owned test orgs with a Chatbot Specialist as the configured supervisor; verify each of the 4 signals end-to-end (alert received, deep-link opens room, throttle works)
Stage 2Closed pilot: 2–3 friendly production customers (Plus/Ultimate/360) running an autonomous agent, with the dependency live; supervisor opt-in
Stage 3Targeted rollout: the 26Q2 production autonomous agents (toward the 15 target customer IDs), enabled per org on request
GAAll eligible (Plus/Ultimate/360) orgs with a configured supervisor, flag on by org setting
Backward compatYes — purely additive. With the flag OFF, agent behavior is byte-for-byte unchanged (detection hooks are no-ops).
MigrationNone — no data migration. New artifacts (throttle keys, alert records) are created forward-only.

10. Observability

Key Events:

Event NameTriggerProperties
ai_agent_failure_signal_raisedA detector (S02–S05) raises any signalsignal_type, room_id, organization_id, conversation_id, confidence?, reason?, timestamp
ai_agent_alert_deliveredAlert successfully POSTed to notification-service (2xx)signal_type, organization_id, sso_id, event_id, timestamp
ai_agent_alert_suppressedAlert suppressed by throttle/dedup within cooldownsignal_type, room_id, organization_id, timestamp
ai_agent_alert_droppedNo recipient resolvedreason (no_supervisor), organization_id, timestamp
ai_agent_alert_skippedSignal not evaluated (e.g. confidence unavailable)reason, signal_type, organization_id, timestamp
ai_agent_alert_delivery_failednotification-service non-2xx after retriessignal_type, organization_id, http_status, timestamp
FieldDetail
Dashboard ownerChatbot squad (BOT), with the alert/telemetry stream also landing in the AI Activity Log datamart (Data)
Alert 1ai_agent_alert_delivery_failed rate > 5% of attempts in any 1h window → Slack: #chatbot-oncall
Alert 2ai_agent_alert_suppressed / ai_agent_failure_signal_raised ratio > 50% sustained 1h (signals being throttled heavily → noisy trigger or looping failure) → Slack: #chatbot-oncall for threshold review

10.1. Post-Launch Monitoring Cadence

FieldDetail
Review cadenceWeekly for the first 4 weeks post-GA, then monthly
OwnerDimas (PM) + Chatbot squad
Review scopeai_agent_failure_signal_raised, ai_agent_alert_delivered, ai_agent_alert_suppressed, ai_agent_alert_delivery_failed, and the ⭐ supervisor time-to-intervene KPI
Trigger threshold 1ai_agent_alert_delivery_failed > 5% of attempts in a week → investigate delivery/dependency health immediately
Trigger threshold 2Suppressed/raised ratio > 50% for 2 consecutive weeks → revisit cooldown window and per-signal thresholds (noise risk, S15 Risk #4)
Rollback considerationIf delivery failures or alert-storm complaints cannot be resolved within 48h, PM disables ai_agent_monitoring_alerts for the affected org(s) pending root cause.

11. Success Metrics

Efficiency & Impact:

MetricDefinitionBaselineTarget
Supervisor time-to-interveneMedian elapsed time from ai_agent_failure_signal_raised to a human opening/acting on that roomN/A — unmeasurable today (no alert exists); establish baseline in Stage 1–2Median ≤ 5 minutes within 90 days of GA

Adoption & Usage:

MetricDefinitionBaselineTarget
Monitored-agent coverageShare of targeted production autonomous agents with ≥1 supervisor receiving alerts0100% of targeted production agents within 30 days of GA
Alert engagement rateShare of delivered alerts whose deep-link is opened by a supervisorN/A — new≥ 60% within 60 days of GA

Quality & Accuracy:

MetricDefinitionBaselineTarget
Alert delivery success rateai_agent_alert_delivered / (delivered + delivery_failed)N/A — new≥ 98% within 30 days of GA
Alert noise ratioSupervisor-reported "not useful" alerts / delivered alerts (sampled)N/A — new≤ 10% within 60 days of GA (else revisit thresholds)

12. Launch Plan & Stage Gates

StageAudienceDurationSuccess Gate to AdvanceOwner
Internal Alpha1–2 Mekari test orgs1–2 weeksAll 4 signals deliver end-to-end; throttle verified; 0 P0/P1; agent reply path unaffected with flag ON and OFFPM + QA
Closed Pilot2–3 friendly production customers2–3 weeksAlert delivery success ≥ 98%; supervisor time-to-intervene baseline captured; noise ratio ≤ 10% (sampled); dependency livePM + CSM
Targeted Rollout26Q2 production autonomous agents (toward 15 IDs)2–4 weeksPilot gates sustained; no alert-storm complaints unresolved > 48hEng Lead + PM
GAAll eligible orgs with a configured supervisorOngoingAll rollout gates sustained 2 weeks; ⭐ KPI trending to ≤ 5 minPM

13. Dependencies

DependencyOwning TeamDeliverable NeededBlocking?
notification-service ai_agent_alert category + chat-origin FCM pushBroadcast / notification-service squadNew notif_category (value e.g. 14) seeded via migration and added to the FCM whitelist so chat-origin ai_agent_alert notifications push; documented in a small TECH RFC. Without it, alerts are stored but not pushed.YES
hub-chat Notification Center rendering of ai_agent_alertChat / Inbox squad (hub-chat)Register the new notif_category in the navbar type lookup (icon + notif_category_label) and ensure the bell + FCM push render it with a working room deep-link, per the S6 contract. hub-chat already does FCM + has the Notification Center, so this is incremental, not net-new. Without it, the alert arrives but renders unlabelled / may not surface.YES
Supervisor FCM token coverage (user_source="chat")Broadcast squad + ChatbotTargeted supervisors must have notification-service FCM tokens registered for chat, or push silently no-opsYES
Supervisor-role resolution (org → sso_id[])Chatbot BE + Platform (role/permission)A reliable way to resolve "who is the supervisor for org X" — existing role, new org setting, or configured list (see S15 #1)YES
Confidence value at the BE emit pointData/ML + Chatbot BEThe numeric confidence surfaced into BE response handling so MON-S05 can threshold it (not available today)NO — only blocks the Should-Have MON-S05, not Phase-1 GA
AI Activity Log datamart write pathData (BI)Accept the alert event as the durable system-of-record row (consistent with existing per-response telemetry)NO — alerts still deliver without it; needed for KPI reporting + Phase 2
Internal API key / auth to notification-serviceBroadcast squadValid X-Api-Key for chatbot BE → notification-service service-to-service callsYES

14. Key Decisions + Alternatives Rejected

14a — Decisions Made

DateDecisionRationale
2026-06-28Detect at the existing failure branches and emit via an async pipeline, out-of-band of the reply pathDetection must never add latency to or block the customer reply; the failure branches are the natural, already-present hooks
2026-06-28Target a configured org supervisor, resolved to sso_id[], not the assigned agentA failed handover may have no assignee; supervision is org-level and must be independent of assignment state
2026-06-28Apply per-room/per-signal throttle + dedup (5-min cooldown default)A looping failure on one room must not flood the supervisor; signal-type granularity still surfaces distinct problems
2026-06-28Alert only on abnormal handovers, not every escalationA normal EVALUATE_ANSWER handoff is expected behavior; alerting on it would be noise (Non-Goal #4)
2026-06-28Make MON-S05 (low-confidence) Should-Have, gated on confidence availabilityThe confidence value isn't surfaced at the BE emit point today; the 3 Must-Have stories deliver value without it
2026-06-28Reuse notification-service; keep its change as an external dependency + TECH RFCAvoids new infra and preserves squad ownership of the delivery channel
2026-06-28Render in the existing hub-chat Notification Center, not a new surface; the FE work is a scoped dependency on the Inbox squadhub-chat already does FCM + a category-aware Notification Center; reusing it means only registering one new category vs. building a notification UI. S6 is the rendering contract

14b — Alternatives Rejected

AlternativeWhy RejectedDate
Emit synchronously inside the reply pathRisks adding latency to or breaking the customer-facing response; unacceptable for a monitoring feature2026-06-28
Emit only from the central chatbot_ai_deduction_worker (single chokepoint)Async + billing-coupled; loses real-time value and conflates monitoring with billing telemetry. Kept as an audit backstop, not the primary path2026-06-28
Alert the currently-assigned agent instead of a supervisorNo assignee on failed handovers; not true supervision2026-06-28
Alert on every occurrence (no throttle)A repeated failure becomes a notification storm; supervisors would mute the channel2026-06-28
Build a dedicated AI-alerting serviceDuplicates notification-service storage + FCM for the same outcome; far more infra2026-06-28

15. Open Questions

#TypeQuestionOwnerDeadline
1Open QuestionHow is "supervisor" defined and resolved to sso_id[] per org — existing role/permission, a new org setting, or an explicitly configured list? Determines the MON-S01 resolution logic.Dimas + Chatbot BE2026-07-11
2Open QuestionWhat is the exact assign_reason allow/deny list that separates "unexpected" from "expected" handovers (MON-S03)? And how do we treat an unknown/new reason — default alert or default silent?Dimas + Chatbot BE2026-07-11
3AssumptionThe numeric confidence value is available (or can be cheaply surfaced) at the BE emit point for MON-S05; default confidence_floor to be set with Data/ML. If false, MON-S05 slips to a fast-follow.Dimas + Data/ML2026-07-18
4RiskAlert noise — too-broad triggers cause supervisors to mute the channel. Mitigation: per-room throttle + dedup (MON-S01), abnormal-only handover (MON-S03), noise-ratio metric + threshold-review cadence (S10.1), and feature flag to disable per org.Dimas2026-07-18
5RiskPush not delivered — supervisors lack user_source="chat" FCM tokens, so alerts store but never push. Mitigation: confirm token coverage in Stage 1 (S13 blocking dep); fall back to in-app notification list until tokens exist.Dimas + Broadcast squad2026-07-18
6Assumptionnotification-service 7-day chat retention is acceptable for live alerts because the durable record lives in the AI Activity Log datamart.Dimas2026-07-11
7Open QuestionDo supervisors also need the alert on the mobile inbox app (not just hub-chat web)? If yes, mobile FCM rendering of ai_agent_alert is an additional dependency.Dimas + Chat/Mobile squad2026-07-18

Types: Assumption · Open Question · Risk


PRD CHANGELOG

VersionDateBySectionTypeSummary
1.02026-06-28ClaudeAllCREATEDInitial NEW PRD generated from grounding analysis of chatbot BE emit chokepoints + notification-service delivery channel
1.12026-06-28ClaudeS8MODIFIEDAdded explicit alert-reversibility (CANNOT) line to all 5 stories and generated the S8 system-flow Mermaid diagram, per score-prd Layer 2.5 Q2 + diagram coverage
1.22026-06-28ClaudeS6,S8,S13,S14,S15MODIFIEDCorrected the "backend-only" framing: documented the hub-chat Notification Center as the supervisor-facing surface with a rendered look + field mapping (S6), added the hub-chat FE rendering as a blocking dependency (S13), added Frontend to scope_changes, updated the system flow + diagram, and logged the mobile-inbox open question (S15)