Qontak | AI Agent | Testing — Phase 1: Historical Validation
Historical Validation — Phase 1 PRD under the AI Agent: Testing ANCHOR. The first test-case source ("Generate from inbox"): sample resolved human-handled conversations, generate AI shadow answers, compare side-by-side vs the human "golden" answer, rate per question, and roll up a confidence score. Imported from Confluence and reconciled against code (
chatbot,chatbot-fe,qontak-designer).
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat |
| PRD Version | 1.0 |
| Status | DRAFT |
| PRD Type | PHASE |
| Epic | BOT-3351 |
| Squad | BOT |
| RFC Link | Related: docs/rfcs/ai-agent-advanced-settings-p2-be-v2.md (chatbot repo) — dedicated RFC to be created |
| Figma Master | Figma — Bot · AI Agent Testing |
| Anchor | AI Agent: Testing — ANCHOR (Confluence) |
| Labels | epic:qontak-chatbot | module:ai-agent | feature:ai-agent-testing |
| Last Updated | 2026-06-18 |
Scope Changes
Backend · Frontend · Data — new Testing page + test-case endpoints (chatbot, chatbot-fe), and the sampling/shadow-generation pipeline (Data).
2. Phase Context
- Anchor PRD: AI Agent: Testing — ANCHOR
- Phase Number: Phase 1 of 3 (phasing by test-case source: Historical → Knowledge → Imported)
- Phase Goal: Validate AI Agent quality against a sample of historical, resolved human conversations so SPV/Admin can confidently go live. (Matches the ANCHOR Phase Index Goal for Phase 1.)
- Prior phases: N/A — this is Phase 1, no prior phases.
- This phase: The Testing page + the "Generate from inbox" test-case source — sample resolved human-handled rooms, generate AI shadow answers, compare side-by-side vs the human "golden" answer, rate per question, roll up a confidence score.
- Deferred to next: Phase 2 "Generate from knowledge" and Phase 3 "Imported question list" test-case sources (scaffolded in design/code but out of scope here).
- Cross-phase deps: Test cases bind to a specific AI Agent version (
ai_agent_history_id). Theai_agent_test_cases/ai_agent_test_case_questionsschema established here must remain stable for Phase 2/3, which reuse the same tables and comparison UI.
3. One-liner + Problem
One-liner: Let Qontak SPVs and Admins validate an AI Agent against a sample of their own resolved conversations — comparing AI answers to their human agents' — before going live.
Problem: Before activating an AI Agent, SPV/Admins have no way to preview its quality at scale, so they run ~6-hour/day manual "War Rooms" during onboarding to catch errors. This phase gives them a self-serve, evidence-based comparison against historical "golden" human answers, so they can reach a confidence bar and activate without babysitting. For full initiative context, see the ANCHOR PRD.
4. Target Users + Persona Context
| Persona | Role | Goal | Pain | Workaround |
|---|---|---|---|---|
| Primary — SPV / Chatbot Admin | SPV / Chatbot Admin responsible for customer-interaction quality | Reach a confidence bar that proves the AI is safe to launch, then activate it | No scalable way to preview AI answers on real customer questions before go-live | ~6 hours/day of manual War Room monitoring during onboarding |
| Secondary — Super Admin | Decision-maker who purchased the AI module | Activate the subscription fully once the SPV signs off | No objective proof to justify activation | Waits on the SPV "green light"; keeps AI inactive |
5. Non-Goals
- Live Shadow Mode — not a real-time system running in parallel during actual customer chats. This phase is strictly historical.
- Model fine-tuning UI — answers are fixed by updating the Knowledge Base/Context, not by "teaching" the AI in this workspace.
- Multi-modal validation — MVP is text-only; tickets with images/voice/attachments are excluded.
- Knowledge-based and imported test sets — deferred to Phase 2 and Phase 3.
- Editing the AI's answer in the workspace — the comparison view is read-only.
- Mobile app — Testing is web-only in this phase.
6. Constraints
- Platform: Web only (Qontak web app — Bot Automation → Testing).
- Performance: Batch generation is asynchronous (~2–5 min for ~50 items). Must respect the LLM provider's TPM/RPM so it never blocks live production traffic. Historical room fetch should read from a replica DB where available, to avoid slowing the live inbox.
- Data limits: Lookback = last 90 days. Sample = 10% of eligible rooms, capped at 50–70 rooms per batch (≤50 shown when a batch exceeds 100). Text-only tickets.
- Plan scope: All plans with the AI Agent enabled.
- Feature flag:
ai_agent_testing| default: OFF. Enabled per organization during beta. - Read/write: Roles owner / supervisor / admin can read + write (generate, rate). Standard agents have no access (menu hidden). Enforced server-side via
set_roleon every test-case endpoint. - Data lifecycle:
ai_agent_test_casesandai_agent_test_case_questionsare soft-deleted (deleted_at,acts_as_paranoid) when a user deletes a test case; the transient LLM inference payload is not persisted beyond the response. Hard-purge window TBD (Open Question #8).
7. Feature Changes
CHG-001 — Confidence score surfaced in the Tree Diagram
- Change Type: Modified component (Bot Flow tree diagram).
- Page:
/bot-automation/bot-flow/{id}(Tree Diagram). - Before: Adding/selecting an AI Agent node shows the agent with no quality signal.
- After: When a tested AI Agent is selected, the node shows its average Confidence Score from testing (average across that agent's completed test cases).
| Element | Before | After |
|---|---|---|
| AI Agent node (Tree Diagram) | No confidence indicator | Shows average Confidence Score from testing per AI Agent |
Backend touchpoint: get_tree_diagram_v3 (chatbot). JIRA: BOT-3976.
8. New Features
Feature: AI Agent Testing page
- URL:
/bot-automation/testing— top-level item under Bot Automation (AI agents · Resources · Actions · Testing · Analytics · Bot flow). Test-case detail opens from a row. - Access: owner / supervisor / admin only. Standard agents: menu hidden.
Component tree:
TestingPage(/bot-automation/testing)PageHeader→ "Generate test case" button →GenerateTestCaseModalFilterToolbar— search + filterTestCasesTable— columns: Test case name · Testing type · Score · Status · Last updated · ActionsGenerateTestCaseModal— source picker:- "Generate from Inbox" →
GenerateFromInboxDrawer(this phase) - "Generate from knowledge" →
GenerateFromKnowledgeDrawer(Phase 2 — hidden/disabled) - "Imported question list" →
UploadManuallyDrawer(Phase 3 — hidden/disabled)
- "Generate from Inbox" →
TestCaseGeneratingModal— async progress while the batch runs
TestCaseDetail(per test case)QuestionList(grouped by topic)ComparisonView— Human answer (left) vs AI answer (right) + metrics + thumbs
UI States:
- Empty: "No test cases yet" illustration + helper; primary action = Generate test case. No table.
- Loading: skeleton rows while the list is fetched.
- Error: blank slate "Couldn't load test cases" + Retry. Log:
ai_workspace_load_failed. - Success: table of test cases with type, score, status, last updated, row actions.
Figma: Testing page — node 16743-298263. Code refs: qontak-designer app/pages/bot-automation/testing/index.vue + app/components/bot-automation/testing/*.
9. API & Webhook Behavior
Base path: /api/frontend_service/v1/ai_agent. All endpoints gated server-side to owner / supervisor / admin via set_role. Technical fields (JSON schemas, error codes) resolved during RFC.
| # | Behavior | Entity Affected | Triggered By | Expected Behavior | Failure Behavior |
|---|---|---|---|---|---|
| 1 | Create test case (triggers batch) | New ai_agent_test_cases row (bound to ai_agent_id + ai_agent_history_id) | SPV/Admin clicks Generate in the Generate-from-Inbox drawer (body: type, version_id) | Test case created (status processing); Sidekiq FetchRoomConversationsWorker (queue :ai_agent) enqueued to fetch assigned rooms, extract Q/A pairs, and (target) generate AI shadow answers; UI shows progress | Version not found → 404, no test case. Create fails → 422. Worker error on a room → skip + Rollbar; batch continues |
| 2 | List / poll test cases | Read ai_agent_test_cases (paginated) | Opening Testing page; polling during a batch | Returns test cases with status + confidence_score; UI polls status until completed | Unauthorized role → forbidden (menu should not have been visible) |
| 3 | Get test-case detail | Read one ai_agent_test_cases + its ai_agent_test_case_questions | SPV/Admin opens a test case | Returns each question: topic, question, answer (AI), parameters.human_answer (golden), response_time, confidence, sources, score, status | Test case not found → 404 |
| 4 | Rate a question | Update one ai_agent_test_case_questions row | SPV/Admin clicks thumbs up/down (score 0/1 + scored_by metadata) | Saves score + is_score + scored_by/scored_at; (target) recomputes the test case aggregate confidence_score | Question not found → 404. Invalid score (not 0/1) → 422 |
Implementation note: today
RateTestCaseQuestiononly persists the per-questionscore; the aggregateconfidence_scorerecompute is not yet wired (Open Question #2).
10. System Flow + User Stories + ACs
10.1 System Flow
- SPV/Admin opens Bot Automation → Testing.
- Clicks "Generate test case" → chooses "Generate from inbox".
- Names the test case, selects the AI Agent version, clicks Generate.
- System creates the test case (status
processing) and enqueuesFetchRoomConversationsWorker. - Worker fetches assigned rooms (last 90 days), extracts customer→agent question/answer pairs, (target) samples 10% capped at 50–70, and generates an AI shadow answer per question.
- Decision point: if total eligible rooms < 10 → use all available rooms.
- Failure branch: if a room fetch / LLM call fails → skip that room, log to Rollbar, continue.
- On completion, status →
completed; UI surfaces the test case with questions grouped by topic. - SPV/Admin opens a question → sees Human answer (left) vs AI answer (right) + confidence / response-time / sources.
- SPV/Admin rates each answer thumbs up/down → (target) confidence meter updates.
- When the meter reaches ≥80%, the agent is "Ready to Launch"; SPV/Admin proceeds to activate.
10.2 User Stories
All 10 stories carry their original JIRA tickets. MoSCoW preserved from the source "Importance". Implementation-status notes flag where ACs describe target behavior not yet in code.
AITEST-S01 — Workspace access control | Must Have
Story: As an SPV/Admin, I want to access the Testing page, so that I can validate AI performance before activation.
Before: No Testing page exists; there is no place to validate an AI Agent against history. After: A Bot Automation → Testing page is visible to owner/supervisor/admin only.
Data Fields: ai_agent_id (uuid, required — route), role (enum, required — auth session)
Happy Path:
- AC-1: Given I am a Super Admin or Supervisor, when I open Bot Automation, then I see the "Testing" menu item and can open the Testing page.
- AC-2: Given I am on the Testing page, when it loads, then I can create a new test case and open existing ones.
Error Path:
- ERR-1: Given the test-case list fails to load, when the page renders, then I see a "Couldn't load" blank slate with Retry, and event
ai_workspace_load_failedis logged.
Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents. Unauthorized: Testing menu not rendered; direct route forbidden.
UI States: Loading (skeleton rows), Empty ("No test cases yet" + Generate CTA), Error (blank slate + Retry), Success (test-case table).
Figma: node 16743-298263. Dependencies: None.
AITEST-S02 — Historical data sampling (10% rule) | Must Have
Story: As an SPV/Admin, I want the system to fetch a 10% sample of past resolved chats, so that I have a "golden standard" to test against.
Before: No mechanism to pull historical conversations into a test set. After: Generating a test case samples eligible resolved, human-handled rooms from the last 90 days.
Data Fields: lookback_days (int, fixed 90), sample_pct (int, 10), cap (int, 50–70)
Happy Path:
- AC-1: Given 200 eligible human-handled rooms in the last 90 days, when I generate a validation set, then the system selects ~10% (≈20) at random.
- AC-2: Given fewer than 10 eligible rooms, when I generate, then the system uses 100% of available rooms.
- AC-3: Given 5,000 eligible rooms, when I generate, then the system caps the sample at 50–70 rooms (≤50 shown when the batch exceeds 100).
Error Path:
- ERR-1: Given the room-list API is unavailable, when generation runs, then the test case surfaces an error state and can be retried.
Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.
UI States: Loading (generating modal), Empty ("no eligible rooms found"), Error (retry), Success (sample loaded).
Figma: node 17699-52615. Dependencies: AITEST-S03; Chat Service room APIs (§15).
Implementation status: the worker currently fetches assigned rooms (90 days, paginated, LIMIT=100) and extracts Q/A pairs, but the 10% sampling + 50–70 cap are not yet implemented.
AITEST-S03 — Data integrity & filtering | Must Have
Story: As the System, I must filter out invalid or non-compatible chats, so that the test set is representative and safe.
Before: No filtering rules for what counts as a valid test conversation. After: Bot-only and non-text conversations are excluded from the sample.
Data Fields: participant_type (enum: customer/agent/system), message_type (enum, must be text)
Happy Path:
- AC-1: Given a room resolved entirely by a bot with no human reply, when sampling runs, then that room is excluded.
- AC-2: Given a room with at least one human-agent text reply to a customer question, when sampling runs, then it is eligible.
- AC-3: Given a ticket containing an image, voice note, or attachment, when the sample is selected, then that ticket is skipped in favor of text-only inquiries.
Error Path:
- ERR-1: Given a room returns zero eligible question/answer pairs, when extraction runs, then the room contributes no items and is not counted toward the sample.
Permission Model: System rule (runs under an owner/supervisor/admin-triggered batch).
UI States: N/A — server-side filtering; result reflected in the generated sample.
Dependencies: AITEST-S02.
Code:
ExtractConversationPairspairs a customer question with the next agent text reply; status filter =assigned(human-handled) rooms.
AITEST-S04 — Shadow mode execution (zero leakage) | Must Have
Story: As a PM, I want the AI to generate responses in shadow mode without messaging actual customers, so that testing is safe.
Before: The AI Agent only answers live customers. After: The AI generates answers to historical inquiries in shadow mode, never sent to customers.
Data Fields: answer (text), confidence (int), sources (jsonb), status (enum)
Happy Path:
- AC-1: Given the AI generates a shadow response for a 3-month-old inquiry, when generation completes, then the
send_messageservice is NOT triggered and no notification/email reaches the customer. - AC-2: Given a shadow response is generated, when it is stored, then the AI answer is saved on
ai_agent_test_case_questions.answerand the human answer onparameters.human_answer— not in the live conversation log.
Error Path:
- ERR-1: Given the LLM call fails for an inquiry, when generation runs, then that question is marked failed with a
status_descriptionand the batch continues.
Permission Model: CAN: system (triggered by owner/supervisor/admin). Unauthorized: not executed if flag OFF.
UI States: Loading (per-question generating), Empty (N/A), Error (per-question failed badge), Success (answer + confidence + sources shown).
Figma: node 16514-155786. Dependencies: AITEST-S03; LLM/AI service (§15).
AITEST-S05 — Side-by-side validation UI | Must Have
Story: As an SPV/Admin, I want to compare human vs AI responses side-by-side, so that I can judge accuracy effectively.
Before: No way to compare AI vs human answers. After: A side-by-side comparison per question, grouped by topic.
Data Fields: question (text), answer (text), parameters.human_answer (text), confidence (int), response_time (int), sources (jsonb), topic (string)
Happy Path:
- AC-1: Given a completed batch, when I open a validation item, then I see the inquiry at the top, the human response (golden standard) on the left, and the AI response on the right.
- AC-2: Given the AI response card, when it renders, then it shows Confidence, response time, and cited Sources.
- AC-3: Given questions in the test case, when the list renders, then questions are grouped by topic.
Error Path:
- ERR-1: Given a question failed shadow generation, when I open it, then the AI side shows a "could not generate" state instead of a blank panel.
Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.
UI States: Loading (skeleton), Empty ("no questions in this test case"), Error (retry), Success (comparison rendered).
Figma: node 16514-155786. Dependencies: AITEST-S04.
Note: schema stores
confidence+sources; there is no separate "relevance" field — if a relevance metric is required it must live inparameters(Open Question #6).
AITEST-S06 — Confidence meter & feedback | Must Have
Story: As an SPV/Admin, I want to rate AI responses, so that the system can calculate a confidence meter for launch readiness.
Before: No rating or roll-up of AI answer quality. After: Each answer can be rated thumbs up/down; ratings roll up into a confidence meter.
Data Fields: score (int 0/1, required), is_score (boolean), confidence_score (int — test-case aggregate)
Happy Path:
- AC-1: Given I am viewing a comparison, when I click thumbs up, then the item is marked "Pass" (
score = 1) and the meter increments. - AC-2: Given I previously marked an item thumbs down (
score = 0), when I change it to thumbs up, then the aggregate confidence meter recalculates immediately. - AC-3: Given a test case with N rated items, when the meter renders, then it equals (thumbs-up ÷ total sample) × 100, with <80% = Low Confidence and ≥80% = Ready to Launch.
Error Path:
- ERR-1: Given the rating save fails, when I click thumbs up/down, then the previous state is restored and an inline error is shown.
Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.
UI States: Loading (saving rating), Empty (meter at 0% before any rating), Error (save failed inline), Success (meter updated).
Figma: node 16514-155786. Dependencies: AITEST-S05.
Implementation status: rating persists the per-question
score; theconfidence_scoreaggregate recompute is not yet wired (Open Question #2).
AITEST-S07 — Activation gatekeeping | Should Have
Story: As a PM, I want to prevent "Go Live" until the AI reaches a safe quality threshold, so that low-quality agents don't go live.
Before: An AI Agent can be activated regardless of any test result. After: Activation is gated until the confidence meter reaches the ≥80% threshold.
Data Fields: confidence_score (int), threshold (int, default 80)
Happy Path:
- AC-1: Given the confidence meter is <80% (Low Confidence), when I open the agent's main settings, then the "Activate Agent" button is disabled/greyed out.
- AC-2: Given the confidence meter is ≥80%, when I open main settings, then "Activate Agent" is enabled.
Error Path:
- ERR-1: Given I attempt to activate via API while <80%, when the request is made, then it is rejected with a clear reason.
Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.
UI States: Disabled (locked) below threshold; Enabled at/above threshold.
Figma: node 16514-155786. Dependencies: AITEST-S06.
Implementation status:
publish_ai_agent.rbdoes not check any confidence threshold today — no activation gate exists yet (Open Question #2). Should-Have for this phase.
AITEST-S08 — Background processing (async) | Must Have
Story: As a user, I want large batches processed in the background, so that the UI doesn't freeze.
Before: No batch generation pipeline. After: Batches run in a background queue so the UI never blocks.
Data Fields: status (enum: processing/completed/failed), test_case_id (uuid)
Happy Path:
- AC-1: Given I trigger a batch of ~50 items, when the request is sent, then the UI shows a progress/processing state and remains responsive.
- AC-2: Given the batch is running, when I poll the test case, then its
statusreflects progress untilcompleted.
Error Path:
- ERR-1: Given a worker job errors, when it fails, then the failure is logged (Rollbar) and the test case surfaces an error/partial state rather than hanging.
Permission Model: CAN: owner/supervisor/admin.
UI States: Loading (TestCaseGeneratingModal / progress), Empty (N/A), Error (failed/partial badge), Success (completed).
Figma: node 16514-155786. Dependencies: AITEST-S02, AITEST-S04.
Code: implemented with Sidekiq (
FetchRoomConversationsWorker, queue:ai_agent) — not Kafka.
AITEST-S09 — Manual override & audit | Could Have
Story: As an Admin, I want to force-activate the AI with a business justification, so that I'm not blocked when I have a valid reason.
Before: No override path if the gate blocks a justified activation. After: Admins can force-activate below threshold with a logged reason.
Data Fields: override_reason (text, required), actor_id (uuid), score_at_override (int)
Happy Path:
- AC-1: Given the meter is below 80%, when I choose "Force Activate", then I must provide a reason before the action proceeds.
- AC-2: Given I provide a reason and confirm, when force-activation completes, then the action is recorded in an audit trail (who, when, reason, score at time).
Error Path:
- ERR-1: Given I submit force-activate without a reason, when I confirm, then the action is blocked with a validation message.
Permission Model: CAN: owner/admin. CANNOT: supervisor (override is admin-level). Unauthorized: option not shown.
UI States: Loading (submitting), Empty (N/A), Error (validation), Success (activated + audit entry).
Figma: node 16514-155786. Dependencies: AITEST-S07.
Implementation status: depends on the (not-yet-built) activation gate. Could-Have for this phase.
AITEST-S10 — Confidence score in Tree Diagram | Should Have
Story: As an SPV/Admin or Bot Specialist, I want to see the confidence score per AI Agent in the Tree Diagram, so that I can judge quality while composing a flow.
Before: Tree Diagram AI Agent nodes show no quality signal. After: A selected, tested AI Agent shows its average confidence score from testing.
Data Fields: ai_agent_id (uuid), avg_confidence_score (int, computed)
Happy Path:
- AC-1: Given I am on the Tree Diagram page, when I click +, choose "AI Agent", and select a configured agent, then the node shows that agent's Confidence Score from testing.
- AC-2: Given an agent has multiple completed test cases, when the score renders, then it is the average across those test cases.
- AC-3: Given an agent has no completed test cases, when selected, then the node shows "no score yet" rather than 0%.
Error Path:
- ERR-1: Given the confidence lookup fails, when the node renders, then it falls back to "no score yet" rather than erroring the diagram.
Permission Model: CAN: owner/supervisor/admin/bot-specialist.
UI States: Loading (node fetching), Empty ("no score yet"), Error (falls back to "no score yet"), Success (average score shown).
Figma: node 16514-155786. Dependencies: AITEST-S06. Backend: get_tree_diagram_v3.
Negative Scenarios
- NEG-1: Given I am a standard agent, when I look for AI Agent testing, then the Testing menu is not rendered and the route is forbidden.
- NEG-2: Given a conversation contains only images/voice/attachments, when sampling runs, then it is excluded from the test set.
- NEG-3: Given I am in the comparison view, when I try to edit the AI's answer, then no edit affordance exists — I must update the Knowledge Base instead.
- NEG-4: Given the current phase, when I open "Generate test case", then "Generate from knowledge" and "Imported question list" are not selectable (deferred to Phase 2/3).
11. Rollout
- Feature flag:
ai_agent_testing| default: OFF. Enabled per organization during beta. - Stage 1: Internal — telesales POC org(s) (the original POC team).
- Stage 2: Closed beta — 3–5 Enterprise/Pro orgs (manually enabled).
- Stage 3: All orgs with AI Agent enabled, on request.
- GA: All orgs with AI Agent enabled (self-serve toggle).
- Backward compat: Yes — additive. AI Agents with no test cases show "no score yet"; existing publish/activate behavior is unchanged until the activation gate (AITEST-S07) ships.
- Migration: Additive tables (
ai_agent_test_cases,ai_agent_test_case_questions). No backfill.
Semantic regression rollback: ai_agent_testing is the per-org kill switch. If beta orgs report the score is misleading, or the gate suppresses legitimate activations, PM toggles ai_agent_testing OFF per org (no deploy); the gate reverts to advisory-only.
12. Observability
| Event Name | Trigger | Properties |
|---|---|---|
ai_workspace_opened | User opens the Testing page | user_id, org_id, bot_id, timestamp |
ai_validation_generated | User clicks Generate (from inbox) | sample_size, date_range, test_case_id, timestamp |
ai_response_graded | User clicks thumbs up/down | grade (pass/fail), confidence_score, inquiry_id, timestamp |
ai_validation_completed | Confidence meter reaches the ≥80% threshold | total_time_spent, total_items_reviewed, timestamp |
ai_agent_activated | User clicks Go Live | previous_validation_score, timestamp |
Dashboard owner: BOT squad (Mixpanel + Tableau).
Alerts:
ai_validation_generatedbatch failure rate > 10% in 1h → Slack: #bot-ai-alerts (on-call).- LLM error rate during shadow generation > 5% in 15m → Slack: #bot-ai-alerts + PagerDuty on-call.
Post-Launch Monitoring Cadence: Weekly for the first 4 weeks post-GA, then monthly. Owner: Dimas Fauzi Hidayat (BOT squad PM). Triggers: activation conversion drops > 10% WoW → investigate within 48h; batch failure rate > 10% for 2 consecutive weeks → PM review + eng escalation. Rollback: if batch failure rate > 20% unresolved within 24h, PM disables ai_agent_testing for affected orgs.
13. Success Metrics
⭐ Primary KPI: Conversion Rate (Configured → Live)
- Definition: % of AI Agents activated within 7 days of finishing configuration
- Baseline: N/A — new capability (today activation takes a minimum of ~3 days, often weeks)
- Target: ≥ 60% within 7 days, within 90 days of GA
Adoption: "Confidence Bar" completion rate
- Definition: % of Admins who review enough items to reach ≥80% on the bar
- Baseline: N/A — new capability
- Target: ≥ 70% of started test cases reach ≥80%
Quality: Shadow-generation success rate
- Definition: % of sampled questions that produce a valid AI shadow answer (no LLM failure)
- Baseline: N/A — new capability
- Target: ≥ 95% within 60 days of GA
Efficiency: Time-to-Confidence
- Definition: Time from first setup to a go-live-ready validation result per agent
- Baseline: ~6 hours/day of manual War Room monitoring during onboarding
- Target: < 1 hour of reviewing the comparison report per agent, within 90 days of GA
Targets assume a re-baselined launch timeline (Open Question #4) — the original "May 2026 / Q1 2026" dates are past.
14. Launch Plan & Stage Gates
| Stage | Audience | Duration | Success Gate | Owner |
|---|---|---|---|---|
| Internal Alpha | Telesales POC org(s) | 2 weeks | 0 P0/P1 bugs; shadow-generation success ≥ 90%; zero customer-message leakage | PM + QA |
| Closed Beta | 3–5 Enterprise/Pro orgs | 3–4 weeks | ≥ 70% of started test cases reach ≥80% bar; batch failure rate ≤ 10% | PM + CSM |
| Open Beta | All orgs with AI Agent, on request | 3 weeks | Shadow-generation success ≥ 95% sustained 1 week; no P0/P1 open | Eng Lead |
| GA | All orgs with AI Agent enabled | Ongoing | All Open Beta gates sustained 2 weeks; PMM launch approved | PM + PMM |
15. Dependencies
| Dependency | Owning Team | Deliverable Needed | Blocking? |
|---|---|---|---|
| Chat Service (Hub) | Inbox / Platform | Assigned-room list + room messages APIs (FetchAssignedRoomIds, FetchRoomMessages) over a 90-day window | YES |
| LLM / AI service | AI squad | Batch shadow inference within TPM/RPM limits (sync_to_ai_service, qontak_nlp/predict) | YES |
| Data team | Data | 10% sampling + 50–70 cap algorithm (currently unbuilt) | YES |
| Channel Integration | Platform | Access tokens for room fetch (ChannelIntegrations::GetTokens) | YES |
| AI Agent versioning | BOT | ai_agent_histories stable — test cases bind to a version | YES |
16. Key Decisions + Alternatives Rejected
Initiative-level decisions live in the ANCHOR PRD §5. Below are phase-specific decisions.
16a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2025-12-19 | Sample resolved, human-handled rooms over the last 90 days; cap 50–70 per batch | Recent data reflects current policies; cap controls token cost/latency on large accounts |
| 2026-06-18 | Store AI answer in answer and the human golden answer in parameters.human_answer on the question row | Matches the implemented schema; keeps comparison data on one record |
| 2026-06-18 | Use Sidekiq (FetchRoomConversationsWorker) for batch generation | Already the chatbot async stack; avoids introducing Kafka for this workload |
16b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
| Expose sampling parameters (date range, %, cap) in the Generate drawer | Adds user effort; system defaults (90d/10%/cap) are sufficient for the trust signal | 2026-06-18 |
Separate ai_validation_sessions / ai_validation_items tables (original draft) | Superseded by the implemented ai_agent_test_cases / ai_agent_test_case_questions schema | 2026-06-18 |
17. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Risk | Historical tickets contain PII sent to a 3rd-party LLM. Mitigation: covered by existing DPA; transient inference only; not used to train the public model. | Dimas (PM) | 2026-07-01 |
| 2 | Risk | Confidence-meter recalc (S06) + activation gate (S07) are not yet built. Mitigation: ship advisory-only for beta; enforce the gate before GA. | Eng (BOT) | 2026-07-15 |
| 3 | Risk | Original launch dates ("May 2026" GA, "late Q1 2026" beta) are now in the past. Mitigation: re-baseline the timeline with stakeholders before READY. | Dimas (PM) | 2026-07-01 |
| 4 | Open Question | Is the 80% confidence threshold fixed or org-configurable? | Dimas (PM) | 2026-07-15 |
| 5 | Open Question | Per-batch token budget across plan tiers — does the 50–70 cap hold for all? | Data team (Reza) | 2026-07-15 |
| 6 | Open Question | Is a separate "relevance" metric required (schema only has confidence)? If yes, store in parameters. | Dimas (PM) | 2026-07-15 |
| 7 | Assumption | A single human reply is a sufficient "golden answer" when a room has multiple agent messages. | Data team (Reza) | 2026-07-15 |
| 8 | Open Question | Hard-purge window for soft-deleted test cases/questions. | Eng (BOT) | 2026-07-31 |
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.0 | 2026-06-18 | Claude | All | CREATED | Phase 1 PRD authored in the documents-repo template under the new AI Agent: Testing ANCHOR. Reconciled with the Confluence draft and current code (chatbot, chatbot-fe, qontak-designer): endpoints, schema (ai_agent_test_cases / ai_agent_test_case_questions), async queue (Sidekiq), and sampling status filter (assigned) aligned to implementation; not-yet-built items (10%/cap sampling, confidence-meter recalc, activation gate) flagged as Open Questions. 10 stories with composite AC ids (AITEST-S01…S10). |