Qontak | AI Agent | Testing — Phase 1: Historical Validation

Historical Validation — Phase 1 PRD under the AI Agent: Testing ANCHOR. The first test-case source ("Generate from inbox"): sample resolved human-handled conversations, generate AI shadow answers, compare side-by-side vs the human "golden" answer, rate per question, and roll up a confidence score. Imported from Confluence and reconciled against code (chatbot, chatbot-fe, qontak-designer).

HEADER BLOCK

Field	Value
PM	Dimas Fauzi Hidayat
PRD Version	1.0
Status	DRAFT
PRD Type	PHASE
Epic	BOT-3351
Squad	BOT
RFC Link	Related: `docs/rfcs/ai-agent-advanced-settings-p2-be-v2.md` (chatbot repo) — dedicated RFC to be created
Figma Master	Figma — Bot · AI Agent Testing
Anchor	AI Agent: Testing — ANCHOR (Confluence)
Labels	`epic:qontak-chatbot` \| `module:ai-agent` \| `feature:ai-agent-testing`
Last Updated	2026-06-18

Scope Changes

Backend · Frontend · Data — new Testing page + test-case endpoints (chatbot, chatbot-fe), and the sampling/shadow-generation pipeline (Data).

2. Phase Context

Anchor PRD: AI Agent: Testing — ANCHOR
Phase Number: Phase 1 of 3 (phasing by test-case source: Historical → Knowledge → Imported)
Phase Goal: Validate AI Agent quality against a sample of historical, resolved human conversations so SPV/Admin can confidently go live. (Matches the ANCHOR Phase Index Goal for Phase 1.)
Prior phases: N/A — this is Phase 1, no prior phases.
This phase: The Testing page + the "Generate from inbox" test-case source — sample resolved human-handled rooms, generate AI shadow answers, compare side-by-side vs the human "golden" answer, rate per question, roll up a confidence score.
Deferred to next: Phase 2 "Generate from knowledge" and Phase 3 "Imported question list" test-case sources (scaffolded in design/code but out of scope here).
Cross-phase deps: Test cases bind to a specific AI Agent version (ai_agent_history_id). The ai_agent_test_cases / ai_agent_test_case_questions schema established here must remain stable for Phase 2/3, which reuse the same tables and comparison UI.

3. One-liner + Problem

One-liner: Let Qontak SPVs and Admins validate an AI Agent against a sample of their own resolved conversations — comparing AI answers to their human agents' — before going live.

Problem: Before activating an AI Agent, SPV/Admins have no way to preview its quality at scale, so they run ~6-hour/day manual "War Rooms" during onboarding to catch errors. This phase gives them a self-serve, evidence-based comparison against historical "golden" human answers, so they can reach a confidence bar and activate without babysitting. For full initiative context, see the ANCHOR PRD.

4. Target Users + Persona Context

Persona	Role	Goal	Pain	Workaround
Primary — SPV / Chatbot Admin	SPV / Chatbot Admin responsible for customer-interaction quality	Reach a confidence bar that proves the AI is safe to launch, then activate it	No scalable way to preview AI answers on real customer questions before go-live	~6 hours/day of manual War Room monitoring during onboarding
Secondary — Super Admin	Decision-maker who purchased the AI module	Activate the subscription fully once the SPV signs off	No objective proof to justify activation	Waits on the SPV "green light"; keeps AI inactive

5. Non-Goals

Live Shadow Mode — not a real-time system running in parallel during actual customer chats. This phase is strictly historical.
Model fine-tuning UI — answers are fixed by updating the Knowledge Base/Context, not by "teaching" the AI in this workspace.
Multi-modal validation — MVP is text-only; tickets with images/voice/attachments are excluded.
Knowledge-based and imported test sets — deferred to Phase 2 and Phase 3.
Editing the AI's answer in the workspace — the comparison view is read-only.
Mobile app — Testing is web-only in this phase.

6. Constraints

Platform: Web only (Qontak web app — Bot Automation → Testing).
Performance: Batch generation is asynchronous (~2–5 min for ~50 items). Must respect the LLM provider's TPM/RPM so it never blocks live production traffic. Historical room fetch should read from a replica DB where available, to avoid slowing the live inbox.
Data limits: Lookback = last 90 days. Sample = 10% of eligible rooms, capped at 50–70 rooms per batch (≤50 shown when a batch exceeds 100). Text-only tickets.
Plan scope: All plans with the AI Agent enabled.
Feature flag: ai_agent_testing | default: OFF. Enabled per organization during beta.
Read/write: Roles owner / supervisor / admin can read + write (generate, rate). Standard agents have no access (menu hidden). Enforced server-side via set_role on every test-case endpoint.
Data lifecycle: ai_agent_test_cases and ai_agent_test_case_questions are soft-deleted (deleted_at, acts_as_paranoid) when a user deletes a test case; the transient LLM inference payload is not persisted beyond the response. Hard-purge window TBD (Open Question #8).

7. Feature Changes

CHG-001 — Confidence score surfaced in the Tree Diagram

Change Type: Modified component (Bot Flow tree diagram).
Page: /bot-automation/bot-flow/{id} (Tree Diagram).
Before: Adding/selecting an AI Agent node shows the agent with no quality signal.
After: When a tested AI Agent is selected, the node shows its average Confidence Score from testing (average across that agent's completed test cases).

Element	Before	After
AI Agent node (Tree Diagram)	No confidence indicator	Shows average Confidence Score from testing per AI Agent

Backend touchpoint: get_tree_diagram_v3 (chatbot). JIRA: BOT-3976.

8. New Features

Feature: AI Agent Testing page

URL: /bot-automation/testing — top-level item under Bot Automation (AI agents · Resources · Actions · Testing · Analytics · Bot flow). Test-case detail opens from a row.
Access: owner / supervisor / admin only. Standard agents: menu hidden.

Component tree:

TestingPage (/bot-automation/testing)
- PageHeader → "Generate test case" button → GenerateTestCaseModal
- FilterToolbar — search + filter
- TestCasesTable — columns: Test case name · Testing type · Score · Status · Last updated · Actions
- GenerateTestCaseModal — source picker:
  - "Generate from Inbox" → GenerateFromInboxDrawer (this phase)
  - "Generate from knowledge" → GenerateFromKnowledgeDrawer (Phase 2 — hidden/disabled)
  - "Imported question list" → UploadManuallyDrawer (Phase 3 — hidden/disabled)
- TestCaseGeneratingModal — async progress while the batch runs
TestCaseDetail (per test case)
- QuestionList (grouped by topic)
- ComparisonView — Human answer (left) vs AI answer (right) + metrics + thumbs

UI States:

Empty: "No test cases yet" illustration + helper; primary action = Generate test case. No table.
Loading: skeleton rows while the list is fetched.
Error: blank slate "Couldn't load test cases" + Retry. Log: ai_workspace_load_failed.
Success: table of test cases with type, score, status, last updated, row actions.

Figma: Testing page — node 16743-298263. Code refs: qontak-designer app/pages/bot-automation/testing/index.vue + app/components/bot-automation/testing/*.

9. API & Webhook Behavior

Base path: /api/frontend_service/v1/ai_agent. All endpoints gated server-side to owner / supervisor / admin via set_role. Technical fields (JSON schemas, error codes) resolved during RFC.

#	Behavior	Entity Affected	Triggered By	Expected Behavior	Failure Behavior
1	Create test case (triggers batch)	New `ai_agent_test_cases` row (bound to `ai_agent_id` + `ai_agent_history_id`)	SPV/Admin clicks Generate in the Generate-from-Inbox drawer (body: `type`, `version_id`)	Test case created (status `processing`); Sidekiq `FetchRoomConversationsWorker` (queue `:ai_agent`) enqueued to fetch assigned rooms, extract Q/A pairs, and (target) generate AI shadow answers; UI shows progress	Version not found → 404, no test case. Create fails → 422. Worker error on a room → skip + Rollbar; batch continues
2	List / poll test cases	Read `ai_agent_test_cases` (paginated)	Opening Testing page; polling during a batch	Returns test cases with `status` + `confidence_score`; UI polls `status` until `completed`	Unauthorized role → forbidden (menu should not have been visible)
3	Get test-case detail	Read one `ai_agent_test_cases` + its `ai_agent_test_case_questions`	SPV/Admin opens a test case	Returns each question: `topic`, `question`, `answer` (AI), `parameters.human_answer` (golden), `response_time`, `confidence`, `sources`, `score`, `status`	Test case not found → 404
4	Rate a question	Update one `ai_agent_test_case_questions` row	SPV/Admin clicks thumbs up/down (`score` 0/1 + scored_by metadata)	Saves `score` + `is_score` + `scored_by`/`scored_at`; (target) recomputes the test case aggregate `confidence_score`	Question not found → 404. Invalid score (not 0/1) → 422

Implementation note: today RateTestCaseQuestion only persists the per-question score; the aggregate confidence_score recompute is not yet wired (Open Question #2).

10. System Flow + User Stories + ACs

10.1 System Flow

SPV/Admin opens Bot Automation → Testing.
Clicks "Generate test case" → chooses "Generate from inbox".
Names the test case, selects the AI Agent version, clicks Generate.
System creates the test case (status processing) and enqueues FetchRoomConversationsWorker.
Worker fetches assigned rooms (last 90 days), extracts customer→agent question/answer pairs, (target) samples 10% capped at 50–70, and generates an AI shadow answer per question.
Decision point: if total eligible rooms < 10 → use all available rooms.
Failure branch: if a room fetch / LLM call fails → skip that room, log to Rollbar, continue.
On completion, status → completed; UI surfaces the test case with questions grouped by topic.
SPV/Admin opens a question → sees Human answer (left) vs AI answer (right) + confidence / response-time / sources.
SPV/Admin rates each answer thumbs up/down → (target) confidence meter updates.
When the meter reaches ≥80%, the agent is "Ready to Launch"; SPV/Admin proceeds to activate.

10.2 User Stories

All 10 stories carry their original JIRA tickets. MoSCoW preserved from the source "Importance". Implementation-status notes flag where ACs describe target behavior not yet in code.

AITEST-S01 — Workspace access control | Must Have

Story: As an SPV/Admin, I want to access the Testing page, so that I can validate AI performance before activation.

Before: No Testing page exists; there is no place to validate an AI Agent against history. After: A Bot Automation → Testing page is visible to owner/supervisor/admin only.

Data Fields: ai_agent_id (uuid, required — route), role (enum, required — auth session)

Happy Path:

AC-1: Given I am a Super Admin or Supervisor, when I open Bot Automation, then I see the "Testing" menu item and can open the Testing page.
AC-2: Given I am on the Testing page, when it loads, then I can create a new test case and open existing ones.

Error Path:

ERR-1: Given the test-case list fails to load, when the page renders, then I see a "Couldn't load" blank slate with Retry, and event ai_workspace_load_failed is logged.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents. Unauthorized: Testing menu not rendered; direct route forbidden.

UI States: Loading (skeleton rows), Empty ("No test cases yet" + Generate CTA), Error (blank slate + Retry), Success (test-case table).

Figma: node 16743-298263. Dependencies: None.

AITEST-S02 — Historical data sampling (10% rule) | Must Have

Story: As an SPV/Admin, I want the system to fetch a 10% sample of past resolved chats, so that I have a "golden standard" to test against.

Before: No mechanism to pull historical conversations into a test set. After: Generating a test case samples eligible resolved, human-handled rooms from the last 90 days.

Data Fields: lookback_days (int, fixed 90), sample_pct (int, 10), cap (int, 50–70)

Happy Path:

AC-1: Given 200 eligible human-handled rooms in the last 90 days, when I generate a validation set, then the system selects ~10% (≈20) at random.
AC-2: Given fewer than 10 eligible rooms, when I generate, then the system uses 100% of available rooms.
AC-3: Given 5,000 eligible rooms, when I generate, then the system caps the sample at 50–70 rooms (≤50 shown when the batch exceeds 100).

Error Path:

ERR-1: Given the room-list API is unavailable, when generation runs, then the test case surfaces an error state and can be retried.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Loading (generating modal), Empty ("no eligible rooms found"), Error (retry), Success (sample loaded).

Figma: node 17699-52615. Dependencies: AITEST-S03; Chat Service room APIs (§15).

Implementation status: the worker currently fetches assigned rooms (90 days, paginated, LIMIT=100) and extracts Q/A pairs, but the 10% sampling + 50–70 cap are not yet implemented.

AITEST-S03 — Data integrity & filtering | Must Have

Story: As the System, I must filter out invalid or non-compatible chats, so that the test set is representative and safe.

Before: No filtering rules for what counts as a valid test conversation. After: Bot-only and non-text conversations are excluded from the sample.

Data Fields: participant_type (enum: customer/agent/system), message_type (enum, must be text)

Happy Path:

AC-1: Given a room resolved entirely by a bot with no human reply, when sampling runs, then that room is excluded.
AC-2: Given a room with at least one human-agent text reply to a customer question, when sampling runs, then it is eligible.
AC-3: Given a ticket containing an image, voice note, or attachment, when the sample is selected, then that ticket is skipped in favor of text-only inquiries.

Error Path:

ERR-1: Given a room returns zero eligible question/answer pairs, when extraction runs, then the room contributes no items and is not counted toward the sample.

Permission Model: System rule (runs under an owner/supervisor/admin-triggered batch).

UI States: N/A — server-side filtering; result reflected in the generated sample.

Dependencies: AITEST-S02.

Code: ExtractConversationPairs pairs a customer question with the next agent text reply; status filter = assigned (human-handled) rooms.

AITEST-S04 — Shadow mode execution (zero leakage) | Must Have

Story: As a PM, I want the AI to generate responses in shadow mode without messaging actual customers, so that testing is safe.

Before: The AI Agent only answers live customers. After: The AI generates answers to historical inquiries in shadow mode, never sent to customers.

Data Fields: answer (text), confidence (int), sources (jsonb), status (enum)

Happy Path:

AC-1: Given the AI generates a shadow response for a 3-month-old inquiry, when generation completes, then the send_message service is NOT triggered and no notification/email reaches the customer.
AC-2: Given a shadow response is generated, when it is stored, then the AI answer is saved on ai_agent_test_case_questions.answer and the human answer on parameters.human_answer — not in the live conversation log.

Error Path:

ERR-1: Given the LLM call fails for an inquiry, when generation runs, then that question is marked failed with a status_description and the batch continues.

Permission Model: CAN: system (triggered by owner/supervisor/admin). Unauthorized: not executed if flag OFF.

UI States: Loading (per-question generating), Empty (N/A), Error (per-question failed badge), Success (answer + confidence + sources shown).

Figma: node 16514-155786. Dependencies: AITEST-S03; LLM/AI service (§15).

AITEST-S05 — Side-by-side validation UI | Must Have

Story: As an SPV/Admin, I want to compare human vs AI responses side-by-side, so that I can judge accuracy effectively.

Before: No way to compare AI vs human answers. After: A side-by-side comparison per question, grouped by topic.

Data Fields: question (text), answer (text), parameters.human_answer (text), confidence (int), response_time (int), sources (jsonb), topic (string)

Happy Path:

AC-1: Given a completed batch, when I open a validation item, then I see the inquiry at the top, the human response (golden standard) on the left, and the AI response on the right.
AC-2: Given the AI response card, when it renders, then it shows Confidence, response time, and cited Sources.
AC-3: Given questions in the test case, when the list renders, then questions are grouped by topic.

Error Path:

ERR-1: Given a question failed shadow generation, when I open it, then the AI side shows a "could not generate" state instead of a blank panel.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Loading (skeleton), Empty ("no questions in this test case"), Error (retry), Success (comparison rendered).

Figma: node 16514-155786. Dependencies: AITEST-S04.

Note: schema stores confidence + sources; there is no separate "relevance" field — if a relevance metric is required it must live in parameters (Open Question #6).

AITEST-S06 — Confidence meter & feedback | Must Have

Story: As an SPV/Admin, I want to rate AI responses, so that the system can calculate a confidence meter for launch readiness.

Before: No rating or roll-up of AI answer quality. After: Each answer can be rated thumbs up/down; ratings roll up into a confidence meter.

Data Fields: score (int 0/1, required), is_score (boolean), confidence_score (int — test-case aggregate)

Happy Path:

AC-1: Given I am viewing a comparison, when I click thumbs up, then the item is marked "Pass" (score = 1) and the meter increments.
AC-2: Given I previously marked an item thumbs down (score = 0), when I change it to thumbs up, then the aggregate confidence meter recalculates immediately.
AC-3: Given a test case with N rated items, when the meter renders, then it equals (thumbs-up ÷ total sample) × 100, with <80% = Low Confidence and ≥80% = Ready to Launch.

Error Path:

ERR-1: Given the rating save fails, when I click thumbs up/down, then the previous state is restored and an inline error is shown.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Loading (saving rating), Empty (meter at 0% before any rating), Error (save failed inline), Success (meter updated).

Figma: node 16514-155786. Dependencies: AITEST-S05.

Implementation status: rating persists the per-question score; the confidence_score aggregate recompute is not yet wired (Open Question #2).

AITEST-S07 — Activation gatekeeping | Should Have

Story: As a PM, I want to prevent "Go Live" until the AI reaches a safe quality threshold, so that low-quality agents don't go live.

Before: An AI Agent can be activated regardless of any test result. After: Activation is gated until the confidence meter reaches the ≥80% threshold.

Data Fields: confidence_score (int), threshold (int, default 80)

Happy Path:

AC-1: Given the confidence meter is <80% (Low Confidence), when I open the agent's main settings, then the "Activate Agent" button is disabled/greyed out.
AC-2: Given the confidence meter is ≥80%, when I open main settings, then "Activate Agent" is enabled.

Error Path:

ERR-1: Given I attempt to activate via API while <80%, when the request is made, then it is rejected with a clear reason.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Disabled (locked) below threshold; Enabled at/above threshold.

Figma: node 16514-155786. Dependencies: AITEST-S06.

Implementation status: publish_ai_agent.rb does not check any confidence threshold today — no activation gate exists yet (Open Question #2). Should-Have for this phase.

AITEST-S08 — Background processing (async) | Must Have

Story: As a user, I want large batches processed in the background, so that the UI doesn't freeze.

Before: No batch generation pipeline. After: Batches run in a background queue so the UI never blocks.

Data Fields: status (enum: processing/completed/failed), test_case_id (uuid)

Happy Path:

AC-1: Given I trigger a batch of ~50 items, when the request is sent, then the UI shows a progress/processing state and remains responsive.
AC-2: Given the batch is running, when I poll the test case, then its status reflects progress until completed.

Error Path:

ERR-1: Given a worker job errors, when it fails, then the failure is logged (Rollbar) and the test case surfaces an error/partial state rather than hanging.

Permission Model: CAN: owner/supervisor/admin.

UI States: Loading (TestCaseGeneratingModal / progress), Empty (N/A), Error (failed/partial badge), Success (completed).

Figma: node 16514-155786. Dependencies: AITEST-S02, AITEST-S04.

Code: implemented with Sidekiq (FetchRoomConversationsWorker, queue :ai_agent) — not Kafka.

AITEST-S09 — Manual override & audit | Could Have

Story: As an Admin, I want to force-activate the AI with a business justification, so that I'm not blocked when I have a valid reason.

Before: No override path if the gate blocks a justified activation. After: Admins can force-activate below threshold with a logged reason.

Data Fields: override_reason (text, required), actor_id (uuid), score_at_override (int)

Happy Path:

AC-1: Given the meter is below 80%, when I choose "Force Activate", then I must provide a reason before the action proceeds.
AC-2: Given I provide a reason and confirm, when force-activation completes, then the action is recorded in an audit trail (who, when, reason, score at time).

Error Path:

ERR-1: Given I submit force-activate without a reason, when I confirm, then the action is blocked with a validation message.

Permission Model: CAN: owner/admin. CANNOT: supervisor (override is admin-level). Unauthorized: option not shown.

UI States: Loading (submitting), Empty (N/A), Error (validation), Success (activated + audit entry).

Figma: node 16514-155786. Dependencies: AITEST-S07.

Implementation status: depends on the (not-yet-built) activation gate. Could-Have for this phase.

AITEST-S10 — Confidence score in Tree Diagram | Should Have

Story: As an SPV/Admin or Bot Specialist, I want to see the confidence score per AI Agent in the Tree Diagram, so that I can judge quality while composing a flow.

Before: Tree Diagram AI Agent nodes show no quality signal. After: A selected, tested AI Agent shows its average confidence score from testing.

Data Fields: ai_agent_id (uuid), avg_confidence_score (int, computed)

Happy Path:

AC-1: Given I am on the Tree Diagram page, when I click +, choose "AI Agent", and select a configured agent, then the node shows that agent's Confidence Score from testing.
AC-2: Given an agent has multiple completed test cases, when the score renders, then it is the average across those test cases.
AC-3: Given an agent has no completed test cases, when selected, then the node shows "no score yet" rather than 0%.

Error Path:

ERR-1: Given the confidence lookup fails, when the node renders, then it falls back to "no score yet" rather than erroring the diagram.

Permission Model: CAN: owner/supervisor/admin/bot-specialist.

UI States: Loading (node fetching), Empty ("no score yet"), Error (falls back to "no score yet"), Success (average score shown).

Figma: node 16514-155786. Dependencies: AITEST-S06. Backend: get_tree_diagram_v3.

Negative Scenarios

NEG-1: Given I am a standard agent, when I look for AI Agent testing, then the Testing menu is not rendered and the route is forbidden.
NEG-2: Given a conversation contains only images/voice/attachments, when sampling runs, then it is excluded from the test set.
NEG-3: Given I am in the comparison view, when I try to edit the AI's answer, then no edit affordance exists — I must update the Knowledge Base instead.
NEG-4: Given the current phase, when I open "Generate test case", then "Generate from knowledge" and "Imported question list" are not selectable (deferred to Phase 2/3).

11. Rollout

Feature flag: ai_agent_testing | default: OFF. Enabled per organization during beta.
Stage 1: Internal — telesales POC org(s) (the original POC team).
Stage 2: Closed beta — 3–5 Enterprise/Pro orgs (manually enabled).
Stage 3: All orgs with AI Agent enabled, on request.
GA: All orgs with AI Agent enabled (self-serve toggle).
Backward compat: Yes — additive. AI Agents with no test cases show "no score yet"; existing publish/activate behavior is unchanged until the activation gate (AITEST-S07) ships.
Migration: Additive tables (ai_agent_test_cases, ai_agent_test_case_questions). No backfill.

Semantic regression rollback: ai_agent_testing is the per-org kill switch. If beta orgs report the score is misleading, or the gate suppresses legitimate activations, PM toggles ai_agent_testing OFF per org (no deploy); the gate reverts to advisory-only.

12. Observability

Event Name	Trigger	Properties
`ai_workspace_opened`	User opens the Testing page	user_id, org_id, bot_id, timestamp
`ai_validation_generated`	User clicks Generate (from inbox)	sample_size, date_range, test_case_id, timestamp
`ai_response_graded`	User clicks thumbs up/down	grade (pass/fail), confidence_score, inquiry_id, timestamp
`ai_validation_completed`	Confidence meter reaches the ≥80% threshold	total_time_spent, total_items_reviewed, timestamp
`ai_agent_activated`	User clicks Go Live	previous_validation_score, timestamp

Dashboard owner: BOT squad (Mixpanel + Tableau).

Alerts:

ai_validation_generated batch failure rate > 10% in 1h → Slack: #bot-ai-alerts (on-call).
LLM error rate during shadow generation > 5% in 15m → Slack: #bot-ai-alerts + PagerDuty on-call.

Post-Launch Monitoring Cadence: Weekly for the first 4 weeks post-GA, then monthly. Owner: Dimas Fauzi Hidayat (BOT squad PM). Triggers: activation conversion drops > 10% WoW → investigate within 48h; batch failure rate > 10% for 2 consecutive weeks → PM review + eng escalation. Rollback: if batch failure rate > 20% unresolved within 24h, PM disables ai_agent_testing for affected orgs.

13. Success Metrics

⭐ Primary KPI: Conversion Rate (Configured → Live)

Definition: % of AI Agents activated within 7 days of finishing configuration
Baseline: N/A — new capability (today activation takes a minimum of ~3 days, often weeks)
Target: ≥ 60% within 7 days, within 90 days of GA

Adoption: "Confidence Bar" completion rate

Definition: % of Admins who review enough items to reach ≥80% on the bar
Baseline: N/A — new capability
Target: ≥ 70% of started test cases reach ≥80%

Quality: Shadow-generation success rate

Definition: % of sampled questions that produce a valid AI shadow answer (no LLM failure)
Baseline: N/A — new capability
Target: ≥ 95% within 60 days of GA

Efficiency: Time-to-Confidence

Definition: Time from first setup to a go-live-ready validation result per agent
Baseline: ~6 hours/day of manual War Room monitoring during onboarding
Target: < 1 hour of reviewing the comparison report per agent, within 90 days of GA

Targets assume a re-baselined launch timeline (Open Question #4) — the original "May 2026 / Q1 2026" dates are past.

14. Launch Plan & Stage Gates

Stage	Audience	Duration	Success Gate	Owner
Internal Alpha	Telesales POC org(s)	2 weeks	0 P0/P1 bugs; shadow-generation success ≥ 90%; zero customer-message leakage	PM + QA
Closed Beta	3–5 Enterprise/Pro orgs	3–4 weeks	≥ 70% of started test cases reach ≥80% bar; batch failure rate ≤ 10%	PM + CSM
Open Beta	All orgs with AI Agent, on request	3 weeks	Shadow-generation success ≥ 95% sustained 1 week; no P0/P1 open	Eng Lead
GA	All orgs with AI Agent enabled	Ongoing	All Open Beta gates sustained 2 weeks; PMM launch approved	PM + PMM

15. Dependencies

Dependency	Owning Team	Deliverable Needed	Blocking?
Chat Service (Hub)	Inbox / Platform	Assigned-room list + room messages APIs (`FetchAssignedRoomIds`, `FetchRoomMessages`) over a 90-day window	YES
LLM / AI service	AI squad	Batch shadow inference within TPM/RPM limits (`sync_to_ai_service`, `qontak_nlp/predict`)	YES
Data team	Data	10% sampling + 50–70 cap algorithm (currently unbuilt)	YES
Channel Integration	Platform	Access tokens for room fetch (`ChannelIntegrations::GetTokens`)	YES
AI Agent versioning	BOT	`ai_agent_histories` stable — test cases bind to a version	YES

16. Key Decisions + Alternatives Rejected

Initiative-level decisions live in the ANCHOR PRD §5. Below are phase-specific decisions.

16a — Decisions Made

Date	Decision	Rationale
2025-12-19	Sample resolved, human-handled rooms over the last 90 days; cap 50–70 per batch	Recent data reflects current policies; cap controls token cost/latency on large accounts
2026-06-18	Store AI answer in `answer` and the human golden answer in `parameters.human_answer` on the question row	Matches the implemented schema; keeps comparison data on one record
2026-06-18	Use Sidekiq (`FetchRoomConversationsWorker`) for batch generation	Already the chatbot async stack; avoids introducing Kafka for this workload

16b — Alternatives Rejected

Alternative	Why Rejected	Date
Expose sampling parameters (date range, %, cap) in the Generate drawer	Adds user effort; system defaults (90d/10%/cap) are sufficient for the trust signal	2026-06-18
Separate `ai_validation_sessions` / `ai_validation_items` tables (original draft)	Superseded by the implemented `ai_agent_test_cases` / `ai_agent_test_case_questions` schema	2026-06-18

17. Open Questions

#	Type	Question	Owner	Deadline
1	Risk	Historical tickets contain PII sent to a 3rd-party LLM. Mitigation: covered by existing DPA; transient inference only; not used to train the public model.	Dimas (PM)	2026-07-01
2	Risk	Confidence-meter recalc (S06) + activation gate (S07) are not yet built. Mitigation: ship advisory-only for beta; enforce the gate before GA.	Eng (BOT)	2026-07-15
3	Risk	Original launch dates ("May 2026" GA, "late Q1 2026" beta) are now in the past. Mitigation: re-baseline the timeline with stakeholders before READY.	Dimas (PM)	2026-07-01
4	Open Question	Is the 80% confidence threshold fixed or org-configurable?	Dimas (PM)	2026-07-15
5	Open Question	Per-batch token budget across plan tiers — does the 50–70 cap hold for all?	Data team (Reza)	2026-07-15
6	Open Question	Is a separate "relevance" metric required (schema only has `confidence`)? If yes, store in `parameters`.	Dimas (PM)	2026-07-15
7	Assumption	A single human reply is a sufficient "golden answer" when a room has multiple agent messages.	Data team (Reza)	2026-07-15
8	Open Question	Hard-purge window for soft-deleted test cases/questions.	Eng (BOT)	2026-07-31

PRD CHANGELOG

Version	Date	By	Section	Type	Summary
1.0	2026-06-18	Claude	All	CREATED	Phase 1 PRD authored in the documents-repo template under the new AI Agent: Testing ANCHOR. Reconciled with the Confluence draft and current code (`chatbot`, `chatbot-fe`, `qontak-designer`): endpoints, schema (`ai_agent_test_cases` / `ai_agent_test_case_questions`), async queue (Sidekiq), and sampling status filter (`assigned`) aligned to implementation; not-yet-built items (10%/cap sampling, confidence-meter recalc, activation gate) flagged as Open Questions. 10 stories with composite AC ids (AITEST-S01…S10).

HEADER BLOCK​

Scope Changes​

2. Phase Context​

3. One-liner + Problem​

4. Target Users + Persona Context​

5. Non-Goals​

6. Constraints​

7. Feature Changes​

8. New Features​

9. API & Webhook Behavior​

10. System Flow + User Stories + ACs​

10.1 System Flow​

10.2 User Stories​

AITEST-S01 — Workspace access control | Must Have​

AITEST-S02 — Historical data sampling (10% rule) | Must Have​

AITEST-S03 — Data integrity & filtering | Must Have​

AITEST-S04 — Shadow mode execution (zero leakage) | Must Have​

AITEST-S05 — Side-by-side validation UI | Must Have​

AITEST-S06 — Confidence meter & feedback | Must Have​

AITEST-S07 — Activation gatekeeping | Should Have​

AITEST-S08 — Background processing (async) | Must Have​

AITEST-S09 — Manual override & audit | Could Have​

AITEST-S10 — Confidence score in Tree Diagram | Should Have​

Negative Scenarios​

11. Rollout​

12. Observability​

13. Success Metrics​

14. Launch Plan & Stage Gates​

15. Dependencies​

16. Key Decisions + Alternatives Rejected​

17. Open Questions​

PRD CHANGELOG​

HEADER BLOCK

Scope Changes

2. Phase Context

3. One-liner + Problem

4. Target Users + Persona Context

5. Non-Goals

6. Constraints

7. Feature Changes

8. New Features

9. API & Webhook Behavior

10. System Flow + User Stories + ACs

10.1 System Flow

10.2 User Stories

AITEST-S01 — Workspace access control | Must Have

AITEST-S02 — Historical data sampling (10% rule) | Must Have

AITEST-S03 — Data integrity & filtering | Must Have

AITEST-S04 — Shadow mode execution (zero leakage) | Must Have

AITEST-S05 — Side-by-side validation UI | Must Have

AITEST-S06 — Confidence meter & feedback | Must Have

AITEST-S07 — Activation gatekeeping | Should Have

AITEST-S08 — Background processing (async) | Must Have

AITEST-S09 — Manual override & audit | Could Have

AITEST-S10 — Confidence score in Tree Diagram | Should Have

Negative Scenarios

11. Rollout

12. Observability

13. Success Metrics

14. Launch Plan & Stage Gates

15. Dependencies

16. Key Decisions + Alternatives Rejected

17. Open Questions

PRD CHANGELOG