Skip to main content

Qontak | AI Agent | Testing — Phase 1: Historical Validation

Historical Validation — Phase 1 PRD under the AI Agent: Testing ANCHOR. The first test-case source ("Generate from inbox"): sample resolved human-handled conversations, generate AI shadow answers, compare side-by-side vs the human "golden" answer, rate per question, and roll up a confidence score. Imported from Confluence and reconciled against code (chatbot, chatbot-fe, qontak-designer).

HEADER BLOCK

FieldValue
PMDimas Fauzi Hidayat
PRD Version1.0
StatusDRAFT
PRD TypePHASE
EpicBOT-3351
SquadBOT
RFC LinkRelated: docs/rfcs/ai-agent-advanced-settings-p2-be-v2.md (chatbot repo) — dedicated RFC to be created
Figma MasterFigma — Bot · AI Agent Testing
AnchorAI Agent: Testing — ANCHOR (Confluence)
Labelsepic:qontak-chatbot | module:ai-agent | feature:ai-agent-testing
Last Updated2026-06-18

Scope Changes

Backend · Frontend · Data — new Testing page + test-case endpoints (chatbot, chatbot-fe), and the sampling/shadow-generation pipeline (Data).


2. Phase Context

  • Anchor PRD: AI Agent: Testing — ANCHOR
  • Phase Number: Phase 1 of 3 (phasing by test-case source: Historical → Knowledge → Imported)
  • Phase Goal: Validate AI Agent quality against a sample of historical, resolved human conversations so SPV/Admin can confidently go live. (Matches the ANCHOR Phase Index Goal for Phase 1.)
  • Prior phases: N/A — this is Phase 1, no prior phases.
  • This phase: The Testing page + the "Generate from inbox" test-case source — sample resolved human-handled rooms, generate AI shadow answers, compare side-by-side vs the human "golden" answer, rate per question, roll up a confidence score.
  • Deferred to next: Phase 2 "Generate from knowledge" and Phase 3 "Imported question list" test-case sources (scaffolded in design/code but out of scope here).
  • Cross-phase deps: Test cases bind to a specific AI Agent version (ai_agent_history_id). The ai_agent_test_cases / ai_agent_test_case_questions schema established here must remain stable for Phase 2/3, which reuse the same tables and comparison UI.

3. One-liner + Problem

One-liner: Let Qontak SPVs and Admins validate an AI Agent against a sample of their own resolved conversations — comparing AI answers to their human agents' — before going live.

Problem: Before activating an AI Agent, SPV/Admins have no way to preview its quality at scale, so they run ~6-hour/day manual "War Rooms" during onboarding to catch errors. This phase gives them a self-serve, evidence-based comparison against historical "golden" human answers, so they can reach a confidence bar and activate without babysitting. For full initiative context, see the ANCHOR PRD.


4. Target Users + Persona Context

PersonaRoleGoalPainWorkaround
Primary — SPV / Chatbot AdminSPV / Chatbot Admin responsible for customer-interaction qualityReach a confidence bar that proves the AI is safe to launch, then activate itNo scalable way to preview AI answers on real customer questions before go-live~6 hours/day of manual War Room monitoring during onboarding
Secondary — Super AdminDecision-maker who purchased the AI moduleActivate the subscription fully once the SPV signs offNo objective proof to justify activationWaits on the SPV "green light"; keeps AI inactive

5. Non-Goals

  1. Live Shadow Mode — not a real-time system running in parallel during actual customer chats. This phase is strictly historical.
  2. Model fine-tuning UI — answers are fixed by updating the Knowledge Base/Context, not by "teaching" the AI in this workspace.
  3. Multi-modal validation — MVP is text-only; tickets with images/voice/attachments are excluded.
  4. Knowledge-based and imported test sets — deferred to Phase 2 and Phase 3.
  5. Editing the AI's answer in the workspace — the comparison view is read-only.
  6. Mobile app — Testing is web-only in this phase.

6. Constraints

  • Platform: Web only (Qontak web app — Bot Automation → Testing).
  • Performance: Batch generation is asynchronous (~2–5 min for ~50 items). Must respect the LLM provider's TPM/RPM so it never blocks live production traffic. Historical room fetch should read from a replica DB where available, to avoid slowing the live inbox.
  • Data limits: Lookback = last 90 days. Sample = 10% of eligible rooms, capped at 50–70 rooms per batch (≤50 shown when a batch exceeds 100). Text-only tickets.
  • Plan scope: All plans with the AI Agent enabled.
  • Feature flag: ai_agent_testing | default: OFF. Enabled per organization during beta.
  • Read/write: Roles owner / supervisor / admin can read + write (generate, rate). Standard agents have no access (menu hidden). Enforced server-side via set_role on every test-case endpoint.
  • Data lifecycle: ai_agent_test_cases and ai_agent_test_case_questions are soft-deleted (deleted_at, acts_as_paranoid) when a user deletes a test case; the transient LLM inference payload is not persisted beyond the response. Hard-purge window TBD (Open Question #8).

7. Feature Changes

CHG-001 — Confidence score surfaced in the Tree Diagram

  • Change Type: Modified component (Bot Flow tree diagram).
  • Page: /bot-automation/bot-flow/{id} (Tree Diagram).
  • Before: Adding/selecting an AI Agent node shows the agent with no quality signal.
  • After: When a tested AI Agent is selected, the node shows its average Confidence Score from testing (average across that agent's completed test cases).
ElementBeforeAfter
AI Agent node (Tree Diagram)No confidence indicatorShows average Confidence Score from testing per AI Agent

Backend touchpoint: get_tree_diagram_v3 (chatbot). JIRA: BOT-3976.


8. New Features

Feature: AI Agent Testing page

  • URL: /bot-automation/testing — top-level item under Bot Automation (AI agents · Resources · Actions · Testing · Analytics · Bot flow). Test-case detail opens from a row.
  • Access: owner / supervisor / admin only. Standard agents: menu hidden.

Component tree:

  • TestingPage (/bot-automation/testing)
    • PageHeader → "Generate test case" button → GenerateTestCaseModal
    • FilterToolbar — search + filter
    • TestCasesTable — columns: Test case name · Testing type · Score · Status · Last updated · Actions
    • GenerateTestCaseModal — source picker:
      • "Generate from Inbox" → GenerateFromInboxDrawer (this phase)
      • "Generate from knowledge" → GenerateFromKnowledgeDrawer (Phase 2 — hidden/disabled)
      • "Imported question list" → UploadManuallyDrawer (Phase 3 — hidden/disabled)
    • TestCaseGeneratingModal — async progress while the batch runs
  • TestCaseDetail (per test case)
    • QuestionList (grouped by topic)
    • ComparisonView — Human answer (left) vs AI answer (right) + metrics + thumbs

UI States:

  • Empty: "No test cases yet" illustration + helper; primary action = Generate test case. No table.
  • Loading: skeleton rows while the list is fetched.
  • Error: blank slate "Couldn't load test cases" + Retry. Log: ai_workspace_load_failed.
  • Success: table of test cases with type, score, status, last updated, row actions.

Figma: Testing page — node 16743-298263. Code refs: qontak-designer app/pages/bot-automation/testing/index.vue + app/components/bot-automation/testing/*.


9. API & Webhook Behavior

Base path: /api/frontend_service/v1/ai_agent. All endpoints gated server-side to owner / supervisor / admin via set_role. Technical fields (JSON schemas, error codes) resolved during RFC.

#BehaviorEntity AffectedTriggered ByExpected BehaviorFailure Behavior
1Create test case (triggers batch)New ai_agent_test_cases row (bound to ai_agent_id + ai_agent_history_id)SPV/Admin clicks Generate in the Generate-from-Inbox drawer (body: type, version_id)Test case created (status processing); Sidekiq FetchRoomConversationsWorker (queue :ai_agent) enqueued to fetch assigned rooms, extract Q/A pairs, and (target) generate AI shadow answers; UI shows progressVersion not found → 404, no test case. Create fails → 422. Worker error on a room → skip + Rollbar; batch continues
2List / poll test casesRead ai_agent_test_cases (paginated)Opening Testing page; polling during a batchReturns test cases with status + confidence_score; UI polls status until completedUnauthorized role → forbidden (menu should not have been visible)
3Get test-case detailRead one ai_agent_test_cases + its ai_agent_test_case_questionsSPV/Admin opens a test caseReturns each question: topic, question, answer (AI), parameters.human_answer (golden), response_time, confidence, sources, score, statusTest case not found → 404
4Rate a questionUpdate one ai_agent_test_case_questions rowSPV/Admin clicks thumbs up/down (score 0/1 + scored_by metadata)Saves score + is_score + scored_by/scored_at; (target) recomputes the test case aggregate confidence_scoreQuestion not found → 404. Invalid score (not 0/1) → 422

Implementation note: today RateTestCaseQuestion only persists the per-question score; the aggregate confidence_score recompute is not yet wired (Open Question #2).


10. System Flow + User Stories + ACs

10.1 System Flow

  1. SPV/Admin opens Bot Automation → Testing.
  2. Clicks "Generate test case" → chooses "Generate from inbox".
  3. Names the test case, selects the AI Agent version, clicks Generate.
  4. System creates the test case (status processing) and enqueues FetchRoomConversationsWorker.
  5. Worker fetches assigned rooms (last 90 days), extracts customer→agent question/answer pairs, (target) samples 10% capped at 50–70, and generates an AI shadow answer per question.
  6. Decision point: if total eligible rooms < 10 → use all available rooms.
  7. Failure branch: if a room fetch / LLM call fails → skip that room, log to Rollbar, continue.
  8. On completion, status → completed; UI surfaces the test case with questions grouped by topic.
  9. SPV/Admin opens a question → sees Human answer (left) vs AI answer (right) + confidence / response-time / sources.
  10. SPV/Admin rates each answer thumbs up/down → (target) confidence meter updates.
  11. When the meter reaches ≥80%, the agent is "Ready to Launch"; SPV/Admin proceeds to activate.

10.2 User Stories

All 10 stories carry their original JIRA tickets. MoSCoW preserved from the source "Importance". Implementation-status notes flag where ACs describe target behavior not yet in code.


AITEST-S01 — Workspace access control | Must Have

Story: As an SPV/Admin, I want to access the Testing page, so that I can validate AI performance before activation.

Before: No Testing page exists; there is no place to validate an AI Agent against history. After: A Bot Automation → Testing page is visible to owner/supervisor/admin only.

Data Fields: ai_agent_id (uuid, required — route), role (enum, required — auth session)

Happy Path:

  • AC-1: Given I am a Super Admin or Supervisor, when I open Bot Automation, then I see the "Testing" menu item and can open the Testing page.
  • AC-2: Given I am on the Testing page, when it loads, then I can create a new test case and open existing ones.

Error Path:

  • ERR-1: Given the test-case list fails to load, when the page renders, then I see a "Couldn't load" blank slate with Retry, and event ai_workspace_load_failed is logged.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents. Unauthorized: Testing menu not rendered; direct route forbidden.

UI States: Loading (skeleton rows), Empty ("No test cases yet" + Generate CTA), Error (blank slate + Retry), Success (test-case table).

Figma: node 16743-298263. Dependencies: None.


AITEST-S02 — Historical data sampling (10% rule) | Must Have

Story: As an SPV/Admin, I want the system to fetch a 10% sample of past resolved chats, so that I have a "golden standard" to test against.

Before: No mechanism to pull historical conversations into a test set. After: Generating a test case samples eligible resolved, human-handled rooms from the last 90 days.

Data Fields: lookback_days (int, fixed 90), sample_pct (int, 10), cap (int, 50–70)

Happy Path:

  • AC-1: Given 200 eligible human-handled rooms in the last 90 days, when I generate a validation set, then the system selects ~10% (≈20) at random.
  • AC-2: Given fewer than 10 eligible rooms, when I generate, then the system uses 100% of available rooms.
  • AC-3: Given 5,000 eligible rooms, when I generate, then the system caps the sample at 50–70 rooms (≤50 shown when the batch exceeds 100).

Error Path:

  • ERR-1: Given the room-list API is unavailable, when generation runs, then the test case surfaces an error state and can be retried.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Loading (generating modal), Empty ("no eligible rooms found"), Error (retry), Success (sample loaded).

Figma: node 17699-52615. Dependencies: AITEST-S03; Chat Service room APIs (§15).

Implementation status: the worker currently fetches assigned rooms (90 days, paginated, LIMIT=100) and extracts Q/A pairs, but the 10% sampling + 50–70 cap are not yet implemented.


AITEST-S03 — Data integrity & filtering | Must Have

Story: As the System, I must filter out invalid or non-compatible chats, so that the test set is representative and safe.

Before: No filtering rules for what counts as a valid test conversation. After: Bot-only and non-text conversations are excluded from the sample.

Data Fields: participant_type (enum: customer/agent/system), message_type (enum, must be text)

Happy Path:

  • AC-1: Given a room resolved entirely by a bot with no human reply, when sampling runs, then that room is excluded.
  • AC-2: Given a room with at least one human-agent text reply to a customer question, when sampling runs, then it is eligible.
  • AC-3: Given a ticket containing an image, voice note, or attachment, when the sample is selected, then that ticket is skipped in favor of text-only inquiries.

Error Path:

  • ERR-1: Given a room returns zero eligible question/answer pairs, when extraction runs, then the room contributes no items and is not counted toward the sample.

Permission Model: System rule (runs under an owner/supervisor/admin-triggered batch).

UI States: N/A — server-side filtering; result reflected in the generated sample.

Dependencies: AITEST-S02.

Code: ExtractConversationPairs pairs a customer question with the next agent text reply; status filter = assigned (human-handled) rooms.


AITEST-S04 — Shadow mode execution (zero leakage) | Must Have

Story: As a PM, I want the AI to generate responses in shadow mode without messaging actual customers, so that testing is safe.

Before: The AI Agent only answers live customers. After: The AI generates answers to historical inquiries in shadow mode, never sent to customers.

Data Fields: answer (text), confidence (int), sources (jsonb), status (enum)

Happy Path:

  • AC-1: Given the AI generates a shadow response for a 3-month-old inquiry, when generation completes, then the send_message service is NOT triggered and no notification/email reaches the customer.
  • AC-2: Given a shadow response is generated, when it is stored, then the AI answer is saved on ai_agent_test_case_questions.answer and the human answer on parameters.human_answer — not in the live conversation log.

Error Path:

  • ERR-1: Given the LLM call fails for an inquiry, when generation runs, then that question is marked failed with a status_description and the batch continues.

Permission Model: CAN: system (triggered by owner/supervisor/admin). Unauthorized: not executed if flag OFF.

UI States: Loading (per-question generating), Empty (N/A), Error (per-question failed badge), Success (answer + confidence + sources shown).

Figma: node 16514-155786. Dependencies: AITEST-S03; LLM/AI service (§15).


AITEST-S05 — Side-by-side validation UI | Must Have

Story: As an SPV/Admin, I want to compare human vs AI responses side-by-side, so that I can judge accuracy effectively.

Before: No way to compare AI vs human answers. After: A side-by-side comparison per question, grouped by topic.

Data Fields: question (text), answer (text), parameters.human_answer (text), confidence (int), response_time (int), sources (jsonb), topic (string)

Happy Path:

  • AC-1: Given a completed batch, when I open a validation item, then I see the inquiry at the top, the human response (golden standard) on the left, and the AI response on the right.
  • AC-2: Given the AI response card, when it renders, then it shows Confidence, response time, and cited Sources.
  • AC-3: Given questions in the test case, when the list renders, then questions are grouped by topic.

Error Path:

  • ERR-1: Given a question failed shadow generation, when I open it, then the AI side shows a "could not generate" state instead of a blank panel.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Loading (skeleton), Empty ("no questions in this test case"), Error (retry), Success (comparison rendered).

Figma: node 16514-155786. Dependencies: AITEST-S04.

Note: schema stores confidence + sources; there is no separate "relevance" field — if a relevance metric is required it must live in parameters (Open Question #6).


AITEST-S06 — Confidence meter & feedback | Must Have

Story: As an SPV/Admin, I want to rate AI responses, so that the system can calculate a confidence meter for launch readiness.

Before: No rating or roll-up of AI answer quality. After: Each answer can be rated thumbs up/down; ratings roll up into a confidence meter.

Data Fields: score (int 0/1, required), is_score (boolean), confidence_score (int — test-case aggregate)

Happy Path:

  • AC-1: Given I am viewing a comparison, when I click thumbs up, then the item is marked "Pass" (score = 1) and the meter increments.
  • AC-2: Given I previously marked an item thumbs down (score = 0), when I change it to thumbs up, then the aggregate confidence meter recalculates immediately.
  • AC-3: Given a test case with N rated items, when the meter renders, then it equals (thumbs-up ÷ total sample) × 100, with <80% = Low Confidence and ≥80% = Ready to Launch.

Error Path:

  • ERR-1: Given the rating save fails, when I click thumbs up/down, then the previous state is restored and an inline error is shown.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Loading (saving rating), Empty (meter at 0% before any rating), Error (save failed inline), Success (meter updated).

Figma: node 16514-155786. Dependencies: AITEST-S05.

Implementation status: rating persists the per-question score; the confidence_score aggregate recompute is not yet wired (Open Question #2).


AITEST-S07 — Activation gatekeeping | Should Have

Story: As a PM, I want to prevent "Go Live" until the AI reaches a safe quality threshold, so that low-quality agents don't go live.

Before: An AI Agent can be activated regardless of any test result. After: Activation is gated until the confidence meter reaches the ≥80% threshold.

Data Fields: confidence_score (int), threshold (int, default 80)

Happy Path:

  • AC-1: Given the confidence meter is <80% (Low Confidence), when I open the agent's main settings, then the "Activate Agent" button is disabled/greyed out.
  • AC-2: Given the confidence meter is ≥80%, when I open main settings, then "Activate Agent" is enabled.

Error Path:

  • ERR-1: Given I attempt to activate via API while <80%, when the request is made, then it is rejected with a clear reason.

Permission Model: CAN: owner/supervisor/admin. CANNOT: standard agents.

UI States: Disabled (locked) below threshold; Enabled at/above threshold.

Figma: node 16514-155786. Dependencies: AITEST-S06.

Implementation status: publish_ai_agent.rb does not check any confidence threshold today — no activation gate exists yet (Open Question #2). Should-Have for this phase.


AITEST-S08 — Background processing (async) | Must Have

Story: As a user, I want large batches processed in the background, so that the UI doesn't freeze.

Before: No batch generation pipeline. After: Batches run in a background queue so the UI never blocks.

Data Fields: status (enum: processing/completed/failed), test_case_id (uuid)

Happy Path:

  • AC-1: Given I trigger a batch of ~50 items, when the request is sent, then the UI shows a progress/processing state and remains responsive.
  • AC-2: Given the batch is running, when I poll the test case, then its status reflects progress until completed.

Error Path:

  • ERR-1: Given a worker job errors, when it fails, then the failure is logged (Rollbar) and the test case surfaces an error/partial state rather than hanging.

Permission Model: CAN: owner/supervisor/admin.

UI States: Loading (TestCaseGeneratingModal / progress), Empty (N/A), Error (failed/partial badge), Success (completed).

Figma: node 16514-155786. Dependencies: AITEST-S02, AITEST-S04.

Code: implemented with Sidekiq (FetchRoomConversationsWorker, queue :ai_agent) — not Kafka.


AITEST-S09 — Manual override & audit | Could Have

Story: As an Admin, I want to force-activate the AI with a business justification, so that I'm not blocked when I have a valid reason.

Before: No override path if the gate blocks a justified activation. After: Admins can force-activate below threshold with a logged reason.

Data Fields: override_reason (text, required), actor_id (uuid), score_at_override (int)

Happy Path:

  • AC-1: Given the meter is below 80%, when I choose "Force Activate", then I must provide a reason before the action proceeds.
  • AC-2: Given I provide a reason and confirm, when force-activation completes, then the action is recorded in an audit trail (who, when, reason, score at time).

Error Path:

  • ERR-1: Given I submit force-activate without a reason, when I confirm, then the action is blocked with a validation message.

Permission Model: CAN: owner/admin. CANNOT: supervisor (override is admin-level). Unauthorized: option not shown.

UI States: Loading (submitting), Empty (N/A), Error (validation), Success (activated + audit entry).

Figma: node 16514-155786. Dependencies: AITEST-S07.

Implementation status: depends on the (not-yet-built) activation gate. Could-Have for this phase.


AITEST-S10 — Confidence score in Tree Diagram | Should Have

Story: As an SPV/Admin or Bot Specialist, I want to see the confidence score per AI Agent in the Tree Diagram, so that I can judge quality while composing a flow.

Before: Tree Diagram AI Agent nodes show no quality signal. After: A selected, tested AI Agent shows its average confidence score from testing.

Data Fields: ai_agent_id (uuid), avg_confidence_score (int, computed)

Happy Path:

  • AC-1: Given I am on the Tree Diagram page, when I click +, choose "AI Agent", and select a configured agent, then the node shows that agent's Confidence Score from testing.
  • AC-2: Given an agent has multiple completed test cases, when the score renders, then it is the average across those test cases.
  • AC-3: Given an agent has no completed test cases, when selected, then the node shows "no score yet" rather than 0%.

Error Path:

  • ERR-1: Given the confidence lookup fails, when the node renders, then it falls back to "no score yet" rather than erroring the diagram.

Permission Model: CAN: owner/supervisor/admin/bot-specialist.

UI States: Loading (node fetching), Empty ("no score yet"), Error (falls back to "no score yet"), Success (average score shown).

Figma: node 16514-155786. Dependencies: AITEST-S06. Backend: get_tree_diagram_v3.


Negative Scenarios

  • NEG-1: Given I am a standard agent, when I look for AI Agent testing, then the Testing menu is not rendered and the route is forbidden.
  • NEG-2: Given a conversation contains only images/voice/attachments, when sampling runs, then it is excluded from the test set.
  • NEG-3: Given I am in the comparison view, when I try to edit the AI's answer, then no edit affordance exists — I must update the Knowledge Base instead.
  • NEG-4: Given the current phase, when I open "Generate test case", then "Generate from knowledge" and "Imported question list" are not selectable (deferred to Phase 2/3).

11. Rollout

  • Feature flag: ai_agent_testing | default: OFF. Enabled per organization during beta.
  • Stage 1: Internal — telesales POC org(s) (the original POC team).
  • Stage 2: Closed beta — 3–5 Enterprise/Pro orgs (manually enabled).
  • Stage 3: All orgs with AI Agent enabled, on request.
  • GA: All orgs with AI Agent enabled (self-serve toggle).
  • Backward compat: Yes — additive. AI Agents with no test cases show "no score yet"; existing publish/activate behavior is unchanged until the activation gate (AITEST-S07) ships.
  • Migration: Additive tables (ai_agent_test_cases, ai_agent_test_case_questions). No backfill.

Semantic regression rollback: ai_agent_testing is the per-org kill switch. If beta orgs report the score is misleading, or the gate suppresses legitimate activations, PM toggles ai_agent_testing OFF per org (no deploy); the gate reverts to advisory-only.


12. Observability

Event NameTriggerProperties
ai_workspace_openedUser opens the Testing pageuser_id, org_id, bot_id, timestamp
ai_validation_generatedUser clicks Generate (from inbox)sample_size, date_range, test_case_id, timestamp
ai_response_gradedUser clicks thumbs up/downgrade (pass/fail), confidence_score, inquiry_id, timestamp
ai_validation_completedConfidence meter reaches the ≥80% thresholdtotal_time_spent, total_items_reviewed, timestamp
ai_agent_activatedUser clicks Go Liveprevious_validation_score, timestamp

Dashboard owner: BOT squad (Mixpanel + Tableau).

Alerts:

  • ai_validation_generated batch failure rate > 10% in 1h → Slack: #bot-ai-alerts (on-call).
  • LLM error rate during shadow generation > 5% in 15m → Slack: #bot-ai-alerts + PagerDuty on-call.

Post-Launch Monitoring Cadence: Weekly for the first 4 weeks post-GA, then monthly. Owner: Dimas Fauzi Hidayat (BOT squad PM). Triggers: activation conversion drops > 10% WoW → investigate within 48h; batch failure rate > 10% for 2 consecutive weeks → PM review + eng escalation. Rollback: if batch failure rate > 20% unresolved within 24h, PM disables ai_agent_testing for affected orgs.


13. Success Metrics

Primary KPI: Conversion Rate (Configured → Live)

  • Definition: % of AI Agents activated within 7 days of finishing configuration
  • Baseline: N/A — new capability (today activation takes a minimum of ~3 days, often weeks)
  • Target: ≥ 60% within 7 days, within 90 days of GA

Adoption: "Confidence Bar" completion rate

  • Definition: % of Admins who review enough items to reach ≥80% on the bar
  • Baseline: N/A — new capability
  • Target: ≥ 70% of started test cases reach ≥80%

Quality: Shadow-generation success rate

  • Definition: % of sampled questions that produce a valid AI shadow answer (no LLM failure)
  • Baseline: N/A — new capability
  • Target: ≥ 95% within 60 days of GA

Efficiency: Time-to-Confidence

  • Definition: Time from first setup to a go-live-ready validation result per agent
  • Baseline: ~6 hours/day of manual War Room monitoring during onboarding
  • Target: < 1 hour of reviewing the comparison report per agent, within 90 days of GA

Targets assume a re-baselined launch timeline (Open Question #4) — the original "May 2026 / Q1 2026" dates are past.


14. Launch Plan & Stage Gates

StageAudienceDurationSuccess GateOwner
Internal AlphaTelesales POC org(s)2 weeks0 P0/P1 bugs; shadow-generation success ≥ 90%; zero customer-message leakagePM + QA
Closed Beta3–5 Enterprise/Pro orgs3–4 weeks≥ 70% of started test cases reach ≥80% bar; batch failure rate ≤ 10%PM + CSM
Open BetaAll orgs with AI Agent, on request3 weeksShadow-generation success ≥ 95% sustained 1 week; no P0/P1 openEng Lead
GAAll orgs with AI Agent enabledOngoingAll Open Beta gates sustained 2 weeks; PMM launch approvedPM + PMM

15. Dependencies

DependencyOwning TeamDeliverable NeededBlocking?
Chat Service (Hub)Inbox / PlatformAssigned-room list + room messages APIs (FetchAssignedRoomIds, FetchRoomMessages) over a 90-day windowYES
LLM / AI serviceAI squadBatch shadow inference within TPM/RPM limits (sync_to_ai_service, qontak_nlp/predict)YES
Data teamData10% sampling + 50–70 cap algorithm (currently unbuilt)YES
Channel IntegrationPlatformAccess tokens for room fetch (ChannelIntegrations::GetTokens)YES
AI Agent versioningBOTai_agent_histories stable — test cases bind to a versionYES

16. Key Decisions + Alternatives Rejected

Initiative-level decisions live in the ANCHOR PRD §5. Below are phase-specific decisions.

16a — Decisions Made

DateDecisionRationale
2025-12-19Sample resolved, human-handled rooms over the last 90 days; cap 50–70 per batchRecent data reflects current policies; cap controls token cost/latency on large accounts
2026-06-18Store AI answer in answer and the human golden answer in parameters.human_answer on the question rowMatches the implemented schema; keeps comparison data on one record
2026-06-18Use Sidekiq (FetchRoomConversationsWorker) for batch generationAlready the chatbot async stack; avoids introducing Kafka for this workload

16b — Alternatives Rejected

AlternativeWhy RejectedDate
Expose sampling parameters (date range, %, cap) in the Generate drawerAdds user effort; system defaults (90d/10%/cap) are sufficient for the trust signal2026-06-18
Separate ai_validation_sessions / ai_validation_items tables (original draft)Superseded by the implemented ai_agent_test_cases / ai_agent_test_case_questions schema2026-06-18

17. Open Questions

#TypeQuestionOwnerDeadline
1RiskHistorical tickets contain PII sent to a 3rd-party LLM. Mitigation: covered by existing DPA; transient inference only; not used to train the public model.Dimas (PM)2026-07-01
2RiskConfidence-meter recalc (S06) + activation gate (S07) are not yet built. Mitigation: ship advisory-only for beta; enforce the gate before GA.Eng (BOT)2026-07-15
3RiskOriginal launch dates ("May 2026" GA, "late Q1 2026" beta) are now in the past. Mitigation: re-baseline the timeline with stakeholders before READY.Dimas (PM)2026-07-01
4Open QuestionIs the 80% confidence threshold fixed or org-configurable?Dimas (PM)2026-07-15
5Open QuestionPer-batch token budget across plan tiers — does the 50–70 cap hold for all?Data team (Reza)2026-07-15
6Open QuestionIs a separate "relevance" metric required (schema only has confidence)? If yes, store in parameters.Dimas (PM)2026-07-15
7AssumptionA single human reply is a sufficient "golden answer" when a room has multiple agent messages.Data team (Reza)2026-07-15
8Open QuestionHard-purge window for soft-deleted test cases/questions.Eng (BOT)2026-07-31

PRD CHANGELOG

VersionDateBySectionTypeSummary
1.02026-06-18ClaudeAllCREATEDPhase 1 PRD authored in the documents-repo template under the new AI Agent: Testing ANCHOR. Reconciled with the Confluence draft and current code (chatbot, chatbot-fe, qontak-designer): endpoints, schema (ai_agent_test_cases / ai_agent_test_case_questions), async queue (Sidekiq), and sampling status filter (assigned) aligned to implementation; not-yet-built items (10%/cap sampling, confidence-meter recalc, activation gate) flagged as Open Questions. 10 stories with composite AC ids (AITEST-S01…S10).