Skip to main content

Qontak | AI Agent | Testing — ANCHOR

ANCHOR PRD — the initiative master index. It orchestrates all phases beneath it and carries no acceptance criteria of its own (ACs live in each phase PRD). Synced with the canonical Confluence AI Agent: Testing ANCHOR (QON 51223396435) and reconciled against the actual codebase (chatbot, chatbot-fe, qontak-designer).

Scope: AI Agent Testing = a dedicated Testing workspace where SPV/Admins validate an AI Agent's quality before go-live, replacing the manual ~6-hour/day "War Room". Phases are organised by test-case source: Historical (real resolved conversations), Knowledge (synthesised questions), and Imported (curated list).

HEADER BLOCK

FieldValue
PMDimas Fauzi Hidayat
PRD Version1.0
StatusACTIVE
PRD TypeANCHOR
Labelsepic:qontak-chatbot | module:ai-agent | feature:ai-agent-testing
Last Updated2026-06-18

Phase Index

PhaseGoalPRD LinkEpicStatus
Phase 1: Historical ValidationValidate AI Agent quality against a sample of historical, resolved human conversations so SPV/Admin can confidently go liveprds/historical-validation.mdBOT-3351🔄 In Progress
Phase 2: Generate from KnowledgeGenerate test questions from the AI Agent's knowledge source to validate coverage and accuracyTBDTBD⏳ Not started
Phase 3: Imported Question ListValidate the AI Agent against a PM/SPV-curated question set uploaded into a test case — single + multi-turnprds/imported-test-cases.mdTBD📝 Draft

Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled Phases 2–3 are seeded from the test-case sources already scaffolded in the design/code (GenerateTestCaseModal). They are placeholders until their phase PRDs are written.


2. One-liner + Problem

One-liner: A dedicated Testing workspace that lets Qontak SPVs and Admins validate AI Agent quality before go-live — across historical, knowledge-based, and imported test sets.

Problem: Admins and SPVs don't trust the AI Agent to handle live customers immediately, so today they verify quality with a high-effort "War Room" where leads spend ~6 hours/day manually monitoring active rooms to catch errors. This is unsustainable and unscalable, delaying activation and inflating implementation cost. Without a self-serve way to prove the AI matches their best human agents, clients buy the AI module but stall at the "Activate" step.


3. Target Users + Persona Context

PersonaRoleGoalPainWorkaround
Primary — SPV / Chatbot AdminOperational manager / team lead responsible for customer-interaction qualityValidate the AI Agent's quality so they can confidently report it is safe to launchNo way to preview AI performance at scale before going live; fear of a wrong answer to a VIP~6 hours/day in a manual "War Room" watching live chats during onboarding
Secondary — Super AdminDecision-maker who purchased the AI moduleActivate the AI subscription fully once quality is provenRelies on the SPV "green light" but has no objective proofDelays full activation; keeps the AI configured but inactive

Data-source role — Human Agents: their historical resolved conversations are the "Golden Standard" the AI is compared against (not a primary user; their data powers the validation).


4. Success Metrics (initiative-level)

Primary KPI: Conversion Rate (Configured → Live)

  • Definition: % of AI Agents activated within 7 days of finishing configuration
  • Baseline: N/A — new capability (today activation takes a minimum of ~3 days, often weeks)
  • Target: ≥ 60% within 7 days, within 90 days of GA

Quality: "Confidence Bar" completion rate

  • Definition: % of Admins who review enough items to reach High Confidence (≥80% on the bar)
  • Baseline: N/A — new capability
  • Target: ≥ 70% of started test cases reach ≥80%

Efficiency: Time-to-Confidence

  • Definition: Hours from first setup to a go-live-ready validation result per agent
  • Baseline: ~6 hours/day of manual War Room monitoring during onboarding
  • Target: < 1 hour of reviewing the comparison report per agent, within 90 days of GA

5. Key Decisions + Alternatives Rejected

5a — Decisions Made

DateDecisionRationale
2025-12-19Build a standalone Testing workspace (not real-time shadow mode) that validates against historical resolved conversationsHistorical "golden" human answers give the strongest psychological-safety signal using the customer's own data; batch testing is far cheaper than real-time shadow infrastructure
2026-06-18Structure Testing as an ANCHOR with phased test-case sources (Historical → Knowledge → Imported)The design/code already scaffolds three generation sources; phasing lets Historical Validation ship first and de-risks the broader testing surface

5b — Alternatives Rejected

AlternativeWhy RejectedDate
Manual Playground / Sandbox (type hypothetical questions one by one)High user effort to invent representative test cases; doesn't prove behavior on real customers2025-12-19
Generic "Golden" test set (pre-made 100 common CS questions)Low relevance — every business has unique products and conversation behavior2025-12-19
Live Shadow Mode (run AI in parallel during real customer chats)Complex, higher risk of leakage; historical batch testing achieves the trust signal without touching live traffic2025-12-19

6. Open Questions

#TypeQuestionOwnerDeadline
1Open QuestionIs the 80% confidence threshold fixed, or should it be org-configurable per AI Agent?Dimas (PM)2026-07-15
2Open QuestionPer-batch token budget across plan tiers — does the 50–70 room cap hold for all?Data team (Reza)2026-07-15
3RiskHistorical tickets contain PII sent to a 3rd-party LLM. Mitigation: covered by existing DPA; transient inference only; not used to train the public model.Dimas (PM)2026-07-01
4RiskLaunch dates referenced in the original draft ("May 2026" GA, "late Q1 2026" beta) are now in the past. Mitigation: re-baseline the timeline with stakeholders before Phase 1 READY.Dimas (PM)2026-07-01

PRD CHANGELOG

VersionDateBySectionTypeSummary
1.02026-06-18ClaudeAllCREATEDANCHOR created from the "AI Agent Testing (Historical Validation)" Confluence draft. Historical Validation set as Phase 1; Generate-from-Knowledge and Imported-Question-List seeded as Phase 2/3 placeholders from the existing design/code scaffolding.