Qontak | AI Agent | Testing — ANCHOR
ANCHOR PRD — the initiative master index. It orchestrates all phases beneath it and carries no acceptance criteria of its own (ACs live in each phase PRD). Synced with the canonical Confluence AI Agent: Testing ANCHOR (QON 51223396435) and reconciled against the actual codebase (
chatbot,chatbot-fe,qontak-designer).Scope: AI Agent Testing = a dedicated Testing workspace where SPV/Admins validate an AI Agent's quality before go-live, replacing the manual ~6-hour/day "War Room". Phases are organised by test-case source: Historical (real resolved conversations), Knowledge (synthesised questions), and Imported (curated list).
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat |
| PRD Version | 1.0 |
| Status | ACTIVE |
| PRD Type | ANCHOR |
| Labels | epic:qontak-chatbot | module:ai-agent | feature:ai-agent-testing |
| Last Updated | 2026-06-18 |
Phase Index
| Phase | Goal | PRD Link | Epic | Status |
|---|---|---|---|---|
| Phase 1: Historical Validation | Validate AI Agent quality against a sample of historical, resolved human conversations so SPV/Admin can confidently go live | prds/historical-validation.md | BOT-3351 | 🔄 In Progress |
| Phase 2: Generate from Knowledge | Generate test questions from the AI Agent's knowledge source to validate coverage and accuracy | TBD | TBD | ⏳ Not started |
| Phase 3: Imported Question List | Validate the AI Agent against a PM/SPV-curated question set uploaded into a test case — single + multi-turn | prds/imported-test-cases.md | TBD | 📝 Draft |
Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled Phases 2–3 are seeded from the test-case sources already scaffolded in the design/code (
GenerateTestCaseModal). They are placeholders until their phase PRDs are written.
2. One-liner + Problem
One-liner: A dedicated Testing workspace that lets Qontak SPVs and Admins validate AI Agent quality before go-live — across historical, knowledge-based, and imported test sets.
Problem: Admins and SPVs don't trust the AI Agent to handle live customers immediately, so today they verify quality with a high-effort "War Room" where leads spend ~6 hours/day manually monitoring active rooms to catch errors. This is unsustainable and unscalable, delaying activation and inflating implementation cost. Without a self-serve way to prove the AI matches their best human agents, clients buy the AI module but stall at the "Activate" step.
3. Target Users + Persona Context
| Persona | Role | Goal | Pain | Workaround |
|---|---|---|---|---|
| Primary — SPV / Chatbot Admin | Operational manager / team lead responsible for customer-interaction quality | Validate the AI Agent's quality so they can confidently report it is safe to launch | No way to preview AI performance at scale before going live; fear of a wrong answer to a VIP | ~6 hours/day in a manual "War Room" watching live chats during onboarding |
| Secondary — Super Admin | Decision-maker who purchased the AI module | Activate the AI subscription fully once quality is proven | Relies on the SPV "green light" but has no objective proof | Delays full activation; keeps the AI configured but inactive |
Data-source role — Human Agents: their historical resolved conversations are the "Golden Standard" the AI is compared against (not a primary user; their data powers the validation).
4. Success Metrics (initiative-level)
⭐ Primary KPI: Conversion Rate (Configured → Live)
- Definition: % of AI Agents activated within 7 days of finishing configuration
- Baseline: N/A — new capability (today activation takes a minimum of ~3 days, often weeks)
- Target: ≥ 60% within 7 days, within 90 days of GA
Quality: "Confidence Bar" completion rate
- Definition: % of Admins who review enough items to reach High Confidence (≥80% on the bar)
- Baseline: N/A — new capability
- Target: ≥ 70% of started test cases reach ≥80%
Efficiency: Time-to-Confidence
- Definition: Hours from first setup to a go-live-ready validation result per agent
- Baseline: ~6 hours/day of manual War Room monitoring during onboarding
- Target: < 1 hour of reviewing the comparison report per agent, within 90 days of GA
5. Key Decisions + Alternatives Rejected
5a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2025-12-19 | Build a standalone Testing workspace (not real-time shadow mode) that validates against historical resolved conversations | Historical "golden" human answers give the strongest psychological-safety signal using the customer's own data; batch testing is far cheaper than real-time shadow infrastructure |
| 2026-06-18 | Structure Testing as an ANCHOR with phased test-case sources (Historical → Knowledge → Imported) | The design/code already scaffolds three generation sources; phasing lets Historical Validation ship first and de-risks the broader testing surface |
5b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
| Manual Playground / Sandbox (type hypothetical questions one by one) | High user effort to invent representative test cases; doesn't prove behavior on real customers | 2025-12-19 |
| Generic "Golden" test set (pre-made 100 common CS questions) | Low relevance — every business has unique products and conversation behavior | 2025-12-19 |
| Live Shadow Mode (run AI in parallel during real customer chats) | Complex, higher risk of leakage; historical batch testing achieves the trust signal without touching live traffic | 2025-12-19 |
6. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Open Question | Is the 80% confidence threshold fixed, or should it be org-configurable per AI Agent? | Dimas (PM) | 2026-07-15 |
| 2 | Open Question | Per-batch token budget across plan tiers — does the 50–70 room cap hold for all? | Data team (Reza) | 2026-07-15 |
| 3 | Risk | Historical tickets contain PII sent to a 3rd-party LLM. Mitigation: covered by existing DPA; transient inference only; not used to train the public model. | Dimas (PM) | 2026-07-01 |
| 4 | Risk | Launch dates referenced in the original draft ("May 2026" GA, "late Q1 2026" beta) are now in the past. Mitigation: re-baseline the timeline with stakeholders before Phase 1 READY. | Dimas (PM) | 2026-07-01 |
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.0 | 2026-06-18 | Claude | All | CREATED | ANCHOR created from the "AI Agent Testing (Historical Validation)" Confluence draft. Historical Validation set as Phase 1; Generate-from-Knowledge and Imported-Question-List seeded as Phase 2/3 placeholders from the existing design/code scaffolding. |