Qontak | AI Agent | Testing — ANCHOR

ANCHOR PRD — the initiative master index. It orchestrates all phases beneath it and carries no acceptance criteria of its own (ACs live in each phase PRD). Synced with the canonical Confluence AI Agent: Testing ANCHOR (QON 51223396435) and reconciled against the actual codebase (chatbot, chatbot-fe, qontak-designer).

Scope: AI Agent Testing = a dedicated Testing workspace where SPV/Admins validate an AI Agent's quality before go-live, replacing the manual ~6-hour/day "War Room". Phases are organised by test-case source: Historical (real resolved conversations), Knowledge (synthesised questions), and Imported (curated list).

HEADER BLOCK

Field	Value
PM	Dimas Fauzi Hidayat
PRD Version	1.0
Status	ACTIVE
PRD Type	ANCHOR
Labels	`epic:qontak-chatbot` \| `module:ai-agent` \| `feature:ai-agent-testing`
Last Updated	2026-06-18

Phase Index

Phase	Goal	PRD Link	Epic	Status
Phase 1: Historical Validation	Validate AI Agent quality against a sample of historical, resolved human conversations so SPV/Admin can confidently go live	`prds/historical-validation.md`	BOT-3351	🔄 In Progress
Phase 2: Generate from Knowledge	Generate test questions from the AI Agent's knowledge source to validate coverage and accuracy	TBD	TBD	⏳ Not started
Phase 3: Imported Question List	Validate the AI Agent against a PM/SPV-curated question set uploaded into a test case — single + multi-turn	`prds/imported-test-cases.md`	TBD	📝 Draft

Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled Phases 2–3 are seeded from the test-case sources already scaffolded in the design/code (GenerateTestCaseModal). They are placeholders until their phase PRDs are written.

2. One-liner + Problem

One-liner: A dedicated Testing workspace that lets Qontak SPVs and Admins validate AI Agent quality before go-live — across historical, knowledge-based, and imported test sets.

Problem: Admins and SPVs don't trust the AI Agent to handle live customers immediately, so today they verify quality with a high-effort "War Room" where leads spend ~6 hours/day manually monitoring active rooms to catch errors. This is unsustainable and unscalable, delaying activation and inflating implementation cost. Without a self-serve way to prove the AI matches their best human agents, clients buy the AI module but stall at the "Activate" step.

3. Target Users + Persona Context

Persona	Role	Goal	Pain	Workaround
Primary — SPV / Chatbot Admin	Operational manager / team lead responsible for customer-interaction quality	Validate the AI Agent's quality so they can confidently report it is safe to launch	No way to preview AI performance at scale before going live; fear of a wrong answer to a VIP	~6 hours/day in a manual "War Room" watching live chats during onboarding
Secondary — Super Admin	Decision-maker who purchased the AI module	Activate the AI subscription fully once quality is proven	Relies on the SPV "green light" but has no objective proof	Delays full activation; keeps the AI configured but inactive

Data-source role — Human Agents: their historical resolved conversations are the "Golden Standard" the AI is compared against (not a primary user; their data powers the validation).

4. Success Metrics (initiative-level)

⭐ Primary KPI: Conversion Rate (Configured → Live)

Definition: % of AI Agents activated within 7 days of finishing configuration
Baseline: N/A — new capability (today activation takes a minimum of ~3 days, often weeks)
Target: ≥ 60% within 7 days, within 90 days of GA

Quality: "Confidence Bar" completion rate

Definition: % of Admins who review enough items to reach High Confidence (≥80% on the bar)
Baseline: N/A — new capability
Target: ≥ 70% of started test cases reach ≥80%

Efficiency: Time-to-Confidence

Definition: Hours from first setup to a go-live-ready validation result per agent
Baseline: ~6 hours/day of manual War Room monitoring during onboarding
Target: < 1 hour of reviewing the comparison report per agent, within 90 days of GA

5. Key Decisions + Alternatives Rejected

5a — Decisions Made

Date	Decision	Rationale
2025-12-19	Build a standalone Testing workspace (not real-time shadow mode) that validates against historical resolved conversations	Historical "golden" human answers give the strongest psychological-safety signal using the customer's own data; batch testing is far cheaper than real-time shadow infrastructure
2026-06-18	Structure Testing as an ANCHOR with phased test-case sources (Historical → Knowledge → Imported)	The design/code already scaffolds three generation sources; phasing lets Historical Validation ship first and de-risks the broader testing surface

5b — Alternatives Rejected

Alternative	Why Rejected	Date
Manual Playground / Sandbox (type hypothetical questions one by one)	High user effort to invent representative test cases; doesn't prove behavior on real customers	2025-12-19
Generic "Golden" test set (pre-made 100 common CS questions)	Low relevance — every business has unique products and conversation behavior	2025-12-19
Live Shadow Mode (run AI in parallel during real customer chats)	Complex, higher risk of leakage; historical batch testing achieves the trust signal without touching live traffic	2025-12-19

6. Open Questions

#	Type	Question	Owner	Deadline
1	Open Question	Is the 80% confidence threshold fixed, or should it be org-configurable per AI Agent?	Dimas (PM)	2026-07-15
2	Open Question	Per-batch token budget across plan tiers — does the 50–70 room cap hold for all?	Data team (Reza)	2026-07-15
3	Risk	Historical tickets contain PII sent to a 3rd-party LLM. Mitigation: covered by existing DPA; transient inference only; not used to train the public model.	Dimas (PM)	2026-07-01
4	Risk	Launch dates referenced in the original draft ("May 2026" GA, "late Q1 2026" beta) are now in the past. Mitigation: re-baseline the timeline with stakeholders before Phase 1 READY.	Dimas (PM)	2026-07-01

PRD CHANGELOG

Version	Date	By	Section	Type	Summary
1.0	2026-06-18	Claude	All	CREATED	ANCHOR created from the "AI Agent Testing (Historical Validation)" Confluence draft. Historical Validation set as Phase 1; Generate-from-Knowledge and Imported-Question-List seeded as Phase 2/3 placeholders from the existing design/code scaffolding.

HEADER BLOCK​

Phase Index​

2. One-liner + Problem​

3. Target Users + Persona Context​

4. Success Metrics (initiative-level)​

5. Key Decisions + Alternatives Rejected​

6. Open Questions​

PRD CHANGELOG​