Qontak | AI Agent | Testing — Phase 3: Imported Test Cases (Single + Multi-Turn)
Imported Test Cases — Phase 3 PRD under the AI Agent: Testing ANCHOR. The third test-case source ("Imported question list"): a Specialist/Bot Admin uploads a curated
.csv/.xls/.xlsxfile, the system replays each scenario against the configured AI Agent, scores responses against the Expected Answer, and saves structured results. Single- and multi-turn are both in scope — a multi-turn scenario is a sequence of user turns sharing conversation context (single-turn = a one-turn conversation). Imported from Confluence and reconciled against code (chatbot,chatbot-fe).
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat |
| PRD Version | 1.1 |
| Status | DRAFT |
| PRD Type | PHASE |
| Epic | TBD (Phase 3 epic not yet created) |
| Squad | BOT |
| RFC Link | To be created (rfc-starter from this PRD) |
| Figma Master | Figma — Bot · AI Agent Testing |
| Anchor | AI Agent: Testing — ANCHOR (Confluence) |
| Labels | epic:qontak-chatbot | module:ai-agent | feature:ai-agent-testing |
| Last Updated | 2026-06-30 |
Scope Changes
Backend · Frontend · Data — file parse/validate + a new test-run orchestration endpoint that reuses the existing bot-preview execution path (chatbot), the import + multi-turn results UI on the Testing page (chatbot-fe), and the semantic-similarity scoring dependency (Data / AI Service).
2. Phase Context
- Anchor PRD: AI Agent: Testing — ANCHOR
- Phase Number: Phase 3 of 3 (phasing by test-case source: Historical → Knowledge → Imported)
- Phase Goal: Validate the AI Agent against a PM/SPV-curated question set uploaded into a test case — covering both single-turn questions and multi-turn conversations — so specialists replace 5+ hours of manual one-by-one bot-preview testing per go-live. (Matches the ANCHOR Phase Index goal for Phase 3.)
- Prior phases: Phase 1 — Historical Validation (
prds/historical-validation.md) establishes the Testing page, theai_agent_test_cases/ai_agent_test_case_questionsschema, the side-by-side comparison UI, and the Sidekiq async pattern. Phase 2 — Generate from Knowledge (not yet authored). - This phase: The "Imported question list" test-case source — file import, multi-turn replay against the agent, semantic-similarity auto-scoring (with a confidence + human-review fallback), and results saved to the Testing Index.
- Deferred to later: Branching multi-turn conversations, an LLM user-simulator, configurable thresholds, exporting results, and re-running only failed conversations (see §5 Non-Goals).
- Cross-phase deps: This phase extends the Phase 1 schema (
ai_agent_test_cases/ai_agent_test_case_questions) rather than adding new tables — it adds a conversation grouping + turn ordering. The schema additions here must stay compatible with Phase 1's comparison/rating UI.
3. One-liner + Problem
One-liner: Upload an Excel/CSV of single- or multi-turn test scenarios and auto-run them against the AI Agent — replacing 5+ hours of manual go-live testing.
Problem: Bot Implementation Specialists follow a 100-scenario SOP before every go-live, tested manually one-by-one in the bot preview — 5+ hours per go-live, repeated after every bot configuration change. Many real scenarios are multi-turn (the bot asks for an order ID, the user replies, then the bot confirms), so a test that only checks isolated single questions does not reflect the real use case. This phase lets specialists import the test files they already maintain and run them — including multi-turn conversations with shared context — so they hit the H+14 AHA-moment window. For full initiative context, see the ANCHOR PRD.
4. Target Users + Persona Context
| Persona | Role | Goal | Pain | Workaround |
|---|---|---|---|---|
| Primary — Bot Implementation Specialist | Internal Qontak specialist configuring AI Agents for go-live | Run the 100-scenario SOP (incl. multi-turn flows) in minutes, not hours | 5+ hours of manual one-by-one bot-preview testing per go-live | Tests each scenario by hand in the tree-diagram bot preview (~5 hrs/go-live) |
| Secondary — Client Bot Admin | Power user on the client side who maintains the AI Agent post go-live | Regression-test repeatably after every bot update | No structured re-test process; changes can silently break scenarios | Ad-hoc manual re-checks after each change (hours, error-prone) |
5. Non-Goals
- Branching conversations — multi-turn turns are linear/scripted (fixed user turns regardless of the bot's actual reply); conditional branches are out of this phase.
- LLM user-simulator — no persona agent that dynamically reacts to bot replies; user turns are author-defined.
- Configurable scoring threshold — thresholds are fixed (≥80% Pass / 60–79% Review / <60% Fail) in this phase.
- Export test results to file — not in this phase.
- Re-run only failed test cases — re-runs are at the conversation grain (re-import + re-run), not per-turn.
- Test result history / version comparison across runs — not in this phase.
- Mobile app — Testing is web-only in this phase.
6. Constraints
- Platform: Web only (Qontak web app — Bot Automation → Testing).
- Plan scope: All plans with the AI Agent module enabled.
- File limits:
.csv/.xls/.xlsxonly; ≤500 rows (turns) per file; ≤5 MB; recommended per-conversation turn cap ≤20 (engineering to confirm — Open Question #4). Larger suites must be split. - File format: Long format — columns
Topic,Conversation ID,Turn,Question,Expected Answer. Rows sharing a Conversation ID form one conversation, executed in Turn order. Expected Answer is optional per turn (blank = a context/setup turn, not scored). A single-turn case is a one-turn conversation. - Execution model: Conversations run in parallel; turns within a conversation run sequentially through one shared bot session so context (memory, slot state, agentic actions) carries across turns. Reuses the live bot-preview execution path (
draft_state=truerooms) — not a new context engine. - Performance: Asynchronous batch run with live progress. Within-conversation latency is sequential, so throughput is expressed as conversations × avg turns. Interim target: ~30 conversations (≈100 turns) scored in ≤5 minutes at max concurrency (10–20 parallel conversations); the final SLA is re-baselined from Phase 1's per-row target (Open Question #3). Must respect the LLM provider's TPM/RPM so it never blocks live production traffic.
- Parsing: File parsing is backend-side — the frontend has no xlsx/csv parser today.
- Read/write: Specialist (internal) + Client Bot Admin roles can import and run. Standard agents have no access (menu hidden). Enforced server-side on every test-run endpoint.
- Feature flag:
ai_agent_testing| default: OFF — enabled per organization during beta; the Imported source is gated within this initiative flag. - Data lifecycle: The raw uploaded file is discarded after parsing (only structured
ai_agent_test_case*rows persist; soft-deleted viaacts_as_paranoidon delete). Test conversations run indraft_state=truebot-preview rooms, purged after the run completes so test rooms/histories never pollute real conversation data (exact purge window — Open Question #5).
7. Feature Changes
CHG-001 — "Imported question list" source enabled in the Generate Test Case modal
- Change Type: Modified component (
GenerateTestCaseModal, scaffolded in Phase 1). - Page:
/bot-automation/testing. - Before: The "Imported question list" source option is hidden/disabled (Phase 3 placeholder).
- After: The option is enabled and opens the import drawer (template download + file upload + preview).
| Element | Before | After |
|---|---|---|
| Generate source picker | "Imported question list" disabled | Enabled → opens ImportTestCaseDrawer |
Figma: GenerateTestCaseModal — node 15576-205254 (feat/ai-agents-testing Phase 1 prototype). Phase 3 change: enable the "Add or upload manually" card and route it to ImportTestCaseDrawer instead of UploadManuallyDrawer.
8. New Features
Feature: Import Test Case (single + multi-turn)
- URL:
/bot-automation/testing→ Generate test case → Imported question list (opensImportTestCaseDrawer). - Access: Specialist + Bot Admin (server-enforced). Standard agents: option hidden.
Design status (2026-06-30):
testing/index.vue,GenerateTestCaseModal, and the Phase 1UploadManuallyDrawerare prototyped inqontak-designer(feat/ai-agents-testing). The Phase 3-specific components —ImportTestCaseDrawer(5-column template + "Run Test" + import preview),TestRunProgress, andTestRunResults— have no prototype yet (Figma master frame16514-155786exists; per-story frames TBD). See prototype notes in S01 and S11.
Component tree:
Phase 1 — already prototyped in feat/ai-agents-testing (reused or adapted):
testing/index.vue— node 16743-298263 — Testing Index; "Imported question list" already in the type filter; columns: Test case name · AI agent · Testing type · Score · Status · Last updatedGenerateTestCaseModal— node 15576-205254 — 3-card picker; Phase 3 enables "Add or upload manually" → routes toImportTestCaseDrawerQuestionComparisonCard— node 16820-169848 — side-by-side AI / Human answer card;aiOnlymode (hides Human col — applies to imported type); confidence % in purple footer; source pills; may be adapted for Phase 3 per-turn view
Phase 3 — new components, no prototype yet (frames TBD):
ImportTestCaseDrawer— replaces Phase 1UploadManuallyDrawer(node 16677-210564 partial stub; Phase 1 uses 3-column format and 45 MB cap — both differ from Phase 3 spec)- Template download (
.xlsx: Topic, Conversation ID, Turn, Question, Expected Answer) FileUpload(reusespixel-input-file-upload.vue; backend parses)ImportPreview— conversations detected + total turns + skipped rows (row number + reason)- "Run Test" → triggers async execution
- Template download (
TestRunProgress— live "X of Y conversations completed" + Pass/Review/Fail counts (polling)TestRunResults- Conversation rows (Topic, conversation verdict, turn count) — expandable
- Per-turn rows: Question · Expected · Actual · Score % · Status (✅/⚠️/❌); context turns shown "not scored"
- Summary banner: X Pass · X Review · X Fail
- Conversation rows (Topic, conversation verdict, turn count) — expandable
- Saved into the existing Testing Index list (Phase 1
TestCasesTable).
UI States:
- Empty: no imported runs yet → import CTA.
- Loading: parse/preview skeleton; per-conversation progress during a run.
- Error: invalid file blank slate + template link; partial-results banner if a run crashes.
- Success: results grouped by conversation, expandable to per-turn.
📊 UI state diagram (import → results):
stateDiagram-v2
[*] --> Empty
Empty --> Uploading: select file
Uploading --> Error: invalid / oversized
Error --> Empty: retry / re-upload
Uploading --> Preview: valid (conversations + turns parsed)
Preview --> Running: Run Test
Running --> Results: run complete
Running --> Partial: job crash mid-run
Results --> [*]
Partial --> [*]: partial results banner
Figma: Bot · AI Agent Testing — Testing page (phase-specific frames TBD). Code refs: chatbot-fe store/ai-agent/, AiAgentValidationForm.vue, endpoint.ts (v1.ai_agents.test_cases.*).
9. API & Webhook Behavior
Namespaced under the existing v1.ai_agents.test_cases.* surface (already wired in FE endpoint.ts), with turn-aware semantics added. All endpoints gated server-side to Specialist/Bot Admin. Technical fields (JSON schemas, error codes) resolved during RFC.
| # | Behavior | Entity Affected | Triggered By | Expected Behavior | Failure Behavior |
|---|---|---|---|---|---|
| 1 | Upload + validate file | New batch (AiAgentTestRun or batch_id) + parsed conversations/turns | User uploads file in the import drawer | Parse headers (case-insensitive), enforce 5 MB / 500-row limits, validate Conversation ID + Turn (unique, positive int), skip invalid rows; return preview { total_rows, conversation_count, valid_turns, skipped_rows: [{ row_index, reason }] } | Wrong columns/type → reject whole file + template link. >5 MB / >500 rows → reject with limit message. Empty file → 422 |
| 2 | Trigger execution | Update batch → processing; enqueue per-conversation jobs | User clicks "Run Test" | Returns immediately { status: "processing" }; Sidekiq (:ai_agent, retry:false) runs conversations in parallel, turns sequential per conversation through one shared room/session | Create fails → 422, no run started. Concurrency exceeded → queue, no rows dropped |
| 3 | Poll run status | Read batch + counts | FE polls during a run | Returns status + live counts (X of Y conversations, Pass/Review/Fail) until completed | Unauthorized role → forbidden. Transient read error → client retries, run unaffected |
| 4 | Get run results | Read conversations + per-turn rows | User opens a completed run | Returns results grouped by conversation, each turn: topic, question, expected_answer, actual_response, similarity_score/confidence, status, turn_index, turn_type | Run not found → 404. Unauthorized role → forbidden |
Implementation note: the execute endpoint is a new orchestration endpoint that internally reuses the existing bot-preview receive path (
bot_previews_controller.rb→ProcessIncomingMessageBotPreviewWorker→ hubSendMessageWithResolve), which already runs a single turn with context (Historyrows +conversation_id+SendContext→ Mekari RAG thread). Multi-turn = orchestrating a sequence of these per conversation.
10. System Flow + User Stories + ACs
10.1 System Flow
Flow name: Import → multi-turn replay → score → save · Type: User Journey + System Sequence (async batch).
- Specialist/Bot Admin opens Bot Automation → Testing → Generate test case → Imported question list.
- Downloads the template (optional), fills Topic, Conversation ID, Turn, Question, Expected Answer; uploads the file.
- Backend parses + validates; UI shows a preview (conversations, turns, skipped rows with reasons).
- Decision point: invalid format / >5 MB / >500 rows → reject whole file with template link (no run).
- User clicks Run Test → batch created (
processing); per-conversation jobs enqueued (Sidekiq:ai_agent). - For each conversation: allocate one bot-preview room (
draft_state=true); replay turns inTurnorder, writingHistory+ pushingSendContextso context accumulates; capture each turn's response. - Decision point: a turn with a blank Expected Answer is a context turn — sent to build state, not scored.
- Failure branch: a turn that errors marks that turn
error, skips the remaining turns of that conversation, and continues other conversations. - After responses are collected, score each scored turn by semantic similarity (dependency) — fallback to AI confidence + human review if the engine is unavailable.
- Failure branch: job crash mid-run → persist completed turns/conversations, mark run
failed, surface partial results. - Roll up each conversation verdict, shown two ways: per-turn pass rate + goal-turn (final scored turn) result.
- On completion the run is saved to the Testing Index with timestamp, file name, and counts; results display grouped by conversation, expandable to per-turn.
📊 System flow diagram:
flowchart TD
A[Open Testing → Imported question list] --> B[Download template / fill file]
B --> C[Upload file]
C --> D{Valid format + within limits?}
D -- No --> E[Reject whole file + template link]
D -- Yes --> F[Parse + row-level validation]
F --> G[Preview: conversations, turns, skipped rows]
G --> H[Run Test]
H --> I[Batch: conversations in parallel - Sidekiq :ai_agent]
I --> J[Per conversation: replay turns in order, shared session + SendContext]
J --> K{Turn error?}
K -- Yes --> L[Mark turn error, skip rest of this conversation]
K -- No --> M{Expected Answer blank?}
M -- Yes --> N[Context turn: feeds state, not scored]
M -- No --> O[Score turn: semantic similarity OR confidence + human review]
L --> P[Roll up conversation verdict: per-turn rate + goal-turn]
N --> P
O --> P
P --> Q{Job crash mid-run?}
Q -- Yes --> R[Persist completed, mark run failed, show partial]
Q -- No --> S[Save run to Testing Index + grouped results]
10.2 User Stories
MoSCoW preserved from the Confluence source. Implementation-status notes flag where ACs describe target behavior built on existing code vs not-yet-wired pieces. Figma frames are phase-specific TBD — links point to the AI Agent Testing master frame.
IMPORT-S01 — Import entry & template download | Must Have
Story: As a Specialist, I want to download a test case template, so that I know the correct format.
Before: No import entry; "Imported question list" is a disabled placeholder. After: Enabling the source opens the import drawer with a downloadable template.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| agent_id | uuid | Yes | route |
| template_columns | fixed list | Yes (system) | Topic, Conversation ID, Turn, Question, Expected Answer |
Happy Path:
- AC-1: Given I am on the Testing page, when I open Generate test case → Imported question list, then the import drawer opens.
- AC-2: Given the import drawer, when I click "Download template", then I get an
.xlsxwith columns Topic, Conversation ID, Turn, Question, Expected Answer. - AC-3: Given I open the template, when I read the header row, then each column is pre-labeled and the Expected Answer column notes "optional = context turn".
Error Path:
- ERR-1: Given the template download fails, when I click it, then an inline error with retry is shown and no drawer state is lost.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: the "Imported question list" option is not rendered; direct route returns forbidden.
UI States: Loading (drawer opening), Empty (no file selected yet → template CTA), Error (download failed inline), Success (template downloaded).
Figma: GenerateTestCaseModal — node 15576-205254 (Phase 1 prototype · feat/ai-agents-testing). ImportTestCaseDrawer frame TBD. Dependencies: Phase 1 Testing page (GenerateTestCaseModal).
Prototype note (S01): The modal's 3rd card (amber
docicon) is labeled "Add or upload manually" with CTA "Add" — this maps to the "Imported question list" type. In Phase 1 it opensUploadManuallyDrawer(node 16677-210564); Phase 3 replaces that withImportTestCaseDrawer. The Phase 1 drawer is a partial stub: dropzone + upload-progress animation + content-preview panel, but uses a 3-column format (Topics/Intent · Question · Expected Answer) and a 45 MB cap — both differ from the Phase 3 spec (5-column / 5 MB / "Run Test"). Phase 3 needs a redesigned drawer. The modal also has a dev-tools state "Not enough conversation" that disables the inbox card with tooltip "You must have at least 50 past inbox conversations" — threshold not currently in the PRD.
IMPORT-S02 — Upload & validate long-format file | Must Have
Story: As a Specialist, I want to upload a .csv/.xls/.xlsx file and see a preview, so that I can confirm before running.
Before: No file import path into a test case. After: A validated upload shows a preview of conversations, turns, and skipped rows.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| file | binary | Yes | user upload |
| topic | string | Yes | file row |
| conversation_id | string | Yes | file row |
| turn | int | Yes | file row |
| question | string | Yes | file row |
| expected_answer | string | No (blank = context turn) | file row |
Happy Path:
- AC-1: Given I upload a valid file, when it passes validation, then I see a preview: conversation_count, total turns, and skipped rows with reasons.
- AC-2: Given columns differ in case (e.g.
expected answer), when validated, then matching is case-insensitive and the file is accepted. - AC-3: Given a valid multi-turn file, when the preview renders, then conversations are grouped (one preview row per Conversation ID with its turn count).
Error Path:
- ERR-1: Given the upload fails mid-transfer, when it errors, then no DB state is written and I can retry.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: upload endpoint returns forbidden.
UI States: Loading (parsing skeleton), Empty (no valid rows → guidance), Error (upload failed + retry), Success (preview shown).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S01.
Implementation status: parsing is backend-side (FE has no xlsx/csv parser). Extends the existing
ai_agent_test_cases/ai_agent_test_case_questionsschema.
IMPORT-S03 — Invalid format rejection | Must Have
Story: As a Specialist, I want a clear error when my file format is wrong, so that I can fix it fast.
Before: No import validation. After: Files with wrong columns or unsupported types are rejected with a template link.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| file_type | enum (.csv/.xls/.xlsx) | Yes | upload |
| reason | enum: invalid_format / row_limit_exceeded / size_exceeded | Yes (system) | validator |
Happy Path:
- AC-1: Given I upload a file with missing/wrong columns, when validated, then the whole file is rejected with an error message + a link to download the correct template.
- AC-2: Given I upload an
.xlsxwith the correct columns but a renamed sheet, when validated, then the first sheet's header row is used and the file is accepted.
Error Path:
- ERR-1: Given I upload an unsupported file type (e.g.
.pdf), when validated, then it is rejected before any processing with a clear type error. - ERR-2: Given the file is a valid type but corrupt/unreadable, when parsing runs, then it is rejected with a "could not read file" error + retry.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.
UI States: Loading (validating), Empty (N/A — rejection path), Error (rejection + template link), Success (N/A — acceptance handled in S02).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.
IMPORT-S04 — Row-level validation & skip | Must Have
Story: As a Specialist, I want rows with missing/invalid data skipped (not blocking the run), so that valid scenarios still proceed.
Before: No row-level validation. After: Invalid rows are flagged by row number + reason; valid rows proceed.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| row_index | int | Yes (system) | parser |
| reason | string | Yes (system) | validator |
| conversation_id | string | Yes | file row |
| turn | int | Yes | file row |
Happy Path:
- AC-1: Given rows with an empty Topic/Conversation ID/Turn/Question, when I upload, then those rows are skipped and flagged by row number and reason; valid rows proceed.
- AC-2: Given two rows share the same (Conversation ID, Turn), when validated, then the duplicate is skipped + flagged (turn ordering must be unique).
- AC-3: Given a conversation whose every turn has a blank Expected Answer, when validated, then it is flagged (nothing to score).
Error Path:
- ERR-1: Given Turn is not a positive integer (e.g.
0,1.5, text), when validated, then that row is skipped + flagged.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.
UI States: Loading (validating), Empty (all rows skipped → "no valid rows" guidance), Error (per-row reason shown in preview), Success (preview lists valid + skipped rows).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.
IMPORT-S05 — Multi-turn execution (parallel across, sequential within) | Must Have
Story: As a Specialist, I want to click "Run Test" and have all scenarios — including multi-turn — run automatically, so that I replace manual testing.
Before: Each scenario is tested manually one-by-one in bot preview. After: Conversations run in parallel; turns within a conversation run sequentially through one shared session with carried context.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| room_id | uuid (per conversation) | Yes (system) | bot-preview room alloc |
| conversation_id | string | Yes | parsed file |
| turn_index | int | Yes | parsed file |
| actual_response | text | Yes (system) | AI agent |
Happy Path:
- AC-1: Given I reviewed the preview, when I click "Run Test", then each conversation is replayed turn-by-turn in Turn order through one shared bot session, and conversations run in parallel.
- AC-2: Given a multi-turn conversation, when turn N runs, then it shares the conversation context (memory/slot state) accumulated from turns 1..N-1.
- AC-3: Given I navigate away mid-run, when I return to the Testing Index, then the run has continued in the background and its result is saved.
Error Path:
- ERR-1: Given the run is triggered, when the request returns, then it returns immediately
{ status: "processing" }and the run continues in the background.
Rollback / reversibility: A run is an immutable record. A partial or unwanted run can be deleted from the Testing Index (soft-delete, acts_as_paranoid) and re-created by re-importing — there is no in-place mutation of a completed run. Test rooms are draft_state=true and purged after the run, so a discarded run leaves no live-data residue.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: execute endpoint returns forbidden.
UI States: Loading (TestRunProgress), Empty (N/A — run already has conversations), Error (partial-results banner on crash), Success (run completes, results shown).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S04; bot-preview execution path (§15).
Implementation status: reuses the live bot-preview execution path (
SendMessageWithResolve) +History+conversation_id+SendContext(Mekari RAG thread). Concurrency unit = conversation (Sidekiq:ai_agent,retry:false).
IMPORT-S06 — Semantic-similarity scoring (with fallback) | Must Have
Story: As a Specialist, I want each scored turn graded Pass/Review/Fail against its Expected Answer, so that I get an objective signal.
Before: Quality is judged manually. After: Each scored turn is auto-graded by semantic similarity; if the engine is unavailable, the system falls back to AI confidence + human review.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| similarity_score | float (0–1) | Yes (system) | semantic engine |
| confidence | int | Conditional (fallback) | AI agent response |
| status | enum: pass/review/fail/error | Yes (system) | scorer |
Happy Path:
- AC-1: Given a turn with an Expected Answer and an agent response, when scoring runs, then similarity ≥80% → Pass, 60–79% → Review, <60% → Fail.
- AC-2: Given the agent response is empty/errored, when scoring runs, then score = 0 and status = Fail.
- AC-3: Given a batch of scored turns, when scoring runs, then it is sent as a single batch request (chunked if the engine's max batch size is exceeded).
Error Path:
- ERR-1: Given the semantic engine is unavailable, when scoring runs, then the turn falls back to AI confidence + human review (rated via the Phase 1 rating flow) rather than blocking the run, and the turn is badged "scored by fallback".
Permission Model: CAN: Specialist, Bot Admin (scoring runs system-side under their triggered run). CANNOT: standard agents. Unauthorized: not executed if flag OFF.
UI States: Loading (scoring), Empty (N/A), Error (fallback-to-review badge), Success (Pass/Review/Fail badge with score).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05; semantic-similarity engine (§15, dependency).
Implementation status: ⚠️ the semantic-similarity engine is not present in the chatbot BE repo (only the KB vector store, for retrieval). It must be confirmed with the AI Service / ML team (Open Question #1). Fallback reuses the existing
RateTestCaseQuestion(confidence + human 0/1).
IMPORT-S07 — Conversation verdict (two views) | Must Have
Story: As a Specialist, I want each conversation's verdict shown two ways, so that I see both intermediate quality and end-state success.
Before: No conversation-level roll-up. After: A conversation shows (a) per-turn pass rate across scored turns and (b) the goal-turn (final scored turn) result, side by side.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| per_turn_pass_rate | float | Yes (system) | roll-up |
| goal_turn_status | enum: pass/review/fail/error | Yes (system) | final scored turn |
| turn_count | int | Yes (system) | conversation |
Happy Path:
- AC-1: Given a completed multi-turn conversation, when the verdict renders, then it shows the per-turn pass rate AND the goal-turn result side by side.
- AC-2: Given a single-turn conversation, when the verdict renders, then both views collapse to that one turn's status.
- AC-3: Given context turns in a conversation, when the pass rate computes, then context turns are excluded from the denominator.
Error Path:
- ERR-1: Given a conversation aborted on a turn error, when the verdict renders, then it is marked incomplete (error) rather than computing a misleading pass rate.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: results endpoint returns forbidden.
UI States: Loading (computing), Empty (N/A), Error (incomplete badge), Success (both verdicts shown).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S06.
IMPORT-S08 — Results grouped by conversation (expandable) | Must Have
Story: As a Specialist, I want results grouped by conversation and expandable to per-turn, so that I can see exactly which turn broke.
Before: No grouped results view. After: Conversation rows expand to per-turn rows (Question, Expected, Actual, Score, Status).
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| topic | string | Yes | file row |
| question | text | Yes | file row |
| expected_answer | text | No (context turn) | file row |
| actual_response | text | Yes (system) | AI agent |
| similarity_score | float | Conditional | scorer |
| status | enum: pass/review/fail/error | Yes (system) | scorer |
Happy Path:
- AC-1: Given results, when I expand a conversation row, then I see each turn's Question / Expected / Actual / Score / Status, plus the conversation verdict.
- AC-2: Given a summary banner, when results render, then it shows total Pass / Review / Fail across the run.
- AC-3: Given a context turn, when I expand it, then it is shown with a "not scored" badge rather than a score.
Error Path:
- ERR-1: Given a turn failed generation, when I expand it, then the Actual side shows a "could not generate" state, not a blank panel.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: results endpoint returns forbidden.
UI States: Loading (skeleton rows), Empty (no results in this run), Error (per-turn failed state), Success (grouped table rendered).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S07. Reuses AiAgentValidationForm.vue table + score badges.
IMPORT-S09 — Context turns are not scored | Must Have
Story: As a Specialist, I want context-only turns (blank Expected Answer) to set up state without being scored, so that setup steps don't distort results.
Before: No notion of an unscored setup turn. After: A blank-Expected-Answer turn feeds context and is shown "not scored".
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| turn_type | enum: user / context | Yes (system) | derived (blank Expected = context) |
| expected_answer | text | No | file row |
Happy Path:
- AC-1: Given a conversation turn with a blank Expected Answer, when it runs, then it feeds context and is shown as "not scored" (excluded from the pass rate).
- AC-2: Given a context turn, when the agent responds, then the response is still captured and displayed (for human inspection) but not scored.
- AC-3: Given a conversation of all context turns except one scored turn, when it runs, then only the scored turn contributes to the verdict.
Error Path:
- ERR-1: Given a context turn's agent call errors, when it runs, then the conversation is aborted per IMPORT-S10 (downstream turns depend on it).
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: not executed if flag OFF.
UI States: Loading (running), Empty (N/A), Error (aborted-conversation badge), Success ("not scored" badge on the turn).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05.
IMPORT-S10 — Turn-error isolation | Must Have
Story: As a Specialist, I want a failed turn to stop only its conversation, not the whole run, so that one bad scenario doesn't block the rest.
Before: No error isolation model. After: A turn error aborts only its conversation; other conversations continue.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| status | enum: error | Yes (system) | runner |
| aborted_at_turn | int | Conditional | runner |
Happy Path:
- AC-1: Given a turn errors (timeout/agent error) mid-conversation, when the run continues, then the remaining turns of that conversation are skipped and other conversations are unaffected.
- AC-2: Given a conversation aborted on a turn error, when results render, then it shows which turn failed and that downstream turns were skipped.
- AC-3: Given several conversations and one aborts, when the run completes, then the run summary counts the aborted conversation as
errorwithout failing the whole run.
Error Path:
- ERR-1: Given the run job crashes mid-batch, when it stops, then completed turns/conversations are persisted, the run is marked
failed, and partial results are shown with a banner.
Rollback / reversibility: A failed or partial run can be deleted (soft-delete) and re-run by re-importing; no live-conversation state is touched (test rooms are draft_state=true and purged).
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: not executed if flag OFF.
UI States: Loading (running), Empty (N/A), Error (per-conversation error badge / partial-run banner), Success (unaffected conversations complete).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05.
IMPORT-S11 — Save to Testing Index & view past runs | Must Have
Story: As a Specialist, I want the run saved to the Testing Index, so that I can revisit and regression-test later.
Before: No persistence of imported runs. After: Each completed run is saved with timestamp, file name, and counts, and is listed in the Testing Index.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| file_name | string | Yes | upload |
| created_at | timestamp | Yes (system) | run |
| pass_count / review_count / fail_count / error_count | int | Yes (system) | roll-up |
Happy Path:
- AC-1: Given a run completes, when I go to the Testing Index, then I see the saved run with timestamp, file name, and total Pass/Review/Fail counts.
- AC-2: Given a saved run, when I open it, then I see its conversation-grouped results.
- AC-3: Given multiple past runs for an agent, when the Testing Index renders, then runs are listed newest-first.
Error Path:
- ERR-1: Given the Testing Index fails to load, when the page renders, then a "Couldn't load" blank slate with Retry is shown.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: Testing menu hidden; direct route forbidden.
UI States: Loading (list skeleton), Empty (no runs yet → import CTA), Error (retry), Success (run listed + openable).
Figma: testing/index.vue — node 16743-298263 (Phase 1 prototype · feat/ai-agents-testing). Dependencies: IMPORT-S08. Reuses Phase 1 TestCasesTable.
Prototype note (S11): Testing Index fully designed. Table columns: Test case name · AI agent · Testing type · Score · Status · Last updated · Actions (sticky right col). The type filter dropdown includes "Imported question list". Columns are sortable (name, score, last updated). Pagination: 10 rows/page. Delete confirmation modal: "Delete test case?" with test case name in bold; soft-delete on confirm. Dev tools: Filled / Empty state / Loading / Search not found / Filter not found. Score column shows
X%(integer); Status badge is Passed (green) or Need review (amber) — Phase 3 may extend to a three-state Pass / Review / Fail badge aligned with semantic-similarity thresholds (≥80% / 60–79% / <60%).
IMPORT-S12 — Async progress / live polling | Must Have
Story: As a Specialist, I want a live progress indicator while a batch runs, so that the UI never freezes.
Before: No progress feedback for a batch. After: The run shows "X of Y conversations completed" with live Pass/Review/Fail and remains responsive.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| status | enum: processing/completed/failed | Yes (system) | run |
| conversations_done | int | Yes (system) | run |
| conversations_total | int | Yes (system) | run |
Happy Path:
- AC-1: Given a run is in progress, when I poll the run, then
statusreflects progress (X of Y conversations) untilcompleted. - AC-2: Given the run is running, when I stay on the page, then the UI remains responsive and updates incrementally.
- AC-3: Given the run completes while I am away, when I reopen it, then it shows the final state (not a stuck progress bar).
Error Path:
- ERR-1: Given a poll request fails transiently, when it retries, then progress resumes without losing the run.
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: status endpoint returns forbidden.
UI States: Loading (progress indicator), Empty (N/A), Error (transient retry indicator), Success (completed state).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05. Precedent: ai-assist.ts UUID status poll.
IMPORT-S13 — File row/size caps | Must Have
Story: As a Specialist, I want the system to cap files at 500 rows and 5 MB, so that runs stay performant.
Before: No file caps. After: Oversized files are rejected with a clear message.
Data Fields:
| Field | Type | Required | Source |
|---|---|---|---|
| row_count | int (≤500) | Yes (system) | parser |
| file_size | bytes (≤5 MB) | Yes (system) | upload |
Happy Path:
- AC-1: Given I upload a file with >500 rows, when validated, then it is rejected with: "File exceeds 500 row limit. Please split into multiple files."
- AC-2: Given I upload a file >5 MB, when validated, then it is rejected with a size-limit error.
- AC-3: Given a file exactly at a limit boundary (500 rows / 5 MB), when validated, then it is accepted (limits are inclusive of the stated max).
Error Path:
- ERR-1: Given a file exceeds the recommended per-conversation turn cap, when validated, then the affected conversation is flagged (engineering to confirm hard vs soft — OQ#4).
Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.
UI States: Loading (validating), Empty (N/A), Error (limit message + template link), Success (within-limit file accepted → preview).
Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.
Negative Scenarios
- NEG-1: Given I am a standard agent, when I look for the Imported test-case import, then the option is not rendered and the route is forbidden.
- NEG-2: Given a multi-turn conversation, when a turn errors, then later turns in that conversation are not executed (they would run on a broken context).
- NEG-3: Given a context turn (blank Expected Answer), when results render, then it never counts toward Pass/Review/Fail.
- NEG-4: Given the semantic engine is down, when a run executes, then it falls back to confidence + human review rather than marking every turn Fail.
11. Rollout
- Feature flag:
ai_agent_testing(initiative flag, default OFF); Imported source gated within it. - Stage 1: Internal — Bot Implementation Specialist team (the primary users).
- Stage 2: Closed beta — select Bot Admin clients.
- Stage 3: All orgs with AI Agent enabled, on request.
- GA: All orgs with AI Agent enabled.
- Backward compat: Yes — additive. Extends the existing
ai_agent_test_cases/ai_agent_test_case_questionsschema with conversation/turn fields; Phase 1 flows unchanged. During the transition, existing Phase 1 test cases (noturn_index/turn_type) render as single-turn conversations — old and new rows coexist without migration. - Migration: Additive columns (
turn_index,expected_answer,turn_type) + a thin batch parent. No backfill; existing rows default to a one-turn conversation.
Semantic-regression rollback: ai_agent_testing is the per-org kill switch. If multi-turn replay or semantic scoring proves unreliable in beta (e.g. scoring disagrees with human review on >20% of a sample), PM toggles ai_agent_testing OFF per org (no deploy); scoring reverts to confidence + human review.
12. Observability
| Event Name | Trigger | Properties |
|---|---|---|
test_case_import_clicked | User opens the import drawer | user_role, agent_id, workspace_id |
template_downloaded | User downloads the template | user_role |
file_uploaded | File submitted for validation | file_format, row_count, conversation_count, file_size_kb |
file_validation_failed | File rejected | reason (invalid_format / row_limit_exceeded / size_exceeded) |
test_run_started | User clicks Run Test | conversation_count, turn_count, skipped_rows |
test_run_completed | Run finishes | total_pass, total_review, total_fail, conversation_count, duration_seconds |
test_run_viewed | User opens a saved run | agent_id, workspace_id |
Dashboard owner: BOT squad (Mixpanel + Tableau).
Alerts:
- Run failure rate > 10% in 1h → Slack: #bot-ai-alerts (on-call).
- Semantic-engine error rate during scoring > 5% in 15m → Slack: #bot-ai-alerts (on-call).
Post-Launch Monitoring Cadence: Weekly for the first 4 weeks post-GA, then monthly. Owner: Dimas Fauzi Hidayat (BOT squad PM). Trigger: run failure rate > 20% unresolved within 24h → PM disables ai_agent_testing for affected orgs.
13. Success Metrics
⭐ Primary KPI: Reduction in average AI Agent testing time per go-live
- Definition: median specialist time from "start testing" to "all scenarios scored"
- Baseline: 5+ hours (manual one-by-one bot preview)
- Target: ≤ 30 minutes, within 60 days of launch
Adoption: % of specialist-led go-lives using Import Test Case
- Definition: share of go-lives where the imported source was used
- Baseline: N/A — new capability
- Target: ≥ 80% within 60 days of launch
Quality: Multi-turn run success rate
- Definition: % of conversations that complete all turns without an execution error
- Baseline: N/A — new capability
- Target: ≥ 95% within 60 days of GA
Efficiency: Time-to-go-live (implementation phase duration)
- Definition: implementation-phase duration vs baseline
- Baseline: current implementation duration
- Target: decrease ≥ 30% vs baseline
14. Launch Plan & Stage Gates
| Stage | Audience | Duration | Success Gate | Owner |
|---|---|---|---|---|
| Internal | Bot Implementation Specialist team | 2 weeks | No major bugs; testing-time reduction vs manual baseline confirmed; multi-turn replay verified against bot preview | PM + QA |
| Closed Beta | Select Bot Admin clients | 3–4 weeks | ≥ 80% of early users find value; run failure rate ≤ 10% | PM + CSM |
| Open Beta | All orgs with AI Agent, on request | 3 weeks | Testing time ≤ 30 min sustained 1 week; no P0/P1 open | Eng Lead |
| GA | All orgs with AI Agent enabled | Ongoing | Open Beta gates sustained 2 weeks; PMM launch approved | PM + PMM |
15. Dependencies
| Dependency | Owning Team | Deliverable Needed | Blocking? |
|---|---|---|---|
| Semantic-similarity engine | AI / ML squad | Scoring endpoint (batch input, Bahasa Indonesia, latency at scale) — confirm contract; not present in chatbot BE | YES |
| Bot-preview execution path | BOT | Reuse SendMessageWithResolve + History + conversation_id + SendContext for per-conversation replay | YES |
| AI Agent versioning | BOT | ai_agent_histories stable — test runs bind to a version | YES |
| Existing test-case schema | BOT | Extend ai_agent_test_cases / ai_agent_test_case_questions (grain to confirm) | YES |
| Data / Analytics | Data | Instrumentation events wired (Mixpanel) | NO |
16. Key Decisions + Alternatives Rejected
Initiative-level decisions live in the ANCHOR PRD §5. Below are phase-specific decisions.
16a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2026-06-26 | Merge multi-turn into the MVP; single-turn = a one-turn conversation | The long-format file makes multi-turn near-zero added friction and mirrors the real conversational use case |
| 2026-06-26 | Long-format file (Topic, Conversation ID, Turn, Question, Expected Answer) | Variable conversation length, per-turn scoring, and backward-compatible (single-turn files still work) |
| 2026-06-26 | Extend existing ai_agent_test_cases / ai_agent_test_case_questions; do not create new tables | The schema already ships (Phase 1); adds conversation grouping + turn ordering |
| 2026-06-26 | Parallel across conversations, sequential within | Reuses the live Room + conversation-context path; correct semantics for shared-context replay |
| 2026-06-26 | Conversation verdict shown two ways (per-turn pass rate + goal-turn) | Specialists need both intermediate quality and end-state success |
| 2026-06-26 | Semantic engine = open dependency with confidence + human-review fallback | The engine is not in the BE repo; de-risks the build without blocking it |
16b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
New TestRun / TestRunResult tables (original Confluence draft) | Superseded by extending the already-shipped ai_agent_test_cases schema | 2026-06-26 |
| Single-turn-only import | Does not reflect the real (multi-turn) use case; single-turn is just the one-turn special case | 2026-06-26 |
| Wide file format (Q1/A1/Q2/A2 columns) | Rigid (fixed max turns), awkward for variable length and per-turn scoring | 2026-06-26 |
| Branching / LLM user-simulator in MVP | Hard to make deterministic and scoreable; deferred | 2026-06-26 |
17. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Risk | Semantic-similarity engine is not in the chatbot BE — confirm the AI Service contract (endpoint, batch size, Bahasa Indonesia, latency). Mitigation: fallback to AI confidence + human review. | Dimas (PM) / AI squad | 2026-07-15 |
| 2 | Open Question | Exact grain of ai_agent_test_cases today (per-conversation vs per-batch) — decides new batch table vs folded-in column. | Eng (BOT — Puji/Eko) | 2026-07-15 |
| 3 | Open Question | Re-baselined throughput SLA now that turns are sequential within a conversation (conversations × avg turns). Interim target: ~30 conversations / ≈100 turns in ≤5 min. | Eng (BOT) | 2026-07-15 |
| 4 | Open Question | Per-conversation turn cap (e.g. ≤20) — safe ceiling, hard vs soft enforcement. | Eng (BOT) | 2026-07-15 |
| 5 | Risk | Test-room retention/cleanup so draft_state test rooms/histories don't pollute real conversation data. Mitigation: purge test rooms after each run; confirm purge window. | Eng (BOT) | 2026-07-31 |
| 6 | Risk | Imported files may contain client PII in questions/answers. Mitigation: scope to agent/workspace; discard raw file after parsing; confirm scoring stays internal. | Dimas (PM) | 2026-07-15 |
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.2 | 2026-06-30 | Claude | §7, §8, S01, S11 | UPDATED | Inserted design status from qontak-designer feat/ai-agents-testing prototype. Updated §8 component tree to distinguish Phase 1 designed components (testing/index.vue node 16743-298263, GenerateTestCaseModal node 15576-205254, UploadManuallyDrawer node 16677-210564, QuestionComparisonCard node 16820-169848) from Phase 3 new components (ImportTestCaseDrawer, TestRunProgress, TestRunResults — frames TBD). Updated §7 CHG-001 Figma link to actual node. Added prototype notes to S01 (modal "Add or upload manually" card, Phase 1 stub discrepancies) and S11 (Testing Index columns, dev tools, score badge). Updated Last Updated header. |
| 1.1 | 2026-06-26 | Claude | S1, S4, S6, S8 | UPDATED | Phase 3 (Imported Test Cases) PRD authored in the documents-repo PHASE template under the AI Agent: Testing ANCHOR. Reconciled the Confluence "Import Test Case" page (QON 51068960826) with current code (chatbot, chatbot-fe): multi-turn merged into MVP (long-format Conversation ID + Turn), schema extends the existing ai_agent_test_cases / ai_agent_test_case_questions models, execution reuses the live bot-preview Room + conversation_id + SendContext path (parallel across conversations, sequential within), conversation verdict shown two ways, and the semantic-similarity engine flagged as an open dependency with a confidence + human-review fallback. 13 stories with composite AC ids (IMPORT-S01…S13). |
| 1.1 | 2026-06-26 | Claude | S1, S4, S6, S8 | UPDATED | Coaching pass after score-prd v3.3. Trimmed one-liner to ≤25 words; added plan scope + feature-flag default state + interim throughput SLA + explicit data-lifecycle/purge to Constraints; added a UI-state Mermaid diagram (S6/S8 New Features) and a system-flow Mermaid diagram (§10.1, with flow type declared); strengthened every story with a Data Fields table (type/required/source), a frame-level Figma link, an explicit Unauthorized clause, all four UI states, and a third AC where thin (S03/S09/S10/S13); added explicit rollback/reversibility to the execution stories (S05/S10). |