Qontak | AI Agent | Testing — Phase 3: Imported Test Cases (Single + Multi-Turn)

Imported Test Cases — Phase 3 PRD under the AI Agent: Testing ANCHOR. The third test-case source ("Imported question list"): a Specialist/Bot Admin uploads a curated .csv/.xls/.xlsx file, the system replays each scenario against the configured AI Agent, scores responses against the Expected Answer, and saves structured results. Single- and multi-turn are both in scope — a multi-turn scenario is a sequence of user turns sharing conversation context (single-turn = a one-turn conversation). Imported from Confluence and reconciled against code (chatbot, chatbot-fe).

HEADER BLOCK

Field	Value
PM	Dimas Fauzi Hidayat
PRD Version	1.1
Status	DRAFT
PRD Type	PHASE
Epic	TBD (Phase 3 epic not yet created)
Squad	BOT
RFC Link	To be created (`rfc-starter` from this PRD)
Figma Master	Figma — Bot · AI Agent Testing
Anchor	AI Agent: Testing — ANCHOR (Confluence)
Labels	`epic:qontak-chatbot` \| `module:ai-agent` \| `feature:ai-agent-testing`
Last Updated	2026-06-30

Scope Changes

Backend · Frontend · Data — file parse/validate + a new test-run orchestration endpoint that reuses the existing bot-preview execution path (chatbot), the import + multi-turn results UI on the Testing page (chatbot-fe), and the semantic-similarity scoring dependency (Data / AI Service).

2. Phase Context

Anchor PRD: AI Agent: Testing — ANCHOR
Phase Number: Phase 3 of 3 (phasing by test-case source: Historical → Knowledge → Imported)
Phase Goal: Validate the AI Agent against a PM/SPV-curated question set uploaded into a test case — covering both single-turn questions and multi-turn conversations — so specialists replace 5+ hours of manual one-by-one bot-preview testing per go-live. (Matches the ANCHOR Phase Index goal for Phase 3.)
Prior phases: Phase 1 — Historical Validation (prds/historical-validation.md) establishes the Testing page, the ai_agent_test_cases / ai_agent_test_case_questions schema, the side-by-side comparison UI, and the Sidekiq async pattern. Phase 2 — Generate from Knowledge (not yet authored).
This phase: The "Imported question list" test-case source — file import, multi-turn replay against the agent, semantic-similarity auto-scoring (with a confidence + human-review fallback), and results saved to the Testing Index.
Deferred to later: Branching multi-turn conversations, an LLM user-simulator, configurable thresholds, exporting results, and re-running only failed conversations (see §5 Non-Goals).
Cross-phase deps: This phase extends the Phase 1 schema (ai_agent_test_cases / ai_agent_test_case_questions) rather than adding new tables — it adds a conversation grouping + turn ordering. The schema additions here must stay compatible with Phase 1's comparison/rating UI.

3. One-liner + Problem

One-liner: Upload an Excel/CSV of single- or multi-turn test scenarios and auto-run them against the AI Agent — replacing 5+ hours of manual go-live testing.

Problem: Bot Implementation Specialists follow a 100-scenario SOP before every go-live, tested manually one-by-one in the bot preview — 5+ hours per go-live, repeated after every bot configuration change. Many real scenarios are multi-turn (the bot asks for an order ID, the user replies, then the bot confirms), so a test that only checks isolated single questions does not reflect the real use case. This phase lets specialists import the test files they already maintain and run them — including multi-turn conversations with shared context — so they hit the H+14 AHA-moment window. For full initiative context, see the ANCHOR PRD.

4. Target Users + Persona Context

Persona	Role	Goal	Pain	Workaround
Primary — Bot Implementation Specialist	Internal Qontak specialist configuring AI Agents for go-live	Run the 100-scenario SOP (incl. multi-turn flows) in minutes, not hours	5+ hours of manual one-by-one bot-preview testing per go-live	Tests each scenario by hand in the tree-diagram bot preview (~5 hrs/go-live)
Secondary — Client Bot Admin	Power user on the client side who maintains the AI Agent post go-live	Regression-test repeatably after every bot update	No structured re-test process; changes can silently break scenarios	Ad-hoc manual re-checks after each change (hours, error-prone)

5. Non-Goals

Branching conversations — multi-turn turns are linear/scripted (fixed user turns regardless of the bot's actual reply); conditional branches are out of this phase.
LLM user-simulator — no persona agent that dynamically reacts to bot replies; user turns are author-defined.
Configurable scoring threshold — thresholds are fixed (≥80% Pass / 60–79% Review / <60% Fail) in this phase.
Export test results to file — not in this phase.
Re-run only failed test cases — re-runs are at the conversation grain (re-import + re-run), not per-turn.
Test result history / version comparison across runs — not in this phase.
Mobile app — Testing is web-only in this phase.

6. Constraints

Platform: Web only (Qontak web app — Bot Automation → Testing).
Plan scope: All plans with the AI Agent module enabled.
File limits: .csv / .xls / .xlsx only; ≤500 rows (turns) per file; ≤5 MB; recommended per-conversation turn cap ≤20 (engineering to confirm — Open Question #4). Larger suites must be split.
File format: Long format — columns Topic, Conversation ID, Turn, Question, Expected Answer. Rows sharing a Conversation ID form one conversation, executed in Turn order. Expected Answer is optional per turn (blank = a context/setup turn, not scored). A single-turn case is a one-turn conversation.
Execution model: Conversations run in parallel; turns within a conversation run sequentially through one shared bot session so context (memory, slot state, agentic actions) carries across turns. Reuses the live bot-preview execution path (draft_state=true rooms) — not a new context engine.
Performance: Asynchronous batch run with live progress. Within-conversation latency is sequential, so throughput is expressed as conversations × avg turns. Interim target: ~30 conversations (≈100 turns) scored in ≤5 minutes at max concurrency (10–20 parallel conversations); the final SLA is re-baselined from Phase 1's per-row target (Open Question #3). Must respect the LLM provider's TPM/RPM so it never blocks live production traffic.
Parsing: File parsing is backend-side — the frontend has no xlsx/csv parser today.
Read/write: Specialist (internal) + Client Bot Admin roles can import and run. Standard agents have no access (menu hidden). Enforced server-side on every test-run endpoint.
Feature flag: ai_agent_testing | default: OFF — enabled per organization during beta; the Imported source is gated within this initiative flag.
Data lifecycle: The raw uploaded file is discarded after parsing (only structured ai_agent_test_case* rows persist; soft-deleted via acts_as_paranoid on delete). Test conversations run in draft_state=true bot-preview rooms, purged after the run completes so test rooms/histories never pollute real conversation data (exact purge window — Open Question #5).

7. Feature Changes

CHG-001 — "Imported question list" source enabled in the Generate Test Case modal

Change Type: Modified component (GenerateTestCaseModal, scaffolded in Phase 1).
Page: /bot-automation/testing.
Before: The "Imported question list" source option is hidden/disabled (Phase 3 placeholder).
After: The option is enabled and opens the import drawer (template download + file upload + preview).

Element	Before	After
Generate source picker	"Imported question list" disabled	Enabled → opens `ImportTestCaseDrawer`

Figma: GenerateTestCaseModal — node 15576-205254 (feat/ai-agents-testing Phase 1 prototype). Phase 3 change: enable the "Add or upload manually" card and route it to ImportTestCaseDrawer instead of UploadManuallyDrawer.

8. New Features

Feature: Import Test Case (single + multi-turn)

URL: /bot-automation/testing → Generate test case → Imported question list (opens ImportTestCaseDrawer).
Access: Specialist + Bot Admin (server-enforced). Standard agents: option hidden.

Design status (2026-06-30): testing/index.vue, GenerateTestCaseModal, and the Phase 1 UploadManuallyDrawer are prototyped in qontak-designer (feat/ai-agents-testing). The Phase 3-specific components — ImportTestCaseDrawer (5-column template + "Run Test" + import preview), TestRunProgress, and TestRunResults — have no prototype yet (Figma master frame 16514-155786 exists; per-story frames TBD). See prototype notes in S01 and S11.

Component tree:

Phase 1 — already prototyped in feat/ai-agents-testing (reused or adapted):

testing/index.vue — node 16743-298263 — Testing Index; "Imported question list" already in the type filter; columns: Test case name · AI agent · Testing type · Score · Status · Last updated
GenerateTestCaseModal — node 15576-205254 — 3-card picker; Phase 3 enables "Add or upload manually" → routes to ImportTestCaseDrawer
QuestionComparisonCard — node 16820-169848 — side-by-side AI / Human answer card; aiOnly mode (hides Human col — applies to imported type); confidence % in purple footer; source pills; may be adapted for Phase 3 per-turn view

Phase 3 — new components, no prototype yet (frames TBD):

ImportTestCaseDrawer — replaces Phase 1 UploadManuallyDrawer (node 16677-210564 partial stub; Phase 1 uses 3-column format and 45 MB cap — both differ from Phase 3 spec)
- Template download (.xlsx: Topic, Conversation ID, Turn, Question, Expected Answer)
- FileUpload (reuses pixel-input-file-upload.vue; backend parses)
- ImportPreview — conversations detected + total turns + skipped rows (row number + reason)
- "Run Test" → triggers async execution
TestRunProgress — live "X of Y conversations completed" + Pass/Review/Fail counts (polling)
TestRunResults
- Conversation rows (Topic, conversation verdict, turn count) — expandable
  - Per-turn rows: Question · Expected · Actual · Score % · Status (✅/⚠️/❌); context turns shown "not scored"
- Summary banner: X Pass · X Review · X Fail
Saved into the existing Testing Index list (Phase 1 TestCasesTable).

UI States:

Empty: no imported runs yet → import CTA.
Loading: parse/preview skeleton; per-conversation progress during a run.
Error: invalid file blank slate + template link; partial-results banner if a run crashes.
Success: results grouped by conversation, expandable to per-turn.

📊 UI state diagram (import → results):

stateDiagram-v2
  [*] --> Empty
  Empty --> Uploading: select file
  Uploading --> Error: invalid / oversized
  Error --> Empty: retry / re-upload
  Uploading --> Preview: valid (conversations + turns parsed)
  Preview --> Running: Run Test
  Running --> Results: run complete
  Running --> Partial: job crash mid-run
  Results --> [*]
  Partial --> [*]: partial results banner

Figma: Bot · AI Agent Testing — Testing page (phase-specific frames TBD). Code refs: chatbot-fe store/ai-agent/, AiAgentValidationForm.vue, endpoint.ts (v1.ai_agents.test_cases.*).

9. API & Webhook Behavior

Namespaced under the existing v1.ai_agents.test_cases.* surface (already wired in FE endpoint.ts), with turn-aware semantics added. All endpoints gated server-side to Specialist/Bot Admin. Technical fields (JSON schemas, error codes) resolved during RFC.

#	Behavior	Entity Affected	Triggered By	Expected Behavior	Failure Behavior
1	Upload + validate file	New batch (`AiAgentTestRun` or batch_id) + parsed conversations/turns	User uploads file in the import drawer	Parse headers (case-insensitive), enforce 5 MB / 500-row limits, validate Conversation ID + Turn (unique, positive int), skip invalid rows; return preview `{ total_rows, conversation_count, valid_turns, skipped_rows: [{ row_index, reason }] }`	Wrong columns/type → reject whole file + template link. >5 MB / >500 rows → reject with limit message. Empty file → 422
2	Trigger execution	Update batch → `processing`; enqueue per-conversation jobs	User clicks "Run Test"	Returns immediately `{ status: "processing" }`; Sidekiq (`:ai_agent`, `retry:false`) runs conversations in parallel, turns sequential per conversation through one shared room/session	Create fails → 422, no run started. Concurrency exceeded → queue, no rows dropped
3	Poll run status	Read batch + counts	FE polls during a run	Returns `status` + live counts (X of Y conversations, Pass/Review/Fail) until `completed`	Unauthorized role → forbidden. Transient read error → client retries, run unaffected
4	Get run results	Read conversations + per-turn rows	User opens a completed run	Returns results grouped by conversation, each turn: topic, question, expected_answer, actual_response, similarity_score/confidence, status, turn_index, turn_type	Run not found → 404. Unauthorized role → forbidden

Implementation note: the execute endpoint is a new orchestration endpoint that internally reuses the existing bot-preview receive path (bot_previews_controller.rb → ProcessIncomingMessageBotPreviewWorker → hub SendMessageWithResolve), which already runs a single turn with context (History rows + conversation_id + SendContext → Mekari RAG thread). Multi-turn = orchestrating a sequence of these per conversation.

10. System Flow + User Stories + ACs

10.1 System Flow

Flow name: Import → multi-turn replay → score → save · Type: User Journey + System Sequence (async batch).

Specialist/Bot Admin opens Bot Automation → Testing → Generate test case → Imported question list.
Downloads the template (optional), fills Topic, Conversation ID, Turn, Question, Expected Answer; uploads the file.
Backend parses + validates; UI shows a preview (conversations, turns, skipped rows with reasons).
Decision point: invalid format / >5 MB / >500 rows → reject whole file with template link (no run).
User clicks Run Test → batch created (processing); per-conversation jobs enqueued (Sidekiq :ai_agent).
For each conversation: allocate one bot-preview room (draft_state=true); replay turns in Turn order, writing History + pushing SendContext so context accumulates; capture each turn's response.
Decision point: a turn with a blank Expected Answer is a context turn — sent to build state, not scored.
Failure branch: a turn that errors marks that turn error, skips the remaining turns of that conversation, and continues other conversations.
After responses are collected, score each scored turn by semantic similarity (dependency) — fallback to AI confidence + human review if the engine is unavailable.
Failure branch: job crash mid-run → persist completed turns/conversations, mark run failed, surface partial results.
Roll up each conversation verdict, shown two ways: per-turn pass rate + goal-turn (final scored turn) result.
On completion the run is saved to the Testing Index with timestamp, file name, and counts; results display grouped by conversation, expandable to per-turn.

📊 System flow diagram:

flowchart TD
  A[Open Testing → Imported question list] --> B[Download template / fill file]
  B --> C[Upload file]
  C --> D{Valid format + within limits?}
  D -- No --> E[Reject whole file + template link]
  D -- Yes --> F[Parse + row-level validation]
  F --> G[Preview: conversations, turns, skipped rows]
  G --> H[Run Test]
  H --> I[Batch: conversations in parallel - Sidekiq :ai_agent]
  I --> J[Per conversation: replay turns in order, shared session + SendContext]
  J --> K{Turn error?}
  K -- Yes --> L[Mark turn error, skip rest of this conversation]
  K -- No --> M{Expected Answer blank?}
  M -- Yes --> N[Context turn: feeds state, not scored]
  M -- No --> O[Score turn: semantic similarity OR confidence + human review]
  L --> P[Roll up conversation verdict: per-turn rate + goal-turn]
  N --> P
  O --> P
  P --> Q{Job crash mid-run?}
  Q -- Yes --> R[Persist completed, mark run failed, show partial]
  Q -- No --> S[Save run to Testing Index + grouped results]

10.2 User Stories

MoSCoW preserved from the Confluence source. Implementation-status notes flag where ACs describe target behavior built on existing code vs not-yet-wired pieces. Figma frames are phase-specific TBD — links point to the AI Agent Testing master frame.

IMPORT-S01 — Import entry & template download | Must Have

Story: As a Specialist, I want to download a test case template, so that I know the correct format.

Before: No import entry; "Imported question list" is a disabled placeholder. After: Enabling the source opens the import drawer with a downloadable template.

Data Fields:

Field	Type	Required	Source
agent_id	uuid	Yes	route
template_columns	fixed list	Yes (system)	Topic, Conversation ID, Turn, Question, Expected Answer

Happy Path:

AC-1: Given I am on the Testing page, when I open Generate test case → Imported question list, then the import drawer opens.
AC-2: Given the import drawer, when I click "Download template", then I get an .xlsx with columns Topic, Conversation ID, Turn, Question, Expected Answer.
AC-3: Given I open the template, when I read the header row, then each column is pre-labeled and the Expected Answer column notes "optional = context turn".

Error Path:

ERR-1: Given the template download fails, when I click it, then an inline error with retry is shown and no drawer state is lost.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: the "Imported question list" option is not rendered; direct route returns forbidden.

UI States: Loading (drawer opening), Empty (no file selected yet → template CTA), Error (download failed inline), Success (template downloaded).

Figma: GenerateTestCaseModal — node 15576-205254 (Phase 1 prototype · feat/ai-agents-testing). ImportTestCaseDrawer frame TBD. Dependencies: Phase 1 Testing page (GenerateTestCaseModal).

Prototype note (S01): The modal's 3rd card (amber doc icon) is labeled "Add or upload manually" with CTA "Add" — this maps to the "Imported question list" type. In Phase 1 it opens UploadManuallyDrawer (node 16677-210564); Phase 3 replaces that with ImportTestCaseDrawer. The Phase 1 drawer is a partial stub: dropzone + upload-progress animation + content-preview panel, but uses a 3-column format (Topics/Intent · Question · Expected Answer) and a 45 MB cap — both differ from the Phase 3 spec (5-column / 5 MB / "Run Test"). Phase 3 needs a redesigned drawer. The modal also has a dev-tools state "Not enough conversation" that disables the inbox card with tooltip "You must have at least 50 past inbox conversations" — threshold not currently in the PRD.

IMPORT-S02 — Upload & validate long-format file | Must Have

Story: As a Specialist, I want to upload a .csv/.xls/.xlsx file and see a preview, so that I can confirm before running.

Before: No file import path into a test case. After: A validated upload shows a preview of conversations, turns, and skipped rows.

Data Fields:

Field	Type	Required	Source
file	binary	Yes	user upload
topic	string	Yes	file row
conversation_id	string	Yes	file row
turn	int	Yes	file row
question	string	Yes	file row
expected_answer	string	No (blank = context turn)	file row

Happy Path:

AC-1: Given I upload a valid file, when it passes validation, then I see a preview: conversation_count, total turns, and skipped rows with reasons.
AC-2: Given columns differ in case (e.g. expected answer), when validated, then matching is case-insensitive and the file is accepted.
AC-3: Given a valid multi-turn file, when the preview renders, then conversations are grouped (one preview row per Conversation ID with its turn count).

Error Path:

ERR-1: Given the upload fails mid-transfer, when it errors, then no DB state is written and I can retry.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: upload endpoint returns forbidden.

UI States: Loading (parsing skeleton), Empty (no valid rows → guidance), Error (upload failed + retry), Success (preview shown).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S01.

Implementation status: parsing is backend-side (FE has no xlsx/csv parser). Extends the existing ai_agent_test_cases / ai_agent_test_case_questions schema.

IMPORT-S03 — Invalid format rejection | Must Have

Story: As a Specialist, I want a clear error when my file format is wrong, so that I can fix it fast.

Before: No import validation. After: Files with wrong columns or unsupported types are rejected with a template link.

Data Fields:

Field	Type	Required	Source
file_type	enum (.csv/.xls/.xlsx)	Yes	upload
reason	enum: invalid_format / row_limit_exceeded / size_exceeded	Yes (system)	validator

Happy Path:

AC-1: Given I upload a file with missing/wrong columns, when validated, then the whole file is rejected with an error message + a link to download the correct template.
AC-2: Given I upload an .xlsx with the correct columns but a renamed sheet, when validated, then the first sheet's header row is used and the file is accepted.

Error Path:

ERR-1: Given I upload an unsupported file type (e.g. .pdf), when validated, then it is rejected before any processing with a clear type error.
ERR-2: Given the file is a valid type but corrupt/unreadable, when parsing runs, then it is rejected with a "could not read file" error + retry.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.

UI States: Loading (validating), Empty (N/A — rejection path), Error (rejection + template link), Success (N/A — acceptance handled in S02).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.

IMPORT-S04 — Row-level validation & skip | Must Have

Story: As a Specialist, I want rows with missing/invalid data skipped (not blocking the run), so that valid scenarios still proceed.

Before: No row-level validation. After: Invalid rows are flagged by row number + reason; valid rows proceed.

Data Fields:

Field	Type	Required	Source
row_index	int	Yes (system)	parser
reason	string	Yes (system)	validator
conversation_id	string	Yes	file row
turn	int	Yes	file row

Happy Path:

AC-1: Given rows with an empty Topic/Conversation ID/Turn/Question, when I upload, then those rows are skipped and flagged by row number and reason; valid rows proceed.
AC-2: Given two rows share the same (Conversation ID, Turn), when validated, then the duplicate is skipped + flagged (turn ordering must be unique).
AC-3: Given a conversation whose every turn has a blank Expected Answer, when validated, then it is flagged (nothing to score).

Error Path:

ERR-1: Given Turn is not a positive integer (e.g. 0, 1.5, text), when validated, then that row is skipped + flagged.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.

UI States: Loading (validating), Empty (all rows skipped → "no valid rows" guidance), Error (per-row reason shown in preview), Success (preview lists valid + skipped rows).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.

IMPORT-S05 — Multi-turn execution (parallel across, sequential within) | Must Have

Story: As a Specialist, I want to click "Run Test" and have all scenarios — including multi-turn — run automatically, so that I replace manual testing.

Before: Each scenario is tested manually one-by-one in bot preview. After: Conversations run in parallel; turns within a conversation run sequentially through one shared session with carried context.

Data Fields:

Field	Type	Required	Source
room_id	uuid (per conversation)	Yes (system)	bot-preview room alloc
conversation_id	string	Yes	parsed file
turn_index	int	Yes	parsed file
actual_response	text	Yes (system)	AI agent

Happy Path:

AC-1: Given I reviewed the preview, when I click "Run Test", then each conversation is replayed turn-by-turn in Turn order through one shared bot session, and conversations run in parallel.
AC-2: Given a multi-turn conversation, when turn N runs, then it shares the conversation context (memory/slot state) accumulated from turns 1..N-1.
AC-3: Given I navigate away mid-run, when I return to the Testing Index, then the run has continued in the background and its result is saved.

Error Path:

ERR-1: Given the run is triggered, when the request returns, then it returns immediately { status: "processing" } and the run continues in the background.

Rollback / reversibility: A run is an immutable record. A partial or unwanted run can be deleted from the Testing Index (soft-delete, acts_as_paranoid) and re-created by re-importing — there is no in-place mutation of a completed run. Test rooms are draft_state=true and purged after the run, so a discarded run leaves no live-data residue.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: execute endpoint returns forbidden.

UI States: Loading (TestRunProgress), Empty (N/A — run already has conversations), Error (partial-results banner on crash), Success (run completes, results shown).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S04; bot-preview execution path (§15).

Implementation status: reuses the live bot-preview execution path (SendMessageWithResolve) + History + conversation_id + SendContext (Mekari RAG thread). Concurrency unit = conversation (Sidekiq :ai_agent, retry:false).

IMPORT-S06 — Semantic-similarity scoring (with fallback) | Must Have

Story: As a Specialist, I want each scored turn graded Pass/Review/Fail against its Expected Answer, so that I get an objective signal.

Before: Quality is judged manually. After: Each scored turn is auto-graded by semantic similarity; if the engine is unavailable, the system falls back to AI confidence + human review.

Data Fields:

Field	Type	Required	Source
similarity_score	float (0–1)	Yes (system)	semantic engine
confidence	int	Conditional (fallback)	AI agent response
status	enum: pass/review/fail/error	Yes (system)	scorer

Happy Path:

AC-1: Given a turn with an Expected Answer and an agent response, when scoring runs, then similarity ≥80% → Pass, 60–79% → Review, <60% → Fail.
AC-2: Given the agent response is empty/errored, when scoring runs, then score = 0 and status = Fail.
AC-3: Given a batch of scored turns, when scoring runs, then it is sent as a single batch request (chunked if the engine's max batch size is exceeded).

Error Path:

ERR-1: Given the semantic engine is unavailable, when scoring runs, then the turn falls back to AI confidence + human review (rated via the Phase 1 rating flow) rather than blocking the run, and the turn is badged "scored by fallback".

Permission Model: CAN: Specialist, Bot Admin (scoring runs system-side under their triggered run). CANNOT: standard agents. Unauthorized: not executed if flag OFF.

UI States: Loading (scoring), Empty (N/A), Error (fallback-to-review badge), Success (Pass/Review/Fail badge with score).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05; semantic-similarity engine (§15, dependency).

Implementation status: ⚠️ the semantic-similarity engine is not present in the chatbot BE repo (only the KB vector store, for retrieval). It must be confirmed with the AI Service / ML team (Open Question #1). Fallback reuses the existing RateTestCaseQuestion (confidence + human 0/1).

IMPORT-S07 — Conversation verdict (two views) | Must Have

Story: As a Specialist, I want each conversation's verdict shown two ways, so that I see both intermediate quality and end-state success.

Before: No conversation-level roll-up. After: A conversation shows (a) per-turn pass rate across scored turns and (b) the goal-turn (final scored turn) result, side by side.

Data Fields:

Field	Type	Required	Source
per_turn_pass_rate	float	Yes (system)	roll-up
goal_turn_status	enum: pass/review/fail/error	Yes (system)	final scored turn
turn_count	int	Yes (system)	conversation

Happy Path:

AC-1: Given a completed multi-turn conversation, when the verdict renders, then it shows the per-turn pass rate AND the goal-turn result side by side.
AC-2: Given a single-turn conversation, when the verdict renders, then both views collapse to that one turn's status.
AC-3: Given context turns in a conversation, when the pass rate computes, then context turns are excluded from the denominator.

Error Path:

ERR-1: Given a conversation aborted on a turn error, when the verdict renders, then it is marked incomplete (error) rather than computing a misleading pass rate.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: results endpoint returns forbidden.

UI States: Loading (computing), Empty (N/A), Error (incomplete badge), Success (both verdicts shown).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S06.

IMPORT-S08 — Results grouped by conversation (expandable) | Must Have

Story: As a Specialist, I want results grouped by conversation and expandable to per-turn, so that I can see exactly which turn broke.

Before: No grouped results view. After: Conversation rows expand to per-turn rows (Question, Expected, Actual, Score, Status).

Data Fields:

Field	Type	Required	Source
topic	string	Yes	file row
question	text	Yes	file row
expected_answer	text	No (context turn)	file row
actual_response	text	Yes (system)	AI agent
similarity_score	float	Conditional	scorer
status	enum: pass/review/fail/error	Yes (system)	scorer

Happy Path:

AC-1: Given results, when I expand a conversation row, then I see each turn's Question / Expected / Actual / Score / Status, plus the conversation verdict.
AC-2: Given a summary banner, when results render, then it shows total Pass / Review / Fail across the run.
AC-3: Given a context turn, when I expand it, then it is shown with a "not scored" badge rather than a score.

Error Path:

ERR-1: Given a turn failed generation, when I expand it, then the Actual side shows a "could not generate" state, not a blank panel.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: results endpoint returns forbidden.

UI States: Loading (skeleton rows), Empty (no results in this run), Error (per-turn failed state), Success (grouped table rendered).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S07. Reuses AiAgentValidationForm.vue table + score badges.

IMPORT-S09 — Context turns are not scored | Must Have

Story: As a Specialist, I want context-only turns (blank Expected Answer) to set up state without being scored, so that setup steps don't distort results.

Before: No notion of an unscored setup turn. After: A blank-Expected-Answer turn feeds context and is shown "not scored".

Data Fields:

Field	Type	Required	Source
turn_type	enum: user / context	Yes (system)	derived (blank Expected = context)
expected_answer	text	No	file row

Happy Path:

AC-1: Given a conversation turn with a blank Expected Answer, when it runs, then it feeds context and is shown as "not scored" (excluded from the pass rate).
AC-2: Given a context turn, when the agent responds, then the response is still captured and displayed (for human inspection) but not scored.
AC-3: Given a conversation of all context turns except one scored turn, when it runs, then only the scored turn contributes to the verdict.

Error Path:

ERR-1: Given a context turn's agent call errors, when it runs, then the conversation is aborted per IMPORT-S10 (downstream turns depend on it).

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: not executed if flag OFF.

UI States: Loading (running), Empty (N/A), Error (aborted-conversation badge), Success ("not scored" badge on the turn).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05.

IMPORT-S10 — Turn-error isolation | Must Have

Story: As a Specialist, I want a failed turn to stop only its conversation, not the whole run, so that one bad scenario doesn't block the rest.

Before: No error isolation model. After: A turn error aborts only its conversation; other conversations continue.

Data Fields:

Field	Type	Required	Source
status	enum: error	Yes (system)	runner
aborted_at_turn	int	Conditional	runner

Happy Path:

AC-1: Given a turn errors (timeout/agent error) mid-conversation, when the run continues, then the remaining turns of that conversation are skipped and other conversations are unaffected.
AC-2: Given a conversation aborted on a turn error, when results render, then it shows which turn failed and that downstream turns were skipped.
AC-3: Given several conversations and one aborts, when the run completes, then the run summary counts the aborted conversation as error without failing the whole run.

Error Path:

ERR-1: Given the run job crashes mid-batch, when it stops, then completed turns/conversations are persisted, the run is marked failed, and partial results are shown with a banner.

Rollback / reversibility: A failed or partial run can be deleted (soft-delete) and re-run by re-importing; no live-conversation state is touched (test rooms are draft_state=true and purged).

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: not executed if flag OFF.

UI States: Loading (running), Empty (N/A), Error (per-conversation error badge / partial-run banner), Success (unaffected conversations complete).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05.

IMPORT-S11 — Save to Testing Index & view past runs | Must Have

Story: As a Specialist, I want the run saved to the Testing Index, so that I can revisit and regression-test later.

Before: No persistence of imported runs. After: Each completed run is saved with timestamp, file name, and counts, and is listed in the Testing Index.

Data Fields:

Field	Type	Required	Source
file_name	string	Yes	upload
created_at	timestamp	Yes (system)	run
pass_count / review_count / fail_count / error_count	int	Yes (system)	roll-up

Happy Path:

AC-1: Given a run completes, when I go to the Testing Index, then I see the saved run with timestamp, file name, and total Pass/Review/Fail counts.
AC-2: Given a saved run, when I open it, then I see its conversation-grouped results.
AC-3: Given multiple past runs for an agent, when the Testing Index renders, then runs are listed newest-first.

Error Path:

ERR-1: Given the Testing Index fails to load, when the page renders, then a "Couldn't load" blank slate with Retry is shown.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: Testing menu hidden; direct route forbidden.

UI States: Loading (list skeleton), Empty (no runs yet → import CTA), Error (retry), Success (run listed + openable).

Figma: testing/index.vue — node 16743-298263 (Phase 1 prototype · feat/ai-agents-testing). Dependencies: IMPORT-S08. Reuses Phase 1 TestCasesTable.

Prototype note (S11): Testing Index fully designed. Table columns: Test case name · AI agent · Testing type · Score · Status · Last updated · Actions (sticky right col). The type filter dropdown includes "Imported question list". Columns are sortable (name, score, last updated). Pagination: 10 rows/page. Delete confirmation modal: "Delete test case?" with test case name in bold; soft-delete on confirm. Dev tools: Filled / Empty state / Loading / Search not found / Filter not found. Score column shows X% (integer); Status badge is Passed (green) or Need review (amber) — Phase 3 may extend to a three-state Pass / Review / Fail badge aligned with semantic-similarity thresholds (≥80% / 60–79% / <60%).

IMPORT-S12 — Async progress / live polling | Must Have

Story: As a Specialist, I want a live progress indicator while a batch runs, so that the UI never freezes.

Before: No progress feedback for a batch. After: The run shows "X of Y conversations completed" with live Pass/Review/Fail and remains responsive.

Data Fields:

Field	Type	Required	Source
status	enum: processing/completed/failed	Yes (system)	run
conversations_done	int	Yes (system)	run
conversations_total	int	Yes (system)	run

Happy Path:

AC-1: Given a run is in progress, when I poll the run, then status reflects progress (X of Y conversations) until completed.
AC-2: Given the run is running, when I stay on the page, then the UI remains responsive and updates incrementally.
AC-3: Given the run completes while I am away, when I reopen it, then it shows the final state (not a stuck progress bar).

Error Path:

ERR-1: Given a poll request fails transiently, when it retries, then progress resumes without losing the run.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: status endpoint returns forbidden.

UI States: Loading (progress indicator), Empty (N/A), Error (transient retry indicator), Success (completed state).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05. Precedent: ai-assist.ts UUID status poll.

IMPORT-S13 — File row/size caps | Must Have

Story: As a Specialist, I want the system to cap files at 500 rows and 5 MB, so that runs stay performant.

Before: No file caps. After: Oversized files are rejected with a clear message.

Data Fields:

Field	Type	Required	Source
row_count	int (≤500)	Yes (system)	parser
file_size	bytes (≤5 MB)	Yes (system)	upload

Happy Path:

AC-1: Given I upload a file with >500 rows, when validated, then it is rejected with: "File exceeds 500 row limit. Please split into multiple files."
AC-2: Given I upload a file >5 MB, when validated, then it is rejected with a size-limit error.
AC-3: Given a file exactly at a limit boundary (500 rows / 5 MB), when validated, then it is accepted (limits are inclusive of the stated max).

Error Path:

ERR-1: Given a file exceeds the recommended per-conversation turn cap, when validated, then the affected conversation is flagged (engineering to confirm hard vs soft — OQ#4).

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.

UI States: Loading (validating), Empty (N/A), Error (limit message + template link), Success (within-limit file accepted → preview).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.

Negative Scenarios

NEG-1: Given I am a standard agent, when I look for the Imported test-case import, then the option is not rendered and the route is forbidden.
NEG-2: Given a multi-turn conversation, when a turn errors, then later turns in that conversation are not executed (they would run on a broken context).
NEG-3: Given a context turn (blank Expected Answer), when results render, then it never counts toward Pass/Review/Fail.
NEG-4: Given the semantic engine is down, when a run executes, then it falls back to confidence + human review rather than marking every turn Fail.

11. Rollout

Feature flag: ai_agent_testing (initiative flag, default OFF); Imported source gated within it.
Stage 1: Internal — Bot Implementation Specialist team (the primary users).
Stage 2: Closed beta — select Bot Admin clients.
Stage 3: All orgs with AI Agent enabled, on request.
GA: All orgs with AI Agent enabled.
Backward compat: Yes — additive. Extends the existing ai_agent_test_cases / ai_agent_test_case_questions schema with conversation/turn fields; Phase 1 flows unchanged. During the transition, existing Phase 1 test cases (no turn_index/turn_type) render as single-turn conversations — old and new rows coexist without migration.
Migration: Additive columns (turn_index, expected_answer, turn_type) + a thin batch parent. No backfill; existing rows default to a one-turn conversation.

Semantic-regression rollback: ai_agent_testing is the per-org kill switch. If multi-turn replay or semantic scoring proves unreliable in beta (e.g. scoring disagrees with human review on >20% of a sample), PM toggles ai_agent_testing OFF per org (no deploy); scoring reverts to confidence + human review.

12. Observability

Event Name	Trigger	Properties
`test_case_import_clicked`	User opens the import drawer	user_role, agent_id, workspace_id
`template_downloaded`	User downloads the template	user_role
`file_uploaded`	File submitted for validation	file_format, row_count, conversation_count, file_size_kb
`file_validation_failed`	File rejected	reason (invalid_format / row_limit_exceeded / size_exceeded)
`test_run_started`	User clicks Run Test	conversation_count, turn_count, skipped_rows
`test_run_completed`	Run finishes	total_pass, total_review, total_fail, conversation_count, duration_seconds
`test_run_viewed`	User opens a saved run	agent_id, workspace_id

Dashboard owner: BOT squad (Mixpanel + Tableau).

Alerts:

Run failure rate > 10% in 1h → Slack: #bot-ai-alerts (on-call).
Semantic-engine error rate during scoring > 5% in 15m → Slack: #bot-ai-alerts (on-call).

Post-Launch Monitoring Cadence: Weekly for the first 4 weeks post-GA, then monthly. Owner: Dimas Fauzi Hidayat (BOT squad PM). Trigger: run failure rate > 20% unresolved within 24h → PM disables ai_agent_testing for affected orgs.

13. Success Metrics

⭐ Primary KPI: Reduction in average AI Agent testing time per go-live

Definition: median specialist time from "start testing" to "all scenarios scored"
Baseline: 5+ hours (manual one-by-one bot preview)
Target: ≤ 30 minutes, within 60 days of launch

Adoption: % of specialist-led go-lives using Import Test Case

Definition: share of go-lives where the imported source was used
Baseline: N/A — new capability
Target: ≥ 80% within 60 days of launch

Quality: Multi-turn run success rate

Definition: % of conversations that complete all turns without an execution error
Baseline: N/A — new capability
Target: ≥ 95% within 60 days of GA

Efficiency: Time-to-go-live (implementation phase duration)

Definition: implementation-phase duration vs baseline
Baseline: current implementation duration
Target: decrease ≥ 30% vs baseline

14. Launch Plan & Stage Gates

Stage	Audience	Duration	Success Gate	Owner
Internal	Bot Implementation Specialist team	2 weeks	No major bugs; testing-time reduction vs manual baseline confirmed; multi-turn replay verified against bot preview	PM + QA
Closed Beta	Select Bot Admin clients	3–4 weeks	≥ 80% of early users find value; run failure rate ≤ 10%	PM + CSM
Open Beta	All orgs with AI Agent, on request	3 weeks	Testing time ≤ 30 min sustained 1 week; no P0/P1 open	Eng Lead
GA	All orgs with AI Agent enabled	Ongoing	Open Beta gates sustained 2 weeks; PMM launch approved	PM + PMM

15. Dependencies

Dependency	Owning Team	Deliverable Needed	Blocking?
Semantic-similarity engine	AI / ML squad	Scoring endpoint (batch input, Bahasa Indonesia, latency at scale) — confirm contract; not present in chatbot BE	YES
Bot-preview execution path	BOT	Reuse `SendMessageWithResolve` + `History` + `conversation_id` + `SendContext` for per-conversation replay	YES
AI Agent versioning	BOT	`ai_agent_histories` stable — test runs bind to a version	YES
Existing test-case schema	BOT	Extend `ai_agent_test_cases` / `ai_agent_test_case_questions` (grain to confirm)	YES
Data / Analytics	Data	Instrumentation events wired (Mixpanel)	NO

16. Key Decisions + Alternatives Rejected

Initiative-level decisions live in the ANCHOR PRD §5. Below are phase-specific decisions.

16a — Decisions Made

Date	Decision	Rationale
2026-06-26	Merge multi-turn into the MVP; single-turn = a one-turn conversation	The long-format file makes multi-turn near-zero added friction and mirrors the real conversational use case
2026-06-26	Long-format file (Topic, Conversation ID, Turn, Question, Expected Answer)	Variable conversation length, per-turn scoring, and backward-compatible (single-turn files still work)
2026-06-26	Extend existing `ai_agent_test_cases` / `ai_agent_test_case_questions`; do not create new tables	The schema already ships (Phase 1); adds conversation grouping + turn ordering
2026-06-26	Parallel across conversations, sequential within	Reuses the live Room + conversation-context path; correct semantics for shared-context replay
2026-06-26	Conversation verdict shown two ways (per-turn pass rate + goal-turn)	Specialists need both intermediate quality and end-state success
2026-06-26	Semantic engine = open dependency with confidence + human-review fallback	The engine is not in the BE repo; de-risks the build without blocking it

16b — Alternatives Rejected

Alternative	Why Rejected	Date
New `TestRun` / `TestRunResult` tables (original Confluence draft)	Superseded by extending the already-shipped `ai_agent_test_cases` schema	2026-06-26
Single-turn-only import	Does not reflect the real (multi-turn) use case; single-turn is just the one-turn special case	2026-06-26
Wide file format (Q1/A1/Q2/A2 columns)	Rigid (fixed max turns), awkward for variable length and per-turn scoring	2026-06-26
Branching / LLM user-simulator in MVP	Hard to make deterministic and scoreable; deferred	2026-06-26

17. Open Questions

#	Type	Question	Owner	Deadline
1	Risk	Semantic-similarity engine is not in the chatbot BE — confirm the AI Service contract (endpoint, batch size, Bahasa Indonesia, latency). Mitigation: fallback to AI confidence + human review.	Dimas (PM) / AI squad	2026-07-15
2	Open Question	Exact grain of `ai_agent_test_cases` today (per-conversation vs per-batch) — decides new batch table vs folded-in column.	Eng (BOT — Puji/Eko)	2026-07-15
3	Open Question	Re-baselined throughput SLA now that turns are sequential within a conversation (conversations × avg turns). Interim target: ~30 conversations / ≈100 turns in ≤5 min.	Eng (BOT)	2026-07-15
4	Open Question	Per-conversation turn cap (e.g. ≤20) — safe ceiling, hard vs soft enforcement.	Eng (BOT)	2026-07-15
5	Risk	Test-room retention/cleanup so `draft_state` test rooms/histories don't pollute real conversation data. Mitigation: purge test rooms after each run; confirm purge window.	Eng (BOT)	2026-07-31
6	Risk	Imported files may contain client PII in questions/answers. Mitigation: scope to agent/workspace; discard raw file after parsing; confirm scoring stays internal.	Dimas (PM)	2026-07-15

PRD CHANGELOG

Version	Date	By	Section	Type	Summary
1.2	2026-06-30	Claude	§7, §8, S01, S11	UPDATED	Inserted design status from `qontak-designer` `feat/ai-agents-testing` prototype. Updated §8 component tree to distinguish Phase 1 designed components (`testing/index.vue` node 16743-298263, `GenerateTestCaseModal` node 15576-205254, `UploadManuallyDrawer` node 16677-210564, `QuestionComparisonCard` node 16820-169848) from Phase 3 new components (`ImportTestCaseDrawer`, `TestRunProgress`, `TestRunResults` — frames TBD). Updated §7 CHG-001 Figma link to actual node. Added prototype notes to S01 (modal "Add or upload manually" card, Phase 1 stub discrepancies) and S11 (Testing Index columns, dev tools, score badge). Updated Last Updated header.
1.1	2026-06-26	Claude	S1, S4, S6, S8	UPDATED	Phase 3 (Imported Test Cases) PRD authored in the documents-repo PHASE template under the AI Agent: Testing ANCHOR. Reconciled the Confluence "Import Test Case" page (QON 51068960826) with current code (`chatbot`, `chatbot-fe`): multi-turn merged into MVP (long-format Conversation ID + Turn), schema extends the existing `ai_agent_test_cases` / `ai_agent_test_case_questions` models, execution reuses the live bot-preview Room + `conversation_id` + `SendContext` path (parallel across conversations, sequential within), conversation verdict shown two ways, and the semantic-similarity engine flagged as an open dependency with a confidence + human-review fallback. 13 stories with composite AC ids (IMPORT-S01…S13).
1.1	2026-06-26	Claude	S1, S4, S6, S8	UPDATED	Coaching pass after score-prd v3.3. Trimmed one-liner to ≤25 words; added plan scope + feature-flag default state + interim throughput SLA + explicit data-lifecycle/purge to Constraints; added a UI-state Mermaid diagram (S6/S8 New Features) and a system-flow Mermaid diagram (§10.1, with flow type declared); strengthened every story with a Data Fields table (type/required/source), a frame-level Figma link, an explicit Unauthorized clause, all four UI states, and a third AC where thin (S03/S09/S10/S13); added explicit rollback/reversibility to the execution stories (S05/S10).

HEADER BLOCK​

Scope Changes​

2. Phase Context​

3. One-liner + Problem​

4. Target Users + Persona Context​

5. Non-Goals​

6. Constraints​

7. Feature Changes​

8. New Features​

9. API & Webhook Behavior​

10. System Flow + User Stories + ACs​

10.1 System Flow​

10.2 User Stories​

IMPORT-S01 — Import entry & template download | Must Have​

IMPORT-S02 — Upload & validate long-format file | Must Have​

IMPORT-S03 — Invalid format rejection | Must Have​

IMPORT-S04 — Row-level validation & skip | Must Have​

IMPORT-S05 — Multi-turn execution (parallel across, sequential within) | Must Have​

IMPORT-S06 — Semantic-similarity scoring (with fallback) | Must Have​

IMPORT-S07 — Conversation verdict (two views) | Must Have​

IMPORT-S08 — Results grouped by conversation (expandable) | Must Have​

IMPORT-S09 — Context turns are not scored | Must Have​

IMPORT-S10 — Turn-error isolation | Must Have​

IMPORT-S11 — Save to Testing Index & view past runs | Must Have​

IMPORT-S12 — Async progress / live polling | Must Have​

IMPORT-S13 — File row/size caps | Must Have​

Negative Scenarios​

11. Rollout​

12. Observability​

13. Success Metrics​

14. Launch Plan & Stage Gates​

15. Dependencies​

16. Key Decisions + Alternatives Rejected​

17. Open Questions​

PRD CHANGELOG​

HEADER BLOCK

Scope Changes

2. Phase Context

3. One-liner + Problem

4. Target Users + Persona Context

5. Non-Goals

6. Constraints

7. Feature Changes

8. New Features

9. API & Webhook Behavior

10. System Flow + User Stories + ACs

10.1 System Flow

10.2 User Stories

IMPORT-S01 — Import entry & template download | Must Have

IMPORT-S02 — Upload & validate long-format file | Must Have

IMPORT-S03 — Invalid format rejection | Must Have

IMPORT-S04 — Row-level validation & skip | Must Have

IMPORT-S05 — Multi-turn execution (parallel across, sequential within) | Must Have

IMPORT-S06 — Semantic-similarity scoring (with fallback) | Must Have

IMPORT-S07 — Conversation verdict (two views) | Must Have

IMPORT-S08 — Results grouped by conversation (expandable) | Must Have

IMPORT-S09 — Context turns are not scored | Must Have

IMPORT-S10 — Turn-error isolation | Must Have

IMPORT-S11 — Save to Testing Index & view past runs | Must Have

IMPORT-S12 — Async progress / live polling | Must Have

IMPORT-S13 — File row/size caps | Must Have

Negative Scenarios

11. Rollout

12. Observability

13. Success Metrics

14. Launch Plan & Stage Gates

15. Dependencies

16. Key Decisions + Alternatives Rejected

17. Open Questions

PRD CHANGELOG