Skip to main content

Qontak | AI Agent | Testing — Phase 3: Imported Test Cases (Single + Multi-Turn)

Imported Test Cases — Phase 3 PRD under the AI Agent: Testing ANCHOR. The third test-case source ("Imported question list"): a Specialist/Bot Admin uploads a curated .csv/.xls/.xlsx file, the system replays each scenario against the configured AI Agent, scores responses against the Expected Answer, and saves structured results. Single- and multi-turn are both in scope — a multi-turn scenario is a sequence of user turns sharing conversation context (single-turn = a one-turn conversation). Imported from Confluence and reconciled against code (chatbot, chatbot-fe).

HEADER BLOCK

FieldValue
PMDimas Fauzi Hidayat
PRD Version1.1
StatusDRAFT
PRD TypePHASE
EpicTBD (Phase 3 epic not yet created)
SquadBOT
RFC LinkTo be created (rfc-starter from this PRD)
Figma MasterFigma — Bot · AI Agent Testing
AnchorAI Agent: Testing — ANCHOR (Confluence)
Labelsepic:qontak-chatbot | module:ai-agent | feature:ai-agent-testing
Last Updated2026-06-30

Scope Changes

Backend · Frontend · Data — file parse/validate + a new test-run orchestration endpoint that reuses the existing bot-preview execution path (chatbot), the import + multi-turn results UI on the Testing page (chatbot-fe), and the semantic-similarity scoring dependency (Data / AI Service).


2. Phase Context

  • Anchor PRD: AI Agent: Testing — ANCHOR
  • Phase Number: Phase 3 of 3 (phasing by test-case source: Historical → Knowledge → Imported)
  • Phase Goal: Validate the AI Agent against a PM/SPV-curated question set uploaded into a test case — covering both single-turn questions and multi-turn conversations — so specialists replace 5+ hours of manual one-by-one bot-preview testing per go-live. (Matches the ANCHOR Phase Index goal for Phase 3.)
  • Prior phases: Phase 1 — Historical Validation (prds/historical-validation.md) establishes the Testing page, the ai_agent_test_cases / ai_agent_test_case_questions schema, the side-by-side comparison UI, and the Sidekiq async pattern. Phase 2 — Generate from Knowledge (not yet authored).
  • This phase: The "Imported question list" test-case source — file import, multi-turn replay against the agent, semantic-similarity auto-scoring (with a confidence + human-review fallback), and results saved to the Testing Index.
  • Deferred to later: Branching multi-turn conversations, an LLM user-simulator, configurable thresholds, exporting results, and re-running only failed conversations (see §5 Non-Goals).
  • Cross-phase deps: This phase extends the Phase 1 schema (ai_agent_test_cases / ai_agent_test_case_questions) rather than adding new tables — it adds a conversation grouping + turn ordering. The schema additions here must stay compatible with Phase 1's comparison/rating UI.

3. One-liner + Problem

One-liner: Upload an Excel/CSV of single- or multi-turn test scenarios and auto-run them against the AI Agent — replacing 5+ hours of manual go-live testing.

Problem: Bot Implementation Specialists follow a 100-scenario SOP before every go-live, tested manually one-by-one in the bot preview — 5+ hours per go-live, repeated after every bot configuration change. Many real scenarios are multi-turn (the bot asks for an order ID, the user replies, then the bot confirms), so a test that only checks isolated single questions does not reflect the real use case. This phase lets specialists import the test files they already maintain and run them — including multi-turn conversations with shared context — so they hit the H+14 AHA-moment window. For full initiative context, see the ANCHOR PRD.


4. Target Users + Persona Context

PersonaRoleGoalPainWorkaround
Primary — Bot Implementation SpecialistInternal Qontak specialist configuring AI Agents for go-liveRun the 100-scenario SOP (incl. multi-turn flows) in minutes, not hours5+ hours of manual one-by-one bot-preview testing per go-liveTests each scenario by hand in the tree-diagram bot preview (~5 hrs/go-live)
Secondary — Client Bot AdminPower user on the client side who maintains the AI Agent post go-liveRegression-test repeatably after every bot updateNo structured re-test process; changes can silently break scenariosAd-hoc manual re-checks after each change (hours, error-prone)

5. Non-Goals

  1. Branching conversations — multi-turn turns are linear/scripted (fixed user turns regardless of the bot's actual reply); conditional branches are out of this phase.
  2. LLM user-simulator — no persona agent that dynamically reacts to bot replies; user turns are author-defined.
  3. Configurable scoring threshold — thresholds are fixed (≥80% Pass / 60–79% Review / <60% Fail) in this phase.
  4. Export test results to file — not in this phase.
  5. Re-run only failed test cases — re-runs are at the conversation grain (re-import + re-run), not per-turn.
  6. Test result history / version comparison across runs — not in this phase.
  7. Mobile app — Testing is web-only in this phase.

6. Constraints

  • Platform: Web only (Qontak web app — Bot Automation → Testing).
  • Plan scope: All plans with the AI Agent module enabled.
  • File limits: .csv / .xls / .xlsx only; ≤500 rows (turns) per file; ≤5 MB; recommended per-conversation turn cap ≤20 (engineering to confirm — Open Question #4). Larger suites must be split.
  • File format: Long format — columns Topic, Conversation ID, Turn, Question, Expected Answer. Rows sharing a Conversation ID form one conversation, executed in Turn order. Expected Answer is optional per turn (blank = a context/setup turn, not scored). A single-turn case is a one-turn conversation.
  • Execution model: Conversations run in parallel; turns within a conversation run sequentially through one shared bot session so context (memory, slot state, agentic actions) carries across turns. Reuses the live bot-preview execution path (draft_state=true rooms) — not a new context engine.
  • Performance: Asynchronous batch run with live progress. Within-conversation latency is sequential, so throughput is expressed as conversations × avg turns. Interim target: ~30 conversations (≈100 turns) scored in ≤5 minutes at max concurrency (10–20 parallel conversations); the final SLA is re-baselined from Phase 1's per-row target (Open Question #3). Must respect the LLM provider's TPM/RPM so it never blocks live production traffic.
  • Parsing: File parsing is backend-side — the frontend has no xlsx/csv parser today.
  • Read/write: Specialist (internal) + Client Bot Admin roles can import and run. Standard agents have no access (menu hidden). Enforced server-side on every test-run endpoint.
  • Feature flag: ai_agent_testing | default: OFF — enabled per organization during beta; the Imported source is gated within this initiative flag.
  • Data lifecycle: The raw uploaded file is discarded after parsing (only structured ai_agent_test_case* rows persist; soft-deleted via acts_as_paranoid on delete). Test conversations run in draft_state=true bot-preview rooms, purged after the run completes so test rooms/histories never pollute real conversation data (exact purge window — Open Question #5).

7. Feature Changes

CHG-001 — "Imported question list" source enabled in the Generate Test Case modal

  • Change Type: Modified component (GenerateTestCaseModal, scaffolded in Phase 1).
  • Page: /bot-automation/testing.
  • Before: The "Imported question list" source option is hidden/disabled (Phase 3 placeholder).
  • After: The option is enabled and opens the import drawer (template download + file upload + preview).
ElementBeforeAfter
Generate source picker"Imported question list" disabledEnabled → opens ImportTestCaseDrawer

Figma: GenerateTestCaseModal — node 15576-205254 (feat/ai-agents-testing Phase 1 prototype). Phase 3 change: enable the "Add or upload manually" card and route it to ImportTestCaseDrawer instead of UploadManuallyDrawer.


8. New Features

Feature: Import Test Case (single + multi-turn)

  • URL: /bot-automation/testing → Generate test case → Imported question list (opens ImportTestCaseDrawer).
  • Access: Specialist + Bot Admin (server-enforced). Standard agents: option hidden.

Design status (2026-06-30): testing/index.vue, GenerateTestCaseModal, and the Phase 1 UploadManuallyDrawer are prototyped in qontak-designer (feat/ai-agents-testing). The Phase 3-specific components — ImportTestCaseDrawer (5-column template + "Run Test" + import preview), TestRunProgress, and TestRunResults — have no prototype yet (Figma master frame 16514-155786 exists; per-story frames TBD). See prototype notes in S01 and S11.

Component tree:

Phase 1 — already prototyped in feat/ai-agents-testing (reused or adapted):

Phase 3 — new components, no prototype yet (frames TBD):

  • ImportTestCaseDrawer — replaces Phase 1 UploadManuallyDrawer (node 16677-210564 partial stub; Phase 1 uses 3-column format and 45 MB cap — both differ from Phase 3 spec)
    • Template download (.xlsx: Topic, Conversation ID, Turn, Question, Expected Answer)
    • FileUpload (reuses pixel-input-file-upload.vue; backend parses)
    • ImportPreview — conversations detected + total turns + skipped rows (row number + reason)
    • "Run Test" → triggers async execution
  • TestRunProgress — live "X of Y conversations completed" + Pass/Review/Fail counts (polling)
  • TestRunResults
    • Conversation rows (Topic, conversation verdict, turn count) — expandable
      • Per-turn rows: Question · Expected · Actual · Score % · Status (✅/⚠️/❌); context turns shown "not scored"
    • Summary banner: X Pass · X Review · X Fail
  • Saved into the existing Testing Index list (Phase 1 TestCasesTable).

UI States:

  • Empty: no imported runs yet → import CTA.
  • Loading: parse/preview skeleton; per-conversation progress during a run.
  • Error: invalid file blank slate + template link; partial-results banner if a run crashes.
  • Success: results grouped by conversation, expandable to per-turn.

📊 UI state diagram (import → results):

stateDiagram-v2
[*] --> Empty
Empty --> Uploading: select file
Uploading --> Error: invalid / oversized
Error --> Empty: retry / re-upload
Uploading --> Preview: valid (conversations + turns parsed)
Preview --> Running: Run Test
Running --> Results: run complete
Running --> Partial: job crash mid-run
Results --> [*]
Partial --> [*]: partial results banner

Figma: Bot · AI Agent Testing — Testing page (phase-specific frames TBD). Code refs: chatbot-fe store/ai-agent/, AiAgentValidationForm.vue, endpoint.ts (v1.ai_agents.test_cases.*).


9. API & Webhook Behavior

Namespaced under the existing v1.ai_agents.test_cases.* surface (already wired in FE endpoint.ts), with turn-aware semantics added. All endpoints gated server-side to Specialist/Bot Admin. Technical fields (JSON schemas, error codes) resolved during RFC.

#BehaviorEntity AffectedTriggered ByExpected BehaviorFailure Behavior
1Upload + validate fileNew batch (AiAgentTestRun or batch_id) + parsed conversations/turnsUser uploads file in the import drawerParse headers (case-insensitive), enforce 5 MB / 500-row limits, validate Conversation ID + Turn (unique, positive int), skip invalid rows; return preview { total_rows, conversation_count, valid_turns, skipped_rows: [{ row_index, reason }] }Wrong columns/type → reject whole file + template link. >5 MB / >500 rows → reject with limit message. Empty file → 422
2Trigger executionUpdate batch → processing; enqueue per-conversation jobsUser clicks "Run Test"Returns immediately { status: "processing" }; Sidekiq (:ai_agent, retry:false) runs conversations in parallel, turns sequential per conversation through one shared room/sessionCreate fails → 422, no run started. Concurrency exceeded → queue, no rows dropped
3Poll run statusRead batch + countsFE polls during a runReturns status + live counts (X of Y conversations, Pass/Review/Fail) until completedUnauthorized role → forbidden. Transient read error → client retries, run unaffected
4Get run resultsRead conversations + per-turn rowsUser opens a completed runReturns results grouped by conversation, each turn: topic, question, expected_answer, actual_response, similarity_score/confidence, status, turn_index, turn_typeRun not found → 404. Unauthorized role → forbidden

Implementation note: the execute endpoint is a new orchestration endpoint that internally reuses the existing bot-preview receive path (bot_previews_controller.rbProcessIncomingMessageBotPreviewWorker → hub SendMessageWithResolve), which already runs a single turn with context (History rows + conversation_id + SendContext → Mekari RAG thread). Multi-turn = orchestrating a sequence of these per conversation.


10. System Flow + User Stories + ACs

10.1 System Flow

Flow name: Import → multi-turn replay → score → save · Type: User Journey + System Sequence (async batch).

  1. Specialist/Bot Admin opens Bot Automation → Testing → Generate test case → Imported question list.
  2. Downloads the template (optional), fills Topic, Conversation ID, Turn, Question, Expected Answer; uploads the file.
  3. Backend parses + validates; UI shows a preview (conversations, turns, skipped rows with reasons).
  4. Decision point: invalid format / >5 MB / >500 rows → reject whole file with template link (no run).
  5. User clicks Run Test → batch created (processing); per-conversation jobs enqueued (Sidekiq :ai_agent).
  6. For each conversation: allocate one bot-preview room (draft_state=true); replay turns in Turn order, writing History + pushing SendContext so context accumulates; capture each turn's response.
  7. Decision point: a turn with a blank Expected Answer is a context turn — sent to build state, not scored.
  8. Failure branch: a turn that errors marks that turn error, skips the remaining turns of that conversation, and continues other conversations.
  9. After responses are collected, score each scored turn by semantic similarity (dependency) — fallback to AI confidence + human review if the engine is unavailable.
  10. Failure branch: job crash mid-run → persist completed turns/conversations, mark run failed, surface partial results.
  11. Roll up each conversation verdict, shown two ways: per-turn pass rate + goal-turn (final scored turn) result.
  12. On completion the run is saved to the Testing Index with timestamp, file name, and counts; results display grouped by conversation, expandable to per-turn.

📊 System flow diagram:

flowchart TD
A[Open Testing → Imported question list] --> B[Download template / fill file]
B --> C[Upload file]
C --> D{Valid format + within limits?}
D -- No --> E[Reject whole file + template link]
D -- Yes --> F[Parse + row-level validation]
F --> G[Preview: conversations, turns, skipped rows]
G --> H[Run Test]
H --> I[Batch: conversations in parallel - Sidekiq :ai_agent]
I --> J[Per conversation: replay turns in order, shared session + SendContext]
J --> K{Turn error?}
K -- Yes --> L[Mark turn error, skip rest of this conversation]
K -- No --> M{Expected Answer blank?}
M -- Yes --> N[Context turn: feeds state, not scored]
M -- No --> O[Score turn: semantic similarity OR confidence + human review]
L --> P[Roll up conversation verdict: per-turn rate + goal-turn]
N --> P
O --> P
P --> Q{Job crash mid-run?}
Q -- Yes --> R[Persist completed, mark run failed, show partial]
Q -- No --> S[Save run to Testing Index + grouped results]

10.2 User Stories

MoSCoW preserved from the Confluence source. Implementation-status notes flag where ACs describe target behavior built on existing code vs not-yet-wired pieces. Figma frames are phase-specific TBD — links point to the AI Agent Testing master frame.


IMPORT-S01 — Import entry & template download | Must Have

Story: As a Specialist, I want to download a test case template, so that I know the correct format.

Before: No import entry; "Imported question list" is a disabled placeholder. After: Enabling the source opens the import drawer with a downloadable template.

Data Fields:

FieldTypeRequiredSource
agent_iduuidYesroute
template_columnsfixed listYes (system)Topic, Conversation ID, Turn, Question, Expected Answer

Happy Path:

  • AC-1: Given I am on the Testing page, when I open Generate test case → Imported question list, then the import drawer opens.
  • AC-2: Given the import drawer, when I click "Download template", then I get an .xlsx with columns Topic, Conversation ID, Turn, Question, Expected Answer.
  • AC-3: Given I open the template, when I read the header row, then each column is pre-labeled and the Expected Answer column notes "optional = context turn".

Error Path:

  • ERR-1: Given the template download fails, when I click it, then an inline error with retry is shown and no drawer state is lost.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: the "Imported question list" option is not rendered; direct route returns forbidden.

UI States: Loading (drawer opening), Empty (no file selected yet → template CTA), Error (download failed inline), Success (template downloaded).

Figma: GenerateTestCaseModal — node 15576-205254 (Phase 1 prototype · feat/ai-agents-testing). ImportTestCaseDrawer frame TBD. Dependencies: Phase 1 Testing page (GenerateTestCaseModal).

Prototype note (S01): The modal's 3rd card (amber doc icon) is labeled "Add or upload manually" with CTA "Add" — this maps to the "Imported question list" type. In Phase 1 it opens UploadManuallyDrawer (node 16677-210564); Phase 3 replaces that with ImportTestCaseDrawer. The Phase 1 drawer is a partial stub: dropzone + upload-progress animation + content-preview panel, but uses a 3-column format (Topics/Intent · Question · Expected Answer) and a 45 MB cap — both differ from the Phase 3 spec (5-column / 5 MB / "Run Test"). Phase 3 needs a redesigned drawer. The modal also has a dev-tools state "Not enough conversation" that disables the inbox card with tooltip "You must have at least 50 past inbox conversations" — threshold not currently in the PRD.


IMPORT-S02 — Upload & validate long-format file | Must Have

Story: As a Specialist, I want to upload a .csv/.xls/.xlsx file and see a preview, so that I can confirm before running.

Before: No file import path into a test case. After: A validated upload shows a preview of conversations, turns, and skipped rows.

Data Fields:

FieldTypeRequiredSource
filebinaryYesuser upload
topicstringYesfile row
conversation_idstringYesfile row
turnintYesfile row
questionstringYesfile row
expected_answerstringNo (blank = context turn)file row

Happy Path:

  • AC-1: Given I upload a valid file, when it passes validation, then I see a preview: conversation_count, total turns, and skipped rows with reasons.
  • AC-2: Given columns differ in case (e.g. expected answer), when validated, then matching is case-insensitive and the file is accepted.
  • AC-3: Given a valid multi-turn file, when the preview renders, then conversations are grouped (one preview row per Conversation ID with its turn count).

Error Path:

  • ERR-1: Given the upload fails mid-transfer, when it errors, then no DB state is written and I can retry.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: upload endpoint returns forbidden.

UI States: Loading (parsing skeleton), Empty (no valid rows → guidance), Error (upload failed + retry), Success (preview shown).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S01.

Implementation status: parsing is backend-side (FE has no xlsx/csv parser). Extends the existing ai_agent_test_cases / ai_agent_test_case_questions schema.


IMPORT-S03 — Invalid format rejection | Must Have

Story: As a Specialist, I want a clear error when my file format is wrong, so that I can fix it fast.

Before: No import validation. After: Files with wrong columns or unsupported types are rejected with a template link.

Data Fields:

FieldTypeRequiredSource
file_typeenum (.csv/.xls/.xlsx)Yesupload
reasonenum: invalid_format / row_limit_exceeded / size_exceededYes (system)validator

Happy Path:

  • AC-1: Given I upload a file with missing/wrong columns, when validated, then the whole file is rejected with an error message + a link to download the correct template.
  • AC-2: Given I upload an .xlsx with the correct columns but a renamed sheet, when validated, then the first sheet's header row is used and the file is accepted.

Error Path:

  • ERR-1: Given I upload an unsupported file type (e.g. .pdf), when validated, then it is rejected before any processing with a clear type error.
  • ERR-2: Given the file is a valid type but corrupt/unreadable, when parsing runs, then it is rejected with a "could not read file" error + retry.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.

UI States: Loading (validating), Empty (N/A — rejection path), Error (rejection + template link), Success (N/A — acceptance handled in S02).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.


IMPORT-S04 — Row-level validation & skip | Must Have

Story: As a Specialist, I want rows with missing/invalid data skipped (not blocking the run), so that valid scenarios still proceed.

Before: No row-level validation. After: Invalid rows are flagged by row number + reason; valid rows proceed.

Data Fields:

FieldTypeRequiredSource
row_indexintYes (system)parser
reasonstringYes (system)validator
conversation_idstringYesfile row
turnintYesfile row

Happy Path:

  • AC-1: Given rows with an empty Topic/Conversation ID/Turn/Question, when I upload, then those rows are skipped and flagged by row number and reason; valid rows proceed.
  • AC-2: Given two rows share the same (Conversation ID, Turn), when validated, then the duplicate is skipped + flagged (turn ordering must be unique).
  • AC-3: Given a conversation whose every turn has a blank Expected Answer, when validated, then it is flagged (nothing to score).

Error Path:

  • ERR-1: Given Turn is not a positive integer (e.g. 0, 1.5, text), when validated, then that row is skipped + flagged.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.

UI States: Loading (validating), Empty (all rows skipped → "no valid rows" guidance), Error (per-row reason shown in preview), Success (preview lists valid + skipped rows).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.


IMPORT-S05 — Multi-turn execution (parallel across, sequential within) | Must Have

Story: As a Specialist, I want to click "Run Test" and have all scenarios — including multi-turn — run automatically, so that I replace manual testing.

Before: Each scenario is tested manually one-by-one in bot preview. After: Conversations run in parallel; turns within a conversation run sequentially through one shared session with carried context.

Data Fields:

FieldTypeRequiredSource
room_iduuid (per conversation)Yes (system)bot-preview room alloc
conversation_idstringYesparsed file
turn_indexintYesparsed file
actual_responsetextYes (system)AI agent

Happy Path:

  • AC-1: Given I reviewed the preview, when I click "Run Test", then each conversation is replayed turn-by-turn in Turn order through one shared bot session, and conversations run in parallel.
  • AC-2: Given a multi-turn conversation, when turn N runs, then it shares the conversation context (memory/slot state) accumulated from turns 1..N-1.
  • AC-3: Given I navigate away mid-run, when I return to the Testing Index, then the run has continued in the background and its result is saved.

Error Path:

  • ERR-1: Given the run is triggered, when the request returns, then it returns immediately { status: "processing" } and the run continues in the background.

Rollback / reversibility: A run is an immutable record. A partial or unwanted run can be deleted from the Testing Index (soft-delete, acts_as_paranoid) and re-created by re-importing — there is no in-place mutation of a completed run. Test rooms are draft_state=true and purged after the run, so a discarded run leaves no live-data residue.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: execute endpoint returns forbidden.

UI States: Loading (TestRunProgress), Empty (N/A — run already has conversations), Error (partial-results banner on crash), Success (run completes, results shown).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S04; bot-preview execution path (§15).

Implementation status: reuses the live bot-preview execution path (SendMessageWithResolve) + History + conversation_id + SendContext (Mekari RAG thread). Concurrency unit = conversation (Sidekiq :ai_agent, retry:false).


IMPORT-S06 — Semantic-similarity scoring (with fallback) | Must Have

Story: As a Specialist, I want each scored turn graded Pass/Review/Fail against its Expected Answer, so that I get an objective signal.

Before: Quality is judged manually. After: Each scored turn is auto-graded by semantic similarity; if the engine is unavailable, the system falls back to AI confidence + human review.

Data Fields:

FieldTypeRequiredSource
similarity_scorefloat (0–1)Yes (system)semantic engine
confidenceintConditional (fallback)AI agent response
statusenum: pass/review/fail/errorYes (system)scorer

Happy Path:

  • AC-1: Given a turn with an Expected Answer and an agent response, when scoring runs, then similarity ≥80% → Pass, 60–79% → Review, <60% → Fail.
  • AC-2: Given the agent response is empty/errored, when scoring runs, then score = 0 and status = Fail.
  • AC-3: Given a batch of scored turns, when scoring runs, then it is sent as a single batch request (chunked if the engine's max batch size is exceeded).

Error Path:

  • ERR-1: Given the semantic engine is unavailable, when scoring runs, then the turn falls back to AI confidence + human review (rated via the Phase 1 rating flow) rather than blocking the run, and the turn is badged "scored by fallback".

Permission Model: CAN: Specialist, Bot Admin (scoring runs system-side under their triggered run). CANNOT: standard agents. Unauthorized: not executed if flag OFF.

UI States: Loading (scoring), Empty (N/A), Error (fallback-to-review badge), Success (Pass/Review/Fail badge with score).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05; semantic-similarity engine (§15, dependency).

Implementation status: ⚠️ the semantic-similarity engine is not present in the chatbot BE repo (only the KB vector store, for retrieval). It must be confirmed with the AI Service / ML team (Open Question #1). Fallback reuses the existing RateTestCaseQuestion (confidence + human 0/1).


IMPORT-S07 — Conversation verdict (two views) | Must Have

Story: As a Specialist, I want each conversation's verdict shown two ways, so that I see both intermediate quality and end-state success.

Before: No conversation-level roll-up. After: A conversation shows (a) per-turn pass rate across scored turns and (b) the goal-turn (final scored turn) result, side by side.

Data Fields:

FieldTypeRequiredSource
per_turn_pass_ratefloatYes (system)roll-up
goal_turn_statusenum: pass/review/fail/errorYes (system)final scored turn
turn_countintYes (system)conversation

Happy Path:

  • AC-1: Given a completed multi-turn conversation, when the verdict renders, then it shows the per-turn pass rate AND the goal-turn result side by side.
  • AC-2: Given a single-turn conversation, when the verdict renders, then both views collapse to that one turn's status.
  • AC-3: Given context turns in a conversation, when the pass rate computes, then context turns are excluded from the denominator.

Error Path:

  • ERR-1: Given a conversation aborted on a turn error, when the verdict renders, then it is marked incomplete (error) rather than computing a misleading pass rate.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: results endpoint returns forbidden.

UI States: Loading (computing), Empty (N/A), Error (incomplete badge), Success (both verdicts shown).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S06.


IMPORT-S08 — Results grouped by conversation (expandable) | Must Have

Story: As a Specialist, I want results grouped by conversation and expandable to per-turn, so that I can see exactly which turn broke.

Before: No grouped results view. After: Conversation rows expand to per-turn rows (Question, Expected, Actual, Score, Status).

Data Fields:

FieldTypeRequiredSource
topicstringYesfile row
questiontextYesfile row
expected_answertextNo (context turn)file row
actual_responsetextYes (system)AI agent
similarity_scorefloatConditionalscorer
statusenum: pass/review/fail/errorYes (system)scorer

Happy Path:

  • AC-1: Given results, when I expand a conversation row, then I see each turn's Question / Expected / Actual / Score / Status, plus the conversation verdict.
  • AC-2: Given a summary banner, when results render, then it shows total Pass / Review / Fail across the run.
  • AC-3: Given a context turn, when I expand it, then it is shown with a "not scored" badge rather than a score.

Error Path:

  • ERR-1: Given a turn failed generation, when I expand it, then the Actual side shows a "could not generate" state, not a blank panel.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: results endpoint returns forbidden.

UI States: Loading (skeleton rows), Empty (no results in this run), Error (per-turn failed state), Success (grouped table rendered).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S07. Reuses AiAgentValidationForm.vue table + score badges.


IMPORT-S09 — Context turns are not scored | Must Have

Story: As a Specialist, I want context-only turns (blank Expected Answer) to set up state without being scored, so that setup steps don't distort results.

Before: No notion of an unscored setup turn. After: A blank-Expected-Answer turn feeds context and is shown "not scored".

Data Fields:

FieldTypeRequiredSource
turn_typeenum: user / contextYes (system)derived (blank Expected = context)
expected_answertextNofile row

Happy Path:

  • AC-1: Given a conversation turn with a blank Expected Answer, when it runs, then it feeds context and is shown as "not scored" (excluded from the pass rate).
  • AC-2: Given a context turn, when the agent responds, then the response is still captured and displayed (for human inspection) but not scored.
  • AC-3: Given a conversation of all context turns except one scored turn, when it runs, then only the scored turn contributes to the verdict.

Error Path:

  • ERR-1: Given a context turn's agent call errors, when it runs, then the conversation is aborted per IMPORT-S10 (downstream turns depend on it).

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: not executed if flag OFF.

UI States: Loading (running), Empty (N/A), Error (aborted-conversation badge), Success ("not scored" badge on the turn).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05.


IMPORT-S10 — Turn-error isolation | Must Have

Story: As a Specialist, I want a failed turn to stop only its conversation, not the whole run, so that one bad scenario doesn't block the rest.

Before: No error isolation model. After: A turn error aborts only its conversation; other conversations continue.

Data Fields:

FieldTypeRequiredSource
statusenum: errorYes (system)runner
aborted_at_turnintConditionalrunner

Happy Path:

  • AC-1: Given a turn errors (timeout/agent error) mid-conversation, when the run continues, then the remaining turns of that conversation are skipped and other conversations are unaffected.
  • AC-2: Given a conversation aborted on a turn error, when results render, then it shows which turn failed and that downstream turns were skipped.
  • AC-3: Given several conversations and one aborts, when the run completes, then the run summary counts the aborted conversation as error without failing the whole run.

Error Path:

  • ERR-1: Given the run job crashes mid-batch, when it stops, then completed turns/conversations are persisted, the run is marked failed, and partial results are shown with a banner.

Rollback / reversibility: A failed or partial run can be deleted (soft-delete) and re-run by re-importing; no live-conversation state is touched (test rooms are draft_state=true and purged).

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: not executed if flag OFF.

UI States: Loading (running), Empty (N/A), Error (per-conversation error badge / partial-run banner), Success (unaffected conversations complete).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05.


IMPORT-S11 — Save to Testing Index & view past runs | Must Have

Story: As a Specialist, I want the run saved to the Testing Index, so that I can revisit and regression-test later.

Before: No persistence of imported runs. After: Each completed run is saved with timestamp, file name, and counts, and is listed in the Testing Index.

Data Fields:

FieldTypeRequiredSource
file_namestringYesupload
created_attimestampYes (system)run
pass_count / review_count / fail_count / error_countintYes (system)roll-up

Happy Path:

  • AC-1: Given a run completes, when I go to the Testing Index, then I see the saved run with timestamp, file name, and total Pass/Review/Fail counts.
  • AC-2: Given a saved run, when I open it, then I see its conversation-grouped results.
  • AC-3: Given multiple past runs for an agent, when the Testing Index renders, then runs are listed newest-first.

Error Path:

  • ERR-1: Given the Testing Index fails to load, when the page renders, then a "Couldn't load" blank slate with Retry is shown.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: Testing menu hidden; direct route forbidden.

UI States: Loading (list skeleton), Empty (no runs yet → import CTA), Error (retry), Success (run listed + openable).

Figma: testing/index.vue — node 16743-298263 (Phase 1 prototype · feat/ai-agents-testing). Dependencies: IMPORT-S08. Reuses Phase 1 TestCasesTable.

Prototype note (S11): Testing Index fully designed. Table columns: Test case name · AI agent · Testing type · Score · Status · Last updated · Actions (sticky right col). The type filter dropdown includes "Imported question list". Columns are sortable (name, score, last updated). Pagination: 10 rows/page. Delete confirmation modal: "Delete test case?" with test case name in bold; soft-delete on confirm. Dev tools: Filled / Empty state / Loading / Search not found / Filter not found. Score column shows X% (integer); Status badge is Passed (green) or Need review (amber) — Phase 3 may extend to a three-state Pass / Review / Fail badge aligned with semantic-similarity thresholds (≥80% / 60–79% / <60%).


IMPORT-S12 — Async progress / live polling | Must Have

Story: As a Specialist, I want a live progress indicator while a batch runs, so that the UI never freezes.

Before: No progress feedback for a batch. After: The run shows "X of Y conversations completed" with live Pass/Review/Fail and remains responsive.

Data Fields:

FieldTypeRequiredSource
statusenum: processing/completed/failedYes (system)run
conversations_doneintYes (system)run
conversations_totalintYes (system)run

Happy Path:

  • AC-1: Given a run is in progress, when I poll the run, then status reflects progress (X of Y conversations) until completed.
  • AC-2: Given the run is running, when I stay on the page, then the UI remains responsive and updates incrementally.
  • AC-3: Given the run completes while I am away, when I reopen it, then it shows the final state (not a stuck progress bar).

Error Path:

  • ERR-1: Given a poll request fails transiently, when it retries, then progress resumes without losing the run.

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: status endpoint returns forbidden.

UI States: Loading (progress indicator), Empty (N/A), Error (transient retry indicator), Success (completed state).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S05. Precedent: ai-assist.ts UUID status poll.


IMPORT-S13 — File row/size caps | Must Have

Story: As a Specialist, I want the system to cap files at 500 rows and 5 MB, so that runs stay performant.

Before: No file caps. After: Oversized files are rejected with a clear message.

Data Fields:

FieldTypeRequiredSource
row_countint (≤500)Yes (system)parser
file_sizebytes (≤5 MB)Yes (system)upload

Happy Path:

  • AC-1: Given I upload a file with >500 rows, when validated, then it is rejected with: "File exceeds 500 row limit. Please split into multiple files."
  • AC-2: Given I upload a file >5 MB, when validated, then it is rejected with a size-limit error.
  • AC-3: Given a file exactly at a limit boundary (500 rows / 5 MB), when validated, then it is accepted (limits are inclusive of the stated max).

Error Path:

  • ERR-1: Given a file exceeds the recommended per-conversation turn cap, when validated, then the affected conversation is flagged (engineering to confirm hard vs soft — OQ#4).

Permission Model: CAN: Specialist, Bot Admin. CANNOT: standard agents. Unauthorized: endpoint returns forbidden.

UI States: Loading (validating), Empty (N/A), Error (limit message + template link), Success (within-limit file accepted → preview).

Figma: AI Agent Testing — master frame (phase frame TBD). Dependencies: IMPORT-S02.


Negative Scenarios

  • NEG-1: Given I am a standard agent, when I look for the Imported test-case import, then the option is not rendered and the route is forbidden.
  • NEG-2: Given a multi-turn conversation, when a turn errors, then later turns in that conversation are not executed (they would run on a broken context).
  • NEG-3: Given a context turn (blank Expected Answer), when results render, then it never counts toward Pass/Review/Fail.
  • NEG-4: Given the semantic engine is down, when a run executes, then it falls back to confidence + human review rather than marking every turn Fail.

11. Rollout

  • Feature flag: ai_agent_testing (initiative flag, default OFF); Imported source gated within it.
  • Stage 1: Internal — Bot Implementation Specialist team (the primary users).
  • Stage 2: Closed beta — select Bot Admin clients.
  • Stage 3: All orgs with AI Agent enabled, on request.
  • GA: All orgs with AI Agent enabled.
  • Backward compat: Yes — additive. Extends the existing ai_agent_test_cases / ai_agent_test_case_questions schema with conversation/turn fields; Phase 1 flows unchanged. During the transition, existing Phase 1 test cases (no turn_index/turn_type) render as single-turn conversations — old and new rows coexist without migration.
  • Migration: Additive columns (turn_index, expected_answer, turn_type) + a thin batch parent. No backfill; existing rows default to a one-turn conversation.

Semantic-regression rollback: ai_agent_testing is the per-org kill switch. If multi-turn replay or semantic scoring proves unreliable in beta (e.g. scoring disagrees with human review on >20% of a sample), PM toggles ai_agent_testing OFF per org (no deploy); scoring reverts to confidence + human review.


12. Observability

Event NameTriggerProperties
test_case_import_clickedUser opens the import draweruser_role, agent_id, workspace_id
template_downloadedUser downloads the templateuser_role
file_uploadedFile submitted for validationfile_format, row_count, conversation_count, file_size_kb
file_validation_failedFile rejectedreason (invalid_format / row_limit_exceeded / size_exceeded)
test_run_startedUser clicks Run Testconversation_count, turn_count, skipped_rows
test_run_completedRun finishestotal_pass, total_review, total_fail, conversation_count, duration_seconds
test_run_viewedUser opens a saved runagent_id, workspace_id

Dashboard owner: BOT squad (Mixpanel + Tableau).

Alerts:

  • Run failure rate > 10% in 1h → Slack: #bot-ai-alerts (on-call).
  • Semantic-engine error rate during scoring > 5% in 15m → Slack: #bot-ai-alerts (on-call).

Post-Launch Monitoring Cadence: Weekly for the first 4 weeks post-GA, then monthly. Owner: Dimas Fauzi Hidayat (BOT squad PM). Trigger: run failure rate > 20% unresolved within 24h → PM disables ai_agent_testing for affected orgs.


13. Success Metrics

Primary KPI: Reduction in average AI Agent testing time per go-live

  • Definition: median specialist time from "start testing" to "all scenarios scored"
  • Baseline: 5+ hours (manual one-by-one bot preview)
  • Target: ≤ 30 minutes, within 60 days of launch

Adoption: % of specialist-led go-lives using Import Test Case

  • Definition: share of go-lives where the imported source was used
  • Baseline: N/A — new capability
  • Target: ≥ 80% within 60 days of launch

Quality: Multi-turn run success rate

  • Definition: % of conversations that complete all turns without an execution error
  • Baseline: N/A — new capability
  • Target: ≥ 95% within 60 days of GA

Efficiency: Time-to-go-live (implementation phase duration)

  • Definition: implementation-phase duration vs baseline
  • Baseline: current implementation duration
  • Target: decrease ≥ 30% vs baseline

14. Launch Plan & Stage Gates

StageAudienceDurationSuccess GateOwner
InternalBot Implementation Specialist team2 weeksNo major bugs; testing-time reduction vs manual baseline confirmed; multi-turn replay verified against bot previewPM + QA
Closed BetaSelect Bot Admin clients3–4 weeks≥ 80% of early users find value; run failure rate ≤ 10%PM + CSM
Open BetaAll orgs with AI Agent, on request3 weeksTesting time ≤ 30 min sustained 1 week; no P0/P1 openEng Lead
GAAll orgs with AI Agent enabledOngoingOpen Beta gates sustained 2 weeks; PMM launch approvedPM + PMM

15. Dependencies

DependencyOwning TeamDeliverable NeededBlocking?
Semantic-similarity engineAI / ML squadScoring endpoint (batch input, Bahasa Indonesia, latency at scale) — confirm contract; not present in chatbot BEYES
Bot-preview execution pathBOTReuse SendMessageWithResolve + History + conversation_id + SendContext for per-conversation replayYES
AI Agent versioningBOTai_agent_histories stable — test runs bind to a versionYES
Existing test-case schemaBOTExtend ai_agent_test_cases / ai_agent_test_case_questions (grain to confirm)YES
Data / AnalyticsDataInstrumentation events wired (Mixpanel)NO

16. Key Decisions + Alternatives Rejected

Initiative-level decisions live in the ANCHOR PRD §5. Below are phase-specific decisions.

16a — Decisions Made

DateDecisionRationale
2026-06-26Merge multi-turn into the MVP; single-turn = a one-turn conversationThe long-format file makes multi-turn near-zero added friction and mirrors the real conversational use case
2026-06-26Long-format file (Topic, Conversation ID, Turn, Question, Expected Answer)Variable conversation length, per-turn scoring, and backward-compatible (single-turn files still work)
2026-06-26Extend existing ai_agent_test_cases / ai_agent_test_case_questions; do not create new tablesThe schema already ships (Phase 1); adds conversation grouping + turn ordering
2026-06-26Parallel across conversations, sequential withinReuses the live Room + conversation-context path; correct semantics for shared-context replay
2026-06-26Conversation verdict shown two ways (per-turn pass rate + goal-turn)Specialists need both intermediate quality and end-state success
2026-06-26Semantic engine = open dependency with confidence + human-review fallbackThe engine is not in the BE repo; de-risks the build without blocking it

16b — Alternatives Rejected

AlternativeWhy RejectedDate
New TestRun / TestRunResult tables (original Confluence draft)Superseded by extending the already-shipped ai_agent_test_cases schema2026-06-26
Single-turn-only importDoes not reflect the real (multi-turn) use case; single-turn is just the one-turn special case2026-06-26
Wide file format (Q1/A1/Q2/A2 columns)Rigid (fixed max turns), awkward for variable length and per-turn scoring2026-06-26
Branching / LLM user-simulator in MVPHard to make deterministic and scoreable; deferred2026-06-26

17. Open Questions

#TypeQuestionOwnerDeadline
1RiskSemantic-similarity engine is not in the chatbot BE — confirm the AI Service contract (endpoint, batch size, Bahasa Indonesia, latency). Mitigation: fallback to AI confidence + human review.Dimas (PM) / AI squad2026-07-15
2Open QuestionExact grain of ai_agent_test_cases today (per-conversation vs per-batch) — decides new batch table vs folded-in column.Eng (BOT — Puji/Eko)2026-07-15
3Open QuestionRe-baselined throughput SLA now that turns are sequential within a conversation (conversations × avg turns). Interim target: ~30 conversations / ≈100 turns in ≤5 min.Eng (BOT)2026-07-15
4Open QuestionPer-conversation turn cap (e.g. ≤20) — safe ceiling, hard vs soft enforcement.Eng (BOT)2026-07-15
5RiskTest-room retention/cleanup so draft_state test rooms/histories don't pollute real conversation data. Mitigation: purge test rooms after each run; confirm purge window.Eng (BOT)2026-07-31
6RiskImported files may contain client PII in questions/answers. Mitigation: scope to agent/workspace; discard raw file after parsing; confirm scoring stays internal.Dimas (PM)2026-07-15

PRD CHANGELOG

VersionDateBySectionTypeSummary
1.22026-06-30Claude§7, §8, S01, S11UPDATEDInserted design status from qontak-designer feat/ai-agents-testing prototype. Updated §8 component tree to distinguish Phase 1 designed components (testing/index.vue node 16743-298263, GenerateTestCaseModal node 15576-205254, UploadManuallyDrawer node 16677-210564, QuestionComparisonCard node 16820-169848) from Phase 3 new components (ImportTestCaseDrawer, TestRunProgress, TestRunResults — frames TBD). Updated §7 CHG-001 Figma link to actual node. Added prototype notes to S01 (modal "Add or upload manually" card, Phase 1 stub discrepancies) and S11 (Testing Index columns, dev tools, score badge). Updated Last Updated header.
1.12026-06-26ClaudeS1, S4, S6, S8UPDATEDPhase 3 (Imported Test Cases) PRD authored in the documents-repo PHASE template under the AI Agent: Testing ANCHOR. Reconciled the Confluence "Import Test Case" page (QON 51068960826) with current code (chatbot, chatbot-fe): multi-turn merged into MVP (long-format Conversation ID + Turn), schema extends the existing ai_agent_test_cases / ai_agent_test_case_questions models, execution reuses the live bot-preview Room + conversation_id + SendContext path (parallel across conversations, sequential within), conversation verdict shown two ways, and the semantic-similarity engine flagged as an open dependency with a confidence + human-review fallback. 13 stories with composite AC ids (IMPORT-S01…S13).
1.12026-06-26ClaudeS1, S4, S6, S8UPDATEDCoaching pass after score-prd v3.3. Trimmed one-liner to ≤25 words; added plan scope + feature-flag default state + interim throughput SLA + explicit data-lifecycle/purge to Constraints; added a UI-state Mermaid diagram (S6/S8 New Features) and a system-flow Mermaid diagram (§10.1, with flow type declared); strengthened every story with a Data Fields table (type/required/source), a frame-level Figma link, an explicit Unauthorized clause, all four UI states, and a third AC where thin (S03/S09/S10/S13); added explicit rollback/reversibility to the execution stories (S05/S10).