RFC: AI Agent Testing — Phase 1: Historical Validation

Document Conventions (do not remove)

This RFC follows the Qontak RFC Template format for governance — the metadata table, Confluence sections 1–6, and Comment logs are mandatory.

It is also agent-execution-ready: §1 Design References (FE half) + §1 PRD-to-Schema Derivation (BE half), §2 Repo Reading Guide (Detail 2.0) for both layers, mermaid diagrams, §2.G Cross-Layer Contract Verification, and §4 Agent Execution Plan + Verification & Rollback Recipe are complete.

Delivery & project management live elsewhere. This RFC is the technical artifact only — no staffing, effort, timeline, or rollout schedule. Those live in the initiative's delivery/ folder. Until handed to delivery, the Delivery row reads not yet handed to delivery.

Grounding note (important). The PRD describes target behavior. This RFC is reconciled against the current code in chatbot, chatbot-fe, and qontak-designer (see §2.0 Source Verification). Where the PRD describes behavior that is not yet built, the RFC says so explicitly and scopes it as work. The biggest such gap: FetchRoomConversationsWorker today only fetches rooms and extracts Q/A pairs, then logs — it does not sample, generate AI shadow answers, persist question rows, or update test-case status.

Metadata

Field	Value	Notes
Status	RFC (IDEA)	Human label; YAML `status: draft`
DRI	Dimas Fauzi Hidayat	Accountable owner carried from PRD; eng tech-lead co-owner to be named in `delivery/`
Team	chatbot (BOT squad)	Advisory slug carried from PRD
Author(s)	Claude (from PRD + repo grounding)
Reviewers	BOT Backend Lead · BOT Frontend Lead · AI Squad Lead · Data Team (Reza)	Cross-squad: BOT, AI, Data, Platform (Chat Service)
Approver(s)	BOT Tech Lead · InfoSec Approver	InfoSec required: historical PII → 3rd-party LLM (Open Q #1)
Submitted Date	2026-06-20
Last Updated	2026-06-20
Target Release	2026-Q3	Re-baselined; original "May 2026" dates are past (PRD Open Q #3)
Target Quarter	2026-Q3
Delivery	not yet handed to delivery
Related	`../prds/historical-validation.md` · `../ai-agent-testing-anchor.md`
Discussion	`#bot-ai-alerts`

Type: full-stack Frontend sub-type: new-feature Backend sub-type: new-feature

Sections at a Glance

Overview (Design References — FE; PRD-to-Schema Derivation — BE; traceability)
Technical Design (Repo Reading Guide → end-to-end mermaid → DDL → APIs → cross-layer verification)
High-Availability & Security
Backwards Compatibility and Rollout Plan
Concern, Questions, or Known Limitations
Comment logs
Ready for agent execution

1. Overview

Phase 1 of the AI Agent: Testing initiative lets a Qontak SPV/Admin validate an AI Agent against a sample of their own resolved, human-handled conversations before going live. The system samples eligible historical rooms (last 90 days), generates an AI "shadow" answer per extracted customer question (never sent to a real customer), and presents a side-by-side comparison of the human "golden" answer vs the AI answer. The SPV rates each answer thumbs up/down; ratings roll up into a confidence meter. At ≥80% the agent is "Ready to Launch"; an activation gate (Should-Have) prevents go-live below threshold.

This RFC is a delta on substantial existing scaffolding, not a greenfield build. The data model, the four read/write endpoints, the rating use case, the Sidekiq worker shell, the conversation-pair extractor, and the chatbot-fe Pinia store + typed API client already exist (see §2.0). The missing pieces — the engineering core of this phase — are:

Sampling (10% / 50–70 cap) in the worker.
Shadow-answer generation (LLM call per question) + persistence of ai_agent_test_case_questions rows.
Test-case status lifecycle (pending → processing → completed/failed).
Confidence-score aggregate recompute on rating.
Activation gate in publish.
Tree-diagram average-confidence surface.
The chatbot-fe Testing page (list + detail/comparison + meter) under the new bot-automation module.

Success Criteria

Zero customer-message leakage: no send_message/notification fires for any historical inquiry during shadow generation (AITEST-S04/AC-1). Provable by spec + zero SendMessageWorker enqueues during a batch.
Shadow-generation success rate ≥ 95% of sampled questions produce a valid AI answer within 60 days of GA (PRD §13 Quality KPI).
Batch latency: a ~50-item batch reaches completed in ≈2–5 min without blocking live production traffic (queue isolation :ai_agent).
Confidence meter equals (thumbs-up ÷ total sample) × 100, recomputed server-side on every rating (AITEST-S06/AC-3).
Primary product KPI (PRD §13): Configured→Live conversion ≥ 60% within 7 days, within 90 days of GA.

Out of Scope

Live shadow mode (real-time parallel answering) — strictly historical.
Model fine-tuning UI; editing the AI answer in the workspace (comparison is read-only).
Multi-modal validation (images/voice/attachments) — text-only.
"Generate from knowledge" (Phase 2) and "Imported question list" (Phase 3) sources — scaffolded/disabled only.
Mobile — web only.
The qontak-designer prototype itself — it is a design reference, not a deployable target (see Decision D-1).

PRD: ../prds/historical-validation.md (BOT-3351)
Initiative anchor: ../ai-agent-testing-anchor.md
Confluence source: https://jurnal.atlassian.net/wiki/spaces/QON/pages/50815303687
Figma master: Bot · AI Agent Testing

Assumptions

A single human agent text reply to a customer question is a sufficient "golden answer" (PRD Open Q #7 — Data team to confirm).
The 90-day lookback and 10%/50–70 cap hold across plan tiers for the beta token budget (PRD Open Q #5).
ai_agent_histories is stable; a test case binds to one ai_agent_history_id.
Chat Service room/message APIs and QontakNlp prediction are reachable from the :ai_agent Sidekiq worker pool with the org's channel access token.
Production frontend is chatbot-fe (it owns auth, the API client, and Pinia stores); qontak-designer has no API/auth layer (Decision D-1).

Dependencies

Dependency	Owning team	Deliverable needed	Availability	Blocking?
Chat Service (Hub)	Inbox / Platform	`Hub::ChatService::Rooms::List` (status `assigned`, date window), `Messages::GetByRoom` over 90 days	exists (`app/core/repositories/chat_service/`, `lib/hub/chat_service/`)	YES
LLM / AI service (QontakNLP)	AI squad	Batch shadow inference within TPM/RPM limits	exists for live predict (`lib/qontak_nlp/inference.rb#prediction`); batch/shadow path needs building	YES
Data team	Data	10% sampling + 50–70 cap algorithm	needs building (not in worker today)	YES
Channel Integration	Platform	Access tokens for room fetch	exists (`Repositories::ChannelIntegrations::GetTokens`)	YES
AI Agent versioning	BOT	`ai_agent_histories` stable	exists (`app/models/ai_agent_history.rb`)	YES
Design (Pixel3)	Design system	`@mekari/pixel3` Drawer/Modal/Table/Badge	exists (`@mekari/pixel3@^1.0.12` in chatbot-fe)	NO

Design References (frontend half — required)

The PRD's UI is specified in Figma; the qontak-designer prototype is the in-code design reference (pixel layout + component decomposition) but is itself a static prototype (no API/auth) — see Decision D-1. Production implementation lands in chatbot-fe.

PRD-named surface	Figma / design link	Frame name	Design system version	Design QA contact	Notes
Testing page (list)	node 16743-298263	Testing page	`@mekari/pixel3@^1.0.12` (chatbot-fe)	BOT Design QA	In-code ref: `qontak-designer app/pages/bot-automation/testing/index.vue`
Generate test case modal + Generate-from-Inbox drawer	node 16514-155786	Generate flow	`@mekari/pixel3@^1.0.12`	BOT Design QA	In-code ref: `qontak-designer app/components/bot-automation/testing/{GenerateTestCaseModal,GenerateFromInboxDrawer,TestCaseGeneratingModal}.vue`
Sampling / generating progress	node 17699-52615	Generating modal	`@mekari/pixel3@^1.0.12`	BOT Design QA	Async progress while batch runs
Test-case detail — side-by-side comparison + confidence meter	node 16514-155786	Comparison view	`@mekari/pixel3@^1.0.12`	BOT Design QA	No prototype component exists in qontak-designer for the detail view — build fresh (see §2.0)
Activation gate (AI agent main settings)	node 16514-155786	Activate button	`@mekari/pixel3@^1.0.12`	BOT Design QA	`chatbot-fe modules/bot-automation/components/AiAgentEditor.vue` footer
Tree-diagram AI Agent node confidence	node 16514-155786	Tree node	`@mekari/pixel3@^1.0.12`	BOT Design QA	Backend `get_tree_diagram_v3`

PRD-to-Schema Derivation (backend half — required)

PRD entity / attribute / rule	Persisted as (table.column)	Exposed via	Enforced where	Source
A test case binds to an agent + a version	`ai_agent_test_cases.ai_agent_id`, `.ai_agent_history_id` (uuid, NOT NULL)	`POST /api/v1/ai_agents/:id/test_cases`	`CreateTestCases` use case (404 if version missing)	PRD §9 #1
Test case has a lifecycle status	`ai_agent_test_cases.status` (string)	list + detail responses	worker transitions (to build); created as `'pending'` today	PRD §10.1, AITEST-S08
Test case aggregate confidence	`ai_agent_test_cases.confidence_score` (integer, nullable)	detail + list	recompute on rating (to build)	AITEST-S06
Sampled question + extracted Q/A	`ai_agent_test_case_questions.question` (text), `.topic` (string)	detail response	`ExtractConversationPairs` + worker persistence (to build)	AITEST-S02/S03
AI shadow answer	`ai_agent_test_case_questions.answer` (text)	detail	shadow-gen worker (to build)	AITEST-S04/AC-2
Human "golden" answer	`ai_agent_test_case_questions.parameters` (jsonb) → `human_answer`	detail	worker persistence (to build)	PRD §16a (2026-06-18)
AI metrics	`.confidence` (int), `.response_time` (int), `.sources` (jsonb `[{id,name,type}]`)	detail	shadow-gen worker (to build)	AITEST-S05/AC-2
Per-question rating	`.score` (int 0/1), `.is_score` (bool), `.scored_by`/`_email`/`_name`/`_at`	`PATCH .../questions/:question_id`	`RateTestCaseQuestion` (exists; aggregate recompute to build)	AITEST-S06
Per-question failure	`.status` (string), `.status_description` (text)	detail	shadow-gen worker (to build)	AITEST-S04/ERR-1
Soft delete	`.deleted_at` (`acts_as_paranoid`) on both tables	DELETE endpoint (to build — see §2.4)	model `acts_as_paranoid`	PRD §6
Activation gate threshold	`confidence_score` vs `threshold` (default 80)	`POST /api/v1/ai_agents/:id/publish`	`Repositories::Publish` (gate to build)	AITEST-S07
Tree-diagram avg confidence	computed avg over agent's completed `ai_agent_test_cases.confidence_score`	`GET /api/v3/paths/:id/tree_diagram`	`Repositories::Paths::GetTreeDiagramV3#add_ai_agent` (to build)	AITEST-S10

Every §2.3 DDL column and §2.4 endpoint traces back to a row here.

Detail 1.A — PRD Traceability (cross-layer)

Composite AC ids per documents/CLAUDE.md (story-qualified, e.g. AITEST-S01/AC-1).

Forward (PRD AC → RFC):

PRD composite AC id	FE section / component	BE section / endpoint
AITEST-S01/AC-1, AC-2	Testing page + nav item (chatbot-fe)	`GET /api/v1/ai_agents/:id/test_cases` (`set_role`)
AITEST-S01/ERR-1	Error blank-slate + `ai_workspace_load_failed`	list endpoint failure path
AITEST-S02/AC-1..3	generating modal	`FetchRoomConversationsWorker` sampling (§2.F)
AITEST-S03/AC-1..3	n/a — server-side	`ExtractConversationPairs` filtering (§2.2)
AITEST-S04/AC-1, AC-2, ERR-1	per-question loading/failed state	shadow-gen worker step (§2.2, §2.F)
AITEST-S05/AC-1..3	`TestCaseComparison` / `QuestionList`	`GET .../test_cases/:id` detail (§2.4)
AITEST-S06/AC-1..3, ERR-1	`ConfidenceMeter` + thumbs	`PATCH .../questions/:id` + aggregate recompute (§2.4, §2.F)
AITEST-S07/AC-1, AC-2, ERR-1	Activate button enable/disable	`POST /api/v1/ai_agents/:id/publish` gate (§2.4)
AITEST-S08/AC-1, AC-2, ERR-1	generating modal + polling	worker async + status lifecycle (§2.F)
AITEST-S09/AC-1, AC-2, ERR-1	Force-activate modal (Could-Have)	publish override + audit (PaperTrail)
AITEST-S10/AC-1..3, ERR-1	Tree-diagram node badge	`GetTreeDiagramV3#add_ai_agent`

Reverse (RFC → PRD AC):

New FE component / BE endpoint / dependency	PRD composite AC id it serves
chatbot-fe `pages/bot-automation/testing/index.vue`	AITEST-S01/AC-1
chatbot-fe `TestCaseComparison.vue` + `ConfidenceMeter.vue`	AITEST-S05/AC-1, AITEST-S06/AC-3
BE worker sampling step	AITEST-S02/AC-1..3
BE worker shadow-gen + persistence step	AITEST-S04/AC-2
BE confidence aggregate recompute	AITEST-S06/AC-2
BE publish gate	AITEST-S07/AC-1
BE `DELETE .../test_cases/:test_case_id` (new)	PRD §6 soft delete
BE `GetTreeDiagramV3` avg-confidence	AITEST-S10/AC-2

UI / Consumer Surface Coverage

PRD-named surface	Consumer	Required reads (BE)	Required writes (BE)	FE component	Status surface
Testing page (list)	web	`GET /api/v1/ai_agents/:id/test_cases`	—	`pages/bot-automation/testing/index.vue`	`status`, `score` columns
Generate-from-Inbox drawer	web	`GET .../ai_agents/:id` (versions)	`POST .../test_cases`	`GenerateTestCaseDrawer.vue`	`status=pending→processing`
Generating modal	web	`GET .../test_cases` (poll)	—	`TestCaseGeneratingModal.vue`	polls `status` until `completed`
Test-case detail / comparison	web	`GET .../test_cases/:id`	`PATCH .../questions/:id`	`TestCaseComparison.vue`	per-question `status`, `confidence`
Confidence meter	web	(from detail payload)	—	`ConfidenceMeter.vue`	`confidence_score` aggregate
AI agent main settings (Activate)	web	`GET .../ai_agents/:id`	`POST .../publish`	`AiAgentEditor.vue` footer	`confidence_score` vs 80
Tree-diagram node	web	`GET /api/v3/paths/:id/tree_diagram`	—	bot-flow node (chatbot-fe)	`avg_confidence_score`

Role Coverage

PRD role	Authorization mechanism	Endpoints permitted (BE)	UI surface visibility (FE)	Cross-tenant?	Audit trail
owner	`set_role(%w[owner supervisor admin])` (JWT `current_user['role']`)	all test-case + publish	full	no (org-scoped)	PaperTrail on test_cases/questions
supervisor	`set_role`	all test-case; publish (not force-override)	full	no	PaperTrail
admin	`set_role`	all test-case + publish + force-override	full	no	PaperTrail (+ override reason)
standard agent	`set_role` rejects (403)	none	menu hidden; route forbidden	no	n/a
Super Admin (PRD secondary)	inherits owner/admin role	activate after SPV sign-off	full	no	PaperTrail
bot-specialist (S10)	`set_role` on tree-diagram (`owner/supervisor/admin`)	`GET .../tree_diagram`	tree-diagram node	no	n/a (read)

Menu visibility in chatbot-fe is feature-flag + subscription gated today (not role-gated) — see Decision D-3; server-side set_role is the authoritative guard.

PRD Section Coverage

PRD §	Title	Where covered
2	Phase Context	§1 Overview
3	One-liner + Problem	§1 Overview
4	Target Users / Persona	§1 (Role Coverage)
5	Non-Goals	§1 Out of Scope
6	Constraints	§3 (perf, security, data lifecycle), §4 (flag)
7	Feature Changes (CHG-001 tree)	AITEST-S10 → §2.4, §2.F.2
8	New Features (Testing page)	§1 Design References, §2.A, Detail 1.C
9	API & Webhook Behavior	§2.4
10	System Flow / Stories / ACs	Detail 1.A, 1.C, §2.2
11	Rollout	§4
12	Observability	§3 Monitoring
13	Success Metrics	§1 Success Criteria, §3
14	Launch Plan & Stage Gates	§4 (delivery owns schedule)
15	Dependencies	§1 Dependencies, §2.F.1
16	Key Decisions	Detail 1.B, §2 Technical Decisions
17	Open Questions	§5

Detail 1.B — Decisions Closed (cross-layer)

Decision	Chosen option	Alternatives rejected	Why rejected	Layer
D-1 Frontend target repo	chatbot-fe (production); `qontak-designer` is design reference only	Build in `qontak-designer`	`qontak-designer` has zero API client + only mock localStorage auth + no roles (`app/composables/useAuth.ts`) — cannot satisfy `set_role` or real data	both
D-2 Storage	Reuse existing `ai_agent_test_cases` / `ai_agent_test_case_questions` (Postgres, uuid, `acts_as_paranoid`)	New `ai_validation_sessions`/`_items` tables	Superseded by implemented schema (PRD §16b)	BE
D-3 Menu gating	Server-side `set_role` is authoritative; FE menu reuses existing `rollout_ai_agent` + subscription flag pattern	Role-gate the FE menu only	FE menu gating today is flag/subscription-based (`layouts/bot-automation.vue`); BE must enforce regardless	both
D-4 Batch processing	Async Sidekiq `FetchRoomConversationsWorker`, queue `:ai_agent`	Kafka; synchronous request	Already the chatbot async stack (PRD §16a); sync would block & exceed LLM TPM	BE
D-5 Shadow inference	Reuse `QontakNlp` predict path per question, token-bucket throttled in worker (RPM cap `SystemPreference`, default 60; 429→backoff+requeue→fail-question)	New batch endpoint on AI service	Reuse the proven `lib/qontak_nlp/inference.rb#prediction`; throttle contract fully specified in §3 Performance (REV-1)	BE
D-6 Confidence aggregate	Recompute `confidence_score` server-side on each rating write	Compute on read; FE-side	Single source of truth; needed by tree-diagram + gate; avoids drift	BE
D-7 Activation gate	Add advisory→enforced threshold check in `Repositories::Publish` behind `ai_agent_testing_gate` flag; threshold is org-configurable via `SystemPreference` (`group_code: 'engine'`, `code: 'ai_agent_testing_threshold'`, default `80`)	Hard gate from day one; hard-coded 80 constant	Ship advisory for beta (PRD Open Q #2), enforce before GA; configurable threshold resolves PRD Open Q #4 (REV-4) without a redeploy	BE
D-8 Delete semantics	Soft delete via `acts_as_paranoid`; add missing `DELETE` endpoint	Hard delete	PRD §6 soft-delete + restore; chatbot-fe already calls a delete route that BE lacks	both
D-9 Per-status lifecycle	`pending → processing → completed`/`failed` (test case); `pending → processing → completed`/`failed` (question)	Single boolean done flag	Needs partial/failed surfacing (AITEST-S08/ERR-1)	BE
D-10 Sampling cap	10% random, capped 50–70, ≤50 shown if batch > 100; all rooms if < 10 eligible	Expose params in drawer	Adds user effort; defaults are the trust signal (PRD §16b)	BE/Data

Minimum-coverage decisions: storage (D-2), sync/async (D-4), caching (no alternative considered — no read-cache introduced this phase; detail reads are infrequent), third-party (D-5), consistency (D-6, server-authoritative/strong within request), multi-tenancy (set_role + org-scoped queries), reuse-vs-new (§2.4 Reuse? column).

Detail 1.C — Per-Story Change Map

Story id	Title	Layer scope	FE changes	BE changes	Composite AC ids	Acceptance criteria (verifiable)	RFC anchors
AITEST-S01	Workspace access control	FE + BE	`pages/bot-automation/testing/index.vue`; nav item in `layouts/bot-automation.vue`; `FETCH_TEST_CASES` (exists)	`GET .../test_cases` (exists, `set_role`)	S01/AC-1, AC-2, ERR-1, NEG-1	rspec: 403 for `standard`; vitest: error slate fires `ai_workspace_load_failed`	§2.4 row1 · §2.A · §3 authz
AITEST-S02	Historical sampling (10%)	BE-only	n/a — server-side	`FetchRoomConversationsWorker` sampling step (new)	S02/AC-1, AC-2, AC-3, ERR-1	worker spec: 200 rooms→~20; <10→all; 5000→cap 50–70	§2.F job spec · §4.D chunk 3
AITEST-S03	Data integrity & filtering	BE-only	n/a	`ExtractConversationPairs` (exists) — confirm non-text/system skip	S03/AC-1, AC-2, AC-3, ERR-1, NEG-2	extractor spec: bot-only excluded; image-only skipped	§2.2 · existing spec
AITEST-S04	Shadow execution (zero leakage)	BE-only	per-question failed badge	worker shadow-gen + question persistence (new); `QontakNlp` predict	S04/AC-1, AC-2, ERR-1	worker spec: 0 `SendMessageWorker` enqueues; `answer` + `parameters.human_answer` persisted	§2.2 · §2.F · §4.D chunk 4
AITEST-S05	Side-by-side validation UI	FE + BE	`TestCaseComparison.vue`, `QuestionList.vue` (grouped by topic)	`GET .../test_cases/:id` detail (exists)	S05/AC-1, AC-2, AC-3, ERR-1, NEG-3	vitest: renders human-left/AI-right + confidence/time/sources; failed→"could not generate"	§2.4 row3 · §2.A
AITEST-S06	Confidence meter & feedback	FE + BE	`ConfidenceMeter.vue`; thumbs via `UPDATE_TEST_CASE_QUESTION` (exists, optimistic)	aggregate recompute on rate (new)	S06/AC-1, AC-2, AC-3, ERR-1	rspec: rating recomputes `confidence_score=(up÷total)×100`; vitest: rollback on save fail	§2.4 row4 · §2.F.2 · §4.D chunk 5
AITEST-S07	Activation gatekeeping	FE + BE	Activate button enable/disable in `AiAgentEditor.vue` footer	publish gate in `Repositories::Publish` (new, flagged)	S07/AC-1, AC-2, ERR-1	rspec: publish 422 when `<80` & gate on; FE button disabled `<80`	§2.4 row5 · §4.D chunk 8
AITEST-S08	Background processing (async)	BE + FE	generating modal + poll	worker status lifecycle (new)	S08/AC-1, AC-2, ERR-1	worker spec: status `processing→completed`; failure→`failed` + Rollbar	§2.F · §2.1 state
AITEST-S09	Manual override & audit	FE + BE	Force-activate modal (reason)	publish override path + PaperTrail reason (new)	S09/AC-1, AC-2, ERR-1	rspec: override requires reason; PaperTrail row w/ reason + score	§2.4 row5 · §3 audit
AITEST-S10	Confidence in Tree Diagram	FE + BE	node badge in bot-flow tree (chatbot-fe)	`GetTreeDiagramV3#add_ai_agent` avg score (new)	S10/AC-1, AC-2, AC-3, ERR-1	rspec: `add_ai_agent` returns avg over completed; no test cases→"no score yet"	§2.4 row6 · §2.F.2

Every FE + BE row has both columns filled. S02/S03/S04 are BE-only (server-side pipeline); their UI effects are covered by S05/S08 surfaces.

2. Technical Design

Detail 2.0 — Repo Reading Guide (read this first)

Repo Map (mermaid, both layers)

flowchart LR
  subgraph fe["chatbot-fe (Nuxt + Pinia)"]
    page["pages/bot-automation/testing/"]
    comp["modules/bot-automation/components/testing/"]
    store["store/ai-agent/{actions,getters,interface}.ts"]
    svc["common/services/main/v1/ai-agents.ts"]
  end
  subgraph be["chatbot (Rails + Grape)"]
    ctrl["api/frontend_service/v1/ai_agent/*_controller.rb"]
    uc["use_cases/{create_test_cases,rate_test_case_question,publish_ai_agent}.rb"]
    repo["repositories/{create_test_case,rate_test_case_question,publish}.rb"]
    worker["workers/fetch_room_conversations_worker.rb"]
    chat["core/repositories/chat_service/*"]
    nlp["lib/qontak_nlp/inference.rb"]
    tree["core/repositories/paths/get_tree_diagram_v3.rb"]
  end
  subgraph infra
    db[("Postgres: ai_agent_test_cases / _questions")]
    q[["Sidekiq queue :ai_agent"]]
    hub(["Hub Chat Service (HTTP)"])
    ai(["QontakNLP AI service (HTTP)"])
  end
  svc --> ctrl
  ctrl --> uc --> repo --> db
  uc --> q --> worker
  worker --> chat --> hub
  worker --> nlp --> ai
  worker --> db
  ctrl --> tree --> db

Existing Code Anchors

Layer	Path	Why the agent reads it	What pattern it teaches
BE	`app/api/frontend_service/v1/ai_agent/test_cases_controller.rb`	The 3 live routes + `set_role` + result-matcher	Grape route + `Dry::Matcher::ResultMatcher` + `success_response`/`error_response`
BE	`app/api/frontend_service/v1/ai_agent/use_cases/create_test_cases.rb`	Create flow, validation, worker enqueue	`APIAbstractUseCase` + `Dry::Monads::Do` + `Repositories::*.call`
BE	`app/api/frontend_service/v1/ai_agent/repositories/create_test_case.rb`	How a test case is built (`status='pending'`)	`AbstractRepository` write pattern
BE	`app/api/frontend_service/v1/ai_agent/repositories/rate_test_case_question.rb`	Rating write (no aggregate today)	per-field update; extension point for recompute
BE	`app/workers/fetch_room_conversations_worker.rb`	Worker shell (fetch+extract+log only)	`sidekiq_options queue: :ai_agent, retry: false`; per-room `rescue`→Rollbar
BE	`app/core/repositories/chat_service/extract_conversation_pairs.rb`	Q/A pairing, system/non-text skip	customer-question → next agent text reply
BE	`app/core/repositories/chat_service/fetch_assigned_room_ids.rb`	Assigned-room fetch (`status:'assigned'`, `LIMIT`)	Hub HTTP + cursor pagination
BE	`lib/qontak_nlp/inference.rb`	`prediction(...)` shape + `timeout: 60`	`@http.call(method:'POST', url:, body:, open/read_timeout:)`
BE	`app/core/repositories/paths/get_tree_diagram_v3.rb`	`add_ai_agent` (L850–909) node assembly	where to add avg-confidence
BE	`app/api/frontend_service/v1/ai_agent/repositories/publish.rb`	Publish = set `active_version_id` (no gate)	extension point for gate
BE	`app/core/repositories/system_preferences/feature_flag.rb`	`FeatureFlag.enabled?(group_code, code)`	org-level flag mechanism
FE	`common/services/main/v1/ai-agents.ts`	5 test-case client methods (incl. `deleteTestCase`)	`$apiMain` + `endpoint.v1.ai_agents.test_cases.*`
FE	`store/ai-agent/interface.ts`	`TestCase`, `TestCaseQuestion`, `TestCaseDetail` types	typed payloads/responses
FE	`store/ai-agent/actions.ts`	`CREATE/FETCH/FETCH_DETAIL/DELETE/UPDATE_QUESTION`	`$patch` fetchStatus pending/resolved/rejected + optimistic rollback
FE	`modules/bot-automation/components/ai-agents/AiAgentsTable.vue`	list table loading/empty/pagination	`tableContent` + empty illustration
FE	`modules/bot-automation/components/AiAgentEditor.vue`	settings footer (Save button)	where Activate/gate lands (L1712–1733)
FE	`layouts/bot-automation.vue`	menu `listMenu` + flag gating (L209–349)	where Testing nav item lands
Design	`qontak-designer app/pages/bot-automation/testing/index.vue`	table columns + states (design ref)	6 columns: name/type/score/status/updated/actions

Existing Contracts to Reuse, Extend, or Replace (BE)

Contract	Status	Justification	Owner
`GET /api/v1/ai_agents/:id/test_cases`	reuse	exists, `set_role`	BOT
`POST /api/v1/ai_agents/:id/test_cases`	extend	exists; add `name` persist, status→`processing`, real pipeline	BOT
`GET /api/v1/ai_agents/:ai_agent_id/test_cases/:id`	reuse	exists (`GetAiAgentTestCaseDetail`, serializes `confidence_score`)	BOT
`PATCH .../test_cases/:test_case_id/questions/:question_id`	extend	exists; add aggregate recompute	BOT
`DELETE /api/v1/ai_agents/:id/test_cases/:test_case_id`	new-with-justification	chatbot-fe `deleteTestCase` calls it but no BE route exists (only `delete '/:id'` deletes the agent); PRD §6 soft delete needs it	BOT
`POST /api/v1/ai_agents/:id/publish`	extend	exists; add confidence gate (flagged)	BOT
`GET /api/v3/paths/:id/tree_diagram`	extend	exists; `add_ai_agent` add avg confidence	BOT
`FetchRoomConversationsWorker`	extend	exists; add sampling + shadow-gen + persistence + status	BOT
`lib/qontak_nlp/inference.rb#prediction`	reuse	live-predict path; call per question with throttle	AI squad

Patterns to Follow (and where to find them)

Layer	Concern	Pattern in repo	Reference file	Deviation?
FE	State management	Pinia store w/ `fetchStatus` enum	`store/ai-agent/actions.ts`	none
FE	Error/optimistic	snapshot + rollback on reject	`store/ai-agent/actions.ts` `UPDATE_TEST_CASE_QUESTION`	none
FE	List loading/empty	`tableContent` + empty illustration	`modules/bot-automation/components/ai-agents/AiAgentsTable.vue`	none
FE	API client	`$apiMain` + `endpoint` map	`common/services/main/v1/ai-agents.ts`	none
BE	HTTP handler	Grape + `ResultMatcher` + `success_response`	`test_cases_controller.rb`	none
BE	Use case	`APIAbstractUseCase` + `Dry::Monads::Do.for(:result)`	`create_test_cases.rb`	none
BE	Repository write	`AbstractRepository#call`	`create_test_case.rb`	none
BE	Worker	`Sidekiq::Worker` + `sidekiq_options queue:` + per-item `rescue`→Rollbar	`fetch_room_conversations_worker.rb`, `ask_airene_predict_worker.rb`	none
BE	Feature flag	`SystemPreferences::FeatureFlag.enabled?`	`feature_flag.rb`	none
BE	Error shape	`ErrorException(message:[], code:, errors:, error_code:)`	`helpers/error_response_helpers.rb`	none
Cross	snake_case API → FE	FE consumes snake_case JSON directly	`store/ai-agent/interface.ts`	none

Reading Order for the Agent

chatbot/app/api/frontend_service/v1/ai_agent/test_cases_controller.rb — live routes + auth.
chatbot/app/api/frontend_service/v1/ai_agent/use_cases/create_test_cases.rb — create + enqueue.
chatbot/app/workers/fetch_room_conversations_worker.rb — the worker to extend (the core gap).
chatbot/app/core/repositories/chat_service/extract_conversation_pairs.rb — Q/A extraction.
chatbot/lib/qontak_nlp/inference.rb — the predict call to reuse for shadow gen.
chatbot/app/api/frontend_service/v1/ai_agent/repositories/rate_test_case_question.rb — recompute extension point.
chatbot/app/api/frontend_service/v1/ai_agent/repositories/publish.rb — gate extension point.
chatbot/app/core/repositories/paths/get_tree_diagram_v3.rb (add_ai_agent) — tree surface.
chatbot-fe/store/ai-agent/{actions,interface}.ts — the FE store/types already wired.
chatbot-fe/modules/bot-automation/components/ai-agents/AiAgentsTable.vue — list/empty/loading pattern to mirror.

Source Verification (anti-hallucination — required)

Layer	Anchor / contract	Verified by	Evidence
BE	`ai_agent_test_cases` schema	read migration	`db/migrate/20260512000001_create_ai_agent_test_cases.rb`: cols `ai_agent_history_id uuid NOT NULL`, `status string`, `confidence_score integer`, `type string`, `deleted_at`; uuid PK
BE	`ai_agent_test_case_questions` schema	read migration	`db/migrate/20260512000002_..._questions.rb`: `topic, question(text), answer(text), is_score(bool default false), score(int), scored_by(uuid)/_email/_name, scored_at, response_time(int), confidence(int), status, status_description(text), sources(jsonb default []), parameters(jsonb default {}), deleted_at`
BE	Models soft-delete	read	`app/models/ai_agent_test_case.rb` L4 `acts_as_paranoid`, L5 `has_paper_trail`; same in `ai_agent_test_case_question.rb`
BE	DB dialect / migrator	read	`config/database.yml` `adapter: postgresql`; `db/schema.rb` `ActiveRecord::Schema[7.1]`, `enable_extension "pgcrypto"`
BE	Routes + mount	read	`app/api/frontend_service/api.rb` L47-48 `mount V1::AiAgent::TestCasesController => '/v1/ai_agents'`; `config/routes.rb` mounts `APIBase => '/api/'` → full `/api/v1/ai_agents`
BE	3 live routes	read	`test_cases_controller.rb` L32 `get '/:id/test_cases'`, L75 `post`, L115 `patch '/:id/test_cases/:test_case_id/questions/:question_id'` — all `set_role(%w[owner supervisor admin])`
BE	No delete test-case route	grep	only `ai_agents_controller.rb` L237 `delete '/:id'` (deletes agent) — no test-case delete
BE	Create status `pending`	read	`repositories/create_test_case.rb` L19 `record.status = 'pending'`; use case enqueues `FetchRoomConversationsWorker.perform_async`
BE	Worker does NOT sample/generate/persist	read full file	`app/workers/fetch_room_conversations_worker.rb` L1-43: fetch rooms → extract pairs → `Rails.logger.info{...}`; no LLM, no question insert, no status update
BE	Queue `:ai_agent`	read	worker L5 `sidekiq_options queue: :ai_agent, retry: false`; `config/sidekiq.yml` lists `ai_agent` queue
BE	Extraction logic	read	`extract_conversation_pairs.rb` L40-66: skip `SYSTEM`; customer text → `pending_question` if `question?`; agent text reply → pair; non-text skipped
BE	Rating no aggregate	read	`repositories/rate_test_case_question.rb` L13-19 sets score/is_score/scored_by*/scored_at only; no `confidence_score` write
BE	Publish no gate	read	`repositories/publish.rb` L12-18 `@ai_agent.update!(active_version_id: @ai_agent.version_id)`; no threshold
BE	Tree diagram v3	read	route `app/api/frontend_service/v3/path.rb` L17 `get ':id/tree_diagram'`; `core/repositories/paths/get_tree_diagram_v3.rb` `add_ai_agent` L850-909 builds node sans confidence
BE	QontakNLP predict	read	`lib/qontak_nlp/inference.rb#prediction` `timeout: 60`, `@http.call(method:'POST', ...)`; `core/repositories/qontak_nlp/predict.rb` resolves timeout via system pref
BE	Chat Service fetch	read	`fetch_assigned_room_ids.rb` `Hub::ChatService::Rooms::List` `status:'assigned'`, `limit: LIMIT`; `fetch_room_messages.rb` `Messages::GetByRoom`
BE	Token fetch	read	`channel_integrations/get_tokens.rb` `access_token` from `chatbot_tokens_encrypted` (lockbox)
BE	Error/success shape	read	`helpers/success_response_helpers.rb` `{status, code, message, data, meta}`; `error_response_helpers.rb` `ErrorException(message:[], code:, errors:, error_code:)`
BE	Feature flag	read	`feature_flag.rb` `FeatureFlag.enabled?(group_code, code, default:)`; no `ai_agent_testing` flag exists yet
BE	Test commands	read	`bin/rspec_pipeline.sh` `RAILS_ENV=test bundle exec rspec spec/...`; `bitbucket-pipelines.yml` `bundle exec rubocop`; `Gemfile` has `brakeman`
BE	Existing specs	ls	`spec/api/frontend_service/v1/ai_agent/{create_test_cases,get_test_cases,rate_test_case_question}_spec.rb`; `spec/workers/fetch_room_conversations_worker_spec.rb`; `spec/core/repositories/chat_service/extract_conversation_pairs_spec.rb`
FE	Test-case API client	read	`common/services/main/v1/ai-agents.ts` L250-351: `createTestCase/getTestCases/getTestCaseDetail/deleteTestCase/updateTestCaseQuestion`; `endpoint.ts` L207-214 paths
FE	Types	read	`store/ai-agent/interface.ts` L268-365: `TestCase, CreateTestCasePayload, TestCaseQuestion, TestCaseDetail, UpdateTestCaseQuestionPayload(score:0
FE	Pinia actions	read	`store/ai-agent/actions.ts` `CREATE_TEST_CASE`(745), `FETCH_TEST_CASES`(803), `FETCH_TEST_CASE_DETAIL`(849), `DELETE_TEST_CASE`(902), `UPDATE_TEST_CASE_QUESTION`(950, optimistic rollback)
FE	No Testing page yet	ls	`pages/bot-automation/` has actions/ai-agents/ai-agent[id]; no `/testing`; old UI in `modules/ai-agent/components/forms/ValidationDetailPanel.vue`
FE	Menu gating flag/sub	read	`layouts/bot-automation.vue` L209-349 `listMenu` gated by `rolloutAIAgentPreferences`/`aiAgentEnabled`/`isNewAIAgentEngine` — not roles
FE	Activate button absent	read	`modules/bot-automation/components/AiAgentEditor.vue` L1712-1733 footer shows "Save changes" only
FE	Design system	read	chatbot-fe `package.json` `@mekari/pixel3@^1.0.12`; qontak-designer `@mekari/pixel3@1.0.13-dev.0`
FE	Test commands	read	chatbot-fe `package.json`: `test: vitest run`, `test:e2e: playwright test`, `lint`, `build: nuxt build`
Design	qontak-designer is static prototype	read/grep	no `api/` folder, no `$fetch/useFetch`; `app/composables/useAuth.ts` mock localStorage, no roles

Design ↔ Code Mapping (frontend half)

Figma frame / component	Implementing file (chatbot-fe)	Reuse vs new	Tokens	Backing API	Deviation
Testing page (list)	`pages/bot-automation/testing/index.vue` + `modules/bot-automation/components/testing/TestCasesTable.vue`	new (mirror `AiAgentsTable.vue`)	`color.surface.`, `space.`, `text.body*`	`GET .../test_cases`	none — pattern-faithful
Generate modal + Inbox drawer	`modules/bot-automation/components/testing/{GenerateTestCaseModal,GenerateFromInboxDrawer}.vue`	new (port from qontak-designer layout)	Pixel3 `MpModal`/`MpDrawer`	`POST .../test_cases`	adds version selector (prototype lacks it — see §5 Q-A)
Generating modal	`.../testing/TestCaseGeneratingModal.vue`	new	`MpModal` + progress	poll `GET .../test_cases`	none
Comparison + question list	`.../testing/TestCaseComparison.vue`, `QuestionList.vue`	new (no prototype exists)	`MpAccordion`, `MpBadge`	`GET .../test_cases/:id`	reference old `modules/ai-agent/.../ValidationDetailPanel.vue` for layout
Confidence meter	`.../testing/ConfidenceMeter.vue`	new	`MpBadge`/progress	from detail payload	none
Activate gate	`modules/bot-automation/components/AiAgentEditor.vue` (footer)	extend	`MpButton`	`POST .../publish`	none

The Comparison/detail view has no qontak-designer prototype — flag for Design QA before the chunk lands (§5 Q-A).

Detail 2.1 — Architecture (mermaid)

End-to-end component diagram

flowchart TB
  user([SPV/Admin]) --> page["chatbot-fe Testing page"]
  page --> store["Pinia ai-agent store"]
  store --> client["ai-agents.ts client"]
  client --> ctrl["/api/v1/ai_agents/.../test_cases/"]
  ctrl --> ucCreate[CreateTestCases UC]
  ucCreate --> repoCreate[(CreateTestCase repo)]
  repoCreate --> db[("ai_agent_test_cases")]
  ucCreate --> q[["Sidekiq :ai_agent"]]
  q --> worker[FetchRoomConversationsWorker]
  worker --> chat["ChatService repos"] --> hub(["Hub Chat Service"])
  worker --> nlp["QontakNlp predict"] --> ai(["AI service"])
  worker --> dbq[("ai_agent_test_case_questions")]
  ctrl --> ucRate[RateTestCaseQuestion UC] --> dbq
  ucRate --> agg[["recompute confidence_score"]] --> db
  ctrl --> tree["GetTreeDiagramV3#add_ai_agent"] --> db

Data model (mermaid erDiagram)

erDiagram
  AI_AGENTS ||--o{ AI_AGENT_HISTORIES : versions
  AI_AGENTS ||--o{ AI_AGENT_TEST_CASES : has
  AI_AGENT_HISTORIES ||--o{ AI_AGENT_TEST_CASES : binds
  AI_AGENT_TEST_CASES ||--o{ AI_AGENT_TEST_CASE_QUESTIONS : has
  AI_AGENT_TEST_CASES {
    uuid id PK
    uuid ai_agent_id FK
    uuid ai_agent_history_id FK
    int organization_id
    string status
    int confidence_score
    string type
    datetime deleted_at
  }
  AI_AGENT_TEST_CASE_QUESTIONS {
    uuid id PK
    uuid ai_agent_test_case_id FK
    string topic
    text question
    text answer
    int score
    bool is_score
    int confidence
    int response_time
    string status
    text status_description
    jsonb sources
    jsonb parameters
    datetime deleted_at
  }

State machine — test-case status

stateDiagram-v2
  [*] --> pending: POST create
  pending --> processing: worker starts
  processing --> completed: all questions generated
  processing --> failed: fatal worker error
  completed --> completed: ratings update (no status change)
  failed --> processing: retry (re-enqueue)
  completed --> [*]

State machine — question status

stateDiagram-v2
  [*] --> pending: row created
  pending --> processing: shadow-gen starts
  processing --> completed: LLM answer stored
  processing --> failed: LLM error (status_description set)
  completed --> [*]
  failed --> [*]

Branch & skip flow — sampling & filtering

flowchart TD
  start([worker: rooms fetched]) --> elig{eligible rooms count}
  elig -- "< 10" --> all[use 100% of rooms]
  elig -- ">= 10" --> sample["random 10%"]
  sample --> cap{"> 50-70 cap?"}
  cap -- yes --> capped[truncate to cap]
  cap -- no --> kept[keep sample]
  all --> extract[ExtractConversationPairs]
  capped --> extract
  kept --> extract
  extract --> nonText{text-only Q/A?}
  nonText -- no --> skip[skip room, not counted]
  nonText -- yes --> gen[shadow-generate + persist question]
  skip --> done([batch continues])
  gen --> done

Detail 2.2 — Sequence (mermaid, end-to-end incl. failure)

Happy path — generate test case (async batch with shadow gen)

sequenceDiagram
  actor U as SPV (chatbot-fe)
  participant LB as LB / API gateway
  participant API as chatbot Grape API
  participant UC as CreateTestCases
  participant DBW as Postgres primary
  participant Q as Sidekiq :ai_agent
  participant W as FetchRoomConversationsWorker
  participant HUB as Hub Chat Service
  participant NLP as QontakNLP AI service

  U->>LB: POST /api/v1/ai_agents/:id/test_cases {type, version_id, name}
  LB->>API: HTTP
  API->>API: set_role(owner/supervisor/admin)
  API->>UC: handle
  UC->>DBW: INSERT ai_agent_test_cases (status='processing')
  UC->>Q: FetchRoomConversationsWorker.perform_async
  UC-->>API: 201 {data: test_case}
  API-->>U: 201 (UI shows generating modal, polls status)
  Note over Q,W: async — worker picks up within seconds
  W->>HUB: GET assigned rooms (status=assigned, 90d, limit=100)
  HUB-->>W: room_ids
  W->>W: sample 10% (cap 50-70; all if <10)
  loop per sampled room
    W->>HUB: GET messages by room
    HUB-->>W: messages
    W->>W: ExtractConversationPairs (text-only)
    loop per Q/A pair
      W->>DBW: INSERT question (status='processing', parameters.human_answer)
      W->>NLP: POST predict {message: question}  (NOT send_message)
      Note right of NLP: timeout 60s; throttle for TPM/RPM
      NLP-->>W: {answer, confidence, sources, response_time}
      W->>DBW: UPDATE question (answer, confidence, sources, status='completed')
    end
  end
  W->>DBW: UPDATE test_case status='completed'

Failure path — LLM error on one question (batch continues)

sequenceDiagram
  participant W as Worker
  participant DBW as Postgres primary
  participant NLP as QontakNLP

  W->>DBW: INSERT question (status='processing')
  W->>NLP: POST predict
  Note right of NLP: timeout after 60s / 5xx
  NLP--xW: error
  W->>W: Rollbar.error(test_case_id, room_id)
  W->>DBW: UPDATE question status='failed', status_description=error
  Note over W: continue with next question; test_case still reaches 'completed' (partial)

Failure path — Chat Service room-list unavailable (whole batch)

sequenceDiagram
  participant W as Worker
  participant HUB as Hub Chat Service
  participant DBW as Postgres primary
  W->>HUB: GET assigned rooms
  HUB--xW: 5xx / timeout
  W->>W: Rollbar.error
  W->>DBW: UPDATE test_case status='failed'
  Note over W: UI surfaces error + Retry (AITEST-S02/ERR-1)

Detail 2.3 — Database Model (DDL)

No new tables. Both tables exist (migrations 20260512000001, 20260512000002, Postgres, ActiveRecord::Migration[7.1], uuid PK via pgcrypto). This phase requires one additive migration to support partial/failed surfacing if not already present (verify status_description exists — it does per schema). No destructive change.

Current shape (verified — for the agent's reference, not re-created):

-- db/migrate/20260512000001_create_ai_agent_test_cases.rb (EXISTS)
CREATE TABLE ai_agent_test_cases (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  ai_agent_history_id uuid NOT NULL,
  ai_agent_id uuid NOT NULL,
  organization_id integer NOT NULL,
  company_id varchar NOT NULL,
  status varchar,
  confidence_score integer,
  type varchar,
  deleted_at timestamp,
  created_at timestamp NOT NULL,
  updated_at timestamp NOT NULL
);
-- indexes: organization_id, company_id, ai_agent_history_id, ai_agent_id, status, type

-- db/migrate/20260512000002_create_ai_agent_test_case_questions.rb (EXISTS)
CREATE TABLE ai_agent_test_case_questions (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  ai_agent_test_case_id uuid NOT NULL,  -- FK ON DELETE CASCADE
  organization_id integer NOT NULL,
  company_id varchar NOT NULL,
  topic varchar, question text, answer text,
  is_score boolean DEFAULT false, score integer,
  scored_by uuid, scored_by_email varchar, scored_by_name varchar, scored_at timestamp,
  started_at timestamp, completed_at timestamp,
  response_time integer, confidence integer,
  status varchar, status_description text,
  sources jsonb DEFAULT '[]', parameters jsonb DEFAULT '{}',
  deleted_at timestamp, created_at timestamp NOT NULL, updated_at timestamp NOT NULL
);
-- indexes: ai_agent_test_case_id, organization_id, company_id, status, score, topic

Additive migration (this phase — required, not conditional). The FE TestCase type carries name (store/ai-agent/interface.ts) and the Generate drawer submits it, but BE create does not persist a name today (repositories/create_test_case.rb sets no name). Chunk 1 adds and persists the column so the create round-trip and the list name column are non-empty (resolves the §2.G partial row, REV-2):

-- db/migrate/2026XXXXXXXXXX_add_name_to_ai_agent_test_cases.rb
ALTER TABLE ai_agent_test_cases ADD COLUMN name varchar;
CREATE INDEX index_ai_agent_test_cases_on_name ON ai_agent_test_cases (name);

name is nullable for backward compatibility (existing rows have none); CreateTestCase persists params[:name] (length-bounded ≤ 24, validated in the use-case contract).

Cardinality: ~1 test case per agent per validation run; questions 10–70 per case.
Growth: bounded by cap (≤70 questions/case). PII: question, answer, parameters.human_answer contain customer/agent text → see §3 Compliance.
Retention: soft-delete (deleted_at); hard-purge window TBD (Open Q #8).

Per-status lifecycle — ai_agent_test_cases.status:

Status	Visibility	Retention	Restore	Transitions allowed
`pending`	list (transient)	until processed	n/a	→ processing
`processing`	list w/ spinner	during batch	n/a	→ completed / failed
`completed`	default list	until soft-deleted	restore via paranoid	(ratings only)
`failed`	list w/ error	until soft-deleted	re-run (re-enqueue)	→ processing
(soft-deleted)	hidden	until hard-purge (TBD)	`restore` (paranoid)	—

Per-status lifecycle — ai_agent_test_case_questions.status:

Status	Visibility	Retention	Restore	Transitions
`pending`	n/a (transient)	during batch	n/a	→ processing
`processing`	per-question spinner	during batch	n/a	→ completed / failed
`completed`	comparison shown	with parent	with parent	(rating only)
`failed`	"could not generate"	with parent	re-run	→ processing

Detail 2.4 — APIs

Base: /api/v1/ai_agents (verified mount, §2.0). All set_role(%w[owner supervisor admin]). Success {status, code, message, data, meta?}; error ErrorException.

Outbound endpoints (consumers call us)

Endpoint	Method	AuthN/AuthZ	Request	Response	Status codes	Idempotency	Versioning	Reuse?
`/api/v1/ai_agents/:id/test_cases`	GET	api_auth + `set_role`	query: `page,limit,query,status,order_by,order_direction`	`{data:[TestCase], meta}`	200, 403	n/a (read)	v1	reuse
`/api/v1/ai_agents/:id/test_cases`	POST	api_auth + `set_role`	`{type, version_id, name}`	`{data: TestCase}` (status `processing`)	201, 404 (version), 422, 403	client-side dedupe; server idempotent per `(agent,version,name)` recommended	v1	extend
`/api/v1/ai_agents/:ai_agent_id/test_cases/:id`	GET	api_auth + `set_role`	path	`{data: TestCaseDetail{questions[]}}` incl. `confidence_score`, per-question `answer`/`parameters.human_answer`/`confidence`/`sources`/`response_time`/`score`/`status`	200, 404	n/a	v1	reuse
`/api/v1/ai_agents/:id/test_cases/:test_case_id/questions/:question_id`	PATCH	api_auth + `set_role`	`{score: 0\|1}`	`{data: question}` + recomputed `confidence_score`	200, 404, 422 (score not 0/1)	last-write-wins per question	v1	extend
`/api/v1/ai_agents/:id/test_cases/:test_case_id`	DELETE	api_auth + `set_role`	path	`{status:success}` (soft delete)	200, 404, 403	idempotent (already-deleted → 200/404)	v1	new-with-justification (FE client exists; BE route missing)
`/api/v1/ai_agents/:id/publish`	POST	api_auth + `set_role`	`{override_reason?}`	`{data: ai_agent}`	200, 404, 422 (below threshold when gate on)	idempotent	v1	extend (add gate)
`/api/v3/paths/:id/tree_diagram`	GET	api_auth + `set_role`	path	`{data: tree}` w/ `ai_agent.avg_confidence_score`	200, 404	n/a	v3	extend

Inbound webhooks (other services call us)

N/A — reason: this phase introduces no inbound webhooks. Shadow generation is a synchronous outbound call from the worker to QontakNLP (no callback); room/message fetch is outbound to Hub Chat Service.

DELETE contract (REV-3) — full specification (this is a new BE route; chatbot-fe's deleteTestCase already calls it):

Request: no body. Path params :id (ai_agent), :test_case_id.
Behavior: soft delete via acts_as_paranoid — DeleteTestCase use case loads the test case scoped to current_user['chatbot_organization_id'] and calls .destroy (paranoid sets deleted_at). Child ai_agent_test_case_questions are soft-deleted via the model's dependent: :destroy (also paranoid). No hard delete.
Response: 200 {status:"success", code:200, message:"OK", data:{ id }}.
Status codes: 200 on success; 404 "Test case not found" when the id does not exist or belongs to another org (cross-tenant reads return 404, not 403, to avoid id enumeration); 403 only when set_role rejects the role outright; idempotent — deleting an already-soft-deleted case returns 404 (the paranoid default scope hides it).
Restore: out of scope for the API this phase; soft-deleted rows are restorable via acts_as_paranoid .restore if a future undo surface is added (hard-purge window is Open Q #8). No restore endpoint is exposed now.

Example create request/response:

// POST /api/v1/ai_agents/7e.../test_cases
{ "type": "inbox", "version_id": "a1...", "name": "War room sample #1" }
// 201
{ "status":"success","code":201,"message":"OK",
  "data": { "id":"c3...","type":"inbox","version_id":"a1...","name":"War room sample #1","status":"processing","score":null } }

Detail 2.A — UI Contract

ConfidenceMeter.vue (new)

Figma: node 16514-155786 · file chatbot-fe/modules/bot-automation/components/testing/ConfidenceMeter.vue
Props:

interface ConfidenceMeterProps {
  scorePercent: number;        // 0..100, = (thumbsUp / total) * 100
  threshold?: number;          // default 80
  totalRated: number;
  totalSample: number;
}

State owner: derived from store/ai-agent testCaseDetail (no local source of truth).
Events: none emitted; analytics ai_validation_completed fires from parent when scorePercent >= threshold.
Conditional render: < threshold → "Low Confidence" (warning); >= threshold → "Ready to Launch" (success).
A11y: role="progressbar", aria-valuenow/min/max, label "Confidence meter".

TestCaseComparison.vue (new)

Figma: node 16514-155786 · file .../testing/TestCaseComparison.vue
Props:

interface TestCaseComparisonProps {
  question: TestCaseQuestion;  // from store/ai-agent/interface.ts
  readonly: true;              // comparison is read-only (NEG-3)
}

Events: @rate { questionId: string; score: 0 | 1 } → dispatches UPDATE_TEST_CASE_QUESTION.
Conditional: question.status === 'failed' → AI panel shows "could not generate" (S05/ERR-1); else human-left / AI-right with confidence/response_time/sources.
A11y: thumbs are <button> with aria-pressed.

Detail 2.B — Data-Fetching Strategy

Library: Pinia store + $apiMain (ofetch) — existing (store/ai-agent/actions.ts).
Cache key: store state slices testCases, testCaseDetail (no external cache lib).
TTL / refetch: refetch on mount; poll GET .../test_cases every ~3 s while any row status ∈ {pending, processing} (S08/AC-2), stop at completed/failed.
SWR: no — explicit fetchStatus enum (pending/resolved/rejected).
Optimistic updates: yes for rating — UPDATE_TEST_CASE_QUESTION snapshots questions and rolls back on reject (existing). On success, dispatch a meter recompute read (the BE returns recomputed aggregate).

Detail 2.C — UI State Matrix

Surface	Loading	Empty	Error	Partial	Success
Testing list	skeleton rows (mirror `AiAgentsTable`)	"No test cases yet" + Generate CTA	blank slate "Couldn't load" + Retry; log `ai_workspace_load_failed`	some rows `processing` (poll)	table rows
Generate drawer	submit spinner	n/a	inline validation (name/version)	n/a	drawer closes → generating modal
Generating modal	progress bar + count	n/a	`failed` badge + Retry	partial count shown	→ list shows `completed`
Detail/comparison	skeleton	"no questions in this test case"	retry	some questions `failed` (per-card)	human/AI panels
Confidence meter	0% until first rating	0% (no ratings)	n/a (derived)	partial as more rated	meter + Ready-to-Launch

Detail 2.D — Data Integrity Matrix

Write path	Transaction scope	Partial failure	Idempotency	Consistency	Duplicate handling	Stale read
Create test case	single INSERT	422 if save fails (no worker enqueued)	recommend unique-ish `(agent,version,name)` guard	strong (single row)	second click → second row (FE disables button)	n/a
Worker: per-question INSERT+UPDATE	per-question (not one big trx)	failed question → `status=failed`, batch continues	re-run replaces by deleting prior questions for the case (re-enqueue)	eventual (batch)	re-run guard: clear existing questions before regen	poll reflects within ~3 s
Rate question + recompute	question UPDATE then aggregate UPDATE in one trx	rollback both on failure → FE restores prior	last-write-wins per question	strong within request	repeated same score → idempotent	meter refetched after write
Publish (gate)	single UPDATE `active_version_id`	422 if `<80` & gate on	idempotent	strong	n/a	n/a

Detail 2.E — Concurrency Collision Map

Resource	Writers	Collision	Resolution	On failure
`ai_agent_test_cases.confidence_score`	concurrent raters on same case	two thumbs near-simultaneously	recompute reads current question scores in-trx (no stored delta)	last recompute wins; value is deterministic from question rows
`ai_agent_test_case_questions` of a case	worker (regen) vs rater	rate while re-run in flight	re-run sets case `processing` → FE disables rating; reject rating with 409/422 if case not `completed`	FE shows "regenerating"
`ai_agents.active_version_id`	publish vs publish	double activate	DB row update idempotent	last write wins

Detail 2.F — Async Job / Event Consumer Spec

Job	Trigger	Input	Retry	DLQ	Concurrency	Idempotency	Per-msg timeout	Poison handling
`FetchRoomConversationsWorker` (extend)	`perform_async` from `CreateTestCases`	`{test_case_id, organization_id}`	`retry: false` today → change to bounded retry (e.g. 3, exp backoff) for transient Hub/NLP errors; set `failed` on exhaustion	none (Sidekiq dead set; alert on failure)	queue `:ai_agent` (`sidekiq.yml` concurrency 5 staging / 10 prod)	re-run clears prior questions for `test_case_id` before regen	per-question NLP `read/open_timeout: 60s`	fatal → `Rollbar.error` + test_case `status=failed`; per-room/per-question errors skip-and-continue (existing pattern)

Detail 2.F.1 — Responsibility Boundary Matrix

Step	Owning squad / service	Inbound trigger	Outbound effect	Failure handler	PRD anchor
1. Create test case	BOT (chatbot API)	SPV POST	row + worker enqueue	404/422	§9 #1, S01
2. Fetch assigned rooms	Platform / Hub Chat Service	worker	room_ids	Rollbar + case `failed`	§15, S02
3. Sample 10%/cap	Data (algorithm) / BOT (impl)	worker	sampled rooms	n/a (deterministic)	§15, S02
4. Extract Q/A	BOT (`ExtractConversationPairs`)	worker	pairs	room skipped	S03
5. Shadow generate	AI squad (QontakNLP)	worker per question	answer/confidence/sources	question `failed`, continue	§15, S04
6. Persist questions + status	BOT	worker	DB rows	partial; case `completed`	S04, S08
7. Rate + recompute	BOT	SPV PATCH	aggregate score	rollback	S06
8. Gate publish	BOT	SPV POST publish	activate or 422	422 below threshold	S07
9. Tree-diagram avg	BOT	tree-diagram read	node score	"no score yet"	S10

Step 3 ownership (Data vs BOT) and Step 5 throttling contract (TPM/RPM) are the two cross-squad items to confirm before build (Open Q #5, §5).

Detail 2.F.2 — State Surface Contract

Entity	State field / event	Defaults	Updated by	Read via	Stale window
Test case	`status`	`pending`	worker	`GET .../test_cases` (poll)	~3 s (poll interval)
Test case	`confidence_score`	`null`	rate recompute	detail + list	immediate (write-through)
Question	`status` / `status_description`	`pending`	worker shadow-gen	detail	batch duration
Question	`score` / `is_score`	`null` / false	`RateTestCaseQuestion`	detail	immediate
AI agent (tree)	`avg_confidence_score` (computed)	"no score yet"	`GetTreeDiagramV3`	tree-diagram	per request

Detail 2.G — Cross-Layer Contract Verification

Endpoint	BE response schema	FE expected schema	Match?	Gaps
`GET .../test_cases`	`{status,code,message,data:[...],meta}`	`TestCasesResponse {data: TestCase[]}`	yes	FE reads `data`; `meta` optional
`POST .../test_cases`	`{data: test_case}` (snake_case, incl. `name`)	`CreateTestCasePayload {agent_id,type,name,version_id}` → `CreateTestCaseResponse{data:TestCase}`	yes (after chunk 1)	resolved (REV-2): `name` column added + persisted in §2.3/§4.D chunk 1 — no longer conditional
`GET .../test_cases/:id`	`{data: {..., questions:[...]}}` incl. `confidence_score`, `parameters.human_answer`	`TestCaseDetailResponse {data: TestCaseDetail{questions}}`	yes	FE `TestCaseQuestion` fields align 1:1 with schema
`PATCH .../questions/:id`	`{data: question}` + recomputed aggregate	`UpdateTestCaseQuestionPayload {score:0\|1}`	yes (after chunk 5)	aggregate recompute is in-scope work (§4.D chunk 5); meter is static until it lands
`DELETE .../test_cases/:id`	`200 {data:{id}}` soft delete (full contract §2.4)	`deleteTestCase` client exists	yes (after chunk 6)	resolved (REV-3): BE route + contract specified in §2.4; built in §4.D chunk 6

All rows now reach Match? = yes once their named execution chunk lands — there is no silent cross-layer divergence. The three former gaps (name persistence → chunk 1, aggregate recompute → chunk 5, delete endpoint → chunk 6) are explicit in-scope work.

Detail 2.H — End-to-End Data Flow

SPV clicks Generate → GenerateFromInboxDrawer @confirm {name, version} → store CREATE_TEST_CASE → ai-agents.ts POST /api/v1/ai_agents/:id/test_cases → CreateTestCases UC → INSERT (status processing) + Sidekiq enqueue → 201 → FE generating modal polls GET .../test_cases → worker (sample → extract → per-question NLP predict → persist) → status completed → SPV opens detail GET .../test_cases/:id → TestCaseComparison renders → @rate PATCH .../questions/:id → recompute confidence_score → meter updates → at ≥80% Activate enabled → POST .../publish.

Side effects: PaperTrail versions on test_cases/questions; analytics events (§3); Rollbar on errors.
Ownership: FE (chatbot-fe) steps 1, detail render, rating UI; BE (chatbot) create/worker/rate/publish/tree; Platform (Hub) rooms; AI (NLP) predict.

Detail 2.I — Scope Boundaries

BE create: chatbot/app/workers/fetch_room_conversations_worker.rb (extend), new repos/services for sampling + shadow-gen + persistence under app/api/frontend_service/v1/ai_agent/, extend rate_test_case_question.rb + publish.rb + get_tree_diagram_v3.rb, new DELETE route, one additive migration.
BE modify: repositories/create_test_case.rb (persist name, status processing), test_cases_controller.rb (add delete route + name param).
BE NOT touched: live SendMessageWorker/inbox send path (must remain uncalled), ai_agent_histories schema.
FE create: chatbot-fe/pages/bot-automation/testing/index.vue + modules/bot-automation/components/testing/* (Table, GenerateModal, GenerateDrawer, GeneratingModal, Comparison, QuestionList, ConfidenceMeter).
FE modify: layouts/bot-automation.vue (nav item), modules/bot-automation/components/AiAgentEditor.vue (Activate gate), bot-flow node (tree confidence badge).
FE NOT touched: legacy modules/ai-agent/components/forms/Validation*.vue (old module — reference only, not extended).
Shared: Pinia store/ai-agent already has the actions/types — extend, don't fork. @mekari/pixel3 components reused.

Detail 2.J — Asset Inventory

Asset	Type	Source	Format & sizes	Path
"No test cases yet" empty illustration	illustration	reuse existing (`/images/not-found-search-illustration.png` used by `AiAgentsTable`) or new export	PNG @1x/2x	`chatbot-fe/public/images/`
thumbs up/down, info icons	icon	`@mekari/pixel3` `MpIcon`	SVG (DS)	n/a (DS)

No new fonts/lotties. Any net-new illustration for the comparison empty/failed state flagged for Design QA (§5 Q-A).

3. High-Availability & Security

The Testing pipeline is off the live request path: it runs in the isolated :ai_agent Sidekiq queue, reading from Hub Chat Service and calling QontakNLP. If Hub or NLP is slow/down, only batches degrade (the test case goes failed and is retriable) — live inbox and live AI answering are unaffected. Reads should target a replica where the chatbot DB topology offers one (PRD §6); writes go to primary.

Performance Requirement

Frontend: list LCP < 2.5 s; rating INP < 200 ms; CLS < 0.1; bundle delta small (reuses Pixel3 + existing store). Browser support per chatbot-fe baseline; web only.
Backend: a ~50-item batch completes in ≈2–5 min (PRD §6); per-question NLP p99 governed by qontak_nlp_prediction_timeout (default 60 s). Worker concurrency bounded by queue config (5 staging / 10 prod). No added live-path RPS.
NLP throttle contract (REV-1). Per-question shadow-generation calls are paced by an in-worker token-bucket so the :ai_agent batch never starves live prediction traffic on the shared AI service:
- Ceiling: a per-org RPM cap read from SystemPreference (group_code: 'engine', code: 'ai_agent_testing_nlp_rpm', default 60 req/min — i.e. one question/sec, well within a ~50–70 item batch's 2–5 min budget). The TPM dimension is bounded indirectly by the cap (PRD Open Q #5 — AI squad confirms the production ceiling before beta; default is conservative).
- Pacing: acquire a bucket token before each QontakNlp::Predict call; if empty, sleep until refill (bucket lives in the worker process; batch is single-worker per test case so no cross-process coordination is needed this phase).
- On HTTP 429 / rate-limit from the AI service: exponential backoff (e.g. 1 s → 2 s → 4 s, max 3 attempts per question) and re-queue the question for a later pass; on attempts-exhausted, mark the question status='failed' (status_description='rate_limited') and continue the batch (consistent with §2.2 failure path). A 429 never fails the whole test case.
- This makes Decision D-5 fully specified (the throttle was previously deferred).

Monitoring & Alerting

FE analytics (Mixpanel, names from PRD §12): ai_workspace_opened, ai_validation_generated {sample_size,date_range,test_case_id}, ai_response_graded {grade,confidence_score,inquiry_id}, ai_validation_completed, ai_agent_activated. Error slate logs ai_workspace_load_failed.
BE: Rollbar (existing) on worker errors with test_case_id/room_id. Structured Rails.logger.info already emits {worker, test_case_id, room_ids_count, conversation_pairs_count} — extend with generated_count, failed_count.
Alerts (PRD §12): batch failure rate > 10%/1h → #bot-ai-alerts; NLP error rate

5%/15m → #bot-ai-alerts + PagerDuty.
Cross-layer: propagate request/job id from create response into worker logs for trace.

Logging

BE: structured worker log (above) + Rollbar; FE: console error → Sentry/Datadog per chatbot-fe.
PII scrub: do not log question, answer, or parameters.human_answer bodies (customer/agent text). Log only ids, counts, statuses.

Security Implications

Threat model: (a) cross-tenant data leak via test-case ids — mitigated by org-scoped queries + set_role; (b) customer-message leakage during shadow gen — mitigated by calling predict only, never send_message/notification (spec-asserted, S04/AC-1); (c) PII to 3rd-party LLM — covered by existing DPA, transient inference, not used to train public model (Open Q #1, InfoSec approval required); (d) privilege escalation — set_role server-side on every endpoint.

Role × Endpoint Authorization Matrix

Role	Endpoint(s)	Methods	Tenant scope	UI visibility	Constraint	Audit
owner	all test-case + publish + tree	GET/POST/PATCH/DELETE	own org	full	—	PaperTrail
supervisor	all test-case + publish + tree	GET/POST/PATCH/DELETE	own org	full	no force-override (admin-only, S09)	PaperTrail
admin	all test-case + publish (+override) + tree	GET/POST/PATCH/DELETE	own org	full	override requires reason	PaperTrail + reason
standard agent	none	—	—	menu hidden	403 on direct route	n/a
bot-specialist	tree-diagram read	GET	own org	tree only	read-only	n/a

Every role from Detail 1.A appears here. standard agent is explicitly denied.

Ownership validation: queries scoped by current_user['chatbot_organization_id'] (existing use-case pattern). Enforcement: use-case layer + set_role.
Input validation: score ∈ {0,1} (422 otherwise); type, version_id, name (length-bounded, e.g. ≤ 24 per prototype) via dry-validation contract.
Injection: ActiveRecord parameterized; outbound URLs (Hub/NLP) from config/env (no user input in URL) → SSRF-safe.
Secrets: channel access token via lockbox-encrypted chatbot_tokens_encrypted; NLP base URL from org settings/env. No hard-coded keys.
Audit: PaperTrail (has_paper_trail) on both tables; force-activate writes reason + score-at-override (S09).
Rate limiting: per-question NLP throttle (TPM/RPM); create endpoint guarded by FE button disable + recommended server idempotency.
Static analysis: bundle exec brakeman (Gemfile) + bundle exec rubocop.
ISO 27001/27701: PII processing logged + access-controlled; see Compliance below.

Detail 3.A — Failure Mode Catalog (merged)

Surface	FE behavior on failure	BE response on failure	Code-shape consistency
List load	blank slate + Retry; `ai_workspace_load_failed`	403/500 `ErrorException`	yes
Create	inline error; button re-enabled	404 (version) / 422 / 403	yes
Generating	`failed` badge + Retry	worker sets `status=failed` (Rollbar)	yes (poll reads status)
Per-question gen	"could not generate" card	question `status=failed` + `status_description`	yes
Rate	optimistic rollback + inline error	404/422	yes
Publish below threshold	button disabled; if forced API → reason	422 with reason (gate on)	yes

Detail 3.A.1 — Branch & Skip Catalog

Branch trigger	Where checked	Downstream effect	Audit	User-visible?
Eligible rooms < 10	worker sampling step (BOT)	use 100% rooms (no 10% cut)	log count	no (result reflected)
Batch > 100	worker sampling	show ≤ 50 (cap)	log	indirectly (count)
Bot-only room (no human reply)	`ExtractConversationPairs`	room excluded, not counted	log	no (NEG-2)
Non-text message (image/voice)	`ExtractConversationPairs`	message skipped	log	no (NEG-2)
Phase 2/3 sources	FE Generate modal	"from knowledge"/"imported" disabled	n/a	yes (NEG-4)
`ai_agent_testing` flag OFF	BE + FE	menu hidden / endpoints inert	n/a	yes

Detail 3.B — Error Response Catalog (BE)

{ "status": "error", "code": 422, "message": ["..."], "errors": {}, "error_code": null }

Endpoint	Code	HTTP	Message	When	User-facing?
POST test_cases	—	404	"Version not found"	`version_id` invalid	yes
POST test_cases	—	422	"Failed to create test case"	save fails	yes
PATCH question	—	422	"Invalid score"	score ∉ {0,1}	yes
GET detail	—	404	"Test case not found"	bad id / cross-tenant	yes
any	—	403	"Permission denied"	`set_role` reject	no (menu hidden)
POST publish	—	422	"Confidence below threshold"	`<80` & gate on	yes

Detail 3.C — Error Message Catalog (FE)

Error code	User-facing message (i18n key)	Surface	User-facing?
list_load_failed	"Couldn't load test cases" + Retry	blank slate	yes
create_failed	"Couldn't generate test case"	toast/inline	yes
rate_failed	"Couldn't save your rating"	inline (after rollback)	yes
gen_failed	"Could not generate" (per question)	comparison card	yes

Detail 3.D — Compliance & Data Governance

Trigger: PII present. question, answer, parameters.human_answer are customer & agent conversation text sent to a 3rd-party LLM.

Field	Classification	Legal basis	Retention	Encryption	Access audit	Right-to-delete
`question` / `answer` / `parameters.human_answer`	PII (customer content)	DPA (existing); UU PDP	soft-delete; hard-purge TBD (Open Q #8)	TLS in transit; DB at rest per platform	PaperTrail + `set_role`	soft-delete + paranoid; align purge to DPA
`scored_by_email` / `_name`	PII (internal user)	legitimate interest	with row	at rest	PaperTrail	with row

Transient inference: NLP prediction payload is not persisted beyond the stored answer/confidence/sources (PRD §6). InfoSec sign-off required before beta (Open Q #1).

Detail 3.E — Accessibility

WCAG AA. Keyboard: drawer/modal focus-trap (Pixel3 defaults); thumbs reachable via Tab with aria-pressed; meter role="progressbar". Focus returns to trigger on modal close. Contrast verified against Pixel3 tokens. prefers-reduced-motion honored for the generating progress animation.

4. Backwards Compatibility and Rollout Plan

Compatibility

BE: additive only — new name column (nullable), new DELETE route, extended worker/publish/tree behavior behind the ai_agent_testing flag. Existing endpoints' request/response shapes unchanged (only name added to create response; nullable).
FE: new page + components; menu item additive. No change to saved client state.
Cross-layer: snake_case JSON unchanged; FE already consumes the contract.

Rollout Strategy

Deploy order: BE first, then FE. BE adds the pipeline + endpoints behind the flag; FE Testing page ships after the contract is live (FE polls real status).
Feature flag: ai_agent_testing via SystemPreferences::FeatureFlag.enabled? (group_code: 'rollout', code: 'ai_agent_testing', default: false) — per-org, default OFF. Kill-switch: toggle OFF per org → menu hidden + endpoints inert + gate reverts to advisory (no deploy). FE menu reuses the existing rollout-preference read pattern (layouts/bot-automation.vue).
Migration sequence: add nullable name column (no backfill needed) → deploy BE → enable flag for internal org → deploy FE.
Stages (audience/gates owned by delivery/; PRD §11/§14): Internal (telesales POC) → Closed beta (3–5 orgs) → Open beta (on request) → GA.
Rollback trigger: batch failure rate > 20% unresolved 24 h, or any customer-message leakage → disable flag for affected orgs.
Rollback mechanism: flag OFF (instant); for DDL, name column is nullable and inert when unused — no down-migration needed for rollback; data written stays (soft-deletable).

Detail 4.A — Cross-Layer Rollout Compatibility Matrix

Scenario	FE	BE	Works?	Mitigation
Pre-deploy	Old	Old	yes	baseline (no Testing page)
Backend first	Old	New	yes	new endpoints unused by old FE; flag OFF
Frontend first	New	Old	no	avoid — deploy BE first; if FE leads, gate page behind flag tied to BE readiness
Both deployed	New	New	yes	target
Backend rollback	New	Old	partial	FE page errors gracefully (error slate); disable flag
Frontend rollback	Old	New	yes	BE endpoints simply unused

Detail 4.B — Configuration Contract

Layer	Env var / flag	Type	Default	Required	Provisioner	Secret?
BE	`ai_agent_testing` (system_preferences `rollout`/`ai_agent_testing`)	bool	false	yes	SystemPreference (per org)	no
BE	`ai_agent_testing_gate` (publish gate enforce)	bool	false (advisory)	no	SystemPreference	no
BE	`ai_agent_testing_threshold` (`engine`) — gate % (REV-4)	int	80	no	SystemPreference (per org)	no
BE	`ai_agent_testing_nlp_rpm` (`engine`) — shadow-gen RPM cap (REV-1)	int	60	no	SystemPreference (per org)	no
BE	`qontak_nlp_prediction_timeout`	int (s)	60	no	SystemPreference (`engine`)	no
BE	sampling cap / pct (if configurable)	int	10% / 50–70	no	const or SystemPreference	no
BE	`QONTAK_NLP_PREDICTION_ENDPOINT`, `AI_SERVICE_BASE_URL`	url	—	yes (existing)	env/org settings	no
FE	rollout pref read (`rollout_ai_agent` pattern)	bool	false	yes	preferences store	no

Detail 4.C — Test Plan (commands from repo)

Layer	Command (source)	What it proves
BE unit/use-case	`RAILS_ENV=test bundle exec rspec spec/api/frontend_service/v1/ai_agent` (`bin/rspec_pipeline.sh`)	create/list/rate + new sampling/gen behavior
BE worker	`RAILS_ENV=test bundle exec rspec spec/workers/fetch_room_conversations_worker_spec.rb`	sampling, no `send_message`, persistence, status
BE repo	`RAILS_ENV=test bundle exec rspec spec/core/repositories/chat_service/extract_conversation_pairs_spec.rb`	text-only filtering (S03)
BE tree	`RAILS_ENV=test bundle exec rspec spec/core/repositories/paths/get_tree_diagram_v3_spec.rb`	avg confidence in node
BE lint/security	`bundle exec rubocop` · `bundle exec brakeman` (`bitbucket-pipelines.yml`, Gemfile)	style + security scan
FE unit	`pnpm test` → `vitest run` (`chatbot-fe/package.json`)	components + store rating rollback
FE e2e	`pnpm test:e2e` → `playwright test`	generate→poll→detail→rate→meter flow
FE lint	`pnpm lint` · build `pnpm build`	typecheck + bundle

Detail 4.D — Agent Execution Plan

Order	Layer	Chunk	Files	Commands	Acceptance criteria
1	BE	Add `name` column + persist; create status `processing`	`db/migrate/2026XXXX_add_name_to_ai_agent_test_cases.rb`, `repositories/create_test_case.rb`, `use_cases/create_test_cases.rb`, `test_cases_controller.rb`	`bundle exec rails db:migrate`; rspec create spec	migration up/down; create persists `name`, returns `status='processing'`
2	BE	Test-case status lifecycle helper	`repositories/` (new `update_test_case_status.rb`)	rspec	status transitions `pending→processing→completed/failed` covered
3	BE	Worker sampling step	`app/workers/fetch_room_conversations_worker.rb`, new `repositories/.../sample_rooms.rb`	`rspec spec/workers/...`	200→~20; <10→all; 5000→cap 50–70 (S02)
4	BE	Worker shadow-gen + question persistence	worker, new `services/.../generate_shadow_answer.rb` (wraps `QontakNlp::Predict`), question repo	`rspec spec/workers/...`	0 `SendMessageWorker` enqueues (S04/AC-1); `answer`+`parameters.human_answer`+`confidence`+`sources` persisted; failed→`status=failed`+desc
5	BE	Confidence aggregate recompute on rate	`repositories/rate_test_case_question.rb` (extend), new `recompute_confidence_score.rb`	`rspec spec/api/frontend_service/v1/ai_agent/use_cases/rate_test_case_question_spec.rb`	`confidence_score = round(up/total*100)` after rate (S06/AC-3)
6	BE	DELETE test-case route (soft delete)	`test_cases_controller.rb`, new `use_cases/delete_test_case.rb` + repo	`rspec`	DELETE soft-deletes (`deleted_at` set); 404 cross-tenant; FE client now resolves
7	BE	Tree-diagram avg confidence	`core/repositories/paths/get_tree_diagram_v3.rb` (`add_ai_agent`)	`rspec spec/core/repositories/paths/get_tree_diagram_v3_spec.rb`	node returns avg over completed; none→"no score yet" (S10)
8	BE	Publish confidence gate (flagged)	`repositories/publish.rb`, `use_cases/publish_ai_agent.rb`	`rspec`	422 when `<80` & gate on; advisory when off (S07)
9	FE	Testing list page + table	`pages/bot-automation/testing/index.vue`, `modules/bot-automation/components/testing/TestCasesTable.vue`	`pnpm test`; `pnpm lint`	renders list/empty/loading/error; `ai_workspace_load_failed` on error (S01)
10	FE	Generate modal + Inbox drawer (with version selector)	`.../testing/{GenerateTestCaseModal,GenerateFromInboxDrawer,TestCaseGeneratingModal}.vue`	`pnpm test`	create dispatch + poll to `completed` (S08)
11	FE	Detail comparison + question list + meter	`.../testing/{TestCaseComparison,QuestionList,ConfidenceMeter}.vue`	`pnpm test`; `pnpm test:e2e`	human-left/AI-right + metrics; failed→"could not generate"; meter = up/total (S05,S06)
12	FE	Activate gate + nav item	`layouts/bot-automation.vue`, `modules/bot-automation/components/AiAgentEditor.vue`	`pnpm test`	button disabled `<80`; Testing nav behind flag (S07,S01)
13	FE	Tree-diagram node badge	bot-flow node component (chatbot-fe)	`pnpm test`	node shows avg / "no score yet" (S10)

Detail 4.E — Verification & Rollback Recipe

Pre-merge (BE): 1) bundle exec rubocop 2) RAILS_ENV=test bundle exec rspec spec/api/frontend_service/v1/ai_agent spec/workers spec/core/repositories/chat_service spec/core/repositories/paths 3) bundle exec brakeman
Pre-merge (FE): 1) pnpm lint 2) pnpm test 3) pnpm test:e2e 4) pnpm build
Post-deploy signals: Rollbar batch error rate < 10%/1h (#bot-ai-alerts); Mixpanel ai_validation_generated firing; worker log failed_count/generated_count ratio healthy; zero SendMessageWorker enqueues correlated with :ai_agent batches.
Rollback: 1) toggle ai_agent_testing OFF for affected org(s) (no deploy) 2) if publish gate misbehaves, toggle ai_agent_testing_gate OFF (advisory) 3) if needed, revert FE PR (BE endpoints become unused) 4) confirm Rollbar error rate normal in 15 min.

Detail 4.F — Resource & Cost Notes

Compute: bounded by :ai_agent queue concurrency (5/10); no new pods required.
DB: +1 nullable column; question rows ≤70/case — negligible growth.
Egress: per-question HTTPS to QontakNLP (cost = batch size × token usage) — the cap controls this; confirm per-tier budget (Open Q #5).
No new infra components.

5. Concern, Questions, or Known Limitations

Review findings ledger (from `historical-validation-review.md`, R1)

rfc-reviewer R1 (score 7.5/10, PROCEED) raised four material findings; all are now addressed inline in this revision:

Id	Severity	Finding	Resolution	Status
REV-1	major	NLP throttle contract unspecified (RPM/TPM, 429 behavior)	§3 Performance: token-bucket, `ai_agent_testing_nlp_rpm` pref (default 60), 429→backoff+requeue→fail-question; D-5 specified	resolved (in-RFC) — production ceiling still confirmed by AI squad (Open Q #5)
REV-2	major	`name` column left conditional	§2.3: column added + persisted unconditionally (chunk 1); §2.G create row now `yes`	resolved
REV-3	major	DELETE endpoint contract thin	§2.4: full DELETE contract (soft-delete, 404-not-403 cross-tenant, idempotency, restore out-of-scope)	resolved
REV-4	minor	publish-gate threshold source unresolved	D-7 + §4.B: `ai_agent_testing_threshold` pref (default 80), org-configurable — resolves PRD Open Q #4	resolved

Minor follow-ups still open after R2 (none blocking): REV-5 worker job retry intervals

dead-set alert depth (§2.F, item 10); REV-6 no Figma frame for the comparison/detail view (§1 Design References, item Q-A — design dependency); REV-7 re-run trigger path absent from §2.4 + clear-before-regen transaction (item 9, Detail 2.D); REV-8 create idempotency "recommended" not decided (§2.4 POST); REV-9 (new in R2) the §3 NLP throttle RPM cap is enforced by an in-worker token-bucket and is therefore per worker process — with queue concurrency 5/10, concurrent same-org batches can aggregate past the "per-org" ceiling. Mitigation: low likelihood (one SPV generates at a time); make it a true per-org ceiling via a Redis-backed counter keyed by org if concurrency proves real. R2 score 8.5/10, verdict PROCEED (see historical-validation-review.md).

Open questions

Carried from PRD §17, scoped to engineering:

Q-A (Design): Generate-from-Inbox drawer needs a version selector (bind test case to ai_agent_history_id); the qontak-designer prototype only has a name field. The comparison/detail view has no prototype — needs a Figma frame + Design QA.
Open Q #1 (Risk/InfoSec): PII to 3rd-party LLM — DPA-covered, transient inference; InfoSec approval required before beta.
Open Q #2 (Risk): confidence recompute (S06) + activation gate (S07) ship as this RFC's work; gate is advisory for beta, enforced before GA.
Open Q #4 — resolved (REV-4): threshold is org-configurable via ai_agent_testing_threshold SystemPreference (default 80); no redeploy to change.
Open Q #5 (Data/AI): per-batch token budget + the production RPM/TPM ceiling across plan tiers — the §3 throttle ships with a conservative default (ai_agent_testing_nlp_rpm = 60); AI squad confirms the real ceiling before beta (REV-1).
Open Q #6: separate "relevance" metric? Schema has only confidence — if needed, store in parameters (no schema change).
Open Q #7 (Data): single human reply as "golden answer" when a room has multiple agent messages — ExtractConversationPairs pairs the next agent text reply.
Open Q #8 (Eng): hard-purge window for soft-deleted test cases/questions (DPA).
Known limitation: re-running a test case must clear prior questions to avoid duplicates (Detail 2.D) — define re-run UX with Design.
Known limitation: FetchRoomConversationsWorker is retry: false today; this RFC changes it to bounded retry — confirm idempotent re-entry (clear-before-regen).

6. Comment logs

Date	Comment(s) From	Action Item(s)
2026-06-20	RFC author (Claude)	Initial draft from PRD + grounded against chatbot / chatbot-fe / qontak-designer code. Flagged worker gap, FE target (chatbot-fe), missing DELETE route, name-column gap.
2026-06-20	`rfc-reviewer` R1 (7.5/10, PROCEED)	Raised REV-1…REV-4 (NLP throttle, `name` column, DELETE contract, gate threshold). See `historical-validation-review.md`.
2026-06-20	RFC author (Claude)	Addressed REV-1…REV-4 inline: §3 throttle contract + D-5; §2.3 `name` column unconditional; §2.4 full DELETE contract; D-7 + §4.B configurable threshold/RPM prefs; §2.G all rows → `yes`; §5 ledger added.
2026-06-20	`rfc-reviewer` R2 (8.5/10, PROCEED)	Confirmed REV-1…REV-4 fixed; decisions 10/10 resolved; CSS 6.5→8.0. Raised REV-9 (per-process vs per-org throttle enforcement, minor). REV-5/6/7/8/9 carry open.

7. Ready for agent execution

yes

All execution-readiness gates are met against verified repo state:

§1 Design References — Figma frames + DS version (@mekari/pixel3@^1.0.12) + Design QA named; the detail/comparison frame and the drawer version-selector are flagged in §5 Q-A (Design QA must confirm before chunks 10–11 land).
§1 PRD-to-Schema Derivation — every entity/attribute/rule mapped to table.column + endpoint + enforcement.
Detail 1.C Per-Story Change Map — all 10 stories, one row each, FE+BE columns filled, verifiable AC.
Repo Reading Guide (2.0) — anchors, contracts (reuse/extend/new), reading order, Source Verification with concrete evidence per row (no unverified claims).
Design ↔ Code Mapping — frames → chatbot-fe files + tokens + backing endpoints.
Mermaid: repo map, component, ER, two state machines, branch/skip, happy + 2 failure sequences.
DDL — existing schema verified; one additive migration; per-status lifecycle tables for both enums.
APIs — outbound table with reuse/extend/new tags; inbound N/A — reason; cross-layer verification flags the 3 closeable gaps.
Failure Mode + Branch & Skip + Error catalogs complete; Role × Endpoint matrix covers every role.
Configuration Contract complete; ai_agent_testing flag named, default OFF.
Agent Execution Plan — 13 ordered chunks, each with files + repo-sourced commands + assertable AC.
Verification & Rollback Recipe — runnable per-layer commands; named signals; flag-first rollback.

Optional next step: hand to rfc-reviewer for a second-pass score (historical-validation-review.md).

Metadata​

Sections at a Glance​

1. Overview​

Success Criteria​

Out of Scope​

Related Documents​

Assumptions​

Dependencies​

Design References (frontend half — required)​

PRD-to-Schema Derivation (backend half — required)​

Detail 1.A — PRD Traceability (cross-layer)​

UI / Consumer Surface Coverage​

Role Coverage​

PRD Section Coverage​

Detail 1.B — Decisions Closed (cross-layer)​

Detail 1.C — Per-Story Change Map​

2. Technical Design​

Detail 2.0 — Repo Reading Guide (read this first)​

Repo Map (mermaid, both layers)​

Existing Code Anchors​

Existing Contracts to Reuse, Extend, or Replace (BE)​

Patterns to Follow (and where to find them)​

Reading Order for the Agent​

Source Verification (anti-hallucination — required)​

Design ↔ Code Mapping (frontend half)​

Detail 2.1 — Architecture (mermaid)​

End-to-end component diagram​

Data model (mermaid erDiagram)​

State machine — test-case status​

State machine — question status​

Branch & skip flow — sampling & filtering​

Detail 2.2 — Sequence (mermaid, end-to-end incl. failure)​

Detail 2.3 — Database Model (DDL)​

Detail 2.4 — APIs​

Outbound endpoints (consumers call us)​

Inbound webhooks (other services call us)​

Detail 2.A — UI Contract​

Detail 2.B — Data-Fetching Strategy​

Detail 2.C — UI State Matrix​

Detail 2.D — Data Integrity Matrix​

Detail 2.E — Concurrency Collision Map​

Detail 2.F — Async Job / Event Consumer Spec​

Detail 2.F.1 — Responsibility Boundary Matrix​

Detail 2.F.2 — State Surface Contract​

Detail 2.G — Cross-Layer Contract Verification​

Detail 2.H — End-to-End Data Flow​

Detail 2.I — Scope Boundaries​

Detail 2.J — Asset Inventory​

3. High-Availability & Security​

Performance Requirement​

Monitoring & Alerting​

Logging​

Security Implications​

Role × Endpoint Authorization Matrix​

Detail 3.A — Failure Mode Catalog (merged)​

Detail 3.A.1 — Branch & Skip Catalog​

Detail 3.B — Error Response Catalog (BE)​

Detail 3.C — Error Message Catalog (FE)​

Detail 3.D — Compliance & Data Governance​

Detail 3.E — Accessibility​

4. Backwards Compatibility and Rollout Plan​

Compatibility​

Rollout Strategy​

Detail 4.A — Cross-Layer Rollout Compatibility Matrix​

Detail 4.B — Configuration Contract​

Detail 4.C — Test Plan (commands from repo)​

Detail 4.D — Agent Execution Plan​

Detail 4.E — Verification & Rollback Recipe​

Detail 4.F — Resource & Cost Notes​

5. Concern, Questions, or Known Limitations​

Review findings ledger (from historical-validation-review.md, R1)​

Open questions​

6. Comment logs​

7. Ready for agent execution​