Skip to main content

RFC: AI Agent Testing — Phase 1: Historical Validation

Document Conventions (do not remove)

This RFC follows the Qontak RFC Template format for governance — the metadata table, Confluence sections 1–6, and Comment logs are mandatory.

It is also agent-execution-ready: §1 Design References (FE half) + §1 PRD-to-Schema Derivation (BE half), §2 Repo Reading Guide (Detail 2.0) for both layers, mermaid diagrams, §2.G Cross-Layer Contract Verification, and §4 Agent Execution Plan + Verification & Rollback Recipe are complete.

Delivery & project management live elsewhere. This RFC is the technical artifact only — no staffing, effort, timeline, or rollout schedule. Those live in the initiative's delivery/ folder. Until handed to delivery, the Delivery row reads not yet handed to delivery.

Grounding note (important). The PRD describes target behavior. This RFC is reconciled against the current code in chatbot, chatbot-fe, and qontak-designer (see §2.0 Source Verification). Where the PRD describes behavior that is not yet built, the RFC says so explicitly and scopes it as work. The biggest such gap: FetchRoomConversationsWorker today only fetches rooms and extracts Q/A pairs, then logs — it does not sample, generate AI shadow answers, persist question rows, or update test-case status.

Metadata

FieldValueNotes
StatusRFC (IDEA)Human label; YAML status: draft
DRIDimas Fauzi HidayatAccountable owner carried from PRD; eng tech-lead co-owner to be named in delivery/
Teamchatbot (BOT squad)Advisory slug carried from PRD
Author(s)Claude (from PRD + repo grounding)
ReviewersBOT Backend Lead · BOT Frontend Lead · AI Squad Lead · Data Team (Reza)Cross-squad: BOT, AI, Data, Platform (Chat Service)
Approver(s)BOT Tech Lead · InfoSec ApproverInfoSec required: historical PII → 3rd-party LLM (Open Q #1)
Submitted Date2026-06-20
Last Updated2026-06-20
Target Release2026-Q3Re-baselined; original "May 2026" dates are past (PRD Open Q #3)
Target Quarter2026-Q3
Deliverynot yet handed to delivery
Related../prds/historical-validation.md · ../ai-agent-testing-anchor.md
Discussion#bot-ai-alerts

Type: full-stack Frontend sub-type: new-feature Backend sub-type: new-feature

Sections at a Glance

  1. Overview (Design References — FE; PRD-to-Schema Derivation — BE; traceability)
  2. Technical Design (Repo Reading Guide → end-to-end mermaid → DDL → APIs → cross-layer verification)
  3. High-Availability & Security
  4. Backwards Compatibility and Rollout Plan
  5. Concern, Questions, or Known Limitations
  6. Comment logs
  7. Ready for agent execution

1. Overview

Phase 1 of the AI Agent: Testing initiative lets a Qontak SPV/Admin validate an AI Agent against a sample of their own resolved, human-handled conversations before going live. The system samples eligible historical rooms (last 90 days), generates an AI "shadow" answer per extracted customer question (never sent to a real customer), and presents a side-by-side comparison of the human "golden" answer vs the AI answer. The SPV rates each answer thumbs up/down; ratings roll up into a confidence meter. At ≥80% the agent is "Ready to Launch"; an activation gate (Should-Have) prevents go-live below threshold.

This RFC is a delta on substantial existing scaffolding, not a greenfield build. The data model, the four read/write endpoints, the rating use case, the Sidekiq worker shell, the conversation-pair extractor, and the chatbot-fe Pinia store + typed API client already exist (see §2.0). The missing pieces — the engineering core of this phase — are:

  1. Sampling (10% / 50–70 cap) in the worker.
  2. Shadow-answer generation (LLM call per question) + persistence of ai_agent_test_case_questions rows.
  3. Test-case status lifecycle (pending → processing → completed/failed).
  4. Confidence-score aggregate recompute on rating.
  5. Activation gate in publish.
  6. Tree-diagram average-confidence surface.
  7. The chatbot-fe Testing page (list + detail/comparison + meter) under the new bot-automation module.

Success Criteria

  • Zero customer-message leakage: no send_message/notification fires for any historical inquiry during shadow generation (AITEST-S04/AC-1). Provable by spec + zero SendMessageWorker enqueues during a batch.
  • Shadow-generation success rate ≥ 95% of sampled questions produce a valid AI answer within 60 days of GA (PRD §13 Quality KPI).
  • Batch latency: a ~50-item batch reaches completed in ≈2–5 min without blocking live production traffic (queue isolation :ai_agent).
  • Confidence meter equals (thumbs-up ÷ total sample) × 100, recomputed server-side on every rating (AITEST-S06/AC-3).
  • Primary product KPI (PRD §13): Configured→Live conversion ≥ 60% within 7 days, within 90 days of GA.

Out of Scope

  1. Live shadow mode (real-time parallel answering) — strictly historical.
  2. Model fine-tuning UI; editing the AI answer in the workspace (comparison is read-only).
  3. Multi-modal validation (images/voice/attachments) — text-only.
  4. "Generate from knowledge" (Phase 2) and "Imported question list" (Phase 3) sources — scaffolded/disabled only.
  5. Mobile — web only.
  6. The qontak-designer prototype itself — it is a design reference, not a deployable target (see Decision D-1).

Assumptions

  • A single human agent text reply to a customer question is a sufficient "golden answer" (PRD Open Q #7 — Data team to confirm).
  • The 90-day lookback and 10%/50–70 cap hold across plan tiers for the beta token budget (PRD Open Q #5).
  • ai_agent_histories is stable; a test case binds to one ai_agent_history_id.
  • Chat Service room/message APIs and QontakNlp prediction are reachable from the :ai_agent Sidekiq worker pool with the org's channel access token.
  • Production frontend is chatbot-fe (it owns auth, the API client, and Pinia stores); qontak-designer has no API/auth layer (Decision D-1).

Dependencies

DependencyOwning teamDeliverable neededAvailabilityBlocking?
Chat Service (Hub)Inbox / PlatformHub::ChatService::Rooms::List (status assigned, date window), Messages::GetByRoom over 90 daysexists (app/core/repositories/chat_service/*, lib/hub/chat_service/*)YES
LLM / AI service (QontakNLP)AI squadBatch shadow inference within TPM/RPM limitsexists for live predict (lib/qontak_nlp/inference.rb#prediction); batch/shadow path needs buildingYES
Data teamData10% sampling + 50–70 cap algorithmneeds building (not in worker today)YES
Channel IntegrationPlatformAccess tokens for room fetchexists (Repositories::ChannelIntegrations::GetTokens)YES
AI Agent versioningBOTai_agent_histories stableexists (app/models/ai_agent_history.rb)YES
Design (Pixel3)Design system@mekari/pixel3 Drawer/Modal/Table/Badgeexists (@mekari/pixel3@^1.0.12 in chatbot-fe)NO

Design References (frontend half — required)

The PRD's UI is specified in Figma; the qontak-designer prototype is the in-code design reference (pixel layout + component decomposition) but is itself a static prototype (no API/auth) — see Decision D-1. Production implementation lands in chatbot-fe.

PRD-named surfaceFigma / design linkFrame nameDesign system versionDesign QA contactNotes
Testing page (list)node 16743-298263Testing page@mekari/pixel3@^1.0.12 (chatbot-fe)BOT Design QAIn-code ref: qontak-designer app/pages/bot-automation/testing/index.vue
Generate test case modal + Generate-from-Inbox drawernode 16514-155786Generate flow@mekari/pixel3@^1.0.12BOT Design QAIn-code ref: qontak-designer app/components/bot-automation/testing/{GenerateTestCaseModal,GenerateFromInboxDrawer,TestCaseGeneratingModal}.vue
Sampling / generating progressnode 17699-52615Generating modal@mekari/pixel3@^1.0.12BOT Design QAAsync progress while batch runs
Test-case detail — side-by-side comparison + confidence meternode 16514-155786Comparison view@mekari/pixel3@^1.0.12BOT Design QANo prototype component exists in qontak-designer for the detail view — build fresh (see §2.0)
Activation gate (AI agent main settings)node 16514-155786Activate button@mekari/pixel3@^1.0.12BOT Design QAchatbot-fe modules/bot-automation/components/AiAgentEditor.vue footer
Tree-diagram AI Agent node confidencenode 16514-155786Tree node@mekari/pixel3@^1.0.12BOT Design QABackend get_tree_diagram_v3

PRD-to-Schema Derivation (backend half — required)

PRD entity / attribute / rulePersisted as (table.column)Exposed viaEnforced whereSource
A test case binds to an agent + a versionai_agent_test_cases.ai_agent_id, .ai_agent_history_id (uuid, NOT NULL)POST /api/v1/ai_agents/:id/test_casesCreateTestCases use case (404 if version missing)PRD §9 #1
Test case has a lifecycle statusai_agent_test_cases.status (string)list + detail responsesworker transitions (to build); created as 'pending' todayPRD §10.1, AITEST-S08
Test case aggregate confidenceai_agent_test_cases.confidence_score (integer, nullable)detail + listrecompute on rating (to build)AITEST-S06
Sampled question + extracted Q/Aai_agent_test_case_questions.question (text), .topic (string)detail responseExtractConversationPairs + worker persistence (to build)AITEST-S02/S03
AI shadow answerai_agent_test_case_questions.answer (text)detailshadow-gen worker (to build)AITEST-S04/AC-2
Human "golden" answerai_agent_test_case_questions.parameters (jsonb) → human_answerdetailworker persistence (to build)PRD §16a (2026-06-18)
AI metrics.confidence (int), .response_time (int), .sources (jsonb [{id,name,type}])detailshadow-gen worker (to build)AITEST-S05/AC-2
Per-question rating.score (int 0/1), .is_score (bool), .scored_by/_email/_name/_atPATCH .../questions/:question_idRateTestCaseQuestion (exists; aggregate recompute to build)AITEST-S06
Per-question failure.status (string), .status_description (text)detailshadow-gen worker (to build)AITEST-S04/ERR-1
Soft delete.deleted_at (acts_as_paranoid) on both tablesDELETE endpoint (to build — see §2.4)model acts_as_paranoidPRD §6
Activation gate thresholdconfidence_score vs threshold (default 80)POST /api/v1/ai_agents/:id/publishRepositories::Publish (gate to build)AITEST-S07
Tree-diagram avg confidencecomputed avg over agent's completed ai_agent_test_cases.confidence_scoreGET /api/v3/paths/:id/tree_diagramRepositories::Paths::GetTreeDiagramV3#add_ai_agent (to build)AITEST-S10

Every §2.3 DDL column and §2.4 endpoint traces back to a row here.

Detail 1.A — PRD Traceability (cross-layer)

Composite AC ids per documents/CLAUDE.md (story-qualified, e.g. AITEST-S01/AC-1).

Forward (PRD AC → RFC):

PRD composite AC idFE section / componentBE section / endpoint
AITEST-S01/AC-1, AC-2Testing page + nav item (chatbot-fe)GET /api/v1/ai_agents/:id/test_cases (set_role)
AITEST-S01/ERR-1Error blank-slate + ai_workspace_load_failedlist endpoint failure path
AITEST-S02/AC-1..3generating modalFetchRoomConversationsWorker sampling (§2.F)
AITEST-S03/AC-1..3n/a — server-sideExtractConversationPairs filtering (§2.2)
AITEST-S04/AC-1, AC-2, ERR-1per-question loading/failed stateshadow-gen worker step (§2.2, §2.F)
AITEST-S05/AC-1..3TestCaseComparison / QuestionListGET .../test_cases/:id detail (§2.4)
AITEST-S06/AC-1..3, ERR-1ConfidenceMeter + thumbsPATCH .../questions/:id + aggregate recompute (§2.4, §2.F)
AITEST-S07/AC-1, AC-2, ERR-1Activate button enable/disablePOST /api/v1/ai_agents/:id/publish gate (§2.4)
AITEST-S08/AC-1, AC-2, ERR-1generating modal + pollingworker async + status lifecycle (§2.F)
AITEST-S09/AC-1, AC-2, ERR-1Force-activate modal (Could-Have)publish override + audit (PaperTrail)
AITEST-S10/AC-1..3, ERR-1Tree-diagram node badgeGetTreeDiagramV3#add_ai_agent

Reverse (RFC → PRD AC):

New FE component / BE endpoint / dependencyPRD composite AC id it serves
chatbot-fe pages/bot-automation/testing/index.vueAITEST-S01/AC-1
chatbot-fe TestCaseComparison.vue + ConfidenceMeter.vueAITEST-S05/AC-1, AITEST-S06/AC-3
BE worker sampling stepAITEST-S02/AC-1..3
BE worker shadow-gen + persistence stepAITEST-S04/AC-2
BE confidence aggregate recomputeAITEST-S06/AC-2
BE publish gateAITEST-S07/AC-1
BE DELETE .../test_cases/:test_case_id (new)PRD §6 soft delete
BE GetTreeDiagramV3 avg-confidenceAITEST-S10/AC-2

UI / Consumer Surface Coverage

PRD-named surfaceConsumerRequired reads (BE)Required writes (BE)FE componentStatus surface
Testing page (list)webGET /api/v1/ai_agents/:id/test_casespages/bot-automation/testing/index.vuestatus, score columns
Generate-from-Inbox drawerwebGET .../ai_agents/:id (versions)POST .../test_casesGenerateTestCaseDrawer.vuestatus=pending→processing
Generating modalwebGET .../test_cases (poll)TestCaseGeneratingModal.vuepolls status until completed
Test-case detail / comparisonwebGET .../test_cases/:idPATCH .../questions/:idTestCaseComparison.vueper-question status, confidence
Confidence meterweb(from detail payload)ConfidenceMeter.vueconfidence_score aggregate
AI agent main settings (Activate)webGET .../ai_agents/:idPOST .../publishAiAgentEditor.vue footerconfidence_score vs 80
Tree-diagram nodewebGET /api/v3/paths/:id/tree_diagrambot-flow node (chatbot-fe)avg_confidence_score

Role Coverage

PRD roleAuthorization mechanismEndpoints permitted (BE)UI surface visibility (FE)Cross-tenant?Audit trail
ownerset_role(%w[owner supervisor admin]) (JWT current_user['role'])all test-case + publishfullno (org-scoped)PaperTrail on test_cases/questions
supervisorset_roleall test-case; publish (not force-override)fullnoPaperTrail
adminset_roleall test-case + publish + force-overridefullnoPaperTrail (+ override reason)
standard agentset_role rejects (403)nonemenu hidden; route forbiddennon/a
Super Admin (PRD secondary)inherits owner/admin roleactivate after SPV sign-offfullnoPaperTrail
bot-specialist (S10)set_role on tree-diagram (owner/supervisor/admin)GET .../tree_diagramtree-diagram nodenon/a (read)

Menu visibility in chatbot-fe is feature-flag + subscription gated today (not role-gated) — see Decision D-3; server-side set_role is the authoritative guard.

PRD Section Coverage

PRD §TitleWhere covered
2Phase Context§1 Overview
3One-liner + Problem§1 Overview
4Target Users / Persona§1 (Role Coverage)
5Non-Goals§1 Out of Scope
6Constraints§3 (perf, security, data lifecycle), §4 (flag)
7Feature Changes (CHG-001 tree)AITEST-S10 → §2.4, §2.F.2
8New Features (Testing page)§1 Design References, §2.A, Detail 1.C
9API & Webhook Behavior§2.4
10System Flow / Stories / ACsDetail 1.A, 1.C, §2.2
11Rollout§4
12Observability§3 Monitoring
13Success Metrics§1 Success Criteria, §3
14Launch Plan & Stage Gates§4 (delivery owns schedule)
15Dependencies§1 Dependencies, §2.F.1
16Key DecisionsDetail 1.B, §2 Technical Decisions
17Open Questions§5

Detail 1.B — Decisions Closed (cross-layer)

DecisionChosen optionAlternatives rejectedWhy rejectedLayer
D-1 Frontend target repochatbot-fe (production); qontak-designer is design reference onlyBuild in qontak-designerqontak-designer has zero API client + only mock localStorage auth + no roles (app/composables/useAuth.ts) — cannot satisfy set_role or real databoth
D-2 StorageReuse existing ai_agent_test_cases / ai_agent_test_case_questions (Postgres, uuid, acts_as_paranoid)New ai_validation_sessions/_items tablesSuperseded by implemented schema (PRD §16b)BE
D-3 Menu gatingServer-side set_role is authoritative; FE menu reuses existing rollout_ai_agent + subscription flag patternRole-gate the FE menu onlyFE menu gating today is flag/subscription-based (layouts/bot-automation.vue); BE must enforce regardlessboth
D-4 Batch processingAsync Sidekiq FetchRoomConversationsWorker, queue :ai_agentKafka; synchronous requestAlready the chatbot async stack (PRD §16a); sync would block & exceed LLM TPMBE
D-5 Shadow inferenceReuse QontakNlp predict path per question, token-bucket throttled in worker (RPM cap SystemPreference, default 60; 429→backoff+requeue→fail-question)New batch endpoint on AI serviceReuse the proven lib/qontak_nlp/inference.rb#prediction; throttle contract fully specified in §3 Performance (REV-1)BE
D-6 Confidence aggregateRecompute confidence_score server-side on each rating writeCompute on read; FE-sideSingle source of truth; needed by tree-diagram + gate; avoids driftBE
D-7 Activation gateAdd advisory→enforced threshold check in Repositories::Publish behind ai_agent_testing_gate flag; threshold is org-configurable via SystemPreference (group_code: 'engine', code: 'ai_agent_testing_threshold', default 80)Hard gate from day one; hard-coded 80 constantShip advisory for beta (PRD Open Q #2), enforce before GA; configurable threshold resolves PRD Open Q #4 (REV-4) without a redeployBE
D-8 Delete semanticsSoft delete via acts_as_paranoid; add missing DELETE endpointHard deletePRD §6 soft-delete + restore; chatbot-fe already calls a delete route that BE lacksboth
D-9 Per-status lifecyclepending → processing → completed/failed (test case); pending → processing → completed/failed (question)Single boolean done flagNeeds partial/failed surfacing (AITEST-S08/ERR-1)BE
D-10 Sampling cap10% random, capped 50–70, ≤50 shown if batch > 100; all rooms if < 10 eligibleExpose params in drawerAdds user effort; defaults are the trust signal (PRD §16b)BE/Data

Minimum-coverage decisions: storage (D-2), sync/async (D-4), caching (no alternative considered — no read-cache introduced this phase; detail reads are infrequent), third-party (D-5), consistency (D-6, server-authoritative/strong within request), multi-tenancy (set_role + org-scoped queries), reuse-vs-new (§2.4 Reuse? column).

Detail 1.C — Per-Story Change Map

Story idTitleLayer scopeFE changesBE changesComposite AC idsAcceptance criteria (verifiable)RFC anchors
AITEST-S01Workspace access controlFE + BEpages/bot-automation/testing/index.vue; nav item in layouts/bot-automation.vue; FETCH_TEST_CASES (exists)GET .../test_cases (exists, set_role)S01/AC-1, AC-2, ERR-1, NEG-1rspec: 403 for standard; vitest: error slate fires ai_workspace_load_failed§2.4 row1 · §2.A · §3 authz
AITEST-S02Historical sampling (10%)BE-onlyn/a — server-sideFetchRoomConversationsWorker sampling step (new)S02/AC-1, AC-2, AC-3, ERR-1worker spec: 200 rooms→~20; <10→all; 5000→cap 50–70§2.F job spec · §4.D chunk 3
AITEST-S03Data integrity & filteringBE-onlyn/aExtractConversationPairs (exists) — confirm non-text/system skipS03/AC-1, AC-2, AC-3, ERR-1, NEG-2extractor spec: bot-only excluded; image-only skipped§2.2 · existing spec
AITEST-S04Shadow execution (zero leakage)BE-onlyper-question failed badgeworker shadow-gen + question persistence (new); QontakNlp predictS04/AC-1, AC-2, ERR-1worker spec: 0 SendMessageWorker enqueues; answer + parameters.human_answer persisted§2.2 · §2.F · §4.D chunk 4
AITEST-S05Side-by-side validation UIFE + BETestCaseComparison.vue, QuestionList.vue (grouped by topic)GET .../test_cases/:id detail (exists)S05/AC-1, AC-2, AC-3, ERR-1, NEG-3vitest: renders human-left/AI-right + confidence/time/sources; failed→"could not generate"§2.4 row3 · §2.A
AITEST-S06Confidence meter & feedbackFE + BEConfidenceMeter.vue; thumbs via UPDATE_TEST_CASE_QUESTION (exists, optimistic)aggregate recompute on rate (new)S06/AC-1, AC-2, AC-3, ERR-1rspec: rating recomputes confidence_score=(up÷total)×100; vitest: rollback on save fail§2.4 row4 · §2.F.2 · §4.D chunk 5
AITEST-S07Activation gatekeepingFE + BEActivate button enable/disable in AiAgentEditor.vue footerpublish gate in Repositories::Publish (new, flagged)S07/AC-1, AC-2, ERR-1rspec: publish 422 when <80 & gate on; FE button disabled <80§2.4 row5 · §4.D chunk 8
AITEST-S08Background processing (async)BE + FEgenerating modal + pollworker status lifecycle (new)S08/AC-1, AC-2, ERR-1worker spec: status processing→completed; failure→failed + Rollbar§2.F · §2.1 state
AITEST-S09Manual override & auditFE + BEForce-activate modal (reason)publish override path + PaperTrail reason (new)S09/AC-1, AC-2, ERR-1rspec: override requires reason; PaperTrail row w/ reason + score§2.4 row5 · §3 audit
AITEST-S10Confidence in Tree DiagramFE + BEnode badge in bot-flow tree (chatbot-fe)GetTreeDiagramV3#add_ai_agent avg score (new)S10/AC-1, AC-2, AC-3, ERR-1rspec: add_ai_agent returns avg over completed; no test cases→"no score yet"§2.4 row6 · §2.F.2

Every FE + BE row has both columns filled. S02/S03/S04 are BE-only (server-side pipeline); their UI effects are covered by S05/S08 surfaces.


2. Technical Design

Detail 2.0 — Repo Reading Guide (read this first)

Repo Map (mermaid, both layers)

flowchart LR
subgraph fe["chatbot-fe (Nuxt + Pinia)"]
page["pages/bot-automation/testing/"]
comp["modules/bot-automation/components/testing/"]
store["store/ai-agent/{actions,getters,interface}.ts"]
svc["common/services/main/v1/ai-agents.ts"]
end
subgraph be["chatbot (Rails + Grape)"]
ctrl["api/frontend_service/v1/ai_agent/*_controller.rb"]
uc["use_cases/{create_test_cases,rate_test_case_question,publish_ai_agent}.rb"]
repo["repositories/{create_test_case,rate_test_case_question,publish}.rb"]
worker["workers/fetch_room_conversations_worker.rb"]
chat["core/repositories/chat_service/*"]
nlp["lib/qontak_nlp/inference.rb"]
tree["core/repositories/paths/get_tree_diagram_v3.rb"]
end
subgraph infra
db[("Postgres: ai_agent_test_cases / _questions")]
q[["Sidekiq queue :ai_agent"]]
hub(["Hub Chat Service (HTTP)"])
ai(["QontakNLP AI service (HTTP)"])
end
svc --> ctrl
ctrl --> uc --> repo --> db
uc --> q --> worker
worker --> chat --> hub
worker --> nlp --> ai
worker --> db
ctrl --> tree --> db

Existing Code Anchors

LayerPathWhy the agent reads itWhat pattern it teaches
BEapp/api/frontend_service/v1/ai_agent/test_cases_controller.rbThe 3 live routes + set_role + result-matcherGrape route + Dry::Matcher::ResultMatcher + success_response/error_response
BEapp/api/frontend_service/v1/ai_agent/use_cases/create_test_cases.rbCreate flow, validation, worker enqueueAPIAbstractUseCase + Dry::Monads::Do + Repositories::*.call
BEapp/api/frontend_service/v1/ai_agent/repositories/create_test_case.rbHow a test case is built (status='pending')AbstractRepository write pattern
BEapp/api/frontend_service/v1/ai_agent/repositories/rate_test_case_question.rbRating write (no aggregate today)per-field update; extension point for recompute
BEapp/workers/fetch_room_conversations_worker.rbWorker shell (fetch+extract+log only)sidekiq_options queue: :ai_agent, retry: false; per-room rescue→Rollbar
BEapp/core/repositories/chat_service/extract_conversation_pairs.rbQ/A pairing, system/non-text skipcustomer-question → next agent text reply
BEapp/core/repositories/chat_service/fetch_assigned_room_ids.rbAssigned-room fetch (status:'assigned', LIMIT)Hub HTTP + cursor pagination
BElib/qontak_nlp/inference.rbprediction(...) shape + timeout: 60@http.call(method:'POST', url:, body:, open/read_timeout:)
BEapp/core/repositories/paths/get_tree_diagram_v3.rbadd_ai_agent (L850–909) node assemblywhere to add avg-confidence
BEapp/api/frontend_service/v1/ai_agent/repositories/publish.rbPublish = set active_version_id (no gate)extension point for gate
BEapp/core/repositories/system_preferences/feature_flag.rbFeatureFlag.enabled?(group_code, code)org-level flag mechanism
FEcommon/services/main/v1/ai-agents.ts5 test-case client methods (incl. deleteTestCase)$apiMain + endpoint.v1.ai_agents.test_cases.*
FEstore/ai-agent/interface.tsTestCase, TestCaseQuestion, TestCaseDetail typestyped payloads/responses
FEstore/ai-agent/actions.tsCREATE/FETCH/FETCH_DETAIL/DELETE/UPDATE_QUESTION$patch fetchStatus pending/resolved/rejected + optimistic rollback
FEmodules/bot-automation/components/ai-agents/AiAgentsTable.vuelist table loading/empty/paginationtableContent + empty illustration
FEmodules/bot-automation/components/AiAgentEditor.vuesettings footer (Save button)where Activate/gate lands (L1712–1733)
FElayouts/bot-automation.vuemenu listMenu + flag gating (L209–349)where Testing nav item lands
Designqontak-designer app/pages/bot-automation/testing/index.vuetable columns + states (design ref)6 columns: name/type/score/status/updated/actions

Existing Contracts to Reuse, Extend, or Replace (BE)

ContractStatusJustificationOwner
GET /api/v1/ai_agents/:id/test_casesreuseexists, set_roleBOT
POST /api/v1/ai_agents/:id/test_casesextendexists; add name persist, status→processing, real pipelineBOT
GET /api/v1/ai_agents/:ai_agent_id/test_cases/:idreuseexists (GetAiAgentTestCaseDetail, serializes confidence_score)BOT
PATCH .../test_cases/:test_case_id/questions/:question_idextendexists; add aggregate recomputeBOT
DELETE /api/v1/ai_agents/:id/test_cases/:test_case_idnew-with-justificationchatbot-fe deleteTestCase calls it but no BE route exists (only delete '/:id' deletes the agent); PRD §6 soft delete needs itBOT
POST /api/v1/ai_agents/:id/publishextendexists; add confidence gate (flagged)BOT
GET /api/v3/paths/:id/tree_diagramextendexists; add_ai_agent add avg confidenceBOT
FetchRoomConversationsWorkerextendexists; add sampling + shadow-gen + persistence + statusBOT
lib/qontak_nlp/inference.rb#predictionreuselive-predict path; call per question with throttleAI squad

Patterns to Follow (and where to find them)

LayerConcernPattern in repoReference fileDeviation?
FEState managementPinia store w/ fetchStatus enumstore/ai-agent/actions.tsnone
FEError/optimisticsnapshot + rollback on rejectstore/ai-agent/actions.ts UPDATE_TEST_CASE_QUESTIONnone
FEList loading/emptytableContent + empty illustrationmodules/bot-automation/components/ai-agents/AiAgentsTable.vuenone
FEAPI client$apiMain + endpoint mapcommon/services/main/v1/ai-agents.tsnone
BEHTTP handlerGrape + ResultMatcher + success_responsetest_cases_controller.rbnone
BEUse caseAPIAbstractUseCase + Dry::Monads::Do.for(:result)create_test_cases.rbnone
BERepository writeAbstractRepository#callcreate_test_case.rbnone
BEWorkerSidekiq::Worker + sidekiq_options queue: + per-item rescue→Rollbarfetch_room_conversations_worker.rb, ask_airene_predict_worker.rbnone
BEFeature flagSystemPreferences::FeatureFlag.enabled?feature_flag.rbnone
BEError shapeErrorException(message:[], code:, errors:, error_code:)helpers/error_response_helpers.rbnone
Crosssnake_case API → FEFE consumes snake_case JSON directlystore/ai-agent/interface.tsnone

Reading Order for the Agent

  1. chatbot/app/api/frontend_service/v1/ai_agent/test_cases_controller.rb — live routes + auth.
  2. chatbot/app/api/frontend_service/v1/ai_agent/use_cases/create_test_cases.rb — create + enqueue.
  3. chatbot/app/workers/fetch_room_conversations_worker.rb — the worker to extend (the core gap).
  4. chatbot/app/core/repositories/chat_service/extract_conversation_pairs.rb — Q/A extraction.
  5. chatbot/lib/qontak_nlp/inference.rb — the predict call to reuse for shadow gen.
  6. chatbot/app/api/frontend_service/v1/ai_agent/repositories/rate_test_case_question.rb — recompute extension point.
  7. chatbot/app/api/frontend_service/v1/ai_agent/repositories/publish.rb — gate extension point.
  8. chatbot/app/core/repositories/paths/get_tree_diagram_v3.rb (add_ai_agent) — tree surface.
  9. chatbot-fe/store/ai-agent/{actions,interface}.ts — the FE store/types already wired.
  10. chatbot-fe/modules/bot-automation/components/ai-agents/AiAgentsTable.vue — list/empty/loading pattern to mirror.

Source Verification (anti-hallucination — required)

LayerAnchor / contractVerified byEvidence
BEai_agent_test_cases schemaread migrationdb/migrate/20260512000001_create_ai_agent_test_cases.rb: cols ai_agent_history_id uuid NOT NULL, status string, confidence_score integer, type string, deleted_at; uuid PK
BEai_agent_test_case_questions schemaread migrationdb/migrate/20260512000002_..._questions.rb: topic, question(text), answer(text), is_score(bool default false), score(int), scored_by(uuid)/_email/_name, scored_at, response_time(int), confidence(int), status, status_description(text), sources(jsonb default []), parameters(jsonb default {}), deleted_at
BEModels soft-deletereadapp/models/ai_agent_test_case.rb L4 acts_as_paranoid, L5 has_paper_trail; same in ai_agent_test_case_question.rb
BEDB dialect / migratorreadconfig/database.yml adapter: postgresql; db/schema.rb ActiveRecord::Schema[7.1], enable_extension "pgcrypto"
BERoutes + mountreadapp/api/frontend_service/api.rb L47-48 mount V1::AiAgent::TestCasesController => '/v1/ai_agents'; config/routes.rb mounts APIBase => '/api/' → full /api/v1/ai_agents
BE3 live routesreadtest_cases_controller.rb L32 get '/:id/test_cases', L75 post, L115 patch '/:id/test_cases/:test_case_id/questions/:question_id' — all set_role(%w[owner supervisor admin])
BENo delete test-case routegreponly ai_agents_controller.rb L237 delete '/:id' (deletes agent) — no test-case delete
BECreate status pendingreadrepositories/create_test_case.rb L19 record.status = 'pending'; use case enqueues FetchRoomConversationsWorker.perform_async
BEWorker does NOT sample/generate/persistread full fileapp/workers/fetch_room_conversations_worker.rb L1-43: fetch rooms → extract pairs → Rails.logger.info{...}; no LLM, no question insert, no status update
BEQueue :ai_agentreadworker L5 sidekiq_options queue: :ai_agent, retry: false; config/sidekiq.yml lists ai_agent queue
BEExtraction logicreadextract_conversation_pairs.rb L40-66: skip SYSTEM; customer text → pending_question if question?; agent text reply → pair; non-text skipped
BERating no aggregatereadrepositories/rate_test_case_question.rb L13-19 sets score/is_score/scored_by*/scored_at only; no confidence_score write
BEPublish no gatereadrepositories/publish.rb L12-18 @ai_agent.update!(active_version_id: @ai_agent.version_id); no threshold
BETree diagram v3readroute app/api/frontend_service/v3/path.rb L17 get ':id/tree_diagram'; core/repositories/paths/get_tree_diagram_v3.rb add_ai_agent L850-909 builds node sans confidence
BEQontakNLP predictreadlib/qontak_nlp/inference.rb#prediction timeout: 60, @http.call(method:'POST', ...); core/repositories/qontak_nlp/predict.rb resolves timeout via system pref
BEChat Service fetchreadfetch_assigned_room_ids.rb Hub::ChatService::Rooms::List status:'assigned', limit: LIMIT; fetch_room_messages.rb Messages::GetByRoom
BEToken fetchreadchannel_integrations/get_tokens.rb access_token from chatbot_tokens_encrypted (lockbox)
BEError/success shapereadhelpers/success_response_helpers.rb {status, code, message, data, meta}; error_response_helpers.rb ErrorException(message:[], code:, errors:, error_code:)
BEFeature flagreadfeature_flag.rb FeatureFlag.enabled?(group_code, code, default:); no ai_agent_testing flag exists yet
BETest commandsreadbin/rspec_pipeline.sh RAILS_ENV=test bundle exec rspec spec/...; bitbucket-pipelines.yml bundle exec rubocop; Gemfile has brakeman
BEExisting specslsspec/api/frontend_service/v1/ai_agent/{create_test_cases,get_test_cases,rate_test_case_question}_spec.rb; spec/workers/fetch_room_conversations_worker_spec.rb; spec/core/repositories/chat_service/extract_conversation_pairs_spec.rb
FETest-case API clientreadcommon/services/main/v1/ai-agents.ts L250-351: createTestCase/getTestCases/getTestCaseDetail/deleteTestCase/updateTestCaseQuestion; endpoint.ts L207-214 paths
FETypesreadstore/ai-agent/interface.ts L268-365: `TestCase, CreateTestCasePayload, TestCaseQuestion, TestCaseDetail, UpdateTestCaseQuestionPayload(score:0
FEPinia actionsreadstore/ai-agent/actions.ts CREATE_TEST_CASE(745), FETCH_TEST_CASES(803), FETCH_TEST_CASE_DETAIL(849), DELETE_TEST_CASE(902), UPDATE_TEST_CASE_QUESTION(950, optimistic rollback)
FENo Testing page yetlspages/bot-automation/ has actions/ai-agents/ai-agent[id]; no /testing; old UI in modules/ai-agent/components/forms/ValidationDetailPanel.vue
FEMenu gating flag/subreadlayouts/bot-automation.vue L209-349 listMenu gated by rolloutAIAgentPreferences/aiAgentEnabled/isNewAIAgentEngine — not roles
FEActivate button absentreadmodules/bot-automation/components/AiAgentEditor.vue L1712-1733 footer shows "Save changes" only
FEDesign systemreadchatbot-fe package.json @mekari/pixel3@^1.0.12; qontak-designer @mekari/pixel3@1.0.13-dev.0
FETest commandsreadchatbot-fe package.json: test: vitest run, test:e2e: playwright test, lint, build: nuxt build
Designqontak-designer is static prototyperead/grepno api/ folder, no $fetch/useFetch; app/composables/useAuth.ts mock localStorage, no roles

Design ↔ Code Mapping (frontend half)

Figma frame / componentImplementing file (chatbot-fe)Reuse vs newTokensBacking APIDeviation
Testing page (list)pages/bot-automation/testing/index.vue + modules/bot-automation/components/testing/TestCasesTable.vuenew (mirror AiAgentsTable.vue)color.surface.*, space.*, text.body*GET .../test_casesnone — pattern-faithful
Generate modal + Inbox drawermodules/bot-automation/components/testing/{GenerateTestCaseModal,GenerateFromInboxDrawer}.vuenew (port from qontak-designer layout)Pixel3 MpModal/MpDrawerPOST .../test_casesadds version selector (prototype lacks it — see §5 Q-A)
Generating modal.../testing/TestCaseGeneratingModal.vuenewMpModal + progresspoll GET .../test_casesnone
Comparison + question list.../testing/TestCaseComparison.vue, QuestionList.vuenew (no prototype exists)MpAccordion, MpBadgeGET .../test_cases/:idreference old modules/ai-agent/.../ValidationDetailPanel.vue for layout
Confidence meter.../testing/ConfidenceMeter.vuenewMpBadge/progressfrom detail payloadnone
Activate gatemodules/bot-automation/components/AiAgentEditor.vue (footer)extendMpButtonPOST .../publishnone

The Comparison/detail view has no qontak-designer prototype — flag for Design QA before the chunk lands (§5 Q-A).

Detail 2.1 — Architecture (mermaid)

End-to-end component diagram

flowchart TB
user([SPV/Admin]) --> page["chatbot-fe Testing page"]
page --> store["Pinia ai-agent store"]
store --> client["ai-agents.ts client"]
client --> ctrl["/api/v1/ai_agents/.../test_cases/"]
ctrl --> ucCreate[CreateTestCases UC]
ucCreate --> repoCreate[(CreateTestCase repo)]
repoCreate --> db[("ai_agent_test_cases")]
ucCreate --> q[["Sidekiq :ai_agent"]]
q --> worker[FetchRoomConversationsWorker]
worker --> chat["ChatService repos"] --> hub(["Hub Chat Service"])
worker --> nlp["QontakNlp predict"] --> ai(["AI service"])
worker --> dbq[("ai_agent_test_case_questions")]
ctrl --> ucRate[RateTestCaseQuestion UC] --> dbq
ucRate --> agg[["recompute confidence_score"]] --> db
ctrl --> tree["GetTreeDiagramV3#add_ai_agent"] --> db

Data model (mermaid erDiagram)

erDiagram
AI_AGENTS ||--o{ AI_AGENT_HISTORIES : versions
AI_AGENTS ||--o{ AI_AGENT_TEST_CASES : has
AI_AGENT_HISTORIES ||--o{ AI_AGENT_TEST_CASES : binds
AI_AGENT_TEST_CASES ||--o{ AI_AGENT_TEST_CASE_QUESTIONS : has
AI_AGENT_TEST_CASES {
uuid id PK
uuid ai_agent_id FK
uuid ai_agent_history_id FK
int organization_id
string status
int confidence_score
string type
datetime deleted_at
}
AI_AGENT_TEST_CASE_QUESTIONS {
uuid id PK
uuid ai_agent_test_case_id FK
string topic
text question
text answer
int score
bool is_score
int confidence
int response_time
string status
text status_description
jsonb sources
jsonb parameters
datetime deleted_at
}

State machine — test-case status

stateDiagram-v2
[*] --> pending: POST create
pending --> processing: worker starts
processing --> completed: all questions generated
processing --> failed: fatal worker error
completed --> completed: ratings update (no status change)
failed --> processing: retry (re-enqueue)
completed --> [*]

State machine — question status

stateDiagram-v2
[*] --> pending: row created
pending --> processing: shadow-gen starts
processing --> completed: LLM answer stored
processing --> failed: LLM error (status_description set)
completed --> [*]
failed --> [*]

Branch & skip flow — sampling & filtering

flowchart TD
start([worker: rooms fetched]) --> elig{eligible rooms count}
elig -- "< 10" --> all[use 100% of rooms]
elig -- ">= 10" --> sample["random 10%"]
sample --> cap{"> 50-70 cap?"}
cap -- yes --> capped[truncate to cap]
cap -- no --> kept[keep sample]
all --> extract[ExtractConversationPairs]
capped --> extract
kept --> extract
extract --> nonText{text-only Q/A?}
nonText -- no --> skip[skip room, not counted]
nonText -- yes --> gen[shadow-generate + persist question]
skip --> done([batch continues])
gen --> done

Detail 2.2 — Sequence (mermaid, end-to-end incl. failure)

Happy path — generate test case (async batch with shadow gen)

sequenceDiagram
actor U as SPV (chatbot-fe)
participant LB as LB / API gateway
participant API as chatbot Grape API
participant UC as CreateTestCases
participant DBW as Postgres primary
participant Q as Sidekiq :ai_agent
participant W as FetchRoomConversationsWorker
participant HUB as Hub Chat Service
participant NLP as QontakNLP AI service

U->>LB: POST /api/v1/ai_agents/:id/test_cases {type, version_id, name}
LB->>API: HTTP
API->>API: set_role(owner/supervisor/admin)
API->>UC: handle
UC->>DBW: INSERT ai_agent_test_cases (status='processing')
UC->>Q: FetchRoomConversationsWorker.perform_async
UC-->>API: 201 {data: test_case}
API-->>U: 201 (UI shows generating modal, polls status)
Note over Q,W: async — worker picks up within seconds
W->>HUB: GET assigned rooms (status=assigned, 90d, limit=100)
HUB-->>W: room_ids
W->>W: sample 10% (cap 50-70; all if <10)
loop per sampled room
W->>HUB: GET messages by room
HUB-->>W: messages
W->>W: ExtractConversationPairs (text-only)
loop per Q/A pair
W->>DBW: INSERT question (status='processing', parameters.human_answer)
W->>NLP: POST predict {message: question} (NOT send_message)
Note right of NLP: timeout 60s; throttle for TPM/RPM
NLP-->>W: {answer, confidence, sources, response_time}
W->>DBW: UPDATE question (answer, confidence, sources, status='completed')
end
end
W->>DBW: UPDATE test_case status='completed'

Failure path — LLM error on one question (batch continues)

sequenceDiagram
participant W as Worker
participant DBW as Postgres primary
participant NLP as QontakNLP

W->>DBW: INSERT question (status='processing')
W->>NLP: POST predict
Note right of NLP: timeout after 60s / 5xx
NLP--xW: error
W->>W: Rollbar.error(test_case_id, room_id)
W->>DBW: UPDATE question status='failed', status_description=error
Note over W: continue with next question; test_case still reaches 'completed' (partial)

Failure path — Chat Service room-list unavailable (whole batch)

sequenceDiagram
participant W as Worker
participant HUB as Hub Chat Service
participant DBW as Postgres primary
W->>HUB: GET assigned rooms
HUB--xW: 5xx / timeout
W->>W: Rollbar.error
W->>DBW: UPDATE test_case status='failed'
Note over W: UI surfaces error + Retry (AITEST-S02/ERR-1)

Detail 2.3 — Database Model (DDL)

No new tables. Both tables exist (migrations 20260512000001, 20260512000002, Postgres, ActiveRecord::Migration[7.1], uuid PK via pgcrypto). This phase requires one additive migration to support partial/failed surfacing if not already present (verify status_description exists — it does per schema). No destructive change.

Current shape (verified — for the agent's reference, not re-created):

-- db/migrate/20260512000001_create_ai_agent_test_cases.rb (EXISTS)
CREATE TABLE ai_agent_test_cases (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
ai_agent_history_id uuid NOT NULL,
ai_agent_id uuid NOT NULL,
organization_id integer NOT NULL,
company_id varchar NOT NULL,
status varchar,
confidence_score integer,
type varchar,
deleted_at timestamp,
created_at timestamp NOT NULL,
updated_at timestamp NOT NULL
);
-- indexes: organization_id, company_id, ai_agent_history_id, ai_agent_id, status, type

-- db/migrate/20260512000002_create_ai_agent_test_case_questions.rb (EXISTS)
CREATE TABLE ai_agent_test_case_questions (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
ai_agent_test_case_id uuid NOT NULL, -- FK ON DELETE CASCADE
organization_id integer NOT NULL,
company_id varchar NOT NULL,
topic varchar, question text, answer text,
is_score boolean DEFAULT false, score integer,
scored_by uuid, scored_by_email varchar, scored_by_name varchar, scored_at timestamp,
started_at timestamp, completed_at timestamp,
response_time integer, confidence integer,
status varchar, status_description text,
sources jsonb DEFAULT '[]', parameters jsonb DEFAULT '{}',
deleted_at timestamp, created_at timestamp NOT NULL, updated_at timestamp NOT NULL
);
-- indexes: ai_agent_test_case_id, organization_id, company_id, status, score, topic

Additive migration (this phase — required, not conditional). The FE TestCase type carries name (store/ai-agent/interface.ts) and the Generate drawer submits it, but BE create does not persist a name today (repositories/create_test_case.rb sets no name). Chunk 1 adds and persists the column so the create round-trip and the list name column are non-empty (resolves the §2.G partial row, REV-2):

-- db/migrate/2026XXXXXXXXXX_add_name_to_ai_agent_test_cases.rb
ALTER TABLE ai_agent_test_cases ADD COLUMN name varchar;
CREATE INDEX index_ai_agent_test_cases_on_name ON ai_agent_test_cases (name);

name is nullable for backward compatibility (existing rows have none); CreateTestCase persists params[:name] (length-bounded ≤ 24, validated in the use-case contract).

  • Cardinality: ~1 test case per agent per validation run; questions 10–70 per case.
  • Growth: bounded by cap (≤70 questions/case). PII: question, answer, parameters.human_answer contain customer/agent text → see §3 Compliance.
  • Retention: soft-delete (deleted_at); hard-purge window TBD (Open Q #8).

Per-status lifecycle — ai_agent_test_cases.status:

StatusVisibilityRetentionRestoreTransitions allowed
pendinglist (transient)until processedn/a→ processing
processinglist w/ spinnerduring batchn/a→ completed / failed
completeddefault listuntil soft-deletedrestore via paranoid(ratings only)
failedlist w/ erroruntil soft-deletedre-run (re-enqueue)→ processing
(soft-deleted)hiddenuntil hard-purge (TBD)restore (paranoid)

Per-status lifecycle — ai_agent_test_case_questions.status:

StatusVisibilityRetentionRestoreTransitions
pendingn/a (transient)during batchn/a→ processing
processingper-question spinnerduring batchn/a→ completed / failed
completedcomparison shownwith parentwith parent(rating only)
failed"could not generate"with parentre-run→ processing

Detail 2.4 — APIs

Base: /api/v1/ai_agents (verified mount, §2.0). All set_role(%w[owner supervisor admin]). Success {status, code, message, data, meta?}; error ErrorException.

Outbound endpoints (consumers call us)

EndpointMethodAuthN/AuthZRequestResponseStatus codesIdempotencyVersioningReuse?
/api/v1/ai_agents/:id/test_casesGETapi_auth + set_rolequery: page,limit,query,status,order_by,order_direction{data:[TestCase], meta}200, 403n/a (read)v1reuse
/api/v1/ai_agents/:id/test_casesPOSTapi_auth + set_role{type, version_id, name}{data: TestCase} (status processing)201, 404 (version), 422, 403client-side dedupe; server idempotent per (agent,version,name) recommendedv1extend
/api/v1/ai_agents/:ai_agent_id/test_cases/:idGETapi_auth + set_rolepath{data: TestCaseDetail{questions[]}} incl. confidence_score, per-question answer/parameters.human_answer/confidence/sources/response_time/score/status200, 404n/av1reuse
/api/v1/ai_agents/:id/test_cases/:test_case_id/questions/:question_idPATCHapi_auth + set_role{score: 0|1}{data: question} + recomputed confidence_score200, 404, 422 (score not 0/1)last-write-wins per questionv1extend
/api/v1/ai_agents/:id/test_cases/:test_case_idDELETEapi_auth + set_rolepath{status:success} (soft delete)200, 404, 403idempotent (already-deleted → 200/404)v1new-with-justification (FE client exists; BE route missing)
/api/v1/ai_agents/:id/publishPOSTapi_auth + set_role{override_reason?}{data: ai_agent}200, 404, 422 (below threshold when gate on)idempotentv1extend (add gate)
/api/v3/paths/:id/tree_diagramGETapi_auth + set_rolepath{data: tree} w/ ai_agent.avg_confidence_score200, 404n/av3extend

Inbound webhooks (other services call us)

N/A — reason: this phase introduces no inbound webhooks. Shadow generation is a synchronous outbound call from the worker to QontakNLP (no callback); room/message fetch is outbound to Hub Chat Service.

DELETE contract (REV-3) — full specification (this is a new BE route; chatbot-fe's deleteTestCase already calls it):

  • Request: no body. Path params :id (ai_agent), :test_case_id.
  • Behavior: soft delete via acts_as_paranoidDeleteTestCase use case loads the test case scoped to current_user['chatbot_organization_id'] and calls .destroy (paranoid sets deleted_at). Child ai_agent_test_case_questions are soft-deleted via the model's dependent: :destroy (also paranoid). No hard delete.
  • Response: 200 {status:"success", code:200, message:"OK", data:{ id }}.
  • Status codes: 200 on success; 404 "Test case not found" when the id does not exist or belongs to another org (cross-tenant reads return 404, not 403, to avoid id enumeration); 403 only when set_role rejects the role outright; idempotent — deleting an already-soft-deleted case returns 404 (the paranoid default scope hides it).
  • Restore: out of scope for the API this phase; soft-deleted rows are restorable via acts_as_paranoid .restore if a future undo surface is added (hard-purge window is Open Q #8). No restore endpoint is exposed now.

Example create request/response:

// POST /api/v1/ai_agents/7e.../test_cases
{ "type": "inbox", "version_id": "a1...", "name": "War room sample #1" }
// 201
{ "status":"success","code":201,"message":"OK",
"data": { "id":"c3...","type":"inbox","version_id":"a1...","name":"War room sample #1","status":"processing","score":null } }

Detail 2.A — UI Contract

ConfidenceMeter.vue (new)

  • Figma: node 16514-155786 · file chatbot-fe/modules/bot-automation/components/testing/ConfidenceMeter.vue
  • Props:
interface ConfidenceMeterProps {
scorePercent: number; // 0..100, = (thumbsUp / total) * 100
threshold?: number; // default 80
totalRated: number;
totalSample: number;
}
  • State owner: derived from store/ai-agent testCaseDetail (no local source of truth).
  • Events: none emitted; analytics ai_validation_completed fires from parent when scorePercent >= threshold.
  • Conditional render: < threshold → "Low Confidence" (warning); >= threshold → "Ready to Launch" (success).
  • A11y: role="progressbar", aria-valuenow/min/max, label "Confidence meter".

TestCaseComparison.vue (new)

  • Figma: node 16514-155786 · file .../testing/TestCaseComparison.vue
  • Props:
interface TestCaseComparisonProps {
question: TestCaseQuestion; // from store/ai-agent/interface.ts
readonly: true; // comparison is read-only (NEG-3)
}
  • Events: @rate { questionId: string; score: 0 | 1 } → dispatches UPDATE_TEST_CASE_QUESTION.
  • Conditional: question.status === 'failed' → AI panel shows "could not generate" (S05/ERR-1); else human-left / AI-right with confidence/response_time/sources.
  • A11y: thumbs are <button> with aria-pressed.

Detail 2.B — Data-Fetching Strategy

  • Library: Pinia store + $apiMain (ofetch) — existing (store/ai-agent/actions.ts).
  • Cache key: store state slices testCases, testCaseDetail (no external cache lib).
  • TTL / refetch: refetch on mount; poll GET .../test_cases every ~3 s while any row status ∈ {pending, processing} (S08/AC-2), stop at completed/failed.
  • SWR: no — explicit fetchStatus enum (pending/resolved/rejected).
  • Optimistic updates: yes for rating — UPDATE_TEST_CASE_QUESTION snapshots questions and rolls back on reject (existing). On success, dispatch a meter recompute read (the BE returns recomputed aggregate).

Detail 2.C — UI State Matrix

SurfaceLoadingEmptyErrorPartialSuccess
Testing listskeleton rows (mirror AiAgentsTable)"No test cases yet" + Generate CTAblank slate "Couldn't load" + Retry; log ai_workspace_load_failedsome rows processing (poll)table rows
Generate drawersubmit spinnern/ainline validation (name/version)n/adrawer closes → generating modal
Generating modalprogress bar + countn/afailed badge + Retrypartial count shown→ list shows completed
Detail/comparisonskeleton"no questions in this test case"retrysome questions failed (per-card)human/AI panels
Confidence meter0% until first rating0% (no ratings)n/a (derived)partial as more ratedmeter + Ready-to-Launch

Detail 2.D — Data Integrity Matrix

Write pathTransaction scopePartial failureIdempotencyConsistencyDuplicate handlingStale read
Create test casesingle INSERT422 if save fails (no worker enqueued)recommend unique-ish (agent,version,name) guardstrong (single row)second click → second row (FE disables button)n/a
Worker: per-question INSERT+UPDATEper-question (not one big trx)failed question → status=failed, batch continuesre-run replaces by deleting prior questions for the case (re-enqueue)eventual (batch)re-run guard: clear existing questions before regenpoll reflects within ~3 s
Rate question + recomputequestion UPDATE then aggregate UPDATE in one trxrollback both on failure → FE restores priorlast-write-wins per questionstrong within requestrepeated same score → idempotentmeter refetched after write
Publish (gate)single UPDATE active_version_id422 if <80 & gate onidempotentstrongn/an/a

Detail 2.E — Concurrency Collision Map

ResourceWritersCollisionResolutionOn failure
ai_agent_test_cases.confidence_scoreconcurrent raters on same casetwo thumbs near-simultaneouslyrecompute reads current question scores in-trx (no stored delta)last recompute wins; value is deterministic from question rows
ai_agent_test_case_questions of a caseworker (regen) vs raterrate while re-run in flightre-run sets case processing → FE disables rating; reject rating with 409/422 if case not completedFE shows "regenerating"
ai_agents.active_version_idpublish vs publishdouble activateDB row update idempotentlast write wins

Detail 2.F — Async Job / Event Consumer Spec

JobTriggerInputRetryDLQConcurrencyIdempotencyPer-msg timeoutPoison handling
FetchRoomConversationsWorker (extend)perform_async from CreateTestCases{test_case_id, organization_id}retry: false today → change to bounded retry (e.g. 3, exp backoff) for transient Hub/NLP errors; set failed on exhaustionnone (Sidekiq dead set; alert on failure)queue :ai_agent (sidekiq.yml concurrency 5 staging / 10 prod)re-run clears prior questions for test_case_id before regenper-question NLP read/open_timeout: 60sfatal → Rollbar.error + test_case status=failed; per-room/per-question errors skip-and-continue (existing pattern)

Detail 2.F.1 — Responsibility Boundary Matrix

StepOwning squad / serviceInbound triggerOutbound effectFailure handlerPRD anchor
1. Create test caseBOT (chatbot API)SPV POSTrow + worker enqueue404/422§9 #1, S01
2. Fetch assigned roomsPlatform / Hub Chat Serviceworkerroom_idsRollbar + case failed§15, S02
3. Sample 10%/capData (algorithm) / BOT (impl)workersampled roomsn/a (deterministic)§15, S02
4. Extract Q/ABOT (ExtractConversationPairs)workerpairsroom skippedS03
5. Shadow generateAI squad (QontakNLP)worker per questionanswer/confidence/sourcesquestion failed, continue§15, S04
6. Persist questions + statusBOTworkerDB rowspartial; case completedS04, S08
7. Rate + recomputeBOTSPV PATCHaggregate scorerollbackS06
8. Gate publishBOTSPV POST publishactivate or 422422 below thresholdS07
9. Tree-diagram avgBOTtree-diagram readnode score"no score yet"S10

Step 3 ownership (Data vs BOT) and Step 5 throttling contract (TPM/RPM) are the two cross-squad items to confirm before build (Open Q #5, §5).

Detail 2.F.2 — State Surface Contract

EntityState field / eventDefaultsUpdated byRead viaStale window
Test casestatuspendingworkerGET .../test_cases (poll)~3 s (poll interval)
Test caseconfidence_scorenullrate recomputedetail + listimmediate (write-through)
Questionstatus / status_descriptionpendingworker shadow-gendetailbatch duration
Questionscore / is_scorenull / falseRateTestCaseQuestiondetailimmediate
AI agent (tree)avg_confidence_score (computed)"no score yet"GetTreeDiagramV3tree-diagramper request

Detail 2.G — Cross-Layer Contract Verification

EndpointBE response schemaFE expected schemaMatch?Gaps
GET .../test_cases{status,code,message,data:[...],meta}TestCasesResponse {data: TestCase[]}yesFE reads data; meta optional
POST .../test_cases{data: test_case} (snake_case, incl. name)CreateTestCasePayload {agent_id,type,name,version_id}CreateTestCaseResponse{data:TestCase}yes (after chunk 1)resolved (REV-2): name column added + persisted in §2.3/§4.D chunk 1 — no longer conditional
GET .../test_cases/:id{data: {..., questions:[...]}} incl. confidence_score, parameters.human_answerTestCaseDetailResponse {data: TestCaseDetail{questions}}yesFE TestCaseQuestion fields align 1:1 with schema
PATCH .../questions/:id{data: question} + recomputed aggregateUpdateTestCaseQuestionPayload {score:0|1}yes (after chunk 5)aggregate recompute is in-scope work (§4.D chunk 5); meter is static until it lands
DELETE .../test_cases/:id200 {data:{id}} soft delete (full contract §2.4)deleteTestCase client existsyes (after chunk 6)resolved (REV-3): BE route + contract specified in §2.4; built in §4.D chunk 6

All rows now reach Match? = yes once their named execution chunk lands — there is no silent cross-layer divergence. The three former gaps (name persistence → chunk 1, aggregate recompute → chunk 5, delete endpoint → chunk 6) are explicit in-scope work.

Detail 2.H — End-to-End Data Flow

SPV clicks Generate → GenerateFromInboxDrawer @confirm {name, version} → store CREATE_TEST_CASE → ai-agents.ts POST /api/v1/ai_agents/:id/test_cases → CreateTestCases UC → INSERT (status processing) + Sidekiq enqueue → 201 → FE generating modal polls GET .../test_cases → worker (sample → extract → per-question NLP predict → persist) → status completed → SPV opens detail GET .../test_cases/:id → TestCaseComparison renders → @rate PATCH .../questions/:id → recompute confidence_score → meter updates → at ≥80% Activate enabled → POST .../publish.

  • Side effects: PaperTrail versions on test_cases/questions; analytics events (§3); Rollbar on errors.
  • Ownership: FE (chatbot-fe) steps 1, detail render, rating UI; BE (chatbot) create/worker/rate/publish/tree; Platform (Hub) rooms; AI (NLP) predict.

Detail 2.I — Scope Boundaries

  • BE create: chatbot/app/workers/fetch_room_conversations_worker.rb (extend), new repos/services for sampling + shadow-gen + persistence under app/api/frontend_service/v1/ai_agent/, extend rate_test_case_question.rb + publish.rb + get_tree_diagram_v3.rb, new DELETE route, one additive migration.
  • BE modify: repositories/create_test_case.rb (persist name, status processing), test_cases_controller.rb (add delete route + name param).
  • BE NOT touched: live SendMessageWorker/inbox send path (must remain uncalled), ai_agent_histories schema.
  • FE create: chatbot-fe/pages/bot-automation/testing/index.vue + modules/bot-automation/components/testing/* (Table, GenerateModal, GenerateDrawer, GeneratingModal, Comparison, QuestionList, ConfidenceMeter).
  • FE modify: layouts/bot-automation.vue (nav item), modules/bot-automation/components/AiAgentEditor.vue (Activate gate), bot-flow node (tree confidence badge).
  • FE NOT touched: legacy modules/ai-agent/components/forms/Validation*.vue (old module — reference only, not extended).
  • Shared: Pinia store/ai-agent already has the actions/types — extend, don't fork. @mekari/pixel3 components reused.

Detail 2.J — Asset Inventory

AssetTypeSourceFormat & sizesPath
"No test cases yet" empty illustrationillustrationreuse existing (/images/not-found-search-illustration.png used by AiAgentsTable) or new exportPNG @1x/2xchatbot-fe/public/images/
thumbs up/down, info iconsicon@mekari/pixel3 MpIconSVG (DS)n/a (DS)

No new fonts/lotties. Any net-new illustration for the comparison empty/failed state flagged for Design QA (§5 Q-A).


3. High-Availability & Security

The Testing pipeline is off the live request path: it runs in the isolated :ai_agent Sidekiq queue, reading from Hub Chat Service and calling QontakNLP. If Hub or NLP is slow/down, only batches degrade (the test case goes failed and is retriable) — live inbox and live AI answering are unaffected. Reads should target a replica where the chatbot DB topology offers one (PRD §6); writes go to primary.

Performance Requirement

  • Frontend: list LCP < 2.5 s; rating INP < 200 ms; CLS < 0.1; bundle delta small (reuses Pixel3 + existing store). Browser support per chatbot-fe baseline; web only.
  • Backend: a ~50-item batch completes in ≈2–5 min (PRD §6); per-question NLP p99 governed by qontak_nlp_prediction_timeout (default 60 s). Worker concurrency bounded by queue config (5 staging / 10 prod). No added live-path RPS.
  • NLP throttle contract (REV-1). Per-question shadow-generation calls are paced by an in-worker token-bucket so the :ai_agent batch never starves live prediction traffic on the shared AI service:
    • Ceiling: a per-org RPM cap read from SystemPreference (group_code: 'engine', code: 'ai_agent_testing_nlp_rpm', default 60 req/min — i.e. one question/sec, well within a ~50–70 item batch's 2–5 min budget). The TPM dimension is bounded indirectly by the cap (PRD Open Q #5 — AI squad confirms the production ceiling before beta; default is conservative).
    • Pacing: acquire a bucket token before each QontakNlp::Predict call; if empty, sleep until refill (bucket lives in the worker process; batch is single-worker per test case so no cross-process coordination is needed this phase).
    • On HTTP 429 / rate-limit from the AI service: exponential backoff (e.g. 1 s → 2 s → 4 s, max 3 attempts per question) and re-queue the question for a later pass; on attempts-exhausted, mark the question status='failed' (status_description='rate_limited') and continue the batch (consistent with §2.2 failure path). A 429 never fails the whole test case.
    • This makes Decision D-5 fully specified (the throttle was previously deferred).

Monitoring & Alerting

  • FE analytics (Mixpanel, names from PRD §12): ai_workspace_opened, ai_validation_generated {sample_size,date_range,test_case_id}, ai_response_graded {grade,confidence_score,inquiry_id}, ai_validation_completed, ai_agent_activated. Error slate logs ai_workspace_load_failed.
  • BE: Rollbar (existing) on worker errors with test_case_id/room_id. Structured Rails.logger.info already emits {worker, test_case_id, room_ids_count, conversation_pairs_count} — extend with generated_count, failed_count.
  • Alerts (PRD §12): batch failure rate > 10%/1h → #bot-ai-alerts; NLP error rate

    5%/15m → #bot-ai-alerts + PagerDuty.

  • Cross-layer: propagate request/job id from create response into worker logs for trace.

Logging

  • BE: structured worker log (above) + Rollbar; FE: console error → Sentry/Datadog per chatbot-fe.
  • PII scrub: do not log question, answer, or parameters.human_answer bodies (customer/agent text). Log only ids, counts, statuses.

Security Implications

  • Threat model: (a) cross-tenant data leak via test-case ids — mitigated by org-scoped queries + set_role; (b) customer-message leakage during shadow gen — mitigated by calling predict only, never send_message/notification (spec-asserted, S04/AC-1); (c) PII to 3rd-party LLM — covered by existing DPA, transient inference, not used to train public model (Open Q #1, InfoSec approval required); (d) privilege escalation — set_role server-side on every endpoint.

Role × Endpoint Authorization Matrix

RoleEndpoint(s)MethodsTenant scopeUI visibilityConstraintAudit
ownerall test-case + publish + treeGET/POST/PATCH/DELETEown orgfullPaperTrail
supervisorall test-case + publish + treeGET/POST/PATCH/DELETEown orgfullno force-override (admin-only, S09)PaperTrail
adminall test-case + publish (+override) + treeGET/POST/PATCH/DELETEown orgfulloverride requires reasonPaperTrail + reason
standard agentnonemenu hidden403 on direct routen/a
bot-specialisttree-diagram readGETown orgtree onlyread-onlyn/a

Every role from Detail 1.A appears here. standard agent is explicitly denied.

  • Ownership validation: queries scoped by current_user['chatbot_organization_id'] (existing use-case pattern). Enforcement: use-case layer + set_role.
  • Input validation: score ∈ {0,1} (422 otherwise); type, version_id, name (length-bounded, e.g. ≤ 24 per prototype) via dry-validation contract.
  • Injection: ActiveRecord parameterized; outbound URLs (Hub/NLP) from config/env (no user input in URL) → SSRF-safe.
  • Secrets: channel access token via lockbox-encrypted chatbot_tokens_encrypted; NLP base URL from org settings/env. No hard-coded keys.
  • Audit: PaperTrail (has_paper_trail) on both tables; force-activate writes reason + score-at-override (S09).
  • Rate limiting: per-question NLP throttle (TPM/RPM); create endpoint guarded by FE button disable + recommended server idempotency.
  • Static analysis: bundle exec brakeman (Gemfile) + bundle exec rubocop.
  • ISO 27001/27701: PII processing logged + access-controlled; see Compliance below.

Detail 3.A — Failure Mode Catalog (merged)

SurfaceFE behavior on failureBE response on failureCode-shape consistency
List loadblank slate + Retry; ai_workspace_load_failed403/500 ErrorExceptionyes
Createinline error; button re-enabled404 (version) / 422 / 403yes
Generatingfailed badge + Retryworker sets status=failed (Rollbar)yes (poll reads status)
Per-question gen"could not generate" cardquestion status=failed + status_descriptionyes
Rateoptimistic rollback + inline error404/422yes
Publish below thresholdbutton disabled; if forced API → reason422 with reason (gate on)yes

Detail 3.A.1 — Branch & Skip Catalog

Branch triggerWhere checkedDownstream effectAuditUser-visible?
Eligible rooms < 10worker sampling step (BOT)use 100% rooms (no 10% cut)log countno (result reflected)
Batch > 100worker samplingshow ≤ 50 (cap)logindirectly (count)
Bot-only room (no human reply)ExtractConversationPairsroom excluded, not countedlogno (NEG-2)
Non-text message (image/voice)ExtractConversationPairsmessage skippedlogno (NEG-2)
Phase 2/3 sourcesFE Generate modal"from knowledge"/"imported" disabledn/ayes (NEG-4)
ai_agent_testing flag OFFBE + FEmenu hidden / endpoints inertn/ayes

Detail 3.B — Error Response Catalog (BE)

{ "status": "error", "code": 422, "message": ["..."], "errors": {}, "error_code": null }
EndpointCodeHTTPMessageWhenUser-facing?
POST test_cases404"Version not found"version_id invalidyes
POST test_cases422"Failed to create test case"save failsyes
PATCH question422"Invalid score"score ∉ {0,1}yes
GET detail404"Test case not found"bad id / cross-tenantyes
any403"Permission denied"set_role rejectno (menu hidden)
POST publish422"Confidence below threshold"<80 & gate onyes

Detail 3.C — Error Message Catalog (FE)

Error codeUser-facing message (i18n key)SurfaceUser-facing?
list_load_failed"Couldn't load test cases" + Retryblank slateyes
create_failed"Couldn't generate test case"toast/inlineyes
rate_failed"Couldn't save your rating"inline (after rollback)yes
gen_failed"Could not generate" (per question)comparison cardyes

Detail 3.D — Compliance & Data Governance

Trigger: PII present. question, answer, parameters.human_answer are customer & agent conversation text sent to a 3rd-party LLM.

FieldClassificationLegal basisRetentionEncryptionAccess auditRight-to-delete
question / answer / parameters.human_answerPII (customer content)DPA (existing); UU PDPsoft-delete; hard-purge TBD (Open Q #8)TLS in transit; DB at rest per platformPaperTrail + set_rolesoft-delete + paranoid; align purge to DPA
scored_by_email / _namePII (internal user)legitimate interestwith rowat restPaperTrailwith row

Transient inference: NLP prediction payload is not persisted beyond the stored answer/confidence/sources (PRD §6). InfoSec sign-off required before beta (Open Q #1).

Detail 3.E — Accessibility

WCAG AA. Keyboard: drawer/modal focus-trap (Pixel3 defaults); thumbs reachable via Tab with aria-pressed; meter role="progressbar". Focus returns to trigger on modal close. Contrast verified against Pixel3 tokens. prefers-reduced-motion honored for the generating progress animation.


4. Backwards Compatibility and Rollout Plan

Compatibility

  • BE: additive only — new name column (nullable), new DELETE route, extended worker/publish/tree behavior behind the ai_agent_testing flag. Existing endpoints' request/response shapes unchanged (only name added to create response; nullable).
  • FE: new page + components; menu item additive. No change to saved client state.
  • Cross-layer: snake_case JSON unchanged; FE already consumes the contract.

Rollout Strategy

  • Deploy order: BE first, then FE. BE adds the pipeline + endpoints behind the flag; FE Testing page ships after the contract is live (FE polls real status).
  • Feature flag: ai_agent_testing via SystemPreferences::FeatureFlag.enabled? (group_code: 'rollout', code: 'ai_agent_testing', default: false) — per-org, default OFF. Kill-switch: toggle OFF per org → menu hidden + endpoints inert + gate reverts to advisory (no deploy). FE menu reuses the existing rollout-preference read pattern (layouts/bot-automation.vue).
  • Migration sequence: add nullable name column (no backfill needed) → deploy BE → enable flag for internal org → deploy FE.
  • Stages (audience/gates owned by delivery/; PRD §11/§14): Internal (telesales POC) → Closed beta (3–5 orgs) → Open beta (on request) → GA.
  • Rollback trigger: batch failure rate > 20% unresolved 24 h, or any customer-message leakage → disable flag for affected orgs.
  • Rollback mechanism: flag OFF (instant); for DDL, name column is nullable and inert when unused — no down-migration needed for rollback; data written stays (soft-deletable).

Detail 4.A — Cross-Layer Rollout Compatibility Matrix

ScenarioFEBEWorks?Mitigation
Pre-deployOldOldyesbaseline (no Testing page)
Backend firstOldNewyesnew endpoints unused by old FE; flag OFF
Frontend firstNewOldnoavoid — deploy BE first; if FE leads, gate page behind flag tied to BE readiness
Both deployedNewNewyestarget
Backend rollbackNewOldpartialFE page errors gracefully (error slate); disable flag
Frontend rollbackOldNewyesBE endpoints simply unused

Detail 4.B — Configuration Contract

LayerEnv var / flagTypeDefaultRequiredProvisionerSecret?
BEai_agent_testing (system_preferences rollout/ai_agent_testing)boolfalseyesSystemPreference (per org)no
BEai_agent_testing_gate (publish gate enforce)boolfalse (advisory)noSystemPreferenceno
BEai_agent_testing_threshold (engine) — gate % (REV-4)int80noSystemPreference (per org)no
BEai_agent_testing_nlp_rpm (engine) — shadow-gen RPM cap (REV-1)int60noSystemPreference (per org)no
BEqontak_nlp_prediction_timeoutint (s)60noSystemPreference (engine)no
BEsampling cap / pct (if configurable)int10% / 50–70noconst or SystemPreferenceno
BEQONTAK_NLP_PREDICTION_ENDPOINT, AI_SERVICE_BASE_URLurlyes (existing)env/org settingsno
FErollout pref read (rollout_ai_agent pattern)boolfalseyespreferences storeno

Detail 4.C — Test Plan (commands from repo)

LayerCommand (source)What it proves
BE unit/use-caseRAILS_ENV=test bundle exec rspec spec/api/frontend_service/v1/ai_agent (bin/rspec_pipeline.sh)create/list/rate + new sampling/gen behavior
BE workerRAILS_ENV=test bundle exec rspec spec/workers/fetch_room_conversations_worker_spec.rbsampling, no send_message, persistence, status
BE repoRAILS_ENV=test bundle exec rspec spec/core/repositories/chat_service/extract_conversation_pairs_spec.rbtext-only filtering (S03)
BE treeRAILS_ENV=test bundle exec rspec spec/core/repositories/paths/get_tree_diagram_v3_spec.rbavg confidence in node
BE lint/securitybundle exec rubocop · bundle exec brakeman (bitbucket-pipelines.yml, Gemfile)style + security scan
FE unitpnpm testvitest run (chatbot-fe/package.json)components + store rating rollback
FE e2epnpm test:e2eplaywright testgenerate→poll→detail→rate→meter flow
FE lintpnpm lint · build pnpm buildtypecheck + bundle

Detail 4.D — Agent Execution Plan

OrderLayerChunkFilesCommandsAcceptance criteria
1BEAdd name column + persist; create status processingdb/migrate/2026XXXX_add_name_to_ai_agent_test_cases.rb, repositories/create_test_case.rb, use_cases/create_test_cases.rb, test_cases_controller.rbbundle exec rails db:migrate; rspec create specmigration up/down; create persists name, returns status='processing'
2BETest-case status lifecycle helperrepositories/ (new update_test_case_status.rb)rspecstatus transitions pending→processing→completed/failed covered
3BEWorker sampling stepapp/workers/fetch_room_conversations_worker.rb, new repositories/.../sample_rooms.rbrspec spec/workers/...200→~20; <10→all; 5000→cap 50–70 (S02)
4BEWorker shadow-gen + question persistenceworker, new services/.../generate_shadow_answer.rb (wraps QontakNlp::Predict), question reporspec spec/workers/...0 SendMessageWorker enqueues (S04/AC-1); answer+parameters.human_answer+confidence+sources persisted; failed→status=failed+desc
5BEConfidence aggregate recompute on raterepositories/rate_test_case_question.rb (extend), new recompute_confidence_score.rbrspec spec/api/frontend_service/v1/ai_agent/use_cases/rate_test_case_question_spec.rbconfidence_score = round(up/total*100) after rate (S06/AC-3)
6BEDELETE test-case route (soft delete)test_cases_controller.rb, new use_cases/delete_test_case.rb + reporspecDELETE soft-deletes (deleted_at set); 404 cross-tenant; FE client now resolves
7BETree-diagram avg confidencecore/repositories/paths/get_tree_diagram_v3.rb (add_ai_agent)rspec spec/core/repositories/paths/get_tree_diagram_v3_spec.rbnode returns avg over completed; none→"no score yet" (S10)
8BEPublish confidence gate (flagged)repositories/publish.rb, use_cases/publish_ai_agent.rbrspec422 when <80 & gate on; advisory when off (S07)
9FETesting list page + tablepages/bot-automation/testing/index.vue, modules/bot-automation/components/testing/TestCasesTable.vuepnpm test; pnpm lintrenders list/empty/loading/error; ai_workspace_load_failed on error (S01)
10FEGenerate modal + Inbox drawer (with version selector).../testing/{GenerateTestCaseModal,GenerateFromInboxDrawer,TestCaseGeneratingModal}.vuepnpm testcreate dispatch + poll to completed (S08)
11FEDetail comparison + question list + meter.../testing/{TestCaseComparison,QuestionList,ConfidenceMeter}.vuepnpm test; pnpm test:e2ehuman-left/AI-right + metrics; failed→"could not generate"; meter = up/total (S05,S06)
12FEActivate gate + nav itemlayouts/bot-automation.vue, modules/bot-automation/components/AiAgentEditor.vuepnpm testbutton disabled <80; Testing nav behind flag (S07,S01)
13FETree-diagram node badgebot-flow node component (chatbot-fe)pnpm testnode shows avg / "no score yet" (S10)

Detail 4.E — Verification & Rollback Recipe

  • Pre-merge (BE): 1) bundle exec rubocop 2) RAILS_ENV=test bundle exec rspec spec/api/frontend_service/v1/ai_agent spec/workers spec/core/repositories/chat_service spec/core/repositories/paths 3) bundle exec brakeman
  • Pre-merge (FE): 1) pnpm lint 2) pnpm test 3) pnpm test:e2e 4) pnpm build
  • Post-deploy signals: Rollbar batch error rate < 10%/1h (#bot-ai-alerts); Mixpanel ai_validation_generated firing; worker log failed_count/generated_count ratio healthy; zero SendMessageWorker enqueues correlated with :ai_agent batches.
  • Rollback: 1) toggle ai_agent_testing OFF for affected org(s) (no deploy) 2) if publish gate misbehaves, toggle ai_agent_testing_gate OFF (advisory) 3) if needed, revert FE PR (BE endpoints become unused) 4) confirm Rollbar error rate normal in 15 min.

Detail 4.F — Resource & Cost Notes

  • Compute: bounded by :ai_agent queue concurrency (5/10); no new pods required.
  • DB: +1 nullable column; question rows ≤70/case — negligible growth.
  • Egress: per-question HTTPS to QontakNLP (cost = batch size × token usage) — the cap controls this; confirm per-tier budget (Open Q #5).
  • No new infra components.

5. Concern, Questions, or Known Limitations

Review findings ledger (from historical-validation-review.md, R1)

rfc-reviewer R1 (score 7.5/10, PROCEED) raised four material findings; all are now addressed inline in this revision:

IdSeverityFindingResolutionStatus
REV-1majorNLP throttle contract unspecified (RPM/TPM, 429 behavior)§3 Performance: token-bucket, ai_agent_testing_nlp_rpm pref (default 60), 429→backoff+requeue→fail-question; D-5 specifiedresolved (in-RFC) — production ceiling still confirmed by AI squad (Open Q #5)
REV-2majorname column left conditional§2.3: column added + persisted unconditionally (chunk 1); §2.G create row now yesresolved
REV-3majorDELETE endpoint contract thin§2.4: full DELETE contract (soft-delete, 404-not-403 cross-tenant, idempotency, restore out-of-scope)resolved
REV-4minorpublish-gate threshold source unresolvedD-7 + §4.B: ai_agent_testing_threshold pref (default 80), org-configurable — resolves PRD Open Q #4resolved

Minor follow-ups still open after R2 (none blocking): REV-5 worker job retry intervals

  • dead-set alert depth (§2.F, item 10); REV-6 no Figma frame for the comparison/detail view (§1 Design References, item Q-A — design dependency); REV-7 re-run trigger path absent from §2.4 + clear-before-regen transaction (item 9, Detail 2.D); REV-8 create idempotency "recommended" not decided (§2.4 POST); REV-9 (new in R2) the §3 NLP throttle RPM cap is enforced by an in-worker token-bucket and is therefore per worker process — with queue concurrency 5/10, concurrent same-org batches can aggregate past the "per-org" ceiling. Mitigation: low likelihood (one SPV generates at a time); make it a true per-org ceiling via a Redis-backed counter keyed by org if concurrency proves real. R2 score 8.5/10, verdict PROCEED (see historical-validation-review.md).

Open questions

Carried from PRD §17, scoped to engineering:

  1. Q-A (Design): Generate-from-Inbox drawer needs a version selector (bind test case to ai_agent_history_id); the qontak-designer prototype only has a name field. The comparison/detail view has no prototype — needs a Figma frame + Design QA.
  2. Open Q #1 (Risk/InfoSec): PII to 3rd-party LLM — DPA-covered, transient inference; InfoSec approval required before beta.
  3. Open Q #2 (Risk): confidence recompute (S06) + activation gate (S07) ship as this RFC's work; gate is advisory for beta, enforced before GA.
  4. Open Q #4 — resolved (REV-4): threshold is org-configurable via ai_agent_testing_threshold SystemPreference (default 80); no redeploy to change.
  5. Open Q #5 (Data/AI): per-batch token budget + the production RPM/TPM ceiling across plan tiers — the §3 throttle ships with a conservative default (ai_agent_testing_nlp_rpm = 60); AI squad confirms the real ceiling before beta (REV-1).
  6. Open Q #6: separate "relevance" metric? Schema has only confidence — if needed, store in parameters (no schema change).
  7. Open Q #7 (Data): single human reply as "golden answer" when a room has multiple agent messages — ExtractConversationPairs pairs the next agent text reply.
  8. Open Q #8 (Eng): hard-purge window for soft-deleted test cases/questions (DPA).
  9. Known limitation: re-running a test case must clear prior questions to avoid duplicates (Detail 2.D) — define re-run UX with Design.
  10. Known limitation: FetchRoomConversationsWorker is retry: false today; this RFC changes it to bounded retry — confirm idempotent re-entry (clear-before-regen).

6. Comment logs

DateComment(s) FromAction Item(s)
2026-06-20RFC author (Claude)Initial draft from PRD + grounded against chatbot / chatbot-fe / qontak-designer code. Flagged worker gap, FE target (chatbot-fe), missing DELETE route, name-column gap.
2026-06-20rfc-reviewer R1 (7.5/10, PROCEED)Raised REV-1…REV-4 (NLP throttle, name column, DELETE contract, gate threshold). See historical-validation-review.md.
2026-06-20RFC author (Claude)Addressed REV-1…REV-4 inline: §3 throttle contract + D-5; §2.3 name column unconditional; §2.4 full DELETE contract; D-7 + §4.B configurable threshold/RPM prefs; §2.G all rows → yes; §5 ledger added.
2026-06-20rfc-reviewer R2 (8.5/10, PROCEED)Confirmed REV-1…REV-4 fixed; decisions 10/10 resolved; CSS 6.5→8.0. Raised REV-9 (per-process vs per-org throttle enforcement, minor). REV-5/6/7/8/9 carry open.

7. Ready for agent execution

  • yes

All execution-readiness gates are met against verified repo state:

  • §1 Design References — Figma frames + DS version (@mekari/pixel3@^1.0.12) + Design QA named; the detail/comparison frame and the drawer version-selector are flagged in §5 Q-A (Design QA must confirm before chunks 10–11 land).
  • §1 PRD-to-Schema Derivation — every entity/attribute/rule mapped to table.column + endpoint + enforcement.
  • Detail 1.C Per-Story Change Map — all 10 stories, one row each, FE+BE columns filled, verifiable AC.
  • Repo Reading Guide (2.0) — anchors, contracts (reuse/extend/new), reading order, Source Verification with concrete evidence per row (no unverified claims).
  • Design ↔ Code Mapping — frames → chatbot-fe files + tokens + backing endpoints.
  • Mermaid: repo map, component, ER, two state machines, branch/skip, happy + 2 failure sequences.
  • DDL — existing schema verified; one additive migration; per-status lifecycle tables for both enums.
  • APIs — outbound table with reuse/extend/new tags; inbound N/A — reason; cross-layer verification flags the 3 closeable gaps.
  • Failure Mode + Branch & Skip + Error catalogs complete; Role × Endpoint matrix covers every role.
  • Configuration Contract complete; ai_agent_testing flag named, default OFF.
  • Agent Execution Plan — 13 ordered chunks, each with files + repo-sourced commands + assertable AC.
  • Verification & Rollback Recipe — runnable per-layer commands; named signals; flag-first rollback.

Optional next step: hand to rfc-reviewer for a second-pass score (historical-validation-review.md).