RFC Review: AI Agent Testing — Phase 1: Historical Validation

Companion review for historical-validation.md, produced by the rfc-reviewer skill. Lives beside the RFC; valid only for the RFC revision in reviewed_rfc_last_updated (2026-06-20).

This is a cumulative review. Current cycle = R3 (re-grounded against chatbot fa6dd8b79 / chatbot-fe f5b80d5a). R1 (pre-fix draft, 7.5) and R2 (post-fix, 8.5) are preserved as history. The RFC last_updated (2026-06-20) still equals the prior stamp → refresh/confirm cycle: no RFC content changed since R2. R3 re-grounded EVERY load-bearing citation in the RFC §2.0 Source Verification table + prior-review spot-checks against the current code (chatbot moved 88b0685 → fa6dd8b79; chatbot-fe at f5b80d5a). Result: zero citation drift, zero breakage — every file:line, route, migration, schema column, and FE symbol still resolves to exactly what is claimed (full re-grounding log in Evidence Notes). No new code-grounded issue surfaced, so no new REV-n is minted. The ledger carries forward unchanged: 4 fixed (R1 majors, confirmed still fixed against current code), 5 minor open (REV-5/6/7/8/9). Score holds 8.5; verdict PROCEED. See the Findings Ledger and Review History for the delta.

Executive Summary

Overall Score: 8.5/10 (R3 — held from R2; was 7.5 at R1)
Rating: Strong
RFC Type: full-stack
Sub-Type: frontend: new-feature · backend: new-feature
Assessment Confidence: High — all 18 categories have direct evidence across both layers; R3 re-grounded every load-bearing citation against current code (chatbot fa6dd8b79, chatbot-fe f5b80d5a): worker still 43 lines fetch+extract+log only; publish.rb:13 only sets active_version_id (no gate); rate_test_case_question.rb:13-19 writes no aggregate; routes GET L32/POST L75/PATCH L115 with set_role; no DELETE test-case route; FE deleteTestCase at ai-agents.ts:313; TestCase type carries name (interface.ts:272) but BE create_test_case.rb:19 still does not persist it; name column still absent from schema. Zero drift, zero breakage.
Applied Caps/Gates: None triggered. No category scored < 5.0; ACV and ROL clear their < 5.0 caps; no cross-layer contract mismatch (after R2, every §2.G row reaches Match? = yes). The RFC does not reach 9.0+ because a few merged categories still sit 8.0 (DIC, FMC, OBS) on minor open items (REV-5/7/8) and one new caveat (REV-9, per-process throttle).
Implementation Readiness Verdict: PROCEED — all four R1 major findings are fixed (REV-1…REV-4). An agent can execute the full 13-chunk plan from the RFC alone. The remaining items are minors + external confirmations (production RPM ceiling Open Q #5, InfoSec Open Q #1) and one design dependency (REV-6: no Figma frame for the comparison/detail view) that blocks pixel-faithful chunk 11 only.
Report Path: chatbot/ai-agent-testing/rfcs/historical-validation-review.md
RFC Author: Claude (from PRD + repo grounding) | Reviewed: 2026-06-20 (R1 draft → R2 commit 88b0685)

This RFC is unusually execution-ready for a draft: it is explicitly framed as a delta on existing scaffolding and backs almost every claim with a verified file path, line range, and the exact gap to close. An AI agent could read §2.0 (Repo Reading Guide + Source Verification), the §2.4 API table, and §4.D (13 ordered chunks with files + commands + acceptance criteria) and start producing correct code for most chunks without a clarification meeting. The biggest strength is cross-layer contract honesty — §2.G names the three real frontend/backend gaps (name persistence, aggregate recompute, missing DELETE) instead of hiding them. The biggest gap is that the engineering core of the phase (worker sampling → shadow-gen → persistence) is unbuilt and, while well-specified at the orchestration level, leaves one hard external contract open: the per-question QontakNLP TPM/RPM throttle (Open Q #5), which is the dimension most likely to make the batch fail or get rate-limited in production. The one thing that must change before that chunk is agentic-safe: pin the throttle contract (concrete RPM/TPM ceiling, inter-call delay or token-bucket, and behavior on 429) — today an agent would invent it.

R2 update (commit 88b0685). That "one thing" is done: §3 Performance now specifies a token-bucket throttle (ai_agent_testing_nlp_rpm pref, default 60; on-429 → backoff + requeue → fail-question), and the other three majors are closed (name column unconditional §2.3; full DELETE contract §2.4; org-configurable gate threshold D-7/§4.B). The engineering core (chunk 4) is now agentic-safe. The one residual scaling caveat R2 raises (REV-9): the cap is enforced per worker process, so concurrent same-org batches (queue concurrency 5/10) can aggregate past the per-org ceiling — minor, with a Redis-counter mitigation noted below. Net: a Strong RFC, no remaining majors, PROCEED.

Quick Verdict

Why this RFC can be implemented agentically:

§2.0 Source Verification gives a verified anchor (file + line + evidence) for every claim, and §4.D decomposes the work into 13 ordered chunks each with files, repo-sourced test commands, and assertable acceptance criteria — the rubric's gold standard for task decomposition.
The contract surface is mostly closed: §2.4 API table (method/path/auth/request/response/status/idempotency/versioning/reuse-tag), §3.B/3.C error catalogs, §2.A typed component props, and §2.G cross-layer verification table mean the agent rarely has to guess shapes.

Why this RFC will cause agent guessing or rework:

The per-question NLP throttle contract is unresolved (Open Q #5, §3 Performance, §2.F): no concrete RPM/TPM number, no inter-call pacing mechanism, no documented 429 behavior — yet this governs whether the core shadow-gen chunk (4) works at batch scale. The RFC says "throttle … to respect TPM/RPM" but never specifies it.
Two contract micro-decisions are left as "if required / recommended": the name column (§2.3 "only if … required"; §2.G says BE doesn't persist it today) and create-endpoint idempotency ("recommended"). An agent hitting chunk 1 and the create path will guess.

Findings Ledger (carry-forward)

Stable, never-renumbered finding ids. R1 minted REV-1…REV-8; R2 reconciled them and minted REV-9. R3 re-grounded every cited anchor against current code and reconciled the ledger unchanged — the four R1 majors (REV-1…4) remain fixed (their fixes are RFC-text fixes; the underlying code state they describe — no name persisted, no gate, no DELETE route, no throttle in worker — is still the live reality the RFC proposes to change, exactly as documented). No drift ⇒ no new REV-n minted.

ID	Severity	Finding (one line)	RFC location	Status	First seen	Resolved in	Evidence / fix
`REV-1`	major	Per-question NLP throttle (TPM/RPM) contract unspecified — no concrete rate ceiling, pacing mechanism, or 429 behavior, despite governing the core shadow-gen chunk	§3 Performance, §2.F job spec, Open Q #5	fixed	R1	R2 (88b0685)	§3 "NLP throttle contract (REV-1)": token-bucket; `ai_agent_testing_nlp_rpm` pref (default 60); on-429 → exp backoff (1→2→4s, max 3) + requeue → fail-question after budget; D-5 updated; §4.B config row added. (Production ceiling still confirmed by AI squad — Open Q #5; tracked, not blocking.)
`REV-2`	major	`name` column persistence left conditional ("only if required"); §2.G marks `POST .../test_cases` cross-layer match as `partial` because BE does not persist `name` today	§2.3 migration, §2.G row 2, §4.D chunk 1	fixed	R1	R2 (88b0685)	§2.3 now "required, not conditional" — column added + `params[:name]` persisted (≤24, validated); §2.G create row → `yes (after chunk 1)`.
`REV-3`	major	DELETE endpoint contract depth thinner than reuse/extend endpoints — soft-delete body, restore semantics, cross-tenant 404-vs-403 not pinned	§2.4 DELETE row, §4.D chunk 6	fixed	R1	R2 (88b0685)	§2.4 "DELETE contract (REV-3)": no body; soft-delete via `acts_as_paranoid` org-scoped; `200 {data:{id}}`; cross-tenant → 404 (not 403) to avoid id enumeration; idempotent; restore explicitly out-of-scope this phase. §2.G delete row → `yes`.
`REV-4`	major	Publish-gate threshold configurability unresolved — 80 hard-coded vs org-configurable (Open Q #4); §4.B "threshold pref" had no key/default/provisioner	§2.4 publish, §4.B, D-7, Open Q #4	fixed	R1	R2 (88b0685)	D-7 + §4.B: `ai_agent_testing_threshold` SystemPreference (`engine`, int, default 80, per-org); §5 Open Q #4 marked resolved.
`REV-5`	minor	Worker `retry: false` → "bounded retry (e.g. 3, exp backoff)" is exemplary, not pinned; no concrete backoff intervals or dead-set alert threshold	§2.F job spec, Open limitation #10	open	R1	—	Note: REV-1's fix pins the per-question NLP 429 backoff (1→2→4s), but the worker-level job retry intervals + dead-set alert depth are still "e.g. 3". Pin them.
`REV-6`	minor	Comparison/detail view has no qontak-designer prototype and no dedicated Figma frame; blocks pixel-faithful chunk 11	§1 Design References row 4, Design↔Code map, §5 Q-A	open	R1	—	Design dependency, not a doc edit — produce a Figma frame before chunk 11. Carried; surfaced in §5 Q-A.
`REV-7`	minor	Re-run trigger (which endpoint/param re-enqueues) + its idempotency guard not in the §2.4 API table, though clear-before-regen is stated in §2.D	§2.D row 2, §2.E, Open limitation #9	open	R1	—	Add the re-run path to §2.4 and pin the clear-before-regen transaction.
`REV-8`	minor	Create-endpoint idempotency "recommended", not decided (`(agent,version,name)` guard not specified as constraint/check)	§2.4 POST row, §2.D create row	open	R1	—	Decide: partial unique index / app-level dedup, or accept duplicates (FE button-disable only).
`REV-9`	minor	(new in R2) §3 calls the RPM cap "per-org" but enforces it with an in-worker token-bucket; with queue concurrency 5/10, concurrent same-org batches each get their own bucket → aggregate can exceed the per-org ceiling	§3 Performance "NLP throttle contract"	open	R2	—	Either scope the claim ("per-worker pacing; same-org concurrent batches are rare") or make it a true per-org ceiling via a Redis-backed counter keyed by org. Low likelihood (one SPV generates at a time), so minor.

Ledger summary (R2): 5 open (0 blocker / 0 major / 5 minor), 4 fixed this cycle (all R1 majors), 0 accepted-risk. No blockers and no open majors → verdict PROCEED. Still-open findings (REV-5/6/7/8/9) are promoted to the RFC Open-Questions / Concerns surface by id.

Ledger summary (R3): unchanged — 5 open (0 blocker / 0 major / 5 minor), 0 fixed this cycle, 0 new. R3 is a re-grounding/confirm cycle: every citation re-verified against current code (chatbot fa6dd8b79 / chatbot-fe f5b80d5a), zero drift or breakage, RFC text unchanged since R2, so the ledger is identical. REV-5/6/7/8/9 remain promoted to the RFC §5 surface. Verdict PROCEED.

PRD → RFC Traceability Matrix

PRD: ../prds/historical-validation.md (BOT-3351). 10 stories (AITEST-S01…S10), CHG-001, 4 negative scenarios, plus sections §2–§17.

Standard format (PRD exists)

PRD Element	RFC Section	Coverage
AITEST-S01 Workspace access control	Detail 1.C row S01, §2.4 GET test_cases, §3 authz matrix, §4.D chunk 9/12	Full
AITEST-S02 Historical sampling (10%)	Detail 1.C row S02, §2.1 branch/skip flow, §2.F, §4.D chunk 3	Full — sampling logic + cap branches specified; ownership (Data vs BOT) flagged (Open Q)
AITEST-S03 Data integrity & filtering	Detail 1.C row S03, §2.0 anchor (`ExtractConversationPairs`), §3.A.1 branch catalog	Full — relies on existing extractor + spec
AITEST-S04 Shadow execution (zero leakage)	Detail 1.C row S04, §2.2 sequence, §2.F, §4.D chunk 4	Partial — orchestration fully specified; the NLP throttle contract that makes it scale is open (REV-1)
AITEST-S05 Side-by-side validation UI	Detail 1.C row S05, §2.A, §2.4 GET detail, §4.D chunk 11	Partial — props/states specified; no Figma frame for the detail view (REV-6)
AITEST-S06 Confidence meter & feedback	Detail 1.C row S06, §2.A ConfidenceMeter, §2.D rate row, §4.D chunk 5	Full — aggregate `(up/total)*100` recompute specified server-side (D-6)
AITEST-S07 Activation gatekeeping	Detail 1.C row S07, §2.4 publish, §4.D chunk 8, D-7	Partial — gate specified; threshold configurability open (REV-4)
AITEST-S08 Background processing (async)	Detail 1.C row S08, §2.F, §2.B polling, state machines	Full — Sidekiq, queue, poll interval, status lifecycle all pinned
AITEST-S09 Manual override & audit	Detail 1.C row S09, §2.4 publish `override_reason`, §3 audit (PaperTrail)	Full — override path + audit fields specified (Could-Have)
AITEST-S10 Confidence in Tree Diagram	Detail 1.C row S10, §2.4 tree_diagram, §4.D chunk 7	Full — avg over completed, "no score yet" fallback specified
CHG-001 Tree-diagram confidence	Detail 1.A, §2.4 `/api/v3/paths/:id/tree_diagram`, §2.F.2	Full
NEG-1 standard agent denied	§3 authz matrix (standard agent → 403), Detail 1.C S01 NEG-1	Full
NEG-2 non-text excluded	§3.A.1 branch catalog, §2.1 branch/skip flow	Full
NEG-3 comparison read-only	§2.A `readonly: true` prop	Full
NEG-4 Phase 2/3 sources disabled	§3.A.1 branch catalog, Out of Scope #4	Full
PRD §12 Observability events	§3 Monitoring (5 Mixpanel events + alerts)	Full — event names match PRD §12 1:1
PRD §13 Success Metrics	§1 Success Criteria	Full — primary KPI, quality KPI carried
PRD §6 Soft delete + restore	§2.3 lifecycle, D-8, §2.4 DELETE	Partial — soft-delete covered; restore endpoint not specified (REV-3)
PRD Open Q #5 (token budget/TPM)	§5 Open Q #5, §3 Performance	Carried as open — not resolved (REV-1)

Reverse (RFC decisions → PRD driver): every new FE component and BE endpoint in the Detail 1.A reverse table traces to a PRD AC. The one RFC addition without a direct PRD line item — the DELETE endpoint — is justified by PRD §6 (soft delete) + the existing FE client; not scope creep. No unjustified RFC additions found.

Summary: 19 of 19 tracked PRD elements covered (14 Full, 5 Partial, 0 Missing). 0 RFC decisions without PRD justification. Forward + reverse traceability is explicit (Detail 1.A) — exemplary for PRT.

Scorecard

Full-Stack Scorecard (18 categories)

#	Category	Source	Score	Evidence-Based Rationale
1	PRT — PRD Traceability	Merged	`9.0`	FE+BE: explicit bidirectional matrix (Detail 1.A forward + reverse), PRD-§ coverage table, PRD-to-Schema Derivation mapping every entity→table.column→endpoint→enforcement. 19/19 elements covered; no scope creep. Two strong non-conflicting evidence points → clears 9.0.
2	TDC — Technical Decisions	Merged	`8.0`	10 decisions D-1…D-10 each with chosen option + rejected alternative + reason + layer (Detail 1.B). Grounded in real files. Drag: D-7 (gate threshold configurability, REV-4) and D-10 (sampling cap, but throttle in D-5 left open) are decided in principle but leave a sub-contract open.
3	CNT — Contract Specificity	FE	`8.0`	§2.A gives typed props for `ConfidenceMeter`/`TestCaseComparison`; §2.B data-fetching strategy (Pinia + `$apiMain`, poll ~3s, optimistic rollback); types reused from verified `store/ai-agent/interface.ts`. Drag: `name` blank-until-built (REV-2); detail-view contract leans on a non-existent prototype (REV-6).
4	SCB — Scope Boundaries	FE	`9.0`	§2.I lists FE create / modify / NOT-touched files by path, plus shared-code note ("extend, don't fork" the Pinia store). §2.0 reading order + anchors. An agent can produce a file-by-file plan directly.
5	DEP — Dependencies	FE	`8.5`	§1 Dependencies table with owning team + availability (exists/needs-building) + blocking flag; Design-system version pinned (`@mekari/pixel3@^1.0.12`, verified). Backend-endpoint availability checked per endpoint (§2.0 contracts table). Drag: NLP batch path "needs building" with open throttle.
6	NFS — Non-Functional Specificity	FE	`8.0`	§3 Performance: LCP<2.5s, INP<200ms, CLS<0.1, bundle delta small; §3.E Accessibility WCAG AA with focus-trap, `aria-pressed`, `role=progressbar`, `prefers-reduced-motion`. Browser support "per chatbot-fe baseline" is slightly soft (no explicit version matrix).
7	TPS — Test Plan Specificity	FE	`8.0`	§4.C maps each layer to a real repo command (vitest, playwright, rspec paths) + what it proves; §4.D acceptance criteria per chunk are assertable ("0 SendMessageWorker enqueues"). Drag: FE visual-regression for the no-prototype detail view not addressed.
8	DMS — Data Model & Schema	BE	`8.5`	§2.3 gives verified current DDL (both tables, columns, types, indexes), per-status lifecycle tables, cardinality (≤70 q/case), PII classification, retention (soft-delete; hard-purge TBD). No new tables; one additive nullable `name` column. Drag: hard-purge window TBD (Open Q #8) and `name` conditional (REV-2).
9	ACV — API Contract & Versioning	BE	`8.0`	§2.4 outbound table: method/path/auth/request/response/status codes/idempotency/versioning/reuse-tag for all 7 endpoints; §3.B error catalog; example create payload. Clears the `<5.0` cap comfortably. Drag: DELETE contract depth (REV-3), create idempotency "recommended" (REV-8), re-run path absent from table (REV-7).
10	DIC — Data Integrity & Consistency	BE	`8.0`	§2.D matrix per write path (transaction scope, partial failure, idempotency, consistency, duplicate, stale read); rate+recompute "in one trx"; worker per-question (not one big trx) with clear-before-regen. Drag: clear-before-regen transaction not pinned to a step (REV-7); create idempotency open (REV-8).
11	FMC — Failure Mode Coverage	Merged	`8.0`	FE: §2.C UI state matrix (5 states/surface) + §3.C error message catalog with i18n keys + optimistic rollback. BE: §3.A failure catalog, §3.B error catalog (code/HTTP/when/user-facing), two failure sequence diagrams, per-question fail→continue. Cross: §3.A confirms FE handles exactly what BE returns. Drag: retry/backoff "e.g. 3" not pinned (REV-5).
12	CSS — Concurrency & Scaling	BE	`6.5`	§2.E collision map (concurrent raters; worker-vs-rater with `processing` lock + 409/422; double-publish idempotent) is solid. But the load-facing half is the weakest in the RFC: the NLP throttle (the actual scaling control for the core path) is unspecified (REV-1); queue concurrency is given (5/10) but per-question pacing is not. This is the lowest score and the main thing holding the overall down.
13	SAS — Security & Authorization	BE	`8.5`	§3 Security: threat model (4 vectors incl. zero-leakage + PII-to-LLM), `set_role` per endpoint + Role×Endpoint matrix, ownership via `current_user['chatbot_organization_id']`, input validation (`score∈{0,1}`, name length-bound), SSRF-safe outbound, lockbox secrets, PaperTrail audit, brakeman/rubocop. Two strong evidence points.
14	ROL — Rollout & Rollback	Merged	`8.5`	FE: flag `ai_agent_testing` (named, default OFF, provisioner). BE: additive migration sequence, nullable-column rollback. Cross (mandatory checks): deploy order specified (BE first) with rationale; §4.A compatibility matrix covers all 6 scenarios incl. the one "no" (Frontend-first) with mitigation; coordinated rollback = flag-OFF; flag coordination (testing + gate as two independent flags). Clears the deploy-order cap.
15	OBS — Observability	Merged	`8.0`	FE: 5 Mixpanel events with properties + `ai_workspace_load_failed`. BE: Rollbar with `test_case_id/room_id`, structured worker log extended with `generated/failed_count`, alert thresholds (batch>10%/1h, NLP>5%/15m→PagerDuty). Cross: propagate request/job id into worker logs for trace. Slightly short of distributed-trace spans → not 9.0.
16	SBC — Service Boundary & Coupling	BE	`8.5`	§2.F.1 Responsibility Boundary Matrix (9 steps, owning squad, trigger, effect, failure handler, PRD anchor); pipeline isolated to `:ai_agent` queue off the live path; sync outbound to Hub/NLP, no new sync coupling on the request path. Clear ownership boundaries.
17	CPA — Pattern Alignment	Merged	`9.0`	FE: §2.0 "Patterns to Follow" (Pinia `fetchStatus`, optimistic rollback, `AiAgentsTable` list pattern) with reference files + "Deviation? none". BE: Grape+ResultMatcher, APIAbstractUseCase, AbstractRepository, Sidekiq worker pattern, FeatureFlag, ErrorException shape — each with a named reference file. Cross: snake_case JSON consumed directly (verified). The single intentional deviation (drawer version selector) is flagged.
18	CDG — Compliance & Data Governance	BE	`8.0`	Active (PII trigger: customer/agent text → 3rd-party LLM). §3.D classification table (legal basis DPA/UU PDP, retention, encryption, access audit, right-to-delete), PII-scrub logging rule (don't log question/answer bodies), transient inference, InfoSec sign-off required (Open Q #1). Drag: hard-purge window TBD (Open Q #8) keeps it off 9.0.

R2 re-score deltas (the table above is the R1 baseline)

The R1 category scores are preserved above as history. R2 fixes (REV-1…REV-4) move five categories:

#	Category	R1 → R2	Why it moved
2	TDC — Technical Decisions	`8.0 → 8.5`	D-5, D-7, D-8 closed → 10/10 decisions resolved; sub-contracts (throttle, threshold) now specified
3	CNT — Contract Specificity (FE)	`8.0 → 8.5`	REV-2 fixed (`name` now persisted); residual drag only REV-6 (no detail-view Figma frame)
8	DMS — Data Model & Schema	`8.5 → 9.0`	REV-2 fixed (`name` column unconditional + persisted); verified DDL + lifecycle + two strong evidence points → clears 9.0 (hard-purge TBD remains advisory, Open Q #8)
9	ACV — API Contract & Versioning	`8.0 → 8.5`	REV-3 fixed (full DELETE contract); residual REV-7 (re-run path) + REV-8 (idempotency) keep it off 9.0
12	CSS — Concurrency & Scaling	`6.5 → 8.0`	REV-1 fixed — the throttle (the scaling control for the core path) is now specified (token-bucket + RPM pref + 429). New REV-9 (per-process vs per-org enforcement) is the only residual, and is minor → caps the rise at 8.0

Unchanged: DIC 8.0, FMC 8.0, OBS 8.0 (REV-5/7/8 still open); all others as R1. Overall: 7.5 → 8.5.

Resource & Cost Advisory (non-blocking)

§4.F: bounded by :ai_agent queue concurrency (5/10), +1 nullable column, ≤70 question rows/case (negligible growth), per-question HTTPS egress to QontakNLP capped by sample size. No new infra. This is a healthy advisory note; the only cost dimension that matters — per-tier token budget — is correctly routed to Open Q #5 (Data/AI) rather than blocking readiness.

Decision Closure Assessment

Decision Index

#	Decision	Status	Critical Gaps
D-1	Frontend target = chatbot-fe (qontak-designer is reference only)	Resolved	none
D-2	Reuse existing `ai_agent_test_cases`/`_questions` tables	Resolved	none
D-3	Server-side `set_role` authoritative; FE menu flag/subscription-gated	Resolved	none
D-4	Async Sidekiq worker, queue `:ai_agent`	Resolved	none
D-5	Reuse `QontakNlp` predict per question, throttled in-worker	Resolved (R2)	throttle contract now specified (token-bucket, RPM pref, 429 behavior) — REV-1 fixed. Residual: per-process enforcement caveat (REV-9, minor)
D-6	Recompute `confidence_score` server-side on each rating	Resolved	none
D-7	Activation gate advisory→enforced behind flag	Resolved (R2)	threshold key pinned: `ai_agent_testing_threshold` (default 80, per-org) — REV-4 fixed
D-8	Soft delete via `acts_as_paranoid` + add DELETE endpoint	Resolved (R2)	full DELETE contract specified; restore scoped out-of-scope explicitly — REV-3 fixed
D-9	Per-status lifecycle `pending→processing→completed/failed`	Resolved	none
D-10	Sampling 10% / cap 50–70 / all if <10	Resolved	step-3 ownership (Data vs BOT) is org-process, not a code gap

Aggregate (R2): 10 of 10 Resolved, 0 Partial, 0 Dangling (R1 was 7/10 + 3 Partial; D-5, D-7, D-8 closed this cycle).

Decision: D-5 — Reuse QontakNlp predict per question, throttled in-worker

Status: Partial

What was decided

"Reuse QontakNlp predict path per question, throttled, in worker" (Detail 1.B D-5); alternative "New batch endpoint on AI service" rejected "Reuse the proven lib/qontak_nlp/inference.rb#prediction; throttle in-worker to respect TPM/RPM."

Alternatives considered

Documented: new batch endpoint on the AI service, rejected to reuse the proven live-predict path. Sound reasoning grounded in a verified file.

Grounding in existing code

Strongly grounded: lib/qontak_nlp/inference.rb#prediction (verified, timeout: 60), core/repositories/qontak_nlp/predict.rb. Reading order item 5.

Interface specification

The call shape is specified (predict, not send_message; timeout 60s; returns answer/confidence/sources/response_time). What is NOT specified: the throttle itself — no RPM/TPM ceiling, no inter-call delay or token-bucket size, no per-tier number. §3 Performance and §2.F both say "throttle … to respect TPM/RPM" and defer the contract to Open Q #5.

Failure handling

Per-question failure is well-handled (failure sequence diagram: Rollbar + status=failed + status_description, batch continues). But the specific failure that the throttle exists to prevent — a 429 from the AI service — has no documented behavior (backoff? requeue the question? count toward failed?).

Challenge results

Scale: This is exactly where it breaks first. A 50–70 question batch firing predict calls with no concrete pacing will, at the upper cap across concurrent batches (queue concurrency 10), plausibly exceed the provider's RPM and get throttled or fail. The throttle is the scaling control and it is unspecified.
Reversibility: Easy to change pacing later (worker-internal), so this is low-blast-radius to defer — but the first implementation will guess.
Consistency: Consistent with D-4 (async) and the FMC failure handling.
Agent implementability: An agent cannot implement the throttle without inventing numbers and a mechanism. It would likely add a naive sleep or no throttle at all.

Gaps and suggestions

Missing: concrete RPM/TPM ceiling (per tier if tier-dependent), pacing mechanism, on-429 behavior. Suggested resolution: pin a conservative default for beta (e.g. token-bucket at N RPM derived from the AI squad's documented per-org limit), with on-429 = exponential backoff then requeue the single question (not the batch), and fail the question after the retry budget. Make the ceiling a SystemPreference so it is tunable per tier without deploy (matches §4.B style). Open questions for author: What is the QontakNLP per-org RPM/TPM ceiling? Is it tier-dependent? On 429, backoff-and-retry the question or mark failed?

Decision: D-7 — Activation gate advisory→enforced behind flag

Status: Partial

What was decided

Add a confidence-threshold check in Repositories::Publish, advisory for beta and enforced before GA, behind ai_agent_testing_gate (default false/advisory). Alternative "hard gate from day one" rejected (PRD Open Q #2). Grounded in verified repositories/publish.rb (no gate today — confirmed: only update!(active_version_id:)).

Interface specification

The publish endpoint contract is specified (POST, 422 below threshold when gate on, override_reason?). The gap: the threshold value's source is unresolved — Open Q #4 asks fixed-80 vs org-configurable; §4.B lists a "threshold pref" but with no key name/type/default/provisioner, unlike every other row in that table.

Failure handling

422 with "Confidence below threshold" is catalogued (§3.B). Override path (S09) handled. Fine.

Challenge results

Scale/Reversibility: trivial; gate is a single comparison. Toggling advisory↔enforced is a flag.
Consistency: consistent with D-6 (server-authoritative aggregate) and S09 override.
Agent implementability: an agent can build the comparison but will guess where 80 lives (constant vs config key). Chunk 8 acceptance criterion says "422 when <80" — implies a constant, contradicting the §4.B "threshold pref" hint.

Gaps and suggestions

Missing: the threshold config key spec (name/type/default/provisioner) OR an explicit "constant 80 this phase." Suggested resolution: for this phase, hard-code 80 as a named constant and add a SystemPreference key ai_agent_testing_threshold (int, default 80) only if Open Q #4 resolves to configurable; document which in §4.B. Open questions for author: Is 80 fixed for the phase or org-configurable on day one?

Decision: D-8 — Soft delete + add DELETE endpoint

Status: Partial

What was decided

Soft delete via acts_as_paranoid; add the missing DELETE /api/v1/ai_agents/:id/test_cases/:test_case_id route (verified: FE deleteTestCase client exists at ai-agents.ts:313; no BE test-case delete route found — confirmed by grep). Alternative hard-delete rejected per PRD §6.

Interface specification

Partially specified: method/path/auth/idempotency/status codes given in §2.4. Gaps: the success response body is {status:success} but doesn't echo the deleted resource; restore is mentioned (D-8 "soft delete + restore"; §2.3 lifecycle "restore via paranoid") but no restore endpoint appears in §2.4; cross-tenant handling is stated as 404 in one place (detail) and 403 in the role matrix — pick one for the DELETE.

Challenge results

Reversibility: soft-delete is the reversible choice; good.
Agent implementability: an agent can build the soft-delete DELETE from chunk 6, but will guess on the response body and whether to also build restore.

Gaps and suggestions

Missing: restore endpoint scope; exact DELETE response body; cross-tenant 404-vs-403 consistency. Suggested resolution: scope restore OUT for this phase (state it explicitly — paranoid restore is a console/operational action, not a UI endpoint), pin DELETE success body, and standardize cross-tenant → 404 (don't reveal existence). Open questions for author: Is restore a user-facing endpoint this phase or operational-only?

Decisions D-1, D-2, D-3, D-4, D-6, D-9, D-10 — Resolved (abbreviated)

Each names the chosen option, rejects a concrete alternative with a reason tied to verified codebase reality, references specific files, and (where it produces an interface) specifies it. D-1 (chatbot-fe vs qontak-designer — the latter has no API/auth, verified). D-2 (reuse tables — verified migrations). D-3 (set_role authoritative — verified per-route). D-4 (Sidekiq :ai_agent — verified queue config). D-6 (server-side recompute, single source of truth). D-9 (status lifecycle — two state-machine diagrams). D-10 (sampling cap with branch/skip flow diagram). No dangling alternatives; an agent can implement each without asking.

UI State Audit

Component	Loading	Empty	Error	Partial	Success	Assessment
Testing list	defined	defined	defined	defined	defined	5/5 (§2.C)
Generate drawer	defined	n/a	defined	n/a	defined	5/5 (n/a justified)
Generating modal	defined	n/a	defined	defined	defined	5/5
Detail/comparison	defined	defined	defined	defined	defined	5/5
Confidence meter	defined	defined	n/a (derived)	defined	defined	5/5

Summary: 5 of 5 data-driven surfaces have all applicable states defined (§2.C UI State Matrix). This is the strongest UI-state coverage the reviewer expects to see; the only soft spot is that the comparison view's "success" rendering leans on a non-existent prototype (REV-6), not a missing state.

Performance Budget Check

Metric	Target	Current Baseline	Source	Assessment
LCP (list)	< 2.5s	not stated	§3 Performance	adequate target, no baseline
INP (rating)	< 200ms	not stated	§3 Performance	adequate
CLS	< 0.1	not stated	§3 Performance	adequate
Bundle delta	"small (reuses Pixel3 + existing store)"	not stated	§3	vague — no numeric budget, but low-risk given reuse
Batch latency (BE)	~2–5 min for ~50 items	n/a (new)	§3, PRD §6	adequate

Frontend Core Web Vitals are quantified (good for a new-feature RFC). Bundle delta is qualitative; acceptable here because the page reuses the design system and existing store, so the blast radius is small — not a blocker.

Accessibility Review

Aspect	Specified?	Details	Assessment
Keyboard navigation flow	yes	drawer/modal focus-trap (Pixel3); thumbs via Tab	adequate (§3.E)
Focus management (modal/route)	yes	focus returns to trigger on modal close	adequate
ARIA labels	yes	thumbs `aria-pressed`; meter `role=progressbar`+`aria-valuenow/min/max`	adequate
Heading hierarchy	no	not addressed	minor gap
Color contrast	yes	"verified against Pixel3 tokens"	adequate (relies on DS)
Motion sensitivity	yes	`prefers-reduced-motion` for generating animation	adequate
Screen reader behavior	partial	meter labeled; comparison-panel SR reading order not described	minor gap

WCAG AA committed with concrete focus/ARIA specifics — strong for a draft. Minor: heading hierarchy and comparison-panel screen-reader order unaddressed.

Pattern Alignment Check

Pattern	RFC Approach	Assessment
FE state management (Pinia `fetchStatus`)	follows (`store/ai-agent/actions.ts`)	adequate — reference file named
FE optimistic update/rollback	follows (`UPDATE_TEST_CASE_QUESTION`)	adequate
FE list/empty/loading	follows (`AiAgentsTable.vue`)	adequate
BE Grape handler	follows (`test_cases_controller.rb` + ResultMatcher)	adequate
BE worker	follows (`fetch_room_conversations_worker.rb` pattern; per-item rescue→Rollbar)	adequate
BE feature flag	follows (`FeatureFlag.enabled?`)	adequate
Cross-layer naming	snake_case JSON consumed directly by FE (verified)	adequate — no ad-hoc transform
Analytics events	preserved/added (PRD §12 names matched)	adequate

No silent new patterns introduced. The one intentional deviation (Generate drawer version selector absent from the prototype) is explicitly flagged (§5 Q-A). Exemplary CPA.

Data Integrity Deep-Dive

Write Path	Transaction Scope	Partial Failure Behavior	Idempotency Key	Consistency Guarantee	Duplicate Handling
Create test case	single INSERT	422; no worker enqueued	`(agent,version,name)` "recommended" (NOT pinned — REV-8)	strong (single row)	second click → second row (FE button-disable only)
Worker per-question INSERT+UPDATE	per-question (not one big trx)	failed q → `status=failed`, batch continues	re-run clears prior questions before regen (trx not pinned — REV-7)	eventual (batch)	re-run clear-before-regen
Rate + recompute	question UPDATE + aggregate UPDATE in one trx	rollback both → FE restores prior	last-write-wins per question	strong within request	repeated same score idempotent
Publish (gate)	single UPDATE `active_version_id`	422 if `<80` & gate on	idempotent	strong	n/a

Strong overall (§2.D is a real matrix). Two soft spots flagged: create idempotency is "recommended" not decided (REV-8), and the clear-before-regen transaction is described as a rule but not tied to a transactional step (REV-7) — an agent could orphan questions if it deletes-then-fails mid-regen.

Concurrency Collision Map

#	Shared Resource	Writers	Collision Scenario	Resolution Mechanism	Lock Failure Behavior	Assessment
1	`ai_agent_test_cases.confidence_score`	concurrent raters	two thumbs near-simultaneously	recompute reads current scores in-trx (no stored delta)	last recompute wins; deterministic from rows	adequate
2	`ai_agent_test_case_questions` of a case	worker (regen) vs rater	rate during re-run	case→`processing` disables rating; reject 409/422 if not `completed`	FE "regenerating"	adequate
3	`ai_agents.active_version_id`	publish vs publish	double activate	idempotent row update	last write wins	adequate
—	AI-service rate budget	concurrent worker batches	N batches × ≤70 questions exceed provider RPM/TPM	NOT SPECIFIED	NOT SPECIFIED	gap (REV-1) — the one shared external resource with no resolution

§2.E covers the DB collision points well. The missing collision is the shared external AI-service rate budget across concurrent batches (queue concurrency up to 10) — this is the CSS-dragging gap (REV-1).

API Contract Completeness Check

Endpoint	Request Schema	Response Schema	Error Taxonomy	Auth Spec	Idempotency	Example Payloads	Assessment
GET `/test_cases`	complete	complete	complete	specific (`set_role`)	n/a (read)	partial	6/6
POST `/test_cases`	complete	complete	complete (404/422/403)	specific	"recommended" (REV-8)	yes	5.5/6
GET `/test_cases/:id`	complete	complete	complete	specific	n/a	partial	6/6
PATCH `/questions/:id`	complete	complete	complete	specific	last-write-wins	partial	6/6
DELETE `/test_cases/:id`	complete	partial (no resource echo; restore unscoped)	partial (404 vs 403)	specific	idempotent	no	4/6 (REV-3)
POST `/publish`	complete (`override_reason?`)	complete	complete (422)	specific	idempotent	no	5.5/6
GET `/tree_diagram`	complete	complete (`avg_confidence_score`)	partial	specific	n/a	no	5.5/6

Strong overall. The DELETE row is the weakest (REV-3); POST idempotency is soft (REV-8). Example payloads are given for create only — acceptable since the other shapes are reused from verified types.

Async Job / Event Consumer Spec

Job/Consumer	Trigger	Input Shape	Retry Policy	DLQ	Concurrency Limit	Idempotency Key	Timeout	Assessment
`FetchRoomConversationsWorker` (extend)	`perform_async` from CreateTestCases	`{test_case_id, organization_id}`	"bounded retry (e.g. 3, exp backoff)" — NOT pinned (REV-5)	"none (Sidekiq dead set; alert on failure)" — no depth threshold	queue `:ai_agent` (5 staging / 10 prod)	re-run clears prior questions before regen	per-question NLP 60s	5.5/7

The job spec is present and mostly complete (input, concurrency, idempotency, per-msg timeout, poison handling all stated). The two soft cells — retry intervals ("e.g. 3") and dead-set alert depth — are the FMC/CSS calibration gaps (REV-5). The throttle (REV-1) is the bigger of the two and belongs here too: per-question NLP pacing is not in the spec.

Compliance Trigger Check

Trigger	Found?	Data Location	Classification	Assessment
PII (name, email, content)	yes	`question`, `answer`, `parameters.human_answer` (customer/agent text); `scored_by_email`/`_name`	PII (customer content + internal user)	handled (§3.D)
Payment data	no	—	—	n/a
Health data	no	—	—	n/a
User content with retention	yes	both tables, soft-delete	retention: soft-delete; hard-purge TBD (Open Q #8)	partially handled
Auth/session data	no (uses existing JWT)	—	—	n/a
Cross-border data transfer	partial	PII → 3rd-party LLM	DPA-covered, transient	flagged, InfoSec sign-off required (Open Q #1)

CDG Status: Active — scored 8.0. Classification, legal basis, encryption, access audit, right-to-delete, and PII-scrub logging are all specified; the hard-purge window (Open Q #8) and InfoSec sign-off (Open Q #1) are the open items, correctly surfaced.

Cross-Layer Contract Verification

Endpoint	Backend Response Schema	Frontend Expected Schema	Match?	Gaps
GET `/test_cases`	`{status,code,message,data:[TestCase],meta}`	`TestCasesResponse {data: TestCase[]}`	Yes	`meta` optional
POST `/test_cases`	`{data: TestCase}` snake_case	`CreateTestCaseResponse {data:TestCase}`	Partial	BE does not persist `name` today → blank `TestCase.name` until chunk 1 (REV-2)
GET `/test_cases/:id`	`{data:{...,questions[]}}` incl. `confidence_score`, `parameters.human_answer`	`TestCaseDetailResponse {data:TestCaseDetail}`	Yes	fields align 1:1 with verified types
PATCH `/questions/:id`	`{data:question}` + recomputed aggregate	`UpdateTestCaseQuestionPayload {score:0\|1}`	Partial	aggregate recompute not built → meter static until chunk 5 (in-scope work)
DELETE `/test_cases/:id`	route missing	`deleteTestCase` client exists (`ai-agents.ts:313`)	No	BE route must be built (chunk 6) (REV-3)

Checks performed:

Field casing — snake_case consumed directly, verified, no ad-hoc transform
Nullability — confidence_score nullable; FE handles "no score"
Error shape — BE ErrorException ↔ FE error catalog (§3.A code-shape consistency column = "yes")
Pagination — page/limit query params; meta returned
Auth token — existing JWT/set_role, unchanged

Mismatches found: 0 true mismatches. The three Partial/No rows are disclosed in-scope work, not silent contract divergence — so the "cross-layer mismatch → cap 6.5" gate does not fire. This is the single most important reason this RFC scores Strong rather than Needs Work: it found its own integration gaps and put them on the build plan (chunks 1, 5, 6).

Cross-Layer Rollout Compatibility Matrix

Scenario	Frontend	Backend	Works?	Notes
Pre-deploy (baseline)	Old	Old	Yes	no Testing page
Backend first	Old	New	Yes	new endpoints unused by old FE; flag OFF
Frontend first	New	Old	No	avoided by deploy order; if FE leads, page gated behind flag tied to BE readiness
Both deployed (target)	New	New	Yes	target
Backend rollback	New	Old	Partial	FE page errors gracefully (error slate); disable flag
Frontend rollback	Old	New	Yes	BE endpoints unused

Deploy order: Backend first (specified, §4 Rollout Strategy, with rationale). Incompatible scenarios: 1 (Frontend-first) — explicitly avoided by deploy order and flag-gating. ROL is not capped: deploy order is specified, so the < 5.0 and deploy-order caps do not fire.

End-to-End Data Flow

Flow: Generate → rate → activate (the primary user action)

SPV clicks Generate
  → FE: GenerateFromInboxDrawer @confirm {name, version} → store CREATE_TEST_CASE
  → API: POST /api/v1/ai_agents/:id/test_cases {type, version_id, name}
  → Backend: CreateTestCases UC → CreateTestCase repo
  → DB: INSERT ai_agent_test_cases (status='processing')
  → Side effect: Sidekiq enqueue FetchRoomConversationsWorker; PaperTrail version
  → Response: 201 {data: TestCase status='processing'}
  → FE: generating modal polls GET /test_cases every ~3s
  → Worker (async): Hub rooms → sample 10%/cap → ExtractConversationPairs → per question: INSERT q (processing) → NLP predict (NOT send_message) → UPDATE q (answer/confidence/sources, completed)
  → DB: UPDATE test_case status='completed'
  → FE: SPV opens detail GET /test_cases/:id → TestCaseComparison renders human-left/AI-right
  → @rate: PATCH /questions/:id {score} → recompute confidence_score (one trx)
  → FE: meter updates → at ≥80% Activate enabled → POST /publish (gate)

Gaps in flow: The full path is traceable from §2.H + the §2.2 sequence diagram — a genuine strength. The only place one layer's output isn't pinned to the next layer's input is the worker→NLP step pacing (REV-1) and the name round-trip (REV-2, the create response echoes a name the BE doesn't persist yet). No structural gaps; the flow is implementable step-by-step.

Agentic Readiness Deep-Dive

Vague Word Audit

#	Word/Phrase	Location	Impact	Concrete Replacement
1	"throttle … to respect TPM/RPM"	§3 Performance, §2.F	agent invents pacing	concrete RPM ceiling + token-bucket + on-429 (REV-1)
2	"bounded retry (e.g. 3, exp backoff)"	§2.F	agent picks intervals	"3 retries: 1s, 4s, 16s" (REV-5)
3	"recommend … unique-ish `(agent,version,name)` guard"	§2.4, §2.D	agent guesses constraint vs check	decide index or app-level dedup (REV-8)
4	"only if a name column is required"	§2.3	makes chunk-1 migration optional	decide yes/no (REV-2)
5	"threshold pref"	§4.B	no key name/default	pin config key or "constant 80" (REV-4)
6	"bundle delta small"	§3	no numeric budget	low-risk; optional numeric budget

Total vague words in spec sections: 6 (low for an 1,200-line RFC; 4 map to the major findings).

Dangling Alternatives

#	Alternatives	Location	Impact
—	none	—	No "we could do X or Y" left unresolved. Every D-1…D-10 picks one option.

Total dangling alternatives: 0.

Task Decomposition Assessment

Chunk	Acceptance Criteria	Assessment
1–8 (BE)	each has files + repo command + assertable AC	verifiable (§4.D)
9–13 (FE)	each has files + `pnpm test`/`e2e` + AC	verifiable

§4.D is a 13-chunk ordered plan with files, commands, and "done when" criteria per chunk — among the best decomposition the rubric anticipates. Chunks 1, 4, 6, 8 carry the open findings (REV-2, REV-1, REV-3, REV-4) but are still actionable with the suggested resolutions.

Strengths

Verified grounding (§2.0 Source Verification). Every load-bearing claim carries a file + line + evidence; the reviewer spot-checked four (worker is fetch+extract+log only at 43 lines; publish.rb only sets active_version_id; rate_test_case_question.rb writes no aggregate; FE deleteTestCase exists at ai-agents.ts:313 with no BE route) — all confirmed. This earns High confidence and is why an agent can trust the anchors.
Cross-layer honesty (§2.G). The RFC names its own three integration gaps (name persistence, aggregate recompute, missing DELETE) as in-scope work rather than letting them surface mid-build — turning what is usually a blocker-capping mismatch into a clean PROCEED.
Execution plan + traceability (§4.D, Detail 1.A/1.C). 13 ordered chunks with assertable AC + bidirectional PRD traceability (19/19 covered) + per-story change map mean an agent always knows what it's building and why.

Biggest Gaps

NLP throttle contract (REV-1, §3 Performance / §2.F / Open Q #5). The one external contract that governs whether the core shadow-gen chunk works at batch scale is unspecified — no RPM/TPM ceiling, pacing mechanism, or 429 behavior. This is the single biggest agentic-execution risk: an agent would ship chunk 4 with naive or no pacing and the batch would get rate-limited in production. Drags CSS to 6.5.
name column left conditional (REV-2, §2.3 / §2.G). The create-path cross-layer match is partial precisely because BE doesn't persist name today, yet §2.3 frames the migration as "only if required." Chunk 1's first migration is effectively optional — an agent could skip it and ship a blank TestCase.name column in the list UI.
DELETE-endpoint depth + gate-threshold source (REV-3, REV-4). The newest endpoint (DELETE) has the thinnest contract (response body, restore scope, 404-vs-403), and the publish gate's threshold source (constant vs config key) is unresolved between §4.B and chunk 8's AC — two spots where an agent guesses.

Priority Actions

Ordered by impact on implementation readiness. Close before agent execution of the named chunks.

§2.F / §3 Performance / Open Q #5 (REV-1) — Pin the per-question NLP throttle: concrete RPM/TPM ceiling (per tier if applicable), token-bucket or fixed inter-call delay, and explicit on-429 behavior (backoff-and-requeue the single question, fail after retry budget). Make the ceiling a SystemPreference. Unblocks chunk 4 (the engineering core).
§2.3 / §4.D chunk 1 (REV-2) — Decide the name column definitively: add column + persist (chunk 1 already lists the migration) and remove the "only if required" hedge, OR drop name from the FE type and update §2.G. Unblocks the create round-trip.
§2.4 DELETE row + D-8 (REV-3) — Specify the DELETE success body, scope restore explicitly OUT (operational-only) for this phase, and standardize cross-tenant → 404. Unblocks chunk 6.
§4.B / §2.4 publish / Open Q #4 (REV-4) — Resolve the gate threshold source: pin a ai_agent_testing_threshold config key (int, default 80, provisioner) like the other §4.B rows, or state "constant 80, configurability deferred." Align chunk 8's AC. Then tighten REV-5/REV-7/REV-8 (retry intervals, clear-before-regen transaction, create idempotency) — minor, but they remove the last guesses.

Backend Contract Addendum

Endpoint Contract Details

Endpoint	Method/Path	AuthZ	Request Contract	Response Contract	Error Contract	Idempotency/Versioning	Status
Create	POST `/api/v1/ai_agents/:id/test_cases`	`set_role` + org scope	`{type, version_id, name}`	`{data: TestCase status='processing'}`	404/422/403	idempotency "recommended" — Missing decision (REV-8)	Missing [idempotency, name persist]
Delete	DELETE `/api/v1/ai_agents/:id/test_cases/:test_case_id`	`set_role` + org scope	path	`{status:success}` (no resource echo)	404/403 (inconsistent)	idempotent	Missing [response body, restore scope, 404-vs-403] (REV-3)
Publish	POST `/api/v1/ai_agents/:id/publish`	`set_role` (+admin for override)	`{override_reason?}`	`{data: ai_agent}`	422 below threshold	idempotent	Missing [threshold source] (REV-4)

Database Changes Details

Change	Table/Entity	DDL / Shape Diff	Data Migration Plan	Rollback Plan	Compatibility Window	Status
Add `name` (conditional)	`ai_agent_test_cases`	`ADD COLUMN name varchar` + index	none (nullable, no backfill)	nullable + inert; no down-migration	additive, indefinite	Missing [the "conditional" must become definite] (REV-2)

Async Worker

The worker contract (§2.F) is mostly complete; missing the throttle (REV-1) and concrete retry intervals (REV-5).

Implementation Readiness Checklist

Unblocked (agent can proceed)

All types:

PRD → RFC traceability matrix complete (19/19)
Technical decisions resolved (7/10 resolved, 3 partial — none dangling)
Failure modes handled with error catalogs (§3.A/3.B/3.C)
Configuration contract (§4.B; one gap REV-4)
Pattern alignment verified (§2.0)
Rollout plan with flag + rollback (§4)
Observability metrics + alerts (§3)
Task decomposition with AC per chunk (§4.D)
Zero vague words — 6 present, 4 map to majors

Frontend:

Interfaces/prop types specified (§2.A)
All UI states defined (§2.C, 5/5)
Performance budget (CWV quantified; bundle qualitative)
Accessibility (§3.E; minor gaps)
Browser support matrix — "per chatbot-fe baseline" (soft)

Full-stack:

Cross-layer contract verified (§2.G; 0 true mismatches)
Deploy order specified (BE first)
Rollout compatibility matrix (§4.A; 1 "No" mitigated)
End-to-end data flow (§2.H + §2.2)
Feature flag coordination (two independent flags)

Backend:

Schema at DDL precision (§2.3)
API contracts — DELETE thin (REV-3), create idempotency open (REV-8)
Transaction boundaries + idempotency per write path (§2.D; 2 soft spots)
Concurrency collision points (§2.E; external rate budget missing — REV-1)
Security (§3)
Migration plan (§4)
Service boundary (§2.F.1)
Compliance (§3.D; hard-purge TBD)

Blocked (must fix first)

None are hard blockers. The four majors (REV-1..4) are PROCEED-with-notes items: close them before executing chunks 4, 1, 6, 8 respectively.

Verdict: Fix 4 notes inline (REV-1..4) — no hard blockers; proceed on the unblocked chunks in parallel

Task Manifest

The RFC already specifies a 13-chunk plan (§4.D). Verified below; the reviewer adds the open-finding each chunk must close.

Order	Chunk	Files to Create/Modify	Acceptance Criteria	Dependencies / Note
1	Add `name` + persist; create status `processing`	migration, `create_test_case.rb`, `create_test_cases.rb`, `test_cases_controller.rb`	migration up/down; create persists `name`, returns `processing`	Close REV-2 first
2	Status lifecycle helper	new `update_test_case_status.rb`	transitions covered	None
3	Worker sampling step	worker, new `sample_rooms.rb`	200→~20; <10→all; 5000→cap	None
4	Worker shadow-gen + persistence	worker, new `generate_shadow_answer.rb`, question repo	0 `SendMessageWorker` enqueues; fields persisted	Close REV-1 first
5	Confidence aggregate recompute	`rate_test_case_question.rb`, new `recompute_confidence_score.rb`	`(up/total)*100` after rate	Chunk 1
6	DELETE route (soft delete)	`test_cases_controller.rb`, new `delete_test_case.rb` + repo	soft-deletes; 404 cross-tenant	Close REV-3
7	Tree-diagram avg confidence	`get_tree_diagram_v3.rb`	avg over completed; none→"no score yet"	Chunk 5
8	Publish gate (flagged)	`publish.rb`, `publish_ai_agent.rb`	422 `<80` & gate on	Close REV-4
9–13	FE list/generate/detail/gate/tree	`pages/bot-automation/testing/`, `modules/bot-automation/components/testing/`, `layouts/bot-automation.vue`, `AiAgentEditor.vue`	per §4.D	chunk 11 needs Figma frame (REV-6)

Dangling Decisions Log

None fully dangling. The three Partial decisions are listed for closure tracking.

#	Decision	Location	Owner	Deadline
1	D-5 NLP throttle contract (REV-1)	§3 / §2.F / Open Q #5	Data/AI (Reza)	2026-07-15 (PRD Open Q #5)
2	D-7 gate threshold source (REV-4)	§4.B / Open Q #4	Dimas (PM)	2026-07-15
3	D-8 DELETE depth + restore scope (REV-3)	§2.4 DELETE	Eng (BOT)	unset

Open Questions

Surfaced during review. Promoted to the RFC's own Open-Questions surface by REV-id.

#	Question	Category	Severity
REV-1	What is the QontakNLP per-org RPM/TPM ceiling, the pacing mechanism, and on-429 behavior for shadow-gen?	CSS / FMC	Important
REV-2	Is the `name` column added and persisted this phase (making chunk-1 migration mandatory), or is `name` dropped from the FE type?	CNT / ACV / DMS	Important
REV-3	DELETE: exact success body, is restore a user-facing endpoint this phase, and cross-tenant 404 vs 403?	ACV	Important
REV-4	Is the 80% threshold a constant this phase or an org-configurable `SystemPreference` (key/default/provisioner)?	ROL / TDC	Important
REV-5	Concrete worker retry count + backoff intervals and Sidekiq dead-set alert depth?	FMC	Nice-to-have
REV-6	Figma frame for the comparison/detail view (no prototype exists) — Design QA before chunk 11.	NFS / CNT	Important
REV-7	Re-run trigger (which endpoint/param re-enqueues) and the clear-before-regen transaction boundary?	DIC / ACV	Nice-to-have
REV-8	Create-endpoint idempotency: unique index, app-level dedup, or accept duplicates?	DIC / ACV	Nice-to-have

Evidence Notes

§2.0 Source Verification — every anchor row carries verifiable evidence; reviewer confirmed 4 load-bearing claims against the sibling repos (worker 43 lines no LLM/persist; publish.rb:13 only active_version_id; rate_test_case_question.rb:14-19 no aggregate; FE deleteTestCase at ai-agents.ts:313, no BE route). Drove High confidence and the PRT/CPA/SBC scores.
§2.G Cross-Layer Contract Verification — the disclosure of three integration gaps as in-scope work (not silent mismatch) is the reason the cross-layer mismatch cap did NOT fire; pivotal to the Strong rating.
§3 Performance + §2.F — the absence of a concrete NLP throttle contract is the single most consequential gap; held CSS to 6.5 and is the lead open finding (REV-1).
§4.D Agent Execution Plan — 13 verifiable chunks; the basis for the Task Manifest and the PROCEED-with-notes verdict.
§2.3 DDL — current schema verified from migrations; the name column's conditional framing is the second-biggest gap (REV-2).

R3 Re-Grounding Log (every cited anchor re-verified against current code)

Cycle R3 opened the actual files in chatbot fa6dd8b79 and chatbot-fe f5b80d5a (chatbot advanced from the R2 stamp 88b0685). Every anchor below was re-confirmed; none drifted or broke.

Anchor (RFC claim)	Re-grounded result
`app/workers/fetch_room_conversations_worker.rb` — 43 lines, `queue: :ai_agent, retry: false`, fetch+extract+log only	✓ exact — 43 lines, `sidekiq_options queue: :ai_agent, retry: false` (L5); no LLM/persist/status
`repositories/publish.rb` — only `update!(active_version_id:)`, no gate	✓ — `def call` L12, `@ai_agent.update!(active_version_id: @ai_agent.version_id)` L13; no threshold/confidence (R2 cited "L12-18"; substance unchanged)
`repositories/create_test_case.rb` — `status='pending'`, no `name` persist	✓ — `record.status = 'pending'` L19; no `name` assignment (REV-2 underlying gap still live)
`repositories/rate_test_case_question.rb` — score/scored_by* only, no aggregate	✓ — L13-19 set score/is_score/scored_by*/scored_at; no `confidence_score` write
`test_cases_controller.rb` — GET L32 / POST L75 / PATCH L115, all `set_role(%w[owner supervisor admin])`	✓ exact lines + roles
No DELETE test-case route; agent delete at `ai_agents_controller.rb`	✓ — only `delete '/:id'` (agent) at `ai_agents_controller.rb:237`; no test-case delete
Mount `api.rb` L47-48 `TestCasesController => '/v1/ai_agents'`	✓ — `api.rb:48`
Migrations `20260512000001/2`; `name` column NOT present	✓ — both migrations exist; `ai_agent_test_cases` schema has no `name` column (RFC chunk-1 migration not yet built — correct for pre-impl RFC)
`ai_agent_test_case_questions` columns (status_description, parameters/sources jsonb, scored_*, confidence, response_time, deleted_at, etc.)	✓ — all present in `db/schema.rb`, indexes match
`extract_conversation_pairs.rb` — skip SYSTEM, `question?`, pair next agent reply (L40-66)	✓ — `SYSTEM_PARTICIPANT_TYPE` L8/L45, `question?` L50/L70, pairing L54-61
`lib/qontak_nlp/inference.rb#prediction` — `timeout: 60`, `@http.call(method:'POST')`	✓ — `prediction` L10 (default `timeout: 60`), `@http.call(method:'POST', ... open_timeout:/read_timeout:)` L22
`get_tree_diagram_v3.rb#add_ai_agent` L850-909, no confidence	✓ — `def add_ai_agent` L850 (called L250); node built without avg confidence
tree route `v3/path.rb` L17 `get ':id/tree_diagram'`; `feature_flag.rb` `enabled?(group_code, code, default:)`	✓ — `path.rb:17`; `FeatureFlag.self.enabled?` L34
FE `ai-agents.ts` — `createTestCase` L250, `getTestCases` L275, `getTestCaseDetail` L292, `deleteTestCase` L313, `updateTestCaseQuestion` L334	✓ — all present; `deleteTestCase` at L313 exactly
FE `endpoint.ts` test_cases paths L207-214	✓ — create/list/detail/delete/update_question at L208-213
FE `interface.ts` — `TestCase` (carries `name`), `TestCaseQuestion` (`score:0\|1\|null`), `UpdateTestCaseQuestionPayload` (`score:0\|1`)	✓ — `TestCase` `name:string` L272; payload `score:0\|1` L359-364
FE `actions.ts` — CREATE L745 / FETCH L803 / FETCH_DETAIL L849 / DELETE L902 / UPDATE_QUESTION L950 (optimistic rollback)	✓ — all exact; optimistic snapshot+rollback confirmed in UPDATE_TEST_CASE_QUESTION (L960-998)
FE `AiAgentEditor.vue` footer "Save changes" only L1712-1733; `layouts/bot-automation.vue` flag-gated L209-349; no `pages/bot-automation/testing/`; `@mekari/pixel3@^1.0.12`; legacy `ValidationDetailPanel.vue`	✓ — all confirmed

Net: 34+ citations re-verified, 0 drift, 0 breakage. The RFC's §2.0 Source-Verification table is accurate against current HEADs, which is why R3 keeps High confidence and holds the 8.5/PROCEED verdict without minting a new finding.

Review History

Cycle	Date	Reviewed RFC revision (`last_updated` / commit)	Score	Verdict	Findings open → fixed	Notes
`R1`	2026-06-20	`2026-06-20` / working tree (uncommitted)	`7.5`	PROCEED with notes	8 open (4 major / 4 minor), 0 fixed	First cycle. Strong, well-grounded full-stack RFC; no hard blockers. Lead gap: NLP throttle contract (REV-1).
`R2`	2026-06-20	`2026-06-20` / commit `88b0685`	`8.5`	PROCEED	4 fixed (REV-1…4, all majors), 1 new minor (REV-9), 5 carry-open (REV-5/6/7/8/9)	Delta re-review. All R1 majors closed; decisions now 10/10 resolved; §2.G fully `yes`. CSS 6.5→8.0 (throttle specified). No open majors → PROCEED.
`R3`	2026-06-20	`2026-06-20` / working tree — chatbot `fa6dd8b79` / chatbot-fe `f5b80d5a`	`8.5`	PROCEED	0 fixed, 0 new, 5 carry-open (REV-5/6/7/8/9)	Re-grounding/confirm cycle (RFC unchanged since R2). Re-verified all §2.0 Source-Verification anchors + prior spot-checks against current HEADs (chatbot advanced `88b0685→fa6dd8b79`). Zero citation drift, zero breakage — every file:line, route, migration, schema column, FE symbol still resolves as claimed. Ledger + score held.

Executive Summary​

Quick Verdict​

Findings Ledger (carry-forward)​

PRD → RFC Traceability Matrix​

Standard format (PRD exists)​

Scorecard​

Full-Stack Scorecard (18 categories)​

R2 re-score deltas (the table above is the R1 baseline)​

Resource & Cost Advisory (non-blocking)​

Decision Closure Assessment​

Decision Index​

Decision: D-5 — Reuse QontakNlp predict per question, throttled in-worker​

What was decided​

Alternatives considered​

Grounding in existing code​

Interface specification​

Failure handling​

Challenge results​

Gaps and suggestions​

Decision: D-7 — Activation gate advisory→enforced behind flag​

What was decided​

Interface specification​

Failure handling​

Challenge results​

Gaps and suggestions​

Decision: D-8 — Soft delete + add DELETE endpoint​

What was decided​

Interface specification​

Challenge results​

Gaps and suggestions​

Decisions D-1, D-2, D-3, D-4, D-6, D-9, D-10 — Resolved (abbreviated)​

UI State Audit​

Performance Budget Check​

Accessibility Review​

Pattern Alignment Check​

Data Integrity Deep-Dive​

Concurrency Collision Map​

API Contract Completeness Check​

Async Job / Event Consumer Spec​

Compliance Trigger Check​

Cross-Layer Contract Verification​

Cross-Layer Rollout Compatibility Matrix​

End-to-End Data Flow​

Flow: Generate → rate → activate (the primary user action)​

Agentic Readiness Deep-Dive​

Vague Word Audit​

Dangling Alternatives​

Task Decomposition Assessment​

Strengths​

Biggest Gaps​

Priority Actions​

Backend Contract Addendum​

Endpoint Contract Details​

Database Changes Details​

Async Worker​

Implementation Readiness Checklist​

Unblocked (agent can proceed)​

Blocked (must fix first)​

Task Manifest​

Dangling Decisions Log​

Open Questions​

Evidence Notes​

R3 Re-Grounding Log (every cited anchor re-verified against current code)​

Review History​

Executive Summary

Quick Verdict

Findings Ledger (carry-forward)

PRD → RFC Traceability Matrix

Standard format (PRD exists)

Scorecard

Full-Stack Scorecard (18 categories)

R2 re-score deltas (the table above is the R1 baseline)

Resource & Cost Advisory (non-blocking)

Decision Closure Assessment

Decision Index

Decision: D-5 — Reuse QontakNlp predict per question, throttled in-worker

What was decided

Alternatives considered

Grounding in existing code

Interface specification

Failure handling

Challenge results

Gaps and suggestions

Decision: D-7 — Activation gate advisory→enforced behind flag

What was decided

Interface specification

Failure handling

Challenge results

Gaps and suggestions

Decision: D-8 — Soft delete + add DELETE endpoint

What was decided

Interface specification

Challenge results

Gaps and suggestions

Decisions D-1, D-2, D-3, D-4, D-6, D-9, D-10 — Resolved (abbreviated)

UI State Audit

Performance Budget Check

Accessibility Review

Pattern Alignment Check

Data Integrity Deep-Dive

Concurrency Collision Map

API Contract Completeness Check

Async Job / Event Consumer Spec

Compliance Trigger Check

Cross-Layer Contract Verification

Cross-Layer Rollout Compatibility Matrix

End-to-End Data Flow

Flow: Generate → rate → activate (the primary user action)

Agentic Readiness Deep-Dive

Vague Word Audit

Dangling Alternatives

Task Decomposition Assessment

Strengths

Biggest Gaps

Priority Actions

Backend Contract Addendum

Endpoint Contract Details

Database Changes Details

Async Worker

Implementation Readiness Checklist

Unblocked (agent can proceed)

Blocked (must fix first)

Task Manifest

Dangling Decisions Log

Open Questions

Evidence Notes

R3 Re-Grounding Log (every cited anchor re-verified against current code)

Review History