Skip to main content

RFC Review: AI Agent Testing — Phase 1: Historical Validation

Companion review for historical-validation.md, produced by the rfc-reviewer skill. Lives beside the RFC; valid only for the RFC revision in reviewed_rfc_last_updated (2026-06-20).

This is a cumulative review. Current cycle = R3 (re-grounded against chatbot fa6dd8b79 / chatbot-fe f5b80d5a). R1 (pre-fix draft, 7.5) and R2 (post-fix, 8.5) are preserved as history. The RFC last_updated (2026-06-20) still equals the prior stamp → refresh/confirm cycle: no RFC content changed since R2. R3 re-grounded EVERY load-bearing citation in the RFC §2.0 Source Verification table + prior-review spot-checks against the current code (chatbot moved 88b0685 → fa6dd8b79; chatbot-fe at f5b80d5a). Result: zero citation drift, zero breakage — every file:line, route, migration, schema column, and FE symbol still resolves to exactly what is claimed (full re-grounding log in Evidence Notes). No new code-grounded issue surfaced, so no new REV-n is minted. The ledger carries forward unchanged: 4 fixed (R1 majors, confirmed still fixed against current code), 5 minor open (REV-5/6/7/8/9). Score holds 8.5; verdict PROCEED. See the Findings Ledger and Review History for the delta.

Executive Summary

  • Overall Score: 8.5/10 (R3 — held from R2; was 7.5 at R1)
  • Rating: Strong
  • RFC Type: full-stack
  • Sub-Type: frontend: new-feature · backend: new-feature
  • Assessment Confidence: High — all 18 categories have direct evidence across both layers; R3 re-grounded every load-bearing citation against current code (chatbot fa6dd8b79, chatbot-fe f5b80d5a): worker still 43 lines fetch+extract+log only; publish.rb:13 only sets active_version_id (no gate); rate_test_case_question.rb:13-19 writes no aggregate; routes GET L32/POST L75/PATCH L115 with set_role; no DELETE test-case route; FE deleteTestCase at ai-agents.ts:313; TestCase type carries name (interface.ts:272) but BE create_test_case.rb:19 still does not persist it; name column still absent from schema. Zero drift, zero breakage.
  • Applied Caps/Gates: None triggered. No category scored < 5.0; ACV and ROL clear their < 5.0 caps; no cross-layer contract mismatch (after R2, every §2.G row reaches Match? = yes). The RFC does not reach 9.0+ because a few merged categories still sit 8.0 (DIC, FMC, OBS) on minor open items (REV-5/7/8) and one new caveat (REV-9, per-process throttle).
  • Implementation Readiness Verdict: PROCEED — all four R1 major findings are fixed (REV-1…REV-4). An agent can execute the full 13-chunk plan from the RFC alone. The remaining items are minors + external confirmations (production RPM ceiling Open Q #5, InfoSec Open Q #1) and one design dependency (REV-6: no Figma frame for the comparison/detail view) that blocks pixel-faithful chunk 11 only.
  • Report Path: chatbot/ai-agent-testing/rfcs/historical-validation-review.md
  • RFC Author: Claude (from PRD + repo grounding) | Reviewed: 2026-06-20 (R1 draft → R2 commit 88b0685)

This RFC is unusually execution-ready for a draft: it is explicitly framed as a delta on existing scaffolding and backs almost every claim with a verified file path, line range, and the exact gap to close. An AI agent could read §2.0 (Repo Reading Guide + Source Verification), the §2.4 API table, and §4.D (13 ordered chunks with files + commands + acceptance criteria) and start producing correct code for most chunks without a clarification meeting. The biggest strength is cross-layer contract honesty — §2.G names the three real frontend/backend gaps (name persistence, aggregate recompute, missing DELETE) instead of hiding them. The biggest gap is that the engineering core of the phase (worker sampling → shadow-gen → persistence) is unbuilt and, while well-specified at the orchestration level, leaves one hard external contract open: the per-question QontakNLP TPM/RPM throttle (Open Q #5), which is the dimension most likely to make the batch fail or get rate-limited in production. The one thing that must change before that chunk is agentic-safe: pin the throttle contract (concrete RPM/TPM ceiling, inter-call delay or token-bucket, and behavior on 429) — today an agent would invent it.

R2 update (commit 88b0685). That "one thing" is done: §3 Performance now specifies a token-bucket throttle (ai_agent_testing_nlp_rpm pref, default 60; on-429 → backoff + requeue → fail-question), and the other three majors are closed (name column unconditional §2.3; full DELETE contract §2.4; org-configurable gate threshold D-7/§4.B). The engineering core (chunk 4) is now agentic-safe. The one residual scaling caveat R2 raises (REV-9): the cap is enforced per worker process, so concurrent same-org batches (queue concurrency 5/10) can aggregate past the per-org ceiling — minor, with a Redis-counter mitigation noted below. Net: a Strong RFC, no remaining majors, PROCEED.


Quick Verdict

Why this RFC can be implemented agentically:

  • §2.0 Source Verification gives a verified anchor (file + line + evidence) for every claim, and §4.D decomposes the work into 13 ordered chunks each with files, repo-sourced test commands, and assertable acceptance criteria — the rubric's gold standard for task decomposition.
  • The contract surface is mostly closed: §2.4 API table (method/path/auth/request/response/status/idempotency/versioning/reuse-tag), §3.B/3.C error catalogs, §2.A typed component props, and §2.G cross-layer verification table mean the agent rarely has to guess shapes.

Why this RFC will cause agent guessing or rework:

  • The per-question NLP throttle contract is unresolved (Open Q #5, §3 Performance, §2.F): no concrete RPM/TPM number, no inter-call pacing mechanism, no documented 429 behavior — yet this governs whether the core shadow-gen chunk (4) works at batch scale. The RFC says "throttle … to respect TPM/RPM" but never specifies it.
  • Two contract micro-decisions are left as "if required / recommended": the name column (§2.3 "only if … required"; §2.G says BE doesn't persist it today) and create-endpoint idempotency ("recommended"). An agent hitting chunk 1 and the create path will guess.

Findings Ledger (carry-forward)

Stable, never-renumbered finding ids. R1 minted REV-1…REV-8; R2 reconciled them and minted REV-9. R3 re-grounded every cited anchor against current code and reconciled the ledger unchanged — the four R1 majors (REV-1…4) remain fixed (their fixes are RFC-text fixes; the underlying code state they describe — no name persisted, no gate, no DELETE route, no throttle in worker — is still the live reality the RFC proposes to change, exactly as documented). No drift ⇒ no new REV-n minted.

IDSeverityFinding (one line)RFC locationStatusFirst seenResolved inEvidence / fix
REV-1majorPer-question NLP throttle (TPM/RPM) contract unspecified — no concrete rate ceiling, pacing mechanism, or 429 behavior, despite governing the core shadow-gen chunk§3 Performance, §2.F job spec, Open Q #5fixedR1R2 (88b0685)§3 "NLP throttle contract (REV-1)": token-bucket; ai_agent_testing_nlp_rpm pref (default 60); on-429 → exp backoff (1→2→4s, max 3) + requeue → fail-question after budget; D-5 updated; §4.B config row added. (Production ceiling still confirmed by AI squad — Open Q #5; tracked, not blocking.)
REV-2majorname column persistence left conditional ("only if required"); §2.G marks POST .../test_cases cross-layer match as partial because BE does not persist name today§2.3 migration, §2.G row 2, §4.D chunk 1fixedR1R2 (88b0685)§2.3 now "required, not conditional" — column added + params[:name] persisted (≤24, validated); §2.G create row → yes (after chunk 1).
REV-3majorDELETE endpoint contract depth thinner than reuse/extend endpoints — soft-delete body, restore semantics, cross-tenant 404-vs-403 not pinned§2.4 DELETE row, §4.D chunk 6fixedR1R2 (88b0685)§2.4 "DELETE contract (REV-3)": no body; soft-delete via acts_as_paranoid org-scoped; 200 {data:{id}}; cross-tenant → 404 (not 403) to avoid id enumeration; idempotent; restore explicitly out-of-scope this phase. §2.G delete row → yes.
REV-4majorPublish-gate threshold configurability unresolved — 80 hard-coded vs org-configurable (Open Q #4); §4.B "threshold pref" had no key/default/provisioner§2.4 publish, §4.B, D-7, Open Q #4fixedR1R2 (88b0685)D-7 + §4.B: ai_agent_testing_threshold SystemPreference (engine, int, default 80, per-org); §5 Open Q #4 marked resolved.
REV-5minorWorker retry: false → "bounded retry (e.g. 3, exp backoff)" is exemplary, not pinned; no concrete backoff intervals or dead-set alert threshold§2.F job spec, Open limitation #10openR1Note: REV-1's fix pins the per-question NLP 429 backoff (1→2→4s), but the worker-level job retry intervals + dead-set alert depth are still "e.g. 3". Pin them.
REV-6minorComparison/detail view has no qontak-designer prototype and no dedicated Figma frame; blocks pixel-faithful chunk 11§1 Design References row 4, Design↔Code map, §5 Q-AopenR1Design dependency, not a doc edit — produce a Figma frame before chunk 11. Carried; surfaced in §5 Q-A.
REV-7minorRe-run trigger (which endpoint/param re-enqueues) + its idempotency guard not in the §2.4 API table, though clear-before-regen is stated in §2.D§2.D row 2, §2.E, Open limitation #9openR1Add the re-run path to §2.4 and pin the clear-before-regen transaction.
REV-8minorCreate-endpoint idempotency "recommended", not decided ((agent,version,name) guard not specified as constraint/check)§2.4 POST row, §2.D create rowopenR1Decide: partial unique index / app-level dedup, or accept duplicates (FE button-disable only).
REV-9minor(new in R2) §3 calls the RPM cap "per-org" but enforces it with an in-worker token-bucket; with queue concurrency 5/10, concurrent same-org batches each get their own bucket → aggregate can exceed the per-org ceiling§3 Performance "NLP throttle contract"openR2Either scope the claim ("per-worker pacing; same-org concurrent batches are rare") or make it a true per-org ceiling via a Redis-backed counter keyed by org. Low likelihood (one SPV generates at a time), so minor.

Ledger summary (R2): 5 open (0 blocker / 0 major / 5 minor), 4 fixed this cycle (all R1 majors), 0 accepted-risk. No blockers and no open majors → verdict PROCEED. Still-open findings (REV-5/6/7/8/9) are promoted to the RFC Open-Questions / Concerns surface by id.

Ledger summary (R3): unchanged — 5 open (0 blocker / 0 major / 5 minor), 0 fixed this cycle, 0 new. R3 is a re-grounding/confirm cycle: every citation re-verified against current code (chatbot fa6dd8b79 / chatbot-fe f5b80d5a), zero drift or breakage, RFC text unchanged since R2, so the ledger is identical. REV-5/6/7/8/9 remain promoted to the RFC §5 surface. Verdict PROCEED.


PRD → RFC Traceability Matrix

PRD: ../prds/historical-validation.md (BOT-3351). 10 stories (AITEST-S01…S10), CHG-001, 4 negative scenarios, plus sections §2–§17.

Standard format (PRD exists)

PRD ElementRFC SectionCoverage
AITEST-S01 Workspace access controlDetail 1.C row S01, §2.4 GET test_cases, §3 authz matrix, §4.D chunk 9/12Full
AITEST-S02 Historical sampling (10%)Detail 1.C row S02, §2.1 branch/skip flow, §2.F, §4.D chunk 3Full — sampling logic + cap branches specified; ownership (Data vs BOT) flagged (Open Q)
AITEST-S03 Data integrity & filteringDetail 1.C row S03, §2.0 anchor (ExtractConversationPairs), §3.A.1 branch catalogFull — relies on existing extractor + spec
AITEST-S04 Shadow execution (zero leakage)Detail 1.C row S04, §2.2 sequence, §2.F, §4.D chunk 4Partial — orchestration fully specified; the NLP throttle contract that makes it scale is open (REV-1)
AITEST-S05 Side-by-side validation UIDetail 1.C row S05, §2.A, §2.4 GET detail, §4.D chunk 11Partial — props/states specified; no Figma frame for the detail view (REV-6)
AITEST-S06 Confidence meter & feedbackDetail 1.C row S06, §2.A ConfidenceMeter, §2.D rate row, §4.D chunk 5Full — aggregate (up/total)*100 recompute specified server-side (D-6)
AITEST-S07 Activation gatekeepingDetail 1.C row S07, §2.4 publish, §4.D chunk 8, D-7Partial — gate specified; threshold configurability open (REV-4)
AITEST-S08 Background processing (async)Detail 1.C row S08, §2.F, §2.B polling, state machinesFull — Sidekiq, queue, poll interval, status lifecycle all pinned
AITEST-S09 Manual override & auditDetail 1.C row S09, §2.4 publish override_reason, §3 audit (PaperTrail)Full — override path + audit fields specified (Could-Have)
AITEST-S10 Confidence in Tree DiagramDetail 1.C row S10, §2.4 tree_diagram, §4.D chunk 7Full — avg over completed, "no score yet" fallback specified
CHG-001 Tree-diagram confidenceDetail 1.A, §2.4 /api/v3/paths/:id/tree_diagram, §2.F.2Full
NEG-1 standard agent denied§3 authz matrix (standard agent → 403), Detail 1.C S01 NEG-1Full
NEG-2 non-text excluded§3.A.1 branch catalog, §2.1 branch/skip flowFull
NEG-3 comparison read-only§2.A readonly: true propFull
NEG-4 Phase 2/3 sources disabled§3.A.1 branch catalog, Out of Scope #4Full
PRD §12 Observability events§3 Monitoring (5 Mixpanel events + alerts)Full — event names match PRD §12 1:1
PRD §13 Success Metrics§1 Success CriteriaFull — primary KPI, quality KPI carried
PRD §6 Soft delete + restore§2.3 lifecycle, D-8, §2.4 DELETEPartial — soft-delete covered; restore endpoint not specified (REV-3)
PRD Open Q #5 (token budget/TPM)§5 Open Q #5, §3 PerformanceCarried as open — not resolved (REV-1)

Reverse (RFC decisions → PRD driver): every new FE component and BE endpoint in the Detail 1.A reverse table traces to a PRD AC. The one RFC addition without a direct PRD line item — the DELETE endpoint — is justified by PRD §6 (soft delete) + the existing FE client; not scope creep. No unjustified RFC additions found.

Summary: 19 of 19 tracked PRD elements covered (14 Full, 5 Partial, 0 Missing). 0 RFC decisions without PRD justification. Forward + reverse traceability is explicit (Detail 1.A) — exemplary for PRT.


Scorecard

Full-Stack Scorecard (18 categories)

#CategorySourceScoreEvidence-Based Rationale
1PRT — PRD TraceabilityMerged9.0FE+BE: explicit bidirectional matrix (Detail 1.A forward + reverse), PRD-§ coverage table, PRD-to-Schema Derivation mapping every entity→table.column→endpoint→enforcement. 19/19 elements covered; no scope creep. Two strong non-conflicting evidence points → clears 9.0.
2TDC — Technical DecisionsMerged8.010 decisions D-1…D-10 each with chosen option + rejected alternative + reason + layer (Detail 1.B). Grounded in real files. Drag: D-7 (gate threshold configurability, REV-4) and D-10 (sampling cap, but throttle in D-5 left open) are decided in principle but leave a sub-contract open.
3CNT — Contract SpecificityFE8.0§2.A gives typed props for ConfidenceMeter/TestCaseComparison; §2.B data-fetching strategy (Pinia + $apiMain, poll ~3s, optimistic rollback); types reused from verified store/ai-agent/interface.ts. Drag: name blank-until-built (REV-2); detail-view contract leans on a non-existent prototype (REV-6).
4SCB — Scope BoundariesFE9.0§2.I lists FE create / modify / NOT-touched files by path, plus shared-code note ("extend, don't fork" the Pinia store). §2.0 reading order + anchors. An agent can produce a file-by-file plan directly.
5DEP — DependenciesFE8.5§1 Dependencies table with owning team + availability (exists/needs-building) + blocking flag; Design-system version pinned (@mekari/pixel3@^1.0.12, verified). Backend-endpoint availability checked per endpoint (§2.0 contracts table). Drag: NLP batch path "needs building" with open throttle.
6NFS — Non-Functional SpecificityFE8.0§3 Performance: LCP<2.5s, INP<200ms, CLS<0.1, bundle delta small; §3.E Accessibility WCAG AA with focus-trap, aria-pressed, role=progressbar, prefers-reduced-motion. Browser support "per chatbot-fe baseline" is slightly soft (no explicit version matrix).
7TPS — Test Plan SpecificityFE8.0§4.C maps each layer to a real repo command (vitest, playwright, rspec paths) + what it proves; §4.D acceptance criteria per chunk are assertable ("0 SendMessageWorker enqueues"). Drag: FE visual-regression for the no-prototype detail view not addressed.
8DMS — Data Model & SchemaBE8.5§2.3 gives verified current DDL (both tables, columns, types, indexes), per-status lifecycle tables, cardinality (≤70 q/case), PII classification, retention (soft-delete; hard-purge TBD). No new tables; one additive nullable name column. Drag: hard-purge window TBD (Open Q #8) and name conditional (REV-2).
9ACV — API Contract & VersioningBE8.0§2.4 outbound table: method/path/auth/request/response/status codes/idempotency/versioning/reuse-tag for all 7 endpoints; §3.B error catalog; example create payload. Clears the <5.0 cap comfortably. Drag: DELETE contract depth (REV-3), create idempotency "recommended" (REV-8), re-run path absent from table (REV-7).
10DIC — Data Integrity & ConsistencyBE8.0§2.D matrix per write path (transaction scope, partial failure, idempotency, consistency, duplicate, stale read); rate+recompute "in one trx"; worker per-question (not one big trx) with clear-before-regen. Drag: clear-before-regen transaction not pinned to a step (REV-7); create idempotency open (REV-8).
11FMC — Failure Mode CoverageMerged8.0FE: §2.C UI state matrix (5 states/surface) + §3.C error message catalog with i18n keys + optimistic rollback. BE: §3.A failure catalog, §3.B error catalog (code/HTTP/when/user-facing), two failure sequence diagrams, per-question fail→continue. Cross: §3.A confirms FE handles exactly what BE returns. Drag: retry/backoff "e.g. 3" not pinned (REV-5).
12CSS — Concurrency & ScalingBE6.5§2.E collision map (concurrent raters; worker-vs-rater with processing lock + 409/422; double-publish idempotent) is solid. But the load-facing half is the weakest in the RFC: the NLP throttle (the actual scaling control for the core path) is unspecified (REV-1); queue concurrency is given (5/10) but per-question pacing is not. This is the lowest score and the main thing holding the overall down.
13SAS — Security & AuthorizationBE8.5§3 Security: threat model (4 vectors incl. zero-leakage + PII-to-LLM), set_role per endpoint + Role×Endpoint matrix, ownership via current_user['chatbot_organization_id'], input validation (score∈{0,1}, name length-bound), SSRF-safe outbound, lockbox secrets, PaperTrail audit, brakeman/rubocop. Two strong evidence points.
14ROL — Rollout & RollbackMerged8.5FE: flag ai_agent_testing (named, default OFF, provisioner). BE: additive migration sequence, nullable-column rollback. Cross (mandatory checks): deploy order specified (BE first) with rationale; §4.A compatibility matrix covers all 6 scenarios incl. the one "no" (Frontend-first) with mitigation; coordinated rollback = flag-OFF; flag coordination (testing + gate as two independent flags). Clears the deploy-order cap.
15OBS — ObservabilityMerged8.0FE: 5 Mixpanel events with properties + ai_workspace_load_failed. BE: Rollbar with test_case_id/room_id, structured worker log extended with generated/failed_count, alert thresholds (batch>10%/1h, NLP>5%/15m→PagerDuty). Cross: propagate request/job id into worker logs for trace. Slightly short of distributed-trace spans → not 9.0.
16SBC — Service Boundary & CouplingBE8.5§2.F.1 Responsibility Boundary Matrix (9 steps, owning squad, trigger, effect, failure handler, PRD anchor); pipeline isolated to :ai_agent queue off the live path; sync outbound to Hub/NLP, no new sync coupling on the request path. Clear ownership boundaries.
17CPA — Pattern AlignmentMerged9.0FE: §2.0 "Patterns to Follow" (Pinia fetchStatus, optimistic rollback, AiAgentsTable list pattern) with reference files + "Deviation? none". BE: Grape+ResultMatcher, APIAbstractUseCase, AbstractRepository, Sidekiq worker pattern, FeatureFlag, ErrorException shape — each with a named reference file. Cross: snake_case JSON consumed directly (verified). The single intentional deviation (drawer version selector) is flagged.
18CDG — Compliance & Data GovernanceBE8.0Active (PII trigger: customer/agent text → 3rd-party LLM). §3.D classification table (legal basis DPA/UU PDP, retention, encryption, access audit, right-to-delete), PII-scrub logging rule (don't log question/answer bodies), transient inference, InfoSec sign-off required (Open Q #1). Drag: hard-purge window TBD (Open Q #8) keeps it off 9.0.

R2 re-score deltas (the table above is the R1 baseline)

The R1 category scores are preserved above as history. R2 fixes (REV-1…REV-4) move five categories:

#CategoryR1 → R2Why it moved
2TDC — Technical Decisions8.0 → 8.5D-5, D-7, D-8 closed → 10/10 decisions resolved; sub-contracts (throttle, threshold) now specified
3CNT — Contract Specificity (FE)8.0 → 8.5REV-2 fixed (name now persisted); residual drag only REV-6 (no detail-view Figma frame)
8DMS — Data Model & Schema8.5 → 9.0REV-2 fixed (name column unconditional + persisted); verified DDL + lifecycle + two strong evidence points → clears 9.0 (hard-purge TBD remains advisory, Open Q #8)
9ACV — API Contract & Versioning8.0 → 8.5REV-3 fixed (full DELETE contract); residual REV-7 (re-run path) + REV-8 (idempotency) keep it off 9.0
12CSS — Concurrency & Scaling6.5 → 8.0REV-1 fixed — the throttle (the scaling control for the core path) is now specified (token-bucket + RPM pref + 429). New REV-9 (per-process vs per-org enforcement) is the only residual, and is minor → caps the rise at 8.0

Unchanged: DIC 8.0, FMC 8.0, OBS 8.0 (REV-5/7/8 still open); all others as R1. Overall: 7.5 → 8.5.

Resource & Cost Advisory (non-blocking)

  • §4.F: bounded by :ai_agent queue concurrency (5/10), +1 nullable column, ≤70 question rows/case (negligible growth), per-question HTTPS egress to QontakNLP capped by sample size. No new infra. This is a healthy advisory note; the only cost dimension that matters — per-tier token budget — is correctly routed to Open Q #5 (Data/AI) rather than blocking readiness.

Decision Closure Assessment

Decision Index

#DecisionStatusCritical Gaps
D-1Frontend target = chatbot-fe (qontak-designer is reference only)Resolvednone
D-2Reuse existing ai_agent_test_cases/_questions tablesResolvednone
D-3Server-side set_role authoritative; FE menu flag/subscription-gatedResolvednone
D-4Async Sidekiq worker, queue :ai_agentResolvednone
D-5Reuse QontakNlp predict per question, throttled in-workerResolved (R2)throttle contract now specified (token-bucket, RPM pref, 429 behavior) — REV-1 fixed. Residual: per-process enforcement caveat (REV-9, minor)
D-6Recompute confidence_score server-side on each ratingResolvednone
D-7Activation gate advisory→enforced behind flagResolved (R2)threshold key pinned: ai_agent_testing_threshold (default 80, per-org) — REV-4 fixed
D-8Soft delete via acts_as_paranoid + add DELETE endpointResolved (R2)full DELETE contract specified; restore scoped out-of-scope explicitly — REV-3 fixed
D-9Per-status lifecycle pending→processing→completed/failedResolvednone
D-10Sampling 10% / cap 50–70 / all if <10Resolvedstep-3 ownership (Data vs BOT) is org-process, not a code gap

Aggregate (R2): 10 of 10 Resolved, 0 Partial, 0 Dangling (R1 was 7/10 + 3 Partial; D-5, D-7, D-8 closed this cycle).


Decision: D-5 — Reuse QontakNlp predict per question, throttled in-worker

Status: Partial

What was decided

"Reuse QontakNlp predict path per question, throttled, in worker" (Detail 1.B D-5); alternative "New batch endpoint on AI service" rejected "Reuse the proven lib/qontak_nlp/inference.rb#prediction; throttle in-worker to respect TPM/RPM."

Alternatives considered

Documented: new batch endpoint on the AI service, rejected to reuse the proven live-predict path. Sound reasoning grounded in a verified file.

Grounding in existing code

Strongly grounded: lib/qontak_nlp/inference.rb#prediction (verified, timeout: 60), core/repositories/qontak_nlp/predict.rb. Reading order item 5.

Interface specification

The call shape is specified (predict, not send_message; timeout 60s; returns answer/confidence/sources/response_time). What is NOT specified: the throttle itself — no RPM/TPM ceiling, no inter-call delay or token-bucket size, no per-tier number. §3 Performance and §2.F both say "throttle … to respect TPM/RPM" and defer the contract to Open Q #5.

Failure handling

Per-question failure is well-handled (failure sequence diagram: Rollbar + status=failed + status_description, batch continues). But the specific failure that the throttle exists to prevent — a 429 from the AI service — has no documented behavior (backoff? requeue the question? count toward failed?).

Challenge results

  • Scale: This is exactly where it breaks first. A 50–70 question batch firing predict calls with no concrete pacing will, at the upper cap across concurrent batches (queue concurrency 10), plausibly exceed the provider's RPM and get throttled or fail. The throttle is the scaling control and it is unspecified.
  • Reversibility: Easy to change pacing later (worker-internal), so this is low-blast-radius to defer — but the first implementation will guess.
  • Consistency: Consistent with D-4 (async) and the FMC failure handling.
  • Agent implementability: An agent cannot implement the throttle without inventing numbers and a mechanism. It would likely add a naive sleep or no throttle at all.

Gaps and suggestions

Missing: concrete RPM/TPM ceiling (per tier if tier-dependent), pacing mechanism, on-429 behavior. Suggested resolution: pin a conservative default for beta (e.g. token-bucket at N RPM derived from the AI squad's documented per-org limit), with on-429 = exponential backoff then requeue the single question (not the batch), and fail the question after the retry budget. Make the ceiling a SystemPreference so it is tunable per tier without deploy (matches §4.B style). Open questions for author: What is the QontakNLP per-org RPM/TPM ceiling? Is it tier-dependent? On 429, backoff-and-retry the question or mark failed?


Decision: D-7 — Activation gate advisory→enforced behind flag

Status: Partial

What was decided

Add a confidence-threshold check in Repositories::Publish, advisory for beta and enforced before GA, behind ai_agent_testing_gate (default false/advisory). Alternative "hard gate from day one" rejected (PRD Open Q #2). Grounded in verified repositories/publish.rb (no gate today — confirmed: only update!(active_version_id:)).

Interface specification

The publish endpoint contract is specified (POST, 422 below threshold when gate on, override_reason?). The gap: the threshold value's source is unresolved — Open Q #4 asks fixed-80 vs org-configurable; §4.B lists a "threshold pref" but with no key name/type/default/provisioner, unlike every other row in that table.

Failure handling

422 with "Confidence below threshold" is catalogued (§3.B). Override path (S09) handled. Fine.

Challenge results

  • Scale/Reversibility: trivial; gate is a single comparison. Toggling advisory↔enforced is a flag.
  • Consistency: consistent with D-6 (server-authoritative aggregate) and S09 override.
  • Agent implementability: an agent can build the comparison but will guess where 80 lives (constant vs config key). Chunk 8 acceptance criterion says "422 when <80" — implies a constant, contradicting the §4.B "threshold pref" hint.

Gaps and suggestions

Missing: the threshold config key spec (name/type/default/provisioner) OR an explicit "constant 80 this phase." Suggested resolution: for this phase, hard-code 80 as a named constant and add a SystemPreference key ai_agent_testing_threshold (int, default 80) only if Open Q #4 resolves to configurable; document which in §4.B. Open questions for author: Is 80 fixed for the phase or org-configurable on day one?


Decision: D-8 — Soft delete + add DELETE endpoint

Status: Partial

What was decided

Soft delete via acts_as_paranoid; add the missing DELETE /api/v1/ai_agents/:id/test_cases/:test_case_id route (verified: FE deleteTestCase client exists at ai-agents.ts:313; no BE test-case delete route found — confirmed by grep). Alternative hard-delete rejected per PRD §6.

Interface specification

Partially specified: method/path/auth/idempotency/status codes given in §2.4. Gaps: the success response body is {status:success} but doesn't echo the deleted resource; restore is mentioned (D-8 "soft delete + restore"; §2.3 lifecycle "restore via paranoid") but no restore endpoint appears in §2.4; cross-tenant handling is stated as 404 in one place (detail) and 403 in the role matrix — pick one for the DELETE.

Challenge results

  • Reversibility: soft-delete is the reversible choice; good.
  • Agent implementability: an agent can build the soft-delete DELETE from chunk 6, but will guess on the response body and whether to also build restore.

Gaps and suggestions

Missing: restore endpoint scope; exact DELETE response body; cross-tenant 404-vs-403 consistency. Suggested resolution: scope restore OUT for this phase (state it explicitly — paranoid restore is a console/operational action, not a UI endpoint), pin DELETE success body, and standardize cross-tenant → 404 (don't reveal existence). Open questions for author: Is restore a user-facing endpoint this phase or operational-only?


Decisions D-1, D-2, D-3, D-4, D-6, D-9, D-10 — Resolved (abbreviated)

Each names the chosen option, rejects a concrete alternative with a reason tied to verified codebase reality, references specific files, and (where it produces an interface) specifies it. D-1 (chatbot-fe vs qontak-designer — the latter has no API/auth, verified). D-2 (reuse tables — verified migrations). D-3 (set_role authoritative — verified per-route). D-4 (Sidekiq :ai_agent — verified queue config). D-6 (server-side recompute, single source of truth). D-9 (status lifecycle — two state-machine diagrams). D-10 (sampling cap with branch/skip flow diagram). No dangling alternatives; an agent can implement each without asking.


UI State Audit

ComponentLoadingEmptyErrorPartialSuccessAssessment
Testing listdefineddefineddefineddefineddefined5/5 (§2.C)
Generate drawerdefinedn/adefinedn/adefined5/5 (n/a justified)
Generating modaldefinedn/adefineddefineddefined5/5
Detail/comparisondefineddefineddefineddefineddefined5/5
Confidence meterdefineddefinedn/a (derived)defineddefined5/5

Summary: 5 of 5 data-driven surfaces have all applicable states defined (§2.C UI State Matrix). This is the strongest UI-state coverage the reviewer expects to see; the only soft spot is that the comparison view's "success" rendering leans on a non-existent prototype (REV-6), not a missing state.


Performance Budget Check

MetricTargetCurrent BaselineSourceAssessment
LCP (list)< 2.5snot stated§3 Performanceadequate target, no baseline
INP (rating)< 200msnot stated§3 Performanceadequate
CLS< 0.1not stated§3 Performanceadequate
Bundle delta"small (reuses Pixel3 + existing store)"not stated§3vague — no numeric budget, but low-risk given reuse
Batch latency (BE)~2–5 min for ~50 itemsn/a (new)§3, PRD §6adequate

Frontend Core Web Vitals are quantified (good for a new-feature RFC). Bundle delta is qualitative; acceptable here because the page reuses the design system and existing store, so the blast radius is small — not a blocker.


Accessibility Review

AspectSpecified?DetailsAssessment
Keyboard navigation flowyesdrawer/modal focus-trap (Pixel3); thumbs via Tabadequate (§3.E)
Focus management (modal/route)yesfocus returns to trigger on modal closeadequate
ARIA labelsyesthumbs aria-pressed; meter role=progressbar+aria-valuenow/min/maxadequate
Heading hierarchynonot addressedminor gap
Color contrastyes"verified against Pixel3 tokens"adequate (relies on DS)
Motion sensitivityyesprefers-reduced-motion for generating animationadequate
Screen reader behaviorpartialmeter labeled; comparison-panel SR reading order not describedminor gap

WCAG AA committed with concrete focus/ARIA specifics — strong for a draft. Minor: heading hierarchy and comparison-panel screen-reader order unaddressed.


Pattern Alignment Check

PatternRFC ApproachAssessment
FE state management (Pinia fetchStatus)follows (store/ai-agent/actions.ts)adequate — reference file named
FE optimistic update/rollbackfollows (UPDATE_TEST_CASE_QUESTION)adequate
FE list/empty/loadingfollows (AiAgentsTable.vue)adequate
BE Grape handlerfollows (test_cases_controller.rb + ResultMatcher)adequate
BE workerfollows (fetch_room_conversations_worker.rb pattern; per-item rescue→Rollbar)adequate
BE feature flagfollows (FeatureFlag.enabled?)adequate
Cross-layer namingsnake_case JSON consumed directly by FE (verified)adequate — no ad-hoc transform
Analytics eventspreserved/added (PRD §12 names matched)adequate

No silent new patterns introduced. The one intentional deviation (Generate drawer version selector absent from the prototype) is explicitly flagged (§5 Q-A). Exemplary CPA.


Data Integrity Deep-Dive

Write PathTransaction ScopePartial Failure BehaviorIdempotency KeyConsistency GuaranteeDuplicate Handling
Create test casesingle INSERT422; no worker enqueued(agent,version,name) "recommended" (NOT pinned — REV-8)strong (single row)second click → second row (FE button-disable only)
Worker per-question INSERT+UPDATEper-question (not one big trx)failed q → status=failed, batch continuesre-run clears prior questions before regen (trx not pinned — REV-7)eventual (batch)re-run clear-before-regen
Rate + recomputequestion UPDATE + aggregate UPDATE in one trxrollback both → FE restores priorlast-write-wins per questionstrong within requestrepeated same score idempotent
Publish (gate)single UPDATE active_version_id422 if <80 & gate onidempotentstrongn/a

Strong overall (§2.D is a real matrix). Two soft spots flagged: create idempotency is "recommended" not decided (REV-8), and the clear-before-regen transaction is described as a rule but not tied to a transactional step (REV-7) — an agent could orphan questions if it deletes-then-fails mid-regen.


Concurrency Collision Map

#Shared ResourceWritersCollision ScenarioResolution MechanismLock Failure BehaviorAssessment
1ai_agent_test_cases.confidence_scoreconcurrent raterstwo thumbs near-simultaneouslyrecompute reads current scores in-trx (no stored delta)last recompute wins; deterministic from rowsadequate
2ai_agent_test_case_questions of a caseworker (regen) vs raterrate during re-runcase→processing disables rating; reject 409/422 if not completedFE "regenerating"adequate
3ai_agents.active_version_idpublish vs publishdouble activateidempotent row updatelast write winsadequate
AI-service rate budgetconcurrent worker batchesN batches × ≤70 questions exceed provider RPM/TPMNOT SPECIFIEDNOT SPECIFIEDgap (REV-1) — the one shared external resource with no resolution

§2.E covers the DB collision points well. The missing collision is the shared external AI-service rate budget across concurrent batches (queue concurrency up to 10) — this is the CSS-dragging gap (REV-1).


API Contract Completeness Check

EndpointRequest SchemaResponse SchemaError TaxonomyAuth SpecIdempotencyExample PayloadsAssessment
GET /test_casescompletecompletecompletespecific (set_role)n/a (read)partial6/6
POST /test_casescompletecompletecomplete (404/422/403)specific"recommended" (REV-8)yes5.5/6
GET /test_cases/:idcompletecompletecompletespecificn/apartial6/6
PATCH /questions/:idcompletecompletecompletespecificlast-write-winspartial6/6
DELETE /test_cases/:idcompletepartial (no resource echo; restore unscoped)partial (404 vs 403)specificidempotentno4/6 (REV-3)
POST /publishcomplete (override_reason?)completecomplete (422)specificidempotentno5.5/6
GET /tree_diagramcompletecomplete (avg_confidence_score)partialspecificn/ano5.5/6

Strong overall. The DELETE row is the weakest (REV-3); POST idempotency is soft (REV-8). Example payloads are given for create only — acceptable since the other shapes are reused from verified types.


Async Job / Event Consumer Spec

Job/ConsumerTriggerInput ShapeRetry PolicyDLQConcurrency LimitIdempotency KeyTimeoutAssessment
FetchRoomConversationsWorker (extend)perform_async from CreateTestCases{test_case_id, organization_id}"bounded retry (e.g. 3, exp backoff)" — NOT pinned (REV-5)"none (Sidekiq dead set; alert on failure)" — no depth thresholdqueue :ai_agent (5 staging / 10 prod)re-run clears prior questions before regenper-question NLP 60s5.5/7

The job spec is present and mostly complete (input, concurrency, idempotency, per-msg timeout, poison handling all stated). The two soft cells — retry intervals ("e.g. 3") and dead-set alert depth — are the FMC/CSS calibration gaps (REV-5). The throttle (REV-1) is the bigger of the two and belongs here too: per-question NLP pacing is not in the spec.


Compliance Trigger Check

TriggerFound?Data LocationClassificationAssessment
PII (name, email, content)yesquestion, answer, parameters.human_answer (customer/agent text); scored_by_email/_namePII (customer content + internal user)handled (§3.D)
Payment datanon/a
Health datanon/a
User content with retentionyesboth tables, soft-deleteretention: soft-delete; hard-purge TBD (Open Q #8)partially handled
Auth/session datano (uses existing JWT)n/a
Cross-border data transferpartialPII → 3rd-party LLMDPA-covered, transientflagged, InfoSec sign-off required (Open Q #1)

CDG Status: Active — scored 8.0. Classification, legal basis, encryption, access audit, right-to-delete, and PII-scrub logging are all specified; the hard-purge window (Open Q #8) and InfoSec sign-off (Open Q #1) are the open items, correctly surfaced.


Cross-Layer Contract Verification

EndpointBackend Response SchemaFrontend Expected SchemaMatch?Gaps
GET /test_cases{status,code,message,data:[TestCase],meta}TestCasesResponse {data: TestCase[]}Yesmeta optional
POST /test_cases{data: TestCase} snake_caseCreateTestCaseResponse {data:TestCase}PartialBE does not persist name today → blank TestCase.name until chunk 1 (REV-2)
GET /test_cases/:id{data:{...,questions[]}} incl. confidence_score, parameters.human_answerTestCaseDetailResponse {data:TestCaseDetail}Yesfields align 1:1 with verified types
PATCH /questions/:id{data:question} + recomputed aggregateUpdateTestCaseQuestionPayload {score:0|1}Partialaggregate recompute not built → meter static until chunk 5 (in-scope work)
DELETE /test_cases/:idroute missingdeleteTestCase client exists (ai-agents.ts:313)NoBE route must be built (chunk 6) (REV-3)

Checks performed:

  • Field casing — snake_case consumed directly, verified, no ad-hoc transform
  • Nullability — confidence_score nullable; FE handles "no score"
  • Error shape — BE ErrorException ↔ FE error catalog (§3.A code-shape consistency column = "yes")
  • Pagination — page/limit query params; meta returned
  • Auth token — existing JWT/set_role, unchanged

Mismatches found: 0 true mismatches. The three Partial/No rows are disclosed in-scope work, not silent contract divergence — so the "cross-layer mismatch → cap 6.5" gate does not fire. This is the single most important reason this RFC scores Strong rather than Needs Work: it found its own integration gaps and put them on the build plan (chunks 1, 5, 6).


Cross-Layer Rollout Compatibility Matrix

ScenarioFrontendBackendWorks?Notes
Pre-deploy (baseline)OldOldYesno Testing page
Backend firstOldNewYesnew endpoints unused by old FE; flag OFF
Frontend firstNewOldNoavoided by deploy order; if FE leads, page gated behind flag tied to BE readiness
Both deployed (target)NewNewYestarget
Backend rollbackNewOldPartialFE page errors gracefully (error slate); disable flag
Frontend rollbackOldNewYesBE endpoints unused

Deploy order: Backend first (specified, §4 Rollout Strategy, with rationale). Incompatible scenarios: 1 (Frontend-first) — explicitly avoided by deploy order and flag-gating. ROL is not capped: deploy order is specified, so the < 5.0 and deploy-order caps do not fire.


End-to-End Data Flow

Flow: Generate → rate → activate (the primary user action)

SPV clicks Generate
→ FE: GenerateFromInboxDrawer @confirm {name, version} → store CREATE_TEST_CASE
→ API: POST /api/v1/ai_agents/:id/test_cases {type, version_id, name}
→ Backend: CreateTestCases UC → CreateTestCase repo
→ DB: INSERT ai_agent_test_cases (status='processing')
→ Side effect: Sidekiq enqueue FetchRoomConversationsWorker; PaperTrail version
→ Response: 201 {data: TestCase status='processing'}
→ FE: generating modal polls GET /test_cases every ~3s
→ Worker (async): Hub rooms → sample 10%/cap → ExtractConversationPairs → per question: INSERT q (processing) → NLP predict (NOT send_message) → UPDATE q (answer/confidence/sources, completed)
→ DB: UPDATE test_case status='completed'
→ FE: SPV opens detail GET /test_cases/:id → TestCaseComparison renders human-left/AI-right
→ @rate: PATCH /questions/:id {score} → recompute confidence_score (one trx)
→ FE: meter updates → at ≥80% Activate enabled → POST /publish (gate)

Gaps in flow: The full path is traceable from §2.H + the §2.2 sequence diagram — a genuine strength. The only place one layer's output isn't pinned to the next layer's input is the worker→NLP step pacing (REV-1) and the name round-trip (REV-2, the create response echoes a name the BE doesn't persist yet). No structural gaps; the flow is implementable step-by-step.


Agentic Readiness Deep-Dive

Vague Word Audit

#Word/PhraseLocationImpactConcrete Replacement
1"throttle … to respect TPM/RPM"§3 Performance, §2.Fagent invents pacingconcrete RPM ceiling + token-bucket + on-429 (REV-1)
2"bounded retry (e.g. 3, exp backoff)"§2.Fagent picks intervals"3 retries: 1s, 4s, 16s" (REV-5)
3"recommend … unique-ish (agent,version,name) guard"§2.4, §2.Dagent guesses constraint vs checkdecide index or app-level dedup (REV-8)
4"only if a name column is required"§2.3makes chunk-1 migration optionaldecide yes/no (REV-2)
5"threshold pref"§4.Bno key name/defaultpin config key or "constant 80" (REV-4)
6"bundle delta small"§3no numeric budgetlow-risk; optional numeric budget

Total vague words in spec sections: 6 (low for an 1,200-line RFC; 4 map to the major findings).

Dangling Alternatives

#AlternativesLocationImpact
noneNo "we could do X or Y" left unresolved. Every D-1…D-10 picks one option.

Total dangling alternatives: 0.

Task Decomposition Assessment

ChunkAcceptance CriteriaAssessment
1–8 (BE)each has files + repo command + assertable ACverifiable (§4.D)
9–13 (FE)each has files + pnpm test/e2e + ACverifiable

§4.D is a 13-chunk ordered plan with files, commands, and "done when" criteria per chunk — among the best decomposition the rubric anticipates. Chunks 1, 4, 6, 8 carry the open findings (REV-2, REV-1, REV-3, REV-4) but are still actionable with the suggested resolutions.


Strengths

  • Verified grounding (§2.0 Source Verification). Every load-bearing claim carries a file + line + evidence; the reviewer spot-checked four (worker is fetch+extract+log only at 43 lines; publish.rb only sets active_version_id; rate_test_case_question.rb writes no aggregate; FE deleteTestCase exists at ai-agents.ts:313 with no BE route) — all confirmed. This earns High confidence and is why an agent can trust the anchors.
  • Cross-layer honesty (§2.G). The RFC names its own three integration gaps (name persistence, aggregate recompute, missing DELETE) as in-scope work rather than letting them surface mid-build — turning what is usually a blocker-capping mismatch into a clean PROCEED.
  • Execution plan + traceability (§4.D, Detail 1.A/1.C). 13 ordered chunks with assertable AC + bidirectional PRD traceability (19/19 covered) + per-story change map mean an agent always knows what it's building and why.

Biggest Gaps

  • NLP throttle contract (REV-1, §3 Performance / §2.F / Open Q #5). The one external contract that governs whether the core shadow-gen chunk works at batch scale is unspecified — no RPM/TPM ceiling, pacing mechanism, or 429 behavior. This is the single biggest agentic-execution risk: an agent would ship chunk 4 with naive or no pacing and the batch would get rate-limited in production. Drags CSS to 6.5.
  • name column left conditional (REV-2, §2.3 / §2.G). The create-path cross-layer match is partial precisely because BE doesn't persist name today, yet §2.3 frames the migration as "only if required." Chunk 1's first migration is effectively optional — an agent could skip it and ship a blank TestCase.name column in the list UI.
  • DELETE-endpoint depth + gate-threshold source (REV-3, REV-4). The newest endpoint (DELETE) has the thinnest contract (response body, restore scope, 404-vs-403), and the publish gate's threshold source (constant vs config key) is unresolved between §4.B and chunk 8's AC — two spots where an agent guesses.

Priority Actions

Ordered by impact on implementation readiness. Close before agent execution of the named chunks.

  1. §2.F / §3 Performance / Open Q #5 (REV-1) — Pin the per-question NLP throttle: concrete RPM/TPM ceiling (per tier if applicable), token-bucket or fixed inter-call delay, and explicit on-429 behavior (backoff-and-requeue the single question, fail after retry budget). Make the ceiling a SystemPreference. Unblocks chunk 4 (the engineering core).
  2. §2.3 / §4.D chunk 1 (REV-2) — Decide the name column definitively: add column + persist (chunk 1 already lists the migration) and remove the "only if required" hedge, OR drop name from the FE type and update §2.G. Unblocks the create round-trip.
  3. §2.4 DELETE row + D-8 (REV-3) — Specify the DELETE success body, scope restore explicitly OUT (operational-only) for this phase, and standardize cross-tenant → 404. Unblocks chunk 6.
  4. §4.B / §2.4 publish / Open Q #4 (REV-4) — Resolve the gate threshold source: pin a ai_agent_testing_threshold config key (int, default 80, provisioner) like the other §4.B rows, or state "constant 80, configurability deferred." Align chunk 8's AC. Then tighten REV-5/REV-7/REV-8 (retry intervals, clear-before-regen transaction, create idempotency) — minor, but they remove the last guesses.

Backend Contract Addendum

Endpoint Contract Details

EndpointMethod/PathAuthZRequest ContractResponse ContractError ContractIdempotency/VersioningStatus
CreatePOST /api/v1/ai_agents/:id/test_casesset_role + org scope{type, version_id, name}{data: TestCase status='processing'}404/422/403idempotency "recommended" — Missing decision (REV-8)Missing [idempotency, name persist]
DeleteDELETE /api/v1/ai_agents/:id/test_cases/:test_case_idset_role + org scopepath{status:success} (no resource echo)404/403 (inconsistent)idempotentMissing [response body, restore scope, 404-vs-403] (REV-3)
PublishPOST /api/v1/ai_agents/:id/publishset_role (+admin for override){override_reason?}{data: ai_agent}422 below thresholdidempotentMissing [threshold source] (REV-4)

Database Changes Details

ChangeTable/EntityDDL / Shape DiffData Migration PlanRollback PlanCompatibility WindowStatus
Add name (conditional)ai_agent_test_casesADD COLUMN name varchar + indexnone (nullable, no backfill)nullable + inert; no down-migrationadditive, indefiniteMissing [the "conditional" must become definite] (REV-2)

Async Worker

The worker contract (§2.F) is mostly complete; missing the throttle (REV-1) and concrete retry intervals (REV-5).


Implementation Readiness Checklist

Unblocked (agent can proceed)

All types:

  • PRD → RFC traceability matrix complete (19/19)
  • Technical decisions resolved (7/10 resolved, 3 partial — none dangling)
  • Failure modes handled with error catalogs (§3.A/3.B/3.C)
  • Configuration contract (§4.B; one gap REV-4)
  • Pattern alignment verified (§2.0)
  • Rollout plan with flag + rollback (§4)
  • Observability metrics + alerts (§3)
  • Task decomposition with AC per chunk (§4.D)
  • Zero vague words — 6 present, 4 map to majors

Frontend:

  • Interfaces/prop types specified (§2.A)
  • All UI states defined (§2.C, 5/5)
  • Performance budget (CWV quantified; bundle qualitative)
  • Accessibility (§3.E; minor gaps)
  • Browser support matrix — "per chatbot-fe baseline" (soft)

Full-stack:

  • Cross-layer contract verified (§2.G; 0 true mismatches)
  • Deploy order specified (BE first)
  • Rollout compatibility matrix (§4.A; 1 "No" mitigated)
  • End-to-end data flow (§2.H + §2.2)
  • Feature flag coordination (two independent flags)

Backend:

  • Schema at DDL precision (§2.3)
  • API contracts — DELETE thin (REV-3), create idempotency open (REV-8)
  • Transaction boundaries + idempotency per write path (§2.D; 2 soft spots)
  • Concurrency collision points (§2.E; external rate budget missing — REV-1)
  • Security (§3)
  • Migration plan (§4)
  • Service boundary (§2.F.1)
  • Compliance (§3.D; hard-purge TBD)

Blocked (must fix first)

  • None are hard blockers. The four majors (REV-1..4) are PROCEED-with-notes items: close them before executing chunks 4, 1, 6, 8 respectively.

Verdict: Fix 4 notes inline (REV-1..4) — no hard blockers; proceed on the unblocked chunks in parallel


Task Manifest

The RFC already specifies a 13-chunk plan (§4.D). Verified below; the reviewer adds the open-finding each chunk must close.

OrderChunkFiles to Create/ModifyAcceptance CriteriaDependencies / Note
1Add name + persist; create status processingmigration, create_test_case.rb, create_test_cases.rb, test_cases_controller.rbmigration up/down; create persists name, returns processingClose REV-2 first
2Status lifecycle helpernew update_test_case_status.rbtransitions coveredNone
3Worker sampling stepworker, new sample_rooms.rb200→~20; <10→all; 5000→capNone
4Worker shadow-gen + persistenceworker, new generate_shadow_answer.rb, question repo0 SendMessageWorker enqueues; fields persistedClose REV-1 first
5Confidence aggregate recomputerate_test_case_question.rb, new recompute_confidence_score.rb(up/total)*100 after rateChunk 1
6DELETE route (soft delete)test_cases_controller.rb, new delete_test_case.rb + reposoft-deletes; 404 cross-tenantClose REV-3
7Tree-diagram avg confidenceget_tree_diagram_v3.rbavg over completed; none→"no score yet"Chunk 5
8Publish gate (flagged)publish.rb, publish_ai_agent.rb422 <80 & gate onClose REV-4
9–13FE list/generate/detail/gate/treepages/bot-automation/testing/*, modules/bot-automation/components/testing/*, layouts/bot-automation.vue, AiAgentEditor.vueper §4.Dchunk 11 needs Figma frame (REV-6)

Dangling Decisions Log

None fully dangling. The three Partial decisions are listed for closure tracking.

#DecisionLocationOwnerDeadline
1D-5 NLP throttle contract (REV-1)§3 / §2.F / Open Q #5Data/AI (Reza)2026-07-15 (PRD Open Q #5)
2D-7 gate threshold source (REV-4)§4.B / Open Q #4Dimas (PM)2026-07-15
3D-8 DELETE depth + restore scope (REV-3)§2.4 DELETEEng (BOT)unset

Open Questions

Surfaced during review. Promoted to the RFC's own Open-Questions surface by REV-id.

#QuestionCategorySeverity
REV-1What is the QontakNLP per-org RPM/TPM ceiling, the pacing mechanism, and on-429 behavior for shadow-gen?CSS / FMCImportant
REV-2Is the name column added and persisted this phase (making chunk-1 migration mandatory), or is name dropped from the FE type?CNT / ACV / DMSImportant
REV-3DELETE: exact success body, is restore a user-facing endpoint this phase, and cross-tenant 404 vs 403?ACVImportant
REV-4Is the 80% threshold a constant this phase or an org-configurable SystemPreference (key/default/provisioner)?ROL / TDCImportant
REV-5Concrete worker retry count + backoff intervals and Sidekiq dead-set alert depth?FMCNice-to-have
REV-6Figma frame for the comparison/detail view (no prototype exists) — Design QA before chunk 11.NFS / CNTImportant
REV-7Re-run trigger (which endpoint/param re-enqueues) and the clear-before-regen transaction boundary?DIC / ACVNice-to-have
REV-8Create-endpoint idempotency: unique index, app-level dedup, or accept duplicates?DIC / ACVNice-to-have

Evidence Notes

  • §2.0 Source Verification — every anchor row carries verifiable evidence; reviewer confirmed 4 load-bearing claims against the sibling repos (worker 43 lines no LLM/persist; publish.rb:13 only active_version_id; rate_test_case_question.rb:14-19 no aggregate; FE deleteTestCase at ai-agents.ts:313, no BE route). Drove High confidence and the PRT/CPA/SBC scores.
  • §2.G Cross-Layer Contract Verification — the disclosure of three integration gaps as in-scope work (not silent mismatch) is the reason the cross-layer mismatch cap did NOT fire; pivotal to the Strong rating.
  • §3 Performance + §2.F — the absence of a concrete NLP throttle contract is the single most consequential gap; held CSS to 6.5 and is the lead open finding (REV-1).
  • §4.D Agent Execution Plan — 13 verifiable chunks; the basis for the Task Manifest and the PROCEED-with-notes verdict.
  • §2.3 DDL — current schema verified from migrations; the name column's conditional framing is the second-biggest gap (REV-2).

R3 Re-Grounding Log (every cited anchor re-verified against current code)

Cycle R3 opened the actual files in chatbot fa6dd8b79 and chatbot-fe f5b80d5a (chatbot advanced from the R2 stamp 88b0685). Every anchor below was re-confirmed; none drifted or broke.

Anchor (RFC claim)Re-grounded result
app/workers/fetch_room_conversations_worker.rb — 43 lines, queue: :ai_agent, retry: false, fetch+extract+log only✓ exact — 43 lines, sidekiq_options queue: :ai_agent, retry: false (L5); no LLM/persist/status
repositories/publish.rb — only update!(active_version_id:), no gate✓ — def call L12, @ai_agent.update!(active_version_id: @ai_agent.version_id) L13; no threshold/confidence (R2 cited "L12-18"; substance unchanged)
repositories/create_test_case.rbstatus='pending', no name persist✓ — record.status = 'pending' L19; no name assignment (REV-2 underlying gap still live)
repositories/rate_test_case_question.rb — score/scored_by* only, no aggregate✓ — L13-19 set score/is_score/scored_by*/scored_at; no confidence_score write
test_cases_controller.rb — GET L32 / POST L75 / PATCH L115, all set_role(%w[owner supervisor admin])✓ exact lines + roles
No DELETE test-case route; agent delete at ai_agents_controller.rb✓ — only delete '/:id' (agent) at ai_agents_controller.rb:237; no test-case delete
Mount api.rb L47-48 TestCasesController => '/v1/ai_agents'✓ — api.rb:48
Migrations 20260512000001/2; name column NOT present✓ — both migrations exist; ai_agent_test_cases schema has no name column (RFC chunk-1 migration not yet built — correct for pre-impl RFC)
ai_agent_test_case_questions columns (status_description, parameters/sources jsonb, scored_*, confidence, response_time, deleted_at, etc.)✓ — all present in db/schema.rb, indexes match
extract_conversation_pairs.rb — skip SYSTEM, question?, pair next agent reply (L40-66)✓ — SYSTEM_PARTICIPANT_TYPE L8/L45, question? L50/L70, pairing L54-61
lib/qontak_nlp/inference.rb#predictiontimeout: 60, @http.call(method:'POST')✓ — prediction L10 (default timeout: 60), @http.call(method:'POST', ... open_timeout:/read_timeout:) L22
get_tree_diagram_v3.rb#add_ai_agent L850-909, no confidence✓ — def add_ai_agent L850 (called L250); node built without avg confidence
tree route v3/path.rb L17 get ':id/tree_diagram'; feature_flag.rb enabled?(group_code, code, default:)✓ — path.rb:17; FeatureFlag.self.enabled? L34
FE ai-agents.tscreateTestCase L250, getTestCases L275, getTestCaseDetail L292, deleteTestCase L313, updateTestCaseQuestion L334✓ — all present; deleteTestCase at L313 exactly
FE endpoint.ts test_cases paths L207-214✓ — create/list/detail/delete/update_question at L208-213
FE interface.tsTestCase (carries name), TestCaseQuestion (score:0|1|null), UpdateTestCaseQuestionPayload (score:0|1)✓ — TestCase name:string L272; payload score:0|1 L359-364
FE actions.ts — CREATE L745 / FETCH L803 / FETCH_DETAIL L849 / DELETE L902 / UPDATE_QUESTION L950 (optimistic rollback)✓ — all exact; optimistic snapshot+rollback confirmed in UPDATE_TEST_CASE_QUESTION (L960-998)
FE AiAgentEditor.vue footer "Save changes" only L1712-1733; layouts/bot-automation.vue flag-gated L209-349; no pages/bot-automation/testing/; @mekari/pixel3@^1.0.12; legacy ValidationDetailPanel.vue✓ — all confirmed

Net: 34+ citations re-verified, 0 drift, 0 breakage. The RFC's §2.0 Source-Verification table is accurate against current HEADs, which is why R3 keeps High confidence and holds the 8.5/PROCEED verdict without minting a new finding.


Review History

CycleDateReviewed RFC revision (last_updated / commit)ScoreVerdictFindings open → fixedNotes
R12026-06-202026-06-20 / working tree (uncommitted)7.5PROCEED with notes8 open (4 major / 4 minor), 0 fixedFirst cycle. Strong, well-grounded full-stack RFC; no hard blockers. Lead gap: NLP throttle contract (REV-1).
R22026-06-202026-06-20 / commit 88b06858.5PROCEED4 fixed (REV-1…4, all majors), 1 new minor (REV-9), 5 carry-open (REV-5/6/7/8/9)Delta re-review. All R1 majors closed; decisions now 10/10 resolved; §2.G fully yes. CSS 6.5→8.0 (throttle specified). No open majors → PROCEED.
R32026-06-202026-06-20 / working tree — chatbot fa6dd8b79 / chatbot-fe f5b80d5a8.5PROCEED0 fixed, 0 new, 5 carry-open (REV-5/6/7/8/9)Re-grounding/confirm cycle (RFC unchanged since R2). Re-verified all §2.0 Source-Verification anchors + prior spot-checks against current HEADs (chatbot advanced 88b0685→fa6dd8b79). Zero citation drift, zero breakage — every file:line, route, migration, schema column, FE symbol still resolves as claimed. Ledger + score held.