Qontak | Chatbot & AI | Unified Agent Quality Scorecard — Phase 2: AI Auto-Scoring & In-Room Scorecard

Template: PHASE PRD v1.2 · Companion to PRD Section Reference v1.5 + Hierarchy v1.0 Note: Phase 2 of the Unified Agent Quality Scorecard initiative. Builds the scoring engine + in-room panel — consumes Phase 1 config; produces the scores that Phase 3 (report) and Phase 5 (gate) read.

HEADER BLOCK

Field	Value
PM	Dimas Fauzi Hidayat
PRD Version	1.2
Status	DRAFT
PRD Type	PHASE
Epic	QC-XXXXX — add once Epic is created
Squad	BOT — Bot, AI & Automation
RFC Link	Pending — RFC to follow via `rfc-starter`
Figma Master	Pending — in-room panel AI mode not yet designed (Stitch prompt in Appendix B)
Anchor	Qontak \| Chatbot & AI \| Unified Agent Quality Scorecard — ANCHOR
Labels	`epic:qontak-chatbot-ai` \| `module:chatbot-ai` \| `feature:unified-agent-scorecard`
Last Updated	2026-06-19

HEADER BLOCK
2. CONDITIONAL BLOCK: PHASE CONTEXT
3. One-liner + Problem
4. What Happens If We Don't Ship This Phase
5. Target Users + Persona Context
6. Non-Goals
7. Constraints
- 7.1 Data Lifecycle
8. Feature Changes
9. API & Webhook Behavior
10. System Flow + User Stories + ACs
- 10.1 System Flow
- 10.2 User Stories
11. Rollout
- 11.1 Semantic Regression Rollback
12. Observability
- 12.1 Post-Launch Monitoring Cadence
13. Success Metrics
14. Launch Plan & Stage Gates
15. Dependencies
16. Key Decisions + Alternatives Rejected
17. Open Questions
Appendix A — AI Scoring Rubric
Appendix B — Stitch UI Prompt
PRD CHANGELOG

2. CONDITIONAL BLOCK: PHASE CONTEXT

Field	Detail
Anchor PRD	Qontak \| Chatbot & AI \| Unified Agent Quality Scorecard — ANCHOR
Phase	Phase 2 of 8
Phase Goal	Score every AI conversation on the two-tier rubric (per actor/segment, with veto metrics) and surface it in the in-room Scorecard panel.
Prior phases	Phase 1 — Scorecard Settings & Rubric Config (shipped): the extended `is_auto_score` (now also enabling AI-agent scoring), the AI pass threshold, the custom-param "AI judging rubric" editor (`prompt` widened `string`→`text`), and the read-only default-rubric viewer. This phase consumes all of it. Note: the human auto-scorer (`auto_agent_scoring.rb`) already exists — this phase adds the AI-agent 9-metric path alongside it.
Deferred to later phases	Analytics report + export → Phase 3; validation/testing harness (pre-launch scores) → Phase 4; go-live gate → Phase 5.
Cross-phase dependencies	(1) Phase 1 config must be live — scoring reads the (extended) `is_auto_score`, the AI pass threshold, and the custom-param rubrics. (2) DSAI evaluator output contract (Strategy doc Appendix A.3) — the per-conversation 9-metric output this phase ingests.

3. One-liner + Problem

One-liner: Auto-score every AI conversation on the two-tier rubric and show each actor's score, reasons, and sources in the in-room Scorecard panel.

Problem: The SkillPack engine already grades every AI conversation with its 9-metric evaluator, but the output is discarded — no product surface reads it, so QA leads cannot see whether an AI agent is good. A GPT auto-scorer (auto_agent_scoring.rb) already fills the scorecard for the first human agent on the manual categories, but nothing scores the AI agent against the engine's 9-metric rubric, handles multiple actors, applies veto, or shows reasons/sources where supervisors work: the in-room Scorecard panel today supports only manual binary scoring of a single agent. The result is AI quality stays invisible, hallucinations go unmeasured (no <2% floor), and the paid Scorecard still has no AI value.

4. What Happens If We Don't Ship This Phase

AI quality stays invisible — the engine emits the signal on every conversation but no one can see it; measured containment and hallucination rate slip another quarter past Q3 2026.
Phase 1's config is stranded — the rubric + threshold defined in Phase 1 (June 2026) have nothing computing against them; the investment delivers no value until this ships.
Phases 3 and 5 are blocked — both the Analytics report (P3) and the go-live gate (P5) read the per-conversation scores this phase produces; neither can start.

5. Target Users + Persona Context

Primary Persona: QA Lead / Supervisor

Field	Detail
Role	QA Lead or Supervisor accountable for conversation quality across human and AI agents
Goal	See each AI agent's score per conversation — with reasons and cited sources — in the inbox panel where they already review, without hand-scoring
Pain	AI quality is invisible today; they can only manually spot-check, and the panel has no way to show a graded AI score
Workaround	Reading transcripts one by one in the inbox; no score, no trend, no reasons

Secondary Persona: Bot / AI Builder (Agent Owner)

Field	Detail
Role	The Bot/AI specialist/admin who configures and ships AI agents (SkillPack)
Goal	See exactly where their AI agent fails (which metric, which conversation) so they can fix the SkillPack or KB
Pain	No objective per-conversation quality signal; problems are found anecdotally
Workaround	Ad-hoc manual testing in the preview pane; no record

6. Non-Goals

Not the Analytics report / trends / export — the unified report surface is Phase 3.
Not the go-live gate — gate decision + advisory/enforced modes are Phase 5.
Not the validation/testing harness — this phase scores only real conversations; pre-launch scoring is Phase 4.
Not the settings or rubric editor — Phase 1 owns config; this phase consumes it.
Not tuning the 9-metric weights — DSAI dependency; weights stay uniform 0.11 here.
No change to human manual scoring — the manual panel mode stays exactly as today; AI mode is added alongside.
No mobile — web (Qontak omnichannel) only.
Not real-time per-message scoring — scoring runs per conversation/segment at terminal exit or handoff, not as a live overlay.

7. Constraints

Field	Value
Platform	Web only — Qontak omnichannel web app
Performance	Per-conversation auto-score available ≤ 60s after conversation close or handoff (async). In-room panel render ≤ 2s P95.
Data limits	Per scored conversation: 9 metric scores + tier-1/tier-2 verdicts + cited source refs + judge reason text. Transcript snapshot 90 days; aggregate scores 13 months.
Plan scope	Professional + Enterprise only. Not Starter/Free.
Feature flag	`ai_qa_unified_scorecard` \| default: OFF. This phase turns scoring + the AI panel on for enrolled orgs (the same flag Phase 1 shipped behind).
Read/write	Read: QA Lead/Supervisor (team scope), Bot/AI Builder/Admin (own agents). Write (manual override of an AI score): Supervisor/Admin only. End CS agents: no access to others' scores.

7.1 Data Lifecycle

Artifact Type	Retention Period	Cleanup Trigger	User-Visible Effect
AI score record (9 metrics + verdict + reasons)	13 months	TTL from `created_at`; nightly cleanup cron	Panel/trend shows up to 13 months; older drops off
Conversation transcript snapshot (judge input)	90 days	TTL from conversation close	After 90 days, the panel shows scores but the transcript reads "expired"
Cited source references (KB chunk ids)	90 days (with transcript)	TTL from conversation close	Source links expire with the transcript
Failed scoring job payload	7 days	Nightly cleanup after retry exhaustion	None — internal; surfaced as "scoring unavailable" on the record

8. Feature Changes

Change ID: CHG-003 — In-room Scorecard panel: AI auto-scored mode + actor selector

Field	Detail
Change Type	Modified component (in-room Scorecard side panel)
Page	/inbox — conversation right panel → Scorecard tab
Page Intent	A supervisor reviews each actor's quality for the open conversation
Before	• Manual-only: each parameter is rated with a binary 👎/👍; Total is summed from manual ratings; the subject is a single human agent; groups are the org's manual categories. • No concept of an AI subject, auto-score, graded metric, judge reason, cited source, veto, or multiple actors.
After	• When the conversation's responder is an AI agent, the panel renders in "Auto-scored by AI" mode: each metric shows a graded score + an expandable judge reason; Groundedness surfaces the cited KB source; veto failures (Groundedness/Policy) show a red "Failed" flag that floors the total. • The 9 defaults appear as a "Qontak AI Quality (default)" group; org tier-2 custom params appear in their own group. • Total score is auto-computed with a pass/fail badge vs `passing_grade`. • A supervisor can override any auto-score → "edited by [SPV]" with audit. • An actor selector lets a room with multiple actors (the AI agent + human handler[s]) be reviewed per actor, each scored on its own segment. • Human-agent conversations are unchanged (manual mode as today).

Element	Before	After
Parameter rating control	Binary 👎/👍 (manual)	Graded score + reason (+ override) for AI; binary retained for human/manual
Scorecard groups	Org categories only	+ "Qontak AI Quality (default)" 9-metric group + tier-2 custom group
Score source label	"Scored by [SPV]" implicit	"Auto-scored by AI" badge vs "Scored by [SPV]"; override → "edited by [SPV]"
Agent / actor selector	Single human agent	Lists every actor who served the room (AI agent + human[s]); each shows its own scorecard scored on its segment
Groundedness / Policy	n/a	Veto flag + cited source surfaced
Total score	Manual sum	Auto-computed % + pass/fail vs `passing_grade`

Figma: Pending — Stitch prompt in Appendix B.

9. API & Webhook Behavior

Behavior 1: Ingest AI evaluator output and score the conversation (per actor/segment)

Field	Detail
Entity affected	New AI agent score record (`agent_scorecard` + details, source = auto / actor = AI)
Triggered by	An AI conversation reaches a terminal exit reason (`skill_completed`, `user_request_human_handoff`, etc.) or hands off to a human; the SkillPack engine emits its 9-metric evaluator result for the AI-handled turns
Information passed	Org, room/conversation id, AI agent id, the 9 metric scores, exit reason, cited source refs, per-metric judge reasons, segment boundary
Expected behavior	• Create an AI auto-score record scoped to the AI-handled turns; compute the tier-1 weighted score. • For each org tier-2 custom param with a non-empty rubric (from Phase 1), request judge scoring and merge. • Apply veto: if a veto metric (Groundedness/Policy) fails, `is_pass` = false regardless of weighted total. • Compute total vs `passing_grade` → `is_pass`; persist per actor with audit.
Failure behavior	• Evaluator output missing/malformed → record marked "scoring unavailable"; retry up to N. • Judge call for a tier-2 param fails → that param marked "unscored", record flagged "partial", tier-1 score still stored. • Scoring never blocks live conversation handling.

Behavior 2: Override an AI auto-score (manual)

Field	Detail
Entity affected	An AI agent score record's per-metric value + total
Triggered by	Supervisor/Admin edits a metric score in the in-room panel
Information passed	Score record id, metric id, override value, optional reason
Expected behavior	Persist the override; recompute total + `is_pass`; mark the metric "edited by [SPV]"; record via paper_trail audit
Failure behavior	• Override value out of range → validation error, not saved. • Unauthorized role → control not rendered. • Save fails → error + retry; `scorecard_override_failed` logged.

Claude resolves during RFC: HTTP method, path, request/response JSON schema, error codes.

10. System Flow + User Stories + ACs

10.1 System Flow

Flow: AI Conversation Scored and Shown in the In-Room Panel Type: User Journey + API Sequence

An AI agent (SkillPack) handles a conversation in Qontak omnichannel.
The conversation reaches a terminal exit reason or hands off to a human; the engine emits its 9-metric result scoped to the AI-handled turns.
System ingests the output → creates an AI auto-score record (tier-1 weighted score), scoped to the AI actor's segment.
Decision — for each org tier-2 custom param: rubric non-empty? Yes → judge scores it, merge; No → exclude.
Decision — a veto metric (Groundedness/Policy) failed? Yes → is_pass = false regardless of total. No → compute total vs passing_grade.
Persist per actor with paper_trail audit.
Failure branch — evaluator output missing/malformed → mark "scoring unavailable", retry up to N, log; live handling never blocked.
A QA Lead opens the in-room Scorecard panel; the actor selector lists the AI agent + any human handler.
Selecting the AI actor → graded 9-metric group + tier-2 group, reasons, cited sources, veto flags, total + pass/fail badge.
Decision — supervisor overrides a metric? Yes → recompute total/is_pass, mark "edited by [SPV]", audit. No → done.

📊 System Flow — Scoring + In-Room Panel

sequenceDiagram
    participant Eng as SkillPack Engine
    participant Score as Scoring Pipeline
    participant Judge as Tier-2 Judge
    participant Store as Scorecard Store
    participant QA as QA Lead
    Eng->>Score: 9-metric result (terminal exit / handoff, AI-handled turns)
    Score->>Score: Tier-1 weighted score (AI actor segment)
    alt custom param rubric non-empty
        Score->>Judge: Score tier-2 param
        Judge-->>Score: Tier-2 result
    else empty rubric
        Note over Score: Exclude param
    end
    Score->>Score: Veto check (Groundedness/Policy)
    Score->>Store: Persist per actor → is_pass
    Note over Score,Store: Malformed output → "scoring unavailable", retry N
    QA->>Store: Open panel, select actor
    Store-->>QA: Graded metrics + reasons + sources + veto + total
    QA->>Store: Override a metric (optional)
    Store-->>QA: Recompute + "edited by SPV" + audit

10.2 User Stories

[P2-S01] — Auto-score an AI conversation on the two-tier rubric (per actor/segment, with veto)


User Story	As a QA Lead, I want every AI agent conversation automatically scored on the two-tier rubric, scoped to the turns the AI handled, so that I can measure AI quality without hand-scoring.
Before State	AI conversations aren't scored against the engine's 9-metric rubric; the existing GPT auto-scorer (`auto_agent_scoring.rb`) scores only the first human agent on the manual categories, and the engine's 9-metric output is discarded.
After Delta	On each AI conversation's terminal exit/handoff, the engine's 9-metric result is ingested into an AI auto-score record (tier-1) for the AI-handled segment, merged with tier-2 custom-param judge scores where a rubric exists, veto-checked, scored vs `passing_grade`, and persisted per actor.
Importance	Must Have
Mockup / Technical Notes	Figma: N/A — backend; surfaces in CHG-003 panel Data Fields: • `organization_id` (string, required) — Auth session • `room_id` (string, required) — conversation • `agent_id` (string, required) — AI agent (actor) • `metrics[]` ({code, score}, required) — engine output (9 metrics) • `exit_reason` (enum, required) — engine • `segment` ({from_turn, to_turn}, required) — AI-handled turns • `is_auto_score` / `passing_grade` (from Phase 1 prefs) Technical Notes: Tier-1 weights uniform 0.11 (DSAI tunes later — S17 Q#3). GATE rule: tier-2 param auto-scored only if rubric non-empty (Phase 1). Veto: Groundedness/Policy failure floors `is_pass`.
Acceptance Criteria	— Happy Path — • AC-1: Given `is_auto_score` is ON and an AI conversation reaches a terminal exit/handoff, when the engine emits its 9-metric result, then an AI auto-score record is created with the tier-1 weighted score within 60s. • AC-2: Given an org tier-2 custom param with a non-empty rubric, when scored, then the judge scores it and merges into the total. • AC-3: Given the total, when compared to `passing_grade`, then `is_pass` is true if total ≥ `passing_grade` else false. — Edge — • AC-4: Given a tier-2 param with an EMPTY rubric, when scored, then it is excluded and marked "manual-only"; tier-1 unaffected. • AC-5: Given a veto metric (Groundedness/Policy) fails, when the total is computed, then `is_pass` = false regardless of the weighted total, flagged with the veto reason. • AC-6: Given a room handled by an AI agent for turns 1–N then a human from N+1, when scored, then only the AI-handled turns are evaluated and stored against the AI actor — a separate record from any human score (`(org, room_id, agent_id)` key). — Error / Unhappy Path — • ERR-1: Given the evaluator output is missing/malformed, when ingestion runs, then the record is marked "scoring unavailable", retried up to N, and `scorecard_autoscore_failed` is logged; live handling never blocked. • ERR-2: Given a tier-2 judge call fails after retries, when scoring completes, then the param is "unscored", the record is "partial", tier-1 is still stored, and `scorecard_tier2_judge_failed` is logged. — Permission Model — • CAN: System (automated) for orgs with `is_auto_score` ON + flag ON. • CANNOT: end CS agents cannot trigger/alter auto-scores. • Unauthorized: N/A — automated pipeline. — UI States — (record surfaces in CHG-003 panel) • Loading: record shows "scoring…". • Empty: no AI conversation → no record. • Error: "scoring unavailable". • Success: total + pass/fail badge. — Negative Scenarios — • NEG-1: Given a human agent conversation, when processed, then it is NOT auto-scored by the AI evaluator. • NEG-2: Given a live, unresolved conversation, when messages are exchanged, then no per-message live score is produced.

Dependencies: Phase 1 config (rubric + threshold); DSAI evaluator contract — see S15.

[P2-S02] — View an AI agent's auto-score in the in-room Scorecard panel


User Story	As a QA Lead, I want to see an AI agent's auto-score — metrics, reasons, sources, veto, total — in the in-room Scorecard panel, choosing the AI actor when a human also served the room, so that I can review AI quality where I already work.
Before State	The panel is manual binary 👎/👍, human-agent only; no AI score, no reasons, no actor selector.
After Delta	An "Auto-scored by AI" mode renders graded metrics + judge reasons + cited sources + veto flags + total/pass-fail, with an actor selector for multi-actor rooms (per CHG-003).
Importance	Must Have
Mockup / Technical Notes	Figma: Pending — Appendix B Stitch prompt Data Fields: • `room_id` (string, required) — conversation • `actor_id` (string, required) — selected actor • `actor_type` (enum ai\|human, required) — drives panel mode • `score_record` (API response) — per-actor score
Acceptance Criteria	— Happy Path — • AC-1: Given a conversation an AI agent handled, when the QA Lead opens the Scorecard panel and selects the AI actor, then the 9-metric group + any tier-2 group render with graded scores, total %, and pass/fail badge. • AC-2: Given a metric, when expanded, then its judge reason is shown; for Groundedness, the cited KB source link is shown. • AC-3: Given a veto metric failed, when the panel renders, then a red "Failed" flag is shown and the total reflects `is_pass` = false. — Edge — • AC-4: Given a room served by both an AI agent and a human, when the QA Lead opens the panel, then the actor selector lists both; selecting the human shows manual categories (today's behavior), selecting the AI shows the AI group. • AC-5: Given a transcript snapshot past 90-day retention, when expanded, then scores render but the transcript reads "expired". — Error / Unhappy Path — • ERR-1: Given the AI score record is "scoring unavailable", when the panel opens, then it shows "scoring unavailable" with a retry indicator (no crash), and `scorecard_panel_load_failed` is logged on a hard load error. — Permission Model — • CAN: QA Lead/Supervisor (team scope), Bot/AI Builder/Admin (own agents). • CANNOT: end CS agents (cannot view others' scores). • Unauthorized: panel/actor not shown. — UI States — • Loading: "scoring…" / skeleton metrics. • Empty: "Not scored yet" before the first AI conversation completes. • Error: "scoring unavailable" + retry. • Success: graded metrics + total + pass/fail. — Negative Scenarios — • NEG-1: Given a human-only conversation, when the panel opens, then only manual mode shows (no AI group).

Dependencies: P2-S01 (score record to display).

[P2-S03] — Override an AI auto-score


User Story	As a Supervisor/Admin, I want to override an AI metric score in the panel, so that I can correct a judge mistake and keep the record trustworthy.
Before State	AI scores would be immutable; no human correction path.
After Delta	A per-metric override recomputes the total/`is_pass`, marks the metric "edited by [SPV]", and audits the change.
Importance	Should Have
Mockup / Technical Notes	Figma: Pending Data Fields: • `score_record_id` (uuid, required) — record • `metric_id` (string, required) — the metric overridden • `override_value` (float, required) — new value • `reason` (text, optional) — override note
Acceptance Criteria	— Happy Path — • AC-1: Given a Supervisor viewing an AI score, when they override a metric value within range and save, then the total + `is_pass` recompute and the metric shows "edited by [SPV]". • AC-2: Given an override, when saved, then it is recorded via paper_trail audit. — Edge — • AC-3: Given an override value out of range, when saved, then a validation error is shown and nothing is persisted. • AC-4: Given the existing scorecard allows only one correction (`edit_count` / `correction_at`), when a supervisor overrides an AI metric, then it consumes that single correction slot, and a further override is blocked with "already corrected". — Error / Unhappy Path — • ERR-1: Given the override save fails, when saving, then an error + Retry is shown, no partial state persists, and `scorecard_override_failed` is logged. — Permission Model — • CAN: Supervisor/Admin. • CANNOT: QA Lead (view only), end CS agents. • Unauthorized: override control not rendered. — UI States — • Loading: spinner on the metric during save. • Empty: N/A. • Error: as ERR-1. • Success: "edited by [SPV]" + recomputed total. — Negative Scenarios — • NEG-1: Given a QA Lead (non-supervisor), when viewing an AI score, then no override control is shown.

Dependencies: P2-S01, P2-S02.

11. Rollout

Field	Value
Feature flag	`ai_qa_unified_scorecard` — default: OFF (the flag Phase 1 shipped behind; this phase turns scoring + AI panel on)
Stage 1	Internal QA: 3–5 internal accounts + the 15-CID validation projects (real conversations scored)
Stage 2	Closed beta: TransGo, Talenta LMS + 3 partner accounts, manually enabled
Stage 3	All Professional + Enterprise on request
GA	All Professional + Enterprise (flag on)
Backward compat	Yes — the manual human scorecard is unaffected; AI mode is additive
Migration	None to existing records. New: AI auto-score source/actor fields on `agent_scorecard`.

11.1 Semantic Regression Rollback

Field	Detail
Model flag	`ai_qa_unified_scorecard_v2_weights` \| default: OFF (per org) — guards a weight/rubric change
Regression metric	Judge-vs-human agreement rate on the calibration sample
Rollback threshold	Agreement drops ≥10% vs the prior calibrated baseline, OR auto-score pass rate shifts ≥15% with no human-verified cause
Rollback path	Toggle the weights flag back to the prior version per org (no deploy); Bot/AI + DSAI; within 4 hours of alert
Monitoring	`scorecard_autoscore_completed` + `judge_human_agreement` reviewed daily during the calibration window

12. Observability

Key Events:

Event Name	Trigger	Properties
`scorecard_autoscore_completed`	An AI conversation is scored	org_id, agent_id, total_score, is_pass, tier2_count, veto_failed, duration_ms
`scorecard_autoscore_failed`	Evaluator ingestion/scoring failed	org_id, room_id, reason, retry_count
`scorecard_tier2_judge_failed`	A tier-2 custom-param judge call failed	org_id, custom_param_id, reason
`scorecard_panel_load_failed`	In-room panel hard load error	org_id, room_id, reason
`scorecard_override_saved`	A supervisor overrode an AI metric	org_id, score_record_id, metric_id
`scorecard_override_failed`	Override save failed	org_id, reason

Field	Detail
Dashboard owner	Bot, AI & Automation (squad: BOT)
Alert 1	`scorecard_autoscore_failed` rate > 5% of AI conversations in 1h → Slack: #bot-ai-oncall
Alert 2	Judge-vs-human agreement < 85% during the calibration window → Slack: #bot-ai-quality

12.1 Post-Launch Monitoring Cadence

Field	Detail
Review cadence	Weekly for the first 4 weeks post-GA, then monthly
Owner	Dimas Fauzi Hidayat (PM) + BOT squad
Review scope	`scorecard_autoscore_completed` / `_failed`, `scorecard_override_saved`, judge-vs-human agreement
Trigger threshold 1	`scorecard_autoscore_failed` > 5% week-over-week → investigate ingestion / engine contract
Trigger threshold 2	Auto-score pass rate shifts > 15% for 2 consecutive weeks with no weight change → investigate calibration drift
Rollback consideration	If judge-vs-human agreement < baseline − 10% and unresolved within 48h, PM reverts to the prior weights flag (S11.1).

13. Success Metrics

Adoption & Usage:

Metric	Definition	Baseline	Target
⭐ AI-QA scoring coverage	% of AI agent conversations auto-scored	0% — no AI-agent scoring (existing auto-scorer covers only the human agent)	≥95% within 30 days of GA
In-room panel usage	% of enrolled accounts whose supervisors open an AI scorecard weekly	N/A — new	≥50% within 60 days of GA

Quality & Accuracy:

Metric	Definition	Baseline	Target
Hallucination rate (product facts)	Share of AI answers flagged ungrounded by the judge (Groundedness veto)	Unmeasured	<2% within 60 days of GA
Judge-vs-human agreement	Agreement between AI auto-scores and supervisor overrides/verdicts on the calibration sample	N/A	≥85% before weights tune (gates Phase 5)

Efficiency & Impact:

Metric	Definition	Baseline	Target
Manual QA effort displaced	% of scored conversations generated automatically vs. hand-scored	0% — 100% manual	≥70% within 90 days of GA

14. Launch Plan & Stage Gates

Stage	Audience	Duration	Success Gate to Advance	Owner
Internal Alpha	3–5 internal QA accounts + 15-CID set	2 weeks	0 P0/P1; `scorecard_autoscore_failed` ≤5%; engine ingestion contract confirmed with DSAI; segment boundary validated	PM + QA
Closed Beta	TransGo, Talenta LMS + 3 partners	3 weeks	Judge-vs-human agreement ≥85% on calibration sample; panel render ≤2s P95	PM + BOT
Open Beta	All Pro+Ent on request	3 weeks	Auto-score coverage ≥90% for enrolled; hallucination <2% on sample; no P0 for 2 weeks	Eng Lead
GA	All Pro+Ent	Ongoing	All Open Beta gates sustained 2 weeks; PMM launch approved	PM + PMM

15. Dependencies

Dependency	Owning Team	Deliverable Needed	Blocking?
SkillPack engine evaluator output contract	DSAI / engine team	Stable per-conversation 9-metric output + segment boundary + `GET /models`, `/thread/message` confirmed (Strategy doc Appendix A.3)	YES
Phase 1 config (settings + rubrics)	BOT (Phase 1)	`is_auto_score`, `passing_grade`, custom-param rubrics live in production	YES
PII hashing + credential vault (G7)	Security / Infosec	PII hashed before judge/log; plaintext credential leak closed before transcripts are stored	YES
Design / UX	Design squad	Frames for the in-room panel AI mode + actor selector (CHG-003)	YES
9-metric weight tuning from 15-CID	DSAI	Tuned weights replacing the uniform 0.11	NO for P2 (scores are measurement-only) · gates Phase 5

16. Key Decisions + Alternatives Rejected

8a — Decisions Made

Date	Decision	Rationale
2026-06-19	Consume the engine's 9-metric evaluator output; do not build a judge in the scorecard	The judge already shipped in SkillPack; avoids two drifting judges
2026-06-19	Veto metrics (Groundedness, Policy) floor `is_pass` regardless of the weighted total	A single hallucination or policy breach should fail a conversation; ties to the <2% hallucination target
2026-06-19	Score each actor on its own segment; one room can hold multiple actor scorecards	AI handles first contact then hands to a human; each must be measured on the turns it handled
2026-06-19	AI scores are overridable by Supervisor/Admin with audit	Judge mistakes need a human correction path to keep the record trustworthy and the judge calibratable
2026-06-19	This phase surfaces scores as measurement only — no go-live gate	Weights are untuned (0.11); gating on them would be premature (gate is Phase 5 after tuning)
2026-06-19	AI-score override reuses the existing one-edit correction slot (`edit_count` / `correction_at`), not a new edit mechanism	The scorecard already enforces "edit once"; reusing it keeps the audit model consistent (verified in hub-chat `ScorecardForm.vue`)
2026-06-19	Grounded in cloned `hub-chat`: extend the in-room panel, the per-room `GET/POST/PATCH /agent_scorecards/{roomId}` API, the `roomParticipant[]` array, and the handover events — not rebuilt	Verified against `hub-chat` `agent-scorecard/` + event handlers (2026-06-19); lowers build risk

8b — Alternatives Rejected

Alternative	Why Rejected	Date
Build a separate AI-QA judge inside the scorecard tables	Duplicates the shipped 9-metric evaluator; two judges drift	2026-06-19
One judge call per parameter	Multiplies token cost per conversation; score all params in one call	2026-06-19
Make AI scores immutable (no override)	Judge mistakes would be uncorrectable and erode trust; no calibration signal	2026-06-19
Score the whole transcript regardless of actor	Unfairly credits/penalizes the AI for the human's turns (and vice versa)	2026-06-19

17. Open Questions

#	Type	Question	Owner	Deadline
1	Open Question	Confirm the exact definitions and order of the 9 engine metrics with DSAI (Appendix A is PROPOSED).	Bot/AI + DSAI	2026-07-15
2	Assumption	The engine exposes a stable per-conversation evaluator output incl. a segment boundary (Strategy doc Appendix A.3 lists `GET /models`, `/thread/message` as OPEN).	DSAI	2026-07-01
3	Risk	The AI/human segment boundary may be ambiguous, mis-attributing turns. Mitigation: delimit segments using the existing hub-chat handover events (`agent_take_room` / `remove_agent` / `handover_id`) + message timestamps (mechanism confirmed in code); validate against 15-CID transcripts in Internal Alpha, and confirm these events fire for the AI→human takeover case.	Bot/AI + Omnichannel	2026-07-15
4	Risk	The 9-metric weights are uniform 0.11 and untuned → scores may be miscalibrated. Mitigation: this phase surfaces scores as measurement only (no gate); tune from 15-CID + monitor judge-vs-human agreement before Phase 5.	DSAI	2026-07-31
5	Risk	Storing scored transcripts widens the blast radius of the live plaintext-credential / PII gap (Strategy doc G7). Mitigation: hash PII before judge/log; close the G7 credential leak before transcripts are stored; 90-day TTL on snapshots.	Security / Infosec	Before GA
6	Risk	The AI agent may not be a scorable actor in the inbox. In hub-chat, messages/participants are only `Models::User` (human) / `Models::Customer`; Airene is side-panel assist, not a message participant — there is no bot/AI-agent participant type. This phase assumes the SkillPack agent is a selectable, scorable actor in the room. Mitigation: confirm with Bot/Automation how the agent's turns are represented; if the bot is not a room participant, attach the AI score to the room + agent-version rather than a participant row.	Bot/AI + Omnichannel	2026-07-15
7	Open Question	Scorecard RBAC today is org-wide Usman `inbox_scorecard_view` / `inbox_scorecard_manage` — there is no team/division scoping (verified in hub-chat `UsmanStore`). The "team scope" references in this PRD require net-new scoping. Resolve: descope to org-wide for Phase 2, or add per-team scoping as an explicit requirement.	PM + Platform	2026-07-15

Appendix A — AI Scoring Rubric

Status: PROPOSED — pending DSAI confirmation (Open Q#1). The 9 metrics are owned by the SkillPack engine. Tier-1 weights uniform 0.11 until DSAI tunes from 15-CID (Open Q#4). Full judging prompts live in the Phase 1 PRD Appendix A and the default-rubric viewer.

#	Metric	What it measures	Veto?
1	Groundedness / factual accuracy	Claims backed by KB sources or customer data; no invented product facts	🛑 Veto
2	Resolution / task completion	Did it resolve the goal (`skill_completed` signal)	—
3	Relevance / intent understanding	Addressed the real intent, not a different question	—
4	Policy & safety adherence	Stayed within "what to avoid"; no unsafe content / PII leak	🛑 Veto
5	Tone & brand voice	Matched configured `tone_of_voice`; courteous	—
6	Language quality (Bahasa)	Fluent target language; no broken/mixed language	—
7	Handoff appropriateness	No false handover (Pattern A); no missed escalation	—
8	Tool / action correctness	Right action, right params, not skipped (Pattern B)	—
9	Conversation efficiency	No loops / re-asking; resolved within turn budget	—

🛑 Veto (P2-S01/AC-5): a clear breach of metric 1 or 4 floors is_pass regardless of the weighted total. Tier-2: org custom params (added in Phase 1) are scored by the judge using their prompt rubric and merged (P2-S01/AC-2).

Appendix B — Stitch UI Prompt

Generated proactively because the in-room panel AI mode is Figma: Pending. Use in Stitch; hand the output to Design as structural reference.

=== SHARED PREAMBLE ===
Product: Mekari Qontak — Omnichannel inbox
Users: QA Lead / Supervisor, Bot/AI Builder
Design tone: Enterprise B2B SaaS — dense, professional, clean white surfaces, purple accent; match the existing Qontak inbox shell
Persistent UI: left icon rail + top bar; the Scorecard is a right-hand panel over the conversation
=== END PREAMBLE ===

#	Screen	Stitch Prompt (paste in full after the preamble)
1	In-room Scorecard panel — AI mode (CHG-003)	Screen: right-hand Scorecard panel over a conversation, "Auto-scored by AI" mode. Purpose: supervisor reviews each actor's score for this conversation. Components: an actor selector at top listing every actor who served this room (AI agent + human handler[s]) — selecting the AI actor shows "Auto-scored by AI" + the 9-metric group; selecting a human shows the org's manual categories with binary thumbs (today's behavior). For the AI actor: a "Qontak AI Quality (default)" group of 9 metrics, each a graded score chip + expandable judge reason, Groundedness showing a cited KB source link, a red "Failed" veto flag on Groundedness/Policy when breached; a separate "Custom (org)" group for tier-2 params; a per-metric override control (edit → "edited by [SPV]"); Total score % with a pass/fail badge vs threshold; Remarks. Generate states: Loading ("scoring…"); Success (graded); Error ("scoring unavailable"); Veto-failed (red banner). Do NOT include: the Analytics report, the go-live gate, binary-only thumbs as the sole control for AI rows.

PRD CHANGELOG

Version	Date	By	Section	Type	Summary
1.0	2026-06-19	Claude	All	CREATED	Phase 2 PRD (AI Auto-Scoring & In-Room Scorecard) — cut from the superset draft: two-tier scoring per actor/segment with veto, and the in-room panel AI mode + actor selector.
1.1	2026-06-19	Claude	CB, S1, S8, S13	MODIFIED	Corrected premise vs cloned code: `auto_scoring.rb` "stub" → the live `auto_agent_scoring.rb` already scores the human agent; this phase adds the AI-agent 9-metric path alongside it. Phase count 7→8.
1.2	2026-06-19	Claude	S8, S16, S17	MODIFIED	hub-chat grounding: added AI-agent-as-participant risk (Q6) + permission-scope question (Q7), answered segment-boundary with handover events (Q3), added edit-once override AC + decision, and a code-grounding decision.

HEADER BLOCK​

Table of Contents​

2. CONDITIONAL BLOCK: PHASE CONTEXT​

3. One-liner + Problem​

4. What Happens If We Don't Ship This Phase​

5. Target Users + Persona Context​

6. Non-Goals​

7. Constraints​

7.1 Data Lifecycle​

8. Feature Changes​

9. API & Webhook Behavior​

10. System Flow + User Stories + ACs​

10.1 System Flow​

📊 System Flow — Scoring + In-Room Panel​

10.2 User Stories​

11. Rollout​

11.1 Semantic Regression Rollback​

12. Observability​

12.1 Post-Launch Monitoring Cadence​

13. Success Metrics​

14. Launch Plan & Stage Gates​

15. Dependencies​

16. Key Decisions + Alternatives Rejected​

17. Open Questions​

Appendix A — AI Scoring Rubric​

Appendix B — Stitch UI Prompt​

PRD CHANGELOG​