Qontak | Chatbot & AI | Unified Agent Quality Scorecard — Phase 2: AI Auto-Scoring & In-Room Scorecard
Template: PHASE PRD v1.2 · Companion to PRD Section Reference v1.5 + Hierarchy v1.0 Note: Phase 2 of the Unified Agent Quality Scorecard initiative. Builds the scoring engine + in-room panel — consumes Phase 1 config; produces the scores that Phase 3 (report) and Phase 5 (gate) read.
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat |
| PRD Version | 1.2 |
| Status | DRAFT |
| PRD Type | PHASE |
| Epic | QC-XXXXX — add once Epic is created |
| Squad | BOT — Bot, AI & Automation |
| RFC Link | Pending — RFC to follow via rfc-starter |
| Figma Master | Pending — in-room panel AI mode not yet designed (Stitch prompt in Appendix B) |
| Anchor | Qontak | Chatbot & AI | Unified Agent Quality Scorecard — ANCHOR |
| Labels | epic:qontak-chatbot-ai | module:chatbot-ai | feature:unified-agent-scorecard |
| Last Updated | 2026-06-19 |
Table of Contents
- HEADER BLOCK
- 2. CONDITIONAL BLOCK: PHASE CONTEXT
- 3. One-liner + Problem
- 4. What Happens If We Don't Ship This Phase
- 5. Target Users + Persona Context
- 6. Non-Goals
- 7. Constraints
- 8. Feature Changes
- 9. API & Webhook Behavior
- 10. System Flow + User Stories + ACs
- 11. Rollout
- 12. Observability
- 13. Success Metrics
- 14. Launch Plan & Stage Gates
- 15. Dependencies
- 16. Key Decisions + Alternatives Rejected
- 17. Open Questions
- Appendix A — AI Scoring Rubric
- Appendix B — Stitch UI Prompt
- PRD CHANGELOG
2. CONDITIONAL BLOCK: PHASE CONTEXT
| Field | Detail |
|---|---|
| Anchor PRD | Qontak | Chatbot & AI | Unified Agent Quality Scorecard — ANCHOR |
| Phase | Phase 2 of 8 |
| Phase Goal | Score every AI conversation on the two-tier rubric (per actor/segment, with veto metrics) and surface it in the in-room Scorecard panel. |
| Prior phases | Phase 1 — Scorecard Settings & Rubric Config (shipped): the extended is_auto_score (now also enabling AI-agent scoring), the AI pass threshold, the custom-param "AI judging rubric" editor (prompt widened string→text), and the read-only default-rubric viewer. This phase consumes all of it. Note: the human auto-scorer (auto_agent_scoring.rb) already exists — this phase adds the AI-agent 9-metric path alongside it. |
| Deferred to later phases | Analytics report + export → Phase 3; validation/testing harness (pre-launch scores) → Phase 4; go-live gate → Phase 5. |
| Cross-phase dependencies | (1) Phase 1 config must be live — scoring reads the (extended) is_auto_score, the AI pass threshold, and the custom-param rubrics. (2) DSAI evaluator output contract (Strategy doc Appendix A.3) — the per-conversation 9-metric output this phase ingests. |
3. One-liner + Problem
One-liner: Auto-score every AI conversation on the two-tier rubric and show each actor's score, reasons, and sources in the in-room Scorecard panel.
Problem:
The SkillPack engine already grades every AI conversation with its 9-metric evaluator, but the output is discarded — no product surface reads it, so QA leads cannot see whether an AI agent is good. A GPT auto-scorer (auto_agent_scoring.rb) already fills the scorecard for the first human agent on the manual categories, but nothing scores the AI agent against the engine's 9-metric rubric, handles multiple actors, applies veto, or shows reasons/sources where supervisors work: the in-room Scorecard panel today supports only manual binary scoring of a single agent. The result is AI quality stays invisible, hallucinations go unmeasured (no <2% floor), and the paid Scorecard still has no AI value.
4. What Happens If We Don't Ship This Phase
- AI quality stays invisible — the engine emits the signal on every conversation but no one can see it; measured containment and hallucination rate slip another quarter past Q3 2026.
- Phase 1's config is stranded — the rubric + threshold defined in Phase 1 (June 2026) have nothing computing against them; the investment delivers no value until this ships.
- Phases 3 and 5 are blocked — both the Analytics report (P3) and the go-live gate (P5) read the per-conversation scores this phase produces; neither can start.
5. Target Users + Persona Context
Primary Persona: QA Lead / Supervisor
| Field | Detail |
|---|---|
| Role | QA Lead or Supervisor accountable for conversation quality across human and AI agents |
| Goal | See each AI agent's score per conversation — with reasons and cited sources — in the inbox panel where they already review, without hand-scoring |
| Pain | AI quality is invisible today; they can only manually spot-check, and the panel has no way to show a graded AI score |
| Workaround | Reading transcripts one by one in the inbox; no score, no trend, no reasons |
Secondary Persona: Bot / AI Builder (Agent Owner)
| Field | Detail |
|---|---|
| Role | The Bot/AI specialist/admin who configures and ships AI agents (SkillPack) |
| Goal | See exactly where their AI agent fails (which metric, which conversation) so they can fix the SkillPack or KB |
| Pain | No objective per-conversation quality signal; problems are found anecdotally |
| Workaround | Ad-hoc manual testing in the preview pane; no record |
6. Non-Goals
- Not the Analytics report / trends / export — the unified report surface is Phase 3.
- Not the go-live gate — gate decision + advisory/enforced modes are Phase 5.
- Not the validation/testing harness — this phase scores only real conversations; pre-launch scoring is Phase 4.
- Not the settings or rubric editor — Phase 1 owns config; this phase consumes it.
- Not tuning the 9-metric weights — DSAI dependency; weights stay uniform 0.11 here.
- No change to human manual scoring — the manual panel mode stays exactly as today; AI mode is added alongside.
- No mobile — web (Qontak omnichannel) only.
- Not real-time per-message scoring — scoring runs per conversation/segment at terminal exit or handoff, not as a live overlay.
7. Constraints
| Field | Value |
|---|---|
| Platform | Web only — Qontak omnichannel web app |
| Performance | Per-conversation auto-score available ≤ 60s after conversation close or handoff (async). In-room panel render ≤ 2s P95. |
| Data limits | Per scored conversation: 9 metric scores + tier-1/tier-2 verdicts + cited source refs + judge reason text. Transcript snapshot 90 days; aggregate scores 13 months. |
| Plan scope | Professional + Enterprise only. Not Starter/Free. |
| Feature flag | ai_qa_unified_scorecard | default: OFF. This phase turns scoring + the AI panel on for enrolled orgs (the same flag Phase 1 shipped behind). |
| Read/write | Read: QA Lead/Supervisor (team scope), Bot/AI Builder/Admin (own agents). Write (manual override of an AI score): Supervisor/Admin only. End CS agents: no access to others' scores. |
7.1 Data Lifecycle
| Artifact Type | Retention Period | Cleanup Trigger | User-Visible Effect |
|---|---|---|---|
| AI score record (9 metrics + verdict + reasons) | 13 months | TTL from created_at; nightly cleanup cron | Panel/trend shows up to 13 months; older drops off |
| Conversation transcript snapshot (judge input) | 90 days | TTL from conversation close | After 90 days, the panel shows scores but the transcript reads "expired" |
| Cited source references (KB chunk ids) | 90 days (with transcript) | TTL from conversation close | Source links expire with the transcript |
| Failed scoring job payload | 7 days | Nightly cleanup after retry exhaustion | None — internal; surfaced as "scoring unavailable" on the record |
8. Feature Changes
Change ID: CHG-003 — In-room Scorecard panel: AI auto-scored mode + actor selector
| Field | Detail |
|---|---|
| Change Type | Modified component (in-room Scorecard side panel) |
| Page | /inbox — conversation right panel → Scorecard tab |
| Page Intent | A supervisor reviews each actor's quality for the open conversation |
| Before | • Manual-only: each parameter is rated with a binary 👎/👍; Total is summed from manual ratings; the subject is a single human agent; groups are the org's manual categories. • No concept of an AI subject, auto-score, graded metric, judge reason, cited source, veto, or multiple actors. |
| After | • When the conversation's responder is an AI agent, the panel renders in "Auto-scored by AI" mode: each metric shows a graded score + an expandable judge reason; Groundedness surfaces the cited KB source; veto failures (Groundedness/Policy) show a red "Failed" flag that floors the total. • The 9 defaults appear as a "Qontak AI Quality (default)" group; org tier-2 custom params appear in their own group. • Total score is auto-computed with a pass/fail badge vs passing_grade.• A supervisor can override any auto-score → "edited by [SPV]" with audit. • An actor selector lets a room with multiple actors (the AI agent + human handler[s]) be reviewed per actor, each scored on its own segment. • Human-agent conversations are unchanged (manual mode as today). |
| Element | Before | After |
|---|---|---|
| Parameter rating control | Binary 👎/👍 (manual) | Graded score + reason (+ override) for AI; binary retained for human/manual |
| Scorecard groups | Org categories only | + "Qontak AI Quality (default)" 9-metric group + tier-2 custom group |
| Score source label | "Scored by [SPV]" implicit | "Auto-scored by AI" badge vs "Scored by [SPV]"; override → "edited by [SPV]" |
| Agent / actor selector | Single human agent | Lists every actor who served the room (AI agent + human[s]); each shows its own scorecard scored on its segment |
| Groundedness / Policy | n/a | Veto flag + cited source surfaced |
| Total score | Manual sum | Auto-computed % + pass/fail vs passing_grade |
Figma: Pending — Stitch prompt in Appendix B.
9. API & Webhook Behavior
Behavior 1: Ingest AI evaluator output and score the conversation (per actor/segment)
| Field | Detail |
|---|---|
| Entity affected | New AI agent score record (agent_scorecard + details, source = auto / actor = AI) |
| Triggered by | An AI conversation reaches a terminal exit reason (skill_completed, user_request_human_handoff, etc.) or hands off to a human; the SkillPack engine emits its 9-metric evaluator result for the AI-handled turns |
| Information passed | Org, room/conversation id, AI agent id, the 9 metric scores, exit reason, cited source refs, per-metric judge reasons, segment boundary |
| Expected behavior | • Create an AI auto-score record scoped to the AI-handled turns; compute the tier-1 weighted score. • For each org tier-2 custom param with a non-empty rubric (from Phase 1), request judge scoring and merge. • Apply veto: if a veto metric (Groundedness/Policy) fails, is_pass = false regardless of weighted total.• Compute total vs passing_grade → is_pass; persist per actor with audit. |
| Failure behavior | • Evaluator output missing/malformed → record marked "scoring unavailable"; retry up to N. • Judge call for a tier-2 param fails → that param marked "unscored", record flagged "partial", tier-1 score still stored. • Scoring never blocks live conversation handling. |
Behavior 2: Override an AI auto-score (manual)
| Field | Detail |
|---|---|
| Entity affected | An AI agent score record's per-metric value + total |
| Triggered by | Supervisor/Admin edits a metric score in the in-room panel |
| Information passed | Score record id, metric id, override value, optional reason |
| Expected behavior | Persist the override; recompute total + is_pass; mark the metric "edited by [SPV]"; record via paper_trail audit |
| Failure behavior | • Override value out of range → validation error, not saved. • Unauthorized role → control not rendered. • Save fails → error + retry; scorecard_override_failed logged. |
Claude resolves during RFC: HTTP method, path, request/response JSON schema, error codes.
10. System Flow + User Stories + ACs
10.1 System Flow
Flow: AI Conversation Scored and Shown in the In-Room Panel Type: User Journey + API Sequence
- An AI agent (SkillPack) handles a conversation in Qontak omnichannel.
- The conversation reaches a terminal exit reason or hands off to a human; the engine emits its 9-metric result scoped to the AI-handled turns.
- System ingests the output → creates an AI auto-score record (tier-1 weighted score), scoped to the AI actor's segment.
- Decision — for each org tier-2 custom param: rubric non-empty? Yes → judge scores it, merge; No → exclude.
- Decision — a veto metric (Groundedness/Policy) failed? Yes →
is_pass= false regardless of total. No → compute total vspassing_grade. - Persist per actor with paper_trail audit.
- Failure branch — evaluator output missing/malformed → mark "scoring unavailable", retry up to N, log; live handling never blocked.
- A QA Lead opens the in-room Scorecard panel; the actor selector lists the AI agent + any human handler.
- Selecting the AI actor → graded 9-metric group + tier-2 group, reasons, cited sources, veto flags, total + pass/fail badge.
- Decision — supervisor overrides a metric? Yes → recompute total/
is_pass, mark "edited by [SPV]", audit. No → done.
📊 System Flow — Scoring + In-Room Panel
sequenceDiagram
participant Eng as SkillPack Engine
participant Score as Scoring Pipeline
participant Judge as Tier-2 Judge
participant Store as Scorecard Store
participant QA as QA Lead
Eng->>Score: 9-metric result (terminal exit / handoff, AI-handled turns)
Score->>Score: Tier-1 weighted score (AI actor segment)
alt custom param rubric non-empty
Score->>Judge: Score tier-2 param
Judge-->>Score: Tier-2 result
else empty rubric
Note over Score: Exclude param
end
Score->>Score: Veto check (Groundedness/Policy)
Score->>Store: Persist per actor → is_pass
Note over Score,Store: Malformed output → "scoring unavailable", retry N
QA->>Store: Open panel, select actor
Store-->>QA: Graded metrics + reasons + sources + veto + total
QA->>Store: Override a metric (optional)
Store-->>QA: Recompute + "edited by SPV" + audit
10.2 User Stories
[P2-S01] — Auto-score an AI conversation on the two-tier rubric (per actor/segment, with veto)
| User Story | As a QA Lead, I want every AI agent conversation automatically scored on the two-tier rubric, scoped to the turns the AI handled, so that I can measure AI quality without hand-scoring. |
| Before State | AI conversations aren't scored against the engine's 9-metric rubric; the existing GPT auto-scorer (auto_agent_scoring.rb) scores only the first human agent on the manual categories, and the engine's 9-metric output is discarded. |
| After Delta | On each AI conversation's terminal exit/handoff, the engine's 9-metric result is ingested into an AI auto-score record (tier-1) for the AI-handled segment, merged with tier-2 custom-param judge scores where a rubric exists, veto-checked, scored vs passing_grade, and persisted per actor. |
| Importance | Must Have |
| Mockup / Technical Notes | Figma: N/A — backend; surfaces in CHG-003 panel Data Fields: • organization_id (string, required) — Auth session• room_id (string, required) — conversation• agent_id (string, required) — AI agent (actor)• metrics[] ({code, score}, required) — engine output (9 metrics)• exit_reason (enum, required) — engine• segment ({from_turn, to_turn}, required) — AI-handled turns• is_auto_score / passing_grade (from Phase 1 prefs)Technical Notes: Tier-1 weights uniform 0.11 (DSAI tunes later — S17 Q#3). GATE rule: tier-2 param auto-scored only if rubric non-empty (Phase 1). Veto: Groundedness/Policy failure floors is_pass. |
| Acceptance Criteria | — Happy Path — • AC-1: Given is_auto_score is ON and an AI conversation reaches a terminal exit/handoff, when the engine emits its 9-metric result, then an AI auto-score record is created with the tier-1 weighted score within 60s.• AC-2: Given an org tier-2 custom param with a non-empty rubric, when scored, then the judge scores it and merges into the total. • AC-3: Given the total, when compared to passing_grade, then is_pass is true if total ≥ passing_grade else false.— Edge — • AC-4: Given a tier-2 param with an EMPTY rubric, when scored, then it is excluded and marked "manual-only"; tier-1 unaffected. • AC-5: Given a veto metric (Groundedness/Policy) fails, when the total is computed, then is_pass = false regardless of the weighted total, flagged with the veto reason.• AC-6: Given a room handled by an AI agent for turns 1–N then a human from N+1, when scored, then only the AI-handled turns are evaluated and stored against the AI actor — a separate record from any human score ( (org, room_id, agent_id) key).— Error / Unhappy Path — • ERR-1: Given the evaluator output is missing/malformed, when ingestion runs, then the record is marked "scoring unavailable", retried up to N, and scorecard_autoscore_failed is logged; live handling never blocked.• ERR-2: Given a tier-2 judge call fails after retries, when scoring completes, then the param is "unscored", the record is "partial", tier-1 is still stored, and scorecard_tier2_judge_failed is logged.— Permission Model — • CAN: System (automated) for orgs with is_auto_score ON + flag ON.• CANNOT: end CS agents cannot trigger/alter auto-scores. • Unauthorized: N/A — automated pipeline. — UI States — (record surfaces in CHG-003 panel) • Loading: record shows "scoring…". • Empty: no AI conversation → no record. • Error: "scoring unavailable". • Success: total + pass/fail badge. — Negative Scenarios — • NEG-1: Given a human agent conversation, when processed, then it is NOT auto-scored by the AI evaluator. • NEG-2: Given a live, unresolved conversation, when messages are exchanged, then no per-message live score is produced. |
Dependencies: Phase 1 config (rubric + threshold); DSAI evaluator contract — see S15.
[P2-S02] — View an AI agent's auto-score in the in-room Scorecard panel
| User Story | As a QA Lead, I want to see an AI agent's auto-score — metrics, reasons, sources, veto, total — in the in-room Scorecard panel, choosing the AI actor when a human also served the room, so that I can review AI quality where I already work. |
| Before State | The panel is manual binary 👎/👍, human-agent only; no AI score, no reasons, no actor selector. |
| After Delta | An "Auto-scored by AI" mode renders graded metrics + judge reasons + cited sources + veto flags + total/pass-fail, with an actor selector for multi-actor rooms (per CHG-003). |
| Importance | Must Have |
| Mockup / Technical Notes | Figma: Pending — Appendix B Stitch prompt Data Fields: • room_id (string, required) — conversation• actor_id (string, required) — selected actor• actor_type (enum ai|human, required) — drives panel mode• score_record (API response) — per-actor score |
| Acceptance Criteria | — Happy Path — • AC-1: Given a conversation an AI agent handled, when the QA Lead opens the Scorecard panel and selects the AI actor, then the 9-metric group + any tier-2 group render with graded scores, total %, and pass/fail badge. • AC-2: Given a metric, when expanded, then its judge reason is shown; for Groundedness, the cited KB source link is shown. • AC-3: Given a veto metric failed, when the panel renders, then a red "Failed" flag is shown and the total reflects is_pass = false.— Edge — • AC-4: Given a room served by both an AI agent and a human, when the QA Lead opens the panel, then the actor selector lists both; selecting the human shows manual categories (today's behavior), selecting the AI shows the AI group. • AC-5: Given a transcript snapshot past 90-day retention, when expanded, then scores render but the transcript reads "expired". — Error / Unhappy Path — • ERR-1: Given the AI score record is "scoring unavailable", when the panel opens, then it shows "scoring unavailable" with a retry indicator (no crash), and scorecard_panel_load_failed is logged on a hard load error.— Permission Model — • CAN: QA Lead/Supervisor (team scope), Bot/AI Builder/Admin (own agents). • CANNOT: end CS agents (cannot view others' scores). • Unauthorized: panel/actor not shown. — UI States — • Loading: "scoring…" / skeleton metrics. • Empty: "Not scored yet" before the first AI conversation completes. • Error: "scoring unavailable" + retry. • Success: graded metrics + total + pass/fail. — Negative Scenarios — • NEG-1: Given a human-only conversation, when the panel opens, then only manual mode shows (no AI group). |
Dependencies: P2-S01 (score record to display).
[P2-S03] — Override an AI auto-score
| User Story | As a Supervisor/Admin, I want to override an AI metric score in the panel, so that I can correct a judge mistake and keep the record trustworthy. |
| Before State | AI scores would be immutable; no human correction path. |
| After Delta | A per-metric override recomputes the total/is_pass, marks the metric "edited by [SPV]", and audits the change. |
| Importance | Should Have |
| Mockup / Technical Notes | Figma: Pending Data Fields: • score_record_id (uuid, required) — record• metric_id (string, required) — the metric overridden• override_value (float, required) — new value• reason (text, optional) — override note |
| Acceptance Criteria | — Happy Path — • AC-1: Given a Supervisor viewing an AI score, when they override a metric value within range and save, then the total + is_pass recompute and the metric shows "edited by [SPV]".• AC-2: Given an override, when saved, then it is recorded via paper_trail audit. — Edge — • AC-3: Given an override value out of range, when saved, then a validation error is shown and nothing is persisted. • AC-4: Given the existing scorecard allows only one correction ( edit_count / correction_at), when a supervisor overrides an AI metric, then it consumes that single correction slot, and a further override is blocked with "already corrected".— Error / Unhappy Path — • ERR-1: Given the override save fails, when saving, then an error + Retry is shown, no partial state persists, and scorecard_override_failed is logged.— Permission Model — • CAN: Supervisor/Admin. • CANNOT: QA Lead (view only), end CS agents. • Unauthorized: override control not rendered. — UI States — • Loading: spinner on the metric during save. • Empty: N/A. • Error: as ERR-1. • Success: "edited by [SPV]" + recomputed total. — Negative Scenarios — • NEG-1: Given a QA Lead (non-supervisor), when viewing an AI score, then no override control is shown. |
Dependencies: P2-S01, P2-S02.
11. Rollout
| Field | Value |
|---|---|
| Feature flag | ai_qa_unified_scorecard — default: OFF (the flag Phase 1 shipped behind; this phase turns scoring + AI panel on) |
| Stage 1 | Internal QA: 3–5 internal accounts + the 15-CID validation projects (real conversations scored) |
| Stage 2 | Closed beta: TransGo, Talenta LMS + 3 partner accounts, manually enabled |
| Stage 3 | All Professional + Enterprise on request |
| GA | All Professional + Enterprise (flag on) |
| Backward compat | Yes — the manual human scorecard is unaffected; AI mode is additive |
| Migration | None to existing records. New: AI auto-score source/actor fields on agent_scorecard. |
11.1 Semantic Regression Rollback
| Field | Detail |
|---|---|
| Model flag | ai_qa_unified_scorecard_v2_weights | default: OFF (per org) — guards a weight/rubric change |
| Regression metric | Judge-vs-human agreement rate on the calibration sample |
| Rollback threshold | Agreement drops ≥10% vs the prior calibrated baseline, OR auto-score pass rate shifts ≥15% with no human-verified cause |
| Rollback path | Toggle the weights flag back to the prior version per org (no deploy); Bot/AI + DSAI; within 4 hours of alert |
| Monitoring | scorecard_autoscore_completed + judge_human_agreement reviewed daily during the calibration window |
12. Observability
Key Events:
| Event Name | Trigger | Properties |
|---|---|---|
scorecard_autoscore_completed | An AI conversation is scored | org_id, agent_id, total_score, is_pass, tier2_count, veto_failed, duration_ms |
scorecard_autoscore_failed | Evaluator ingestion/scoring failed | org_id, room_id, reason, retry_count |
scorecard_tier2_judge_failed | A tier-2 custom-param judge call failed | org_id, custom_param_id, reason |
scorecard_panel_load_failed | In-room panel hard load error | org_id, room_id, reason |
scorecard_override_saved | A supervisor overrode an AI metric | org_id, score_record_id, metric_id |
scorecard_override_failed | Override save failed | org_id, reason |
| Field | Detail |
|---|---|
| Dashboard owner | Bot, AI & Automation (squad: BOT) |
| Alert 1 | scorecard_autoscore_failed rate > 5% of AI conversations in 1h → Slack: #bot-ai-oncall |
| Alert 2 | Judge-vs-human agreement < 85% during the calibration window → Slack: #bot-ai-quality |
12.1 Post-Launch Monitoring Cadence
| Field | Detail |
|---|---|
| Review cadence | Weekly for the first 4 weeks post-GA, then monthly |
| Owner | Dimas Fauzi Hidayat (PM) + BOT squad |
| Review scope | scorecard_autoscore_completed / _failed, scorecard_override_saved, judge-vs-human agreement |
| Trigger threshold 1 | scorecard_autoscore_failed > 5% week-over-week → investigate ingestion / engine contract |
| Trigger threshold 2 | Auto-score pass rate shifts > 15% for 2 consecutive weeks with no weight change → investigate calibration drift |
| Rollback consideration | If judge-vs-human agreement < baseline − 10% and unresolved within 48h, PM reverts to the prior weights flag (S11.1). |
13. Success Metrics
Adoption & Usage:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| ⭐ AI-QA scoring coverage | % of AI agent conversations auto-scored | 0% — no AI-agent scoring (existing auto-scorer covers only the human agent) | ≥95% within 30 days of GA |
| In-room panel usage | % of enrolled accounts whose supervisors open an AI scorecard weekly | N/A — new | ≥50% within 60 days of GA |
Quality & Accuracy:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Hallucination rate (product facts) | Share of AI answers flagged ungrounded by the judge (Groundedness veto) | Unmeasured | <2% within 60 days of GA |
| Judge-vs-human agreement | Agreement between AI auto-scores and supervisor overrides/verdicts on the calibration sample | N/A | ≥85% before weights tune (gates Phase 5) |
Efficiency & Impact:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Manual QA effort displaced | % of scored conversations generated automatically vs. hand-scored | 0% — 100% manual | ≥70% within 90 days of GA |
14. Launch Plan & Stage Gates
| Stage | Audience | Duration | Success Gate to Advance | Owner |
|---|---|---|---|---|
| Internal Alpha | 3–5 internal QA accounts + 15-CID set | 2 weeks | 0 P0/P1; scorecard_autoscore_failed ≤5%; engine ingestion contract confirmed with DSAI; segment boundary validated | PM + QA |
| Closed Beta | TransGo, Talenta LMS + 3 partners | 3 weeks | Judge-vs-human agreement ≥85% on calibration sample; panel render ≤2s P95 | PM + BOT |
| Open Beta | All Pro+Ent on request | 3 weeks | Auto-score coverage ≥90% for enrolled; hallucination <2% on sample; no P0 for 2 weeks | Eng Lead |
| GA | All Pro+Ent | Ongoing | All Open Beta gates sustained 2 weeks; PMM launch approved | PM + PMM |
15. Dependencies
| Dependency | Owning Team | Deliverable Needed | Blocking? |
|---|---|---|---|
| SkillPack engine evaluator output contract | DSAI / engine team | Stable per-conversation 9-metric output + segment boundary + GET /models, /thread/message confirmed (Strategy doc Appendix A.3) | YES |
| Phase 1 config (settings + rubrics) | BOT (Phase 1) | is_auto_score, passing_grade, custom-param rubrics live in production | YES |
| PII hashing + credential vault (G7) | Security / Infosec | PII hashed before judge/log; plaintext credential leak closed before transcripts are stored | YES |
| Design / UX | Design squad | Frames for the in-room panel AI mode + actor selector (CHG-003) | YES |
| 9-metric weight tuning from 15-CID | DSAI | Tuned weights replacing the uniform 0.11 | NO for P2 (scores are measurement-only) · gates Phase 5 |
16. Key Decisions + Alternatives Rejected
8a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2026-06-19 | Consume the engine's 9-metric evaluator output; do not build a judge in the scorecard | The judge already shipped in SkillPack; avoids two drifting judges |
| 2026-06-19 | Veto metrics (Groundedness, Policy) floor is_pass regardless of the weighted total | A single hallucination or policy breach should fail a conversation; ties to the <2% hallucination target |
| 2026-06-19 | Score each actor on its own segment; one room can hold multiple actor scorecards | AI handles first contact then hands to a human; each must be measured on the turns it handled |
| 2026-06-19 | AI scores are overridable by Supervisor/Admin with audit | Judge mistakes need a human correction path to keep the record trustworthy and the judge calibratable |
| 2026-06-19 | This phase surfaces scores as measurement only — no go-live gate | Weights are untuned (0.11); gating on them would be premature (gate is Phase 5 after tuning) |
| 2026-06-19 | AI-score override reuses the existing one-edit correction slot (edit_count / correction_at), not a new edit mechanism | The scorecard already enforces "edit once"; reusing it keeps the audit model consistent (verified in hub-chat ScorecardForm.vue) |
| 2026-06-19 | Grounded in cloned hub-chat: extend the in-room panel, the per-room GET/POST/PATCH /agent_scorecards/{roomId} API, the roomParticipant[] array, and the handover events — not rebuilt | Verified against hub-chat agent-scorecard/ + event handlers (2026-06-19); lowers build risk |
8b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
| Build a separate AI-QA judge inside the scorecard tables | Duplicates the shipped 9-metric evaluator; two judges drift | 2026-06-19 |
| One judge call per parameter | Multiplies token cost per conversation; score all params in one call | 2026-06-19 |
| Make AI scores immutable (no override) | Judge mistakes would be uncorrectable and erode trust; no calibration signal | 2026-06-19 |
| Score the whole transcript regardless of actor | Unfairly credits/penalizes the AI for the human's turns (and vice versa) | 2026-06-19 |
17. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Open Question | Confirm the exact definitions and order of the 9 engine metrics with DSAI (Appendix A is PROPOSED). | Bot/AI + DSAI | 2026-07-15 |
| 2 | Assumption | The engine exposes a stable per-conversation evaluator output incl. a segment boundary (Strategy doc Appendix A.3 lists GET /models, /thread/message as OPEN). | DSAI | 2026-07-01 |
| 3 | Risk | The AI/human segment boundary may be ambiguous, mis-attributing turns. Mitigation: delimit segments using the existing hub-chat handover events (agent_take_room / remove_agent / handover_id) + message timestamps (mechanism confirmed in code); validate against 15-CID transcripts in Internal Alpha, and confirm these events fire for the AI→human takeover case. | Bot/AI + Omnichannel | 2026-07-15 |
| 4 | Risk | The 9-metric weights are uniform 0.11 and untuned → scores may be miscalibrated. Mitigation: this phase surfaces scores as measurement only (no gate); tune from 15-CID + monitor judge-vs-human agreement before Phase 5. | DSAI | 2026-07-31 |
| 5 | Risk | Storing scored transcripts widens the blast radius of the live plaintext-credential / PII gap (Strategy doc G7). Mitigation: hash PII before judge/log; close the G7 credential leak before transcripts are stored; 90-day TTL on snapshots. | Security / Infosec | Before GA |
| 6 | Risk | The AI agent may not be a scorable actor in the inbox. In hub-chat, messages/participants are only Models::User (human) / Models::Customer; Airene is side-panel assist, not a message participant — there is no bot/AI-agent participant type. This phase assumes the SkillPack agent is a selectable, scorable actor in the room. Mitigation: confirm with Bot/Automation how the agent's turns are represented; if the bot is not a room participant, attach the AI score to the room + agent-version rather than a participant row. | Bot/AI + Omnichannel | 2026-07-15 |
| 7 | Open Question | Scorecard RBAC today is org-wide Usman inbox_scorecard_view / inbox_scorecard_manage — there is no team/division scoping (verified in hub-chat UsmanStore). The "team scope" references in this PRD require net-new scoping. Resolve: descope to org-wide for Phase 2, or add per-team scoping as an explicit requirement. | PM + Platform | 2026-07-15 |
Appendix A — AI Scoring Rubric
Status: PROPOSED — pending DSAI confirmation (Open Q#1). The 9 metrics are owned by the SkillPack engine. Tier-1 weights uniform 0.11 until DSAI tunes from 15-CID (Open Q#4). Full judging prompts live in the Phase 1 PRD Appendix A and the default-rubric viewer.
| # | Metric | What it measures | Veto? |
|---|---|---|---|
| 1 | Groundedness / factual accuracy | Claims backed by KB sources or customer data; no invented product facts | 🛑 Veto |
| 2 | Resolution / task completion | Did it resolve the goal (skill_completed signal) | — |
| 3 | Relevance / intent understanding | Addressed the real intent, not a different question | — |
| 4 | Policy & safety adherence | Stayed within "what to avoid"; no unsafe content / PII leak | 🛑 Veto |
| 5 | Tone & brand voice | Matched configured tone_of_voice; courteous | — |
| 6 | Language quality (Bahasa) | Fluent target language; no broken/mixed language | — |
| 7 | Handoff appropriateness | No false handover (Pattern A); no missed escalation | — |
| 8 | Tool / action correctness | Right action, right params, not skipped (Pattern B) | — |
| 9 | Conversation efficiency | No loops / re-asking; resolved within turn budget | — |
🛑 Veto (P2-S01/AC-5): a clear breach of metric 1 or 4 floors
is_passregardless of the weighted total. Tier-2: org custom params (added in Phase 1) are scored by the judge using theirpromptrubric and merged (P2-S01/AC-2).
Appendix B — Stitch UI Prompt
Generated proactively because the in-room panel AI mode is
Figma: Pending. Use in Stitch; hand the output to Design as structural reference.
=== SHARED PREAMBLE ===
Product: Mekari Qontak — Omnichannel inbox
Users: QA Lead / Supervisor, Bot/AI Builder
Design tone: Enterprise B2B SaaS — dense, professional, clean white surfaces, purple accent; match the existing Qontak inbox shell
Persistent UI: left icon rail + top bar; the Scorecard is a right-hand panel over the conversation
=== END PREAMBLE ===
| # | Screen | Stitch Prompt (paste in full after the preamble) |
|---|---|---|
| 1 | In-room Scorecard panel — AI mode (CHG-003) | Screen: right-hand Scorecard panel over a conversation, "Auto-scored by AI" mode. Purpose: supervisor reviews each actor's score for this conversation. Components: an actor selector at top listing every actor who served this room (AI agent + human handler[s]) — selecting the AI actor shows "Auto-scored by AI" + the 9-metric group; selecting a human shows the org's manual categories with binary thumbs (today's behavior). For the AI actor: a "Qontak AI Quality (default)" group of 9 metrics, each a graded score chip + expandable judge reason, Groundedness showing a cited KB source link, a red "Failed" veto flag on Groundedness/Policy when breached; a separate "Custom (org)" group for tier-2 params; a per-metric override control (edit → "edited by [SPV]"); Total score % with a pass/fail badge vs threshold; Remarks. Generate states: Loading ("scoring…"); Success (graded); Error ("scoring unavailable"); Veto-failed (red banner). Do NOT include: the Analytics report, the go-live gate, binary-only thumbs as the sole control for AI rows. |
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.0 | 2026-06-19 | Claude | All | CREATED | Phase 2 PRD (AI Auto-Scoring & In-Room Scorecard) — cut from the superset draft: two-tier scoring per actor/segment with veto, and the in-room panel AI mode + actor selector. |
| 1.1 | 2026-06-19 | Claude | CB, S1, S8, S13 | MODIFIED | Corrected premise vs cloned code: auto_scoring.rb "stub" → the live auto_agent_scoring.rb already scores the human agent; this phase adds the AI-agent 9-metric path alongside it. Phase count 7→8. |
| 1.2 | 2026-06-19 | Claude | S8, S16, S17 | MODIFIED | hub-chat grounding: added AI-agent-as-participant risk (Q6) + permission-scope question (Q7), answered segment-boundary with handover events (Q3), added edit-once override AC + decision, and a code-grounding decision. |