Skip to main content

Qontak | Chatbot & AI | Unified Agent Quality Scorecard — Phase 2: AI Auto-Scoring & In-Room Scorecard

Template: PHASE PRD v1.2 · Companion to PRD Section Reference v1.5 + Hierarchy v1.0 Note: Phase 2 of the Unified Agent Quality Scorecard initiative. Builds the scoring engine + in-room panel — consumes Phase 1 config; produces the scores that Phase 3 (report) and Phase 5 (gate) read.


HEADER BLOCK

FieldValue
PMDimas Fauzi Hidayat
PRD Version1.2
StatusDRAFT
PRD TypePHASE
EpicQC-XXXXX — add once Epic is created
SquadBOT — Bot, AI & Automation
RFC LinkPending — RFC to follow via rfc-starter
Figma MasterPending — in-room panel AI mode not yet designed (Stitch prompt in Appendix B)
AnchorQontak | Chatbot & AI | Unified Agent Quality Scorecard — ANCHOR
Labelsepic:qontak-chatbot-ai | module:chatbot-ai | feature:unified-agent-scorecard
Last Updated2026-06-19

Table of Contents


2. CONDITIONAL BLOCK: PHASE CONTEXT

FieldDetail
Anchor PRDQontak | Chatbot & AI | Unified Agent Quality Scorecard — ANCHOR
PhasePhase 2 of 8
Phase GoalScore every AI conversation on the two-tier rubric (per actor/segment, with veto metrics) and surface it in the in-room Scorecard panel.
Prior phasesPhase 1 — Scorecard Settings & Rubric Config (shipped): the extended is_auto_score (now also enabling AI-agent scoring), the AI pass threshold, the custom-param "AI judging rubric" editor (prompt widened stringtext), and the read-only default-rubric viewer. This phase consumes all of it. Note: the human auto-scorer (auto_agent_scoring.rb) already exists — this phase adds the AI-agent 9-metric path alongside it.
Deferred to later phasesAnalytics report + export → Phase 3; validation/testing harness (pre-launch scores) → Phase 4; go-live gate → Phase 5.
Cross-phase dependencies(1) Phase 1 config must be live — scoring reads the (extended) is_auto_score, the AI pass threshold, and the custom-param rubrics. (2) DSAI evaluator output contract (Strategy doc Appendix A.3) — the per-conversation 9-metric output this phase ingests.

3. One-liner + Problem

One-liner: Auto-score every AI conversation on the two-tier rubric and show each actor's score, reasons, and sources in the in-room Scorecard panel.

Problem: The SkillPack engine already grades every AI conversation with its 9-metric evaluator, but the output is discarded — no product surface reads it, so QA leads cannot see whether an AI agent is good. A GPT auto-scorer (auto_agent_scoring.rb) already fills the scorecard for the first human agent on the manual categories, but nothing scores the AI agent against the engine's 9-metric rubric, handles multiple actors, applies veto, or shows reasons/sources where supervisors work: the in-room Scorecard panel today supports only manual binary scoring of a single agent. The result is AI quality stays invisible, hallucinations go unmeasured (no <2% floor), and the paid Scorecard still has no AI value.


4. What Happens If We Don't Ship This Phase

  • AI quality stays invisible — the engine emits the signal on every conversation but no one can see it; measured containment and hallucination rate slip another quarter past Q3 2026.
  • Phase 1's config is stranded — the rubric + threshold defined in Phase 1 (June 2026) have nothing computing against them; the investment delivers no value until this ships.
  • Phases 3 and 5 are blocked — both the Analytics report (P3) and the go-live gate (P5) read the per-conversation scores this phase produces; neither can start.

5. Target Users + Persona Context

Primary Persona: QA Lead / Supervisor

FieldDetail
RoleQA Lead or Supervisor accountable for conversation quality across human and AI agents
GoalSee each AI agent's score per conversation — with reasons and cited sources — in the inbox panel where they already review, without hand-scoring
PainAI quality is invisible today; they can only manually spot-check, and the panel has no way to show a graded AI score
WorkaroundReading transcripts one by one in the inbox; no score, no trend, no reasons

Secondary Persona: Bot / AI Builder (Agent Owner)

FieldDetail
RoleThe Bot/AI specialist/admin who configures and ships AI agents (SkillPack)
GoalSee exactly where their AI agent fails (which metric, which conversation) so they can fix the SkillPack or KB
PainNo objective per-conversation quality signal; problems are found anecdotally
WorkaroundAd-hoc manual testing in the preview pane; no record

6. Non-Goals

  1. Not the Analytics report / trends / export — the unified report surface is Phase 3.
  2. Not the go-live gate — gate decision + advisory/enforced modes are Phase 5.
  3. Not the validation/testing harness — this phase scores only real conversations; pre-launch scoring is Phase 4.
  4. Not the settings or rubric editor — Phase 1 owns config; this phase consumes it.
  5. Not tuning the 9-metric weights — DSAI dependency; weights stay uniform 0.11 here.
  6. No change to human manual scoring — the manual panel mode stays exactly as today; AI mode is added alongside.
  7. No mobile — web (Qontak omnichannel) only.
  8. Not real-time per-message scoring — scoring runs per conversation/segment at terminal exit or handoff, not as a live overlay.

7. Constraints

FieldValue
PlatformWeb only — Qontak omnichannel web app
PerformancePer-conversation auto-score available ≤ 60s after conversation close or handoff (async). In-room panel render ≤ 2s P95.
Data limitsPer scored conversation: 9 metric scores + tier-1/tier-2 verdicts + cited source refs + judge reason text. Transcript snapshot 90 days; aggregate scores 13 months.
Plan scopeProfessional + Enterprise only. Not Starter/Free.
Feature flagai_qa_unified_scorecard | default: OFF. This phase turns scoring + the AI panel on for enrolled orgs (the same flag Phase 1 shipped behind).
Read/writeRead: QA Lead/Supervisor (team scope), Bot/AI Builder/Admin (own agents). Write (manual override of an AI score): Supervisor/Admin only. End CS agents: no access to others' scores.

7.1 Data Lifecycle

Artifact TypeRetention PeriodCleanup TriggerUser-Visible Effect
AI score record (9 metrics + verdict + reasons)13 monthsTTL from created_at; nightly cleanup cronPanel/trend shows up to 13 months; older drops off
Conversation transcript snapshot (judge input)90 daysTTL from conversation closeAfter 90 days, the panel shows scores but the transcript reads "expired"
Cited source references (KB chunk ids)90 days (with transcript)TTL from conversation closeSource links expire with the transcript
Failed scoring job payload7 daysNightly cleanup after retry exhaustionNone — internal; surfaced as "scoring unavailable" on the record

8. Feature Changes

Change ID: CHG-003 — In-room Scorecard panel: AI auto-scored mode + actor selector

FieldDetail
Change TypeModified component (in-room Scorecard side panel)
Page/inbox — conversation right panel → Scorecard tab
Page IntentA supervisor reviews each actor's quality for the open conversation
Before• Manual-only: each parameter is rated with a binary 👎/👍; Total is summed from manual ratings; the subject is a single human agent; groups are the org's manual categories.
• No concept of an AI subject, auto-score, graded metric, judge reason, cited source, veto, or multiple actors.
After• When the conversation's responder is an AI agent, the panel renders in "Auto-scored by AI" mode: each metric shows a graded score + an expandable judge reason; Groundedness surfaces the cited KB source; veto failures (Groundedness/Policy) show a red "Failed" flag that floors the total.
• The 9 defaults appear as a "Qontak AI Quality (default)" group; org tier-2 custom params appear in their own group.
• Total score is auto-computed with a pass/fail badge vs passing_grade.
• A supervisor can override any auto-score → "edited by [SPV]" with audit.
• An actor selector lets a room with multiple actors (the AI agent + human handler[s]) be reviewed per actor, each scored on its own segment.
• Human-agent conversations are unchanged (manual mode as today).
ElementBeforeAfter
Parameter rating controlBinary 👎/👍 (manual)Graded score + reason (+ override) for AI; binary retained for human/manual
Scorecard groupsOrg categories only+ "Qontak AI Quality (default)" 9-metric group + tier-2 custom group
Score source label"Scored by [SPV]" implicit"Auto-scored by AI" badge vs "Scored by [SPV]"; override → "edited by [SPV]"
Agent / actor selectorSingle human agentLists every actor who served the room (AI agent + human[s]); each shows its own scorecard scored on its segment
Groundedness / Policyn/aVeto flag + cited source surfaced
Total scoreManual sumAuto-computed % + pass/fail vs passing_grade

Figma: Pending — Stitch prompt in Appendix B.


9. API & Webhook Behavior

Behavior 1: Ingest AI evaluator output and score the conversation (per actor/segment)

FieldDetail
Entity affectedNew AI agent score record (agent_scorecard + details, source = auto / actor = AI)
Triggered byAn AI conversation reaches a terminal exit reason (skill_completed, user_request_human_handoff, etc.) or hands off to a human; the SkillPack engine emits its 9-metric evaluator result for the AI-handled turns
Information passedOrg, room/conversation id, AI agent id, the 9 metric scores, exit reason, cited source refs, per-metric judge reasons, segment boundary
Expected behavior• Create an AI auto-score record scoped to the AI-handled turns; compute the tier-1 weighted score.
• For each org tier-2 custom param with a non-empty rubric (from Phase 1), request judge scoring and merge.
• Apply veto: if a veto metric (Groundedness/Policy) fails, is_pass = false regardless of weighted total.
• Compute total vs passing_gradeis_pass; persist per actor with audit.
Failure behavior• Evaluator output missing/malformed → record marked "scoring unavailable"; retry up to N.
• Judge call for a tier-2 param fails → that param marked "unscored", record flagged "partial", tier-1 score still stored.
• Scoring never blocks live conversation handling.

Behavior 2: Override an AI auto-score (manual)

FieldDetail
Entity affectedAn AI agent score record's per-metric value + total
Triggered bySupervisor/Admin edits a metric score in the in-room panel
Information passedScore record id, metric id, override value, optional reason
Expected behaviorPersist the override; recompute total + is_pass; mark the metric "edited by [SPV]"; record via paper_trail audit
Failure behavior• Override value out of range → validation error, not saved.
• Unauthorized role → control not rendered.
• Save fails → error + retry; scorecard_override_failed logged.

Claude resolves during RFC: HTTP method, path, request/response JSON schema, error codes.


10. System Flow + User Stories + ACs

10.1 System Flow

Flow: AI Conversation Scored and Shown in the In-Room Panel Type: User Journey + API Sequence

  1. An AI agent (SkillPack) handles a conversation in Qontak omnichannel.
  2. The conversation reaches a terminal exit reason or hands off to a human; the engine emits its 9-metric result scoped to the AI-handled turns.
  3. System ingests the output → creates an AI auto-score record (tier-1 weighted score), scoped to the AI actor's segment.
  4. Decision — for each org tier-2 custom param: rubric non-empty? Yes → judge scores it, merge; No → exclude.
  5. Decision — a veto metric (Groundedness/Policy) failed? Yesis_pass = false regardless of total. No → compute total vs passing_grade.
  6. Persist per actor with paper_trail audit.
  7. Failure branch — evaluator output missing/malformed → mark "scoring unavailable", retry up to N, log; live handling never blocked.
  8. A QA Lead opens the in-room Scorecard panel; the actor selector lists the AI agent + any human handler.
  9. Selecting the AI actor → graded 9-metric group + tier-2 group, reasons, cited sources, veto flags, total + pass/fail badge.
  10. Decision — supervisor overrides a metric? Yes → recompute total/is_pass, mark "edited by [SPV]", audit. No → done.

📊 System Flow — Scoring + In-Room Panel

sequenceDiagram
participant Eng as SkillPack Engine
participant Score as Scoring Pipeline
participant Judge as Tier-2 Judge
participant Store as Scorecard Store
participant QA as QA Lead
Eng->>Score: 9-metric result (terminal exit / handoff, AI-handled turns)
Score->>Score: Tier-1 weighted score (AI actor segment)
alt custom param rubric non-empty
Score->>Judge: Score tier-2 param
Judge-->>Score: Tier-2 result
else empty rubric
Note over Score: Exclude param
end
Score->>Score: Veto check (Groundedness/Policy)
Score->>Store: Persist per actor → is_pass
Note over Score,Store: Malformed output → "scoring unavailable", retry N
QA->>Store: Open panel, select actor
Store-->>QA: Graded metrics + reasons + sources + veto + total
QA->>Store: Override a metric (optional)
Store-->>QA: Recompute + "edited by SPV" + audit

10.2 User Stories

[P2-S01] — Auto-score an AI conversation on the two-tier rubric (per actor/segment, with veto)

User StoryAs a QA Lead, I want every AI agent conversation automatically scored on the two-tier rubric, scoped to the turns the AI handled, so that I can measure AI quality without hand-scoring.
Before StateAI conversations aren't scored against the engine's 9-metric rubric; the existing GPT auto-scorer (auto_agent_scoring.rb) scores only the first human agent on the manual categories, and the engine's 9-metric output is discarded.
After DeltaOn each AI conversation's terminal exit/handoff, the engine's 9-metric result is ingested into an AI auto-score record (tier-1) for the AI-handled segment, merged with tier-2 custom-param judge scores where a rubric exists, veto-checked, scored vs passing_grade, and persisted per actor.
ImportanceMust Have
Mockup / Technical NotesFigma: N/A — backend; surfaces in CHG-003 panel

Data Fields:
organization_id (string, required) — Auth session
room_id (string, required) — conversation
agent_id (string, required) — AI agent (actor)
metrics[] ({code, score}, required) — engine output (9 metrics)
exit_reason (enum, required) — engine
segment ({from_turn, to_turn}, required) — AI-handled turns
is_auto_score / passing_grade (from Phase 1 prefs)

Technical Notes: Tier-1 weights uniform 0.11 (DSAI tunes later — S17 Q#3). GATE rule: tier-2 param auto-scored only if rubric non-empty (Phase 1). Veto: Groundedness/Policy failure floors is_pass.
Acceptance Criteria— Happy Path —
• AC-1: Given is_auto_score is ON and an AI conversation reaches a terminal exit/handoff, when the engine emits its 9-metric result, then an AI auto-score record is created with the tier-1 weighted score within 60s.
• AC-2: Given an org tier-2 custom param with a non-empty rubric, when scored, then the judge scores it and merges into the total.
• AC-3: Given the total, when compared to passing_grade, then is_pass is true if total ≥ passing_grade else false.

— Edge —
• AC-4: Given a tier-2 param with an EMPTY rubric, when scored, then it is excluded and marked "manual-only"; tier-1 unaffected.
• AC-5: Given a veto metric (Groundedness/Policy) fails, when the total is computed, then is_pass = false regardless of the weighted total, flagged with the veto reason.
• AC-6: Given a room handled by an AI agent for turns 1–N then a human from N+1, when scored, then only the AI-handled turns are evaluated and stored against the AI actor — a separate record from any human score ((org, room_id, agent_id) key).

— Error / Unhappy Path —
• ERR-1: Given the evaluator output is missing/malformed, when ingestion runs, then the record is marked "scoring unavailable", retried up to N, and scorecard_autoscore_failed is logged; live handling never blocked.
• ERR-2: Given a tier-2 judge call fails after retries, when scoring completes, then the param is "unscored", the record is "partial", tier-1 is still stored, and scorecard_tier2_judge_failed is logged.

— Permission Model —
• CAN: System (automated) for orgs with is_auto_score ON + flag ON.
• CANNOT: end CS agents cannot trigger/alter auto-scores.
• Unauthorized: N/A — automated pipeline.

— UI States — (record surfaces in CHG-003 panel)
• Loading: record shows "scoring…".
• Empty: no AI conversation → no record.
• Error: "scoring unavailable".
• Success: total + pass/fail badge.

— Negative Scenarios —
• NEG-1: Given a human agent conversation, when processed, then it is NOT auto-scored by the AI evaluator.
• NEG-2: Given a live, unresolved conversation, when messages are exchanged, then no per-message live score is produced.

Dependencies: Phase 1 config (rubric + threshold); DSAI evaluator contract — see S15.


[P2-S02] — View an AI agent's auto-score in the in-room Scorecard panel

User StoryAs a QA Lead, I want to see an AI agent's auto-score — metrics, reasons, sources, veto, total — in the in-room Scorecard panel, choosing the AI actor when a human also served the room, so that I can review AI quality where I already work.
Before StateThe panel is manual binary 👎/👍, human-agent only; no AI score, no reasons, no actor selector.
After DeltaAn "Auto-scored by AI" mode renders graded metrics + judge reasons + cited sources + veto flags + total/pass-fail, with an actor selector for multi-actor rooms (per CHG-003).
ImportanceMust Have
Mockup / Technical NotesFigma: Pending — Appendix B Stitch prompt

Data Fields:
room_id (string, required) — conversation
actor_id (string, required) — selected actor
actor_type (enum ai|human, required) — drives panel mode
score_record (API response) — per-actor score
Acceptance Criteria— Happy Path —
• AC-1: Given a conversation an AI agent handled, when the QA Lead opens the Scorecard panel and selects the AI actor, then the 9-metric group + any tier-2 group render with graded scores, total %, and pass/fail badge.
• AC-2: Given a metric, when expanded, then its judge reason is shown; for Groundedness, the cited KB source link is shown.
• AC-3: Given a veto metric failed, when the panel renders, then a red "Failed" flag is shown and the total reflects is_pass = false.

— Edge —
• AC-4: Given a room served by both an AI agent and a human, when the QA Lead opens the panel, then the actor selector lists both; selecting the human shows manual categories (today's behavior), selecting the AI shows the AI group.
• AC-5: Given a transcript snapshot past 90-day retention, when expanded, then scores render but the transcript reads "expired".

— Error / Unhappy Path —
• ERR-1: Given the AI score record is "scoring unavailable", when the panel opens, then it shows "scoring unavailable" with a retry indicator (no crash), and scorecard_panel_load_failed is logged on a hard load error.

— Permission Model —
• CAN: QA Lead/Supervisor (team scope), Bot/AI Builder/Admin (own agents).
• CANNOT: end CS agents (cannot view others' scores).
• Unauthorized: panel/actor not shown.

— UI States —
• Loading: "scoring…" / skeleton metrics.
• Empty: "Not scored yet" before the first AI conversation completes.
• Error: "scoring unavailable" + retry.
• Success: graded metrics + total + pass/fail.

— Negative Scenarios —
• NEG-1: Given a human-only conversation, when the panel opens, then only manual mode shows (no AI group).

Dependencies: P2-S01 (score record to display).


[P2-S03] — Override an AI auto-score

User StoryAs a Supervisor/Admin, I want to override an AI metric score in the panel, so that I can correct a judge mistake and keep the record trustworthy.
Before StateAI scores would be immutable; no human correction path.
After DeltaA per-metric override recomputes the total/is_pass, marks the metric "edited by [SPV]", and audits the change.
ImportanceShould Have
Mockup / Technical NotesFigma: Pending

Data Fields:
score_record_id (uuid, required) — record
metric_id (string, required) — the metric overridden
override_value (float, required) — new value
reason (text, optional) — override note
Acceptance Criteria— Happy Path —
• AC-1: Given a Supervisor viewing an AI score, when they override a metric value within range and save, then the total + is_pass recompute and the metric shows "edited by [SPV]".
• AC-2: Given an override, when saved, then it is recorded via paper_trail audit.

— Edge —
• AC-3: Given an override value out of range, when saved, then a validation error is shown and nothing is persisted.
• AC-4: Given the existing scorecard allows only one correction (edit_count / correction_at), when a supervisor overrides an AI metric, then it consumes that single correction slot, and a further override is blocked with "already corrected".

— Error / Unhappy Path —
• ERR-1: Given the override save fails, when saving, then an error + Retry is shown, no partial state persists, and scorecard_override_failed is logged.

— Permission Model —
• CAN: Supervisor/Admin.
• CANNOT: QA Lead (view only), end CS agents.
• Unauthorized: override control not rendered.

— UI States —
• Loading: spinner on the metric during save.
• Empty: N/A.
• Error: as ERR-1.
• Success: "edited by [SPV]" + recomputed total.

— Negative Scenarios —
• NEG-1: Given a QA Lead (non-supervisor), when viewing an AI score, then no override control is shown.

Dependencies: P2-S01, P2-S02.


11. Rollout

FieldValue
Feature flagai_qa_unified_scorecard — default: OFF (the flag Phase 1 shipped behind; this phase turns scoring + AI panel on)
Stage 1Internal QA: 3–5 internal accounts + the 15-CID validation projects (real conversations scored)
Stage 2Closed beta: TransGo, Talenta LMS + 3 partner accounts, manually enabled
Stage 3All Professional + Enterprise on request
GAAll Professional + Enterprise (flag on)
Backward compatYes — the manual human scorecard is unaffected; AI mode is additive
MigrationNone to existing records. New: AI auto-score source/actor fields on agent_scorecard.

11.1 Semantic Regression Rollback

FieldDetail
Model flagai_qa_unified_scorecard_v2_weights | default: OFF (per org) — guards a weight/rubric change
Regression metricJudge-vs-human agreement rate on the calibration sample
Rollback thresholdAgreement drops ≥10% vs the prior calibrated baseline, OR auto-score pass rate shifts ≥15% with no human-verified cause
Rollback pathToggle the weights flag back to the prior version per org (no deploy); Bot/AI + DSAI; within 4 hours of alert
Monitoringscorecard_autoscore_completed + judge_human_agreement reviewed daily during the calibration window

12. Observability

Key Events:

Event NameTriggerProperties
scorecard_autoscore_completedAn AI conversation is scoredorg_id, agent_id, total_score, is_pass, tier2_count, veto_failed, duration_ms
scorecard_autoscore_failedEvaluator ingestion/scoring failedorg_id, room_id, reason, retry_count
scorecard_tier2_judge_failedA tier-2 custom-param judge call failedorg_id, custom_param_id, reason
scorecard_panel_load_failedIn-room panel hard load errororg_id, room_id, reason
scorecard_override_savedA supervisor overrode an AI metricorg_id, score_record_id, metric_id
scorecard_override_failedOverride save failedorg_id, reason
FieldDetail
Dashboard ownerBot, AI & Automation (squad: BOT)
Alert 1scorecard_autoscore_failed rate > 5% of AI conversations in 1h → Slack: #bot-ai-oncall
Alert 2Judge-vs-human agreement < 85% during the calibration window → Slack: #bot-ai-quality

12.1 Post-Launch Monitoring Cadence

FieldDetail
Review cadenceWeekly for the first 4 weeks post-GA, then monthly
OwnerDimas Fauzi Hidayat (PM) + BOT squad
Review scopescorecard_autoscore_completed / _failed, scorecard_override_saved, judge-vs-human agreement
Trigger threshold 1scorecard_autoscore_failed > 5% week-over-week → investigate ingestion / engine contract
Trigger threshold 2Auto-score pass rate shifts > 15% for 2 consecutive weeks with no weight change → investigate calibration drift
Rollback considerationIf judge-vs-human agreement < baseline − 10% and unresolved within 48h, PM reverts to the prior weights flag (S11.1).

13. Success Metrics

Adoption & Usage:

MetricDefinitionBaselineTarget
AI-QA scoring coverage% of AI agent conversations auto-scored0% — no AI-agent scoring (existing auto-scorer covers only the human agent)≥95% within 30 days of GA
In-room panel usage% of enrolled accounts whose supervisors open an AI scorecard weeklyN/A — new≥50% within 60 days of GA

Quality & Accuracy:

MetricDefinitionBaselineTarget
Hallucination rate (product facts)Share of AI answers flagged ungrounded by the judge (Groundedness veto)Unmeasured<2% within 60 days of GA
Judge-vs-human agreementAgreement between AI auto-scores and supervisor overrides/verdicts on the calibration sampleN/A≥85% before weights tune (gates Phase 5)

Efficiency & Impact:

MetricDefinitionBaselineTarget
Manual QA effort displaced% of scored conversations generated automatically vs. hand-scored0% — 100% manual≥70% within 90 days of GA

14. Launch Plan & Stage Gates

StageAudienceDurationSuccess Gate to AdvanceOwner
Internal Alpha3–5 internal QA accounts + 15-CID set2 weeks0 P0/P1; scorecard_autoscore_failed ≤5%; engine ingestion contract confirmed with DSAI; segment boundary validatedPM + QA
Closed BetaTransGo, Talenta LMS + 3 partners3 weeksJudge-vs-human agreement ≥85% on calibration sample; panel render ≤2s P95PM + BOT
Open BetaAll Pro+Ent on request3 weeksAuto-score coverage ≥90% for enrolled; hallucination <2% on sample; no P0 for 2 weeksEng Lead
GAAll Pro+EntOngoingAll Open Beta gates sustained 2 weeks; PMM launch approvedPM + PMM

15. Dependencies

DependencyOwning TeamDeliverable NeededBlocking?
SkillPack engine evaluator output contractDSAI / engine teamStable per-conversation 9-metric output + segment boundary + GET /models, /thread/message confirmed (Strategy doc Appendix A.3)YES
Phase 1 config (settings + rubrics)BOT (Phase 1)is_auto_score, passing_grade, custom-param rubrics live in productionYES
PII hashing + credential vault (G7)Security / InfosecPII hashed before judge/log; plaintext credential leak closed before transcripts are storedYES
Design / UXDesign squadFrames for the in-room panel AI mode + actor selector (CHG-003)YES
9-metric weight tuning from 15-CIDDSAITuned weights replacing the uniform 0.11NO for P2 (scores are measurement-only) · gates Phase 5

16. Key Decisions + Alternatives Rejected

8a — Decisions Made

DateDecisionRationale
2026-06-19Consume the engine's 9-metric evaluator output; do not build a judge in the scorecardThe judge already shipped in SkillPack; avoids two drifting judges
2026-06-19Veto metrics (Groundedness, Policy) floor is_pass regardless of the weighted totalA single hallucination or policy breach should fail a conversation; ties to the <2% hallucination target
2026-06-19Score each actor on its own segment; one room can hold multiple actor scorecardsAI handles first contact then hands to a human; each must be measured on the turns it handled
2026-06-19AI scores are overridable by Supervisor/Admin with auditJudge mistakes need a human correction path to keep the record trustworthy and the judge calibratable
2026-06-19This phase surfaces scores as measurement only — no go-live gateWeights are untuned (0.11); gating on them would be premature (gate is Phase 5 after tuning)
2026-06-19AI-score override reuses the existing one-edit correction slot (edit_count / correction_at), not a new edit mechanismThe scorecard already enforces "edit once"; reusing it keeps the audit model consistent (verified in hub-chat ScorecardForm.vue)
2026-06-19Grounded in cloned hub-chat: extend the in-room panel, the per-room GET/POST/PATCH /agent_scorecards/{roomId} API, the roomParticipant[] array, and the handover events — not rebuiltVerified against hub-chat agent-scorecard/ + event handlers (2026-06-19); lowers build risk

8b — Alternatives Rejected

AlternativeWhy RejectedDate
Build a separate AI-QA judge inside the scorecard tablesDuplicates the shipped 9-metric evaluator; two judges drift2026-06-19
One judge call per parameterMultiplies token cost per conversation; score all params in one call2026-06-19
Make AI scores immutable (no override)Judge mistakes would be uncorrectable and erode trust; no calibration signal2026-06-19
Score the whole transcript regardless of actorUnfairly credits/penalizes the AI for the human's turns (and vice versa)2026-06-19

17. Open Questions

#TypeQuestionOwnerDeadline
1Open QuestionConfirm the exact definitions and order of the 9 engine metrics with DSAI (Appendix A is PROPOSED).Bot/AI + DSAI2026-07-15
2AssumptionThe engine exposes a stable per-conversation evaluator output incl. a segment boundary (Strategy doc Appendix A.3 lists GET /models, /thread/message as OPEN).DSAI2026-07-01
3RiskThe AI/human segment boundary may be ambiguous, mis-attributing turns. Mitigation: delimit segments using the existing hub-chat handover events (agent_take_room / remove_agent / handover_id) + message timestamps (mechanism confirmed in code); validate against 15-CID transcripts in Internal Alpha, and confirm these events fire for the AI→human takeover case.Bot/AI + Omnichannel2026-07-15
4RiskThe 9-metric weights are uniform 0.11 and untuned → scores may be miscalibrated. Mitigation: this phase surfaces scores as measurement only (no gate); tune from 15-CID + monitor judge-vs-human agreement before Phase 5.DSAI2026-07-31
5RiskStoring scored transcripts widens the blast radius of the live plaintext-credential / PII gap (Strategy doc G7). Mitigation: hash PII before judge/log; close the G7 credential leak before transcripts are stored; 90-day TTL on snapshots.Security / InfosecBefore GA
6RiskThe AI agent may not be a scorable actor in the inbox. In hub-chat, messages/participants are only Models::User (human) / Models::Customer; Airene is side-panel assist, not a message participant — there is no bot/AI-agent participant type. This phase assumes the SkillPack agent is a selectable, scorable actor in the room. Mitigation: confirm with Bot/Automation how the agent's turns are represented; if the bot is not a room participant, attach the AI score to the room + agent-version rather than a participant row.Bot/AI + Omnichannel2026-07-15
7Open QuestionScorecard RBAC today is org-wide Usman inbox_scorecard_view / inbox_scorecard_manage — there is no team/division scoping (verified in hub-chat UsmanStore). The "team scope" references in this PRD require net-new scoping. Resolve: descope to org-wide for Phase 2, or add per-team scoping as an explicit requirement.PM + Platform2026-07-15

Appendix A — AI Scoring Rubric

Status: PROPOSED — pending DSAI confirmation (Open Q#1). The 9 metrics are owned by the SkillPack engine. Tier-1 weights uniform 0.11 until DSAI tunes from 15-CID (Open Q#4). Full judging prompts live in the Phase 1 PRD Appendix A and the default-rubric viewer.

#MetricWhat it measuresVeto?
1Groundedness / factual accuracyClaims backed by KB sources or customer data; no invented product facts🛑 Veto
2Resolution / task completionDid it resolve the goal (skill_completed signal)
3Relevance / intent understandingAddressed the real intent, not a different question
4Policy & safety adherenceStayed within "what to avoid"; no unsafe content / PII leak🛑 Veto
5Tone & brand voiceMatched configured tone_of_voice; courteous
6Language quality (Bahasa)Fluent target language; no broken/mixed language
7Handoff appropriatenessNo false handover (Pattern A); no missed escalation
8Tool / action correctnessRight action, right params, not skipped (Pattern B)
9Conversation efficiencyNo loops / re-asking; resolved within turn budget

🛑 Veto (P2-S01/AC-5): a clear breach of metric 1 or 4 floors is_pass regardless of the weighted total. Tier-2: org custom params (added in Phase 1) are scored by the judge using their prompt rubric and merged (P2-S01/AC-2).


Appendix B — Stitch UI Prompt

Generated proactively because the in-room panel AI mode is Figma: Pending. Use in Stitch; hand the output to Design as structural reference.

=== SHARED PREAMBLE ===
Product: Mekari Qontak — Omnichannel inbox
Users: QA Lead / Supervisor, Bot/AI Builder
Design tone: Enterprise B2B SaaS — dense, professional, clean white surfaces, purple accent; match the existing Qontak inbox shell
Persistent UI: left icon rail + top bar; the Scorecard is a right-hand panel over the conversation
=== END PREAMBLE ===
#ScreenStitch Prompt (paste in full after the preamble)
1In-room Scorecard panel — AI mode (CHG-003)Screen: right-hand Scorecard panel over a conversation, "Auto-scored by AI" mode. Purpose: supervisor reviews each actor's score for this conversation. Components: an actor selector at top listing every actor who served this room (AI agent + human handler[s]) — selecting the AI actor shows "Auto-scored by AI" + the 9-metric group; selecting a human shows the org's manual categories with binary thumbs (today's behavior). For the AI actor: a "Qontak AI Quality (default)" group of 9 metrics, each a graded score chip + expandable judge reason, Groundedness showing a cited KB source link, a red "Failed" veto flag on Groundedness/Policy when breached; a separate "Custom (org)" group for tier-2 params; a per-metric override control (edit → "edited by [SPV]"); Total score % with a pass/fail badge vs threshold; Remarks. Generate states: Loading ("scoring…"); Success (graded); Error ("scoring unavailable"); Veto-failed (red banner). Do NOT include: the Analytics report, the go-live gate, binary-only thumbs as the sole control for AI rows.

PRD CHANGELOG

VersionDateBySectionTypeSummary
1.02026-06-19ClaudeAllCREATEDPhase 2 PRD (AI Auto-Scoring & In-Room Scorecard) — cut from the superset draft: two-tier scoring per actor/segment with veto, and the in-room panel AI mode + actor selector.
1.12026-06-19ClaudeCB, S1, S8, S13MODIFIEDCorrected premise vs cloned code: auto_scoring.rb "stub" → the live auto_agent_scoring.rb already scores the human agent; this phase adds the AI-agent 9-metric path alongside it. Phase count 7→8.
1.22026-06-19ClaudeS8, S16, S17MODIFIEDhub-chat grounding: added AI-agent-as-participant risk (Q6) + permission-scope question (Q7), answered segment-boundary with handover events (Q3), added edit-once override AC + decision, and a code-grounding decision.