Qontak | Chatbot & AI | Unified Agent Quality Scorecard (AI + Human) — ANCHOR

Template: ANCHOR PRD v1.2 · Companion to PRD Section Reference v1.5 + Hierarchy v1.0

HEADER BLOCK

Field	Value
PM	Dimas Fauzi Hidayat
PRD Version	1.4
Status	DRAFT
PRD Type	ANCHOR
Anchor	Yes — this IS the Anchor
Labels	`epic:qontak-chatbot-ai` \| `module:chatbot-ai` \| `feature:unified-agent-scorecard`
Last Updated	2026-06-19

Status values: DRAFT → ACTIVE → DEPRECATED

HEADER BLOCK
2. PHASE INDEX
3. One-liner + Problem
4. What Happens If We Don't Build This
5. Target Users + Persona Context
6. Success Metrics (Initiative-level)
7. Key Decisions + Alternatives Rejected
8. Open Questions
PRD CHANGELOG

2. PHASE INDEX

Source of truth for all phases. Goal column must match each child Phase PRD's "Phase Goal" exactly.

Phase	Goal	PRD Link	Epic Key	Status	Shipped
Phase 1: Scorecard Settings & Rubric Config	Build the config layer — extend `is_auto_score` to enable AI-agent scoring, set the AI pass threshold, and let QA leads / bot admins define the rubric (9 Qontak defaults + custom params) that will score AI agents.	Phase 1 PRD (draft — link on publish)	QC-XXXXX	📝 Draft	—
Phase 2: AI Auto-Scoring & In-Room Scorecard	Score every AI conversation on the two-tier rubric (per actor/segment, with veto metrics) and surface it in the in-room Scorecard panel.	Phase 2 PRD (draft — link on publish)	QC-XXXXX	📝 Draft	—
Phase 3: Unified Analytics Report	Ship the AI + human Agent Scorecard Report — KPI cards, per-agent trends, conversation drill-down, PDF/CSV export.	Phase 3 PRD (draft — link on publish)	QC-XXXXX	📝 Draft	—
Phase 4: Validation / Testing Harness	= the AI Agent builder's "Testing" surface (strategy doc — currently `NOT BUILT`). Run an agent against a validation set (15-CID / historical inbox replay) so the scorecard can produce pre-launch scores. Likely shared with / owned by the AI Agent builder roadmap; the scorecard scores its run outputs.	—	—	⏳ Not started	—
Phase 5: Go-Live Gate	Gate an AI agent's go-live on scorecard-pass (advisory → enforced after weight tuning), using Phase 4 validation scores.	—	—	⏳ Not started	—
Phase 6: Self-Improvement Loop	Loop judge findings + unanswered/low-confidence questions back into the KB and SkillPack (Strategy doc G5).	—	—	⏳ Not started	—
Phase 7: Calibration & Ecosystem	Productize ongoing 9-metric weight re-calibration and extend unified outcome reporting across channels and the Mekari ecosystem.	—	—	⏳ Not started	—
Phase 8: Multi-Agent Scoring & Selectable Scorecard	Score every agent who served a room — each on a selectable scorecard template — via the Agent + Scorecard selectors. (Parked — human-QA scaling, sequenced after the AI-agent core; today the panel handles only `roomParticipant[0]`.)	Phase 8 PRD (draft — link on publish)	QC-XXXXX	📝 Draft	—

Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled

3. One-liner + Problem

One-liner: Give QA leads and bot builders one scorecard that grades AI and human agents on the same quality lens, and gates an AI agent's go-live on a passing score.

Problem: Qontak runs two disconnected quality systems — a human-QA Agent Scorecard (supervisors hand-score agents, and a GPT auto-scorer, auto_agent_scoring.rb, already fills scores for the first human agent against the manual categories when is_auto_score is on — but it scores neither the AI agent nor against the engine's 9-metric rubric, handles only one agent, and surfaces no reasons/sources) and a shipped-but-headless 9-metric AI evaluator inside the SkillPack engine whose per-conversation output no one can see, because the Analytics screen is an empty placeholder. This leaves QA leads and Bot/AI builders across every Qontak omnichannel account with no way to measure whether an AI agent is actually good before or after it ships — and the Agent Scorecard is already the lowest-adoption feature at every tier (Enterprise 28% / Mid-Market 19% / SMB 12%), declining three consecutive months. The cost: autonomous agents go live with no quality floor (hallucination rate unmeasured, containment ~0% baseline), Qontak can't counter Intercom Fin's unified "CX Score + Scorecards + Monitors" narrative, and a paid differentiator keeps bleeding adoption.

4. What Happens If We Don't Build This

The paid differentiator keeps dying. Agent Scorecard already has the lowest adoption at every tier and has declined three straight months — a manual-only scorecard with no AI value has no path to reverse that.
No measured AI quality = no competitive answer. Intercom Fin ships CX Score + Custom Scorecards + Monitors; Freshworks (80%) and Kata.ai (81%) market autonomous-resolution rates. Differentiation in 2026 is proven outcomes, not "having AI" — and we currently publish none.
Autonomous agents ship blind. With no go-live gate and no surfaced hallucination rate (G3, target <2%), agents reach production with no quality floor — a live risk the moment containment moves off its ~0% baseline.
An explicit Enterprise ask stays unanswered. NPS verbatim already requests "team-level trends over time, SLA dashboards, PDF export" — exactly the Analytics surface this initiative builds.

5. Target Users + Persona Context

Primary Persona: QA Lead / Supervisor

Field	Detail
Role	QA Lead or Supervisor in a Qontak omnichannel account, accountable for conversation quality across both human agents and AI agents
Goal	Know — at a glance and at scale — whether each agent (human or AI) meets the quality bar, on one consistent rubric, without hand-scoring every conversation
Pain	Today they can only sample-score human agents manually; AI agent quality is invisible (the 9-metric evaluator output has no surface), so they cannot compare, trend, or gate on it
Workaround	Manual spot-checks of a handful of conversations in the inbox; spreadsheets for any trend view; AI agents are effectively un-QA'd

Secondary Persona: Bot / AI Builder (Agent Owner)

Field	Detail
Role	The Bot/AI specialist or admin who configures and ships AI agents (SkillPack) for an account
Goal	Ship an AI agent only when it has demonstrably passed a quality bar, and see exactly where it fails so they can fix the SkillPack or KB
Pain	No objective, enforced quality signal exists at go-live; they ship on intuition and find quality problems in production
Workaround	Ad-hoc manual testing in the preview pane; no pass/fail record, no gate, no historical trend

6. Success Metrics (Initiative-level)

Adoption & Usage:

Metric	Definition	Baseline	Target
⭐ Unified Scorecard pass-rate gating coverage	% of new AI agent go-lives gated by a unified (AI + human) scorecard-pass decision	N/A — evaluator and human scorecard are separate; gate does not exist	Gate defined and applied to 100% of new AI agent go-lives within 90 days of Phase 1 GA
AI-QA scoring coverage	% of AI agent conversations automatically scored by the unified rubric	0% (no AI-agent scoring today; the existing GPT auto-scorer covers only the human agent on the manual categories)	≥95% of AI agent conversations auto-scored within 30 days of Phase 1 GA

Quality & Accuracy:

Metric	Definition	Baseline	Target
Hallucination rate (product facts)	Share of AI answers flagged ungrounded by the judge, surfaced in the unified scorecard	Unmeasured	<2% (via judge) within 60 days of Phase 1 GA
Autonomous containment rate (visible)	% of AI conversations resolved without human handoff, made visible via the Analytics surface	~0% (old engine) — unmeasured on new engine	Measured and published from 15-CID; 15–20% in Q1 of GA, climbing

Efficiency & Impact:

Metric	Definition	Baseline	Target
Manual QA effort displaced	Share of scoring generated automatically vs. hand-scored by an SPV	0% — 100% manual	≥70% of scored conversations auto-generated within 90 days of Phase 1 GA
Agent Scorecard adoption (paid feature)	L3M adoption of the Scorecard feature across tiers	Ent 28% / MM 19% / SMB 12%, declining 3 months	Reverse the decline; +10pp at each tier within two quarters of Phase 1 GA

7. Key Decisions + Alternatives Rejected

8a — Decisions Made

Date	Decision	Rationale
2026-06-19	Unify the SkillPack 9-metric AI evaluator with the existing human QA Scorecard into one CRM-native lens, rather than building a separate AI-QA tool.	The judge already exists in the shipped engine; Intercom unifies CX Score + Scorecards + Monitors across AI and human; two disconnected surfaces fragment the quality story. (Strategy doc Move 3 / G2.)
2026-06-19	Adopt a two-tier rubric: Qontak-calibrated AI defaults (the 9-metric evaluator) + org-owned custom parameters (the existing `scorecard_custom_parameter.prompt` field).	Defaults give a trustworthy, centrally calibrated baseline; custom params give org flexibility with the org bearing calibration risk.
2026-06-19	Make scorecard-pass an enforced agent go-live gate.	Turns quality from a passive report into a hard gate — a stronger product hook than Intercom's report-only model.
2026-06-19	The AI judge stays owned by DSAI / the engine team; this initiative consumes its per-conversation output.	Avoids duplicating the evaluator and keeps a clean engineering hand-off; prevents two judges from drifting.
2026-06-19	Phase 4 validation = the AI Agent builder's "Testing" surface (strategy doc `NOT BUILT`), not a scorecard-specific harness; the scorecard scores its run outputs.	The Testing screen is already on the AI Agent roadmap; building a parallel harness would duplicate it. Keeps the go-live gate (P5) as a thin consumer of a shared surface.
2026-06-19	Extend the existing GPT auto-scorer, not build from a stub. The scorecard already auto-scores the human agent via `auto_agent_scoring.rb` (GPT, on room resolve, when `is_auto_score` is on); the initiative adds the AI-agent 9-metric/two-tier path alongside it and reuses this machinery.	Verified against the cloned `hub-chat` + `chatbot` code (2026-06-19). The earlier "auto-scoring is a stub" premise referenced a different, dead file (`frontend_service/.../auto_scoring.rb`) — the live path is `gpt/omnichannel/auto_agent_scoring.rb`.
2026-06-19	Multi-agent scoring + selectable per-agent scorecard (the design mock's Agent + Scorecard dropdowns) is a parked item, sequenced as Phase 8 after the AI-agent core.	Today the scorer (`auto_agent_scoring.rb` — "first agent") and the in-room panel (`roomParticipant[0]`) both handle only one agent per room; multi-agent + template selection is a separate human-QA scaling effort.

8b — Alternatives Rejected

Alternative	Why Rejected	Date
Build a new standalone AI-QA scoring engine inside the chatbot scorecard tables	Duplicates the already-shipped 9-metric evaluator; two judges would drift and disagree	2026-06-19
Keep AI and human QA as two separate surfaces	Customers (Enterprise NPS) want unified team-level reporting; Intercom unifies; split surfaces fragment the quality narrative	2026-06-19
Auto-score only default parameters; leave custom params manual	`scorecard_custom_parameter` already carries a `prompt` field designed for LLM scoring; orgs expect their own criteria evaluated too	2026-06-19
Ship the go-live gate with the untuned uniform 0.11 weights immediately	Untuned weights would gate the wrong agents; gate must start advisory and harden only after weights are tuned from 15-CID	2026-06-19

8. Open Questions

#	Type	Question	Owner	Deadline
1	Open Question	Do the 9 evaluator metrics map cleanly onto existing scorecard categories/parameters, or is a translation layer required between the engine output and the scorecard data model?	Bot/AI + DSAI	2026-07-15
2	Assumption	The SkillPack engine exposes a stable per-conversation evaluator output (Appendix A.3 lists `GET /models`, `/thread/message` as OPEN).	DSAI	2026-07-01
3	Risk	The 9-metric weights are uniform 0.11 and untuned — a gate built on them may be miscalibrated and block good agents (or pass bad ones). Mitigation: ship the go-live gate in advisory mode first; enforce only after weights are tuned from the 15-CID results.	DSAI	2026-07-31
4	Risk	Storing scored conversation content widens the blast radius of the live plaintext-credential / PII gap (Strategy doc G7). Mitigation: hash PII before judge/log; close the G7 credential leak before the scorecard persists transcripts.	Security / Infosec	Before GA

PRD CHANGELOG

Version	Date	By	Section	Type	Summary
1.0	2026-06-19	Claude	All	CREATED	Initial ANCHOR PRD created from grooming session — unifies AI evaluator + human QA Scorecard, indexes Phases 1–3.
1.1	2026-06-19	Claude	Phase Index	MODIFIED	Re-phased the initiative per-part: split the old Phase 1 into Settings (P1), Auto-Scoring + In-Room Panel (P2), Analytics Report (P3), Validation/Testing Harness (P4), Go-Live Gate (P5); self-improvement + ecosystem become P6/P7.
1.2	2026-06-19	Claude	Phase Index, S7	MODIFIED	Clarified Phase 4 = the AI Agent builder's "Testing" surface (strategy doc NOT BUILT), shared with the AI Agent roadmap; scorecard consumes its run outputs. Added matching decision.
1.3	2026-06-19	Claude	S1, S6, S7, Phase Index	MODIFIED	Corrected the "auto-scoring is a stub" premise after verifying the cloned hub-chat/chatbot code: `auto_agent_scoring.rb` already auto-scores the human agent; the initiative extends it. Added Phase 8 (parked multi-agent + selectable scorecard) and two decisions.
1.4	2026-06-19	Claude	Phase Index	MODIFIED	Phase 8 PRD written + linked; goal cleaned to match the Phase PRD's CB Phase Goal; status → Draft.

HEADER BLOCK​

Table of Contents​

2. PHASE INDEX​

3. One-liner + Problem​

4. What Happens If We Don't Build This​

5. Target Users + Persona Context​

6. Success Metrics (Initiative-level)​

7. Key Decisions + Alternatives Rejected​

8. Open Questions​

PRD CHANGELOG​