Qontak | Chatbot & AI | Unified Agent Quality Scorecard (AI + Human) — ANCHOR
Template: ANCHOR PRD v1.2 · Companion to PRD Section Reference v1.5 + Hierarchy v1.0
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat |
| PRD Version | 1.4 |
| Status | DRAFT |
| PRD Type | ANCHOR |
| Anchor | Yes — this IS the Anchor |
| Labels | epic:qontak-chatbot-ai | module:chatbot-ai | feature:unified-agent-scorecard |
| Last Updated | 2026-06-19 |
Status values:
DRAFT→ACTIVE→DEPRECATED
Table of Contents
- HEADER BLOCK
- 2. PHASE INDEX
- 3. One-liner + Problem
- 4. What Happens If We Don't Build This
- 5. Target Users + Persona Context
- 6. Success Metrics (Initiative-level)
- 7. Key Decisions + Alternatives Rejected
- 8. Open Questions
- PRD CHANGELOG
2. PHASE INDEX
Source of truth for all phases. Goal column must match each child Phase PRD's "Phase Goal" exactly.
| Phase | Goal | PRD Link | Epic Key | Status | Shipped |
|---|---|---|---|---|---|
| Phase 1: Scorecard Settings & Rubric Config | Build the config layer — extend is_auto_score to enable AI-agent scoring, set the AI pass threshold, and let QA leads / bot admins define the rubric (9 Qontak defaults + custom params) that will score AI agents. | Phase 1 PRD (draft — link on publish) | QC-XXXXX | 📝 Draft | — |
| Phase 2: AI Auto-Scoring & In-Room Scorecard | Score every AI conversation on the two-tier rubric (per actor/segment, with veto metrics) and surface it in the in-room Scorecard panel. | Phase 2 PRD (draft — link on publish) | QC-XXXXX | 📝 Draft | — |
| Phase 3: Unified Analytics Report | Ship the AI + human Agent Scorecard Report — KPI cards, per-agent trends, conversation drill-down, PDF/CSV export. | Phase 3 PRD (draft — link on publish) | QC-XXXXX | 📝 Draft | — |
| Phase 4: Validation / Testing Harness | = the AI Agent builder's "Testing" surface (strategy doc — currently NOT BUILT). Run an agent against a validation set (15-CID / historical inbox replay) so the scorecard can produce pre-launch scores. Likely shared with / owned by the AI Agent builder roadmap; the scorecard scores its run outputs. | — | — | ⏳ Not started | — |
| Phase 5: Go-Live Gate | Gate an AI agent's go-live on scorecard-pass (advisory → enforced after weight tuning), using Phase 4 validation scores. | — | — | ⏳ Not started | — |
| Phase 6: Self-Improvement Loop | Loop judge findings + unanswered/low-confidence questions back into the KB and SkillPack (Strategy doc G5). | — | — | ⏳ Not started | — |
| Phase 7: Calibration & Ecosystem | Productize ongoing 9-metric weight re-calibration and extend unified outcome reporting across channels and the Mekari ecosystem. | — | — | ⏳ Not started | — |
| Phase 8: Multi-Agent Scoring & Selectable Scorecard | Score every agent who served a room — each on a selectable scorecard template — via the Agent + Scorecard selectors. (Parked — human-QA scaling, sequenced after the AI-agent core; today the panel handles only roomParticipant[0].) | Phase 8 PRD (draft — link on publish) | QC-XXXXX | 📝 Draft | — |
Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled
3. One-liner + Problem
One-liner: Give QA leads and bot builders one scorecard that grades AI and human agents on the same quality lens, and gates an AI agent's go-live on a passing score.
Problem:
Qontak runs two disconnected quality systems — a human-QA Agent Scorecard (supervisors hand-score agents, and a GPT auto-scorer, auto_agent_scoring.rb, already fills scores for the first human agent against the manual categories when is_auto_score is on — but it scores neither the AI agent nor against the engine's 9-metric rubric, handles only one agent, and surfaces no reasons/sources) and a shipped-but-headless 9-metric AI evaluator inside the SkillPack engine whose per-conversation output no one can see, because the Analytics screen is an empty placeholder. This leaves QA leads and Bot/AI builders across every Qontak omnichannel account with no way to measure whether an AI agent is actually good before or after it ships — and the Agent Scorecard is already the lowest-adoption feature at every tier (Enterprise 28% / Mid-Market 19% / SMB 12%), declining three consecutive months. The cost: autonomous agents go live with no quality floor (hallucination rate unmeasured, containment ~0% baseline), Qontak can't counter Intercom Fin's unified "CX Score + Scorecards + Monitors" narrative, and a paid differentiator keeps bleeding adoption.
4. What Happens If We Don't Build This
- The paid differentiator keeps dying. Agent Scorecard already has the lowest adoption at every tier and has declined three straight months — a manual-only scorecard with no AI value has no path to reverse that.
- No measured AI quality = no competitive answer. Intercom Fin ships CX Score + Custom Scorecards + Monitors; Freshworks (80%) and Kata.ai (81%) market autonomous-resolution rates. Differentiation in 2026 is proven outcomes, not "having AI" — and we currently publish none.
- Autonomous agents ship blind. With no go-live gate and no surfaced hallucination rate (G3, target <2%), agents reach production with no quality floor — a live risk the moment containment moves off its ~0% baseline.
- An explicit Enterprise ask stays unanswered. NPS verbatim already requests "team-level trends over time, SLA dashboards, PDF export" — exactly the Analytics surface this initiative builds.
5. Target Users + Persona Context
Primary Persona: QA Lead / Supervisor
| Field | Detail |
|---|---|
| Role | QA Lead or Supervisor in a Qontak omnichannel account, accountable for conversation quality across both human agents and AI agents |
| Goal | Know — at a glance and at scale — whether each agent (human or AI) meets the quality bar, on one consistent rubric, without hand-scoring every conversation |
| Pain | Today they can only sample-score human agents manually; AI agent quality is invisible (the 9-metric evaluator output has no surface), so they cannot compare, trend, or gate on it |
| Workaround | Manual spot-checks of a handful of conversations in the inbox; spreadsheets for any trend view; AI agents are effectively un-QA'd |
Secondary Persona: Bot / AI Builder (Agent Owner)
| Field | Detail |
|---|---|
| Role | The Bot/AI specialist or admin who configures and ships AI agents (SkillPack) for an account |
| Goal | Ship an AI agent only when it has demonstrably passed a quality bar, and see exactly where it fails so they can fix the SkillPack or KB |
| Pain | No objective, enforced quality signal exists at go-live; they ship on intuition and find quality problems in production |
| Workaround | Ad-hoc manual testing in the preview pane; no pass/fail record, no gate, no historical trend |
6. Success Metrics (Initiative-level)
Adoption & Usage:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| ⭐ Unified Scorecard pass-rate gating coverage | % of new AI agent go-lives gated by a unified (AI + human) scorecard-pass decision | N/A — evaluator and human scorecard are separate; gate does not exist | Gate defined and applied to 100% of new AI agent go-lives within 90 days of Phase 1 GA |
| AI-QA scoring coverage | % of AI agent conversations automatically scored by the unified rubric | 0% (no AI-agent scoring today; the existing GPT auto-scorer covers only the human agent on the manual categories) | ≥95% of AI agent conversations auto-scored within 30 days of Phase 1 GA |
Quality & Accuracy:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Hallucination rate (product facts) | Share of AI answers flagged ungrounded by the judge, surfaced in the unified scorecard | Unmeasured | <2% (via judge) within 60 days of Phase 1 GA |
| Autonomous containment rate (visible) | % of AI conversations resolved without human handoff, made visible via the Analytics surface | ~0% (old engine) — unmeasured on new engine | Measured and published from 15-CID; 15–20% in Q1 of GA, climbing |
Efficiency & Impact:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Manual QA effort displaced | Share of scoring generated automatically vs. hand-scored by an SPV | 0% — 100% manual | ≥70% of scored conversations auto-generated within 90 days of Phase 1 GA |
| Agent Scorecard adoption (paid feature) | L3M adoption of the Scorecard feature across tiers | Ent 28% / MM 19% / SMB 12%, declining 3 months | Reverse the decline; +10pp at each tier within two quarters of Phase 1 GA |
7. Key Decisions + Alternatives Rejected
8a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2026-06-19 | Unify the SkillPack 9-metric AI evaluator with the existing human QA Scorecard into one CRM-native lens, rather than building a separate AI-QA tool. | The judge already exists in the shipped engine; Intercom unifies CX Score + Scorecards + Monitors across AI and human; two disconnected surfaces fragment the quality story. (Strategy doc Move 3 / G2.) |
| 2026-06-19 | Adopt a two-tier rubric: Qontak-calibrated AI defaults (the 9-metric evaluator) + org-owned custom parameters (the existing scorecard_custom_parameter.prompt field). | Defaults give a trustworthy, centrally calibrated baseline; custom params give org flexibility with the org bearing calibration risk. |
| 2026-06-19 | Make scorecard-pass an enforced agent go-live gate. | Turns quality from a passive report into a hard gate — a stronger product hook than Intercom's report-only model. |
| 2026-06-19 | The AI judge stays owned by DSAI / the engine team; this initiative consumes its per-conversation output. | Avoids duplicating the evaluator and keeps a clean engineering hand-off; prevents two judges from drifting. |
| 2026-06-19 | Phase 4 validation = the AI Agent builder's "Testing" surface (strategy doc NOT BUILT), not a scorecard-specific harness; the scorecard scores its run outputs. | The Testing screen is already on the AI Agent roadmap; building a parallel harness would duplicate it. Keeps the go-live gate (P5) as a thin consumer of a shared surface. |
| 2026-06-19 | Extend the existing GPT auto-scorer, not build from a stub. The scorecard already auto-scores the human agent via auto_agent_scoring.rb (GPT, on room resolve, when is_auto_score is on); the initiative adds the AI-agent 9-metric/two-tier path alongside it and reuses this machinery. | Verified against the cloned hub-chat + chatbot code (2026-06-19). The earlier "auto-scoring is a stub" premise referenced a different, dead file (frontend_service/.../auto_scoring.rb) — the live path is gpt/omnichannel/auto_agent_scoring.rb. |
| 2026-06-19 | Multi-agent scoring + selectable per-agent scorecard (the design mock's Agent + Scorecard dropdowns) is a parked item, sequenced as Phase 8 after the AI-agent core. | Today the scorer (auto_agent_scoring.rb — "first agent") and the in-room panel (roomParticipant[0]) both handle only one agent per room; multi-agent + template selection is a separate human-QA scaling effort. |
8b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
| Build a new standalone AI-QA scoring engine inside the chatbot scorecard tables | Duplicates the already-shipped 9-metric evaluator; two judges would drift and disagree | 2026-06-19 |
| Keep AI and human QA as two separate surfaces | Customers (Enterprise NPS) want unified team-level reporting; Intercom unifies; split surfaces fragment the quality narrative | 2026-06-19 |
| Auto-score only default parameters; leave custom params manual | scorecard_custom_parameter already carries a prompt field designed for LLM scoring; orgs expect their own criteria evaluated too | 2026-06-19 |
| Ship the go-live gate with the untuned uniform 0.11 weights immediately | Untuned weights would gate the wrong agents; gate must start advisory and harden only after weights are tuned from 15-CID | 2026-06-19 |
8. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Open Question | Do the 9 evaluator metrics map cleanly onto existing scorecard categories/parameters, or is a translation layer required between the engine output and the scorecard data model? | Bot/AI + DSAI | 2026-07-15 |
| 2 | Assumption | The SkillPack engine exposes a stable per-conversation evaluator output (Appendix A.3 lists GET /models, /thread/message as OPEN). | DSAI | 2026-07-01 |
| 3 | Risk | The 9-metric weights are uniform 0.11 and untuned — a gate built on them may be miscalibrated and block good agents (or pass bad ones). Mitigation: ship the go-live gate in advisory mode first; enforce only after weights are tuned from the 15-CID results. | DSAI | 2026-07-31 |
| 4 | Risk | Storing scored conversation content widens the blast radius of the live plaintext-credential / PII gap (Strategy doc G7). Mitigation: hash PII before judge/log; close the G7 credential leak before the scorecard persists transcripts. | Security / Infosec | Before GA |
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.0 | 2026-06-19 | Claude | All | CREATED | Initial ANCHOR PRD created from grooming session — unifies AI evaluator + human QA Scorecard, indexes Phases 1–3. |
| 1.1 | 2026-06-19 | Claude | Phase Index | MODIFIED | Re-phased the initiative per-part: split the old Phase 1 into Settings (P1), Auto-Scoring + In-Room Panel (P2), Analytics Report (P3), Validation/Testing Harness (P4), Go-Live Gate (P5); self-improvement + ecosystem become P6/P7. |
| 1.2 | 2026-06-19 | Claude | Phase Index, S7 | MODIFIED | Clarified Phase 4 = the AI Agent builder's "Testing" surface (strategy doc NOT BUILT), shared with the AI Agent roadmap; scorecard consumes its run outputs. Added matching decision. |
| 1.3 | 2026-06-19 | Claude | S1, S6, S7, Phase Index | MODIFIED | Corrected the "auto-scoring is a stub" premise after verifying the cloned hub-chat/chatbot code: auto_agent_scoring.rb already auto-scores the human agent; the initiative extends it. Added Phase 8 (parked multi-agent + selectable scorecard) and two decisions. |
| 1.4 | 2026-06-19 | Claude | Phase Index | MODIFIED | Phase 8 PRD written + linked; goal cleaned to match the Phase PRD's CB Phase Goal; status → Draft. |