Skip to main content

Qontak | Chatbot & AI | Unified Agent Quality Scorecard (AI + Human) — ANCHOR

Template: ANCHOR PRD v1.2 · Companion to PRD Section Reference v1.5 + Hierarchy v1.0


HEADER BLOCK

FieldValue
PMDimas Fauzi Hidayat
PRD Version1.4
StatusDRAFT
PRD TypeANCHOR
AnchorYes — this IS the Anchor
Labelsepic:qontak-chatbot-ai | module:chatbot-ai | feature:unified-agent-scorecard
Last Updated2026-06-19

Status values: DRAFTACTIVEDEPRECATED


Table of Contents


2. PHASE INDEX

Source of truth for all phases. Goal column must match each child Phase PRD's "Phase Goal" exactly.

PhaseGoalPRD LinkEpic KeyStatusShipped
Phase 1: Scorecard Settings & Rubric ConfigBuild the config layer — extend is_auto_score to enable AI-agent scoring, set the AI pass threshold, and let QA leads / bot admins define the rubric (9 Qontak defaults + custom params) that will score AI agents.Phase 1 PRD (draft — link on publish)QC-XXXXX📝 Draft
Phase 2: AI Auto-Scoring & In-Room ScorecardScore every AI conversation on the two-tier rubric (per actor/segment, with veto metrics) and surface it in the in-room Scorecard panel.Phase 2 PRD (draft — link on publish)QC-XXXXX📝 Draft
Phase 3: Unified Analytics ReportShip the AI + human Agent Scorecard Report — KPI cards, per-agent trends, conversation drill-down, PDF/CSV export.Phase 3 PRD (draft — link on publish)QC-XXXXX📝 Draft
Phase 4: Validation / Testing Harness= the AI Agent builder's "Testing" surface (strategy doc — currently NOT BUILT). Run an agent against a validation set (15-CID / historical inbox replay) so the scorecard can produce pre-launch scores. Likely shared with / owned by the AI Agent builder roadmap; the scorecard scores its run outputs.⏳ Not started
Phase 5: Go-Live GateGate an AI agent's go-live on scorecard-pass (advisory → enforced after weight tuning), using Phase 4 validation scores.⏳ Not started
Phase 6: Self-Improvement LoopLoop judge findings + unanswered/low-confidence questions back into the KB and SkillPack (Strategy doc G5).⏳ Not started
Phase 7: Calibration & EcosystemProductize ongoing 9-metric weight re-calibration and extend unified outcome reporting across channels and the Mekari ecosystem.⏳ Not started
Phase 8: Multi-Agent Scoring & Selectable ScorecardScore every agent who served a room — each on a selectable scorecard template — via the Agent + Scorecard selectors. (Parked — human-QA scaling, sequenced after the AI-agent core; today the panel handles only roomParticipant[0].)Phase 8 PRD (draft — link on publish)QC-XXXXX📝 Draft

Status options: 📝 Draft · 🔄 In Progress · ✅ Shipped · ⏸ Paused · ❌ Cancelled


3. One-liner + Problem

One-liner: Give QA leads and bot builders one scorecard that grades AI and human agents on the same quality lens, and gates an AI agent's go-live on a passing score.

Problem: Qontak runs two disconnected quality systems — a human-QA Agent Scorecard (supervisors hand-score agents, and a GPT auto-scorer, auto_agent_scoring.rb, already fills scores for the first human agent against the manual categories when is_auto_score is on — but it scores neither the AI agent nor against the engine's 9-metric rubric, handles only one agent, and surfaces no reasons/sources) and a shipped-but-headless 9-metric AI evaluator inside the SkillPack engine whose per-conversation output no one can see, because the Analytics screen is an empty placeholder. This leaves QA leads and Bot/AI builders across every Qontak omnichannel account with no way to measure whether an AI agent is actually good before or after it ships — and the Agent Scorecard is already the lowest-adoption feature at every tier (Enterprise 28% / Mid-Market 19% / SMB 12%), declining three consecutive months. The cost: autonomous agents go live with no quality floor (hallucination rate unmeasured, containment ~0% baseline), Qontak can't counter Intercom Fin's unified "CX Score + Scorecards + Monitors" narrative, and a paid differentiator keeps bleeding adoption.


4. What Happens If We Don't Build This

  • The paid differentiator keeps dying. Agent Scorecard already has the lowest adoption at every tier and has declined three straight months — a manual-only scorecard with no AI value has no path to reverse that.
  • No measured AI quality = no competitive answer. Intercom Fin ships CX Score + Custom Scorecards + Monitors; Freshworks (80%) and Kata.ai (81%) market autonomous-resolution rates. Differentiation in 2026 is proven outcomes, not "having AI" — and we currently publish none.
  • Autonomous agents ship blind. With no go-live gate and no surfaced hallucination rate (G3, target <2%), agents reach production with no quality floor — a live risk the moment containment moves off its ~0% baseline.
  • An explicit Enterprise ask stays unanswered. NPS verbatim already requests "team-level trends over time, SLA dashboards, PDF export" — exactly the Analytics surface this initiative builds.

5. Target Users + Persona Context

Primary Persona: QA Lead / Supervisor

FieldDetail
RoleQA Lead or Supervisor in a Qontak omnichannel account, accountable for conversation quality across both human agents and AI agents
GoalKnow — at a glance and at scale — whether each agent (human or AI) meets the quality bar, on one consistent rubric, without hand-scoring every conversation
PainToday they can only sample-score human agents manually; AI agent quality is invisible (the 9-metric evaluator output has no surface), so they cannot compare, trend, or gate on it
WorkaroundManual spot-checks of a handful of conversations in the inbox; spreadsheets for any trend view; AI agents are effectively un-QA'd

Secondary Persona: Bot / AI Builder (Agent Owner)

FieldDetail
RoleThe Bot/AI specialist or admin who configures and ships AI agents (SkillPack) for an account
GoalShip an AI agent only when it has demonstrably passed a quality bar, and see exactly where it fails so they can fix the SkillPack or KB
PainNo objective, enforced quality signal exists at go-live; they ship on intuition and find quality problems in production
WorkaroundAd-hoc manual testing in the preview pane; no pass/fail record, no gate, no historical trend

6. Success Metrics (Initiative-level)

Adoption & Usage:

MetricDefinitionBaselineTarget
⭐ Unified Scorecard pass-rate gating coverage% of new AI agent go-lives gated by a unified (AI + human) scorecard-pass decisionN/A — evaluator and human scorecard are separate; gate does not existGate defined and applied to 100% of new AI agent go-lives within 90 days of Phase 1 GA
AI-QA scoring coverage% of AI agent conversations automatically scored by the unified rubric0% (no AI-agent scoring today; the existing GPT auto-scorer covers only the human agent on the manual categories)≥95% of AI agent conversations auto-scored within 30 days of Phase 1 GA

Quality & Accuracy:

MetricDefinitionBaselineTarget
Hallucination rate (product facts)Share of AI answers flagged ungrounded by the judge, surfaced in the unified scorecardUnmeasured<2% (via judge) within 60 days of Phase 1 GA
Autonomous containment rate (visible)% of AI conversations resolved without human handoff, made visible via the Analytics surface~0% (old engine) — unmeasured on new engineMeasured and published from 15-CID; 15–20% in Q1 of GA, climbing

Efficiency & Impact:

MetricDefinitionBaselineTarget
Manual QA effort displacedShare of scoring generated automatically vs. hand-scored by an SPV0% — 100% manual≥70% of scored conversations auto-generated within 90 days of Phase 1 GA
Agent Scorecard adoption (paid feature)L3M adoption of the Scorecard feature across tiersEnt 28% / MM 19% / SMB 12%, declining 3 monthsReverse the decline; +10pp at each tier within two quarters of Phase 1 GA

7. Key Decisions + Alternatives Rejected

8a — Decisions Made

DateDecisionRationale
2026-06-19Unify the SkillPack 9-metric AI evaluator with the existing human QA Scorecard into one CRM-native lens, rather than building a separate AI-QA tool.The judge already exists in the shipped engine; Intercom unifies CX Score + Scorecards + Monitors across AI and human; two disconnected surfaces fragment the quality story. (Strategy doc Move 3 / G2.)
2026-06-19Adopt a two-tier rubric: Qontak-calibrated AI defaults (the 9-metric evaluator) + org-owned custom parameters (the existing scorecard_custom_parameter.prompt field).Defaults give a trustworthy, centrally calibrated baseline; custom params give org flexibility with the org bearing calibration risk.
2026-06-19Make scorecard-pass an enforced agent go-live gate.Turns quality from a passive report into a hard gate — a stronger product hook than Intercom's report-only model.
2026-06-19The AI judge stays owned by DSAI / the engine team; this initiative consumes its per-conversation output.Avoids duplicating the evaluator and keeps a clean engineering hand-off; prevents two judges from drifting.
2026-06-19Phase 4 validation = the AI Agent builder's "Testing" surface (strategy doc NOT BUILT), not a scorecard-specific harness; the scorecard scores its run outputs.The Testing screen is already on the AI Agent roadmap; building a parallel harness would duplicate it. Keeps the go-live gate (P5) as a thin consumer of a shared surface.
2026-06-19Extend the existing GPT auto-scorer, not build from a stub. The scorecard already auto-scores the human agent via auto_agent_scoring.rb (GPT, on room resolve, when is_auto_score is on); the initiative adds the AI-agent 9-metric/two-tier path alongside it and reuses this machinery.Verified against the cloned hub-chat + chatbot code (2026-06-19). The earlier "auto-scoring is a stub" premise referenced a different, dead file (frontend_service/.../auto_scoring.rb) — the live path is gpt/omnichannel/auto_agent_scoring.rb.
2026-06-19Multi-agent scoring + selectable per-agent scorecard (the design mock's Agent + Scorecard dropdowns) is a parked item, sequenced as Phase 8 after the AI-agent core.Today the scorer (auto_agent_scoring.rb — "first agent") and the in-room panel (roomParticipant[0]) both handle only one agent per room; multi-agent + template selection is a separate human-QA scaling effort.

8b — Alternatives Rejected

AlternativeWhy RejectedDate
Build a new standalone AI-QA scoring engine inside the chatbot scorecard tablesDuplicates the already-shipped 9-metric evaluator; two judges would drift and disagree2026-06-19
Keep AI and human QA as two separate surfacesCustomers (Enterprise NPS) want unified team-level reporting; Intercom unifies; split surfaces fragment the quality narrative2026-06-19
Auto-score only default parameters; leave custom params manualscorecard_custom_parameter already carries a prompt field designed for LLM scoring; orgs expect their own criteria evaluated too2026-06-19
Ship the go-live gate with the untuned uniform 0.11 weights immediatelyUntuned weights would gate the wrong agents; gate must start advisory and harden only after weights are tuned from 15-CID2026-06-19

8. Open Questions

#TypeQuestionOwnerDeadline
1Open QuestionDo the 9 evaluator metrics map cleanly onto existing scorecard categories/parameters, or is a translation layer required between the engine output and the scorecard data model?Bot/AI + DSAI2026-07-15
2AssumptionThe SkillPack engine exposes a stable per-conversation evaluator output (Appendix A.3 lists GET /models, /thread/message as OPEN).DSAI2026-07-01
3RiskThe 9-metric weights are uniform 0.11 and untuned — a gate built on them may be miscalibrated and block good agents (or pass bad ones). Mitigation: ship the go-live gate in advisory mode first; enforce only after weights are tuned from the 15-CID results.DSAI2026-07-31
4RiskStoring scored conversation content widens the blast radius of the live plaintext-credential / PII gap (Strategy doc G7). Mitigation: hash PII before judge/log; close the G7 credential leak before the scorecard persists transcripts.Security / InfosecBefore GA

PRD CHANGELOG

VersionDateBySectionTypeSummary
1.02026-06-19ClaudeAllCREATEDInitial ANCHOR PRD created from grooming session — unifies AI evaluator + human QA Scorecard, indexes Phases 1–3.
1.12026-06-19ClaudePhase IndexMODIFIEDRe-phased the initiative per-part: split the old Phase 1 into Settings (P1), Auto-Scoring + In-Room Panel (P2), Analytics Report (P3), Validation/Testing Harness (P4), Go-Live Gate (P5); self-improvement + ecosystem become P6/P7.
1.22026-06-19ClaudePhase Index, S7MODIFIEDClarified Phase 4 = the AI Agent builder's "Testing" surface (strategy doc NOT BUILT), shared with the AI Agent roadmap; scorecard consumes its run outputs. Added matching decision.
1.32026-06-19ClaudeS1, S6, S7, Phase IndexMODIFIEDCorrected the "auto-scoring is a stub" premise after verifying the cloned hub-chat/chatbot code: auto_agent_scoring.rb already auto-scores the human agent; the initiative extends it. Added Phase 8 (parked multi-agent + selectable scorecard) and two decisions.
1.42026-06-19ClaudePhase IndexMODIFIEDPhase 8 PRD written + linked; goal cleaned to match the Phase PRD's CB Phase Goal; status → Draft.