Qontak | Chatbot & AI | Unified Agent Quality Scorecard — Phase 1: Scorecard Settings & Rubric Config
Template: NEW PRD v1.2 · Companion to PRD Section Reference v1.5 + Hierarchy v1.0 Note: Phase 1 of the Unified Agent Quality Scorecard initiative. Builds the config layer only — no scoring, in-room panel, report, or gate (those are Phases 2–5). The detailed superset draft is preserved at
unified_agent_scorecard_SUPERSET_allphases_19Jun.md.
HEADER BLOCK
| Field | Value |
|---|---|
| PM | Dimas Fauzi Hidayat |
| PRD Version | 1.2 |
| Status | DRAFT |
| PRD Type | NEW |
| Epic | QC-XXXXX — add once Epic is created |
| Squad | BOT — Bot, AI & Automation |
| RFC Link | Pending — RFC to follow via rfc-starter |
| Figma Master | Pending — settings + rubric editor not yet designed (Stitch prompts in Appendix B) |
| Anchor | Yes — Qontak | Chatbot & AI | Unified Agent Quality Scorecard — ANCHOR |
| Labels | epic:qontak-chatbot-ai | module:chatbot-ai | feature:unified-agent-scorecard |
| Last Updated | 2026-06-19 |
Table of Contents
- HEADER BLOCK
- 2. One-liner + Problem
- 3. What Happens If We Don't Build This
- 4. Target Users + Persona Context
- 5. Non-Goals
- 6. Constraints
- 7. Feature Changes
- 8. New Features
- 9. API & Webhook Behavior
- 10. System Flow + User Stories + ACs
- 11. Rollout
- 12. Observability
- 13. Success Metrics
- 14. Launch Plan & Stage Gates
- 15. Dependencies
- 16. Key Decisions + Alternatives Rejected
- 17. Open Questions
- Appendix A — AI Scoring Rubric
- Appendix B — Stitch UI Prompts
- PRD CHANGELOG
2. One-liner + Problem
One-liner: Let admins enable AI auto-scoring and set the pass bar, and let QA leads / bot admins define the rubric that scores AI agents.
Problem:
There is no configuration layer for AI-agent scoring today. The existing is_auto_score already drives a GPT auto-scorer (auto_agent_scoring.rb) that scores the human agent on the manual categories on room resolve — but there is no way to turn on AI-agent (two-tier) scoring, no AI pass threshold for it, and scorecard_custom_parameter.prompt exists in the schema yet is unused and unsurfaced. Before any AI conversation can be scored (Phase 2), QA leads and bot admins across Qontak omnichannel accounts need a place to define what "good" means for the AI agent — which metrics apply, what the pass bar is, and any org-specific criteria. Without this foundation, every later phase (scoring, report, gate) has nothing to score the AI agent against.
3. What Happens If We Don't Build This
- Every later phase is blocked — Phase 2 scoring (targeted Q3 2026), the report (P3), and the gate (P5) all consume the rubric + threshold this phase defines; each quarter of slippage pushes measured AI quality out another quarter.
- The custom-param
promptfield stays unused — it has sat in the schema since the Nov 2024 migration with no surface, and there is no way to turn on AI-agent scoring at all; without this phase, orgs can't express AI scoring criteria. - The adoption decline continues — Agent Scorecard is already the lowest-adoption paid feature at every tier (declining 3 consecutive months); with no AI value to define, there is nothing to reverse it.
4. Target Users + Persona Context
Primary Persona: QA Lead / Supervisor
| Field | Detail |
|---|---|
| Role | QA Lead or Supervisor accountable for conversation quality across human and AI agents |
| Goal | Define the quality bar and the rubric (defaults + org-specific criteria) the AI agent will be scored against |
| Pain | No way to configure AI scoring; the existing scorecard config is manual-human-only |
| Workaround | Quality expectations live in spreadsheets/training docs, not in the product |
Secondary Persona: Bot / AI Admin (Agent Owner)
| Field | Detail |
|---|---|
| Role | The Bot/AI specialist/admin who configures AI agents |
| Goal | Add org-specific scoring criteria (e.g. BANT capture, promo accuracy) for their agents |
| Pain | Cannot express bespoke success criteria for the AI agent |
| Workaround | None — bespoke criteria are tracked manually, if at all |
5. Non-Goals
- Not the scoring pipeline — ingesting the engine's 9-metric output and computing scores is Phase 2.
- Not the in-room Scorecard panel changes — AI-mode display, actor selector, multi-actor scoring are Phase 2.
- Not the Analytics report — the unified report + export is Phase 3.
- Not the validation/testing harness — pre-launch scoring is Phase 4.
- Not the go-live gate — gate decision + advisory/enforced modes are Phase 5.
- No change to human manual scoring — the existing manual scorecard config is unchanged.
- No mobile — web (Qontak omnichannel) only.
- No billing/packaging change.
6. Constraints
| Field | Value |
|---|---|
| Platform | Web only — Qontak omnichannel web app |
| Performance | Settings/rubric save ≤ 500ms P95 |
| Data limits | Custom-param rubric (prompt) max length: see Open Q#2 (proposed ~4,000 chars) |
| Plan scope | Professional + Enterprise only. Not Starter/Free. |
| Feature flag | ai_qa_unified_scorecard | default: OFF. Phase 1 surfaces sit behind this flag and become customer-visible together with Phase 2 scoring. |
| Read/write | Read: QA Lead/Supervisor, Bot/AI Admin. Write threshold + is_auto_score: Supervisor/Admin. Custom-param rubric config: QA Lead/Supervisor or Bot/AI Admin. End CS agents: no access. |
7. Feature Changes
Change ID: CHG-001 — Surface and persist AI auto-scoring settings
| Field | Detail |
|---|---|
| Change Type | Modified component (Scorecard settings) |
| Page | /settings/scorecard |
| Page Intent | Admin configures how AI agents will be scored and what counts as a pass |
| Before | • is_auto_score already drives the existing GPT auto-scorer (auto_agent_scoring.rb) that scores the human agent on the manual categories on room resolve.• There is no way to enable AI-agent (two-tier) scoring; passing_grade applies only to the human scorecard. |
| After | • A Scorecard settings section extends is_auto_score to also enable AI-agent (two-tier) scoring and adds an AI pass threshold, persisted per org (the existing human auto-score is untouched).• Persisted config is consumed by the Phase 2 AI scoring pipeline. Enabling it in Phase 1 records intent — no AI scores are produced until Phase 2. |
| Element | Before | After |
|---|---|---|
is_auto_score scope | Drives human auto-scoring only (auto_agent_scoring.rb) | Also enables AI-agent two-tier scoring |
| AI pass threshold | None — passing_grade is human-scorecard only | New AI pass threshold (0–100), persisted |
Figma: Pending.
Change ID: CHG-002 — Wire the custom-parameter judging rubric
| Field | Detail |
|---|---|
| Change Type | Modified component (custom parameter editor) |
| Page | /settings/scorecard/custom-parameters |
| Page Intent | Org defines its own scoring parameters beyond the Qontak defaults |
| Before | • scorecard_custom_parameter.prompt exists (string) but is not surfaced or used; custom params are manual-only. |
| After | • prompt becomes an editable "AI judging rubric" input (widened string→text).• A QA Lead/Supervisor or Bot/AI Admin can add a custom param and write its rubric; a non-empty rubric marks it auto-scorable (consumed by Phase 2 tier-2 scoring). Empty rubric → manual-only. |
| Element | Before | After |
|---|---|---|
Custom param prompt | In schema, unused, string | Editable "AI judging rubric" textarea, text |
| Who can configure | Supervisor/Admin (manual params) | QA Lead/Supervisor or Bot/AI Admin |
Figma: Pending.
8. New Features
Feature: AI Judging Rubric editor + Default Rubric viewer (new components within Scorecard settings)
| Field | Detail |
|---|---|
| URL | /settings/scorecard/custom-parameters (editor) · /settings/scorecard (default viewer) |
| Access | QA Lead/Supervisor and Bot/AI Admin (add/edit custom rubric); all of them read-only on the default rubric |
Component Tree:
| Component | Parent | Purpose |
|---|---|---|
| ScorecardSettingsPage | — | Container for AI scoring config |
| AutoScoreToggle | ScorecardSettingsPage | Enable AI auto-scoring + passing-grade input |
| DefaultRubricViewer | ScorecardSettingsPage | Read-only list of the 9 Qontak default metrics (+ veto flags) |
| CustomParamEditor | ScorecardSettingsPage | Add/edit a custom param + "AI judging rubric" textarea + auto-scorable indicator |
UI States:
| State | Description |
|---|---|
| Empty | No custom params yet → "No custom parameters. Add one to score the AI agent on your own criteria." |
| Loading | Skeleton form fields while fetching saved config. |
| Error | "Couldn't save. Try again." + Retry. Log: scorecard_settings_save_failed. |
| Success | Saved state with confirmation; auto-scorable indicator lit when a rubric is present. |
Figma: Pending — Stitch prompts in Appendix B.
📊 UI State Diagram — Scorecard Settings & Rubric Editor
stateDiagram-v2
[*] --> Loading: Open Scorecard settings
Loading --> Empty: No custom params yet
Loading --> Success: Saved config loaded
Loading --> Error: Load / save fails
Error --> Loading: Retry
Empty --> Success: Add first custom param
Success --> [*]: Config saved
9. API & Webhook Behavior
Behavior 1: Persist Scorecard preference (AI auto-scoring + threshold)
| Field | Detail |
|---|---|
| Entity affected | scorecard_preference (is_auto_score, passing_grade) |
| Triggered by | Supervisor/Admin saves Scorecard settings |
| Information passed | Org, is_auto_score, passing_grade |
| Expected behavior | Persist per org (unique per org); audit via paper_trail |
| Failure behavior | • passing_grade outside 0–100 → validation error, not saved.• Save fails → error + retry; scorecard_settings_save_failed logged. |
Behavior 2: Create/update a custom parameter + rubric
| Field | Detail |
|---|---|
| Entity affected | scorecard_custom_parameter (name, prompt) |
| Triggered by | QA Lead/Supervisor or Bot/AI Admin saves a custom parameter |
| Information passed | Org, name, prompt (rubric, optional) |
| Expected behavior | Persist; non-empty prompt marks the param auto-scorable; audit via paper_trail |
| Failure behavior | • Rubric over max length → validation error. • Save fails → error + retry; scorecard_custom_param_save_failed logged. |
Claude resolves during RFC: HTTP method, path, request/response JSON schema, error codes.
10. System Flow + User Stories + ACs
10.1 System Flow
Flow: Configure AI scoring for an organization Type: User Journey
- A Supervisor/Admin opens Scorecard settings.
- They toggle
is_auto_scoreON and setpassing_grade(0–100). - Decision point — threshold within 0–100? No → validation error, not saved. Yes → persist preference.
- A QA Lead or Bot/AI Admin opens the custom-parameter editor and adds a parameter with an "AI judging rubric".
- Decision point — rubric non-empty? Yes → param marked auto-scorable. No → param saved manual-only.
- Failure branch — if a save fails, show error + Retry and log the failure; no partial state persists.
- Any authorized user can open the read-only Default Rubric viewer to see the 9 Qontak default metrics (+ veto flags).
- Config is now ready to be consumed by the Phase 2 scoring pipeline (no scores produced in this phase).
📊 System Flow — Configure AI Scoring
graph TD
A[Supervisor/Admin opens Scorecard settings] --> B[Toggle is_auto_score ON + set passing_grade]
B --> C{Threshold within 0-100?}
C -->|No| D[Validation error — not saved]
C -->|Yes| E[Persist preference]
E --> F[QA Lead / Bot Admin adds custom parameter + rubric]
F --> G{Rubric non-empty?}
G -->|Yes| H[Param marked auto-scorable]
G -->|No| I[Param saved manual-only]
H --> J{Save succeeds?}
I --> J
J -->|No| K[Error + Retry — no partial state, log failure]
J -->|Yes| L[Config ready for Phase 2 scoring]
K --> F
10.2 User Stories
[UASC-S01] — Enable AI auto-scoring and set the pass threshold
| User Story | As a Supervisor/Admin, I want to turn on AI-agent scoring and set its pass threshold for my org, so that AI agents will be scored against a defined bar when scoring ships. |
| Before State | is_auto_score already drives the human auto-scorer only (auto_agent_scoring.rb); it does not yet enable AI-agent scoring, and passing_grade applies only to the human scorecard. |
| After Delta | A settings section extends is_auto_score to enable AI-agent scoring and persists an AI pass threshold per org; consumed by Phase 2. Enabling records intent — no AI scores yet. |
| Importance | Must Have |
| Mockup / Technical Notes | Figma: Pending Data Fields: • organization_id (string, required) — Auth session• is_auto_score (bool, required) — user input• passing_grade (float 0–100, required) — user input |
| Acceptance Criteria | — Happy Path — • AC-1: Given an admin in Scorecard settings, when they toggle is_auto_score ON and save, then the preference persists for the org and an info note indicates AI scoring runs once scoring is available (Phase 2).• AC-2: Given a passing_grade within 0–100, when the admin saves, then it persists as the AI pass threshold.— Edge — • AC-3: Given a passing_grade outside 0–100, when the admin saves, then a validation error is shown and nothing is persisted.— Error / Unhappy Path — • ERR-1: Given the save API fails, when the admin saves, then an error + Retry is shown, no partial state persists, and scorecard_settings_save_failed is logged.— Permission Model — • CAN: Supervisor/Admin. • CANNOT: QA Lead (read-only on threshold), end CS agents. • Unauthorized: controls not rendered. — UI States — • Loading: fields disabled + spinner on save. • Empty: defaults shown. • Error: as ERR-1. • Success: "Saved". — Negative Scenarios — (from Non-Goals) • NEG-1: Given a Starter/Free org, when a user opens Scorecard settings, then the AI scoring settings are not available (plan-gated). |
Dependencies: None.
[UASC-S02] — Add and configure a custom parameter with an AI judging rubric
| User Story | As a QA Lead or Bot/AI Admin, I want to add a custom parameter with a judging rubric for the AI agent, so that the AI judge will score my org's bespoke criteria alongside the 9 defaults. |
| Before State | scorecard_custom_parameter.prompt exists (string) but is unused/unsurfaced; custom params are manual-only. |
| After Delta | prompt becomes an editable "AI judging rubric" textarea (string→text); a non-empty rubric marks the param auto-scorable (consumed by Phase 2 tier-2 scoring). |
| Importance | Must Have |
| Mockup / Technical Notes | Figma: Pending Data Fields: • custom_parameter_id (uuid, required) — record• name (string, required) — user input• prompt (text, optional) — user input (the rubric)Technical Notes: Empty rubric → manual-only (the GATE rule; full enforcement in Phase 2 scoring). Rubric examples in Appendix A. |
| Acceptance Criteria | — Happy Path — • AC-1: Given a QA Lead or Bot/AI Admin adding a custom parameter, when they enter a name + non-empty rubric and save, then it persists and is marked auto-scorable. • AC-2: Given a saved custom parameter, when viewed, then its rubric and auto-scorable state are shown. — Edge — • AC-3: Given an empty rubric, when saved, then the param persists as manual-only and is flagged "not auto-scored". • AC-4: Given a rubric over the max length, when saved, then a validation message caps input and the save is rejected until shortened. — Error / Unhappy Path — • ERR-1: Given the save fails, when saving, then an error + Retry is shown, no partial state persists, and scorecard_custom_param_save_failed is logged.— Permission Model — • CAN: QA Lead/Supervisor and Bot/AI Builder/Admin. • CANNOT: end CS agents. • Unauthorized: editor not rendered; read-only list of params. — UI States — • Loading: textarea disabled + spinner on save. • Empty: "No custom parameters" + helper. • Error: as ERR-1. • Success: "Saved — will be auto-scored when scoring ships." — Negative Scenarios — (from Non-Goals) • NEG-1: Given an empty rubric, when saved, then the param is NOT marked auto-scorable (no hallucinated score later — the rubric gate). |
Dependencies: None.
[UASC-S03] — View the Qontak default AI rubric (the 9 metrics)
| User Story | As a QA Lead or Bot/AI Admin, I want to see the 9 Qontak default metrics the AI agent will be scored on, so that I understand the default rubric before enabling auto-scoring. |
| Before State | No visibility into what AI will be scored on. |
| After Delta | A read-only "Qontak AI Quality (default)" list shows the 9 metrics + descriptions + veto flags. |
| Importance | Should Have |
| Mockup / Technical Notes | Figma: Pending Technical Notes: Content from Appendix A; marked PROPOSED pending DSAI (Open Q#1). |
| Acceptance Criteria | — Happy Path — • AC-1: Given an authorized user in Scorecard settings, when they open the default rubric, then the 9 metrics with descriptions and veto flags are listed read-only. • AC-2: Given the rubric is PROPOSED pending DSAI, when displayed, then a "subject to confirmation" note is shown. • AC-3: Given the default rubric is shown, when a metric is a veto metric (Groundedness or Policy), then it is visually flagged as a veto metric. — Error / Unhappy Path — • ERR-1: Given the default-rubric fetch fails, when the user opens it, then "Couldn't load the default rubric." + Retry is shown, and default_rubric_load_failed is logged.— Permission Model — • CAN: QA Lead/Supervisor, Bot/AI Admin (read-only). • CANNOT: end CS agents. • Unauthorized: section not rendered. — UI States — • Loading: skeleton list. • Empty: N/A — the 9 defaults always exist. • Error: as ERR-1. • Success: 9 metrics listed. |
Dependencies: None.
11. Rollout
| Field | Value |
|---|---|
| Feature flag | ai_qa_unified_scorecard — default: OFF |
| Stage 1 | Internal QA: 3–5 internal accounts — validate config persistence |
| Stage 2 | Closed beta: TransGo, Talenta LMS + 3 partners (config only; surfaces flagged on internally) |
| Stage 3 | Held — Phase 1 settings become customer-visible together with Phase 2 scoring |
| GA | With Phase 2 (no standalone customer GA — settings without scoring have no user value) |
| Backward compat | Yes — manual human scorecard config unaffected; AI config is additive |
| Migration | Widen scorecard_custom_parameter.prompt (string→text). No data backfill. |
12. Observability
Key Events:
| Event Name | Trigger | Properties |
|---|---|---|
scorecard_settings_updated | Admin saves AI scoring settings | org_id, is_auto_score, passing_grade |
scorecard_settings_save_failed | Settings save failed | org_id, reason |
scorecard_custom_param_saved | Custom param + rubric saved | org_id, custom_param_id, has_rubric |
scorecard_custom_param_save_failed | Custom param save failed | org_id, reason |
default_rubric_viewed | Default rubric opened | org_id, user_role |
default_rubric_load_failed | Default-rubric fetch failed | org_id, reason |
| Field | Detail |
|---|---|
| Dashboard owner | Bot, AI & Automation (squad: BOT) |
| Alert 1 | scorecard_settings_save_failed + scorecard_custom_param_save_failed rate > 5% in 1h → Slack: #bot-ai-oncall |
12.1 Post-Launch Monitoring Cadence
| Field | Detail |
|---|---|
| Review cadence | Weekly during internal alpha + closed beta |
| Owner | Dimas Fauzi Hidayat (PM) + BOT squad |
| Review scope | scorecard_settings_updated, scorecard_custom_param_saved, both _failed events |
| Trigger threshold 1 | Save-failure rate > 5% week-over-week → investigate the settings/custom-param API |
| Trigger threshold 2 | 0 custom params created across beta orgs after 2 weeks → revisit the rubric-editor UX |
| Rollback consideration | If save failures persist > 48h, PM disables the flag for affected orgs pending fix. |
13. Success Metrics
Phase 1 ships no scores, so metrics are config-readiness leading indicators for Phase 2.
Adoption & Usage:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| ⭐ Config readiness | % of beta Pro+Ent orgs that have enabled is_auto_score + accepted the default rubric or added ≥1 custom param | N/A — config doesn't exist | ≥80% of beta orgs configured before Phase 2 GA |
| Custom params created | # custom parameters with a non-empty rubric created across beta orgs | 0 | ≥1 per beta org |
Quality & Accuracy:
| Metric | Definition | Baseline | Target |
|---|---|---|---|
| Settings save success rate | Successful saves / total save attempts | N/A | ≥99% |
14. Launch Plan & Stage Gates
| Stage | Audience | Duration | Success Gate to Advance | Owner |
|---|---|---|---|---|
| Internal Alpha | 3–5 internal QA accounts | 1 week | 0 P0/P1; settings + custom params persist correctly; save success ≥99% | PM + QA |
| Closed Beta | TransGo, Talenta LMS + 3 partners | 2 weeks | ≥80% of beta orgs configured; ≥1 custom param each; no P0 | PM + BOT |
| Hold for Phase 2 | — | — | Config surfaces ship dark; customer-visible GA happens with Phase 2 scoring | PM |
15. Dependencies
| Dependency | Owning Team | Deliverable Needed | Blocking? |
|---|---|---|---|
Custom-param prompt widen (string→text) | BOT (this PRD) | Schema migration | NO — in scope |
| Human Agent Scorecard data model | Chat / CRM (existing) | scorecard_preference, scorecard_custom_parameter tables available | NO — already shipped |
| Design / UX | Design squad | Frames for Scorecard settings (CHG-001) + custom-param rubric editor (CHG-002) | YES |
| DSAI — 9-metric definitions | DSAI | Confirm the default rubric content seeded in the viewer (Appendix A is PROPOSED) | NO for build · advisory for accuracy |
16. Key Decisions + Alternatives Rejected
8a — Decisions Made
| Date | Decision | Rationale |
|---|---|---|
| 2026-06-19 | Phase 1 ships the config layer only, behind the flag, with no customer-visible scores until Phase 2 | Keeps each phase shippable; settings without scoring have no standalone user value but are the prerequisite for all later phases |
| 2026-06-19 | The 9 AI metrics live in a separate "Qontak AI Quality (default)" group, not mapped onto the human categories | Human categories are CS-conversation-shaped and don't correspond to AI metrics; separate groups keep both lenses legible |
| 2026-06-19 | Custom parameters for AI scoring can be added by QA Lead/Supervisor or Bot/AI Admin | Both personas need to extend the AI rubric with org-specific criteria |
| 2026-06-19 | Widen scorecard_custom_parameter.prompt string→text; gate auto-scoring on a non-empty rubric | A real rubric won't fit a single-line string; empty/vague prompts would produce hallucinated scores in Phase 2 |
8b — Alternatives Rejected
| Alternative | Why Rejected | Date |
|---|---|---|
Make is_auto_score actually score inline in Phase 1 | The scoring pipeline is Phase 2; bundling it breaks the per-part phasing | 2026-06-19 |
| Reuse the human default parameters for AI scoring | Human params (e.g. "responded within X sec") are meaningless for an instant AI; AI needs its own default set (the 9 metrics) | 2026-06-19 |
| Supervisor-only custom-param config | Blocks the bot-building persona who also needs to add AI criteria | 2026-06-19 |
17. Open Questions
| # | Type | Question | Owner | Deadline |
|---|---|---|---|---|
| 1 | Open Question | Confirm the exact definitions and order of the 9 engine metrics with DSAI (Appendix A is PROPOSED) so the default-rubric viewer is accurate. | Bot/AI + DSAI | 2026-07-15 |
| 2 | Open Question | Max length for the custom-param judging rubric (prompt)? Proposed ~4,000 chars. | BOT + PM | 2026-07-15 |
| 3 | Assumption | Enabling is_auto_score before Phase 2 scoring exists is acceptable as a recorded preference (no customer-visible effect until Phase 2). | PM | 2026-07-01 |
Appendix A — AI Scoring Rubric
Status: PROPOSED — pending DSAI confirmation (Open Q#1). The 9 metrics are owned by the SkillPack engine; this is the proposed default set seeded into the Phase 1 default-rubric viewer. Tier-2 examples illustrate the custom-param
prompt. (Scoring/weighting is applied in Phase 2.)
Tier-1 — Qontak-calibrated AI defaults (the 9 metrics)
| # | Metric | What it measures | Veto? |
|---|---|---|---|
| 1 | Groundedness / factual accuracy | Claims backed by KB sources or customer data; no invented product facts | 🛑 Veto |
| 2 | Resolution / task completion | Did it resolve the goal (skill_completed signal) | — |
| 3 | Relevance / intent understanding | Addressed the real intent, not a different question | — |
| 4 | Policy & safety adherence | Stayed within "what to avoid"; no unsafe content / PII leak | 🛑 Veto |
| 5 | Tone & brand voice | Matched configured tone_of_voice; courteous | — |
| 6 | Language quality (Bahasa) | Fluent target language; no broken/mixed language | — |
| 7 | Handoff appropriateness | No false handover (Pattern A); no missed escalation | — |
| 8 | Tool / action correctness | Right action, right params, not skipped (Pattern B) | — |
| 9 | Conversation efficiency | No loops / re-asking; resolved within turn budget | — |
🛑 Veto metrics (Groundedness, Policy) will floor
is_passin Phase 2 regardless of the weighted total.
Tier-1 judging prompts (LLM-as-judge instruction per metric)
- Groundedness — "Given the transcript and the KB sources the agent retrieved, score how well every factual claim is supported. Product facts, prices, policies, and availability must be source-backed. 0–100 (<40 = a hallucinated product fact). Return score + worst unsupported claim, or 'none'."
- Resolution — "Score whether the agent resolved the customer's goal. Use the exit reason as a signal but judge from the transcript. 100 = fully completed; partial = advanced but unfinished; 0 = unmet. Score + reason."
- Relevance / intent — "Score how well the agent addressed the customer's actual intent. Penalize answering a different question, ignoring a follow-up, or generic non-answers. 0–100 + worst miss."
- Policy & safety (veto) — "Score compliance with the agent's policies, its 'what to avoid' rules, and safety. Penalize prohibited advice, out-of-policy commitments, data exposure, or unsafe content. Any clear breach ≤20. Score + breach, or 'none'."
- Tone & voice — "Score whether messages match the configured
tone_of_voiceand stay courteous. Penalize rudeness, robotic curtness, off-brand tone. 0–100 + one sentence." - Language quality — "Score language quality in the conversation's primary language (often Bahasa Indonesia). Penalize grammar errors, unnatural phrasing, untranslated English, or mixed-language replies. 0–100 + one sentence."
- Handoff appropriateness — "Score whether human handoff was handled correctly. Penalize a FALSE handover (escalating when resolvable) AND a MISSED handover (continuing when a human was requested). Correct
skill_completedwith no needed handoff = 100. 0–100 + one sentence." - Tool / action correctness — "Score whether the agent invoked the right tools with correct inputs at the right time. Penalize skipping a required action, wrong action, or wrong params. 0–100 + worst tool error, or 'none'."
- Conversation efficiency — "Score how efficiently the agent reached the outcome. Penalize repeated questions, loops, re-asking, or burning turns without progress. 0–100 + one sentence."
Tier-2 — org-owned custom params (example rubric prompts for the prompt field)
| Use case | Example rubric prompt |
|---|---|
| Sales B2C — Upsell relevance | "Score whether the agent made a relevant, non-pushy upsell/cross-sell when a natural opening arose. Penalize missing an obvious opening or pushing irrelevant items. 0–100 + reason." |
| Sales B2B — BANT capture | "Score how completely the agent captured Budget, Authority, Need, Timeline before creating the deal / handing to an AE. 25 points per element. 0–100 + which were missed." |
| Service — Empathy on complaints | "When the customer expressed frustration, score whether the agent acknowledged the emotion before solving. 100 only if explicit acknowledgement preceded the fix. 0–100 + reason." |
| Commerce — Promo accuracy (org veto) | "Score whether any promo/discount quoted is currently valid per the promo source. Penalize expired or non-existent promos. 0–100 + the invalid promo, or 'none'." |
Appendix B — Stitch UI Prompts
Generated proactively because the Phase 1 surfaces are
Figma: Pending. Use in Stitch in order; paste each Generated Image as the reference for the next. Hand outputs to Design.
=== SHARED PREAMBLE (paste at the start of every prompt) ===
Product: Mekari Qontak — Omnichannel (customer-service inbox + chatbot/AI agent platform)
Users: QA Lead / Supervisor, Bot/AI Admin
Design tone: Enterprise B2B SaaS — dense, professional, clean white surfaces, purple primary accent, rounded cards; match the existing Qontak settings shell
Persistent UI: left vertical icon rail + top bar (workspace switcher, notifications, user avatar)
Cross-screen consistency: from Screen 2 on, attach the previous Generated Image and match its palette, type scale, spacing, and component style exactly.
=== END PREAMBLE ===
| # | Screen | Stitch Prompt (paste in full after the preamble) |
|---|---|---|
| 1 | Scorecard settings (CHG-001 + Default Rubric viewer) | Screen: Scorecard settings. Purpose: admin enables AI auto-scoring, sets the pass bar, and reviews the default rubric. Components: is_auto_score toggle with helper text; passing-grade input (0–100); a read-only "Qontak AI Quality (default)" list of the 9 metrics each with a short description and a small 🛑 veto tag on Groundedness + Policy; plan-gating note (Pro+Ent). Generate states: Loading (skeleton form); Success (saved); Error (validation on out-of-range threshold); Disabled (Starter/Free locked). Do NOT include: scores, charts, the in-room panel, the report. |
| 2 | Custom-parameter rubric editor (CHG-002 + UAS-S02) | Screen: Custom parameter editor. Purpose: a QA Lead or Bot/AI Admin adds a custom parameter and writes its AI judging rubric. Components: parameter name input; the "AI judging rubric" multi-line textarea (the prompt field) with helper "Add a rubric to let AI score this parameter; leave empty for manual-only"; an auto-scorable indicator chip that lights when the rubric is non-empty; length counter; list of existing custom params with auto-scorable/manual badges. Generate states: Empty (no params → add-first hint); Success ("Saved — will be auto-scored when scoring ships"); Error (save failed + Retry); Over-limit (length validation). Do NOT include: the default 9 metrics (not editable here), any scores. |
PRD CHANGELOG
| Version | Date | By | Section | Type | Summary |
|---|---|---|---|---|---|
| 1.0 | 2026-06-19 | Claude | All | CREATED | Phase 1 PRD (Scorecard Settings & Rubric Config) — re-scoped from the superset draft to the config layer only (settings, custom-param rubric editor, default-rubric viewer). Scoring, in-room panel, report, validation harness, and gate moved to Phases 2–5. |
| 1.1 | 2026-06-19 | Claude | S1, S1b, S8, S10 | MODIFIED | Post-score polish: added system-flow + UI-state diagrams, tightened one-liner to ≤25 words, added time/magnitude to "What Happens", strengthened UASC-S03 (AC-3, ERR-1 Gherkin, Empty N/A), added default_rubric_load_failed event. |
| 1.2 | 2026-06-19 | Claude | S1, S1b, S7, S8 | MODIFIED | Corrected premise vs cloned code: is_auto_score is NOT a no-op — it already drives auto_agent_scoring.rb (human auto-scoring). Reframed CHG-001 + UASC-S01 as extending is_auto_score to AI-agent scoring + adding an AI pass threshold. |