Custom Metrics Best Practices
Best Practices for Defining Custom Metrics
This guide helps you create effective custom metrics that produce accurate, consistent evaluation results for your AI voice agents.
Quick Start
Don't have time to read the full guide? Here's what you need to know:
- Simple descriptions work fine. Our AI automatically enhances your metrics.
- Pick the right type: Use BOOLEAN for pass/fail, NUMERIC for quality on a scale.
- Be specific when it matters: The more context you provide, the more precise your results.
Example—this is enough to get started:
| Name | Type | Description |
|---|---|---|
| Professional Tone | NUMERIC | Agent maintains a professional and respectful tone |
| Issue Resolved | BOOLEAN | Customer's issue was resolved by the end of the call |
| Identity Verified | BOOLEAN | Agent verified customer identity before discussing account |
Read on for detailed guidance and comprehensive examples.
Table of Contents
- Quick Start
- Understanding the Evaluation Context
- Choosing the Right Measurement Type
- Writing Effective Metric Names
- Writing Clear Descriptions
- Common Pitfalls to Avoid
- Examples
- Summary Checklist
Understanding the Evaluation Context
Before defining metrics, it's important to understand how evaluations work:
| Role | Description |
|---|---|
| Your Agent (tested_agent) | The AI agent you're testing. This is what metrics evaluate. |
| Simulated Customer (evalion) | Our AI that plays the role of a customer/user interacting with your agent. |
Key principle: Unless explicitly stated otherwise, all metrics evaluate your agent's behavior, not the simulated customer's.
For example:
- "Agent provides accurate information" → Evaluates whether your agent provides accurate info
- "Agent handles objections gracefully" → Evaluates how your agent responds to customer objections
Choosing the Right Measurement Type
BOOLEAN Metrics (Pass/Fail)
Use BOOLEAN metrics when:
- There's a clear binary outcome (did it happen or not?)
- You need definitive pass/fail criteria
- The metric represents a compliance or safety requirement
Good for:
- Compliance checks ("Agent verified customer identity")
- Safety requirements ("Agent did not provide medical advice")
- Specific actions ("Agent offered to transfer to a human")
- Binary outcomes ("Customer's issue was resolved")
NUMERIC Metrics (0-10 Scale)
Use NUMERIC metrics when:
- Performance exists on a spectrum
- You want to track gradual improvements
- Partial success is meaningful
Good for:
- Quality assessments ("Empathy and rapport")
- Soft skills ("Communication clarity")
- Subjective qualities ("Professionalism")
- Complex behaviors that can be partially achieved
Decision Guide
| Scenario | Recommended Type | Reasoning |
|---|---|---|
| "Did the agent collect the customer's email?" | BOOLEAN | Clear yes/no outcome |
| "How well did the agent handle the complaint?" | NUMERIC | Quality exists on a spectrum |
| "Did the agent follow the script?" | BOOLEAN | Compliance check |
| "How natural was the conversation?" | NUMERIC | Subjective quality |
| "Agent must not discuss competitors" | BOOLEAN | Safety/compliance rule |
| "Customer satisfaction level" | NUMERIC | Satisfaction varies in degree |
Writing Effective Metric Names
Do's
-
Be specific and descriptive
- Good: "Identity Verification Completed"
- Bad: "Verification"
-
Use action-oriented language
- Good: "Appointment Successfully Scheduled"
- Bad: "Scheduling"
-
Include the outcome when relevant
- Good: "Upsell Offer Presented"
- Bad: "Upselling"
Don'ts
- Avoid vague single words ("Quality", "Performance", "Good")
- Avoid overly long names (keep under 50 characters)
- Avoid technical jargon your team won't understand
Writing Clear Descriptions
The description is the most important part of your metric. A well-written description produces consistent, accurate evaluations.
The SOAR Framework
Use this framework to write effective descriptions:
| Component | Description | Example |
|---|---|---|
| Specific | Define exactly what you're measuring | "The agent must state the exact appointment date and time" |
| Observable | Focus on behaviors that can be heard/seen in the transcript | "Agent uses phrases like 'I understand' or 'I hear you'" |
| Actionable | Describe what success looks like | "Agent provides at least two relevant solutions" |
| Relevant | Ensure it connects to business outcomes | "This ensures customers have the information needed to complete their purchase" |
Be Explicit About Success Criteria
Vague (Bad):
"The agent should be helpful"
Specific (Good):
"The agent provides actionable information that directly addresses the customer's question. This includes offering specific solutions, providing relevant details (dates, prices, steps), and confirming the customer understands the information provided."
Account for Voice Conversation Context
Remember that evaluations are based on transcribed voice conversations:
- Transcription artifacts: Filler words (um, uh), repeated words, and minor speech disfluencies are normal
- Intent over verbatim: Focus on whether the meaning was conveyed, not exact wording
- Natural variation: Different phrasings can convey the same information
Example description that accounts for this:
"The agent confirms the customer's appointment details. The agent should state the date, time, and location. Minor variations in phrasing are acceptable as long as all three pieces of information are clearly communicated. Partial confirmation (e.g., only stating the date) should be considered a failure."
Specify Edge Cases
Think about scenarios that might be ambiguous:
Without edge cases:
"Agent resolves the customer's issue"
With edge cases:
"Agent resolves the customer's issue. Consider it resolved if: (1) the agent provides a direct solution, (2) the agent correctly transfers to a specialist who can help, or (3) the agent schedules a follow-up where the issue will be addressed. Consider it NOT resolved if the agent simply provides general information without addressing the specific problem."
Common Pitfalls to Avoid
1. Subjective Language Without Criteria
Problem: Using subjective terms without defining what they mean.
| Bad | Good |
|---|---|
| "Agent was friendly" | "Agent used positive language, greeted the customer warmly, and avoided negative or dismissive phrases" |
| "Agent was professional" | "Agent maintained a respectful tone, avoided slang, and addressed the customer appropriately" |
| "Good customer service" | "Agent acknowledged the customer's concern, provided a solution, and confirmed satisfaction" |
2. Assuming Context
Problem: Assuming the evaluator knows your business context.
| Bad | Good |
|---|---|
| "Agent followed the process" | "Agent followed the refund process: (1) verified purchase, (2) confirmed reason, (3) offered alternatives, (4) processed refund if requested" |
| "Agent used the right greeting" | "Agent greeted with company name and their own name, e.g., 'Thank you for calling Acme Corp, this is Sarah speaking'" |
3. Multiple Criteria in One Metric
Problem: Combining unrelated criteria makes scoring inconsistent.
| Bad | Good |
|---|---|
| "Agent was professional, resolved the issue, and offered upsells" | Create three separate metrics: "Professional Tone", "Issue Resolution", "Upsell Opportunity Captured" |
4. Impossible to Evaluate
Problem: Metrics that require information not available in the conversation.
| Bad | Why It's Bad |
|---|---|
| "Agent provided accurate pricing" | Evaluator doesn't know your actual prices |
| "Agent followed internal policy" | Evaluator doesn't know your policies |
Solution: Include the criteria in the description:
"Agent provided pricing consistent with our standard rates: Basic plan at $29/month, Pro plan at $79/month, Enterprise at $199/month. Any deviation from these prices should be flagged."
5. Evaluating the Wrong Agent
Problem: Accidentally creating metrics that evaluate the customer instead of your agent.
| Bad | Good |
|---|---|
| "Customer was satisfied" | "Agent took actions likely to satisfy the customer: acknowledged concerns, provided solutions, and confirmed understanding" |
| "The issue was complex" | "Agent successfully handled the complexity by breaking down the problem and addressing each component" |
Examples
Note: The examples below are intentionally comprehensive to illustrate best practices. You don't need to provide this level of detail. Our system automatically enhances your metric descriptions using AI, so even a simple description like "Agent verifies customer identity" or "Measures empathy" will work well.
Use these examples as inspiration and guidance—the more context you provide, the more precise your evaluations will be, but simpler descriptions are perfectly acceptable.
Example 1: Identity Verification (BOOLEAN)
Name: Customer Identity Verified
Description:
The agent must verify the customer's identity before discussing account details. Verification requires confirming at least TWO of the following: (1) full name, (2) date of birth, (3) last four digits of SSN, (4) account number, or (5) security question answer.
Pass: Agent explicitly asks for and receives confirmation of at least two identity factors before proceeding.
Fail: Agent discusses account details without verification, or only verifies one factor.
Edge case: If the customer proactively provides identifying information without being asked, this counts toward verification as long as the agent acknowledges it.
Example 2: Empathy and Rapport (NUMERIC)
Name: Empathy and Rapport
Description:
Measures how well the agent demonstrates understanding of the customer's emotional state and builds a positive connection.
High scores (8-10): Agent explicitly acknowledges the customer's feelings ("I understand this is frustrating"), uses empathetic language throughout, personalizes responses, and maintains a warm tone even when delivering difficult news.
Medium scores (5-7): Agent shows some empathy but inconsistently. May acknowledge feelings once but then become transactional. Uses polite but generic language.
Low scores (1-4): Agent is dismissive, robotic, or ignores emotional cues. Uses scripted responses that feel impersonal. May interrupt or rush the customer.
Key indicators to look for:
- Acknowledgment phrases: "I understand", "That must be frustrating", "I can see why you'd feel that way"
- Personalization: Using the customer's name, referencing their specific situation
- Tone matching: Adjusting energy level to match the customer's state
Example 3: Upsell Attempt (BOOLEAN)
Name: Relevant Upsell Offered
Description:
The agent should identify at least one opportunity to offer an upgrade or additional service that's relevant to the customer's needs.
Pass: Agent presents an upsell that logically connects to the customer's current purchase, inquiry, or expressed needs. The offer includes a clear benefit statement.
Fail: Agent makes no upsell attempt, or offers something completely unrelated to the customer's needs.
Not Applicable: The conversation is a complaint call where the customer is clearly frustrated—attempting an upsell would be inappropriate. Mark as N/A in these cases.
Edge case: If the customer preemptively declines additional offers at the start of the call ("I just want to pay my bill, nothing else"), the agent should respect this, and the metric should be marked N/A.
Example 4: Call Closure Quality (NUMERIC)
Name: Call Closure Quality
Description:
Evaluates how effectively the agent concludes the conversation.
Scoring guide:
- 9-10: Agent summarizes what was accomplished, confirms next steps, asks if there's anything else, thanks the customer, and provides a clear closing
- 7-8: Agent covers most closure elements but may miss one (e.g., forgets to summarize or doesn't ask if there's anything else)
- 5-6: Basic closure—thanks the customer and ends the call but doesn't summarize or confirm next steps
- 3-4: Abrupt ending without proper closure elements
- 1-2: Call ends awkwardly, agent hangs up prematurely, or no closure attempted
Required elements for high scores:
- Summary of actions taken or information provided
- Clear next steps (if applicable)
- Opportunity for additional questions
- Professional sign-off
Example 5: Compliance - Do Not Discuss Competitors (BOOLEAN)
Name: No Competitor Discussion
Description:
The agent must not discuss, compare, or provide opinions about competitor products or services.
Pass: Agent avoids all competitor mentions. If asked about competitors, redirects to own product benefits without naming or comparing to competitors.
Fail: Agent mentions competitors by name, compares features to competitors, or provides opinions about competitor products (positive or negative).
Acceptable redirections:
- "I can't speak to other providers, but let me tell you what makes our service great..."
- "I'm not familiar with their offerings, but here's what we provide..."
Not Applicable: If competitors are never mentioned or asked about during the conversation.
Summary Checklist
For the best results, consider these points when creating metrics. Not all are required—our system will enhance your descriptions automatically, but covering more of these will improve precision:
Essential:
- Clear name that describes what's being measured
- Appropriate type (BOOLEAN for pass/fail, NUMERIC for spectrum)
- Basic description of what you want to evaluate
Recommended for precision:
- Success criteria clearly defined
- Failure criteria clearly defined
- Single focus (one thing per metric)
Nice to have:
- Edge cases addressed
- Not applicable conditions specified
- Specific examples of good/bad behavior
Need Help?
Remember: Simple is fine! A metric like "Agent stays on topic" or "Professional tone" will work. Our AI automatically enhances your descriptions to make them more precise for evaluation.
If you want to be more specific, consider:
- What specific behavior do you want to encourage or prevent?
- How would a human quality reviewer evaluate this?
- What would you consider a clear pass vs. clear fail?
The more context you provide, the more tailored your evaluations will be—but don't let perfectionism stop you from creating metrics. Start simple and refine based on results.
Updated 11 days ago
