Best Practices for Defining Custom Metrics

This guide helps you create effective custom metrics that produce accurate, consistent evaluation results for your AI voice agents.

Quick Start

Don't have time to read the full guide? Here's what you need to know:

Simple descriptions work fine. Our AI automatically enhances your metrics.
Pick the right type: Use BOOLEAN for pass/fail, NUMERIC for quality on a scale.
Be specific when it matters: The more context you provide, the more precise your results.

Example—this is enough to get started:

Name	Type	Description
Professional Tone	NUMERIC	Agent maintains a professional and respectful tone
Issue Resolved	BOOLEAN	Customer's issue was resolved by the end of the call
Identity Verified	BOOLEAN	Agent verified customer identity before discussing account

Read on for detailed guidance and comprehensive examples.

Quick Start
Understanding the Evaluation Context
Choosing the Right Measurement Type
Writing Effective Metric Names
Writing Clear Descriptions
Common Pitfalls to Avoid
Examples
Summary Checklist

Understanding the Evaluation Context

Before defining metrics, it's important to understand how evaluations work:

Role	Description
Your Agent (tested_agent)	The AI agent you're testing. This is what metrics evaluate.
Simulated Customer (evalion)	Our AI that plays the role of a customer/user interacting with your agent.

Key principle: Unless explicitly stated otherwise, all metrics evaluate your agent's behavior, not the simulated customer's.

For example:

"Agent provides accurate information" → Evaluates whether your agent provides accurate info
"Agent handles objections gracefully" → Evaluates how your agent responds to customer objections

Choosing the Right Measurement Type

BOOLEAN Metrics (Pass/Fail)

Use BOOLEAN metrics when:

There's a clear binary outcome (did it happen or not?)
You need definitive pass/fail criteria
The metric represents a compliance or safety requirement

Good for:

Compliance checks ("Agent verified customer identity")
Safety requirements ("Agent did not provide medical advice")
Specific actions ("Agent offered to transfer to a human")
Binary outcomes ("Customer's issue was resolved")

NUMERIC Metrics (0-10 Scale)

Use NUMERIC metrics when:

Performance exists on a spectrum
You want to track gradual improvements
Partial success is meaningful

Good for:

Quality assessments ("Empathy and rapport")
Soft skills ("Communication clarity")
Subjective qualities ("Professionalism")
Complex behaviors that can be partially achieved

Decision Guide

Scenario	Recommended Type	Reasoning
"Did the agent collect the customer's email?"	BOOLEAN	Clear yes/no outcome
"How well did the agent handle the complaint?"	NUMERIC	Quality exists on a spectrum
"Did the agent follow the script?"	BOOLEAN	Compliance check
"How natural was the conversation?"	NUMERIC	Subjective quality
"Agent must not discuss competitors"	BOOLEAN	Safety/compliance rule
"Customer satisfaction level"	NUMERIC	Satisfaction varies in degree

Writing Effective Metric Names

Do's

Be specific and descriptive
- Good: "Identity Verification Completed"
- Bad: "Verification"
Use action-oriented language
- Good: "Appointment Successfully Scheduled"
- Bad: "Scheduling"
Include the outcome when relevant
- Good: "Upsell Offer Presented"
- Bad: "Upselling"

Don'ts

Avoid vague single words ("Quality", "Performance", "Good")
Avoid overly long names (keep under 50 characters)
Avoid technical jargon your team won't understand

Writing Clear Descriptions

The description is the most important part of your metric. A well-written description produces consistent, accurate evaluations.

The SOAR Framework

Use this framework to write effective descriptions:

Component	Description	Example
Specific	Define exactly what you're measuring	"The agent must state the exact appointment date and time"
Observable	Focus on behaviors that can be heard/seen in the transcript	"Agent uses phrases like 'I understand' or 'I hear you'"
Actionable	Describe what success looks like	"Agent provides at least two relevant solutions"
Relevant	Ensure it connects to business outcomes	"This ensures customers have the information needed to complete their purchase"

Be Explicit About Success Criteria

Vague (Bad):

"The agent should be helpful"

Specific (Good):

"The agent provides actionable information that directly addresses the customer's question. This includes offering specific solutions, providing relevant details (dates, prices, steps), and confirming the customer understands the information provided."

Account for Voice Conversation Context

Remember that evaluations are based on transcribed voice conversations:

Transcription artifacts: Filler words (um, uh), repeated words, and minor speech disfluencies are normal
Intent over verbatim: Focus on whether the meaning was conveyed, not exact wording
Natural variation: Different phrasings can convey the same information

Example description that accounts for this:

"The agent confirms the customer's appointment details. The agent should state the date, time, and location. Minor variations in phrasing are acceptable as long as all three pieces of information are clearly communicated. Partial confirmation (e.g., only stating the date) should be considered a failure."

Specify Edge Cases

Think about scenarios that might be ambiguous:

Without edge cases:

"Agent resolves the customer's issue"

With edge cases:

"Agent resolves the customer's issue. Consider it resolved if: (1) the agent provides a direct solution, (2) the agent correctly transfers to a specialist who can help, or (3) the agent schedules a follow-up where the issue will be addressed. Consider it NOT resolved if the agent simply provides general information without addressing the specific problem."

Common Pitfalls to Avoid

1. Subjective Language Without Criteria

Problem: Using subjective terms without defining what they mean.

Bad	Good
"Agent was friendly"	"Agent used positive language, greeted the customer warmly, and avoided negative or dismissive phrases"
"Agent was professional"	"Agent maintained a respectful tone, avoided slang, and addressed the customer appropriately"
"Good customer service"	"Agent acknowledged the customer's concern, provided a solution, and confirmed satisfaction"

2. Assuming Context

Problem: Assuming the evaluator knows your business context.

Bad	Good
"Agent followed the process"	"Agent followed the refund process: (1) verified purchase, (2) confirmed reason, (3) offered alternatives, (4) processed refund if requested"
"Agent used the right greeting"	"Agent greeted with company name and their own name, e.g., 'Thank you for calling Acme Corp, this is Sarah speaking'"

3. Multiple Criteria in One Metric

Problem: Combining unrelated criteria makes scoring inconsistent.

Bad	Good
"Agent was professional, resolved the issue, and offered upsells"	Create three separate metrics: "Professional Tone", "Issue Resolution", "Upsell Opportunity Captured"

4. Impossible to Evaluate

Problem: Metrics that require information not available in the conversation.

Bad	Why It's Bad
"Agent provided accurate pricing"	Evaluator doesn't know your actual prices
"Agent followed internal policy"	Evaluator doesn't know your policies

Solution: Include the criteria in the description:

"Agent provided pricing consistent with our standard rates: Basic plan at $29/month, Pro plan at $79/month, Enterprise at $199/month. Any deviation from these prices should be flagged."

5. Evaluating the Wrong Agent

Problem: Accidentally creating metrics that evaluate the customer instead of your agent.

Bad	Good
"Customer was satisfied"	"Agent took actions likely to satisfy the customer: acknowledged concerns, provided solutions, and confirmed understanding"
"The issue was complex"	"Agent successfully handled the complexity by breaking down the problem and addressing each component"

Examples

Note: The examples below are intentionally comprehensive to illustrate best practices. You don't need to provide this level of detail. Our system automatically enhances your metric descriptions using AI, so even a simple description like "Agent verifies customer identity" or "Measures empathy" will work well.

Use these examples as inspiration and guidance—the more context you provide, the more precise your evaluations will be, but simpler descriptions are perfectly acceptable.

Example 1: Identity Verification (BOOLEAN)

Name: Customer Identity Verified

Description:

The agent must verify the customer's identity before discussing account details. Verification requires confirming at least TWO of the following: (1) full name, (2) date of birth, (3) last four digits of SSN, (4) account number, or (5) security question answer.

Pass: Agent explicitly asks for and receives confirmation of at least two identity factors before proceeding.

Fail: Agent discusses account details without verification, or only verifies one factor.

Edge case: If the customer proactively provides identifying information without being asked, this counts toward verification as long as the agent acknowledges it.

Example 2: Empathy and Rapport (NUMERIC)

Name: Empathy and Rapport

Description:

Measures how well the agent demonstrates understanding of the customer's emotional state and builds a positive connection.

High scores (8-10): Agent explicitly acknowledges the customer's feelings ("I understand this is frustrating"), uses empathetic language throughout, personalizes responses, and maintains a warm tone even when delivering difficult news.

Medium scores (5-7): Agent shows some empathy but inconsistently. May acknowledge feelings once but then become transactional. Uses polite but generic language.

Low scores (1-4): Agent is dismissive, robotic, or ignores emotional cues. Uses scripted responses that feel impersonal. May interrupt or rush the customer.

Key indicators to look for:

Acknowledgment phrases: "I understand", "That must be frustrating", "I can see why you'd feel that way"

Personalization: Using the customer's name, referencing their specific situation

Tone matching: Adjusting energy level to match the customer's state

Example 3: Upsell Attempt (BOOLEAN)

Name: Relevant Upsell Offered

Description:

The agent should identify at least one opportunity to offer an upgrade or additional service that's relevant to the customer's needs.

Pass: Agent presents an upsell that logically connects to the customer's current purchase, inquiry, or expressed needs. The offer includes a clear benefit statement.

Fail: Agent makes no upsell attempt, or offers something completely unrelated to the customer's needs.

Not Applicable: The conversation is a complaint call where the customer is clearly frustrated—attempting an upsell would be inappropriate. Mark as N/A in these cases.

Edge case: If the customer preemptively declines additional offers at the start of the call ("I just want to pay my bill, nothing else"), the agent should respect this, and the metric should be marked N/A.

Example 4: Call Closure Quality (NUMERIC)

Name: Call Closure Quality

Description:

Evaluates how effectively the agent concludes the conversation.

Scoring guide:

9-10: Agent summarizes what was accomplished, confirms next steps, asks if there's anything else, thanks the customer, and provides a clear closing

7-8: Agent covers most closure elements but may miss one (e.g., forgets to summarize or doesn't ask if there's anything else)

5-6: Basic closure—thanks the customer and ends the call but doesn't summarize or confirm next steps

3-4: Abrupt ending without proper closure elements

1-2: Call ends awkwardly, agent hangs up prematurely, or no closure attempted

Required elements for high scores:

Summary of actions taken or information provided

Clear next steps (if applicable)

Opportunity for additional questions

Professional sign-off

Example 5: Compliance - Do Not Discuss Competitors (BOOLEAN)

Name: No Competitor Discussion

Description:

The agent must not discuss, compare, or provide opinions about competitor products or services.

Pass: Agent avoids all competitor mentions. If asked about competitors, redirects to own product benefits without naming or comparing to competitors.

Fail: Agent mentions competitors by name, compares features to competitors, or provides opinions about competitor products (positive or negative).

Acceptable redirections:

"I can't speak to other providers, but let me tell you what makes our service great..."

"I'm not familiar with their offerings, but here's what we provide..."

Not Applicable: If competitors are never mentioned or asked about during the conversation.

Summary Checklist

For the best results, consider these points when creating metrics. Not all are required—our system will enhance your descriptions automatically, but covering more of these will improve precision:

Essential:

Clear name that describes what's being measured
Appropriate type (BOOLEAN for pass/fail, NUMERIC for spectrum)
Basic description of what you want to evaluate

Recommended for precision:

Success criteria clearly defined
Failure criteria clearly defined
Single focus (one thing per metric)

Nice to have:

Edge cases addressed
Not applicable conditions specified
Specific examples of good/bad behavior

Need Help?

Remember: Simple is fine! A metric like "Agent stays on topic" or "Professional tone" will work. Our AI automatically enhances your descriptions to make them more precise for evaluation.

If you want to be more specific, consider:

What specific behavior do you want to encourage or prevent?
How would a human quality reviewer evaluate this?
What would you consider a clear pass vs. clear fail?

The more context you provide, the more tailored your evaluations will be—but don't let perfectionism stop you from creating metrics. Start simple and refine based on results.

Custom Metrics Best Practices

Best Practices for Defining Custom Metrics

Quick Start

Table of Contents

Understanding the Evaluation Context

Choosing the Right Measurement Type

BOOLEAN Metrics (Pass/Fail)

NUMERIC Metrics (0-10 Scale)

Decision Guide

Writing Effective Metric Names

Do's

Don'ts

Writing Clear Descriptions

The SOAR Framework

Be Explicit About Success Criteria

Account for Voice Conversation Context

Specify Edge Cases

Common Pitfalls to Avoid

1. Subjective Language Without Criteria

2. Assuming Context

3. Multiple Criteria in One Metric

4. Impossible to Evaluate

5. Evaluating the Wrong Agent

Examples

Example 1: Identity Verification (BOOLEAN)

Example 2: Empathy and Rapport (NUMERIC)

Example 3: Upsell Attempt (BOOLEAN)

Example 4: Call Closure Quality (NUMERIC)

Example 5: Compliance - Do Not Discuss Competitors (BOOLEAN)

Summary Checklist

Need Help?