Metrics

A Metric in Evalion defines measurable criteria for evaluating your AI agent's performance during testing. Metrics provide quantitative and qualitative assessments of how well your agent meets specific success standards, enabling data-driven insights into conversation quality, technical performance, and user experience.

Metrics serve as the foundation for determining whether test cases pass or fail, and help identify specific areas where your agent excels or needs improvement.

Metric Types

Evalion supports two main categories of metrics to provide comprehensive agent evaluation:

Semantic Metrics

Evaluate your AI Agent's conversational understanding, response quality, and goal achievement:

Custom creation: Define specific criteria based on your business requirements.
Evaluation methods: Boolean (pass/fail) or Numeric (scored with thresholds).

Technical Metrics

Measure the performance and technical reliability of your AI Agent system during interactions.

Evalion's Built-in technical metrics are automatically included in all test suites:

CSAT (Customer Satisfaction): AI-generated satisfaction score based on conversation analysis.
AVG_Latency: Average response time between user input and agent response.

These technical metrics require no configuration and provide consistent baseline measurements across all test runs.

Metric Components

Each metric consists of several key elements:

1. Metric Name

A clear, descriptive identifier for the measurement (e.g., "Proper Introduction" or "Information Accuracy and Conciseness").

2. Metric Type

Choose between Boolean (pass/fail) or Numeric (scored) evaluation methods based on what you need to measure:

Boolean Metric: Binary pass/fail measurements for specific requirements (e.g., "Agent introduces itself properly" - either it does or doesn't) .
Numeric Metric: Scored evaluations with configurable thresholds.
Score range: Typically, a 0-10 scale.
Pass threshold: Minimum score required for success.

3. Metric Description

A detailed explanation of what the metric measures and how it should be evaluated, providing context for consistent assessment.

3. Pass Threshold (Numeric Only)

For scored metrics, set the minimum score required to consider the metric successful (e.g., 7 out of 10).

Metrics Role in Testing

Metrics are applied at the test suite level and evaluate agent performance across all scenario-persona combinations, performing analysis such as:

Performance Evaluation: Each simulation is assessed against all assigned metrics, providing comprehensive performance data for analysis and improvement.
Failure Identification: When metrics fall below defined thresholds, they trigger failure analysis and generate specific recommendations for improvement.
Test Trend Analysis: Multiple test runs create performance trends that help track improvement over time and identify consistent problem areas.
Benchmarking: Consistent application of metrics enables comparison between different agent configurations, scenarios, and improvement iterations.

This systematic measurement approach ensures that your agent evaluation is objective, comprehensive, and aligned with your quality and performance requirements.