Pre-Defined Metrics

These are standard metrics applicable across domains defined by Cekura. They are organized into four categories based on what aspect of your voice agent they evaluate.

Accuracy Metrics

These metrics evaluate whether your agent provides correct and consistent information.

Expected Outcome

Result: Pass / Review Required / Failed | Cost: 0 creditsEvaluates whether the Main Agent achieved the goal specified in the evaluator’s expected outcome prompt. An LLM analyzes the conversation to determine if the intended outcome was reached.Requirements: Set expected_outcome_prompt in your evaluator configuration describing what success looks like.Interpretation:

Pass: Main Agent achieved the expected outcome
Review Required: Outcome unclear, manual review recommended
Failed: Main Agent did not achieve the expected outcome

Hallucination

Result: True/False | Cost: 0.2 credits per callDetects when the Main Agent provides information that contradicts or isn’t supported by the uploaded Knowledge Base files. A Knowledge Base is a collection of files uploaded to the agent containing reference information for fact-checking. An LLM compares Main Agent responses against the Knowledge Base content.Requirements: Upload Knowledge Base files to your agent containing the source of truth.Interpretation:

True: No hallucinations detected (Main Agent stayed factual)
False: Main Agent provided unsupported or contradictory information

Relevancy

Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent’s responses were relevant and appropriate to the conversation context. An LLM analyzes each response to determine if it addressed what the Testing Agent was asking.Interpretation:

True: Responses were relevant and on-topic
False: Main Agent gave off-topic or inappropriate responses

Response Consistency

Result: True/False | Cost: 0.2 credits per callDetects inconsistencies in the Main Agent’s responses during a call. An LLM checks for two specific issues:

Testing Agent provides information (e.g., their name) and the Main Agent repeats it back incorrectly
Main Agent makes contradictory statements (e.g., says one thing early in the call, then contradicts it later)

Interpretation:

True: Main Agent maintained consistent information throughout
False: Inconsistencies or contradictions detected

Tool Call Success

Result: True/False | Cost: 0 creditsChecks whether any tool calls made by the Main Agent resulted in errors.Interpretation:

True: All tool calls succeeded
False: One or more tool calls returned an error

Transcription Accuracy

Result: Score (0-100) | Cost: 0 credits (Runs) / 1.0 credits per minute (Call Logs)Evaluates speech-to-text accuracy differently depending on the context:For Call Logs: Uses two separate state-of-the-art transcription models to generate ground truth transcriptions. Compares these against the candidate transcript to find inconsistencies, with the score based on the number of errors found.For Runs/Simulations: Scores based on the number of transcription errors made by the Testing Agent. Errors in names, nouns, and numbers count as 1.0. Verb errors count as 0.5. Other words are ignored. Scoring: 5 = 0 errors, 4 = 1-3 errors, 3 = 4-6 errors, 2 = 7-12 errors, 1 = 13+ errors.The explanation also includes the standard Word Error Rate (WER) percentage.Interpretation: Higher scores indicate better transcription accuracy.

Voicemail Detection (Beta)

Result: True/False | Cost: 0.2 credits per callDetects whether the call reached a voicemail system instead of a live person. An LLM analyzes the transcript for voicemail indicators (greeting messages, beeps, recording prompts).Interpretation:

True: Call reached voicemail
False: Call connected to a live person

Conversation Quality Metrics

These metrics evaluate the flow and dynamics of the conversation.

AI Interrupting User

Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Main Agent started speaking while the Testing Agent was still talking. An interruption is when one speaker begins talking while the other is still speaking. Uses Voice Activity Detection (VAD) to find the timestamps for each turn of each speaker and precisely detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. Frequent interruptions indicate the Main Agent isn’t properly waiting for the Testing Agent to finish speaking.

Stop Time After User Interruption (ms)

Result: Numeric (milliseconds) | Cost: 0 creditsMeasures how long it takes for the Main Agent to stop speaking after the Testing Agent interrupts. Uses VAD to find the timestamps for each turn and detect when the Testing Agent starts speaking over the Main Agent.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. A responsive Main Agent should stop quickly (under 500ms) when interrupted.

User Interrupting AI

Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Testing Agent started speaking while the Main Agent was still talking. Uses VAD to find the timestamps for each turn and detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: High counts may indicate Testing Agent frustration, Main Agent speaking too long, or poor turn-taking.

Latency (in ms)

Result: Numeric (average milliseconds) | Cost: 0 creditsMeasures the response time between the Testing Agent finishing speaking and the Main Agent starting its response. Latency is calculated for every Main Agent turn throughout the call. With stereo audio, VAD is used to find precise timestamps for each speaker turn.We compute and display percentile statistics: P25, P50, P75, P90, P95, and P99.Requirements: Audio recording. Stereo audio provides more accurate measurements.Interpretation: Lower is better. Latency under 2000ms is generally good.

Unnecessary Repetition Count

Result: Count (number of repetitions) | Cost: 0.2 credits per callIdentifies instances where the Main Agent unnecessarily repeated information it had already provided. An LLM analyzes the conversation for redundant statements.Interpretation: Lower is better. Repetition wastes time and can frustrate the Testing Agent.

Detect Silence in Conversation

Result: True/False | Cost: 0 creditsDetects prolonged silence periods where both the Main Agent and Testing Agent are silent, which may indicate technical issues or agent problems.Requirements: Audio recording.Configuration: Set silence_duration in the metric configuration (default: 10 seconds).Interpretation:

True: No problematic silence detected
False: Extended mutual silence exceeding the threshold was detected

Infrastructure Issues

Result: True/False | Cost: 0 creditsDetects when the Main Agent fails to respond within the configured timeout after the Testing Agent finishes their turn, indicating potential infrastructure or connectivity problems.Requirements: Audio recording.Configuration: Set infra_issues_timeout in the metric configuration (default: 10 seconds).Interpretation:

True: No infrastructure issues detected
False: Main Agent failed to respond within the timeout after the Testing Agent finished speaking

Appropriate Termination by Main Agent

Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent ended the call appropriately. An LLM analyzes whether the Main Agent wrapped up the conversation properly before ending.Interpretation:

True: Call was ended appropriately by the Main Agent
False: Main Agent ended call abruptly or inappropriately

Appropriate Termination by Testing Agent

Result: True/False | Cost: 0.2 credits per callEvaluates whether the Testing Agent ended the call early, which may indicate poor experience or unresolved issues. An LLM analyzes the conversation to determine if the call ended prematurely.Interpretation:

True: Call ended at a natural conclusion point
False: Testing Agent ended call early, suggesting dissatisfaction

Customer Experience Metrics

These metrics evaluate the Testing Agent’s experience and satisfaction with the conversation.

CSAT

Result: Score (0-100) | Cost: 0.2 credits per callEvaluates overall customer satisfaction based on two dimensions:1. Customer Sentiment - Evaluates the Testing Agent’s (customer’s) tone throughout the call:

Positive: Clear expressions of gratitude like “Thank you so much for your help” = 5 points
Neutral: Simple thanks, cooperative tone, matter-of-fact responses = 5 points
Negative: Explicit frustration, harsh language, complaints = 1 point

2. Cooperation - Evaluates whether the Main Agent helped the Testing Agent:

Fully cooperative / No issues = 5 points
Somewhat uncooperative = 3 points
Refused to help / Obstructed = 1 point

The final score is the average of both dimensions, scaled to 0-100.Interpretation: Higher is better. Scores above 70 indicate good customer satisfaction.

Dropoff Node

Result: Enum (one of your configured stages) | Cost: 0.2 credits per callIdentifies at which stage of the conversation the call dropped or ended. An LLM maps the conversation endpoint to one of your predefined stages.Requirements: Configure dropoff_nodes on your agent with the conversation stages you want to track (e.g., “greeting”, “information_gathering”, “resolution”, “closing”).Interpretation: Helps identify where in your conversation flow Testing Agents are dropping off, enabling targeted improvements.

Sentiment

Result: Enum (positive / neutral / negative) | Cost: 0.2 credits per callDetermines the Testing Agent’s overall sentiment toward the Main Agent based on the conversation transcript. An LLM analyzes the Testing Agent’s language, tone, and responses.Classification criteria:

Positive: Only when the Testing Agent is clearly very grateful with phrases like “Thank you so much for your help”, “I really appreciate this”, “You’ve been so helpful”
Negative: Explicit frustration, harsh language, complaints, or aggressive tone
Neutral: Simple “thanks”, cooperative tone, matter-of-fact responses, or when sentiment is unclear

Note: Neutral is the default value when sentiment is uncertain.Interpretation:

Positive: Testing Agent seemed very satisfied or grateful
Neutral: Testing Agent showed no strong emotion
Negative: Testing Agent seemed frustrated or dissatisfied

Topic of Call

Result: Enum (one of your configured topics) | Cost: 0.2 credits per callCategorizes the call into one of your predefined topics. An LLM analyzes the conversation to determine the primary subject matter.Requirements: Configure topic_nodes on your agent with the topics you want to track (e.g., “billing”, “technical_support”, “sales”, “general_inquiry”).Interpretation: Helps understand call volume distribution across different topics for resource planning and analysis.

Speech Quality Metrics

These metrics evaluate the audio and speech characteristics of the Main Agent.

Average Pitch (in Hz)

Result: Numeric (Hertz) | Cost: 0 creditsMeasures the average pitch frequency of the Main Agent’s voice during the call using pitch extraction algorithms.Requirements: Audio recording.

Gibberish Detection (Beta)

Result: True/False | Cost: 0.2 credits per minuteDetects nonsensical or garbled speech from the Main Agent. A multimodal LLM analyzes the Main Agent’s audio to identify unintelligible segments, nonsense sounds, or garbled speech.Requirements: Stereo audio recording with separate channels.Interpretation:

True: Speech was clear and intelligible
False: Gibberish or garbled speech detected

Letterwise Pronunciation

Result: True/False | Cost: 0.2 credits per callChecks whether certain words were spelled out letter-by-letter correctly in the audio (e.g., spelling out “J-O-H-N” for a name). A multimodal LLM analyzes the Main Agent’s audio to verify spelling.Requirements: Audio recording. Configure spelling_word_types on your agent specifying which types of words should be spelled out (e.g., “name”, “email”, “confirmation_code”).Interpretation:

True: Every instance of every word of the selected category was correctly spelled out in the audio
False: Spelling errors detected or words not spelled out when required

Pronunciation Check (Beta)

Result: Score (0-100) | Cost: 0.2 credits per callEvaluates pronunciation accuracy for specific words you define.Requirements: Audio recording. Configure pronunciation_words on your agent as a list of word-phoneme pairs (e.g., [["Cekura", "suh-KYUR-uh"]]).Interpretation: Higher scores indicate better pronunciation accuracy. Useful for brand names or technical terms.

Speaking Rate (Beta)

Result: True/False | Cost: 0.2 credits per callDetects abrupt or unnatural changes in the Main Agent’s speaking rate during the call using an ML model.Requirements: Audio recording. Currently supports English only.Interpretation:

True: Speaking rate was consistent and natural
False: Unnatural speaking rate changes detected

Talk Ratio

Result: Numeric (0.0 to 1.0) | Cost: 0 creditsCalculates the ratio of Main Agent speaking time to total call duration. Uses VAD to find the timestamps for each turn of each speaker for accurate speaker separation.Requirements: Stereo audio recording with separate channels.Interpretation: A ratio around 0.4-0.6 is typical. Very high ratios may indicate the Main Agent is dominating the conversation; very low ratios may indicate the Main Agent isn’t being helpful enough.

Voice Change Detection (Beta)

Result: True/False | Cost: 0.2 credits per callDetects unexpected speaker changes during the Main Agent’s speaking turns using an ML model for voice analysis.Requirements: Audio recording.Interpretation:

True: Consistent speaker throughout Main Agent turns
False: Unexpected voice change detected (may indicate system issues)

Voice Tone + Clarity

Result: Score (0-100) | Cost: 0.2 credits per callEvaluates the overall voice quality of the Main Agent’s audio using an ML model. Specifically analyzes clarity (how clear and understandable the voice is) and jitter (variations in audio timing that can affect quality) on the Main Agent’s audio channel.Requirements: Audio recording.Example of low voice clarity:

Interpretation: Higher is better. Scores above 70 indicate good voice quality. Low scores may indicate audio issues, background noise, or voice synthesis problems.

Words Per Minute (WPM)

Result: Numeric (words per minute) | Cost: 0 creditsCalculates the Main Agent’s speaking speed based on transcript word count and speaking duration.Requirements: Audio recording.

Get Started

Key Concepts

Guides

Integrations

Advanced

Pre-Defined Metrics

Accuracy Metrics

Conversation Quality Metrics

Customer Experience Metrics

Speech Quality Metrics

Get Started

Key Concepts

Guides

Integrations

Advanced

​Accuracy Metrics

​Conversation Quality Metrics

​Customer Experience Metrics

​Speech Quality Metrics

Accuracy Metrics

Conversation Quality Metrics

Customer Experience Metrics

Speech Quality Metrics