Skip to main content
These are standard metrics applicable across domains defined by Cekura. They are organized into four categories based on what aspect of your voice agent they evaluate.

Accuracy Metrics

These metrics evaluate whether your agent provides correct and consistent information.
Result: Pass / Review Required / Failed | Cost: 0 creditsEvaluates whether the Main Agent achieved the goal specified in the evaluator’s expected outcome prompt. An LLM analyzes the conversation to determine if the intended outcome was reached.Requirements: Set expected_outcome_prompt in your evaluator configuration describing what success looks like.Interpretation:
  • Pass: Main Agent achieved the expected outcome
  • Review Required: Outcome unclear, manual review recommended
  • Failed: Main Agent did not achieve the expected outcome
Result: True/False | Cost: 0.2 credits per callDetects when the Main Agent provides information that contradicts or isn’t supported by the uploaded Knowledge Base files. A Knowledge Base is a collection of files uploaded to the agent containing reference information for fact-checking. An LLM compares Main Agent responses against the Knowledge Base content.Requirements: Upload Knowledge Base files to your agent containing the source of truth.Interpretation:
  • True: No hallucinations detected (Main Agent stayed factual)
  • False: Main Agent provided unsupported or contradictory information
Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent’s responses were relevant and appropriate to the conversation context. An LLM analyzes each response to determine if it addressed what the Testing Agent was asking.Interpretation:
  • True: Responses were relevant and on-topic
  • False: Main Agent gave off-topic or inappropriate responses
Result: True/False | Cost: 0.2 credits per callDetects inconsistencies in the Main Agent’s responses during a call. An LLM checks for two specific issues:
  1. Testing Agent provides information (e.g., their name) and the Main Agent repeats it back incorrectly
  2. Main Agent makes contradictory statements (e.g., says one thing early in the call, then contradicts it later)
Interpretation:
  • True: Main Agent maintained consistent information throughout
  • False: Inconsistencies or contradictions detected
Result: True/False | Cost: 0 creditsChecks whether any tool calls made by the Main Agent resulted in errors.Interpretation:
  • True: All tool calls succeeded
  • False: One or more tool calls returned an error
Result: Score (0-100) | Cost: 0 credits (Runs) / 1.0 credits per minute (Call Logs)Evaluates speech-to-text accuracy differently depending on the context:For Call Logs: Uses two separate state-of-the-art transcription models to generate ground truth transcriptions. Compares these against the candidate transcript to find inconsistencies, with the score based on the number of errors found.For Runs/Simulations: Scores based on the number of transcription errors made by the Testing Agent. Errors in names, nouns, and numbers count as 1.0. Verb errors count as 0.5. Other words are ignored. Scoring: 5 = 0 errors, 4 = 1-3 errors, 3 = 4-6 errors, 2 = 7-12 errors, 1 = 13+ errors.The explanation also includes the standard Word Error Rate (WER) percentage.Interpretation: Higher scores indicate better transcription accuracy.
Result: True/False | Cost: 0.2 credits per callDetects whether the call reached a voicemail system instead of a live person. An LLM analyzes the transcript for voicemail indicators (greeting messages, beeps, recording prompts).Interpretation:
  • True: Call reached voicemail
  • False: Call connected to a live person

Conversation Quality Metrics

These metrics evaluate the flow and dynamics of the conversation.
Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Main Agent started speaking while the Testing Agent was still talking. An interruption is when one speaker begins talking while the other is still speaking. Uses Voice Activity Detection (VAD) to find the timestamps for each turn of each speaker and precisely detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. Frequent interruptions indicate the Main Agent isn’t properly waiting for the Testing Agent to finish speaking.
Result: Numeric (milliseconds) | Cost: 0 creditsMeasures how long it takes for the Main Agent to stop speaking after the Testing Agent interrupts. Uses VAD to find the timestamps for each turn and detect when the Testing Agent starts speaking over the Main Agent.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: Lower is better. A responsive Main Agent should stop quickly (under 500ms) when interrupted.
Result: Count (number of interruptions) | Cost: 0 creditsCounts how many times the Testing Agent started speaking while the Main Agent was still talking. Uses VAD to find the timestamps for each turn and detect overlapping speech.Requirements: Stereo audio recording with separate channels for each speaker.Interpretation: High counts may indicate Testing Agent frustration, Main Agent speaking too long, or poor turn-taking.
Result: Numeric (average milliseconds) | Cost: 0 creditsMeasures the response time between the Testing Agent finishing speaking and the Main Agent starting its response. Latency is calculated for every Main Agent turn throughout the call. With stereo audio, VAD is used to find precise timestamps for each speaker turn.We compute and display percentile statistics: P25, P50, P75, P90, P95, and P99.Requirements: Audio recording. Stereo audio provides more accurate measurements.Interpretation: Lower is better. Latency under 2000ms is generally good.
Result: Count (number of repetitions) | Cost: 0.2 credits per callIdentifies instances where the Main Agent unnecessarily repeated information it had already provided. An LLM analyzes the conversation for redundant statements.Interpretation: Lower is better. Repetition wastes time and can frustrate the Testing Agent.
Result: True/False | Cost: 0 creditsDetects prolonged silence periods where both the Main Agent and Testing Agent are silent, which may indicate technical issues or agent problems.Requirements: Audio recording.Configuration: Set silence_duration in the metric configuration (default: 10 seconds).Interpretation:
  • True: No problematic silence detected
  • False: Extended mutual silence exceeding the threshold was detected
Result: True/False | Cost: 0 creditsDetects when the Main Agent fails to respond within the configured timeout after the Testing Agent finishes their turn, indicating potential infrastructure or connectivity problems.Requirements: Audio recording.Configuration: Set infra_issues_timeout in the metric configuration (default: 10 seconds).Interpretation:
  • True: No infrastructure issues detected
  • False: Main Agent failed to respond within the timeout after the Testing Agent finished speaking
Result: True/False | Cost: 0.2 credits per callEvaluates whether the Main Agent ended the call appropriately. An LLM analyzes whether the Main Agent wrapped up the conversation properly before ending.Interpretation:
  • True: Call was ended appropriately by the Main Agent
  • False: Main Agent ended call abruptly or inappropriately
Result: True/False | Cost: 0.2 credits per callEvaluates whether the Testing Agent ended the call early, which may indicate poor experience or unresolved issues. An LLM analyzes the conversation to determine if the call ended prematurely.Interpretation:
  • True: Call ended at a natural conclusion point
  • False: Testing Agent ended call early, suggesting dissatisfaction

Customer Experience Metrics

These metrics evaluate the Testing Agent’s experience and satisfaction with the conversation.
Result: Score (0-100) | Cost: 0.2 credits per callEvaluates overall customer satisfaction based on two dimensions:1. Customer Sentiment - Evaluates the Testing Agent’s (customer’s) tone throughout the call:
  • Positive: Clear expressions of gratitude like “Thank you so much for your help” = 5 points
  • Neutral: Simple thanks, cooperative tone, matter-of-fact responses = 5 points
  • Negative: Explicit frustration, harsh language, complaints = 1 point
2. Cooperation - Evaluates whether the Main Agent helped the Testing Agent:
  • Fully cooperative / No issues = 5 points
  • Somewhat uncooperative = 3 points
  • Refused to help / Obstructed = 1 point
The final score is the average of both dimensions, scaled to 0-100.Interpretation: Higher is better. Scores above 70 indicate good customer satisfaction.
Result: Enum (one of your configured stages) | Cost: 0.2 credits per callIdentifies at which stage of the conversation the call dropped or ended. An LLM maps the conversation endpoint to one of your predefined stages.Requirements: Configure dropoff_nodes on your agent with the conversation stages you want to track (e.g., “greeting”, “information_gathering”, “resolution”, “closing”).Interpretation: Helps identify where in your conversation flow Testing Agents are dropping off, enabling targeted improvements.
Result: Enum (positive / neutral / negative) | Cost: 0.2 credits per callDetermines the Testing Agent’s overall sentiment toward the Main Agent based on the conversation transcript. An LLM analyzes the Testing Agent’s language, tone, and responses.Classification criteria:
  • Positive: Only when the Testing Agent is clearly very grateful with phrases like “Thank you so much for your help”, “I really appreciate this”, “You’ve been so helpful”
  • Negative: Explicit frustration, harsh language, complaints, or aggressive tone
  • Neutral: Simple “thanks”, cooperative tone, matter-of-fact responses, or when sentiment is unclear
Note: Neutral is the default value when sentiment is uncertain.Interpretation:
  • Positive: Testing Agent seemed very satisfied or grateful
  • Neutral: Testing Agent showed no strong emotion
  • Negative: Testing Agent seemed frustrated or dissatisfied
Result: Enum (one of your configured topics) | Cost: 0.2 credits per callCategorizes the call into one of your predefined topics. An LLM analyzes the conversation to determine the primary subject matter.Requirements: Configure topic_nodes on your agent with the topics you want to track (e.g., “billing”, “technical_support”, “sales”, “general_inquiry”).Interpretation: Helps understand call volume distribution across different topics for resource planning and analysis.

Speech Quality Metrics

These metrics evaluate the audio and speech characteristics of the Main Agent.
Result: Numeric (Hertz) | Cost: 0 creditsMeasures the average pitch frequency of the Main Agent’s voice during the call using pitch extraction algorithms.Requirements: Audio recording.
Result: True/False | Cost: 0.2 credits per minuteDetects nonsensical or garbled speech from the Main Agent. A multimodal LLM analyzes the Main Agent’s audio to identify unintelligible segments, nonsense sounds, or garbled speech.Requirements: Stereo audio recording with separate channels.Interpretation:
  • True: Speech was clear and intelligible
  • False: Gibberish or garbled speech detected
Result: True/False | Cost: 0.2 credits per callChecks whether certain words were spelled out letter-by-letter correctly in the audio (e.g., spelling out “J-O-H-N” for a name). A multimodal LLM analyzes the Main Agent’s audio to verify spelling.Requirements: Audio recording. Configure spelling_word_types on your agent specifying which types of words should be spelled out (e.g., “name”, “email”, “confirmation_code”).Interpretation:
  • True: Every instance of every word of the selected category was correctly spelled out in the audio
  • False: Spelling errors detected or words not spelled out when required
Result: Score (0-100) | Cost: 0.2 credits per callEvaluates pronunciation accuracy for specific words you define.Requirements: Audio recording. Configure pronunciation_words on your agent as a list of word-phoneme pairs (e.g., [["Cekura", "suh-KYUR-uh"]]).Interpretation: Higher scores indicate better pronunciation accuracy. Useful for brand names or technical terms.
Result: True/False | Cost: 0.2 credits per callDetects abrupt or unnatural changes in the Main Agent’s speaking rate during the call using an ML model.Requirements: Audio recording. Currently supports English only.Interpretation:
  • True: Speaking rate was consistent and natural
  • False: Unnatural speaking rate changes detected
Result: Numeric (0.0 to 1.0) | Cost: 0 creditsCalculates the ratio of Main Agent speaking time to total call duration. Uses VAD to find the timestamps for each turn of each speaker for accurate speaker separation.Requirements: Stereo audio recording with separate channels.Interpretation: A ratio around 0.4-0.6 is typical. Very high ratios may indicate the Main Agent is dominating the conversation; very low ratios may indicate the Main Agent isn’t being helpful enough.
Result: True/False | Cost: 0.2 credits per callDetects unexpected speaker changes during the Main Agent’s speaking turns using an ML model for voice analysis.Requirements: Audio recording.Interpretation:
  • True: Consistent speaker throughout Main Agent turns
  • False: Unexpected voice change detected (may indicate system issues)
Result: Score (0-100) | Cost: 0.2 credits per callEvaluates the overall voice quality of the Main Agent’s audio using an ML model. Specifically analyzes clarity (how clear and understandable the voice is) and jitter (variations in audio timing that can affect quality) on the Main Agent’s audio channel.Requirements: Audio recording.Example of low voice clarity:
Interpretation: Higher is better. Scores above 70 indicate good voice quality. Low scores may indicate audio issues, background noise, or voice synthesis problems.
Result: Numeric (words per minute) | Cost: 0 creditsCalculates the Main Agent’s speaking speed based on transcript word count and speaking duration.Requirements: Audio recording.