Python Metric

Python Metric allows you to write custom evaluation logic in Python to evaluate your AI agent’s performance. This gives you complete control over the evaluation process and enables complex analysis that goes beyond simple prompt-based metrics.

Overview

Custom code metrics are executed in a secure Python environment with access to call data including transcripts, metadata, and dynamic variables. Your code must set specific output variables to provide the evaluation result and explanation.

Available Data Variables

When writing your custom code, you have access to different variables depending on the evaluation context.

Quick Reference

Variable	Simulation	Observability
transcript	✅	✅
transcript_json	✅	✅
call_duration	✅	✅
call_end_reason	✅	✅
voice_recording	✅	✅
agent_description	✅	✅
date	✅	✅
timestamp	✅	✅
metadata	✅	✅
recording_data	✅	✅
Metric Results	✅	✅
topic	❌	✅
Latency Metrics	✅	✅
dynamic_variables	❌	✅
call_log_id	❌	✅
tags	✅	❌
provider_call_id	✅	❌
provider_call_data	✅	❌
cekura_transcript_json	✅	❌
test_profile	✅	❌
run_id	✅	❌
expected_outcome	✅	❌
expected_outcome_explanation	✅	❌

Detailed Field Documentation

Available in Both Simulation & Observability

transcript

Availability: ✅ Simulation | ✅ ObservabilityFull conversation transcript as a formatted string with timestamps

# Access the full transcript
transcript = data["transcript"]
# Output:
# "[00:01] Main Agent: Hello.\n[00:12] Testing Agent: L m z o uh-huh.\n[00:14] Main Agent: Could you clarify your message or let me know how I can assist you?\n[00:22] Testing Agent: Hello? I'm Vicky.\n[00:27] Main Agent: Hi, Vicky. How can I help you today?..."

transcript_json

Availability: ✅ Simulation | ✅ ObservabilityTranscript as a structured list with detailed timing and speaker information

# Access structured transcript with detailed timing
transcript_json = data["transcript_json"]
# Actual structure:
# [
#   {
#     "role": "Main Agent",
#     "time": "00:01", 
#     "content": "Hello.",
#     "end_time": 1.817,
#     "start_time": 1.317
#   },
#   {
#     "role": "Testing Agent",
#     "time": "00:12",
#     "content": "L m z o uh-huh.", 
#     "end_time": 13.817,
#     "start_time": 12.357
#   }
# ]

call_duration

Availability: ✅ Simulation | ✅ ObservabilityCall duration in seconds as a float

# Access call duration
call_duration = data["call_duration"]
# Output: 125.5 (seconds)

call_end_reason

Availability: ✅ Simulation | ✅ ObservabilityReason why the call ended

# Access call end reason
end_reason = data["call_end_reason"]
# Example values: "main-agent-ended-call", "testing-agent-ended-call"

voice_recording

Availability: ✅ Simulation | ✅ ObservabilityURL to the voice recording file

# Access voice recording URL
recording_url = data["voice_recording"]
# Output: "https://recordings.example.com/call_123.wav"

agent_description

Availability: ✅ Simulation | ✅ ObservabilityDescription of the AI agent used in the call

# Access agent description
agent_desc = data["agent_description"]
# Output: "Customer service agent with product knowledge and billing expertise"

metadata

Availability: ✅ Simulation | ✅ ObservabilityAdditional context metadata as a dictionary

# Access metadata
metadata = data["metadata"]
# Observability example: {"customer_tier": "premium", "region": "US", "language": "en"}
# Simulation example: {"ended_reason": "agent_ended_call", "test_mode": true}

# Available for both Simulation and Observability evaluations
# Structure varies based on what was stored with the call/run

date

Availability: ✅ Simulation | ✅ ObservabilityCurrent date in YYYY-MM-DD format

# Access current date
current_date = data["date"]
# Output: "2026-01-31"

timestamp

Availability: ✅ Simulation | ✅ ObservabilityISO 8601 formatted timestamp of when the call/run occurred

# Access timestamp
timestamp = data["timestamp"]
# Output: "2026-02-19T09:09:51.346633+00:00"

# Parse timestamp for date/time analysis
from datetime import datetime
dt = datetime.fromisoformat(timestamp)

# Example: Check if call was during business hours
hour = dt.hour
if 9 <= hour < 17:
    _result = "Business hours"

recording_data

Availability: ✅ Simulation | ✅ ObservabilityAudio metadata and analysis results as a dictionary containing:

has_audio_data (boolean) - Whether audio data is available for analysis
sample_rate (integer) - Audio sample rate in Hz (e.g., 8000, 16000)
shape (list) - Audio dimensions as [total_samples, channels]
separable_channels (boolean) - Whether stereo channels can be separated into distinct speaker channels
total_duration (float) - Total audio duration in seconds
main_speaking (list) - Speaking segments for the main/agent channel as [[start, end], …] in seconds
testing_speaking (list) - Speaking segments for the testing/user channel as [[start, end], …] in seconds

{
  "has_audio_data": True,
  "sample_rate": 8000,
  "shape": [561600, 2],
  "separable_channels": True,
  "total_duration": 70.2,
  "main_speaking": [[1.32, 8.39], [21.45, 23.6], [36.62, 37.05]],
  "testing_speaking": [[0.33, 8.44], [9.35, 9.5], [19.43, 20.29]]
}

Observability Only

dynamic_variables

Availability: ❌ Simulation | ✅ ObservabilityDynamic variables configured for the agent as a dictionary

# Access dynamic variables
variables = data["dynamic_variables"]
# Output: {"customer_name": "John", "account_id": "ACC123", "plan_type": "premium"}

call_log_id

Availability: ❌ Simulation | ✅ ObservabilityCallLog ID for observability calls

# Access call log ID
call_log_id = data["call_log_id"]
# Output: 12345

topic

Availability: ❌ Simulation | ✅ ObservabilityCall topic/subject

# Access call topic
topic = data["topic"]
# Output: "Billing inquiry and payment issues"

Simulation Only

Metric Results Access

These results are available in both Simulation and Observability contexts:

Individual Metric Results

Availability: ✅ Simulation | ✅ ObservabilityAccess any evaluated metric result directly by name

# Access individual metric results by name
customer_satisfaction = data["Customer Satisfaction"]  # Could be: 4.5, "Excellent", 85
response_time = data["Response Time"]  # Could be: 120 (seconds)
product_knowledge = data["Product Knowledge"]  # Could be: 85, "Good", 4.2
workflow_adherence = data["Workflow Adherence"]  # Could be: "Good", 0.8, 78

Metric Explanations

Availability: ✅ Simulation | ✅ ObservabilityList of explanation strings for each metric

# Access metric explanations
explanations = data["explanation"]

# Get explanations for specific metrics
satisfaction_reasons = explanations["Customer Satisfaction"]
# Example: ["Customer expressed satisfaction", "Positive tone detected", "Issue resolved"]

response_reasons = explanations["Response Time"]
# Example: ["Response was within acceptable range", "No long pauses detected"]

Latency Metrics

Availability: ✅ Simulation | ✅ ObservabilityLatency metrics for performance analysisThe latency_data list contains detailed information about each turn’s latency:

latency: The duration of the latency in milliseconds.
speaker: The entity associated with the latency (e.g., “Main Agent”).
start_time: The timestamp when the turn started, in seconds.

# Access latency metrics
avg_latency = data["Average Latency (in ms)"]  # Example: 1607.5
latency_data = data["latency_data"]
# Actual structure:
# [
#   {"latency": 1680.0, "speaker": "Main Agent", "start_time": 14.07},
#   {"latency": 1240.0, "speaker": "Main Agent", "start_time": 26.51},
#   {"latency": 1970.0, "speaker": "Main Agent", "start_time": 38.33},
#   {"latency": 1540.0, "speaker": "Main Agent", "start_time": 51.68}
# ]

Required Output Variables

Your Python code must set these two variables:

_result - The evaluation outcome (can be boolean, numeric, string, etc.)
_explanation - A string explaining the reasoning behind the result

Example Code

Here’s a simple example that checks if the agent mentioned a specific product:

# Check if the agent mentioned "Premium Plan" in the conversation
transcript = data["transcript"].lower()
if "premium plan" in transcript:
    _result = True
    _explanation = "Agent successfully mentioned the Premium Plan during the conversation"
else:
    _result = False
    _explanation = "Agent did not mention the Premium Plan in the conversation"

Latency Threshold Example

This example detects if latency is under a threshold in each turn.

import json 
THRESHOLD = 2000

latency_data = data["latency_data"]

# Use a standard loop to avoid the 'NameError'
violations = []
for item in latency_data:
    if item["latency"] > THRESHOLD:
        violations.append(item)

if not violations:
    _result = True
    _explanation = f"Every return is under {THRESHOLD} milliseconds."
else:
    _result = False
    
    formatted_lines = []
    for v in violations:
        # Get the time in seconds (default to 0 if missing)
        total_seconds = v.get('start_time', 0)
        
        # Calculate Minutes and Seconds
        minutes = int(total_seconds // 60)
        seconds = int(total_seconds % 60)
        
        # Format as [MM:SS] with leading zeros (e.g., [01:05])
        timestamp_str = f"[{minutes:02d}:{seconds:02d}]"
        
        # Create the final line
        line = f"{timestamp_str} {json.dumps(v)}"
        formatted_lines.append(line)
    
    # Join with new lines
    _explanation = f"Items with latency > {THRESHOLD}:\n" + "\n".join(formatted_lines)

Complete Data Reference

Here’s the complete structure of data available to your custom Python code:

# Available Data 
{
  "transcript": "[00:01] Main Agent: Hello.\n[00:12] Testing Agent: L m z o uh-huh.\n[00:14] Main Agent: Could you clarify your message or let me know how I can assist you?...",

  "transcript_json": [
    {
      "role": "Main Agent",
      "time": "00:01",
      "content": "Hello.",
      "end_time": 1.817,
      "start_time": 1.317
    },
    {
      "role": "Testing Agent",
      "time": "00:12",
      "content": "L m z o uh-huh.",
      "end_time": 13.817,
      "start_time": 12.357
    }
  ],

  // ---------- Context Fields ----------
  "call_duration": 180.5,
  "call_end_reason": "customer_satisfaction",
  "voice_recording": "https://recordings.example.com/call123.wav",
  "agent_description": "Customer service agent with product knowledge",
  "date": "2026-01-31",
  "timestamp": "2026-02-19T09:09:51.346633+00:00",
  "metadata": {
    "key": "value"
  },
  "dynamic_variables": {
    "customer_name": "John"
  },
  "tags": ["priority", "billing_inquiry"],

  // ---------- Recording Data ----------
  "recording_data": {
    "has_audio_data": true,
    "sample_rate": 8000,
    "shape": [561600, 2],
    "separable_channels": true,
    "total_duration": 70.2,
    "main_speaking": [[1.32, 8.39], [21.45, 23.6], [36.62, 37.05]],
    "testing_speaking": [[0.33, 8.44], [9.35, 9.5], [19.43, 20.29]]
  },

  // ---------- Metric Results ----------
  "Customer Satisfaction": 4.5,
  "Response Time": 120,
  "Product Knowledge": 85,

  "explanation": {
    "Customer Satisfaction": [
      "Customer expressed satisfaction",
      "Positive tone"
    ],
    "Response Time": [
      "Response was within acceptable range"
    ]
  },

  // ---------- Latency Data ----------
  "Average Latency (in ms)": 1607.5,

  "latency_data": [
    {
      "latency": 1680.0,
      "speaker": "Main Agent",
      "start_time": 14.07
    },
    {
      "latency": 1240.0,
      "speaker": "Main Agent",
      "start_time": 26.51
    },
    {
      "latency": 1970.0,
      "speaker": "Main Agent",
      "start_time": 38.33
    },
    {
      "latency": 1540.0,
      "speaker": "Main Agent",
      "start_time": 51.68
    }
  ],

  // ---------- Expected Outcome ----------
  "expected_outcome": 4.2,
  "expected_outcome_explanation": [
    "Expected positive outcome"
  ],

  // ---------- Call Log Context ----------
  "call_log_id": 123,
  "topic": "Billing inquiry",

  // ---------- Run / Simulation Context ----------
  "run_id": 456,
  "test_profile": {
    "company": "Cekura"
  },
  "provider_call_data": {
    // Provider specific call data
  },
  "provider_call_id": "abc-123",
  "cekura_transcript_json": [
    {"role": "assistant", "content": "Hello", "timestamp": 1.0},
    {"role": "user", "content": "Hi there", "timestamp": 5.5}
  ]
}

Data Flow and Execution Order

Important: Custom Python code metrics execute after all other metrics (Basic, Advanced, and pre-defined metrics). This means:

Non-custom metrics evaluate first
Results are structured and merged into the data dictionary
Custom code receives ALL previous results via direct dictionary access
Custom code can build upon or combine existing metric results

Using Metric Results

You can access the results of other metrics that were evaluated for the same call directly by metric name using data["Metric Name"]. You can also access their explanations using data["explanation"]["Metric Name"]. Example usage:

# Access metric results directly by name
customer_satisfaction = data["Customer Satisfaction"]
response_time = data["Response Time"]
product_knowledge = data["Product Knowledge"]

# Access metric explanations
satisfaction_reasons = data["explanation"]["Customer Satisfaction"]
response_reasons = data["explanation"]["Response Time"]

# Each metric result contains the evaluation outcome
if isinstance(customer_satisfaction, (int, float)) and customer_satisfaction > 4.0 and response_time < 60:
    _result = "Excellent"
    _explanation = f"Customer was satisfied ({satisfaction_reasons[0]}) and response time was fast ({response_time}s)"

Calling LLM Judge Metrics from Python

Function Reference: `evaluate_llm_judge_metric`

The evaluate_llm_judge_metric function allows you to evaluate LLM Judge metrics directly from your Python code. This function sends your data and evaluation criteria to Cekura’s LLM judge system and returns the evaluation result. Function Signature:

def evaluate_llm_judge_metric(
    data: Dict,
    api_key: str,
    description: str,
    eval_type: str|None = None,
    enum_values: List[str]|None = None,
) -> Dict:

Parameters:

data

Dict

required

This is the same data object available in your custom Python code with access to transcript, metadata, and other call data.

api_key

str

required

Your Cekura API key for authentication.

description

str

required

The evaluation prompt/description that guides the LLM judge on how to evaluate the metric.You can use context variables in the description using {{variable}} syntax (e.g., {{metadata.instructions}}). See LLM Judge Available Variables for a complete list of available variables.

eval_type

str

default:"None"

The type of evaluation to perform. Supported values:

"binary_workflow_adherence" - Binary evaluation (returns 0 or 5)
"binary_qualitative" - Binary qualitative assessment (returns 0 or 5)
"numeric" - Numeric evaluation (returns integer or float)
"continuous_qualitative" - Continuous scale from 0 to 5
"enum" - Enumerated values (requires enum_values parameter)

enum_values

List[str]|None

default:"None"

List of possible values when using eval_type="enum". Only applicable for ENUM type evaluations.Example: ["Excellent", "Good", "Fair", "Poor"]

Return Value: Returns a dictionary with two keys:

{
  "result": <evaluation_result>,  # Type depends on eval_type
  "explanation": List[<string>]      # Explanation of the evaluation or error message
}

result: The evaluated metric value (type depends on eval_type)
- Binary types: 0 or 5
- Numeric: int or float
- Continuous: float between 0 and 5
- Enum: string from enum_values
explanation: List of String explaining the evaluation result or error message

Using Context Variables: You can make your LLM judge evaluations dynamic by using context variables in the description parameter. For example, use {{metadata.instructions}} to reference specific scenario steps the agent was supposed to follow. See LLM Judge Available Variables for the complete list.

Example Usage

key = "<your_cekura_api_key>"

def get_not_early_end_call_description():
    return f"""You are an AI quality assurance analyst tasked with evaluating customer service call transcripts. Your primary objective is to determine if a Main Agent terminated a call prematurely without valid reason. This analysis is crucial for maintaining high standards in customer service interactions."""

call_end_reason = data["call_end_reason"]
transcript_json = data["transcript_json"]

if "main" not in call_end_reason.lower():
    _score = 5
    _explanation = "The call was ended by the Testing Agent or due to error."

description = get_not_early_end_call_description()

response = evaluate_llm_judge_metric(data, key, description, "binary_workflow_adherence")
_result = response.get("result")
_explanation = response.get("explanation")

Example For Calling Basic Metrics (Deprecated)

Deprecated: evaluate_basic_metric is deprecated in favor of evaluate_llm_judge_metric. Please use evaluate_llm_judge_metric for new implementations.

key = "<your_cekura_api_key>"

def get_not_early_end_call_description():
    return f"""You are an AI quality assurance analyst tasked with evaluating customer service call transcripts. Your primary objective is to determine if a Main Agent terminated a call prematurely without valid reason. This analysis is crucial for maintaining high standards in customer service interactions."""

call_end_reason = data["call_end_reason"]
transcript_json = data["transcript_json"]

if "main" not in call_end_reason.lower():
    _score = 5
    _explanation = "The call was ended by the Testing Agent or due to error."

description = get_not_early_end_call_description()

response = evaluate_basic_metric(data, key, description, "binary_workflow_adherence")
_result = response.get("result")
_explanation = response.get("explanation")

Example For Calling Advanced Metrics (Deprecated)

Deprecated: evaluate_advance_metric is deprecated in favor of evaluate_llm_judge_metric. Please use evaluate_llm_judge_metric for new implementations.

key = "<your_cekura_api_key"

def get_not_early_end_call_prompt(transcript, call_end_reason):
    return f"""You are an AI quality assurance analyst tasked with evaluating customer service call transcripts. Your primary objective is to determine if a Main Agent terminated a call prematurely without valid reason. This analysis is crucial for maintaining high standards in customer service interactions.

Please review the following call transcript:

<call_transcript>
{transcript}
</call_transcript>

Now, consider the reason provided for why the call ended:

<call_end_reason>
{call_end_reason}
</call_end_reason>

Your task is to analyze the transcript and call end reason to determine if the Main Agent terminated the call early without justification. 
"""

if "transcript_json" not in data or not data["transcript_json"]:
    _score = None
    _explanation = "No transcript available"

if "call_end_reason" not in data or not data["call_end_reason"]:
    _score = None
    _explanation = "No call end reason available"


call_end_reason = data["call_end_reason"]
transcript_json = data["transcript_json"]

if "main" not in call_end_reason.lower():
    _score = 5
    _explanation = "The call was ended by the Testing Agent or due to error."

prompt = get_not_early_end_call_prompt(data["transcript"], data["call_end_reason"])

response = evaluate_advance_metric(data, key, prompt, "binary_workflow_adherence")
_result = response.get("result")
_explanation = response.get("explanation")

Advanced Example

Here’s a more complex example that analyzes sentiment and response time:

import re
from datetime import datetime

# Get transcript data
transcript = data["transcript"]
call_duration = data["call_duration"]

# Analyze agent responses
agent_responses = []
lines = transcript.split('\n')

for line in lines:
    if line.strip().startswith('Agent:'):
        response = line.replace('Agent:', '').strip()
        agent_responses.append(response)

# Calculate average response length
if agent_responses:
    avg_response_length = sum(len(response) for response in agent_responses) / len(agent_responses)

    # Check if responses are detailed enough (more than 50 characters average)
    if avg_response_length > 50:
        _result = True
        _explanation = f"Agent provided detailed responses with average length of {avg_response_length:.1f} characters"
    else:
        _result = False
        _explanation = f"Agent responses were too brief with average length of {avg_response_length:.1f} characters"
else:
    _result = False
    _explanation = "No agent responses found in transcript"

Example Using Multiple Data Sources

Here’s an example that combines multiple metric results with call metadata and tags:

# Access metric results directly by name
try:
    satisfaction = data["Customer Satisfaction"]
    response_time = data["Response Time"]

    # Access additional call data
    call_duration = data["call_duration"]
    call_end_reason = data["call_end_reason"]
    tags = data["tags"]

    # Check if this was a priority call based on tags
    is_priority = "priority" in tags or "vip" in tags

    # Evaluate based on multiple factors
    if call_end_reason == "hangup" and isinstance(satisfaction, (int, float)) and satisfaction > 3.0 and response_time < 60:
        if is_priority:
            _result = "Excellent"
            _explanation = f"Priority customer was satisfied ({satisfaction}) with fast response time ({response_time}s) and completed the call normally"
        else:
            _result = "Good"
            _explanation = f"Customer was satisfied ({satisfaction}) with fast response time ({response_time}s) and completed the call normally"
    elif call_end_reason in ["timeout", "error"]:
        _result = "Poor"
        _explanation = f"Call ended unexpectedly due to {call_end_reason}, indicating technical issues"
    else:
        _result = "Needs Improvement"
        _explanation = f"Call performance needs improvement - satisfaction: {satisfaction}, response time: {response_time}s, ended reason: {call_end_reason}"

except KeyError as e:
    _result = "Incomplete"
    _explanation = f"Required data not found: {str(e)}"

Get Started

Key Concepts

Guides

Integrations

Advanced

Python Metric

Python Metric

Overview

Available Data Variables

Quick Reference

Detailed Field Documentation

Available in Both Simulation & Observability

Observability Only

Simulation Only

Metric Results Access

Required Output Variables

Example Code

Latency Threshold Example

Complete Data Reference

Data Flow and Execution Order

Using Metric Results

Calling LLM Judge Metrics from Python

Function Reference: `evaluate_llm_judge_metric`

Example Usage

Example For Calling Basic Metrics (Deprecated)

Example For Calling Advanced Metrics (Deprecated)

Advanced Example

Example Using Multiple Data Sources

Get Started

Key Concepts

Guides

Integrations

Advanced

​Python Metric

​Overview

​Available Data Variables

​Quick Reference

​Detailed Field Documentation

Available in Both Simulation & Observability

Observability Only

Simulation Only

​Metric Results Access

​Required Output Variables

​Example Code

​Latency Threshold Example

​Complete Data Reference

​Data Flow and Execution Order

​Using Metric Results

​Calling LLM Judge Metrics from Python

​Function Reference: evaluate_llm_judge_metric

​Example Usage

​Example For Calling Basic Metrics (Deprecated)

​Example For Calling Advanced Metrics (Deprecated)

​Advanced Example

​Example Using Multiple Data Sources

Python Metric

Overview

Available Data Variables

Quick Reference

Detailed Field Documentation

Metric Results Access

Required Output Variables

Example Code

Latency Threshold Example

Complete Data Reference

Data Flow and Execution Order

Using Metric Results

Calling LLM Judge Metrics from Python

Function Reference: `evaluate_llm_judge_metric`

Example Usage

Example For Calling Basic Metrics (Deprecated)

Example For Calling Advanced Metrics (Deprecated)

Advanced Example

Example Using Multiple Data Sources