Python Metric
Python Metric allows you to write custom evaluation logic in Python to evaluate your AI agent’s performance. This gives you complete control over the evaluation process and enables complex analysis that goes beyond simple prompt-based metrics.Overview
Custom code metrics are executed in a secure Python environment with access to call data including transcripts, metadata, and dynamic variables. Your code must set specific output variables to provide the evaluation result and explanation.Available Data Variables
When writing your custom code, you have access to different variables depending on the evaluation context.Quick Reference
| Variable | Simulation | Observability |
|---|---|---|
| transcript | ✅ | ✅ |
| transcript_json | ✅ | ✅ |
| call_duration | ✅ | ✅ |
| call_end_reason | ✅ | ✅ |
| voice_recording | ✅ | ✅ |
| agent_description | ✅ | ✅ |
| date | ✅ | ✅ |
| timestamp | ✅ | ✅ |
| metadata | ✅ | ✅ |
| recording_data | ✅ | ✅ |
| Metric Results | ✅ | ✅ |
| topic | ❌ | ✅ |
| Latency Metrics | ✅ | ✅ |
| dynamic_variables | ❌ | ✅ |
| call_log_id | ❌ | ✅ |
| tags | ✅ | ❌ |
| provider_call_id | ✅ | ❌ |
| provider_call_data | ✅ | ❌ |
| cekura_transcript_json | ✅ | ❌ |
| test_profile | ✅ | ❌ |
| run_id | ✅ | ❌ |
| expected_outcome | ✅ | ❌ |
| expected_outcome_explanation | ✅ | ❌ |
Detailed Field Documentation
Available in Both Simulation & Observability
transcript
transcript
Availability: ✅ Simulation | ✅ ObservabilityFull conversation transcript as a formatted string with timestamps
transcript_json
transcript_json
Availability: ✅ Simulation | ✅ ObservabilityTranscript as a structured list with detailed timing and speaker information
call_duration
call_duration
Availability: ✅ Simulation | ✅ ObservabilityCall duration in seconds as a float
call_end_reason
call_end_reason
Availability: ✅ Simulation | ✅ ObservabilityReason why the call ended
voice_recording
voice_recording
Availability: ✅ Simulation | ✅ ObservabilityURL to the voice recording file
agent_description
agent_description
Availability: ✅ Simulation | ✅ ObservabilityDescription of the AI agent used in the call
metadata
metadata
Availability: ✅ Simulation | ✅ ObservabilityAdditional context metadata as a dictionary
date
date
Availability: ✅ Simulation | ✅ ObservabilityCurrent date in YYYY-MM-DD format
timestamp
timestamp
Availability: ✅ Simulation | ✅ ObservabilityISO 8601 formatted timestamp of when the call/run occurred
recording_data
recording_data
Availability: ✅ Simulation | ✅ ObservabilityAudio metadata and analysis results as a dictionary containing:
has_audio_data(boolean) - Whether audio data is available for analysissample_rate(integer) - Audio sample rate in Hz (e.g., 8000, 16000)shape(list) - Audio dimensions as [total_samples, channels]separable_channels(boolean) - Whether stereo channels can be separated into distinct speaker channelstotal_duration(float) - Total audio duration in secondsmain_speaking(list) - Speaking segments for the main/agent channel as [[start, end], …] in secondstesting_speaking(list) - Speaking segments for the testing/user channel as [[start, end], …] in seconds
Observability Only
dynamic_variables
dynamic_variables
Availability: ❌ Simulation | ✅ ObservabilityDynamic variables configured for the agent as a dictionary
call_log_id
call_log_id
Availability: ❌ Simulation | ✅ ObservabilityCallLog ID for observability calls
topic
topic
Availability: ❌ Simulation | ✅ ObservabilityCall topic/subject
Simulation Only
tags
tags
provider_call_id
provider_call_id
Availability: ✅ Simulation | ❌ ObservabilityProvider-specific call identifier
provider_call_data
provider_call_data
Availability: ✅ Simulation | ❌ ObservabilityProvider-specific call details as a dictionary
cekura_transcript_json
cekura_transcript_json
Availability: ✅ Simulation | ❌ ObservabilityCekura-specific transcript format
test_profile
test_profile
Availability: ✅ Simulation | ❌ ObservabilityTest scenario data configured for simulation runs
run_id
run_id
Availability: ✅ Simulation | ❌ ObservabilityRun ID for simulation runs
expected_outcome
expected_outcome
Availability: ✅ Simulation | ❌ ObservabilityExpected outcome value for the test scenario
expected_outcome_explanation
expected_outcome_explanation
Availability: ✅ Simulation | ❌ ObservabilityList of explanation strings for expected outcome
Metric Results Access
These results are available in both Simulation and Observability contexts:Individual Metric Results
Individual Metric Results
Availability: ✅ Simulation | ✅ ObservabilityAccess any evaluated metric result directly by name
Metric Explanations
Metric Explanations
Availability: ✅ Simulation | ✅ ObservabilityList of explanation strings for each metric
Latency Metrics
Latency Metrics
Availability: ✅ Simulation | ✅ ObservabilityLatency metrics for performance analysisThe
latency_data list contains detailed information about each turn’s latency:latency: The duration of the latency in milliseconds.speaker: The entity associated with the latency (e.g., “Main Agent”).start_time: The timestamp when the turn started, in seconds.
Required Output Variables
Your Python code must set these two variables:_result- The evaluation outcome (can be boolean, numeric, string, etc.)_explanation- A string explaining the reasoning behind the result
Example Code
Here’s a simple example that checks if the agent mentioned a specific product:Latency Threshold Example
This example detects if latency is under a threshold in each turn.Complete Data Reference
Here’s the complete structure of data available to your custom Python code:Data Flow and Execution Order
Important: Custom Python code metrics execute after all other metrics (Basic, Advanced, and pre-defined metrics). This means:- Non-custom metrics evaluate first
- Results are structured and merged into the
datadictionary - Custom code receives ALL previous results via direct dictionary access
- Custom code can build upon or combine existing metric results
Using Metric Results
You can access the results of other metrics that were evaluated for the same call directly by metric name usingdata["Metric Name"]. You can also access their explanations using data["explanation"]["Metric Name"].
Example usage:
Calling LLM Judge Metrics from Python
Function Reference: evaluate_llm_judge_metric
The evaluate_llm_judge_metric function allows you to evaluate LLM Judge metrics directly from your Python code. This function sends your data and evaluation criteria to Cekura’s LLM judge system and returns the evaluation result.
Function Signature:
This is the same
data object available in your custom Python code with access to transcript, metadata, and other call data.Your Cekura API key for authentication.
The evaluation prompt/description that guides the LLM judge on how to evaluate the metric.You can use context variables in the description using
{{variable}} syntax (e.g., {{metadata.instructions}}). See LLM Judge Available Variables for a complete list of available variables.The type of evaluation to perform. Supported values:
"binary_workflow_adherence"- Binary evaluation (returns 0 or 5)"binary_qualitative"- Binary qualitative assessment (returns 0 or 5)"numeric"- Numeric evaluation (returns integer or float)"continuous_qualitative"- Continuous scale from 0 to 5"enum"- Enumerated values (requiresenum_valuesparameter)
List of possible values when using
eval_type="enum". Only applicable for ENUM type evaluations.Example: ["Excellent", "Good", "Fair", "Poor"]-
result: The evaluated metric value (type depends oneval_type)- Binary types:
0or5 - Numeric:
intorfloat - Continuous:
floatbetween 0 and 5 - Enum:
stringfromenum_values
- Binary types:
-
explanation: List of String explaining the evaluation result or error message