LLM outputs are inherently non-deterministic, so what "works in a demo or MVP" often fails in production — in fact, most AI agents never successfully reach production. Human intuition and prompt engineering alone don't scale without systematic measurement and holistic evaluation, especially for RAG systems. To achieve enterprise trust, organizations must focus on compliance, auditability, cost and latency optimization and deep observability across agentic workflows.
The system automatically inspects trace patterns to discover "Agents" and infers their black-box structure (Inputs, Outputs, and Child Tools) without manual configuration.
Why It Matters
Unlike other evaluation platforms with passive tracing, this actively "reverse engineers" your agent's behavior, making it easier to understand complex, evolving agentic workflows without manual intervention.
Evaluation Config allows a single evaluation template to be targeted dynamically at either Live Tracing (production) or Dataset Runs (testing) with a simple toggle.
Why It Matters
Eliminates the need to maintain separate "Testing" and "Monitoring" logic. Your CI/CD tests can become your production monitors instantly, reducing maintenance overhead and ensuring consistency.
A built-in "Marketplace" of system-level Library Templates that users can browse and "Clone" into their project with one click, inheriting best-practice prompt engineering.
Why It Matters
Lowers the barrier to entry for evaluation. Users don't start from scratch; they start with industry-standard evaluators (Hallucination, Conciseness, etc.), accelerating time-to-value.
The creation logic includes "Smart Filter Detection" which automatically targets evaluations to specific spans (e.g., "Agent" vs "Tool" spans) based on the semantic content of the prompt.
Why It Matters
Reduces misconfiguration. The system "knows" that a generic "Tool Usage" metric should only run on Tool spans, not whole traces, ensuring accurate and relevant evaluations.
Built-in support for delay seconds in evaluation configs, allowing "Shadow Mode" evaluations that handle eventual consistency or long running async agent tasks.
Why It Matters
Critical for Agentic workflows where the "result" might settle seconds after the trace finishes. Prevents race conditions in scoring and ensures evaluation accuracy for delayed operations.
Datasets support explicit input schema and output schema mapping/validation, treating datasets as "Golden Samples" with strict contracts rather than just loose CSV uploads.
Why It Matters
Ensures high-quality testing data. Prevents "Garbage In, Garbage Out" in evaluation runs by validating data structure before execution, maintaining dataset integrity and reliability.
A dedicated Model Pricing table allows users to override input/output costs per model/provider, supporting custom fine-tuned models or negotiated enterprise rates.
Why It Matters
Essential for enterprises using proxies or private models where public API pricing doesn't apply. Enables accurate cost tracking and budgeting for custom deployments.
The architecture is built from the ground up with Project and User isolation boundaries enforced at the API and Database level.
Why It Matters
Makes the platform ready for SaaS-style deployment or internal Platform Engineering teams serving multiple business units out of the box, with zero additional configuration.