Evaluation

Evaluation closes the loop: you ran traces, you tracked cost, now you need to know how good the answers are. Cognis ships a small evaluation harness under cognis::eval for offline runs over a dataset — pair it with scores to push results into Langfuse.

Mental model

Three moving parts:

Cases — (input, expected) pairs of type EvalCase<I, O>.
A runnable under test — anything Runnable<I, O>. An agent, a chain, or a custom impl.
An evaluator — Evaluator<O> produces a score in [0.0, 1.0] for an actual O against a reference O.

EvalRunner invokes every case through the runnable, scores each output, and produces an EvalReport.

Quick example

use std::sync::Arc;
use cognis::eval::{EvalCase, EvalRunner, ExactMatch};
use cognis::prelude::*;

let cases = vec![
    EvalCase::new("greeting".to_string(), "hello".to_string()).with_name("greet"),
    EvalCase::new("math".to_string(), "4".to_string()).with_name("two-plus-two"),
];

let runner = EvalRunner::new(
    Arc::new(my_runnable),       // anything Runnable<String, String>
    Arc::new(ExactMatch),        // built-in evaluator
    cases,
)
.with_concurrency(8);

let report = runner.run().await?;
println!("mean score: {:.2}", report.mean());
println!("pass rate (>= 0.8): {:.2}", report.pass_rate(0.8));

EvalReport<O> exposes mean(), pass_rate(threshold), passing(threshold), best(), worst(), plus the raw rows: Vec<EvalRow<O>>. Source: examples/observability/evaluation_framework.rs.

Built-in evaluators

Evaluator	Score
`ExactMatch`	`1.0` iff `actual == expected`, else `0.0`
`Contains`	`1.0` iff the actual string contains the expected substring
`LlmJudge`	Asks a `Client` to score the actual against the expected — useful when “exact match” isn’t appropriate

LlmJudge takes a Client plus a rubric prompt; see crates/cognis/src/eval/evaluators.rs.

Custom evaluators

The trait is one async method:

use async_trait::async_trait;
use cognis::eval::Evaluator;
use cognis::Result;

struct LengthSimilar;

#[async_trait]
impl Evaluator<String> for LengthSimilar {
    async fn score(&self, actual: &String, expected: &String) -> Result<f32> {
        let diff = (actual.len() as f32 - expected.len() as f32).abs();
        let scale = expected.len().max(1) as f32;
        Ok((1.0 - diff / scale).max(0.0))
    }
}

Stack evaluators by running multiple EvalRunners over the same cases — most evals run a fast deterministic evaluator (exact match, length check) plus an LlmJudge for nuanced quality.

Pushing scores to Langfuse

Combine the eval report with LangfuseScorer (see Prompts and scores) to push every score record into Langfuse, tied to the trace of each case.

use cognis_trace::{ScoreRecord, ScoreValue};
use cognis_trace::exporters::langfuse::{LangfuseConfig, LangfuseScorer};

let scorer = LangfuseScorer::new(LangfuseConfig::from_env()?)?;
for (row, run_id) in report.rows.iter().zip(case_run_ids) {
    scorer.submit(ScoreRecord {
        run_id,
        // Set at least one of `trace_id` or `session_id` — `LangfuseScorer`
        // skips submission when both are None (a score with nothing to link
        // to is dropped).
        trace_id: Some(run_id),       // attach to the case's trace
        session_id: None,
        name: "exact_match".into(),
        value: ScoreValue::Numeric(row.score as f64),
        comment: row.name.clone(),
    }).await?;
}

Run cases under a tracing observer to get the run_id per case (case_run_ids above); see Trace with Langfuse for the wiring.

How it works

Two passes per run: the runner invokes every case under a concurrency cap, then scores actuals against expecteds.
Concurrency follows with_concurrency(n) — defaults to 4. Tune for your provider’s quota when the runnable is a Client-backed chain.
Errors per case propagate from runner.run().await?. Wrap your runnable with with_max_retries if you want transient errors absorbed.
Reports are read-only summaries. EvalReport and EvalRow aren’t Serialize today — for snapshotting against baselines, build your own thin record from (name, score, actual) and serialize that.

Prompts and scores

Push eval scores into Langfuse.

Trace with Langfuse

See production and eval runs side by side.

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

Mental model

Quick example

Built-in evaluators

Custom evaluators

Pushing scores to Langfuse

How it works

See also

Prompts and scores

Trace with Langfuse

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

Documentation Index

​Mental model

​Quick example

​Built-in evaluators

​Custom evaluators

​Pushing scores to Langfuse

​How it works

​See also

Prompts and scores

Trace with Langfuse

Mental model

Quick example

Built-in evaluators

Custom evaluators

Pushing scores to Langfuse

How it works

See also