Skip to main content

Documentation Index

Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt

Use this file to discover all available pages before exploring further.

Evaluation closes the loop: you ran traces, you tracked cost, now you need to know how good the answers are. Cognis ships a small evaluation harness under cognis::eval for offline runs over a dataset — pair it with scores to push results into Langfuse.

Mental model

Three moving parts:
  • Cases(input, expected) pairs of type EvalCase<I, O>.
  • A runnable under test — anything Runnable<I, O>. An agent, a chain, or a custom impl.
  • An evaluatorEvaluator<O> produces a score in [0.0, 1.0] for an actual O against a reference O.
EvalRunner invokes every case through the runnable, scores each output, and produces an EvalReport.

Quick example

use std::sync::Arc;
use cognis::eval::{EvalCase, EvalRunner, ExactMatch};
use cognis::prelude::*;

let cases = vec![
    EvalCase::new("greeting".to_string(), "hello".to_string()).with_name("greet"),
    EvalCase::new("math".to_string(), "4".to_string()).with_name("two-plus-two"),
];

let runner = EvalRunner::new(
    Arc::new(my_runnable),       // anything Runnable<String, String>
    Arc::new(ExactMatch),        // built-in evaluator
    cases,
)
.with_concurrency(8);

let report = runner.run().await?;
println!("mean score: {:.2}", report.mean());
println!("pass rate (>= 0.8): {:.2}", report.pass_rate(0.8));
EvalReport<O> exposes mean(), pass_rate(threshold), passing(threshold), best(), worst(), plus the raw rows: Vec<EvalRow<O>>. Source: examples/observability/evaluation_framework.rs.

Built-in evaluators

EvaluatorScore
ExactMatch1.0 iff actual == expected, else 0.0
Contains1.0 iff the actual string contains the expected substring
LlmJudgeAsks a Client to score the actual against the expected — useful when “exact match” isn’t appropriate
LlmJudge takes a Client plus a rubric prompt; see crates/cognis/src/eval/evaluators.rs.

Custom evaluators

The trait is one async method:
use async_trait::async_trait;
use cognis::eval::Evaluator;
use cognis::Result;

struct LengthSimilar;

#[async_trait]
impl Evaluator<String> for LengthSimilar {
    async fn score(&self, actual: &String, expected: &String) -> Result<f32> {
        let diff = (actual.len() as f32 - expected.len() as f32).abs();
        let scale = expected.len().max(1) as f32;
        Ok((1.0 - diff / scale).max(0.0))
    }
}
Stack evaluators by running multiple EvalRunners over the same cases — most evals run a fast deterministic evaluator (exact match, length check) plus an LlmJudge for nuanced quality.

Pushing scores to Langfuse

Combine the eval report with LangfuseScorer (see Prompts and scores) to push every score record into Langfuse, tied to the trace of each case.
use cognis_trace::{ScoreRecord, ScoreValue};
use cognis_trace::exporters::langfuse::{LangfuseConfig, LangfuseScorer};

let scorer = LangfuseScorer::new(LangfuseConfig::from_env()?)?;
for (row, run_id) in report.rows.iter().zip(case_run_ids) {
    scorer.submit(ScoreRecord {
        run_id,
        // Set at least one of `trace_id` or `session_id` — `LangfuseScorer`
        // skips submission when both are None (a score with nothing to link
        // to is dropped).
        trace_id: Some(run_id),       // attach to the case's trace
        session_id: None,
        name: "exact_match".into(),
        value: ScoreValue::Numeric(row.score as f64),
        comment: row.name.clone(),
    }).await?;
}
Run cases under a tracing observer to get the run_id per case (case_run_ids above); see Trace with Langfuse for the wiring.

How it works

  • Two passes per run: the runner invokes every case under a concurrency cap, then scores actuals against expecteds.
  • Concurrency follows with_concurrency(n) — defaults to 4. Tune for your provider’s quota when the runnable is a Client-backed chain.
  • Errors per case propagate from runner.run().await?. Wrap your runnable with with_max_retries if you want transient errors absorbed.
  • Reports are read-only summaries. EvalReport and EvalRow aren’t Serialize today — for snapshotting against baselines, build your own thin record from (name, score, actual) and serialize that.

See also

Prompts and scores

Push eval scores into Langfuse.

Trace with Langfuse

See production and eval runs side by side.