Evaluation closes the loop: you ran traces, you tracked cost, now you need to know how good the answers are. Cognis ships a small evaluation harness underDocumentation Index
Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt
Use this file to discover all available pages before exploring further.
cognis::eval for offline runs over a dataset — pair it with scores to push results into Langfuse.
Mental model
Three moving parts:- Cases —
(input, expected)pairs of typeEvalCase<I, O>. - A runnable under test — anything
Runnable<I, O>. An agent, a chain, or a custom impl. - An evaluator —
Evaluator<O>produces a score in[0.0, 1.0]for an actualOagainst a referenceO.
EvalRunner invokes every case through the runnable, scores each output, and produces an EvalReport.
Quick example
EvalReport<O> exposes mean(), pass_rate(threshold), passing(threshold), best(), worst(), plus the raw rows: Vec<EvalRow<O>>.
Source: examples/observability/evaluation_framework.rs.
Built-in evaluators
| Evaluator | Score |
|---|---|
ExactMatch | 1.0 iff actual == expected, else 0.0 |
Contains | 1.0 iff the actual string contains the expected substring |
LlmJudge | Asks a Client to score the actual against the expected — useful when “exact match” isn’t appropriate |
LlmJudge takes a Client plus a rubric prompt; see crates/cognis/src/eval/evaluators.rs.
Custom evaluators
The trait is one async method:EvalRunners over the same cases — most evals run a fast deterministic evaluator (exact match, length check) plus an LlmJudge for nuanced quality.
Pushing scores to Langfuse
Combine the eval report withLangfuseScorer (see Prompts and scores) to push every score record into Langfuse, tied to the trace of each case.
run_id per case (case_run_ids above); see Trace with Langfuse for the wiring.
How it works
- Two passes per run: the runner invokes every case under a concurrency cap, then scores actuals against expecteds.
- Concurrency follows
with_concurrency(n)— defaults to 4. Tune for your provider’s quota when the runnable is aClient-backed chain. - Errors per case propagate from
runner.run().await?. Wrap your runnable withwith_max_retriesif you want transient errors absorbed. - Reports are read-only summaries.
EvalReportandEvalRowaren’tSerializetoday — for snapshotting against baselines, build your own thin record from(name, score, actual)and serialize that.
See also
Prompts and scores
Push eval scores into Langfuse.
Trace with Langfuse
See production and eval runs side by side.