Skip to main content

Documentation Index

Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt

Use this file to discover all available pages before exploring further.

Middleware is how Cognis adds production discipline around Client calls — retry, fallback, rate limits, redaction, prompt caching, planning, summarization. Each middleware wraps a Client and runs on every chat call. Multiple middlewares compose into a MiddlewarePipeline.

How it works

A middleware implements cognis::middleware::Middleware, a trait with one async method (call) that receives a MiddlewareCtx and an Arc<dyn Next>. The pipeline runs them in reverse-push order — the most-recently-pushed layer is the outermost wrapper.
use std::sync::Arc;
use cognis::middleware::{ModelRetry, RegexRedactor, MiddlewarePipeline};
use cognis_llm::{Client, provider::Provider};

let primary = Client::from_env()?;

let pipelined = MiddlewarePipeline::new()
    .push(ModelRetry::new(3))         // innermost (closest to the client)
    .push(RegexRedactor::new())       // outermost (sees the request first)
    .build(primary);

// Use directly:
let resp = pipelined.invoke(messages, tool_defs, opts).await?;
The chain executes outside-in: RegexRedactor::call runs first, then ModelRetry::call, then the raw client. Push order is “innermost first.”
Middleware is not auto-wired into AgentBuilder in v0.3. To run middleware inside an agent loop, wrap your client into a PipelinedClient and serve it through a custom LLMProvider — see Wiring middleware into an agent below.

What’s in the box

The full catalog under cognis::middleware::*. Reach for these by job:

Resilience

MiddlewareConstructor
ModelRetryModelRetry::new(max_attempts) — exponential backoff (100ms initial, 2x, 30s cap) by default
ModelFallbackModelFallback::new(fallback_client: Client)
Recovery (FixedRecovery, FnRecovery)Custom recovery on errors
ToolRetry (ToolRetryClassifier)ToolRetry::new(max_attempts) for retrying tool calls the model emitted that failed

Rate and cost

MiddlewareConstructor
RateLimitRateLimit::new(Arc::new(TokenBucket::new(rate_per_sec, burst))) — also accepts SlidingWindow, Composite, CostBased
ModelCallLimitModelCallLimit::new(cap) — hard cap on calls per pipeline run
ToolCallLimitToolCallLimit::new(cap)

Privacy

MiddlewareConstructor
PiiRedactorPiiRedactor::new() — masks common PII patterns (emails, phones, etc.)
RegexRedactorRegexRedactor::new() — bring your own patterns

Prompt and context

MiddlewareConstructor
PromptCachingPromptCaching::new() (or ::default()) — Anthropic prompt-cache markers
ContextEditingContextEditing::new(policy) — mutate messages before they go to the model
ContextInjectionContextInjection::new(provider) — inject context derived from your app state
SummarizationSummarization::new(keep_last) — compress old turns when the transcript grows

Planning and todos

MiddlewareConstructor
PlanningPlanning::new()
TodoMiddlewareMaintain an internal task list the agent can read and update

Tools

MiddlewareEffect
ToolEmulator (MapEmulator, EmulatorSource)Replay tool calls deterministically — great for tests
ToolSelectionSteer the model toward a subset of tools per turn
PatchToolCalls (FnToolCallPatcher)Fix or rewrite the model’s tool calls before dispatch
For tool-call gating (require human approval before specific tools run), use Approver + AgentBuilder::with_approver, not middleware. See Human-in-the-loop.

Workspace and subagents

MiddlewareEffect
FilesystemMiddlewareExpose a virtual workspace to the model
SubagentMiddleware (SubagentRouter)Spawn subagents from inside the pipeline for context isolation

Quick example — production stack

A reasonable defaults stack for a customer-facing client:
use std::sync::Arc;
use cognis::middleware::{
    ModelRetry, ModelFallback, RateLimit, TokenBucket,
    PromptCaching, PiiRedactor, Summarization, MiddlewarePipeline,
};
use cognis_llm::{Client, provider::Provider};

let primary = Client::from_env()?;
let backup  = Client::builder()
    .provider(Provider::OpenAI)
    .api_key(std::env::var("OPENAI_API_KEY")?)
    .model("gpt-4o-mini")
    .build()?;

// Innermost first, outermost last (reverse-push semantics).
let pipelined = MiddlewarePipeline::new()
    .push(ModelFallback::new(backup))                          // last-resort fallback
    .push(ModelRetry::new(3))                                  // retry transients
    .push(Summarization::new(8))                               // keep last 8 turns
    .push(PromptCaching::new())                                // mark cacheable
    .push(PiiRedactor::new())                                  // redact before send
    .push(RateLimit::new(Arc::new(TokenBucket::new(1000.0, 60_000)))) // outermost
    .build(primary);
Read top-to-bottom (push order): retry happens inside redaction, redaction happens inside the rate limiter — so the rate limiter sees every retry attempt, and the model never sees raw PII.

Wiring middleware into an agent

AgentBuilder accepts a raw Client; it doesn’t take a PipelinedClient directly. Two ways to bridge: Option 1 — PipelinedClient standalone, for code that calls the model directly without the agent harness:
let resp = pipelined.invoke(messages, tool_defs, opts).await?;
Option 2 — Custom LLMProvider that delegates to the pipeline. Wrap that provider into a Client, then hand it to AgentBuilder:
use std::sync::Arc;
use async_trait::async_trait;
use cognis_llm::{Client, provider::{LLMProvider, Provider}};
use cognis_llm::chat::{ChatOptions, ChatResponse, HealthStatus, StreamChunk};
use cognis::middleware::PipelinedClient;
use cognis_core::{Message, Result, RunnableStream};

struct PipelinedProvider(Arc<PipelinedClient>);

#[async_trait]
impl LLMProvider for PipelinedProvider {
    fn name(&self) -> &str { "pipelined" }
    fn provider_type(&self) -> Provider { self.0.client().provider().provider_type() }
    async fn chat_completion(&self, messages: Vec<Message>, opts: ChatOptions) -> Result<ChatResponse> {
        self.0.invoke(messages, vec![], opts).await
    }
    async fn chat_completion_with_tools(&self, messages: Vec<Message>, tools: Vec<cognis_llm::tools::ToolDefinition>, opts: ChatOptions) -> Result<ChatResponse> {
        self.0.invoke(messages, tools, opts).await
    }
    async fn chat_completion_stream(&self, m: Vec<Message>, o: ChatOptions) -> Result<RunnableStream<StreamChunk>> {
        self.0.client().provider().chat_completion_stream(m, o).await
    }
    async fn health_check(&self) -> Result<HealthStatus> {
        self.0.client().provider().health_check().await
    }
}

let agent = cognis::AgentBuilder::new()
    .with_llm(Client::new(Arc::new(PipelinedProvider(Arc::new(pipelined)))))
    .build()?;
This is a power-user pattern; reach for it when you need every agent-driven LLM call to go through the same middleware chain.

Writing your own middleware

The trait is small:
use std::sync::Arc;
use async_trait::async_trait;
use cognis::middleware::{Middleware, MiddlewareCtx, Next};
use cognis_llm::chat::ChatResponse;
use cognis_core::Result;

struct LogTokens;

#[async_trait]
impl Middleware for LogTokens {
    async fn call(&self, ctx: MiddlewareCtx, next: Arc<dyn Next>) -> Result<ChatResponse> {
        let resp = next.invoke(ctx).await?;
        if let Some(usage) = &resp.usage {
            tracing::info!(input = usage.input_tokens, output = usage.output_tokens, "llm call");
        }
        Ok(resp)
    }

    fn name(&self) -> &str { "LogTokens" }
}
Push it onto the pipeline like any other:
let pipelined = MiddlewarePipeline::new()
    .push(LogTokens)
    .push(ModelRetry::new(3))
    .build(client);

See also

Production → Resilience

Patterns for ModelRetry, ModelFallback, and Recovery.

Production → Security

PII redaction, deny-lists, SSRF protection.

Human-in-the-loop

Approval-gated tools — different from middleware.

Reference → cognis

Full middleware re-export list.