Middleware

Middleware is how Cognis adds production discipline around Client calls — retry, fallback, rate limits, redaction, prompt caching, planning, summarization. Each middleware wraps a Client and runs on every chat call. Multiple middlewares compose into a MiddlewarePipeline.

How it works

A middleware implements cognis::middleware::Middleware, a trait with one async method (call) that receives a MiddlewareCtx and an Arc<dyn Next>. The pipeline runs them in reverse-push order — the most-recently-pushed layer is the outermost wrapper.

use std::sync::Arc;
use cognis::middleware::{ModelRetry, RegexRedactor, MiddlewarePipeline};
use cognis_llm::{Client, provider::Provider};

let primary = Client::from_env()?;

let pipelined = MiddlewarePipeline::new()
    .push(ModelRetry::new(3))         // innermost (closest to the client)
    .push(RegexRedactor::new())       // outermost (sees the request first)
    .build(primary);

// Use directly:
let resp = pipelined.invoke(messages, tool_defs, opts).await?;

The chain executes outside-in: RegexRedactor::call runs first, then ModelRetry::call, then the raw client. Push order is “innermost first.”

Middleware is not auto-wired into AgentBuilder in v0.3. To run middleware inside an agent loop, wrap your client into a PipelinedClient and serve it through a custom LLMProvider — see Wiring middleware into an agent below.

What’s in the box

The full catalog under cognis::middleware::*. Reach for these by job:

Resilience

Middleware	Constructor
`ModelRetry`	`ModelRetry::new(max_attempts)` — exponential backoff (100ms initial, 2x, 30s cap) by default
`ModelFallback`	`ModelFallback::new(fallback_client: Client)`
`Recovery` (`FixedRecovery`, `FnRecovery`)	Custom recovery on errors
`ToolRetry` (`ToolRetryClassifier`)	`ToolRetry::new(max_attempts)` for retrying tool calls the model emitted that failed

Rate and cost

Middleware	Constructor
`RateLimit`	`RateLimit::new(Arc::new(TokenBucket::new(rate_per_sec, burst)))` — also accepts `SlidingWindow`, `Composite`, `CostBased`
`ModelCallLimit`	`ModelCallLimit::new(cap)` — hard cap on calls per pipeline run
`ToolCallLimit`	`ToolCallLimit::new(cap)`

Privacy

Middleware	Constructor
`PiiRedactor`	`PiiRedactor::new()` — masks common PII patterns (emails, phones, etc.)
`RegexRedactor`	`RegexRedactor::new()` — bring your own patterns

Prompt and context

Middleware	Constructor
`PromptCaching`	`PromptCaching::new()` (or `::default()`) — Anthropic prompt-cache markers
`ContextEditing`	`ContextEditing::new(policy)` — mutate messages before they go to the model
`ContextInjection`	`ContextInjection::new(provider)` — inject context derived from your app state
`Summarization`	`Summarization::new(keep_last)` — compress old turns when the transcript grows

Planning and todos

Middleware	Constructor
`Planning`	`Planning::new()`
`TodoMiddleware`	Maintain an internal task list the agent can read and update

Tools

Middleware	Effect
`ToolEmulator` (`MapEmulator`, `EmulatorSource`)	Replay tool calls deterministically — great for tests
`ToolSelection`	Steer the model toward a subset of tools per turn
`PatchToolCalls` (`FnToolCallPatcher`)	Fix or rewrite the model’s tool calls before dispatch

For tool-call gating (require human approval before specific tools run), use Approver + AgentBuilder::with_approver, not middleware. See Human-in-the-loop.

Workspace and subagents

Middleware	Effect
`FilesystemMiddleware`	Expose a virtual workspace to the model
`SubagentMiddleware` (`SubagentRouter`)	Spawn subagents from inside the pipeline for context isolation

Quick example — production stack

A reasonable defaults stack for a customer-facing client:

use std::sync::Arc;
use cognis::middleware::{
    ModelRetry, ModelFallback, RateLimit, TokenBucket,
    PromptCaching, PiiRedactor, Summarization, MiddlewarePipeline,
};
use cognis_llm::{Client, provider::Provider};

let primary = Client::from_env()?;
let backup  = Client::builder()
    .provider(Provider::OpenAI)
    .api_key(std::env::var("OPENAI_API_KEY")?)
    .model("gpt-4o-mini")
    .build()?;

// Innermost first, outermost last (reverse-push semantics).
let pipelined = MiddlewarePipeline::new()
    .push(ModelFallback::new(backup))                          // last-resort fallback
    .push(ModelRetry::new(3))                                  // retry transients
    .push(Summarization::new(8))                               // keep last 8 turns
    .push(PromptCaching::new())                                // mark cacheable
    .push(PiiRedactor::new())                                  // redact before send
    .push(RateLimit::new(Arc::new(TokenBucket::new(1000.0, 60_000)))) // outermost
    .build(primary);

Read top-to-bottom (push order): retry happens inside redaction, redaction happens inside the rate limiter — so the rate limiter sees every retry attempt, and the model never sees raw PII.

Wiring middleware into an agent

AgentBuilder accepts a raw Client; it doesn’t take a PipelinedClient directly. Two ways to bridge: Option 1 — PipelinedClient standalone, for code that calls the model directly without the agent harness:

let resp = pipelined.invoke(messages, tool_defs, opts).await?;

Option 2 — Custom LLMProvider that delegates to the pipeline. Wrap that provider into a Client, then hand it to AgentBuilder:

use std::sync::Arc;
use async_trait::async_trait;
use cognis_llm::{Client, provider::{LLMProvider, Provider}};
use cognis_llm::chat::{ChatOptions, ChatResponse, HealthStatus, StreamChunk};
use cognis::middleware::PipelinedClient;
use cognis_core::{Message, Result, RunnableStream};

struct PipelinedProvider(Arc<PipelinedClient>);

#[async_trait]
impl LLMProvider for PipelinedProvider {
    fn name(&self) -> &str { "pipelined" }
    fn provider_type(&self) -> Provider { self.0.client().provider().provider_type() }
    async fn chat_completion(&self, messages: Vec<Message>, opts: ChatOptions) -> Result<ChatResponse> {
        self.0.invoke(messages, vec![], opts).await
    }
    async fn chat_completion_with_tools(&self, messages: Vec<Message>, tools: Vec<cognis_llm::tools::ToolDefinition>, opts: ChatOptions) -> Result<ChatResponse> {
        self.0.invoke(messages, tools, opts).await
    }
    async fn chat_completion_stream(&self, m: Vec<Message>, o: ChatOptions) -> Result<RunnableStream<StreamChunk>> {
        self.0.client().provider().chat_completion_stream(m, o).await
    }
    async fn health_check(&self) -> Result<HealthStatus> {
        self.0.client().provider().health_check().await
    }
}

let agent = cognis::AgentBuilder::new()
    .with_llm(Client::new(Arc::new(PipelinedProvider(Arc::new(pipelined)))))
    .build()?;

This is a power-user pattern; reach for it when you need every agent-driven LLM call to go through the same middleware chain.

Writing your own middleware

The trait is small:

use std::sync::Arc;
use async_trait::async_trait;
use cognis::middleware::{Middleware, MiddlewareCtx, Next};
use cognis_llm::chat::ChatResponse;
use cognis_core::Result;

struct LogTokens;

#[async_trait]
impl Middleware for LogTokens {
    async fn call(&self, ctx: MiddlewareCtx, next: Arc<dyn Next>) -> Result<ChatResponse> {
        let resp = next.invoke(ctx).await?;
        if let Some(usage) = &resp.usage {
            tracing::info!(input = usage.input_tokens, output = usage.output_tokens, "llm call");
        }
        Ok(resp)
    }

    fn name(&self) -> &str { "LogTokens" }
}

Push it onto the pipeline like any other:

let pipelined = MiddlewarePipeline::new()
    .push(LogTokens)
    .push(ModelRetry::new(3))
    .build(client);

Production → Resilience

Patterns for ModelRetry, ModelFallback, and Recovery.

Production → Security

PII redaction, deny-lists, SSRF protection.

Human-in-the-loop

Approval-gated tools — different from middleware.

Reference → cognis

Full middleware re-export list.

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

How it works

What’s in the box

Resilience

Rate and cost

Privacy

Prompt and context

Planning and todos

Tools

Workspace and subagents

Quick example — production stack

Wiring middleware into an agent

Writing your own middleware

See also

Production → Resilience

Production → Security

Human-in-the-loop

Reference → cognis

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

Documentation Index

​How it works

​What’s in the box

​Resilience

​Rate and cost

​Privacy

​Prompt and context

​Planning and todos

​Tools

​Workspace and subagents

​Quick example — production stack

​Wiring middleware into an agent

​Writing your own middleware

​See also

Production → Resilience

Production → Security

Human-in-the-loop

Reference → cognis

How it works

What’s in the box

Resilience

Rate and cost

Privacy

Prompt and context

Planning and todos

Tools

Workspace and subagents

Quick example — production stack

Wiring middleware into an agent

Writing your own middleware

See also