Going to production

A working agent is a starting point. Going to production is about adding the layers that aren’t about the agent’s behavior — they’re about keeping it up, watching it, and bounding what it can do. This page is a checklist. Each section links to the deeper guide.

Resilience

✅ Wrap the LLM client with with_max_retries(3) and with_timeout(Duration::from_secs(30)).
✅ Add a fallback model via Client::with_fallback(...) or ModelFallback::new(backup_client) middleware. A cheaper backup beats a 5xx.
✅ Retry on rate limits. RetryPolicy::new(n) with backoff handles 429s correctly.
✅ Cap retry attempts. Infinite retry is a bug. 3–5 attempts with exponential backoff.
✅ Cap loop iterations. Always set with_max_iterations(n) on agents.

See Resilience for the patterns.

Rate and cost

✅ Rate-limit upstream. RateLimit::new(Arc::new(TokenBucket::new(rate_per_sec, burst))) with bucket sized to your provider tier.
✅ Track cost. Wire cognis-trace with with_default_pricing() so every call has a USD attached.
✅ Cap cost per run. ModelCallLimit and ToolCallLimit middleware bound how much one user request can spend.
✅ Cache obvious repeats. CachedEmbeddings for indexing; with_memory_cache for LLM calls; PromptCaching middleware for Anthropic-style prefix caching.

See Cost tracking and Caching.

Observability

✅ Wire cognis-trace. LangfuseExporter::from_env() plus a TracingHandler plus the wrapped HandlerObserver on every RunnableConfig.
✅ Set trace metadata. TraceMeta::session(...), user(...), release(...), environment(...). Without these, you can’t filter your dashboards.
✅ Drain on shutdown. handler.shutdown().await before exit so batches flush.
✅ Log structured errors. Cognis uses tracing — make sure your subscriber is on.

See Trace with Langfuse.

Security

✅ PII redaction. PiiRedactor::default() middleware on every customer-facing agent. Add RegexRedactor for domain-specific patterns.
✅ Tool deny-lists. Even if the model has no reason to call delete_account, deny-list it.
✅ HITL for the riskiest tools. Money, customer data, irreversible actions get an Approver.
✅ SSRF-safe HTTP tools. Use the built-in protected client; allow-list explicit hosts.
✅ Sandboxed FS for file-writing agents. SandboxedFsBackend over a scratch directory.
✅ No .env files. Use envchain / direnv / your secret manager.

See Security.

Persistence

✅ Pick a checkpointer. SqliteCheckpointer for single-host, PostgresCheckpointer for multi-process.
✅ Use stable thread ids. Tie them to your auth (user id, conversation id) so resume is deterministic.
✅ Test resume paths. Kill a process mid-run; verify the next call resumes correctly.
✅ Bound history. Don’t keep checkpoints forever — implement a TTL on your checkpointer.

See Checkpointing.

Memory

✅ Pick a memory variant. SummaryBufferMemory is the safe default; switch when usage tells you otherwise.
✅ Pin the system prompt. Use with_system(...) on the memory so it survives summarization.
✅ Bound the budget. SummaryBufferMemory::new(client, max_tokens) with max_tokens matched to your model and your latency target.

See Memory.

Streaming

✅ Stream by default. Long agents feel broken when they don’t stream. Use stream_events or stream_mode.
✅ Handle disconnects. When the client disconnects, cancel the run with cancel_token.
✅ Protect against backpressure. Bounded channels; drop or summarize on overflow.

See Streaming.

Evals and CI

✅ Build a small golden set. 30–100 cases of “I know what the answer should be.”
✅ Run evals on every change. EvalRunner over the golden set; gate CI on regressions.
✅ Push scores to Langfuse. Tie eval scores to traces so you can drill from “this regressed” to “the trace that broke it.”

See Evaluation.

Deploys

✅ Build with the right features. cargo build --release --features all-providers,langfuse,vectorstore-faiss (adapt to your stack).
✅ Pin secrets in your runner’s secret store. Same env-var names as local dev.
✅ Health checks. Wire a tiny endpoint that calls client.provider().health_check().await so your platform’s load balancer can detect bad pods.
✅ Graceful shutdown. Drain the trace handler, cancel in-flight requests, then exit.

Cost containers

✅ Per-tenant limits. Different rate buckets for different customers. Use one bucket per tenant; share the same RateLimit middleware.
✅ Daily caps. Sliding-window or composite limiter that stops a runaway use case before it surprises your bill.
✅ Fallback to cheaper. Production stack: try the good model, fall back to a cheap one, then return a graceful error.

How it all fits

A reference production agent setup:

use std::sync::Arc;
use std::time::Duration;
use cognis::prelude::*;
use cognis::AgentBuilder;
use cognis::middleware::{
    MiddlewarePipeline, ModelRetry, ModelFallback, RateLimit, TokenBucket,
    PromptCaching, PiiRedactor, Summarization, ModelCallLimit,
};
use cognis_llm::{Client, provider::Provider};
use cognis_graph::SqliteCheckpointer;

let primary = Client::from_env()?
    .with_max_retries(3)
    .with_timeout(Duration::from_secs(30));
let backup = Client::builder()
    .provider(Provider::OpenAI)
    .api_key(std::env::var("OPENAI_API_KEY")?)
    .model("gpt-4o-mini")
    .build()?;

// 1. Wrap the LLM client with middleware. Innermost first, outermost last.
let pipelined = MiddlewarePipeline::new()
    .push(ModelFallback::new(backup))
    .push(ModelRetry::new(3))
    .push(Summarization::new(8))
    .push(PromptCaching::new())
    .push(PiiRedactor::new())
    .push(ModelCallLimit::new(20))
    .push(RateLimit::new(Arc::new(TokenBucket::new(1000.0, 60_000))))
    .build(primary.clone());

// 2. Persist agent graph state across processes via the checkpointer.
let cp = Arc::new(SqliteCheckpointer::open("./state.db").await?);
let custom_graph = cognis::default_react_graph()
    .compile()?
    .with_checkpointer(cp);

let memory = SummaryBufferMemory::new(primary.clone(), 2000)
    .with_system("You are a careful assistant.");

let agent = AgentBuilder::new()
    // For middleware to wrap every model call inside the agent, see
    // /building-agents/middleware#wiring-middleware-into-an-agent
    // (build a custom LLMProvider that delegates to `pipelined`).
    .with_llm(primary)
    .with_graph(custom_graph)
    .with_tools(my_tools)
    .with_memory(memory)
    .with_max_iterations(8)
    .stateful()
    .build()?;

That’s not a starting point — it’s the end state. Build the simple version, then layer on what your usage tells you you need.

Resilience

Retries, fallbacks, recovery.

Observability

Where your runs land.

Security

PII, tools, sandboxes.

Caching

Don’t pay twice.

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

Going to production

Resilience

Rate and cost

Observability

Security

Persistence

Memory

Streaming

Evals and CI

Deploys

Cost containers

How it all fits

See also

Resilience

Observability

Security

Caching

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

Documentation Index

​Resilience

​Rate and cost

​Observability

​Security

​Persistence

​Memory

​Streaming

​Evals and CI

​Deploys

​Cost containers

​How it all fits

​See also

Resilience

Observability

Security

Caching

Resilience

Rate and cost

Observability

Security

Persistence

Memory

Streaming

Evals and CI

Deploys

Cost containers

How it all fits

See also