Caching

LLM calls are expensive and slow. Cognis has three caching layers, each fitting a different access pattern. Pick the one that matches what you’re trying to skip.

Pick a cache

Layer	Hits	When
In-memory cache	Same input, same call	Per-process; tests; short-lived workers.
SQLite cache	Same input across processes	Single-host services; CLI tools that retain state.
Provider prompt caching	Same prefix across calls	Anthropic-style cached prompts; long system contexts.

The three layers compose — you can have all of them on at once. Each catches a different category of “we already did this.”

In-memory cache

with_memory_cache(key_fn) wraps a Runnable with a hash-keyed in-memory cache. The closure produces the cache key from the input.

use cognis::prelude::*;
use cognis_llm::Client;

let client = Client::from_env()?;

// Key the cache on the rendered messages.
let cached = client.with_memory_cache(|messages: &Vec<Message>| {
    messages.iter().map(|m| m.content().to_string()).collect::<Vec<_>>().join("|")
});

Identical inputs return immediately. Cache lives for the lifetime of the wrapped value.

SQLite cache

For multi-process or long-running services, use the SQLite-backed cache (feature cognis/cache-sqlite):

use cognis::cache_sqlite::SqliteCache;

let cache = SqliteCache::new("./cache.db")?;
// Wire it via wrappers / middleware that accept a CacheBackend.

The cache survives restarts and is shared across processes that point at the same file. Good for CLI tools, batch jobs, and single-host services.

Provider prompt caching

Anthropic and some others support marking the start of a prompt as cacheable on the provider side. You pay full price the first time, a fraction on subsequent calls. Use the PromptCaching middleware:

use cognis::middleware::{PromptCaching, MiddlewarePipeline};

let pipelined = MiddlewarePipeline::new()
    .push(PromptCaching::new())
    .build(client);

The middleware adds the right markers; the provider does the deduplication. Particularly powerful for agents with long, stable system prompts and tool definitions — those rarely change between turns. To wire this inside an AgentBuilder agent, see Middleware → Wiring middleware into an agent.

How it composes

A typical stack:

use cognis::prelude::*;
use cognis::middleware::{PromptCaching, ModelRetry, MiddlewarePipeline};
use cognis_llm::Client;

// Two layers do similar jobs at different levels:
//   - in-memory dedup of repeat calls -> Runnable wrapper on the Client
//   - prompt-cache markers for the provider + retries -> middleware pipeline
//
// `MiddlewarePipeline::build` requires a raw `Client`, so wrap that *first*,
// then layer the Runnable wrappers around the pipelined client when you
// invoke it (or via `with_max_retries` on the Client surface for code that
// uses Client directly without the pipeline).
let client = Client::from_env()?;

let pipelined = MiddlewarePipeline::new()
    .push(ModelRetry::new(3))                 // innermost: retry transient errors
    .push(PromptCaching::new())               // outermost: mark cacheable
    .build(client);

Mark cacheable prompts for the provider, retry on transient errors. For an in-memory dedup cache on top, use client.with_memory_cache(key_fn) directly when calling the Client outside the pipeline, or layer your own caching middleware in the pipeline.

Embedder cache

For RAG, wrap your embedder with CachedEmbeddings:

use std::sync::Arc;
use cognis_rag::{CachedEmbeddings, Embeddings, OpenAIEmbeddings};

let raw = OpenAIEmbeddings::new(key);
let cached = CachedEmbeddings::new(raw);
let embed: Arc<dyn Embeddings> = Arc::new(cached);

Same chunk, embedded twice → second call free. This is hugely valuable when re-running indexing pipelines on slowly-changing corpora.

Cache invalidation

A cache only helps if a hit returns the right answer. Three invalidation knobs:

Key carefully. Include everything that influences output: messages, model name, temperature, tool list, system prompt. Two prompts with the same text but different tools are different cache entries.
Bypass on demand. with_memory_cache doesn’t have a force-bypass switch by design; if you need one, wrap your own cache backend.
TTL. The default in-memory cache is unbounded — explicitly limit with MemoryCache::with_capacity(n) if you’re worried about memory.

How it works

Hashing happens in your key_fn. Whatever string the closure returns is the cache key. No automatic introspection.
Cache entries are typed. A Cache<R, I, O, K, B> only stores O. Different output types → separate caches.
Provider prompt caching is opaque to the client. The middleware adds protocol markers; the provider returns whether the cache hit. Trace via cost tracking to see cache_read_tokens.

Resilience

Combine caching with retries and fallbacks.

Cost tracking

See provider cache hits in your cost numbers.

Embeddings

CachedEmbeddings for RAG.

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

Pick a cache

In-memory cache

SQLite cache

Provider prompt caching

How it composes

Embedder cache

Cache invalidation

How it works

See also

Resilience

Cost tracking

Embeddings

Get started

Core ideas

Building agents

Building RAG

Graph workflows

Observability

Patterns

Production

Documentation Index

​Pick a cache

​In-memory cache

​SQLite cache

​Provider prompt caching

​How it composes

​Embedder cache

​Cache invalidation

​How it works

​See also

Resilience

Cost tracking

Embeddings

Pick a cache

In-memory cache

SQLite cache

Provider prompt caching

How it composes

Embedder cache

Cache invalidation

How it works

See also