Skip to main content

Documentation Index

Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt

Use this file to discover all available pages before exploring further.

LLM calls are expensive and slow. Cognis has three caching layers, each fitting a different access pattern. Pick the one that matches what you’re trying to skip.

Pick a cache

LayerHitsWhen
In-memory cacheSame input, same callPer-process; tests; short-lived workers.
SQLite cacheSame input across processesSingle-host services; CLI tools that retain state.
Provider prompt cachingSame prefix across callsAnthropic-style cached prompts; long system contexts.
The three layers compose — you can have all of them on at once. Each catches a different category of “we already did this.”

In-memory cache

with_memory_cache(key_fn) wraps a Runnable with a hash-keyed in-memory cache. The closure produces the cache key from the input.
use cognis::prelude::*;
use cognis_llm::Client;

let client = Client::from_env()?;

// Key the cache on the rendered messages.
let cached = client.with_memory_cache(|messages: &Vec<Message>| {
    messages.iter().map(|m| m.content().to_string()).collect::<Vec<_>>().join("|")
});
Identical inputs return immediately. Cache lives for the lifetime of the wrapped value.

SQLite cache

For multi-process or long-running services, use the SQLite-backed cache (feature cognis/cache-sqlite):
use cognis::cache_sqlite::SqliteCache;

let cache = SqliteCache::new("./cache.db")?;
// Wire it via wrappers / middleware that accept a CacheBackend.
The cache survives restarts and is shared across processes that point at the same file. Good for CLI tools, batch jobs, and single-host services.

Provider prompt caching

Anthropic and some others support marking the start of a prompt as cacheable on the provider side. You pay full price the first time, a fraction on subsequent calls. Use the PromptCaching middleware:
use cognis::middleware::{PromptCaching, MiddlewarePipeline};

let pipelined = MiddlewarePipeline::new()
    .push(PromptCaching::new())
    .build(client);
The middleware adds the right markers; the provider does the deduplication. Particularly powerful for agents with long, stable system prompts and tool definitions — those rarely change between turns. To wire this inside an AgentBuilder agent, see Middleware → Wiring middleware into an agent.

How it composes

A typical stack:
use cognis::prelude::*;
use cognis::middleware::{PromptCaching, ModelRetry, MiddlewarePipeline};
use cognis_llm::Client;

// Two layers do similar jobs at different levels:
//   - in-memory dedup of repeat calls -> Runnable wrapper on the Client
//   - prompt-cache markers for the provider + retries -> middleware pipeline
//
// `MiddlewarePipeline::build` requires a raw `Client`, so wrap that *first*,
// then layer the Runnable wrappers around the pipelined client when you
// invoke it (or via `with_max_retries` on the Client surface for code that
// uses Client directly without the pipeline).
let client = Client::from_env()?;

let pipelined = MiddlewarePipeline::new()
    .push(ModelRetry::new(3))                 // innermost: retry transient errors
    .push(PromptCaching::new())               // outermost: mark cacheable
    .build(client);
Mark cacheable prompts for the provider, retry on transient errors. For an in-memory dedup cache on top, use client.with_memory_cache(key_fn) directly when calling the Client outside the pipeline, or layer your own caching middleware in the pipeline.

Embedder cache

For RAG, wrap your embedder with CachedEmbeddings:
use std::sync::Arc;
use cognis_rag::{CachedEmbeddings, Embeddings, OpenAIEmbeddings};

let raw = OpenAIEmbeddings::new(key);
let cached = CachedEmbeddings::new(raw);
let embed: Arc<dyn Embeddings> = Arc::new(cached);
Same chunk, embedded twice → second call free. This is hugely valuable when re-running indexing pipelines on slowly-changing corpora.

Cache invalidation

A cache only helps if a hit returns the right answer. Three invalidation knobs:
  • Key carefully. Include everything that influences output: messages, model name, temperature, tool list, system prompt. Two prompts with the same text but different tools are different cache entries.
  • Bypass on demand. with_memory_cache doesn’t have a force-bypass switch by design; if you need one, wrap your own cache backend.
  • TTL. The default in-memory cache is unbounded — explicitly limit with MemoryCache::with_capacity(n) if you’re worried about memory.

How it works

  • Hashing happens in your key_fn. Whatever string the closure returns is the cache key. No automatic introspection.
  • Cache entries are typed. A Cache<R, I, O, K, B> only stores O. Different output types → separate caches.
  • Provider prompt caching is opaque to the client. The middleware adds protocol markers; the provider returns whether the cache hit. Trace via cost tracking to see cache_read_tokens.

See also

Resilience

Combine caching with retries and fallbacks.

Cost tracking

See provider cache hits in your cost numbers.

Embeddings

CachedEmbeddings for RAG.