LLM calls are expensive and slow. Cognis has three caching layers, each fitting a different access pattern. Pick the one that matches what you’re trying to skip.Documentation Index
Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt
Use this file to discover all available pages before exploring further.
Pick a cache
| Layer | Hits | When |
|---|---|---|
| In-memory cache | Same input, same call | Per-process; tests; short-lived workers. |
| SQLite cache | Same input across processes | Single-host services; CLI tools that retain state. |
| Provider prompt caching | Same prefix across calls | Anthropic-style cached prompts; long system contexts. |
In-memory cache
with_memory_cache(key_fn) wraps a Runnable with a hash-keyed in-memory cache. The closure produces the cache key from the input.
SQLite cache
For multi-process or long-running services, use the SQLite-backed cache (featurecognis/cache-sqlite):
Provider prompt caching
Anthropic and some others support marking the start of a prompt as cacheable on the provider side. You pay full price the first time, a fraction on subsequent calls. Use thePromptCaching middleware:
AgentBuilder agent, see Middleware → Wiring middleware into an agent.
How it composes
A typical stack:client.with_memory_cache(key_fn) directly when calling the Client outside the pipeline, or layer your own caching middleware in the pipeline.
Embedder cache
For RAG, wrap your embedder withCachedEmbeddings:
Cache invalidation
A cache only helps if a hit returns the right answer. Three invalidation knobs:- Key carefully. Include everything that influences output: messages, model name, temperature, tool list, system prompt. Two prompts with the same text but different tools are different cache entries.
- Bypass on demand.
with_memory_cachedoesn’t have a force-bypass switch by design; if you need one, wrap your own cache backend. - TTL. The default in-memory cache is unbounded — explicitly limit with
MemoryCache::with_capacity(n)if you’re worried about memory.
How it works
- Hashing happens in your
key_fn. Whatever string the closure returns is the cache key. No automatic introspection. - Cache entries are typed. A
Cache<R, I, O, K, B>only storesO. Different output types → separate caches. - Provider prompt caching is opaque to the client. The middleware adds protocol markers; the provider returns whether the cache hit. Trace via cost tracking to see cache_read_tokens.
See also
Resilience
Combine caching with retries and fallbacks.
Cost tracking
See provider cache hits in your cost numbers.
Embeddings
CachedEmbeddings for RAG.