Modern models have huge context windows, but stuffing them is rarely the right move — quality drops, costs balloon, and the “lost in the middle” effect bites hard. This pattern handles long documents with a map-reduce approach: split, summarize each chunk, summarize the summaries, optionally reorder.Documentation Index
Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt
Use this file to discover all available pages before exploring further.
What you’ll build
A function that takes a long document and returns a 1-page summary, scaling to inputs that wouldn’t fit in any single context window.How it works
- Split the document into chunk-sized pieces with
RecursiveCharSplitter. - Map: summarize each chunk in parallel with the LLM.
- Reduce: summarize the chunk-summaries together. If even those don’t fit, recurse.
- Reorder the final summary’s source list (best-first → edge-first) for the model that consumes the summary downstream.
The code
How it works
- Concurrency comes from
buffer_unordered. Eight chunks summarize in parallel; the embedder rate limit is the throttle. Tune to your provider’s quota. with_overlap(200)keeps cross-chunk references stable. A claim that spans two chunks survives because both chunks share the boundary text.- Two passes scales linearly. A 200k-token doc with 1k-token chunks is 200 chunks → 200 summaries (5–10k tokens) → one final summary. Each LLM call is cheap.
- Recurse when needed. If the combined summaries themselves don’t fit, treat them as a new doc and run the same map-reduce.
When the document has structure
For docs with sections (Markdown, HTML, books with chapters), useMarkdownSplitter or HtmlSplitter so chunks align to natural boundaries. Hierarchical summaries (per-section, then per-chapter, then overall) preserve structure better than flat map-reduce.
Long-context reorder
If you’re feeding a list of summaries (not a flattened blob) into a final model, useLongContextReorder to put the most-relevant ones at the edges where the model’s attention is best:
Production considerations
| Concern | Add |
|---|---|
| Retries on transient errors | Wrap the client with with_max_retries(3). |
| Cost cap | RateLimit::new(Arc::new(TokenBucket::new(rate, burst))) in a MiddlewarePipeline; or precompute token counts and refuse oversized inputs. |
| Caching | Hash chunks; skip the map call when the chunk has been seen. |
| Monitoring | Wire cognis-trace so each map call is its own generation span — see Trace with Langfuse. |
| Determinism | Set temperature=0 (via Client::builder().temperature(...) or ChatOptions) for reproducible summaries in tests. |
When to skip this pattern
- Document fits comfortably in one context. Just send it.
- You only need a quick gist. Truncate, then summarize.
- The doc is highly redundant (logs, API responses). Deduplicate first; you may not need to summarize at all.
See also
Documents and splitters
Pick the right chunker for your input.
Reranking and compression
Compress and reorder retrieved docs.
Middleware → Summarization
Auto-summarize an agent’s transcript when context grows.