Indexing turns documents into a searchable vector store. Doing it once is easy; doing it over and over — when source files change, when you edit chunking strategy, when you add new docs — is where the cost lives. Cognis’Documentation Index
Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt
Use this file to discover all available pages before exploring further.
IndexingPipeline makes incremental indexing the default: tell it which docs you care about, give it a way to fingerprint them, and it only re-embeds what changed.
The shape
Quick example
examples/retrieval/indexing_rag.rs.
The output of the second run reports only what changed — changed=1, added=1, unchanged=1. Doc A wasn’t re-embedded.
Incremental vs full
Two run methods, two trade-offs:| Method | Behavior | When |
|---|---|---|
pipeline.run().await? | Re-index everything. | Initial population. Splitter or embedder change. |
pipeline.run_incremental(record_manager, group, key_fn).await? | Only re-embed new or changed docs. | Steady state. Most production loops. |
group is a namespace for the record manager — you can keep multiple indices in one manager (e.g., "docs", "code", "tickets").
key_fn returns the document key the record manager uses for fingerprinting. The simplest choice: |d| d.id.clone(). For sources without stable ids, hash the content.
Record managers
The record manager is the bookkeeping layer. It stores fingerprints (key + content hash) so the pipeline can spot changes without diffing the vector store.| Implementation | Notes |
|---|---|
InMemoryRecordManager::default() | Lives in process. Lost on restart. |
| (custom) | Implement RecordManager for SQLite, Postgres, Redis, S3. |
How “changed” is detected
The pipeline computes a stable content fingerprint per doc. Same key + same content hash = unchanged. Different content for the same key = changed (re-embed and replace). New key = added. Key seen previously but not in this load = deleted (removed from the store). Fingerprint stability matters: since v0.3.0 the algorithm is locked to be deterministic across Rust releases (see PR #26). You won’t get spurious re-indexing from compiler upgrades.How it works
- Splitting happens before fingerprinting. Each chunk inherits the parent doc’s key and is fingerprinted alongside it.
- The store is updated atomically per doc. Old chunks for a changed doc are deleted before new ones are added, so there’s no window where the old and new versions are both queryable.
- Errors per doc don’t poison the run. A failing embedder call for one document is reported in the result but doesn’t prevent others from indexing.
- Concurrency follows
RunnableConfig::max_concurrency. Tune for your embedder’s rate limits.
See also
Documents and splitters
What goes into the pipeline.
Embeddings and vector stores
What comes out.
Patterns → Code Q&A
A worked indexing-then-retrieval flow.