RAG starts with documents — the unit of retrieval. ADocumentation Index
Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt
Use this file to discover all available pages before exploring further.
Document is text plus metadata. Before you can embed and store them, long documents need to be split into chunks small enough for the embedder’s context window and small enough that retrieval surfaces relevant pieces, not whole files.
What a document is
| Field | Type | Purpose |
|---|---|---|
content | String | The text. |
id | Option<String> | Stable identifier — set this for incremental indexing. |
metadata | HashMap<String, Value> | Arbitrary k/v pairs you control. Filter retrieval by these. |
Loading documents
TheDocumentLoader trait is one async method that returns a stream of Document. Built-in loaders are feature-gated:
| Source | Feature flag |
|---|---|
| Plain text / files in a directory | always available |
| CSV | cognis-rag/csv-loader |
| HTML | cognis-rag/html-loader |
| YAML | cognis-rag/yaml-loader |
| TOML | cognis-rag/toml-loader |
| Web fetch | cognis-rag/web-loader |
cognis-rag/pdf-loader |
DocumentLoader for sources that aren’t in the box (databases, APIs, S3 buckets):
Splitters
All splitters implementTextSplitter:
split returns chunks for one doc; split_all is a convenient wrapper for many.
Pick a splitter
| Splitter | When to use |
|---|---|
RecursiveCharSplitter | Default. Tries paragraph → line → sentence → char. Good for prose. |
CharacterSplitter | Simple split on a single separator. Predictable; fast. |
TokenAwareSplitter | Chunk by token count using a Tokenizer. Most accurate budgeting for embedders. |
MarkdownSplitter | Header-aware Markdown chunking — preserves H1/H2/H3 structure. |
SentenceSplitter | Sentence boundaries. Good for short-form text. |
CodeSplitter | Language-tuned separators (fn, class, etc.). |
HtmlSplitter | DOM-aware HTML chunking. |
JsonSplitter | Structured-data chunking. |
chunk_size, overlap, and a separators list.
How chunks land in retrieval
Each chunk inherits its parent’s metadata, plus a position-in-parent marker. So when you retrieve a chunk, you also know:- Which document it came from (via metadata, especially if you set
with_id). - Roughly where in that document it sits.
- Any custom metadata you attached.
Tuning chunk size
Two competing forces:- Smaller chunks = sharper retrieval (the right idea, not surrounding noise) but less context for the LLM to reason from.
- Larger chunks = more context but worse signal-to-noise — the embedding averages over a lot of unrelated text.
| Use case | Chunk size | Overlap |
|---|---|---|
| Conversational FAQ | 200–500 chars | 50 |
| Long-form prose | 800–1200 chars | 100–200 |
| Source code | 500–1000 chars (line-aligned) | 50 |
| Markdown docs | 1000–2000 chars (header-bounded) | 0 |
How it works
- Splitting is lossless. No characters disappear; overlap means neighboring chunks share a window.
- Document
idis preserved through splits. Each chunk gets its own derived id (so you can de-duplicate later) but knows its parent. - Splitters don’t know about embedders. That decoupling means you can switch embedders without re-splitting.
See also
Embeddings and vector stores
Turn chunks into vectors, store, search.
Indexing pipeline
Keep your store in sync with the source of truth.
Patterns → Code Q&A
A worked end-to-end RAG over a Rust codebase.