What you’ll build
A local agent that:- chats via a local Ollama model
- embeds via a local Ollama embedder
- searches a small in-memory knowledge base
- streams tokens in real time
Step 0 — Install Ollama and pull models
qwen2.5:3b, phi3, mistral-nemo, etc. For tool-calling, prefer models that support function calling natively (llama3.1, qwen2.5, mistral-nemo).
Step 1 — Add cognis with the ollama feature
ollama is in the default feature set, so you actually only need cognis = "0.3". But being explicit doesn’t hurt.
Step 2 — Configure the env
Step 3 — Build a tool-calling agent
llama3.2:1b etc), tool calling can be flaky — switch to a model that’s known to handle it (llama3.1, qwen2.5).
Step 4 — Add local RAG
OllamaEmbeddings with OpenAIEmbeddings later when you want quality up; the rest of the pipeline doesn’t change.
Step 5 — Stream tokens
How it works
Client::from_env()readsCOGNIS_PROVIDER=ollamaand points at the daemon. Same code as any other provider.OllamaEmbeddings::new(model)talks to the same daemon for embeddings. No second service to install.- No keys, ever. The Ollama wire protocol uses no auth. Don’t expose your daemon to untrusted networks.
- Speed depends on hardware. A 7B model on Apple Silicon is interactive; on a CPU-only laptop it’s slow. Pick small models when iterating.
When to graduate to a hosted provider
| Local strength | Hosted strength |
|---|---|
| Zero cost | Larger models (Claude Opus, GPT-4o) |
| Privacy | Faster cold start |
| Offline iteration | Better tool-calling reliability on smaller prompts |
| Predictable latency on your hardware | Higher quality on hard tasks |
COGNIS_PROVIDER=openai when you want production quality, and toggle by environment.
See also
Models and providers
All providers, all builder knobs.
Embeddings and vector stores
Local embedders and stores.
Examples → Quickstart V2
The numbered demo set, all of which work against Ollama.