Skip to main content

Documentation Index

Fetch the complete documentation index at: https://cognis.vasanth.xyz/llms.txt

Use this file to discover all available pages before exploring further.

You can build a complete Cognis app — agent, RAG, even multi-agent — without leaving localhost. Ollama runs LLMs and embedders locally; Cognis ships clients for both. This pattern walks through a fully local research-style assistant with an in-memory vector store.

What you’ll build

A local agent that:
  • chats via a local Ollama model
  • embeds via a local Ollama embedder
  • searches a small in-memory knowledge base
  • streams tokens in real time
No API keys. No outbound traffic.

Step 0 — Install Ollama and pull models

ollama pull llama3.1                  # ~4.7 GB
ollama pull nomic-embed-text          # ~270 MB embedder
You can substitute any model: qwen2.5:3b, phi3, mistral-nemo, etc. For tool-calling, prefer models that support function calling natively (llama3.1, qwen2.5, mistral-nemo).

Step 1 — Add cognis with the ollama feature

[dependencies]
cognis = { version = "0.3", features = ["ollama"] }
tokio = { version = "1", features = ["full"] }
ollama is in the default feature set, so you actually only need cognis = "0.3". But being explicit doesn’t hurt.

Step 2 — Configure the env

export COGNIS_PROVIDER=ollama
export COGNIS_OLLAMA_MODEL=llama3.1
# Optional: export COGNIS_OLLAMA_BASE_URL=http://localhost:11434

Step 3 — Build a tool-calling agent

use std::sync::Arc;
use cognis::prelude::*;
use cognis::{AgentBuilder, Calculator};
use cognis_llm::Client;

#[tokio::main]
async fn main() -> Result<()> {
    let client = Client::from_env()?;

    let mut agent = AgentBuilder::new()
        .with_llm(client)
        .with_tool(Arc::new(Calculator::new()))
        .with_system_prompt(
            "You are a math assistant. Use the calculator for any arithmetic. \
             Always state the final answer clearly."
        )
        .with_max_iterations(4)
        .build()?;

    let resp = agent.run(Message::human("What is 23 * 17 + 4?")).await?;
    println!("{}", resp.content);
    Ok(())
}
If your model is small (llama3.2:1b etc), tool calling can be flaky — switch to a model that’s known to handle it (llama3.1, qwen2.5).

Step 4 — Add local RAG

use std::sync::Arc;
use cognis::prelude::*;
use cognis_llm::Client;
use cognis_rag::{
    Document, Embeddings, InMemoryVectorStore, OllamaEmbeddings,
    RecursiveCharSplitter, TextSplitter, VectorStore,
};

#[tokio::main]
async fn main() -> Result<()> {
    let docs = vec![
        Document::new("Cognis is a Rust LLM framework."),
        Document::new("cognisgraph offers a Pregel-style stateful graph engine."),
        Document::new("cognis-rag bundles embeddings, vector stores, and retrievers."),
    ];
    let chunks = RecursiveCharSplitter::new()
        .with_chunk_size(120)
        .split_all(&docs);

    let emb: Arc<dyn Embeddings> = Arc::new(OllamaEmbeddings::new("nomic-embed-text"));
    let mut store = InMemoryVectorStore::new(emb);
    let texts: Vec<_> = chunks.iter().map(|c| c.content.clone()).collect();
    store.add_texts(texts, None).await?;

    let q = "What does cognis-rag include?";
    let hits = store.similarity_search(q, 2).await?;
    let context: String = hits.iter().map(|h| format!("- {}", h.text)).collect::<Vec<_>>().join("\n");

    let client = Client::from_env()?;
    let prompt = format!("Answer using only:\n{context}\n\nQ: {q}\nA:");
    let resp = client.invoke(vec![Message::human(prompt)]).await?;
    println!("{}", resp.content());
    Ok(())
}
Replace OllamaEmbeddings with OpenAIEmbeddings later when you want quality up; the rest of the pipeline doesn’t change.

Step 5 — Stream tokens

use cognis::prelude::*;
use futures::StreamExt;
use cognis_llm::Client;

#[tokio::main]
async fn main() -> Result<()> {
    let client = Client::from_env()?;
    let mut s = client.stream(vec![Message::human("Tell me a one-line joke.")]).await?;
    while let Some(chunk) = s.next().await {
        print!("{}", chunk?.content);
    }
    println!();
    Ok(())
}

How it works

  • Client::from_env() reads COGNIS_PROVIDER=ollama and points at the daemon. Same code as any other provider.
  • OllamaEmbeddings::new(model) talks to the same daemon for embeddings. No second service to install.
  • No keys, ever. The Ollama wire protocol uses no auth. Don’t expose your daemon to untrusted networks.
  • Speed depends on hardware. A 7B model on Apple Silicon is interactive; on a CPU-only laptop it’s slow. Pick small models when iterating.

When to graduate to a hosted provider

Local strengthHosted strength
Zero costLarger models (Claude Opus, GPT-4o)
PrivacyFaster cold start
Offline iterationBetter tool-calling reliability on smaller prompts
Predictable latency on your hardwareHigher quality on hard tasks
The transition is one env var. Build your app local-first, set COGNIS_PROVIDER=openai when you want production quality, and toggle by environment.

See also

Models and providers

All providers, all builder knobs.

Embeddings and vector stores

Local embedders and stores.

Examples → Quickstart V2

The numbered demo set, all of which work against Ollama.