Skip to content

Provider Setup

Providers connect your agents to large language models. Each provider plugin handles API authentication, streaming, tool calling format differences, and thinking/reasoning support so your agent config stays clean.

RivetOS ships with five provider plugins:

ProviderModelsThinking SupportNotes
AnthropicClaude Opus, Sonnet, HaikuExtended thinking, OAuth login
xAIGrok 3, Grok 4Responses API, conversation caching
GoogleGemini 2.5 Pro, FlashThought signatures for function calling
OllamaAny local modelLocal inference, no API key needed
llama.cpp serverAny model served by llama-serverLocal llama-server binary (native sampling, tags, lenient tool calling)

  1. Go to the Anthropic Console
  2. Sign up or log in
  3. Go to API KeysCreate Key
  4. Copy the key (starts with sk-ant-)

Alternatively, use OAuth login (no API key needed):

Terminal window
npx rivetos login

This opens a browser, authenticates with Anthropic, and stores tokens locally. The provider auto-detects OAuth tokens vs API keys.

Add your key to .env:

Terminal window
ANTHROPIC_API_KEY=sk-ant-...your-key-here

Add to config.yaml:

providers:
anthropic:
model: claude-sonnet-4-20250514
max_tokens: 8192
agents:
myagent:
provider: anthropic
default_thinking: medium
KeyTypeDefaultDescription
modelstringclaude-opus-4-6Model identifier
max_tokensnumber8192Maximum output tokens
api_keystring${ANTHROPIC_API_KEY}API key. Use env var
base_urlstringhttps://api.anthropic.comAPI endpoint (for proxies)
token_pathstringPath to OAuth token file (set automatically by rivetos login)

When default_thinking is set on the agent, the provider requests extended thinking with a token budget:

LevelBudgetBest For
offSimple questions, fast responses
low2,000 tokensLight reasoning
medium10,000 tokensCode review, planning
high50,000 tokensComplex architecture, deep analysis
ModelSpeedIntelligenceContext
claude-opus-4-6SlowHighest200K
claude-sonnet-4-20250514FastHigh200K
claude-haiku-3-5-20241022FastestGood200K

Docs: Anthropic API Reference


  1. Go to console.x.ai
  2. Sign up or log in
  3. Create an API key
  4. Copy the key (starts with xai-)

Add your key to .env:

Terminal window
XAI_API_KEY=xai-...your-key-here

Add to config.yaml:

providers:
xai:
model: grok-4-1-fast-reasoning
agents:
grok:
provider: xai
KeyTypeDefaultDescription
modelstringgrok-4.20-reasoningModel identifier
api_keystring${XAI_API_KEY}API key
base_urlstringhttps://api.x.ai/v1API endpoint
temperaturenumberSampling temperature (not used with reasoning models)
storebooleantrueServer-side conversation storage. When enabled, only new messages are sent each turn
timeout_msnumber3600000Request timeout in milliseconds (default: 1 hour for reasoning)

When store: true (default), xAI stores the conversation server-side. Each turn only sends new messages, reducing token usage and latency. The provider manages previous_response_id automatically.

ModelTypeNotes
grok-4.20-reasoningFlagship2M context, fast + agentic, $2.00/$6.00 per M tokens
grok-4-1-fast-reasoningFast10x cheaper ($0.20/$0.50), good for compaction/fallback

Docs: xAI API Documentation


  1. Go to Google AI Studio
  2. Click Create API Key
  3. Select or create a Google Cloud project
  4. Copy the key

Add your key to .env:

Terminal window
GOOGLE_API_KEY=AIza...your-key-here

Add to config.yaml:

providers:
google:
model: gemini-2.5-pro
agents:
gemini:
provider: google
default_thinking: medium
KeyTypeDefaultDescription
modelstringgemini-2.5-proModel identifier
api_keystring${GOOGLE_API_KEY}API key
max_tokensnumber8192Maximum output tokens
base_urlstringhttps://generativelanguage.googleapis.com/v1betaAPI endpoint
LevelBudget
off0
low1,024 tokens
medium8,192 tokens
high32,768 tokens
ModelSpeedContextNotes
gemini-2.5-proMedium1MBest reasoning
gemini-2.5-flashFast1MGood balance of speed and quality

Docs: Gemini API Documentation


Ollama runs models locally on your machine. No API key needed, no usage costs — just hardware.

Terminal window
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS
brew install ollama
# Or download from https://ollama.com/download
Terminal window
ollama pull qwen2.5:32b

Browse available models at ollama.com/library.

No .env needed — Ollama runs locally without authentication.

providers:
ollama:
model: qwen2.5:32b
base_url: http://localhost:11434
agents:
local:
provider: ollama
local: true # Extended context (tokens are free)
KeyTypeDefaultDescription
modelstringllama3.1Model name (must be pulled via ollama pull)
base_urlstringhttp://localhost:11434Ollama API endpoint
temperaturenumber0.7Sampling temperature
top_pnumber0.9Nucleus sampling threshold
num_ctxnumbermodel defaultContext window size in tokens
keep_alivestring30mHow long to keep model loaded in memory
  • Set local: true on the agent — this includes extended workspace context (CAPABILITIES.md, daily notes) since tokens are free with local inference.
  • num_ctx is critical for tool-using agents. Most models default to 2048-4096 tokens, which isn’t enough. Set 8192 or higher.
  • keep_alive controls how long the model stays in VRAM after the last request. Set to 0 to unload immediately, or 24h to keep it warm.
  • Remote Ollama: If Ollama runs on a different machine, change base_url to point at it (e.g., http://192.0.2.50:11434).

Docs: Ollama API Documentation


The native provider for llama-server — the built-in HTTP server from the llama.cpp project.

It uses the native /completion and /infill endpoints (not the OpenAI compat layer). This gives full access to llama.cpp sampling parameters (typical_p, mirostat, repeat_last_n, seed, etc.), native <think> / <thinking> tag support, and lenient JSON tool-call parsing.

Terminal window
# Build from source (recommended for latest features)
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j server
# Or use prebuilt binaries from https://github.com/ggerganov/llama.cpp/releases
# Run with a model (adjust -m, --host, --port)
./llama-server -m models/Meta-Llama-3.1-70B-Instruct-Q4_K_M.gguf \
--host 0.0.0.0 --port 8080 \
-c 32768 --n-gpu-layers 99
providers:
local:
provider_type: llama-server # or just use the default "llama-server"
base_url: http://localhost:8080
model: llama3.1:70b # any model name your server knows
num_ctx: 32768
typical_p: 0.9
repeat_last_n: 64
mirostat: 2
mirostat_tau: 5.0
seed: 42
agents:
local:
provider: local
local: true
KeyTypeDefaultDescription
base_urlstringhttp://localhost:8080Must point to your llama-server (no /v1)
modelstringdefaultModel alias or path known to the server
num_ctxnumber8192Context window (matches server -c)
temperaturenumber0.7Sampling temperature
top_pnumber0.9Nucleus sampling
typical_pnumber0.9Locally typical sampling (llama.cpp specific)
repeat_penaltynumber1.1Repetition penalty
repeat_last_nnumber64Last N tokens to consider for repetition
mirostatnumber00=off, 1=Mirostat v1, 2=v2
mirostat_taunumber5.0Target surprise value
mirostat_etanumber0.1Learning rate for Mirostat
seednumber-1Random seed (-1 = random)
first_chunk_timeout_msnumber120000Timeout for first token
chunk_timeout_msnumber30000Timeout between tokens

Note: This provider is llama.cpp-specific. It talks directly to the native llama-server endpoints (not the OpenAI-compat layer). A future generic openai provider is planned for OpenRouter, Together, Fireworks, vLLM, etc.


When a provider fails (429 rate limit, 503 overloaded, timeout), RivetOS can automatically try the next provider in a fallback chain.

Configure at the agent level:

agents:
opus:
provider: anthropic
fallbacks:
- "google:gemini-2.5-pro"
- "xai:grok-4-1-fast-reasoning"

Or globally:

runtime:
fallbacks:
- providerId: anthropic
fallbacks:
- "google:gemini-2.5-pro"
- "xai:grok-4-1-fast-reasoning"

Format: provider_id uses the provider’s default model, provider_id:model overrides the model.


Terminal window
# Run provider connectivity checks
npx rivetos doctor
# Smoke test — send a test message to each provider
npx rivetos test
# Check which providers are loaded
npx rivetos status