By the end of this tutorial, you’ll have a fully functional local LLM running on your machine via Ollama, exposed as an OpenAI-compatible REST API, and callable from Python — zero API costs, zero data leaving your hardware. The Ollama local LLM setup takes under 10 minutes on any modern machine with at least 8GB of RAM.
If you’ve been paying per-token for every dev experiment, running classification tasks in bulk, or processing sensitive documents through a cloud API, this changes your workflow significantly. Ollama gives you a clean CLI and a local server that mimics the OpenAI API format — meaning most tools that work with GPT-4 can be pointed at your local machine with a single URL swap.
- Install Ollama — Download and install the binary for your OS (Windows, Mac, or Linux)
- Pull a model — Download an open-source model like Llama 3.1 or Mistral via CLI
- Verify the server is running — Confirm the REST API is live on localhost:11434
- Call the API from Python — Make inference requests using the OpenAI-compatible endpoint
- Tune performance for your hardware — Configure GPU layers, context size, and concurrency
- Wire it into an agent workflow — Drop in as a local backend for LangChain or a custom agent
Step 1: Install Ollama on Windows, Mac, or Linux
macOS
Download the .dmg from ollama.com/download, drag it to Applications, and launch it. Ollama runs as a menu bar app and starts the server automatically on port 11434. That’s it.
Linux
The official one-liner works reliably on Ubuntu 20.04+, Debian, and most systemd-based distros:
curl -fsSL https://ollama.com/install.sh | sh
This installs the binary to /usr/local/bin/ollama and registers a systemd service. The service starts automatically on boot. To check it:
sudo systemctl status ollama
# Should show: active (running)
Windows
Download the OllamaSetup.exe installer from the same download page. It installs as a background service accessible via the system tray. WSL2 users can also run the Linux install script inside WSL — GPU passthrough works if you’ve configured NVIDIA drivers for WSL.
Hardware note: Ollama runs on CPU if no GPU is detected, but expect 3-10x slower inference. An M-series Mac or any NVIDIA GPU with 8GB+ VRAM will give you usable speeds. On Apple Silicon, Ollama uses the Metal backend automatically — no configuration needed.
Step 2: Pull a Model and Run It
Ollama hosts models on its registry. Pull your first model:
# Llama 3.1 8B — good balance of quality and speed, ~4.7GB download
ollama pull llama3.1
# Mistral 7B — faster, slightly lower quality on reasoning tasks
ollama pull mistral
# Qwen2.5 3B — runs well on CPU-only machines with 8GB RAM
ollama pull qwen2.5:3b
# See what's downloaded locally
ollama list
To run a model interactively in the terminal:
ollama run llama3.1
# >>> type your prompt here, /bye to exit
The first run after a pull takes a few seconds to load the model into memory. Subsequent calls are fast because the model stays loaded for 5 minutes after the last request (configurable via OLLAMA_KEEP_ALIVE).
For production use, you’ll want to interact via the API rather than the interactive CLI.
Step 3: Verify the REST API Is Running
Ollama runs a local HTTP server at http://localhost:11434 the moment the app or service starts — you don’t need to call ollama run first.
# Check the server is alive
curl http://localhost:11434
# Should return: Ollama is running
# List available models via API
curl http://localhost:11434/api/tags
The native Ollama API uses /api/generate and /api/chat, but the more useful endpoint for most developers is the OpenAI-compatible layer added in Ollama 0.1.24+:
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.1",
"messages": [{"role": "user", "content": "What is 2+2?"}]
}'
This returns a response in the exact same JSON schema as OpenAI’s API — which means you can swap any OpenAI client to point at your local Ollama instance.
Step 4: Call the API from Python
You have two options: use the native ollama Python package, or use the openai SDK with a base URL override. I’d go with the OpenAI SDK approach because it means zero code changes when you want to run the same agent against Claude or GPT-4 in production. If you’re thinking about cost tradeoffs between hosted and self-hosted models, our breakdown of self-hosting vs Claude API costs is worth reading before you commit to an architecture.
from openai import OpenAI
# Point the OpenAI client at your local Ollama instance
client = OpenAI(
base_url="http://localhost:11434/v1",
api_key="ollama" # Required by the SDK but not validated by Ollama
)
response = client.chat.completions.create(
model="llama3.1", # Must match a model you've pulled
messages=[
{"role": "system", "content": "You are a concise assistant."},
{"role": "user", "content": "Summarize quantum entanglement in two sentences."}
],
temperature=0.3, # Lower = more deterministic
max_tokens=200
)
print(response.choices[0].message.content)
Streaming works the same way:
stream = client.chat.completions.create(
model="llama3.1",
messages=[{"role": "user", "content": "Write a haiku about Python."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
Install the dependency: pip install openai. No special Ollama SDK required.
Step 5: Tune Performance for Your Hardware
Default settings are conservative. Here’s how to get more out of your hardware.
GPU Layer Offloading
Ollama auto-detects GPUs, but you can control how many layers are loaded onto the GPU vs kept in RAM. Set this via environment variable before starting Ollama:
# Linux/Mac: force all layers to GPU (fastest, needs enough VRAM)
OLLAMA_GPU_LAYERS=999 ollama serve
# Check how many layers were actually offloaded
ollama ps # Shows running models and memory breakdown
Context Window and Concurrency
The default context length is 2048 tokens for most models. For document processing or longer conversations, increase it — but this scales memory usage linearly:
# Set context length per-request via options
response = client.chat.completions.create(
model="llama3.1",
messages=[...],
extra_body={
"options": {
"num_ctx": 8192, # Context window size
"num_thread": 8, # CPU threads (for CPU inference)
"num_gpu": 1 # Number of GPUs to use
}
}
)
Environment Variables Worth Setting
export OLLAMA_MAX_LOADED_MODELS=2 # Keep 2 models in memory simultaneously
export OLLAMA_NUM_PARALLEL=4 # Handle 4 concurrent requests
export OLLAMA_KEEP_ALIVE=30m # Keep model loaded for 30 minutes
On a machine with 16GB RAM and an RTX 3080, I get roughly 35-45 tokens/sec with Llama 3.1 8B fully offloaded to GPU. CPU-only on a modern 8-core machine gives around 5-8 tokens/sec with the same model. If inference latency matters for your agent loops, understanding how temperature and sampling settings affect generation speed is worth your time — lower top-p values can modestly improve throughput.
Step 6: Wire It Into an Agent Workflow
Because Ollama exposes an OpenAI-compatible endpoint, dropping it into LangChain, LlamaIndex, or a custom agent takes one line. Here’s a minimal LangChain agent using a local model:
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import tool
# Point LangChain at Ollama
llm = ChatOpenAI(
model="llama3.1",
base_url="http://localhost:11434/v1",
api_key="ollama",
temperature=0
)
@tool
def get_word_count(text: str) -> int:
"""Count the number of words in a text string."""
return len(text.split())
tools = [get_word_count]
prompt = ChatPromptTemplate.from_messages([
("system", "You are a helpful assistant with access to tools."),
("human", "{input}"),
MessagesPlaceholder("agent_scratchpad"),
])
agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
result = executor.invoke({"input": "How many words are in this sentence: The quick brown fox?"})
print(result["output"])
This pattern means your agent code is fully portable — swap the base_url and model to point at OpenAI, Anthropic (via a proxy), or any other OpenAI-compatible endpoint. If you’re building more complex pipelines, the same local model can back a multi-agent orchestration setup where Ollama handles cheap sub-tasks and a hosted model handles final synthesis.
Common Errors and How to Fix Them
Error: “model not found” or 404 on /v1/chat/completions
You’re requesting a model name that doesn’t exist locally. Run ollama list to see what’s pulled. Model names are case-sensitive and must match exactly — llama3.1 not llama-3.1. Pull the model first with ollama pull <model-name>.
Error: “context length exceeded” or garbled output mid-response
The default context window (2048 tokens) is being exceeded. Pass num_ctx in the request options as shown in Step 5. Note that doubling the context roughly doubles VRAM usage, so watch your ollama ps output for memory pressure.
Extremely slow inference (< 2 tokens/sec)
Ollama is running purely on CPU, or the model doesn’t fit in VRAM and is swapping to system RAM. Check ollama ps — the “size” column shows memory usage. If GPU utilization is 0% during inference (nvidia-smi on Linux/Windows, Activity Monitor on Mac), Ollama didn’t detect your GPU. On Linux, verify CUDA drivers with nvidia-smi before installing Ollama. On Windows with an NVIDIA card, ensure you have the latest Game Ready or Studio drivers installed.
What to Build Next: A Privacy-First Document Q&A Tool
The natural extension of this setup is a RAG pipeline where sensitive documents never leave your machine. The pattern: chunk documents locally → embed with a local embedding model (Ollama supports nomic-embed-text via ollama pull nomic-embed-text) → store vectors in a local Qdrant or Chroma instance → query your local Llama model at inference time.
Every piece runs on your hardware. For a legal or compliance context, this matters — no document content hits a third-party API at any point. If you need to evaluate whether RAG or fine-tuning is the right approach for your use case, we’ve covered the cost and performance tradeoffs between RAG and fine-tuning in detail.
The Ollama local LLM setup is genuinely production-ready for use cases where latency tolerances are in the 500ms+ range, data privacy is non-negotiable, or you’re running high-volume batch tasks where per-token API costs would add up fast. It’s not a replacement for frontier models on complex reasoning tasks — Llama 3.1 8B is not GPT-4o — but for classification, summarization, structured extraction, and routing logic in agent pipelines, it’s more than capable. If you’re thinking about API cost management at scale, pairing a local Ollama instance for cheap tasks with a hosted model for high-value ones is one of the more effective strategies for managing LLM costs at scale.
Bottom line by reader type: Solo founder doing 50k+ API calls per month on classification or extraction tasks — go self-hosted, the hardware pays for itself fast. Team using LLMs for sensitive HR or legal documents — Ollama is the obvious choice, stop sending that data to the cloud. Developer wanting to experiment without burning API budget — Ollama local LLM setup takes 10 minutes and costs nothing per query.
Frequently Asked Questions
What hardware do I need to run Ollama effectively?
For CPU-only inference, 16GB RAM is the practical minimum for 7B models — you’ll get 5-8 tokens/sec which is usable but slow. An NVIDIA GPU with 8GB VRAM (e.g. RTX 3060/3070) runs 7-8B models at 30-50 tokens/sec. Apple Silicon Macs (M1/M2/M3) are excellent — an M2 Pro with 16GB unified memory handles Llama 3.1 8B at around 30 tokens/sec with no configuration needed.
Is Ollama compatible with the OpenAI Python SDK?
Yes, from version 0.1.24+ Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Set base_url="http://localhost:11434/v1" and api_key="ollama" (any non-empty string works) in the OpenAI client — chat completions, streaming, and function calling all work. Tool use reliability varies by model; Llama 3.1 and Mistral NeMo handle it better than smaller models.
Which models work best with Ollama for coding tasks?
For coding specifically, codellama:13b and deepseek-coder-v2 outperform general-purpose 7B models. If you have the VRAM for it (24GB+), qwen2.5-coder:32b is competitive with GPT-3.5 on many coding benchmarks. For machines with limited memory, qwen2.5-coder:7b is a solid pick — it punches above its weight class on code generation and completion.
Can I run Ollama on a remote server and access it from another machine?
By default Ollama only listens on 127.0.0.1. To expose it on your network, set OLLAMA_HOST=0.0.0.0:11434 before starting Ollama. Put it behind an nginx reverse proxy with authentication before exposing it beyond your local network — the API has no built-in auth. Don’t expose port 11434 directly to the public internet.
How do I run multiple models simultaneously in Ollama?
Set OLLAMA_MAX_LOADED_MODELS=2 (or more) in your environment before starting Ollama. Each model stays in memory according to OLLAMA_KEEP_ALIVE after its last request. Keep in mind each loaded model occupies its full VRAM/RAM footprint simultaneously — running two 8B models requires roughly double the memory of one. ollama ps shows currently loaded models and their memory usage.
What’s the difference between Ollama’s native API and the OpenAI-compatible endpoint?
The native API (/api/generate, /api/chat) offers Ollama-specific options like raw mode, model templates, and more granular sampling parameters. The OpenAI-compatible endpoint (/v1/chat/completions) is a compatibility layer that accepts the standard OpenAI request format — useful for dropping Ollama into existing tooling without code changes. For new projects, I’d use the OpenAI-compatible endpoint unless you specifically need Ollama-only features.
Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

