Self-Hosting LLMs with Ollama: Step-by-Step Setup for Windows, Mac, and Linux

Q: What hardware do I need to run Ollama effectively?

For CPU-only inference, 16GB RAM is the practical minimum for 7B models — you'll get 5-8 tokens/sec which is usable but slow. An NVIDIA GPU with 8GB VRAM (e.g. RTX 3060/3070) runs 7-8B models at 30-50 tokens/sec. Apple Silicon Macs (M1/M2/M3) are excellent — an M2 Pro with 16GB unified memory handles Llama 3.1 8B at around 30 tokens/sec with no configuration needed.

Q: Is Ollama compatible with the OpenAI Python SDK?

Yes, from version 0.1.24+ Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Set base_url="http://localhost:11434/v1" and api_key="ollama" (any non-empty string works) in the OpenAI client — chat completions, streaming, and function calling all work. Tool use reliability varies by model; Llama 3.1 and Mistral NeMo handle it better than smaller models.

Q: Which models work best with Ollama for coding tasks?

For coding specifically, codellama:13b and deepseek-coder-v2 outperform general-purpose 7B models. If you have the VRAM for it (24GB+), qwen2.5-coder:32b is competitive with GPT-3.5 on many coding benchmarks. For machines with limited memory, qwen2.5-coder:7b is a solid pick — it punches above its weight class on code generation and completion.

Q: Can I run Ollama on a remote server and access it from another machine?

By default Ollama only listens on 127.0.0.1. To expose it on your network, set OLLAMA_HOST=0.0.0.0:11434 before starting Ollama. Put it behind an nginx reverse proxy with authentication before exposing it beyond your local network — the API has no built-in auth. Don't expose port 11434 directly to the public internet.

Q: How do I run multiple models simultaneously in Ollama?

Set OLLAMA_MAX_LOADED_MODELS=2 (or more) in your environment before starting Ollama. Each model stays in memory according to OLLAMA_KEEP_ALIVE after its last request. Keep in mind each loaded model occupies its full VRAM/RAM footprint simultaneously — running two 8B models requires roughly double the memory of one. ollama ps shows currently loaded models and their memory usage.

Q: What's the difference between Ollama's native API and the OpenAI-compatible endpoint?

The native API (/api/generate, /api/chat) offers Ollama-specific options like raw mode, model templates, and more granular sampling parameters. The OpenAI-compatible endpoint (/v1/chat/completions) is a compatibility layer that accepts the standard OpenAI request format — useful for dropping Ollama into existing tooling without code changes. For new projects, I'd use the OpenAI-compatible endpoint unless you specifically need Ollama-only features.

By the end of this tutorial, you’ll have a fully functional local LLM running on your machine via Ollama, exposed as an OpenAI-compatible REST API, and callable from Python — zero API costs, zero data leaving your hardware. The Ollama local LLM setup takes under 10 minutes on any modern machine with at least 8GB of RAM.

If you’ve been paying per-token for every dev experiment, running classification tasks in bulk, or processing sensitive documents through a cloud API, this changes your workflow significantly. Ollama gives you a clean CLI and a local server that mimics the OpenAI API format — meaning most tools that work with GPT-4 can be pointed at your local machine with a single URL swap.

Install Ollama — Download and install the binary for your OS (Windows, Mac, or Linux)
Pull a model — Download an open-source model like Llama 3.1 or Mistral via CLI
Verify the server is running — Confirm the REST API is live on localhost:11434
Call the API from Python — Make inference requests using the OpenAI-compatible endpoint
Tune performance for your hardware — Configure GPU layers, context size, and concurrency
Wire it into an agent workflow — Drop in as a local backend for LangChain or a custom agent

Step 1: Install Ollama on Windows, Mac, or Linux

macOS

Download the .dmg from ollama.com/download, drag it to Applications, and launch it. Ollama runs as a menu bar app and starts the server automatically on port 11434. That’s it.

Linux

The official one-liner works reliably on Ubuntu 20.04+, Debian, and most systemd-based distros:

curl -fsSL https://ollama.com/install.sh | sh

This installs the binary to /usr/local/bin/ollama and registers a systemd service. The service starts automatically on boot. To check it:

sudo systemctl status ollama
# Should show: active (running)

Windows

Download the OllamaSetup.exe installer from the same download page. It installs as a background service accessible via the system tray. WSL2 users can also run the Linux install script inside WSL — GPU passthrough works if you’ve configured NVIDIA drivers for WSL.

Hardware note: Ollama runs on CPU if no GPU is detected, but expect 3-10x slower inference. An M-series Mac or any NVIDIA GPU with 8GB+ VRAM will give you usable speeds. On Apple Silicon, Ollama uses the Metal backend automatically — no configuration needed.

Step 2: Pull a Model and Run It

Ollama hosts models on its registry. Pull your first model:

# Llama 3.1 8B — good balance of quality and speed, ~4.7GB download
ollama pull llama3.1

# Mistral 7B — faster, slightly lower quality on reasoning tasks
ollama pull mistral

# Qwen2.5 3B — runs well on CPU-only machines with 8GB RAM
ollama pull qwen2.5:3b

# See what's downloaded locally
ollama list

To run a model interactively in the terminal:

ollama run llama3.1
# >>> type your prompt here, /bye to exit

The first run after a pull takes a few seconds to load the model into memory. Subsequent calls are fast because the model stays loaded for 5 minutes after the last request (configurable via OLLAMA_KEEP_ALIVE).

For production use, you’ll want to interact via the API rather than the interactive CLI.

Step 3: Verify the REST API Is Running

Ollama runs a local HTTP server at http://localhost:11434 the moment the app or service starts — you don’t need to call ollama run first.

# Check the server is alive
curl http://localhost:11434

# Should return: Ollama is running

# List available models via API
curl http://localhost:11434/api/tags

The native Ollama API uses /api/generate and /api/chat, but the more useful endpoint for most developers is the OpenAI-compatible layer added in Ollama 0.1.24+:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1",
    "messages": [{"role": "user", "content": "What is 2+2?"}]
  }'

This returns a response in the exact same JSON schema as OpenAI’s API — which means you can swap any OpenAI client to point at your local Ollama instance.

Step 4: Call the API from Python

You have two options: use the native ollama Python package, or use the openai SDK with a base URL override. I’d go with the OpenAI SDK approach because it means zero code changes when you want to run the same agent against Claude or GPT-4 in production. If you’re thinking about cost tradeoffs between hosted and self-hosted models, our breakdown of self-hosting vs Claude API costs is worth reading before you commit to an architecture.

from openai import OpenAI

# Point the OpenAI client at your local Ollama instance
client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama"  # Required by the SDK but not validated by Ollama
)

response = client.chat.completions.create(
    model="llama3.1",          # Must match a model you've pulled
    messages=[
        {"role": "system", "content": "You are a concise assistant."},
        {"role": "user", "content": "Summarize quantum entanglement in two sentences."}
    ],
    temperature=0.3,           # Lower = more deterministic
    max_tokens=200
)

print(response.choices[0].message.content)

Streaming works the same way:

stream = client.chat.completions.create(
    model="llama3.1",
    messages=[{"role": "user", "content": "Write a haiku about Python."}],
    stream=True
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="", flush=True)

Install the dependency: pip install openai. No special Ollama SDK required.

Step 5: Tune Performance for Your Hardware

Default settings are conservative. Here’s how to get more out of your hardware.

GPU Layer Offloading

Ollama auto-detects GPUs, but you can control how many layers are loaded onto the GPU vs kept in RAM. Set this via environment variable before starting Ollama:

# Linux/Mac: force all layers to GPU (fastest, needs enough VRAM)
OLLAMA_GPU_LAYERS=999 ollama serve

# Check how many layers were actually offloaded
ollama ps  # Shows running models and memory breakdown

Context Window and Concurrency

The default context length is 2048 tokens for most models. For document processing or longer conversations, increase it — but this scales memory usage linearly:

# Set context length per-request via options
response = client.chat.completions.create(
    model="llama3.1",
    messages=[...],
    extra_body={
        "options": {
            "num_ctx": 8192,     # Context window size
            "num_thread": 8,     # CPU threads (for CPU inference)
            "num_gpu": 1         # Number of GPUs to use
        }
    }
)

Environment Variables Worth Setting

export OLLAMA_MAX_LOADED_MODELS=2    # Keep 2 models in memory simultaneously
export OLLAMA_NUM_PARALLEL=4         # Handle 4 concurrent requests
export OLLAMA_KEEP_ALIVE=30m         # Keep model loaded for 30 minutes

On a machine with 16GB RAM and an RTX 3080, I get roughly 35-45 tokens/sec with Llama 3.1 8B fully offloaded to GPU. CPU-only on a modern 8-core machine gives around 5-8 tokens/sec with the same model. If inference latency matters for your agent loops, understanding how temperature and sampling settings affect generation speed is worth your time — lower top-p values can modestly improve throughput.

Step 6: Wire It Into an Agent Workflow

Because Ollama exposes an OpenAI-compatible endpoint, dropping it into LangChain, LlamaIndex, or a custom agent takes one line. Here’s a minimal LangChain agent using a local model:

from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor, create_openai_tools_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.tools import tool

# Point LangChain at Ollama
llm = ChatOpenAI(
    model="llama3.1",
    base_url="http://localhost:11434/v1",
    api_key="ollama",
    temperature=0
)

@tool
def get_word_count(text: str) -> int:
    """Count the number of words in a text string."""
    return len(text.split())

tools = [get_word_count]

prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful assistant with access to tools."),
    ("human", "{input}"),
    MessagesPlaceholder("agent_scratchpad"),
])

agent = create_openai_tools_agent(llm, tools, prompt)
executor = AgentExecutor(agent=agent, tools=tools, verbose=True)

result = executor.invoke({"input": "How many words are in this sentence: The quick brown fox?"})
print(result["output"])

This pattern means your agent code is fully portable — swap the base_url and model to point at OpenAI, Anthropic (via a proxy), or any other OpenAI-compatible endpoint. If you’re building more complex pipelines, the same local model can back a multi-agent orchestration setup where Ollama handles cheap sub-tasks and a hosted model handles final synthesis.

Common Errors and How to Fix Them

Error: “model not found” or 404 on /v1/chat/completions

You’re requesting a model name that doesn’t exist locally. Run ollama list to see what’s pulled. Model names are case-sensitive and must match exactly — llama3.1 not llama-3.1. Pull the model first with ollama pull <model-name>.

Error: “context length exceeded” or garbled output mid-response

The default context window (2048 tokens) is being exceeded. Pass num_ctx in the request options as shown in Step 5. Note that doubling the context roughly doubles VRAM usage, so watch your ollama ps output for memory pressure.

Extremely slow inference (< 2 tokens/sec)

Ollama is running purely on CPU, or the model doesn’t fit in VRAM and is swapping to system RAM. Check ollama ps — the “size” column shows memory usage. If GPU utilization is 0% during inference (nvidia-smi on Linux/Windows, Activity Monitor on Mac), Ollama didn’t detect your GPU. On Linux, verify CUDA drivers with nvidia-smi before installing Ollama. On Windows with an NVIDIA card, ensure you have the latest Game Ready or Studio drivers installed.

What to Build Next: A Privacy-First Document Q&A Tool

The natural extension of this setup is a RAG pipeline where sensitive documents never leave your machine. The pattern: chunk documents locally → embed with a local embedding model (Ollama supports nomic-embed-text via ollama pull nomic-embed-text) → store vectors in a local Qdrant or Chroma instance → query your local Llama model at inference time.

Every piece runs on your hardware. For a legal or compliance context, this matters — no document content hits a third-party API at any point. If you need to evaluate whether RAG or fine-tuning is the right approach for your use case, we’ve covered the cost and performance tradeoffs between RAG and fine-tuning in detail.

The Ollama local LLM setup is genuinely production-ready for use cases where latency tolerances are in the 500ms+ range, data privacy is non-negotiable, or you’re running high-volume batch tasks where per-token API costs would add up fast. It’s not a replacement for frontier models on complex reasoning tasks — Llama 3.1 8B is not GPT-4o — but for classification, summarization, structured extraction, and routing logic in agent pipelines, it’s more than capable. If you’re thinking about API cost management at scale, pairing a local Ollama instance for cheap tasks with a hosted model for high-value ones is one of the more effective strategies for managing LLM costs at scale.

Bottom line by reader type: Solo founder doing 50k+ API calls per month on classification or extraction tasks — go self-hosted, the hardware pays for itself fast. Team using LLMs for sensitive HR or legal documents — Ollama is the obvious choice, stop sending that data to the cloud. Developer wanting to experiment without burning API budget — Ollama local LLM setup takes 10 minutes and costs nothing per query.

Frequently Asked Questions

What hardware do I need to run Ollama effectively?

For CPU-only inference, 16GB RAM is the practical minimum for 7B models — you’ll get 5-8 tokens/sec which is usable but slow. An NVIDIA GPU with 8GB VRAM (e.g. RTX 3060/3070) runs 7-8B models at 30-50 tokens/sec. Apple Silicon Macs (M1/M2/M3) are excellent — an M2 Pro with 16GB unified memory handles Llama 3.1 8B at around 30 tokens/sec with no configuration needed.

Is Ollama compatible with the OpenAI Python SDK?

Yes, from version 0.1.24+ Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1. Set base_url="http://localhost:11434/v1" and api_key="ollama" (any non-empty string works) in the OpenAI client — chat completions, streaming, and function calling all work. Tool use reliability varies by model; Llama 3.1 and Mistral NeMo handle it better than smaller models.

Which models work best with Ollama for coding tasks?

For coding specifically, codellama:13b and deepseek-coder-v2 outperform general-purpose 7B models. If you have the VRAM for it (24GB+), qwen2.5-coder:32b is competitive with GPT-3.5 on many coding benchmarks. For machines with limited memory, qwen2.5-coder:7b is a solid pick — it punches above its weight class on code generation and completion.

Can I run Ollama on a remote server and access it from another machine?

By default Ollama only listens on 127.0.0.1. To expose it on your network, set OLLAMA_HOST=0.0.0.0:11434 before starting Ollama. Put it behind an nginx reverse proxy with authentication before exposing it beyond your local network — the API has no built-in auth. Don’t expose port 11434 directly to the public internet.

How do I run multiple models simultaneously in Ollama?

Set OLLAMA_MAX_LOADED_MODELS=2 (or more) in your environment before starting Ollama. Each model stays in memory according to OLLAMA_KEEP_ALIVE after its last request. Keep in mind each loaded model occupies its full VRAM/RAM footprint simultaneously — running two 8B models requires roughly double the memory of one. ollama ps shows currently loaded models and their memory usage.

What’s the difference between Ollama’s native API and the OpenAI-compatible endpoint?

The native API (/api/generate, /api/chat) offers Ollama-specific options like raw mode, model templates, and more granular sampling parameters. The OpenAI-compatible endpoint (/v1/chat/completions) is a compatibility layer that accepts the standard OpenAI request format — useful for dropping Ollama into existing tooling without code changes. For new projects, I’d use the OpenAI-compatible endpoint unless you specifically need Ollama-only features.

Put this into practice

Try the Ai Engineer agent — ready to use, no setup required.

Browse Agents →

Editorial note: API pricing, model capabilities, and tool features change frequently — always verify current details on the vendor’s website before building in production. Code examples are tested at time of writing; pin your dependency versions to avoid breaking changes. Some links in this article may be affiliate links — we may earn a commission if you sign up, at no extra cost to you.

Self-Hosting LLMs with Ollama: Step-by-Step Setup for Windows, Mac, and Linux

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation

Self-Hosting LLMs with Ollama: Step-by-Step Setup for Windows, Mac, and Linux

Step 1: Install Ollama on Windows, Mac, or Linux

macOS

Linux

Windows

Step 2: Pull a Model and Run It

Step 3: Verify the REST API Is Running

Step 4: Call the API from Python

Step 5: Tune Performance for Your Hardware

GPU Layer Offloading

Context Window and Concurrency

Environment Variables Worth Setting

Step 6: Wire It Into an Agent Workflow

Common Errors and How to Fix Them

Error: “model not found” or 404 on /v1/chat/completions

Error: “context length exceeded” or garbled output mid-response

Extremely slow inference (< 2 tokens/sec)

What to Build Next: A Privacy-First Document Q&A Tool

Frequently Asked Questions

What hardware do I need to run Ollama effectively?

Is Ollama compatible with the OpenAI Python SDK?

Which models work best with Ollama for coding tasks?

Can I run Ollama on a remote server and access it from another machine?

How do I run multiple models simultaneously in Ollama?

What’s the difference between Ollama’s native API and the OpenAI-compatible endpoint?

Put this into practice

Related Claude Code Agents

Related Posts

Claude MCP servers: complete setup guide for production tool integrations

Prompt token optimization: reducing LLM API costs without sacrificing quality

Building Claude agents with persistent memory: architecture for multi-session state management

Stacking multiple Claude models in a single workflow: when to use Haiku vs Sonnet vs Opus

Building Claude agents with Starlette 1.0: modern Python web framework integration

Holotron-12B for computer use agents: building high-throughput vision-based automation