Running Claude Code on Local Inference Backends
- •Claude Code agents can run on local backends to eliminate per-token costs and rate limits.
- •Developers use ANTHROPIC_BASE_URL to redirect API calls to Ollama, LM Studio, or llama.cpp servers.
- •Recommended local coding models include glm-4.7-flash, devstral-small-2:24b, and gemma4:26b for agentic tasks.
Claude Code, a tool used for automated software engineering tasks, can now operate using local inference backends to avoid per-token costs and rate limits. Because agentic sessions often consume 10-50x more tokens than standard chat interfaces, running models locally provides a significant financial advantage. The redirection process involves setting the ANTHROPIC_BASE_URL environment variable to point requests toward local servers that support the Anthropic Messages API format.
Three primary backends support this integration as of June 12, 2026. Ollama offers native support for the API, with recommended models like glm-4.7-flash (8GB VRAM, 128K context) and devstral-small-2:24b (16GB VRAM). LM Studio (version 0.4.1 or later) provides a graphical interface for management using a /v1/messages endpoint. For users requiring granular control, llama.cpp enables direct inference parameter adjustment, such as setting context size to 128K and using --n-gpu-layers 99 for hardware acceleration.
Successful implementation requires specific configuration updates. Developers must set dummy API keys, map model tiers to local names, and include CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" in settings.json to avoid header errors. Recommended coding models include the gemma4:26b (20GB VRAM, 256K context, 77% coding benchmark accuracy) and qwen3-coder (20GB VRAM, 128K context). If tool calls fail, users should ensure Ollama version 0.14.3 or newer is active to maintain compatibility with the required API specifications.