読み込み中…

KDnuggets·2026年6月10日 23:00·約28分

低コストでのローカルエージェント型プログラミング：Claude Code、Ollama、Gemma4の活用

#LLM #エージェントプログラミング #Gemma4 #Ollama #オープンソースモデル #ローカル推論

TL;DR

KDnuggets は、Claude Code と Ollama、Gemma4 を組み合わせることで、高価なクラウドサービスに依存せずローカル環境でエージェント型プログラミングを実現する手法を紹介している。

AI深層分析2026年6月11日 00:04

注目/ 5段階

深度40%

キーポイント

ローカル環境でのエージェント構築

Claude Code のコード生成能力と Ollama を介した Gemma4 モデルを組み合わせることで、外部 API に依存しない自律的なプログラミングエージェントを構築可能である。

コスト削減とデータプライバシー

高価なクラウドサービスや API 利用料金を回避でき、機密性の高いコードデータをローカルネットワーク内に保持できるため、セキュリティリスクが低減する。

技術的実装の具体例

Ollama を用いて軽量な LLM をローカルで実行し、それを Claude Code のワークフローに統合することで、効率的かつ安価な開発環境を構築する手順が示されている。

重要な引用

Local Agentic Programming on the Cheap

Claude Code + Ollama + Gemma4

影響分析・編集コメントを表示

影響分析

この手法は、大規模な LLM クラウドサービスへの依存を減らし、個人開発者や中小企業が低コストで高度な AI エージェントを活用できる道を開くものである。特に機密情報を扱うプロジェクトにおいて、ローカル環境での実行が可能な点は、セキュリティ要件を満たすための重要な選択肢となる。

編集コメント

クラウド依存のリスクを回避しつつ、最新の AI エージェント技術をローカルで活用できる実用的なアプローチであり、特にセキュリティ意識の高い開発現場での採用が期待される。

image**

# イントロダクション

想像してみてください：ファイルを参照し、パッチを書き込み、テストを実行し、4 つのサービスにわたって反復処理を行うマルチエージェント・ワークフローが、午後の間に 400 回の API コールを発生させる。通知が届きます。あなたはまたしてもソフト制限を超えてしまいました。トークン 1 つごとにコストがかかり、プロンプトを送信するたびに独自コードがサードパーティのサーバーに送信され、レート制限によって長時間実行されるセッションが中断されます——唯一の解決策は、さらに支払うことです。

Gemma 4 26B MoE は、1 つのフォワードパスで 260 億個のパラメータのうち、実際に活性化するのは 38 億個のみです。LiveCodeBench v6 では 77.1%、特にモデルがツールを呼び出し、ステップを実行し、マルチステップ・ワークフロー全体でエラーを処理する能力をテストするベンチマークである τ2-bench** においては 86.4% のスコアを記録しています。前世代の Gemma 3 27B は、同じベンチマークでわずか 6.6% でした。これは小さなアップグレードではありません。それは、ツールを信頼して呼び出せないモデルと、Claude Code エージェント・ループを実行する際に頻繁に関数呼び出しパラメータが誤って生成されることなく実行できるモデルとの差です。

この記事では、完全なスタックを構築します。ローカルで Gemma 4 を提供する Ollama、エージェントセッションにおけるコンテキストウィンドウの失敗を防ぐ Modelfile、Claude Code をローカルエンドポイントに接続する settings.json、本格的なコード使用前にすべてが正常に動作していることを確認する検証スクリプト、そして何が壊れてどう直すかという率直な解説です。対象読者は、大規模言語モデル（LLM）やエージェントループのコストについてすでに理解しているエンジニアです。基礎的な説明は行いません。

# なぜ Gemma 4 か？

2026 年 4 月 2 日に Apache 2.0 ライセンスの下でリリースされた Gemma 4 は、Google DeepMind がこれまでに出した中で最も能力の高いオープンウェイトモデルファミリーです。4 つのバリアントが提供されました：E2B（実効 2B）、E4B（実効 4B）、26B MoE、および 31B Dense です。26B MoE は 128 の小規模なエキスパートを使用し、トークンごとに 8 つのみを活性化し、さらに 1 つの共有エキスパートを追加することで、劇的に低い計算コストでほぼ 31B の品質を実現します。

以前の Gemma バージョンは、商用利用に関する制限が曖昧な Google カスタムライセンスを採用しており、企業の法務チームが頻繁にこれを障壁として指摘していました。Gemma 4 は Apache 2.0 であり、Gemma ファミリーとしては初です。もしあなたのチームがこのモデルを内部ツールに組み込んだり、その上に製品をリリースしたり、法的レビューのオーバーヘッドなしで生産パイプラインで実行したい場合、この変更は運用上重要な意味を持ちます。

// コーディングエージェントにとって重要な数値

ベンチマーク**

Gemma 3 27B

Gemma 4 26B MoE

Gemma 4 31B Dense

τ2-bench (エージェントツール使用)

6.6%

~79%

86.4%

LiveCodeBench v6

29.1%

77.1%

80.0%

GPQA Diamond

42.4%

82.3%

84.3%

AIME 2026 (math)

20.8%

88.3%

89.2%

Arena AI ELO

1365

1441

1452

// ハードウェア要件

18 GB のモデルをダウンロードする前に、実際に何を取り扱っているのかを理解しておく必要があります。Gemma 4 ファミリーはエッジデバイスからワークステーションまでをカバーするように設計されており、その 4 つのバリアントがその範囲を反映しています。

バリアント

Ollama タグ

アクティブパラメータ数

Q4 時の VRAM

コンテキストウィンドウ

Edge 4B

gemma4:e4b

約 6 GB

128K

26B MoE (Mixture of Experts)

gemma4:26b

3.8B

約 16–18 GB

256K

31B Dense

gemma4:31b

31B

約 24–32 GB

256K

// Ollama、Gemma 4、Claude Code のインストール

手順 1: Ollama のインストール

macOS および Linux -- 一行でインストール

curl -fsSL https://ollama.com/install.sh | sh

バージョンの確認 -- Anthropic Messages API サポートには 0.14.0 以上が必要

Anthropic 互換エンドポイントは 2026 年 1 月に追加されました

ollama version

期待される結果: ollama version は 0.22.x 以上 (2026 年 5 月時点)

Windows: https://ollama.com からネイティブインストーラーをダウンロード

Windows で GPU パススルーを行いたい場合は WSL2 を推奨します

インストール後、Ollama はポート 11434 でバックグラウンドサービスとして起動します。動作を確認してください:

curl http://localhost:11434

期待される応答: Ollama is running

手順 2: Gemma 4 のダウンロード

26B MoE -- このセットアップに推奨 (約 18 GB ダウンロード)

ollama pull gemma4:26b

ダウンロードを待っている間に、進行状況を確認します

ollama ps

現在ダウンロード中または実行中のモデルを表示

オプション: 対応するハードウェアで比較するために 31B もダウンロード

ollama pull gemma4:31b

プル完了を確認する

ollama list

gemma4:26b がサイズと変更日付とともに表示されるはずです

ステップ 3: Claude Code のインストール

前提条件：Node.js 18 以降

node --version # バージョンが 18 以上であることを確認する

Claude Code CLI をグローバルにインストール

npm install -g @anthropic-ai/claude-code

インストールの確認

claude --version

Ollama が実行中で Gemma 4 がプルされている場合、自然な次の手順は環境変数をエクスポートしてすぐに Claude Code を起動することです。

# Modelfile の作成

OllamaのGemma 4に対するデフォルトのコンテキストウィンドウ（*context window*）は 4K トークンです。Gemma 4 の実際のコンテキストウィンドウは128K〜256K**です。この 4K というデフォルト値は提案ではなく、上書きしない限り Ollama が使用する値そのものです。ソースファイルを読み込み、会話履歴を保持し、複数のターンにわたってツール呼び出しの結果を維持する Claude Code のエージェントセッションにおいて、4K トークンは数秒で枯渇してしまいます。

コンテキストのオーバーライドを行わないと、Claude Code は編集の途中でファイルの内容を追跡できなくなり、以前の指示を忘れ、断片的な変更を生み出してしまいます。具体的には、エージェントが 200 行にわたるサービスクラスのリファクタリングを試みると、後半部分が存在することをきれいに忘れてしまいます。エージェントはエラーを報告しません。単に不完全なファイルビューに基づいて黙々と作業し、下流で問題を引き起こす部分的に正しい出力を生成してしまいます。

解決策は、適切なコンテキストサイズやその他の推論パラメータを名前付きモデル変数に組み込む Modelfile を作成することです。このファイルを作成してください：

~/.ollama/Modelfiles/gemma4-claude

Claude Code エージェントセッション用に調整された Gemma 4 26B MoE バリアント。

コンテキストウィンドウ、温度（temperature）、システムプロンプトをモデルに埋め込み、

すべての Claude Code セッションが正しい設定で開始されるようにします。

ビルド方法:

mkdir -p ~/.ollama/Modelfiles

ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

FROM gemma4:26b

コンテキストウィンドウ -- 16〜18 GB の VRAM を持つシステムでスワップをトリガーすることなく、

本格的なコードベースに対応するためのテスト済み安全下限値は 65536 トークン（64K）です。

24 GB 以上のシステムで余裕がある場合は、131072（128K）に増やしてください。

負荷下でのメモリ使用量をプロファイルしていない限り、131072 を超えることは避けてください。Ollama は KV キャッシュを事前に全量確保するためです。

PARAMETER num_ctx 65536

温度（temperature） -- エージェントコーディングにはあえて低い 0.2 に設定します。

高い温度値は、ツール呼び出しのパラメータフォーマットに変動をもたらし、Claude Code のツールバリデーターが呼び出しを拒否する原因となります。

クリエイティブなタスクではこの値を高く設定しますが、エージェントループにおいては低く保つ必要があります。

PARAMETER temperature 0.2

top_p -- ニュークリアスサンプリングの閾値です。0.9 に設定することで生成に焦点を絞りつつ、

長いエージェントセッションにおいて top_p=1.0 で発生しうる反復ループを回避します。

PARAMETER top_p 0.9

repeat_penalty -- トークンの繰り返しに対してモデルにペナルティを与えます。

1.15 に設定することで、Gemma 4 が同じ失敗したツール呼び出しをほぼ同一のパラメータで無期限に再試行するツール呼び出しループを防ぎます。

PARAMETER repeat_penalty 1.15

num_predict -- 1 回のレスポンスあたりの最大トークン数。コードパッチの大半には 4096 で十分です。単一の生成で大きなファイルを頻繁に作成する場合は、8192 に増やしてください。

PARAMETER num_predict 4096

システムプロンプト -- コーディングエージェントとしての振る舞いと、明示的なツール使用の規律を強化します。Gemma 4 は、自分が何をするかを説明するのではなく、ツール呼び出しにコミットすることを繰り返し思い出させることで恩恵を受けます。

SYSTEM """あなたはコーディングエージェントとして動作するシニアソフトウェアエンジニアです。

コードを扱う際:

編集する前にファイルを読み込んでください。ファイルの内容を推測してはいけません。
一度に焦点を絞った変更を行い、次に進む前に検証してください。
ツール呼び出しが失敗した場合、再試行する前にエラーを注意深く確認してください。同じパラメータで再試行しないでください。まず診断を行ってください。
完全なファイルの書き換えよりも、外科的な編集を優先してください。
変更のバッチ処理の後ではなく、意味のある変更ごとにテストを実行してください。
コードベースの構造について不確かな場合は、推測するのではなく、より多くのファイルを読み込んでください。

正確かつ体系的に行動してください。実行できるのに、これから何をするかを説明しようとしてはいけません。"""

バリアントを作成します:

Modelfiles ディレクトリが存在しない場合は作成します

mkdir -p ~/.ollama/Modelfiles

上記の Modelfile の内容をこのパスに保存し、ビルドを実行します:

ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

バリアントが作成されたことを確認します

ollama list

gemma4:26b と並んで "gemma4-claude" が表示されるはずです

Quick smoke test -- verify it loads and responds

ollama run gemma4-claude "What is the time complexity of binary search and why?"

Expect a clear, concise technical response within a few seconds

# Wiring Claude Code to the Local Model

With the model variant built, the configuration layer connects Claude Code to Ollama. Two environment variables are the core of this, but three additional variables prevent the most common failure modes.

Ollama's Anthropic-compatible endpoint is at http://localhost:11434, not http://localhost:11434/v1. The /v1 path is Ollama's OpenAI-compatible layer. Claude Code uses the Anthropic Messages API protocol, which maps to the root endpoint. Using the /v1 path will produce authentication errors or unexpected behavior.

// Global Settings — ~/.claude/settings.json

This configuration applies to every Claude Code session across all projects. It is the right choice unless you are switching between local and cloud models frequently per project.

{

"env": {

"ANTHROPIC_BASE_URL": "http://localhost:11434",

"ANTHROPIC_AUTH_TOKEN": "ollama",

"ANTHROPIC_API_KEY": "",

"ANTHROPIC_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",

"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"

}

Why each variable matters:

ANTHROPIC_BASE_URL は、Anthropic のサーバーから送信されるすべての Claude Code API 呼び出しを、ローカルの Ollama インスタンスへリダイレクトします。

ANTHROPIC_AUTH_TOKEN は、空でない文字列のいずれかに設定する必要があります。Ollama はこの値を無視しますが、Claude Code はヘッダーが存在することを要求します。

ANTHROPIC_API_KEY: "" と明示的にキーを空にすることで、シェル環境に実際の Anthropic API キーが設定されていた場合でも、Claude Code がそれにフォールバックできないようにします。これを設定しないと、誤設定された ANTHROPIC_BASE_URL が、有料の API に対して静かにフォールバックしてしまう可能性があります。

ANTHROPIC_MODEL は、Claude Code がリクエストに送信する主要なモデル名です。これは、コンテキストウィンドウのオーバーライドを保持していない生のモデルタグ（gemma4:26b など）ではなく、カスタムの Modelfile バリアントである gemma4-claude に設定してください。

ANTHROPIC_DEFAULT_SONNET_MODEL、ANTHROPIC_DEFAULT_HAIKU_MODEL、および ANTHROPIC_DEFAULT_OPUS_MODEL：Claude Code は内部で異なるタスクタイプを異なるモデルティアにルーティングします。これら 3 つすべてを同じローカルモデルに設定することで、Claude Code が内部でどのティアを選択したかに関わらず、すべてのリクエストが Ollama インスタントに到達するように確保できます。

CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" と設定すると、Claude Code がリクエストに追加する Anthropic 固有のベータヘッダーが削除されます。ローカル推論サーバーはこれらのヘッダーを認識せず、含まれるリクエストを拒否します。この変数を設定することで、コアとなる Claude Code の機能に影響を与えることなく、そのエラーを防ぐことができます。

// プロジェクトごとの設定 — .claude/settings.json

グローバル設定から隔離されたローカル推論を必要とするプロジェクト、つまりプライベートリポジトリや機密性の高いコードベース、あるいは特定のモデル要件を持つプロジェクトでは、プロジェクトレベルの設定ファイルを使用してください:

プロジェクトのルートディレクトリで

mkdir -p .claude

cat > .claude/settings.json << 'EOF'

{

"env": {

"ANTHROPIC_BASE_URL": "http://localhost:11434",

"ANTHROPIC_AUTH_TOKEN": "ollama",

"ANTHROPIC_API_KEY": "",

"ANTHROPIC_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",

"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"

}

EOF

Claude Code は、プロジェクトレベルの .claude/settings.json が存在する場合にそれを読み込み、そのプロジェクトに対するグローバル設定を上書きします。設定に環境固有の内容が含まれる場合は .gitignore に .claude/settings.json を追加するか、チーム全員がそのプロジェクトでローカル推論を実行したい場合はコミットしてください。

// 設定の検証

Claude Code を実際のコードベースに対して実行する前に、以下の3点を確認してください：Ollama が正しく動作しているか、モデルが Anthropic Messages フォーマットの API 呼び出しに応答できるか、そして特にツール呼び出し（tool calling）が機能するかです。3点目は絶対条件です。ツール呼び出しは、Claude Code がファイルを読み込み、パッチを書き込み、コマンドを実行するための手段だからです。ツール呼び出しを正しくフォーマットできないモデルは、基本的なエージェントタスクでループして失敗します。

必須要件：**

pip install httpx # 検証スクリプト用の非同期 HTTP クライアント

完全な検証スクリプト:

#!/usr/bin/env python3

"""

verify_local_setup.py

使用前に、Claude Code + Ollama + Gemma 4 のスタック全体を検証します。

以下の 3 つのチェックを順次実行します:

Ollama の稼働状況とモデルの利用可能性
Anthropic Messages API への基本的な呼び出し
ツール呼び出しの往復テスト

前提条件:

pip install httpx

実行方法:

python verify_local_setup.py

正常に動作しているセットアップでの期待される出力:

[PASS] Ollama が localhost:11434 で稼働中

[PASS] モデル 'gemma4-claude' が利用可能

[PASS] Anthropic Messages API の呼び出しが成功

[PASS] ツール呼び出し：モデルが有効な tool_use ブロックを生成

すべてのチェックに合格 -- Claude Code + Ollama + Gemma 4 は使用準備完了。

"""

import httpx

import json

import sys

── 設定 ─────────────────────────────────────────────────────────────

OLLAMA_BASE_URL = "http://localhost:11434"

MODEL_NAME = "gemma4-claude" # モデルファイルのバリアント名と一致させる必要があります

TIMEOUT = 120.0 # 秒数 -- 初回呼び出し時は生成に時間がかかる場合があります

def check_ollama_health() -> bool:

"""

確認 1: Ollama が実行中で応答しているかを確認します。

正常な状態では 'Ollama is running' を返すルートエンドポイントにアクセスします。

"""

print("\n確認 1: Ollama のヘルスチェック")

try:

response = httpx.get(OLLAMA_BASE_URL, timeout=5.0)

if "Ollama is running" in response.text:

print(f" [PASS] {OLLAMA_BASE_URL} で Ollama が実行中です")

return True

else:

print(f" [FAIL] 予期せぬ応答: {response.text[:100]}")

return False

except httpx.ConnectError:

print(f" [FAIL] {OLLAMA_BASE_URL} に接続できません")

print(" Ollama は実行されていますか？試行：ollama serve")

return False

def check_model_available() -> bool:

"""

確認 2: 特定のモデルバリアントが Ollama で利用可能かを確認します。

すべてのプル済みモデルをリストアップする /api/tags エンドポイントを使用します。

"""

print("\n確認 2: モデルの利用可能性")

try:

response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5.0)

data = response.json()

models = [m["name"] for m in data.get("models", [])]

# 正規化: 指定されていない場合、Ollama は ":latest" を追加する場合があります

normalized = [m.split(":")[0] for m in models]

if MODEL_NAME in models or MODEL_NAME in normalized:

print(f" [PASS] Model '{MODEL_NAME}' is available")

return True

else:

print(f" [FAIL] Model '{MODEL_NAME}' not found")

print(f" Available models: {', '.join(models) or 'none'}")

print(f" Run: ollama create {MODEL_NAME} -f ~/.ollama/Modelfiles/gemma4-claude")

return False

except Exception as e:

print(f" [FAIL] Error checking model list: {e}")

return False

def check_messages_api() -> bool:

"""

Check 3: Send a basic Anthropic Messages API call to the local endpoint.

Verifies the request format, model routing, and basic generation work.

Uses the same /v1/messages path and request schema that Claude Code uses.

Note: Claude Code uses http://localhost:11434 (root), not /v1.

The Anthropic-compatible API is at /api/chat or the root -- Ollama routes it.

"""

print("\nCheck 3: Anthropic Messages API call")

payload = {

"model": MODEL_NAME,

"max_tokens": 100,

"messages": [

{

"role": "user",

"content": "Reply with exactly: VERIFICATION_OK"

}

]

}

headers = {

"Content-Type": "application/json",

"x-api-key": "ollama", # API 仕様で必須ですが、ローカルでは値は無視されます

"anthropic-version": "2023-06-01" # 必須のバージョンヘッダー

}

try:

response = httpx.post(

f"{OLLAMA_BASE_URL}/v1/messages",

json=payload,

headers=headers,

timeout=TIMEOUT

)

if response.status_code != 200:

print(f" [FAIL] HTTP {response.status_code}: {response.text[:200]}")

return False

data = response.json()

# Anthropic Messages API のレスポンス構造:

# { "content": [{"type": "text", "text": "..."}], "stop_reason": "..." }

content_blocks = data.get("content", [])

text_blocks = [b for b in content_blocks if b.get("type") == "text"]

if not text_blocks:

print(f" [FAIL] レスポンスにテキストコンテンツがありません: {json.dumps(data, indent=2)}")

return False

response_text = text_blocks[0].get("text", "")

print(f" [PASS] Anthropic Messages API の呼び出しが成功しました")

print(f" モデルのレスポンス: {response_text[:80]}")

return True

except Exception as e:

print(f" [FAIL] リクエストに失敗しました: {e}")

return False

def check_tool_calling() -> bool:

"""

チェック 4：ツール呼び出しがエンドツーエンドで機能することを確認します。

これは Claude Code のエージェント利用において最も重要なチェックです。

Claude Code は、ファイル操作、シェルコマンド、コード実行のすべてに対して、モデルが正しく tool_use ブロックを生成することに依存しています。

単純なツールの定義と、それをトリガーするはずのプロンプトを送信します。

モデルがツール呼び出しの説明テキストではなく、tool_use ブロックを返すことを確認します。

"""

print("\nCheck 4: Tool calling verification")

# Anthropic の関数呼び出しスキーマを使用した最小限のツール定義

tools = [

{

"name": "read_file",

"description": "指定されたパスにあるファイルの内容を読み取ります。",

"input_schema": {

"type": "object",

"properties": {

"path": {

"type": "string",

"description": "読み取る絶対パスまたは相対パスのファイルパス"

}

"required": ["path"]

}

]

payload = {

"model": MODEL_NAME,

"max_tokens": 256,

"tools": tools,

# モデルがテキストで応答するのではなく、ツールを呼び出すよう強制します。

# tool_choice: {"type": "any"} は任意のツールの使用を要求します。

# モデルがツールを自己選択するかテストしている場合はこれを削除してください。

"tool_choice": {"type": "any"},

"messages": [

{

"role": "user",

"content": "/tmp/test.py のファイルを読み込んで、その内容を見せてください。"

}

]

}

headers = {

"Content-Type": "application/json",

"x-api-key": "ollama",

"anthropic-version": "2023-06-01"

}

必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:

{"translation": "翻訳全文"}

try:

response = httpx.post(

f"{OLLAMA_BASE_URL}/v1/messages",

json=payload,

headers=headers,

timeout=TIMEOUT

)

if response.status_code != 200:

print(f" [FAIL] HTTP {response.status_code}: {response.text[:200]}")

return False

{"translation": "翻訳全文"}

データ = レスポンス.json()

コンテンツブロック = データ.get("content", [])

ツールブロック = [b for b in コンテンツブロック if b.get("type") == "tool_use"]

{"translation": "翻訳全文"}

if not tool_blocks:

print(" [FAIL] Model did not produce a tool_use block")

print(" This means tool calling is not working correctly.")

print(" Agentic Claude Code sessions will fail on file operations.")

print(f" Full response: {json.dumps(data, indent=2)}")

return False

tool_call = tool_blocks[0]

tool_name = tool_call.get("name", "")

tool_input = tool_call.get("input", {})

print(f" [PASS] Tool calling: model produced a valid tool_use block")

print(f" Tool called: {tool_name}")

print(f" Parameters: {json.dumps(tool_input)}")

# Sanity check: did it call the right tool with the right parameter?

if tool_name == "read_file" and "path" in tool_input:

print(f" Tool name and parameter are correct.")

else:

print(f" WARNING: Unexpected tool name or missing 'path' parameter.")

print(f" The model called a tool but not the expected one.")

return True

except Exception as e:

print(f" [FAIL] Request failed: {e}")

return False

def main():

print("=" * 60)

print("Claude Code + Ollama + Gemma 4 Setup Verification")

print("=" * 60)

checks = [

check_ollama_health,

check_model_available,

check_messages_api,

check_tool_calling,

]

results = [check() for check in checks]

print("\n" + "=" * 60)

passed = sum(results)

total = len(results)

if all(results):

print(f"All {total} checks passed.")

print("Claude Code + Ollama + Gemma 4 is ready.")

print(f"\nLaunch with: claude")

sys.exit(0)

else:

failed_checks = [i + 1 for i, r in enumerate(results) if not r]

print(f"{passed}/{total} checks passed. Failed: {failed_checks}")

print("Resolve the failures above before using Claude Code locally.")

sys.exit(1)

if __name__ == "__main__":

main()

How to run:

pip install httpx

python verify_local_setup.py

# Agentic Task Walkthrough

With verification passing, here is what a real agentic session looks like. The task: take an existing Python module with no tests, analyze it, write a test suite, run the tests, and fix any failures.

Navigate to a project directory

cd ~/projects/my-service

Confirm Claude Code sees the local configuration

claude --version

Verify it does not prompt for an Anthropic API key -- if it does,

the settings.json is not being read correctly

Start an agentic session

claude

Inside Claude Code, give the agent a concrete task:

> Analyze the UserService class in src/user_service.py.

> Write a pytest test suite covering all public methods.

> Run the tests and fix any failures.

> The goal is a clean pytest run with no skips.

このセッション中に Claude Code ツール呼び出しのトレースがどのように見えるか:

→ read_file("src/user_service.py")

247行を読み込み中...

→ list_files("src/")

発見: user_service.py, models.py, db.py, exceptions.py

→ read_file("src/models.py")

89行を読み込み中...

→ write_file("tests/test_user_service.py", [test content])

書き込み完了: 312行

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")

14件のテストを実行中...

FAILED tests/test_user_service.py::test_update_email_invalid

AssertionError: ValidationError を期待したが、None が返された

→ read_file("src/user_service.py") [update_email メソッドのターゲット再読]

...

→ write_file("tests/test_user_service.py", [corrected test])

test_update_email_assertion のパッチ適用完了

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")

1.23秒で 14件合格

Gemma 4 は、編集前にファイルを読み込み、変更後にテストを実行し、エラー出力から失敗を診断して盲目的に再試行しないというこのパターンを確実に処理します。多数のファイルにわたる複雑なアーキテクチャ上の意思決定における振る舞いこそが、クラウドモデルがまだ優位性を持つ部分です。上記のタスク（分析、テスト生成、およびターゲットを絞った修正）については、ローカル環境でも完全に能力を発揮できます。

注視すべき点：エージェントが「無効なツールパラメータ」エラーを生成し、同じパラメータで繰り返しリトライする様子が見られた場合、温度（temperature）設定が高すぎるか、モデルが gemma4-claude の Modelfile 変種を使用していない可能性があります。両方の温度設定とコンテキストウィンドウのオーバーライドは、この変種に組み込まれています。一方、生の gemma4:26b タグにはそれらが含まれていません。

// 何が壊れ、どう直すか

ツールパラメータフォーマットエラー

症状：Claude Code が「無効なツールパラメータ」を繰り返し報告します。エージェントは謝罪し、同一またはほぼ同一のパラメータでリトライしてループ状態に陥ります。

原因：これは Ollama の GitHub イシューで文書化されています。モデルが生成するツール呼び出し JSON が、Claude Code が期待するスキーマと一致していないことが原因です。最も一般的なのは、フィールド名の誤り、必須フィールドの欠落、またはスカラー値が期待される箇所にネストされたオブジェクトが含まれているケースです。

対策：gemma4:26b を直接実行するのではなく、gemma4-claude（Modelfile 変種）を実行していることを確認してください。Modelfile に含まれる temperature: 0.2 とシステムプロンプトにより、この問題は大幅に軽減されます。問題が継続する場合は、Modelfile 内の温度を 0.1 に下げて再構築してください。

コンテキストウィンドウのディスクへのスワップ

症状：数ターン生成が進むと処理速度が極端に低下します。ollama ps コマンドの実行結果から GPU の利用率が低下していることが確認できます。OS が KV キャッシュ（Key-Value Cache）をディスクへページングしています。

対策：

オプション 1：Modelfile 内のコンテキストウィンドウを削減する

~/.ollama/Modelfiles/gemma4-claude を編集

変更前: PARAMETER num_ctx 65536

変更後: PARAMETER num_ctx 32768

その後再構築: ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

オプション 2：KV キャッシュの量子化を有効にしてメモリ使用量を削減する

export OLLAMA_KV_CACHE_TYPE=q8_0

これにより KV キャッシュ自体が量子化され、わずかな品質低下でメモリ消費量が削減されます

設定後は Ollama を再起動してください: pkill ollama && ollama serve

エージェントのターン間のモデルアンローディング

症状：各 Claude Code メッセージの開始時に著しいコールドスタート遅延が発生する。Ollama は非活動タイムアウト後にモデルをアンロードし、リクエストごとに再読み込みしている。

修正方法:

ワークセッション中はモデルを無限にロード状態に保つ

export OLLAMA_KEEP_ALIVE=-1

または、永続的な効果のためにシェルプロファイルに設定する

echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.zshrc

あるいは、Ollama API を使用してモデルを固定することも可能

curl http://localhost:11434/api/generate \

-d '{"model": "gemma4-claude", "keep_alive": -1}'

これにより、明示的にアンロードするか Ollama を再起動するまでモデルが固定されます

ベータヘッダー拒否エラー

症状：Claude Code の起動時またはセッション中に、anthropic-beta ヘッダーに対する予期せぬ値（Unexpected value(s)）のエラーが発生する。

修正方法:

settings.json に CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" が設定されていることを確認してください。シェル export を使用して settings.json 経由ではなく設定している場合は、claude が実行されている同じシェルセッション内でエクスポートされているか検証してください：

echo $CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS

Must print: 1

# Wrapping Up

この記事で説明したスタックは概念実証（PoC）ではありません。これは、2026 年 1 月に Ollama が Anthropic Messages API のサポートを追加して以来、エンジニアが毎日運用している実際の生産環境構成です。Modelfile はオプションではなく、マルチファイルタスクにおいてツールが機能するか、あるいは不完全な出力を静かに生成し続けるかの分岐点となります。検証スクリプトは、セッション中に混乱を招くエージェントの失敗として表面化する前に、設定上の問題を検出します。

この記事で構築したセットアップは、現代のハードウェア上で実用的な生成速度で、コード解析、テスト生成、ターゲットを絞ったリファクタリング、デバッグといった日々のエンジニアリングタスクの大半を処理する、プライベートかつトークンあたりのコストがゼロのコーディングエージェントです。

このセットアップは、大規模なコードベース全体や SWE-bench クラスのタスクにおいて、深いリポジトリ理解を必要とする複雑なアーキテクチャ推論のためのクラウド推論（cloud inference）を完全に代替するものではありません。

Shittu Olumide は、最先端技術を駆使して説得力のある物語を構築することに情熱を持つソフトウェアエンジニアであり技術ライターです。細部への鋭い眼と複雑な概念を簡素化する才能を持っています。また、Twitter でも Shittu を見つけることができます。

原文を表示

Local Agentic Programming on the Cheap: Claude Code + Ollama + Gemma4

# Introduction

Visualize this: a multi-agent workflow that reads files, writes patches, runs tests, and iterates across four services, making 400 API calls in a single afternoon. The notification arrives. You have crossed the soft limit again. Every token costs money, every prompt sends your proprietary code to a third-party server, and the rate limits interrupt long-running sessions — the only solution is paying more.

Gemma 4 26B MoE activates only 3.8 billion of its 26 billion parameters per forward pass. It scores 77.1% on LiveCodeBench v6 and 86.4% on τ2-bench** agentic tool use — the benchmark that specifically tests what happens when a model has to call tools, execute steps, and handle errors across a multi-step workflow. The previous generation, Gemma 3 27B, scored 6.6% on that same benchmark. That is not a small upgrade. It is the difference between a model that cannot reliably call tools and one that can run a Claude Code agentic loop without constantly malforming its function call parameters.

This article builds the full stack: Ollama serving Gemma 4 locally, the Modelfile that prevents context window failures in agentic sessions, the settings.json that wires Claude Code to the local endpoint, a verification script that confirms everything is working before you use it on real code, and an honest rundown of what breaks and how to fix it. The audience is engineers who already understand what large language models (LLMs) are and what agentic loops cost. No hand-holding on the basics.

# Why Gemma 4?

Released on April 2, 2026 under Apache 2.0, Gemma 4 is Google DeepMind's most capable open-weight model family to date. Four variants shipped: E2B (2B effective), E4B (4B effective), 26B MoE, and 31B Dense. The 26B MoE uses 128 small experts and activates only 8 per token plus one shared expert, delivering near-31B quality at dramatically lower compute cost.

Previous Gemma versions used a custom Google license with commercial use restrictions ambiguous enough that enterprise legal teams routinely flagged it as a blocker. Gemma 4 is Apache 2.0, a first for the Gemma family. If your team wants to embed this in internal tooling, ship products on top of it, or run it in production pipelines without legal review overhead, that change matters operationally.

// The Numbers That Matter for Coding Agents

Benchmark**

Gemma 3 27B

Gemma 4 26B MoE

Gemma 4 31B Dense

τ2-bench (agentic tool use)

6.6%

~79%

86.4%

LiveCodeBench v6

29.1%

77.1%

80.0%

GPQA Diamond

42.4%

82.3%

84.3%

AIME 2026 (math)

20.8%

88.3%

89.2%

Arena AI ELO

1365

1441

1452

// Hardware Requirements

Before pulling an 18 GB model, know what you are actually working with. The Gemma 4 family was designed to span edge devices through workstations, and the four variants reflect that range.

Variant

Ollama tag

Active params

VRAM at Q4

Context window

Edge 4B

gemma4:e4b

~6 GB

128K

26B MoE

gemma4:26b

3.8B

~16–18 GB

256K

31B Dense

gemma4:31b

31B

~24–32 GB

256K

// Installing Ollama, Gemma 4, and Claude Code

Step 1: Install Ollama

code

# macOS and Linux -- one-line install
curl -fsSL https://ollama.com/install.sh | sh

# Verify version -- must be 0.14.0+ for Anthropic Messages API support
# The Anthropic-compatible endpoint was added in January 2026
ollama version
# Expected: ollama version is 0.22.x or higher (as of May 2026)

# Windows: download the native installer from https://ollama.com
# WSL2 is recommended if you want GPU passthrough on Windows

After installation, Ollama starts as a background service on port 11434. Verify it is up:

code

curl http://localhost:11434
# Expected response: Ollama is running

Step 2: Pull Gemma 4

code

# The 26B MoE -- recommended for this setup (~18 GB download)
ollama pull gemma4:26b

# While you wait, confirm the download is progressing
ollama ps
# Shows currently downloading or running models

# Optional: also pull the 31B for comparison on capable hardware
ollama pull gemma4:31b

# Confirm the pull completed
ollama list
# Should show gemma4:26b with size and modification date

Step 3: Install Claude Code

code

# Prerequisites: Node.js 18 or later
node --version   # Confirm you are on 18+

# Install Claude Code CLI globally
npm install -g @anthropic-ai/claude-code

# Verify the install
claude --version

With Ollama running and Gemma 4 pulled, the natural next instinct is to export the environment variables and launch Claude Code immediately.

# The Modelfile

Ollama's default context window for Gemma 4 is 4K tokens. Gemma 4's actual context window is 128K–256K**. That 4K default is not a suggestion — it is what Ollama will use unless you override it. In a Claude Code agentic session that reads source files, holds conversation history, and maintains tool call results across multiple turns, 4K tokens is exhausted in seconds.

Without the context override, Claude Code loses track of file contents mid-edit, forgets earlier instructions, and produces fragmented changes. Specifically: when an agent tries to refactor a 200-line service class, it cleanly forgets the second half exists. The agent does not raise an error. It just silently works on an incomplete view of the file and produces partially correct output that breaks downstream.

The fix is a Modelfile that bakes the correct context size and other inference parameters into a named model variant. Create this file:

code

# ~/.ollama/Modelfiles/gemma4-claude
# Gemma 4 26B MoE variant tuned for Claude Code agentic sessions.
# Bakes context window, temperature, and system prompt into the model
# so every Claude Code session starts with the correct configuration.
#
# Build with:
#   mkdir -p ~/.ollama/Modelfiles
#   ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

FROM gemma4:26b

# Context window -- 65536 tokens (64K) is the tested-safe floor for real
# codebases without triggering swap on 16-18 GB VRAM systems.
# Increase to 131072 (128K) if you have headroom on 24 GB+ systems.
# Do not go above 131072 unless you have profiled your memory usage
# under load -- Ollama pre-allocates the full KV cache upfront.
PARAMETER num_ctx 65536

# Temperature -- 0.2 is deliberately low for agentic coding.
# Higher temperature introduces variability in tool call parameter
# formatting that causes Claude Code's tool validator to reject calls.
# For creative tasks, you would set this higher. For agentic loops: low.
PARAMETER temperature 0.2

# top_p -- nucleus sampling threshold. 0.9 keeps generation focused
# while avoiding the repetition loops that top_p=1.0 can produce on
# long agentic sessions.
PARAMETER top_p 0.9

# repeat_penalty -- penalizes the model for repeating tokens.
# 1.15 helps prevent tool call loops where Gemma 4 retries the same
# failed tool call with nearly identical parameters indefinitely.
PARAMETER repeat_penalty 1.15

# num_predict -- maximum tokens per response. 4096 is sufficient for
# most code patches. Increase to 8192 if you regularly generate
# large files in a single generation.
PARAMETER num_predict 4096

# System prompt -- reinforces coding agent behavior and explicit
# tool use discipline. Gemma 4 benefits from being reminded to
# commit to tool calls rather than describing what it would do.
SYSTEM """You are a senior software engineer operating as a coding agent.

When working with code:
- Read files before editing them. Never assume file contents.
- Make one focused change at a time and verify it before proceeding.
- When a tool call fails, examine the error carefully before retrying.
  Do not retry with identical parameters. Diagnose first.
- Prefer surgical edits over full file rewrites.
- Run tests after each meaningful change, not after a batch of changes.
- If you are uncertain about the codebase structure, read more files
  rather than guessing.

Be precise and methodical. Avoid explaining what you are about to do
when you could simply do it."""

Build the variant:

code

# Create the Modelfiles directory if it does not exist
mkdir -p ~/.ollama/Modelfiles

# Save the Modelfile content from above to this path, then build:
ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

# Verify the variant was created
ollama list
# Should show gemma4-claude alongside gemma4:26b

# Quick smoke test -- verify it loads and responds
ollama run gemma4-claude "What is the time complexity of binary search and why?"
# Expect a clear, concise technical response within a few seconds

# Wiring Claude Code to the Local Model

Ollama's Anthropic-compatible endpoint is at http://localhost:11434, not http://localhost:11434/v1. The /v1 path is Ollama's OpenAI-compatible layer. Claude Code uses the Anthropic Messages API protocol, which maps to the root endpoint. Using the /v1 path will produce authentication errors or unexpected behavior.

// Global Settings — ~/.claude/settings.json

This configuration applies to every Claude Code session across all projects. It is the right choice unless you are switching between local and cloud models frequently per project.

code

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",

    "ANTHROPIC_AUTH_TOKEN": "ollama",

    "ANTHROPIC_API_KEY": "",

    "ANTHROPIC_MODEL": "gemma4-claude",

    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",

    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  }
}

Why each variable matters:

ANTHROPIC_BASE_URL redirects all Claude Code API calls from Anthropic's servers to your local Ollama instance.

ANTHROPIC_AUTH_TOKEN must be set to any non-empty string; Ollama ignores the value but Claude Code requires the header to be present.

ANTHROPIC_API_KEY: "" explicitly empties the key so Claude Code cannot fall back to a real Anthropic API key if one happens to be set in your shell environment. Without this, a misconfigured ANTHROPIC_BASE_URL might silently fail over to the paid API.

ANTHROPIC_MODEL is the primary model name Claude Code sends in requests. Set this to your custom Modelfile variant, gemma4-claude not gemma4:26b. The raw model tag does not carry the context window override.

ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL: Claude Code internally routes different task types to different model tiers. Setting all three to the same local model ensures every request lands at your Ollama instance regardless of which tier Claude Code internally selects.

CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" strips the Anthropic-specific beta headers that Claude Code adds to requests. Local inference servers do not recognize these headers and reject requests that include them. Setting this variable prevents that error without affecting any core Claude Code functionality.

// Per-Project Configuration — .claude/settings.json

For projects where you want local inference isolated from your global setup — private repositories, sensitive codebases, or projects with specific model requirements — use a project-level settings file instead:

code

# In your project root
mkdir -p .claude

cat > .claude/settings.json << 'EOF'
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  }
}
EOF

Claude Code reads the project-level .claude/settings.json when it exists, overriding global settings for that project. Add .claude/settings.json to your .gitignore if the settings contain anything environment-specific, or commit it if you want the entire team running local inference on that project.

// Verifying the Setup

Before running Claude Code against a real codebase, verify three things: Ollama is serving correctly, the model responds to API calls in the Anthropic Messages format, and tool calling specifically works. The third point is non-negotiable: tool calling is how Claude Code reads files, writes patches, and executes commands. A model that cannot format tool calls correctly will loop and fail on basic agentic tasks.

Prerequisites:**

code

pip install httpx   # Async HTTP client for the verification script

The full verification script:

code


#!/usr/bin/env python3
"""
verify_local_setup.py

Verifies the full Claude Code + Ollama + Gemma 4 stack before use.
Runs three checks in sequence:
  1. Ollama health and model availability
  2. Basic Anthropic Messages API call
  3. Tool calling round-trip

Prerequisites:
  pip install httpx

How to run:
  python verify_local_setup.py

Expected output on a working setup:
  [PASS] Ollama is running on localhost:11434
  [PASS] Model 'gemma4-claude' is available
  [PASS] Anthropic Messages API call successful
  [PASS] Tool calling: model produced a valid tool_use block
  All checks passed -- Claude Code + Ollama + Gemma 4 is ready.
"""

import httpx
import json
import sys

# ── Configuration ─────────────────────────────────────────────────────────────
OLLAMA_BASE_URL = "http://localhost:11434"
MODEL_NAME      = "gemma4-claude"   # Must match your Modelfile variant name
TIMEOUT         = 120.0             # Seconds -- generation can be slow on first call

def check_ollama_health() -> bool:
    """
    Check 1: Verify Ollama is running and responding.
    Hits the root endpoint which returns 'Ollama is running' when healthy.
    """
    print("\nCheck 1: Ollama health")
    try:
        response = httpx.get(OLLAMA_BASE_URL, timeout=5.0)
        if "Ollama is running" in response.text:
            print(f"  [PASS] Ollama is running on {OLLAMA_BASE_URL}")
            return True
        else:
            print(f"  [FAIL] Unexpected response: {response.text[:100]}")
            return False
    except httpx.ConnectError:
        print(f"  [FAIL] Cannot connect to {OLLAMA_BASE_URL}")
        print("         Is Ollama running? Try: ollama serve")
        return False

def check_model_available() -> bool:
    """
    Check 2: Verify the specific model variant is available in Ollama.
    Uses the /api/tags endpoint which lists all pulled models.
    """
    print("\nCheck 2: Model availability")
    try:
        response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5.0)
        data     = response.json()
        models   = [m["name"] for m in data.get("models", [])]

        # Normalize: Ollama may add ":latest" if not specified
        normalized = [m.split(":")[0] for m in models]

        if MODEL_NAME in models or MODEL_NAME in normalized:
            print(f"  [PASS] Model '{MODEL_NAME}' is available")
            return True
        else:
            print(f"  [FAIL] Model '{MODEL_NAME}' not found")
            print(f"         Available models: {', '.join(models) or 'none'}")
            print(f"         Run: ollama create {MODEL_NAME} -f ~/.ollama/Modelfiles/gemma4-claude")
            return False
    except Exception as e:
        print(f"  [FAIL] Error checking model list: {e}")
        return False

def check_messages_api() -> bool:
    """
    Check 3: Send a basic Anthropic Messages API call to the local endpoint.
    Verifies the request format, model routing, and basic generation work.
    Uses the same /v1/messages path and request schema that Claude Code uses.
    Note: Claude Code uses http://localhost:11434 (root), not /v1.
    The Anthropic-compatible API is at /api/chat or the root -- Ollama routes it.
    """
    print("\nCheck 3: Anthropic Messages API call")

    payload = {
        "model": MODEL_NAME,
        "max_tokens": 100,
        "messages": [
            {
                "role": "user",
                "content": "Reply with exactly: VERIFICATION_OK"
            }
        ]
    }

    headers = {
        "Content-Type":      "application/json",
        "x-api-key":         "ollama",            # Required by the API spec; value ignored locally
        "anthropic-version": "2023-06-01"         # Required version header
    }

    try:
        response = httpx.post(
            f"{OLLAMA_BASE_URL}/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP {response.status_code}: {response.text[:200]}")
            return False

        data = response.json()

        # Anthropic Messages API response structure:
        # { "content": [{"type": "text", "text": "..."}], "stop_reason": "..." }
        content_blocks = data.get("content", [])
        text_blocks    = [b for b in content_blocks if b.get("type") == "text"]

        if not text_blocks:
            print(f"  [FAIL] No text content in response: {json.dumps(data, indent=2)}")
            return False

        response_text = text_blocks[0].get("text", "")
        print(f"  [PASS] Anthropic Messages API call successful")
        print(f"         Model response: {response_text[:80]}")
        return True

    except Exception as e:
        print(f"  [FAIL] Request failed: {e}")
        return False

def check_tool_calling() -> bool:
    """
    Check 4: Verify tool calling works end-to-end.
    This is the most important check for Claude Code agentic use.
    Claude Code relies on the model correctly producing tool_use blocks
    for every file operation, shell command, and code execution.

    Sends a simple tool definition and a prompt that should trigger it.
    Verifies the model returns a tool_use block (not just text describing the call).
    """
    print("\nCheck 4: Tool calling verification")

    # A minimal tool definition using the Anthropic function calling schema
    tools = [
        {
            "name": "read_file",
            "description": "Read the contents of a file at the given path.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "The absolute or relative file path to read"
                    }
                },
                "required": ["path"]
            }
        }
    ]

    payload = {
        "model": MODEL_NAME,
        "max_tokens": 256,
        "tools": tools,
        # Force the model to call a tool rather than respond in text.
        # tool_choice: {"type": "any"} requires any tool use.
        # Remove this if testing whether the model self-selects tools.
        "tool_choice": {"type": "any"},
        "messages": [
            {
                "role": "user",
                "content": "Read the file at /tmp/test.py and show me its contents."
            }
        ]
    }

    headers = {
        "Content-Type":      "application/json",
        "x-api-key":         "ollama",
        "anthropic-version": "2023-06-01"
    }

    try:
        response = httpx.post(
            f"{OLLAMA_BASE_URL}/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP {response.status_code}: {response.text[:200]}")
            return False

        data           = response.json()
        content_blocks = data.get("content", [])
        tool_blocks    = [b for b in content_blocks if b.get("type") == "tool_use"]

        if not tool_blocks:
            print("  [FAIL] Model did not produce a tool_use block")
            print("         This means tool calling is not working correctly.")
            print("         Agentic Claude Code sessions will fail on file operations.")
            print(f"         Full response: {json.dumps(data, indent=2)}")
            return False

        tool_call  = tool_blocks[0]
        tool_name  = tool_call.get("name", "")
        tool_input = tool_call.get("input", {})

        print(f"  [PASS] Tool calling: model produced a valid tool_use block")
        print(f"         Tool called: {tool_name}")
        print(f"         Parameters:  {json.dumps(tool_input)}")

        # Sanity check: did it call the right tool with the right parameter?
        if tool_name == "read_file" and "path" in tool_input:
            print(f"         Tool name and parameter are correct.")
        else:
            print(f"         WARNING: Unexpected tool name or missing 'path' parameter.")
            print(f"         The model called a tool but not the expected one.")

        return True

    except Exception as e:
        print(f"  [FAIL] Request failed: {e}")
        return False

def main():
    print("=" * 60)
    print("Claude Code + Ollama + Gemma 4 Setup Verification")
    print("=" * 60)

    checks = [
        check_ollama_health,
        check_model_available,
        check_messages_api,
        check_tool_calling,
    ]

    results = [check() for check in checks]

    print("\n" + "=" * 60)
    passed = sum(results)
    total  = len(results)

    if all(results):
        print(f"All {total} checks passed.")
        print("Claude Code + Ollama + Gemma 4 is ready.")
        print(f"\nLaunch with: claude")
        sys.exit(0)
    else:
        failed_checks = [i + 1 for i, r in enumerate(results) if not r]
        print(f"{passed}/{total} checks passed. Failed: {failed_checks}")
        print("Resolve the failures above before using Claude Code locally.")
        sys.exit(1)

if __name__ == "__main__":
    main()

How to run:

code

pip install httpx
python verify_local_setup.py

# Agentic Task Walkthrough

With verification passing, here is what a real agentic session looks like. The task: take an existing Python module with no tests, analyze it, write a test suite, run the tests, and fix any failures.

code

# Navigate to a project directory
cd ~/projects/my-service

# Confirm Claude Code sees the local configuration
claude --version
# Verify it does not prompt for an Anthropic API key -- if it does,
# the settings.json is not being read correctly

# Start an agentic session
claude

# Inside Claude Code, give the agent a concrete task:
# > Analyze the UserService class in src/user_service.py.
# > Write a pytest test suite covering all public methods.
# > Run the tests and fix any failures.
# > The goal is a clean pytest run with no skips.

What the Claude Code tool call trace looks like during this session:

code

→ read_file("src/user_service.py")
  Reading 247 lines...

→ list_files("src/")
  Found: user_service.py, models.py, db.py, exceptions.py

→ read_file("src/models.py")
  Reading 89 lines...

→ write_file("tests/test_user_service.py", [test content])
  Written: 312 lines

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
  Running 14 tests...
  FAILED tests/test_user_service.py::test_update_email_invalid
    AssertionError: Expected ValidationError, got None

→ read_file("src/user_service.py")  [targeted re-read of update_email method]
  ...

→ write_file("tests/test_user_service.py", [corrected test])
  Patched test_update_email_invalid assertion

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
  14 passed in 1.23s

Gemma 4 handles this pattern reliably — reading files before editing, running tests after changes, and diagnosing failures from error output rather than retrying blindly. The behavior on complex architectural decisions across many files is where cloud models still have an edge. For the task above (analysis, test generation, and targeted fixes), the local setup is fully capable.

What to watch for:** If you see the agent produce "Invalid tool parameters" errors and then retry with the same parameters repeatedly, the temperature is too high, or the model is not using the gemma4-claude Modelfile variant. Both temperature and the context window override are baked into the variant; the raw gemma4:26b tag does not carry them.

// What Breaks and How to Fix It

Tool Parameter Formatting Errors

Symptom: Claude Code reports Invalid tool parameters repeatedly. The agent apologizes and retries with identical or nearly identical parameters, then loops.

Cause: This is documented in the Ollama GitHub issues. The model produces tool call JSON that does not match the schema Claude Code expects. Most commonly: wrong field names, missing required fields, or nested objects where scalars are expected.

Fix: Confirm you are running gemma4-claude (the Modelfile variant) not gemma4:26b directly. The temperature: 0.2 and system prompt in the Modelfile significantly reduce this. If the issue persists, drop the temperature to 0.1 in the Modelfile and rebuild.

Context Window Swapping to Disk

Symptom: Generation slows to a crawl after several turns. ollama ps shows GPU utilization dropping. The OS is paging the KV cache to disk.

Fix:

code

# Option 1: Reduce context window in the Modelfile
# Edit ~/.ollama/Modelfiles/gemma4-claude
# Change: PARAMETER num_ctx 65536
# To:     PARAMETER num_ctx 32768
# Then rebuild: ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

# Option 2: Enable KV cache quantization to reduce memory footprint
export OLLAMA_KV_CACHE_TYPE=q8_0
# This quantizes the KV cache itself, reducing memory at a small quality cost
# Restart Ollama after setting this: pkill ollama && ollama serve

Model Unloading Between Agent Turns

Symptom: Noticeable cold-start delay at the beginning of each Claude Code message. Ollama is unloading the model after an inactivity timeout and reloading it for each request.

Fix:

code

# Keep the model loaded indefinitely during your work session
export OLLAMA_KEEP_ALIVE=-1

# Or set it in your shell profile for permanent effect
echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.zshrc

# Alternatively, use the Ollama API to pin the model
curl http://localhost:11434/api/generate \
  -d '{"model": "gemma4-claude", "keep_alive": -1}'
# This pins the model until you explicitly unload it or restart Ollama

Beta Header Rejection Errors

Symptom: Claude Code produces Unexpected value(s) for the anthropic-beta header errors on launch or mid-session.

Fix: Confirm CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" is in your settings.json. If you set it via shell export instead of settings.json, verify it is exported in the same shell session where claude is running:

code

echo $CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS
# Must print: 1

# Wrapping Up

The stack described in this article is not a proof of concept. It is a working production configuration that engineers have been running daily since Ollama added Anthropic Messages API support in January 2026. The Modelfile is not optional; it is the difference between a tool that works and one that silently produces incomplete outputs on multi-file tasks. The verification script catches configuration issues before they surface mid-session as confusing agent failures.

The setup built in this article is a private, zero-per-token-cost coding agent that handles the majority of daily engineering tasks — code analysis, test generation, targeted refactoring, and debugging — at generation speeds that are usable on modern hardware.

This setup is not a replacement for cloud inference on complex architectural reasoning across large codebases or SWE-bench class tasks that require deep repository understanding at scale.

Shittu Olumide** is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.

この記事をシェア

The Zvi重要度42026年7月25日 22:40

Claude Opus 5 システムカード発表

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

KDnuggets·2026年6月10日 23:00·約28分

低コストでのローカルエージェント型プログラミング：Claude Code、Ollama、Gemma4の活用

#LLM #エージェントプログラミング #Gemma4 #Ollama #オープンソースモデル #ローカル推論

TL;DR

AI深層分析2026年6月11日 00:04

注目/ 5段階

深度40%

キーポイント

ローカル環境でのエージェント構築

コスト削減とデータプライバシー

技術的実装の具体例

重要な引用

Local Agentic Programming on the Cheap

Claude Code + Ollama + Gemma4

影響分析・編集コメントを表示

影響分析

編集コメント

image**

# イントロダクション

# なぜ Gemma 4 か？

// コーディングエージェントにとって重要な数値

ベンチマーク**

Gemma 3 27B

Gemma 4 26B MoE

Gemma 4 31B Dense

τ2-bench (エージェントツール使用)

6.6%

~79%

86.4%

LiveCodeBench v6

29.1%

77.1%

80.0%

GPQA Diamond

42.4%

82.3%

84.3%

AIME 2026 (math)

20.8%

88.3%

89.2%

Arena AI ELO

1365

1441

1452

// ハードウェア要件

バリアント

Ollama タグ

アクティブパラメータ数

Q4 時の VRAM

コンテキストウィンドウ

Edge 4B

gemma4:e4b

約 6 GB

128K

26B MoE (Mixture of Experts)

gemma4:26b

3.8B

約 16–18 GB

256K

31B Dense

gemma4:31b

31B

約 24–32 GB

256K

// Ollama、Gemma 4、Claude Code のインストール

手順 1: Ollama のインストール

macOS および Linux -- 一行でインストール

curl -fsSL https://ollama.com/install.sh | sh

バージョンの確認 -- Anthropic Messages API サポートには 0.14.0 以上が必要

Anthropic 互換エンドポイントは 2026 年 1 月に追加されました

ollama version

期待される結果: ollama version は 0.22.x 以上 (2026 年 5 月時点)

Windows: https://ollama.com からネイティブインストーラーをダウンロード

Windows で GPU パススルーを行いたい場合は WSL2 を推奨します

インストール後、Ollama はポート 11434 でバックグラウンドサービスとして起動します。動作を確認してください:

curl http://localhost:11434

期待される応答: Ollama is running

手順 2: Gemma 4 のダウンロード

26B MoE -- このセットアップに推奨 (約 18 GB ダウンロード)

ollama pull gemma4:26b

ダウンロードを待っている間に、進行状況を確認します

ollama ps

現在ダウンロード中または実行中のモデルを表示

オプション: 対応するハードウェアで比較するために 31B もダウンロード

ollama pull gemma4:31b

プル完了を確認する

ollama list

gemma4:26b がサイズと変更日付とともに表示されるはずです

ステップ 3: Claude Code のインストール

前提条件：Node.js 18 以降

node --version # バージョンが 18 以上であることを確認する

Claude Code CLI をグローバルにインストール

npm install -g @anthropic-ai/claude-code

インストールの確認

claude --version

Ollama が実行中で Gemma 4 がプルされている場合、自然な次の手順は環境変数をエクスポートしてすぐに Claude Code を起動することです。

# Modelfile の作成

~/.ollama/Modelfiles/gemma4-claude

Claude Code エージェントセッション用に調整された Gemma 4 26B MoE バリアント。

コンテキストウィンドウ、温度（temperature）、システムプロンプトをモデルに埋め込み、

すべての Claude Code セッションが正しい設定で開始されるようにします。

ビルド方法:

mkdir -p ~/.ollama/Modelfiles

ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

FROM gemma4:26b

コンテキストウィンドウ -- 16〜18 GB の VRAM を持つシステムでスワップをトリガーすることなく、

本格的なコードベースに対応するためのテスト済み安全下限値は 65536 トークン（64K）です。

24 GB 以上のシステムで余裕がある場合は、131072（128K）に増やしてください。

負荷下でのメモリ使用量をプロファイルしていない限り、131072 を超えることは避けてください。Ollama は KV キャッシュを事前に全量確保するためです。

PARAMETER num_ctx 65536

温度（temperature） -- エージェントコーディングにはあえて低い 0.2 に設定します。

高い温度値は、ツール呼び出しのパラメータフォーマットに変動をもたらし、Claude Code のツールバリデーターが呼び出しを拒否する原因となります。

クリエイティブなタスクではこの値を高く設定しますが、エージェントループにおいては低く保つ必要があります。

PARAMETER temperature 0.2

top_p -- ニュークリアスサンプリングの閾値です。0.9 に設定することで生成に焦点を絞りつつ、

長いエージェントセッションにおいて top_p=1.0 で発生しうる反復ループを回避します。

PARAMETER top_p 0.9

repeat_penalty -- トークンの繰り返しに対してモデルにペナルティを与えます。

1.15 に設定することで、Gemma 4 が同じ失敗したツール呼び出しをほぼ同一のパラメータで無期限に再試行するツール呼び出しループを防ぎます。

PARAMETER repeat_penalty 1.15

num_predict -- 1 回のレスポンスあたりの最大トークン数。コードパッチの大半には 4096 で十分です。単一の生成で大きなファイルを頻繁に作成する場合は、8192 に増やしてください。

PARAMETER num_predict 4096

システムプロンプト -- コーディングエージェントとしての振る舞いと、明示的なツール使用の規律を強化します。Gemma 4 は、自分が何をするかを説明するのではなく、ツール呼び出しにコミットすることを繰り返し思い出させることで恩恵を受けます。

SYSTEM """あなたはコーディングエージェントとして動作するシニアソフトウェアエンジニアです。

コードを扱う際:

編集する前にファイルを読み込んでください。ファイルの内容を推測してはいけません。
一度に焦点を絞った変更を行い、次に進む前に検証してください。
ツール呼び出しが失敗した場合、再試行する前にエラーを注意深く確認してください。同じパラメータで再試行しないでください。まず診断を行ってください。
完全なファイルの書き換えよりも、外科的な編集を優先してください。
変更のバッチ処理の後ではなく、意味のある変更ごとにテストを実行してください。
コードベースの構造について不確かな場合は、推測するのではなく、より多くのファイルを読み込んでください。

正確かつ体系的に行動してください。実行できるのに、これから何をするかを説明しようとしてはいけません。"""

バリアントを作成します:

Modelfiles ディレクトリが存在しない場合は作成します

mkdir -p ~/.ollama/Modelfiles

上記の Modelfile の内容をこのパスに保存し、ビルドを実行します:

ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

バリアントが作成されたことを確認します

ollama list

gemma4:26b と並んで "gemma4-claude" が表示されるはずです

Quick smoke test -- verify it loads and responds

ollama run gemma4-claude "What is the time complexity of binary search and why?"

Expect a clear, concise technical response within a few seconds

# Wiring Claude Code to the Local Model

// Global Settings — ~/.claude/settings.json

This configuration applies to every Claude Code session across all projects. It is the right choice unless you are switching between local and cloud models frequently per project.

{

"env": {

"ANTHROPIC_BASE_URL": "http://localhost:11434",

"ANTHROPIC_AUTH_TOKEN": "ollama",

"ANTHROPIC_API_KEY": "",

"ANTHROPIC_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",

"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"

}

Why each variable matters:

ANTHROPIC_BASE_URL は、Anthropic のサーバーから送信されるすべての Claude Code API 呼び出しを、ローカルの Ollama インスタンスへリダイレクトします。

ANTHROPIC_AUTH_TOKEN は、空でない文字列のいずれかに設定する必要があります。Ollama はこの値を無視しますが、Claude Code はヘッダーが存在することを要求します。

ANTHROPIC_API_KEY: "" と明示的にキーを空にすることで、シェル環境に実際の Anthropic API キーが設定されていた場合でも、Claude Code がそれにフォールバックできないようにします。これを設定しないと、誤設定された ANTHROPIC_BASE_URL が、有料の API に対して静かにフォールバックしてしまう可能性があります。

ANTHROPIC_MODEL は、Claude Code がリクエストに送信する主要なモデル名です。これは、コンテキストウィンドウのオーバーライドを保持していない生のモデルタグ（gemma4:26b など）ではなく、カスタムの Modelfile バリアントである gemma4-claude に設定してください。

ANTHROPIC_DEFAULT_SONNET_MODEL、ANTHROPIC_DEFAULT_HAIKU_MODEL、および ANTHROPIC_DEFAULT_OPUS_MODEL：Claude Code は内部で異なるタスクタイプを異なるモデルティアにルーティングします。これら 3 つすべてを同じローカルモデルに設定することで、Claude Code が内部でどのティアを選択したかに関わらず、すべてのリクエストが Ollama インスタントに到達するように確保できます。

CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" と設定すると、Claude Code がリクエストに追加する Anthropic 固有のベータヘッダーが削除されます。ローカル推論サーバーはこれらのヘッダーを認識せず、含まれるリクエストを拒否します。この変数を設定することで、コアとなる Claude Code の機能に影響を与えることなく、そのエラーを防ぐことができます。

// プロジェクトごとの設定 — .claude/settings.json

プロジェクトのルートディレクトリで

mkdir -p .claude

cat > .claude/settings.json << 'EOF'

{

"env": {

"ANTHROPIC_BASE_URL": "http://localhost:11434",

"ANTHROPIC_AUTH_TOKEN": "ollama",

"ANTHROPIC_API_KEY": "",

"ANTHROPIC_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",

"ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",

"CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"

}

EOF

// 設定の検証

必須要件：**

pip install httpx # 検証スクリプト用の非同期 HTTP クライアント

完全な検証スクリプト:

#!/usr/bin/env python3

"""

verify_local_setup.py

使用前に、Claude Code + Ollama + Gemma 4 のスタック全体を検証します。

以下の 3 つのチェックを順次実行します:

Ollama の稼働状況とモデルの利用可能性
Anthropic Messages API への基本的な呼び出し
ツール呼び出しの往復テスト

前提条件:

pip install httpx

実行方法:

python verify_local_setup.py

正常に動作しているセットアップでの期待される出力:

[PASS] Ollama が localhost:11434 で稼働中

[PASS] モデル 'gemma4-claude' が利用可能

[PASS] Anthropic Messages API の呼び出しが成功

[PASS] ツール呼び出し：モデルが有効な tool_use ブロックを生成

すべてのチェックに合格 -- Claude Code + Ollama + Gemma 4 は使用準備完了。

"""

import httpx

import json

import sys

── 設定 ─────────────────────────────────────────────────────────────

OLLAMA_BASE_URL = "http://localhost:11434"

MODEL_NAME = "gemma4-claude" # モデルファイルのバリアント名と一致させる必要があります

TIMEOUT = 120.0 # 秒数 -- 初回呼び出し時は生成に時間がかかる場合があります

def check_ollama_health() -> bool:

"""

確認 1: Ollama が実行中で応答しているかを確認します。

正常な状態では 'Ollama is running' を返すルートエンドポイントにアクセスします。

"""

print("\n確認 1: Ollama のヘルスチェック")

try:

response = httpx.get(OLLAMA_BASE_URL, timeout=5.0)

if "Ollama is running" in response.text:

print(f" [PASS] {OLLAMA_BASE_URL} で Ollama が実行中です")

return True

else:

print(f" [FAIL] 予期せぬ応答: {response.text[:100]}")

return False

except httpx.ConnectError:

print(f" [FAIL] {OLLAMA_BASE_URL} に接続できません")

print(" Ollama は実行されていますか？試行：ollama serve")

return False

def check_model_available() -> bool:

"""

確認 2: 特定のモデルバリアントが Ollama で利用可能かを確認します。

すべてのプル済みモデルをリストアップする /api/tags エンドポイントを使用します。

"""

print("\n確認 2: モデルの利用可能性")

try:

response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5.0)

data = response.json()

models = [m["name"] for m in data.get("models", [])]

# 正規化: 指定されていない場合、Ollama は ":latest" を追加する場合があります

normalized = [m.split(":")[0] for m in models]

if MODEL_NAME in models or MODEL_NAME in normalized:

print(f" [PASS] Model '{MODEL_NAME}' is available")

return True

else:

print(f" [FAIL] Model '{MODEL_NAME}' not found")

print(f" Available models: {', '.join(models) or 'none'}")

print(f" Run: ollama create {MODEL_NAME} -f ~/.ollama/Modelfiles/gemma4-claude")

return False

except Exception as e:

print(f" [FAIL] Error checking model list: {e}")

return False

def check_messages_api() -> bool:

"""

Check 3: Send a basic Anthropic Messages API call to the local endpoint.

Verifies the request format, model routing, and basic generation work.

Uses the same /v1/messages path and request schema that Claude Code uses.

Note: Claude Code uses http://localhost:11434 (root), not /v1.

The Anthropic-compatible API is at /api/chat or the root -- Ollama routes it.

"""

print("\nCheck 3: Anthropic Messages API call")

payload = {

"model": MODEL_NAME,

"max_tokens": 100,

"messages": [

{

"role": "user",

"content": "Reply with exactly: VERIFICATION_OK"

}

]

}

headers = {

"Content-Type": "application/json",

"x-api-key": "ollama", # API 仕様で必須ですが、ローカルでは値は無視されます

"anthropic-version": "2023-06-01" # 必須のバージョンヘッダー

}

try:

response = httpx.post(

f"{OLLAMA_BASE_URL}/v1/messages",

json=payload,

headers=headers,

timeout=TIMEOUT

)

if response.status_code != 200:

print(f" [FAIL] HTTP {response.status_code}: {response.text[:200]}")

return False

data = response.json()

# Anthropic Messages API のレスポンス構造:

# { "content": [{"type": "text", "text": "..."}], "stop_reason": "..." }

content_blocks = data.get("content", [])

text_blocks = [b for b in content_blocks if b.get("type") == "text"]

if not text_blocks:

print(f" [FAIL] レスポンスにテキストコンテンツがありません: {json.dumps(data, indent=2)}")

return False

response_text = text_blocks[0].get("text", "")

print(f" [PASS] Anthropic Messages API の呼び出しが成功しました")

print(f" モデルのレスポンス: {response_text[:80]}")

return True

except Exception as e:

print(f" [FAIL] リクエストに失敗しました: {e}")

return False

def check_tool_calling() -> bool:

"""

チェック 4：ツール呼び出しがエンドツーエンドで機能することを確認します。

これは Claude Code のエージェント利用において最も重要なチェックです。

Claude Code は、ファイル操作、シェルコマンド、コード実行のすべてに対して、モデルが正しく tool_use ブロックを生成することに依存しています。

単純なツールの定義と、それをトリガーするはずのプロンプトを送信します。

モデルがツール呼び出しの説明テキストではなく、tool_use ブロックを返すことを確認します。

"""

print("\nCheck 4: Tool calling verification")

# Anthropic の関数呼び出しスキーマを使用した最小限のツール定義

tools = [

{

"name": "read_file",

"description": "指定されたパスにあるファイルの内容を読み取ります。",

"input_schema": {

"type": "object",

"properties": {

"path": {

"type": "string",

"description": "読み取る絶対パスまたは相対パスのファイルパス"

}

"required": ["path"]

}

]

payload = {

"model": MODEL_NAME,

"max_tokens": 256,

"tools": tools,

# モデルがテキストで応答するのではなく、ツールを呼び出すよう強制します。

# tool_choice: {"type": "any"} は任意のツールの使用を要求します。

# モデルがツールを自己選択するかテストしている場合はこれを削除してください。

"tool_choice": {"type": "any"},

"messages": [

{

"role": "user",

"content": "/tmp/test.py のファイルを読み込んで、その内容を見せてください。"

}

]

}

headers = {

"Content-Type": "application/json",

"x-api-key": "ollama",

"anthropic-version": "2023-06-01"

}

{"translation": "翻訳全文"}

try:

response = httpx.post(

f"{OLLAMA_BASE_URL}/v1/messages",

json=payload,

headers=headers,

timeout=TIMEOUT

)

if response.status_code != 200:

print(f" [FAIL] HTTP {response.status_code}: {response.text[:200]}")

return False

{"translation": "翻訳全文"}

データ = レスポンス.json()

コンテンツブロック = データ.get("content", [])

ツールブロック = [b for b in コンテンツブロック if b.get("type") == "tool_use"]

{"translation": "翻訳全文"}

if not tool_blocks:

print(" [FAIL] Model did not produce a tool_use block")

print(" This means tool calling is not working correctly.")

print(" Agentic Claude Code sessions will fail on file operations.")

print(f" Full response: {json.dumps(data, indent=2)}")

return False

tool_call = tool_blocks[0]

tool_name = tool_call.get("name", "")

tool_input = tool_call.get("input", {})

print(f" [PASS] Tool calling: model produced a valid tool_use block")

print(f" Tool called: {tool_name}")

print(f" Parameters: {json.dumps(tool_input)}")

# Sanity check: did it call the right tool with the right parameter?

if tool_name == "read_file" and "path" in tool_input:

print(f" Tool name and parameter are correct.")

else:

print(f" WARNING: Unexpected tool name or missing 'path' parameter.")

print(f" The model called a tool but not the expected one.")

return True

except Exception as e:

print(f" [FAIL] Request failed: {e}")

return False

def main():

print("=" * 60)

print("Claude Code + Ollama + Gemma 4 Setup Verification")

print("=" * 60)

checks = [

check_ollama_health,

check_model_available,

check_messages_api,

check_tool_calling,

]

results = [check() for check in checks]

print("\n" + "=" * 60)

passed = sum(results)

total = len(results)

if all(results):

print(f"All {total} checks passed.")

print("Claude Code + Ollama + Gemma 4 is ready.")

print(f"\nLaunch with: claude")

sys.exit(0)

else:

failed_checks = [i + 1 for i, r in enumerate(results) if not r]

print(f"{passed}/{total} checks passed. Failed: {failed_checks}")

print("Resolve the failures above before using Claude Code locally.")

sys.exit(1)

if __name__ == "__main__":

main()

How to run:

pip install httpx

python verify_local_setup.py

# Agentic Task Walkthrough

With verification passing, here is what a real agentic session looks like. The task: take an existing Python module with no tests, analyze it, write a test suite, run the tests, and fix any failures.

Navigate to a project directory

cd ~/projects/my-service

Confirm Claude Code sees the local configuration

claude --version

Verify it does not prompt for an Anthropic API key -- if it does,

the settings.json is not being read correctly

Start an agentic session

claude

Inside Claude Code, give the agent a concrete task:

> Analyze the UserService class in src/user_service.py.

> Write a pytest test suite covering all public methods.

> Run the tests and fix any failures.

> The goal is a clean pytest run with no skips.

このセッション中に Claude Code ツール呼び出しのトレースがどのように見えるか:

→ read_file("src/user_service.py")

247行を読み込み中...

→ list_files("src/")

発見: user_service.py, models.py, db.py, exceptions.py

→ read_file("src/models.py")

89行を読み込み中...

→ write_file("tests/test_user_service.py", [test content])

書き込み完了: 312行

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")

14件のテストを実行中...

FAILED tests/test_user_service.py::test_update_email_invalid

AssertionError: ValidationError を期待したが、None が返された

→ read_file("src/user_service.py") [update_email メソッドのターゲット再読]

...

→ write_file("tests/test_user_service.py", [corrected test])

test_update_email_assertion のパッチ適用完了

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")

1.23秒で 14件合格

// 何が壊れ、どう直すか

ツールパラメータフォーマットエラー

コンテキストウィンドウのディスクへのスワップ

対策：

オプション 1：Modelfile 内のコンテキストウィンドウを削減する

~/.ollama/Modelfiles/gemma4-claude を編集

変更前: PARAMETER num_ctx 65536

変更後: PARAMETER num_ctx 32768

その後再構築: ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

オプション 2：KV キャッシュの量子化を有効にしてメモリ使用量を削減する

export OLLAMA_KV_CACHE_TYPE=q8_0

これにより KV キャッシュ自体が量子化され、わずかな品質低下でメモリ消費量が削減されます

設定後は Ollama を再起動してください: pkill ollama && ollama serve

エージェントのターン間のモデルアンローディング

修正方法:

ワークセッション中はモデルを無限にロード状態に保つ

export OLLAMA_KEEP_ALIVE=-1

または、永続的な効果のためにシェルプロファイルに設定する

echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.zshrc

あるいは、Ollama API を使用してモデルを固定することも可能

curl http://localhost:11434/api/generate \

-d '{"model": "gemma4-claude", "keep_alive": -1}'

これにより、明示的にアンロードするか Ollama を再起動するまでモデルが固定されます

ベータヘッダー拒否エラー

症状：Claude Code の起動時またはセッション中に、anthropic-beta ヘッダーに対する予期せぬ値（Unexpected value(s)）のエラーが発生する。

修正方法:

echo $CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS

Must print: 1

# Wrapping Up

原文を表示

# Introduction

# Why Gemma 4?

// The Numbers That Matter for Coding Agents

Benchmark**

Gemma 3 27B

Gemma 4 26B MoE

Gemma 4 31B Dense

τ2-bench (agentic tool use)

6.6%

~79%

86.4%

LiveCodeBench v6

29.1%

77.1%

80.0%

GPQA Diamond

42.4%

82.3%

84.3%

AIME 2026 (math)

20.8%

88.3%

89.2%

Arena AI ELO

1365

1441

1452

// Hardware Requirements

Before pulling an 18 GB model, know what you are actually working with. The Gemma 4 family was designed to span edge devices through workstations, and the four variants reflect that range.

Variant

Ollama tag

Active params

VRAM at Q4

Context window

Edge 4B

gemma4:e4b

~6 GB

128K

26B MoE

gemma4:26b

3.8B

~16–18 GB

256K

31B Dense

gemma4:31b

31B

~24–32 GB

256K

// Installing Ollama, Gemma 4, and Claude Code

Step 1: Install Ollama

code

# macOS and Linux -- one-line install
curl -fsSL https://ollama.com/install.sh | sh

# Verify version -- must be 0.14.0+ for Anthropic Messages API support
# The Anthropic-compatible endpoint was added in January 2026
ollama version
# Expected: ollama version is 0.22.x or higher (as of May 2026)

# Windows: download the native installer from https://ollama.com
# WSL2 is recommended if you want GPU passthrough on Windows

After installation, Ollama starts as a background service on port 11434. Verify it is up:

code

curl http://localhost:11434
# Expected response: Ollama is running

Step 2: Pull Gemma 4

code

# The 26B MoE -- recommended for this setup (~18 GB download)
ollama pull gemma4:26b

# While you wait, confirm the download is progressing
ollama ps
# Shows currently downloading or running models

# Optional: also pull the 31B for comparison on capable hardware
ollama pull gemma4:31b

# Confirm the pull completed
ollama list
# Should show gemma4:26b with size and modification date

Step 3: Install Claude Code

code

# Prerequisites: Node.js 18 or later
node --version   # Confirm you are on 18+

# Install Claude Code CLI globally
npm install -g @anthropic-ai/claude-code

# Verify the install
claude --version

With Ollama running and Gemma 4 pulled, the natural next instinct is to export the environment variables and launch Claude Code immediately.

# The Modelfile

The fix is a Modelfile that bakes the correct context size and other inference parameters into a named model variant. Create this file:

code

# ~/.ollama/Modelfiles/gemma4-claude
# Gemma 4 26B MoE variant tuned for Claude Code agentic sessions.
# Bakes context window, temperature, and system prompt into the model
# so every Claude Code session starts with the correct configuration.
#
# Build with:
#   mkdir -p ~/.ollama/Modelfiles
#   ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

FROM gemma4:26b

# Context window -- 65536 tokens (64K) is the tested-safe floor for real
# codebases without triggering swap on 16-18 GB VRAM systems.
# Increase to 131072 (128K) if you have headroom on 24 GB+ systems.
# Do not go above 131072 unless you have profiled your memory usage
# under load -- Ollama pre-allocates the full KV cache upfront.
PARAMETER num_ctx 65536

# Temperature -- 0.2 is deliberately low for agentic coding.
# Higher temperature introduces variability in tool call parameter
# formatting that causes Claude Code's tool validator to reject calls.
# For creative tasks, you would set this higher. For agentic loops: low.
PARAMETER temperature 0.2

# top_p -- nucleus sampling threshold. 0.9 keeps generation focused
# while avoiding the repetition loops that top_p=1.0 can produce on
# long agentic sessions.
PARAMETER top_p 0.9

# repeat_penalty -- penalizes the model for repeating tokens.
# 1.15 helps prevent tool call loops where Gemma 4 retries the same
# failed tool call with nearly identical parameters indefinitely.
PARAMETER repeat_penalty 1.15

# num_predict -- maximum tokens per response. 4096 is sufficient for
# most code patches. Increase to 8192 if you regularly generate
# large files in a single generation.
PARAMETER num_predict 4096

# System prompt -- reinforces coding agent behavior and explicit
# tool use discipline. Gemma 4 benefits from being reminded to
# commit to tool calls rather than describing what it would do.
SYSTEM """You are a senior software engineer operating as a coding agent.

When working with code:
- Read files before editing them. Never assume file contents.
- Make one focused change at a time and verify it before proceeding.
- When a tool call fails, examine the error carefully before retrying.
  Do not retry with identical parameters. Diagnose first.
- Prefer surgical edits over full file rewrites.
- Run tests after each meaningful change, not after a batch of changes.
- If you are uncertain about the codebase structure, read more files
  rather than guessing.

Be precise and methodical. Avoid explaining what you are about to do
when you could simply do it."""

Build the variant:

code

# Create the Modelfiles directory if it does not exist
mkdir -p ~/.ollama/Modelfiles

# Save the Modelfile content from above to this path, then build:
ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

# Verify the variant was created
ollama list
# Should show gemma4-claude alongside gemma4:26b

# Quick smoke test -- verify it loads and responds
ollama run gemma4-claude "What is the time complexity of binary search and why?"
# Expect a clear, concise technical response within a few seconds

# Wiring Claude Code to the Local Model

// Global Settings — ~/.claude/settings.json

This configuration applies to every Claude Code session across all projects. It is the right choice unless you are switching between local and cloud models frequently per project.

code

{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",

    "ANTHROPIC_AUTH_TOKEN": "ollama",

    "ANTHROPIC_API_KEY": "",

    "ANTHROPIC_MODEL": "gemma4-claude",

    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",

    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  }
}

Why each variable matters:

ANTHROPIC_BASE_URL redirects all Claude Code API calls from Anthropic's servers to your local Ollama instance.

ANTHROPIC_AUTH_TOKEN must be set to any non-empty string; Ollama ignores the value but Claude Code requires the header to be present.

ANTHROPIC_API_KEY: "" explicitly empties the key so Claude Code cannot fall back to a real Anthropic API key if one happens to be set in your shell environment. Without this, a misconfigured ANTHROPIC_BASE_URL might silently fail over to the paid API.

ANTHROPIC_MODEL is the primary model name Claude Code sends in requests. Set this to your custom Modelfile variant, gemma4-claude not gemma4:26b. The raw model tag does not carry the context window override.

ANTHROPIC_DEFAULT_SONNET_MODEL, ANTHROPIC_DEFAULT_HAIKU_MODEL, and ANTHROPIC_DEFAULT_OPUS_MODEL: Claude Code internally routes different task types to different model tiers. Setting all three to the same local model ensures every request lands at your Ollama instance regardless of which tier Claude Code internally selects.

CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" strips the Anthropic-specific beta headers that Claude Code adds to requests. Local inference servers do not recognize these headers and reject requests that include them. Setting this variable prevents that error without affecting any core Claude Code functionality.

// Per-Project Configuration — .claude/settings.json

code

# In your project root
mkdir -p .claude

cat > .claude/settings.json << 'EOF'
{
  "env": {
    "ANTHROPIC_BASE_URL": "http://localhost:11434",
    "ANTHROPIC_AUTH_TOKEN": "ollama",
    "ANTHROPIC_API_KEY": "",
    "ANTHROPIC_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_HAIKU_MODEL": "gemma4-claude",
    "ANTHROPIC_DEFAULT_OPUS_MODEL": "gemma4-claude",
    "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1"
  }
}
EOF

// Verifying the Setup

Prerequisites:**

code

pip install httpx   # Async HTTP client for the verification script

The full verification script:

code


#!/usr/bin/env python3
"""
verify_local_setup.py

Verifies the full Claude Code + Ollama + Gemma 4 stack before use.
Runs three checks in sequence:
  1. Ollama health and model availability
  2. Basic Anthropic Messages API call
  3. Tool calling round-trip

Prerequisites:
  pip install httpx

How to run:
  python verify_local_setup.py

Expected output on a working setup:
  [PASS] Ollama is running on localhost:11434
  [PASS] Model 'gemma4-claude' is available
  [PASS] Anthropic Messages API call successful
  [PASS] Tool calling: model produced a valid tool_use block
  All checks passed -- Claude Code + Ollama + Gemma 4 is ready.
"""

import httpx
import json
import sys

# ── Configuration ─────────────────────────────────────────────────────────────
OLLAMA_BASE_URL = "http://localhost:11434"
MODEL_NAME      = "gemma4-claude"   # Must match your Modelfile variant name
TIMEOUT         = 120.0             # Seconds -- generation can be slow on first call

def check_ollama_health() -> bool:
    """
    Check 1: Verify Ollama is running and responding.
    Hits the root endpoint which returns 'Ollama is running' when healthy.
    """
    print("\nCheck 1: Ollama health")
    try:
        response = httpx.get(OLLAMA_BASE_URL, timeout=5.0)
        if "Ollama is running" in response.text:
            print(f"  [PASS] Ollama is running on {OLLAMA_BASE_URL}")
            return True
        else:
            print(f"  [FAIL] Unexpected response: {response.text[:100]}")
            return False
    except httpx.ConnectError:
        print(f"  [FAIL] Cannot connect to {OLLAMA_BASE_URL}")
        print("         Is Ollama running? Try: ollama serve")
        return False

def check_model_available() -> bool:
    """
    Check 2: Verify the specific model variant is available in Ollama.
    Uses the /api/tags endpoint which lists all pulled models.
    """
    print("\nCheck 2: Model availability")
    try:
        response = httpx.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5.0)
        data     = response.json()
        models   = [m["name"] for m in data.get("models", [])]

        # Normalize: Ollama may add ":latest" if not specified
        normalized = [m.split(":")[0] for m in models]

        if MODEL_NAME in models or MODEL_NAME in normalized:
            print(f"  [PASS] Model '{MODEL_NAME}' is available")
            return True
        else:
            print(f"  [FAIL] Model '{MODEL_NAME}' not found")
            print(f"         Available models: {', '.join(models) or 'none'}")
            print(f"         Run: ollama create {MODEL_NAME} -f ~/.ollama/Modelfiles/gemma4-claude")
            return False
    except Exception as e:
        print(f"  [FAIL] Error checking model list: {e}")
        return False

def check_messages_api() -> bool:
    """
    Check 3: Send a basic Anthropic Messages API call to the local endpoint.
    Verifies the request format, model routing, and basic generation work.
    Uses the same /v1/messages path and request schema that Claude Code uses.
    Note: Claude Code uses http://localhost:11434 (root), not /v1.
    The Anthropic-compatible API is at /api/chat or the root -- Ollama routes it.
    """
    print("\nCheck 3: Anthropic Messages API call")

    payload = {
        "model": MODEL_NAME,
        "max_tokens": 100,
        "messages": [
            {
                "role": "user",
                "content": "Reply with exactly: VERIFICATION_OK"
            }
        ]
    }

    headers = {
        "Content-Type":      "application/json",
        "x-api-key":         "ollama",            # Required by the API spec; value ignored locally
        "anthropic-version": "2023-06-01"         # Required version header
    }

    try:
        response = httpx.post(
            f"{OLLAMA_BASE_URL}/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP {response.status_code}: {response.text[:200]}")
            return False

        data = response.json()

        # Anthropic Messages API response structure:
        # { "content": [{"type": "text", "text": "..."}], "stop_reason": "..." }
        content_blocks = data.get("content", [])
        text_blocks    = [b for b in content_blocks if b.get("type") == "text"]

        if not text_blocks:
            print(f"  [FAIL] No text content in response: {json.dumps(data, indent=2)}")
            return False

        response_text = text_blocks[0].get("text", "")
        print(f"  [PASS] Anthropic Messages API call successful")
        print(f"         Model response: {response_text[:80]}")
        return True

    except Exception as e:
        print(f"  [FAIL] Request failed: {e}")
        return False

def check_tool_calling() -> bool:
    """
    Check 4: Verify tool calling works end-to-end.
    This is the most important check for Claude Code agentic use.
    Claude Code relies on the model correctly producing tool_use blocks
    for every file operation, shell command, and code execution.

    Sends a simple tool definition and a prompt that should trigger it.
    Verifies the model returns a tool_use block (not just text describing the call).
    """
    print("\nCheck 4: Tool calling verification")

    # A minimal tool definition using the Anthropic function calling schema
    tools = [
        {
            "name": "read_file",
            "description": "Read the contents of a file at the given path.",
            "input_schema": {
                "type": "object",
                "properties": {
                    "path": {
                        "type": "string",
                        "description": "The absolute or relative file path to read"
                    }
                },
                "required": ["path"]
            }
        }
    ]

    payload = {
        "model": MODEL_NAME,
        "max_tokens": 256,
        "tools": tools,
        # Force the model to call a tool rather than respond in text.
        # tool_choice: {"type": "any"} requires any tool use.
        # Remove this if testing whether the model self-selects tools.
        "tool_choice": {"type": "any"},
        "messages": [
            {
                "role": "user",
                "content": "Read the file at /tmp/test.py and show me its contents."
            }
        ]
    }

    headers = {
        "Content-Type":      "application/json",
        "x-api-key":         "ollama",
        "anthropic-version": "2023-06-01"
    }

    try:
        response = httpx.post(
            f"{OLLAMA_BASE_URL}/v1/messages",
            json=payload,
            headers=headers,
            timeout=TIMEOUT
        )

        if response.status_code != 200:
            print(f"  [FAIL] HTTP {response.status_code}: {response.text[:200]}")
            return False

        data           = response.json()
        content_blocks = data.get("content", [])
        tool_blocks    = [b for b in content_blocks if b.get("type") == "tool_use"]

        if not tool_blocks:
            print("  [FAIL] Model did not produce a tool_use block")
            print("         This means tool calling is not working correctly.")
            print("         Agentic Claude Code sessions will fail on file operations.")
            print(f"         Full response: {json.dumps(data, indent=2)}")
            return False

        tool_call  = tool_blocks[0]
        tool_name  = tool_call.get("name", "")
        tool_input = tool_call.get("input", {})

        print(f"  [PASS] Tool calling: model produced a valid tool_use block")
        print(f"         Tool called: {tool_name}")
        print(f"         Parameters:  {json.dumps(tool_input)}")

        # Sanity check: did it call the right tool with the right parameter?
        if tool_name == "read_file" and "path" in tool_input:
            print(f"         Tool name and parameter are correct.")
        else:
            print(f"         WARNING: Unexpected tool name or missing 'path' parameter.")
            print(f"         The model called a tool but not the expected one.")

        return True

    except Exception as e:
        print(f"  [FAIL] Request failed: {e}")
        return False

def main():
    print("=" * 60)
    print("Claude Code + Ollama + Gemma 4 Setup Verification")
    print("=" * 60)

    checks = [
        check_ollama_health,
        check_model_available,
        check_messages_api,
        check_tool_calling,
    ]

    results = [check() for check in checks]

    print("\n" + "=" * 60)
    passed = sum(results)
    total  = len(results)

    if all(results):
        print(f"All {total} checks passed.")
        print("Claude Code + Ollama + Gemma 4 is ready.")
        print(f"\nLaunch with: claude")
        sys.exit(0)
    else:
        failed_checks = [i + 1 for i, r in enumerate(results) if not r]
        print(f"{passed}/{total} checks passed. Failed: {failed_checks}")
        print("Resolve the failures above before using Claude Code locally.")
        sys.exit(1)

if __name__ == "__main__":
    main()

How to run:

code

pip install httpx
python verify_local_setup.py

# Agentic Task Walkthrough

With verification passing, here is what a real agentic session looks like. The task: take an existing Python module with no tests, analyze it, write a test suite, run the tests, and fix any failures.

code

# Navigate to a project directory
cd ~/projects/my-service

# Confirm Claude Code sees the local configuration
claude --version
# Verify it does not prompt for an Anthropic API key -- if it does,
# the settings.json is not being read correctly

# Start an agentic session
claude

# Inside Claude Code, give the agent a concrete task:
# > Analyze the UserService class in src/user_service.py.
# > Write a pytest test suite covering all public methods.
# > Run the tests and fix any failures.
# > The goal is a clean pytest run with no skips.

What the Claude Code tool call trace looks like during this session:

code

→ read_file("src/user_service.py")
  Reading 247 lines...

→ list_files("src/")
  Found: user_service.py, models.py, db.py, exceptions.py

→ read_file("src/models.py")
  Reading 89 lines...

→ write_file("tests/test_user_service.py", [test content])
  Written: 312 lines

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
  Running 14 tests...
  FAILED tests/test_user_service.py::test_update_email_invalid
    AssertionError: Expected ValidationError, got None

→ read_file("src/user_service.py")  [targeted re-read of update_email method]
  ...

→ write_file("tests/test_user_service.py", [corrected test])
  Patched test_update_email_invalid assertion

→ bash("python -m pytest tests/test_user_service.py -v 2>&1")
  14 passed in 1.23s

// What Breaks and How to Fix It

Tool Parameter Formatting Errors

Symptom: Claude Code reports Invalid tool parameters repeatedly. The agent apologizes and retries with identical or nearly identical parameters, then loops.

Cause: This is documented in the Ollama GitHub issues. The model produces tool call JSON that does not match the schema Claude Code expects. Most commonly: wrong field names, missing required fields, or nested objects where scalars are expected.

Fix: Confirm you are running gemma4-claude (the Modelfile variant) not gemma4:26b directly. The temperature: 0.2 and system prompt in the Modelfile significantly reduce this. If the issue persists, drop the temperature to 0.1 in the Modelfile and rebuild.

Context Window Swapping to Disk

Symptom: Generation slows to a crawl after several turns. ollama ps shows GPU utilization dropping. The OS is paging the KV cache to disk.

Fix:

code

# Option 1: Reduce context window in the Modelfile
# Edit ~/.ollama/Modelfiles/gemma4-claude
# Change: PARAMETER num_ctx 65536
# To:     PARAMETER num_ctx 32768
# Then rebuild: ollama create gemma4-claude -f ~/.ollama/Modelfiles/gemma4-claude

# Option 2: Enable KV cache quantization to reduce memory footprint
export OLLAMA_KV_CACHE_TYPE=q8_0
# This quantizes the KV cache itself, reducing memory at a small quality cost
# Restart Ollama after setting this: pkill ollama && ollama serve

Model Unloading Between Agent Turns

Symptom: Noticeable cold-start delay at the beginning of each Claude Code message. Ollama is unloading the model after an inactivity timeout and reloading it for each request.

Fix:

code

# Keep the model loaded indefinitely during your work session
export OLLAMA_KEEP_ALIVE=-1

# Or set it in your shell profile for permanent effect
echo 'export OLLAMA_KEEP_ALIVE=-1' >> ~/.zshrc

# Alternatively, use the Ollama API to pin the model
curl http://localhost:11434/api/generate \
  -d '{"model": "gemma4-claude", "keep_alive": -1}'
# This pins the model until you explicitly unload it or restart Ollama

Beta Header Rejection Errors

Symptom: Claude Code produces Unexpected value(s) for the anthropic-beta header errors on launch or mid-session.

Fix: Confirm CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS: "1" is in your settings.json. If you set it via shell export instead of settings.json, verify it is exported in the same shell session where claude is running:

code

echo $CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS
# Must print: 1

# Wrapping Up

This setup is not a replacement for cloud inference on complex architectural reasoning across large codebases or SWE-bench class tasks that require deep repository understanding at scale.

この記事をシェア

The Zvi重要度42026年7月25日 22:40

Claude Opus 5 システムカード発表

Latent Space重要度42026年7月25日 16:25

Anthropic、Claude Opus 5 を発表

Simon Willison Blog重要度42026年7月25日 09:42

アントのOpus5、プロンプト注入に強靭

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む