GLM-5.2 OpenAI 互換 API:推論努力、関数呼び出し、長文コンテキスト検索のハンズオンガイド
Zhipu AI の最新モデル GLM-5.2 が OpenAI 互換 API として提供され、思考プロセスの制御や長文コンテキスト処理など高度な機能を実装するための包括的なガイドが公開された。
キーポイント
多様なプロバイダー経由での利用可能化
GLM-5.2 が Z.ai、OpenRouter、Together AI、Hugging Face など複数のプラットフォームで OpenAI 互換 API として提供されており、ユーザーは柔軟に選択して利用できます。
思考プロセスの制御機能
「thinking mode」や「reasoning effort(high/max)」をパラメータとして指定可能で、推論コストと速度のバランスを調整しながら複雑な問題解決を最適化できます。
高度なツール呼び出しと構造化出力
関数呼び出し(Function Calling)によるエージェント構築や、JSON 形式での構造化データ出力をサポートし、実用的な自動化タスクへの応用が容易です。
長文コンテキストとコスト管理
長いコンテキストウィンドウを活用した情報検索機能と、入力・出力トークンごとの詳細なコスト見積もり機能を組み込むことで、大規模データ処理を効率的に実行できます。
多プロバイダー対応と API キー管理
Zai, OpenRouter, Together, Requesty, Hugging Face の 5 つの主要プロバイダーを単一のコードベースでサポートし、Google Colab や環境変数からの自動的な API キー取得を実装しています。
GLM-5.2 固有のパラメータ制御
`extra_body` を介して `thinking`(思考のオン/オフ)や `reasoning_effort`(高/最大レベル)といった GLM-5.2 特有の推論パラメータを、標準的な OpenAI クライアントを通じて柔軟に設定できます。
推論トレースの抽出機能
プロバイダー固有のフィールド(`reasoning_content`)を検索する `get_reasoning` 関数を実装しており、モデルの内部思考プロセスや隠された推論トレースをプログラムから直接取得・表示することが可能です。
影響分析・編集コメントを表示
影響分析
このニュースは、中国発の高性能 LLM が欧米中心の OpenAI エコシステムと完全に互換性を持つことを示しており、開発者がベンダーロックインを避けつつ最先端の推論モデルを利用できる環境を拡大します。特に「思考プロセスの制御」機能は、コスト効率を重視する実務現場において、複雑なタスク処理における重要な選択肢となるでしょう。
編集コメント
OpenAI 互換 API の普及により、中国製モデルの採用ハードルが劇的に低下しています。特に推論コストを細かく制御できる機能は、実務での大規模展開において非常に魅力的です。
本チュートリアルでは、GLM-5.2 を使用し、フルモデルをローカルで実行する代わりに、ホストされた OpenAI 互換 API(Application Programming Interface)を利用します。まず、複数のプロバイダーオプションを設定し、API キーを安全に読み込み、通常のチャット、思考モード、ストリーミング、ツール呼び出し、トークン追跡をサポートする再利用可能なチャットラッパーを作成することから始めます。その後、単純なチャットボットの例を超えて、推論効率の制御、ストリーミングによる推論と回答、関数呼び出し(Function Calling)、小規模なツール使用エージェント、構造化された JSON 出力、長文コンテキスト検索、コスト見積もりといった、より実践的なシナリオでモデルをテストします。
GLM-5.2 OpenAI 互換クライアントのセットアップと再利用可能なチャットラッパー
コードをコピーしました(ブラウザを変更してください)
import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False)
import os, re, json, time, getpass
from openai import OpenAI
PROVIDERS = {
"zai": {"base_url": "https://api.z.ai/api/paas/v4/", "model": "glm-5.2", "env": "ZAI_API_KEY"},
"openrouter": {"base_url": "https://openrouter.ai/api/v1", "model": "z-ai/glm-5.2", "env": "OPENROUTER_API_KEY"},
"together": {"base_url": "https://api.together.xyz/v1", "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},
"requesty": {"base_url": "https://router.requesty.ai/v1", "model": "zai/glm-5.2", "env": "REQUESTY_API_KEY"},
"huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"},
}
PROVIDER = "zai"
CFG = PROVIDERS[PROVIDER]
MODEL = CFG["model"]
def load_api_key(env_name):
try:
from google.colab import userdata
v = userdata.get(env_name)
if v: return v
except Exception:
pass
if os.environ.get(env_name):
return os.environ[env_name]
return getpass.getpass(f"Enter your {env_name}: ")
client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])
PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40
_USAGE = {"in": 0, "out": 0, "calls": 0}
def _track(usage):
if usage:
_USAGE["in"] += getattr(usage, "prompt_tokens", 0) or 0
_USAGE["out"] += getattr(usage, "completion_tokens", 0) or 0
_USAGE["calls"] += 1
def get_reasoning(obj):
"""Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field)."""
val = getattr(obj, "reasoning_content", None)
if val: return val
extra = getattr(obj, "model_extra", None) or {}
if extra.get("reasoning_content"): return extra["reasoning_content"]
try: return obj.to_dict().get("reasoning_content")
except Exception: return None
def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto",
stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):
"""
effort: None | "high" | "max" (GLM-5.2 thinking-effort level; max is the model default)
thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency)
GLM-specific params go through extra_body so any OpenAI client works.
"""
extra = {"thinking": {"type": "enabled" if thinking else "disabled"}}
if effort and thinking: extra["reasoning_effort"] = effort
if tool_stream: extra["tool_stream"] = True
kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens,
temperature=temperature, stream=stream, extra_body=extra)
if tools:
kwargs.update(tools=tools, tool_choice=tool_choice)
if stream:
kwargs["stream_options"] = {"include_usage": True}
return client.chat.completions.create(**kwargs)
GLM-5.2 を OpenAI 互換 API を通じて利用するための完全な基盤を構築しました。複数のプロバイダーオプションを定義し、API キーを安全に読み込み、OpenAI クライアントを作成するとともに、ノートブック全体でのトークンコスト追跡を設定しています。さらに、後続のすべてのデモで思考モード、推論エフォート(reasoning effort)、ストリーミング、ツール呼び出し、プロバイダー固有のパラメータをクリーンに使用できるように、再利用可能なチャットラッパーも構築しました。
GLM-5.2 を用いた基本チャット、思考エフォートの制御、およびストリーミング推論
コードをコピーコピー済み別のブラウザを使用
def demo_basic():
print("\n=== 1. BASIC CHAT / SANITY CHECK =========================")
resp = chat([{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "In one sentence, what is GLM-5.2 best at?"}],
thinking=False, max_tokens=200)
_track(resp.usage)
print(resp.choices[0].message.content.strip())
def demo_effort():
print("\n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========")
problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. "
"Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. "
"At what clock time do they meet? Show the key steps briefly.")
for label, kw in [("thinking OFF", dict(thinking=False)),
("effort=high", dict(thinking=True, effort="high")),
("effort=max", dict(thinking=True, effort="max"))]:
t0 = time.time()
resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)
dt = time.time() - t0
_track(resp.usage)
msg, u = resp.choices[0].message, resp.usage
print(f"\n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")
r = get_reasoning(msg)
if r:
print(" [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...")
print(" Please view this post in your web browser to complete the quiz.")
必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:
{"translation": "翻訳全文"}
: " + " ".join((msg.content or '').split())[:350])
def demo_streaming():
print("\n=== 3. STREAMING: reasoning channel vs answer channel ====")
stream = chat([{"role": "user", "content":
"Explain why the sky is blue, then give a one-line TL;DR."}],
thinking=True, effort="high", stream=True, max_tokens=1200)
saw_r = saw_a = False
usage = None
for chunk in stream:
if getattr(chunk, "usage", None): usage = chunk.usage
if not chunk.choices: continue
delta = chunk.choices[0].delta
r = get_reasoning(delta)
if r:
if not saw_r: print("\n[thinking] ", end="", flush=True); saw_r = True
print(r, end="", flush=True)
if getattr(delta, "content", None):
if not saw_a: print("\n\n ", end="", flush=True); saw_a = True
print(delta.content, end="", flush=True)
print()
_track(usage)
GLM-5.2 のテストを、基本的なチャット機能、推論エフォート(思考の深さ)制御、ストリーミング出力から開始します。まず単純な健全性チェックを実行し、次に「思考オフ」「高エフォート」「最大エフォート」の各モードで同一の問題を比較して、レイテンシと出力トークンの変化を観察します。また、モデル応答をストリーミングすることで、生成される過程で推論チャネル(思考プロセス)と最終回答を別々に確認できるようにします。
関数呼び出しと多段階ツール利用型 GLM-5.2 エージェント
コードをコピーしました
別のブラウザを使用してください
必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:
{"translation": "翻訳全文"}
def tool_calculator(expression: str):
if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""):
return {"error": "unsupported characters"}
try: return {"result": eval(expression, {"__builtins__": {}}, {})}
except Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
"sao paulo": 22_400_000, "mexico city": 21_800_000}
def tool_city_population(city: str):
return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}
TOOLS = [
{"type": "function", "function": {
"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
"required": ["expression"]}}},
{"type": "function", "function": {
"name": "city_population", "description": "Look up the metro population of a city.",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}},
"required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
"""Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""
for _ in range(max_rounds):
resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,
max_tokens=1500, temperature=0.3)
_track(resp.usage)
m = resp.choices[0].message
if not getattr(m, "tool_calls", None):
return m.content
messages.append({
"role": "assistant", "content": m.content or "",
"tool_calls": [{"id": tc.id, "type": "function",
"function": {"name": tc.function.name,
"arguments": tc.function.arguments}}
for tc in m.tool_calls]})
for tc in m.tool_calls:
try: args = json.loads(tc.function.arguments or "{}")
except json.JSONDecodeError: args = {}
result = TOOL_IMPLS.get(tc.function.name, lambda k: {"error": "unknown"})(args)
print(f" ↳ {tc.function.name}({args}) -> {result}")
messages.append({"role": "tool", "tool_call_id": tc.id,
"content": json.dumps(result)})
return "(stopped: max tool rounds reached)"
def demo_tools():
print("\n=== 4. FUNCTION / TOOL CALLING ===========================")
q = ("How many times larger is Tokyo's metro population than Mexico City's? "
"Use the tools, then answer with the ratio to one decimal place.")
print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))
def demo_agent():
print("\n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")
task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "
"then compute the combined population of the top two and report it. "
"Use the tools for every lookup and sum; never guess numbers.")
ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
{"role": "user", "content": task}])
print("Final:", " ".join((ans or "").split()))
GLM-5.2 を外部ツールに接続し、小規模なツール利用ワークフローを構築します。計算機と都市人口検索ツールを定義し、OpenAI スタイルのツールスキーマに登録して、モデルがツールの呼び出しを要求し、その結果を受け取るループを作成します。その後、この設定を用いて、直接関数呼び出しタスクを実行するとともに、人口検索、都市ランキング、推測なしでの計算を行う小規模な多段階エージェントを構築します。
GLM-5.2 による構造化 JSON 出力と長文コンテキストの検索
コードをコピーしました。別のブラウザを使用してください。
必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms など) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:
{"translation": "翻訳全文"}
def tool_calculator(expression: str):
if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""):
return {"error": "unsupported characters"}
try: return {"result": eval(expression, {"__builtins__": {}}, {})}
except Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
"sao paulo": 22_400_000, "mexico city": 21_800_000}
def tool_city_population(city: str):
return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}
TOOLS = [
{"type": "function", "function": {
"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
"required": ["expression"]}}},
{"type": "function", "function": {
"name": "city_population", "description": "Look up the metro population of a city.",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}},
"required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
"""Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""
for _ in range(max_rounds):
resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,
max_tokens=1500, temperature=0.3)
_track(resp.usage)
m = resp.choices[0].message
if not getattr(m, "tool_calls", None):
return m.content
messages.append({
"role": "assistant", "content": m.content or "",
"tool_calls": [{"id": tc.id, "type": "function",
"function": {"name": tc.function.name,
"arguments": tc.function.arguments}}
for tc in m.tool_calls]})
for tc in m.tool_calls:
try: args = json.loads(tc.function.arguments or "{}")
except json.JSONDecodeError: args = {}
result = TOOL_IMPLS.get(tc.function.name, lambda k: {"error": "unknown"})(args)
print(f" ↳ {tc.function.name}({args}) -> {result}")
messages.append({"role": "tool", "tool_call_id": tc.id,
"content": json.dumps(result)})
return "(stopped: max tool rounds reached)"
def demo_tools():
print("\n=== 4. FUNCTION / TOOL CALLING ===========================")
q = ("How many times larger is Tokyo's metro population than Mexico City's? "
"Use the tools, then answer with the ratio to one decimal place.")
print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))
def demo_agent():
print("\n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")
task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "
"then compute the combined population of the top two and report it. "
"Use the tools for every lookup and sum; never guess numbers.")
ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
{"role": "user", "content": task}])
print("Final:", " ".join((ans or "").split()))
信頼性の高い構造化出力と長文コンテキストの検索に焦点を当てます。JSON 抽出ヘルパーを作成し、モデルに対して厳格な JSON オブジェクトの返却を指示します。また、最初のレスポンスが有効な JSON でない場合は一度リトライする仕組みも用意しています。さらに、隠された「針(needle)」を含む合成ドキュメントを作成し、GLM-5.2 に送信して、提供されたコンテキストから正確な起動コードをモデルが取得できるかを確認します。
GLM-5.2 トークンとコスト管理を用いた全デモの実行
コードをコピーしました。別のブラウザを使用してください
def cost_summary():
print("\n=== 8. TOKEN + COST ACCOUNTING ===========================")
cost = _USAGE["in"]/1e6*PRICE_IN_PER_M + _USAGE["out"]/1e6*PRICE_OUT_PER_M
print(f" calls: {_USAGE['calls']} | input: {_USAGE['in']:,} tok | output: {_USAGE['out']:,} tok")
print(f" estimated spend @ ${PRICE_IN_PER_M}/{PRICE_OUT_PER_M} per 1M: ${cost:0.4f}")
DEMOS = [demo_basic, demo_effort, demo_streaming, demo_tools,
demo_agent, demo_structured, demo_long_context]
print(f"Provider={PROVIDER} model={MODEL}")
for fn in DEMOS:
try: fn()
except Exception as e:
print(f" [skipped {fn.__name__}: {type(e).__name__}: {e}]")
cost_summary()
print("\nDone. Tweak PROVIDER / effort / max_tokens and re-run any demo function.")
必ず JSON 形式で返してください。translation フィールドのみ。他のフィールド (technical_terms 等) は一切追加しないこと — 余計なフィールドを書こうとして本文翻訳がトークン上限で打ち切られる事故を防ぐため:
{"translation": "翻訳全文"}
このチュートリアルでは、使用状況情報の収集とトップからボトムまですべてのデモの実行によって締めくくります。総入力トークン数と出力トークン数に基づいて推定コストを計算し、呼び出し数、トークン数、および支出のコンパクトなサマリーを出力します。また、ドライバーループを使用することで、単一のデモが失敗してもノートブック全体が停止しないようにしており、これによりチュートリアルの実行、デバッグ、再利用が容易になります。
結論
結論として、Python アプリケーションで GLM-5.2 を使用する際の、実用的かつ再利用可能なワークフローを確立しました。推論動作の制御方法、異なる思考モードの比較、ツールとの接続、構造化出力の検証、長文コンテキスト入力のテスト、および推定コストを含むトークン使用状況の監視方法を学びました。これにより、研究アシスタント、ドキュメント分析ツール、コーディングエージェント、長文コンテキスト検索ワークフロー、または API ベースの推論パイプラインなど、より高度なシステムの構築に向けた強力な出発点を提供します。最後に紹介したセットアップは Colab で軽量に動作する一方で、実際のプロジェクトで GLM-5.2 を使用する際の構成にも十分近いものです。
完全なコードはこちらでご覧ください。また、Twitter でフォローしていただくことも歓迎いたします。150k 人以上の ML サブレッドに参加し、ニュースレターを購読することも忘れずにお願いします。待ってください!Telegram をご利用ですか?今なら Telegram でも私たちに参加できます。
GitHub リポジトリや Hugging Face ページ、製品リリース、ウェビナーなどのプロモーションのためにパートナーシップをご検討の場合は、ぜひご連絡ください。
GLM-5.2 OpenAI 互換 API:推論努力、関数呼び出し、長文脈検索のハンズオンガイド(続き 11/11)という投稿は、MarkTechPost で最初に公開されました。
原文を表示
In this tutorial, we work with GLM-5.2 and use its hosted, OpenAI-compatible API instead of running the full model locally. We begin by setting up multiple provider options, securely loading the API key, and creating a reusable chat wrapper that supports normal chat, thinking mode, streaming, tool calling, and token tracking. Then we move beyond a simple chatbot example and test the model in more practical situations, including reasoning-effort control, streamed reasoning and answers, function calling, a small tool-using agent, structured JSON output, long-context retrieval, and cost estimation.
Setting Up the GLM-5.2 OpenAI-Compatible Client and Reusable Chat Wrapper
Copy CodeCopiedUse a different Browser
import sys, subprocess
subprocess.run([sys.executable, "-m", "pip", "install", "-q", "-U", "openai"], check=False)
import os, re, json, time, getpass
from openai import OpenAI
PROVIDERS = {
"zai": {"base_url": "https://api.z.ai/api/paas/v4/", "model": "glm-5.2", "env": "ZAI_API_KEY"},
"openrouter": {"base_url": "https://openrouter.ai/api/v1", "model": "z-ai/glm-5.2", "env": "OPENROUTER_API_KEY"},
"together": {"base_url": "https://api.together.xyz/v1", "model": "zai-org/GLM-5.2","env": "TOGETHER_API_KEY"},
"requesty": {"base_url": "https://router.requesty.ai/v1", "model": "zai/glm-5.2", "env": "REQUESTY_API_KEY"},
"huggingface": {"base_url": "https://router.huggingface.co/v1","model": "zai-org/GLM-5.2","env": "HF_TOKEN"},
}
PROVIDER = "zai"
CFG = PROVIDERS[PROVIDER]
MODEL = CFG["model"]
def load_api_key(env_name):
try:
from google.colab import userdata
v = userdata.get(env_name)
if v: return v
except Exception:
pass
if os.environ.get(env_name):
return os.environ[env_name]
return getpass.getpass(f"Enter your {env_name}: ")
client = OpenAI(api_key=load_api_key(CFG["env"]), base_url=CFG["base_url"])
PRICE_IN_PER_M, PRICE_OUT_PER_M = 1.40, 4.40
_USAGE = {"in": 0, "out": 0, "calls": 0}
def _track(usage):
if usage:
_USAGE["in"] += getattr(usage, "prompt_tokens", 0) or 0
_USAGE["out"] += getattr(usage, "completion_tokens", 0) or 0
_USAGE["calls"] += 1
def get_reasoning(obj):
"""Pull GLM's hidden reasoning trace from a message/delta (a provider-extra field)."""
val = getattr(obj, "reasoning_content", None)
if val: return val
extra = getattr(obj, "model_extra", None) or {}
if extra.get("reasoning_content"): return extra["reasoning_content"]
try: return obj.to_dict().get("reasoning_content")
except Exception: return None
def chat(messages, effort=None, thinking=True, tools=None, tool_choice="auto",
stream=False, max_tokens=2048, temperature=1.0, tool_stream=False):
"""
effort: None | "high" | "max" (GLM-5.2 thinking-effort level; max is the model default)
thinking: True -> deep thinking on; False -> off (fast, cheap, low-latency)
GLM-specific params go through extra_body so any OpenAI client works.
"""
extra = {"thinking": {"type": "enabled" if thinking else "disabled"}}
if effort and thinking: extra["reasoning_effort"] = effort
if tool_stream: extra["tool_stream"] = True
kwargs = dict(model=MODEL, messages=messages, max_tokens=max_tokens,
temperature=temperature, stream=stream, extra_body=extra)
if tools:
kwargs.update(tools=tools, tool_choice=tool_choice)
if stream:
kwargs["stream_options"] = {"include_usage": True}
return client.chat.completions.create(**kwargs)
We set up the complete foundation for using GLM-5.2 through an OpenAI-compatible API. We define multiple provider options, load the API key securely, create the OpenAI client, and set up token-cost tracking for the entire notebook. We also build a reusable chat wrapper so that every subsequent demo can use thinking mode, reasoning effort, streaming, tool calling, and provider-specific parameters cleanly.
Basic Chat, Thinking-Effort Control, and Streamed Reasoning with GLM-5.2
Copy CodeCopiedUse a different Browser
def demo_basic():
print("\n=== 1. BASIC CHAT / SANITY CHECK =========================")
resp = chat([{"role": "system", "content": "You are a concise technical assistant."},
{"role": "user", "content": "In one sentence, what is GLM-5.2 best at?"}],
thinking=False, max_tokens=200)
_track(resp.usage)
print(resp.choices[0].message.content.strip())
def demo_effort():
print("\n=== 2. THINKING-EFFORT CONTROL (off / high / max) ========")
problem = ("Train A leaves city A at 9:00 going 60 km/h toward city B. "
"Train B leaves B (420 km away) at 9:30 going 90 km/h toward A. "
"At what clock time do they meet? Show the key steps briefly.")
for label, kw in [("thinking OFF", dict(thinking=False)),
("effort=high", dict(thinking=True, effort="high")),
("effort=max", dict(thinking=True, effort="max"))]:
t0 = time.time()
resp = chat([{"role": "user", "content": problem}], max_tokens=2000, **kw)
dt = time.time() - t0
_track(resp.usage)
msg, u = resp.choices[0].message, resp.usage
print(f"\n--- {label} | {dt:0.1f}s | out_tokens={getattr(u,'completion_tokens',0)} ---")
r = get_reasoning(msg)
if r:
print(" [reasoning, first 220 chars]: " + " ".join(r.split())[:220] + " ...")
print(" Please view this post in your web browser to complete the quiz.
: " + " ".join((msg.content or '').split())[:350])
def demo_streaming():
print("\n=== 3. STREAMING: reasoning channel vs answer channel ====")
stream = chat([{"role": "user", "content":
"Explain why the sky is blue, then give a one-line TL;DR."}],
thinking=True, effort="high", stream=True, max_tokens=1200)
saw_r = saw_a = False
usage = None
for chunk in stream:
if getattr(chunk, "usage", None): usage = chunk.usage
if not chunk.choices: continue
delta = chunk.choices[0].delta
r = get_reasoning(delta)
if r:
if not saw_r: print("\n[thinking] ", end="", flush=True); saw_r = True
print(r, end="", flush=True)
if getattr(delta, "content", None):
if not saw_a: print("\n\n ", end="", flush=True); saw_a = True
print(delta.content, end="", flush=True)
print()
_track(usage)
We start testing GLM-5.2 with basic chat, reasoning-effort control, and streaming output. We first run a simple sanity check, then compare the same problem across thinking-off, high-effort, and max-effort modes to observe changes in latency and output tokens. We also stream the model response so we can view the reasoning channel and the final answer separately as the response is being generated.
Function Calling and a Multi-Step Tool-Using GLM-5.2 Agent
Copy CodeCopiedUse a different Browser
def tool_calculator(expression: str):
if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""):
return {"error": "unsupported characters"}
try: return {"result": eval(expression, {"__builtins__": {}}, {})}
except Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
"sao paulo": 22_400_000, "mexico city": 21_800_000}
def tool_city_population(city: str):
return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}
TOOLS = [
{"type": "function", "function": {
"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
"required": ["expression"]}}},
{"type": "function", "function": {
"name": "city_population", "description": "Look up the metro population of a city.",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}},
"required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
"""Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""
for _ in range(max_rounds):
resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,
max_tokens=1500, temperature=0.3)
_track(resp.usage)
m = resp.choices[0].message
if not getattr(m, "tool_calls", None):
return m.content
messages.append({
"role": "assistant", "content": m.content or "",
"tool_calls": [{"id": tc.id, "type": "function",
"function": {"name": tc.function.name,
"arguments": tc.function.arguments}}
for tc in m.tool_calls]})
for tc in m.tool_calls:
try: args = json.loads(tc.function.arguments or "{}")
except json.JSONDecodeError: args = {}
result = TOOL_IMPLS.get(tc.function.name, lambda k: {"error": "unknown"})(args)
print(f" ↳ {tc.function.name}({args}) -> {result}")
messages.append({"role": "tool", "tool_call_id": tc.id,
"content": json.dumps(result)})
return "(stopped: max tool rounds reached)"
def demo_tools():
print("\n=== 4. FUNCTION / TOOL CALLING ===========================")
q = ("How many times larger is Tokyo's metro population than Mexico City's? "
"Use the tools, then answer with the ratio to one decimal place.")
print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))
def demo_agent():
print("\n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")
task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "
"then compute the combined population of the top two and report it. "
"Use the tools for every lookup and sum; never guess numbers.")
ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
{"role": "user", "content": task}])
print("Final:", " ".join((ans or "").split()))
We connect GLM-5.2 to external tools and build a small tool-using workflow. We define a calculator and a city-population lookup tool, register them in an OpenAI-style tool schema, and create a loop in which the model requests tool calls and receives tool results. We then use this setup for a direct function-calling task and a small multi-step agent that looks up populations, ranks cities, and performs calculations without guessing.
Structured JSON Output and Long-Context Retrieval with GLM-5.2
Copy CodeCopiedUse a different Browser
def tool_calculator(expression: str):
if not re.fullmatch(r"[0-9+\-*/(). %]+", expression or ""):
return {"error": "unsupported characters"}
try: return {"result": eval(expression, {"__builtins__": {}}, {})}
except Exception as e: return {"error": str(e)}
_CITY_POP = {"tokyo": 37_400_068, "delhi": 32_900_000, "shanghai": 28_500_000,
"sao paulo": 22_400_000, "mexico city": 21_800_000}
def tool_city_population(city: str):
return {"city": city, "population": _CITY_POP.get((city or "").strip().lower())}
TOOLS = [
{"type": "function", "function": {
"name": "calculator", "description": "Evaluate basic arithmetic like '37400068/21800000'.",
"parameters": {"type": "object", "properties": {"expression": {"type": "string"}},
"required": ["expression"]}}},
{"type": "function", "function": {
"name": "city_population", "description": "Look up the metro population of a city.",
"parameters": {"type": "object", "properties": {"city": {"type": "string"}},
"required": ["city"]}}},
]
TOOL_IMPLS = {"calculator": tool_calculator, "city_population": tool_city_population}
def run_tool_loop(messages, max_rounds=6, effort="max"):
"""Full loop: model -> tool_calls -> execute -> feed results back -> repeat."""
for _ in range(max_rounds):
resp = chat(messages, tools=TOOLS, thinking=True, effort=effort,
max_tokens=1500, temperature=0.3)
_track(resp.usage)
m = resp.choices[0].message
if not getattr(m, "tool_calls", None):
return m.content
messages.append({
"role": "assistant", "content": m.content or "",
"tool_calls": [{"id": tc.id, "type": "function",
"function": {"name": tc.function.name,
"arguments": tc.function.arguments}}
for tc in m.tool_calls]})
for tc in m.tool_calls:
try: args = json.loads(tc.function.arguments or "{}")
except json.JSONDecodeError: args = {}
result = TOOL_IMPLS.get(tc.function.name, lambda k: {"error": "unknown"})(args)
print(f" ↳ {tc.function.name}({args}) -> {result}")
messages.append({"role": "tool", "tool_call_id": tc.id,
"content": json.dumps(result)})
return "(stopped: max tool rounds reached)"
def demo_tools():
print("\n=== 4. FUNCTION / TOOL CALLING ===========================")
q = ("How many times larger is Tokyo's metro population than Mexico City's? "
"Use the tools, then answer with the ratio to one decimal place.")
print("Final:", " ".join((run_tool_loop([{"role": "user", "content": q}]) or "").split()))
def demo_agent():
print("\n=== 5. MINI MULTI-STEP AGENT (tools + max effort) ========")
task = ("Rank Tokyo, Delhi, and Shanghai by metro population (largest first), "
"then compute the combined population of the top two and report it. "
"Use the tools for every lookup and sum; never guess numbers.")
ans = run_tool_loop([{"role": "system", "content": "You are a careful analyst."},
{"role": "user", "content": task}])
print("Final:", " ".join((ans or "").split()))
We focus on reliable, structured output and long-context retrieval. We create a JSON extraction helper, ask the model to return a strict JSON object, and retry once if the first response is not valid JSON. We also build a synthetic long document with a hidden “needle” and send it to GLM-5.2 to check whether the model retrieves the exact launch code from the provided context.
Running All Demos with GLM-5.2 Token and Cost Accounting
Copy CodeCopiedUse a different Browser
def cost_summary():
print("\n=== 8. TOKEN + COST ACCOUNTING ===========================")
cost = _USAGE["in"]/1e6*PRICE_IN_PER_M + _USAGE["out"]/1e6*PRICE_OUT_PER_M
print(f" calls: {_USAGE['calls']} | input: {_USAGE['in']:,} tok | output: {_USAGE['out']:,} tok")
print(f" estimated spend @ ${PRICE_IN_PER_M}/{PRICE_OUT_PER_M} per 1M: ${cost:0.4f}")
DEMOS = [demo_basic, demo_effort, demo_streaming, demo_tools,
demo_agent, demo_structured, demo_long_context]
print(f"Provider={PROVIDER} model={MODEL}")
for fn in DEMOS:
try: fn()
except Exception as e:
print(f" [skipped {fn.__name__}: {type(e).__name__}: {e}]")
cost_summary()
print("\nDone. Tweak PROVIDER / effort / max_tokens and re-run any demo function.")
We finish the tutorial by collecting usage information and running all demos from top to bottom. We calculate the estimated cost from total input and output tokens, then print a compact summary of calls, token counts, and spend. We also use a driver loop so that a single failed demo does not halt the entire notebook, making the tutorial easier to run, debug, and reuse.
Conclusion
In conclusion, we have a practical and reusable workflow for using GLM-5.2 in Python applications. We learned how to control its reasoning behavior, compare different thinking modes, connect it with tools, validate structured outputs, test long-context inputs, and monitor token usage with estimated cost. It provides us a strong starting point for building more advanced systems such as research assistants, document analysis tools, coding agents, long-context retrieval workflows, or API-based reasoning pipelines. We finished with a setup that is lightweight enough for Colab but still close to how we would build with GLM-5.2 in a real project.
Check out the Full Codes here. Also, feel free to follow us on Twitter and don’t forget to join our 150k+ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Need to partner with us for promoting your GitHub Repo OR Hugging Face Page OR Product Release OR Webinar etc.? Connect with us
The post GLM-5.2 OpenAI-Compatible API: A Hands-On Guide to Reasoning Effort, Function Calling, and Long-Context Retrieval appeared first on MarkTechPost.
関連記事
Preferred Networks、国産生成AI基盤モデル「PLaMo 3.0 Prime」をリリース
Preferred Networks は、推論能力の強化とコンテキスト長の拡張(256K)を実現した新モデル「PLaMo 3.0 Prime」を公開し、API や無料プランを通じて提供を開始しました。
Claude Code の「拡張思考」出力のテキストは真正ではない(3 分読了)
Anthropic は Claude Code の「拡張思考」機能における推論プロセスを暗号化しており、ユーザー端末には鍵が提供されない。API が返すのは推論の要約のみであり、完全な思考出力を得るには企業向け契約が必要である。
ナレッジエージェント:構造の改善で最先端モデルを凌駕する(18 分読)
Anthropic が Mythos モデルを撤退させる中、著者は Qwen 3.6 27B などの小規模モデルでも大規模モデルに匹敵する「ナレッジエージェント」を開発した。この手法は、特定の知識を注入しデータを構造化して検索を行うことで、専門クエリや独自データへの対応を強化する。
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み