Vercel Blog·2026年5月12日 13:00·約14分

AI Gateway の生産性インデックス

#LLM #Reasoning #Agentic AI #Cost Optimization #Vercel

TL;DR

Vercel の AI Gateway が公開した生産環境データは、モデルの性能評価がベンチマークではなく「コスト対効果」と「ユースケース」によって決まる実態を浮き彫りにし、Anthropic と Google の役割分担と OSS モデルの台頭を示している。

AI深層分析2026年5月14日 06:03

重要/ 5段階

深度40%

キーポイント

コストとボリュームの逆転現象

2026 年 4 月のデータでは Anthropic が支出シェアで 61% を占める一方、Google はトークンボリュームの 38% で首位に立っており、高品質な推論と安価な高速処理という明確な役割分担が存在する。

アジェンティックワークロードの急拡大

エージェントによる処理が全トークンボリュームの 59% を占め、過去 6 ヶ月で 2 倍に増加しており、AI の利用形態が静的な Q&A から自律的なタスク実行へ急速に移行している。

ユースケースに応じたモデル選定の定着

バックオフィスや法務リスクを伴う業務では高コストのモデル（Claude Opus）が選ばれ、個人用アシスタントなど許容範囲が広い用途では低コストモデル（Gemini Flash）が利用されるなど、エラーのコストが選定基準となっている。

トークンあたりのコストは失敗のコストに依存

個人向けアシスタントは安価なモデルで十分だが、法的・財務リスクのある業務では高品質な推論が必要となり、トークンあたりの支出が増加する。

プロバイダーは用途ごとに特化しており勝者は一人ではない

Anthropic は高リスク層で、Google は消費者向けでそれぞれ優位だが、OpenAI はどの層にも偏らず分散しているため最も安定している。

市場の焦点は「誰が勝っているか」ではなく「どの用途で勝っているか」

単一の支配的プロバイダーが存在しない理由は、支配的な用途がないからであり、各ラボは同じスタックの異なる層を競い合っている。

AI リクエストの形状変化とコスト構造

ツール呼び出しを伴うリクエスト比率は約2倍に増加し、トークン使用量のうち58.9%を占めるなど、チャットからエージェントへ重心が移っている。

影響分析・編集コメントを表示

影響分析

このレポートは、企業における AI 導入戦略が単なる「性能競争」から「コスト効率とリスク管理の最適化」へ転換したことを示す決定的な証拠となる。開発者は特定のモデルに固執するのではなく、ユースケースごとのエラー許容度に基づいて複数モデルを動的に切り替えるアーキテクチャ（AI Gateway 等）の構築が必須となる。

編集コメント

ベンチマーク数値の喧伝に終始しがちな業界において、実稼働環境における「お金とリスク」の観点からモデル選定を再定義する貴重なデータです。特にエージェント利用の急増は、今後の AI エコシステム設計における重要な示唆を含んでいます。

どの AI モデルが最良かと問えば、インクが乾く前に答えは変わってしまう。それが週に一度も新しいモデルがリリースされる業界で起こることだ。

すべてのベンチマークが異なるレースを測定しており、それぞれのレースが独自の勝者を決定するが、Vercel は生産環境でのワークロードを通じて業界に独特な視点を持っている。AI Gateway は、実際のアプリケーションやエージェントを通じて、数百のモデルにまたがる数十兆トークンを処理している。

私たちが観察していること:

Anthropic は単価が高いにもかかわらず支出で首位を維持し、Google はボリュームで首位に立つ

OSS モデルは注目を集めているが、特定のラボへの忠誠心はない

最近のモデル更新後、OpenAI の支出シェアは急速に拡大している

高ボリュームワークロードは平均して 30 以上の異なるモデルにルーティングされる

エージェント型ワークロードは全トークンボリュームの 59% を占める（6 ヶ月間で 2 倍増）

このレポートは、AI Gateway の 7 ヶ月にわたる生産環境トラフィックデータに基づいて構築されており、20 万チーム以上の利用実績を反映している。

Anthropic は支出で首位、Google はボリュームで首位

コストとボリュームのランキングが一致しないのは、同じ顧客であっても測定するワークロードが異なるためだ。

2026 年 4 月の支出ベースでは、Anthropic が 61%、Google が 21%、OpenAI が 12% を占めた。

トークンボリュームで見ると状況は逆転した。AI Gateway を通じた 4 月のトラフィックの 38% が Google にルーティングされ、26% が Anthropic、13% が OpenAI、10% が xAI だった。その他の小さなラボが残り全体を分け合った。

一部のモデルは、1 トークンあたりのコストを十分に低く抑えることで大量の処理量を担い勝利を目指す一方、他のモデルは品質が極めて重要な用途にのみ適するよう高価格で設定されています。これらの異なるモデルは、同じ呼び出しを巡って競合しているわけではありません。集計すると、同じ顧客ベースが両方のリーダーボードに存在しており、プレミアムな推論呼び出しは Claude Opus に、安価で高速な呼び出しは Gemini Flash に割り当てられています。支出は高リスクの呼び出しに従い、処理量は低リスクのものに従い、各研究所が同じアプリケーションの異なる層を担っています。

処理量対支出の関係は、研究所レベルでも急速に変化します。いくつかの具体的なシグナルを示します：

Gemini Flash は、Google が支出シェアのより少ない割合で処理量の首位に立つのを支援しました

Claude Opus は、Google よりも少ない処理量で Anthropic を支出シェアの首位に導きました

OpenAI の支出シェアは、GPT-5.4/5.5 のリリース後、3 月から 4 月にかけて 3 倍になりました

Gemini Flash の利用が拡大するにつれ、Google の支出シェアは 3 月の 8% から 4 月に 21% に上昇しました

支出は「間違いを犯すコスト」に従います

特定のワークロードの種類内部でも、より微細な粒度で同じコストと処理量の分断が存在します：

パーソナルアシスタントは、トークン処理量の 40% を占める中でコストの 20% を占めています

コーディングエージェントは、トークンの 20% でコストの約 22% にバランスよく位置しています

バックオフィスエージェントは、トークンの 15% でコストの 6% で稼働しています

アプリ生成は、トークンの 11% でコストの 7% で稼働しています

1 トークルあたりのコストは、そのユースケースにおいて誤った回答がどれほど高価になるかという関数です。パーソナルアシスタントは、ミスが個人ユーザーに影響しやすく迅速に修正できるため、安価で高速なモデル上で動作できます。バックオフィスワークフローでは、エラーが法的・財務的または運用上のリスクを引き起こす可能性があり、1 回あたりの呼び出しコストの節約を上回るため、より強力な推論能力に対して支払います。1 トークルあたりの経済性はステークマップのようなもので、ミスのコストが高いほどアプリケーションは 1 トークルあたりにより多くを費やします。

このパターンは、より広範な B2C（対消費者）と B2B（対企業）の分割においても同様です。B2C アプリケーションは多数の低コスト呼び出しを生成する一方、B2B アプリケーションは fewer でより高価な呼び出しを実行します。1 トークルあたりで見ると、B2B のコストは B2C の約 2 倍になります。

単一のプロバイダーがすべてのユースケースで勝利することはない

ユースケース別にデータを切り出すと、プロバイダーの状況は断片的であることがわかります：

Anthropic は特にソフトウェア構築においてリードしています

Google は消費者向けに過剰に集中しています

OpenAI は最も均等に分布しています

xAI および他社はコーディング、消費者向け、ロングテールユースケースに分散しています

Anthropic のパターンは、高リスク層における集中です。ワークロードがバックオフィスから消費者向けへ移行するにつれ、Anthropic のトークンシェアは 71% から 7% に低下します。そのコストシェアはより緩やかな曲線を描き、4 つのカテゴリのうち 3 つでリードを維持しています。収益は、回答が正しくなければならない wherever（どこでも）に集中し、通過するボリュームの量に関係なくです。

⟦CODE_0⟧

Google は逆の形状をしています。その足跡は消費者向けに集中しており、Gemini Flash がトークンの 28% を担いながらコストは全体の 15% に抑えられており、それ以外の分野ではコストチャート上にほとんど姿を現しません。このポジションは、Flash の採用状況に連動して上昇・下降する単一 SKU（製品）への賭けです。

xAI は価格の楔（くさび）となっています。Grok は、それぞれの分野でより大幅に小さいコストシェアの中で、ビルディング向けトークンの 20% とアウトリーチ向けトークンの 18% を担っています。xAI は価格対品質の適合性において勝利しており、その価格に追随する者が楔を埋めることになります。

OpenAI は 4 つの企業の中で最もバランスが取れています。ビルディング向けコストの 6%、消費者向けコストの 18%、アウトリーチ向けコストの 28% を占めています。どの単一の層も OpenAI の全体シェアを支える決定的な要素ではないため、同社はどの一つの層における混乱に対して最も影響を受けにくい企業となっています。

Kimi、MiniMax、GLM などのオープンウェイト（重み公開）ファミリーは、コストの天井が最も低い消費者向けおよびビルディング向けティアを回転しています。彼らのコストシェアは小さいままですが、消費者向けおよびビルディング内でのトークンシェアは十分に大きいため、コストのみに基づく市場の見方は彼らを過小評価することになります。

単一の支配的なユースケースが存在しないため、市場全体に単一の支配的プロバイダーも存在しません。問うべき正しい質問は「誰が AI で勝っているのか」ではなく、「私が関心を持つユースケースでどのモデルが勝っているのか」です。ブレンデッドチャート上で最も拮抗しているように見えるラボたちは、同じスタックの異なる層を巡って競合しています。

アプリはよりエージェント型へと進化している

このすべての背景の下で、生産環境における AI リクエストの形状は変化しています。2026 年 4 月には、AI Gateway のリクエストのうちツール呼び出し（tool call）で終了する割合が 11.4% から 22.2% に上昇しました。トークン数で測ると、この変化はさらに顕著です。現在、すべてのトークンのうち 58.9% がツール呼び出しを伴うリクエストに占められており、これは半年前の 31.6% から大幅な増加です。

両方の指標において、エージェント型の割合は半年で約倍になりましたが、より示唆に富むのはこの二つの割合の差です。リクエストの 22.2% がトークンの 58.9% を占めているということは、ツールを使用するリクエストは残りのリクエストに比べて約 2.6 倍もトークン消費量が多いことを意味します。AI のコスト構造はチャット型からエージェント型へとシフトしましたが、表向きのリクエスト数はほとんど変化していません。

関数実行、API 呼び出し、データベースクエリ、コード実行のいずれであっても、ラウンドトリップ（往復）はすべて同じメーターで課金されるため、10 のツール呼び出しを行うエージェントは、チャットが消費するトークンの約 10 倍を請求されます。チャットではプロンプトごとに 1 ラウンドトリップが課金されるのに対し、エージェントでは一連の連鎖（チェーン）に対して課金が行われます。

リーダーボードでは一つのモデルがランク付けされますが、生産環境のチームはスケーリング時に 35 種類以上のモデルを使用しています。

大規模運用においては、マルチモデル構成は選択肢ではなく、標準的なエージェントアーキテクチャとなっています。

1,000 から 10,000 のリクエストを処理するチームでは平均して 3 つの異なるモデルが使用されています。一方、1,000 万回以上のリクエストを扱うバケットでは、平均して 35 のモデルが日常的に使用されています。100 万から 1,000 万回のリクエストバケットにおける 18 モデルから、1,000 万回以上における 35 モデルへの増加は、転換点（インフレーションポイント）を示しています。

35 モデルのフリートはルーティンググラフとして稼働しており、意図検出用の安価な分類器、推論ステップ用の最先端モデル、検索用の埋め込みモデル、要約用の高速モデル、スクリーンショット処理用のビジョンモデルを備えています。これらのモデルはすべて交換可能です。プロバイダーが価格を引き上げたり、品質を低下させたり、障害が発生したりした場合でも、トラフィックは数時間以内に残りのモデル間で再分配されます。リーダーボードの支出の大部分を生み出す規模において、ラボ間の切り替えはベンダー移行というよりは設定変更に近いものであり、リクエスト量曲線の上昇に伴って「ラボロックイン」という一般的な物語は逆転します。

新モデルは急速に採用される

同じフリート設計が、なぜ新リリースがこれほど速く吸収されるかを説明しています。モデルファミリー内で新しいバージョンがリリースされると、トラフィックは数週間以内にその新版本へ移行します。

Claude Sonnet 4.6 は、ローンチから最初の完全な月までに、Sonnet ファミリーのシェアの大部分を吸収しました。

Opus ファミリも現在同じ軌道を進んでおり、Claude Opus 4.7 が Opus 4.6 からシェアを奪っています。その曲線はほぼ同一です。

先行モデルは両方の期間中、AI Gateway で稼働し続けたままルーティング可能でしたが、チームはそれでも移行を行いました。この移行は設定変更であり、ラボはもはや自社の製品ラインのアップグレードスケジュールを設定する立場にはありません。

プロバイダー障害には隠れたコストがある

AI Gateway で処理されたリクエストの約 3.5% は、フォールバック後に完了しています。これは初期のルーティングがエラー、レート制限、またはタイムアウトに遭遇したが、ゲートウェイが健全な代替先へ素早く再発行し、ユーザー側には依然として成功したレスポンスが届いたことを意味します。

トークン数で測定するとリスキュー率は 5.1%、金額換算では 4.9% です。トークン重み付けおよびコスト重み付けされたレートが、リクエスト重み付けされたレートよりも高くなるのは、救済されたリクエストの方が平均してサイズが大きく、より高額であるためです。長いコンテキストウィンドウは短いものよりも頻繁にレート制限に遭遇し、マルチステップのエージェント実行ではステップ間で失敗が蓄積され、重い推論呼び出しは持続的な負荷下でタイムアウトします。これらの故障モードはいずれも、ワークロードの高額側を標的にするため、ドルベースのレートがリクエストベースのレートよりも高くなります。

プロバイダの SLA はリクエストレベルの稼働率を測定しますが、実際の生産アプリケーションはコスト重み付けされた稼働率を経験し、モデルに支払われた呼び出しにおいてのみこの 2 つは乖離します。

結論：ラボではなくワークロードのために構築せよ

生産環境でのワークロードは、最新のモデルリーダーボードに合わせるためではなく、効率性、信頼性、柔軟性を設計目的としています。

同じデータの 6 つの切り口を通じて、その背後にある形状は一貫しています。異なるラボが同一アプリケーションの異なるレイヤーで勝利しますが、それらのレイヤーを処理するアーキテクチャは、スケールする生産チームがすでに構築しているものです。

これは初期のクラウド時代の様子と重なります。チームはまず計算リソースを拡大し（インスタンス数の増加、リージョンの拡張、冗長性の確保）、その後単価のコスト削減を図りました。支出曲線の上部に見える 35 モデルのフリートも、より高速なペースで同じパターンを示しており、その後の最適化はルーティング層で行われます。

今日 AI を展開するすべての人に向けて:

複数のプロバイダーにまたがるモデルを計画してください

稼働率とコストの最適化のためにフォールバック（代替経路）の必要性を想定してください

アーキテクチャの初期段階から、ルーティングをコアユニットとして設計してください

パターンが変化するにつれて、このデータは定期的に見直すことを期待しています。最新のモデルランキングは AI Gateway リーダーボードで確認できます。

本データについて

本分析は、2026 年 4 月までの Vercel AI Gateway からの匿名化された集計ルーティングデータを基にしています。

測定に関するいくつかの注記:

支出額は、市場価格（公開リスト価格）を使用し、独自 API キーを持つチーム間での比較を可能にする正規化された視点を提供します。

ボリュームは、AI Gateway を経由してルーティングされたトークン数をカウントしたものです。

B2C、B2B、およびユースケース分類は集計値です。個々のチームやワークロードが特定されることはありません。

原文を表示

Ask which AI model is best, and the answer changes before the ink dries. That's what happens in an industry where new models are released weekly.

Every benchmark measures a different race, and every race crowns its own winner, but Vercel has a unique view of the industry through production workloads. AI Gateway serves tens of trillions of tokens across hundreds of models through real applications and agents.

What we're seeing:

Anthropic leads in spend despite a higher unit price, Google leads in volume

OSS models are gaining traction, but there is no loyalty to specific labs

OpenAI spend share is growing quickly after recent model updates

High-volume workloads route across 30+ distinct models on average

Agentic workloads carry 59% of all token volume (up 2x over 6 months)

This report is built on data from seven months of production traffic from AI Gateway, with usage from over 200K+ unique teams.

Anthropic leads in spend; Google leads in volume

Cost and volume rankings disagree because they measure two different workloads, even for the same customer.

By spend in April 2026, Anthropic took 61%, Google 21%, and OpenAI 12%.

By token volume, the picture flipped. 38% of April traffic through AI Gateway routed to Google, 26% to Anthropic, 13% to OpenAI, and 10% to xAI. Smaller labs split the rest.

Some models are positioned to win by being cheap enough per token to carry huge volume, while others are priced high enough to make sense only for quality-critical work. The different models are not competing for the same call. In aggregate the same customer base sits on both leaderboards, with premium reasoning calls landing on Claude Opus and cheap fast calls landing on Gemini Flash. Spend follows the high-stakes calls, and volume follows the low-stakes ones, with the labs each holding a different layer of the same applications.

Volume-vs-spend also changes quickly at the lab level. A few specific signals:

Gemini Flash helped Google take the lead on volume at a smaller share of spend

Claude Opus helps Anthropic lead on spend with less volume than Google

OpenAI's spend share tripled from March to April after the GPT-5.4/5.5 releases

Google's spend share climbed from 8% in March to 21% in April as Gemini Flash usage scaled

Spend follows the cost of being wrong

The same cost/volume divide exists at a finer grain inside specific kinds workloads:

Personal assistants account for 20% of cost on 40% of token volume

Coding agents sit roughly balanced at 22% of cost on 20% of tokens

Back office agents run at 6% of cost on 15% of tokens

App generation runs at 7% of cost on 11% of tokens

What a workload spends per token is a function of how expensive a wrong answer is to the use case. Personal assistants can run on cheap, fast models because mistakes only impact individual users and are quickly corrected. Back-office workflows pay for stronger reasoning because errors can trigger legal, financial, or operational risks that outweigh the per-call savings. The per-token economics are a stake map: applications spend more per token when mistakes cost more.

The same pattern holds in a broader B2C/B2B split. B2C applications generate many low-cost calls, while B2B applications run fewer, more expensive ones. On a per-token basis, B2B costs roughly two times as much as B2C.

No single provider wins across use cases

Cutting the data by use case shows a fragmented provider landscape:

Anthropic notably leads in software building

Google over-indexes in consumer

OpenAI is the most evenly distributed

xAI and others are split across coding, consumer, and long-tail use cases

Anthropic's pattern is concentration at the high-stakes layer. As the workload moves from back office to consumer, Anthropic's token share drops from 71% down to 7%. Its cost share follows a much shallower curve and keeps the lead through three of the four categories. The revenue concentrates wherever the answer has to be right, regardless of how much volume passes through.

Google is the inverse shape. Its footprint concentrates in consumer, where Gemini Flash carries 28% of tokens at 15% of cost, and barely appears on the cost chart outside it. The position is a single-SKU bet that rises and falls with Flash adoption.

xAI is a price wedge. Grok carries 20% of building tokens and 18% of outreach tokens at materially smaller cost shares in each. xAI wins on price-to-quality fit, and whoever matches the price closes the wedge.

OpenAI is the most balanced of the four at 6% of building cost, 18% of consumer cost, and 28% of outreach cost. No single layer is load-bearing for OpenAI's overall share, which makes the company the least exposed of the four to disruption in any one layer.

Open-weights families like Kimi, MiniMax, and GLM rotate through the consumer and building tiers where the cost ceiling is lowest. Their cost share stays small, and their token share inside consumer and building is large enough that any cost-only view of the market understates them.

There is no single dominant provider across the whole market because there is no single dominant use case. The right question is not "Who is winning AI?", it is "Which models are winning the use case I care about?" The labs that look closest to even on a blended chart are competing for different layers of the same stack.

Apps are becoming more agentic

The shape of production AI requests has changed underneath all of this. In April 2026, 22.2% of AI Gateway requests ended with a tool call, up from 11.4% in October 2025. Measured by tokens, the shift is bigger. 58.9% of all tokens are now in tool-call requests, up from 31.6% six months ago.

By both measures the agentic share roughly doubled in half a year, but the more telling number is the gap between the two shares. 22.2% of requests carry 58.9% of tokens, which means tool-using requests are about 2.6× more token-heavy than the rest. The cost surface of AI has shifted from chat-shaped to agent-shaped, while headline request counts barely budged.

Every kind of round trip bills against the same meter, whether it's a function execution, an API call, a database query, or a code run, so an agent shipping ten tool calls bills roughly ten times the tokens a chat would. Where a chat bills one round trip per prompt, an agent bills a chain.

Leaderboards rank one model, but production teams use 35+ at scale

At scale, multi-model stops being a choice and becomes standard agent architecture.

Teams running 1K to 10K requests averaged 3 distinct models. By the 10M+ requests bucket, the average is 35 models in regular use. The jump from 18 models in the 1M to 10M bucket to 35 in the 10M+ bucket is the inflection point.

A 35-model fleet runs as a routing graph, with a cheap classifier for intent detection, a frontier model for the reasoning step, an embedding model for retrieval, a fast model for summarization, and a vision model for screenshots. Every one of those models is swappable. If a provider raises prices, degrades quality, or has an outage, traffic redistributes across the rest in hours. At the scale that produces most of the spend on the leaderboards, switching between labs is closer to a config change than to a vendor migration, and the standard story about lab lock-in inverts the higher you go on the request-volume curve.

New models are adopted rapidly

The same fleet design explains how fast new releases get absorbed. When a new version ships inside a model family, traffic moves to it within weeks.

Claude Sonnet 4.6 absorbed most of the Sonnet family's share by its first full month after launch.

The Opus family is moving through the same shape now, with Claude Opus 4.7 taking share from Opus 4.6 on a near-identical curve.

Predecessor models stayed live and routable on AI Gateway throughout both windows, but teams moved anyway. The migration is a config change, and the labs no longer set the upgrade timeline of their own product lines.

Provider outages have a hidden cost

Roughly 3.5% of requests on AI Gateway complete after a fallback. That means the initial route hit an error, a rate limit, or a timeout, and the gateway reissued the request to a healthy alternative fast enough that the user still got a successful response.

Measured in tokens the rescue rate runs at 5.1%, and in dollars at 4.9%. The token-weighted and cost-weighted rates run higher than the request-weighted rate because the requests that get rescued are, on average, bigger and more expensive than the ones that don't. Long context windows hit rate limits more often than short ones, multi-step agent runs accumulate failure across steps, and heavy reasoning calls time out under sustained load. Each of those failure modes targets the expensive end of the workload, which is why the dollar rate sits higher than the request rate.

A provider's SLA measures request-level uptime, but a production application experiences cost-weighted uptime, and the two come apart on exactly the calls that paid for the model.

Conclusion: Build for workload, not the lab

Production workloads are designed for efficiency, reliability, and flexibility, not to match the latest model leaderboards.

Across six cuts of the same data, the shape underneath stays the same. Different labs win different layers of the same applications, and the architecture that handles those layers is the one production teams at scale have already built for.

This echoes the early cloud era. Teams expanded compute first (more instances, regions, redundancy) and squeezed per-unit cost later. The 35-model fleets visible at the top of the spend curve are the same patter at a faster cadence; the optimization that follows happens at the routing layer.

For anyone shipping AI today:

Plan for multiple models across providers

Assume the need for fallbacks to optimize for uptime and cost

Design routing as a core unit of architecture from the beginning

We expect to revisit this data on a recurring cadence as the patterns shift. Live model rankings are available on the AI Gateway Leaderboards.

About this data

This analysis is based on anonymized, aggregate routing data from the Vercel AI Gateway through April 2026.

A few notes on measurement:

Spend uses market-rate pricing (published list price) to provide a normalized view across teams that bring their own API keys.

Volume counts tokens routed through AI Gateway.

B2C, B2B, and use-case classifications are aggregate. No individual team or workload is identified.

この記事をシェア

AWS Machine Learning Blog重要度42026年6月30日 02:25

Amazon Bedrock AgentCore Observability を用いたプロダクションエージェントのデバッグ

The Zvi重要度42026年6月28日 23:49

GPT-5.6：システムカードの発表

Hugging Face Blog重要度42026年6月30日 23:39

専門化が不可避である理由

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む