エージェントに力を:Workers AIが大規模モデルを実行開始、Kimi K2.5から
CloudflareはWorkers AIにMoonshot AIのKimi K2.5モデルを追加し、大規模コンテキストと低コスト推論を実現してエージェント開発基盤を強化した。
キーポイント
Kimi K2.5の統合とエージェント最適化
256kの超長文コンテキストウィンドウ、マルチターンツール呼び出し、ビジョン入力、構造化出力をサポートし、複雑なエージェントワークロードの実行に適したモデルをWorkers AIで提供開始。
実証されたコストパフォーマンス
自社セキュリティレビューエージェントでの運用実績を基に、同等の商用モデルと比較して推論コストを77%削減できる経済合理性を実証した。
エージェントライフサイクルのプラットフォーム完結
Durable Objects、Workflows、Dynamic WorkersとWorkers AIのモデルを統合し、エージェントの開発からデプロイ、実行までを単一プラットフォームで完結可能にした。
24時間稼働型エージェント時代のインフラ対応
パーソナル・コーディングエージェントの普及に伴う推論需要の急増を見据え、スケーラビリティとコスト効率を両立させる基盤整備を進めている。
コスト最適化によるオープンソースモデルへの移行
エージェントのトークン消費量が膨大になるにつれ、プロプライエタリモデルのコストがスケーリングの主要な障壁となっているため、企業はFrontierレベルの推論性能を持ちつつ低コストなオープンソースモデルへ移行している。
大規模モデル向けインフラの最適化
Workers AIはKimi K2.5用にカスタムカーネルとInfire推論エンジンを構築し、データ/テンソル/エクスパート並列化やdisaggregated prefillなどの高度な最適化技術により、GPU利用率とスループットを最大化している。
開発者体験の簡素化
複雑なモデル最適化やインフラ管理をプラットフォーム側で完結させることで、開発者はMLエンジニアやSREの専門知識なしにAPI呼び出しだけでエージェントをデプロイ・スケーリングできる。
影響分析・編集コメントを表示
影響分析
Cloudflareは自社プラットフォームにオープンソースのフロントティアモデルを直接統合することで、エージェント開発における「インフラとモデルの分離」という課題を解決し、コスト競争力を大幅に高めた。特に長文コンテキストと低推論コストの両立は、24時間稼働する自律型エージェントの実用化を加速させる可能性が高く、既存のクラウドAIサービス提供者に対する競争圧力となる。
編集コメント
自社開発ツールでの実証データと明確なコスト削減数値を提示しており、単なる機能追加ではなく実務レベルでの経済合理性を明確に示した点が高く評価できる。
同期レート制限を超えるような大量のリクエストについては、非同期で完了する推論のバッチを送信できます。私たちは刷新された非同期APIを導入しており、これにより非同期ユースケースにおいて、容量不足エラーに遭遇することなく、推論が確実に実行されるようになります。私たちの非同期APIは、バッチAPIというよりもオンデマンド処理に近く、モデルインスタンスに余裕がある限り、非同期キュー内のリクエストを処理します。内部テストでは、非同期リクエストは通常5分以内に実行されますが、これは実際のトラフィック状況に依存します。Kimiを一般公開するにあたり、スケーリングを適宜調整しますが、非同期APIは、確実なワークフローで容量不足エラーを回避するための最良の方法です。これは、コードスキャンエージェントやリサーチエージェントなど、リアルタイム性を必要としないユースケースに最適です。
Workers AIには以前から非同期APIがありましたが、最近その基盤システムを刷新しました。従来のプッシュベースシステムではなく、プルベースシステムを採用することで、容量に余裕が生じ次第、キューイングされたリクエストを取り込めるようになりました。また、非同期リクエストのスループットを調整するためのより優れた制御機能を追加し、GPU使用率をリアルタイムで監視し、使用率が低いときに非同期リクエストを取り込むことで、重要な同期リクエストに優先権を与えつつ、非同期リクエストも効率的に処理できるようにしました。
非同期APIを使用するには、以下のようにリクエストを送信します。また、推論が完了したときに通知を受け取るためのイベント通知を設定する方法もあり、リクエストをポーリングする必要がありません。
// (1.) キューにリクエストを送信する
// queueRequest: true を指定する
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
"requests": [{
"messages": [{
"role": "user",
"content": "Tell me a joke"
}]
}, {
"messages": [{
"role": "user",
"content": "Explain the Pythagoras theorem"
}]
}, ...{<add more requests in a batch>} ]
}, {
queueRequest: true,
});
// (2.) リクエストIDを取得する
let request_id;
if(res && res.request_id){
request_id = res.request_id;
}
// (3.) ステータスをポーリングする
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
request_id: request_id
});
if(res && (res.status === "queued" || res.status === "running")) {
// 再試行: 再度ポーリングする
// ...
}
else {
return Response.json(res); // これには最終的な完了レスポンスが含まれます
}今日試してみてください
今日からWorkers AIでKimi K2.5を始めましょう。開発者向けドキュメントを読んで、モデル情報と価格、およびセッションアフィニティヘッダーと非同期APIを介したプロンプトキャッシングを活用する方法を確認できます。Agents SDKスターターも、デフォルトモデルとしてKimi K2.5を使用するようになりました。Opencode経由でWorkers AI上のKimi K2.5に接続することもできます。ライブデモについては、プレイグラウンドでお試しください。
そして、サーバーレス推論、ML最適化、GPUインフラストラクチャに関するこの一連の課題に興味をお持ちでしたら、私たちは採用活動中です!
原文を表示
We're making Cloudflare the best place for building and deploying agents. But reliable agents aren't built on prompts alone; they require a robust, coordinated infrastructure of underlying primitives.
At Cloudflare, we have been building these primitives for years: Durable Objects for state persistence, Workflows for long running tasks, and Dynamic Workers or Sandbox containers for secure execution. Powerful abstractions like the Agents SDK are designed to help you build agents on top of Cloudflare’s Developer Platform.
But these primitives only provided the execution environment. The agent still needed a model capable of powering it.
Starting today, Workers AI is officially in the big models game. We now offer frontier open-source models on our AI inference platform. We’re starting by releasing Moonshot AI’s Kimi K2.5 model on Workers AI. With a full 256k context window and support for multi-turn tool calling, vision inputs, and structured outputs, the Kimi K2.5 model is excellent for all kinds of agentic tasks. By bringing a frontier-scale model directly into the Cloudflare Developer Platform, we’re making it possible to run the entire agent lifecycle on a single, unified platform.
The heart of an agent is the AI model that powers it, and that model needs to be smart, with high reasoning capabilities and a large context window. Workers AI now runs those models.
The price-performance sweet spot
We spent the last few weeks testing Kimi K2.5 as the engine for our internal development tools. Within our OpenCode environment, Cloudflare engineers use Kimi as a daily driver for agentic coding tasks. We have also integrated the model into our automated code review pipeline; you can see this in action via our public code review agent, Bonk, on Cloudflare GitHub repos. In production, the model has proven to be a fast, efficient alternative to larger proprietary models without sacrificing quality.
Serving Kimi K2.5 began as an experiment, but it quickly became critical after reviewing how the model performs and how cost-efficient it is. As an illustrative example: we have an agent that does security reviews of Cloudflare’s codebases. This agent processes over 7B tokens per day, and using Kimi, it has caught more than 15 confirmed issues in a single codebase. Doing some rough math, if we had run this agent on a mid-tier proprietary model, we would have spent $2.4M a year for this single use case, on a single codebase. Running this agent with Kimi K2.5 cost just a fraction of that: we cut costs by 77% simply by making the switch to Workers AI.
As AI adoption increases, we are seeing a fundamental shift not only in how engineering teams are operating, but how individuals are operating. It is becoming increasingly common for people to have a personal agent like OpenClaw running 24/7. The volume of inference is skyrocketing.
This new rise in personal and coding agents means that cost is no longer a secondary concern; it is the primary blocker to scaling. When every employee has multiple agents processing hundreds of thousands of tokens per hour, the math for proprietary models stops working. Enterprises will look to transition to open-source models that offer frontier-level reasoning without the proprietary price tag. Workers AI is here to facilitate this shift, providing everything from serverless endpoints for a personal agent to dedicated instances powering autonomous agents across an entire organization.
The large model inference stack
Workers AI has served models, including LLMs, since its launch two years ago, but we’ve historically prioritized smaller models. Part of the reason was that for some time, open-source LLMs fell far behind the models from frontier model labs. This changed with models like Kimi K2.5, but to serve this type of very large LLM, we had to make changes to our inference stack. We wanted to share with you some of what goes on behind the scenes to support a model like Kimi.
We’ve been working on custom kernels for Kimi K2.5 to optimize how we serve the model, which is built on top of our proprietary Infire inference engine. Custom kernels improve the model’s performance and GPU utilization, unlocking gains that would otherwise go unclaimed if you were just running the model out of the box. There are also multiple techniques and hardware configurations that can be leveraged to serve a large model. Developers typically use a combination of data, tensor, and expert parallelization techniques to optimize model performance. Strategies like disaggregated prefill are also important, in which you separate the prefill and generation stages onto different machines in order to get better throughput or higher GPU utilization. Implementing these techniques and incorporating them into the inference stack takes a lot of dedicated experience to get right.
Workers AI has already done the experimentation with serving techniques to yield excellent throughput on Kimi K2.5. A lot of this does not come out of the box when you self-host an open-source model. The benefit of using a platform like Workers AI is that you don’t need to be a Machine Learning Engineer, a DevOps expert, or a Site Reliability Engineer to do the optimizations required to host it: we’ve already done the hard part, you just need to call an API.
Beyond the model — platform improvements for agentic workloads
In concert with this launch, we’ve also improved our platform and are releasing several new features to help you build better agents.
Prefix caching and surfacing cached tokens
When you work with agents, you are likely sending a large number of input tokens as part of the context: this could be detailed system prompts, tool definitions, MCP server tools, or entire codebases. Inputs can be as large as the model context window, so in theory, you could be sending requests with almost 256k input tokens. That’s a lot of tokens.
When an LLM processes a request, the request is broken down into two stages: the prefill stage processes input tokens and the output stage generates output tokens. These stages are usually sequential, where input tokens have to be fully processed before you can generate output tokens. This means that sometimes the GPU is not fully utilized while the model is doing prefill.
With multi-turn conversations, when you send a new prompt, the client sends all the previous prompts, tools, and context from the session to the model as well. The delta between consecutive requests is usually just a few new lines of input; all the other context has already gone through the prefill stage during a previous request. This is where prefix caching helps. Instead of doing prefill on the entire request, we can cache the input tensors from a previous request, and only do prefill on the new input tokens. This saves a lot of time and compute from the prefill stage, which means a faster Time to First Token (TTFT) and a higher Tokens Per Second (TPS) throughput as you’re not blocked on prefill.
Workers AI has always done prefix caching, but we are now surfacing cached tokens as a usage metric and offering a discount on cached tokens compared to input tokens. (Pricing can be found on the model page.) We also have new techniques for you to leverage in order to get a higher prefix cache hit rate, reducing your costs.
New session affinity header for higher cache hit rates
In order to route to the same model instance and take advantage of prefix caching, we use a new x-session-affinity header. When you send this header, you’ll improve your cache hit ratio, leading to more cached tokens and subsequently, faster TTFT, TPS, and lower inference costs.
You can pass the new header like below, with a unique string per session or per agent. Some clients like OpenCode implement this automatically out of the box. Our Agents SDK starter has already set up the wiring to do this for you, too.
curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/{ACCOUNT_ID}/ai/run/@cf/moonshotai/kimi-k2.5" \
-H "Authorization: Bearer {API_TOKEN}" \
-H "Content-Type: application/json" \
-H "x-session-affinity: ses_12345678" \
-d '{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "What is prefix caching and why does it matter?"
}
],
"max_tokens": 2400,
"stream": true
}'
Redesigned async APIs
Serverless inference is really hard. With a pay-per-token business model, it’s cheaper on a single request basis because you don’t need to pay for entire GPUs to service your requests. But there’s a trade-off: you have to contend with other people’s traffic and capacity constraints, and there’s no strict guarantee that your request will be processed. This is not unique to Workers AI — it’s evidently the case across serverless model providers, given the frequent news reports of overloaded providers and service disruptions. While we always strive to serve your request and have built-in autoscaling and rebalancing, there are hard limitations (like hardware) that make this a challenge.
For volumes of requests that would exceed synchronous rate limits, you can submit batches of inferences to be completed asynchronously. We’re introducing a revamped Asynchronous API, which means that for asynchronous use cases, you won’t run into Out of Capacity errors and inference will execute durably at some point. Our async API looks more like flex processing than a batch API, where we process requests in the async queue as long as we have headroom in our model instances. With internal testing, our async requests usually execute within 5 minutes, but this will depend on what live traffic looks like. As we bring Kimi to the public, we will tune our scaling accordingly, but the async API is the best way to make sure you don’t run into capacity errors in durable workflows. This is perfect for use cases that are not real-time, such as code scanning agents or research agents.
Workers AI previously had an asynchronous API, but we’ve recently revamped the systems under the hood. We now rely on a pull-based system versus the historical push-based system, allowing us to pull in queued requests as soon as we have capacity. We’ve also added better controls to tune the throughput of async requests, monitoring GPU utilization in real-time and pulling in async requests when utilization is low, so that critical synchronous requests get priority while still processing asynchronous requests efficiently.
To use the asynchronous API, you would send your requests as seen below. We also have a way to set up event notifications so that you can know when the inference is complete instead of polling for the request.
// (1.) Push a request in queue
// pass queueRequest: true
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
"requests": [{
"messages": [{
"role": "user",
"content": "Tell me a joke"
}]
}, {
"messages": [{
"role": "user",
"content": "Explain the Pythagoras theorem"
}]
}, ...{<add more requests in a batch>} ];
}, {
queueRequest: true,
});
// (2.) grab the request id
let request_id;
if(res && res.request_id){
request_id = res.request_id;
}
// (3.) poll the status
let res = await env.AI.run("@cf/moonshotai/kimi-k2.5", {
request_id: request_id
});
if(res && res.status === "queued" || res.status === "running") {
// retry by polling again
...
}
else
return Response.json(res); // This will contain the final completed response
Try it out today
Get started with Kimi K2.5 on Workers AI today. You can read our developer docs to find out model information and pricing, and how to take advantage of prompt caching via session affinity headers and asynchronous API. The Agents SDK starter also now uses Kimi K2.5 as its default model. You can also connect to Kimi K2.5 on Workers AI via Opencode. For a live demo, try it in our playground.
And if this set of problems around serverless inference, ML optimizations, and GPU infrastructure sound interesting to you — we’re hiring!
image
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み