読み込み中…

Cloudflare Blog·2026年4月16日 23:00·約10分

CloudflareのAIプラットフォーム：エージェント向けに設計された推論レイヤー

#AIエージェント #推論インフラ #モデルルーティング #Cloudflare AI Gateway #マルチプロバイダー

TL;DR

Cloudflareはエージェント開発向けに、複数プロバイダーの70以上のAIモデルを統一APIでアクセス可能な推論層「AI Gateway」を提供開始した。

AI深層分析2026年4月16日 23:41

重要/ 5段階

深度40%

キーポイント

エージェント向け推論層の統一化

複雑なエージェントワークフローに対応するため、単一APIで複数モデルの呼び出しとコスト・レイテンシ管理を一元化する。

モデル切り替えの簡素化

WorkersのAI.run()バインディングを流用し、プロバイダー変更を1行コードで可能にする。

多様なモデルカタログの提供

OpenAIやAnthropicなど12以上のプロバイダーから70以上のモデル（画像・動画・音声含む）を統一クレジットで利用可能。

信頼性向上機能の強化

アップストリーム障害時の自動リトライやログ制御を標準搭載し、エージェントの連鎖呼び出しにおける失敗カスケードを防ぐ。

Cogコンテナのデプロイと将来機能

`cog build`でモデルをコンテナ化しWorkers AIにデプロイ可能。今後wranglerコマンドやGPUスナップショットによるコールドスタートの高速化を実現し、一般公開を目指す。

エージェント向け超低遅延ネットワーク

AI Gatewayは330都市のグローバルネットワークを活用し最初のトークン生成時間を最小化。エージェント最適化モデルを同一ネットワーク内でホストすることで余分なインターネット経由を排除し遅延を削減。

自動フェイルオーバーによる高信頼性

AI Gatewayはプロバイダー障害時に自動的に代替プロバイダーへルーティングするため、カスタムフェイルオーバーロジックを記述せずにエージェントワークフローの信頼性を確保できる。

重要な引用

This means you need access to all the models, without tying yourself financially and operationally to a single provider.

An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn't add 50ms, it adds 500ms.

Today, we’re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable.

Using Workers AI models with AI Gateway is particularly powerful if you’re building live agents – where a user's perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes.

Through AI Gateway, if you're calling a model that's available on multiple providers and one provider goes down, we'll automatically route to another available provider without you having to write any failover logic of your own.

If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens.

影響分析・編集コメントを表示

影響分析

本発表は、AIエージェントの実装において「モデルベンダーロックイン」を回避し、柔軟なプロビジョニングを可能にするインフラ標準化の動きを示している。開発者は技術選定の自由度が高まり、大規模エージェント運用時のコスト最適化と障害耐性が現実的なものとなる。これにより、次世代AIアプリ開発のハードルが下がり、マルチモデル活用が業界標準となる可能性が高い。

編集コメント

単一モデル依存のリスクを解消する「推論層」のアブストラクションは、エージェント時代の必須インフラとなり得る。ただし、モデルカタログの拡大に伴う品質管理とコスト透明性の監視が今後の運用課題となる。

AIモデルは急速に進化しています：今日エージェント型コーディング（agentic coding）に最適なモデルが、3ヶ月後には全く異なるプロバイダーの別モデルになっている可能性があります。さらに、実際のユースケースでは複数のモデルを呼び出す必要があることがよくあります。カスタマーサポートエージェントは、ユーザーのメッセージを分類するために高速で低コストなモデルを使用し、行動計画を立てるために大規模な推論モデル（reasoning model）を、個々のタスクを実行するために軽量なモデルを使用するかもしれません。

つまり、単一のプロバイダーに金銭的・運用面で縛られることなく、すべてのモデルにアクセスできる環境が必要です。また、プロバイダー間でコストを監視し、いずれかのプロバイダーに障害が発生した際の信頼性を確保し、ユーザーの場所に関わらずレイテンシ（latency）を管理するための適切なシステムを整備することも不可欠です。

これらの課題はAIを活用した開発において常に存在しますが、エージェント（agents）を構築する際にはさらに緊急性が増します。シンプルなチャットボットは、ユーザーのプロンプトごとに1回の推論呼び出し（inference call）を行うだけで済みますが、エージェントは単一のタスクを完了するために10回の呼び出しをチェーンする可能性があります。すると、単一のプロバイダーの遅延が50msではなく500ms追加されることになります。1回の失敗したリクエストは単なる再試行で済むはずですが、突然、下流の障害が連鎖する事態に陥ります。

AI GatewayとWorkers AIのリリース以来、Cloudflare上でAI搭載アプリケーションを構築する開発者から素晴らしい採用実績を得ており、追いつくために高速でリリースを続けてきました！過去数ヶ月の間に、ダッシュボードを更新し、セットアップ不要のデフォルトゲートウェイ（default gateways）を追加し、アップストリーム障害時の自動リトライ機能とより細粒度なログ制御（logging controls）を導入しました。本日、Cloudflareを統一された推論レイヤー（unified inference layer）へと進化させます。あらゆるプロバイダーのAIモデルにアクセスできる単一APIであり、高速かつ信頼性の高い構築を目指しています。

1つのカタログ、1つの統一エンドポイント

今日から、すでにWorkers AIでご利用いただいている同じAI.run()バインディング（AI.run() binding）を使用して、サードパーティ製モデルを呼び出すことができます。Workersをご利用の場合、Cloudflareホスト型モデルからOpenAI、Anthropic、またはその他のプロバイダーのモデルへ切り替えるのは、1行の変更で済みます。

javascript

const response = await env.AI.run('anthropic/claude-opus-4-6',{
input: 'What is Cloudflare?',
}, {
gateway: { id: "default" },
});

Workersをご利用でない方も、今後数週間以内にREST APIサポート（REST API support）をリリースする予定です。これにより、あらゆる環境からフルモデルカタログにアクセスできるようになります。

また、12以上のプロバイダーにまたがる70以上のモデルにアクセスできるようになったことをご報告できることを嬉しく思います。これらはすべて単一APIで、モデル間の切り替えは1行のコードで、支払いには単一のクレジット体系を使用します。さらに、私たちは進行中にこれを急速に拡大しています。

ユースケースに最適なモデルをお探しの方は、Cloudflare Workers AI上でホストされているオープンソースモデル（open-source models）から、主要なモデルプロバイダーが提供する独自モデル（proprietary models）まで、当社のモデルカタログから閲覧して選定できます。Alibaba Cloud、AssemblyAI、Bytedance、Google、InWorld、MiniMax、OpenAI、Pixverse、Recraft、Runway、Viduからのモデル提供をAI Gatewayを通じて拡大できることを嬉しく思います。特筆すべきは、画像・動画・音声モデルも提供範囲に追加し、マルチモーダルアプリケーション（multimodal applications）の構築が可能になった点です。

image

1つのAPIを通じてすべてのモデルにアクセスできるということは、AIコスト（AI spend）を一元管理できることを意味します。現在、多くの企業は複数のプロバイダーから平均3.5種類のモデルを呼び出しており、どの単一プロバイダーもAI利用状況を包括的に把握することはできません。AI Gatewayを利用すれば、AIコストの監視と管理を行うための一元化されたプラットフォームが得られます。

リクエストにカスタムメタデータ（custom metadata）を含めることで、無料ユーザーと有料ユーザーの支出、個別顧客、アプリ内の特定のワークフロー（workflows）など、最も関心の高い属性に基づいたコスト内訳を取得できます。

const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',

{

prompt: 'What is AI Gateway?'

{

metadata: { "teamId": "AI", "userId": 12345 }

}

);

image

Bring your own model

AI Gatewayは、1つのAPIを通じてすべてのプロバイダーのモデルにアクセスできる機能を提供します。ただし、独自のデータでファインチューニング（fine-tuned）されたモデルや、特定のユースケースに最適化されたモデルを実行する必要がある場合もあります。そのために、当社はユーザーが独自モデルをWorkers AIに持ち込むことができるよう開発を進めています。

当社のトラフィックの圧倒的多数は、プラットフォーム上でカスタムモデルを実行するエンタープライズ顧客（Enterprise customers）向けの専用インスタンス（dedicated instances）から発生しており、この機能をより多くの顧客に提供したいと考えています。そのために、機械学習モデルのコンテナ化（containerize）を支援するReplicateのCog技術を活用しています。

Cogは非常にシンプルに設計されています。cog.yamlファイルに依存関係を記述し、Pythonファイルに推論コード（inference code）を記述するだけです。Cogは、CUDA依存関係（CUDA dependencies）、Pythonのバージョン、重み読み込み（weight loading）など、MLモデルのパッケージ化（packaging ML models）に関する複雑な処理をすべて抽象化しています。

cog.yamlファイルの例:

build:

python_version: "3.13"

python_requirements: requirements.txt

predict: "predict.py:Predictor"

predict.pyファイルの例（モデルの設定を行う関数と、推論リクエスト（予測）を受信した際に実行される関数を備えたファイル）:

from cog import BasePredictor, Path, Input

import torch

class Predictor(BasePredictor):

def setup(self):

"""複数の予測を効率的に実行するためにモデルをメモリに読み込む"""

self.net = torch.load("weights.pth")

def predict(self,

image: Path = Input(description="拡大する画像"),

scale: float = Input(description="画像を拡大する倍率", default=1.5)

) -> Path:

"""モデルに対して単一の予測を実行する"""

# ... 前処理 ...

output = self.net(input)

# ... 後処理 ...

return output

次に、cog buildコマンドを実行してコンテナイメージをビルドし、CogコンテナをWorkers AIにプッシュできます。モデルのデプロイとサービングは当社が行い、その後、通常のWorkers AI APIを通じてアクセスしていただけます。

さらに、より多くの顧客にこの機能を提供できるよう、カスタマー向けAPIやwranglerコマンド（wrangler commands）の実装、およびGPUスナップショット（GPU snapshotting）によるコールドスタートの高速化といった大規模プロジェクトに取り組んでいます。Cloudflare社内チームや、当社のビジョンを導いてくれる外部顧客数名を対象に内部テストを実施しており、もし設計パートナーとしてご興味がある場合はお気軽にお問い合わせください。まもなく、誰でもモデルをパッケージ化し、Workers AIを通じて利用できるようになります。

初回トークン到達時間の高速化

AI GatewayとWorkers AIモデルを組み合わせることは、特にリアルタイムエージェント（live agents）の構築において強力です。これはユーザーが速度を認識する基準が、完全な応答にかかる時間ではなく、初回トークン到達時間（time to first token）やエージェントが応答を始める速さに依存するためです。全体の推論に3秒かかっても、最初のトークンを50ms早く取得できれば、キビキビとしたエージェントと重たいエージェントとの違いが生じます。

世界中の330都市に展開するCloudflareのデータセンターネットワークにより、AI Gatewayはユーザーと推論エンドポイント（inference endpoints）の両方に近接した位置に配置され、ストリーミング開始前のネットワーク時間を最小限に抑えています。

Workers AIの公開カタログにはオープンソースモデル（open-source models）もホストされており、現在ではKimi K2.5やリアルタイム音声モデル（real-time voice models）など、エージェント用に特別に設計された大規模モデルが含まれています。これらのCloudflareホスト型モデルをAI Gateway経由で呼び出す場合、コードと推論が同じグローバルネットワーク上で実行されるため、パブリックインターネットを介した追加のホップが発生せず、エージェントに可能な限り低いレイテンシを提供します。

自動フェールオーバーによる信頼性の構築

エージェントを構築する際、速度がユーザーが気にする唯一の要素ではありません。信頼性も重要です。エージェントワークフローのすべてのステップは、その前のステップに依存しています。1つの呼び出しが失敗すると全体の後続チェーンに影響を与える可能性があるため、信頼性の高い推論はエージェントにとって不可欠です。

AI Gatewayを通じて、複数のプロバイダーで利用可能なモデルを呼び出している際に1つのプロバイダーがダウンした場合、独自のフェールオーバーロジック（failover logic）を記述する必要なく、自動的に他の利用可能なプロバイダーへルーティングされます。

Agents SDKを使用して長時間実行されるエージェントを構築している場合、ストリーミング推論呼び出し (streaming inference calls) も切断 (disconnects) に対して耐性があります。AI Gatewayはエージェントの存続期間とは無関係に、生成されるストリーミング応答をバッファリングします。推論中にエージェントが中断された場合、AI Gatewayに再接続して応答を取得でき、新しい推論呼び出しを行う必要も、同じ出力トークン (output tokens) に対して二度課金されることもありません。Agents SDKの組み込みチェックポイント機能 (checkpointing) と組み合わせることで、エンドユーザーはこれらの事象に気づくことなく利用できます。

Replicate

Replicateチームは正式に当社のAI Platformチームに統合され、もはや別々のチームとして区別することすらなくなりました。ReplicateとCloudflare間の統合作業に注力しており、その一環としてすべてのReplicateモデルをAI Gatewayへ移行し、ホストされているモデルをCloudflareインフラストラクチャ (Cloudflare infrastructure) 上で再プラットフォーム化しています。まもなく、Replicateで愛用していたモデルをAI Gateway経由でアクセスできるようになるだけでなく、ReplicateにデプロイしたモデルをWorkers AI上でホストすることも可能になります。

はじめる

使い始めるには、AI GatewayまたはWorkers AIのドキュメントをご覧ください。Agents SDKを通じてCloudflare上でエージェントを構築する方法について詳しく学びましょう。

Cloudflare TVで視聴

原文を表示

AI models are changing quickly: the best model to use for agentic coding today might in three months be a completely different model from a different provider. On top of this, real-world use cases often require calling more than one model. Your customer support agent might use a fast, cheap model to classify a user's message; a large, reasoning model to plan its actions; and a lightweight model to execute individual tasks.

This means you need access to all the models, without tying yourself financially and operationally to a single provider. You also need the right systems in place to monitor costs across providers, ensure reliability when one of them has an outage, and manage latency no matter where your users are.

These challenges are present whenever you’re building with AI, but they get even more pressing when you’re building agents. A simple chatbot might make one inference call per user prompt. An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn't add 50ms, it adds 500ms. One failed request isn't a retry, but suddenly a cascade of downstream failures.

Since launching AI Gateway and Workers AI, we’ve seen incredible adoption from developers building AI-powered applications on Cloudflare and we’ve been shipping fast to keep up! In just the past few months, we've refreshed the dashboard, added zero-setup default gateways, automatic retries on upstream failures, and more granular logging controls. Today, we’re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable.

One catalog, one unified endpoint

Starting today, you can call third-party models using the same AI.run() binding you already use for Workers AI. If you’re using Workers, switching from a Cloudflare-hosted model to one from OpenAI, Anthropic, or any other provider is a one-line change.

const response = await env.AI.run('anthropic/claude-opus-4-6',{

input: 'What is Cloudflare?',

}, {

gateway: { id: "default" },

});

For those who don’t use Workers, we’ll be releasing REST API support in the coming weeks, so you can access the full model catalog from any environment.

We’re also excited to share that you'll now have access to 70+ models across 12+ providers — all through one API, one line of code to switch between them, and one set of credits to pay for them. And we’re quickly expanding this as we go.

You can browse through our model catalog to find the best model for your use case, from open-source models hosted on Cloudflare Workers AI to proprietary models from the major model providers. We’re excited to be expanding access to models from Alibaba Cloud, AssemblyAI, Bytedance, Google, InWorld, MiniMax, OpenAI, Pixverse, Recraft, Runway, and Vidu — who will provide their models through AI Gateway. Notably, we’re expanding our model offerings to include image, video, and speech models so that you can build multimodal applications

image

Accessing all your models through one API also means you can manage all your AI spend in one place. Most companies today are calling an average of 3.5 models across multiple providers, which means no one provider is able to give you a holistic view of your AI usage. With AI Gateway, you’ll get one centralized place to monitor and manage AI spend.

By including custom metadata with your requests, you can get a breakdown of your costs on the attributes that you care about most, like spend by free vs. paid users, by individual customers, or by specific workflows in your app.

const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',

{

prompt: 'What is AI Gateway?'

{

metadata: { "teamId": "AI", "userId": 12345 }

}

);

image

Bring your own model

AI Gateway gives you access to models from all the providers through one API. But sometimes you need to run a model you've fine-tuned on your own data or one optimized for your specific use case. For that, we are working on letting users bring their own model to Workers AI.

The overwhelming majority of our traffic comes from dedicated instances for Enterprise customers who are running custom models on our platform, and we want to bring this to more customers. To do this, we leverage Replicate’s Cog technology to help you containerize machine learning models.

Cog is designed to be quite simple: all you need to do is write down dependencies in a cog.yaml file, and your inference code in a Python file. Cog abstracts away all the hard things about packaging ML models, such as CUDA dependencies, Python versions, weight loading, etc.

Example of a cog.yaml file:

build:

python_version: "3.13"

python_requirements: requirements.txt

predict: "predict.py:Predictor"

Example of a predict.py file, which has a function to set up the model and a function that runs when you receive an inference request (a prediction):

from cog import BasePredictor, Path, Input

import torch

class Predictor(BasePredictor):

def setup(self):

"""Load the model into memory to make running multiple predictions efficient"""

self.net = torch.load("weights.pth")

def predict(self,

image: Path = Input(description="Image to enlarge"),

scale: float = Input(description="Factor to scale image by", default=1.5)

) -> Path:

"""Run a single prediction on the model"""

# ... pre-processing ...

output = self.net(input)

# ... post-processing ...

return output

Then, you can run cog build to build your container image, and push your Cog container to Workers AI. We will deploy and serve the model for you, which you then access through your usual Workers AI APIs.

We’re working on some big projects to be able to bring this to more customers, like customer-facing APIs and wrangler commands so that you can push your own containers, as well as faster cold starts through GPU snapshotting. We’ve been testing this internally with Cloudflare teams and some external customers who are guiding our vision. If you’re interested in being a design partner with us, please reach out! Soon, anyone will be able to package their model and use it through Workers AI.

The fast path to first token

Using Workers AI models with AI Gateway is particularly powerful if you’re building live agents – where a user's perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes. Even if total inference is 3 seconds, getting that first token 50ms faster makes the difference between an agent that feels zippy and one that feels sluggish.

Cloudflare's network of data centers in 330 cities around the world means AI Gateway is positioned close to both users and inference endpoints, minimizing the network time before streaming begins.

Workers AI also hosts open-source models on its public catalog, which now includes large models purpose-built for agents, including Kimi K2.5 and real-time voice models. When you call these Cloudflare-hosted models through AI Gateway, there's no extra hop over the public Internet since your code and inference run on the same global network, giving your agents the lowest latency possible.

Built for reliability with automatic failover

When building agents, speed is not the only factor that users care about – reliability matters too. Every step in an agent workflow depends on the steps before it. Reliable inference is crucial for agents because one call failing can affect the entire downstream chain.

Through AI Gateway, if you're calling a model that's available on multiple providers and one provider goes down, we'll automatically route to another available provider without you having to write any failover logic of your own.

If you’re building long-running agents with Agents SDK, your streaming inference calls are also resilient to disconnects. AI Gateway buffers streaming responses as they’re generated, independently of your agent's lifetime. If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens. Combined with the Agents SDK's built-in checkpointing, the end user never notices.

Replicate

The Replicate team has officially joined our AI Platform team, so much so that we don’t even consider ourselves separate teams anymore. We’ve been hard at work on integrations between Replicate and Cloudflare, which include bringing all the Replicate models onto AI Gateway and replatforming the hosted models onto Cloudflare infrastructure. Soon, you’ll be able to access the models you loved on Replicate through AI Gateway, and host the models you deployed on Replicate on Workers AI as well.

Get started

To get started, check out our documentation for AI Gateway or Workers AI. Learn more about building agents on Cloudflare through Agents SDK.

Watch on Cloudflare TV

この記事をシェア

Cloudflare Blog重要度42026年7月18日 05:18

Cloudflare、WordPress の重大脆弱性対策を公開

Cloudflare Blog重要度42026年7月14日 22:00

DNSSEC ロールオーバー失敗で .AL ドメインが停止、Cloudflare は検証バイパスを検知して通知

Cloudflare Blog重要度42026年7月13日 22:00

Precursor の紹介：継続的なクライアント側信号を用いたエージェント型行動の検出

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Cloudflare Blog·2026年4月16日 23:00·約10分

CloudflareのAIプラットフォーム：エージェント向けに設計された推論レイヤー

#AIエージェント #推論インフラ #モデルルーティング #Cloudflare AI Gateway #マルチプロバイダー

TL;DR

Cloudflareはエージェント開発向けに、複数プロバイダーの70以上のAIモデルを統一APIでアクセス可能な推論層「AI Gateway」を提供開始した。

AI深層分析2026年4月16日 23:41

重要/ 5段階

深度40%

キーポイント

エージェント向け推論層の統一化

複雑なエージェントワークフローに対応するため、単一APIで複数モデルの呼び出しとコスト・レイテンシ管理を一元化する。

モデル切り替えの簡素化

WorkersのAI.run()バインディングを流用し、プロバイダー変更を1行コードで可能にする。

多様なモデルカタログの提供

OpenAIやAnthropicなど12以上のプロバイダーから70以上のモデル（画像・動画・音声含む）を統一クレジットで利用可能。

信頼性向上機能の強化

アップストリーム障害時の自動リトライやログ制御を標準搭載し、エージェントの連鎖呼び出しにおける失敗カスケードを防ぐ。

Cogコンテナのデプロイと将来機能

エージェント向け超低遅延ネットワーク

自動フェイルオーバーによる高信頼性

重要な引用

This means you need access to all the models, without tying yourself financially and operationally to a single provider.

An agent might chain ten calls together to complete a single task and suddenly, a single slow provider doesn't add 50ms, it adds 500ms.

Today, we’re making Cloudflare into a unified inference layer: one API to access any AI model from any provider, built to be fast and reliable.

Using Workers AI models with AI Gateway is particularly powerful if you’re building live agents – where a user's perception of speed hinges on time to first token or how quickly the agent starts responding, rather than how long the full response takes.

Through AI Gateway, if you're calling a model that's available on multiple providers and one provider goes down, we'll automatically route to another available provider without you having to write any failover logic of your own.

If your agent is interrupted mid-inference, it can reconnect to AI Gateway and retrieve the response without having to make a new inference call or paying twice for the same output tokens.

影響分析・編集コメントを表示

影響分析

編集コメント

1つのカタログ、1つの統一エンドポイント

javascript

const response = await env.AI.run('anthropic/claude-opus-4-6',{
input: 'What is Cloudflare?',
}, {
gateway: { id: "default" },
});

image

const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',

{

prompt: 'What is AI Gateway?'

{

metadata: { "teamId": "AI", "userId": 12345 }

}

);

image

Bring your own model

cog.yamlファイルの例:

build:

python_version: "3.13"

python_requirements: requirements.txt

predict: "predict.py:Predictor"

predict.pyファイルの例（モデルの設定を行う関数と、推論リクエスト（予測）を受信した際に実行される関数を備えたファイル）:

from cog import BasePredictor, Path, Input

import torch

class Predictor(BasePredictor):

def setup(self):

"""複数の予測を効率的に実行するためにモデルをメモリに読み込む"""

self.net = torch.load("weights.pth")

def predict(self,

image: Path = Input(description="拡大する画像"),

scale: float = Input(description="画像を拡大する倍率", default=1.5)

) -> Path:

"""モデルに対して単一の予測を実行する"""

# ... 前処理 ...

output = self.net(input)

# ... 後処理 ...

return output

初回トークン到達時間の高速化

自動フェールオーバーによる信頼性の構築

Replicate

はじめる

Cloudflare TVで視聴

原文を表示

One catalog, one unified endpoint

const response = await env.AI.run('anthropic/claude-opus-4-6',{

input: 'What is Cloudflare?',

}, {

gateway: { id: "default" },

});

For those who don’t use Workers, we’ll be releasing REST API support in the coming weeks, so you can access the full model catalog from any environment.

image

const response = await env.AI.run('@cf/moonshotai/kimi-k2.5',

{

prompt: 'What is AI Gateway?'

{

metadata: { "teamId": "AI", "userId": 12345 }

}

);

image

Bring your own model

Example of a cog.yaml file:

build:

python_version: "3.13"

python_requirements: requirements.txt

predict: "predict.py:Predictor"

Example of a predict.py file, which has a function to set up the model and a function that runs when you receive an inference request (a prediction):

from cog import BasePredictor, Path, Input

import torch

class Predictor(BasePredictor):

def setup(self):

"""Load the model into memory to make running multiple predictions efficient"""

self.net = torch.load("weights.pth")

def predict(self,

image: Path = Input(description="Image to enlarge"),

scale: float = Input(description="Factor to scale image by", default=1.5)

) -> Path:

"""Run a single prediction on the model"""

# ... pre-processing ...

output = self.net(input)

# ... post-processing ...

return output

The fast path to first token

Cloudflare's network of data centers in 330 cities around the world means AI Gateway is positioned close to both users and inference endpoints, minimizing the network time before streaming begins.

Built for reliability with automatic failover

Replicate

Get started

To get started, check out our documentation for AI Gateway or Workers AI. Learn more about building agents on Cloudflare through Agents SDK.

Watch on Cloudflare TV

この記事をシェア

Cloudflare Blog重要度42026年7月18日 05:18

Cloudflare、WordPress の重大脆弱性対策を公開

Cloudflare Blog重要度42026年7月14日 22:00

DNSSEC ロールオーバー失敗で .AL ドメインが停止、Cloudflare は検証バイパスを検知して通知

Cloudflare Blog重要度42026年7月13日 22:00

Precursor の紹介：継続的なクライアント側信号を用いたエージェント型行動の検出

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

CloudflareのAIプラットフォーム：エージェント向けに設計された推論レイヤー

キーポイント

重要な引用

影響分析

編集コメント

関連記事

CloudflareのAIプラットフォーム：エージェント向けに設計された推論レイヤー

キーポイント

重要な引用

影響分析

編集コメント

関連記事