読み込み中…

Together AI Blog·2026年5月19日 09:00·約11分

大規模推論におけるコーディングエージェントのベンチマーク評価

#LLM Inference #Coding Agents #Performance Benchmarking #KV Cache Optimization

TL;DR

Together AI は、本番環境のコーディングエージェント負荷を想定した新しいベンチマークを発表し、独自エンジンが競合 OSS エンジンよりスループットで 31% 向上し、TTFT も 2 倍改善したことを実証した。

AI深層分析2026年7月5日 18:09

重要/ 5段階

深度40%

キーポイント

本番環境に特化したベンチマークの定義

単一ユーザー向けの従来のベンチとは異なり、多数の同時リクエストと長いコンテキスト（45k〜200k トークン）を扱うコーディングエージェントの負荷を模擬した厳密なテストを実施。

圧倒的なパフォーマンス向上の実証

同一ハードウェア上で、Together Inference Engine が最速の OSS エンジンに対し TPS で 31% の改善と、サチュレーション時の TTFT で 2 倍の性能差を示した。

フルスタック最適化による技術的突破

ThunderMLA、カスタムカーネルの書き換え、および実トラフィックに基づくエンドツーエンドのプロファイリングという多角的な最適化が成果につながった。

重要な引用

On a production coding agent workload, Together Inference Engine delivers 31% more TPS than the next fastest OSS engine on the same hardware

Most inference benchmarks measure a single user hitting a dedicated endpoint. The numbers look great. They're also useless for reasoning about production.

For coding agents, TTFT is the metric that determines whether the tool feels fast or broken.

影響分析・編集コメントを表示

影響分析

この発表は、LLM 推論ベンチマークのあり方そのものを見直す契機となり、単なるピーク性能ではなく、本番環境における実負荷時の挙動を重視する基準への転換を促す。特にコーディングエージェントのような複雑なワークロードにおいて、インフラ層の最適化が直接的に開発者の生産性と信頼性に直結することを明確に示しており、推論エンジン市場の競争基準を一段階引き上げる内容である。

編集コメント

単なる数値比較ではなく、なぜ従来のベンチが本番環境で役に立たないのかを論理的に説明しており、インフラエンジニアにとって非常に示唆に富む内容です。

要約

本番環境でのコーディングエージェントのワークロードにおいて、Together Inference Engine は同一ハードウェア上で次点となる最速 OSS エンジンよりも31% 多い TPS（秒間トランザクション数）を提供し、飽和状態でも TTFT（最初のトークンまでの時間）が 2 倍改善されています。これらの向上は、ThunderMLA、カスタムカーネルの書き換え、実トラフィックに基づくエンドツーエンドのプロファイリングというフルスタック最適化によるものです。

多くの推論ベンチマークは、専用エンドポイントに単一のユーザーがアクセスする状況を測定します。数値は素晴らしいように見えますが、本番環境を分析するには役に立ちません。

本番環境では、数十から数百の並行リクエストを実行することになります。これらは同じ KV キャッシュ（Key-Value Cache）、同じメモリ帯域幅、同じ GPU サイクルを競合します。重要なのは、システムに負荷がかかった際に各ユーザーに何が起こるかです。

私たちはこのベンチマークをコーディングエージェントの課題に答えるために構築しました。これは推論に厳しいワークロードです：長い入力、高い並行性、そして負荷下でのレイテンシ劣化に対する許容度のなさがあります。

これはバージョン 1 です。開発が進むにつれて更新していきます。

コーディングエージェントのワークロードの特徴

コーディングエージェントのリクエストには多くのコンテキストが含まれます。編集中のファイル、周囲のコード、会話履歴、取得されたスニペットなどです。入力は長く、出力は意味のあるものですが範囲が限定されています。関数を生成しているのであり、エッセイを書いているわけではありません。

より困難な課題は並行処理です。多くのユーザーが同時にエンドポイントにアクセスし、これらのリクエストは単一ユーザーのベンチマークでは決して捉えられない方法で相互作用します。トラフィックが増加すると KV キャッシュ（Key-Value Cache）が満杯になり、スケジューリングの負荷が高まります。ユーザーごとのスループットが低下し、最初のトークンまでの時間（TTFT: Time To First Token）が上昇します。ある時点でシステムは実用性を失います。異なるエンジンがその限界に達するトラフィックレベルは非常に異なります。

私たちはこれをストレステストするために、大規模な生産環境におけるコーディングエージェントのトラフィックで観測されるリクエスト分布をモデル化した高トラフィックベンチマークを設計しました。プロンプト長は約 45k から 200k トークンの範囲で、現実的なコーディングセッションの成長をシミュレートし、生成長は平均して約 450 トークンです。主要な指標は TPM（1 分あたりの入力トークン数）、ユーザーごとの TPS（1 秒あたりのトークン数）、および p50 TTFT です。

インファレンスで正しく行うべきこと

コーディングエージェントにとって、TTFT はツールが高速に感じられるか壊れているかを決定する指標です。開発者がリクエストを送信すると、最初のトークンが届くまで何も表示されません。送信とストリーミング開始の間のこのギャップこそが、信頼を獲得するか失うかの分かれ道です。出力速度も重要ですが、二次的な要素です：一度トークンのストリーミングが始まれば、生成レートが中程度であっても体験は滑らかに感じられます。

2 つ目の制約は、長いコンテキスト下での並行処理です。コーディングエージェントの要求は単に長いだけでなく、同時に発生します。数十人の開発者が同じエンドポイントに同時にアクセスし、それぞれが 80k トークン以上のコンテキストを保持すると、シングルユーザーベンチマークでは表面化しない KV キャッシュ（Key-Value Cache）への負荷が生じます。キャッシュが埋まると、スケジューラーの動作余地が狭まります。プリフィルレイテンシが増加し、TTFT（Time To First Token）が悪化します。十分な高いトラフィック下では、システムは正式に失敗する前に実用上使い物にならなくなります。

3 つ目の制約は出力形状です。あなたはエッセイではなく関数を生成しています。生成長さは制限されており、平均して約 450 トークン程度です。つまり、サマライゼーションやドキュメント生成ワークロードとは異なり、飽和状態でのスループット特性が異なります。システムは持続的なデコード圧力下にあるのではなく、持続的なプリフィル（Prefill）圧力下にあり、頻繁な短いデコードバーストが発生します。長いデコード実行に最適化されたエンジンが必ずしもここで勝つわけではありません。

これら 3 つの制約 — TTFT への感度、並行する長文コンテキスト負荷、そしてプリフィル中心の出力形状 — が、このベンチマークがストレステストを意図して設計した要素です。

方法論

ハードウェア: エンジンあたり NVIDIA B200 を 4 基（SGLang: B200 を 8 基 — 以下の注記参照）。

ワークロード: 長いプロンプト、高い並行処理、現実的なセッションの入れ替わり。プロンプト長は約45kトークンから200kトークンの範囲で、現実的なコーディングセッションの成長をシミュレートしています。生成長の平均は450トークン（p50: 293, p99: 2,230）です。難易度はトラフィックに応じて変化します：QPSが高い場合、長いプロンプトと増大するKVキャッシュ（Key-Value Cache）によりプレフィル（Prefill）負荷が増加し、維持すべきコンテキストが膨らみ、セッションの入れ替わりが進むにつれてKVキャッシュのスラッシング（Thrashing）も激しくなります。

EAGLE 推測デコーディング: ドラフトトークン3個。受容率（約70%）は、現実的な合成プロンプトデータから自然に現れるものであり、無理やり設定しているわけではありません。

エンジン構成: TensorRT-LLM はこのワークロードに対して十分にチューニングされており、強力なベースラインとなっています。SGLang も可能な限り同等の構成で設定しました。徹底的なチューニング実験は行っていないため、わずかな改善余地があるかもしれません。すべてのエンジンは低レイテンシを目的に構成されています。これはスループット最適化構成とは異なります。後者の場合、最大デコードバッチサイズを増やし、プレフィルとデコーディングの分離（Disaggregation）を採用して、出力TPS（Tokens Per Second）を犠牲にして入力TPM（Tokens Per Minute）を高める設定になります。

最適化した点

私たちのパフォーマンス向上は、推論をフルスタックの問題として扱った結果得られました。つまり、エンドツーエンドのプロファイリングを行い、最もコストのかかる操作を特定し、一つずつ排除していったのです。

ThunderMLA。 Kimi K2.5 は、DeepSeek の Multi-head Latent Attention (MLA) アーキテクチャを採用しています。標準的な実装では、デコードステップごとに 2 つの別々のカーネル起動が行われますが、私たちの ThunderMLA（ThunderKittens カーネルライブラリの一部）はこれらを単一のメガカーネルに融合させ、起動オーバーヘッドとそれらの間のテール効果を排除しました。代表的なデコードワークロードにおいて、ThunderMLA は DeepSeek 自身の FlashMLA よりも 20–35% 高速です。

ThunderMLA の他にも、ドライバーの動作、メモリアウトプット、カーネル実行など、フルスタックのプロファイリングを行い、発見したボトルネックをすべて解消しました。一部には設定変更が必要でしたし、他にはゼロからカーネルを書く必要もありました。私たちが作成したカーネルは、このワークロードにおいて TensorRT-LLM のオープンソース版よりも優れたパフォーマンスを発揮します。

これが負荷下でのフルシステムにどう反映されるかを見てみましょう。

結果

Together Inference Engine を、EAGLE スペキュレーティブデコーディングを適用した Kimi K2.5 に対して、2 つのベースラインと比較しました：

TensorRT-LLM — NVIDIA B200 GPU 4 枚
SGLang — NVIDIA B200 GPU 8 枚

SGLang に関する注釈： SGLang で EAGLE を使用して Kimi K2.5 を実行する場合、TP4（Tensor Parallelism 4）ではメモリ不足になりました。このモデルにおいて、SGLang の EAGLE 実装は TensorRT-LLM のものよりも多くのメモリを必要とするためです。そのため、実行には TP8（GPU 8 枚）を使用しました。一方、TensorRT-LLM と Together Inference Engine は GPU 4 枚で動作させました。

image

GPU あたり 625 TPM（合計 2.5M TPM）において、Together Inference Engine は TensorRT-LLM よりも TPS が 31% 多く、かつ 1 秒未満の TTFT を維持できる唯一のエンジンです。

The degradation curve

曲線の形状は、単一のデータポイントよりも重要です。すべての推論エンジンはいずれ飽和します：KV キャッシュが満杯になり、スケジューリング負圧が増加し、TTFT が上昇します。エンジン間で異なるのは、それがいつ起こるか、そしてどの程度の速度で起こるかです。

2.5M TPM の場合、すべてのエンジンは快適な範囲を超えています：

Engine | GPUs | p50 TTFT

---|---|---

Together IE | 4 | 0.71s

TensorRT-LLM | 4 | 1.1s

SGLang | 8 | 5.1s

2.5M TPM で測定。

すべてのエンジンが劣化し始めているトラフィックレベルにおいて、Together IE の TTFT は TensorRT-LLM よりも 2 倍良く、SGLang よりも 3 倍良いです。システムにはより多くの余裕があり、他のエンジンでは機能しない負荷でも動作可能です。

Cost and quality

本記事のパフォーマンスベンチマークは Kimi K2.5 を対象としています。Kimi K2.6 は現在 Together で利用可能であり、コーディングベンチマークにおいては Claude Opus 4.6 と比較して全体的に同等かそれ以上の性能を発揮します。

Benchmark | Kimi K2.6 | Claude Opus 4.6

---|---|---

SWE-Bench Verified | 80.2 | 80.8

SWE-Bench Pro | 58.6 | 53.4

LiveCodeBench v6 | 89.6 | 88.8

Terminal-Bench 2.0

66.7

65.4

この品質レベルにおいて、コスト差は顕著です。このワークロードにおける典型的なリクエスト（入力トークン約 8 万〜10 万、出力トークン約 450）の場合：

モデル

リクエストあたりのコスト

Together 上の Kimi K2.6

$0.108

Claude Opus 4.6

$0.451

リクエストあたりのコストが 76% 安くなります。 30 人のエンジニアチームが、Claude Opus 4.6 と比較して、1 日 5 時間（年間 250 営業日）、TPM 150 万でコーディングエージェントを実行した場合、推論コストで年間約 44 万ドルの節約になります。

これはバージョン 1 です

これらの結果は、Together Inference Engine が今日、このワークロードにおいて、このハードウェア構成でどこに位置しているかを反映したものです。私たちはベンチマークが意味を持つべきだと考えています：実際のワークロード形状に基づき、方法論について透明性があり、何が破綻し始めるかについても正直であるべきです。

各アップデートは累積的なものとなります。目標は、推論可能なワークロードにおいて、最適化が実際に何をもたらすのかを記録し続けることです。次のバージョンがリリースされたら、どこが変わり、なぜ数値が変動したのかを正確にお見せします。

大規模でコーディングエージェントを実行中で、これが自分のワークロードに何を意味するかを理解したい場合は、お問い合わせください。

原文を表示

Summary

On a production coding agent workload, Together Inference Engine delivers 31% more TPS than the next fastest OSS engine on the same hardware, and maintains 2× better TTFT at saturation. The gains come from full-stack optimization: ThunderMLA, custom kernel rewrites, and end-to-end profiling on real traffic.

Most inference benchmarks measure a single user hitting a dedicated endpoint. The numbers look great. They're also useless for reasoning about production.

In production, you're running dozens or hundreds of concurrent requests. They compete for the same KV cache, the same memory bandwidth, the same GPU cycles. What matters is what happens to every user when the system is under load.

We built this benchmark to answer that question for coding agents. It's a workload that hits inference hard: long inputs, high concurrency, and no tolerance for latency degradation under load.

This is version one. We'll update it as we build.

What a coding agent workload looks like

Coding agent requests carry a lot of context. The file being edited, surrounding code, conversation history, retrieved snippets. Inputs are long. Outputs are meaningful but bounded; you're generating a function, not an essay.

The harder challenge is concurrency. Many users hit the endpoint simultaneously, and those requests interact in ways single-user benchmarks never capture. As traffic increases, KV cache fills. Scheduling pressure mounts. Per-user throughput drops. Time to first token (TTFT) climbs. At some point, the system stops being useful. Different engines reach that point at very different traffic levels.

We designed a high-traffic benchmark to stress-test this, modeled on the request distributions we see serving production coding agent traffic at scale. Prompt lengths range from ~45k to 200k tokens, simulating realistic coding session growth, and generation lengths average around 450 tokens. The key metrics are TPM (input tokens per minute), TPS (tokens per second) per user, and p50 TTFT.

What inference has to get right

For coding agents, TTFT is the metric that determines whether the tool feels fast or broken. A developer who submits a request sees nothing until the first token arrives. That gap — between submit and stream — is where trust is won or lost. Output speed matters, but it's secondary: once tokens are streaming, the experience feels fluid even at moderate generation rates.

The second constraint is concurrency under long context. Coding agent requests aren't just long — they're long *and* simultaneous. Dozens of developers hitting the same endpoint at once, each carrying 80k+ tokens of context, creates KV cache pressure that single-user benchmarks never surface. As cache fills, the scheduler has less room to maneuver. Prefill latency climbs. TTFT degrades. At high enough traffic, the system stops being useful before it formally fails.

The third constraint is output shape. You're generating a function, not an essay. Generation lengths are bounded — averaging around 450 tokens — which means throughput-at-saturation looks different here than in summarization or document-generation workloads. The system isn't under sustained decode pressure; it's under sustained *prefill* pressure, with frequent short bursts of decode. Engines optimized for long decode runs won't necessarily win here.

These three constraints — TTFT sensitivity, concurrent long-context load, and prefill-heavy output shape — are what the benchmark is designed to stress.

Methodology

Hardware: 4× NVIDIA B200 per engine (SGLang: 8× B200 — see note below).

Workload: Long prompts, high concurrency, realistic churn. Prompt lengths range from ~45k to 200k tokens, simulating realistic coding session growth. Generation lengths average 450 tokens (p50: 293, p99: 2,230). Difficulty scales with traffic: at higher QPS, longer prompts and growing KV caches create more prefill pressure, more context to maintain, and more KV cache thrashing as session churn increases.

EAGLE speculative decoding: 3 draft tokens. Acceptance rate (~70%) emerges naturally from the realistic synthetic prompt data — we're not forcing it.

Engine configs: TensorRT-LLM is well-tuned for this workload and represents a strong baseline. SGLang was configured to match where possible; we didn't run exhaustive tuning experiments, so there may be marginal room for improvement. All engines are configured for low latency. This is distinct from a throughput-optimized config, which would increase max decode batch size and use prefill-decode disaggregation to trade output TPS for higher input TPM.

What we optimized

Our performance gains came from treating inference as a full-stack problem: profiling end-to-end, identifying the most expensive operations, and eliminating them one by one.

ThunderMLA. Kimi K2.5 uses DeepSeek's Multi-head Latent Attention (MLA) architecture. Standard implementations run two separate kernel launches per decode step. Our ThunderMLA — part of our ThunderKittens kernel library — fuses these into a single megakernel, eliminating launch overhead and the tail effects between them. On representative decode workloads, ThunderMLA is 20–35% faster than DeepSeek's own FlashMLA.

Beyond ThunderMLA, we profiled the full stack — driver behavior, memory layout, kernel execution — and removed every bottleneck we found. Some required configuration changes. Others required writing kernels from scratch. The kernels we wrote outperform TensorRT-LLM's open-source equivalents on this workload.

Here's how that translates to the full system under load.

Results

We compared Together Inference Engine against two baselines on Kimi K2.5 with EAGLE speculative decoding:

TensorRT-LLM — 4 x NVIDIA B200 GPUs
SGLang — 8 x NVIDIA B200 GPUs

A note on SGLang: Running Kimi K2.5 with EAGLE on SGLang at TP4 ran out of memory — SGLang's EAGLE implementation requires more memory than TensorRT-LLM's on this model. We used TP8 (8 GPUs) to run it. TensorRT-LLM and Together Inference Engine ran on 4 GPUs.

At 625 TPM per GPU (2.5M TPM total), Together Inference Engine delivers 31% more TPS than TensorRT-LLM and is the only engine still under 1s TTFT.

The degradation curve

The shape of the curve matters more than any single data point. Every inference engine eventually saturates: KV cache fills, scheduling pressure increases, TTFT climbs. What differs between engines is when that happens and how fast.

At 2.5M TPM, every engine is past its comfortable range:

Engine

GPUs

p50 TTFT

Together IE

0.71s

TensorRT-LLM

1.1s

SGLang

5.1s

Measured at 2.5M TPM.

At the traffic level where all engines are degrading, Together IE's TTFT is 2× better than TensorRT-LLM's and 3× better than SGLang's. The system has more headroom: functional at loads where other engines are not.

Cost and quality

The performance benchmarks in this post are on Kimi K2.5. Kimi K2.6 is now available on Together, and on coding benchmarks it matches or beats Claude Opus 4.6 across the board.

Benchmark

Kimi K2.6

Claude Opus 4.6

SWE-Bench Verified

80.2

80.8

SWE-Bench Pro

58.6

53.4

LiveCodeBench v6

89.6

88.8

Terminal-Bench 2.0

66.7

65.4

At that quality level, the cost difference is significant. For a typical request on this workload — ~80k-100k input tokens, ~450 output tokens:

Model

Cost per request

Kimi K2.6 on Together

$0.108

Claude Opus 4.6

$0.451

76% cheaper per request. A 30-person engineering team running a coding agent at 1.5M TPM for 5 hours a day (250 working days) saves ~$440K/year on inference costs vs. Claude Opus 4.6.

This is version one

These results reflect where Together Inference Engine stands today, on this workload, on this hardware configuration. We're publishing them because we think benchmarks should be meaningful: based on real workload shapes, transparent about methodology, and honest about where things start to break.

Each update will be additive. The goal is a running record of what optimization actually buys you on a workload you can reason about. When the next one ships, we'll show you exactly what changed and why the numbers moved.

If you're running a coding agent at scale and want to understand what this means for your workload, reach out.

この記事をシェア

Together AI Blog2026年7月24日 09:00

Together AI、Kimi K3とClaude Fable 5を比較

Together AI Blog重要度42026年7月23日 09:00

Together AI、オープン重み推論プラットフォームを発表

Together AI Blog重要度42026年7月20日 09:00

Together AI と YC が GPU クラスターを共同設立

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

Together AI Blog·2026年5月19日 09:00·約11分

大規模推論におけるコーディングエージェントのベンチマーク評価

#LLM Inference #Coding Agents #Performance Benchmarking #KV Cache Optimization

TL;DR

AI深層分析2026年7月5日 18:09

重要/ 5段階

深度40%

キーポイント

本番環境に特化したベンチマークの定義

圧倒的なパフォーマンス向上の実証

同一ハードウェア上で、Together Inference Engine が最速の OSS エンジンに対し TPS で 31% の改善と、サチュレーション時の TTFT で 2 倍の性能差を示した。

フルスタック最適化による技術的突破

重要な引用

On a production coding agent workload, Together Inference Engine delivers 31% more TPS than the next fastest OSS engine on the same hardware

Most inference benchmarks measure a single user hitting a dedicated endpoint. The numbers look great. They're also useless for reasoning about production.

For coding agents, TTFT is the metric that determines whether the tool feels fast or broken.

影響分析・編集コメントを表示

影響分析

編集コメント

要約

これはバージョン 1 です。開発が進むにつれて更新していきます。

コーディングエージェントのワークロードの特徴

インファレンスで正しく行うべきこと

方法論

ハードウェア: エンジンあたり NVIDIA B200 を 4 基（SGLang: B200 を 8 基 — 以下の注記参照）。

最適化した点

これが負荷下でのフルシステムにどう反映されるかを見てみましょう。

結果

Together Inference Engine を、EAGLE スペキュレーティブデコーディングを適用した Kimi K2.5 に対して、2 つのベースラインと比較しました：

TensorRT-LLM — NVIDIA B200 GPU 4 枚
SGLang — NVIDIA B200 GPU 8 枚

image

The degradation curve

2.5M TPM の場合、すべてのエンジンは快適な範囲を超えています：

Engine | GPUs | p50 TTFT

---|---|---

Together IE | 4 | 0.71s

TensorRT-LLM | 4 | 1.1s

SGLang | 8 | 5.1s

2.5M TPM で測定。

Cost and quality

Benchmark | Kimi K2.6 | Claude Opus 4.6

---|---|---

SWE-Bench Verified | 80.2 | 80.8

SWE-Bench Pro | 58.6 | 53.4

LiveCodeBench v6 | 89.6 | 88.8

Terminal-Bench 2.0

66.7

65.4

モデル

リクエストあたりのコスト

Together 上の Kimi K2.6

$0.108

Claude Opus 4.6

$0.451

これはバージョン 1 です

大規模でコーディングエージェントを実行中で、これが自分のワークロードに何を意味するかを理解したい場合は、お問い合わせください。

原文を表示

Summary

Most inference benchmarks measure a single user hitting a dedicated endpoint. The numbers look great. They're also useless for reasoning about production.

We built this benchmark to answer that question for coding agents. It's a workload that hits inference hard: long inputs, high concurrency, and no tolerance for latency degradation under load.

This is version one. We'll update it as we build.

What a coding agent workload looks like

What inference has to get right

These three constraints — TTFT sensitivity, concurrent long-context load, and prefill-heavy output shape — are what the benchmark is designed to stress.

Methodology

Hardware: 4× NVIDIA B200 per engine (SGLang: 8× B200 — see note below).

EAGLE speculative decoding: 3 draft tokens. Acceptance rate (~70%) emerges naturally from the realistic synthetic prompt data — we're not forcing it.

What we optimized

Our performance gains came from treating inference as a full-stack problem: profiling end-to-end, identifying the most expensive operations, and eliminating them one by one.

Here's how that translates to the full system under load.

Results

We compared Together Inference Engine against two baselines on Kimi K2.5 with EAGLE speculative decoding:

TensorRT-LLM — 4 x NVIDIA B200 GPUs
SGLang — 8 x NVIDIA B200 GPUs

At 625 TPM per GPU (2.5M TPM total), Together Inference Engine delivers 31% more TPS than TensorRT-LLM and is the only engine still under 1s TTFT.

The degradation curve

At 2.5M TPM, every engine is past its comfortable range:

Engine

GPUs

p50 TTFT

Together IE

0.71s

TensorRT-LLM

1.1s

SGLang

5.1s

Measured at 2.5M TPM.

Cost and quality

The performance benchmarks in this post are on Kimi K2.5. Kimi K2.6 is now available on Together, and on coding benchmarks it matches or beats Claude Opus 4.6 across the board.

Benchmark

Kimi K2.6

Claude Opus 4.6

SWE-Bench Verified

80.2

80.8

SWE-Bench Pro

58.6

53.4

LiveCodeBench v6

89.6

88.8

Terminal-Bench 2.0

66.7

65.4

At that quality level, the cost difference is significant. For a typical request on this workload — ~80k-100k input tokens, ~450 output tokens:

Model

Cost per request

Kimi K2.6 on Together

$0.108

Claude Opus 4.6

$0.451

76% cheaper per request. A 30-person engineering team running a coding agent at 1.5M TPM for 5 hours a day (250 working days) saves ~$440K/year on inference costs vs. Claude Opus 4.6.

This is version one

If you're running a coding agent at scale and want to understand what this means for your workload, reach out.

この記事をシェア

Together AI Blog2026年7月24日 09:00

Together AI、Kimi K3とClaude Fable 5を比較

Together AI Blog重要度42026年7月23日 09:00

Together AI、オープン重み推論プラットフォームを発表

Together AI Blog重要度42026年7月20日 09:00

Together AI と YC が GPU クラスターを共同設立

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む

キーポイント

重要な引用

影響分析

編集コメント

コーディングエージェントのワークロードの特徴

インファレンスで正しく行うべきこと

方法論

最適化した点

結果

The degradation curve

Cost and quality

これはバージョン 1 です

What a coding agent workload looks like

What inference has to get right

Methodology

What we optimized

Results

The degradation curve

Cost and quality

This is version one

関連記事

キーポイント

重要な引用

影響分析

編集コメント

コーディングエージェントのワークロードの特徴

インファレンスで正しく行うべきこと

方法論

最適化した点

結果

The degradation curve

Cost and quality

これはバージョン 1 です

What a coding agent workload looks like

What inference has to get right

Methodology

What we optimized

Results

The degradation curve

Cost and quality

This is version one

関連記事