TLDR AI·2026年5月14日 09:00·約14分

DeepSeek V4 Pro と Flash を Claude Opus 4.7 および Kimi K2.6 と比較テストした結果

#LLM #オープンソース #DeepSeek V4 #ベンチマーク #MIT ライセンス

TL;DR

DeepSeek が V3 に続く新アーキテクチャ「V4 Pro」と軽量モデル「Flash」を MIT ライセンスで公開し、Claude Opus や Kimi K2.6 との比較テストが行われた。

AI深層分析2026年7月5日 19:14

重要/ 5段階

深度40%

キーポイント

新アーキテクチャとライセンスの発表

DeepSeek は 2026 年 4 月 24 日、V3 以来の新アーキテクチャを持つ「V4 Pro」と「Flash」を MIT ライセンスで公開し、初のオープンウェイト・ツートーンモデルラインナップとなった。

競合モデルとの性能比較テスト

本記事では、DeepSeek の新モデルが業界のトップである Claude Opus 4.7 や Kimi K2.6 と直接比較され、その性能差や特性が詳細に検証されている。

プロフェッショナルと軽量モデルの役割分担

「Pro」をフラッグシップ、「Flash」を軽量モデルとして位置付け、用途に応じた使い分けが可能となる新たなエコシステムが構築されたことが示唆されている。

影響分析・編集コメントを表示

影響分析

DeepSeek の新モデル発表は、オープンソース LLM の性能基準を再定義する可能性があり、特に MIT ライセンスによる商用利用の容易さが業界全体のコスト構造や開発スピードに大きな影響を与える。Claude や Kimi といったクローズド/ハイブリッドモデルとの直接比較により、ユーザーが最適なモデル選択を行うための新たなベンチマーク指標として機能することが期待される。

編集コメント

DeepSeek の新アーキテクチャ発表は、オープンソースモデルがクローズドトップを脅かす転換点となり得る重要なニュースです。特に MIT ライセンスでの公開は、開発者コミュニティにおける採用加速を促す要因となるでしょう。

DeepSeek V4 Pro と DeepSeek V4 Flash は、2026 年 4 月 24 日に MIT ライセンスの下で同時にリリースされました。これらは V3 以来の DeepSeek の新たなアーキテクチャであり、フラッグシップモデルである Pro と軽量モデルである Flash の 2 つのティアを持つ最初のオープンウェイトラインナップです。

私たちは、Claude Opus 4.7 vs Kimi K2.6 で使用したのと同じ FlowGraph スペックを用いて両モデルをテストしました。同じ仕様、同じプロンプト、同じ採点基準に基づいています。

TL;DR: DeepSeek V4 Pro は 77/100 のスコアを獲得し、コストは 2.25 ドルでした。性能面では Opus 4.7 (91) と Kimi K2.6 (68) の間に位置します。DeepSeek V4 Flash は 60/100 のスコアで 0.02 ドルという、これまでこのテストで見たことのない価格帯でしたが、ビルドに失敗し、出力には重要な要素が欠落しています。

DeepSeek V4 Flash は、比較対象の中で圧倒的に最安値のモデルです。 出力トークンのコストは Kimi K2.6 の 1/14 未満であり、Claude Opus 4.7 の約 1/89 です。

DeepSeek はまた、2026 年 5 月 31 日まで DeepSeek V4 Pro に対して 75% オフのプロモーションを実施しています。 この割引適用下では、DeepSeek V4 Pro の入力コストは約 0.036 ドル/M に、出力コストは 0.87 ドル/M に低下し、両方の軸で Kimi K2.6 を下回ります。また、DeepSeek はラインナップ全体にわたって入力キャッシュ価格を従来レベルの 1/10 に引き下げ、これを恒久的な変更として発表しました。

これは、Opus 4.7 vs Kimi K2.6 の実行で用いたのと同じ FlowGraph 仕様です。20 のエンドポイントを持つワークフローオーケストレーションバックエンドであり、永続状態、リース管理、リトライ、イベントストリーミングを備えています。これは通常のカoding ベンチマークよりも重いインフラストラクチャテストであり、モデルの限界を試すためのものです。

DeepSeek V4 Pro と DeepSeek V4 Flash を同じセットアップで実行し、Claude Opus 4.7 および Kimi K2.6 と比較して、新しい DeepSeek ラインナップがコストと初回パス品質の面でどこに位置するかを確認しました。

両方の DeepSeek モデルを Kilo CLI で実行し、Opus 4.7 および Kimi K2.6 の際に使用したのと同じプロンプトを使用しました：

「@SPEC.md を読み取り、現在のディレクトリでプロジェクトを構築してください。@SPEC.md を真実の源として扱ってください。これをモック、おもちゃアプリ、または基本的な CRUD スケフォールドに簡略化しないでください。実行可能なプロジェクトに必要なすべてのコード、設定、Prisma スキーマ、テスト、README を作成してください。…」

両方の DeepSeek モデルは、思考モードで各自の空のディレクトリで実行され、共有状態はありませんでした。

DeepSeek V4 Pro は独自のテストスイートを通過しましたが、TypeScript のビルドに失敗しました。DeepSeek V4 Flash のテストスイートは実行されませんでした。そのセットアップスクリプトがデータベースを強制リセットしようとした際、最初のテストが実行される前にエラーが発生したためです。

もしモデルの要約だけで判断していたなら、両方の DeepSeek 実装は実際よりも Claude Opus 4.7 に近いものに見えたでしょう。しかし、コードの直接レビューと孤立した SQLite データベースに対するターゲットを絞った再現テストを行った結果、両方のモデル出力に問題があることが明らかになりました。

DeepSeek V4 Pro はシステムの全体的な形状については正しく捉えていました。エンドポイントが接続され、テストスイートは通過し、プロジェクトの構成も妥当です。私たちが発見した問題は、Kimi K2.6 と同じ箇所に集中しています：リース期限切れの処理、スケジューリング、バリデーション、ビルドの整合性です。

ワーカーがステップをクレームすると、システムは設定されたタイムアウト後に期限切れとなるリースを付与します。もしワーカーが停止またはクラッシュした場合、そのリースは期限切れとなり、別のワーカーがそのステップを引き継ぐことができるはずです。一度リースが期限切れになれば、元のワーカーはそのステップの所有者ではなくなり、それを完了したとマークすることはできなくなります。

DeepSeek V4 Pro はハートビートにおいてはこれを強制していますが、完了処理においては強制していません。私たちはあるステップをクレームし、そのリース期限切れ時刻を過去に設定した上で、API にそのステップを正常に完了させたようにマークするよう要求しました。すると API は 200 を返し、そのステップを成功済みとして記録しました。結果として、元のワーカーは実質的に期限切れとなったリースを超えて到達し、もはや所有していない作業を最終化してしまったことになります。

DeepSeek V4 Pro の自身の README には「ワーカーはリース期限切れ後に完了できない」と記載されていますが、その実装ではそれが強制されていません。

ワークフロー実行は、並列に実行できるステップの最大数を宣言できます。その上限に達した場合、飽和した実行はそれ以上の作業を受け付けてはいけませんが、同じキューを共有する他の実行は引き続き動作し続けるべきです。

DeepSeek V4 Pro の主張ロジックは、候補を一度に一つずつチェックします。もしその候補がすでに並列上限に達している実行に属する場合、関数は次の候補へ進むのではなく、あきらめて何も返しません。

私たちは、同じキューを共有する 2 つのアクティブな実行でこれを再現しました。実行 A は並列制限に達していました。一方、実行 B には余力があり、より優先度の高いステップが準備されていました。しかし、次の主張要求は空の結果として返されました。本番環境では、実際に処理すべき作業があるにもかかわらず、キューの先頭にある実行がたまたま飽和しているという理由だけで、ワーカーがアイドル状態になるような状況が発生します。

npm test はパスしますが、npm run build は失敗します。ビルドエラーを修正した後でも、npm start を通じてプロジェクトを実行することはできません。TypeScript 設定ではコンパイル出力を発行しないように設定されており、一方 package.json では npm start がそのコンパイル出力を実行することを期待しています。DeepSeek V4 Pro の自身の README に従ってクリーンなチェックアウトを行ったユーザーは、動作するサーバーを得ることはできません。

全体で 0.02 ドルという価格において、DeepSeek V4 Flash はこれまでテストしたことがない領域にいます。内部ロジックは妥当ですが、問題が生じるのはパブリック API です。

このシステムを使用するには、クライアントはまず特定のエンドポイントを呼び出してワークフロー実行を作成する必要があります。そのエンドポイントが機能していなければ、他のことは何も起こりません。ワーカーが引き受けるべき実行も、ストリームするイベントも、完了すべきステップも存在しません。

DeepSeek V4 Flash はこのエンドポイントのハンドラを実装しましたが、誤ったルートプレフィックスの下にマウントしてしまいました。仕様では /workflows/key/:key/runs に配置することが要求されていますが、DeepSeek V4 Flash は実際には /runs/key/:key/runs で提供しています。実行中のサーバーに対して仕様のパスでリクエストを送ると、「404 エンドポイントが見つかりません」というレスポンスが返されます。README には仕様のパスが文書化されていますが、サーバーはそれを提供していません。

DeepSeek V4 Flash のテストでは、HTTP API を経由せず内部関数を直接呼び出しています。テストスイートの観点からはすべて正常に見えますが、実際のクライアントの視点では、システムの入り口が存在しないことになります。

ワークフロー実行が失敗すると（ステップのいずれかがリトライ試行回数の上限に達したため）、その実行内の他のすべてのステップは停止する必要があります。仕様では、残りのステップを「ブロック状態」に移行させ、ワーカーがそれらを引き受けないようにすることを求めています。

DeepSeek V4 Flash の回復ロジックは、開始時に期限切れのステップをすべて読み込み、その後一つずつ処理します。最初の期限切れステップがリトライ試行回数を枯渇して親実行に失敗した場合でも、同じバッチ内の後のステップは、「再試行可能」状態に昇格されることがあります。これは、そのステップが属する実行がすでに終了しているにもかかわらずです。

この問題を、1 つの実行内に期限切れのステップを 2 つ用意して再現しました：

ステップ a は再試行の機会がなく、正しく「デッド」とマークされた
親ランは正しく「失敗」とマークされた
ステップ b はブロック状態ではなく、待機再試行（waiting_retry）状態になった

新しい作業をポーリングするワーカーは、すでに失敗したワークフローに対してステップ b を受け取り、実行してしまう。Claude Opus 4.7 には関連するマルチ期限切れリースのバグがあった。Kimi K2.6 はライブイベントストリーミングを完全に見落としていた。競合下での回復は、どのモデルにとってもこの仕様を初回で正しく実装するのが最も難しい部分であり続けている。

DeepSeek V4 Flash は DeepSeek V4 Pro と同じく、期限切れリースによる完了バグを持っている。元のワーカーがそのステップの所有権を失っていても、期限切れリースは依然として作業を完了させてしまう。

また、有効なリクエストペイロードも拒否する。仕様では、ワークフローランの入力とメタデータには配列、文字列、数値を含む任意の JSON を含めることができるとされているが、DeepSeek V4 Flash の検証ロジックは JSON オブジェクトのみを受け付ける。クライアントが JSON 配列を入力として送信すると、仕様上は許容されるにもかかわらず 400 レスポンスが返されてしまう。

上記のバグは DeepSeek V4 Flash が生成した出力に関するものだが、ツール呼び出し（tool calling）は別の軸である：これは Kilo CLI 内でモデルがどのように振る舞ったかという観点だ。この軸では、モデルは驚くほどよく機能していた。編集前にファイルを読み込み、依存関係をインストールし、テストスイートを適切なタイミングで実行し、壊れたコマンドに対して再試行ループに陥らなかった。生成されたコードに欠落があっても、エージェントのループはクリーンに動作した。

これは、この価格帯のモデルから期待されるものとは全く異なる結果です。 ツール呼び出しの信頼性は、通常、安価なモデルが最初に破綻する部分であり、 malformed な引数、幻覚的なファイルパス、または進歩もなくトークンを浪費して無限ループに陥る現象が見られます。しかし、DeepSeek V4 Flash は今回のテストにおいてそれらの失敗モードを回避しました。

私たちは、Opus と Kimi の比較記事と同じ 7 つのカテゴリーからなる評価基準を使用しました。

DeepSeek V4 Pro は、Claude Opus 4.7 と Kimi K2.6 の間に位置します。Opus との差は主にビルド品質とリース処理に集中しています。一方、DeepSeek V4 Flash は Kimi K2.6 よりも下位にあり、ほぼすべてのカテゴリで減点されています。

このベンチマークにおいて、DeepSeek V4 Flash のポイントあたりのコストは、Kimi K2.6 よりも約 30 倍安く、Opus 4.7 よりも 100 倍安いです。 スコアは低いものの、絶対的な金額が非常に小さいため、同じタスクを比較のために 3 回または 4 回実行しても、Kimi K2.6 を 1 回実行するよりも依然として安価です。

今回のランでは DeepSeek V4 Pro の方が Kimi K2.6 よりも高価でしたが、これは公式割引適用前に実行したためです。DeepSeek の 75% プロモーションを現在のレートに適用した場合、同じランのコストは約 0.55 ドル程度となり、絶対コストでは Kimi K2.6 を下回りますが、スコアは 9 ポイント高くなります。

過去の比較から続くパターンが依然として成立しています。オープンウェイトモデルと最先端の独自モデルとの間の表面カバー範囲の差は狭まっています。また、ハードなコードパス内（リース回復、跨ランスケジューリング、期限切れリース拒否など）における正答性の差もまだ存在しますが、これも縮小傾向にあります。

DeepSeek V4 Pro は、Kimi K2.6 に基づいた実用的なステップアップです。一般的な失敗パターンは同じですが、全体的な構造がより明確で、仕様レベルの欠落も少なくなっています。DeepSeek の公式割引が適用されているため、Kimi との価格差は縮まり、品質差はそのまま残ります。

DeepSeek V4 Flash はまた別の話です。定価では、既存の予算層（Gemini 3.1 Flash Lite、Claude Haiku 4.5）よりも大幅に安価です。 この仕様におけるスコアが 60/100 であることは、それ単独で使用するための理由にはなりませんが、コストは理由になります。粗い最初の試作と人間のレビューを許容できるタスクにおいては、試行あたり 0.02 ドルという価格設定は計算を大きく変えます。

Claude Opus 4.7 は依然として先行しています。 この仕様のより複雑な部分（タイミング、回復、移動する部品間の協調などに関わるあらゆる事項）において、他のすべてのモデルがポイントを獲得できず失いました。Claude Opus 4.7 には再現可能なバグが 1 つだけありましたが、他の 3 つのモデルはそれ以上ありました。

DeepSeek V4 Pro は今回のランで Kimi K2.6 よりも優れたパフォーマンスを示しました。 スコアは 9 点高く、トークンあたりのリスト価格も低く、レビュー下での失敗のパターン形状はほぼ同じです。5 月 31 日まで続く DeepSeek の公式割引により、コスト差はさらに大きくなります。

DeepSeek V4 Flash は新しいカテゴリです。 クリーンアップパスなしでは複雑なバックエンドビルには完全に信頼できません。しかし、この規模のバックエンドに対する最初の試作として 0.02 ドルという価格は、以前には存在しなかった価格帯です。不完全な出力を許容できるのであれば、計算は変わります。

原文を表示

DeepSeek V4 Pro and DeepSeek V4 Flash launched together on April 24, 2026 under MIT license. They are DeepSeek’s first new architecture since V3, and their first open-weight lineup with two tiers (Pro as the flagship, Flash as the lightweight model).

We ran both through the same FlowGraph spec we used forClaude Opus 4.7 vs Kimi K2.6. With the same spec, same prompt, same scoring rubric.

TL;DR: DeepSeek V4 Pro scored 77/100 for $2.25 and lands between Opus 4.7 (91) and Kimi K2.6 (68) in terms of performance. DeepSeek V4 Flash scored 60/100 for $0.02, a price point we have not seen on this test before, but its build failed and the output is missing some key pieces.

DeepSeek V4 Flash is the cheapest model in the comparison by a wide margin. Output tokens cost less than 1/14th of Kimi K2.6 and roughly 1/89th of Claude Opus 4.7.

DeepSeek is also running a 75% off promotion on DeepSeek V4 Pro through May 31, 2026. Under the discount, DeepSeek V4 Pro input drops to roughly $0.036/M and output drops to $0.87/M, putting it below Kimi K2.6 on both axes. DeepSeek separately cut input cache pricing across the lineup to one-tenth of previous levels as a permanent change.

This is the same FlowGraph spec we used in theOpus 4.7 vs Kimi K2.6 run, a workflow orchestration backend with 20 endpoints, persistent state, lease management, retries, and event streaming. It is a heavier infrastructure test than our usual coding benchmarks to push the models to their limits.

We ran DeepSeek V4 Pro and DeepSeek V4 Flash through the same setup to see where the new DeepSeek lineup lands on cost and first-pass quality next to Claude Opus 4.7 and Kimi K2.6.

We ran both DeepSeek models inKilo CLI with the same prompt we used for Opus 4.7 and Kimi K2.6:

“Read @SPEC.md and build the project in the current directory. Treat @SPEC.md as the source of truth. Do not simplify this into a mock, toy app, or basic CRUD scaffold. Create all code, configuration, Prisma schema, tests, and README needed for a runnable project.…”

Both DeepSeek models ran on thinking mode in their own empty directories with no shared state.

DeepSeek V4 Pro passed its own test suite but the TypeScript build failed. DeepSeek V4 Flash’s test suite never ran because its setup script tried to force-reset the database in a way that errored out before the first test executed.

If we had stopped at the model summaries, both DeepSeek implementations would look closer to Claude Opus 4.7 than they actually were. A direct code review plus targeted reproductions against isolated SQLite databases revealed the problems in both model outputs.

DeepSeek V4 Pro got the broad shape of the system right. The endpoints are wired up, the test suite passes, and the project layout is reasonable. The issues we found are concentrated in the same places as Kimi K2.6: lease expiry handling, scheduling, validation, and build integrity.

When a worker claims a step, the system gives it a lease that expires after a set timeout. If the worker stalls or crashes, the lease should expire and another worker should be free to pick up the step. Once the lease has expired, the original worker is no longer the owner of that step and shouldn’t be able to mark it as done.

DeepSeek V4 Pro enforces this on heartbeats but not on completions. We claimed a step, pushed its lease expiry into the past, then asked the API to mark the step as successfully completed. The API returned 200 and recorded the step as succeeded. The original worker effectively reached past its expired lease and finalized work it no longer owned.

DeepSeek V4 Pro’s own README says workers cannot complete after their lease expires, but the implementation does not enforce that.

A workflow run can declare a maximum number of steps it is allowed to run in parallel. When that cap is reached, the saturated run shouldn’t accept more work, but other runs sharing the same queue should keep moving.

DeepSeek V4 Pro’s claim logic checks one candidate at a time. If that candidate happens to belong to a run that is already at its parallel cap, the function gives up and returns nothing, instead of moving on to the next candidate.

We reproduced this with two active runs sharing a queue. Run A was at its parallel limit. Run B had capacity and a higher-priority step ready to go. The next claim request came back empty. In production this would look like workers idling while there is real work to do, just because the first run on the queue happens to be saturated.

npm test passes but npm run build does not. Even after the build errors are fixed, the project still would not be runnable through npm start. The TypeScript config is set to not emit any compiled output, while package.json expects npm start to run that compiled output. A user following DeepSeek V4 Pro’s own README on a clean checkout would not get a working server.

At $0.02 for the entire run, DeepSeek V4 Flash is in territory we have not tested before. The internal logic is plausible. The public API is where it falls apart.

To use this system, a client first creates a workflow run by calling a specific endpoint. Without that endpoint working, nothing else can happen. There is no run for workers to claim from, no events to stream, no step to complete.

DeepSeek V4 Flash wrote the handler for this endpoint, but mounted it under the wrong route prefix. The spec requires it at /workflows/key/:key/runs. DeepSeek V4 Flash actually serves it at /runs/key/:key/runs. A request to the spec path against the running server returned 404 Endpoint not found. The README documents the spec path, but the server does not serve it.

DeepSeek V4 Flash’s tests call internal functions directly rather than going through the HTTP API. From the test suite’s perspective, everything was fine. From an actual client’s perspective, the entry point to the system was missing.

Once a workflow run fails (because one of its steps used up all its retry attempts), every other step in that run should stop. The spec calls for the remaining steps to move into a blocked state so workers will not pick them up.

DeepSeek V4 Flash’s recovery logic loads all expired steps at the start, then handles them one by one. If the first expired step exhausts its retries and fails the parent run, a later step in the same batch can still be promoted to a “ready to retry” state, even though the run it belongs to is already over.

We reproduced this with two expired steps in one run:

Step a had no retry attempts left and was correctly marked dead
The parent run was correctly marked failed
Step b ended up in waiting_retry instead of blocked

A worker polling for new work would still receive step b and execute it for a workflow that had already failed. Claude Opus 4.7 had a related multi-expired lease bug. Kimi K2.6 missed live event streaming entirely. Recovery under contention keeps being the hardest part of this spec for any model to get right on the first pass.

DeepSeek V4 Flash has the same expired-lease completion bug as DeepSeek V4 Pro. An expired lease can still finalize the work, even though the original worker no longer owns the step.

It also rejects valid request payloads. The spec says workflow run input and metadata can carry arbitrary JSON, which includes arrays, strings, and numbers. DeepSeek V4 Flash’s validation only accepts JSON objects. A client sending a JSON array as input would get a 400 response even though the spec accepts it.

The bugs above are about the output DeepSeek V4 Flash produced. Tool calling is a separate axis: how the model performed inside Kilo CLI. On that axis, the model held up surprisingly well. It read files before editing them, installed dependencies and ran the test suite at sensible points, and did not get stuck in retry loops on broken commands. The agent loop ran cleanly even when the code it produced had gaps.

That is not what we expected from a model at this price tier. Tool calling reliability is usually where cheaper models break down first, with malformed arguments, hallucinated file paths, or runaway loops that burn through tokens without making progress. DeepSeek V4 Flash avoided those failure modes in our run.

We used the same 7-category rubric as the Opus vs Kimi post.

DeepSeek V4 Pro slots between Claude Opus 4.7 and Kimi K2.6. The gap with Opus is concentrated in build quality and lease handling. DeepSeek V4 Flash sits below Kimi K2.6, with deductions in nearly every category.

DeepSeek V4 Flash’s cost per point is roughly 30x cheaper than Kimi K2.6 and 100x cheaper than Opus 4.7 on this benchmark. The score is lower, but the absolute dollar amount is so small that running the same task three or four times to compare attempts is still cheaper than one Kimi K2.6 run.

DeepSeek V4 Pro is more expensive than Kimi K2.6 in this run because we ran it before applying the official discount. With DeepSeek’s 75% promo applied to current rates, the same run would have cost closer to $0.55, putting it below Kimi K2.6 in absolute cost while scoring 9 points higher.

The pattern from previous comparisons keeps holding. The gap on surface coverage between open-weight and frontier proprietary is narrow. The gap on correctness inside hard code paths (lease recovery, cross-run scheduling, expired-lease rejection) is still there but also narrowing.

DeepSeek V4 Pro is the practical step up from Kimi K2.6 based on our test. Same general failure pattern, but cleaner overall structure and fewer spec-level gaps. With DeepSeek’s official discount in effect, the price gap with Kimi closes and the quality gap stays.

DeepSeek V4 Flash is a different conversation. At full price it is cheaper than the existing budget tier (Gemini 3.1 Flash Lite, Claude Haiku 4.5) by a wide margin. A 60/100 score on this spec is not a reason to use it on its own, but the cost is. For tasks where you can absorb a rough first pass and a human review, $0.02 per attempt changes the math considerably.

Claude Opus 4.7 still pulls ahead. The trickier parts of the spec (anything involving timing, recovery, or coordination between moving pieces) are where every other model lost points. Claude Opus 4.7 had only one reproducible bug while the other three had more.

DeepSeek V4 Pro performed better than Kimi K2.6 in this run. It scored 9 points higher, runs at a lower per-token list price, and produces about the same shape of failures under review. With DeepSeek’s official discount through May 31, the cost gap is even larger.

DeepSeek V4 Flash is a new category. It is not fully reliable for complex backend builds without a cleanup pass. But $0.02 for a first-pass attempt at a backend of this size is a price point that did not exist before. If you can absorb imperfect output, the math changes.

この記事をシェア

Simon Willison Blog2026年7月5日 10:00

sqlite-utils 4.0rc2、主にClaude Fable（約149.25ドル分）が執筆

TLDR AI2026年7月3日 09:00

メタの「Watermelon」が GPT-5.5 ベンチマークに匹敵

TLDR AI重要度42026年7月3日 09:00

Seed2.0 モデルカード（72 分間の読了）

今日のまとめ

AI日報で今日の重要ニュースをまとめ読み

ニュース一覧に戻る元記事を読む