Vercel Blog·2026年4月7日 10:01·約16分で読める

最大規模のモノレポで58%のPRが人間のレビューなしでマージされる

#LLM #AIエージェント #開発ワークフロー自動化 #コードレビュー #リスク分類 #モノレポ

TL;DR

Vercelは、自社の大規模モノレポで、LLMベースのPR分類器を用いてリスクを評価し、低リスクPRの58%を人間レビューなしで自動マージするワークフローを構築し、平均マージ時間を29時間から10.9時間へ62%短縮したと発表した。

AI深層分析2026年4月7日 10:41

重要/ 5段階

深度40%

キーポイント

レビューボトルネックの解消

週400以上のPRが発生する大規模モノレポにおいて、平均マージ時間が29時間と長く、特に低リスク変更のレビューが開発速度のボトルネックとなっていた。

リスクに基づくPR分類の自動化

Geminiを利用したLLMベースの分類器を構築し、PRのdiff、タイトル、説明から、認証・支払いなどの「高リスク」と、UI・ドキュメントなどの「低リスク」を自動で判別する。

自動マージによる効率化

低リスクと判定されたPRの58%をエージェントがレビュー・マージし、人間レビューを不要とした結果、平均マージ時間が62%短縮された。

レビューの目的の明確化

PRレビューを「何を構築するか（Alignment）」と「構築したものが正しく動作するか（Verification）」に分け、後者はAIが効果的に処理できると指摘。

リスク分類の仕組みとチューニング

LLMはコード差分から証拠を抽出し、証拠がない場合はLOWと判定する。誤検知のリスクを考慮し、誤HIGHを誤LOWより優先するようにチューニングされている。

段階的な導入と安全基準

サイレント分類、可視化ラベル、強制適用の3段階で導入され、リバート率やロールバック率などの安全基準を設定して実験を進めた。

ハードルールとフィードバックループ

100ファイル以上の変更やCODEOWNERS保護パスは常にHIGHリスクとなる。誤分類は「Incorrect?」リンクで報告され、評価データセットに追加される。

影響分析・編集コメントを表示

影響分析

この記事は、AIを単なるコード生成ツールとしてではなく、開発ワークフローの一部として統合し、人間の判断が必要な部分と自動化可能な部分を明確に分離する実践的なアプローチを示している。特に大規模プロジェクトにおける開発速度と品質保証の両立という普遍的な課題に対する解決策として、業界全体に影響を与える可能性が高い。

編集コメント

AI活用が単なる「便利ツール」段階を超え、組織の開発プロセスそのものを再設計する段階に入ったことを示す重要なケーススタディ。特に「すべてのPRに人間レビューが必要」という常識をデータに基づいて問い直した点が革新的。

当社で最も古く、最大のNext.jsアプリケーションの一つは、複数の重要なプロパティを含むモノレポジトリです：Vercelマーケティングサイト、ドキュメント、サインアップフロー、ダッシュボード、および内部ツール群。このリポジトリでは、平均して週に400以上のプルリクエストが作成されます。最近まで、マージ前にはそれぞれが人間の承認を必要としていました。

現在では、エージェントがそれらのプルリクエストの58%を人間のレビュアーなしでレビューおよびマージしており、平均マージ時間は29時間から10.9時間へと62%減少しました。

エージェント生成コードのマージは危険な場合があります。これは、エージェント自体を安全に本番環境へのデプロイに使用する方法の実際の例です。

問題：レビューのボトルネック

重要なデザイン更新やA/Bテストは可能な限り迅速に公開する必要がありますが、それらは我々が望むほど速く本番環境に到達していませんでした。リポジトリのPRを分析した後、レビュー準備完了からマージまでの平均時間が29時間であることがわかりました。それは、待機に費やされる完全な稼働日以上です。

さらに深く調査すると、PRの半数以上が（人間によって）コメントゼロで承認されていることがわかりました。18%は5分以内に形式的に承認されていました。

そこで我々は自問しました：ほとんどのレビューが何も検出していないなら、それらは実際に何から保護しているのか？

プルリクエストは、2つの異なる活動を容易に混同させることができます：

アラインメントは、何を構築するか、どのように構築するかについて合意することです：アーキテクチャ、構造、および設計上の決定

検証は、構築されたものが正しく機能することを確認することです

このような成熟したコードベースにおけるほとんどの変更は検証のみを必要とし、AIは検証をうまく処理できます。すべてのCSS調整やドキュメント更新に人間の承認を要求することは、コードベースをより安全にするのではなく、エンジニアを遅くし、基本的な更新を不必要に遅らせます。

皮肉なことに、AIは問題を悪化させています：エージェントがコードを生成するにつれて、より多くのPRがボトルネックに流れ込みます。しかし、解決策はエンジニアにレビューをより厳しく、より速く行うよう求めることではありません。人間の判断を必要とする変更と必要としない変更を区別できるシステムを構築することです。

以下に、我々が自動マージワークフローを構築した方法と、その過程で学んだことを説明します。

リスクフレームワークから始める

重要な洞察は、初期分析において明白でした：すべてのPRが同等のリスクを負うわけではありません。ドキュメント修正と認証変更では、根本的に異なる影響範囲があります。我々はそのリスクを自動的に分類する方法を必要としていました。

我々はGeminiを使用したLLMベースのPR分類器を構築し、差分、タイトル、および説明に基づいてすべてのPRを評価します。分類器は2つのラベルのいずれかを割り当てます。

HIGHリスクには、認証、支払い、データ整合性、セキュリティ、およびインフラストラクチャへの変更が含まれます。これらは常に人間のレビューを必要とします

LOWリスクには、UI変更、スタイリング、テスト、ドキュメント、リファクタリング、およびオフになっているフィーチャーフラグが含まれます。これらは自動承認の候補です。

分類器は構造化されたJSONを返します：

スキーマはevidenceQuotesを最初に、riskLevelを最後に配置します。これにより、モデルは分類する前に逐語的な差分スニペットを抽出し、それらについて推論することを強制されます。実際のコードにリスクの証拠を見つけられない場合、デフォルトでLOWになります。決定はPRタイトルではなく、差分に基づいています。

分類器はまた、誤ったLOWよりも誤ったHIGHを優先するように調整されています。誤ったHIGHは1つの不要なレビューを要します。誤ったLOWはリスクのあるコードをレビューなしで出荷させます。データ整合性PRの93%とセキュリティPRの92%をHIGHリスクとしてフラグ付けします。一方、スタイリングPRの0.2%とドキュメントPRの0.4%がHIGHとしてフラグ付けされます。

これらのカテゴリは固定されていません。すべてのリスク評価には「Incorrect?」リンクが含まれており、応答をDatadogに記録し、通知をSlackにルーティングします。エンジニアが誤分類をフラグすると、我々はそれをレビューし、分類器が間違っていた場合、そのPRを評価セットに追加します。

2つの厳格なルールがLLMをバイパスします：100以上の変更ファイルがあるPRは常にHIGHであり、CODEOWNERSで保護されたパスは常に人間のレビューを必要とします。

すべてのLLM呼び出しは、キャッシング、レート制限、および可観測性のためにVercel AI Gatewayを経由します。コストは評価あたり約$0.054、または週約$51です。

このアプローチは、我々が最近実行可能ガードレールとして説明したものを実践しています。何がリスクと見なされるかをリストするWikiページの代わりに、その判断をパイプライン自体にエンコードしました。

テスト、検証、およびロールアウト

我々はこれを3つのフェーズでロールアウトし、各フェーズはマージ自律性のレベルを上げる前に信頼を構築するように設計されました。テストを開始する前に、キルスイッチを定義しました。実験は以下の場合に終了します：

リバート率が1.7%のベースラインの3倍（5.1%の閾値）を超えた場合

ロールバック率がベースラインの3倍（7.2/週の閾値）を超えた場合

チームの感情が否定的に変わった場合

以下は実験のフェーズと各フェーズで起こったことです：

フェーズ1：サイレント分類

LLMはすべてのPRをLOWまたはHIGHリスクとしてラベル付けし始めました。唯一の可視信号は、分類を表示する情報提供のGitHubチェックでした。運用上は何も変更されませんでした。

我々はデータを収集し、リスクに関する我々自身の評価に対して精度を検証しました。精度閾値を満たすには約3週間のプロンプト反復が必要でした。その時点で、我々はエンジニアリングチームと結果を検証する準備ができました。

フェーズ2：可視ラベル

Vercel Agentは、リスク分類と理論的根拠を含むコメントをすべてのPRに付け始めました。エンジニアはその推論を見て、異議を唱え、「Incorrect?」をクリックして誤りをフラグすることができました。

フェーズ3：強制

このフェーズでは、LOWリスクPRはVercel Agentによって自動承認され、人間のレビュアーなしでブランチ保護を満たしました。HIGHリスクPRは警告コメントを受け取り、依然として人間の承認を必要としました。

エンジニアは依然として提出した任意のPRにレビューをリクエストすることができました。変更点は、低リスク変更においてレビューがブロッカーではなくなったことです。

結果はすべての安全閾値をクリアし、ワークフローは現在リポジトリのデフォルトです。

SOC-2コンプライアンスは、実験および強制プロセス全体で維持されました。詳細は以下のコンプライアンスセクションで説明します。

結果

レビューをスキップしてもリバートは増加しなかった

これが最も重要な質問でした。低リスクPRにレビューをスキップさせると、より多くの悪いコードが本番環境に到達するでしょうか？

671の低リスクPRがレビューをスキップしました。ゼロがリバートされました。（Wilson 95% CI上限：0.6%、我々の1%安全閾値をはるかに下回ります。）

対照群（依然としてレビューを受けた低リスクPR）は同じリバート率でした：1,047件中2件（0.2%）。レビューをスキップしても測定可能な差はありませんでした。

デプロイメントロールバックは週2.8回から1.9回に減少しました。実験中のロールバックのいずれも、自動承認PRによって引き起こされたものではありませんでした。我々はVercelアクティビティログを介して各ロールバックされたデプロイメントをトリガーPRにマッピングしました。

インシデントを引き起こした1つのロールバックは、ミドルウェアリダイレクト変更でした。分類器はそれをHIGHとしてフラグ付けしました。人間がそれをレビューし、承認し、マージしました。分類器は危険な変更を検出しましたが、人間がそれを通過させました。

エンジニアは62%速く出荷した

レビューをスキップしたPRの中央マージ時間は0.5時間で、レビューされたPRの2.3時間と比較されました。差はテールで広がります：p90では、スキップPRはレビューPRよりも58.3時間速かったです。

採用は即時でした。強制が開始された週、低リスクPRの61%がレビューをスキップしました。

個人の人間のスループットは46%増加しました。アクティブ作成者あたりのPRは週2.6件から3.8件へと増加しました。

ピークマージ時間は午後2-4時から午後6-10時（PST）にシフトしました。時間外マージは7.5パーセントポイント増加し、週末マージは6.3パーセントポイント増加しました。エンジニアはレビュアーがオンラインのときではなく、作業が完了したときにマージします。

人間のレビューは重要な場所で改善された

HIGHリスク、大規模差分PRにおける初回レビューまでの時間は24.7時間から9.0時間に減少し、2.7倍の改善です。リスクのある変更に人間の目が必要な場合、より速くそれらを得ます。

レビュアーのワークロードも週13PRからわずか5強に減少しました。レビューするPRが少ないため、実行される評価はより徹底的です。

HIGHリスクPRにおける形式的承認率は安定しました（11.9% vs 12.4%ベースライン）。レビューでフラグ付けされたセキュリティ懸念は6.3%から27.2%に急増しました（フェーズ2でのn=261 HIGHリスクPR）。小規模HIGHリスク差分におけるレビューの深さが改善されました（Cohen's d = 0.44）。

エンジニアは分類器に同意するか？

我々はエンジニアの不同意をアンケートではなく行動を通じて測定しました。

信号

割合

LOWリスクPRに対するCHANGES_REQUESTED

0.9%

LOWリスクPRのリバート

0.2%

低リスクPRの43%は、必須でないにもかかわらず、依然として自発的なレビューを受けました。それらの70%はコメントゼロでした。一部のチームは依然としてコラボレーションと知識共有のためにレビューを好みますが、重要な点はレビューをゲートではなく選択にすることです。

敵対的強化

分類器はユーザー制御入力（user-controlled input）を処理し、結果に基づいて自動承認します。これは敵対的攻撃面（adversarial surface）です。

アーキテクチャ

システムは、完全に侵害されたLLM出力でも重大な害を引き起こせないように設計されています：

ゼロツール。LLMは構造化されたJSONを出力します。コード実行、ファイルアクセス、API呼び出しはありません。

制約された出力、事前決定されたアクション。モデルは2つの有効なリスクレベルのみを返すことができ、それらは2つの可能なシステムアクション：APPROVEまたはWARNING COMMENTにマッピングされます。LLM出力から任意の動作へのパスはありません。

フェイルオープン。LLMが失敗するか無効な出力を返した場合、PRは標準の人間レビューにフォールバックします。

入力強化

不可視Unicode除去。すべてのLLM入力からタグ文字（U+E0000-E007F）、バリエーションセレクタ、および双方向テキストオーバーライドを除去します。これらの不可視文字はプロンプトに命令を密輸することができます。GitHubは差分でそれらを保持します。これはGlassWormキャンペーン（151以上のリポジトリ、不可視プロンプトインジェクション）で悪用されました。

出力サニタイゼーション。モデル生成テキストはGitHubに投稿される前にサニタイズされます。非HTTPSリンクと画像埋め込みは除去されます。

作成者ゲーティング。信頼されていない作成者（初回貢献者、プレースホルダーアカウント）からのPRは分類と評価の投稿を受けますが、自動承認は決されません。

敵対的評価スイート。3つのプロンプトインジェクションシナリオ（Unicode密輸、XMLタグインジェクション、コードコメント操作）がすべてのデプロイで100%精度ゲートで実行されます。

我々が検討し却下したこと

追加の強化アプローチを検討しましたが、実装はしませんでした。

防御

理由

ソルト付きXMLタグ

チャットボットの脅威モデルであり、構造化された分類ではありません。プロンプトキャッシュを破壊します。

サンドイッチ防御

公開された研究では5.5-85.6%の攻撃成功率；適応型攻撃に対しては>95%

XMLエスケープ入力

正当なコードを破損します

制限事項

分類は確率的です。Carlini et al. (2025) は、公開された12のプロンプトインジェクション防御が71-100%の成功率で回避されることを示しました。単一の絶対的な防御は存在しません。

分類器は簡単な判断を高速化します。難しい判断における人間の判断を置き換えるものではありません。HIGHリスクの変更は常に人間によるレビューを必要とします。攻撃が成功した場合の最悪のシナリオは、1つの低リスクPRが自動承認されることです。私たちはリバート率とロールバック率を継続的に監視し、逸脱を検知しています。

コンプライアンス

エンジニアリングリーダーからよく寄せられる質問は、コンプライアンスはどうなるのか、というものです。

コンプライアンスフレームワークが要求するのは、明確に定義された変更管理プログラムであり、必ずしも手動のピアレビューではありません。重要なのは、変更が承認され、文書化され、テストされ、一貫した監査可能なプロセスを通じて承認されることです。

LLMベースのリスク分類器を追加することは、以下の3つの方法で当社の変更管理プロセスを強化します：

より良い文書化。すべてのPRが、根拠、証拠、分類を含む構造化されたリスク評価を今では受けます。必須レビュー下では、承認の52%が全く文書化されていませんでした。監査証跡は、単一の承認クリックから、PRごとの完全なリスク根拠へと進化しました。

リスクベースのルーティング。すべての変更を同じように扱う代わりに、分類器は人間の注目を最も重要なHIGHリスクの変更に向けます。LOWリスクの変更は、一貫した監査可能な承認プロセスを通過します。セキュリティに敏感なパスは、CODEOWNERSを介して指定されたレビュアーを依然として必要とします。

継続的監視。リバート率、ロールバック率、分類器の精度は毎週追跡されます。これにより、必須レビューにはなかったフィードバックループが生まれます：プロセスが機能しているかどうかを測定でき、単にそう想定するだけではありません。

当社の内部リスクベースアプローチに基づいてモデル化されたLLMベースのPR分類器を含めることは、当社の変更管理プロセスを強化し、監査可能性のための追加文書を提供しました。

学んだこと

必須レビューはすでに形骸化していました。レビューの52%は何も生み出しませんでした。自動承認は、機能している安全網を取り除いたのではありません。機能していなかったものを要求するのをやめたのです。671件のレビューがスキップされ、レビューされたPRと同じリバート率、62%高速なマージ。

真の利益は速度だけでなく、集中です。レビュアーはHIGHリスクPRに2.7倍速く到達します。重要なPRのボトルネックは、レビュー能力では決してありませんでした。それはレビューの割り当てでした。

保守的な分類が正しいデフォルトです。誤ってHIGHと判定するコストは、1つの不要なレビューです。誤ってLOWと判定するコストは、レビューされていない危険なコードがリリースされることです。過剰にフラグを立て、エンジニアにオプトアウトさせましょう。

レビューはゲートではなく、選択肢になりました。低リスクPRの43%は依然として自主的なレビューを受けました。コラボレーションと知識共有のためにレビューを好むチームもあり、それは問題ありません。

次のステップ

レビューをスキップするワークフローは、当社最大のモノレポで恒久的になりました。同じ3段階アプローチを使用して、より多くのリポジトリに展開しています。

エージェントがより多くのコードを生成し、PRの量が増加するにつれて、適応しないチームにとってレビューのボトルネックは悪化するだけです。答えは、より一生懸命レビューすることではありません。リスク判断をパイプラインに組み込むシステムを構築することです。

希少な資源は人間の判断です。それが重要である場所に費やしましょう。

原文を表示

One of our oldest and largest Next.js apps is a monorepo that contains multiple critical properties: the Vercel marketing site, our docs, the sign-up flow, dashboards, and internal tooling. The repo sees over 400 pull requests per week on average. Until recently, every one of them required human approval before merging.

Today, an agent reviews and merges 58% of those pull requests without a human reviewer, and average merge time has dropped 62%, from 29 hours to 10.9 hours.

Merging agent-generated code can be dangerous. This is a real example of how you can use agents themselves to deploy to production, safely.

The problem: review bottlenecks

Critical design updates and A/B tests need to go live as quickly as possible, but they weren't making it to production as fast as we wanted. After analyzing PRs on the repo, we learned that average time from ready-for-review to merge was 29 hours. That's more than a full working day spent waiting.

Digging deeper, we discovered that over half of PRs were approved (by humans) with zero comments. 18% were rubber-stamped in under 5 minutes.

So we asked ourselves: If most reviews aren't catching anything, what are they actually protecting against?

Pull requests can esaily conflate two distinct activities:

Alignment is agreeing on what to build and how: the architecture, structure, and design decisions

Verification is confirming that what was built works correctly

Most changes in a mature codebase like this one only need verification, and AI can handle verification well. Requiring a human to approve every CSS tweak and docs update doesn't make the codebase safer, it makes engineers slower and delays basic updates unnecessarily.

Ironically, AI is making the problem worse: more PRs flow into the bottleneck as agents generate code. But the answer isn't asking engineers to review harder and faster. It's building systems that can distinguish between changes that need human judgment and those that don't.

Here's how we built our auto-merge workflow and what we learned along the way.

Start with a risk framework

The key insight was plain in the initial analysis: not all PRs carry equal risk. A documentation fix and an authentication change have fundamentally different blast radiuses. We needed a way to classify that risk automatically.

We built an LLM-based PR classifier using Gemini that evaluates every PR based on its diff, title, and description. The classifier assigns one of two labels.

HIGH risk includes changes to authentication, payments, data integrity, security, and infrastructure. These always require human review

LOW risk includes UI changes, styling, tests, documentation, refactors, and feature flags that are turned off. These are candidates for auto-approval.

The classifier returns structured JSON:

The schema puts evidenceQuotes first and riskLevel last. This forces the model to extract verbatim diff snippets and reason about them before it can classify. If it can’t find evidence of risk in the actual code, it defaults to LOW. The decision is grounded in the diff, not the PR title.

The classifier is also tuned to prefer false HIGHs over false LOWs. A false HIGH costs one unnecessary review. A false LOW lets risky code ship unreviewed. It flags 93% of data integrity PRs and 92% of security PRs as HIGH risk. On the other end of the spectrum, 0.2% of styling PRs and 0.4% of docs PRs get flagged HIGH.

These categories aren't fixed. Every risk assessment includes an "Incorrect?" link that logs the response to Datadog and routes a notification to Slack. When an engineer flags a misclassification, we review it, and if the classifier was wrong, we add the PR to our evals.

Two hard rules bypass the LLM: PRs with 100+ changed files are always HIGH, and CODEOWNERS-protected paths always require human review.

All LLM calls route through Vercel AI Gateway for caching, rate limiting, and observability. The cost is ~$0.054 per assessment, or about $51/week.

This approach puts into practice what we recently described as executable guardrails. Instead of a wiki page listing what counts as risky, we encoded that judgment into the pipeline itself.

Testing, validation, and rollout

We rolled this out in three phases, each designed to build confidence before increasing the level of merge autonomy. Before starting the test, we defined kill switches. The experiment would end if:

The revert rate exceeded 3x our baseline of 1.7% (a 5.1% threshold)

The rollback rate exceeded 3x baseline (7.2/week threshold)

Team sentiment turned negative

Here are the phases of the experiment and what happened in each:

Phase 1: silent classification

The LLM began labeling every PR as LOW or HIGH risk. The only visible signal was an informational GitHub check that surfaced the classification. Nothing changed operationally.

We collected data and validated accuracy against our own assessment of risk. It took about three weeks of prompt iteration to meet our accuracy thresholds. At that point, we were ready to validate results with our engineering team.

Phase 2: visible labels

Vercel Agent started commenting on every PR with the risk classification and rationale. Engineers could see the reasoning, challenge it, and click “Incorrect?” to flag mistakes.

Phase 3: enforcement

In this phase, LOW-risk PRs were auto-approved by Vercel Agent, satisfying branch protection without a human reviewer. HIGH-risk PRs got a warning comment and still required human approval.

Engineers were still able to request review on any PR they submitted. The change was that review was no longer a blocker for low-risk changes.

The results cleared every safety threshold, and the workflow is now default for the repo.

SOC-2 compliance was maintained throughout the experimentation and enforcement process. We cover details in the compliance section below.

Results

Skipping review didn't increase reverts

This was the question that mattered most. If we let low-risk PRs skip review, would more bad code reach production?

671 low-risk PRs skipped review. Zero were reverted. (Wilson 95% CI upper bound: 0.6%, well below our 1% safety threshold.)

The control group (low-risk PRs that still received review) had the same revert rate: 2 out of 1,047 (0.2%). Skipping review made no measurable difference.

Deployment rollbacks decreased from 2.8 per week to 1.9 per week. None of the rollbacks during the experiment were caused by an auto-approved PR. We mapped each rolled-back deployment to the triggering PR via the Vercel Activity Log.

The one incident-causing rollback was a middleware redirect change. The classifier flagged it HIGH. A human reviewed it, approved it, and merged it. The classifier caught the dangerous change, but the human let it through.

Engineers shipped 62% faster

PRs that skipped review had a median merge time of 0.5 hours, compared to 2.3 hours for reviewed PRs. The gap widens at the tail: at p90, skipped PRs were 58.3 hours faster than reviewed PRs.

Adoption was immediate. The week enforcement turned on, 61% of low-risk PRs skipped review.

Individual human throughput increased 46%. PRs per active author went from 2.6 per week to 3.8 per week.

Peak merge time shifted from 2-4pm to 6-10pm PST. Off-hours merges increased by 7.5 percentage points, weekend merges by 6.3 percentage points. Engineers merge when the work is done, not when a reviewer is online.

Human review got better where it matters

Time-to-first-review on HIGH-risk, large-diff PRs dropped from 24.7 hours to 9.0 hours, a 2.7x improvement. When a risky change needs human eyes, it gets them faster.

Reviewer workload also decreased from 13 PRs per week to just over 5. With fewer PRs to review, evaluations that are performed are more thorough.

Rubber-stamp rates on HIGH-risk PRs held steady (11.9% vs 12.4% baseline). Security concerns flagged in reviews jumped from 6.3% to 27.2% (n=261 HIGH-risk PRs in stage 2). Review depth on small HIGH-risk diffs improved (Cohen's d = 0.44).

Do engineers agree with the classifier?

We measured engineer disagreement through behavior, not surveys.

Signal

Rate

CHANGES_REQUESTED on LOW-risk PRs

0.9%

LOW-risk PRs reverted

0.2%

43% of low-risk PRs still received voluntary reviews even though they weren’t required. 70% of those had zero comments. Some teams still prefer review for collaboration and knowledge sharing, but the point is making review a choice, not a gate.

Adversarial hardening

The classifier processes user-controlled input and auto-approves based on the result. This is an adversarial surface.

Architecture

The system is designed so that even a fully compromised LLM output can’t cause serious harm:

Zero tools. The LLM outputs structured JSON. No code execution, no file access, no API calls.

Constrained output, pre-determined actions. The model can only return two valid risk levels, which map to two possible system actions: APPROVE or WARNING COMMENT. There is no path from LLM output to arbitrary behavior.

Fail-open. If the LLM fails or returns invalid output, the PR falls back to standard human review.

Input hardening

Invisible Unicode stripping. We strip tag characters (U+E0000-E007F), variation selectors, and bidi overrides from all LLM inputs. These invisible characters can smuggle instructions into prompts. GitHub preserves them in diffs. This was exploited in the GlassWorm campaign (151+ repositories, invisible prompt injection).

Output sanitization. Model-generated text is sanitized before posting to GitHub. Non-HTTPS links and image embeds are stripped.

Author gating. PRs from untrusted authors (first-time contributors, placeholder accounts) get classification and a posted assessment, but never auto-approval.

Adversarial eval suite. Three prompt injection scenarios (Unicode smuggling, XML tag injection, code comment manipulation) run on every deploy with a 100% accuracy gate.

What we considered and rejected

We explored additional approaches to hardening, but didn't implement them.

Defense

Why

Salted XML tags

Chatbot threat model, not structured classification. Breaks prompt caching.

Sandwich defense

5.5-85.6% attack success rate in published research; >95% against adaptive attacks

XML-escaping inputs

Mangles legitimate code

Limitations

Classification is probabilistic. Carlini et al. (2025) showed 12 published prompt injection defenses bypassed at 71-100% success rates. No single defense is absolute.

The classifier speeds up easy decisions. It doesn’t replace judgment on hard ones. HIGH-risk changes always require human review. The worst case for a successful attack is one low-risk PR getting auto-approved, and we monitor revert and rollback rates continuously to catch drift.

Compliance

A common question from engineering leaders is: what about compliance?

Compliance frameworks require a well-defined change management program, not specifically manual peer review. What matters is that changes are authorized, documented, tested, and approved through a consistent, auditable process.

Adding an LLM-based risk classifier strengthens our change management process in three ways:

Better documentation. Every PR now gets a structured risk assessment with reasoning, evidence, and a classification. Under mandatory review, 52% of approvals had no documentation at all. The audit trail went from a single approval click to a full risk rationale per PR.

Risk-based routing. Instead of treating every change the same, the classifier routes human attention to HIGH-risk changes where it matters most. LOW-risk changes flow through a consistent, auditable approval process. Security-sensitive paths still require designated reviewers via CODEOWNERS.

Continuous monitoring. Revert rates, rollback rates, and classifier accuracy are tracked weekly. This creates a feedback loop that mandatory review never had: we can measure whether the process is working, not just assume it is.

Including an LLM-based PR classifier modeled after our internal risk-based approach has enhanced our change management processes and provided additional documentation for auditability.

What we learned

Mandatory review was already theater. 52% of reviews produced nothing. Auto-approve didn’t remove a functioning safety net. It stopped requiring one that wasn’t working. 671 skipped reviews, same revert rate as reviewed PRs, 62% faster merges.

The real gain is focus, not just speed. Reviewers reach HIGH-risk PRs 2.7x faster. The bottleneck for critical PRs was never review capacity. It was review allocation.

Conservative classification is the right default. The cost of a false HIGH is one unnecessary review. The cost of a false LOW is risky code shipping unreviewed. Over-flag and let engineers opt out.

Review became a choice, not a gate. 43% of low-risk PRs still received voluntary reviews. Some teams prefer review for collaboration and knowledge sharing, which is fine.

What's next

The skip-review workflow is now permanent on our largest monorepo. We’re rolling it out to more repositories using the same three-phase approach.

As agents generate more code and PR volume increases, the review bottleneck will only get worse for teams that don’t adapt. The answer isn’t reviewing harder. It’s building systems that encode risk judgment into the pipeline.

The scarce resource is human judgment. Spend it where it counts.

この記事をシェア

Sebastian Raschka★42026年6月6日 20:16

LLM 研究論文：2026 年 1 月から 5 月のリスト

Sebastian Raschka が、2026 年上半期（1 月〜5 月）に注目すべき大規模言語モデル関連の研究論文を選定し、一覧として公開した。

Latent Space★42026年6月6日 13:34

[AINews] 今日特に大きな出来事はありませんでした

Latent Space が運営するニュースレター「AINews」が、6月4日から5日にかけてのAI業界動向を12件のRedditスレッドや544件のTwitter投稿から選別して紹介しました。記事ではRL環境ガイドの推奨や、DeepSeek v4 Pro向けの最適化に関するリモートポッドの更新について言及しています。

Latent Space★42026年6月5日 15:44

[AINews] 今日は何も大きな出来事はありませんでした

Anthropic が RSI の兆候を示し、OpenAI の ChatGPT が月間アクティブユーザー数で 10 億人を突破。SpaceX AI は IPO について説明しているが、最も重要なのは AIE WF のチケット確保とイベント参加である。

ニュース一覧に戻る元記事を読む

Vercel Blog·2026年4月7日 10:01·約16分で読める

最大規模のモノレポで58%のPRが人間のレビューなしでマージされる

#LLM #AIエージェント #開発ワークフロー自動化 #コードレビュー #リスク分類 #モノレポ

TL;DR

AI深層分析2026年4月7日 10:41

重要/ 5段階

深度40%

キーポイント

レビューボトルネックの解消

リスクに基づくPR分類の自動化

自動マージによる効率化

低リスクと判定されたPRの58%をエージェントがレビュー・マージし、人間レビューを不要とした結果、平均マージ時間が62%短縮された。

レビューの目的の明確化

PRレビューを「何を構築するか（Alignment）」と「構築したものが正しく動作するか（Verification）」に分け、後者はAIが効果的に処理できると指摘。

リスク分類の仕組みとチューニング

段階的な導入と安全基準

サイレント分類、可視化ラベル、強制適用の3段階で導入され、リバート率やロールバック率などの安全基準を設定して実験を進めた。

ハードルールとフィードバックループ

100ファイル以上の変更やCODEOWNERS保護パスは常にHIGHリスクとなる。誤分類は「Incorrect?」リンクで報告され、評価データセットに追加される。

影響分析・編集コメントを表示

影響分析

編集コメント

問題：レビューのボトルネック

そこで我々は自問しました：ほとんどのレビューが何も検出していないなら、それらは実際に何から保護しているのか？

プルリクエストは、2つの異なる活動を容易に混同させることができます：

アラインメントは、何を構築するか、どのように構築するかについて合意することです：アーキテクチャ、構造、および設計上の決定

検証は、構築されたものが正しく機能することを確認することです

以下に、我々が自動マージワークフローを構築した方法と、その過程で学んだことを説明します。

リスクフレームワークから始める

分類器は構造化されたJSONを返します：

テスト、検証、およびロールアウト

リバート率が1.7%のベースラインの3倍（5.1%の閾値）を超えた場合

ロールバック率がベースラインの3倍（7.2/週の閾値）を超えた場合

チームの感情が否定的に変わった場合

以下は実験のフェーズと各フェーズで起こったことです：

フェーズ1：サイレント分類

フェーズ2：可視ラベル

フェーズ3：強制

結果はすべての安全閾値をクリアし、ワークフローは現在リポジトリのデフォルトです。

SOC-2コンプライアンスは、実験および強制プロセス全体で維持されました。詳細は以下のコンプライアンスセクションで説明します。

結果

レビューをスキップしてもリバートは増加しなかった

これが最も重要な質問でした。低リスクPRにレビューをスキップさせると、より多くの悪いコードが本番環境に到達するでしょうか？

671の低リスクPRがレビューをスキップしました。ゼロがリバートされました。（Wilson 95% CI上限：0.6%、我々の1%安全閾値をはるかに下回ります。）

エンジニアは62%速く出荷した

採用は即時でした。強制が開始された週、低リスクPRの61%がレビューをスキップしました。

個人の人間のスループットは46%増加しました。アクティブ作成者あたりのPRは週2.6件から3.8件へと増加しました。

人間のレビューは重要な場所で改善された

レビュアーのワークロードも週13PRからわずか5強に減少しました。レビューするPRが少ないため、実行される評価はより徹底的です。

エンジニアは分類器に同意するか？

我々はエンジニアの不同意をアンケートではなく行動を通じて測定しました。

信号

割合

LOWリスクPRに対するCHANGES_REQUESTED

0.9%

LOWリスクPRのリバート

0.2%

敵対的強化

分類器はユーザー制御入力（user-controlled input）を処理し、結果に基づいて自動承認します。これは敵対的攻撃面（adversarial surface）です。

アーキテクチャ

システムは、完全に侵害されたLLM出力でも重大な害を引き起こせないように設計されています：

ゼロツール。LLMは構造化されたJSONを出力します。コード実行、ファイルアクセス、API呼び出しはありません。

フェイルオープン。LLMが失敗するか無効な出力を返した場合、PRは標準の人間レビューにフォールバックします。

入力強化

出力サニタイゼーション。モデル生成テキストはGitHubに投稿される前にサニタイズされます。非HTTPSリンクと画像埋め込みは除去されます。

我々が検討し却下したこと

追加の強化アプローチを検討しましたが、実装はしませんでした。

防御

理由

ソルト付きXMLタグ

チャットボットの脅威モデルであり、構造化された分類ではありません。プロンプトキャッシュを破壊します。

サンドイッチ防御

公開された研究では5.5-85.6%の攻撃成功率；適応型攻撃に対しては>95%

XMLエスケープ入力

正当なコードを破損します

制限事項

コンプライアンス

エンジニアリングリーダーからよく寄せられる質問は、コンプライアンスはどうなるのか、というものです。

LLMベースのリスク分類器を追加することは、以下の3つの方法で当社の変更管理プロセスを強化します：

学んだこと

次のステップ

希少な資源は人間の判断です。それが重要である場所に費やしましょう。

原文を表示

Today, an agent reviews and merges 58% of those pull requests without a human reviewer, and average merge time has dropped 62%, from 29 hours to 10.9 hours.

Merging agent-generated code can be dangerous. This is a real example of how you can use agents themselves to deploy to production, safely.

The problem: review bottlenecks

Digging deeper, we discovered that over half of PRs were approved (by humans) with zero comments. 18% were rubber-stamped in under 5 minutes.

So we asked ourselves: If most reviews aren't catching anything, what are they actually protecting against?

Pull requests can esaily conflate two distinct activities:

Alignment is agreeing on what to build and how: the architecture, structure, and design decisions

Verification is confirming that what was built works correctly

Here's how we built our auto-merge workflow and what we learned along the way.

Start with a risk framework

We built an LLM-based PR classifier using Gemini that evaluates every PR based on its diff, title, and description. The classifier assigns one of two labels.

HIGH risk includes changes to authentication, payments, data integrity, security, and infrastructure. These always require human review

LOW risk includes UI changes, styling, tests, documentation, refactors, and feature flags that are turned off. These are candidates for auto-approval.

The classifier returns structured JSON:

Two hard rules bypass the LLM: PRs with 100+ changed files are always HIGH, and CODEOWNERS-protected paths always require human review.

All LLM calls route through Vercel AI Gateway for caching, rate limiting, and observability. The cost is ~$0.054 per assessment, or about $51/week.

This approach puts into practice what we recently described as executable guardrails. Instead of a wiki page listing what counts as risky, we encoded that judgment into the pipeline itself.

Testing, validation, and rollout

We rolled this out in three phases, each designed to build confidence before increasing the level of merge autonomy. Before starting the test, we defined kill switches. The experiment would end if:

The revert rate exceeded 3x our baseline of 1.7% (a 5.1% threshold)

The rollback rate exceeded 3x baseline (7.2/week threshold)

Team sentiment turned negative

Here are the phases of the experiment and what happened in each:

Phase 1: silent classification

The LLM began labeling every PR as LOW or HIGH risk. The only visible signal was an informational GitHub check that surfaced the classification. Nothing changed operationally.

Phase 2: visible labels

Vercel Agent started commenting on every PR with the risk classification and rationale. Engineers could see the reasoning, challenge it, and click “Incorrect?” to flag mistakes.

Phase 3: enforcement

In this phase, LOW-risk PRs were auto-approved by Vercel Agent, satisfying branch protection without a human reviewer. HIGH-risk PRs got a warning comment and still required human approval.

Engineers were still able to request review on any PR they submitted. The change was that review was no longer a blocker for low-risk changes.

The results cleared every safety threshold, and the workflow is now default for the repo.

SOC-2 compliance was maintained throughout the experimentation and enforcement process. We cover details in the compliance section below.

Results

Skipping review didn't increase reverts

This was the question that mattered most. If we let low-risk PRs skip review, would more bad code reach production?

671 low-risk PRs skipped review. Zero were reverted. (Wilson 95% CI upper bound: 0.6%, well below our 1% safety threshold.)

The control group (low-risk PRs that still received review) had the same revert rate: 2 out of 1,047 (0.2%). Skipping review made no measurable difference.

Engineers shipped 62% faster

PRs that skipped review had a median merge time of 0.5 hours, compared to 2.3 hours for reviewed PRs. The gap widens at the tail: at p90, skipped PRs were 58.3 hours faster than reviewed PRs.

Adoption was immediate. The week enforcement turned on, 61% of low-risk PRs skipped review.

Individual human throughput increased 46%. PRs per active author went from 2.6 per week to 3.8 per week.

Human review got better where it matters

Time-to-first-review on HIGH-risk, large-diff PRs dropped from 24.7 hours to 9.0 hours, a 2.7x improvement. When a risky change needs human eyes, it gets them faster.

Reviewer workload also decreased from 13 PRs per week to just over 5. With fewer PRs to review, evaluations that are performed are more thorough.

Do engineers agree with the classifier?

We measured engineer disagreement through behavior, not surveys.

Signal

Rate

CHANGES_REQUESTED on LOW-risk PRs

0.9%

LOW-risk PRs reverted

0.2%

Adversarial hardening

The classifier processes user-controlled input and auto-approves based on the result. This is an adversarial surface.

Architecture

The system is designed so that even a fully compromised LLM output can’t cause serious harm:

Zero tools. The LLM outputs structured JSON. No code execution, no file access, no API calls.

Fail-open. If the LLM fails or returns invalid output, the PR falls back to standard human review.

Input hardening

Output sanitization. Model-generated text is sanitized before posting to GitHub. Non-HTTPS links and image embeds are stripped.

Author gating. PRs from untrusted authors (first-time contributors, placeholder accounts) get classification and a posted assessment, but never auto-approval.

Adversarial eval suite. Three prompt injection scenarios (Unicode smuggling, XML tag injection, code comment manipulation) run on every deploy with a 100% accuracy gate.

What we considered and rejected

We explored additional approaches to hardening, but didn't implement them.

Defense

Why

Salted XML tags

Chatbot threat model, not structured classification. Breaks prompt caching.

Sandwich defense

5.5-85.6% attack success rate in published research; >95% against adaptive attacks

XML-escaping inputs

Mangles legitimate code

Limitations

Classification is probabilistic. Carlini et al. (2025) showed 12 published prompt injection defenses bypassed at 71-100% success rates. No single defense is absolute.

Compliance

A common question from engineering leaders is: what about compliance?

Adding an LLM-based risk classifier strengthens our change management process in three ways:

Including an LLM-based PR classifier modeled after our internal risk-based approach has enhanced our change management processes and provided additional documentation for auditability.

What we learned

The real gain is focus, not just speed. Reviewers reach HIGH-risk PRs 2.7x faster. The bottleneck for critical PRs was never review capacity. It was review allocation.

Conservative classification is the right default. The cost of a false HIGH is one unnecessary review. The cost of a false LOW is risky code shipping unreviewed. Over-flag and let engineers opt out.

Review became a choice, not a gate. 43% of low-risk PRs still received voluntary reviews. Some teams prefer review for collaboration and knowledge sharing, which is fine.

What's next

The skip-review workflow is now permanent on our largest monorepo. We’re rolling it out to more repositories using the same three-phase approach.

The scarce resource is human judgment. Spend it where it counts.

この記事をシェア

Sebastian Raschka★42026年6月6日 20:16

LLM 研究論文：2026 年 1 月から 5 月のリスト

Sebastian Raschka が、2026 年上半期（1 月〜5 月）に注目すべき大規模言語モデル関連の研究論文を選定し、一覧として公開した。

Latent Space★42026年6月6日 13:34

[AINews] 今日特に大きな出来事はありませんでした

Latent Space★42026年6月5日 15:44

[AINews] 今日は何も大きな出来事はありませんでした

ニュース一覧に戻る元記事を読む