エージェント、サンドボックス、人間によるTurborepoの96%高速化
Vercelは、AIエージェント、サンドボックス、従来のエンジニアリング手法を組み合わせた8日間の最適化により、Turborepoのタスクグラフ計算を最大96%高速化し、大規模モノレポでの実行時間を大幅に短縮した。
キーポイント
大幅なパフォーマンス向上
Turborepoのタスクグラフ計算が最大96%高速化され、1,000以上のパッケージを持つモノレポではturbo runが瞬時に感じられるようになった。
AIエージェントによる自動最適化
8つのバックグラウンドコーディングエージェントを夜間に実行し、そのうち3つが実用的な改善(ハッシュ計算の最適化、依存関係の置き換え、アルゴリズムの改善)をもたらした。
AIエージェントの限界と洞察
エージェントはベンチマークを自ら実行せず、最初のアイデアに固執する傾向があり、適切なコンテキストエンジニアリングなしでは限界があることが明らかになった。
複合的な最適化アプローチ
単一の最適化ではなく、AIエージェント、Vercelサンドボックス、従来のエンジニアリング手法を8日間組み合わせたプロセスで性能向上を達成した。
エージェントの限界と人間の関与の必要性
エージェントはマイクロベンチマークで大きな数値を追求するが、実世界のパフォーマンス改善にはほとんど寄与せず、回帰テストやプロファイリングフラグの使用も行わなかったため、より強力なテストと検証ループ、人間の関与が必要となった。
プロファイル形式の最適化によるエージェント性能向上
Chrome Trace Event FormatのJSONプロファイルは人間にもエージェントにも読みにくかったが、Markdown形式に変換することで、エージェントの最適化提案の品質が劇的に向上した。
効率的な最適化のための反復ループの確立
Markdownプロファイルを活用し、エージェントにプロファイル作成とホットスポット特定を指示する「計画モード」と、人間による検証・実装・テストのサイクルを確立することで、効率的な最適化が可能になった。
影響分析・編集コメントを表示
影響分析
この記事は、AIエージェントが実際のプロダクト開発で具体的な性能向上をもたらす実例を示しており、AI支援開発の実用段階への移行を象徴している。同時に、AIエージェントの現在の限界と人間の監督の必要性を明確にし、AIと人間の協業の重要性を再確認させる内容となっている。
編集コメント
AIエージェントの実用的な成果とその限界を同時に示す貴重なケーススタディ。開発現場でのAI活用の現実的な期待値を理解する上で重要な記事。
私は、ベンチマーク作業の全ワークフローを自動化するbashスクリプトを作成しました。以下は完全なGistの要約版です。
このスクリプトの最後で、プロファイルを自分のラップトップにダウンロードしていることに気づくでしょう。これにより、私のエージェントはベンチマーク結果とMarkdownプロファイルをローカルで検査し、変更が真の改善なのかノイズなのかを確信を持って判断できるようになりました。
壁を突破する
サンドボックスからのクリーンなシグナルにより、ノイズの多い私のラップトップ環境では見えなかった、低レベルの変更による真の突破口を確認できました。
スタック割り当てgit OID(#11984)
gitインデックス内のすべてのファイルは、その40文字のSHA-1ハッシュをヒープ割り当てされた文字列(String)として保存していました。私たちの最大のリポジトリでは、new_from_gix_index の呼び出しだけで、10,000個以上の個別の40バイトのヒープ割り当てが発生していました。
OidHash は Deref<Target=str> を実装しているため、既存の利用者は変更を意識せずに済みます。また、Copy を実装することで、クローン作成がヒープ割り当てではなく、スタック上の40バイトのメモリコピー(memcpy)になります。プロファイルデータによると、new_from_gix_index の自己時間(self-time)が15%、get_package_file_hashes_from_index が17%減少しました。
| リポジトリサイズ | 前 | 後 | 変化 |
|---|---|---|---|
| ~1,000パッケージ | 1.463s ± 0.052s | 1.466s ± 0.027s | 速度は同等、分散(variance)が48%減少 |
| ~125パッケージ | 658.6ms ± 144.6ms | 592.1ms ± 62.9ms | 10%高速化、分散が57%減少 |
| 6パッケージ | 96.8ms ± 46.7ms | 75.0ms ± 18.4ms | 22%高速化、分散が61%減少 |
3つのサイズすべてにおいて最も注目すべき改善は、実行ごとの分散の減少です。これは、アロケータ(allocator)への負荷が軽減され、パフォーマンスがより予測可能になるという私たちの仮説と一致しています。
システムコール(syscall)の排除(#11985)
すべてのキャッシュフェッチ(cache fetch)は、3つのシステムコールを実行していました: stat(.tar)(これは常にENOENTを返す)、次に stat(.tar.zst)、そして open(.tar.zst) です。これは奇妙なパターンでした。
調査の結果、.tar フォールバックは、TurborepoのGolang時代(2021-2022年)からのキャッシュアーティファクト(cache artifacts)のために存在していたことが判明しました。最新バージョンは非圧縮キャッシュエントリを書き込まず、キャッシュエントリは常にローテーション(rotate out)されます。
最大のリポジトリにおける962回のキャッシュフェッチ全体で、フェッチの自己時間は200.5msから129.6msに減少し、35%の削減となりました。
クローン(clone)ではなくムーブ(move)(#11986)
ビジター(visitor)のディスパッチループ(dispatch loop)は、事前計算されたマップから、約1,700のタスクそれぞれに対して (String, HashMap<String, String>) をディープクローン(deep-cloning)していました。各タスクIDはディスパッチストリームに正確に一度だけ現れるため、HashMap::remove() を使ってクローン作成ではなく、コストゼロで値をムーブアウト(move the value out)することが可能です。
結果
8日後、最大のリポジトリにおける初回タスクまでの時間(Time to First Task)は8.1秒から716ミリ秒に減少しました。
| リポジトリサイズ | v2.8.0 | v2.9.0 | 改善 |
|---|---|---|---|
| ~1,000パッケージ | 8.1s | 0.716s | 91%高速化 |
| 132パッケージ | 1.9s | 0.361s | 81%高速化 |
| 6パッケージ | 0.676s | 0.132s | 80%高速化 |
エージェントなしでは、これには少なくとも2ヶ月はかかったと推定しています。しかし、この記事が、エージェントが私の代わりに仕事をしたわけではないことを示してくれればと思います。私は常に主導権を握り、何をプロファイリングするか、どの提案を追求するか、いつツールを変更するか、いつ戦略を転換するかを決定していました。私の既存のエンジニアリング知識、エージェントへのより優れたツールの提供、そしてクリーンなベンチマーク環境の組み合わせにより、6ヶ月前には不可能だったペースで進めることができたのです。
Turborepo 2.9でリリース
これらのパフォーマンス向上は、現在安定版で利用可能です。Turborepo 2.9のリリース投稿を訪れて、詳細を確認してください。
原文を表示
Turborepo is now 81-91% faster to compute its task graph in our repositories, scaling with repo size. On our 1,000+ package monorepo, turbo run now feels instant. Time to First Task is now 11x faster.
After testing my changes with some open source Turborepos and asking Vercel customers to try canary releases on their repositories, I found the performance improvement could get as high as 96% depending on the size and complexity of the repository.
The process behind earning these performance gains is worth sharing, because it wasn't one optimization or one technique. It was eight days of mixing AI agents, Vercel Sandboxes, and typical, boring engineering practices.
How Turborepo schedules your tasks
Every turbo run starts by analyzing your monorepo's structure, scripts, and dependencies to build a task graph. That graph determines execution order, creates parallelism, and powers caching so you never repeat the same work twice.
Building the task graph is overhead you pay before your repository's work begins. The larger the repo, the higher the cost. On our 1,000-package monorepo, that cost was around 10 seconds on an M4 Pro Max. I don't know about you, but I found that unacceptable.
Starting with unattended agents
I wanted to see what agents could do about this without much guidance. I spun up 8 background coding agents from my phone before bed, each targeting a different part of the Rust codebase I suspected was too slow.
In each prompt, I replaced the part of the codebase I was interested in with a new target. I was curious what the agents would accomplish with plenty of ambiguity, as a baseline.
By morning, 3 of the 8 had produced outputs that I could turn into shippable wins:
PR #11872 netted a ~25% reduction in wall-clock time, reducing allocation pressure through hashing by reference instead of cloning an entire HashMap.
PR #11874 replaced twox-hash, one of our Rust dependency crates, with xxhash-rust. A near 1:1 replacement that uses a faster hashing algorithm, creating a ~6% win.
PR #11878 came from an existing TODO comment that we hadn't gotten to yet. We needed to replace an unnecessary Floyd-Warshall algorithm with a multi-source depth-first search (DFS). This wasn't on the hot path of turbo run, but my prompts didn't specify which hot path, did they? Fair.
These are undoubtedly meaningful successes, but reviewing all 8 chat sessions and code outputs taught me just as much about where unattended, state-of-the-art agents without proper context engineering will fall short today.
The agent never realized it could benchmark the improvements on the Turborepo codebase itself. Turborepo dogfoods Turborepo, so it could have easily built a binary and run it right on the source code to get end-to-end results.
The agent would hyperfixate on the first idea that it came up with and force it to work, rather than backing up and thinking abstractly about the problem (even though the chat logs showed it trying to do so).
The agent would chase the biggest number it could get, creating microbenchmarks that were relatively meaningless when it came to real-world performance. It would then crank out a 97% improvement for the benchmark, which actually amounted to a 0.02% real-world improvement.
Never once did an agent write a regression test.
Never once did an agent use the --profile flag in the turbo CLI.
The agents running unattended produced some good wins, but I could tell this wouldn't be sustainable. We needed stronger testing, and a better verification loop. I had to be more involved.
Making profiling work for agents and humans
The first normal engineering thing I did was take a profile. Shocking, I know.
I ran turbo run build --profile on our largest repo and opened the trace in Perfetto.
Flame graphs are informative, but can be slow to work with. As much as I do enjoy reading flame graphs and grinding out a win, Turborepo has a lot of shipping to do. I have a duty to users of Turborepo to work efficiently and effectively, using the best tools that I have at my disposal.
Maybe Chrome Tracing JSON isn't the best format
Turborepo's profiles are JSON files in Chrome Trace Event Format.
An LLM can theoretically read through and parse all this, but...well...just look at it. Function identifiers split across lines, irrelevant metadata mixed in with timing data, not grep-friendly. I pointed an agent at the file and watched it struggle through grep calls, trying to piece together function names from different lines, unsuccessfully trying to filter out noise. It was fumbling through this file in the same way I would.
One of my favorite heuristics for working with coding agents is that if something is poorly designed for me to work with, it's poorly designed for an agent, too. This isn't necessarily a comment about work quantity, but more so about interfaces. If something is hard for me to read, it stands to reason it's hard for an agent to read, too. This idea has its limits, but you'll see it quickly pay dividends in a moment.
Building LLM-friendly profiles
A week prior, I saw a tweet from Jarred Sumner about how Bun shipped a new flag: --cpu-prof-md. It outputs profiles as Markdown, which easily fits into my view of how agents work best.
In #11880, I added a new turborepo-profile-md crate that generates a companion .md file alongside every trace. Hot functions sorted by self-time, call trees sorted by total-time, caller/callee relationships. All greppable, all on single lines.
The difference in the agent's output quality was dramatic. Same model, same codebase, same data, same agent harness. Different format, radically better optimization suggestions. The profile data was finally in a format that both I and the agent could read at a glance.
The iterative loop
With Markdown profiles, I settled into a rhythm.
Put the agent in Plan Mode with instructions to create a profile and find hotspots in the Markdown output
Review the proposed optimizations and decide which ones were worth pursuing
Have the agent implement the good proposal(s)
Validate with end-to-end hyperfine benchmarks
Make a PR
Repeat
This loop produced over 20 performance PRs in four days. The wins fell into three categories. I'll give some examples.
Parallelization was the largest. Building the git index, walking the filesystem for glob matches, parsing lockfiles, and loading package.json files were all sequential operations that could run concurrently. PRs #11889, #11902, #11927, and #11918 parallelized these hot paths.
Allocation elimination removed redundant copies and clones throughout the pipeline, including reference-based hashing in SCM operations (#11916), pre-compiling glob exclusion filters (#11891), and using a shared HTTP client instead of constructing a new one per request (#11929).
Syscall reduction batched per-package git subprocess calls into a single repo-wide index (#11887), replaced git subprocesses with libgit2 library calls (#11938), and then replaced libgit2 with the faster gix-index altogether (#11950).
Again, it's typical, normal, boring software engineering stuff. I did try to turn this into a Ralph Wiggum loop but it repeatedly made too many mistakes. The combination of the model, the harness, and the loop simply weren't dependable enough, and could move so much code out from underneath me too quickly. Maybe if I were working on a sideproject, I would have accepted it, but Turborepo powers some of the largest repositories in the world. I have to be fast and responsible.
Your source code is the best feedback loop
The most interesting pattern I noticed during this phase was how the codebase itself served as the agent's strongest feedback mechanism.
I'd point out a performance issue in code the agent was working on. We'd fix it together. Then I'd ask, "Do you see anywhere else where we can improve in the same way?" The agent would find more instances of the same pattern across the codebase. Depending on the size of the changes, I would either add the change to the PR or write it down to do later.
In places where the existing code had a sloppy pattern, the agent would write new code in the same style. Once I corrected one instance, the agent followed the correction going forward. In future conversations, without any memory or context carrying across chats, the agent would see the merged improvements in the source and stop reproducing the old patterns.
Over time, I noticed the agent spontaneously writing tests when I wasn't expecting it to. I saw it creating abstractions that matched what I would have done, which wasn't happening before. I would revisit a place in the codebase where the agent had previously been ineffective, and, with no changes to model or harness, it would produce better code outputs.
It turns out your own source code is the best reinforcement learning out there.
Hitting a wall at 85%
By the end of the week, Turborepo was roughly 85% faster on our largest repo. Before I started, I had arbitrarily set a goal of 95% better. The remaining gains were feeling within reach.
The problem became measurement. I had been running all benchmarks on my MacBook, and the hyperfine reports were getting increasingly noisy. As the code gets faster, system noise matters more. Syscalls, memory, and disk I/O all have their variance.
The profiles were noisy too. I had gotten the codebase to a point where the individual functions were fast enough that background activity on my laptop was drowning out any good signal.
Was the change I made really 2% faster, or did I just get lucky with a quiet run? I couldn't confidently distinguish real improvements from noise. I needed a quieter lab for my science.
Vercel Sandbox for benchmarking
Vercel Sandboxes are ephemeral Linux containers that only have what you put in them. No background daemons, no Slack notifications pulling CPU, no background programs making network requests. The machine's resources are entirely focused on what you're running.
I wrote a bash script that automated the entire benchmarking workflow. I'll put an abbreviated version of the full gist below.
You'll notice that, at the end of this script, I'm downloading the profiles back to my laptop. My agent could then inspect the benchmark results and Markdown profiles locally, and I could confidently tell whether a change was a real improvement or noise.
Breaking through the wall
With clean signal from Sandbox, I could see real breakthroughs in low-level changes that were invisible on my noisy laptop.
Stack-allocated git OIDs (#11984)
Every file in the git index stored its 40-character SHA-1 hash as a heap-allocated String. On our largest repo, new_from_gix_index alone was creating over 10,000 individual 40-byte heap allocations.
OidHash implements Deref<Target=str> so existing consumers work unchanged, and Copy means cloning is a 40-byte memcpy on the stack instead of a heap allocation. Profile data showed new_from_gix_index self-time dropped 15% and get_package_file_hashes_from_index dropped 17%.
Repo size
Before
After
Change
~1,000 packages
1.463s ± 0.052s
1.466s ± 0.027s
Same speed, 48% less variance
~125 packages
658.6ms ± 144.6ms
592.1ms ± 62.9ms
10% faster, 57% less variance
6 packages
96.8ms ± 46.7ms
75.0ms ± 18.4ms
22% faster, 61% less variance
The most notable improvement across all three sizes was the reduction in run-to-run variance, which agrees with our theory of less allocator pressure and more predictable performance.
Syscall elimination (#11985)
Every cache fetch was performing three syscalls: stat(.tar), which returned ENOENT, then stat(.tar.zst), then open(.tar.zst). Weird pattern.
After some digging, I figured out that the .tar fallback existed for cache artifacts from Turborepo's Golang era (2021-2022). No modern version writes uncompressed cache entries, and cache entries rotate out constantly.
Across 962 cache fetches on our largest repo, fetch self-time dropped from 200.5ms to 129.6ms, a 35% reduction.
Move instead of clone (#11986)
The visitor dispatch loop was deep-cloning a (String, HashMap<String, String>) from a precomputed map for each of roughly 1,700 tasks. Since each task ID appears exactly once in the dispatch stream, HashMap::remove() can move the value out at zero cost instead of cloning.
Results
After eight days, Time to First Task on our largest repo dropped from 8.1 seconds to 716 milliseconds.
Repo size
v2.8.0
v2.9.0
Improvement
~1,000 packages
8.1s
0.716s
91% faster
132 packages
1.9s
0.361s
81% faster
6 packages
0.676s
0.132s
80% faster
I estimate this would have taken at least two months without agents, but I hope this article shows you that they didn't do the work for me. I was leading the entire time, deciding what to profile, which proposals to pursue, when to change tools, and when to change strategy. But the combination of my existing engineering knowledge, giving agents better tooling, and a clean benchmarking environment let me move at a pace that wouldn't have been possible six months ago.
Released in Turborepo 2.9
These performance gains are now stable and ready for you to use. Visit the Turborepo 2.9 release post to learn more about the latest in Turborepo.
Read more
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み