AIエージェントコーディングに懐疑的な人物が詳細に試してみた
AIエージェントコーディングに懐疑的なデータサイエンティストが、実際にLLMを活用したコード改善実験を行い、現状の実用性と限界について実践的な知見を共有している。
キーポイント
AIエージェントコーディングへの懐疑的立場
著者はAIエージェントコーディングの過剰な宣伝に疑問を呈し、エージェントの予測不可能性、コスト、信頼性の問題を指摘している。
実践的なLLM活用実験
gemimg PythonパッケージのコードをOpenRouter上の最新LLMに投入し、コード品質向上のための改善提案を求める実験を行った。
LLMによる実用的な改善効果
LLMは関数のdocstring追加、型ヒントの改善、よりPythonicな実装の提案など、実際に有用な改善を特定できた。
GitHub Copilotの限定的な有用性
データサイエンス作業においてClaude Sonnet 4.5を搭載したGitHub Copilotは過度に冗長なコードを生成し、実用的ではなかった。
実践的なAIツール評価の重要性
著者は誇大広告ではなく、実際の使用経験に基づいたAIツールの評価と、条件が整えばエージェントを受け入れる姿勢を示している。
AGENTS.mdファイルの重要性とカスタマイズ
AIエージェントの動作を制御するためにAGENTS.mdファイルが重要であり、Pythonコード品質向上のための詳細なルールを設定している。特に重要なルールは大文字で強調されている。
プロンプト設計の詳細なアプローチ
著者はLLMに曖昧な質問をするのではなく、詳細なプロンプトをMarkdownファイルに記述し、gitで管理するアプローチを推奨している。
影響分析・編集コメントを表示
影響分析
この記事はAIエージェントコーディングの過剰な期待に対して実践的な検証を提供し、業界の健全な議論に貢献する。LLMの現実的な活用可能性と限界を示すことで、開発者や組織がAIツールを適切に評価・導入するための指針となる。
編集コメント
AIツールの誇大広告に流されず、実践的な検証に基づく冷静な評価がなされており、現場の開発者にとって参考になる視点を提供している。
それが機能するなら、それはAI精神病ではない#
LLM(大規模言語モデル)の使い方についてのブログ記事を書く前に、私は皮肉を込めて『LLMに「より良いコードを書け」と繰り返し要求すると、より良いコードを書くのか?』というタイトルの記事を書きました。その名の通りです。これは、LLMが曖昧な命令「より良いコードを書け」をどのように解釈するかを調べる実験でした。この場合、LLMはコードをより複雑にし、より多くの便利な機能を追加することを優先しました。しかし、代わりにコードを最適化する命令を与えると、可読性を大きく犠牲にしながらも、実際にコードを高速化することに成功しました。ソフトウェア工学において、最大の罪の一つは早すぎる最適化です。これは、開発を遅らせ、価値がないかもしれない性能向上を追い求めるために、コードの可読性、ひいては保守性を犠牲にする行為です。しかしながら、エージェント的(agentic)コーディングでは、私たちはコードの解釈が曖昧であることを暗黙のうちに受け入れています。ベンチマークの実行時間を最小化するという唯一の目的で最適化を繰り返し適用するエージェントたちが、もしそのベンチマークが代表的であるならば、典型的な使用ケースでより高速なコードを生成するという点で、今や実際に良いアイデアになり得るのではないでしょうか?人々はAI生成コードが遅いと不平を言いますが、もしAIが今や確実に高速なコードを生成できるなら、その議論は変わります。
乗算と除算はOpus 4.6にとって遅すぎる。
データサイエンティストとして、私はここ数年、polars以外に影響力のある新しいPythonデータサイエンスツールがリリースされていないことに苛立っていました。
今月、OpenAIがCodexアプリを発表し、同僚たちが質問をしていました。そこで私はそれをダウンロードし、GPT-5.2-Codex(高)モデルのテストケースとして、UMAPアルゴリズムをRustで再実装するように依頼しました。UMAPは次元削減技術(dimensionality reduction technique)で、高次元のデータ行列を入力として受け取り、低次元でデータのクラスタリングと可視化を同時に行うことができます。しかし、これは計算量が非常に多いアルゴリズムであり、高速に実行できる唯一のツールはNVIDIAのcuMLで、これはCUDA依存地獄(CUDA dependency hell)を必要とします。もし依存関係を最小限に抑えた超高速なRustのUMAPパッケージを作成できれば、私の仕事の種類においては莫大な生産性向上となり、十分に高速であれば楽しいアプリケーションを可能にします。
OpenAIがGPT-5.2-Codexよりもこれらの種類のタスクで大幅に優れた性能と速度を示すGPT-5.3-Codex(高)をリリースした後、私はCodexにRustでスクラッチからUMAP実装を書くように依頼しました。一見すると動作し、妥当な結果を与えているようでした。また、代表的な入力行列サイズの幅広い種類をテストするベンチマークを作成するように指示しました。Rustにはcriterionという人気のあるベンチマーククレート(benchmarking crate)があり、読みやすい形式でベンチマーク結果を出力します。最も重要なことに、エージェントはそれを容易に解析できます。
criterionからの出力例
一見すると、ベンチマークとその構築は良好に見え(つまり、不正行為なし)、PythonでUMAPを扱うよりもはるかに高速でした。さらにテストするために、私はエージェントに、HDBSCANのような追加の異なる有用な機械学習アルゴリズム(machine learning algorithms)を個別のプロジェクトとして実装するように依頼しました。各リポジトリは、以下の8つのプロンプト計画を順番に実行して開始します:
- 特定の機能要件と設計目標でパッケージを実装する。その後、典型的な使用ケースを代表する特定の行列サイズでベンチマークを作成する。
- コードとコメントを整理し、さらなる最適化を行うために2回目のパスを行う。
- クレートをスキャンして、極端なケースでのアルゴリズム的弱点の領域を見つけ、それぞれについて問題、潜在的な解決策、解決策の影響を定量化した文章を書く。
- 見つかった知見を活用して、すべてのベンチマークが60%またはそれ以上(1.4倍高速)で実行されるようにクレートを最適化する。そのためにあらゆる技術を使用し、ベンチマークのパフォーマンスが収束するまで繰り返すが、ベンチマーク入力だけに過剰適合(overfitting)させてベンチマークをごまかさない。
- 入力データの固有の量とCPUスレッドの飽和/スケジューリング/並列化を利用するカスタムチューニングプロファイルを作成し、すべてのベンチマークが60%またはそれ以上(1.4倍高速)で実行されるようにクレートを最適化する。プロファイリングにはflamegraphクレートを使用してもよい。
- pyo3を使用してPythonバインディング(Python bindings)を追加する。
- Pythonで対応するベンチマークを作成し、Pythonバインディングと既存のPythonパッケージの比較スクリプトを書く。
- 最適化を追求する際にアルゴリズム実装をごまかしている可能性があるとエージェントを非難し、既知の良好な実装との出力の類似性を最適化するように指示する(例えば、回帰タスク(regression task)では、2つのアプローチ間の予測の平均絶対誤差(mean absolute error)を最小化する)。
AGENTS.mdによるコード品質要件の同時制約
主成分分析(principal component analysis)実装を最適化した後のCodex 5.3。
私は2〜3倍の高速化だけでは満足していません。今日、このエージェント的コードが意味を持つためには、単なるGitHub上の別のリポジトリではなく、可能な限り最速の実装でなければなりません。皮肉な好奇心に駆られた瞬間、私はCodexとOpusがRustコードの最適化に異なるアプローチを取るかどうかを、それらを連鎖させることで試してみました:
- Codexにベンチマークを実行時間の60%に最適化するように指示する。
- Opusにベンチマークを実行時間の60%に最適化するように指示する。
- Opusに、どのベンチマークでも5%以上の速度低下(speed regression)を引き起こさずに、エージェント的実装と既知の良好な実装の間の差異を最小化するように指示する。
これは機能します。アルゴリズムでの私のテストから、Codexはしばしばアルゴリズムを1.5倍から2倍高速化でき、その後Opusは何らかの方法でその最適化されたコードをさらに高い程度に再び高速化します。これは私がテストしたすべてのRustコードに当てはまりました。私はまたicoを実行しました。
原文を表示
You’ve likely seen many blog posts about AI agent coding/vibecoding where the author talks about all the wonderful things agents can now do supported by vague anecdata, how agents will lead to the atrophy of programming skills, how agents impugn the sovereignty of the human soul, etc etc. This is NOT one of those posts. You’ve been warned.
Last May, I wrote a blog post titled As an Experienced LLM User, I Actually Don’t Use Generative LLMs Often as a contrasting response to the hype around the rising popularity of agentic coding. In that post, I noted that while LLMs are most definitely not useless and they can answer simple coding questions faster than it would take for me to write it myself with sufficient accuracy, agents are a tougher sell: they are unpredictable, expensive, and the hype around it was wildly disproportionate given the results I had seen in personal usage. However, I concluded that I was open to agents if LLMs improved enough such that all my concerns were addressed and agents were more dependable.
In the months since, I continued my real-life work as a Data Scientist while keeping up-to-date on the latest LLMs popping up on OpenRouter. In August, Google announced the release of their Nano Banana generative image AI with a corresponding API that’s difficult to use, so I open-sourced the gemimg Python package that serves as an API wrapper. It’s not a thrilling project: there’s little room or need for creative implementation and my satisfaction with it was the net present value with what it enabled rather than writing the tool itself. Therefore as an experiment, I plopped the feature-complete code into various up-and-coming LLMs on OpenRouter and prompted the models to identify and fix any issues with the Python code: if it failed, it’s a good test for the current capabilities of LLMs, if it succeeded, then it’s a software quality increase for potential users of the package and I have no moral objection to it. The LLMs actually were helpful: in addition to adding good function docstrings and type hints, it identified more Pythonic implementations of various code blocks.
Around this time, my coworkers were pushing GitHub Copilot within Visual Studio Code as a coding aid, particularly around then-new Claude Sonnet 4.5. For my data science work, Sonnet 4.5 in Copilot was not helpful and tended to create overly verbose Jupyter Notebooks so I was not impressed. However, in November, Google then released Nano Banana Pro which necessitated an immediate update to gemimg
Create a grid.py file that implements the Grid class as described in issue #15
In November, just a few days before Thanksgiving, Anthropic released Claude Opus 4.5 and naturally my coworkers were curious if it was a significant improvement over Sonnet 4.5. It was very suspicious that Anthropic released Opus 4.5 right before a major holiday since companies typically do that in order to bury underwhelming announcements as your prospective users will be too busy gathering with family and friends to notice. Fortunately, I had no friends and no family in San Francisco so I had plenty of bandwidth to test the new Opus.
A Foreword on AGENTS.md#
One aspect of agents I hadn’t researched but knew was necessary to getting good results from agents was the concept of the AGENTS.md file: a file which can control specific behaviors of the agents such as code formatting. If the file is present in the project root, the agent will automatically read the file and in theory obey all the rules within. This is analogous to system prompts for normal LLM calls and if you’ve been following my writing, I have an unhealthy addiction to highly nuanced system prompts with additional shenanigans such as ALL CAPS for increased adherence to more important rules (yes, that’s still effective). I could not find a good starting point for a Python-oriented AGENTS.md
Add an AGENTS.md file oriented for good Python code quality. It should be intricately details. More important rules should use caps, e.g. MUST
I then added a few more personal preferences and suggested tools from my previous failures working with agents in Python: use uv
NEVER use emoji, or unicode that emulates emoji (e.g. ✓, ✗).
Agents also tend to leave a lot of redundant code comments, so I added another rule to prevent that:
MUST avoid including redundant comments which are tautological or self-demonstating (e.g. cases where it is easily parsable what the code does at a glance or its function name giving sufficient information as to what the code does, so the comment does nothing other than waste user time)
My up-to-date AGENTS.md
As a side note if you are using Claude Code, the file must be named CLAUDE.md
Opus First Contact#
With my AGENTS.md
From the Claude Code quickstart.
Anthropic’s prompt suggestions are simple, but you can’t give an LLM an open-ended question like that and expect the results you want! You, the user, are likely subconsciously picky, and there are always functional requirements that the agent won’t magically apply because it cannot read minds and behaves as a literal genie. My approach to prompting is to write the potentially-very-large individual prompt in its own Markdown file (which can be tracked in git
I completely ignored Anthropic’s advice and wrote a more elaborate test prompt based on a use case I’m familiar with and therefore can audit the agent’s code quality. In 2021, I wrote a script to scrape YouTube video metadata from videos on a given channel using YouTube’s Data API, but the API is poorly and counterintuitively documented and my Python scripts aren’t great. I subscribe to the SiIvagunner YouTube account which, as a part of the channel’s gimmick (musical swaps with different melodies than the ones expected), posts hundreds of videos per month with nondescript thumbnails and titles, making it nonobvious which videos are the best other than the view counts. The video metadata could be used to surface good videos I missed, so I had a fun idea to test Opus 4.5:
Create a robust Python script that, given a YouTube Channel ID, can scrape the YouTube Data API and store all video metadata in a SQLite database. The YOUTUBE_API_KEY is present in .env. Documentation on the channel endpoint: https://developers.google.com/youtube/v3/guides/implementation/channels The test channel ID to scrape is: UC9ecwl3FTG66jIKA9JRDtmg You MUST obey ALL the FOLLOWING rules in your implementation. - Do not use the Google Client SDK. Use the REST API with httpx. - Include sensible aggregate metrics, e.g. number of comments on the video. - Incude channel_id and retrieved_at in the database schema.
The resulting script is available here, and it worked first try to scrape up to 20,000 videos (the max limit). The resulting Python script has very Pythonic code quality following the copious rules provided by the AGENTS.md
I asked a more data-science-oriented followup prompt to test Opus 4.5’s skill at data-sciencing:
Create a Jupyter Notebook that, using polars to process the data, does a thorough exploratory data analysis of data saved in youtube_videos.db, for all columns. This analysis should be able to be extended to any arbitrary input channel_id.
The resulting Jupyter Notebook is…indeed thorough. That’s on me for specifying “for all columns”, although it was able to infer the need for temporal analysis (e.g. total monthly video uploads over time) despite not explicitly being mentioned in the prompt.
The monthly analysis gave me an idea: could Opus 4.5 design a small webapp to view the top videos by month? That gives me the opportunity to try another test of how well Opus 4.5 works with less popular frameworks than React or other JavaScript component frameworks that LLMs push by default. Here, I’ll try FastAPI, Pico CSS for the front end (because we don’t need a JavaScript framework for this), and HTMX for lightweight client/server interactivity:
Create a Hacker News-worthy FastAPI application using HTMX for interactivity and PicoCSS for styling to build a YouTube-themed application that leverages youtube_videos.db to create an interactive webpage that shows the top videos for each month, including embedded YouTube videos which can be clicked.
The FastAPI webapp Python code is good with logical integration of HTMX routes and partials, but Opus 4.5 had fun with the “YouTube-themed” aspect of the prompt: the video thumbnail simulates a YouTube thumbnail with video duration that loads an embedded video player when clicked! The full code is open-source in this GitHub repository.
All of these tests performed far better than what I expected given my prior poor experiences with agents. Did I gaslight myself by being an agent skeptic? How did a LLM sent to die finally solve my agent problems? Despite the holiday, X and Hacker News were abuzz with similar stories about the massive difference between Sonnet 4.5 and Opus 4.5, so something did change.
Obviously an API scraper and data viewer alone do not justify an OPUS 4.5 CHANGES EVERYTHING declaration on social media, but it’s enough to be less cynical and more optimistic about agentic coding. It’s an invitation to continue creating more difficult tasks for Opus 4.5 to solve. From this point going forward, I will also switch to the terminal Claude Code, since my pipeline is simple enough and doesn’t warrant a UI or other shenanigans.
Getting Rusty At Coding#
If you’ve spent enough time on programming forums such as Hacker News, you’ve probably seen the name “Rust”, often in the context of snark. Rust is a relatively niche compiled programming language that touts two important features: speed, which is evident in framework benchmarks where it can perform 10x as fast as the fastest Python library, and memory safety enforced at compile time through its ownership and borrowing systems which mitigates many potential problems. For over a decade, the slogan “Rewrite it in Rust” became a meme where advocates argued that everything should be rewritten in Rust due to its benefits, including extremely mature software that’s infeasible to actually rewrite in a different language. Even the major LLM companies are looking to Rust to eke out as much performance as possible: OpenAI President Greg Brockman recently tweeted “rust is a perfect language for agents, given that if it compiles it’s ~correct” which — albeit that statement is silly at a technical level since code can still be logically incorrect — shows that OpenAI is very interested in Rust, and if they’re interested in writing Rust code, they need their LLMs to be able to code well in Rust.
I myself am not very proficient in Rust. Rust has a famously excellent interactive tutorial, but a persistent issue with Rust is that there are few resources for those with intermediate knowledge: there’s little between the tutorial and “write an operating system from scratch.” That was around 2020 and I decided to wait and see if the ecosystem corrected this point (in 2026 it has not), but I’ve kept an eye on Hacker News for all the new Rust blog posts and library crates so that one day I too will be able to write the absolutely highest performing code possible.
Historically, LLMs have been poor at generating Rust code due to its nicheness relative to Python and JavaScript. Over the years, one of my test cases for evaluating new LLMs was to ask it to write a relatively simple application such as Create a Rust app that can create "word cloud" data visualizations given a long input text.
However, due to modern LLM postraining paradigms, it’s entirely possible that newer LLMs are specifically RLHF-trained to write better code in Rust despite its relative scarcity. I ran more experiments with Opus 4.5 and using LLMs in Rust on some fun pet projects, and my results were far better than I expected. Here are four such projects:
As someone who primarily works in Python, what first caught my attention about Rust is the PyO3 crate: a crate that allows accessing Rust code through Python with all the speed and memory benefits that entails while the Python end-user is none-the-wiser. My first exposure to pyo3
I decided to start with a very simple project: a project that can take icons from an icon font file such as the ones provided by Font Awesome and render them into images at any arbitrary resolution.
I made this exact project in Python in 2021, and it’s very hacky by pulling together several packages and cannot easily be maintained. A better version in Rust with Python bindings is a good way to test Opus 4.5.
The very first thing I did was create a AGENTS.md
With that, I built a gigaprompt to ensure Opus 4.5 accounted for both the original Python implementation and a few new ideas I had, such as supersampling to antialias the output.
Create a Rust/Python package (through pyo3 and maturin) that efficiently and super-quickly takes an Icon Font and renders an image based on the specified icon. The icon fonts are present in assets, and the CSS file which maps the icon name to the corresponding reference in the icon font is in fontawesome.css. You MUST obey ALL the FOLLOWING implementation notes: - If the icon name has solid in it, it is referencing fa-solid.otf. - fa-brands.otf and fa-regular.otf can be combined. - The package MUST also support Python (via pyo3 and maturin). - The package MUST be able to output the image rendered as an optimized PNG and WEBP. with a default output resolution of 1024 x 1024. - The image rendering MUST support supersampling for antialiased text and points (2x by default) - The package MUST implement fontdue as its text rendering method. - Allow the user to specify the color of the icon and the color of the background (both hex and RGB) - Allow transparent backgrounds. - Allow user to specify the icon size and canvas size separately. - Allow user to specify the anchor positions (horizontal and vertical) for the icon relative to the canvas (default: center and center) - Allow users to specify a horizontal and vertical pixel offset for the icon relative to the canvas. After your base implementation is complete, you MUST: - Write a comprehensive Python test suite using pytest. - Write a Python Jupyter Notebook - Optimize the Rust binary file size and the Python package file size.
It completed the assignment in one-shot, accounting for all of the many feature constraints specified. The “Python Jupyter Notebook” notebook command at the end is how I manually tested whether the pyo3
The generated icons, at a high resolution, show signs of not having curves and instead showing discrete edges (image attached). Investigate the fontdue font renderer to see if there's an issue there. In the event that it's not possible to fix this in fontdue, investigate using ab_glyph instead.
Opus 4.5 used its Web Search tool to confirm the issue is expected with fontdue
icon-to-image is available open-source on GitHub. There were around 10 prompts total adding tweaks and polish, but through all of them Opus 4.5 never failed the assignment as written. Of course, generating icon images in Rust-with-Python-bindings is an order of magnitude faster than my old hacky method, and thanks to the better text rendering and supersampling it also looks much better than the Python equivalent.
There’s a secondary pro and con to this pipeline: since the code is compiled, it avoids having to specify as many dependencies in Python itself; in this package’s case, Pillow for image manipulation in Python is optional and the Python package won’t break if Pillow changes its API. The con is that compiling the Rust code into Python wheels is difficult to automate especially for multiple OS targets: fortunately, GitHub provides runner VMs for this pipeline and a little bit of back-and-forth with Opus 4.5 created a GitHub Workflow which runs the build for all target OSes on publish, so there’s no extra effort needed on my end.
Word Clouds In The Browser#
When I used word clouds in Rust as my test case for LLM Rust knowledge, I had an ulterior motive: I love word clouds. Back in 2019, I open-sourced a Python package titled stylecloud: a package built on top of Python’s word cloud, but with the added ability to add more color gradients and masks based on icons to easily conform it into shapes (sound familiar?)
However, stylecloud was hacky and fragile, and a number of features I wanted to add such as non-90-degree word rotation, transparent backgrounds, and SVG output flat-out were not possible to add due to its dependency on Python’s wordcloud/matplotlib, and also the package was really slow. The only way to add the features I wanted was to build something from scratch: Rust fit the bill.
The pipeline was very similar to icon-to-image
After more back-and-forth with design nitpicks and more features to add, the package is feature complete. However, it needs some more polish and a more unique design before I can release it, and I got sidetracked by something more impactful…
Create a music player in the terminal using Rust
miditui is available open-sourced on GitHub, and the prompts used to build it are here.
During development I encountered a caveat: Opus 4.5 can’t test or view a terminal output, especially one with unusual functional requirements. But despite being blind, it knew enough about the ratatui terminal framework to implement whatever UI changes I asked. There were a large number of UI bugs that likely were caused by Opus’s inability to create test cases, namely failures to account for scroll offsets resulting in incorrect click locations. As someone who spent 5 years as a black box Software QA Engineer who was unable to review the underlying code, this situation was my specialty. I put my QA skills to work by messing around with miditui
One night — after a glass of wine — I had another idea: one modern trick with ASCII art is the use of Braille unicode characters to allow for very high detail. That reminded me of ball physics simulations, so what about building a full physics simulator also in the terminal? So I asked Opus 4.5 to create a terminal physics simulator with the rapier 2D physics engine and a detailed explanation of the Braille character trick: this time Opus did better and completed it in one-shot, so I spent more time making it colorful and fun. I pessimistically thought the engine would only be able to handle a few hundred balls: instead, the Rust codebase can handle over 10,000 logical balls!
I explicitly prompted Opus to make the Colors button have a different color for each letter.
ballin is available open-sourced on GitHub, and the prompts used to build it are here.
The main lesson I learnt from working on these projects is that agents work best when you have approximate knowledge of many things with enough domain expertise to know what should and should not work. Opus 4.5 is good enough to let me finally do side projects where I know precisely what I want but not necessarily how to implement it. These specific projects aren’t the Next Big Thing™ that justifies the existence of an industry taking billions of dollars in venture capital, but they make my life better and since they are open-sourced, hopefully they make someone else’s life better. However, I still wanted to push agents to do more impactful things in an area that might be more worth it.
It’s Not AI Psychosis If It Works#
Before I wrote my blog post about how I use LLMs, I wrote a tongue-in-cheek blog post titled Can LLMs write better code if you keep asking them to “write better code”? which is exactly as the name suggests. It was an experiment to determine how LLMs interpret the ambiguous command “write better code”: in this case, it was to prioritize making the code more convoluted with more helpful features, but if instead given commands to optimize the code, it did make the code faster successfully albeit at the cost of significant readability. In software engineering, one of the greatest sins is premature optimization, where you sacrifice code readability and thus maintainability to chase performance gains that slow down development time and may not be worth it. Buuuuuuut with agentic coding, we implicitly accept that our interpretation of the code is fuzzy: could agents iteratively applying optimizations for the sole purpose of minimizing benchmark runtime — and therefore faster code in typical use cases if said benchmarks are representative — now actually be a good idea? People complain about how AI-generated code is slow, but if AI can now reliably generate fast code, that changes the debate.
Multiplication and division are too slow for Opus 4.6.
As a data scientist, I’ve been frustrated that there haven’t been any impactful new Python data science tools released in the past few years other than polars
This month, OpenAI announced their Codex app and my coworkers were asking questions. So I downloaded it, and as a test case for the GPT-5.2-Codex (high) model, I asked it to reimplement the UMAP algorithm in Rust. UMAP is a dimensionality reduction technique that can take in a high-dimensional matrix of data and simultaneously cluster and visualize data in lower dimensions. However, it is a very computationally-intensive algorithm and the only tool that can do it quickly is NVIDIA’s cuML which requires CUDA dependency hell. If I can create a UMAP package in Rust that’s superfast with minimal dependencies, that is an massive productivity gain for the type of work I do and can enable fun applications if fast enough.
After OpenAI released GPT-5.3-Codex (high) which performed substantially better and faster at these types of tasks than GPT-5.2-Codex, I asked Codex to write a UMAP implementation from scratch in Rust, which at a glance seemed to work and gave reasonable results. I also instructed it to create benchmarks that test a wide variety of representative input matrix sizes. Rust has a popular benchmarking crate in criterion, which outputs the benchmark results in an easy-to-read format, which, most importantly, agents can easily parse.
Example output from criterion
At first glance, the benchmarks and their construction looked good (i.e. no cheating) and are much faster than working with UMAP in Python. To further test, I asked the agents to implement additional different useful machine learning algorithms such as HDBSCAN as individual projects, with each repo starting with this 8 prompt plan in sequence:
Implement the package with the specific functional requirements and design goals; afterwards, create benchmarks with specific matrix sizes that are representative of typical use cases
Do a second pass to clean up the code/comments and make further optimizations
Scan the crate to find areas of algorithmic weaknesses in extreme cases, and write a sentence for each describing the problem, the potential solution, and quantifying the impact of the solution
Leveraging the findings found, optimize the crate such that ALL benchmarks run 60% or quicker (1.4x faster). Use any techniques to do so, and repeat until benchmark performance converges, but don’t game the benchmarks by overfitting on the benchmark inputs alone 1
Create custom tuning profiles that take advantage of the inherent quantities of the input data and CPU thread saturation/scheduling/parallelization to optimize the crate such that ALL benchmarks run 60% or quicker (1.4x faster). You can use the flamegraph crate to help with the profiling
Add Python bindings using pyo3
Create corresponding benchmarks in Python, and write a comparison script between the Python bindings and an existing Python package
Accuse the agent of potentially cheating its algorithm implementation while pursuing its optimizations, so tell it to optimize for the similarity of outputs against a known good implementation (e.g. for a regression task, minimize the mean absolute error in predictions between the two approaches)
The simultaneous constraints of code quality requirements via AGENTS.md
Codex 5.3 after optimizing a principal component analysis implementation.
I’m not content with only 2-3x speedups: nowadays in order for this agentic code to be meaningful and not just another repo on GitHub, it has to be the fastest implementation possible. In a moment of sarcastic curiosity, I tried to see if Codex and Opus had different approaches to optimizing Rust code by chaining them:
Instruct Codex to optimize benchmarks to 60% of runtime
Instruct Opus to optimize benchmarks to 60% of runtime
Instruct Opus to minimize differences between agentic implementation and known good implementation without causing more than a 5% speed regression on any benchmarks
This works. From my tests with the algorithms, Codex can often speed up the algorithm by 1.5x-2x, then Opus somehow speeds up that optimized code again to a greater degree. This has been the case of all the Rust code I’ve tested: I also ran the ico
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み