コパイロット応用科学におけるエージェント駆動開発
GitHub Copilot Applied ScienceチームのAI研究者が、コーディングエージェントの評価作業を自動化する「eval-agents」ツールを開発し、チーム全体の知的労働の効率化と高速開発サイクルを実現した事例を紹介している。
キーポイント
知的労働の自動化による業務変革
AI研究者がコーディングエージェントの評価作業(数十万行の軌跡データ分析)を自動化し、従来の手動分析から解放されることで、より創造的な業務に集中できる環境を構築した。
GitHub Copilotを活用した開発効率化
GitHub Copilotを活用したパターン抽出と調査の繰り返し作業を自動化することで、個人だけでなくチーム全体の開発ループを高速化するツール「eval-agents」を開発した。
エージェント駆動開発の実践的アプローチ
エージェントの共有・使用の容易さ、新規エージェント作成の簡便さ、コーディングエージェントを主要貢献手段とするという3つの設計目標を掲げて実装を進めた。
チーム全体への波及効果
個人で開発した自動化ツールをチーム全体で維持・活用することで、Copilot Applied Scienceチームの全メンバーが同様の知的労働自動化を実現できる基盤を構築した。
エージェント駆動開発のコア原則
プロンプト戦略、アーキテクチャ戦略、反復戦略の3つの原則が、高速な開発とコラボレーションを可能にした。
効果的なプロンプト戦略
AIエージェントをエンジニアのように扱い、会話的で詳細な説明と計画モードの活用が、複雑な問題解決に効果的である。
エージェントファーストのリポジトリ構築における優先事項
リファクタリング、テスト作成、ドキュメント整備など、コードベースの保守性向上が新機能開発よりも優先される。これによりCopilotがコードベースを理解しやすくなり、機能開発が容易になる。
影響分析・編集コメントを表示
影響分析
この記事は、AIツールの開発者自身がそのツールを使って自身の業務を自動化するという「メタ自動化」の具体例を示しており、AI開発の現場における生産性革命の一端を垣間見せる。GitHub Copilotの応用範囲が単なるコード補助から業務プロセス全体の最適化へと拡大していることを示唆し、AI支援開発の新たな段階に入ったことを示している。
編集コメント
AIツールの開発者自身が「使う側」としての体験を語る貴重な事例。技術の進化が開発者の業務内容そのものを変容させている現実を実感させる内容で、業界関係者に多くの示唆を与える。
人間によるレビュー
ここで、前のセクションで議論したパターンを適用します。
さらに、機能開発のループの外では、以下のことをCopilotに早期かつ頻繁に指示するようにしてください:
/plan コードを見て、不足しているテスト、壊れている可能性のあるテスト、デッドコードがないか確認する
/plan コードを見て、重複や抽象化の機会がないか確認する
/plan ドキュメントとコードを見て、ドキュメントギャップを特定する。関連する変更を反映するためにcopilot-instructions.mdを必ず更新する
私はこれらを毎週自動的に実行していますが、新しい機能や修正が入るにつれて、エージェント駆動開発環境を維持するために、週を通して自分で実行することもよくあります。
これをあなたのものにしてください
不可能なくらい反復的な分析タスクへの不満から始まったことが、はるかに興味深いものに変わりました。それは、ソフトウェアの構築方法、協力方法、エンジニアとして成長する方法についての新しい考え方です。
エージェントファーストの考え方でコーディングエージェントを構築することは、私の働き方を根本的に変えました。それは単に自動化の成果だけではありません。4人の科学者が3日以内に11のエージェント、4つのスキル、そしてまったく新しいコンセプトをリリースするのを目の当たりにするのは、まさに驚くべきことです。この開発スタイルは、あなたに優先順位を強いるからです:クリーンアーキテクチャ、徹底的なドキュメント、意味のあるテスト、そして思慮深い設計——私たちが常に重要だと知りながらも時間を割けなかったものについてです。
ジュニアエンジニアへの類推は、常に有効です。彼らをうまくオンボーディングし、明確なコンテキストを与え、彼らのミスが大問題にならないようにガードレールを構築し、そして成長を信頼します。何か問題が起きたら、プロセスを責めます。エージェントを責めません。もし一つだけ持ち帰ってほしいことがあるとすれば、それは次のことです。あなたを優れたエンジニアと優れたチームメイトにするスキルは、Copilotで効果的に構築するためのスキルと同じです。テクノロジーは新しいものですが、原則はそうではありません。
だから、そのコードベースを整理し、先延ばしにしていたドキュメントを書き、あなたのCopilotをチームの最新メンバーのように扱い始めてください。あなたは、キャリアで最も興味深い仕事へと自分自身を自動化するかもしれません。
私が狂っていると思いますか?では、これを試してみてください:
- Copilot CLIをダウンロードする
- 任意のリポジトリでCopilot CLIを有効化する:
cd <repo_path> && copilot - 次の指示を貼り付ける:
/plan <このブログ投稿へのリンク>を読み、このリポジトリをエージェントファースト開発のためにどのように改善できるか計画を立てるのを手伝ってください
投稿「Agent-driven development in Copilot Applied Science」は The GitHub Blog に最初に掲載されました。
原文を表示
I may have just automated myself into a completely different job…
This is a familiar pattern among software engineers, who often, through inspiration, frustration, or sometimes even laziness, build systems to remove toil and focus on more creative work. We then end up owning and maintaining those systems, unlocking that automated goodness for the rest of those around us.
As an AI researcher, I recently took this beyond what was previously possible and have automated away my intellectual toil. And now I find myself maintaining this tool to enable all my peers on the Copilot Applied Science team to do the same.
During this process, I learned a lot about how to effectively create and collaborate using GitHub Copilot. Applying these learnings has unlocked an incredibly fast development loop for myself as well as enabled my team mates to build solutions to fit their needs.
Before I get into explaining how I made this possible, let me set the stage for what spawned this project so you better understand the scope of what you can do with GitHub Copilot.
The impetus
A large part of my job involves analyzing coding agent performance as measured against standardized evaluation benchmarks, like TerminalBench2 or SWEBench-Pro. This often involves poring through tons of what are called trajectories, which are essentially lists of the thought processes and actions agents take while performing tasks.
Each task in an evaluation dataset produces its own trajectory, showing how the agent attempted to solve that task. These trajectories are often .json files with hundreds of lines of code. Multiply that over dozens of tasks in a benchmark set and again over the many benchmark runs needing analysis on any given day, and we’re talking hundreds of thousands of lines of code to analyze.
It’s an impossible task to do alone, so I would typically turn to AI to help. When analyzing new benchmark runs, I found that I kept repeating the same loop: I used GitHub Copilot to surface patterns in the trajectories then investigated them myself—reducing the number of lines of code I had to read from hundreds of thousands to a few hundred.
However, the engineer in me saw this repetitive task and said, “I want to automate that.” Agents provide us with the means to automate this kind of intellectual work, and thus eval-agents was born.
The plan
Engineering and science teams work better together. That was my guiding principle as I set about solving this new challenge.
Thus, I approached the design and implementation strategy of this project with a couple of goals in mind:
Make these agents easy to share and use
Make it easy to author new agents
Make coding agents the primary vehicle for contributions
Bullets one and two are in GitHub’s lifeblood and are values and skills I’ve gained throughout my career, especially during my stint as an OSS maintainer on the GitHub CLI.
However, goal three shaped the project the most. I noticed that when I set GitHub Copilot up to help me build the tool effectively, it also made the project easier to use and collaborate on. That experience taught me a few key lessons, which ultimately helped push the first and second goals forward in ways I didn’t expect.
Making coding agents your primary contributor
I’ll start by describing my agentic coding setup:
Coding agent: Copilot CLI
Model used: Claude Opus 4.6
IDE: VSCode
It’s also noteworthy that I leveraged the Copilot SDK to accelerate agent creation, which is powered under the hood by the Copilot CLI. This gave me access to existing tools and MCP servers, a way to register new tools and skills, and a whole bunch of other agentic goodness out of the box that I didn’t have to reinvent myself.
With that out of the way, I could streamline the whole development process very quickly by following a few core principles:
Prompting strategies: agents work best when you’re conversational, verbose, and when you leverage planning modes before agent modes.
Architectural strategies: refactor often, update docs often, clean up often.
Iteration strategies: “trust but verify” is now “blame process, not agents.”
Uncovering and following these strategies led to an incredible phenomenon: adding new agents and features was fast and easy. We had five folks jump into the project for the first time, and we created a total of 11 new agents, four new skills, and the concept of eval-agent workflows (think scientist streams of reasoning) in less than three days. That amounted to a change of +28,858/-2,884 lines of code across 345 files.
Holy crap!
Below, I’ll go into detail about these three principles and how they enabled this amazing feat of collaboration and innovation.
Prompting strategies
We know that AI coding agents are really good at solving well-scoped problems but need handholding for the more complex problems you’d only entrust to your more senior engineers.
So, if you want your agent to act like an engineer, treat it like one. Guide its thinking, over-explain your assumptions, and leverage its research speed to plan before jumping into changes. I found it far more effective to put some stream-of-consciousness musings about a problem I was chewing on into a prompt and working with Copilot in planning mode than to give it a terse problem statement or solution.
Here’s an example of a prompt I wrote to add more robust regression tests to the tool:
/plan I've recently observed Copilot happily updating tests to fit its new paradigms even though those tests shouldn't be updated. How can I create a reserved test space that Copilot can't touch or must reserve to protect against regressions?
This resulted in a back and forth that ultimately led to a series of guardrails akin to contract testing that can only be updated by humans. I had an idea of what I wanted, and through conversation, Copilot helped me get to the right solution.
It turns out that the things that make human engineers the most effective at doing their jobs are the same things that make these agents effective at doing theirs.
Architectural strategies
Engineers, rejoice! Remember all those refactors you wanted to do to make the codebase more readable, the tests you never had time to write, and the docs you wish had existed when you onboarded? They’re now the most important thing you can be working on when building an agent-first repository.
Gone are the days where deprioritizing this work over new feature work was necessary, because delivering features with Copilot becomes trivial when you have a well-maintained, agent-first project.
I’ve spent most of my time on this project refactoring names and file structures, documenting new features or patterns, and adding test cases for problems that I’ve uncovered as I go. I’ve even spent a few cycles cleaning up the dead code that the agents (like your junior engineers) may have missed while implementing all these new features and changes.
This work makes it easy for Copilot to navigate the codebase and understand the patterns, just like it would for any other engineer.
I can even ask, “Knowing what I know now, how would I design this differently?” And I can then justify actually going back and rearchitecting the whole project (with the help of Copilot, of course).
It’s a dream come true!
And this leads me to my last bit of guidance.
Iteration strategies
As agents and models have improved, I have moved from a “trust but verify” mindset to one that is more trusting than doubtful. This mirrors how the industry treats human teams: “blame process, not people.” It’s how the most effective teams operate, because people make mistakes, so we build systems around that reality.
This idea of blameless culture provides psychological safety for teams to iterate and innovate, knowing that they won’t be blamed if they make a mistake. The core principle is that we implement processes and guardrails to protect against mistakes, and if a mistake does happen, we learn from it and introduce new processes and guardrails so that our teams won’t make the same mistake again.
Applying this same philosophy to agent-driven development has been fundamental to unlocking this incredibly rapid iteration pipeline. That means we add processes and guardrails to help prevent the agent from making mistakes, but when it does make a mistake, we add additional guardrails and processes—like more robust tests and better prompts—so the agent can’t make the same mistake again. Taking this one step further means that practicing good CI/CD principles is a must.
Practices like strict typing ensure the agent conforms to interfaces. Robust linters impose implementation rules on the agent that keep it following good patterns and practices. And integration, end-to-end, and contract tests—which can be expensive to build manually—become much cheaper to implement with agent assistance, while giving you confidence that new changes don’t break existing features.
When Copilot has these tools available in its development loop, it can check its own work. You’re setting it up for success, much in the same way you’d set up a junior engineer for success in your project.
Putting it all together
Here’s what all this means for your development loop when you’ve got your codebase set up for agent-driven development:
Plan a new feature with Copilot using /plan.
Iterate on the plan.
Ensure that testing is included in the plan.
Ensure that docs updates are included in the plan and done before code is implemented. These can serve as additional guidelines that live beside your plan.
Let Copilot implement the feature on /autopilot.
Prompt Copilot to initiate a review loop with the Copilot Code Review agent. For me, it’s often something like: request Copilot Code Review, wait for the review to finish, address any relevant comments, and then re-request review. Continue this loop until there are no more relevant comments.
Human review. This is where I enforce the patterns I discussed in the previous sections.
Additionally, outside of your feature loop, be sure you’re prompting Copilot early and often with the following:
/plan Review the code for any missing tests, any tests that may be broken, and dead code
/plan Review the code for any duplication or opportunities for abstraction
/plan Review the documentation and code to identify any documentation gaps. Be sure to update the copilot-instructions.md to reflect any relevant changes
I have these run automatically once a week, but I often find myself running them throughout the week as new features and fixes go in to maintain my agent-driven development environment.
Take this with you
What started as a frustration with an impossibly repetitive analysis task turned into something far more interesting: a new way of thinking about how we build software, how we collaborate, and how we grow as engineers.
Building agents with a coding agent-first mindset has fundamentally changed how I work. It’s not just about the automation wins—though watching four scientists ship 11 agents, four skills, and a brand-new concept in under three days is nothing short of remarkable. It’s about what this style of development forces you to prioritize: clean architecture, thorough documentation, meaningful tests, and thoughtful design—the things we always knew mattered but never had time for.
The analogy to a junior engineer keeps proving itself out. You onboard them well, give them clear context, build guardrails so their mistakes don’t become disasters, and then trust them to grow. If something goes wrong, you blame the process. Not the agent. If there’s one thing I want you to take away from this, it’s that the skills that make you a great engineer and a great teammate are the same skills that make you great at building with Copilot. The technology is new. The principles aren’t.
So go clean up that codebase, write that documentation you’ve been putting off, and start treating your Copilot like the newest member of your team. You might just automate yourself into the most interesting work of your career.
Think I’m crazy? Well, try this:
Download Copilot CLI
Activate Copilot CLI in any repo: cd <repo_path> && copilot
Paste in the following prompt: /plan Read <link to this blog post> and help me plan how I could best improve this repo for agent-first development
The post Agent-driven development in Copilot Applied Science appeared first on The GitHub Blog.
関連記事
今日のまとめ
AI日報で今日の重要ニュースをまとめ読み