How AI Researchers Can Automate Intellectual Toil with GitHub Copilot
The Problem: Analyzing Thousands of Agent Trajectories
As an AI researcher specializing in coding agent performance, my daily routine involved analyzing massive amounts of data from standardized benchmarks like TerminalBench2 and SWEBench-Pro. Each benchmark run produces dozens of trajectories—JSON files containing the step-by-step thought processes and actions an agent took to solve a task. With hundreds of tasks per dataset and multiple runs per day, I was facing hundreds of thousands of lines of code to review manually.

Reading through these files was not only tedious but also prone to human error. I needed a way to surface patterns quickly without drowning in raw data. That’s where GitHub Copilot first came into the picture.
The Repetitive Loop That Inspired Automation
I started using GitHub Copilot to help analyze these trajectories. The typical workflow was: I would prompt Copilot to find common failure modes, then manually investigate the results—reducing the lines I had to read from hundreds of thousands to a few hundred. It worked, but it was still a repetitive, manual loop that I was repeating multiple times a day.
The software engineer in me rebelled. Why not automate this entire intellectual process? Agents themselves are designed to automate cognitive tasks, so I realized I could build an agent that performs the analysis for me. That insight gave birth to eval-agents, a tool that automates the analysis of coding agent trajectories using GitHub Copilot.
The Solution: Eval-Agents
Eval-agents is a framework that enables researchers and engineers to create and share agents that automate the analysis of benchmark results. Instead of manually reviewing each trajectory, users can now define a custom agent that pinpoints relevant patterns, generates summaries, and even suggests improvements. The entire process is driven by GitHub Copilot, which provides the intelligence to understand complex JSON structures and reason about agent behavior.
The initial version focused on my own needs, but I quickly realized that my colleagues on the Copilot Applied Science team could benefit from the same automation. So I designed eval-agents to be easy to share and easy to extend, turning a personal productivity hack into a team-wide force multiplier.
Design Principles for Collaborative Agent Development
When building eval-agents, I followed three core principles to ensure the tool would be adopted and improved by others:
Make Agents Easy to Share and Use
GitHub has always been about collaboration. I wanted eval-agents to work seamlessly across the team, so I built it with GitHub Copilot Chat integration and packaged the agents as reusable scripts. Any team member can run an agent against a new benchmark run without needing to understand the underlying code.

Make It Easy to Author New Agents
The key to long-term success is enabling others to create their own agents. I designed the agent definition format to be as simple as possible—requiring only a few lines of configuration to specify the analysis task and what data to extract. This low barrier to entry encourages experimentation and customization.
Make Coding Agents the Primary Vehicle for Contributions
Instead of relying solely on documentation or manual processes, eval-agents treats agents themselves as the primary artifacts. Contributions come in the form of new agents that analyze different aspects of trajectories, or improvements to existing ones. This shifts the focus from writing reports to writing code that does the work—a mindset I learned from maintaining open-source projects like the GitHub CLI.
Lessons Learned from Using GitHub Copilot
Developing eval-agents taught me several important lessons about effective collaboration with AI:
- Prompt engineering is key. The quality of the analysis directly depends on how well you phrase your requests to Copilot. Iterating on prompts is a skill that improves over time.
- Automate the boring parts, keep the creative ones. Let AI handle pattern recognition while you focus on interpreting results and making strategic decisions.
- Share your automation. What starts as a personal script can become a team asset if you design for reuse from the start.
Today, my role has shifted from manually analyzing trajectories to maintaining the eval-agents framework and helping my peers build their own agents. I’ve essentially automated myself out of a tedious job and into a more creative, impactful one. And that’s exactly the kind of evolution that makes software engineering so rewarding.
Related Articles
- How to Get Involved in Google Summer of Code 2026: A Step-by-Step Guide for Student Developers
- Solving Cloud Email Delivery Issues with Brevo's HTTP API
- Python Packaging Governance Council Gets Final Approval – Elections Slated for June
- VideoLAN Releases dav2d: An Open-Source Decoder for the Next-Generation AV2 Codec
- Exploring Roq: Building High-Performance Static Sites with Quarkus
- Google Opens I/O 2026 Countdown Design to Developers via AI Challenge
- Python Security Response Team: New Governance and Growing Community Enhance Ecosystem Safety
- Go 1.26 Arrives with Enhanced Language Features, Performance Gains, and Experimental Tools