How to Build a Virtual Agent Fleet for Automated Testing and Triage

Introduction

Imagine a team of AI agents that autonomously test your product, triage issues, post release notes, and even fix bugs—all running in CI without human intervention. That's exactly what the Coding Agent Sandboxes (sbx) team at Docker accomplished with their 'Fleet'. This how-to guide walks you through creating your own virtual agent team using the same principles: role-based skills (not scripts), local-first development, and seamless CI integration. By the end, you'll have a replicable process to ship faster with AI agent autonomy.

How to Build a Virtual Agent Fleet for Automated Testing and Triage — Source: www.docker.com

What You Need

Before you start building your fleet, gather these prerequisites:

A sandboxing tool (like sbx or any microVM-based isolation system) that provides secure, disposable environments for agents.
Access to a coding agent framework (e.g., Claude Code, Gemini, Codex) that supports skill files—markdown-based role definitions with personas, responsibilities, and allowed tools.
A CLI tool to manage sandbox lifecycles (create, start, stop, remove, network config, workspace mount) on your target platforms (macOS, Linux, Windows).
CI infrastructure (GitHub Actions, GitLab CI, etc.) with runners for each platform you support.
Version control for your skill files and workflows (e.g., a Git repository).
Basic understanding of AI agent behavior—specifically that skills enable judgment, not just step-by-step scripts.

Step-by-Step Guide

Step 1: Define Your Agent Roles and Responsibilities

Start by listing the tasks you want your fleet to handle autonomously. For example:

Exploratory tester (e.g., /cli-tester): runs all CLI commands, validates outputs, finds edge cases.
Release note generator: summarizes shipped features and fixes, posts to a channel.
Issue triager: scans backlog, categorizes, assigns priority, and may even attempt fixes.
Build engineer: compiles binaries on multiple platforms, checks upgrade paths, monitors resource leaks.

Each role must have a clear persona—think of a human colleague with specific expertise. Role descriptions should emphasize decision-making, not just execution.

Step 2: Create a Skill File for Each Role

Skill files are markdown files that define:

Persona: A character description (e.g., "You are a meticulous QA engineer with 10 years of experience testing CLI tools.")
Responsibilities: What the agent should accomplish (e.g., "Test every command with valid and invalid arguments.")
Allowed tools: Which APIs, commands, or sandboxes the agent can use (e.g., "You may create and destroy sandboxes, execute shell commands inside them, and read logs.")
Constraints: Boundaries (e.g., "Do not modify files outside the sandbox.")

Critically, a skill is not a script. It guides the agent's judgment. For example, if a test fails unexpectedly, a script stops—but a role investigates. Write each skill to encourage the agent to explore, learn, and adapt.

Step 3: Test Each Skill Locally First

Never start by wiring a skill into CI. Instead, run it from your terminal using the sandbox CLI. For example, invoke sbx run /cli-tester on your laptop. Watch the agent think—observe where it gets confused, where it succeeds, and whether it follows the intended logic. Tweak the skill file (markdown) and re-invoke immediately. This local-first approach turns iteration cycles from minutes (commit-push-wait-read-logs) into seconds (edit-file-run).

During local testing, verify:

Does the agent correctly interpret its persona?
Does it use allowed tools appropriately?
Does it report failures in a useful way (e.g., structured logs, actual bug reports)?
Does it stay within its sandbox boundary?

Only promote a skill to CI once it consistently produces reliable results on your machine.

Step 4: Set Up CI Workflows for Each Role

Now wire your skill into CI. The key principle: CI is just another runtime for the same skill file. Do not create a separate CI version. Your workflow should simply:

Set up the environment (checkout code, install sandbox CLI, configure platform-specific dependencies).
Invoke the skill exactly as you did locally (e.g., sbx run /cli-tester).
Collect results (logs, generated reports, issue links).

For the /cli-tester example, Docker runs it nightly on macOS, Linux, and Windows runners—all using the exact same skill file. The workflow does not add any custom logic. This ensures consistency across runtimes and eliminates translation errors.

Step 5: Integrate and Iterate

Once individual skills run in CI, chain them into a fleet. For instance:

The build engineer skill runs first, producing binaries.
The exploratory tester skill then exercises those binaries on each platform.
If tests fail, an issue triager skill can automatically classify and assign the bug.
A release note skill runs after every successful release.

Monitor the fleet's output and feedback loops. If an agent misbehaves (e.g., triages incorrectly or generates poor notes), revert to local mode, debug the skill file, and redeploy. Since all skills are runtime-agnostic, improvements propagate instantly to both local dev and CI.

Tips for Success

Keep skills simple and focused. Don't overload one agent with too many personas. A dedicated tester skill will outperform a jack-of-all-trades.
Write skills as decision guides, not checklists. The power of AI agents is judgment. Let them decide how to test; you define what to test.
Invest in local debugging. The faster you can iterate on a skill file, the better your fleet will perform. Avoid the commit-push-wait cycle at all costs.
Use version control for skill files. Treat them like code—review changes, rollback bad tweaks, and document why a persona works.
Monitor agent behavior over time. As products evolve, your skills may need updates. Schedule periodic reviews of agent decisions to catch drift.
Start small. Build one or two critical roles first (e.g., tester and triager), then expand. A smaller, reliable fleet beats a large, brittle one.
Embrace failure as learning. When an agent makes a mistake, that's a chance to improve the skill file. The fleet learns not from training data but from human refinement of its role description.

By following these steps, you can assemble a virtual agent team that ships faster, reduces manual toil, and gives you back time for creative engineering. The Docker sbx team proved that local-first, skill-based agents scale—and now you can too.

Tags: