We spend a lot of time thinking about how to safely give AI agents access to real systems. Some of that is personal curiosity, and some of it comes from the work we do at Arcade building agent infrastructure—especially the parts that tend to break once you move past toy demos.
So when Docker released Docker Sandboxes, which let AI coding agents run inside an isolated container instead of directly on your laptop, we wanted to try it for real. Not as a demo, but on an actual codebase, doing the kinds of things agents are increasingly being asked to do.
We tested it with Claude Code. Here’s what that experience was actually like.
TL;DR
- Setup is genuinely easy
- Isolation works exactly as advertised
- Feels seamless at first — you forget you’re sandboxed
- Real-world dev workflows expose rough edges fast
- Environment setup, binaries, and API access are painful
- Solid foundation, but not something we’d use daily (yet)
Why we tried it
One of the biggest concerns people have with coding agents isn’t whether they can edit files — it’s whether they’ll do something unintended once they have real access.
In practice, the failures we see aren’t usually about an agent deleting files outside the workspace it’s given — sandboxing already constrains that. The more concerning issues tend to be agents touching the wrong systems, using the wrong credentials, or running commands in contexts they don’t fully understand.
Docker Sandboxes promise to reduce part of that risk by isolating execution. That felt worth testing.
Setup: surprisingly smooth
Getting started was straightforward:
- Update to the latest version of Docker
- Run
docker sandbox run claude - Sign into Claude Code
Once the sandbox started, Claude could see only the files in our working directory — nothing else on the machine.
From an execution-safety standpoint, this is a real win. There’s no ambiguity about what the agent can touch locally, and that immediately builds trust.
At first, it feels kind of magical
For simple tasks, the experience is almost indistinguishable from running Claude directly.
Claude edits files, reads code, and proposes changes without noticeable friction. We genuinely forgot we were working inside a sandbox for a bit, which is probably the best compliment you can give this kind of tooling.
Isolation without constantly reminding you it’s there is hard to pull off, and Docker mostly nails that part.
Where things started to fall apart
The moment we asked Claude to do something closer to real development work, things changed.
We had it write some tests, then asked it to run our test suite with make test.
That failed immediately — make wasn’t installed in the sandbox.
Claude tried to recover by running the test commands manually, but that failed too, because some of our dev dependencies don’t support the sandbox’s OS.
None of this is surprising if you’ve worked with containers before, but it’s a reminder that execution isolation and environment parity are very different problems.
Environment setup is where the friction really shows
The biggest pain came when APIs entered the picture.
One of the tests required an API key. Because the sandbox wasn’t started with that environment variable, we couldn’t just add it.
Instead, we had to:
- Stop the sandbox
- Delete it
- Restart it with the env var
- Lose the entire Claude conversation
From an agent-workflow perspective, that’s a steep penalty for a small configuration mistake.
Agents are iterative by nature. Losing context because of environment changes breaks that loop in a way humans don’t tolerate for long.
An assumption we didn’t realize we were making
We also realized we’d made a quiet assumption: we expected Claude to be working in a git worktree, not directly on our working directory.
Instead, the sandbox mounts the code directly.
That creates a few issues:
- You can’t easily let the agent run longer tasks in the background
- You end up competing with the agent if you try to edit the same files locally at the same time
- If the agent deletes or rewrites large parts of the repo, the impact is immediate
At that point, you start to see the boundary of what execution sandboxing actually protects — and what it doesn’t.
What Docker Sandboxes get right
To be fair, there’s a lot to like:
- Initial setup is easy
- Filesystem isolation works as advertised
- Claude integration feels natural
- For small or greenfield projects, this is likely fine
- You often forget you’re sandboxed
If your primary concern is local safety, Docker Sandboxes solve a real problem.
Where it struggles today
From a day-to-day dev perspective, there are still some rough edges:
- Claude-only support
- Reinstalling binaries that already exist locally
- Environment variables requiring full restarts
- Docker-heavy CLI UX for basic configuration
- Losing agent context when the sandbox needs to be restarted with changed configuration
None of these are catastrophic on their own, but together they limit how often we’d reach for this in real work.
The part sandboxing doesn’t address
One thing this experience reinforced for us is that filesystem safety is often the least interesting part of agent risk.
In practice, we’re far more concerned about:
- which services an agent can talk to
- which credentials it’s using
- whether it understands the difference between test and production
- what actions it’s allowed to take on behalf of a user
Execution sandboxing answers “Where can this code run?”It doesn’t answer “What should this agent be allowed to do?”
That distinction becomes very clear once you try to use tools like this on real systems. And it’s a key reason why many companies have chosen to implement Arcade.
Final thoughts
Docker Sandboxes feel like an important step forward. The execution isolation works, and the Claude integration is thoughtfully done.
But real development workflows are messy. They rely on environment parity, long-lived context, APIs, credentials, and permissions that don’t fit neatly into a clean container.
From our perspective, this feels like solid infrastructure — just one layer of a larger stack teams will need as agents move from experiments into real workflows.
_________________________________________________________
Interested in getting a handle on agent risk in your enterprise? Get started with Arcade for free


