Docker Sandboxes + Claude Code: What Works, What Breaks

We spend a lot of time thinking about how to safely give AI agents access to real systems. Some of that is personal curiosity, and some of it comes from the work we do at Arcade building agent infrastructure—especially the parts that tend to break once you move past toy demos.

So when Docker released Docker Sandboxes, which let AI coding agents run inside an isolated container instead of directly on your laptop, we wanted to try it for real. Not as a demo, but on an actual codebase, doing the kinds of things agents are increasingly being asked to do.

We tested it with Claude Code. Here’s what that experience was actually like.

TL;DR

Setup is genuinely easy
Isolation works exactly as advertised
Feels seamless at first — you forget you’re sandboxed
Real-world dev workflows expose rough edges fast
Environment setup, binaries, and API access are painful
Solid foundation, but not something we’d use daily (yet)

Why we tried it

One of the biggest concerns people have with coding agents isn’t whether they can edit files — it’s whether they’ll do something unintended once they have real access.

In practice, the failures we see aren’t usually about an agent deleting files outside the workspace it’s given — sandboxing already constrains that. The more concerning issues tend to be agents touching the wrong systems, using the wrong credentials, or running commands in contexts they don’t fully understand.

Docker Sandboxes promise to reduce part of that risk by isolating execution. That felt worth testing.

Setup: surprisingly smooth

Getting started was straightforward:

Update to the latest version of Docker
Run docker sandbox run claude
Sign into Claude Code

Once the sandbox started, Claude could see only the files in our working directory — nothing else on the machine.

From an execution-safety standpoint, this is a real win. There’s no ambiguity about what the agent can touch locally, and that immediately builds trust.

At first, it feels kind of magical

For simple tasks, the experience is almost indistinguishable from running Claude directly.

Claude edits files, reads code, and proposes changes without noticeable friction. We genuinely forgot we were working inside a sandbox for a bit, which is probably the best compliment you can give this kind of tooling.

Isolation without constantly reminding you it’s there is hard to pull off, and Docker mostly nails that part.

Where things started to fall apart

The moment we asked Claude to do something closer to real development work, things changed.

We had it write some tests, then asked it to run our test suite with make test.

That failed immediately — make wasn’t installed in the sandbox.

Claude tried to recover by running the test commands manually, but that failed too, because some of our dev dependencies don’t support the sandbox’s OS.

None of this is surprising if you’ve worked with containers before, but it’s a reminder that execution isolation and environment parity are very different problems.

Environment setup is where the friction really shows

The biggest pain came when APIs entered the picture.

One of the tests required an API key. Because the sandbox wasn’t started with that environment variable, we couldn’t just add it.

Instead, we had to:

Stop the sandbox
Delete it
Restart it with the env var
Lose the entire Claude conversation

From an agent-workflow perspective, that’s a steep penalty for a small configuration mistake.

Agents are iterative by nature. Losing context because of environment changes breaks that loop in a way humans don’t tolerate for long.

An assumption we didn’t realize we were making

We also realized we’d made a quiet assumption: we expected Claude to be working in a git worktree, not directly on our working directory.

Instead, the sandbox mounts the code directly.

That creates a few issues:

You can’t easily let the agent run longer tasks in the background
You end up competing with the agent if you try to edit the same files locally at the same time
If the agent deletes or rewrites large parts of the repo, the impact is immediate

At that point, you start to see the boundary of what execution sandboxing actually protects — and what it doesn’t.

What Docker Sandboxes get right

To be fair, there’s a lot to like:

Initial setup is easy
Filesystem isolation works as advertised
Claude integration feels natural
For small or greenfield projects, this is likely fine
You often forget you’re sandboxed

If your primary concern is local safety, Docker Sandboxes solve a real problem.

Where it struggles today

From a day-to-day dev perspective, there are still some rough edges:

Claude-only support
Reinstalling binaries that already exist locally
Environment variables requiring full restarts
Docker-heavy CLI UX for basic configuration
Losing agent context when the sandbox needs to be restarted with changed configuration

None of these are catastrophic on their own, but together they limit how often we’d reach for this in real work.

The part sandboxing doesn’t address

One thing this experience reinforced for us is that filesystem safety is often the least interesting part of agent risk.

In practice, we’re far more concerned about:

which services an agent can talk to
which credentials it’s using
whether it understands the difference between test and production
what actions it’s allowed to take on behalf of a user

Execution sandboxing answers “Where can this code run?”It doesn’t answer “What should this agent be allowed to do?”

That distinction becomes very clear once you try to use tools like this on real systems. And it’s a key reason why many companies have chosen to implement Arcade.

Final thoughts

Docker Sandboxes feel like an important step forward. The execution isolation works, and the Claude integration is thoughtfully done.

But real development workflows are messy. They rely on environment parity, long-lived context, APIs, credentials, and permissions that don’t fit neatly into a clean container.

From our perspective, this feels like solid infrastructure — just one layer of a larger stack teams will need as agents move from experiments into real workflows.

_________________________________________________________

Interested in getting a handle on agent risk in your enterprise? Get started with Arcade for free

What It’s Actually Like to Use Docker Sandboxes with Claude Code

TL;DR

Why we tried it

Setup: surprisingly smooth

At first, it feels kind of magical

Where things started to fall apart

Environment setup is where the friction really shows

An assumption we didn’t realize we were making

What Docker Sandboxes get right

Where it struggles today

The part sandboxing doesn’t address

Final thoughts

RECENT ARTICLES

Docker Sandboxes Are a Meaningful Step Toward Safer Coding Agents — Here’s What Still Matters

Build on the Bubble: Why foundation model instability is the best thing that ever happened to enterprise AI

Your MCP Client Just Got Superpowers: Arcade Tools are now in Cursor, VS Code, and more

Get early access to Arcade, and start building now.