Skip to main content

AI agents belong in prison

·7 mins

Last Friday, Opus, which I had allowed `terraform plan` permissions to help troubleshoot some integration, suddenly asked to do `terraform apply` even though the plan showed that a production database would get deleted and recreated (😱), even if I had explicitly instructed it to help me investigate only, and change nothing. Because it at least asked, catastrophe was averted, but it did get my pulse up.

The problem of course is not really the model, any model can go off the reservation. The problem was that I had given the agent (part of) my own access for a bit of convenience - and if you run your LLM with access to ~/.ssh, ~/.config/gcloud, ~/.aws, and your kubeconfig, it may hallucinate your production env away.

So: AI agents belong in prison, in a nice padded cell without access to sharp objects.

One project with two trust postures

The threat from a mistaken or compromised agent is eerily like the threat from a supply chain attack, in that it'll execute something as you, with your permissions. Instead of a malicious intent, the issue is that e.g. Claude Code is very capable, not always right, can take instructions by mistake, and it is running under a shell with too much access.

A prompt injection from a fetched page, a plausible-looking tool call with wrong arguments, a directory mix-up - any or all of the above can nuke something important. The solution (for me at least) is to treat the agent as something great but dangerous, that my system needs protection from.

The pattern I've landed on is two configs per project with different access. One for when I open a shell and want to do stuff on my own, one for when an agent opens a shell.

pwrap

When this happened, I was already in the process of rewriting my old ad-hoc Fish+Bubblewrap scripts into pwrap, because I got lost in a forest of scripts and bwrap parameter ordering, and because I decided that I really needed some sane supply-chain protection, primarily around my side projects. It's a small(ish) Python CLI with no pip dependencies that wraps project shells in bubblewrap sandboxes via per-project TOML configs and shell init files. pwrap myproject drops you into a sandboxed shell where only the things you asked for are visible/writable.

What follows is a usage pattern, not a feature of pwrap. pwrap has no concept of an "agent" config. But a pwrap project is just a file at ~/.config/pwrap/name/project.toml pointing at a dir, and nothing stops you from keeping two configs that point at the same directory.

DIY

There are no great leaps of technology in pwrap - the cleverest (both interpretations) thing is the map-to-root β†’ mount gocryptfs β†’ map back to user namespace layering for encrypted data that is not even really part of the sandbox. You can get everything I discuss in this post just from setting up and using Bubblewrap correctly, then stacking a bunch of shell scripts around it…

Minimum security / my shell's config

My normal config for a work project looks something like this:

  # ~/.config/pwrap/okb/project.toml
  [project]
  name = "okb"
  dir = "~/projects/okb"
  shell = "/usr/bin/fish"

  [sandbox]
  enabled = true
  blacklist = [
      "~/projects/",           # other projects, none of this one's business
      "~/.aws",                # different AWS account lives here
      "~/.config/gcloud",      # same story for GCP
      "/mnt",                  # escalation path on WSL2
  ]
  whitelist = [
      "/mnt/wsl",              # I want WSL integration to work
  ]
  writable = [
    "/tmp/.X11-unix",          # X11 display socket
    "/mnt/wslg/runtime-dir",   # Wayland
  ]

Me, but with project boundaries. A rogue pip install in this project can't read secrets belonging to another project, can't escape via WSL shenanigans, but I can do all the things I normally do in this project.

The cell / agent's config

The second config, named okb-llm, lives alongside the first:

  # ~/.config/pwrap/okb-llm/project.toml
  [project]
  name = "okb-llm"
  dir = "~/projects/okb"          # same directory as the shell config
  shell = "/usr/bin/bash"

  [sandbox]
  enabled = true
  clean_env = true                  # only PATH/HOME/USER/SHELL/TERM/LANG survive
  blacklist = [
      "~",                          # hide everything under home
      "/mnt",                       # WSL drives, doesn't hurt on native Linux
  ]
  whitelist = [
      "~/.pyenv/",                  # python runtimes, read-only
      "~/.cache/pip",               # so pip installs still work
  ]
  writable = [
      "~/.claude.json.lock/",      # Claude Code won't run without it (dir)
      "/tmp/.X11-unix",            # X11 display socket
      "/mnt/wslg/runtime-dir",     # Wayland + PulseAudio
  ]
  [env]
  CLAUDE_CONFIG_DIR = ".claude-okb"   # relative to project dir, doesn't touch ~/.claude

Same repo on disk, but with a very different access model, as it can't even see most of my environment, and can't modify what it can see.

  • ~ is hidden entirely. On top of the default read-only home, blacklisting ~ means no ~/.ssh, no ~/.aws, no ~/.config/gcloud, no ~/.kube, no stray .env files in sibling projects. Exfiltration via a mis-targeted tool call won't happen if credentials are not available.
  • Project dir is the whole writable world. rm -rf * in the wrong directory hits a copy of the project, not home. git still works, so that's recoverable from origin.
  • clean_env = true. My shell's GOOGLE_APPLICATION_CREDENTIALS, VAULT_TOKEN, KUBECONFIG, ANTHROPIC_API_KEY (that I of course would never set in a random shell…) are all gone. The sandbox inherits PATH, HOME, USER, SHELL, TERM, LANG and nothing else; everything else is set in [env] or an init.sh file.
  • Tools are still on PATH, just without access. kubectl lives in /usr/bin either way – blacklisting home doesn't move it. What the blacklist does is take its kubeconfig away. gcloud is the same. terraform can't modify anything as it has no access tokens.
  • CLAUDE_CONFIG_DIR. Claude Code writes state into a handful of paths under ~. Redirecting its config dir into .claude-okb inside the project keeps that state local, and it can't access state from another project (goes both ways, of course).

Launching the agent is two commands:

pwrap okb-llm     # drops into the sandboxed shell
claude            # from inside the sandbox

I mostly use Claude Code, but most (all) other agents work similarly. For one-shot launching, exec claude at the end of init.sh hands the shell straight over to the agent.

From the agent's point of view it's a normal Linux environment with exactly one project in it and nothing in the environment it didn't ask for, and no access to push code or anything outside of it.

Not quite Alcatraz

bubblewrap isn't a security boundary against a determined attacker with a real kernel exploit. If someone wants to escape the namespace and the kernel isn't patched, they probably can. This is closer to a mistakes-and-misuse boundary than a nation-state boundary. The risk I'm managing is "helpful agent does a bad thing by accident", or maybe "supply chain attack tries to read my credentials", not "three-letter organization pivots from my laptop to prod".

Whatever access you do give your agent can also be exfiltrated - while I mean to add network namespaces and iptables support to do partial isolation on the network level, I've not yet done so, because I keep the access minimal so I've not really felt the pressure here.

Also, nothing here protects against code the agent writes that runs later outside the sandbox. If the agent commits a malicious migration and I run it on my unsandboxed shell, or it runs in CI with prod credentials, you're way down shit creek (no paddle).

Beyond agents

I now wrap all work in individual bubblewrapped namespaces, and keep my secrets in per-project gocryptfs volumes (also part of pwrap but tangential to the LLM isolation). Setup can still be a bit of a headache (what tools do I actually use, and what access do they actually need?), but it's mostly a one-off cost.

YMMV, but for me moving "Least privilege" from user level to project level, and further on to me-or-agent level makes me sleep a little better at night and actually give Opus a bit freer reins, since I'm reasonably confident that it won't make catastrophic mistakes.

The code is at github.com/haard/projectwrap. It comes with absolutely no warranty and has been reviewed by me, Opus 4.6, and codestral – which is to say, not nearly enough. If you try it and something's confusing or broken, or if you have a better pattern for this on a Mac (afaik bwrap does not work there), open an issue or find me on Mastodon (I'm @motmanniska@mastodon.nu).