Testing

Agent Safehouse includes a comprehensive test suite to verify policy behavior and prevent regressions. Tests use sandbox-exec to validate that allowed operations succeed and denied operations fail.

Running Tests

Tests must run outside an existing sandbox. sandbox-exec cannot nest. If your terminal session is already sandboxed, tests will fail immediately.

Run the full test suite from the repository root:

./tests/run.sh

Expected output:

=== Default Workdir Access (No Git Root) ===
  PASS  write and read file in CWD
  PASS  create file in CWD
  PASS  create directory in CWD

=== Denied Writes Outside Default Workdir ===
  PASS  write to HOME root
  PASS  write to HOME directory outside grants

===========================================
  Total: 42  |  Pass: 40  |  Fail: 0  |  Skip: 2
===========================================

Skipped tests typically occur when optional dependencies (like git, docker, kubectl) are not installed on the test system.

Test Structure

Tests are organized under tests/sections/ by functional area:

tests/
├── run.sh                    # Main test harness
├── lib/
│   ├── common.sh            # Assertion helpers
│   └── setup.sh             # Environment setup
└── sections/
    ├── 10-filesystem.sh     # Workdir and path grant tests
    ├── 20-integrations.sh   # git, Docker, kubectl tests
    ├── 30-runtime.sh        # System runtime behavior
    ├── 40-tooling.sh        # npm, cargo, Python tests
    ├── 50-policy-behavior.sh # Policy assembly tests
    ├── 60-wrapper-cli.sh    # CLI flag behavior
    └── 70-cli-edge-cases.sh # Error handling tests

Test Helpers

All test sections use helpers from tests/lib/common.sh:

# Assert that a command succeeds under the sandbox
assert_allowed "$POLICY_PATH" "read from workdir" \
  /bin/cat "${TEST_CWD}/file.txt"

Writing New Tests

When adding or modifying policy behavior:

Create or update test section

Add a new function in the appropriate tests/sections/*.sh file:

tests/sections/20-integrations.sh

run_section_integrations() {
  section_begin "Git Integration"
  assert_allowed "$POLICY_DEFAULT" \
    "read .gitconfig" \
    /bin/cat "${HOME}/.gitconfig"
  
  assert_allowed_if_exists "$POLICY_DEFAULT" \
    "run git status" \
    "git" \
    git status
}

register_section run_section_integrations

Use descriptive test names

Test descriptions should clearly state what behavior is being verified:✅ Good: "write to --add-dirs path"❌ Bad: "test 3"

Run tests locally

Validate your changes:

./tests/run.sh

Always call register_section at the end of your test file:

register_section run_section_integrations

Policy Assembly Tests

For changes to policy assembly logic or module dependencies, use structure and ordering assertions:

# Verify a profile was included
assert_policy_contains "$POLICY_PATH" \
  "includes docker profile" \
  "mach-lookup (global-name \"com.docker.vmnetd\")"

# Verify rule ordering (critical for deny-after-allow overrides)
assert_policy_order_literal "$POLICY_PATH" \
  "base rules load before integrations" \
  "#safehouse-test-id:10-system-runtime#" \
  "#safehouse-test-id:50-integrations-core#"

The #safehouse-test-id:*# markers in .sb files are used by ordering tests. Preserve these when editing profiles.

CI Validation

GitHub Actions runs tests automatically on:

All pull requests
Pushes to main
macOS runners only (sandbox-exec is macOS-specific)

CI also validates that dist/ artifacts are up-to-date when policy or runtime files change.

Test Environment

The test harness (tests/lib/setup.sh) creates isolated directories:

Variable	Purpose
`TEST_CWD`	Temporary working directory for test commands
`TEST_HOME_CANARY`	File path outside workdir (should be denied)
`TEST_RO_DIR`	Directory used with `--add-dirs-ro` tests
`TEST_RW_DIR`	Directory used with `--add-dirs` tests
`TEST_GIT_REPO`	Temporary git repository for auto-detection tests

All test artifacts are cleaned up automatically on exit.

Preflight Checks

The test runner performs these checks before starting:

Sandbox nesting check

Verifies the current session is not already sandboxed (tests cannot run inside a sandbox).

Binary validation

Confirms sandbox-exec is available and bin/safehouse.sh exists.

Environment setup

Creates temporary directories and generates test policies.

If preflight fails, tests exit with status 2 and an explanation.

Debugging Test Failures

Check policy contents

Generated test policies are in /tmp/safehouse-test-*/:

cat /tmp/safehouse-test-*/policy-default.sb

Run commands manually

Execute test commands directly with sandbox-exec:

sandbox-exec -f /tmp/policy.sb -- touch /tmp/test.txt

Watch denial logs

Stream sandbox denials while running tests:

/usr/bin/log stream --predicate 'eventMessage CONTAINS "deny("'

Verify outside sandbox

Confirm the command works unsandboxed:

touch /tmp/test.txt  # Should succeed

E2E and Live Agent Tests

For heavier integration testing:

# Terminal-based workflow simulation
./tests/e2e/run.sh

# Real agent CLI testing (requires API keys)
export ANTHROPIC_API_KEY="..."
./tests/e2e/live/run.sh

E2E and live agent tests may incur API usage costs and are not run by default in CI.

Get Started

Core Concepts

Usage

Advanced

Operations

Agent Compatibility

Running Tests

Test Structure

Test Helpers

Writing New Tests

Policy Assembly Tests

CI Validation

Test Environment

Preflight Checks

Debugging Test Failures

Check policy contents

Run commands manually

Watch denial logs

Verify outside sandbox

E2E and Live Agent Tests

Next Steps

Debugging

Contributing

Build docs developers (and LLMs) love

Get Started

Core Concepts

Usage

Advanced

Operations

Agent Compatibility

Documentation Index

​Running Tests

​Test Structure

​Test Helpers

​Writing New Tests

​Policy Assembly Tests

​CI Validation

​Test Environment

​Preflight Checks

​Debugging Test Failures

Check policy contents

Run commands manually

Watch denial logs

Verify outside sandbox

​E2E and Live Agent Tests

​Next Steps

Debugging

Contributing

Build docs developers (and LLMs) love

Running Tests

Test Structure

Test Helpers

Writing New Tests

Policy Assembly Tests

CI Validation

Test Environment

Preflight Checks

Debugging Test Failures

E2E and Live Agent Tests

Next Steps