Protocol, Proxy, and Security Integration Tests Analysis

May 14, 2026 · View on GitHub

This document provides a detailed analysis of the integration tests covering protocol support, IPv6, API proxy, credential hiding, token unsetting, one-shot tokens, exit code propagation, and error handling in the AWF (Agentic Workflow Firewall).

Protocol Support Tests
IPv6 Integration Tests
API Proxy Sidecar Tests
Credential Hiding Security Tests
Token Unsetting Tests
One-Shot Token Protection Tests
Exit Code Propagation Tests
Error Handling Tests
Cross-Cutting Observations

1. Protocol Support Tests

File: tests/integration/protocol-support.test.ts

What It Tests

Test Case	Description
HTTPS to allowed domain	Sends `curl -fsS https://api.github.com/zen` with `github.com` allowed. Verifies the request succeeds (exit code 0).
HTTPS to non-allowed domain	Sends `curl -f https://example.com` with only `github.com` allowed. Verifies the CONNECT is denied by Squid and curl fails.
HTTPS verbose output	Runs `curl -v` and greps for SSL/TLS connection info, confirming TLS handshakes happen through the proxy.
HTTP/2 connections	Uses `curl --http2` to verify HTTP/2 negotiation works through the Squid CONNECT tunnel.
HTTP/1.1 fallback	Uses `curl --http1.1` to verify HTTP/1.1 also works, confirming protocol negotiation flexibility.
HTTP requests	Attempts plain HTTP to `github.com`. Expects failure because HTTP requests hit Squid's intercept port where HTTP-to-HTTPS redirects fail. Documents a known limitation.
Custom headers	Passes `-H "Accept: application/json"` to verify headers traverse the proxy.
User-Agent header	Passes `-A "Test-Agent/1.0"` to verify custom User-Agent traverses the proxy.
IPv4 connections	Uses `curl -4` to force IPv4, verifying IPv4-only connectivity works.
IPv6 (may not be available)	Uses `curl -6` with `
curl max-time	Uses `--max-time 5` to verify the timeout option works through the proxy.
curl connect-timeout	Uses `--connect-timeout 10` to verify connection timeout works through the proxy.

Real-World Mapping

These tests map to the core firewall functionality: when an AI agent (Claude, Copilot, Codex) makes HTTP/HTTPS requests, the proxy must:

Allow HTTPS to whitelisted domains (API calls to api.github.com, api.anthropic.com, etc.)
Block HTTPS to non-whitelisted domains (preventing data exfiltration)
Support various HTTP versions and headers that real tools use
Handle connection timeouts gracefully when domains are blocked

Gaps and Missing Coverage

Gap	Impact	Recommendation
No test for multiple allowed domains simultaneously	Real workflows allow 5-10+ domains. No test verifies correct behavior with many domains.	Add test with 3-5 allowed domains, verify each works and a non-listed one is blocked.
No test for wildcard subdomain matching	Allowing `github.com` should also allow `api.github.com`, `raw.githubusercontent.com`, etc.	Add test verifying subdomain matching behavior.
No test for large request/response bodies	AI agents may download large files (repos, models).	Add test downloading a file >1MB through the proxy.
No test for concurrent requests	AI agents make many parallel requests.	Add test making 5-10 concurrent curl requests.
No test for WebSocket upgrade	Some MCP servers use WebSocket.	Add test attempting WebSocket connection through proxy (expected to fail or document limitation).
No test for POST/PUT/PATCH/DELETE methods	Only GET-like requests are tested. AI agents use POST for API calls.	Add test for POST with JSON body through the proxy.
HTTP intercept mode behavior not thoroughly validated	The HTTP test documents a "known limitation" but doesn't verify the exact failure mode.	Add test that checks the specific error (403 page vs connection refused).
No test for redirect following	`curl -L` following redirects across domains.	Test that redirects from allowed domain to non-allowed domain are blocked.

Edge Cases

What happens when Squid is slow to start and the first request arrives before it's ready?
What happens with very long domain names (>255 chars)?
What about requests to IP addresses directly (bypassing DNS)?

2. IPv6 Integration Tests

File: tests/integration/ipv6.test.ts

What It Tests

Test Case	Description
Accept IPv6 DNS servers	Configures `dnsServers: ['2001:4860:4860::8888', '2001:4860:4860::8844']` and verifies they appear in debug logs.
Accept mixed IPv4/IPv6 DNS	Configures both `8.8.8.8` and `2001:4860:4860::8888`, verifies both appear in logs.
DNS resolution with IPv4-only DNS	Runs `nslookup github.com` with IPv4 DNS only, verifies resolution works.
IPv6 traffic blocked for non-whitelisted	When IPv6 is available, attempts `curl -6 https://example.com` (not in allowlist). Expects failure.
IPv6 curl graceful failure	Tests `curl -6` with fallback, ensuring graceful handling when IPv6 is unavailable.
ip6tables status logging	Verifies that starting with IPv6 DNS servers logs ip6tables availability status.
IPv6 chain cleanup	Verifies that ip6tables chains are cleaned up on exit.
IPv4 curl with IPv4 DNS	Baseline test confirming IPv4 curl works with IPv4 DNS servers.
Dual-stack DNS configuration	Configures 4 DNS servers (2 IPv4 + 2 IPv6), verifies all appear in logs.
Valid IPv6 loopback	Tests that `::1` (loopback) is accepted as a DNS server.
Link-local addresses	Tests graceful handling by using IPv4 DNS to avoid link-local issues.
Empty IPv6 DNS server list	Tests that IPv4-only DNS servers work and DNS resolution succeeds.

Real-World Mapping

GitHub Actions runners may have IPv6 enabled. The firewall must handle IPv6 DNS configuration correctly to prevent DNS-based data exfiltration while allowing legitimate IPv6 DNS resolution. The ip6tables rules must match the iptables rules for IPv4.

Gaps and Missing Coverage

Gap	Impact	Recommendation
No test for actual IPv6 traffic through proxy	Only tests DNS config and curl -6 blocking. Doesn't verify IPv6 traffic flows correctly through Squid when allowed.	Add conditional test (when IPv6 available) that makes successful IPv6 request to allowed domain.
Invalid IPv6 address rejection not tested	The test for "Invalid IPv6 address rejected at CLI level" actually uses a valid IPv6 address (`::1`).	Add test with genuinely invalid IPv6 like `not:an:ipv6` and verify CLI error.
No test for IPv6-only environment	All tests either have IPv4 or mixed. What if IPv4 is unavailable?	Add conditional test for IPv6-only DNS configuration.
Link-local test doesn't actually test link-local	Uses IPv4 DNS "to avoid link-local issues" - doesn't actually test `fe80::` handling.	Add test with `fe80::1%eth0` to verify proper rejection or handling.
ip6tables cleanup not strongly verified	Checks for presence of log keywords but doesn't verify rules are actually removed.	After cleanup, run `ip6tables -L` and verify no AWF rules remain.
Many tests only check log output	Several tests just verify strings appear in stderr rather than testing actual network behavior.	Where feasible, add actual connectivity tests.

Edge Cases

What happens when IPv6 is available but ip6tables is not installed?
Race condition: ip6tables rules being applied while IPv6 traffic is already in flight.
DNS64/NAT64 environments where IPv6 traffic is translated to IPv4.

3. API Proxy Sidecar Tests

File: tests/integration/api-proxy.test.ts

What It Tests

Test Case	Description
Anthropic healthcheck	Starts API proxy with `ANTHROPIC_API_KEY`, checks `/health` on port 10001. Expects `"status":"healthy"` and `"anthropic-proxy"`.
OpenAI healthcheck	Same pattern with `OPENAI_API_KEY` on port 10000.
ANTHROPIC_BASE_URL set	Verifies the agent container has `ANTHROPIC_BASE_URL=http://172.30.0.30:10001`.
ANTHROPIC_AUTH_TOKEN placeholder	Verifies the agent gets `ANTHROPIC_AUTH_TOKEN=placeholder-token-for-credential-isolation` (real key held by sidecar).
OPENAI_BASE_URL set	Verifies `OPENAI_BASE_URL=http://172.30.0.30:10000`.
Anthropic API routing through Squid	Makes actual POST to `/v1/messages` via proxy. Expects authentication error (proving the request reached Anthropic's servers through Squid).
Health endpoint with Anthropic-only	Verifies port 10000 health shows `openai:false, anthropic:true` when only Anthropic key is provided.
Copilot healthcheck	Starts with `COPILOT_GITHUB_TOKEN` on port 10002.
COPILOT_API_URL set	Verifies `COPILOT_API_URL=http://172.30.0.30:10002`.
COPILOT_TOKEN placeholder	Verifies `COPILOT_TOKEN=ghu_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa`.
Copilot in health providers	Verifies health endpoint reports `copilot:true`.

Real-World Mapping

The API proxy sidecar is the core credential isolation mechanism. In production agentic workflows:

The AI agent never sees the real API key (only a placeholder)
The sidecar injects the real key when proxying requests to the API
This prevents prompt injection attacks from exfiltrating API keys
All API traffic still routes through Squid for domain filtering

Gaps and Missing Coverage

Gap	Impact	Recommendation
No test for OpenAI API routing	Anthropic routing is tested end-to-end, but OpenAI routing is not.	Add test making POST to OpenAI endpoint and verifying auth error response.
No test for Copilot API routing	Only healthcheck and env vars tested for Copilot.	Add end-to-end routing test for Copilot.
No test for all three keys simultaneously	Real workflows may have multiple API keys.	Test with all three keys and verify all proxies are healthy.
No test for API proxy with blocked domain	What happens when Squid blocks the API domain?	Test with `enableApiProxy: true` but without the API domain in allowDomains.
No test for API proxy failure/crash	What if the sidecar crashes mid-request?	Test resilience to sidecar failures.
No negative test for key leakage	Should verify the real API key is NOT accessible in agent container.	Add test that greps for the fake API key in `/proc/*/environ` and other locations.
No test for streaming responses	Claude and OpenAI APIs use SSE streaming.	Test that streaming responses work through the proxy chain.
No test for concurrent API requests	AI agents make many parallel API calls.	Test multiple concurrent requests through the proxy.

Edge Cases

What happens if ANTHROPIC_API_KEY is set but api.anthropic.com is not in allowed domains?
What if the API proxy sidecar starts but Squid hasn't passed healthcheck yet?
What about API key rotation during a long-running agent session?

4. Credential Hiding Security Tests

File: tests/integration/credential-hiding.test.ts

What It Tests

Test Case	Description
Docker config.json hidden (normal mode)	Reads `~/.docker/config.json` inside container. Expects empty output (file mounted from `/dev/null`).
GitHub CLI hosts.yml hidden	Reads `~/.config/gh/hosts.yml`. Expects no `oauth_token` or `gho_` strings.
NPM .npmrc hidden	Reads `~/.npmrc`. Expects no `_authToken` or `npm_` strings.
Credential files are 0 bytes	Uses `wc -c` on multiple credential files. All should be 0 bytes.
Debug logs show credential hiding	Verifies debug output contains "Using selective mounting" or "Hidden.*credential".
Chroot mode hides at /host paths	Reads `/host$HOME/.docker/config.json`. Expects empty or "No such file".
Chroot mode debug logs	Verifies chroot-specific credential hiding log messages.
Chroot bypass prevention	Critical security test: verifies credentials hidden at direct `$HOME` path (not just `/host` path). Previously a bypass vulnerability.
Chroot GitHub CLI tokens hidden at direct path	Same bypass prevention for `hosts.yml`.
Simulated exfiltration: base64 encoding	Runs `cat ...
Multiple encoding attempts	Runs `cat ...
Grep for tokens finds nothing	Greps for `oauth_token`, `_authToken`, `auth:` patterns. Finds nothing.
MCP logs /tmp/gh-aw/mcp-logs/ hidden (normal)	Verifies the MCP logs directory is hidden (tmpfs overlay).
MCP logs hidden (chroot)	Same at `/host/tmp/gh-aw/mcp-logs/`.
MCP log files unreadable	Tries to read a specific log file path. Expects "No such file".

Real-World Mapping

This is a critical security layer. In real agentic workflows:

The AI agent has filesystem access to write code, run tests, etc.
A prompt injection attack could instruct the agent to read credential files
Credential hiding prevents exfiltration of Docker Hub tokens, GitHub CLI tokens, NPM auth tokens
The MCP logs hiding prevents reading log files that might contain sensitive operation details

Gaps and Missing Coverage

Gap	Impact	Recommendation
No test for Cargo credentials	`~/.cargo/credentials.toml` may contain crates.io tokens.	Add test verifying Cargo credentials are hidden.
No test for Composer auth	`~/.composer/auth.json` may contain Packagist tokens.	Add test for Composer credentials.
No test for pip/PyPI credentials	`~/.pip/pip.conf` or `~/.config/pip/pip.conf` may contain index tokens.	Add test for Python package manager credentials.
No test for SSH keys	`~/.ssh/id_rsa`, `~/.ssh/id_ed25519` could be exfiltrated.	Add test verifying SSH keys are not accessible (or document if they're needed).
No test for `/proc/self/environ` credential reading	Agent could read its own environment via `/proc/self/environ`.	Add test reading `/proc/self/environ` and verifying sensitive tokens are cleared. (Covered partially by token-unset tests.)
No test for symlink bypass	Agent could create symlinks from accessible paths to hidden paths.	Test that symlinks to credential files resolve to `/dev/null` content.
No test for bind mount information leakage	`mount` or `cat /proc/mounts` could reveal the `/dev/null` pattern.	Test that mount info doesn't reveal the hiding mechanism.
No test for credential files that don't exist on host	Tests assume credential files exist on the host machine.	Add conditional logic or mock files for test portability.
No test for cloud provider credentials	`~/.aws/credentials`, `~/.config/gcloud/`, `~/.azure/` could contain cloud tokens.	Add tests for cloud provider credential files.

Edge Cases

What if a credential file is a symlink on the host?
What if the agent creates a new credential file (e.g., docker login creates ~/.docker/config.json)?
What about credential discovery via find / -name "*.json" -exec grep -l auth?
What if a tool writes credentials to a non-standard location?

5. Token Unsetting Tests

File: tests/integration/token-unset.test.ts

What It Tests

Test Case	Description
GITHUB_TOKEN unset from /proc/1/environ	After 7s delay, reads `/proc/1/environ` and verifies `GITHUB_TOKEN` is gone. Also verifies agent can still read via `getenv`.
OPENAI_API_KEY unset	Same pattern for OpenAI key.
ANTHROPIC_API_KEY unset	Same pattern for Anthropic key.
Multiple tokens simultaneously	Sets all three tokens, verifies all cleared from `/proc/1/environ` after 7s.
Works in non-chroot mode	Sets `AWF_CHROOT_ENABLED=false`, verifies token clearing works in container-only mode.

Real-World Mapping

The entrypoint.sh runs the agent in the background, waits 5 seconds for initialization, then unsets sensitive tokens from the parent shell's environment. This prevents a secondary attack vector: even if credential files are hidden, an attacker could read /proc/1/environ to find API keys passed as environment variables. The one-shot token library caches values so the agent process can still use them.

Gaps and Missing Coverage

Gap	Impact	Recommendation
Fixed 7s sleep is fragile	If agent startup takes longer (or shorter), tests may be flaky.	Use a more robust synchronization mechanism (e.g., signal file).
No test for COPILOT_GITHUB_TOKEN	Listed in `unset_sensitive_tokens()` but not tested here.	Add test for `COPILOT_GITHUB_TOKEN`.
No test for all tokens in sensitive list	`unset_sensitive_tokens()` lists 12+ tokens. Only 3 are tested.	Add tests for `GH_TOKEN`, `GITHUB_API_TOKEN`, `GITHUB_PAT`, `CLAUDE_API_KEY`, etc.
No test for /proc/self/environ	Tests check `/proc/1/environ` (PID 1) but not `/proc/self/environ` of child processes.	Add test reading `/proc/self/environ` from a child process.
No test for timing race	What if agent reads `/proc/1/environ` before the 5s unsetting delay?	Add test that reads immediately (before sleep) and after, showing the transition.
*No test for token values in /proc//cmdline**	Tokens passed via `--env` might appear in cmdline.	Verify token values don't appear in `/proc/*/cmdline`.
No negative test: token not set	Should verify that unsetting a non-existent token doesn't cause errors.	Add test with no tokens set.

Edge Cases

What if the agent process forks and the child inherits the pre-unset environment?
What if /proc/1/environ is not readable (restricted /proc mount)?
What about tokens set via --env flag vs inherited from host environment?

Test Case	Description
Cache GITHUB_TOKEN	Uses `printenv` (forks new process each time). Both reads succeed. Verifies debug log shows token access with value preview.
Cache COPILOT_GITHUB_TOKEN	Same pattern for Copilot token.
Cache OPENAI_API_KEY	Same pattern for OpenAI key.
Multiple tokens independently	Sets GITHUB_TOKEN and OPENAI_API_KEY. Verifies both cached independently.
Non-sensitive vars unaffected	Sets `NORMAL_VAR`. Verifies it's readable multiple times without one-shot-token log messages.
Python same-process caching	Uses Python `os.getenv()` to call getenv() directly (same process). Both reads succeed via cache.
Clear from /proc/self/environ	Python test: verifies first getenv() caches, checks `os.environ` dict, second getenv() returns cached value.

Chroot Mode (5 tests)

Test Case	Description
Cache GITHUB_TOKEN in chroot	Same as container mode but in chroot. Verifies library copied to chroot.
Cache COPILOT_GITHUB_TOKEN in chroot	Same for Copilot token in chroot.
Python caching in chroot	Python `os.getenv()` test in chroot mode.
Non-sensitive vars in chroot	Verifies non-sensitive vars unaffected in chroot.
Multiple tokens in chroot	Multiple independent tokens in chroot mode.

Edge Cases (3 tests)

Test Case	Description
Empty token value	Sets `GITHUB_TOKEN=''`. Both reads return empty.
Token not set	`NONEXISTENT_TOKEN` not in environment. Both reads return empty.
Special characters	Sets token with `@#$%` characters. Both reads preserve special chars.

Real-World Mapping

The one-shot token library (LD_PRELOAD) is the core mechanism preventing token theft within the agent process. When an AI agent like Claude Code calls getenv("ANTHROPIC_API_KEY"):

First call: library caches value in memory, unsets from environment
Subsequent calls: library returns cached value
/proc/self/environ no longer contains the token
A prompt injection attack that reads /proc/self/environ or spawns a subprocess to printenv gets nothing

Gaps and Missing Coverage

Gap	Impact	Recommendation
No test for ANTHROPIC_API_KEY caching	Tested in container mode but not explicitly with debug log verification like GITHUB_TOKEN.	Already covered (Test 3 in container mode).
No test for Node.js `process.env` access	Python is tested but Node.js (used by Claude Code, Copilot CLI) is not.	Add Node.js test with `process.env.GITHUB_TOKEN`.
No test for Go `os.Getenv()` access	Go is common in GitHub Actions.	Add Go test verifying LD_PRELOAD works with Go binaries.
No test for compiled C program	The library intercepts C's `getenv()`. Testing with a compiled C program would be most direct.	Add test with simple C program calling getenv().
No test for `environ` global variable	Some programs access the `environ` array directly instead of getenv().	Test that `extern char **environ` iteration doesn't show tokens.
No test for token very long values	What if a token is 10KB+?	Test with very long token value.
No test for concurrent getenv() calls	Multi-threaded programs calling getenv() simultaneously.	Test thread safety of the LD_PRELOAD library.
No test for LD_PRELOAD being stripped	If agent can modify LD_PRELOAD before forking.	Verify LD_PRELOAD is preserved in child processes.
No test for `CLAUDE_API_KEY` or `CODEX_API_KEY`	Listed in sensitive tokens but not tested.	Add tests for all listed sensitive tokens.

Edge Cases

What if LD_PRELOAD is disabled by the runtime (e.g., setuid binaries)?
What about statically linked binaries that don't use dynamic getenv()?
What if a program calls secure_getenv() instead of getenv()?
Race condition: what if two threads call getenv() for the same token simultaneously?

Test Case	Description
Exit code 0	`exit 0` propagates as 0. Verifies "Process exiting with code: 0" in stderr.
Exit code 1	`exit 1` propagates as 1.
Exit code 2	`exit 2` propagates as 2.
Exit code 42	Custom exit code propagates correctly.
Exit code 127	Running `nonexistent_command_xyz` produces 127 (command not found).
Exit code 255	Maximum standard exit code propagates correctly.

Command Exit Codes (6 tests)

Test Case	Description
true	`true` command returns 0.
false	`false` command returns 1.
test success	`test 1 -eq 1` returns 0.
test failure	`test 1 -eq 2` returns 1.
grep found	`echo "hello world" \| grep hello` returns 0.
grep not found	`echo "hello world" \| grep xyz` returns 1.

Pipeline Exit Codes (3 tests)

Test Case	Description
Last command in pipeline	`echo "test" \| cat \| exit 5` returns 5.
Compound command success	`echo "a" && echo "b" && exit 0` returns 0.
Compound command failure	`echo "a" && false && echo "c"` returns 1 (short-circuits at `false`).

Real-World Mapping

Exit code propagation is essential for CI/CD integration. When AWF runs in a GitHub Actions step:

Exit code 0 means the agent succeeded → workflow continues
Non-zero exit code means the agent failed → workflow marks step as failed
The correct exit code helps users diagnose what went wrong

Gaps and Missing Coverage

Gap	Impact	Recommendation
No test for signal-induced exit codes	SIGTERM → 143, SIGKILL → 137. Not tested.	Add tests for signal exit codes.
No test for exit codes > 255	Shell wraps codes modulo 256.	Test `exit 256` returns 0, `exit 257` returns 1.
No test for exit code from killed process	What exit code does a `kill -9`'d process produce?	Test that AWF propagates signal-killed process codes.
No test for exit code from timeout	When `--max-time` causes curl to timeout.	Test exit code from timed-out commands.
No test for exit code from OOM-killed process	When a command is killed by OOM.	Document behavior (may not be testable in CI).
No test for pipefail behavior	`set -o pipefail` changes which exit code is returned from pipelines.	Test with `set -o pipefail` to verify behavior.
No test for nested AWF invocations	AWF running AWF (if that's ever a pattern).	May not be relevant but worth documenting.

Edge Cases

What if the Docker container itself crashes (not the command)?
What if docker wait returns a different code than the process's actual exit code?
What about exit codes from exec'd processes within the command?

Test Case	Description
Blocked domain	`curl -f https://example.com` with only `github.com` allowed. Expects non-zero exit.
Connection refused	`curl http://localhost:12345` where no server listens. Expects "connection failed".
DNS resolution failure	curl to non-existent domain (in allowlist). Expects "dns failed".

Command Errors (3 tests)

Test Case	Description
Command not found	`nonexistent_command_xyz123` returns exit code 127.
Permission denied	`cat /etc/shadow` fails with permission denied.
File not found	`cat /nonexistent/file/path` fails with "No such file".

Script Errors (2 tests)

Test Case	Description
Bash syntax error	`bash -c "if then fi"` caught with `\|\| echo`.
Division by zero	`bash -c "echo $((1/0))"` caught with `\|\| echo`.

Process Signals (1 test)

Test Case	Description
SIGTERM from command	`kill -TERM $$` self-terminates. Verifies firewall handles it gracefully (result is defined).

Recovery After Errors (1 test)

Test Case	Description
Continue after failure	Runs `false` then `echo "recovery test"`. Verifies second command works (they're separate AWF invocations).

Real-World Mapping

Error handling is critical for production reliability:

Network errors occur when agents try to access domains not in the allowlist
Command errors occur when agent-generated commands have bugs
Script errors happen when AI-generated code has syntax issues
Signal handling ensures the firewall cleans up properly when interrupted
Recovery after errors ensures the system doesn't enter a broken state

Gaps and Missing Coverage

Gap	Impact	Recommendation
No test for Docker daemon unavailable	What if Docker is not running when AWF starts?	Test CLI behavior when Docker socket is missing.
No test for disk full	What if `/tmp` is full and work directory can't be created?	Test behavior with limited disk space.
No test for Squid proxy crash	What if Squid crashes mid-session?	Test agent behavior when Squid becomes unreachable.
No test for network partition	What if the Docker network goes down?	Test behavior when `awf-net` becomes unavailable.
No test for SIGKILL handling	SIGKILL cannot be caught. What state is left behind?	Document and test cleanup behavior after SIGKILL.
No test for SIGINT (Ctrl+C) during command	The most common user interruption pattern.	Test that SIGINT properly stops containers and cleans up.
No test for concurrent error + cleanup	What if error occurs during cleanup?	Test nested error handling scenarios.
No test for invalid CLI arguments	What happens with `--allow-domains ""` or `--dns-servers invalid`?	Already tested at unit level but not integration.
No test for container startup timeout	What if container image pull takes too long?	Test behavior with unreachable registry.
No test for resource cleanup verification	After error, verify Docker networks/containers/volumes are cleaned up.	Add post-test assertions checking Docker state.

Edge Cases

What if the error occurs in a trap handler itself?
What if multiple signals arrive in rapid succession?
What about errors in the iptables cleanup phase?
What if the work directory is deleted by another process during execution?

9. Cross-Cutting Observations

Strengths

Defense in depth: The security tests cover multiple layers (credential hiding, token unsetting, one-shot tokens, API proxy isolation).
Both modes tested: Most security tests cover both normal and chroot mode.
Real-world attack simulation: The credential hiding tests simulate actual exfiltration attacks (base64, xxd, grep patterns).
Custom matchers: The toSucceed(), toFail(), toExitWithCode() matchers provide clear, readable assertions.
Bypass prevention: Tests specifically cover the chroot bypass vulnerability (Test 8) that was previously discovered and fixed.
Broad API proxy coverage: Tests cover all three API providers (OpenAI, Anthropic, Copilot) for healthchecks and env wiring; end-to-end request routing and credential isolation are currently verified in depth only for Anthropic.

Systemic Gaps

No stress testing: No tests verify behavior under load, concurrent requests, or resource exhaustion.
No timing/performance tests: No tests verify that proxy overhead is within acceptable bounds.
Limited negative testing for security: While credential hiding is well-tested, other security bypasses (iptables manipulation, network namespace escape, container escape) are not tested.
Fragile timing dependencies: The token-unset tests rely on fixed sleep 7 delays which could be flaky in slow CI environments.
Test isolation concerns: Tests share cleanup fixtures. A failed test could leave state that affects subsequent tests.
No test for the full workflow: No end-to-end test runs an actual AI agent (even a mock one) through the complete AWF pipeline.
Missing SSL Bump tests: The --ssl-bump feature is configured in the CLI but has no integration tests in this set.
Missing blocked-domains tests: The --block-domains feature (deny-list on top of allow-list) has no integration tests.
No test for --env-all security implications: Passing all host env vars could leak credentials.
No test for log output sanitization: Verifying that real API keys never appear in logs/stderr.