This document provides a detailed analysis of the integration tests covering protocol support, IPv6, API proxy, credential hiding, token unsetting, one-shot tokens, exit code propagation, and error handling in the AWF (Agentic Workflow Firewall).
- Protocol Support Tests
- IPv6 Integration Tests
- API Proxy Sidecar Tests
- Credential Hiding Security Tests
- Token Unsetting Tests
- One-Shot Token Protection Tests
- Exit Code Propagation Tests
- Error Handling Tests
- Cross-Cutting Observations
File: tests/integration/protocol-support.test.ts
| Test Case | Description |
|---|
| HTTPS to allowed domain | Sends curl -fsS https://api.github.com/zen with github.com allowed. Verifies the request succeeds (exit code 0). |
| HTTPS to non-allowed domain | Sends curl -f https://example.com with only github.com allowed. Verifies the CONNECT is denied by Squid and curl fails. |
| HTTPS verbose output | Runs curl -v and greps for SSL/TLS connection info, confirming TLS handshakes happen through the proxy. |
| HTTP/2 connections | Uses curl --http2 to verify HTTP/2 negotiation works through the Squid CONNECT tunnel. |
| HTTP/1.1 fallback | Uses curl --http1.1 to verify HTTP/1.1 also works, confirming protocol negotiation flexibility. |
| HTTP requests | Attempts plain HTTP to github.com. Expects failure because HTTP requests hit Squid's intercept port where HTTP-to-HTTPS redirects fail. Documents a known limitation. |
| Custom headers | Passes -H "Accept: application/json" to verify headers traverse the proxy. |
| User-Agent header | Passes -A "Test-Agent/1.0" to verify custom User-Agent traverses the proxy. |
| IPv4 connections | Uses curl -4 to force IPv4, verifying IPv4-only connectivity works. |
| IPv6 (may not be available) | Uses curl -6 with ` |
| curl max-time | Uses --max-time 5 to verify the timeout option works through the proxy. |
| curl connect-timeout | Uses --connect-timeout 10 to verify connection timeout works through the proxy. |
These tests map to the core firewall functionality: when an AI agent (Claude, Copilot, Codex) makes HTTP/HTTPS requests, the proxy must:
- Allow HTTPS to whitelisted domains (API calls to
api.github.com, api.anthropic.com, etc.)
- Block HTTPS to non-whitelisted domains (preventing data exfiltration)
- Support various HTTP versions and headers that real tools use
- Handle connection timeouts gracefully when domains are blocked
| Gap | Impact | Recommendation |
|---|
| No test for multiple allowed domains simultaneously | Real workflows allow 5-10+ domains. No test verifies correct behavior with many domains. | Add test with 3-5 allowed domains, verify each works and a non-listed one is blocked. |
| No test for wildcard subdomain matching | Allowing github.com should also allow api.github.com, raw.githubusercontent.com, etc. | Add test verifying subdomain matching behavior. |
| No test for large request/response bodies | AI agents may download large files (repos, models). | Add test downloading a file >1MB through the proxy. |
| No test for concurrent requests | AI agents make many parallel requests. | Add test making 5-10 concurrent curl requests. |
| No test for WebSocket upgrade | Some MCP servers use WebSocket. | Add test attempting WebSocket connection through proxy (expected to fail or document limitation). |
| No test for POST/PUT/PATCH/DELETE methods | Only GET-like requests are tested. AI agents use POST for API calls. | Add test for POST with JSON body through the proxy. |
| HTTP intercept mode behavior not thoroughly validated | The HTTP test documents a "known limitation" but doesn't verify the exact failure mode. | Add test that checks the specific error (403 page vs connection refused). |
| No test for redirect following | curl -L following redirects across domains. | Test that redirects from allowed domain to non-allowed domain are blocked. |
- What happens when Squid is slow to start and the first request arrives before it's ready?
- What happens with very long domain names (>255 chars)?
- What about requests to IP addresses directly (bypassing DNS)?
File: tests/integration/ipv6.test.ts
| Test Case | Description |
|---|
| Accept IPv6 DNS servers | Configures dnsServers: ['2001:4860:4860::8888', '2001:4860:4860::8844'] and verifies they appear in debug logs. |
| Accept mixed IPv4/IPv6 DNS | Configures both 8.8.8.8 and 2001:4860:4860::8888, verifies both appear in logs. |
| DNS resolution with IPv4-only DNS | Runs nslookup github.com with IPv4 DNS only, verifies resolution works. |
| IPv6 traffic blocked for non-whitelisted | When IPv6 is available, attempts curl -6 https://example.com (not in allowlist). Expects failure. |
| IPv6 curl graceful failure | Tests curl -6 with fallback, ensuring graceful handling when IPv6 is unavailable. |
| ip6tables status logging | Verifies that starting with IPv6 DNS servers logs ip6tables availability status. |
| IPv6 chain cleanup | Verifies that ip6tables chains are cleaned up on exit. |
| IPv4 curl with IPv4 DNS | Baseline test confirming IPv4 curl works with IPv4 DNS servers. |
| Dual-stack DNS configuration | Configures 4 DNS servers (2 IPv4 + 2 IPv6), verifies all appear in logs. |
| Valid IPv6 loopback | Tests that ::1 (loopback) is accepted as a DNS server. |
| Link-local addresses | Tests graceful handling by using IPv4 DNS to avoid link-local issues. |
| Empty IPv6 DNS server list | Tests that IPv4-only DNS servers work and DNS resolution succeeds. |
GitHub Actions runners may have IPv6 enabled. The firewall must handle IPv6 DNS configuration correctly to prevent DNS-based data exfiltration while allowing legitimate IPv6 DNS resolution. The ip6tables rules must match the iptables rules for IPv4.
| Gap | Impact | Recommendation |
|---|
| No test for actual IPv6 traffic through proxy | Only tests DNS config and curl -6 blocking. Doesn't verify IPv6 traffic flows correctly through Squid when allowed. | Add conditional test (when IPv6 available) that makes successful IPv6 request to allowed domain. |
| Invalid IPv6 address rejection not tested | The test for "Invalid IPv6 address rejected at CLI level" actually uses a valid IPv6 address (::1). | Add test with genuinely invalid IPv6 like not:an:ipv6 and verify CLI error. |
| No test for IPv6-only environment | All tests either have IPv4 or mixed. What if IPv4 is unavailable? | Add conditional test for IPv6-only DNS configuration. |
| Link-local test doesn't actually test link-local | Uses IPv4 DNS "to avoid link-local issues" - doesn't actually test fe80:: handling. | Add test with fe80::1%eth0 to verify proper rejection or handling. |
| ip6tables cleanup not strongly verified | Checks for presence of log keywords but doesn't verify rules are actually removed. | After cleanup, run ip6tables -L and verify no AWF rules remain. |
| Many tests only check log output | Several tests just verify strings appear in stderr rather than testing actual network behavior. | Where feasible, add actual connectivity tests. |
- What happens when IPv6 is available but ip6tables is not installed?
- Race condition: ip6tables rules being applied while IPv6 traffic is already in flight.
- DNS64/NAT64 environments where IPv6 traffic is translated to IPv4.
File: tests/integration/api-proxy.test.ts
| Test Case | Description |
|---|
| Anthropic healthcheck | Starts API proxy with ANTHROPIC_API_KEY, checks /health on port 10001. Expects "status":"healthy" and "anthropic-proxy". |
| OpenAI healthcheck | Same pattern with OPENAI_API_KEY on port 10000. |
| ANTHROPIC_BASE_URL set | Verifies the agent container has ANTHROPIC_BASE_URL=http://172.30.0.30:10001. |
| ANTHROPIC_AUTH_TOKEN placeholder | Verifies the agent gets ANTHROPIC_AUTH_TOKEN=placeholder-token-for-credential-isolation (real key held by sidecar). |
| OPENAI_BASE_URL set | Verifies OPENAI_BASE_URL=http://172.30.0.30:10000. |
| Anthropic API routing through Squid | Makes actual POST to /v1/messages via proxy. Expects authentication error (proving the request reached Anthropic's servers through Squid). |
| Health endpoint with Anthropic-only | Verifies port 10000 health shows openai:false, anthropic:true when only Anthropic key is provided. |
| Copilot healthcheck | Starts with COPILOT_GITHUB_TOKEN on port 10002. |
| COPILOT_API_URL set | Verifies COPILOT_API_URL=http://172.30.0.30:10002. |
| COPILOT_TOKEN placeholder | Verifies COPILOT_TOKEN=ghu_aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa. |
| Copilot in health providers | Verifies health endpoint reports copilot:true. |
The API proxy sidecar is the core credential isolation mechanism. In production agentic workflows:
- The AI agent never sees the real API key (only a placeholder)
- The sidecar injects the real key when proxying requests to the API
- This prevents prompt injection attacks from exfiltrating API keys
- All API traffic still routes through Squid for domain filtering
| Gap | Impact | Recommendation |
|---|
| No test for OpenAI API routing | Anthropic routing is tested end-to-end, but OpenAI routing is not. | Add test making POST to OpenAI endpoint and verifying auth error response. |
| No test for Copilot API routing | Only healthcheck and env vars tested for Copilot. | Add end-to-end routing test for Copilot. |
| No test for all three keys simultaneously | Real workflows may have multiple API keys. | Test with all three keys and verify all proxies are healthy. |
| No test for API proxy with blocked domain | What happens when Squid blocks the API domain? | Test with enableApiProxy: true but without the API domain in allowDomains. |
| No test for API proxy failure/crash | What if the sidecar crashes mid-request? | Test resilience to sidecar failures. |
| No negative test for key leakage | Should verify the real API key is NOT accessible in agent container. | Add test that greps for the fake API key in /proc/*/environ and other locations. |
| No test for streaming responses | Claude and OpenAI APIs use SSE streaming. | Test that streaming responses work through the proxy chain. |
| No test for concurrent API requests | AI agents make many parallel API calls. | Test multiple concurrent requests through the proxy. |
- What happens if
ANTHROPIC_API_KEY is set but api.anthropic.com is not in allowed domains?
- What if the API proxy sidecar starts but Squid hasn't passed healthcheck yet?
- What about API key rotation during a long-running agent session?
File: tests/integration/credential-hiding.test.ts
| Test Case | Description |
|---|
| Docker config.json hidden (normal mode) | Reads ~/.docker/config.json inside container. Expects empty output (file mounted from /dev/null). |
| GitHub CLI hosts.yml hidden | Reads ~/.config/gh/hosts.yml. Expects no oauth_token or gho_ strings. |
| NPM .npmrc hidden | Reads ~/.npmrc. Expects no _authToken or npm_ strings. |
| Credential files are 0 bytes | Uses wc -c on multiple credential files. All should be 0 bytes. |
| Debug logs show credential hiding | Verifies debug output contains "Using selective mounting" or "Hidden.*credential". |
| Chroot mode hides at /host paths | Reads /host$HOME/.docker/config.json. Expects empty or "No such file". |
| Chroot mode debug logs | Verifies chroot-specific credential hiding log messages. |
| Chroot bypass prevention | Critical security test: verifies credentials hidden at direct $HOME path (not just /host path). Previously a bypass vulnerability. |
| Chroot GitHub CLI tokens hidden at direct path | Same bypass prevention for hosts.yml. |
| Simulated exfiltration: base64 encoding | Runs `cat ... |
| Multiple encoding attempts | Runs `cat ... |
| Grep for tokens finds nothing | Greps for oauth_token, _authToken, auth: patterns. Finds nothing. |
| MCP logs /tmp/gh-aw/mcp-logs/ hidden (normal) | Verifies the MCP logs directory is hidden (tmpfs overlay). |
| MCP logs hidden (chroot) | Same at /host/tmp/gh-aw/mcp-logs/. |
| MCP log files unreadable | Tries to read a specific log file path. Expects "No such file". |
This is a critical security layer. In real agentic workflows:
- The AI agent has filesystem access to write code, run tests, etc.
- A prompt injection attack could instruct the agent to read credential files
- Credential hiding prevents exfiltration of Docker Hub tokens, GitHub CLI tokens, NPM auth tokens
- The MCP logs hiding prevents reading log files that might contain sensitive operation details
| Gap | Impact | Recommendation |
|---|
| No test for Cargo credentials | ~/.cargo/credentials.toml may contain crates.io tokens. | Add test verifying Cargo credentials are hidden. |
| No test for Composer auth | ~/.composer/auth.json may contain Packagist tokens. | Add test for Composer credentials. |
| No test for pip/PyPI credentials | ~/.pip/pip.conf or ~/.config/pip/pip.conf may contain index tokens. | Add test for Python package manager credentials. |
| No test for SSH keys | ~/.ssh/id_rsa, ~/.ssh/id_ed25519 could be exfiltrated. | Add test verifying SSH keys are not accessible (or document if they're needed). |
No test for /proc/self/environ credential reading | Agent could read its own environment via /proc/self/environ. | Add test reading /proc/self/environ and verifying sensitive tokens are cleared. (Covered partially by token-unset tests.) |
| No test for symlink bypass | Agent could create symlinks from accessible paths to hidden paths. | Test that symlinks to credential files resolve to /dev/null content. |
| No test for bind mount information leakage | mount or cat /proc/mounts could reveal the /dev/null pattern. | Test that mount info doesn't reveal the hiding mechanism. |
| No test for credential files that don't exist on host | Tests assume credential files exist on the host machine. | Add conditional logic or mock files for test portability. |
| No test for cloud provider credentials | ~/.aws/credentials, ~/.config/gcloud/, ~/.azure/ could contain cloud tokens. | Add tests for cloud provider credential files. |
- What if a credential file is a symlink on the host?
- What if the agent creates a new credential file (e.g.,
docker login creates ~/.docker/config.json)?
- What about credential discovery via
find / -name "*.json" -exec grep -l auth?
- What if a tool writes credentials to a non-standard location?
File: tests/integration/token-unset.test.ts
| Test Case | Description |
|---|
| GITHUB_TOKEN unset from /proc/1/environ | After 7s delay, reads /proc/1/environ and verifies GITHUB_TOKEN is gone. Also verifies agent can still read via getenv. |
| OPENAI_API_KEY unset | Same pattern for OpenAI key. |
| ANTHROPIC_API_KEY unset | Same pattern for Anthropic key. |
| Multiple tokens simultaneously | Sets all three tokens, verifies all cleared from /proc/1/environ after 7s. |
| Works in non-chroot mode | Sets AWF_CHROOT_ENABLED=false, verifies token clearing works in container-only mode. |
The entrypoint.sh runs the agent in the background, waits 5 seconds for initialization, then unsets sensitive tokens from the parent shell's environment. This prevents a secondary attack vector: even if credential files are hidden, an attacker could read /proc/1/environ to find API keys passed as environment variables. The one-shot token library caches values so the agent process can still use them.
| Gap | Impact | Recommendation |
|---|
| Fixed 7s sleep is fragile | If agent startup takes longer (or shorter), tests may be flaky. | Use a more robust synchronization mechanism (e.g., signal file). |
| No test for COPILOT_GITHUB_TOKEN | Listed in unset_sensitive_tokens() but not tested here. | Add test for COPILOT_GITHUB_TOKEN. |
| No test for all tokens in sensitive list | unset_sensitive_tokens() lists 12+ tokens. Only 3 are tested. | Add tests for GH_TOKEN, GITHUB_API_TOKEN, GITHUB_PAT, CLAUDE_API_KEY, etc. |
| No test for /proc/self/environ | Tests check /proc/1/environ (PID 1) but not /proc/self/environ of child processes. | Add test reading /proc/self/environ from a child process. |
| No test for timing race | What if agent reads /proc/1/environ before the 5s unsetting delay? | Add test that reads immediately (before sleep) and after, showing the transition. |
| No test for token values in /proc/*/cmdline | Tokens passed via --env might appear in cmdline. | Verify token values don't appear in /proc/*/cmdline. |
| No negative test: token not set | Should verify that unsetting a non-existent token doesn't cause errors. | Add test with no tokens set. |
- What if the agent process forks and the child inherits the pre-unset environment?
- What if
/proc/1/environ is not readable (restricted /proc mount)?
- What about tokens set via
--env flag vs inherited from host environment?
File: tests/integration/one-shot-tokens.test.ts
| Test Case | Description |
|---|
| Cache GITHUB_TOKEN | Uses printenv (forks new process each time). Both reads succeed. Verifies debug log shows token access with value preview. |
| Cache COPILOT_GITHUB_TOKEN | Same pattern for Copilot token. |
| Cache OPENAI_API_KEY | Same pattern for OpenAI key. |
| Multiple tokens independently | Sets GITHUB_TOKEN and OPENAI_API_KEY. Verifies both cached independently. |
| Non-sensitive vars unaffected | Sets NORMAL_VAR. Verifies it's readable multiple times without one-shot-token log messages. |
| Python same-process caching | Uses Python os.getenv() to call getenv() directly (same process). Both reads succeed via cache. |
| Clear from /proc/self/environ | Python test: verifies first getenv() caches, checks os.environ dict, second getenv() returns cached value. |
| Test Case | Description |
|---|
| Cache GITHUB_TOKEN in chroot | Same as container mode but in chroot. Verifies library copied to chroot. |
| Cache COPILOT_GITHUB_TOKEN in chroot | Same for Copilot token in chroot. |
| Python caching in chroot | Python os.getenv() test in chroot mode. |
| Non-sensitive vars in chroot | Verifies non-sensitive vars unaffected in chroot. |
| Multiple tokens in chroot | Multiple independent tokens in chroot mode. |
| Test Case | Description |
|---|
| Empty token value | Sets GITHUB_TOKEN=''. Both reads return empty. |
| Token not set | NONEXISTENT_TOKEN not in environment. Both reads return empty. |
| Special characters | Sets token with @#$% characters. Both reads preserve special chars. |
The one-shot token library (LD_PRELOAD) is the core mechanism preventing token theft within the agent process. When an AI agent like Claude Code calls getenv("ANTHROPIC_API_KEY"):
- First call: library caches value in memory, unsets from environment
- Subsequent calls: library returns cached value
/proc/self/environ no longer contains the token
- A prompt injection attack that reads
/proc/self/environ or spawns a subprocess to printenv gets nothing
| Gap | Impact | Recommendation |
|---|
| No test for ANTHROPIC_API_KEY caching | Tested in container mode but not explicitly with debug log verification like GITHUB_TOKEN. | Already covered (Test 3 in container mode). |
No test for Node.js process.env access | Python is tested but Node.js (used by Claude Code, Copilot CLI) is not. | Add Node.js test with process.env.GITHUB_TOKEN. |
No test for Go os.Getenv() access | Go is common in GitHub Actions. | Add Go test verifying LD_PRELOAD works with Go binaries. |
| No test for compiled C program | The library intercepts C's getenv(). Testing with a compiled C program would be most direct. | Add test with simple C program calling getenv(). |
No test for environ global variable | Some programs access the environ array directly instead of getenv(). | Test that extern char **environ iteration doesn't show tokens. |
| No test for token very long values | What if a token is 10KB+? | Test with very long token value. |
| No test for concurrent getenv() calls | Multi-threaded programs calling getenv() simultaneously. | Test thread safety of the LD_PRELOAD library. |
| No test for LD_PRELOAD being stripped | If agent can modify LD_PRELOAD before forking. | Verify LD_PRELOAD is preserved in child processes. |
No test for CLAUDE_API_KEY or CODEX_API_KEY | Listed in sensitive tokens but not tested. | Add tests for all listed sensitive tokens. |
- What if LD_PRELOAD is disabled by the runtime (e.g., setuid binaries)?
- What about statically linked binaries that don't use dynamic getenv()?
- What if a program calls
secure_getenv() instead of getenv()?
- Race condition: what if two threads call getenv() for the same token simultaneously?
File: tests/integration/exit-code-propagation.test.ts
| Test Case | Description |
|---|
| Exit code 0 | exit 0 propagates as 0. Verifies "Process exiting with code: 0" in stderr. |
| Exit code 1 | exit 1 propagates as 1. |
| Exit code 2 | exit 2 propagates as 2. |
| Exit code 42 | Custom exit code propagates correctly. |
| Exit code 127 | Running nonexistent_command_xyz produces 127 (command not found). |
| Exit code 255 | Maximum standard exit code propagates correctly. |
| Test Case | Description |
|---|
| true | true command returns 0. |
| false | false command returns 1. |
| test success | test 1 -eq 1 returns 0. |
| test failure | test 1 -eq 2 returns 1. |
| grep found | echo "hello world" | grep hello returns 0. |
| grep not found | echo "hello world" | grep xyz returns 1. |
| Test Case | Description |
|---|
| Last command in pipeline | echo "test" | cat | exit 5 returns 5. |
| Compound command success | echo "a" && echo "b" && exit 0 returns 0. |
| Compound command failure | echo "a" && false && echo "c" returns 1 (short-circuits at false). |
Exit code propagation is essential for CI/CD integration. When AWF runs in a GitHub Actions step:
- Exit code 0 means the agent succeeded → workflow continues
- Non-zero exit code means the agent failed → workflow marks step as failed
- The correct exit code helps users diagnose what went wrong
| Gap | Impact | Recommendation |
|---|
| No test for signal-induced exit codes | SIGTERM → 143, SIGKILL → 137. Not tested. | Add tests for signal exit codes. |
| No test for exit codes > 255 | Shell wraps codes modulo 256. | Test exit 256 returns 0, exit 257 returns 1. |
| No test for exit code from killed process | What exit code does a kill -9'd process produce? | Test that AWF propagates signal-killed process codes. |
| No test for exit code from timeout | When --max-time causes curl to timeout. | Test exit code from timed-out commands. |
| No test for exit code from OOM-killed process | When a command is killed by OOM. | Document behavior (may not be testable in CI). |
| No test for pipefail behavior | set -o pipefail changes which exit code is returned from pipelines. | Test with set -o pipefail to verify behavior. |
| No test for nested AWF invocations | AWF running AWF (if that's ever a pattern). | May not be relevant but worth documenting. |
- What if the Docker container itself crashes (not the command)?
- What if
docker wait returns a different code than the process's actual exit code?
- What about exit codes from
exec'd processes within the command?
File: tests/integration/error-handling.test.ts
| Test Case | Description |
|---|
| Blocked domain | curl -f https://example.com with only github.com allowed. Expects non-zero exit. |
| Connection refused | curl http://localhost:12345 where no server listens. Expects "connection failed". |
| DNS resolution failure | curl to non-existent domain (in allowlist). Expects "dns failed". |
| Test Case | Description |
|---|
| Command not found | nonexistent_command_xyz123 returns exit code 127. |
| Permission denied | cat /etc/shadow fails with permission denied. |
| File not found | cat /nonexistent/file/path fails with "No such file". |
| Test Case | Description |
|---|
| Bash syntax error | bash -c "if then fi" caught with || echo. |
| Division by zero | bash -c "echo $((1/0))" caught with || echo. |
| Test Case | Description |
|---|
| SIGTERM from command | kill -TERM $$ self-terminates. Verifies firewall handles it gracefully (result is defined). |
| Test Case | Description |
|---|
| Continue after failure | Runs false then echo "recovery test". Verifies second command works (they're separate AWF invocations). |
Error handling is critical for production reliability:
- Network errors occur when agents try to access domains not in the allowlist
- Command errors occur when agent-generated commands have bugs
- Script errors happen when AI-generated code has syntax issues
- Signal handling ensures the firewall cleans up properly when interrupted
- Recovery after errors ensures the system doesn't enter a broken state
| Gap | Impact | Recommendation |
|---|
| No test for Docker daemon unavailable | What if Docker is not running when AWF starts? | Test CLI behavior when Docker socket is missing. |
| No test for disk full | What if /tmp is full and work directory can't be created? | Test behavior with limited disk space. |
| No test for Squid proxy crash | What if Squid crashes mid-session? | Test agent behavior when Squid becomes unreachable. |
| No test for network partition | What if the Docker network goes down? | Test behavior when awf-net becomes unavailable. |
| No test for SIGKILL handling | SIGKILL cannot be caught. What state is left behind? | Document and test cleanup behavior after SIGKILL. |
| No test for SIGINT (Ctrl+C) during command | The most common user interruption pattern. | Test that SIGINT properly stops containers and cleans up. |
| No test for concurrent error + cleanup | What if error occurs during cleanup? | Test nested error handling scenarios. |
| No test for invalid CLI arguments | What happens with --allow-domains "" or --dns-servers invalid? | Already tested at unit level but not integration. |
| No test for container startup timeout | What if container image pull takes too long? | Test behavior with unreachable registry. |
| No test for resource cleanup verification | After error, verify Docker networks/containers/volumes are cleaned up. | Add post-test assertions checking Docker state. |
- What if the error occurs in a
trap handler itself?
- What if multiple signals arrive in rapid succession?
- What about errors in the iptables cleanup phase?
- What if the work directory is deleted by another process during execution?
- Defense in depth: The security tests cover multiple layers (credential hiding, token unsetting, one-shot tokens, API proxy isolation).
- Both modes tested: Most security tests cover both normal and chroot mode.
- Real-world attack simulation: The credential hiding tests simulate actual exfiltration attacks (base64, xxd, grep patterns).
- Custom matchers: The
toSucceed(), toFail(), toExitWithCode() matchers provide clear, readable assertions.
- Bypass prevention: Tests specifically cover the chroot bypass vulnerability (Test 8) that was previously discovered and fixed.
- Broad API proxy coverage: Tests cover all three API providers (OpenAI, Anthropic, Copilot) for healthchecks and env wiring; end-to-end request routing and credential isolation are currently verified in depth only for Anthropic.
- No stress testing: No tests verify behavior under load, concurrent requests, or resource exhaustion.
- No timing/performance tests: No tests verify that proxy overhead is within acceptable bounds.
- Limited negative testing for security: While credential hiding is well-tested, other security bypasses (iptables manipulation, network namespace escape, container escape) are not tested.
- Fragile timing dependencies: The token-unset tests rely on fixed
sleep 7 delays which could be flaky in slow CI environments.
- Test isolation concerns: Tests share cleanup fixtures. A failed test could leave state that affects subsequent tests.
- No test for the full workflow: No end-to-end test runs an actual AI agent (even a mock one) through the complete AWF pipeline.
- Missing SSL Bump tests: The
--ssl-bump feature is configured in the CLI but has no integration tests in this set.
- Missing blocked-domains tests: The
--block-domains feature (deny-list on top of allow-list) has no integration tests.
- No test for
--env-all security implications: Passing all host env vars could leak credentials.
- No test for log output sanitization: Verifying that real API keys never appear in logs/stderr.