Skip to content

[awf] ARC/DinD: remaining workarounds needed for zero-config chroot mode on Kubernetes runners #4399

@lpcox

Description

@lpcox

Context

Despite significant ARC/DinD improvements in AWF (PRs #2839, #2843, #3218, #3554, #3852, #3914, #4026), real-world users on ARC/DinD runners still require a composite action with ~100 lines of shell workarounds to get agentic workflows running. See gh-aw#34896 comment for the full workaround code.

The core infrastructure (path prefix, socket detection, /etc synthesis) is fixed. This issue tracks the remaining gaps that prevent a truly zero-workaround experience.


Gap 1: Copilot HOME/identity vars not forwarded in chroot mode

Problem: AWF chroot mode passes HOME=/home/runner, USER=root, LOGNAME=root to the agent exec regardless of engine.env settings. The Copilot CLI can't write to ~/.copilot in the chrooted DinD filesystem and exits silently with status 1.

Current workaround: Users create a shell shim (copilot.real + wrapper) that forces HOME=/tmp/gh-aw/home USER=runner LOGNAME=runner before exec'ing the real binary.

Proposed fix:

  • Add chroot.identity to stdin config:
    {
      "chroot": {
        "identity": {
          "home": "/tmp/gh-aw/home",
          "user": "runner",
          "uid": 1001,
          "gid": 1001
        }
      }
    }
  • AWF's entrypoint.sh reads these from config and sets HOME, USER, LOGNAME after the chroot pivot, overriding the defaults.
  • Document in awf-config-schema.json and docs/chroot-mode.md.

Gap 2: /tmp/gh-aw directory tree pre-staging inside DinD daemon

Problem: On ARC/DinD, the Docker daemon's /tmp is a separate filesystem from the runner's /tmp. AWF writes files to the runner's /tmp/gh-aw/, but the daemon (which creates containers) can't see them. Users must pre-create the directory tree with correct permissions inside the daemon's filesystem before AWF runs.

Current workaround: Users run docker run --rm ... -v /tmp:/host-tmp:rw ... mkdir -p /host-tmp/gh-aw/{.cache,.config,.local/state,home,mcp-logs,...} && chmod -R 0777 as a pre-agent step.

Proposed fix:

  • Add dind.preStageDirs to stdin config:
    {
      "dind": {
        "preStageDirs": true,
        "workDir": "/tmp/gh-aw",
        "stagingImage": "ghcr.io/github/gh-aw-firewall/agent:latest"
      }
    }
  • When preStageDirs: true and a DinD environment is detected, AWF runs a lightweight init container to create the required directory tree with open permissions before starting the compose stack.
  • This reuses the existing DinD detection logic from PR feat: auto-detect DinD split filesystem via sentinel probe #3554.

Gap 3: Engine binary staging into DinD daemon's /usr/local/bin

Problem: The Copilot CLI is installed on the runner at runtime by gh-aw. But in DinD mode, the runner's filesystem is not visible to containers created by the daemon. The binary must be copied into the daemon's filesystem so AWF's /usr:/host/usr:ro mount exposes it inside the chroot.

Current workaround: Users docker run ... -v /usr/local/bin:/daemon-usr-local-bin:rw ... cp copilot /daemon-usr-local-bin/ after installation.

Proposed fix:

  • Add dind.stageEngineBinary to stdin config:
    {
      "dind": {
        "stageEngineBinary": {
          "path": "/usr/local/bin/copilot",
          "targetPath": "/usr/local/bin/copilot"
        }
      }
    }
  • AWF detects the DinD split filesystem, locates the engine binary on the runner, and stages it into the daemon's filesystem via a short-lived container before starting the agent.
  • The binary path comes from config (non-sensitive); no secrets involved.

Gap 4: MCP DOCKER_HOST env for DinD socket

Problem: MCP servers (github-mcp-server, mcpg) need to know the Docker socket location when running inside DinD. The user currently has to manually set sandbox.mcp.env.DOCKER_HOST.

Current workaround: sandbox.mcp.env.DOCKER_HOST: tcp://localhost:2375 in workflow frontmatter.

Proposed fix:

  • When AWF detects DinD mode (already supported), automatically propagate the detected Docker host to MCP server containers as DOCKER_HOST.
  • No config change needed — this is implicit behavior when --enable-dind or auto-detection is active.

Design Principles

All proposed config fields follow AWF's existing conventions:

Parameter Location Rationale
chroot.identity.home stdin config Non-sensitive path configuration
chroot.identity.user stdin config Non-sensitive identity hint
chroot.identity.uid/gid stdin config Non-sensitive numeric IDs
dind.preStageDirs stdin config Boolean flag, no secrets
dind.workDir stdin config Non-sensitive path
dind.stagingImage stdin config Non-sensitive image reference
dind.stageEngineBinary.path stdin config Non-sensitive filesystem path
API keys, tokens env vars only Never in config — passed via -e flags

Documentation requirements

  • All new fields MUST be added to src/awf-config-schema.json with descriptions
  • All new fields MUST be reflected in src/types/ TypeScript interfaces
  • docs/chroot-mode.md MUST document the ARC/DinD identity override behavior
  • A new docs/arc-dind.md guide should consolidate all ARC/DinD configuration in one place
  • The AWF spec (awf-config-spec.yaml if applicable) MUST include the new fields

Success Criteria

A user on ARC/DinD runners can run an agentic workflow with only standard workflow frontmatter fields (no composite action, no pre-agent-steps, no resources: block). The AWF binary handles all filesystem staging internally based on the stdin config provided by the gh-aw compiler.

References

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions