Skip to content

Reduce SSH connection churn in SSHRemoteJobOperator under high fan-out#68115

Open
kaxil wants to merge 2 commits into
apache:mainfrom
astronomer:ssh-remote-job-load-hardening
Open

Reduce SSH connection churn in SSHRemoteJobOperator under high fan-out#68115
kaxil wants to merge 2 commits into
apache:mainfrom
astronomer:ssh-remote-job-load-hardening

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Jun 6, 2026

SSHRemoteJobOperator currently opens a brand-new SSH connection for every remote command. A large .expand() fan-out against a single host drives the connection rate past the remote sshd MaxStartups limit, which drops connections. This showed up in load testing two ways: submit-time failures (paramiko ... Error reading SSH protocol banner) and job directories left behind on the remote host.

This PR cuts the connection rate at the source and hardens the retry and cleanup paths. Defaults preserve existing behavior.

Root cause

The banner error appears within a few milliseconds of the connect attempt and the underlying exception is **EOFError**: the server closed the socket before sending its banner. That is sshd MaxStartups (default 10:30:100) throttling concurrent unauthenticated connections, not a slow banner, so raising banner_timeout does not help.

The trigger reconnected 2-3 times per poll (completion check, log size, log read), per task, for the whole job, so a 200-way fan-out sustained a very high handshake rate against one server. The retries failed for the same reason: synchronized across the fleet while the rest of the fleet kept the server saturated. Cleanup ran only on completion and silently swallowed its own dropped connection, orphaning the directory.

What changed

  • Trigger holds a single connection for the whole poll loop instead of reconnecting per command, and reconnects with jittered backoff (bounded by max_reconnect_attempts) if it drops. asyncssh.Error is now treated as a reconnectable failure. The reconnect budget resets only after a full successful poll, so a connection that handshakes but whose command channel keeps failing (for example ChannelOpenError under MaxSessions) still exhausts the budget instead of deferring forever.
  • Operator reuses one connection for OS detection and submission (was two).
  • Cleanup retries (cleanup_retries) instead of orphaning the directory on a single dropped connection.
  • New conn_retry_attempts on the hook and operator so the initial submit burst tolerates transient refusals.
  • SSHHookAsync sets a keepalive on the now long-lived trigger connection.

Measured against a real OpenSSH container

before after
SSH connections for one ~9s job 20 5
MaxStartups drops, 120-way simultaneous submit 71 36

New parameters (all optional)

  • conn_retry_attempts: operator default 5, hook default 3 (unchanged behavior at 3).
  • cleanup_retries: default 3.
  • command_timeout: default 30.0, and max_reconnect_attempts: default 5, both forwarded from the operator to the trigger.

Gotchas

  • The direct lever for very high fan-out is raising MaxStartups (and MaxSessions) on the SSH server. The operator docs now call this out.
  • Cleanup only runs when a job reaches completion, so killed or timed-out tasks can still leave a directory behind. A server-side TTL reaper (for example systemd-tmpfiles) is recommended for those, and the docs mention it.

The operator and trigger opened a new SSH connection for every remote
command. A large expand() fan-out against one host drove the connection
rate past the remote sshd MaxStartups limit, which drops connections and
surfaces as "paramiko ... Error reading SSH protocol banner" (an immediate
EOF, not a banner timeout) at submit time, and left job directories behind
when the cleanup connection was dropped too.

Changes:
- Trigger holds one connection for the whole poll loop instead of
  reconnecting per command, with bounded jittered reconnect on drops and
  asyncssh.Error treated as reconnectable.
- Operator reuses one connection for OS detection and submission.
- Cleanup retries instead of orphaning the job directory on a dropped
  connection.
- Configurable conn_retry_attempts (operator/hook) for the submit burst,
  plus command_timeout and max_reconnect_attempts forwarded to the trigger.
- SSHHookAsync sets a keepalive on the long-lived trigger connection.
- _run_command decodes bytes stdout/stderr so the return matches
  tuple[int, str, str] (asyncssh types them as bytes | str).
- Drop 'jittered'/'desynchronise' from docstrings (Sphinx spellcheck).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant