Reduce SSH connection churn in `SSHRemoteJobOperator` under high fan-out by kaxil · Pull Request #68115 · apache/airflow

kaxil · 2026-06-06T01:19:40Z

SSHRemoteJobOperator currently opens a brand-new SSH connection for every remote command. A large .expand() fan-out against a single host drives the connection rate past the remote sshd MaxStartups limit, which drops connections. This showed up in load testing two ways: submit-time failures (paramiko ... Error reading SSH protocol banner) and job directories left behind on the remote host.

This PR cuts the connection rate at the source and hardens the retry and cleanup paths. Defaults preserve existing behavior.

Root cause

The banner error appears within a few milliseconds of the connect attempt and the underlying exception is **EOFError**: the server closed the socket before sending its banner. That is sshd MaxStartups (default 10:30:100) throttling concurrent unauthenticated connections, not a slow banner, so raising banner_timeout does not help.

The trigger reconnected 2-3 times per poll (completion check, log size, log read), per task, for the whole job, so a 200-way fan-out sustained a very high handshake rate against one server. The retries failed for the same reason: synchronized across the fleet while the rest of the fleet kept the server saturated. Cleanup ran only on completion and silently swallowed its own dropped connection, orphaning the directory.

What changed

Trigger holds a single connection for the whole poll loop instead of reconnecting per command, and reconnects with jittered backoff (bounded by max_reconnect_attempts) if it drops. asyncssh.Error is now treated as a reconnectable failure. The reconnect budget resets only after a full successful poll, so a connection that handshakes but whose command channel keeps failing (for example ChannelOpenError under MaxSessions) still exhausts the budget instead of deferring forever.
Operator reuses one connection for OS detection and submission (was two).
Cleanup retries (cleanup_retries) instead of orphaning the directory on a single dropped connection.
New conn_retry_attempts on the hook and operator so the initial submit burst tolerates transient refusals.
SSHHookAsync sets a keepalive on the now long-lived trigger connection.

Measured against a real OpenSSH container

	before	after
SSH connections for one ~9s job	20	5
`MaxStartups` drops, 120-way simultaneous submit	71	36

New parameters (all optional)

conn_retry_attempts: operator default 5, hook default 3 (unchanged behavior at 3).
cleanup_retries: default 3.
command_timeout: default 30.0, and max_reconnect_attempts: default 5, both forwarded from the operator to the trigger.

Gotchas

The direct lever for very high fan-out is raising MaxStartups (and MaxSessions) on the SSH server. The operator docs now call this out.
Cleanup only runs when a job reaches completion, so killed or timed-out tasks can still leave a directory behind. A server-side TTL reaper (for example systemd-tmpfiles) is recommended for those, and the docs mention it.

The operator and trigger opened a new SSH connection for every remote command. A large expand() fan-out against one host drove the connection rate past the remote sshd MaxStartups limit, which drops connections and surfaces as "paramiko ... Error reading SSH protocol banner" (an immediate EOF, not a banner timeout) at submit time, and left job directories behind when the cleanup connection was dropped too. Changes: - Trigger holds one connection for the whole poll loop instead of reconnecting per command, with bounded jittered reconnect on drops and asyncssh.Error treated as reconnectable. - Operator reuses one connection for OS detection and submission. - Cleanup retries instead of orphaning the job directory on a dropped connection. - Configurable conn_retry_attempts (operator/hook) for the submit burst, plus command_timeout and max_reconnect_attempts forwarded to the trigger. - SSHHookAsync sets a keepalive on the long-lived trigger connection.

- _run_command decodes bytes stdout/stderr so the return matches tuple[int, str, str] (asyncssh types them as bytes | str). - Drop 'jittered'/'desynchronise' from docstrings (Sphinx spellcheck).

boring-cyborg Bot added area:providers kind:documentation provider:ssh labels Jun 6, 2026

Fix mypy return type and docs spellcheck in SSH remote job trigger

a0892fe

- _run_command decodes bytes stdout/stderr so the return matches tuple[int, str, str] (asyncssh types them as bytes | str). - Drop 'jittered'/'desynchronise' from docstrings (Sphinx spellcheck).

kaxil requested review from eladkal, gopidesupavan and vatsrahul1001 June 6, 2026 02:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce SSH connection churn in `SSHRemoteJobOperator` under high fan-out#68115

Reduce SSH connection churn in `SSHRemoteJobOperator` under high fan-out#68115
kaxil wants to merge 2 commits into
apache:mainfrom
astronomer:ssh-remote-job-load-hardening

kaxil commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaxil commented Jun 6, 2026

Root cause

What changed

Measured against a real OpenSSH container

New parameters (all optional)

Gotchas

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant