Reduce SSH connection churn in SSHRemoteJobOperator under high fan-out#68115
Open
kaxil wants to merge 2 commits into
Open
Reduce SSH connection churn in SSHRemoteJobOperator under high fan-out#68115kaxil wants to merge 2 commits into
SSHRemoteJobOperator under high fan-out#68115kaxil wants to merge 2 commits into
Conversation
The operator and trigger opened a new SSH connection for every remote command. A large expand() fan-out against one host drove the connection rate past the remote sshd MaxStartups limit, which drops connections and surfaces as "paramiko ... Error reading SSH protocol banner" (an immediate EOF, not a banner timeout) at submit time, and left job directories behind when the cleanup connection was dropped too. Changes: - Trigger holds one connection for the whole poll loop instead of reconnecting per command, with bounded jittered reconnect on drops and asyncssh.Error treated as reconnectable. - Operator reuses one connection for OS detection and submission. - Cleanup retries instead of orphaning the job directory on a dropped connection. - Configurable conn_retry_attempts (operator/hook) for the submit burst, plus command_timeout and max_reconnect_attempts forwarded to the trigger. - SSHHookAsync sets a keepalive on the long-lived trigger connection.
- _run_command decodes bytes stdout/stderr so the return matches tuple[int, str, str] (asyncssh types them as bytes | str). - Drop 'jittered'/'desynchronise' from docstrings (Sphinx spellcheck).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SSHRemoteJobOperatorcurrently opens a brand-new SSH connection for every remote command. A large.expand()fan-out against a single host drives the connection rate past the remotesshdMaxStartupslimit, which drops connections. This showed up in load testing two ways: submit-time failures (paramiko ... Error reading SSH protocol banner) and job directories left behind on the remote host.This PR cuts the connection rate at the source and hardens the retry and cleanup paths. Defaults preserve existing behavior.
Root cause
The banner error appears within a few milliseconds of the connect attempt and the underlying exception is
**EOFError**: the server closed the socket before sending its banner. That issshdMaxStartups(default10:30:100) throttling concurrent unauthenticated connections, not a slow banner, so raisingbanner_timeoutdoes not help.The trigger reconnected 2-3 times per poll (completion check, log size, log read), per task, for the whole job, so a 200-way fan-out sustained a very high handshake rate against one server. The retries failed for the same reason: synchronized across the fleet while the rest of the fleet kept the server saturated. Cleanup ran only on completion and silently swallowed its own dropped connection, orphaning the directory.
What changed
max_reconnect_attempts) if it drops.asyncssh.Erroris now treated as a reconnectable failure. The reconnect budget resets only after a full successful poll, so a connection that handshakes but whose command channel keeps failing (for exampleChannelOpenErrorunderMaxSessions) still exhausts the budget instead of deferring forever.cleanup_retries) instead of orphaning the directory on a single dropped connection.conn_retry_attemptson the hook and operator so the initial submit burst tolerates transient refusals.SSHHookAsyncsets a keepalive on the now long-lived trigger connection.Measured against a real OpenSSH container
MaxStartupsdrops, 120-way simultaneous submitNew parameters (all optional)
conn_retry_attempts: operator default 5, hook default 3 (unchanged behavior at 3).cleanup_retries: default 3.command_timeout: default 30.0, andmax_reconnect_attempts: default 5, both forwarded from the operator to the trigger.Gotchas
MaxStartups(andMaxSessions) on the SSH server. The operator docs now call this out.systemd-tmpfiles) is recommended for those, and the docs mention it.