Skip to content

ZOOKEEPER-4419: The potential exception in Learner$LeaderConnector.connectToLeader may cause unnecessary re-election or service delay#1782

Open
functioner wants to merge 2 commits into
apache:masterfrom
functioner:ZOOKEEPER-4419
Open

ZOOKEEPER-4419: The potential exception in Learner$LeaderConnector.connectToLeader may cause unnecessary re-election or service delay#1782
functioner wants to merge 2 commits into
apache:masterfrom
functioner:ZOOKEEPER-4419

Conversation

@functioner

Copy link
Copy Markdown

The patch for ZOOKEEPER-4419

@functioner functioner changed the title ZOOKEEPER-4419: The potential exception in Learner.connectToLeader may cause unnecessary re-election or service delay ZOOKEEPER-4419: The potential exception in Learner$LeaderConnector.connectToLeader may cause unnecessary re-election or service delay Dec 2, 2021
@lanicc

lanicc commented Dec 2, 2021

Copy link
Copy Markdown
Contributor

good catch

@anmolnar

Copy link
Copy Markdown
Contributor

@kezhuw @eolivelli Thoughts about merging this?

@kezhuw kezhuw left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The direction looks good to me. I left inline comments about code.

try {
sock = createSocket();
if (socket.get() == null) {
continue;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?

What does this mean ?

I think all we have to do is moving createSocket under try-catach so the for-loop can continue retry.

Besides, shall we close sock on execeptional ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this either. Wouldn't be better to use the try block for creating and closing?

try (Socket sock = createSocket())  {
...
}

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anmolnar oh you're right. try block for creating and closing is better.

@kezhuw what's the expected behavior when closing sock on exception? catch and throw out of the function?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try block for creating and closing is better.

We need the created socket on return, so "try-with-resources" statement is not going to work here.

what's the expected behavior when closing sock on exception? catch and throw out of the function?

I think it should be catch-and-swallow. Socket::connect failure is the reason to retry.

@PAC-MAJ PAC-MAJ Dec 30, 2025

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small question if I may.
Do we really want to "continue;" and skip the backoff delay leaderConnectDelayDuringRetryMs ?
For the exceptions, it wasn't previously done neither but .close() with catch and swallow would do the trick in my opinion.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PAC-MAJ

The "continue;" you mentioned handles the case that sock = createSocket(); gives null value. In this case, the abnormal null value comes from some internal issue of the zk learner, not from leader's connection issue. Hence introducing the backoff delay leaderConnectDelayDuringRetryMs will not make sense, because that delay should focus on coordinating learn-leader connection frequency.

In a broader sense, the loop for (int tries = 0; tries < 5; tries++) is for the retry of creating socket AND connecting leader. Maybe this logic could be decoupled as two phases? For example, write a retry for loop for the leaderConnectDelayDuringRetryMs backoff delay, after successfully creating the socket.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hello,

Key comments are:

  1. Move createSocket() into the try block so socket creation failures are retried.
  2. Close the socket in the catch block before retrying.
  3. Remove the socket recreation from the catch block since the next iteration creates a fresh socket.

Then I propose the following outline.

Socket sock = null;

for (...) {
    try {
        sock = createSocket();

        remainingTimeout = ...;

        sockConnect(sock, ...);

        if (sslEnabled) {
            handshake(...);
        }

        sock.setTcpNoDelay(...);

        break; // success

    } catch (IOException e) {

        // close failed socket before retrying
        if (sock != null) {
            try {
                sock.close();
            } catch (IOException closeException) {
                LOG.debug("Failed to close socket after connection failure", closeException);
            }
            sock = null;
        }

        // keep existing timeout calculation and logging unchanged
        remainingTimeout = ...;

        if (remainingTimeout <= leaderConnectDelayDuringRetryMs) {
            LOG.error(...);
            throw e;
        } else if (tries >= 4) {
            LOG.error(...);
            throw e;
        } else {
            LOG.warn(...);
        }
    }

    Thread.sleep(leaderConnectDelayDuringRetryMs);
}

return sock;

@anmolnar anmolnar left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm.

@anmolnar

Copy link
Copy Markdown
Contributor

@kezhuw Can you approve this?

@functioner

Copy link
Copy Markdown
Author

Thanks for the review and suggestions.

@kezhuw can you approve? Then I will aim to merge this PR soon.

I would be following this comment if not advised otherwise.

…y cause unnecessary re-election or service delay
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants