ZOOKEEPER-4419: The potential exception in Learner$LeaderConnector.connectToLeader may cause unnecessary re-election or service delay#1782
Conversation
|
good catch |
|
@kezhuw @eolivelli Thoughts about merging this? |
kezhuw
left a comment
There was a problem hiding this comment.
The direction looks good to me. I left inline comments about code.
| try { | ||
| sock = createSocket(); | ||
| if (socket.get() == null) { | ||
| continue; |
There was a problem hiding this comment.
?
What does this mean ?
I think all we have to do is moving createSocket under try-catach so the for-loop can continue retry.
Besides, shall we close sock on execeptional ?
There was a problem hiding this comment.
I don't understand this either. Wouldn't be better to use the try block for creating and closing?
try (Socket sock = createSocket()) {
...
}There was a problem hiding this comment.
try block for creating and closing is better.
We need the created socket on return, so "try-with-resources" statement is not going to work here.
what's the expected behavior when closing sock on exception? catch and throw out of the function?
I think it should be catch-and-swallow. Socket::connect failure is the reason to retry.
There was a problem hiding this comment.
I have a small question if I may.
Do we really want to "continue;" and skip the backoff delay leaderConnectDelayDuringRetryMs ?
For the exceptions, it wasn't previously done neither but .close() with catch and swallow would do the trick in my opinion.
There was a problem hiding this comment.
The "continue;" you mentioned handles the case that sock = createSocket(); gives null value. In this case, the abnormal null value comes from some internal issue of the zk learner, not from leader's connection issue. Hence introducing the backoff delay leaderConnectDelayDuringRetryMs will not make sense, because that delay should focus on coordinating learn-leader connection frequency.
In a broader sense, the loop for (int tries = 0; tries < 5; tries++) is for the retry of creating socket AND connecting leader. Maybe this logic could be decoupled as two phases? For example, write a retry for loop for the leaderConnectDelayDuringRetryMs backoff delay, after successfully creating the socket.
There was a problem hiding this comment.
hello,
Key comments are:
- Move createSocket() into the try block so socket creation failures are retried.
- Close the socket in the catch block before retrying.
- Remove the socket recreation from the catch block since the next iteration creates a fresh socket.
Then I propose the following outline.
Socket sock = null;
for (...) {
try {
sock = createSocket();
remainingTimeout = ...;
sockConnect(sock, ...);
if (sslEnabled) {
handshake(...);
}
sock.setTcpNoDelay(...);
break; // success
} catch (IOException e) {
// close failed socket before retrying
if (sock != null) {
try {
sock.close();
} catch (IOException closeException) {
LOG.debug("Failed to close socket after connection failure", closeException);
}
sock = null;
}
// keep existing timeout calculation and logging unchanged
remainingTimeout = ...;
if (remainingTimeout <= leaderConnectDelayDuringRetryMs) {
LOG.error(...);
throw e;
} else if (tries >= 4) {
LOG.error(...);
throw e;
} else {
LOG.warn(...);
}
}
Thread.sleep(leaderConnectDelayDuringRetryMs);
}
return sock;|
@kezhuw Can you approve this? |
|
Thanks for the review and suggestions. @kezhuw can you approve? Then I will aim to merge this PR soon. I would be following this comment if not advised otherwise. |
…y cause unnecessary re-election or service delay
af00780 to
f0fab48
Compare
The patch for ZOOKEEPER-4419