We're using SwiftMQ 9.2.5 in replicated HA setup. Due to some unidentified network problems keepalive counter reaches 0 and the API tries to reconnect. A few questions:
- can the API be forced to try the same server where it was connected to and not the other node?
- how can I adjust the reconnection timeout, because it seems to be set to 60000 (60 seconds)?
Keepalive is set to 500 to detect instance failure quickly.
The JNDI URL the application uses is: smqp://host1:4001/host2=host2;port2=4001;timeout=10000;retrydelay=1000;maxretries=50;reconnect=true
Thanks, but what about the timeout when trying to reconnect? Can that be adjusted some way or it's hardwired into code, that 60 seconds? Because if the connection breaks/keepalive ticks down the API will try the standby node, and thus unnecessarily delaying the reconnect for at least a minute.
Which timeout do you mean? If the client tries to reconnect and it doesn't get a connection, it immediately gets a SocketException and tries the other host.
If you mean that it takes time for the client to detect a broken network, then you may adjust the keep alive interval in the connection factory. The connection is marked as dead and disconnected after 5 missing keep alive messages.
Well, the problem is that due to a firewall in between the two machines the TCP connection does come up, but nothing comes through it because the port on the far end actually is not open. I don't know why it works like that, I'll talk to the network admins as this is really weird. Anyhow please note the 1 minute difference between two log lines (09:11:18 vs 09:12:18) and the delay=60000 value. Is that waiting time configurable and where/how?
I have to return to this problem. I thought it was solved for good, but today we had another event where the reconnection took more than a minute. What I see in the log is that it tried to connected to the inactive node and again, waited 60 seconds, then tried the active node. Why?
Please see attached log and notice the 1 minute gap between 06:17:33 and 06:18:36 reconnect_log2.txt