this is a matter affecting us for years now, over many versions of SwiftMQ.
From time to time, there is no real recognizable pattern, starting the second HA instance kills the entire MQ cluster. I can't really be the only one having these difficulties, as we are running three seperate HA instances and it can occur on each one.
What happens is the following.
- starting of the second instance, first running as standalone. Please view the console output of the second instance
This would be fine, if not the first instance saw this differently
+++ High Availability State: INITIALIZE/STANDALONE
+++ High Availability State: NEGOTIATE/STANDALONE
Once you are in this state you can wait forever. Stopping the second instance returns the first in this state:
+++ High Availability State: STANDALONE/STANDALONE
But the router is practically dead. You need to stop it by triggering kill, halting in Explorer has no effect.
After that start the router again, it will check the queue store, as it didn't shut down properly. Now start the second instance and you will most likely just have the whole problem from the beginning. I usually wait an hour or two, running in standalone mode, then try to start the second instance, which usually then works fine.
Very annoying and I assume at least a slight risk for store integrity, even if the system outage is only short and usually no problem for the connected clients, as they just reconnect.
Ok, if that ever happens again (it sure will, but we don't stop any HA instances that often), I will produce the dump. Why doesn't the STANDALONE instance work properly after the start of the second is aborted? Can't it unfreeze all of the tasks?
at com.swiftmq.extension.snmp.agent.SwiftletMQAgent.startAgent(Unknown Source)
at com.swiftmq.extension.snmp.SNMPSwiftlet.b(Unknown Source)
at com.swiftmq.extension.snmp.SNMPSwiftlet.performTimeAction(Unknown Source)
at com.swiftmq.impl.timer.standard.d.run(Unknown Source)
at com.swiftmq.impl.threadpool.standard.PoolThread.run(Unknown Source)
There is configurable agent-startup-delay delay (10 sec by default) which is to ensure that all Swiftlets have been loaded after router start. The HA negotiation cut right into that and causes a dead lock. You can work around by setting the agent-startup-delay="120" so that the HA sync is finished before the SNMP agent is started.
This is a bug and will be fixed in the next release.
It's not the first time the SNMP swiftlet is a little trouble, it's hard to reproduce, but from time to time when restarted, which is necessary if you want to monitor new queues and have the OID tree rebuilt, it takes way too long work properly again, with our monitoring system only being able to retrieve data from a few queues. Over time more and more appear. Also it can hang when being disabled, which then ultimately requires a failover of the router. But all this is hard to reproduce. If it hangs again on disabling I can provide another thread dump if that would be of help!
Anyway HA negotiation failed again, but most annoyingly after I finally had the former standby instance running in STANDALONE, after I had to completely restart the router to resolve the issue of failed thread freezes, the JavaMail bridges were not lost but config was incomplete.
It took me some time figure out the cause, as the bridges were defined, but all outbound mail was dumped to the error queue with NullPointerExceptions, because the "Default Header" settings were all lost. I had to copy over a routerconfig.xml backup and then somehow try to figure out which queue the mails belonged to, at least I was too stupid to get a working selector for that, although it would have been incredibly handy.
It seems you're using an old client here as this was fixed a long time ago.
As far as it concerns the SNMP Swiftlet: The lockup during HA negotiation is already fixed. We have replaced the intravm connection with an internal (direct) CLI interface and this works fine. This will be part of 9.7.0 to be released in early January.