Error in HA negotiation

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

Error in HA negotiation

TheQL
This post was updated on .
Hi there,

this is a matter affecting us for years now, over many versions of SwiftMQ.

From time to time, there is no real recognizable pattern, starting the second HA instance kills the entire MQ cluster. I can't really be the only one having these difficulties, as we are running three seperate HA instances and it can occur on each one.

What happens is the following.

- starting of the second instance, first running as standalone. Please view the console output of the second instance
SwiftMQ 9.4.1 Production is ready.
+++ High Availability State: UNKNOWN/ACTIVE
+++ High Availability State: INITIALIZE/ACTIVE
+++ High Availability State: NEGOTIATE/ACTIVE
... standby: Authentication Swiftlet
... standby: Store Swiftlet (HA)
... standby: Queue Manager Swiftlet (HA)
... standby: Topic Manager Swiftlet
... standby: Accounting Swiftlet
... standby: Management Swiftlet
... standby: XA Resource Manager Swiftlet
... standby: Routing Swiftlet (Unlimited Connections)
... standby: JNDI Swiftlet
... standby: JMS Swiftlet (XAASF)
... standby: AMQP Swiftlet
... standby: Deploy Swiftlet
... standby: JMS Application Container Swiftlet
... standby: Scheduler Swiftlet
... standby: Monitor Swiftlet (HA)
... standby: FileCache Swiftlet (HA)
+++ High Availability State: STANDBY-SYNC-PREPARE/STANDBY-SYNC-PREPARE

This would be fine, if not the first instance saw this differently
+++ High Availability State: INITIALIZE/STANDALONE
+++ High Availability State: NEGOTIATE/STANDALONE

Once you are in this state you can wait forever. Stopping the second instance returns the first in this state:
+++ High Availability State: STANDALONE/STANDALONE

But the router is practically dead. You need to stop it by triggering kill, halting in Explorer has no effect.
After that start the router again, it will check the queue store, as it didn't shut down properly. Now start the second instance and you will most likely just have the whole problem from the beginning. I usually wait an hour or two, running in standalone mode, then try to start the second instance, which usually then works fine.

Very annoying and I assume at least a slight risk for store integrity, even if the system outage is only short and usually no problem for the connected clients, as they just reconnect.

Thanks for any ideas and a fix to that!
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
This is from the info.log of a HA instance which is in STANDALONE mode (your "first" instance):

2012-07-23 12:36:03.665/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=NEGOTIATE, lastSavedState=STANDALONE/High Availability State: NEGOTIATE/STANDALONE
2012-07-23 12:36:03.675/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: mgmt
2012-07-23 12:36:03.676/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: mgmt
2012-07-23 12:36:03.676/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: default
2012-07-23 12:36:03.676/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: default
2012-07-23 12:36:03.676/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: jms.session
2012-07-23 12:36:03.677/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: jms.session
2012-07-23 12:36:03.677/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: jndi
2012-07-23 12:36:03.677/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: jndi
2012-07-23 12:36:03.677/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: queue.cluster
2012-07-23 12:36:03.680/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: queue.cluster
2012-07-23 12:36:03.710/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: queue.redispatcher
2012-07-23 12:36:03.720/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: queue.redispatcher
2012-07-23 12:36:03.720/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: queue.timeout
2012-07-23 12:36:03.720/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: queue.timeout
2012-07-23 12:36:03.720/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: routing.exchanger
2012-07-23 12:36:03.722/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: routing.exchanger
2012-07-23 12:36:03.722/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: routing.scheduler
2012-07-23 12:36:03.723/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: routing.scheduler
2012-07-23 12:36:03.723/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: routing.service
2012-07-23 12:36:03.723/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: routing.service
2012-07-23 12:36:03.723/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: store.log
2012-07-23 12:36:03.723/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: store.log
2012-07-23 12:36:03.723/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: timer.tasks
2012-07-23 12:36:03.723/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: timer.tasks
2012-07-23 12:36:03.724/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: topic
2012-07-23 12:36:03.724/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/High Availability State: ACTIVE-SYNC-PREPARE/ACTIVE-SYNC-PREPARE


I think yours hangs in freezing a thread pool. Can you please check which "free pool" entry you see as the last entry?
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
Hi,

ah, interesting, I didn't know that. Here is the output of the STANDALONE instance.

2014-02-04 15:01:04.816/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=NEGOTIATE, lastSavedState=STANDALONE/High Availability State: NEGOTIATE/STANDALONE
2014-02-04 15:01:04.856/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: mgmt
2014-02-04 15:01:04.857/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: mgmt
2014-02-04 15:01:04.857/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: default
2014-02-04 15:01:04.858/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: default
2014-02-04 15:01:04.858/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: jms.session
2014-02-04 15:01:04.860/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: jms.session
2014-02-04 15:01:04.860/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: jndi
2014-02-04 15:01:04.860/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: jndi
2014-02-04 15:01:04.860/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: queue.cluster
2014-02-04 15:01:04.861/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: queue.cluster
2014-02-04 15:01:04.861/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: queue.redispatcher
2014-02-04 15:01:04.862/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: queue.redispatcher
2014-02-04 15:01:04.862/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: queue.timeout
2014-02-04 15:01:04.863/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: queue.timeout
2014-02-04 15:01:04.864/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: routing.exchanger
2014-02-04 15:01:04.864/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: routing.exchanger
2014-02-04 15:01:04.864/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: routing.scheduler
2014-02-04 15:01:04.865/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: routing.scheduler
2014-02-04 15:01:04.865/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: routing.service
2014-02-04 15:01:04.866/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: routing.service
2014-02-04 15:01:04.867/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: store.log
2014-02-04 15:01:04.868/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freezed: store.log
2014-02-04 15:01:04.868/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=ACTIVE-SYNC-PREPARE, lastSavedState=ACTIVE-SYNC-PREPARE/freeze pool: timer.tasks
2014-02-04 15:01:11.361/[BlockingTCPListener, swiftlet=sys$jms, port=4001]/Listener/INFORMATION/connection accepted: someIP (edited)
2014-02-04 15:01:28.445/sys$hacontroller/INFORMATION/StateRegistry, preferredActive=false, currentState=STANDALONE, lastSavedState=STANDALONE/High Availability State: STANDALONE/STANDALONE
2014-02-04 15:01:28.445/sys$store/INFORMATION/LogManagerFactoryImpl/set force sync for the transaction log to: true

I included the part after I already stopped the second instance. As I said, it seems to resume operation but is mainly dead.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
It waits freezing the timer.task pool so obviously there is a timer that doesn't finish. To analyze it I would need a thread dump. Can you produce it if that happens again (kill -3)?
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
Ok, if that ever happens again (it sure will, but we don't stop any HA instances that often), I will produce the dump. Why doesn't the STANDALONE instance work properly after the start of the second is aborted? Can't it unfreeze all of the tasks?
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
A freeze must always go through. We haven't seen this before and so are unable to reproduce it easily. But we can try if you send your routerconfig files to bugreport@swiftmq.com
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
Well, not always what must happen is what really happens ;)

Anyway, had no problems with a failover and rejoining of the HA cluster today. Will provide the thread dump as soon as it occurs and can then also provide you with our routerconfig.

Thanks for your help!
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
In reply to this post by IIT Software
Finally this occured again yesterday after updating our HA routers to 9.6.0

Once on one HA setup, three times on the other. I can provide the logs and router config files. Shall I send all data to bugreport@swiftmq.com?
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
Yes, please do do. Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
In reply to this post by TheQL
It seems to be the SNMP Swiftlet:

        at com.swiftmq.extension.snmp.agent.SwiftletMQAgent.startAgent(Unknown Source)
        at com.swiftmq.extension.snmp.SNMPSwiftlet.b(Unknown Source)
        at com.swiftmq.extension.snmp.SNMPSwiftlet.performTimeAction(Unknown Source)
        at com.swiftmq.impl.timer.standard.d.run(Unknown Source)
        at com.swiftmq.impl.threadpool.standard.PoolThread.run(Unknown Source)

There is configurable agent-startup-delay delay (10 sec by default) which is to ensure that all Swiftlets have been loaded after router start. The HA negotiation cut right into that and causes a dead lock. You can work around by setting the agent-startup-delay="120" so that the HA sync is finished before the SNMP agent is started.

This is a bug and will be fixed in the next release.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
Thanks, looking forward to that.

It's not the first time the SNMP swiftlet is a little trouble, it's hard to reproduce, but from time to time when restarted, which is necessary if you want to monitor new queues and have the OID tree rebuilt, it takes way too long work properly again, with our monitoring system only being able to retrieve data from a few queues. Over time more and more appear. Also it can hang when being disabled, which then ultimately requires a failover of the router. But all this is hard to reproduce. If it hangs again on disabling I can provide another thread dump if that would be of help!
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
Not required, thanks. Please try to increase the startup delay and see if that helps.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
After a JMS bridging problem I had to fail over and once more the HA negotiation failed.
I have the agent-startup-delay of the SNMP swiftlet set to 120.

I saved the thread dumps and can provide them if necessary.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
Thanks. We will come back on you when we start to investigate it.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
Ok, I have saved a copy of the thread dumps until further notice.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
In reply to this post by IIT Software
This problem keeps bothering us.

On Friday on a router restart this happened again.
The restart was probably unnecessary, a colleague got irritated by these console messages:
null: req == null! Reply=[CommandReply [Reply, ok=true exception=null requestNum
ber=0 timeout=false ], result=null]
null: sem == null! Reply=[CommandReply [Reply, ok=true exception=null requestNum
ber=0 timeout=false ], result=null]

Anyway HA negotiation failed again, but most annoyingly after I finally had the former standby instance running in STANDALONE, after I had to completely restart the router to resolve the issue of failed thread freezes, the JavaMail bridges were not lost but config was incomplete.

It took me some time figure out the cause, as the bridges were defined, but all outbound mail was dumped to the error queue with NullPointerExceptions, because the "Default Header" settings were all lost. I had to copy over a routerconfig.xml backup and then somehow try to figure out which queue the mails belonged to, at least I was too stupid to get a working selector for that, although it would have been incredibly handy.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

IIT Software
Administrator
TheQL wrote
null: req == null! Reply=[CommandReply [Reply, ok=true exception=null requestNum
ber=0 timeout=false ], result=null]
null: sem == null! Reply=[CommandReply [Reply, ok=true exception=null requestNum
ber=0 timeout=false ], result=null]
It seems you're using an old client here as this was fixed a long time ago.

As far as it concerns the SNMP Swiftlet: The lockup during HA negotiation is already fixed. We have replaced the intravm connection with an internal (direct) CLI interface and this works fine. This will be part of 9.7.0 to be released in early January.
Reply | Threaded
Open this post in threaded view
|

Re: Error in HA negotiation

TheQL
Thanks, I will check if any old client installations are around and am looking forward to 9.7.0. I hope it will resolve the issue, increasing the timeout has sadly been ineffective.