it's hard to reproduce, I figure, because this has only happened twice here in over a year running SwiftMQ 10.2.0, but as it happened TWICE and not only a single time, I thought I'd let you know.
On the first occurence after a double failover from instance 1 to 2 and back again, which seemingly worked fine, we noticed broken messages consumed by the JavaMail Bridge Extension Swiftlet. Upon investigation I found that on a single bridge of many a single property translation was no longer configured. I then saved the config and performed a diff on the routerconfig.xml against the backup version and could verify that only this one translation was missing. I then copied the backup file over routerconfig.xml and the watchdog picked up the change, added the property and everything was fine.
Today the same thing happened on a different HA cluster, here a JMS Bridge had lost the "remote_to_local" bridge definition. I could easily verify this by performing the same steps as above.
2018-08-21 14:56:27.545/SwiftletManager/INFORMATION/ConfigfileWatchdog/performTimeAction/applyNewEntities, context=/xt$bridge/servers/somebridge/bridgings, entity added=copy remote to local
2018-08-21 14:57:27.725/SwiftletManager/INFORMATION/ConfigfileWatchdog/performTimeAction/applyNewEntities, context=/xt$bridge/servers/somebridge/bridgings, entity added=copy remote to local
2018-08-21 14:58:27.841/SwiftletManager/INFORMATION/ConfigfileWatchdog/performTimeAction/applyNewEntities, context=/xt$bridge/servers/somebridge/bridgings, entity added=copy remote to local
Now this time the watchdog did attempt to create the bridge but obviously failed, because it kept trying.
The error.log revealed this:
While thinking about it, the NPE could have been the reason for the Bridge property to disappear on failover. But I don't believe there was an NPE on the first incident on the JavaMail Bridge. I could try to dig in our logs to verify, if you'd like.
It only affects Extension Swiftlets. The difference between those and Kernel Swiftlets is the delay in which Extension Swiftlets are loaded. They are loaded in another thread and the delay depends on the interval for the deploy space of the Deploy Swiftlet.
So if you have a failover:
- active replicates config to standby on initial connect
- but only for those Swiftlets that have been registered in the Management Tree
- Extension Swiftlets register in the tree upon load which takes place after a delay
- therefore the Extension Swiftlets load their config from the last save state in the routerconfig of the standby
This is what I need to test. May be a simple auto-save after a config replication on an initial connect to the standby would solve that.
Meanwhile try to save every change to active and standby. If standby is not running and you change the config, you might run into this issue, I guess.
I have created a job fix for this. Will be probably fixed for the next release.
Anyway, usually we save the config pretty often and hardly ever is the standby instance unavailable during that. As a matter of fact the JavaMail Bridge that initially had this issue is unchanged for quite a while... Nevertheless your fixes will be an improvement, I believe.
this just happened again. I have made another observation, though, as this time there was only one failover from active to standby and enabled another perspective on the issue!
Although the standby config was saved at the same date as the active, the property that disappeared was already missing in the saved config on standby. I can't explain why this was the case and especially why we encounter this issue so regularly while it has never been an issue in the past. Just wanted to let you know.