Router lost Extension Swiftlet config on Failover

classic Classic list List threaded Threaded
37 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Router lost Extension Swiftlet config on Failover

TheQL
This post was updated on .
We just recently updated to 9.7.3 from 9.6.0 and I am playing around with failovers a little, now something curious happened.

I stopped the active instance by issuing a reboot, which does in fact first send a kill to the SwiftMQ process, but I did not initiate the halt via Explorer. The standby instance then did the following:

... resume: Authentication Swiftlet
... resume: Store Swiftlet (HA)
... resume: Queue Manager Swiftlet (HA)
... resume: Topic Manager Swiftlet
... resume: Accounting Swiftlet
... resume: Management Swiftlet
... resume: XA Resource Manager Swiftlet
... resume: Routing Swiftlet (Unlimited Connections)
... resume: JNDI Swiftlet
... resume: JMS Swiftlet (XAASF)
... resume: AMQP Swiftlet
... resume: Deploy Swiftlet
... resume: JMS Application Container Swiftlet
... startup: JavaMail Bridge Extension Swiftlet
... startup: SNMP Management Extension Swiftlet
... resume: Monitor Swiftlet (HA)
... resume: FileCache Swiftlet (HA)
+++ High Availability State: STANDALONE/STANDALONE
... shutdown: JavaMail Bridge Extension Swiftlet
... shutdown: SNMP Management Extension Swiftlet
... shutdown: JMS Bridge Extension Swiftlet
... startup: JavaMail Bridge Extension Swiftlet
... startup: SNMP Management Extension Swiftlet
... startup: JMS Bridge Extension Swiftlet

Now this was odd. Why did it stop and then start the swiftlets again? It took some time until they were started as well, no instant restart. Afterwards I noticed our monitoring was broken, so I checked with Explorer and found the SNMP swiftlet to be disabled, also it had lost its config. I could verify that on disk from routerconfig.xml

$ diff routerconfig.xml routerconfig.xml.20160831210515726
827,829c827,831
<   <swiftlet name="xt$snmp">
<     <agent>
<       <communities/>
---
>   <swiftlet name="xt$snmp" enabled="true">
>     <agent agent-startup-delay="120">
>       <communities>
>         <community name="public" security-name="public"/>
>       </communities>

I have never saved the config in this "broken" state and restarted the first instance and switched back to that one. It still has the config on disk in routerconfig.xml, but as the live config got synced on re-establishing HA, the active config is also without SNMP. This might have also happened to the JMS Bridge Swiftlet, but luckily this instance does not currently have any JMS bridges. The JavaMail bridge kept its configuration.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
I was thinking about the start-up delay which we increased in the past due to problems with the swiftlet, if maybe the 120 seconds hadn't yet passed and the swiftlet was not started, maybe during failover the config would not be synced. I am pretty sure that this was not the case and the active instance was up for far longer than 120 seconds, but anyway if what I described was possible it probably shouldn't be.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
The reason why it stopped and started the Extension Swiftlets is certainly the upgrade. Extension Swiftlets are hot deployed. When the instance starts, it starts the current version. Then it gets notified that there is a new version of the Swiftlet available and stops the current version. Last step is to start the new version. So that is ok.

Why it lost the config is not clear to me. It can happen if there is a failure during startup. But this is logged to the log files. Please have a look.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
The instanced the failover occured to was the one I updated first, so it was already running as standalone before. There should have been no need to perform any further update tasks.

My update path was, update the standby, bring it back up, failover, update the former active, start it. Then I switched back to the first instance, re-establishing HA. The reason for the next failover with the incident described here was plainly that I had to reboot the active instance. Both instances had already been running before.

Anyway, there was an entry to the error.log:

2016-08-31 21:04:15.237/sys$deploy/ERROR/DeploySpaceImpl, name=extension-swiftlets/performTimeAction, xt$javamail, exception: java.lang.Exception: Unknown command, removing bundle. Correct the error and deploy again!
2016-08-31 21:04:15.395/sys$deploy/ERROR/DeploySpaceImpl, name=extension-swiftlets/performTimeAction, xt$snmp, exception: java.lang.Exception: Unknown command, removing bundle. Correct the error and deploy again!
2016-08-31 21:04:15.512/sys$deploy/ERROR/DeploySpaceImpl, name=extension-swiftlets/performTimeAction, xt$bridge, exception: java.lang.Exception: Unknown command, removing bundle. Correct the error and deploy again!

Don't get confused by the late timestamp, this is HKT.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
So you have upgraded the standby with router 9.7.3 and swiftlets 9.7.3? How did you update the swiftlets, what did you copy where?
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
It was a fresh setup on entirely new servers. So all swiftlets were copied directly from the distribution only retaining the routerconfig.xml during the rolling upgrade. Anyway, the server was running fine as standalone, active and standby. Only after the active was booted and a failover happened, the config was lost.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
Ok, this needs to be tested and fixed. Thanks.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
Thanks! Probably hard to reproduce, but as I have updated 3 HA instances in the last few days I will report if something like this happens again when a failover becomes necessary over the next weeks.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
In reply to this post by IIT Software
So, this happened again today.

I did change a custom .jar file containing some transformers and JMS apps, then just restarted the router and performed a failover. During this failover once again the SNMP, JavaMail and JMS Bridge swiftlet got undeployed and then re-deployed themselves, losing the config of the JMS bridge and SNMP swiftlet once again. JavaMail survived.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
Thanks. We will check it in the course of this job fix.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
As an info, also all hot deployed JMS Apps do this during the failover:

... resume: Deploy Swiftlet
... resume: JMS Application Container Swiftlet
+++ Hot Deployed JMS Application 'someapp' started
... resume: Scheduler Swiftlet
[...]
+++ Hot Deployed JMS Application 'someapp' stopped
+++ Hot Deployed JMS Application 'someapp' started

As a bit of additional explanation what we are doing. We have packaged the entire SwiftMQ release in our own custom RPM. This also contains our own JMS apps and transformers, etc.
So when I want to update one of these apps I create a new RPM and deploy it. This probably touches all SwiftMQ release files but leaves the config, all _deployedXXXXX and the local DB files alone. Anway, most files will be touched by this. This is when all of the above happens, and it seems to happen each time I do this. As I then have to stop the router, copy over the previously running routerconfig.xml and restart it, it sort if destroys the rolling update process, as I cannot avoid a short downtime while I recover the config. And it happens twice, on each instance during the first failover to this instance after the update.
But all I do to fix it is just copy back the previous config and restart the router and it works fine.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
Aha. If any of the files of a deployment is newer than the corresponding _deployed folder, the Swiftlet is redeployed. This is what you see.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
Ok, thanks for the info. I was able to avoid this issue by removing the _deployed folder before the failover. It does not answer how the config gets lost though. Usually the JMS Swiftlet and the SNMP Swiftlet lose the config, JavaMail comes back up with an intact config.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
The lost config seems to be lost by exeuting the cli commands inside the config.xml (have a look). For example, if there are thread pool created and if they are defined in the routerconfig.xml (e.g. by using an old config file), it gets an error during deployment and deletes the Swiftlet config. A new deployment expects that all cli cmds can be executed. Same goes for undeployment. All undeployment cli cmds must be executed without errors.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
I can understand that, but why should the redeployment of the SNMP swiftlet fail? Is this caused by the startup delay? And what about the JMS swiftlet? Maybe if the JMS bridges weren't yet successfully connected it cannot be stopped correctly?
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
Can you check the error.log whether the undeployment or the deployment fails?
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
This happened:

2016-09-27 20:04:41.935/sys$deploy/ERROR/DeploySpaceImpl, name=extension-swiftlets/performTimeAction, xt$javamail, exception: java.lang.Exception: Unknown command, removing bundle. Correct the error and deploy again!
2016-09-27 20:04:41.985/sys$deploy/ERROR/DeploySpaceImpl, name=extension-swiftlets/performTimeAction, xt$snmp, exception: java.lang.Exception: Unknown command, removing bundle. Correct the error and deploy again!
2016-09-27 20:04:42.50/sys$deploy/ERROR/DeploySpaceImpl, name=extension-swiftlets/performTimeAction, xt$bridge, exception: java.lang.Exception: Unknown command, removing bundle. Correct the error and deploy again!

Then I stop the router, take the same old config, start the router and no problem.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
Seems to be the undeployment. Please chek the SNMP config file for the CLI cmds for undeployment against your old routerconfig whether all commands can be properly exeuted (all etities that will be removed must be defined €.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

TheQL
Hi again.

Well, I would have assumed this would work, as I can easily undeploy the SNMP swiftlet via Explorer by unchecking the box. I suppose you wanted me to run the commands from this passage:

    <after-remove>
      cc /sys$threadpool/pools
      delete snmp
    </after-remove>
  </cli>

So I did!

sr router1
router1> cc /sys$threadpool/pools
router1/sys$threadpool/pools> delete snmp
Unknown Entity: snmp
router1/sys$threadpool/pools>  lc
Entity List: Pools
Description: Threadpool Definitions

Number of Entities in this List: 33
--------------------------------
accounting.connections
accounting.events
amqp.connection
amqp.session
filecache.request
filecache.session
hacontroller.inbounddispatcher
hacontroller.outbounddispatcher
hacontroller.stagecontroller
hacontroller.timer
jac.runner
jms.connection
jms.ivm.client.connection
jms.ivm.client.session
jms.session
jndi
mgmt
net.connection
net.connection.mgr
queue.cluster
queue.redispatcher
queue.timeout
routing.connection.mgr
routing.exchanger
routing.scheduler
routing.service
routing.throttle
scheduler.job
scheduler.system
store.log
timer.dispatcher
timer.tasks
topic

So I tried undeploying via Explorer, this seems to have worked, the box disappears and I can no longer query the router via SNMP. Anyway, the console does not show that the swiftlet was undeployed or redeployed later on. But maybe I did something entirely wrong and unrelated.
Reply | Threaded
Open this post in threaded view
|

Re: Router lost Extension Swiftlet config on Failover

IIT Software
Administrator
You can't undeploy a Swiftlet via Explorer. You can only disable it.

Anyway, the problem certainly lies in the way you do your updates. A Swiftlet deployment/undeployment  is always consistent. That is, it creates the resources in section before-install and removes it in after-remove. You proved above that pool snmp didn't exist so the undeployment failed. The problem is in your routerconfig from the rpm. You may have copied the snmp config part but not the resources.

This is the problem with "customized" config processes. It is not how it is intended.
12