Exception (or not) after time consuming store sync

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

Exception (or not) after time consuming store sync

TheQL
Currently we sadly have a rather large store (2.5GB) and HA sync takes close to 3 minutes to complete. So far so bad.

After sync is complete, I see this in the log of the active instance:

+++ High Availability State: ACTIVE-SYNC/ACTIVE-SYNC
+++ High Availability State: ACTIVE/ACTIVE
null: req == null! Reply=[CommandReply [Reply, ok=true exception=null requestNumber=0 timeout=false ], result=null]
null: sem == null! Reply=[CommandReply [Reply, ok=true exception=null requestNumber=0 timeout=false ], result=null]
null: req == null! Reply=[CommandReply [Reply, ok=true exception=null requestNumber=0 timeout=false ], result=routername
XYZ_HA]
null: sem == null! Reply=[CommandReply [Reply, ok=true exception=null requestNumber=0 timeout=false ], result=routername
XYZ_HA]

So, I understand this is not really a problem and I recall you telling me, this was fixed and can only occur if an old consumer/Explorer is connected, but the only Explorer connected during sync was mine, which is 9.7.2, the router being 9.6.0 - so where does this come from?

As I scan the console output for the term "exception" this triggers an alert and therefore is a little irritating.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

IIT Software
Administrator
This is from a CLI command reply. The corresponding request doesn't exists anymore. Probably a request timeout after 60 secs due to the long sync time. No problem.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
Right, sync time is longer than 60 seconds, it would just be nice to not have this error/message in the logs. As I said, if you check it for exceptions this is not easy to break down into matching patterns.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

IIT Software
Administrator
This is a client-side exception. If that comes on the router's stdout it might be caused by a hot deployed app that uses CLI...
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
Thanks, that is a possibility and at least that explains it!
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
In reply to this post by IIT Software
Is there a way to expand the timeout value?

com.swiftmq.tools.requestreply.TimeoutException: Request time out (60000) ms!

Somehow these 60 seconds fly around a lot and I do believe they are currently causing us additional problems.
We do perform a scheduled store backup every 2 hours and currently we have many occurences, where the SNMP swiftlet just dies. It is currently dead, so if you are interested in a dump, I could create one.

The resulting effect is that we get these replies from SNMP queries:
No valid data returned (No Such Instance currently exists at this OID)

So, what I assume, and it might be totally wrong, is that the store backup takes a little longer than 60 seconds and from time to time the SNMP swiftlet does not recover from the freeze. It does not seem like a coincidence, that the swiftlet is dead shortly after a backup took place. I do have to admit that we are still using 9.6.0 in production, I know there was some kind of change with the SNMP swiftlet after that to improve the way it communicates with the router.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

IIT Software
Administrator
The property is called swiftmq.request.timeout. For hot deploy apps you'd need that on the router's side. More here.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
Ok, thanks, so you don't see any connection to the problems with the SNMP swiftlet? It seems to only lose all queue OIDs, HA state or active instance name can still be queried.

And for hot deployed apps I would have to add -Dswiftmq.request.timeout=xxxx to the router start script? And the same goes for my explorer start script or would that also respect the setting from the router?
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

IIT Software
Administrator
Please send over the dump you've mentioned above. Is that a thread dump?
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
Sorry, I didn't create a dump, didn't know which kind you'd like. I will come back to you if this happens again, from current experience that won't take more than a week.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
In reply to this post by IIT Software
So, because I am stupid instead of creating a thread dump I sent SIGTERM to the process. Anyway, as the SNMP swiftlet is sort of unresponsive, the router shutdown does not complete, it hangs while stopping the swiftlet. I have a thread dump of that state, if you are interested.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

IIT Software
Administrator
Yes, please send!
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
I will gladly create another, better dump next time. As I said, it's about less than a week between two crashes since the store is this large. I have planned to shrink it, but that's not that simple...

console.gz
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

IIT Software
Administrator
Thank you, that was helpful.

It seems there is a deadlock. A management disconnect request was received from the router (because of the shutdown) which leads to a close of the management endpoint inside the intravm CLI connection used from the SNMP Swiftlet. So far so good. But - due to the shutdown at the same time, which is carried out from another thread (shutdown hook) - the intravm CLI connection was closed in parallel which then led to this situation. Definitely a bug which will be fixed for the next release.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

IIT Software
Administrator
In reply to this post by TheQL
You are using 9.6.0. This was already fixed in 9.7.0 (see changes). We have replaced the intravm CLI connection with a direct CLI interface due to the shutdown problem.
Reply | Threaded
Open this post in threaded view
|

Re: Exception (or not) after time consuming store sync

TheQL
I had something like that in mind. Wasn't sure if this issue was also fixed by that. Thanks for the info.