Stale results on SNMP queries

classic Classic list List threaded Threaded
17 messages Options
Reply | Threaded
Open this post in threaded view
|

Stale results on SNMP queries

TheQL
Hello,

since we upgraded to 9.7.1 we have the issue that our monitoring system often gets stale results from SNMP queries. Values remain unchanged although in Explorer they have in fact changed. Querying manually mostly then gets a fresh result.
This is unexpected and new behavior, maybe somehow connected to the changes made to the SNMP swiftlet. Maybe if many request come at the same time, which is the case with many hundred checks that are all started at somewhat similar times every 5 minutes. This is of course just a guess.

It does affect for instance the queue count but also the last transfer time of JMS bridges, the latter causing false alarms when the monitoring assumes there hasn't been a transfer for hours, the first maybe missing problems with broken consumers if the SNMP result is always 0.

Anyway, we couldn't really find proof for our assumption, especially with manual querying usually working. But today we once again had the issue with last transfer time not updating, but this time it didn't update with manual querying. Checked in Explorer and last transfer time was in fact shown as hours ago there as well. Watching the queues that are part of the bridges we could verify that the bridges did in fact work and messages were produced and consumed, just the time was off, even after restarting the bridge. This has never happened before but it is another clue that there might be something wrong in SwiftMQ. After a failover everything went back to normal, although I had to issue a second kill on the ACTIVE process as the JMS swiftlet didn't stop properly.

I have some screenshots that "prove" what I am saying, also I have a thread dump of the ACTIVE process before I killed it if you are interested.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

IIT Software
Administrator
We will verify why the last transfer time will not be updated and fix it for the next release if necessary.

Please send the thread dump.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
The fact that the last transfer time was not updated within the JMS swiftlet might just be a side concern, the main issue is about all SNMP queryable values, including queue counts.

Anyway: thread.dump

I did include some error log there, there are multiple exceptions regarding the JMS swiftlet caused by the connecting client, but only the last one seems to have caused trouble within the swiftlet.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

IIT Software
Administrator
From what I see there is a bridge to WMQ that hangs in a lock:

Lock held by:

"SwiftMQ-bridge.server-10": inconsistent?, holding [0x000000070d0598a0, 0x000000070d064b78, 0x000000070d063820, 0x000000070d055a80, 0x0000000701ccca90, 0x0000000701ac6468]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at com.ibm.mq.jmqi.remote.internal.system.ReentrantMutex.acquire(ReentrantMutex.java:92)
        at com.ibm.mq.jmqi.remote.internal.RemoteHconn.requestDispatchLock(RemoteHconn.java:585)
        at com.ibm.mq.jmqi.remote.internal.RemoteFAP.MQCTL(RemoteFAP.java:2114)
        at com.ibm.msg.client.wmq.internal.WMQConsumerOwnerShadow.suspendAsyncService(WMQConsumerOwnerShadow.java:698)
        at com.ibm.msg.client.wmq.internal.WMQSession.suspendAsyncService(WMQSession.java:1577)
        at com.ibm.msg.client.wmq.internal.WMQAsyncConsumerShadow.deregisterMessageListener(WMQAsyncConsumerShadow.java:729)
        at com.ibm.msg.client.wmq.internal.WMQAsyncConsumerShadow.setMessageListener(WMQAsyncConsumerShadow.java:882)
        at com.ibm.msg.client.wmq.internal.WMQMessageConsumer.setMessageListener(WMQMessageConsumer.java:493)
        at com.ibm.msg.client.jms.internal.JmsMessageConsumerImpl.setMessageListener(JmsMessageConsumerImpl.java:527)
        at com.ibm.mq.jms.MQMessageConsumer.setMessageListener(MQMessageConsumer.java:318)
        at com.swiftmq.extension.bridge.RemoteQueueBridgeSource.destroy(Unknown Source)
        at com.swiftmq.extension.bridge.DestinationBridge.d(Unknown Source)
        at com.swiftmq.extension.bridge.ServerBridge.b(Unknown Source)
        at com.swiftmq.extension.bridge.ServerBridge.b(Unknown Source)
        at com.swiftmq.extension.bridge.ServerBridge.f(Unknown Source)
        at com.swiftmq.extension.bridge.BridgeSwiftlet.a(Unknown Source)
        at com.swiftmq.extension.bridge.BridgeSwiftlet.a(Unknown Source)
        at com.swiftmq.extension.bridge.BridgeSwiftlet.b(Unknown Source)
        at com.swiftmq.extension.bridge.m.run(Unknown Source)
        at com.swiftmq.impl.threadpool.standard.PoolThread.run(Unknown Source)

And this one waits:

"DispatchThread: 201": waiting to acquire [0x000000070d064b78], holding [0x000000070d063810]
        at com.ibm.msg.client.wmq.internal.WMQConsumerShadow.close(WMQConsumerShadow.java:331)
        at com.ibm.msg.client.wmq.internal.WMQMessageConsumer.close(WMQMessageConsumer.java:379)
        at com.ibm.msg.client.jms.internal.JmsMessageConsumerImpl.emergencyClose(JmsMessageConsumerImpl.java:563)
        at com.ibm.msg.client.jms.internal.JmsMessageConsumerImpl.access$400(JmsMessageConsumerImpl.java:53)
        at com.ibm.msg.client.jms.internal.JmsMessageConsumerImpl$JmsProviderMessageListener.onMessage(JmsMessageConsumerImpl.java:980)
        at com.ibm.msg.client.wmq.internal.WMQAsyncConsumerShadow.honourNoLocal(WMQAsyncConsumerShadow.java:565)
        at com.ibm.msg.client.wmq.internal.WMQAsyncConsumerShadow.consumer(WMQAsyncConsumerShadow.java:399)
        at com.ibm.mq.jmqi.remote.internal.RemoteAsyncConsume.driveConsumer(RemoteAsyncConsume.java:1526)
        at com.ibm.mq.jmqi.remote.internal.RemoteDispatchThread.run(RemoteDispatchThread.java:394)
        at com.ibm.msg.client.commonservices.workqueue.WorkQueueItem.runTask(WorkQueueItem.java:209)
        at com.ibm.msg.client.commonservices.workqueue.SimpleWorkQueueItem.runItem(SimpleWorkQueueItem.java:100)
        at com.ibm.msg.client.commonservices.workqueue.WorkQueueItem.run(WorkQueueItem.java:224)
        at com.ibm.msg.client.commonservices.workqueue.WorkQueueManager.runWorkQueueItem(WorkQueueManager.java:298)
        at com.ibm.msg.client.commonservices.j2se.workqueue.WorkQueueManagerImplementation$ThreadPoolWorker.run(WorkQueueManagerImplementation.java:1220)

Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
In reply to this post by IIT Software
I have another addition. I just saw stale results again and opened the Explorer. Saw the same old times (age about an hour ago) there, too, although there is constant traffic on the bridge. And after a few seconds all times in Explorer got updated, next monitoring run was also ok afterwards.

Here is some debug output from our monitoring I am currently recording:

[ INFO  ] Node1: mq1 Node2: mq2 Community: public Oid: 1.3.6.1.4.1.16683.1.2.21.1.4.2.1.1.2.23.[..].73
[       ] warn if age > 900 seconds, critical if age > 1800 seconds
[ INFO  ] Active node: mq1
[ INFO  ] Last transfer time is Wed Sep 21 11:28:03 CEST 2016, in Unixtime: 1474450083, now is: 1474454196 - 4113 seconds offset
--
[ INFO  ] Node1: mq1 Node2: mq2 Community: public Oid: 1.3.6.1.4.1.16683.1.2.21.1.4.2.1.1.2.23.[..].73
[       ] warn if age > 900 seconds, critical if age > 1800 seconds
[ INFO  ] Active node: mq1
[ INFO  ] Last transfer time is Wed Sep 21 12:41:28 CEST 2016, in Unixtime: 1474454488, now is: 1474454496 - 8 seconds offset
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
In reply to this post by IIT Software
Might be true that there was a problem with a specific bridge, but all times were stale, not just of one bridge. Also my last post indicates there might be an issue with updating values in general, in Explorer and the SNMP swiftlet.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

IIT Software
Administrator
I notes that and have created a job fix already. If there is an issue, it will be fixed.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
Great, thanks! Looking forward to that!
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
In reply to this post by IIT Software
I was just able to reproduce what I mentioned earlier. Check result seems stale, open Explorer, see old timestamps, wait a few seconds, explorer gets updated, next check is ok. I made two screenshots a few seconds apart confirming this, but they probably won't help you in finding the matter.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

IIT Software
Administrator
So if you have a connected Explorer the field is updated but with SNMP only it is not?
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

IIT Software
Administrator
In reply to this post by TheQL
If that is the case then we need to fix that. Whenever an admin tool connects to the Mgmt Swiftlet, an event is fired which starts the collectors in the various Swiftlets to update the usage sections. The SNMP Swiftlet in newer versions use a different path. I guess this doesn't lead to fire this event (have to check the code).

However, as a workaround just start CLI somewhere and connect to the router. It should fire that event and you should see the updates.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
Yes, that is the case at least with JMS bridges. I am not sure if this is the case 100% of the time, but when the result seems stale, opening Explorer fixes the issue.

I am unsure about queue counts, have to try if it creates similar results but there I don't get alerts if a good result, aka. 0 messages in queue, is in fact a stale result.

Also, in the past, triggering the check manually often got a good result, the scheduled checks didn't. We were wondering if some kind of caching/freezing of results happens when a query is made and many checks at the same time somehow cause most checks returning stale results, but we can just guess around as to why this happens.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
In reply to this post by IIT Software
Hi,

I could just sort of confirm the same with queue counts. A queue that didn't show any change in monitoring although a 5 minute check interval is in place just dropped instantly after I opened Explorer. Not sure what caused the first drop or why it didn't move in the mean time, but the drop at the end of the graph was clearly caused by me opening Explorer.

Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

IIT Software
Administrator
That's what I was assuming. Just try the workaround I mentioned above until this is fixed.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
While this is issue is mostly resolved, thanks, some queue depth queries do still deliver the same result over and over until checked with Explorer. If I were to bet on it, I'd say it got worse again with 10.2.0, but I'm not sure. Could you have another look?

It's by far not as bad as before, but we are able to see this from time to time. And you can never be sure if the last result was "0" if the query result is correct or not, as this does not raise any alarms in our monitoring configuration.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

IIT Software
Administrator
We will definitely check that.

Meanwhile, since you are on 10.2.0, please have a look at this admin stream which can send you alerts on any queue and any threshold.
Reply | Threaded
Open this post in threaded view
|

Re: Stale results on SNMP queries

TheQL
Thank you, I will do that!