SwiftMQ HA going to Standalone mode

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

SwiftMQ HA going to Standalone mode

Bali
Hi  
We are using the version 9.2.1 in our production environment with two machines in HA. Both the machines have 8 cores and 24 Gigs Memory. We have nearly 50 Queues and nearly 400 Physical Queues and 70 topics with 100 subscribers and process nearly 100 messages/sec. We have been facing a lot of issues recently with our setup going into Standalone or Split brain situation very frequently(#3 in last month) and also some missing messages when the system gets up again (STANDALONE --> ACTIVE/STANDBY). It is a very critical infra piece for us and we want it to be always available. Here are the few GC logs we have where we suspect some irregular behaviour where there is a concurrent mode failure and the GC switches to Full GC which I think also includes the Stop the World Phase which can be responsible for this. We have also set flowcontrol-start-queuesize and cache-size variables to 10,000 values for some queues.
Log 1 -->
  2710066.770: [Full GC (System) 2710066.771: [CMS: 2531276K->2513526K(2796224K), 14.1114080 secs] 2878935K->2513526K(3961344K), [CMS Perm : 24094K->24077K(40240K)], 14.1122230 secs] [Times: user=14.10 sys=0.02, real=14.12 secs]
5651229.872: [Full GC 5651229.872: [CMS5651230.398: [CMS-concurrent-mark: 5.008/6.532 secs] [Times: user=0.00 sys=0.00, real=6.53 secs]  
5651435.342: [Full GC 5651435.343: [CMS5651436.838: [CMS-concurrent-sweep: 3.319/3.327 secs] [Times: user=0.00 sys=0.00, real=3.32 secs]  
5651562.840: [Full GC 5651562.841: [CMS5651566.676: [CMS-concurrent-mark: 5.157/5.163 secs] [Times: user=0.00 sys=0.00, real=5.16 secs]  
5651580.286: [Full GC 5651580.287: [CMS: 2796152K->2796172K(2796224K), 15.2507540 secs] 3961176K->2805786K(3961344K), [CMS Perm : 24082K->24082K(40432K)], 15.2514030 secs] [Times: user=0.00 sys=0.00, real=15.25 secs]  
5651596.944: [Full GC 5651596.944: [CMS5651601.062: [CMS-concurrent-mark: 5.507/5.512 secs] [Times: user=0.00 sys=0.00, real=5.51 secs]  5651628.250: [Full GC 5651628.250: [CMS5651628.276: [CMS-concurrent-sweep: 3.300/3.309 secs] [Times: user=0.00 sys=0.00, real=3.32 secs]  
Log 2 
CMS: abort preclean due to time 5651061.266: [CMS-concurrent-abortable-preclean: 4.999/5.005 secs] [Times: user=0.00 sys=0.00, real=5.01 secs]  
CMS: abort preclean due to time 5651078.645: [CMS-concurrent-abortable-preclean: 5.343/5.349 secs] [Times: user=0.00 sys=0.00, real=5.35 secs]  
CMS: abort preclean due to time 5651096.649: [CMS-concurrent-abortable-preclean: 5.240/5.408 secs] [Times: user=0.00 sys=0.00, real=5.41 secs]  
CMS: abort preclean due to time 5651112.335: [CMS-concurrent-abortable-preclean: 5.357/5.362 secs] [Times: user=0.00 sys=0.00, real=5.36 secs]  
CMS: abort preclean due to time 5651129.491: [CMS-concurrent-abortable-preclean: 5.409/5.416 secs] [Times: user=0.00 sys=0.00, real=5.42 secs]  
 CMS: abort preclean due to time 5651147.222: [CMS-concurrent-abortable-preclean: 5.368/5.377 secs] [Times: user=0.00 sys=0.00, real=5.38 secs]  
5651172.667: [CMS-concurrent-abortable-preclean: 1.624/1.786 secs] [Times: user=0.00 sys=0.00, real=1.79 secs]  
CMS: abort preclean due to time 5651187.243: [CMS-concurrent-abortable-preclean: 4.890/5.448 secs] [Times: user=0.00 sys=0.00, real=5.45 secs]  
5651197.809: [CMS-concurrent-abortable-preclean: 1.421/1.786 secs] [Times: user=0.00 sys=0.00, real=1.78 secs]  
5651433.398: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]  \
5651624.877: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]  
5651671.512: [CMS-concurrent-abortable-preclean: 0.455/3.434 secs] [Times: user=0.00 sys=0.00, real=3.43 secs]  
5651680.078: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]  
5651688.082: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]  
5651699.750: [GC 5651699.751: [ParNew5651699.824: [CMS-concurrent-abortable-preclean: 0.794/3.699 secs] [Times: user=0.00 sys=0.00, real=3.70 secs]  
5651724.107: [GC 5651724.108: [ParNew5651724.127: [CMS-concurrent-abortable-preclean: 0.158/1.260 secs] [Times: user=0.00 sys=0.00, real=1.26 secs]  
5651732.447: [CMS-concurrent-abortable-preclean: 0.000/0.000 secs] [Times: user=0.00 sys=0.00, real=0.00 secs]  
5651745.068: [GC 5651745.068: [ParNew5651745.083: [CMS-concurrent-abortable-preclean: 0.886/4.692 secs] [Times: user=0.00 sys=0.00, real=4.69 secs]  
CMS: abort preclean due to time 5651764.849: [CMS-concurrent-abortable-preclean: 2.661/5.049 secs] [Times: user=0.00 sys=0.00, real=5.05 secs]  
CMS: abort preclean due to time 5651778.198: [CMS-concurrent-abortable-preclean: 2.061/5.015 secs] [Times: user=0.00 sys=0.00, real=5.02 secs]  
Java version --> 1.6.0_26
Config options
 -->  -XX:+PrintGCDetails -XX:+PrintGCTimeStamps  -Xms8192m -Xmx8192m -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=128m -Xss1024k -XX:NewRatio=2 -XX:+UseConcMarkSweepGC -XX:SurvivorRatio=4 -XX:CompileThreshold=100
Questions -->
1. Do we need to tweak our GC settings for this to stop.
2. How to manage connections when an ACTIVE dies and goes into STANDALONE.
3. When the machine goes into STANDALONE we just change the HA state to STANDBY and restart the process where the Server killed itself, is this the right way to do it.
4. Are we looking at the wrong problem ie GC.
Reply | Threaded
Open this post in threaded view
|

Re: SwiftMQ HA going to Standalone mode

IIT Software
Administrator
1. Do we need to tweak our GC settings for this to stop.
Well, if you've figured out the settings above I guess you have more knowledge than me on this matter. Only one hint: If you need to tweak GC settings at all then you don't have a GC but a BIG memory problem.

2. How to manage connections when an ACTIVE dies and goes into STANDALONE.
You mean the client connections? They automatically switch over to the remaining STANDALONE instance due to transparent failover.

3. When the machine goes into STANDALONE we just change the HA state to STANDBY and restart the process where the Server killed itself, is this the right way to do it.
You don't need to change the HA state of the failed instance. Just start it with mode ACTIVE. It will be automatically switch to STANDBY if the other instance is STANDALONE.

Manual intervention is only necessary after a split brain.

4. Are we looking at the wrong problem ie GC.
As mentioned, you have a memory problem, not a GC one. You have a support contract. Why don't you just submit an incident from your MySwiftMQ account (via the "Support" section) instead of posting it here? If I have an incident from you, we can work out and adjust every setting of your configuration to solve the memory problem.
Reply | Threaded
Open this post in threaded view
|

Re: SwiftMQ HA going to Standalone mode

IIT Software
Administrator
In reply to this post by Bali
If you have so many queues, you should limit the the queue's cache sizes. The default is 500 messages without a memory limit. So if you have large messages on your 400 queues and each queue stores it into the cache, you will have a memory problem.

Let's say you want to have a maximum of 200 MB (204800 KB) in queue caches. You have 400 physical queues and 100 subscriber queues which is a total of 500 queues. The subscriber queues store the same message object within a transaction but just let's assume the are different. 204800 / 500 = 409 KB max cache size.

For each regular queue add the cache-size-bytes-kb attribute, e.g.:

       <queue name="testqueue" cache-size-bytes-kb="409" />

Do the same at the queue controller for subscriber queues:

      <queue-controller name="01" persistence-mode="non_persistent"  cache-size-bytes-kb="409" predicate="tmp$%"/>

The best is to apply it dynamically with SwiftMQ Explorer / CLI and save it. If you do it in the routerconfig.xml you'd need to shutdown/restart both instances.
Reply | Threaded
Open this post in threaded view
|

Re: SwiftMQ HA going to Standalone mode

Bali
The way you have suggested involves changing attributes of all queues one by one and then doing a save.
Can we set these values at one place, something like setting for all queues with one change.
 
Reply | Threaded
Open this post in threaded view
|

Re: SwiftMQ HA going to Standalone mode

IIT Software
Administrator
You could create a CLI script like the attached sample and execute it with "clis".

cachesize.cli
Reply | Threaded
Open this post in threaded view
|

Re: SwiftMQ HA going to Standalone mode

Bali
What I meant to ask is, is there a way to set the default size. We would set the size automatically for the existing queues now, but would be good if there is a way to set default which would apply for newly created queues.
Reply | Threaded
Open this post in threaded view
|

Re: SwiftMQ HA going to Standalone mode

IIT Software
Administrator
No, the default is fix for regular queues. For subscriber queues this is configurable in the queue controller as mentioned above.