Re: Split brain problem in Version 9.4.2

classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

DouglasJD
I have posted the question before. I have reproduced the issue in my local PC. But can help tell me the root reason for SPLIT BRAIN.

The connection seems de-active when heart beat count comes to 4 rather than 0.

I do snapshot for JMS2 vm server from
20/10/2015 2:11:30 PM
to
20/10/2015 2:11:37 PM


Find attached files for more log files.

swiftmq.zip
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

IIT Software
Administrator
I can't confirm that the HeartBeatProcessor closes the connection at cnt=4. Rather there is a socket exception:

2015-10-20 14:11:40.749/192.168.0.119:4444/BlockingHandler/INFORMATION/Exception, EXITING: java.net.SocketException: Connection reset

which occurred when the cnt was 4:

2015-10-20 14:11:40.749/kernel/sys$hacontroller/ChannelOutboundDispatcher/visit, po=[POConnectionRemove, connection=192.168.0.119:4444] ...
2015-10-20 14:11:40.749/kernel/sys$hacontroller/ChannelInboundDispatcher/visit, po=[POConnectionDeactivate] ...
2015-10-20 14:11:40.749/kernel/sys$hacontroller/ChannelInboundDispatcher/visit, po=[POConnectionDeactivate] done
2015-10-20 14:11:40.749/kernel/sys$hacontroller/HeartBeatProcessor, cnt=4/channelDeactivated
2015-10-20 14:11:40.749/kernel/sys$hacontroller/StageController/visit, po=[POChannelDeactivated] ...
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

DouglasJD
A VMware snapshot is a copy of the virtual machine's disk file (VMDK) at a given point in time. Snapshots provide a change log for the virtual disk and are used to restore a VM to a particular point in time when a failure or system error occurs.

It seems the root reason cause the socket exception. So any solution in configuration to avoid socket exception in SWIFTMQ?
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

IIT Software
Administrator
How did you force this split brain on your local PC? Probably by terminating the network connection of the replication channel with SwiftMQ Explorer or CLI. So this causes of course a socket exception. Another reason could be a network problem.

Have a look here about recommended HA deployment.

hx
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

hx
Any solution for those momentary VM suspend or network down when the period is less than  heartbeat threshold 10 times * 2 seconds.  Any configuration to re establish the connection rather than deactivate channel.
hx
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

hx
In reply to this post by IIT Software
I have read the doc. But still have no idea . We are facing a Production issue. Split brain occur quite often at midnight  when VM does snapshot.
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

IIT Software
Administrator
In reply to this post by hx
Well, NOW I understand the problem. If you are running them on a VM and suspend it for a while, the other instance goes into standalone. If you resume the VM, you have a split brain.

What you need to do is to shutdown the HA instance before you suspend and start it after the resume. Do it with a cron job or a script that you can run from a VM hook before / after these events occur.
Reply | Threaded
Open this post in threaded view
|

Re: Split brain problem in Version 9.4.2

DouglasJD
Thanks for your help.