Fix Self-Fencing in a 2-Server HA XCP Pool
A two-server pool of servers with high availability configured and enabled will sometimes result in a server self-fencing in the event of one of the hosts failing. The code should be robust enough to resolve this issue and avoid this situation. It should not be mandatory to have at lest three servers in a pool for high-availability to work properly.
it would be nice to see this fixed, as a pool of two is a very useful configuration and the issue clearly has a solution.,
I sugest to add an external host to check (maybe 18.104.22.168) as a simple solution to prevent the split-brain situation. Maybe more hosts can be added, but this will increase the time to check them all.
This is the solution used by HA-Lizard too, for 2 hosts pools, so maybe is ok for Citrix too.
See http://www.halizard.com/ for an open-source solution to this.
See http://www.halizard.com/ as a possible alternative.
Christof Giesers commented
IMHO the "ha-SR" is not enough.
E.g. if you have 2 EqualLogics, DataCores, Lefthands or whatever running as synchron mirrored and the Link between 2 serverrooms is broken (whyever that happens) you will run into a splitbrain situation, because both Servers still have at least one link to their storage left and will claim 'last man standing' for themself.
That can be fought by e.g. declaring a special target as "Main test" or whatever.
In our situation it would be one of our backbone Switches here in this building, so the other server will know that something is weird and it shouldn't do anything by itsefl.
That's certainly one way to approach this. As I see it, the heartbeat SR could contain "last updated" timestamps for active servers. If a server fails to respond and create an update that's newer than another, the last successful server in a two-server pool could be declared the "winner" and made into the new master. As it is now, you lose all the VMs on a host, anyway, so you'd certainly be no worse off. The current mechanism uses this interesting representation of active pool members, and some complex masking to determine the HA integrity, but it seems to me at least that the "live" hosts issue could be dealt with more simply. That's just one idea, and I'm sure more clever people could come up with something way better than what I've proposed here.
Christof Giesers commented
Alright, that's exactly what I already thought.
You could give an option to enter Devices which can be checked (e.g. a backbone-switch) to see which one really got lost.
If I understood that right: "at least 3 hosts" is also a failing description.
If you have a break in your backbone, every number, divided by 2 will result in a mass, if cut by half.