I suggest you ...

Fix Self-Fencing in a 2-Server HA XCP Pool

A two-server pool of servers with high availability configured and enabled will sometimes result in a server self-fencing in the event of one of the hosts failing. The code should be robust enough to resolve this issue and avoid this situation. It should not be mandatory to have at lest three servers in a pool for high-availability to work properly.

23 votes
Vote
Sign in
Check!
(thinking…)
Reset
or sign in with
  • facebook
  • google
    Password icon
    I agree to the terms of service
    Signed in as (Sign out)
    You have left! (?) (thinking…)
    Tobias KreidlTobias Kreidl shared this idea  ·   ·  Admin →

    5 comments

    Sign in
    Check!
    (thinking…)
    Reset
    or sign in with
    • facebook
    • google
      Password icon
      I agree to the terms of service
      Signed in as (Sign out)
      Submitting...
      • Christof GiesersChristof Giesers commented  · 

        IMHO the "ha-SR" is not enough.
        E.g. if you have 2 EqualLogics, DataCores, Lefthands or whatever running as synchron mirrored and the Link between 2 serverrooms is broken (whyever that happens) you will run into a splitbrain situation, because both Servers still have at least one link to their storage left and will claim 'last man standing' for themself.
        That can be fought by e.g. declaring a special target as "Main test" or whatever.
        In our situation it would be one of our backbone Switches here in this building, so the other server will know that something is weird and it shouldn't do anything by itsefl.

      • Tobias KreidlTobias Kreidl commented  · 

        That's certainly one way to approach this. As I see it, the heartbeat SR could contain "last updated" timestamps for active servers. If a server fails to respond and create an update that's newer than another, the last successful server in a two-server pool could be declared the "winner" and made into the new master. As it is now, you lose all the VMs on a host, anyway, so you'd certainly be no worse off. The current mechanism uses this interesting representation of active pool members, and some complex masking to determine the HA integrity, but it seems to me at least that the "live" hosts issue could be dealt with more simply. That's just one idea, and I'm sure more clever people could come up with something way better than what I've proposed here.

      • Christof GiesersChristof Giesers commented  · 

        Alright, that's exactly what I already thought.
        You could give an option to enter Devices which can be checked (e.g. a backbone-switch) to see which one really got lost.
        If I understood that right: "at least 3 hosts" is also a failing description.
        If you have a break in your backbone, every number, divided by 2 will result in a mass, if cut by half.

      Feedback and Knowledge Base