Hypervisor evacuation procedure summary
---------------------------------------

Hypervisor evacuation (for HW maintenance for instance) is done as follows:

-  Service inventory
-  Service failover
-  Gluster client stopping
-  If GlusterFS server:

  - GlusterFS stopping
  - Etcd stopping

On hypervisor reintegration, a rebalancing operation is advised.

On the following procedure, the hypervisor to evacuate is designed with ``${HYPERVISOR}``.

Note that hypervisors may work in groups for highly-availabile services.
Evacuating such an hypervisor can only be performed safely if sufficient associated hypervisors are still available.
Be sure that loosing ``${HYPERVISOR}`` will not prevent to have all the required quorums for the hosted highly-available services before going any further.


Hypervisor evacuation procedure
-------------------------------

If the hypervisor to evacuate is a GlusterFS server or an Etcd cluster member, you should check beforehand if the current state is healthy and the impact of this procedure.

Fleet impact assessment
^^^^^^^^^^^^^^^^^^^^^^^

On a ``fleet`` client (``admin[1-2]`` or another hypervisor}, list the services hosted on the hypervisor to evacuate

.. code-block:: shell

    # fleetctl list-unit-files | grep "${HYPERVISOR}$"
    pcocc-vm-admin2.service     c5985fc launched    launched    a1ff44e6.../worker1
    pcocc-vm-ns3.service        ff5cf2a launched    launched    a1ff44e6.../worker1

GlusterFS health check
^^^^^^^^^^^^^^^^^^^^^^

Check that ``GlusterFS`` is healthy and the hypervisor ready to be taken down [#]_.

Check that there is no tiering daemon:

.. code-block:: shell

   # gluster volume tier volspoms1 status
   Tiering Migration Functionality: volspoms1: failed: volume volspoms1 is not a tier volume
   Tier command failed
   # gluster volume tier volspoms2 status
   Tiering Migration Functionality: volspoms2: failed: volume volspoms2 is not a tier volume
   Tier command failed

Check that there is no rebalancing process ongoing

.. code-block:: shell

   # gluster volume rebalance volspoms1 status
   volume rebalance: volspoms1: failed: Rebalance not started for volume volspoms1.
   # gluster volume rebalance volspoms2 status
   volume rebalance: volspoms2: failed: Rebalance not started for volume volspoms2.

Check that files are correctly replicated

.. code-block:: shell

   # gluster volume heal volspoms1 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
   0
   # gluster volume heal volspoms2 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
   0

Etcd health check
^^^^^^^^^^^^^^^^^

Check that the ``etcd`` cluster is healthy

.. code-block:: shell

    # etcdctl -C "https://$(facter fqdn):2379" cluster-health
    member 21ba43a3db03bf64 is healthy: got healthy result from https://top1.mg1.hpc.domain.fr:2379
    member 54de3afb81ad231e is healthy: got healthy result from https://worker1.mg1.hpc.domain.fr:2379
    member 56be7f61679e83e0 is healthy: got healthy result from https://worker2.mg1.hpc.domain.fr:2379
    member 6daceb4fdf706afd is healthy: got healthy result from https://worker3.mg1.hpc.domain.fr:2379
    member a737706dc883425f is healthy: got healthy result from https://top3.mg1.hpc.domain.fr:2379
    member e8d3e2afaf64ac7c is healthy: got healthy result from https://top2.mg1.hpc.domain.fr:2379


Fleet shutdown
^^^^^^^^^^^^^^

Failover services managed by ``fleet``. To do so, just stop the ``fleetd`` daemon on the hypervisor to evacuate.

.. code-block:: shell

    # systemctl stop fleet

GlusterFS shutdown
^^^^^^^^^^^^^^^^^^

Stop the ``GlusterFS`` daemons and then kill the remaining processes.

.. code-block:: shell

    # systemctl stop glusterfsd glusterd
    # /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh -g; echo $?

A non-zero return code requires some investigation to figure-out what happened

Unmount ``GlusterFS`` filesystems.

.. code-block:: shell

    # umount -t fuse.glusterfs -a


Etcd shutdown
^^^^^^^^^^^^^

Stop ``etcd`` with ``systemctl``

.. code-block:: shell

    # systemctl stop etcd

Hypervisor reintegration procedure
----------------------------------

``GlusterFS`` and ``etcd`` daemon starts at boot time. Check that everything is OK.

GlusterFS start-up checks
^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Check that files are correctly replicated (may take some time and/or timeout)

.. code-block:: shell

    # gluster volume heal volspoms1 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
    0
    # gluster volume heal volspoms2 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
    0

Etcd start-up check
^^^^^^^^^^^^^^^^^^^

Check that the ``etcd`` cluster is healthy

.. code-block:: shell

    # etcdctl -C "https://$(facter fqdn):2379" cluster-health
    member 21ba43a3db03bf64 is healthy: got healthy result from https://top1.mg1.hpc.domain.fr:2379
    member 54de3afb81ad231e is healthy: got healthy result from https://worker1.mg1.hpc.domain.fr:2379
    member 56be7f61679e83e0 is healthy: got healthy result from https://worker2.mg1.hpc.domain.fr:2379
    member 6daceb4fdf706afd is healthy: got healthy result from https://worker3.mg1.hpc.domain.fr:2379
    member a737706dc883425f is healthy: got healthy result from https://top3.mg1.hpc.domain.fr:2379
    member e8d3e2afaf64ac7c is healthy: got healthy result from https://top2.mg1.hpc.domain.fr:2379

Fleet start-up
^^^^^^^^^^^^^^

Start the ``fleet`` daemon. This will only start services that can only run on this particular hypervisor (pinned VMs). There is no automatic rebalancing.

.. code-block:: shell

    # systemctl start fleet

Measure the hypervisor load for each hypervisor. Launch on a running hypervisor:

.. code-block:: shell

    # weights=0; declare -A weights
    while read hyp unit
    do
      current_weight=${weights[${hyp}]:-0}
      unit_weight=$(fleetctl cat $unit | grep Weight | cut -f2 -d'=')
      weights[${hyp}]=$(( ${current_weight} + ${unit_weight} ))
    done <<< "$(fleetctl list-units --no-legend --fields hostname,unit)"
    for hyp in "${!weights[@]}"; do echo "$hyp ${weights[$hyp]}"; done | sort -k 1 | column -t

    [...]
    top1 20000
    top2 24000
    top3 20000
    worker1 20000
    worker2 58000
    worker3 50000

Measure load induced by each VM. Launch on a running hypervisor:

.. code-block:: shell

    # while read hyp unit
    do
      unit_weight=$(fleetctl cat $unit | grep Weight | cut -f2 -d'=')
      echo -e "${unit} on ${hyp}\t\t${unit_weight}"
    done <<< "$(fleetctl list-units --no-legend --fields hostname,unit)" | sort -k 1 | column -t

Using the informations above and the criticity of each VM, determine the VM to failback. To failback a VM launch:

.. code-block:: shell

    # fleetctl unload --no-block ${VM}
    # fleetctl start --no-block ${VM}


.. rubric:: Foot notes

.. [#] Based on the procedure scripted in /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh