Hypervisor evacuation procedure summary --------------------------------------- Hypervisor evacuation (for HW maintenance for instance) is done as follows: - Service inventory - Service failover - Gluster client stopping - If GlusterFS server: - GlusterFS stopping - Etcd stopping On hypervisor reintegration, a rebalancing operation is advised. On the following procedure, the hypervisor to evacuate is designed with ``${HYPERVISOR}``. Note that hypervisors may work in groups for highly-availabile services. Evacuating such an hypervisor can only be performed safely if sufficient associated hypervisors are still available. Be sure that loosing ``${HYPERVISOR}`` will not prevent to have all the required quorums for the hosted highly-available services before going any further. Hypervisor evacuation procedure ------------------------------- If the hypervisor to evacuate is a GlusterFS server or an Etcd cluster member, you should check beforehand if the current state is healthy and the impact of this procedure. Fleet impact assessment ^^^^^^^^^^^^^^^^^^^^^^^ On a ``fleet`` client (``admin[1-2]`` or another hypervisor}, list the services hosted on the hypervisor to evacuate .. code-block:: shell # fleetctl list-unit-files | grep "${HYPERVISOR}$" pcocc-vm-admin2.service c5985fc launched launched a1ff44e6.../worker1 pcocc-vm-ns3.service ff5cf2a launched launched a1ff44e6.../worker1 GlusterFS health check ^^^^^^^^^^^^^^^^^^^^^^ Check that ``GlusterFS`` is healthy and the hypervisor ready to be taken down [#]_. Check that there is no tiering daemon: .. code-block:: shell # gluster volume tier volspoms1 status Tiering Migration Functionality: volspoms1: failed: volume volspoms1 is not a tier volume Tier command failed # gluster volume tier volspoms2 status Tiering Migration Functionality: volspoms2: failed: volume volspoms2 is not a tier volume Tier command failed Check that there is no rebalancing process ongoing .. code-block:: shell # gluster volume rebalance volspoms1 status volume rebalance: volspoms1: failed: Rebalance not started for volume volspoms1. # gluster volume rebalance volspoms2 status volume rebalance: volspoms2: failed: Rebalance not started for volume volspoms2. Check that files are correctly replicated .. code-block:: shell # gluster volume heal volspoms1 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}' 0 # gluster volume heal volspoms2 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}' 0 Etcd health check ^^^^^^^^^^^^^^^^^ Check that the ``etcd`` cluster is healthy .. code-block:: shell # etcdctl -C "https://$(facter fqdn):2379" cluster-health member 21ba43a3db03bf64 is healthy: got healthy result from https://top1.mg1.hpc.domain.fr:2379 member 54de3afb81ad231e is healthy: got healthy result from https://worker1.mg1.hpc.domain.fr:2379 member 56be7f61679e83e0 is healthy: got healthy result from https://worker2.mg1.hpc.domain.fr:2379 member 6daceb4fdf706afd is healthy: got healthy result from https://worker3.mg1.hpc.domain.fr:2379 member a737706dc883425f is healthy: got healthy result from https://top3.mg1.hpc.domain.fr:2379 member e8d3e2afaf64ac7c is healthy: got healthy result from https://top2.mg1.hpc.domain.fr:2379 Fleet shutdown ^^^^^^^^^^^^^^ Failover services managed by ``fleet``. To do so, just stop the ``fleetd`` daemon on the hypervisor to evacuate. .. code-block:: shell # systemctl stop fleet GlusterFS shutdown ^^^^^^^^^^^^^^^^^^ Stop the ``GlusterFS`` daemons and then kill the remaining processes. .. code-block:: shell # systemctl stop glusterfsd glusterd # /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh -g; echo $? A non-zero return code requires some investigation to figure-out what happened Unmount ``GlusterFS`` filesystems. .. code-block:: shell # umount -t fuse.glusterfs -a Etcd shutdown ^^^^^^^^^^^^^ Stop ``etcd`` with ``systemctl`` .. code-block:: shell # systemctl stop etcd Hypervisor reintegration procedure ---------------------------------- ``GlusterFS`` and ``etcd`` daemon starts at boot time. Check that everything is OK. GlusterFS start-up checks ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Check that files are correctly replicated (may take some time and/or timeout) .. code-block:: shell # gluster volume heal volspoms1 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}' 0 # gluster volume heal volspoms2 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}' 0 Etcd start-up check ^^^^^^^^^^^^^^^^^^^ Check that the ``etcd`` cluster is healthy .. code-block:: shell # etcdctl -C "https://$(facter fqdn):2379" cluster-health member 21ba43a3db03bf64 is healthy: got healthy result from https://top1.mg1.hpc.domain.fr:2379 member 54de3afb81ad231e is healthy: got healthy result from https://worker1.mg1.hpc.domain.fr:2379 member 56be7f61679e83e0 is healthy: got healthy result from https://worker2.mg1.hpc.domain.fr:2379 member 6daceb4fdf706afd is healthy: got healthy result from https://worker3.mg1.hpc.domain.fr:2379 member a737706dc883425f is healthy: got healthy result from https://top3.mg1.hpc.domain.fr:2379 member e8d3e2afaf64ac7c is healthy: got healthy result from https://top2.mg1.hpc.domain.fr:2379 Fleet start-up ^^^^^^^^^^^^^^ Start the ``fleet`` daemon. This will only start services that can only run on this particular hypervisor (pinned VMs). There is no automatic rebalancing. .. code-block:: shell # systemctl start fleet Measure the hypervisor load for each hypervisor. Launch on a running hypervisor: .. code-block:: shell # weights=0; declare -A weights while read hyp unit do current_weight=${weights[${hyp}]:-0} unit_weight=$(fleetctl cat $unit | grep Weight | cut -f2 -d'=') weights[${hyp}]=$(( ${current_weight} + ${unit_weight} )) done <<< "$(fleetctl list-units --no-legend --fields hostname,unit)" for hyp in "${!weights[@]}"; do echo "$hyp ${weights[$hyp]}"; done | sort -k 1 | column -t [...] top1 20000 top2 24000 top3 20000 worker1 20000 worker2 58000 worker3 50000 Measure load induced by each VM. Launch on a running hypervisor: .. code-block:: shell # while read hyp unit do unit_weight=$(fleetctl cat $unit | grep Weight | cut -f2 -d'=') echo -e "${unit} on ${hyp}\t\t${unit_weight}" done <<< "$(fleetctl list-units --no-legend --fields hostname,unit)" | sort -k 1 | column -t Using the informations above and the criticity of each VM, determine the VM to failback. To failback a VM launch: .. code-block:: shell # fleetctl unload --no-block ${VM} # fleetctl start --no-block ${VM} .. rubric:: Foot notes .. [#] Based on the procedure scripted in /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh