Hypervisor evacuation procedure summary¶
Hypervisor evacuation (for HW maintenance for instance) is done as follows:
Service inventory
Service failover
Gluster client stopping
If GlusterFS server:
GlusterFS stopping
Etcd stopping
On hypervisor reintegration, a rebalancing operation is advised.
On the following procedure, the hypervisor to evacuate is designed with ${HYPERVISOR}
.
Note that hypervisors may work in groups for highly-availabile services.
Evacuating such an hypervisor can only be performed safely if sufficient associated hypervisors are still available.
Be sure that loosing ${HYPERVISOR}
will not prevent to have all the required quorums for the hosted highly-available services before going any further.
Hypervisor evacuation procedure¶
If the hypervisor to evacuate is a GlusterFS server or an Etcd cluster member, you should check beforehand if the current state is healthy and the impact of this procedure.
Fleet impact assessment¶
On a fleet
client (admin[1-2]
or another hypervisor}, list the services hosted on the hypervisor to evacuate
# fleetctl list-unit-files | grep "${HYPERVISOR}$"
pcocc-vm-admin2.service c5985fc launched launched a1ff44e6.../worker1
pcocc-vm-ns3.service ff5cf2a launched launched a1ff44e6.../worker1
GlusterFS health check¶
Check that GlusterFS
is healthy and the hypervisor ready to be taken down 1.
Check that there is no tiering daemon:
# gluster volume tier volspoms1 status
Tiering Migration Functionality: volspoms1: failed: volume volspoms1 is not a tier volume
Tier command failed
# gluster volume tier volspoms2 status
Tiering Migration Functionality: volspoms2: failed: volume volspoms2 is not a tier volume
Tier command failed
Check that there is no rebalancing process ongoing
# gluster volume rebalance volspoms1 status
volume rebalance: volspoms1: failed: Rebalance not started for volume volspoms1.
# gluster volume rebalance volspoms2 status
volume rebalance: volspoms2: failed: Rebalance not started for volume volspoms2.
Check that files are correctly replicated
# gluster volume heal volspoms1 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
0
# gluster volume heal volspoms2 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
0
Etcd health check¶
Check that the etcd
cluster is healthy
# etcdctl -C "https://$(facter fqdn):2379" cluster-health
member 21ba43a3db03bf64 is healthy: got healthy result from https://top1.mg1.hpc.domain.fr:2379
member 54de3afb81ad231e is healthy: got healthy result from https://worker1.mg1.hpc.domain.fr:2379
member 56be7f61679e83e0 is healthy: got healthy result from https://worker2.mg1.hpc.domain.fr:2379
member 6daceb4fdf706afd is healthy: got healthy result from https://worker3.mg1.hpc.domain.fr:2379
member a737706dc883425f is healthy: got healthy result from https://top3.mg1.hpc.domain.fr:2379
member e8d3e2afaf64ac7c is healthy: got healthy result from https://top2.mg1.hpc.domain.fr:2379
Fleet shutdown¶
Failover services managed by fleet
. To do so, just stop the fleetd
daemon on the hypervisor to evacuate.
# systemctl stop fleet
GlusterFS shutdown¶
Stop the GlusterFS
daemons and then kill the remaining processes.
# systemctl stop glusterfsd glusterd
# /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh -g; echo $?
A non-zero return code requires some investigation to figure-out what happened
Unmount GlusterFS
filesystems.
# umount -t fuse.glusterfs -a
Hypervisor reintegration procedure¶
GlusterFS
and etcd
daemon starts at boot time. Check that everything is OK.
GlusterFS start-up checks¶
Check that files are correctly replicated (may take some time and/or timeout)
# gluster volume heal volspoms1 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
0
# gluster volume heal volspoms2 info | grep "Number of entries" | awk '{ sum+=$4} END {print sum}'
0
Etcd start-up check¶
Check that the etcd
cluster is healthy
# etcdctl -C "https://$(facter fqdn):2379" cluster-health
member 21ba43a3db03bf64 is healthy: got healthy result from https://top1.mg1.hpc.domain.fr:2379
member 54de3afb81ad231e is healthy: got healthy result from https://worker1.mg1.hpc.domain.fr:2379
member 56be7f61679e83e0 is healthy: got healthy result from https://worker2.mg1.hpc.domain.fr:2379
member 6daceb4fdf706afd is healthy: got healthy result from https://worker3.mg1.hpc.domain.fr:2379
member a737706dc883425f is healthy: got healthy result from https://top3.mg1.hpc.domain.fr:2379
member e8d3e2afaf64ac7c is healthy: got healthy result from https://top2.mg1.hpc.domain.fr:2379
Fleet start-up¶
Start the fleet
daemon. This will only start services that can only run on this particular hypervisor (pinned VMs). There is no automatic rebalancing.
# systemctl start fleet
Measure the hypervisor load for each hypervisor. Launch on a running hypervisor:
# weights=0; declare -A weights
while read hyp unit
do
current_weight=${weights[${hyp}]:-0}
unit_weight=$(fleetctl cat $unit | grep Weight | cut -f2 -d'=')
weights[${hyp}]=$(( ${current_weight} + ${unit_weight} ))
done <<< "$(fleetctl list-units --no-legend --fields hostname,unit)"
for hyp in "${!weights[@]}"; do echo "$hyp ${weights[$hyp]}"; done | sort -k 1 | column -t
[...]
top1 20000
top2 24000
top3 20000
worker1 20000
worker2 58000
worker3 50000
Measure load induced by each VM. Launch on a running hypervisor:
# while read hyp unit
do
unit_weight=$(fleetctl cat $unit | grep Weight | cut -f2 -d'=')
echo -e "${unit} on ${hyp}\t\t${unit_weight}"
done <<< "$(fleetctl list-units --no-legend --fields hostname,unit)" | sort -k 1 | column -t
Using the informations above and the criticity of each VM, determine the VM to failback. To failback a VM launch:
# fleetctl unload --no-block ${VM}
# fleetctl start --no-block ${VM}
Foot notes
- 1
Based on the procedure scripted in /usr/share/glusterfs/scripts/stop-all-gluster-processes.sh