Diskless management¶
Image generation¶
Thanks to Ocean Stack’s architecture, diskless image are simply virtual machine images that are exported through iSCSI. See Diskless for details about diskless architecture.
Here, we will only document the image generation procedure. The compute node configuration management is out of scope.
First, to generate a diskless image, use the Add a new service VM procedure to add a new VM that will be our compute image. This reference VM will be designated as COMPUTE_VM
.
The procedure for generating a complete image is the following:
If present, backup the old reference image :
mv /volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2 /volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2.$(date +%F)
Create a new reference image file :
# prepare-ocean-image.sh ${COMPUTE_VM}
Formatting '/volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2', fmt=qcow2 size=53687091200 backing_file='/volspoms1/pcocc/persistent_drives/rhel.latest.qcow2' encryption=off cluster_size=65536 lazy_refcounts=off
On a working hypervisor, launch the reference VM :
# . /etc/sysconfig/pcocc-vm-${COMPUTE_VM}
# pcocc alloc ${COMPUTE_VM}
Follow the bootstrap process using the pcocc cli :
(pcocc/XXXXX) # pcocc console
[...]
[ 48.578821] cloud-init[902]: + cloud-init-per instance distro_sync yum distribution-synchronization -y
[ 48.822081] cloud-init[902]: Loaded plugins: priorities, search-disabled-repos
[ 53.809157] cloud-init[902]: 437 packages excluded due to repository priority protections
[ 58.596408] cloud-init[902]: Resolving Dependencies
[ 58.598198] cloud-init[902]: --> Running transaction check
[ 58.599632] cloud-init[902]: ---> Package bind-libs-lite.x86_64 32:9.9.4-74.el7_6.1 will be updated
[ 58.854444] cloud-init[902]: ---> Package bind-libs-lite.x86_64 32:9.9.4-74.el7_6.2 will be an update
[ 58.873073] cloud-init[902]: ---> Package bind-license.noarch 32:9.9.4-74.el7_6.1 will be updated
[...]
Note
Please note that the VM might not be reachable immediately. This is because of the time between the boot and the effective configuration of the SSH daemon. Be patient and check the console output for any error that could prevent SSH from listening correctly.
Poll the VM for cloud-init completion, if the
/run/cloud-init/result.json
file is present, cloud-init process is complete :
(pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 cat /run/cloud-init/result.json
{
"v1": {
"datasource": "DataSourceNoCloud [seed=/dev/sr0][dsmode=net]",
"errors": [
"('users-groups', TypeError(\"Can not create sudoers rule addition with type u'bool'\",))",
"('scripts-user', RuntimeError('Runparts: 1 failures in 1 attempted commands',))"
]
}
Do a first sanity reboot, to make sure that the correct kernel is booted.
(pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 reboot
The VM may boot using the DisklessTrap initramfs image. To jump out of the trap, exit the shell present on the console.
(pcocc/XXXXX) # pcocc console
root@${COMPUTE_VM}_DisklessTrap:/root# exit
[...]
Apply, once again, a puppet run to make sure that kernel-related changes are correctly applied on the current kernel.
(pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 puppet-apply
[...]
Notice: Applied catalog in 121.63 seconds
Rebuild the initramfs, then reboot
(pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 puppet-apply
# dracut -fMv
[...]
*** Creating initramfs image file '/boot/initramfs-3.10.0-957.35.2.el7.x86_64.img' done ***
# reboot
Again, the VM will boot using the DisklessTrap initramfs image. To jump out of the trap, exit the shell present on the console.
(pcocc/XXXXX) # pcocc console
root@${COMPUTE_VM}_DisklessTrap:/root# exit
[...]
Extract the initramfs and vmlinuz files from the image
(pcocc/XXXXX) # pcocc ssh -p 422 vm0 "tar -C /boot -czO initramfs-$(uname -r).img vmlinuz-$(uname -r)" | tar -C /volspoms1/pub/boot/diskless/ --transform 's/$/.new/' -xzf -
Shutdown the VM
(pcocc/XXXXX) # ^D
Terminating the cluster...
Define destination image and key
export KEY=/volspoms1/diskless/keys/stacker-image-$(date +%F).key
export RAW_IMG=/volspoms1/diskless/images/raw/stacker-image-$(date +%F).raw
export ENC_IMG=/volspoms1/diskless/images/encrypted/stacker-image-$(date +%F).img
Copy the qcow2 image into a raw image using qemu-img
# qemu-img convert -f qcow2 -O raw gluster://top1/volspoms1/pcocc/persistent_drives/${COMPUTE_NODE}.qcow2 gluster://top1${RAW_IMG}
[2020-01-14 14:43:29.803547] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.803936] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-2: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.804297] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-3: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.804641] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-4: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.804978] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-5: All subvolumes are down. Going offline until atleast one of them comes back up.
[...]
Finally, encrypt (while copying) the image.
# stacker lio encrypt -k ${KEY} -s ${RAW_IMG} -d ${ENC_IMG}
Note
Stacker will create ${KEY} and encrypt ${RAW_IMG} with it.
Exporting image to compute nodes¶
Once diskless image generated, export it to the nodes
Define the nodes
export COMPUTE_NODES=ocean[1-1000]
Export previously generated image
IMG_NAME="compute_img-$(date +%F)"
clush -S -bw iscsi_srv[1-2] stacker lio export -n ${IMG_NAME} -W ${IMG_NAME} -d ${ENC_IMG} -w ${COMPUTE_NODES}
clush -S -bw iscsi_srv[1-2] stacker lio config --save
Note
Keyfile and image are accessed by the VM through GlusterFS. They are defined in image and key variables definition.
Accessing image on the compute node¶
As said in previous sections, compute node boot on DisklessTrap. After the boot process node need to be configured to mount the exported compute image.
Configure iSCSI client
cat << EOF | clush -bw ${COMPUTE_NODES}
cat > /etc/iscsi/iscsid.conf << EO_ISCSI
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.timeo.replacement_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes
node.session.scan = auto
discovery.sendtargets.auth.authmethod = CHAP
discovery.sendtargets.auth.username = disco_user
discovery.sendtargets.auth.password = disco_pass
discovery.sendtargets.auth.username_in = disco_mutual_user
discovery.sendtargets.auth.password_in = disco_mutual_pass
node.session.auth.authmethod = CHAP
node.session.auth.username_in = node_mutual_user
node.session.auth.password_in = node_mutual_pass
node.session.auth.username = node_user
node.session.auth.password = node_pass
EO_ISCSI
EOF
Discover iscsi server targets
for server in $(nodeset -e iscsi_srv[1-2])
do
iscsi_prefix=$(ssh server "awk -F= '/^wwn_target_prefix/ {print \$2}' /etc/stacker/stacker.conf)
clush -bw ${COMPUTE_NODES} iscsiadm -m discovery -t st -p ${server}
clush -bw ${COMPUTE_NODES} iscsiadm -m node -T ${iscsi_prefix}${IMG_NAME} -p ${server}:3260 -l
done
Note
The default port for iSCSI server is 3260.
Note
iscsi_srv[1-2] are iscsi servers serving ${COMPUTE_NODES}. This list should be adapted regarding the cluster architecture.
Configure and launch multipath
cat << EOF | clush -bw ${COMPUTE_NODES}
cat > /etc/multipath.conf << EO_MULTIPATH
defaults {
polling_interval 10
failback immediate
no_path_retry queue
user_friendly_names yes
find_multipaths yes
prio random
uid_attribute ID_FS_UUID
}
blacklist {
devnode "^zram.*"
}
EO_MULTIPATH
multipathd
EOF
Copy the image key to the nodes
clush -bw ${COMPUTE_NODES} --copy ${KEY} --dest /dev/shm/luksKey
Open luks device
cat <<EOF | clusb -bw ${COMPUTE_NODES}
scsi_id=$(udevadm info -q property /dev/sda | grep ^ID_FS_UUID= | sed 's/ID_FS_UUID=//')
cryptsetup luksOpen -d /dev/shm/luksKey /dev/mapper/${scsi_id} luks_root
EOF
Note
The device used here is /dev/sda and should be modified regarding server exported images and node configuration.
Cleanup key
clush -bw ${COMPUTE_NODES} rm /dev/shm/luksKey
Get device patitions
cat <<EOF | clush -bw ${COMPUTE_NODES}
partprobe /dev/mapper/luks_root
for device in /dev/mapper/* ; do
blockdev --setro $device
lvm lvchange -ay $device
done
EOF
Note
We explicitly set devices read-only here to workaround a miss detection of the read-only property in device_mapper kernel module.
Prepare directory tree
clush -bw ${COMPUTE_NODES} "mkdir -p /overlay/upper/{root,var} /overlay/work/{root,var} /overlay/lower/{root,var}"
Prepare Zram for overlay
cat <<EOF | clush -bw ${COMPUTE_NODES}
mkfs.xfs -f /dev/zram0
mount /dev/zram0 /overlay
EOF
Mount device partitions
cat <<EOF| clush -bw ${COMPUTE_NODES}
mount -o ro,_netdev /dev/mapper/luks_root1 /overlay/lower/root
mount -o ro,_netdev /dev/mapper/system-var /overlay/lower/var
EOF
Mount the root filesystem
cat <<EOF| clush -bw ${COMPUTE_NODES}
mount -t overlay overlay -olowerdir=/overlay/lower/root,upperdir=/overlay/upper/root,workdir=/overlay/work/root,_netdev /sysroot || exit 1
mount -t overlay overlay -olowerdir=/overlay/lower/var,upperdir=/overlay/upper/var,workdir=/overlay/work/var,_netdev /sysroot/var || exit 1
EOF
Configure needed services
cat <<EOF| clush -bw ${COMPUTE_NODES}
mkdir -p /sysroot/etc/iscsi
cp /etc/iscsi/iscsid.conf /sysroot/etc/iscsi/iscsid.conf
cp /etc/multipath.conf /sysroot/etc/multipath.conf
cp /etc/iscsi/initiatorname.iscsi /sysroot/etc/iscsi/initiatorname.iscsi
systemctl --root /sysroot/ enable iscsid.service
systemctl --root /sysroot/ enable multipathd.service
rm -f /sysroot/.autorelabel
hostname > /sysroot/etc/hostname
EOF
Launch boot sequence
clush -bw ${COMPUTE_NODES} systemctl stop dracut-emergency.service
Deactivate multipath¶
In previous section access to iscsi server is made with multipath configured, this section describe how to deactivate this feature.
iSCSI configuration
cat << EOF | clush -bw ${COMPUTE_NODES}
cat > /etc/iscsi/iscsid.conf << EO_ISCSI
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.timeo.replacement_timeout = 600
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes
node.session.scan = auto
EO_ISCSI
EOF
Remove multipath configuration
cat << EOF | clush -bw ${COMPUTE_NODES}
rm /etc/multipath.conf
pkill multipathd
EOF
Modify this step when decrypting luks device with:
clush -bw ${COMPUTE_NODES} cryptsetup luksOpen -d /dev/shm/luksKey /dev/sda
Note
The device used here is /dev/sda and should be modified regarding server exported images and node configuration.
DisklessTrap initramfs¶
We provide a dracut module to manage diskless boot.
It generates an initramfs to Trap the node boot proccess. Once the node in this state, it will be accessed though ssh with needed tools to boot with any diskless method supported by Ocean.
Installation is done automaticaly by puppet during the VM boot.
It can be done manualy by installing dracut-ccc-modules
package.
dnf install -y dracut-ccc-modules
Puppet will configure DisklessTrap in order to generate a full featured diskless initramfs image.
To update initramfs image content and behaviour check
/etc/dracut-ccc-modules.conf
, then update it with: dracut -fMv
.
More information here man DisklessTrap
.