Diskless management

Image generation

Thanks to Ocean Stack’s architecture, diskless image are simply virtual machine images that are exported through iSCSI. See Diskless for details about diskless architecture.

Here, we will only document the image generation procedure. The compute node configuration management is out of scope.

First, to generate a diskless image, use the Add a new service VM procedure to add a new VM that will be our compute image. This reference VM will be designated as COMPUTE_VM.

The procedure for generating a complete image is the following:

  • If present, backup the old reference image :

mv /volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2 /volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2.$(date +%F)
  • Create a new reference image file :

# prepare-ocean-image.sh ${COMPUTE_VM}
Formatting '/volspoms1/pcocc/persistent_drives/${COMPUTE_VM}.qcow2', fmt=qcow2 size=53687091200 backing_file='/volspoms1/pcocc/persistent_drives/rhel.latest.qcow2' encryption=off cluster_size=65536 lazy_refcounts=off
  • On a working hypervisor, launch the reference VM :

# . /etc/sysconfig/pcocc-vm-${COMPUTE_VM}
# pcocc alloc ${COMPUTE_VM}
  • Follow the bootstrap process using the pcocc cli :

(pcocc/XXXXX) # pcocc console
[...]
[   48.578821] cloud-init[902]: + cloud-init-per instance distro_sync yum distribution-synchronization -y
[   48.822081] cloud-init[902]: Loaded plugins: priorities, search-disabled-repos
[   53.809157] cloud-init[902]: 437 packages excluded due to repository priority protections
[   58.596408] cloud-init[902]: Resolving Dependencies
[   58.598198] cloud-init[902]: --> Running transaction check
[   58.599632] cloud-init[902]: ---> Package bind-libs-lite.x86_64 32:9.9.4-74.el7_6.1 will be updated
[   58.854444] cloud-init[902]: ---> Package bind-libs-lite.x86_64 32:9.9.4-74.el7_6.2 will be an update
[   58.873073] cloud-init[902]: ---> Package bind-license.noarch 32:9.9.4-74.el7_6.1 will be updated
[...]

Note

Please note that the VM might not be reachable immediately. This is because of the time between the boot and the effective configuration of the SSH daemon. Be patient and check the console output for any error that could prevent SSH from listening correctly.

  • Poll the VM for cloud-init completion, if the /run/cloud-init/result.json file is present, cloud-init process is complete :

 (pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 cat /run/cloud-init/result.json
 {
"v1": {
 "datasource": "DataSourceNoCloud [seed=/dev/sr0][dsmode=net]",
 "errors": [
  "('users-groups', TypeError(\"Can not create sudoers rule addition with type u'bool'\",))",
  "('scripts-user', RuntimeError('Runparts: 1 failures in 1 attempted commands',))"
 ]
}
  • Do a first sanity reboot, to make sure that the correct kernel is booted.

(pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 reboot
  • The VM may boot using the DisklessTrap initramfs image. To jump out of the trap, exit the shell present on the console.

(pcocc/XXXXX) # pcocc console
root@${COMPUTE_VM}_DisklessTrap:/root# exit
[...]
  • Apply, once again, a puppet run to make sure that kernel-related changes are correctly applied on the current kernel.

(pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 puppet-apply
[...]
Notice: Applied catalog in 121.63 seconds
  • Rebuild the initramfs, then reboot

(pcocc/XXXXX) # pcocc ssh -J ${COMPUTE_VM} -p 422 vm0 puppet-apply
# dracut -fMv
[...]
*** Creating initramfs image file '/boot/initramfs-3.10.0-957.35.2.el7.x86_64.img' done ***
# reboot
  • Again, the VM will boot using the DisklessTrap initramfs image. To jump out of the trap, exit the shell present on the console.

(pcocc/XXXXX) # pcocc console
root@${COMPUTE_VM}_DisklessTrap:/root# exit
[...]
  • Extract the initramfs and vmlinuz files from the image

(pcocc/XXXXX) # pcocc ssh -p 422 vm0 "tar -C /boot -czO initramfs-$(uname -r).img vmlinuz-$(uname -r)" | tar -C /volspoms1/pub/boot/diskless/ --transform 's/$/.new/' -xzf -
  • Shutdown the VM

(pcocc/XXXXX) # ^D
Terminating the cluster...
  • Define destination image and key

export KEY=/volspoms1/diskless/keys/stacker-image-$(date +%F).key
export RAW_IMG=/volspoms1/diskless/images/raw/stacker-image-$(date +%F).raw
export ENC_IMG=/volspoms1/diskless/images/encrypted/stacker-image-$(date +%F).img
  • Copy the qcow2 image into a raw image using qemu-img

# qemu-img convert -f qcow2 -O raw gluster://top1/volspoms1/pcocc/persistent_drives/${COMPUTE_NODE}.qcow2 gluster://top1${RAW_IMG}
[2020-01-14 14:43:29.803547] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.803936] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-2: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.804297] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-3: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.804641] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-4: All subvolumes are down. Going offline until atleast one of them comes back up.
[2020-01-14 14:43:29.804978] E [MSGID: 108006] [afr-common.c:5214:__afr_handle_child_down_event] 0-volspoms1-replicate-5: All subvolumes are down. Going offline until atleast one of them comes back up.
[...]
  • Finally, encrypt (while copying) the image.

# stacker lio encrypt -k ${KEY} -s ${RAW_IMG} -d ${ENC_IMG}

Note

Stacker will create ${KEY} and encrypt ${RAW_IMG} with it.

Exporting image to compute nodes

Once diskless image generated, export it to the nodes

  • Define the nodes

export COMPUTE_NODES=ocean[1-1000]
  • Export previously generated image

IMG_NAME="compute_img-$(date +%F)"

clush -S -bw iscsi_srv[1-2] stacker lio export -n ${IMG_NAME} -W ${IMG_NAME} -d ${ENC_IMG} -w ${COMPUTE_NODES}
clush -S -bw iscsi_srv[1-2] stacker lio config --save

Note

Keyfile and image are accessed by the VM through GlusterFS. They are defined in image and key variables definition.

Accessing image on the compute node

As said in previous sections, compute node boot on DisklessTrap. After the boot process node need to be configured to mount the exported compute image.

  • Configure iSCSI client

cat << EOF | clush -bw ${COMPUTE_NODES}
cat > /etc/iscsi/iscsid.conf << EO_ISCSI
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.timeo.replacement_timeout = 15
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes
node.session.scan = auto
discovery.sendtargets.auth.authmethod = CHAP
discovery.sendtargets.auth.username = disco_user
discovery.sendtargets.auth.password = disco_pass
discovery.sendtargets.auth.username_in = disco_mutual_user
discovery.sendtargets.auth.password_in = disco_mutual_pass
node.session.auth.authmethod = CHAP
node.session.auth.username_in = node_mutual_user
node.session.auth.password_in = node_mutual_pass
node.session.auth.username = node_user
node.session.auth.password = node_pass
EO_ISCSI
EOF
  • Discover iscsi server targets

for server in $(nodeset -e iscsi_srv[1-2])
do
  iscsi_prefix=$(ssh server "awk -F= '/^wwn_target_prefix/ {print \$2}' /etc/stacker/stacker.conf)
      clush -bw ${COMPUTE_NODES} iscsiadm -m discovery -t st -p ${server}
      clush -bw ${COMPUTE_NODES} iscsiadm -m node -T ${iscsi_prefix}${IMG_NAME} -p ${server}:3260 -l
done

Note

The default port for iSCSI server is 3260.

Note

iscsi_srv[1-2] are iscsi servers serving ${COMPUTE_NODES}. This list should be adapted regarding the cluster architecture.

  • Configure and launch multipath

cat << EOF | clush -bw ${COMPUTE_NODES}
cat > /etc/multipath.conf << EO_MULTIPATH
defaults {
      polling_interval        10
      failback                immediate
      no_path_retry           queue
      user_friendly_names     yes
      find_multipaths         yes
      prio                    random
      uid_attribute           ID_FS_UUID
}
blacklist {
      devnode "^zram.*"
}
EO_MULTIPATH
multipathd
EOF
  • Copy the image key to the nodes

clush -bw ${COMPUTE_NODES} --copy ${KEY} --dest /dev/shm/luksKey
  • Open luks device

cat <<EOF | clusb -bw ${COMPUTE_NODES}
scsi_id=$(udevadm info -q property /dev/sda | grep ^ID_FS_UUID= | sed 's/ID_FS_UUID=//')
cryptsetup luksOpen -d /dev/shm/luksKey /dev/mapper/${scsi_id} luks_root
EOF

Note

The device used here is /dev/sda and should be modified regarding server exported images and node configuration.

  • Cleanup key

clush -bw ${COMPUTE_NODES} rm /dev/shm/luksKey
  • Get device patitions

cat <<EOF | clush -bw ${COMPUTE_NODES}
partprobe /dev/mapper/luks_root
for device in /dev/mapper/* ; do
  blockdev --setro $device
  lvm lvchange -ay $device
done
EOF

Note

We explicitly set devices read-only here to workaround a miss detection of the read-only property in device_mapper kernel module.

  • Prepare directory tree

clush -bw ${COMPUTE_NODES} "mkdir -p /overlay/upper/{root,var} /overlay/work/{root,var} /overlay/lower/{root,var}"
  • Prepare Zram for overlay

cat <<EOF | clush -bw ${COMPUTE_NODES}
mkfs.xfs -f /dev/zram0
mount /dev/zram0 /overlay
EOF
  • Mount device partitions

cat <<EOF| clush -bw ${COMPUTE_NODES}
mount -o ro,_netdev /dev/mapper/luks_root1 /overlay/lower/root
mount -o ro,_netdev /dev/mapper/system-var /overlay/lower/var
EOF
  • Mount the root filesystem

cat <<EOF| clush -bw ${COMPUTE_NODES}
mount -t overlay overlay -olowerdir=/overlay/lower/root,upperdir=/overlay/upper/root,workdir=/overlay/work/root,_netdev /sysroot || exit 1
mount -t overlay overlay -olowerdir=/overlay/lower/var,upperdir=/overlay/upper/var,workdir=/overlay/work/var,_netdev /sysroot/var || exit 1
EOF
  • Configure needed services

cat <<EOF| clush -bw ${COMPUTE_NODES}
mkdir -p /sysroot/etc/iscsi
cp /etc/iscsi/iscsid.conf /sysroot/etc/iscsi/iscsid.conf
cp /etc/multipath.conf /sysroot/etc/multipath.conf
cp /etc/iscsi/initiatorname.iscsi /sysroot/etc/iscsi/initiatorname.iscsi
systemctl --root /sysroot/ enable iscsid.service
systemctl --root /sysroot/ enable multipathd.service
rm -f /sysroot/.autorelabel
hostname > /sysroot/etc/hostname
EOF
  • Launch boot sequence

clush -bw ${COMPUTE_NODES} systemctl stop dracut-emergency.service

Deactivate multipath

In previous section access to iscsi server is made with multipath configured, this section describe how to deactivate this feature.

  • iSCSI configuration

cat << EOF | clush -bw ${COMPUTE_NODES}
cat > /etc/iscsi/iscsid.conf << EO_ISCSI
iscsid.startup = /bin/systemctl start iscsid.socket iscsiuio.socket
node.startup = automatic
node.leading_login = No
node.session.timeo.replacement_timeout = 600
node.conn[0].timeo.login_timeout = 15
node.conn[0].timeo.logout_timeout = 15
node.conn[0].timeo.noop_out_interval = 0
node.conn[0].timeo.noop_out_timeout = 0
node.session.err_timeo.abort_timeout = 15
node.session.err_timeo.lu_reset_timeout = 30
node.session.err_timeo.tgt_reset_timeout = 30
node.session.initial_login_retry_max = 8
node.session.cmds_max = 128
node.session.queue_depth = 32
node.session.xmit_thread_priority = -20
node.session.iscsi.InitialR2T = No
node.session.iscsi.ImmediateData = Yes
node.session.iscsi.FirstBurstLength = 262144
node.session.iscsi.MaxBurstLength = 16776192
node.conn[0].iscsi.MaxRecvDataSegmentLength = 262144
node.conn[0].iscsi.MaxXmitDataSegmentLength = 0
discovery.sendtargets.iscsi.MaxRecvDataSegmentLength = 32768
node.conn[0].iscsi.HeaderDigest = None
node.session.nr_sessions = 1
node.session.iscsi.FastAbort = Yes
node.session.scan = auto
EO_ISCSI
EOF
  • Remove multipath configuration

cat << EOF | clush -bw ${COMPUTE_NODES}
rm /etc/multipath.conf
pkill multipathd
EOF
  • Modify this step when decrypting luks device with:

clush -bw ${COMPUTE_NODES} cryptsetup luksOpen -d /dev/shm/luksKey /dev/sda

Note

The device used here is /dev/sda and should be modified regarding server exported images and node configuration.

DisklessTrap initramfs

We provide a dracut module to manage diskless boot.

It generates an initramfs to Trap the node boot proccess. Once the node in this state, it will be accessed though ssh with needed tools to boot with any diskless method supported by Ocean.

Installation is done automaticaly by puppet during the VM boot. It can be done manualy by installing dracut-ccc-modules package.

dnf install -y dracut-ccc-modules

Puppet will configure DisklessTrap in order to generate a full featured diskless initramfs image.

To update initramfs image content and behaviour check /etc/dracut-ccc-modules.conf, then update it with: dracut -fMv.

More information here man DisklessTrap.